Optimising the FF16 hot paths

ff16
performance
develop
engineering
A profiling-driven pass over the FF16 solver: ~2.7-3.5x faster on the benchmark scenarios, achieved entirely through bit-identical C++ changes guided by Rprof and native sampling.
Published

June 18, 2026

plant @develop f8dbc4c

This post summarises a profiling-driven optimisation pass over the FF16 solver (PR #471). The work was guided by Rprof() plus native /usr/bin/sample profiles of the run_plant_benchmarks() FF16 scenarios, and the governing constraint was that every accepted change be bit-identical against HEAD (or its trajectory shift quantified) before being kept. The headline result: roughly 2.7-3.5x on the FF16 benchmarks, with the science provably unchanged.

Headline numbers

Measured as one-iteration bench::mark() timings (FF16 scenario):

Stage scm build_schedule
Original baseline (4b5c43d1) 242 ms 656 ms
After the optimisation pass ~85 ms ~186 ms
Speedup ~2.7x ~3.5x

The gains accumulated across ~15 individual changes, each profiled and verified in turn rather than landed as one big rewrite.

Where the time was going

Native profiling pointed at a handful of hot regions, all on the per-node, per-step physiology path:

  1. Crown assimilation quadraturestd::function callable overhead and redundant canopy-shape work dominated the integrand.
  2. Spline / interpolator lookups — repeated, bounds-checked evaluations and per-point binary searches during assimilation.
  3. Competition / environment recomputation — inverse-height competition ratios recomputed every call.
  4. Gradient calculation — physiology copied and recomputed via a per-call Internals allocation.
  5. String/map lookups — state and aux variables addressed by name on the hot path.

How it was done

Each row below is an independently profiled, bit-identical change. Where machine drift between sessions made cumulative ratios unreliable, the honest figure was re-measured as a same-session A/B/A comparison (documented in the tracking note).

Target Change Effect
Assimilation quadrature Direct-lambda integrand (drop std::function); eta-specialised canopy shape (issue #465); ratio-based q() path Largest single contributor — ~1.4-1.5x from the canopy-shape specialisation alone
Spline lookups Inline interpolator accessors + unchecked eval; hoist spline.max() out of the integrand; O(1)-guess + nudge lookup replacing per-point binary search ~3.5x on build_schedule traces back largely here
Competition Cache the inverse-height competition ratio; dependent-aux height-inverse cache ~1.7x cumulative on competition-bound paths
Gradients Reuse a thread-local scratch Individual (no per-call Internals alloc) Removed per-gradient heap churn
State/rate addressing Direct hot-path state/aux indices instead of string/map lookups ~1.9x on the indexed paths
Allocation derivatives Dedup pow() and reuse area_sapwood/mass_sapwood from net_mass_production_dt Bit-identical, timing-neutral hygiene; kept for clarity
FF16 strategy Inline competition helpers + area_leaf into ff16_strategy.h; inline util::is_finite into a header ~1.2-1.3x from removing call overhead on the tightest loops

Notes on measurement discipline

A recurring theme — and the reason the tracking note is so long — is that these runs are short (hundreds of ms) with a ~10% run-to-run noise floor, and the macOS /usr/bin/sample profiler perturbs Rprof totals when it attaches successfully. Several apparent “regressions” were measurement artefacts:

  • A Sampler-Effect Audit isolated the inflation caused by concurrent sampling; the sample-off figures are the ones to trust for small changes.
  • Cross-session comparisons were re-grounded with same-session A/B/A re-measurements before any speedup was claimed.
  • An LTO experiment (issue #470) showed no measurable gain and was declined rather than kept speculatively.
Note

Bit-identical optimisations that turned out to be timing-neutral were deliberately kept where they improved clarity (e.g. the pow() dedup), and declined where they added complexity for no measured benefit (LTO). “Faster” was never allowed to mean “different answer” — the FF16 reference baselines gate the whole pass.