Optimising the FF16 hot paths
plant @develop f8dbc4c
This post summarises a profiling-driven optimisation pass over the FF16 solver (PR #471). The work was guided by Rprof() plus native /usr/bin/sample profiles of the run_plant_benchmarks() FF16 scenarios, and the governing constraint was that every accepted change be bit-identical against HEAD (or its trajectory shift quantified) before being kept. The headline result: roughly 2.7-3.5x on the FF16 benchmarks, with the science provably unchanged.
Headline numbers
Measured as one-iteration bench::mark() timings (FF16 scenario):
| Stage | scm |
build_schedule |
|---|---|---|
Original baseline (4b5c43d1) |
242 ms | 656 ms |
| After the optimisation pass | ~85 ms | ~186 ms |
| Speedup | ~2.7x | ~3.5x |
The gains accumulated across ~15 individual changes, each profiled and verified in turn rather than landed as one big rewrite.
Where the time was going
Native profiling pointed at a handful of hot regions, all on the per-node, per-step physiology path:
- Crown assimilation quadrature —
std::functioncallable overhead and redundant canopy-shape work dominated the integrand. - Spline / interpolator lookups — repeated, bounds-checked evaluations and per-point binary searches during assimilation.
- Competition / environment recomputation — inverse-height competition ratios recomputed every call.
- Gradient calculation — physiology copied and recomputed via a per-call
Internalsallocation. - String/map lookups — state and aux variables addressed by name on the hot path.
How it was done
Each row below is an independently profiled, bit-identical change. Where machine drift between sessions made cumulative ratios unreliable, the honest figure was re-measured as a same-session A/B/A comparison (documented in the tracking note).
| Target | Change | Effect |
|---|---|---|
| Assimilation quadrature | Direct-lambda integrand (drop std::function); eta-specialised canopy shape (issue #465); ratio-based q() path |
Largest single contributor — ~1.4-1.5x from the canopy-shape specialisation alone |
| Spline lookups | Inline interpolator accessors + unchecked eval; hoist spline.max() out of the integrand; O(1)-guess + nudge lookup replacing per-point binary search |
~3.5x on build_schedule traces back largely here |
| Competition | Cache the inverse-height competition ratio; dependent-aux height-inverse cache | ~1.7x cumulative on competition-bound paths |
| Gradients | Reuse a thread-local scratch Individual (no per-call Internals alloc) |
Removed per-gradient heap churn |
| State/rate addressing | Direct hot-path state/aux indices instead of string/map lookups | ~1.9x on the indexed paths |
| Allocation derivatives | Dedup pow() and reuse area_sapwood/mass_sapwood from net_mass_production_dt |
Bit-identical, timing-neutral hygiene; kept for clarity |
| FF16 strategy | Inline competition helpers + area_leaf into ff16_strategy.h; inline util::is_finite into a header |
~1.2-1.3x from removing call overhead on the tightest loops |
Notes on measurement discipline
A recurring theme — and the reason the tracking note is so long — is that these runs are short (hundreds of ms) with a ~10% run-to-run noise floor, and the macOS /usr/bin/sample profiler perturbs Rprof totals when it attaches successfully. Several apparent “regressions” were measurement artefacts:
- A Sampler-Effect Audit isolated the inflation caused by concurrent sampling; the sample-off figures are the ones to trust for small changes.
- Cross-session comparisons were re-grounded with same-session A/B/A re-measurements before any speedup was claimed.
- An LTO experiment (issue #470) showed no measurable gain and was declined rather than kept speculatively.
Bit-identical optimisations that turned out to be timing-neutral were deliberately kept where they improved clarity (e.g. the pow() dedup), and declined where they added complexity for no measured benefit (LTO). “Faster” was never allowed to mean “different answer” — the FF16 reference baselines gate the whole pass.