Optimising the FF16 hot paths

ff16

performance

develop

engineering

A profiling-driven pass over the FF16 solver: ~2.7-3.5x faster on the benchmark scenarios, achieved entirely through bit-identical C++ changes guided by Rprof and native sampling.

Published

June 18, 2026

plant @develop f8dbc4c

This post summarises a profiling-driven optimisation pass over the FF16 solver (PR #471). The work was guided by Rprof() plus native /usr/bin/sample profiles of the run_plant_benchmarks() FF16 scenarios, and the governing constraint was that every accepted change be bit-identical against HEAD (or its trajectory shift quantified) before being kept. The headline result: roughly 2.7-3.5x on the FF16 benchmarks, with the science provably unchanged.

Headline numbers

Measured as one-iteration bench::mark() timings (FF16 scenario):

Stage	`scm`	`build_schedule`
Original baseline (`4b5c43d1`)	242 ms	656 ms
After the optimisation pass	~85 ms	~186 ms
Speedup	~2.7x	~3.5x

The gains accumulated across ~15 individual changes, each profiled and verified in turn rather than landed as one big rewrite.

Where the time was going

Native profiling pointed at a handful of hot regions, all on the per-node, per-step physiology path:

Crown assimilation quadrature — std::function callable overhead and redundant canopy-shape work dominated the integrand.
Spline / interpolator lookups — repeated, bounds-checked evaluations and per-point binary searches during assimilation.
Competition / environment recomputation — inverse-height competition ratios recomputed every call.
Gradient calculation — physiology copied and recomputed via a per-call Internals allocation.
String/map lookups — state and aux variables addressed by name on the hot path.

How it was done

Each row below is an independently profiled, bit-identical change. Where machine drift between sessions made cumulative ratios unreliable, the honest figure was re-measured as a same-session A/B/A comparison (documented in the tracking note).

Target	Change	Effect
Assimilation quadrature	Direct-lambda integrand (drop `std::function`); eta-specialised canopy shape (issue #465); ratio-based `q()` path	Largest single contributor — ~1.4-1.5x from the canopy-shape specialisation alone
Spline lookups	Inline interpolator accessors + unchecked eval; hoist `spline.max()` out of the integrand; O(1)-guess + nudge lookup replacing per-point binary search	~3.5x on `build_schedule` traces back largely here
Competition	Cache the inverse-height competition ratio; dependent-aux height-inverse cache	~1.7x cumulative on competition-bound paths
Gradients	Reuse a thread-local scratch `Individual` (no per-call `Internals` alloc)	Removed per-gradient heap churn
State/rate addressing	Direct hot-path state/aux indices instead of string/map lookups	~1.9x on the indexed paths
Allocation derivatives	Dedup `pow()` and reuse `area_sapwood`/`mass_sapwood` from `net_mass_production_dt`	Bit-identical, timing-neutral hygiene; kept for clarity
FF16 strategy	Inline competition helpers + `area_leaf` into `ff16_strategy.h`; inline `util::is_finite` into a header	~1.2-1.3x from removing call overhead on the tightest loops

Notes on measurement discipline

A recurring theme — and the reason the tracking note is so long — is that these runs are short (hundreds of ms) with a ~10% run-to-run noise floor, and the macOS /usr/bin/sample profiler perturbs Rprof totals when it attaches successfully. Several apparent “regressions” were measurement artefacts:

A Sampler-Effect Audit isolated the inflation caused by concurrent sampling; the sample-off figures are the ones to trust for small changes.
Cross-session comparisons were re-grounded with same-session A/B/A re-measurements before any speedup was claimed.
An LTO experiment (issue #470) showed no measurable gain and was declined rather than kept speculatively.

Note

Bit-identical optimisations that turned out to be timing-neutral were deliberately kept where they improved clarity (e.g. the pow() dedup), and declined where they added complexity for no measured benefit (LTO). “Faster” was never allowed to mean “different answer” — the FF16 reference baselines gate the whole pass.