4-Phase autoresearch campaign: CPU infrastructure → NURBS kernels → Algorithmic limits → GPU offload
Systematic exploration from infrastructure tuning to GPU offload, revealing both wins and fundamental limits.
9 CAD models benchmarked across all phases. Median of 5 measured runs (1 warmup), checksums verified.
Compiler flags, memory allocators, parallelization, and profile-guided optimization. 5 experiments, 2 winners stacked to 11.4%.
Hypothesis: Aggressive compiler optimizations should improve NURBS math performance.
Why it failed: GCC 11.5 already optimizes well at -O2. -O3 increases code size (I-cache pressure) without vectorization gains. NURBS kernels have irregular memory patterns that limit auto-vectorization. -flto added link time but hurt from code expansion.
How: Zero code changes — just LD_PRELOAD. jemalloc's thread-local caching and size-class bins handle SMLib's frequent small allocations (SmPolyVertex, SmPolyEdge, SmPolyFace, SmTArray resizes) far better than glibc's ptmalloc2.
Largest per-model improvements: geolux -9.3%, platte -6.9%, high_heel_shoe -5.4%
How: Three-step build: instrument → train on representative workload (225 .gcda profiles) → rebuild with profile data. PGO optimizes branch prediction in N_BasisFindSpan (binary search) and NURBS subdivision paths.
Combined with jemalloc: Orthogonal improvements stack perfectly. PGO fixes code layout, jemalloc fixes allocation overhead.
| Experiment | Tess (ms) | Change | Read (ms) | Status |
|---|---|---|---|---|
| Baseline (-O2) | 895.5 | — | 225.3 | Baseline |
| -O3 -march=native -ffast-math -flto | 918.9 | +2.6% | 229.6 | Failed |
| jemalloc (LD_PRELOAD) | 840.0 | -6.2% | 221.0 | Win |
| tcmalloc (LD_PRELOAD) | 844.3 | -5.7% | 220.3 | Win |
| Parallel (9 threads) | 57% wall time reduction | — | Win | |
| PGO alone | 835.7 | -6.7% | 216.2 | Win |
| PGO + jemalloc | 793.8 | -11.4% | 213.7 | Best |
The biggest wins came from orthogonal approaches: jemalloc addresses memory-system bottlenecks while PGO addresses branch prediction and code layout. Standard compiler flags (-O3, -flto) did not help because the bottleneck is memory access patterns and branch misprediction, not instruction selection.
Code-level optimization of the NURBS evaluation hot path. 5 experiments delivering 13.4% additional speedup (23.8% cumulative).
Vectorized three hot paths using vmulq_f64 (2-wide double), vfmaq_f64 (fused multiply-add), and broadcast+multiply for derivative outer products.
Why small gains: Typical cubic NURBS have degree 3, so inner loops are only 4 iterations. NEON's 2-wide doubles give limited benefit at this trip count.
Replaced the custom smos_Sqrt(a) macro (calling sm_sqrt) with sm_sqrt_fast() that emits ARM's hardware FSQRT instruction directly. Eliminates function call overhead across ~299 call sites.
Added __builtin_prefetch to all three branches of SmOwningTopology::GetAll(). Prefetches the next linked-list node before processing the current one, hiding memory latency.
The big winner. Enabled the existing CreateUVTrimCurvesParallel() code path via SM_USE_TBB. UV trim curve computation per face is independent and safely parallelizable.
Race condition found: Full TBB parallelism (face cache + UV trim) caused checksum mismatches. CreateFaceTessCache shares mutable state that cannot be made thread-safe without major refactoring. Only the UV trim path was kept.
Added QuadBBoxDistSq() helper for early rejection in FindPolygonTriangleGuess(). Skips expensive triangle-point distance computation when the control point quad's AABB is already farther than the current best distance.
| Experiment | Tess (ms) | Change | Status |
|---|---|---|---|
| P2 Baseline (PGO+jemalloc) | 787.2 | — | Baseline |
| NEON SIMD basis eval | 783.7 | -0.44% | Win |
| Inline hardware sqrt | 770.6 | -2.10% | Win |
| Prefetch topology traversal | 777.2 | -1.27% | Win |
| TBB full parallel | 403.1 | -48.8% (wrong checksums) | Rejected |
| TBB UV trim only | 681.0 | -13.48% | Win |
| BBox prune (clean) | 770.5 | -2.12% | Win |
| Combined (all 5) | 681.9 | -13.37% | Best |
7 experiments attempted targeting remaining hotspots. None produced reproducible improvements. Critical finding: PGO build variance dominates small optimizations.
PGO builds exhibit ~1.5% variance between builds due to non-deterministic profile data collection. Any optimization delivering less than ~3% improvement cannot be reliably distinguished from noise. All Phase 3 targets were individually <3% of CPU time.
| Experiment | Tess (ms) | Initial Result | Recheck | Status |
|---|---|---|---|---|
| TBB face-count threshold (>30) | 681.6 | -0.05% | — | Noise |
| N_BasisFindSpan O(1) uniform knot | 681.1 | -0.13% | — | Noise |
| AngleBetween: atan2 replacement | 672.6 | -1.36% | +2.0% regression | PGO Artifact |
| SubdivideToTolerances lazy metric | 1700.2 | +149% | — | Failed |
| Memory arena for tess temporaries | — | Already exists (SmArenaMemBlockMgr) | Skipped | |
| Row-level AABB pruning | 673.0 | -1.30% | Regression | PGO Artifact |
| atan2 in sm_ComputeNetConstants | 675.0 | -1.02% | Regression | PGO Artifact |
| Run | Tess (ms) | Notes |
|---|---|---|
| P2 combined baseline | 681.920 | Phase 2 report value |
| P3 baseline recheck (same code) | 672.249 | -1.4% from PGO luck alone |
| P3 exp3 recheck (atan2) | 685.995 | +2.0% regression vs recheck |
| P3 combined (all "winners") | 687.506 | +2.3% regression vs recheck |
| P3 combined-367 r2 (repeat) | 686.811 | +2.2% regression vs recheck |
Phase 3 proves that the remaining CPU hotspots (N_SrfBasisDerivs 10.8%, N_ComputeDerivativeMatrix 7.7%) are tight numerical loops where the computation itself is irreducible. There is no algorithmic shortcut to bypass core NURBS evaluation. This motivates the GPU offload in Phase 4.
CUDA kernels for NURBS B-spline surface evaluation on Blackwell GPU (sm_100, 152 SMs). 29-295x kernel speedup with perfect FP64 accuracy.
| Model | Faces | Eval Points | CPU (ms) | GPU FP64 (ms) | Speedup FP64 | GPU FP32 (ms) | Speedup FP32 |
|---|---|---|---|---|---|---|---|
| sphere | 1 | 1,024 | 0.04 | 0.067 | 0.7× | 0.053 | 0.8× |
| torus | 1 | 1,024 | 0.04 | 0.073 | 0.6× | 0.055 | 0.8× |
| mouse | 4 | 4,096 | 0.14 | 0.080 | 1.8× | 0.055 | 2.6× |
| rot | 6 | 6,144 | 0.39 | 0.055 | 7.1× | 0.056 | 6.9× |
| offsetbox | 11 | 11,264 | 0.22 | 0.070 | 3.2× | 0.058 | 3.8× |
| blade_hub | 46 | 47,104 | 1.77 | 0.087 | 20.3× | 0.060 | 29.8× |
| platte | 122 | 124,928 | 12.69 | 0.106 | 119.4× | 0.065 | 193.8× |
| high_heel_shoe | 411 | 420,864 | 5.16 | 0.180 | 28.6× | 0.085 | 60.8× |
| geolux | 238 | 243,712 | 20.35 | 0.135 | 151.1× | 0.069 | 295.3× |
| Model | CPU Deriv (ms) | GPU Deriv (ms) | Speedup |
|---|---|---|---|
| sphere | 0.18 | 0.085 | 2.1× |
| torus | 0.18 | 0.086 | 2.1× |
| mouse | 0.36 | 0.087 | 4.1× |
| rot | 0.35 | 0.075 | 4.7× |
| offsetbox | 0.81 | 0.087 | 9.3× |
| blade_hub | 6.69 | 0.122 | 54.7× |
| platte | 10.07 | 0.156 | 64.4× |
| high_heel_shoe | 12.18 | 0.311 | 39.2× |
| geolux | 16.59 | 0.215 | 77.3× |
| Model | FP64 Max Error | FP64 Avg Error | FP32 Max Error | FP32 Avg Error |
|---|---|---|---|---|
| sphere | 2.23e-15 | 5.83e-16 | 1.91e-06 | 5.93e-07 |
| torus | 7.38e-15 | 1.16e-15 | 5.40e-06 | 1.58e-06 |
| blade_hub | 1.91e-13 | 3.14e-14 | 1.10e-04 | 2.01e-05 |
| platte | 1.72e-13 | 1.61e-14 | 9.94e-05 | 9.42e-06 |
| high_heel_shoe | 6.11e-14 | 6.99e-15 | 3.84e-05 | 4.45e-06 |
| geolux | 6.33e-14 | 6.46e-15 | 2.37e-05 | 3.87e-06 |
Machine epsilon level. Perfect agreement with CPU evaluator across all models.
Well within tessellation tolerance (chord height = 0.1). FP32 is viable for tessellation with 2x throughput gain.
GPU becomes faster than CPU at ~3,000 evaluation points (~3 surfaces at 32×32 grid).
18 experiments across 4 phases. Every experiment documented — wins, failures, and noise — for full research credibility.
| Experiment | Phase | Tess (ms) | Change vs Original | Status |
|---|---|---|---|---|
| Original Baseline (-O2) | 0 | 895.5 | — | Baseline |
| -O3 -march=native -ffast-math -flto | 1 | 918.9 | +2.6% | Failed |
| jemalloc (LD_PRELOAD) | 1 | 840.0 | -6.2% | Win |
| tcmalloc (LD_PRELOAD) | 1 | 844.3 | -5.7% | Win |
| PGO alone | 1 | 835.7 | -6.7% | Win |
| PGO + jemalloc | 1 | 793.8 | -11.4% | Best P1 |
| NEON SIMD basis eval | 2 | 783.7 | -12.5% | Win |
| Inline hardware sqrt | 2 | 770.6 | -13.9% | Win |
| Prefetch topology traversal | 2 | 777.2 | -13.2% | Win |
| TBB full parallel (wrong results) | 2 | 403.1 | — | Rejected |
| TBB UV trim only | 2 | 681.0 | -24.0% | Win |
| AABB bbox prune | 2 | 770.5 | -14.0% | Win |
| Phase 2 Combined | 2 | 681.9 | -23.8% | Best CPU |
| TBB face-count threshold | 3 | 681.6 | -0.05% | Noise |
| Uniform knot fast path | 3 | 681.1 | -0.13% | Noise |
| AngleBetween atan2 | 3 | 672.6 | PGO artifact | Failed |
| Lazy metric evaluation | 3 | 1700.2 | +149% | Failed |
| Row-level AABB pruning | 3 | 673.0 | PGO artifact | Failed |
| atan2 in sm_ComputeNetConstants | 3 | 675.0 | PGO artifact | Failed |
| GPU NURBS eval (peak) | 4 | 0.069ms | 295× kernel | GPU Win |
What we learned about optimizing a 970K-line C++ solid modeling library on ARM.
PGO + jemalloc: zero code changes required. Just build system configuration and an LD_PRELOAD.
The existing CreateUVTrimCurvesParallel() was already written but disabled. Enabling safe parallelism was the biggest single win.
Sub-1.5% optimizations are unmeasurable under PGO. Phase 3 proved this conclusively with 7 failed experiments.
NURBS evaluation is embarrassingly parallel. Blackwell FP64 delivers 151x speedup with machine-epsilon accuracy.
-O3, -ffast-math, -flto all regressed. The bottleneck was memory access patterns, not instruction selection.
Full TBB parallelism gave 48.8% speedup but corrupted results. Correctness verification (checksums) caught it.