VMX on the 360 for optimization of vector math
Slower than C counterpart, and out of order (so broken)
Missing out of order logic
- no instruction reordering
- no store forward hardware
- smaller caches, slower memory
- no l3 cache
- LHS
- L2 Miss
- Expensive, non pipelined instructions
- Branch mispredict penalty
Load Hit Store
- Store to memory location, then load, flush the L2 cache
- Casts, changing register set, aliasing
- Passing by value, or by reference
- On Pc, instruction reoder and store / forward hardware
L2 Miss
- Loading from location, checks cache
- Cost ~610 cycles to load cache line
- Hot cold split
- Reduce in-memory data size
- Use cache coherent structures
Expensive Instructions
- non pipelined instructions
- Stalls hardware threads
Branch Mispredict
- Mispredicting branch
- 23-24 cycle delay
- Know how the compiler implements branches
- Reduce total branch count for task
- Refactor calculations to remove branches
- Unroll
Profiling!!!
360 Tools
- PIX Cpu instruction trace
- LibPMCPB counters
- XbPerfView sampling capture
Other Platforms
- SN Tuner, vTune
Think laterally
- Inline functions
- pass and return in register (_declspec(passinreg)
- _restrict (complier released from being ultra careful
- const
Compiler options
- Inline
- Prefer speed over size
- Fast floaging point over precise
- 360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
- Reduce parameter counts
- Prefer 32, 64, 128 bit parameters
- Isoloate constants
- Avoid virtual if feasible
Know you cache architecture
- Cross core sharing policy (L2 shared, L1 single)
- Prefetch mech (dcbt, dcbz128)
- L2 1MB, L1 32Kb
- Cache line 128 byte
Know your instruction set
- 360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
- PS3 (altivec)
- PC (SSE2-4.1 and friends)
What went wrong
- Correctness
- Guessed at 1 perf issue
- SIMD vs straight float
- Memory access and L2 usage unchanged
- Branch behavior exactly the same
Image Analysis
- Gaussian Mixture Model
- Profiling showed (86% tiem in pixel cost function)
No comments:
Post a Comment