Direct X, XNA, etc: Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

Thursday, March 26, 2009

Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

VMX on the 360 for optimization of vector math

Slower than C counterpart, and out of order (so broken)

Missing out of order logic

no instruction reordering
no store forward hardware
smaller caches, slower memory
no l3 cache

LHS
L2 Miss
Expensive, non pipelined instructions
Branch mispredict penalty

Load Hit Store

Store to memory location, then load, flush the L2 cache
Casts, changing register set, aliasing
Passing by value, or by reference
On Pc, instruction reoder and store / forward hardware

L2 Miss

Loading from location, checks cache
Cost ~610 cycles to load cache line
Hot cold split
Reduce in-memory data size
Use cache coherent structures

Expensive Instructions

non pipelined instructions
Stalls hardware threads

Branch Mispredict

Mispredicting branch
23-24 cycle delay
Know how the compiler implements branches
Reduce total branch count for task
Refactor calculations to remove branches
Unroll

Profiling!!!

360 Tools

PIX Cpu instruction trace
LibPMCPB counters
XbPerfView sampling capture

Other Platforms

SN Tuner, vTune

Think laterally

Inline functions
pass and return in register (_declspec(passinreg)
_restrict (complier released from being ultra careful
const

Compiler options

Inline
Prefer speed over size
Fast floaging point over precise
360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
Reduce parameter counts
Prefer 32, 64, 128 bit parameters
Isoloate constants
Avoid virtual if feasible

Know you cache architecture

Cross core sharing policy (L2 shared, L1 single)
Prefetch mech (dcbt, dcbz128)
L2 1MB, L1 32Kb
Cache line 128 byte

Know your instruction set

360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
PS3 (altivec)
PC (SSE2-4.1 and friends)

What went wrong

Correctness
Guessed at 1 perf issue
SIMD vs straight float
Memory access and L2 usage unchanged
Branch behavior exactly the same

Image Analysis

Gaussian Mixture Model
Profiling showed (86% tiem in pixel cost function)

No comments:

Subscribe to: Post Comments (Atom)