Thursday, March 26, 2009

Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

VMX on the 360 for optimization of vector math

Slower than C counterpart, and out of order (so broken)

Missing out of order logic

  • no instruction reordering
  • no store forward hardware
  • smaller caches, slower memory
  • no l3 cache

 

  • LHS
  • L2 Miss
  • Expensive, non pipelined instructions
  • Branch mispredict penalty

Load Hit Store

  • Store to memory location, then load, flush the L2 cache
  • Casts, changing register set, aliasing
  • Passing by value, or by reference
  • On Pc, instruction reoder and store / forward hardware

L2 Miss

  • Loading from location, checks cache
  • Cost ~610 cycles to load cache line
  • Hot cold split
  • Reduce in-memory data size
  • Use cache coherent structures

Expensive Instructions

  • non pipelined instructions
  • Stalls hardware threads

Branch Mispredict

  • Mispredicting branch
  • 23-24 cycle delay
  • Know how the compiler implements branches
  • Reduce total branch count for task
  • Refactor calculations to remove branches
  • Unroll

Profiling!!!

360 Tools

  • PIX Cpu instruction trace
  • LibPMCPB counters
  • XbPerfView sampling capture

Other Platforms

  • SN Tuner, vTune

Think laterally

  • Inline functions
  • pass and return in register (_declspec(passinreg)
  • _restrict (complier released from being ultra careful
  • const

 

Compiler options

  • Inline
  • Prefer speed over size
  • Fast floaging point over precise
  • 360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
  • Reduce parameter counts
  • Prefer 32, 64, 128 bit parameters
  • Isoloate constants
  • Avoid virtual if feasible

Know you cache architecture

  • Cross core sharing policy (L2 shared, L1 single)
  • Prefetch mech (dcbt, dcbz128)
  • L2 1MB, L1 32Kb
  • Cache line 128 byte

Know your instruction set

  • 360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
  • PS3 (altivec)
  • PC (SSE2-4.1 and friends)

What went wrong

  • Correctness
  • Guessed at 1 perf issue
  • SIMD vs straight float
  • Memory access and L2 usage unchanged
  • Branch behavior exactly the same

Image Analysis

  • Gaussian Mixture Model
  • Profiling showed (86% tiem in pixel cost function)

No comments: