Thursday, March 26, 2009

The PlayStation 3’s SPUs in the Real World (presented by Michiel van der Leeuw)

  • Things they did on the SPU’s (post mordum)
  • What worked and didn’t work
  • Practical advice
  • Food for thought

3 Years ~120 team size with 27 programmers

  • Cinematic
  • Dense
  • Realistic
  • Intense


  • 6 x 3.2 Ghz processor
  • Local mem per SPU
  • Very fast DMA


  • Core Requirements
    • Animation
    • AI
    • Skinning
    • Physics
    • Compression/Decompression
    • etc


  • Light probe sampling
  • ~2500 static light probes per level
  • 9x3 Spherical Harmonics in KD-tree
  • sample light, blend 4 closet light probes, rotate in view space,
  • bake lights into level

Particle simulation

  •   250 particle systems per frame
  • 150 drawn
  • 3000 particles updated
  • 200 colision ray cast
  • System grown over time


  • Vertex generation
  • Particle simulation inner loop
  • Initilaization & deletion of particles
  • High-level management / glue

Not done on SPU

  • Updated global scene graph
  • Starting & stopping sounds

Image Post Processing

Effects done on SPU

  • Moiton blur
  • Depth of field
  • Bloom

Spu assist the RSX with post-processing

  • RSX prepares low-rew image buffers
  • RSX triggers interrupt to start SPUs
  • SPUs perform image operations
  • RSX already starts next frame
  • Result in SPU processed by RSX early in next frame
  • Similar to PhyreEngine now

SPUs are compute-bound

  • Bandwidth no issue
  • Code can be optimized

Our trade of: RSX vs SPU time

  • SPUs take longer
  • SPUs look better
  • RSX was the bottleneck

Bloom and Lens Relection

  • 13% on one spu
  • Depth dependnd intensity response curve
  • 7x7 guassian blur
  • Upscaling resulst from deifferent levels
  • Internal Lens Relfection
  • Result buffer

Waypoint cover maps –> depth map

IBL Sampling

SPU cost a lot of dev time

Code is future proof, scales to more cores, supports the items they require.

The future is memory-local and excessively parallel

SPUS are just one of these ‘new architectures’

Optimize for the concept

Keep code portable

Parallelization of code takes time

Treat CPU as cluster

Think in workloads / jobs

Build latency in algorithms

Don’t optimize too early

No comments: