Direct X, XNA, etc

Thursday, May 13, 2010

Silverlight on Windows 7 Phone – Performance Tuning

Recently, I have been reviewing some of the content from MIX10, specifically pertaining to the talks on the new Windows Phone. I came upon the talk from Seema Ramchandani (http://live.visitmix.com/MIX10/Sessions/CL60). This was full of good tips on performance tools and tuning for Silverlight on the phone. I wanted to share the items that stood out to me.

First, the idea of having a UI thread (that uses the phone CPU) and a Render thread (that runs on the GPU) is key to making a smooth rich interface. The UI thread should be reserved for input primarily with the animations and other calculations being done on the render thread.

UI Virtualization-

For instance the listbox shows by default (styles) 6 rows. What is hidden is that the 5 above and 5 below are also in the visual tree. These are not recreated as the ui is scrolled but simply reused.

The Loop of all Loops (render loop)

A timer in triggers the render loop every 33 milliseconds and attempts to draw. First we check for property changes (ie. movement). Then the visual tree is recurred 2 times. First to see “how big” everything is. Then on the second pass arranging the items to be rendered is performed (and clipping). Next we queue up the rendering changes. Only one back buffer is rasterized per frame. And then the buffer is rendered.

Tools

Enable frame rate counter (in main)
Enable redraw regions (to see what is being redrawn in different colors)

We should only be drawing the changes, and the redraw regions flag will show you what this is. Lots of colors is bad :).

Threads

Render thread
UI thread
Child threads

Rasterization
Media Decoding

Less is more – keep the app lean for best performance (goes without saying)
Limit activity on UI thread
Leverage render thread
Leverage GPU
Debug, Debug, Debug

Active input on the device can take 15-25% of the device CPU! Thus, we need the render thread to keep the application responding real time.

Quick way to check if we are using the GPU. Enable the frame rate counter. If this is rendered when you run the application, then the GPU is being used.

Use CacheMode=”Bitmap Cache” to force bitmap caching for GPU optimization. Certain items will cache automatically (ie. double animations).

Order is important in the xaml to keep the cached bitmaps close together (less textures), which will increase gpu performance.

Media and effects in the same frame == bad!

Max texture size on desktop and phone is 2048 in any direction.

60 fps in the CTP, but will be limited to 30 fps max on release build (to keep battery life as high as possible).

Thursday, January 21, 2010

Global Game Jam

Had a friend mention the Global Game Jam to me today. I was still able to register for the New York City location. The event takes place between January 29th and 31st. The NYC location currently has 66 jammers registered. I will post updates as this takes place.

Global Game Jam Site

Saturday, May 9, 2009

Project: GPU replacement

Recently, my otherwise perfect XPS laptop experienced a GPU failure. It was time for me to upgrade my laptop anyway, so I went ahead a got a 1730 to replace this 1710. I then contacted a Dell reseller in CA for a replacement GPU (these things are $680 from Dell) and was able to secure a new one for $400. This is Nvidia 7950 GTX, the largest GPU available for the XPS M1710. I received the new GPU a few days ago and installed today. It required pretty much stripping the laptop but all is working fine now. Pics can be found here. BTW, the 1730 is purchased has dual 8800 GTX (for a grand total of 1GB video memory) with SLI. It runs well. :)

Thursday, March 26, 2009

The Beauty of Destruction (Pete Isensee, Microsoft)

C++ Destructor Definition

One
Special
Deterministic –> called at well defined times
Automatic –> object out of scope or delete
Symmetric –> constructor fits
Member –> part of a class
Function
With
- A special name (~)
- No parameters
- No return type
Designed to
- Give last rites
- Before object death

C# uses finalizer different (called by GC) non-deterministic, same in Java

When destructors are called

Global or static object, called when terminates
Arrays destructed in reverse way
STL container , unspecified order
delete operator
out of scope
temp objects
exception thrown (stack unwinding)
explicitly
exit()
abort (does not call destructor)

Order of destruction

Rule of thumb: Reverse order of construction
Specifically
- Destructor body
- Data members in reverse order of declaration
- Direct non-virtual base classes in reverse order
- Virtual base classes in reverse order

Implicit Destructors

not specified by programmer
inline by default
public
recommended for struct-like POD-only objects
for everything else, avoid implicit destructors
- better debugging
- improved perf analysis

Trivial Destructors

Implicit
Not virtual
All direct base classes have trivial dtors
All non-static members have trivial dtors
Destructors that never do anything

Virtual Destructors

Guarantee that derived classes get cleaned up
Rule of thumb: if class has virtual functions, dtor should be virtual
- if delete on Base* could ever point to a Derived*
Perf: Obj with any virtual funcs includes a vtable ptr
Idiom exceptions: mixin classes
Pure signals abstract class (virtual ~T() = 0{})

Partial Construction & Destruction

Dtors are only called for fully constructed objects
if a ctor throws, obj was not fully constructed
- obj dtor will not be called
- but fully constructed subobjects will be destroyed
Always use RAII with ctors
- Resource Acquisition Is Initialization

Virtual Functions in Destructors

Virtual functions are not virtual inside dtors

C++ Exception Handling

Destructors : Exceptions :: Spock : Kirk
Wrap any function that acquires a resource in a class where dtor releases the resource
Never allow an exception to exit a dtor
- Best: don’t throw in dtor
- OK: wrap throwing code in a try/catch
Good advice even if you don’t use C++EH

Multithreading

You are responsible for protecting objects and their contents
Sharing an object across threads
- Use shared_ptr
- or some other reference counting
- or otherwise ensure only one thread can destroy
Protect shared memory (global counters, ref counts) in dtor

delete and Destructors

delete p is a two-step process

Explicit Destructors

Destructors can be called directly
Avoid 99.9% of the time
Very powerful for custom memory scenarios
Examples
- w / placement new
- STL allocators

std::allocator

Allocators enable custom STL container memory
Two key destructive functions

shared_ptr

Templated non-intrusive deterministically referenced-counted smart pointer

shared_ptr deleters

Deleter : a functor called on the stored raw pointer when ref count hits zero

Performance

Destructors are called a LOT
they are invisible in code
streamline common dtors
the best dtor is empty
inlining
profile

The Rendering Technology of KillZone 2 (Michal Valient)

How we made Killzone 2 run @ 30FPS

Deferred shading
Diet for render targets
Dirty lighting tricks
Rendering, memory and SPUs

Deferred shading

not forward rendering
Geometry pass – fill the GBuffer (all material info for lighting)
loading depth map, normal / bump map, albedo (diffuse color and texture), shininess (reflective materials)
Lighting pass – accumulate info (only light, no textures)

GBuffer

RGBA FP16 buffers proved to be too much
Moved to RGBA8
- 4xRGBA8 + D24S8 – 18.4mb
- 2xMSAA (Quincunx) – 36.8mb
Memory reused by later rendering stages
- Low res pass, post processing, HUD
View space position computed from depth buffer
Normal.z = sqrt(1 – Normal.x2 – Normal.y2)
- No neg z, but does not cause problems
- 2xFP16 compressed to RGBA8 on write
Motion vectors – screen space
Albedo – material diffuse color
Roughness – specular exponent in log range
Specular intesity – single channel only
Sun Shadow – pre-rendered sun shadows (offline light map)
- Mixed with real-time sun shadows
Lighting accumulation buffer (LAB)
- Geometry pass fills in indirect lighting terms
  - Stored in lightmaps and IBLs
  - Adds ambient color, scene reflections
- Lighting pass adds contribution of each light
Glow – contains HDR luminance of LAB
- Used to reconstruct HDR RGB for bloom

Lighting pass

Most expensive pass
- 100+ dynamic lights per frame
- 10+ shadow casting lights per frame
- AA means more of everything
Optimization
- Avoid hard work
- Work less for MSAA
- Precompute sun shadow offline
- Approximate

Avoid hard work

Don’t run shaders
- Use early z/stencil cull unit
- Depth bounds test is the new cool
- Enable conditional rendering
Optimized light shaders
- For each combination of light features
Fade out shadows for small lights
Remove small objects from shadow map

Lighting pass and MSAA

MSAA facts
- Each sample has to be lit
- Samples of non-edge pixel are equal
KZ2 solution – in shader supersampling
- Run at 1280x720 not 2560x720
- Light two samples in one go

Shadow map filtering distribution

Motivation
- Define filtering quality per pixel rather than per sample.
Split filter coordinates into disjoint sets
- One set per pixel sample
MSAA is almost as fast as non-MSAA

Sunlight

Fullscreen directional light
We divide screen into depth slices
Each depth slice is lit separately
- Different shadow properties
- Used depth bounds test
Use sun shadow from GBuffer
- Stencil mark pixels completely in shadow
  - Skip expensive sunlight shader
- Also mixed with real-time shadows

Sunlight rendering – Fake MSAA

Used only in distance pixels
Cut down lighting cost
- Run lighting equation on closest sample only
Is this wrong?
- Its a hack
- Works correctly against background
- The edges are still partially anti-aliased
- Distant scenery is heavily post processed

Sunlight – shadow map rendering

Generate shadow map for each depth slice
Common approach
- Align shadow map to view direction
- Pros – max shadow map usage
- Result – shadow map shimmering
Fix

GPU driven memory allocation system

Push Buffer building

Multiple SPUs building PB in parallel
Additional SPUs generating data
- Skinning, particles – VB
- IBL interpolation – textures
Common solutions
- Ring Buffering
  - Issue with out of order allocations
- Double Buffering
  - Too much memory

KZ2 render memory allocator

Fixed mem pool
- 22MB block – split into 256k blocks
Each block has associated AllocationID
- Specified by client during allocation
- Only whole block can be allocated
Global FreeID identify free blocks
- Updated as RSX consumes ‘Free’ marker
Lockless, out of order, memory allocation
- From PPU and/or SPU
- Simple table walk (fast!)
Allows immediate memory reuse
- WE generate push-buffer just in time for RSX
- Block can be reused right after RSX consumption
Can allocate memory for skinning early…
- and still free at correct point in frame

Direct3D 11 Tessellation Deep Dive (presented by Matt Lee)

High fidelity characters seem a bit out of reach of real-time apps (games).

10 to 30K chars not out of reach for 360/PS3

Striving for Cinematic Quality Characters

think in terms of triangles currently

Catmull-Clark subdivision surfaces

Industry standard subdivision surface scheme

Modern implementations don’t require too smooth

Direct3D11

Realtime rendering of Catmull-Clark
3 new pipelline stages
Hardware design removes bandwidth bottlenecks from current implmentations
Better use of multi-core processors and improved shader management

(dynamic shader linking)

Direct3D11 Pipeline

Hull Shader
Tesselator
Domain Shader

Tessleation Data Flow

Hull shader – executed per patch
Tessellator – executed per patch (fixed-function) generates triangle
Domain shader – per tessllated vertex

Hull Shader

patch control points (input)
output to Domain Shader
two phases per patch(control points, patch constant)
patch constant output to tessellator (modifies behavior of tessellator)
control points go to domain shader

Tessellator

state from D3D API
input from Patch constant phase
generates tessellated triangles
out to later stages

Domain Shader

Hull Shader and Tessellator output
Smooth surface evaluated
one vertex
Control points are in GPU (saves bandwidth)

Where to use

LOD of terrain
Bezier patches from higher-order surfaces (Catmull-Clark)

Catmull-Clark

Baked into content early
Goal is real time
Disadvantage (have to build offline and huge memory issues)

Loop-Schaefer approximation (D3D11), others exist

Benefits

Content creation easier
Save memory
Easier LOD (doesn’t require sep meshes) so no use for MIP maps?

Pipeline

offline Load control mesh
offline compute adjacency for each quad
offline compute texture tangent space for each vertex
rt Morph & skin the quad mesh in the VS
rt convert quad mesh into patches in hull shader
rt Evaluate patches using domain shader
rt apply displayment map

tangent patches (fixes up surface normals) extrodianry vertex (<4 or > 4)

Available in March 2009 DX SDK (today), next release in June 09

SubD11 sample

Optimization will be performed when hardware is finalized

Current shader design is not expected to perform well on hardware.

Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

VMX on the 360 for optimization of vector math

Slower than C counterpart, and out of order (so broken)

Missing out of order logic

no instruction reordering
no store forward hardware
smaller caches, slower memory
no l3 cache

LHS
L2 Miss
Expensive, non pipelined instructions
Branch mispredict penalty

Load Hit Store

Store to memory location, then load, flush the L2 cache
Casts, changing register set, aliasing
Passing by value, or by reference
On Pc, instruction reoder and store / forward hardware

L2 Miss

Loading from location, checks cache
Cost ~610 cycles to load cache line
Hot cold split
Reduce in-memory data size
Use cache coherent structures

Expensive Instructions

non pipelined instructions
Stalls hardware threads

Branch Mispredict

Mispredicting branch
23-24 cycle delay
Know how the compiler implements branches
Reduce total branch count for task
Refactor calculations to remove branches
Unroll

Profiling!!!

360 Tools

PIX Cpu instruction trace
LibPMCPB counters
XbPerfView sampling capture

Other Platforms

SN Tuner, vTune

Think laterally

Inline functions
pass and return in register (_declspec(passinreg)
_restrict (complier released from being ultra careful
const

Compiler options

Inline
Prefer speed over size
Fast floaging point over precise
360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
Reduce parameter counts
Prefer 32, 64, 128 bit parameters
Isoloate constants
Avoid virtual if feasible

Know you cache architecture

Cross core sharing policy (L2 shared, L1 single)
Prefetch mech (dcbt, dcbz128)
L2 1MB, L1 32Kb
Cache line 128 byte

Know your instruction set

360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
PS3 (altivec)
PC (SSE2-4.1 and friends)

What went wrong

Correctness
Guessed at 1 perf issue
SIMD vs straight float
Memory access and L2 usage unchanged
Branch behavior exactly the same

Image Analysis

Gaussian Mixture Model
Profiling showed (86% tiem in pixel cost function)