Recently, my otherwise perfect XPS laptop experienced a GPU failure. It was time for me to upgrade my laptop anyway, so I went ahead a got a 1730 to replace this 1710. I then contacted a Dell reseller in CA for a replacement GPU (these things are $680 from Dell) and was able to secure a new one for $400. This is Nvidia 7950 GTX, the largest GPU available for the XPS M1710. I received the new GPU a few days ago and installed today. It required pretty much stripping the laptop but all is working fine now. Pics can be found here. BTW, the 1730 is purchased has dual 8800 GTX (for a grand total of 1GB video memory) with SLI. It runs well. :)
Saturday, May 9, 2009
Thursday, March 26, 2009
The Beauty of Destruction (Pete Isensee, Microsoft)
C++ Destructor Definition
- One
- Special
- Deterministic –> called at well defined times
- Automatic –> object out of scope or delete
- Symmetric –> constructor fits
- Member –> part of a class
- Function
- With
- A special name (~)
- No parameters
- No return type
- Designed to
- Give last rites
- Before object death
C# uses finalizer different (called by GC) non-deterministic, same in Java
When destructors are called
- Global or static object, called when terminates
- Arrays destructed in reverse way
- STL container , unspecified order
- delete operator
- out of scope
- temp objects
- exception thrown (stack unwinding)
- explicitly
- exit()
- abort (does not call destructor)
Order of destruction
- Rule of thumb: Reverse order of construction
- Specifically
- Destructor body
- Data members in reverse order of declaration
- Direct non-virtual base classes in reverse order
- Virtual base classes in reverse order
Implicit Destructors
- not specified by programmer
- inline by default
- public
- recommended for struct-like POD-only objects
- for everything else, avoid implicit destructors
- better debugging
- improved perf analysis
Trivial Destructors
- Implicit
- Not virtual
- All direct base classes have trivial dtors
- All non-static members have trivial dtors
- Destructors that never do anything
Virtual Destructors
- Guarantee that derived classes get cleaned up
- Rule of thumb: if class has virtual functions, dtor should be virtual
- if delete on Base* could ever point to a Derived*
- Perf: Obj with any virtual funcs includes a vtable ptr
- Idiom exceptions: mixin classes
- Pure signals abstract class (virtual ~T() = 0{})
Partial Construction & Destruction
- Dtors are only called for fully constructed objects
- if a ctor throws, obj was not fully constructed
- obj dtor will not be called
- but fully constructed subobjects will be destroyed
- Always use RAII with ctors
- Resource Acquisition Is Initialization
Virtual Functions in Destructors
- Virtual functions are not virtual inside dtors
C++ Exception Handling
- Destructors : Exceptions :: Spock : Kirk
- Wrap any function that acquires a resource in a class where dtor releases the resource
- Never allow an exception to exit a dtor
- Best: don’t throw in dtor
- OK: wrap throwing code in a try/catch
- Good advice even if you don’t use C++EH
Multithreading
- You are responsible for protecting objects and their contents
- Sharing an object across threads
- Use shared_ptr
- or some other reference counting
- or otherwise ensure only one thread can destroy
- Protect shared memory (global counters, ref counts) in dtor
delete and Destructors
- delete p is a two-step process
Explicit Destructors
- Destructors can be called directly
- Avoid 99.9% of the time
- Very powerful for custom memory scenarios
- Examples
- w / placement new
- STL allocators
std::allocator
- Allocators enable custom STL container memory
- Two key destructive functions
shared_ptr
- Templated non-intrusive deterministically referenced-counted smart pointer
shared_ptr deleters
- Deleter : a functor called on the stored raw pointer when ref count hits zero
Performance
- Destructors are called a LOT
- they are invisible in code
- streamline common dtors
- the best dtor is empty
- inlining
- profile
The Rendering Technology of KillZone 2 (Michal Valient)
How we made Killzone 2 run @ 30FPS
- Deferred shading
- Diet for render targets
- Dirty lighting tricks
- Rendering, memory and SPUs
Deferred shading
- not forward rendering
- Geometry pass – fill the GBuffer (all material info for lighting)
- loading depth map, normal / bump map, albedo (diffuse color and texture), shininess (reflective materials)
- Lighting pass – accumulate info (only light, no textures)
GBuffer
- RGBA FP16 buffers proved to be too much
- Moved to RGBA8
- 4xRGBA8 + D24S8 – 18.4mb
- 2xMSAA (Quincunx) – 36.8mb
- Memory reused by later rendering stages
- Low res pass, post processing, HUD
- View space position computed from depth buffer
- Normal.z = sqrt(1 – Normal.x2 – Normal.y2)
- No neg z, but does not cause problems
- 2xFP16 compressed to RGBA8 on write
- Motion vectors – screen space
- Albedo – material diffuse color
- Roughness – specular exponent in log range
- Specular intesity – single channel only
- Sun Shadow – pre-rendered sun shadows (offline light map)
- Mixed with real-time sun shadows
- Lighting accumulation buffer (LAB)
- Geometry pass fills in indirect lighting terms
- Stored in lightmaps and IBLs
- Adds ambient color, scene reflections
- Lighting pass adds contribution of each light
- Geometry pass fills in indirect lighting terms
- Glow – contains HDR luminance of LAB
- Used to reconstruct HDR RGB for bloom
Lighting pass
- Most expensive pass
- 100+ dynamic lights per frame
- 10+ shadow casting lights per frame
- AA means more of everything
- Optimization
- Avoid hard work
- Work less for MSAA
- Precompute sun shadow offline
- Approximate
Avoid hard work
- Don’t run shaders
- Use early z/stencil cull unit
- Depth bounds test is the new cool
- Enable conditional rendering
- Optimized light shaders
- For each combination of light features
- Fade out shadows for small lights
- Remove small objects from shadow map
Lighting pass and MSAA
- MSAA facts
- Each sample has to be lit
- Samples of non-edge pixel are equal
- KZ2 solution – in shader supersampling
- Run at 1280x720 not 2560x720
- Light two samples in one go
Shadow map filtering distribution
- Motivation
- Define filtering quality per pixel rather than per sample.
- Split filter coordinates into disjoint sets
- One set per pixel sample
- MSAA is almost as fast as non-MSAA
Sunlight
- Fullscreen directional light
- We divide screen into depth slices
- Each depth slice is lit separately
- Different shadow properties
- Used depth bounds test
- Use sun shadow from GBuffer
- Stencil mark pixels completely in shadow
- Skip expensive sunlight shader
- Also mixed with real-time shadows
- Stencil mark pixels completely in shadow
Sunlight rendering – Fake MSAA
- Used only in distance pixels
- Cut down lighting cost
- Run lighting equation on closest sample only
- Is this wrong?
- Its a hack
- Works correctly against background
- The edges are still partially anti-aliased
- Distant scenery is heavily post processed
Sunlight – shadow map rendering
- Generate shadow map for each depth slice
- Common approach
- Align shadow map to view direction
- Pros – max shadow map usage
- Result – shadow map shimmering
- Fix
- Remove shadow map rotation
- Align shadow maps to world instead of view
- Remove sub-pixel movement
- Cons – unused shadow map space
- Remove shadow map rotation
GPU driven memory allocation system
Push Buffer building
- Multiple SPUs building PB in parallel
- Additional SPUs generating data
- Skinning, particles – VB
- IBL interpolation – textures
- Common solutions
- Ring Buffering
- Issue with out of order allocations
- Double Buffering
- Too much memory
- Ring Buffering
KZ2 render memory allocator
- Fixed mem pool
- 22MB block – split into 256k blocks
- Each block has associated AllocationID
- Specified by client during allocation
- Only whole block can be allocated
- Global FreeID identify free blocks
- Updated as RSX consumes ‘Free’ marker
- Lockless, out of order, memory allocation
- From PPU and/or SPU
- Simple table walk (fast!)
- Allows immediate memory reuse
- WE generate push-buffer just in time for RSX
- Block can be reused right after RSX consumption
- Can allocate memory for skinning early…
- and still free at correct point in frame
Direct3D 11 Tessellation Deep Dive (presented by Matt Lee)
High fidelity characters seem a bit out of reach of real-time apps (games).
10 to 30K chars not out of reach for 360/PS3
Striving for Cinematic Quality Characters
think in terms of triangles currently
Catmull-Clark subdivision surfaces
- Industry standard subdivision surface scheme
Modern implementations don’t require too smooth
Direct3D11
- Realtime rendering of Catmull-Clark
- 3 new pipelline stages
- Hardware design removes bandwidth bottlenecks from current implmentations
- Better use of multi-core processors and improved shader management
(dynamic shader linking)
Direct3D11 Pipeline
- Hull Shader
- Tesselator
- Domain Shader
Tessleation Data Flow
- Hull shader – executed per patch
- Tessellator – executed per patch (fixed-function) generates triangle
- Domain shader – per tessllated vertex
Hull Shader
- patch control points (input)
- output to Domain Shader
- two phases per patch(control points, patch constant)
- patch constant output to tessellator (modifies behavior of tessellator)
- control points go to domain shader
Tessellator
- state from D3D API
- input from Patch constant phase
- generates tessellated triangles
- out to later stages
Domain Shader
- Hull Shader and Tessellator output
- Smooth surface evaluated
- one vertex
- Control points are in GPU (saves bandwidth)
Where to use
- LOD of terrain
- Bezier patches from higher-order surfaces (Catmull-Clark)
Catmull-Clark
- Baked into content early
- Goal is real time
- Disadvantage (have to build offline and huge memory issues)
Loop-Schaefer approximation (D3D11), others exist
Benefits
- Content creation easier
- Save memory
- Easier LOD (doesn’t require sep meshes) so no use for MIP maps?
Pipeline
- offline Load control mesh
- offline compute adjacency for each quad
- offline compute texture tangent space for each vertex
- rt Morph & skin the quad mesh in the VS
- rt convert quad mesh into patches in hull shader
- rt Evaluate patches using domain shader
- rt apply displayment map
tangent patches (fixes up surface normals) extrodianry vertex (<4 or > 4)
Available in March 2009 DX SDK (today), next release in June 09
SubD11 sample
Optimization will be performed when hardware is finalized
Current shader design is not expected to perform well on hardware.
Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)
VMX on the 360 for optimization of vector math
Slower than C counterpart, and out of order (so broken)
Missing out of order logic
- no instruction reordering
- no store forward hardware
- smaller caches, slower memory
- no l3 cache
- LHS
- L2 Miss
- Expensive, non pipelined instructions
- Branch mispredict penalty
Load Hit Store
- Store to memory location, then load, flush the L2 cache
- Casts, changing register set, aliasing
- Passing by value, or by reference
- On Pc, instruction reoder and store / forward hardware
L2 Miss
- Loading from location, checks cache
- Cost ~610 cycles to load cache line
- Hot cold split
- Reduce in-memory data size
- Use cache coherent structures
Expensive Instructions
- non pipelined instructions
- Stalls hardware threads
Branch Mispredict
- Mispredicting branch
- 23-24 cycle delay
- Know how the compiler implements branches
- Reduce total branch count for task
- Refactor calculations to remove branches
- Unroll
Profiling!!!
360 Tools
- PIX Cpu instruction trace
- LibPMCPB counters
- XbPerfView sampling capture
Other Platforms
- SN Tuner, vTune
Think laterally
- Inline functions
- pass and return in register (_declspec(passinreg)
- _restrict (complier released from being ultra careful
- const
Compiler options
- Inline
- Prefer speed over size
- Fast floaging point over precise
- 360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
- Reduce parameter counts
- Prefer 32, 64, 128 bit parameters
- Isoloate constants
- Avoid virtual if feasible
Know you cache architecture
- Cross core sharing policy (L2 shared, L1 single)
- Prefetch mech (dcbt, dcbz128)
- L2 1MB, L1 32Kb
- Cache line 128 byte
Know your instruction set
- 360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
- PS3 (altivec)
- PC (SSE2-4.1 and friends)
What went wrong
- Correctness
- Guessed at 1 perf issue
- SIMD vs straight float
- Memory access and L2 usage unchanged
- Branch behavior exactly the same
Image Analysis
- Gaussian Mixture Model
- Profiling showed (86% tiem in pixel cost function)
The PlayStation 3’s SPUs in the Real World (presented by Michiel van der Leeuw)
- Things they did on the SPU’s (post mordum)
- What worked and didn’t work
- Practical advice
- Food for thought
3 Years ~120 team size with 27 programmers
- Cinematic
- Dense
- Realistic
- Intense
- 6 x 3.2 Ghz processor
- Local mem per SPU
- Very fast DMA
- Core Requirements
- Animation
- AI
- Skinning
- Physics
- Compression/Decompression
- etc
Graphics
- Light probe sampling
- ~2500 static light probes per level
- 9x3 Spherical Harmonics in KD-tree
- sample light, blend 4 closet light probes, rotate in view space,
- bake lights into level
Particle simulation
- 250 particle systems per frame
- 150 drawn
- 3000 particles updated
- 200 colision ray cast
- System grown over time
Refactoered
- Vertex generation
- Particle simulation inner loop
- Initilaization & deletion of particles
- High-level management / glue
Not done on SPU
- Updated global scene graph
- Starting & stopping sounds
Image Post Processing
Effects done on SPU
- Moiton blur
- Depth of field
- Bloom
Spu assist the RSX with post-processing
- RSX prepares low-rew image buffers
- RSX triggers interrupt to start SPUs
- SPUs perform image operations
- RSX already starts next frame
- Result in SPU processed by RSX early in next frame
- Similar to PhyreEngine now
SPUs are compute-bound
- Bandwidth no issue
- Code can be optimized
Our trade of: RSX vs SPU time
- SPUs take longer
- SPUs look better
- RSX was the bottleneck
Bloom and Lens Relection
- 13% on one spu
- Depth dependnd intensity response curve
- 7x7 guassian blur
- Upscaling resulst from deifferent levels
- Internal Lens Relfection
- Result buffer
Waypoint cover maps –> depth map
IBL Sampling
SPU cost a lot of dev time
Code is future proof, scales to more cores, supports the items they require.
The future is memory-local and excessively parallel
SPUS are just one of these ‘new architectures’
Optimize for the concept
Keep code portable
Parallelization of code takes time
Treat CPU as cluster
Think in workloads / jobs
Build latency in algorithms
Don’t optimize too early
Lockless Programming in Games (Bruce Dawson, Microsoft)
Current Hardware
- 360 – 6 hardware threads
- PS3 – 9 hardware threads
- Windows – Quad cores not uncommon
- Point being multi-core is here to stay
Multithreading is mandatory if you want to harness the available power. If not you are really wasting the advanced features of the hardware.
Multithreaded programming is easy if you don’t share data. :) Of course this is not usually an option.
Best way to share data between threads is by using locks. This is important. Lockless is not a one-size fits all approach.
Lockless programming typically involves a job queue, using STL queue. The problem is STL queues not thread safe. So we have to make them safe. :)
Solution, use critical section to block off the code.
Bad things
- Acquiring and releasing locks takes time
- Deadlocks
- Contention – waiting, holding locks too long
- Priority inversions – system threads on 360 do this (too often)
Use locks carefully or lockless
- Safely share data without locks (no deadlocks or priority inversion)
- Cons
- Very limited, tricky, generally not portable
sList (singly linked list) InterlockedPushEntrySList
This is NOT a queue! This is a stack! Don’t use on 360!!!!!
One writer, one reader (singleton) (works on paper, not in real world)
Read data (cpu to L2), write (cpu to L2)
Writes can happen before getting put in L2 cache
Happens on reads too (second read could come from L1)
read and write can pass each other
Power PC read / writes can pass each other but on x86 only load can pass a store
Reads not passing writes would basically disable L1, huge perf hit
publisher / subscriber model
ExportBarrier – no passing sign (stop sign) HANDLE BOTH reads and writes
Compilers are just as evil, rearrange code (single threaded)
Compiler/CPU reordering barriers needed
_ReadWriteBarrier(); x86
_lwsync(); PowerPC (both cpu and compiler)
Positioning is crucial (barrier between writes)
write-release semantics is the name
read-acquire semantics is the name
reader needs both read / write
Dekker’s / Peterson’s Algorithm
MemoryBarrier
- x86 _asm xchange Barrier, eax
- x64 _FastStorefence()
- power _sync();
what about volatile
standard volatile…..NO
doesn’t prevent CPU reordering and all variables would need to tagged volatile
VC++ is better, doesn’t prevent hardware reodering on 360
Acts as read-acquire / write-release on x86/x64 and Itanium
atomic <T> in C++0x
Double checked locking – singleton
InterlockedXxx
doesn’t work on 360
its a full barrier on x86/x64/Itanium
InterlockedXxx Acquire/Release are portable (preferred)
Uses
- Reference counts
- Setting a flag
- Publish/Subscribe
- SLists
- XMCore on 360
- Double checked locking
Export, import, full barriers
Prefer to use locks!!!!
use lockless when locks are too costly
http://msdn.microsoft.com/en-us/library/bb310595(VS.85).aspx
Keynote: Discovering new development opportunities (Satoru Iwata, Nintendo)
The day starts out with a packed room, all waiting to hear the keynote to be delivered by Satoru Iwata, President of Nintendo. With the unquestionable success of the Nintendo products worldwide this talk is to pull back the curtain a bit on some the ideas and methods that have lead to this success.
Iwata started off by presenting the obligatory numbers slides showing how much success both the Wii and DS have shown in recent history. No one can take that away from them, they have several successful platforms currently.
Iwata then started to discuss, arguably the most important developer at Nintendo, Mr. Miyamoto. He explained that Miyamoto is one of the main reasons for the continued success. He explained in somewhat detailed terms, how Miyamoto’s development style has achieved this success.
Mr. Miyamoto first starts with a core concept, as do most software projects. One of the differences comes from where Miyamoto pulls the new ideas from. He is fascinated with studying humans and their behaviors, specifically, when they are doing something that makes them happy. He will draw on this, to come up the concept for the new piece of software that he is attempting to create. For example, he got a dog for his family, and out of this, was born Nintendogs.
Of course, having a good idea for a game concept, and following this through to release of a successful title are 2 very different things. This brings up the next key point. This is Miyamoto’s software development style. He typically will form a very small team (sometimes even just one developer) and they will work on a prototype, or rather a series of prototypes. At this stage, the graphics are very crude (boxes). They will work on this for how ever long this takes to perfect the core concept. At this stage, no even the president of the company will ask how things are going, or when this will be ready for the next stage. It should be noted that sometimes at this stage, work will be done, but then shelved. This could happen for various reasons, but almost always, at least some of this will be used at a later time.
If the prototyping has met Miyamoto’s satisfaction, only at this stage will others be brought in on the game. This is where the polish comes in (graphics) and such, but the core gameplay is pretty much guaranteed at this point. This saves from the issue of after spending time on later polish items, a core gameplay elements requires a rewrite. This almost never happens with this style.
Nintendo also has some unique “playtest” elements to the project. They do not conduct formal playtests. Instead, Miyamoto will “kidnap” an employee (non-technical) and have them play the game (with no help). He then checks to see how it works out. If they are able to understand a play with no help, the dev team has done their job. If not, its a failure and will be readjusted.
Next, Iwata unveiled the new Virtual console with larger SD support and options to run from SD. Also some game demos of future titles were shown. Also, Rhythm Heaven was introduced, and he gave everyone in attendance a free copy, before this can be bought.
Thursday, January 8, 2009
Indexer example in C#
namespace Indexer
{
class Program
{
static void Main(string[] args)
{
TestObject obj1 = new TestObject();
TestObject obj2 = new TestObject();
obj1.AddData(new[] { "this", "is", "test", "one", "!?" });
obj2.AddData(new[] { "this", "is", "another", "test", "!" });
// output to check object
OutputObject(obj1, "OBJ1");
OutputObject(obj2, "OBJ2");
obj1[1] = "was";
obj2[1] = "used to be";
// output to check object
OutputObject(obj1, "OBJ1");
OutputObject(obj2, "OBJ2");
}
public static void OutputObject(TestObject obj, string name)
{
for (int i = 0; i < 5; i++)
Console.WriteLine(string.Format("{0}: {1}", name, obj[i]));
}
}
class TestObject
{
private readonly string[] store = new string[5];
public string this[int index]
{
get { return store[index]; }
set { store[index] = value; }
}
public void AddData(string[] objData)
{
for (int i = 0; i < 5; i++)
store[i] = objData[i];
}
}
}
Tuesday, January 6, 2009
Memory alignment
Memory bandwidth can quickly become the bottleneck to a system. If we take, for instance, this case. We have a processor that has a memory width of 32 bits. If we are going to fetch something from memory (say an int, which happens to be 32 bits wide). With this situation, as it is aligned, the processor can fetch the value in one cycle.
Many of the data enumerations in DirectX contain a value at the end named x_FORCE_DWORD with a value of 0x7FFFFFFF. This value is 1111111111111111111111111111111 (31) bits. This will guarantee this enum will be at least 32 bits in size.