Advanced Scenegraph Rendering Pipeline
Markus Tavenrath – NVIDIA – matavenrath@nvidia.com
Christoph Kubisch – NVIDIA – ckubisch@nvidia.com
2
 Traditional approach is render while
traversing a SceneGraph
 Scene complexity increases
– Deep hierarchies, traversal expensive
– Large objects split up into a lot of
little pieces, increased draw call count
– Unsorted rendering, lot of state
changes
 CPU becomes bottleneck when
rendering those scenes
SceneGraph Rendering
models courtesy of PTC
Introduction SceneGraph SceneTree ShapeList Renderer
3
Overview
Introduction SceneGraph SceneTree ShapeList Renderer
G0
T0 T1
T2
S1 S2
G1
T3
S0
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
ShapeList
S1 S2S0 S1‘ S2‘
RendererCache
S1 S2S0 S1‘ S2‘
SceneGraph SceneTree ShapeList
Renderer
Gi Group Ti Transform Si Shape
4
SceneGraph
G0
T0 T1
T2
S1 S2
G1
T3
S0
 SceneGraph is DAG
 No unique path to a node
– Cannot efficiently cache path-dependent data per node
 Traversal runs over 14 nodes for rendering.
 Processed 6 Transform Nodes
– 6 matrix/matrix multiplications and inversions
 Nodes are usually ‚large‘ and not linear in memory
– Each node access generates at least one, most
likely cache misses
Gi Group Ti Transform Si Shape
Introduction SceneGraph SceneTree ShapeList Renderer
5
SceneTree construction
G0
T0 T1
T2
S1 S2
G1
T3
S0
Gi Group Ti Transform Si Shape
Introduction SceneGraph SceneTree ShapeList Renderer
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
G0 -> (G0)
T0 -> (T0)
T1 -> (T1)
S0 -> (S0)
G1 -> (G1)
T2 -> (T2)
S1 -> (S1)
T3 -> (T3)
S2 -> (S2)
G0 -> (G0)
T0 -> (T0)
T1 -> (T1)
S0 -> (S0)
G1 -> (G1,G1‘)
T2 -> (T2,T2‘)
S1 -> (S1,S1‘)
T3 -> (T3,T3‘)
S2 -> (S2,S2‘)
 Observer based
synchronization
6
 SceneTree has unique path to each
node
 Store accumulated attributes like
transforms or visibility in each Node
 Trade memory for performance
– 64-byte per node, 100k nodes ~6MB
– Transforms stored separate vector
 Traversal still processes 14 nodes.
Gi Group Ti Transform Si Shape
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
SceneTree
Introduction SceneGraph SceneTree ShapeList Renderer
7
SceneTree invalidate attributes cache
 Keep dirty flags per node
 Keep dirty vector per flag
 SceneGraph change notifications
invalidated nodes
– If not dirty, mark dirty and add to
dirty vector
– O(1) operation, no sorting required
upon changes
 Before rendering a frame process
dirty vectorDirty vector
T1T3 T3‘
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
Introduction SceneGraph SceneTree ShapeList Renderer
8
T3‘
T1
T2‘T3
SceneTree validate attribute cache
 Walk through dirty vector
— Node marked dirty -> search top dirty
— Validate subtree from top dirty
 Validation example
— T3 dirty, traverse up to root node
 T3 top dirty node, validate T3 subtree
— T3‘ dirty, traverse up to root node
 T1 top dirty node, validate T1 subtree
— T1 not dirty
 No work to do
Dirty vector
T1T3 T3‘
G0
T0 T1
T2
S1 S2
G1
T3
S0
T3‘
S1‘ S2‘
G1‘
T3 T1T3‘
Introduction SceneGraph SceneTree ShapeList Renderer
9
SceneTree to ShapeList
 Add Events for ShapeList generation
– addShape(Shape)
– removeShape(Shape)
ShapeList
S1 S2S0 S1‘ S2‘
Gi Group Ti Transform Si Shape
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
Introduction SceneGraph SceneTree ShapeList Renderer
10
 SceneGraph to SceneTree synchronization
– Store accumulated data per node instance
 SceneTree to ShapeList synchronization
– Avoid SceneTree traversal
 Next: Efficient data structure for renderer based on
ShapeList
Summary
Introduction SceneGraph SceneTree ShapeList Renderer
11
Renderer Data Structures
Program
Shader
Vertex
Shader
Fragment
ParameterDescription
Camera
ParameterDescription
Light
ParameterDescription
Matrices
ParameterDescription
Material
ambient
diffuse
specular
texture
Name Type Arraysize
vec3
vec3
vec3
Sampler
2
2
2
0
ParamaterDescriptionShape
Program
‚colored‘
Geometry
Shape1
ParameterData
Camera1
ParameterData
Lightset 1
ParameterData
Transform 1
ParameterData
red
Introduction SceneGraph SceneTree ShapeList Renderer
12
Example Parameter Grouping
Shader independent globals, i.e. camera
Object parameters, i.e. position/rotation/scaling
Material handles, i.e. textures and buffers
Material raw values, i.e. float, int and bool
Light, i.e. light sources and shadow maps
Shader dependent globals, i.e. environment map
always
frequent
constant
rare
Introduction SceneGraph SceneTree ShapeList Renderer
Parameters Frequency
13
Shapes
coloredGroup
by
Program
shapelist
Rendering structures
‚colored‘ ‚textured‘ ‚colored‘ ‚colored‘ ‚textured‘
‚colored‘ ‚colored‘‚colored‘
Shapes
textured
‚textured‘ ‚textured‘
Introduction SceneGraph SceneTree ShapeList Renderer
Sort by ParameterData
14
ParameterData Cache
 Cache is a big char[] with all ParameterData.
 ParameterData are sorted by first usage.
 Parameters are converted to Target-API datatype, i.e.
— Int8 to int32, TextureHandle to bindless texture...
 Updating parameters is only playback of data in memory,
no conditionals.
 Filter for used parameters to reduce cache size
Parameters
colored
red blue
Parameters
textured
wood marble
Introduction SceneGraph SceneTree ShapeList Renderer
15
Vertex Attribute Cache
 Big char[] with vertex attribute pointers
— Bindless pointers, VBOs or VAB streams
 Each set of attributes stored only once
 Ordered by first usage
 Attributes required by program are known
— Store only used attributes in Cache
— Useful for special passes like depth pass where only pos is
required
attributes
colored
pos
normal
pos
normal
pos
normal
Introduction SceneGraph SceneTree ShapeList Renderer
16
Renderer Cache complete
Parameters
colored
Phong red Phong blue
Shapes
colored
‚colored‘ ‚colored‘‚colored‘
Attributes
colored
pos
normal
pos
normal
pos
normal
Introduction SceneGraph SceneTree ShapeList Renderer
foreach(shape) {
if (visible(shape)) {
if (changed(parameters)) render(parameters);
if (changed(attributes)) render(attributes);
render(shape);
}
}
17
 CPU boundedness improved (application)
– Recomputation of attributes (transforms)
– Deep hierarchies: traversal expensive
– Unsorted rendering, lot of state changes
 CPU boundedness remaining (OpenGL usage)
– Large objects split up into a lot of little pieces,
increased draw call count
Achievements
ShapeList
Renderer
SceneTree
RendererCache
OpenGL
implementation
Renderer
18
 Avoid data redundancy
– Data stored once, referenced multiple times
– Update only once (less host to gpu transfers)
 Increase batching potential
– Further cuts api calls
– Less driver CPU work
 Minimize CPU/GPU interaction
– Allow GPU to update its own data
– Lower api usage when scene is changed little
– E.g. GPU-based culling
Enabling Hardware Scalability
19
 Avoids classic SceneGraph design
 Geometry
– Vertex/IndexBuffer
– BoundingBox
– Divided into parts (CAD features)
 Material
 Node Hierarchy
 Object
– Node and Geometry reference
– For each Geometry part
 Material reference
 Enabled state
model courtesy of PTC
OpenGL Research Framework
 99000 total parts, 3.8 Mtris, 5.1 Mverts
 700 geometries, 128 materials
 2000 objects
20
 Kepler Quadro K5000, i7
 vbo bind and drawcall per part, i.e. 99 000
drawcalls
scene draw time > 38 ms (CPU bound)
 vbo bind per geometry, drawcalls per part
scene draw time > 14 ms (CPU bound)
 All subsequent techniques raise perf significantly
scene draw time < 6 ms
1.8 ms with occlusion culling
Performance baseline
21
 MultiDraw (1.x)
– Render ranges from current VBO/IBO
– Single drawcall for many distinct objects
– Reduces overhead for low complexity objects
 ARB_draw_indirect (4.x)
 ARB_multi_draw_indirect
– Store drawcall information on GPU or HOST
– Let GPU create/modify GPU buffers
Drawcall Reduction
DrawElementsIndirect
{
GLuint count;
GLuint instanceCount;
GLuint firstIndex;
GLint baseVertex;
GLuint baseInstance;
}
22
– All use multidraw capabilites to
render across gaps
– BATCHED use CPU generated list of
combined parts with same state
 Object‘s part cache must be rebuilt
based on material/enabled state
– INDIVIDUAL stay on per-part level
 No caches, can update assignment or
cmd buffers directly
Drawing Techniques
a b c
a+b c
Parts with different materials in geometry
Grouped and „grown“ drawcalls
Single call, encode material/matrix
assignment via vertex attribute
23
 Group parameters by frequency of change
 Generating shader strings allows different storage
backend for „uniforms“
Parameters
Effect "Phong {
Group „material" (many) {
vec4 "ambient"
vec4 "diffuse"
vec4 "specular"
}
Group „view" (few) {
vec4 „viewProjTM„
}
Group „object" (many) {
mat4 „worldTM„
}
... Code ...
}
 OpenGL 2 uniforms
 OpenGL 3,4 buffers
 NVIDIA bindless technology...
24
 GL2 approach:
– Avoid many small
uniforms
– Arrays of uniforms,
grouped by frequency of
update, tightly-packed
Parameters
uniform mat4 worldMatrices[2];
uniform vec4 materialData[8];
#define matrix_world worldMatrices[0]
#define matrix_worldIT worldMatrices[1]
#define material_diffuse materialData[0]
#define material_emissive materialData[1]
#define material_gloss materialData[2].x
// GL3 can use floatBitsToInt and friends
// for free reinterpret casts within
// macros
...
wPos = matrix_world * oPos;
...
// in fragment shader
color = material_diffuse +
material_emissive;
...
25
 GL4 approach:
– TextureBufferObject
(TBO) for matrices
– UniformBufferObject
(UBO) with array data
to save costly binds
– Assignment indices
passed as vertex
attribute
Parameters in vec4 oPos;
uniform samplerBuffer matrixBuffer;
uniform materialBuffer {
Material materials[512];
};
in ivec2 vAssigns;
flat out ivec2 fAssigns;
// in vertex shader
fAssigns = vAssigns;
worldTM = getMatrix (matrixBuffer,
vAssigns.x);
wPos = worldTM * oPos;
...
// in fragment shader
color = materials[fAssigns.y].color;
...
26
setupSceneMatrixAndMaterialBuffer (scene);
foreach (obj in scene) {
if ( isVisible(obj) ) {
setupDrawGeometryVertexBuffer (obj);
// iterate over different materials used
foreach ( batch in obj.materialCaches) {
glVertexAttribI2i (indexAttr, batch.materialIndex, matrixIndex);
glMultiDrawElements (GL_TRIANGLES, batch.counts, GL_UNSIGNED_INT ,
batch.offsets,batched.numUsed);
}
}
}
OpenGL 4.x approach
27
glVertexAttribDivisor == 0 : VArray[ gl_VertexID + baseVertex ]
glVertexAttribDivisor != 0 : VArray[ gl_InstanceID / VDivisor + baseInstance ]
VArray[ 0 / 1 + baseInstance ]
Material & Matrix Index
VertexBuffer (divisor:1)
Position & Normal
VertexBuffer (divisor:0)
...
instanceCount = 1
baseInstance = 0
...
instanceCount = 1
baseInstance = 1
MultiDrawIndirect
Buffer
Per drawcall vertex attribute
vertex attributes
fetched for last
vertex in second
drawcall
baseinstance = 1
28
OpenGL 4.2+ indirect approach
...
foreach ( obj in scene.objects ) {
...
// instead of glVertexAttribI2i calls and a loop
// we use the baseInstance for the attribute
// bind special assignment buffer as vertex attribute
glBindBuffer ( GL_ARRAY_BUFFER, obj->assignBuffer);
glVertexAttribIPointer (indexAttr, 2, GL_INT, . . . );
// draw everything in one go
glMultiDrawElementsIndirect ( GL_TRIANGLES, GL_UNSIGNED_INT,
obj->indirectOffset, obj->numIndirects, 0 );
}
29
 ARB_vertex_attrib_binding
(VAB)
– Avoids many buffer changes
– Separates format from data
– Bind multiple vertex
attributes to one buffer
 NV_vertex_buffer_unified_
memory (VBUM)
– Allows very fast switching
through GPU pointers
Vertex Setup
/* setup once, similar to glVertexAttribPointer
but with relative offset last */
glVertexAttribFormat (ATTR_NORMAL, 3,
GL_FLOAT, GL_TRUE, offsetof(Vertex,normal));
glVertexAttribFormat (ATTR_POS, 3,
GL_FLOAT, GL_FALSE, offsetof(Vertex,pos));
// bind to stream
glVertexAttribBinding (ATTR_NORMAL, 0);
glVertexAttribBinding (ATTR_POS, 0);
// switch single stream buffer
glBindVertexBuffer (0, bufID, 0, sizeof(Vertex));
// NV_vertex_buffer_unified_memory
// enable once and set stride
glEnableClientState (GL_VERTEX...NV);...
glBindVertexBuffer (0, 0, 0, sizeof(Vertex));
// switch single buffer via pointer
glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR,
bufSize);
30
0
200
400
600
800
1000
VBO VAB VAB+VBUM BINDLESS
INDIRECT HOST
VAB+VBUM
BINDLESS
INDIRECT GPU
VAB+VBUM
Timeinmicroseocnds[us]
– Vertex/Index setup inside MultiDrawIndirect command
NV_bindless_multidraw_indirect
one GL call to draw entire scene
GPU benefit depends on triangles
per drawcall (> ~ 500)
NV_bindless_multidraw_indirect
~ 2400 drawcalls, GL4 BATCHED style
Lower is
Better
Effect on CPU time
31
0
1000
2000
3000
4000
5000
6000
K KB K KB K KB K KB
GL4 INDIRECTHOST
INDIVIDUAL
GL4 INDIRECTGPU
INDIVIDUAL
GL4 BATCHED GL2 BATCHED
Timeinmicroseconds[us]
Bindless (green) always reduces CPU, and
may help framerate/GPU a bit
K = Kepler 5000, regular VBO
KB = Kepler 5000, VBUM + VAB
Lower is
Better
GPU
CPU
99.000 2.400 hw drawcalls
2.000 2.400 sw drawcalls
32
0
1000
2000
3000
4000
5000
6000
K KB K KB K KB K KB
GL4 INDIRECTHOST
INDIVIDUAL
GL4 INDIRECTGPU
INDIVIDUAL
GL4 BATCHED GL2 BATCHED
Timeinmicroseconds[us]
GPU
CPU
MultiDrawIndirect achieves almost 20 Mio drawcalls per
second (2000 VBO changes, „only“ 1/3 perf lost).
GPU-buffered commands save lots of CPU time
Lower is
Better
99.000 2.400 hw drawcalls
2.000 2.400 sw drawcalls
Scene-dependent!
INDIVIDUAL could be as fast
if enough work per drawcall
K = Kepler 5000, regular VBO
KB = Kepler 5000, VBUM + VAB
33
0
1000
2000
3000
4000
5000
6000
K KB K KB K KB K KB
GL4 INDIRECTHOST
INDIVIDUAL
GL4 INDIRECTGPU
INDIVIDUAL
GL4 BATCHED GL2 BATCHED
Timeinmicroseconds[us]
GL2 uniforms beat paletted UBO a bit in GPU, but are slower on
CPU side. (1 glUniform call with 8x vec4, vs indexed UBO)
Lower is
Better
GPU
CPU
Scene-dependent!
GL4 better when more
materials changed per object
K = Kepler 5000, regular VBO
KB = Kepler 5000, VBUM + VAB
99.000 2.400 hw drawcalls
2.000 2.400 sw drawcalls
34
 Share geometry buffers for batching
 Group parameters for fast updating
 MultiDraw/Indirect for keeping objects
independent or remove additional loops
– baseInstance to provide unique
index/assignments for drawcall
 Bindless to reduce validation
overhead/add flexibility
Recap
35
 GPU friendly processing
– Matrix and bbox buffer, object buffer
– XFB/Compute or „invisible“ rendering
– Vs. old techniques: Single GPU job for ALL objects!
 Results
– „Readback“ GPU to Host
 Can use GPU to pack into bit stream
– „Indirect“ GPU to GPU
 Set DrawIndirect‘s instanceCount to 0 or 1
GPU Culling Basics
0,1,0,1,1,1,0,0,0
buffer cmdBuffer{
Command cmds[];
};
...
cmds[obj].instanceCount = visible;
36
 OpenGL 4.2+
– Depth-Pass
– Raster „invisible“ bounding boxes
 Disable Color/Depth writes
 Geometry Shader to create the three
visible box sides
 Depth buffer discards occluded fragments
(earlyZ...)
 Fragment Shader writes output:
visible[objindex] = 1
Occlusion Culling
// GLSL fragment shader
// from ARB_shader_image_load_store
layout(early_fragment_tests) in;
buffer visibilityBuffer{
int visibility[];
};
flat in int objID;
void main(){
visibility[objID] = 1;
}
// buffer would have been cleared
// to 0 before
Passing bbox fragments
enable object
Algorithm by
Evgeny Makarov, NVIDIA
depth
buffer
37
 Exploit that majority of objects don‘t change
much relative to camera
 Draw each object only once (vertex/drawcall-
bound)
– Render last visible, fully shaded
(last)
– Test all against current depth:
(visible)
– Render newly added visible:
none, if no spatial changes made
(~last) & (visible)
– (last) = (visible)
Temporal Coherence
frame: f – 1
frame: f
last visible
bboxes occluded
bboxes pass depth
(visible)
new visible
invisible
visible
camera
camera
moved
38
Culling Readback vs Indirect
0
500
1000
1500
2000
2500
readback indirect NVindirect
Timeinmicroseconds[us] In the „draw new visible“ phase indirect cannot
benefit of „nothing to setup/draw“ in advance,
still processes „empty“ lists
For readback results, CPU has to
wait for GPU idle
37% faster with
culling
33% faster with
culling 37% faster with
culling
NV_bindless_
multidraw_indirect
saves CPU and bit of
GPU time
Scene-dependent,
i.e. triangles per
drawcall and # of
„invisible“Lower is
Better
GL4 BATCHED style
GPU
CPU
39
 Temporal culling very useful for object/vertex-boundedness
– Can also apply for Z-pass...
 Readback vs Indirect
– Readback variant „easier“ to be faster (no setups...), but syncs!
– NV_bindless_multidraw benefit depends on scene (VBO changes
and primitives per drawcall)
 Working towards GPU autonomous system
– (NV_bindless)/ARB_multidraw_indirect as mechanism for GPU
creating its own work, research and feature work in progresss
Culling Results
40
 Thank you!
– Contact
 ckubisch@nvidia.com
 matavenrath@nvidia.com
glFinish();
41
 Family of extensions to use
native handles/addresses
 NV_vertex_buffer_unified_memory
 NV_bindless_multidraw_indirect
 NV_shader_buffer_load/store
– Pointers in GLSL
 NV_bindless_texture
– No more unit restrictions
– References inside buffers
NVIDIA Bindless Technology
// GLSL with true pointers
uniform MyStruct* mystructs;
// API
glUniformui64NV (bufferLocation,
bufferADDR);
texHDL = glGetTextureHandleNV (tex);
// later instead of glBindTexture
glUniformHandleui64NV (texLocation,
texHDL)
// GLSL
// can also store textures in resources
uniform materialBuffer {
sampler2D manyTextures [LARGE];
}
42
0
500
1000
1500
2000
2500
Readback
GPU
Readback
CPU
Indirect
GPU
Indirect
CPU
NVIndirect
GPU
NVIndirect
CPU
Timeinmicroseconds
6. Draw New Visible
5. Update Internals
4. Occlusion Cull
3. Draw Last Visible
2. Update Internals
1. Frustum Cull
Nothing „new“ to draw, but CPU doesn‘t
know, still setting things up, GPU runs
thru „empty“ cmd buffer
For readback results, CPU has to
wait for GPU idle
Culling
Readback
vs
Indirect 432 fps with culling
315 without 387 fps with culling
289 without
429 fps with culling
313 without
Special bindless indirect
version can save lots of
CPU and a bit GPU costs
for drawing the scene
with a single big cmd
buffer

Advanced Scenegraph Rendering Pipeline

  • 1.
    Advanced Scenegraph RenderingPipeline Markus Tavenrath – NVIDIA – [email protected] Christoph Kubisch – NVIDIA – [email protected]
  • 2.
    2  Traditional approachis render while traversing a SceneGraph  Scene complexity increases – Deep hierarchies, traversal expensive – Large objects split up into a lot of little pieces, increased draw call count – Unsorted rendering, lot of state changes  CPU becomes bottleneck when rendering those scenes SceneGraph Rendering models courtesy of PTC Introduction SceneGraph SceneTree ShapeList Renderer
  • 3.
    3 Overview Introduction SceneGraph SceneTreeShapeList Renderer G0 T0 T1 T2 S1 S2 G1 T3 S0 G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ ShapeList S1 S2S0 S1‘ S2‘ RendererCache S1 S2S0 S1‘ S2‘ SceneGraph SceneTree ShapeList Renderer Gi Group Ti Transform Si Shape
  • 4.
    4 SceneGraph G0 T0 T1 T2 S1 S2 G1 T3 S0 SceneGraph is DAG  No unique path to a node – Cannot efficiently cache path-dependent data per node  Traversal runs over 14 nodes for rendering.  Processed 6 Transform Nodes – 6 matrix/matrix multiplications and inversions  Nodes are usually ‚large‘ and not linear in memory – Each node access generates at least one, most likely cache misses Gi Group Ti Transform Si Shape Introduction SceneGraph SceneTree ShapeList Renderer
  • 5.
    5 SceneTree construction G0 T0 T1 T2 S1S2 G1 T3 S0 Gi Group Ti Transform Si Shape Introduction SceneGraph SceneTree ShapeList Renderer G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ G0 -> (G0) T0 -> (T0) T1 -> (T1) S0 -> (S0) G1 -> (G1) T2 -> (T2) S1 -> (S1) T3 -> (T3) S2 -> (S2) G0 -> (G0) T0 -> (T0) T1 -> (T1) S0 -> (S0) G1 -> (G1,G1‘) T2 -> (T2,T2‘) S1 -> (S1,S1‘) T3 -> (T3,T3‘) S2 -> (S2,S2‘)  Observer based synchronization
  • 6.
    6  SceneTree hasunique path to each node  Store accumulated attributes like transforms or visibility in each Node  Trade memory for performance – 64-byte per node, 100k nodes ~6MB – Transforms stored separate vector  Traversal still processes 14 nodes. Gi Group Ti Transform Si Shape G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ SceneTree Introduction SceneGraph SceneTree ShapeList Renderer
  • 7.
    7 SceneTree invalidate attributescache  Keep dirty flags per node  Keep dirty vector per flag  SceneGraph change notifications invalidated nodes – If not dirty, mark dirty and add to dirty vector – O(1) operation, no sorting required upon changes  Before rendering a frame process dirty vectorDirty vector T1T3 T3‘ G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ Introduction SceneGraph SceneTree ShapeList Renderer
  • 8.
    8 T3‘ T1 T2‘T3 SceneTree validate attributecache  Walk through dirty vector — Node marked dirty -> search top dirty — Validate subtree from top dirty  Validation example — T3 dirty, traverse up to root node  T3 top dirty node, validate T3 subtree — T3‘ dirty, traverse up to root node  T1 top dirty node, validate T1 subtree — T1 not dirty  No work to do Dirty vector T1T3 T3‘ G0 T0 T1 T2 S1 S2 G1 T3 S0 T3‘ S1‘ S2‘ G1‘ T3 T1T3‘ Introduction SceneGraph SceneTree ShapeList Renderer
  • 9.
    9 SceneTree to ShapeList Add Events for ShapeList generation – addShape(Shape) – removeShape(Shape) ShapeList S1 S2S0 S1‘ S2‘ Gi Group Ti Transform Si Shape G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ Introduction SceneGraph SceneTree ShapeList Renderer
  • 10.
    10  SceneGraph toSceneTree synchronization – Store accumulated data per node instance  SceneTree to ShapeList synchronization – Avoid SceneTree traversal  Next: Efficient data structure for renderer based on ShapeList Summary Introduction SceneGraph SceneTree ShapeList Renderer
  • 11.
    11 Renderer Data Structures Program Shader Vertex Shader Fragment ParameterDescription Camera ParameterDescription Light ParameterDescription Matrices ParameterDescription Material ambient diffuse specular texture NameType Arraysize vec3 vec3 vec3 Sampler 2 2 2 0 ParamaterDescriptionShape Program ‚colored‘ Geometry Shape1 ParameterData Camera1 ParameterData Lightset 1 ParameterData Transform 1 ParameterData red Introduction SceneGraph SceneTree ShapeList Renderer
  • 12.
    12 Example Parameter Grouping Shaderindependent globals, i.e. camera Object parameters, i.e. position/rotation/scaling Material handles, i.e. textures and buffers Material raw values, i.e. float, int and bool Light, i.e. light sources and shadow maps Shader dependent globals, i.e. environment map always frequent constant rare Introduction SceneGraph SceneTree ShapeList Renderer Parameters Frequency
  • 13.
    13 Shapes coloredGroup by Program shapelist Rendering structures ‚colored‘ ‚textured‘‚colored‘ ‚colored‘ ‚textured‘ ‚colored‘ ‚colored‘‚colored‘ Shapes textured ‚textured‘ ‚textured‘ Introduction SceneGraph SceneTree ShapeList Renderer Sort by ParameterData
  • 14.
    14 ParameterData Cache  Cacheis a big char[] with all ParameterData.  ParameterData are sorted by first usage.  Parameters are converted to Target-API datatype, i.e. — Int8 to int32, TextureHandle to bindless texture...  Updating parameters is only playback of data in memory, no conditionals.  Filter for used parameters to reduce cache size Parameters colored red blue Parameters textured wood marble Introduction SceneGraph SceneTree ShapeList Renderer
  • 15.
    15 Vertex Attribute Cache Big char[] with vertex attribute pointers — Bindless pointers, VBOs or VAB streams  Each set of attributes stored only once  Ordered by first usage  Attributes required by program are known — Store only used attributes in Cache — Useful for special passes like depth pass where only pos is required attributes colored pos normal pos normal pos normal Introduction SceneGraph SceneTree ShapeList Renderer
  • 16.
    16 Renderer Cache complete Parameters colored Phongred Phong blue Shapes colored ‚colored‘ ‚colored‘‚colored‘ Attributes colored pos normal pos normal pos normal Introduction SceneGraph SceneTree ShapeList Renderer foreach(shape) { if (visible(shape)) { if (changed(parameters)) render(parameters); if (changed(attributes)) render(attributes); render(shape); } }
  • 17.
    17  CPU boundednessimproved (application) – Recomputation of attributes (transforms) – Deep hierarchies: traversal expensive – Unsorted rendering, lot of state changes  CPU boundedness remaining (OpenGL usage) – Large objects split up into a lot of little pieces, increased draw call count Achievements ShapeList Renderer SceneTree RendererCache OpenGL implementation Renderer
  • 18.
    18  Avoid dataredundancy – Data stored once, referenced multiple times – Update only once (less host to gpu transfers)  Increase batching potential – Further cuts api calls – Less driver CPU work  Minimize CPU/GPU interaction – Allow GPU to update its own data – Lower api usage when scene is changed little – E.g. GPU-based culling Enabling Hardware Scalability
  • 19.
    19  Avoids classicSceneGraph design  Geometry – Vertex/IndexBuffer – BoundingBox – Divided into parts (CAD features)  Material  Node Hierarchy  Object – Node and Geometry reference – For each Geometry part  Material reference  Enabled state model courtesy of PTC OpenGL Research Framework  99000 total parts, 3.8 Mtris, 5.1 Mverts  700 geometries, 128 materials  2000 objects
  • 20.
    20  Kepler QuadroK5000, i7  vbo bind and drawcall per part, i.e. 99 000 drawcalls scene draw time > 38 ms (CPU bound)  vbo bind per geometry, drawcalls per part scene draw time > 14 ms (CPU bound)  All subsequent techniques raise perf significantly scene draw time < 6 ms 1.8 ms with occlusion culling Performance baseline
  • 21.
    21  MultiDraw (1.x) –Render ranges from current VBO/IBO – Single drawcall for many distinct objects – Reduces overhead for low complexity objects  ARB_draw_indirect (4.x)  ARB_multi_draw_indirect – Store drawcall information on GPU or HOST – Let GPU create/modify GPU buffers Drawcall Reduction DrawElementsIndirect { GLuint count; GLuint instanceCount; GLuint firstIndex; GLint baseVertex; GLuint baseInstance; }
  • 22.
    22 – All usemultidraw capabilites to render across gaps – BATCHED use CPU generated list of combined parts with same state  Object‘s part cache must be rebuilt based on material/enabled state – INDIVIDUAL stay on per-part level  No caches, can update assignment or cmd buffers directly Drawing Techniques a b c a+b c Parts with different materials in geometry Grouped and „grown“ drawcalls Single call, encode material/matrix assignment via vertex attribute
  • 23.
    23  Group parametersby frequency of change  Generating shader strings allows different storage backend for „uniforms“ Parameters Effect "Phong { Group „material" (many) { vec4 "ambient" vec4 "diffuse" vec4 "specular" } Group „view" (few) { vec4 „viewProjTM„ } Group „object" (many) { mat4 „worldTM„ } ... Code ... }  OpenGL 2 uniforms  OpenGL 3,4 buffers  NVIDIA bindless technology...
  • 24.
    24  GL2 approach: –Avoid many small uniforms – Arrays of uniforms, grouped by frequency of update, tightly-packed Parameters uniform mat4 worldMatrices[2]; uniform vec4 materialData[8]; #define matrix_world worldMatrices[0] #define matrix_worldIT worldMatrices[1] #define material_diffuse materialData[0] #define material_emissive materialData[1] #define material_gloss materialData[2].x // GL3 can use floatBitsToInt and friends // for free reinterpret casts within // macros ... wPos = matrix_world * oPos; ... // in fragment shader color = material_diffuse + material_emissive; ...
  • 25.
    25  GL4 approach: –TextureBufferObject (TBO) for matrices – UniformBufferObject (UBO) with array data to save costly binds – Assignment indices passed as vertex attribute Parameters in vec4 oPos; uniform samplerBuffer matrixBuffer; uniform materialBuffer { Material materials[512]; }; in ivec2 vAssigns; flat out ivec2 fAssigns; // in vertex shader fAssigns = vAssigns; worldTM = getMatrix (matrixBuffer, vAssigns.x); wPos = worldTM * oPos; ... // in fragment shader color = materials[fAssigns.y].color; ...
  • 26.
    26 setupSceneMatrixAndMaterialBuffer (scene); foreach (objin scene) { if ( isVisible(obj) ) { setupDrawGeometryVertexBuffer (obj); // iterate over different materials used foreach ( batch in obj.materialCaches) { glVertexAttribI2i (indexAttr, batch.materialIndex, matrixIndex); glMultiDrawElements (GL_TRIANGLES, batch.counts, GL_UNSIGNED_INT , batch.offsets,batched.numUsed); } } } OpenGL 4.x approach
  • 27.
    27 glVertexAttribDivisor == 0: VArray[ gl_VertexID + baseVertex ] glVertexAttribDivisor != 0 : VArray[ gl_InstanceID / VDivisor + baseInstance ] VArray[ 0 / 1 + baseInstance ] Material & Matrix Index VertexBuffer (divisor:1) Position & Normal VertexBuffer (divisor:0) ... instanceCount = 1 baseInstance = 0 ... instanceCount = 1 baseInstance = 1 MultiDrawIndirect Buffer Per drawcall vertex attribute vertex attributes fetched for last vertex in second drawcall baseinstance = 1
  • 28.
    28 OpenGL 4.2+ indirectapproach ... foreach ( obj in scene.objects ) { ... // instead of glVertexAttribI2i calls and a loop // we use the baseInstance for the attribute // bind special assignment buffer as vertex attribute glBindBuffer ( GL_ARRAY_BUFFER, obj->assignBuffer); glVertexAttribIPointer (indexAttr, 2, GL_INT, . . . ); // draw everything in one go glMultiDrawElementsIndirect ( GL_TRIANGLES, GL_UNSIGNED_INT, obj->indirectOffset, obj->numIndirects, 0 ); }
  • 29.
    29  ARB_vertex_attrib_binding (VAB) – Avoidsmany buffer changes – Separates format from data – Bind multiple vertex attributes to one buffer  NV_vertex_buffer_unified_ memory (VBUM) – Allows very fast switching through GPU pointers Vertex Setup /* setup once, similar to glVertexAttribPointer but with relative offset last */ glVertexAttribFormat (ATTR_NORMAL, 3, GL_FLOAT, GL_TRUE, offsetof(Vertex,normal)); glVertexAttribFormat (ATTR_POS, 3, GL_FLOAT, GL_FALSE, offsetof(Vertex,pos)); // bind to stream glVertexAttribBinding (ATTR_NORMAL, 0); glVertexAttribBinding (ATTR_POS, 0); // switch single stream buffer glBindVertexBuffer (0, bufID, 0, sizeof(Vertex)); // NV_vertex_buffer_unified_memory // enable once and set stride glEnableClientState (GL_VERTEX...NV);... glBindVertexBuffer (0, 0, 0, sizeof(Vertex)); // switch single buffer via pointer glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR, bufSize);
  • 30.
    30 0 200 400 600 800 1000 VBO VAB VAB+VBUMBINDLESS INDIRECT HOST VAB+VBUM BINDLESS INDIRECT GPU VAB+VBUM Timeinmicroseocnds[us] – Vertex/Index setup inside MultiDrawIndirect command NV_bindless_multidraw_indirect one GL call to draw entire scene GPU benefit depends on triangles per drawcall (> ~ 500) NV_bindless_multidraw_indirect ~ 2400 drawcalls, GL4 BATCHED style Lower is Better Effect on CPU time
  • 31.
    31 0 1000 2000 3000 4000 5000 6000 K KB KKB K KB K KB GL4 INDIRECTHOST INDIVIDUAL GL4 INDIRECTGPU INDIVIDUAL GL4 BATCHED GL2 BATCHED Timeinmicroseconds[us] Bindless (green) always reduces CPU, and may help framerate/GPU a bit K = Kepler 5000, regular VBO KB = Kepler 5000, VBUM + VAB Lower is Better GPU CPU 99.000 2.400 hw drawcalls 2.000 2.400 sw drawcalls
  • 32.
    32 0 1000 2000 3000 4000 5000 6000 K KB KKB K KB K KB GL4 INDIRECTHOST INDIVIDUAL GL4 INDIRECTGPU INDIVIDUAL GL4 BATCHED GL2 BATCHED Timeinmicroseconds[us] GPU CPU MultiDrawIndirect achieves almost 20 Mio drawcalls per second (2000 VBO changes, „only“ 1/3 perf lost). GPU-buffered commands save lots of CPU time Lower is Better 99.000 2.400 hw drawcalls 2.000 2.400 sw drawcalls Scene-dependent! INDIVIDUAL could be as fast if enough work per drawcall K = Kepler 5000, regular VBO KB = Kepler 5000, VBUM + VAB
  • 33.
    33 0 1000 2000 3000 4000 5000 6000 K KB KKB K KB K KB GL4 INDIRECTHOST INDIVIDUAL GL4 INDIRECTGPU INDIVIDUAL GL4 BATCHED GL2 BATCHED Timeinmicroseconds[us] GL2 uniforms beat paletted UBO a bit in GPU, but are slower on CPU side. (1 glUniform call with 8x vec4, vs indexed UBO) Lower is Better GPU CPU Scene-dependent! GL4 better when more materials changed per object K = Kepler 5000, regular VBO KB = Kepler 5000, VBUM + VAB 99.000 2.400 hw drawcalls 2.000 2.400 sw drawcalls
  • 34.
    34  Share geometrybuffers for batching  Group parameters for fast updating  MultiDraw/Indirect for keeping objects independent or remove additional loops – baseInstance to provide unique index/assignments for drawcall  Bindless to reduce validation overhead/add flexibility Recap
  • 35.
    35  GPU friendlyprocessing – Matrix and bbox buffer, object buffer – XFB/Compute or „invisible“ rendering – Vs. old techniques: Single GPU job for ALL objects!  Results – „Readback“ GPU to Host  Can use GPU to pack into bit stream – „Indirect“ GPU to GPU  Set DrawIndirect‘s instanceCount to 0 or 1 GPU Culling Basics 0,1,0,1,1,1,0,0,0 buffer cmdBuffer{ Command cmds[]; }; ... cmds[obj].instanceCount = visible;
  • 36.
    36  OpenGL 4.2+ –Depth-Pass – Raster „invisible“ bounding boxes  Disable Color/Depth writes  Geometry Shader to create the three visible box sides  Depth buffer discards occluded fragments (earlyZ...)  Fragment Shader writes output: visible[objindex] = 1 Occlusion Culling // GLSL fragment shader // from ARB_shader_image_load_store layout(early_fragment_tests) in; buffer visibilityBuffer{ int visibility[]; }; flat in int objID; void main(){ visibility[objID] = 1; } // buffer would have been cleared // to 0 before Passing bbox fragments enable object Algorithm by Evgeny Makarov, NVIDIA depth buffer
  • 37.
    37  Exploit thatmajority of objects don‘t change much relative to camera  Draw each object only once (vertex/drawcall- bound) – Render last visible, fully shaded (last) – Test all against current depth: (visible) – Render newly added visible: none, if no spatial changes made (~last) & (visible) – (last) = (visible) Temporal Coherence frame: f – 1 frame: f last visible bboxes occluded bboxes pass depth (visible) new visible invisible visible camera camera moved
  • 38.
    38 Culling Readback vsIndirect 0 500 1000 1500 2000 2500 readback indirect NVindirect Timeinmicroseconds[us] In the „draw new visible“ phase indirect cannot benefit of „nothing to setup/draw“ in advance, still processes „empty“ lists For readback results, CPU has to wait for GPU idle 37% faster with culling 33% faster with culling 37% faster with culling NV_bindless_ multidraw_indirect saves CPU and bit of GPU time Scene-dependent, i.e. triangles per drawcall and # of „invisible“Lower is Better GL4 BATCHED style GPU CPU
  • 39.
    39  Temporal cullingvery useful for object/vertex-boundedness – Can also apply for Z-pass...  Readback vs Indirect – Readback variant „easier“ to be faster (no setups...), but syncs! – NV_bindless_multidraw benefit depends on scene (VBO changes and primitives per drawcall)  Working towards GPU autonomous system – (NV_bindless)/ARB_multidraw_indirect as mechanism for GPU creating its own work, research and feature work in progresss Culling Results
  • 40.
  • 41.
    41  Family ofextensions to use native handles/addresses  NV_vertex_buffer_unified_memory  NV_bindless_multidraw_indirect  NV_shader_buffer_load/store – Pointers in GLSL  NV_bindless_texture – No more unit restrictions – References inside buffers NVIDIA Bindless Technology // GLSL with true pointers uniform MyStruct* mystructs; // API glUniformui64NV (bufferLocation, bufferADDR); texHDL = glGetTextureHandleNV (tex); // later instead of glBindTexture glUniformHandleui64NV (texLocation, texHDL) // GLSL // can also store textures in resources uniform materialBuffer { sampler2D manyTextures [LARGE]; }
  • 42.
    42 0 500 1000 1500 2000 2500 Readback GPU Readback CPU Indirect GPU Indirect CPU NVIndirect GPU NVIndirect CPU Timeinmicroseconds 6. Draw NewVisible 5. Update Internals 4. Occlusion Cull 3. Draw Last Visible 2. Update Internals 1. Frustum Cull Nothing „new“ to draw, but CPU doesn‘t know, still setting things up, GPU runs thru „empty“ cmd buffer For readback results, CPU has to wait for GPU idle Culling Readback vs Indirect 432 fps with culling 315 without 387 fps with culling 289 without 429 fps with culling 313 without Special bindless indirect version can save lots of CPU and a bit GPU costs for drawing the scene with a single big cmd buffer