Name NV_mesh_shader Name String GL_NV_mesh_shader Contact Christoph Kubisch, NVIDIA (ckubisch 'at' nvidia.com) Pat Brown, NVIDIA (pbrown 'at' nvidia.com) Contributors Yury Uralsky, NVIDIA Daniel Koch, NVIDIA Sahil Parmar, NVIDIA Status Shipping Version Last Modified Date: March 6, 2019 NVIDIA Revision: 7 Dependencies This extension can be applied to OpenGL GLSL versions 4.50 (#version 450) and higher. This extension can be applied to OpenGL ES ESSL versions 3.20 (#version 320) and higher. This extension is written against the GLSL 4.50.6 Specification (Compatibility Profile), dated April 14, 2016. This extension interacts with GLSL 4.60 and KHR_vulkan_glsl. This extension interacts with NV_viewport_array2. This extension interacts with NV_stereo_view_rendering. This extension interacts with NVX_multiview_per_view_attributes. This extension interacts with ARB_shader_draw_parameters. This extension interacts with EXT_clip_cull_distance. Overview This extension provides a new mechanism allowing applications to use two new programmable shader types -- the task and mesh shader -- to generate collections of geometric primitives to be processed by fixed-function primitive assembly and rasterization logic. When the task and mesh shaders are dispatched, they replace the standard programmable vertex processing pipeline, including vertex array attribute fetching, vertex shader processing, tessellation, and the geometry shader processing. Both new shader types have execution environments similar to that of compute shaders, where a collection of shader invocations form a work group and cooperate to produce a set of outputs. Unlike traditional vertex, tessellation, and geometry shaders that typically process a vertex or primitive at a time, the mesh and task shaders process and generate a batch of primitives at once. The optional task shader pre-processes geometry and generates a variable number of mesh shader tasks. The mesh shader evaluates the geometry corresponding to its task and emits a mesh -- a collection of vertices arranged into point, line, or triangle primitives. The primitives emitted by the mesh shader are then processed by fixed-function primitive assembly and rasterization logic and generate fragments that will be processed by the fragment shader. Work is submitted to the mesh pipeline by launching the work from the API which spawns a one-dimensional array of tasks, similar to the API dispatch for compute spawns a three-dimensional array of compute shader work groups. If a task shader is present, each task generated by this launch spawns a task shader work group. If no task shader is present, each task generated by the launch spawns a mesh shader work group. When a task shader work group is executed, its invocations execute in parallel and evaluate geometry associated with the task. The task shader has no built-in or user-defined input variables other than the built-ins identifying the work group and invocation being executed. The task shader can use that information to read properties of the geometry associated with the task from memory, using shader storage buffers, textures, or other resources. The task shader determines the number of mesh shader tasks that should be spawned for the task it is processing and writes the task count to the built-in variable gl_TaskCountNV. Additionally, the task shader can compute and write additional properties of the geometry it processes to user-defined output variables qualified with "taskNV" to task memory, which can be read as inputs by all of the mesh shaders that it spawns. The task shader can be used to drive level-of-detail calculations for procedurally generated geometry, to perform coarse-level culling for batches of static or dynamic geometry, and for other forms of work reduction or amplification. When a mesh shader work group is executed, its invocations execute in parallel to evaluate geometry corresponding to its task and emit a mesh for further processing by subsequent pipeline stages. As with task shaders, mesh shaders have no built-in inputs other than those identifying the work group and invocation being executed, and must fetch their inputs explicitly from memory. The mesh shader invocations collectively must produce a mesh, which consists of: * a primitive count, written to the built-in output gl_PrimitiveCountNV; * a collection of vertex attributes, where each vertex in the mesh has a set of built-in and user-defined per-vertex output variables and blocks; * a collection of primitive attributes, where each of the gl_PrimitiveCountNV primitives in the mesh has a set of built-in and user-defined per-primitive output variables and blocks; and * an array of vertex index values written to the built-in output array gl_PrimitiveIndicesNV, where each output primitive has a set of one, two, or three indices that identify the output vertices in the mesh used to form the primitive. The number of primitives and vertices emitted by the mesh shader can be variable, but the mesh shader must specify maximum vertex and primitive counts. There are implementation-dependent limits on the number of vertices and primitives emitted by the mesh shader, and are also implementation-dependent limits on the total amount of memory consumed by a mesh. In the initial implementation of this extension, implementation limits are sufficiently low that complex geometry will need to be decomposed into multiple tasks. A typical mesh shader used to render static triangle data might operate in three phases. The first phase fetches vertex position data and local index data of the primitives that the mesh represents. The index data would have been prepared offline to leverage vertex re-use within the mesh. In the second phase, triangles would be culled and output primitive indices written. Finally, other vertex attributes of the surviving subset of vertices would be loaded and computed. During this process, the invocations would sometimes work on a per-vertex and sometimes on a per-primitive level. Additionally, mesh shaders include infrastructure to allow a single mesh shader work group to compute a mesh with multiple "views" (e.g., left and right eye views for stereoscopic rendering), using a "view index" similar to the view IDs used in the OVR_multiview (OpenGL and OpenGL ES) and VK_KHR_multiview (Vulkan) extensions. Unlike those extensions, the programming model here does not run separate shader invocations for each view but instead allows shaders to designate individual outputs as "per-view". When a mesh shader completes, its primitives will be processed separately for each view with fragments directed at separate layers of the framebuffer. For each view, outputs designated as per-view (such as position) will take on values written for that view and all other outputs will take on a single shared value written for all views. Conventional From Application Vertex | Pipeline v Launch Mesh Tasks (Fig 3.1) | | +---+-----+ | | | | | | | | Task Shader ---+ | | | | | | v | | | Task Generation | Image Load/Store | | | | Atomic Counter | +---+-----+ |<--> Shader Storage | | | Texture Fetch | v | Uniform Block | Mesh Shader ----------+ | | | +-------------> + | | | v | Rasterization | | | v | Fragment Shader ------+ | v Per-Fragment Operations | v Framebuffer Mesh Processing Pipeline Mapping to SPIR-V ----------------- For informational purposes (non-normative), the following is an expected way for an implementation to map GLSL constructs to SPIR-V constructs: task shader -> TaskNV Execution model mesh shader -> MeshNV Execution model shared qualifier -> Workgroup Storage Class (existing) points layout qualifier -> OutputPoints Execution Mode (existing) lines layout qualifier -> OutputLinesNV Execution Mode triangles layout qualifier -> OutputTrianglesNV Execution Mode max_vertices layout qualifier -> OutputVertices Execution Mode (existing) max_primitives layout qualifier -> OutputPrimitivesNV Execution Mode local_size_(xyz) layout qualifiers -> LocalSize Execution Mode (existing) local_size_(xyz)_id layout qualifiers -> LocalSizeId Execution Mode (existing) perprimitiveNV auxiliary storage qualifier -> PerPrimitiveNV Decoration perviewNV auxiliary storage qualifier -> PerViewNV Decoration taskNV auxiliary storage qualifier -> PerTaskNV Decoration gl_WorkGroupSize -> WorkgroupSize decorated OpVariable (existing) gl_WorkGroupID -> WorkgroupId decorated OpVariable (existing) gl_LocalInvocationID -> LocalInvocationId decorated OpVariable (existing) gl_GlobalInvocationID -> GlobalInvocationId decorated OpVariable (existing) gl_LocalInvocationIndex -> LocalInvocationIndex decorated OpVariable (existing) gl_TaskCountNV -> TaskCountNV decorated OpVariable gl_PrimitiveCountNV -> PrimitiveCountNV decorated OpVariable gl_PrimitiveIndicesNV -> PrimitiveIndicesNV decorated OpVariable gl_Position -> Position decorated OpVariable (existing) gl_PositionPerViewNV -> PositionPerViewNV decorated OpVariable (existing extension) gl_PointSize -> PointSize decorated OpVariable (existing) gl_ClipDistance -> ClipDistance decorated OpVariable (existing) gl_ClipDistancePerViewNV -> ClipDistancePerViewNV decorated OpVariable gl_CullDistance -> CullDistance decorated OpVariable (existing) gl_CullDistancePerViewNV -> CullDistancePerViewNV decorated OpVariable gl_PrimitiveID -> PrimitiveId decorated OpVariable (existing) gl_Layer -> Layer decorated OpVariable (existing) gl_LayerPerViewNV -> LayerPerViewNV decorated OpVariable gl_ViewportIndex -> ViewportIndex decorated OpVariable (existing) gl_ViewportMask -> ViewportMaskNV decorated OpVariable (existing extension) gl_ViewportMaskPerViewNV -> ViewportMaskPerViewNV decorated OpVariable (existing extension) gl_MeshViewCountNV -> MeshViewCountNV decorated OpVariable gl_MeshViewIndicesNV -> MeshViewIndicesNV decorated OpVariable gl_DrawID -> DrawIndex decorated OpVariable (existing 1.3, extension) gl_MeshPerVertexNV -> block name, not needed gl_MeshPerPrimitiveNV -> block name, not needed writePackedPrimitiveIndices4x8NV -> OpWritePackedPrimitiveIndices4x8NV() Modifications to the OpenGL Shading Language Specification, Version 4.50.6 Including the following line in a shader can be used to control the language features described in this extension: #extension GL_NV_mesh_shader : where is as specified in section 3.3. A new preprocessor #define is added to the OpenGL Shading Language: #define GL_NV_mesh_shader 1 Modify the introduction to Chapter 2, Overview of OpenGL Shading (p. 7) (modify first paragraph) ... Currently, these processors are the vertex, tessellation control, tessellation evaluation, geometry, fragment, compute, task, and mesh processors. (modify second paragraph) ... The specific languages will be referred to by the name of the processor they target: vertex, tessellation control, tessellation evaluation, geometry, fragment, compute, task, or mesh. Insert new sections at the end of Chapter 2 (p. 9) Section 2.7, Task Processor The task processor is a programmable unit that operates in conjunction with the mesh processor to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. The task and mesh processors form a primitive processing pipeline that can be used instead of the conventional primitive processing pipeline that includes the vertex, tessellation control, tessellation evaluation, and geometry processors. Compilation units written in the OpenGL Shading Language to run on this processor are called task shaders. When a set of task shaders is successfully compiled and linked, they result in a task shader executable that runs on the task processor. A task shader has access to many of the same resources as fragment and other shader processors, including textures, buffers, image variables, and atomic counters. The task shader has no fixed-function inputs other than variables identifying the specific work group and invocation; any vertex attributes or other data required by the task shader must be fetched from memory. The only fixed output of the task shader is a task count, identifying the number of mesh shader work groups to spawn. The task shader can write additional outputs to task memory, which can be read by all of the mesh shader work groups it spawns. A task shader operates on a group of work items called a work group. A work group is a collection of shader invocations that execute the same code, potentially in parallel. An invocation within a work group may share data with other members of the same work group through shared variables and issue memory and control barriers to synchronize with other members of the same work group. Section 2.8, Mesh Processor The mesh processor is a programmable unit that operates in conjunction with the task processor to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. The task and mesh processors form a primitive processing pipeline that can be used instead of the conventional primitive processing pipeline that includes the vertex, tessellation control, tessellation evaluation, and geometry processors. Compilation units written in the OpenGL Shading Language to run on this processor are called mesh shaders. When a set of mesh shaders is successfully compiled and linked, they result in a mesh shader executable that runs on the mesh processor. A mesh shader has access to many of the same resources as fragment and other shader processors, including textures, buffers, image variables, and atomic counters. The only inputs available to the mesh shader are variables identifying the specific work group and invocation and any outputs written to task memory by the task shader that spawned the mesh shader's work group. Any vertex attributes or other data required by the mesh shader must be fetched from memory. The invocations of the mesh shader work group write an output mesh, comprising a set of primitives with per-primitive attributes, a set of vertices with per-vertex attributes, and an array of indices identifying the mesh vertices that belong to each primitive. The primitives of this mesh are then processed by subsequent graphics pipeline stages, where the outputs of the mesh shader form an interface with the fragment shader. A mesh shader operates on a group of work items called a work group. A work group is a collection of shader invocations that execute the same code, potentially in parallel. An invocation within a work group may share data with other members of the same work group through shared variables and issue memory and control barriers to synchronize with other members of the same work group. Modify Section 3.6, Keywords (p. 18) (add to the end of the list of keywords, p. 19) perprimitiveNV perviewNV taskNV Modify Section 3.8.2, Dynamically Uniform Expressions and Uniform Control Flow (p. 21) (modify third paragraph of this section) An invocation group is the complete set of invocations collectively processing a particular compute, task, or mesh shader workgroup, or a graphical operation, where the scope ... Modify Section 4.3, Storage Qualifiers (p. 43) (modify table of base storage qualifiers, p. 43) Qualifier Meaning ------------------ ----------------------------------------------- shared variable storage for compute, task, and mesh shaders shared across all work items in a local work group (add to table of auxiliary storage qualifiers, p. 44) Auxiliary Storage Qualifier Meaning ------------------ ----------------------------------------------- perprimitiveNV mesh shader outputs with per-primitive instances perviewNV mesh shader outputs with per-view instances taskNV generic outputs for task shader work groups Modify Section 4.3.4, Input Variables (p. 46) (modify third paragraph, p. 47, to treat all mesh shader outputs as "arrayed" interfaces) Some inputs and outputs are arrayed ... Geometry shader inputs, tessellation control shader inputs and outputs, tessellation evaluation inputs, and mesh shader outputs all have an additional level of arrayness relative to other shader inputs and outputs. Component limits for these arrayed interfaces (e.g., gl_MaxTessControlInputComponents) are limits for a single instance and not for the entire interface. (insert before the last paragraph, p. 47, "Fragment shader inputs get") Task shaders do not permit user-defined input variables and do not form a formal interface with any previous shader stage. See section 7.1 "Built-In Variables" for a description of built-in task shader input variables. All other input to a task shader is retrieved explicitly through image loads, texture fetches, loads from uniforms, uniform buffers, or shader storage buffers, or other user supplied code. Redeclaration of built-in input variables in task shaders is not permitted. Mesh shaders form an interface with task shaders and support a collection of input variables in task memory. All user-defined mesh shader inputs must be declared as members of a single interface block qualified with "taskNV" qualifier. Mesh shaders do not support user-defined inputs declared outside interface blocks or without "taskNV" and do not support more than one input interface block. In addition to user-defined inputs, mesh shaders support the built-in input variables described in section 7.1. User-defined mesh shader input variables are filled with the values of matching user-defined output variables written by the task shader. As with other input variables, mesh shader inputs in task memory must be declared using the same type and qualification as task memory outputs from the previous (task) shader stage. It is a compile-time error to use the "taskNV" qualfier with inputs in any stage other than the mesh shader. All other input to a task shader is retrieved explicitly through image loads, texture fetches, loads from uniforms, uniform buffers, or shader storage buffers, or other user supplied code. Redeclaration of built-in input variables in mesh shaders is not permitted. (modify last paragraph, p. 47) Fragment shader inputs get... The auxiliary storage qualifiers centroid, sample, and perprimitiveNV can also be applied, as well as... (modify first paragraph, p. 48) Fragment shader inputs that are signed or unsigned integers, integer vectors, or any double-precision floating-point type must be qualified with the interpolation qualifier flat or with the auxillary storage qualifier perprimitiveNV. (add a new example to the second paragraph, p. 48) perprimitiveNV in vec3 triangleNormal; (modify third paragraph, p. 48) The fragment shader inputs form an interface with the mesh shader or last active shader in the conventional vertex processing pipeline (e.g., vertex, tessellation evaluation, geometry). ... Also, interpolation qualification (e.g., flat) and auxiliary qualification other than "perprimitiveNV" (e.g. centroid) may differ. ... Modify Section 4.3.6, Output Variables (p. 49) (modify last paragraph, p. 49 to add task and mesh shaders) It is a compile-time error to declare a vertex, tessellation evaluation, tessellation control, geometry, task, or mesh shader output that contains any of the following: ... (insert before the next-to-last paragraph "The order of execution", p. 50) Task shader output variables may be used to write values in task memory that can be read by the mesh shader invocations for the tasks that it spawns. All user-defined task shader outputs must be declared as members of a single interface block qualified with "taskNV" qualifier. Task shaders do not support user-defined outputs declared outside interface blocks or without "taskNV" and do not support more than one output. It is a compile-time error to use the "taskNV" qualifier in output declarations in any other shader stage. Mesh shader output variables may be used to write per-vertex or per-primitive data. Output variables qualified with "perprimitiveNV" have separate instances for each primitive in the output mesh; all other output variables have separate instances for each vertex in the output mesh. It is a compile-time error to use the "perprimitiveNV" qualifier in output declarations in any other shader stage. Both types of output variables are arrayed (see "arrayed" under 4.3.4, Inputs) and each per-vertex or per-primitive output variable (or output block, see interface blocks below) needs to be declared as an array. For example, out float vertexColor[]; // per-vertex color perprimitiveNV out vec3 triangleNormal[]; // per-triangle normal Each element of such an array corresponds to one vertex or primitive of the output mesh. Each array can optionally have a size declared. The array size will be set by (or if provided must be consistent with) the output layout declaration(s) establishing the maximum number of vertices and primitives in the output mesh. When checking a mesh shader against implementation limits on the total number of output variable components, the compiler adds the number of per-vertex outputs for a single vertex instance and the number of per-primitive outputs for a single primitive instance. Unlike tessellation control shaders, a mesh shader invocation may write to outputs for any vertex or primitive. Mesh shader outputs qualified with "perviewNV" are considered to be per-view and arrayed with a second additional level of arrayness. Each non-block output variable must to be declared as an array with at least two dimensions. For output block members, one level of arrayness applies to the block declaration and a second applies to the block member declaration. For example, perviewNV out float perViewVertexColor[][]; out PerVertexBlock { perviewNV vec2 perViewTextureCoord[]; } v[]; For non-block output variables, each element in the outer (leftmost) dimension of such an array corresponds to one vertex or primitive of the output mesh, as described immediately above. Each element in the second (next-to-leftmost) dimension corresponds to a single view of the output primitive or vertex. The array dimension corresponding to the view number can optionally have a size declared. The array size will be set to (or if provided must be consistent with) the maximum number of views supported by the implementation given by the constant gl_MaxMeshViewCountNV. When using per-view outputs, all view instances of per-view outputs count separately against implementation limits on the total number of output components. Additionally, values for extra views will be stored in the upper end of the set of available locations for mesh shader outputs. A compile- or link-time error will be generated if extra storage required for extra per-view outputs leaves the compiler unable to assign locations for all outputs or includes a location already consumed by an active output variable with an associated "location" layout qualifier. (modify the next-to-last and last paragraph, p. 50) The order of execution of tessellation control, task, and mesh shader invocations relative to the other invocations for the same input patch or local work group is undefined unless the built-in function barrier() is used to provide some control over relative execution order. When a shader invocation calls barrier(), ... Because tessellation control, task, and mesh shader invocations execute in undefined order between barriers, the values of output variables will sometimes be undefined. ... Modify Section 4.3.8, Shared Variables (p. 52) (modify first paragraph of the section, p. 52) The shared qualifier is used to declare variables that have storage shared between all work items in a compute, task, or mesh shader local work group. Variables declared as shared may only be used in compute, task, or mesh shaders. ... (modify last paragraph of the section, p. 52) There is a limit to the total size of all variables declared as shared in a single shader stage. This limit, expressed in units of basic machine units may be determined by using the OpenGL API to query the value of MAX_COMPUTE_SHARED_MEMORY_SIZE (compute shaders), MAX_TASK_SHARED_MEMORY_SIZE_NV (task shaders), or MAX_MESH_SHARED_MEMORY_SIZE_NV (mesh shaders) Modify Section 4.3.9, Interface Blocks, p. 52 (rework grammar rules, p. 53, to allow "taskNV", "perprimitiveNV", and "perviewNV" to qualify blocks) interface-qualifier: in-block-qualifiers(_opt) "in" out-block-qualifiers(_opt) "out" uniform buffer // Note: Not shown for simplicity, but memory qualifiers may also be used in-block-qualifiers: patch taskNV perprimitiveNV out-block-qualifiers: out-block-qualifier out-block-qualifier out-block-qualifiers out-block-qualifier: patch taskNV perprimitiveNV perviewNV Modify Section 4.4, Layout Qualifiers, p. 57 (modify the layout qualifier table, pp. 58-59) Layout Qualifier | Qualifier | Individual | Block | Block | Allowed interfaces | only | variable | | Member | -------------------+-----------+------------+-------+--------+-------------------- local_size_x = | | | | | compute in local_size_y = | X | | | | mesh in local_size_z = | | | | | task in -------------------+-----------+------------+-------+--------+-------------------- max_vertices = | X | | | | geometry out | | | | | mesh out -------------------+-----------+------------+-------+--------+-------------------- max_primitives = | X | | | | mesh out -------------------+-----------+------------+-------+--------+-------------------- [ points ] | | | | | [ lines ] | X | | | | mesh out [ triangles ] | | | | | Add new Section 4.4.1.5, Task Shader Inputs, p. 67 (note: the content of this section is nearly identical to the content of section 4.4.1.4, Compute Shader Inputs) There are no layout location qualifiers for task shader inputs. Layout qualifier identifiers for task shader inputs are the work group size qualifiers: layout-qualifier-id : local_size_x = integer-constant-expression local_size_y = integer-constant-expression local_size_z = integer-constant-expression These task shader input layout qualifers behave identically to the equivalent compute shader qualifiers and specify a fixed local group size used for each task shader work group. If no size is specified in any of the three dimensions, a default size of one will be used. If the fixed local group size of the shader in any dimension is greater than the maximum size supported by the implementation for that dimension, a compile-time error results. Also, if such a layout qualifier is declared more than once in the same shader, all those declarations must set the same set of local workgroup sizes and set them to the same values; otherwise a compile-time error results. If multiple task shaders attached to a single program object declare a fixed local group size, the declarations must be identical; otherwise a link-time error results. Furthermore, if a program object contains any task shaders, at least one must contain an input layout qualifier specifying a fixed local group size for the program, or a link-time error will occur. Note that task shaders do not currently support multi-dimensional work groups; the maximum value for local_size_y and local_size_z will be one. Add new Section 4.4.1.6, Mesh Shader Inputs, p. 67 (note: the content of this section is nearly identical to the content of section 4.4.1.4, Compute Shader Inputs) There are no layout location qualifiers for mesh shader inputs. Layout qualifier identifiers for mesh shader inputs are the work group size qualifiers: layout-qualifier-id : local_size_x = integer-constant-expression local_size_y = integer-constant-expression local_size_z = integer-constant-expression These mesh shader input layout qualifers behave identically to the equivalent compute shader qualifiers and specify a fixed local group size used for each mesh shader work group. If no size is specified in any of the three dimensions, a default size of one will be used. If the fixed local group size of the shader in any dimension is greater than the maximum size supported by the implementation for that dimension, a compile-time error results. Also, if such a layout qualifier is declared more than once in the same shader, all those declarations must set the same set of local workgroup sizes and set them to the same values; otherwise a compile-time error results. If multiple mesh shaders attached to a single program object declare a fixed local group size, the declarations must be identical; otherwise a link-time error results. Furthermore, if a program object contains any mesh shaders, at least one must contain an input layout qualifier specifying a fixed local group size for the program, or a link-time error will occur. Note that mesh shaders do not currently support multi-dimensional work groups; the maximum value for local_size_y and local_size_z will be one. Modify section 4.4.2.1, Transform Feedback Layout Qualifiers, p. 69 (add a new paragraph at the end of the section, p. 71) Transform feedback is not supported to capture the outputs of task and mesh shaders. Use of transform feedback layout qualifiers in these shader types will result in a compile-time error. Add new Section 4.4.2.5, Mesh Shader Outputs, p. 75 Mesh shaders can have three additional types of output layout identifiers: an output primitive type, a maximum output vertex count, and a maximum output primitive count. The primitive type, vertex and primitive count identifiers are allowed only on the interface qualifier out, not on an output block, block member, or variable declaration. The layout qualifier identifiers for mesh shader outputs are layout-qualifier-id : points lines triangles max_vertices = integer-constant-expression max_primitives = integer-constant-expression The primitive type identifiers "points", "lines", and "triangles" are used to specify the type of output primitive produced by the mesh shader, and only one of these is accepted. At least one mesh shader (compilation unit) in a program must declare an output primitive type, and all mesh shader output primitive type declarations in a program must declare the same primitive type. It is not required that all mesh shaders in a program declare an output primitive type. The vertex count identifier "max_vertices" is used to specify the maximum number of vertices the shader will ever emit for the invocation group. At least one mesh shader (compilation unit) in a program must declare a maximum output vertex count, and all mesh shader output vertex count declarations in a program must declare the same count. It is not required that all mesh shaders in a program declare a count. The primitive count identifier "max_primitives" is used to specify the maximum number of primitives the shader will ever emit for the invocation group. At least one mesh shader (compilation unit) in a program must declare a maximum output primitive count, and all mesh shader output primitive count declarations in a program must declare the same count. It is not required that all mesh shaders in a program declare a count. The intrinsically declared output block gl_MeshVerticesNV[] and any user-defined output variables or blocks not qualified with "perprimitiveNV" will be sized by the "max_vertices" output declaration. The intrinsically declared output block gl_MeshPrimitivesNV[] and any user-defined output variables or blocks qualified with "perprimitiveNV" will be sized by the "max_primitives" output declaration. The intrinsically declared array gl_PrimitiveIndicesNV[] will be sized according to the primitive type and "max_primitives" declarations, where the size is: * the value of "max_primitives" if "points" is declared * two times the value of "max_primitives" if "lines" is declared, or * three times the value of "max_primitives" if "triangles" is declared. For outputs declared without an array size, including intrinsically declared outputs (e.g., gl_MeshVerticesNV), a layout must be declared before any use of the method length() or other array use that requires its size to be known. It is a compile-time error if an output array is declared with an explicit size that does not match the array size derived from the layout qualifier. Modify Section 4.5, Interpolation Qualifiers, p. 83 (modify first paragraph of the section, p. 83) The presence of and type of interpolation is controlled by the above interpolation qualifiers as well as the auxiliary storage qualifiers centroid and sample. The auxiliary storage qualifiers "patch", "taskNV", "perprimitiveNV" are not used for interpolation; it is a compile-time error to use interpolation qualifiers with those auxillary storage qualifiers. The auxillary storage qualifier "perviewNV" may not be used when declaring fragment shader inputs, but can be used with interpolation qualifiers in the declaration of mesh shader outputs. (add a new paragraph at the end of the section, p. 84) A variable qualified with the auxillary storage qualifier "perprimitiveNV" will also not be interpolated. Instead, it will use the same per-primitive value for all fragments generated by each primitive. Such a variable can also qualified with an interpolation qualifier with centroid or sample, but those qualifications will mean the same thing as only qualifying with "perprimitiveNV". Modify Section 7.1, Built-In Language Variables (p. 120) (insert after the first paragraph and variable list, p. 123) In the task language, built-in variables are intrinsically declared as: const uvec3 gl_WorkGroupSize; in uvec3 gl_WorkGroupID; in uvec3 gl_LocalInvocationID; in uvec3 gl_GlobalInvocationID; in uint gl_LocalInvocationIndex; in uint gl_MeshViewCountNV; in uint gl_MeshViewIndicesNV[]; out uint gl_TaskCountNV; In the mesh language, built-in variables are intrinsically declared as: const uvec3 gl_WorkGroupSize; in uvec3 gl_WorkGroupID; in uvec3 gl_LocalInvocationID; in uvec3 gl_GlobalInvocationID; in uint gl_LocalInvocationIndex; in uint gl_MeshViewCountNV; in uint gl_MeshViewIndicesNV[]; out uint gl_PrimitiveCountNV; out uint gl_PrimitiveIndicesNV[]; out gl_MeshPerVertexNV { vec4 gl_Position; perviewNV vec4 gl_PositionPerViewNV[]; // NVX_multiview_per_view_attributes float gl_PointSize; float gl_ClipDistance[]; perviewNV float gl_ClipDistancePerViewNV[][]; float gl_CullDistance[]; perviewNV float gl_CullDistancePerViewNV[][]; } gl_MeshVerticesNV[]; perprimitiveNV out gl_MeshPerPrimitiveNV { int gl_PrimitiveID; int gl_Layer; perviewNV int gl_LayerPerViewNV[]; int gl_ViewportIndex; int gl_ViewportMask[]; // NV_viewport_array2 perviewNV int gl_ViewportMaskPerViewNV[][]; } gl_MeshPrimitivesNV[]; (modify the discussion of the built-in variables shared with compute shaders, which starts on p. 123) The built-in constant gl_WorkGroupSize is a compute, task, or mesh shader constant containing the local work-group size of the shader. The size ... The built-in variable gl_WorkGroupID is a compute, task, or mesh shader input variable containing the three-dimensional index of the global work group that the current invocation is executing in. ... The built-in variable gl_LocalInvocationID is a compute, task, or mesh shader input variable containing the three-dimensional index of the local work group within the global work group that the current invocation is executing in. ... The built-in variable gl_GlobalInvocationID is a compute, task, or mesh shader input variable containing the global index of the current work item. This value uniquely identifies this invocation from all other invocations across all local and global work groups initiated by the current DispatchCompute or DispatchMeshTasksNV call or by a previously executed task shader. ... The built-in variable gl_LocalInvocationIndex is a compute, task, or mesh shader input variable that contains the one-dimensional representation of the gl_LocalInvocationID. (modify discussion of gl_PrimitiveID, gl_Layer, and gl_ViewportIndex to allow as a mesh output, pp. 125-127) The output variable gl_PrimitiveID is available only in the geometry and mesh languages and provides a single integer that serves as a primitive identifier. This is then available to fragment shaders as the fragment input gl_PrimitiveID, which will select the written primitive ID from the provoking vertex in the primitive being shaded when using a geometry shader or from the appropriate per-primitive output value when using a mesh shader. If a fragment shader using gl_PrimitiveID is active and a geometry or mesh shader is also active, the geometry or mesh shader must write to gl_PrimitiveID or the fragment shader input gl_PrimitiveID is undefined. ... The variable gl_Layer is available as an output variable in the geometry and mesh languages and an input variable in the fragment language. In the geometry and mesh languages, it is used to select a specific layer (or face and layer of a cube map) of a multi-layer framebuffer attachment. When using a geometry shader, the actual layer used will come from one of the vertices in the primitive being shaded. Which vertex the layer comes from is discussed in section 11.3.4.6 "Layer and Viewport Selection" of the OpenGL Specification. It might be undefined, so it is best to write the same layer value for all vertices of a primitive. When using a mesh shader, the actual layer will come from the appropriate per-primitive output value written by the mesh shader. ... The input variable gl_Layer in the fragment language will have the same value that was written to the output variable gl_Layer in the geometry or mesh language. If the geometry or mesh stage does not dynamically assign ... If the geometry or mesh stage makes no static assignment to gl_Layer, the input value... Otherwise, the fragment stage will read the same value written by the geometry or mesh stage, even if... The variable gl_ViewportIndex is available as an output variable in the geometry and mesh languages and an input variable in the fragment language. In the geometry and mesh language, it provides the ... Primitives generated by the geometry or mesh shader will undergo viewport transformation and scissor testing using the viewport transformation and scissor rectangle selected by the value of gl_ViewportIndex. When using a geometry shader, the viewport index used will come from one of the vertices in the primitive being shaded. However, which vertex the viewport index comes from is implementation-dependent, so it is best to use the same viewport index for all vertices of the primitive. When using a mesh shader, the viewport index used will come from the appropriate per-primitive output value written by the mesh shader. If a geometry or mesh shader does not assign a value to gl_ViewportIndex, ... If a geometry or mesh shader statically assigns a value to gl_ViewportIndex... The input variable gl_ViewportIndex in the fragment stage will have the same value that was written to the output variable gl_ViewportIndex in the geometry or mesh stage. If the geometry or mesh stage does not dynamically assign... If the geometry or mesh stage makes no static assignment... Otherwise, the fragment stage will read the same value written by the geometry or mesh stage, even if... (insert new paragraphs before the seventh paragraph, starting with "Fragment shaders output values", p. 127, describing new task and mesh built-in variables) The input variable gl_MeshViewCountNV is only available in the mesh and task languages and defines the number of views processed by the current mesh and task shader invocations. When using the multi-view API feature, the primitives emitted by the mesh shader will be processed separately for each enabled view and sent to a different layer of a layered render target. Mesh shader outputs qualified with "perviewNV" are declared as arrays with separate values for each view. To ensure defined results, mesh shaders must write values for array elements zero through gl_MeshViewCountNV-1 for each such per-view output. The input variable gl_MeshViewIndicesNV is only available in the mesh and task languages. This variable is an array where each element holds the view number of one of the views being processed by the current mesh and task shader invocations. The array elements with indices greater than or equal to the value of gl_MeshViewCountNV are undefined. If the value of gl_MeshViewIndicesNV[i] is , then any outputs qualified with "perviewNV" will take on the value of array element when processing primitives for view index . The output variable gl_TaskCountNV is only available in the task language and defines the number of subsequent mesh shader work groups to generate upon completion of the task shader. The output variable gl_PrimitiveCountNV is only available in the mesh language and defines the number of primitives in the output mesh produced by the mesh shader that should be processed by subsequent pipeline stages. The output array variable gl_PrimitiveIndicesNV[] is only available in the mesh language. Depending on the output primitive type declared using a layout qualifier, each group of one (points), two (lines), three (triangles) specifies the indices of the vertices making up the primitive. All index values must be in the range [0, N-1], where N is the value of the "max_vertices" layout qualifier. Out-of-bounds index values will result in undefined behavior. The mesh shader output block members gl_PositionPerViewNV[], gl_ClipDistancePerViewNV[][], gl_CullDistancePerViewNV[], gl_LayerPerViewNV[], and glViewportMaskPerViewNV[][] are per-view versions of the single-view variables with equivalent names that lack the "PerViewNV" suffix: Per-View Variable Single-View Variable ---------------------------- -------------------- gl_PositionPerViewNV[] gl_Position gl_ClipDistancePerViewNV[][] gl_ClipDistance[] gl_CullDistancePerViewNV[][] gl_CullDistance[] gl_LayerPerViewNV[] gl_Layer gl_ViewportMaskPerViewNV[][] gl_ViewportMask[] All of these outputs are considered arrayed, with separate values for each view. The view number is used to index in the first dimension of these arrays. For all of these variables, if a shader statically assigns a value to any element of a per-view array, it may not statically assign a value to the equivalent single-view variable in any mesh shader compilation unit. As with the gl_ClipDistance[] and gl_CullDistance[] arrays, the second dimension of gl_ClipDistancePerViewNV[] and gl_CullDistancePerViewNV[] is predeclared as unsized and must be sized by the shader either redeclaring it with a size or indexing it only with integral constant expressions. The size determines the number and set of enabled clip or cull distances and can be at most gl_MaxClipDistances or gl_MaxCullDistances, respectively. The number of varying components consumed by these arrays will match the size of the array, and shaders writing to either array must write all enabled distances, or clipping/culling results will be undefined. (modify the fifth paragraph, p. 129) The gl_PerVertex, gl_MeshPerVertexNV, and gl_MeshPerPrimitiveNV blocks can be redeclared in a shader to explicitly indicate what subset of the fixed pipeline interface will be used. ... (modify the sixth paragraph, p. 129) This establishes the output interface the shader will use with the subsequent pipeline stage. It must be a subset of the built-in members of gl_PerVertex, gl_MeshPerVertexNV, or gl_MeshPerPrimitiveNV. ... Modify Section 7.3, Built-In Constants (p. 136) Add to the end of the long list of constants that makes up this section: const int gl_MaxMeshViewCountNV = 4; Add new Section 8.xx, Mesh Shader Functions, after section 8.15, p. 187 These functions are only available in mesh shaders. Insert a syntax/description table similar to the previous section. Syntax: void writePackedPrimitiveIndices4x8NV(uint indexOffset, uint packedIndices) Description: Interprets the as four 8 bit unsigned int values and stores them into the gl_PrimitiveIndicesNV array starting from the provided , which must be a multiple of four. Lower bytes are stored at lower addresses in the array. The write operations must not exceed the size of the gl_PrimitiveIndicesNV array. Modify Section 8.16, Shader Invocation Control Functions, p. 186 (modify first paragraph of the section, p. 186) The shader invocation control function is available only in tessellation control, compute, task, and mesh shaders and compute shaders. It is used to control the relative execution order of multiple shader invocations used to process a patch (in the case of tessellation control shaders) or a local work group (in the case of compute, task, and mesh shaders), which are otherwise executed with an undefined relative order. (modify the last paragraph, p. 186) For compute, task, and mesh shaders, the barrier() function may be placed within flow control, but that flow control must be uniform flow control. ... Modify Section 8.17, Shader Memory Control Functions, p. 187 (modify table of functions, p. 187) void memoryBarrierShared() Control the ordering of memory transactions to shared variables issued within a single shader invocation. Only available in compute, task, and mesh shaders. void groupMemoryBarrier() Control the ordering of all memory transactions issued within a single shader invocation, as viewed by other invocations in the same work group. Only available in compute, task, and mesh shaders. (modify last paragraph, p. 187) ... all of the above variable types. The functions memoryBarrierShared() and groupMemoryBarrier() are available only in compute, task, and mesh shaders; the other functions are available in all shader types. (modify last paragraph, p. 188) ... When using the function groupMemoryBarrier(), this ordering guarantee applies only to other shader invocations in the same compute, task, or mesh shader work group; all other memory barrier functions provide the guarantee to all other shader invocations. ... Interactions with GLSL 4.60 and KHR_vulkan_glsl If GLSL 4.60 or KHR_vulkan_glsl is supported, the layout qualifiers "local_size_x_id", "local_size_y_id", and "local_size_z_id" are supported in mesh and task shaders, as in compute shaders. In the big layout qualifier table in section 4.4, add: Layout Qualifier | Qualifier | Individual | Block | Block | Allowed interfaces | only | variable | | Member | -------------------+-----------+------------+-------+--------+-------------------- local_size_x_id = | | | | | compute in local_size_y_id = | X | | | | mesh in local_size_z_id = | | | | | task in | | | | | (SPIR-V generation | | | | | only) No changes are required to the spec language describing these layout qualifiers, since the language doesn't specifically reference compute shaders and the mesh/task support should be identical. Interactions with NV_viewport_array2 If NV_viewport_array2 is not supported, remove gl_ViewportMask[] from the gl_PerPrimitiveNV block declaration. Interactions with NV_stereo_view_rendering Mesh shaders support a fully generic set of per-view positions and viewport masks, so we include no support for the more limited gl_SecondaryPositionNV and gl_SecondaryViewportMaskNV[] built-ins from NV_stereo_view_rendering. Interactions with NVX_multiview_per_view_attributes If NVX_multiview_per_view_attributes is not supported, remove gl_PositionPerViewNV[] from the gl_PerVertex block declaration and remove gl_ViewportMaskPerViewNV[] from the gl_PerPrimitiveNV block declaration. If NVX_multiview_per_view_attributes is supported, it is a compile-time error for a mesh shader to make a static assignment to gl_PositionPerViewNV as well as to either of gl_Position or gl_SecondaryPositionNV. If NVX_multiview_per_view_attributes is supported, it is a compile-time error for a mesh shader to make a static assignment to gl_ViewportMaskPerViewNV[] as well as to either of glViewportMask[] or gl_SecondaryViewportMaskNV[]. Interactions with ARB_shader_draw_parameters If ARB_shader_draw_parameters is supported, the task and mesh shaders will also have the following built-in inputs: in int gl_DrawIDARB; The variable is a vertex, task and mesh language input variable that holds the integer index of the drawing command to which the current vertex belongs (see "Shader Inputs" in section 11.1.3.9 of the OpenGL Graphics System Specification), or for the latter the current task or mesh workgroup. If the vertex or workgroup is not invoked by a Multi* form of a draw command, then the value of gl_DrawIDARB is zero. Interactions with EXT_clip_cull_distance If implemented with OpenGL ES ESSL and EXT_clip_cull_distance is not supported, remove references to gl_ClipDistance, gl_CullDistance, gl_ClipDistancePerViewNV and gl_CullDistancePerViewNV. Issues (1) What are the matching requirements between mesh outputs declared with "perprimitiveNV" and fragment shader inputs? What should we do with interpolation and other auxillary storage qualifiers on per-primitive values? RESOLVED: In the initial implementation of this extension, reading per-primitive mesh shader outputs in a fragment shader would return incorrect/undefined values if the fragment shader input has no special qualification. As a result, we require that mesh shader outputs qualified with "perprimitiveNV" be matched with fragment shader inputs qualified with "perprimitiveNV" and vice versa. We currently allow any of the interpolation and related auxillary storage qualifiers (e.g, flat, centroid) on fragment shader inputs qualified with "perprimitiveNV". These qualifiers have no effect. This resolution is consistent with the core GLSL specification language that allows (and ignores) auxilliary storage qualifiers such as "sample" or "centroid" to be used on inputs qualified by "flat", despite the fact that the storage qualifiers are meaningless for flat-shaded attributes. (2) How do "arrayed" outputs and blocks work for mesh shaders? Do you have to declare an array dimension? If you do declare an array dimension, how is it checked? RESOLVED: The rules for mesh shader outputs are the same as for arrayed inputs and outputs in tessellation control, tessellation evaluation, and geometry shaders. When declaring an "arrayed" block, the size is optional. If omitted, the size is taken from the maximum vertex or primitive counts declared using layout qualifiers ("max_vertices" and "max_primitives"). If a size is provided, it must match the limits specified by the layout qualifiers. (3) How are location layout qualfiers handled in mesh and task shaders? Do we support some sort of layout or offset qualifier for task memory? RESOLVED: For mesh shader outputs, the "location" layout qualifier is supported and is used for interface matching with the fragment shader. Locations assigned to mesh shader outputs have the same semantics as locations assigned to vertex, tessellation control, tessellation evaluation, and geometry shader outputs. As with tessellation control shaders, mesh shader outputs are "arrayed" with separate instances of each variable or block for each output vertex or primitive. These multiple instances do not consume separate locations for each vertex/primitive. For task shader outputs (used as mesh shader inputs), we've chosen not to support any location or offset layout qualifiers. Instead, we limit task and mesh shaders to use at most one block qualified by "taskNV" and do not allow non-block variables to use "taskNV". With a single block where member declarations need to match between stages, any internal offsets/locations can be assigned by the compiler without any external annotation. (4) For mesh shaders supporting multiple views, how do applications specify the set of views that should be produced? RESOLVED: Ignoring mesh shaders, there are significant differences in how multiple views are handled in OpenGL and Vulkan. OVR_multiview (OpenGL ES) specifies the view count using the "num_views" layout qualifier, where shaders will implicitly use views 0 through num_views-1. VK_KHR_multiview (Vulkan) provides no view information in the shader, other than references to a view index. Instead, the Vulkan render pass specifies a bitfield identifying the set of views to produce. In the Vulkan algorithm, there is no explicit notion of a view count in the shader, and the view mask is not known at shader compile time. For mesh shaders in OpenGL, we use the same OVR_multiview "num_views" layout qualifier to specify the view count. Unlike multiview vertex shaders, multiview mesh shaders are not run separately for each view. The "num_views" layout qualifier is used only to determine array sizes for outputs qualified with "perviewNV". For mesh shaders in Vulkan, the view mask of the render pass is used to determine the storage requirements of per-view attributes and controls the values of the gl_MeshViewCount and gl_MeshViewIndicesNV built-ins. (5) For outputs declared with "perviewNV", which are arrays with separate elements for each view, what are the rules for array sizing and indexing? Do you have to declare an array dimension? If you do declare an array dimension, how is it checked? RESOLVED: The rules for per-view mesh shader outputs are the same as for arrayed inputs and outputs in tessellation control, tessellation evaluation, and geometry shaders, as well as the per-vertex and per-primitive mesh shader output arrays. When declaring an output qualified with "perviewNV", an extra array dimension needs to be used for indexing across views. The array size in that dimension is optional. If omitted, the size is taken from the implementation dependent maximum view count. If provided, the size must match the maximum view count. Given that the view count on Vulkan is inferred at *run time* from the view mask in the render pass, we can't use that derived view count for SPIR-V code generation and compile-time error checking. Because of this, we have chosen to use the *maximum* view count for sizing per-view arrays, which is known at compile time. (6) What built-ins should be provided for multi-view mesh shaders? RESOLVED: We provide per-view versions of gl_Position, gl_ClipDistance[], and gl_CullDistance[] in the built-in block gl_MeshPerVertexNV: perviewNV vec4 gl_PositionPerViewNV[]; perviewNV float gl_ClipDistancePerViewNV[][]; perviewNV float gl_CullDistancePerViewNV[][]; Because these per-view built-ins refer to the same attributes as the equivalent standard built-ins, we prohibit the static use of a per-view built-in and its standard equivalent in a single shader. We considered instead allowing shaders to redeclare output blocks to add "perviewNV" qualification to existing built-ins, such as: out gl_PerVertex { perviewNV vec4 gl_Position[]; } v[]; This approach was rejected because modifying the basic types of built-in variables could result in new declarations that consist with the basic definitions built into the compiler. (7) For multi-view, how do we broadcast mesh shader outputs to multiple layers or viewports, where at least some outputs have per-view values? RESOLVED: In the OpenGL and Vulkan multi-view extensions, the programming model has logically separate shader invocations for each view. These extensions have a view ID/index built-in that can be used to determine which view is being processed by a given invocation. If a hardware platform is capable of compiling a multi-view shader to correctly process multiple views in a single shader invocation, the implementation is free to perform such an optimization. For mesh shaders, a transparent optimization that combines invocations for N different views is significantly more problematic. Separate invocations could produce structurally different output (e.g., different primitive counts or different topology), which would be more difficult to "broadcast". To simplify matters, we instead use a programming model where there is a single work group that processes all views at once. For per-view attributes, the mesh shader is responsible for computing separate output values for each view. (8) Should the gl_NumWorkGroups built-in be supported in task or mesh shaders, as with compute shaders? RESOLVED: No, this isn't worth the trouble. If required, an application can pass a workgroup count manually via a uniform. If we were to support such a thing, it would be necessary to figure out how this built-in would interact with gl_NumWorkGroups. For compute shaders, if you dispatched five workgroups with DispatchCompute, they would always be numbered 0..4 and have values less than gl_NumWorkGroups. If you called glDrawMeshTasksNV with set to 3 and set to 5, the work groups would be numbered 3..7 and it would be necessary to decide if gl_NumWorkGroups should be 5 or 8. Revision History Version 7, March 6, 2019 (pknowles) - Added EXT_clip_cull_distance interactions. Version 6, October 22, 2018 (sparmar) - Fix typo for per-primitive fragment shader input example Version 5, October 5, 2018 (pbrown) - Add an interaction with GLSL 4.60 and GL_KHR_vulkan_glsl to allow the use of "local_size_[xyz]_id" where applicable. Version 4, October 4, 2018 (pbrown) - Fix incorrect layout qualifier table entries. "local_size_[xyz]" is legal in task shaders. Version 3, September 18, 2018 (pbrown) - Additional edits preparing for publication. Version 2, September 11, 2018 (pbrown) - Miscellaneous edits preparing for publication. Version 1 (ckubisch, pbrown) - NVIDIA internal revisions.