Phase 5.2-5.3: CPU perf optimizations + GPU compute Surface Nets

CPU smooth mesher optimizations (560ms → 17ms): - VoxelData grid cache eliminates redundant readVoxel calls - Pre-cached 27 neighbor chunk pointers (readVoxelFast) - smoothNear dilation (8 lookups/cell instead of 56) - Early exit via containsSmooth flag on chunks - Thread-local scratch buffers (SmoothScratch ~600KB) - wi::jobsystem parallelization across all cores - Persistent staging vectors for upload TopingSystem optimizations (58ms → 6ms): - collectInstancesParallel() with per-chunk local vectors - Neighbor chunk pointer caching GPU compute Surface Nets (Phase 5.3): - Two-pass compute shader: centroid grid + emit with smooth normals - Pass 1 (voxelSmoothCentroidCS): computes centroids + solid flags for cells [-1..32], cross-chunk neighbor voxel reading - Pass 2 (voxelSmoothCS): reads ONLY from centroid grid, computes area-weighted smooth normals from 12 incident edges per vertex - Batched dispatch: all centroid passes then all emit passes with single UAV→SRV barrier (instead of 2 barriers per chunk) - Smooth chunk filtering: only dispatches chunks with containsSmooth - Centroid grid buffer dynamically sized per smooth chunk count - 1-frame readback delay with auto-redispatch on first frame
2026-03-27 22:30:43 +01:00 · 2026-03-27 22:30:43 +01:00 · cd9814e494
commit cd9814e494
parent d075a8492c
13 changed files with 1318 additions and 97 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -486,12 +486,46 @@ Système de biseaux décoratifs (« topings ») sur les faces +Y exposées pour
 - Lancé avec `BVLEVoxels.exe debugsmooth`
 - 11 configurations isolées dans un seul chunk : SmoothStone↔Grass, SmoothStone↔Dirt, SmoothStone↔Sand, SmoothStone↔Stone, Snow↔Grass, Snow↔Sand, références blocky (Sand↔Dirt, Grass↔Dirt), escalier SmoothStone, patch smooth entouré de grass, bloc smooth isolé
-#### Phase 5.2 - Optimisations et polish [A FAIRE]
+#### Phase 5.2 - Smooth normals + optimisations perf [FAIT]
 - **Smooth vertex normals** : accumulation area-weighted des face normals dans chaque vertex indexé, puis normalisation. Donne un éclairage Gouraud lisse sans géométrie additionnelle
 - **Geometric normals pour triplanar** : le PS utilise `ddx`/`ddy` du worldPos pour reconstruire la normal géométrique (face) pour les poids triplanar, la smooth normal pour le lighting uniquement. Empêche le stretching de textures causé par les normals lissées
 - **Depth bias smooth PSO** : rasterizer avec `depth_bias = 1` pour résoudre le z-fighting smooth↔blocky aux frontières
 - **Surface-only vertex extension** : le filtre `hasSmooth` étendu vérifie aussi que la cellule est sur la surface (`hasPos && hasNeg`) ET non entièrement souterraine. Empêche le smooth mesh de plonger dans le sous-sol
 **Optimisations CPU (560ms → 17ms = 33× plus rapide)** :
 - **Cache VoxelData dans la grille** : `voxelGrid[]` stocke VoxelData aux côtés du SDF, élimine tous les `readVoxel` redondants dans le boundary clamping, material counting, surface check
 - **Pré-cache 27 chunks voisins** : `neighborChunks[3][3][3]` rempli avant le grid fill. `readVoxelFast()` utilise un accès direct au tableau du chunk voisin au lieu de `world.getVoxel()` (hashmap lookup). Élimine ~14K hashmap lookups par chunk smooth
 - **Suppression `computeNormal` mort** : la fonction SDF gradient (6 readVoxel/vertex) était écrasée par les smooth normals. Code mort supprimé
 - **Early exit `containsSmooth`** : flag posé pendant `generateChunk()`. `meshChunk()` vérifie ce chunk + 26 voisins (27 hashmap lookups) avant le grid fill coûteux (46K voxel reads). Skip ~70% des chunks
 - **Dilation smoothNear** : grille pré-dilatée (smooth + face-neighbors) remplace le check hasSmooth étendu. 8 lookups/cell au lieu de 56 (7× moins dans la boucle la plus chaude)
 - **Thread-local scratch buffers** : `SmoothScratch` (~600 KB) alloué une fois par thread, réutilisé entre les appels. Élimine malloc/free par chunk
 - **Parallélisation `wi::jobsystem`** : tous les chunks meshés en parallèle sur tous les cœurs CPU
 - **Staging vectors persistants** : `smoothStagingVerts_` réutilisé entre frames, évite les allocations de vecteurs
 **Optimisations TopingSystem** :
 - **`collectInstancesParallel()`** : chaque chunk écrit dans un vecteur local, merge séquentiel. Élimine la contention
 - **Staging vectors persistants** : `topingSorted_`, `topingGpuInsts_` réutilisés entre frames
 **Résultats animation (648 chunks, Ryzen 7 3700X + RX 5700 XT)** :
 - SmoothMesh: 560ms → 17ms (parallèle, dilation, cache)
 - SmoothUpload: 13ms → 4ms (staging persistant)
 - TopingCollect: 58ms → 6.5ms (parallèle)
 - TopingUpload: 7.5ms → 1.2ms (bug fix timing + staging persistant)
 - Frame total: 662ms → 58ms (1.5 → 17 FPS avec animation terrain)
 #### Phase 5.3 - GPU compute Surface Nets [A FAIRE]
 - Compute shader pour SDF grid fill + vertex generation + quad emission
 - Élimine le CPU bottleneck restant (17ms → <1ms estimé)
 - Pattern similaire au GPU mesher blocky (Phase 2.4-2.5)
 - Readback 1-frame-delay du compteur atomique pour le vertex count
 #### Phase 5.4 - Polish [A FAIRE]
 - SDF lissé (distance field approximatif au lieu de binaire ±1)
 - Smooth normals (vertex normals au lieu de face normals pour surfaces lisses)
 - GPU compute Surface Nets (compute shader au lieu de CPU)
 - LOD : réduction de triangles à distance
 - Pipeline asynchrone : double-buffer GPU resources, CPU frame N prépare pendant que GPU rend frame N-1
 ### Phase 6 - Ray tracing hybride [A FAIRE]
@ -504,9 +538,11 @@ Système de biseaux décoratifs (« topings ») sur les faces +Y exposées pour
 | Métrique | Cible | Résultat (Ryzen 7 3700X + RX 5700 XT) |
 |----------|-------|---------------------------------------|
-| FPS 1440p | > 60 fps | ✅ 80-110 FPS (anim 60Hz), 700+ FPS (statique) |
+| FPS 1440p | > 60 fps | ✅ 80-110 FPS (anim blocky), 700+ FPS (statique) |
-| Meshing GPU | < 200 µs/chunk | ✅ ~0.6 µs/chunk (0.1ms / 171 chunks) |
+| FPS anim smooth+topings | > 15 fps | ✅ 17 FPS (smooth+topings+blocky anim 60Hz) |
-| Re-mesh complet | < 16ms | ✅ ~13ms (regen 8.7ms + upload 4.5ms) |
+| Meshing GPU (blocky) | < 200 µs/chunk | ✅ ~0.6 µs/chunk (0.1ms / 171 chunks) |
 | Meshing CPU (smooth) | < 30ms | ✅ 17ms (parallèle, 648 chunks) |
 | Re-mesh complet | < 16ms | ✅ ~13ms blocky (regen 8.7ms + upload 4.5ms) |
 | Mémoire GPU | < 500 Mo | ✅ ~30 Mo (11 MB voxels + 16 MB quads + buffers) |
 | RT shadows + AO | < 4ms en 1440p | ⏳ Phase 6 |
 | Draw calls | < 100 | ✅ 1 (GPU mesh) ou 1 (MDI) |
--- a/README.md
+++ b/README.md
@ -109,8 +109,8 @@ GPU: frustum cull compute → indirect args → DrawInstancedIndirectCount (1 ap
 - [x] **Phase 1** — Setup, meshing CPU, rendu basique
 - [x] **Phase 2** — GPU-driven pipeline, mega-buffer, culling, compute shaders
 - [x] **Phase 3** — Texture blending (triplanar, height-based)
- [ ] **Phase 4** — Toping (rebords, bordures procédurales)
+- [x] **Phase 4** — Toping (rebords, bordures procédurales)
- [ ] **Phase 5** — Rendu smooth (Surface Nets / Marching Cubes)
+- [x] **Phase 5** — Rendu smooth (Surface Nets / Marching Cubes)
 - [ ] **Phase 6** — Ray tracing hybride (RT shadows + AO)
 ## Licence
--- a/shaders/voxelCommon.hlsli
+++ b/shaders/voxelCommon.hlsli
@ -65,7 +65,7 @@ struct IndirectDrawArgsInstanced {
    uint startInstanceLocation;
 };
-// ── GPU chunk info (must match C++ GPUChunkInfo, 80 bytes) ──────
+// ── GPU chunk info (must match C++ GPUChunkInfo, 112 bytes) ─────
 // NOTE: No arrays — scalar-only to guarantee C-style packing in StructuredBuffer.
 struct GPUChunkInfo {
    float4 worldPos;    // xyz = chunk origin in world space, w = debug flag
@ -76,6 +76,9 @@ struct GPUChunkInfo {
    // Per-face data (6 faces: +X -X +Y -Y +Z -Z)
    uint faceOff0, faceOff1, faceOff2, faceOff3, faceOff4, faceOff5;
    uint faceCnt0, faceCnt1, faceCnt2, faceCnt3, faceCnt4, faceCnt5;
    // Face neighbor chunk indices (0xFFFFFFFF = no neighbor)
    uint neighbor0, neighbor1, neighbor2, neighbor3, neighbor4, neighbor5; // +X,-X,+Y,-Y,+Z,-Z
    uint _pad2, _pad3;
 };
 // Helper functions to access scalar face fields by index
@ -101,4 +104,15 @@ uint getFaceCount(GPUChunkInfo info, uint f) {
    }
 }
 uint getNeighborIdx(GPUChunkInfo info, uint f) {
    switch (f) {
        case 0: return info.neighbor0; // +X
        case 1: return info.neighbor1; // -X
        case 2: return info.neighbor2; // +Y
        case 3: return info.neighbor3; // -Y
        case 4: return info.neighbor4; // +Z
        default: return info.neighbor5; // -Z
    }
 }
 #endif // VOXEL_COMMON_HLSLI
--- a/shaders/voxelSmoothCS.hlsl
+++ b/shaders/voxelSmoothCS.hlsl
@ -0,0 +1,335 @@
 // BVLE Voxels - GPU Smooth Mesher Pass 2: Emit with Smooth Normals
 // Reads ONLY from centroid grid (written by pass 1). No voxel buffer access.
 // This keeps the shader simple and fast to compile.
 //
 // Centroid grid format (float4 per cell, cells [-1..32]):
 //   xyz = chunk-local position (valid for surface cells)
 //   w   = packed flags: bit24=valid, bit25=solid, [7:0]=mat, [15:8]=secMat, [23:16]=blend
 //
 // Dispatch: 4x4x4 groups of 8x8x8 threads per chunk (cells [0..31])
 #include "voxelCommon.hlsli"
 struct SmoothPush {
    uint chunkIndex;
    uint voxelBufferOffset;   // unused in this shader
    uint maxOutputVerts;
    uint centroidGridOffset;
    uint pad[8];
 };
 [[vk::push_constant]] ConstantBuffer<SmoothPush> push : register(b999);
 StructuredBuffer<GPUChunkInfo> chunkInfo : register(t1);
 StructuredBuffer<float4> centroidGrid : register(t2);
 struct GPUSmoothVertex {
    float px, py, pz;
    float nx, ny, nz;
    uint packedMat;
    uint packedChunk;
 };
 RWStructuredBuffer<GPUSmoothVertex> outputVerts : register(u0);
 RWByteAddressBuffer vertCounter : register(u1);
 static const uint CSIZE = 32;
 static const uint GRID_DIM = 34;
 // ── Grid access helpers ─────────────────────────────────────────────
 uint gridIndex(int3 cellPos) {
    return push.centroidGridOffset +
           (uint)(cellPos.z + 1) * GRID_DIM * GRID_DIM +
           (uint)(cellPos.y + 1) * GRID_DIM +
           (uint)(cellPos.x + 1);
 }
 uint readGridPacked(int3 cellPos) {
    if (any(cellPos < -1) || any(cellPos > 32)) return 0;
    return asuint(centroidGrid[gridIndex(cellPos)].w);
 }
 bool isCentroidValid(int3 cellPos) {
    return (readGridPacked(cellPos) >> 24) & 1;
 }
 bool isCellSolid(int3 cellPos) {
    return ((readGridPacked(cellPos) >> 25) & 1) != 0;
 }
 float3 readCentroidPos(int3 cellPos) {
    return centroidGrid[gridIndex(cellPos)].xyz;
 }
 // ── Face normal for one quad (4 sharing cells) ──────────────────────
 float3 computeQuadFaceNormal(int3 c0, int3 c1, int3 c2, int3 c3,
                              bool solid0, int edgeAxis) {
    if (!isCentroidValid(c0) || !isCentroidValid(c1) ||
        !isCentroidValid(c2) || !isCentroidValid(c3))
        return float3(0, 0, 0);
    float3 p0 = readCentroidPos(c0);
    float3 p1 = readCentroidPos(c1);
    float3 p3 = readCentroidPos(c3);
    float3 fn = cross(p1 - p0, p3 - p0);
    // Orient: solid→empty direction
    int s = solid0 ? +1 : -1;
    float fnAxis = (edgeAxis == 0) ? fn.x : ((edgeAxis == 1) ? fn.y : fn.z);
    if ((fnAxis > 0.0) != (s > 0)) fn = -fn;
    return fn; // area-weighted (not normalized)
 }
 // ── Smooth normal for a vertex at cell v ────────────────────────────
 // Checks all 12 incident edges (4 per axis), computes face normals from
 // centroid grid, averages them. All reads from grid only.
 float3 computeSmoothNormal(int3 v) {
    float3 accum = float3(0, 0, 0);
    // X-edges: at (v.x, v.y+dy, v.z+dz) for dy,dz in {0,1}
    {
        bool sv = isCellSolid(v);
        bool sv_x1 = isCellSolid(v + int3(1,0,0));
        bool sv_01 = isCellSolid(int3(v.x, v.y+1, v.z));
        bool sv_01_x1 = isCellSolid(int3(v.x+1, v.y+1, v.z));
        bool sv_10 = isCellSolid(int3(v.x, v.y, v.z+1));
        bool sv_10_x1 = isCellSolid(int3(v.x+1, v.y, v.z+1));
        bool sv_11 = isCellSolid(int3(v.x, v.y+1, v.z+1));
        bool sv_11_x1 = isCellSolid(int3(v.x+1, v.y+1, v.z+1));
        // Edge (v.x, v.y, v.z)
        if (sv != sv_x1) {
            accum += computeQuadFaceNormal(
                v + int3(0,-1,-1), v + int3(0,0,-1),
                v + int3(0,-1,0),  v, sv, 0);
        }
        // Edge (v.x, v.y+1, v.z)
        if (sv_01 != sv_01_x1) {
            accum += computeQuadFaceNormal(
                int3(v.x, v.y, v.z-1), int3(v.x, v.y+1, v.z-1),
                v, int3(v.x, v.y+1, v.z), sv_01, 0);
        }
        // Edge (v.x, v.y, v.z+1)
        if (sv_10 != sv_10_x1) {
            accum += computeQuadFaceNormal(
                int3(v.x, v.y-1, v.z), v,
                int3(v.x, v.y-1, v.z+1), int3(v.x, v.y, v.z+1), sv_10, 0);
        }
        // Edge (v.x, v.y+1, v.z+1)
        if (sv_11 != sv_11_x1) {
            accum += computeQuadFaceNormal(
                v, int3(v.x, v.y+1, v.z),
                int3(v.x, v.y, v.z+1), int3(v.x, v.y+1, v.z+1), sv_11, 0);
        }
    }
    // Y-edges: at (v.x+dx, v.y, v.z+dz) for dx,dz in {0,1}
    {
        bool sv = isCellSolid(v);
        bool sv_y1 = isCellSolid(v + int3(0,1,0));
        bool sv_10 = isCellSolid(int3(v.x+1, v.y, v.z));
        bool sv_10_y1 = isCellSolid(int3(v.x+1, v.y+1, v.z));
        bool sv_01 = isCellSolid(int3(v.x, v.y, v.z+1));
        bool sv_01_y1 = isCellSolid(int3(v.x, v.y+1, v.z+1));
        bool sv_11 = isCellSolid(int3(v.x+1, v.y, v.z+1));
        bool sv_11_y1 = isCellSolid(int3(v.x+1, v.y+1, v.z+1));
        if (sv != sv_y1) {
            accum += computeQuadFaceNormal(
                v + int3(-1,0,-1), v + int3(0,0,-1),
                v + int3(-1,0,0),  v, sv, 1);
        }
        if (sv_10 != sv_10_y1) {
            accum += computeQuadFaceNormal(
                int3(v.x, v.y, v.z-1), int3(v.x+1, v.y, v.z-1),
                v, int3(v.x+1, v.y, v.z), sv_10, 1);
        }
        if (sv_01 != sv_01_y1) {
            accum += computeQuadFaceNormal(
                int3(v.x-1, v.y, v.z), v,
                int3(v.x-1, v.y, v.z+1), int3(v.x, v.y, v.z+1), sv_01, 1);
        }
        if (sv_11 != sv_11_y1) {
            accum += computeQuadFaceNormal(
                v, int3(v.x+1, v.y, v.z),
                int3(v.x, v.y, v.z+1), int3(v.x+1, v.y, v.z+1), sv_11, 1);
        }
    }
    // Z-edges: at (v.x+dx, v.y+dy, v.z) for dx,dy in {0,1}
    {
        bool sv = isCellSolid(v);
        bool sv_z1 = isCellSolid(v + int3(0,0,1));
        bool sv_10 = isCellSolid(int3(v.x+1, v.y, v.z));
        bool sv_10_z1 = isCellSolid(int3(v.x+1, v.y, v.z+1));
        bool sv_01 = isCellSolid(int3(v.x, v.y+1, v.z));
        bool sv_01_z1 = isCellSolid(int3(v.x, v.y+1, v.z+1));
        bool sv_11 = isCellSolid(int3(v.x+1, v.y+1, v.z));
        bool sv_11_z1 = isCellSolid(int3(v.x+1, v.y+1, v.z+1));
        if (sv != sv_z1) {
            accum += computeQuadFaceNormal(
                v + int3(-1,-1,0), v + int3(0,-1,0),
                v + int3(-1,0,0),  v, sv, 2);
        }
        if (sv_10 != sv_10_z1) {
            accum += computeQuadFaceNormal(
                int3(v.x, v.y-1, v.z), int3(v.x+1, v.y-1, v.z),
                v, int3(v.x+1, v.y, v.z), sv_10, 2);
        }
        if (sv_01 != sv_01_z1) {
            accum += computeQuadFaceNormal(
                int3(v.x-1, v.y, v.z), v,
                int3(v.x-1, v.y+1, v.z), int3(v.x, v.y+1, v.z), sv_01, 2);
        }
        if (sv_11 != sv_11_z1) {
            accum += computeQuadFaceNormal(
                v, int3(v.x+1, v.y, v.z),
                int3(v.x, v.y+1, v.z), int3(v.x+1, v.y+1, v.z), sv_11, 2);
        }
    }
    float len = length(accum);
    return (len > 0.0001) ? accum / len : float3(0, 1, 0);
 }
 // ── Emit helpers ────────────────────────────────────────────────────
 void emitVertex(uint slot, float3 pos, float3 normal, uint primaryMat, uint secondaryMat, uint blendWeight) {
    GPUSmoothVertex vert;
    vert.px = pos.x; vert.py = pos.y; vert.pz = pos.z;
    vert.nx = normal.x; vert.ny = normal.y; vert.nz = normal.z;
    vert.packedMat = (primaryMat & 0xFF) | ((secondaryMat & 0xFF) << 8) | ((blendWeight & 0xFF) << 16);
    vert.packedChunk = push.chunkIndex & 0xFFFF;
    outputVerts[slot] = vert;
 }
 void emitQuad(float3 p[4], float3 n[4], uint mat, uint secMat, uint blendW, bool windingA) {
    uint slot;
    vertCounter.InterlockedAdd(0, 6, slot);
    if (slot + 6 > push.maxOutputVerts) return;
    if (windingA) {
        emitVertex(slot + 0, p[0], n[0], mat, secMat, blendW);
        emitVertex(slot + 1, p[1], n[1], mat, secMat, blendW);
        emitVertex(slot + 2, p[3], n[3], mat, secMat, blendW);
        emitVertex(slot + 3, p[0], n[0], mat, secMat, blendW);
        emitVertex(slot + 4, p[3], n[3], mat, secMat, blendW);
        emitVertex(slot + 5, p[2], n[2], mat, secMat, blendW);
    } else {
        emitVertex(slot + 0, p[0], n[0], mat, secMat, blendW);
        emitVertex(slot + 1, p[3], n[3], mat, secMat, blendW);
        emitVertex(slot + 2, p[1], n[1], mat, secMat, blendW);
        emitVertex(slot + 3, p[0], n[0], mat, secMat, blendW);
        emitVertex(slot + 4, p[2], n[2], mat, secMat, blendW);
        emitVertex(slot + 5, p[3], n[3], mat, secMat, blendW);
    }
 }
 // ── Main ────────────────────────────────────────────────────────────
 [RootSignature(VOXEL_ROOTSIG)]
 [numthreads(8, 8, 8)]
 void main(uint3 DTid : SV_DispatchThreadID)
 {
    if (any(DTid >= CSIZE)) return;
    int3 cellPos = int3(DTid);
    bool cellSolid = isCellSolid(cellPos);
    float3 chunkWorldPos = chunkInfo[push.chunkIndex].worldPos.xyz;
    // ── X-edge: cellPos → cellPos + (1,0,0) ────────────────────────
    {
        bool neighborSolid = isCellSolid(cellPos + int3(1, 0, 0));
        if (cellSolid != neighborSolid) {
            int3 cells[4] = {
                cellPos + int3(0, -1, -1),
                cellPos + int3(0,  0, -1),
                cellPos + int3(0, -1,  0),
                cellPos
            };
            if (isCentroidValid(cells[0]) && isCentroidValid(cells[1]) &&
                isCentroidValid(cells[2]) && isCentroidValid(cells[3])) {
                float3 p[4], n[4];
                [loop] for (uint i = 0; i < 4; i++)
                    p[i] = chunkWorldPos + readCentroidPos(cells[i]);
                [loop] for (uint i = 0; i < 4; i++)
                    n[i] = computeSmoothNormal(cells[i]);
                float3 fn = cross(p[1] - p[0], p[3] - p[0]);
                int s = cellSolid ? +1 : -1;
                if ((fn.x > 0.0) != (s > 0)) fn = -fn;
                bool windingA = !cellSolid;
                uint packed = readGridPacked(cells[3]);
                uint mat = packed & 0xFF;
                uint secMat = (packed >> 8) & 0xFF;
                uint blendW = (packed >> 16) & 0xFF;
                emitQuad(p, n, mat, secMat, blendW, windingA);
            }
        }
    }
    // ── Y-edge: cellPos → cellPos + (0,1,0) ────────────────────────
    {
        bool neighborSolid = isCellSolid(cellPos + int3(0, 1, 0));
        if (cellSolid != neighborSolid) {
            int3 cells[4] = {
                cellPos + int3(-1, 0, -1),
                cellPos + int3( 0, 0, -1),
                cellPos + int3(-1, 0,  0),
                cellPos
            };
            if (isCentroidValid(cells[0]) && isCentroidValid(cells[1]) &&
                isCentroidValid(cells[2]) && isCentroidValid(cells[3])) {
                float3 p[4], n[4];
                [loop] for (uint i = 0; i < 4; i++)
                    p[i] = chunkWorldPos + readCentroidPos(cells[i]);
                [loop] for (uint i = 0; i < 4; i++)
                    n[i] = computeSmoothNormal(cells[i]);
                float3 fn = cross(p[1] - p[0], p[3] - p[0]);
                int s = cellSolid ? +1 : -1;
                if ((fn.y > 0.0) != (s > 0)) fn = -fn;
                bool windingA = !cellSolid;
                windingA = !windingA; // Y-axis winding flip
                uint packed = readGridPacked(cells[3]);
                uint mat = packed & 0xFF;
                uint secMat = (packed >> 8) & 0xFF;
                uint blendW = (packed >> 16) & 0xFF;
                emitQuad(p, n, mat, secMat, blendW, windingA);
            }
        }
    }
    // ── Z-edge: cellPos → cellPos + (0,0,1) ────────────────────────
    {
        bool neighborSolid = isCellSolid(cellPos + int3(0, 0, 1));
        if (cellSolid != neighborSolid) {
            int3 cells[4] = {
                cellPos + int3(-1, -1, 0),
                cellPos + int3( 0, -1, 0),
                cellPos + int3(-1,  0, 0),
                cellPos
            };
            if (isCentroidValid(cells[0]) && isCentroidValid(cells[1]) &&
                isCentroidValid(cells[2]) && isCentroidValid(cells[3])) {
                float3 p[4], n[4];
                [loop] for (uint i = 0; i < 4; i++)
                    p[i] = chunkWorldPos + readCentroidPos(cells[i]);
                [loop] for (uint i = 0; i < 4; i++)
                    n[i] = computeSmoothNormal(cells[i]);
                float3 fn = cross(p[1] - p[0], p[3] - p[0]);
                int s = cellSolid ? +1 : -1;
                if ((fn.z > 0.0) != (s > 0)) fn = -fn;
                bool windingA = !cellSolid;
                uint packed = readGridPacked(cells[3]);
                uint mat = packed & 0xFF;
                uint secMat = (packed >> 8) & 0xFF;
                uint blendW = (packed >> 16) & 0xFF;
                emitQuad(p, n, mat, secMat, blendW, windingA);
            }
        }
    }
 }
--- a/shaders/voxelSmoothCentroidCS.hlsl
+++ b/shaders/voxelSmoothCentroidCS.hlsl
@ -0,0 +1,256 @@
 // BVLE Voxels - GPU Smooth Mesher Pass 1: Centroid + Solid Grid
 // Computes centroid position + material for each surface cell and writes
 // to a per-chunk centroid grid buffer (34^3 entries for cells [-1..32]).
 //
 // IMPORTANT: Also writes a "solid" flag for ALL cells (surface or not).
 // The emit shader (pass 2) reads ONLY from this grid — no voxel buffer access.
 // This makes the emit shader much simpler and faster to compile.
 //
 // Grid format (float4 per cell):
 //   xyz = chunk-local position (only valid for surface cells)
 //   w   = asfloat(packed):
 //         bits [7:0]   = primaryMat
 //         bits [15:8]  = secondaryMat
 //         bits [23:16] = blendWeight
 //         bit  24      = valid (has centroid, is surface cell)
 //         bit  25      = solid (readVox != 0 at this cell position)
 //
 // Dispatch: 5x5x5 groups of 8x8x8 threads per chunk (covers 40^3, clipped to 34^3)
 #include "voxelCommon.hlsli"
 struct SmoothPush {
    uint chunkIndex;
    uint voxelBufferOffset;
    uint maxOutputVerts;
    uint centroidGridOffset;
    uint pad[8];
 };
 [[vk::push_constant]] ConstantBuffer<SmoothPush> push : register(b999);
 StructuredBuffer<uint> voxelData : register(t0);
 StructuredBuffer<GPUChunkInfo> chunkInfo : register(t1);
 RWStructuredBuffer<float4> centroidGrid : register(u0);
 static const uint CSIZE = 32;
 static const uint GRID_DIM = 34;
 static const uint WORDS_PER_CHUNK = CSIZE * CSIZE * CSIZE / 2;
 static const int3 cornerOff[8] = {
    int3(0,0,0), int3(1,0,0), int3(0,1,0), int3(1,1,0),
    int3(0,0,1), int3(1,0,1), int3(0,1,1), int3(1,1,1)
 };
 static const float3 cornerOffF[8] = {
    float3(0,0,0), float3(1,0,0), float3(0,1,0), float3(1,1,0),
    float3(0,0,1), float3(1,0,1), float3(0,1,1), float3(1,1,1)
 };
 static const uint2 edgePairs[12] = {
    uint2(0,1), uint2(2,3), uint2(4,5), uint2(6,7),
    uint2(0,2), uint2(1,3), uint2(4,6), uint2(5,7),
    uint2(0,4), uint2(1,5), uint2(2,6), uint2(3,7)
 };
 static const int3 dirs6[6] = {
    int3(1,0,0), int3(-1,0,0), int3(0,1,0), int3(0,-1,0), int3(0,0,1), int3(0,0,-1)
 };
 // ── Voxel reading (with cross-chunk neighbor support) ───────────────
 uint readVoxelAt(uint bufferOffset, uint flatIndex) {
    uint pairIndex = flatIndex >> 1;
    uint shift = (flatIndex & 1) * 16;
    return (voxelData[bufferOffset + pairIndex] >> shift) & 0xFFFF;
 }
 uint readVoxel(uint flatIndex) {
    return readVoxelAt(push.voxelBufferOffset, flatIndex);
 }
 uint readVox(int3 p) {
    int3 localP = p;
    int3 chunkOff = int3(0, 0, 0);
    if (p.x < 0)                { chunkOff.x = -1; localP.x += CSIZE; }
    else if (p.x >= (int)CSIZE) { chunkOff.x =  1; localP.x -= CSIZE; }
    if (p.y < 0)                { chunkOff.y = -1; localP.y += CSIZE; }
    else if (p.y >= (int)CSIZE) { chunkOff.y =  1; localP.y -= CSIZE; }
    if (p.z < 0)                { chunkOff.z = -1; localP.z += CSIZE; }
    else if (p.z >= (int)CSIZE) { chunkOff.z =  1; localP.z -= CSIZE; }
    if (chunkOff.x == 0 && chunkOff.y == 0 && chunkOff.z == 0) {
        uint fi = (uint)localP.x + (uint)localP.y * CSIZE + (uint)localP.z * CSIZE * CSIZE;
        return readVoxel(fi);
    }
    int axisCount = abs(chunkOff.x) + abs(chunkOff.y) + abs(chunkOff.z);
    if (axisCount > 1) return 0;
    uint neighborFace;
    if      (chunkOff.x > 0) neighborFace = 0;
    else if (chunkOff.x < 0) neighborFace = 1;
    else if (chunkOff.y > 0) neighborFace = 2;
    else if (chunkOff.y < 0) neighborFace = 3;
    else if (chunkOff.z > 0) neighborFace = 4;
    else                      neighborFace = 5;
    GPUChunkInfo ci = chunkInfo[push.chunkIndex];
    uint nIdx = getNeighborIdx(ci, neighborFace);
    if (nIdx == 0xFFFFFFFF) return 0;
    uint fi = (uint)localP.x + (uint)localP.y * CSIZE + (uint)localP.z * CSIZE * CSIZE;
    return readVoxelAt(nIdx * WORDS_PER_CHUNK, fi);
 }
 bool isSmooth(uint voxel) { return (voxel != 0) && ((voxel >> 4) & 0x1); }
 uint getMatID(uint voxel) { return (voxel >> 8) & 0xFF; }
 uint gridIndex(int3 cellPos) {
    return push.centroidGridOffset +
           (uint)(cellPos.z + 1) * GRID_DIM * GRID_DIM +
           (uint)(cellPos.y + 1) * GRID_DIM +
           (uint)(cellPos.x + 1);
 }
 // Flags byte (bits 24-31 of packed w):
 //   bit 0 (24): valid centroid
 //   bit 1 (25): cell is solid
 static const uint FLAG_VALID = 1u;
 static const uint FLAG_SOLID = 2u;
 [RootSignature(VOXEL_ROOTSIG)]
 [numthreads(8, 8, 8)]
 void main(uint3 DTid : SV_DispatchThreadID)
 {
    int3 cellPos = int3(DTid) - 1;
    if (any(cellPos > 32)) return;
    uint idx = gridIndex(cellPos);
    // Determine if cell center is solid (needed by emit shader for edge sign checks)
    uint voxAtCell = readVox(cellPos);
    uint solidFlag = (voxAtCell != 0) ? FLAG_SOLID : 0u;
    // Read SDF at 8 corners
    float corner[8];
    bool hasPos = false, hasNeg = false;
    bool hasSmoothFlag = false;
    [unroll]
    for (uint c = 0; c < 8; c++) {
        int3 cp = cellPos + cornerOff[c];
        uint vox = readVox(cp);
        corner[c] = (vox == 0) ? 1.0 : -1.0;
        if (corner[c] < 0.0) hasNeg = true;
        else hasPos = true;
        if (isSmooth(vox)) hasSmoothFlag = true;
    }
    // Not a surface cell → write only solid flag
    if (!hasPos || !hasNeg) {
        centroidGrid[idx] = float4(0, 0, 0, asfloat(solidFlag << 24));
        return;
    }
    // Must be near smooth voxels
    if (!hasSmoothFlag) {
        bool nearSmooth = false;
        [unroll]
        for (uint c = 0; c < 8 && !nearSmooth; c++) {
            int3 cp = cellPos + cornerOff[c];
            [unroll]
            for (uint d = 0; d < 6 && !nearSmooth; d++) {
                uint nv = readVox(cp + dirs6[d]);
                if (isSmooth(nv)) nearSmooth = true;
            }
        }
        if (!nearSmooth) {
            centroidGrid[idx] = float4(0, 0, 0, asfloat(solidFlag << 24));
            return;
        }
    }
    // Compute centroid from edge crossings
    float3 sum = float3(0, 0, 0);
    uint crossCount = 0;
    [unroll]
    for (uint e = 0; e < 12; e++) {
        float s0 = corner[edgePairs[e].x];
        float s1 = corner[edgePairs[e].y];
        if ((s0 < 0.0) == (s1 < 0.0)) continue;
        float t = clamp(s0 / (s0 - s1), 0.01, 0.99);
        sum += cornerOffF[edgePairs[e].x] + t * (cornerOffF[edgePairs[e].y] - cornerOffF[edgePairs[e].x]);
        crossCount++;
    }
    if (crossCount == 0) {
        centroidGrid[idx] = float4(0, 0, 0, asfloat(solidFlag << 24));
        return;
    }
    float3 cen = sum / (float)crossCount;
    // Boundary clamping
    [unroll]
    for (uint c = 0; c < 8; c++) {
        if (corner[c] >= 0.0) continue;
        int3 cp = cellPos + cornerOff[c];
        uint vox = readVox(cp);
        if (vox != 0 && !isSmooth(vox)) {
            if (cornerOff[c].x == 0) cen.x = max(cen.x, 0.5);
            else                     cen.x = min(cen.x, 0.5);
            if (cornerOff[c].y == 0) cen.y = max(cen.y, 0.5);
            else                     cen.y = min(cen.y, 0.5);
            if (cornerOff[c].z == 0) cen.z = max(cen.z, 0.5);
            else                     cen.z = min(cen.z, 0.5);
        }
    }
    float3 localPos = float3(cellPos) + float3(0.5, 0.5, 0.5) + cen;
    // Material determination
    uint matIDs[8];
    bool matIsSmooth[8];
    uint solidCount = 0;
    [unroll]
    for (uint cc = 0; cc < 8; cc++) {
        if (corner[cc] >= 0.0) continue;
        int3 cp = cellPos + cornerOff[cc];
        uint vox = readVox(cp);
        if (vox == 0) continue;
        matIDs[solidCount] = getMatID(vox);
        matIsSmooth[solidCount] = isSmooth(vox);
        solidCount++;
    }
    uint bestMat = 6, bestCount = 0, smoothSolidCount = 0;
    [unroll] for (uint i = 0; i < 8; i++) {
        if (i >= solidCount) break;
        if (matIsSmooth[i]) smoothSolidCount++;
    }
    [unroll] for (uint i = 0; i < 8; i++) {
        if (i >= solidCount) break;
        bool useSmooth = (smoothSolidCount > 0);
        if (useSmooth && !matIsSmooth[i]) continue;
        uint mat = matIDs[i];
        uint cnt = 0;
        [unroll] for (uint j = 0; j < 8; j++) {
            if (j >= solidCount) break;
            if (useSmooth && !matIsSmooth[j]) continue;
            if (matIDs[j] == mat) cnt++;
        }
        if (cnt > bestCount) { bestCount = cnt; bestMat = mat; }
    }
    uint secMat = bestMat, secCount = 0;
    [unroll] for (uint i = 0; i < 8; i++) {
        if (i >= solidCount) break;
        if (matIDs[i] == bestMat) continue;
        uint mat = matIDs[i];
        uint cnt = 0;
        [unroll] for (uint j = 0; j < 8; j++) {
            if (j >= solidCount) break;
            if (matIDs[j] == mat && mat != bestMat) cnt++;
        }
        if (cnt > secCount) { secCount = cnt; secMat = mat; }
    }
    uint blendW = (secCount > 0 && secMat != bestMat) ? 255 : 0;
    uint flags = FLAG_VALID | solidFlag;
    uint packed = (bestMat & 0xFF) | ((secMat & 0xFF) << 8) | ((blendW & 0xFF) << 16) | (flags << 24);
    centroidGrid[idx] = float4(localPos, asfloat(packed));
 }
--- a/src/voxel/TopingSystem.cpp
+++ b/src/voxel/TopingSystem.cpp
@ -1,5 +1,6 @@
 #include "TopingSystem.h"
 #include "VoxelWorld.h"
 #include "wiJobSystem.h"
 #include <cmath>
 #include <cstring>
@ -541,4 +542,127 @@ void TopingSystem::collectInstances(const VoxelWorld& world) {
    });
 }
 void TopingSystem::collectInstancesParallel(const VoxelWorld& world) {
    instances_.clear();
    // Quick lookup: material -> toping def index (-1 if none)
    int8_t matToDef[256];
    memset(matToDef, -1, sizeof(matToDef));
    for (size_t i = 0; i < defs_.size(); i++) {
        matToDef[defs_[i].materialID] = (int8_t)i;
    }
    // Collect chunk pointers for parallel dispatch
    std::vector<const Chunk*> chunkPtrs;
    std::vector<ChunkPos> chunkPositions;
    world.forEachChunk([&](const ChunkPos& cpos, const Chunk& chunk) {
        chunkPtrs.push_back(&chunk);
        chunkPositions.push_back(cpos);
    });
    const uint32_t numChunks = (uint32_t)chunkPtrs.size();
    if (numChunks == 0) return;
    // Per-chunk local instance vectors (no locking needed)
    std::vector<std::vector<TopingInstance>> perChunkInstances(numChunks);
    wi::jobsystem::context ctx;
    wi::jobsystem::Dispatch(ctx, numChunks, 1,
        [&](wi::jobsystem::JobArgs args) {
            const uint32_t ci = args.jobIndex;
            const Chunk& chunk = *chunkPtrs[ci];
            const ChunkPos& cpos = chunkPositions[ci];
            auto& localInstances = perChunkInstances[ci];
            // Pre-cache neighbor chunks: self + ±X, ±Z, +Y (5 directions used)
            // Layout: [dx+1][dz+1] for same-Y neighbors, plus +Y neighbors
            const Chunk* ncXZ[3][3] = {};   // neighbors at same Y
            const Chunk* ncXZup[3][3] = {}; // neighbors at Y+1
            for (int dz = -1; dz <= 1; dz++)
            for (int dx = -1; dx <= 1; dx++) {
                ncXZ[dx+1][dz+1] = world.getChunk(
                    ChunkPos{cpos.x + dx, cpos.y, cpos.z + dz});
                ncXZup[dx+1][dz+1] = world.getChunk(
                    ChunkPos{cpos.x + dx, cpos.y + 1, cpos.z + dz});
            }
            // Fast voxel read using cached chunk pointers
            auto readVoxel = [&](int wx, int wy, int wz) -> VoxelData {
                int lx = wx - cpos.x * CHUNK_SIZE;
                int ly = wy - cpos.y * CHUNK_SIZE;
                int lz = wz - cpos.z * CHUNK_SIZE;
                int cx = (lx < 0) ? 0 : (lx >= CHUNK_SIZE) ? 2 : 1;
                int cz = (lz < 0) ? 0 : (lz >= CHUNK_SIZE) ? 2 : 1;
                const Chunk* nc;
                if (ly >= 0 && ly < CHUNK_SIZE) {
                    nc = ncXZ[cx][cz];
                } else if (ly >= CHUNK_SIZE && ly < CHUNK_SIZE * 2) {
                    nc = ncXZup[cx][cz];
                    ly -= CHUNK_SIZE;
                } else {
                    return VoxelData{}; // out of cached range
                }
                if (!nc) return VoxelData{};
                int flx = ((lx % CHUNK_SIZE) + CHUNK_SIZE) % CHUNK_SIZE;
                int flz = ((lz % CHUNK_SIZE) + CHUNK_SIZE) % CHUNK_SIZE;
                return nc->at(flx, ly, flz);
            };
            for (int z = 0; z < CHUNK_SIZE; z++) {
                for (int y = 0; y < CHUNK_SIZE; y++) {
                    for (int x = 0; x < CHUNK_SIZE; x++) {
                        const VoxelData& v = chunk.at(x, y, z);
                        if (v.isEmpty()) continue;
                        const uint8_t mat = v.getMaterialID();
                        const int8_t defIdx = matToDef[mat];
                        if (defIdx < 0) continue;
                        const TopingDef& def = defs_[defIdx];
                        const int wx = cpos.x * CHUNK_SIZE + x;
                        const int wy = cpos.y * CHUNK_SIZE + y;
                        const int wz = cpos.z * CHUNK_SIZE + z;
                        if (def.face == FACE_POS_Y) {
                            if (!readVoxel(wx, wy + 1, wz).isEmpty()) continue;
                            uint8_t adj = 0;
                            const uint8_t myPriority = def.priority;
                            auto checkNeighbor = [&](int nx, int nz) -> bool {
                                uint8_t nMat = readVoxel(nx, wy, nz).getMaterialID();
                                if (nMat == 0) return false;
                                if (!readVoxel(nx, wy + 1, nz).isEmpty()) return false;
                                if (nMat == mat) return true;
                                int8_t nDefIdx = matToDef[nMat];
                                if (nDefIdx >= 0 && defs_[nDefIdx].priority >= myPriority) return true;
                                return false;
                            };
                            if (checkNeighbor(wx + 1, wz)) adj |= 1;
                            if (checkNeighbor(wx - 1, wz)) adj |= 2;
                            if (checkNeighbor(wx, wz + 1)) adj |= 4;
                            if (checkNeighbor(wx, wz - 1)) adj |= 8;
                            localInstances.push_back({
                                (float)wx, (float)wy, (float)wz,
                                (uint16_t)defIdx, adj
                            });
                        }
                    }
                }
            }
        });
    wi::jobsystem::Wait(ctx);
    // Merge per-chunk instances
    size_t total = 0;
    for (auto& v : perChunkInstances) total += v.size();
    instances_.reserve(total);
    for (auto& v : perChunkInstances) {
        instances_.insert(instances_.end(), v.begin(), v.end());
    }
 }
 } // namespace voxel
--- a/src/voxel/TopingSystem.h
+++ b/src/voxel/TopingSystem.h
@ -59,6 +59,7 @@ class TopingSystem {
 public:
    void initialize();
    void collectInstances(const VoxelWorld& world);
    void collectInstancesParallel(const VoxelWorld& world);
    // Accessors for Phase 4.2 GPU upload
    const std::vector<TopingVertex>&  getVertices()  const { return vertices_; }
--- a/src/voxel/VoxelMesher.cpp
+++ b/src/voxel/VoxelMesher.cpp
@ -298,23 +298,87 @@ void SmoothMesher::computeNormal(const Chunk& chunk, const VoxelWorld& world,
    }
 }
 // Thread-local scratch buffers to avoid per-chunk allocation overhead.
 // Each worker thread gets its own set, eliminating malloc/free thrashing.
 struct SmoothScratch {
    float sdf[GRID * GRID * GRID];
    uint8_t smoothGrid[GRID * GRID * GRID];
    uint8_t smoothNear[GRID * GRID * GRID]; // dilated: 1 if smooth OR face-adjacent to smooth
    VoxelData voxelGrid[GRID * GRID * GRID];
    int32_t vertexMap[33 * 33 * 33]; // VERT_RANGE³
 };
 static thread_local SmoothScratch* tls_scratch = nullptr;
 uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
    chunk.smoothVertices.clear();
    chunk.hasSmooth = false;
-    // ── Step 1: Build SDF grid + smooth flag grid ────────────────
+    // ── Early exit: skip chunks far from any smooth voxels ──────
    // Check this chunk + 26 neighbors for containsSmooth flag.
    // This avoids the expensive 36³ grid fill for ~70% of chunks.
    {
        bool nearSmooth = chunk.containsSmooth;
        if (!nearSmooth) {
            for (int dz = -1; dz <= 1 && !nearSmooth; dz++)
            for (int dy = -1; dy <= 1 && !nearSmooth; dy++)
            for (int dx = -1; dx <= 1 && !nearSmooth; dx++) {
                if (dx == 0 && dy == 0 && dz == 0) continue;
                const Chunk* nc = world.getChunk(
                    ChunkPos{chunk.pos.x + dx, chunk.pos.y + dy, chunk.pos.z + dz});
                if (nc && nc->containsSmooth) nearSmooth = true;
            }
        }
        if (!nearSmooth) return 0;
    }
    // Allocate thread-local scratch once per thread (persists across calls)
    if (!tls_scratch) tls_scratch = new SmoothScratch();
    auto& scratch = *tls_scratch;
    // ── Step 1: Build SDF grid + smooth flag grid + voxel cache ──
    // PAD=2 so we have SDF data for cells at [-1..CHUNK_SIZE] (all 8 corners accessible)
    // Also build a "isSmooth" grid for the same range to detect proximity to smooth voxels.
-    std::vector<float> sdf(GRID * GRID * GRID, 1.0f);
+    // voxelGrid caches VoxelData to avoid repeated cross-chunk hashmap lookups later.
-    // smoothGrid: true if the voxel at that position is smooth
+    float* sdf = scratch.sdf;
-    std::vector<uint8_t> smoothGrid(GRID * GRID * GRID, 0);
+    uint8_t* smoothGrid = scratch.smoothGrid;
    VoxelData* voxelGrid = scratch.voxelGrid;
    constexpr int GRID3 = GRID * GRID * GRID;
    std::memset(smoothGrid, 0, GRID3);
    // SDF defaults to 1.0f (empty) — fill below
    for (int i = 0; i < GRID3; i++) sdf[i] = 1.0f;
    bool anySmooth = false;
    // Pre-cache neighbor chunk pointers for fast cross-chunk access
    const Chunk* neighborChunks[3][3][3] = {};
    for (int dz = -1; dz <= 1; dz++)
    for (int dy = -1; dy <= 1; dy++)
    for (int dx = -1; dx <= 1; dx++) {
        neighborChunks[dx+1][dy+1][dz+1] = world.getChunk(
            ChunkPos{chunk.pos.x + dx, chunk.pos.y + dy, chunk.pos.z + dz});
    }
    // Helper: fast voxel read using cached neighbor chunk pointers
    auto readVoxelFast = [&](int x, int y, int z) -> VoxelData {
        if (x >= 0 && x < CHUNK_SIZE && y >= 0 && y < CHUNK_SIZE && z >= 0 && z < CHUNK_SIZE)
            return chunk.at(x, y, z);
        // Determine which neighbor chunk
        int cx = (x < 0) ? 0 : (x >= CHUNK_SIZE) ? 2 : 1;
        int cy = (y < 0) ? 0 : (y >= CHUNK_SIZE) ? 2 : 1;
        int cz = (z < 0) ? 0 : (z >= CHUNK_SIZE) ? 2 : 1;
        const Chunk* nc = neighborChunks[cx][cy][cz];
        if (!nc) return VoxelData{};  // empty if chunk not loaded
        int lx = ((x % CHUNK_SIZE) + CHUNK_SIZE) % CHUNK_SIZE;
        int ly = ((y % CHUNK_SIZE) + CHUNK_SIZE) % CHUNK_SIZE;
        int lz = ((z % CHUNK_SIZE) + CHUNK_SIZE) % CHUNK_SIZE;
        return nc->at(lx, ly, lz);
    };
    for (int z = -PAD; z < CHUNK_SIZE + PAD; z++) {
        for (int y = -PAD; y < CHUNK_SIZE + PAD; y++) {
            for (int x = -PAD; x < CHUNK_SIZE + PAD; x++) {
                int gi = gridIdx(x, y, z);
-                VoxelData v = readVoxel(chunk, world, x, y, z);
+                VoxelData v = readVoxelFast(x, y, z);
                voxelGrid[gi] = v;
                sdf[gi] = v.isEmpty() ? 1.0f : -1.0f;
                if (v.isSmooth()) {
                    smoothGrid[gi] = 1;
@ -340,6 +404,24 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
    if (!anySmooth) return 0;
    chunk.hasSmooth = true;
    // ── Step 1b: Dilate smoothGrid → smoothNear ──────────────────
    // Pre-compute "smooth or face-adjacent to smooth" to reduce the
    // per-cell hasSmooth check from 56 lookups to 8 lookups.
    uint8_t* smoothNear = scratch.smoothNear;
    std::memcpy(smoothNear, smoothGrid, GRID3);
    for (int z = -PAD + 1; z < CHUNK_SIZE + PAD - 1; z++)
    for (int y = -PAD + 1; y < CHUNK_SIZE + PAD - 1; y++)
    for (int x = -PAD + 1; x < CHUNK_SIZE + PAD - 1; x++) {
        if (smoothGrid[gridIdx(x, y, z)]) {
            smoothNear[gridIdx(x+1, y, z)] = 1;
            smoothNear[gridIdx(x-1, y, z)] = 1;
            smoothNear[gridIdx(x, y+1, z)] = 1;
            smoothNear[gridIdx(x, y-1, z)] = 1;
            smoothNear[gridIdx(x, y, z+1)] = 1;
            smoothNear[gridIdx(x, y, z-1)] = 1;
        }
    }
    // ── Step 2: Generate vertices for surface cells ──────────────
    // Extended range: [-1, CHUNK_SIZE) for cross-chunk connectivity.
    // This chunk generates vertices for cells at [-1..CHUNK_SIZE-1].
@ -347,7 +429,8 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
    static constexpr int VERT_MIN = -1;
    static constexpr int VERT_MAX = CHUNK_SIZE; // exclusive
    static constexpr int VERT_RANGE = VERT_MAX - VERT_MIN; // CHUNK_SIZE + 1 = 33
-    std::vector<int32_t> vertexMap(VERT_RANGE * VERT_RANGE * VERT_RANGE, -1);
+    int32_t* vertexMap = scratch.vertexMap;
    std::memset(vertexMap, -1, VERT_RANGE * VERT_RANGE * VERT_RANGE * sizeof(int32_t));
    auto vertMapIdx = [](int x, int y, int z) -> int {
        // shift coordinates by -VERT_MIN = +1 so index range is [0, VERT_RANGE)
@ -377,27 +460,13 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
    for (int z = VERT_MIN; z < VERT_MAX; z++) {
        for (int y = VERT_MIN; y < VERT_MAX; y++) {
            for (int x = VERT_MIN; x < VERT_MAX; x++) {
-                // hasSmooth check: at least one corner of the cell must be a smooth
+                // hasSmooth check via dilated grid: at least one corner must be
-                // voxel OR be face-adjacent (6-connected) to a smooth voxel.
+                // smooth or face-adjacent to smooth. Uses pre-dilated smoothNear
-                // The 1-voxel extension ensures cells at the smooth↔blocky boundary
+                // grid → only 8 lookups instead of 56.
                // generate vertices for quad connectivity (closing the gap).
                // Checking direct face-adjacency (not neighbor cells' corners) prevents
                // the smooth mesh from cascading into underground blocky territory.
                bool hasSmooth = false;
-                for (int dz = 0; dz <= 1 && !hasSmooth; dz++)
+                for (int c = 0; c < 8 && !hasSmooth; c++) {
-                for (int dy = 0; dy <= 1 && !hasSmooth; dy++)
+                    if (smoothNear[gridIdx(x + cornerOff[c][0], y + cornerOff[c][1], z + cornerOff[c][2])])
                for (int dx = 0; dx <= 1 && !hasSmooth; dx++) {
                    int gx = x + dx, gy = y + dy, gz = z + dz;
                    if (smoothGrid[gridIdx(gx, gy, gz)]) {
                        hasSmooth = true;
                    } else {
                        // Check 6 face-neighbors of this corner for smooth voxels
                        static const int d6[6][3] = {{1,0,0},{-1,0,0},{0,1,0},{0,-1,0},{0,0,1},{0,0,-1}};
                        for (int d = 0; d < 6 && !hasSmooth; d++) {
                            if (smoothGrid[gridIdx(gx+d6[d][0], gy+d6[d][1], gz+d6[d][2])])
                                hasSmooth = true;
                        }
                    }
                }
                if (!hasSmooth) continue;
@ -455,8 +524,8 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
                bool blockyZlo = false, blockyZhi = false;
                for (int c = 0; c < 8; c++) {
                    if (corner[c] >= 0.0f) continue; // empty corner
-                    VoxelData v = readVoxel(chunk, world,
+                    VoxelData v = voxelGrid[gridIdx(
-                        x + cornerOff[c][0], y + cornerOff[c][1], z + cornerOff[c][2]);
+                        x + cornerOff[c][0], y + cornerOff[c][1], z + cornerOff[c][2])];
                    if (!v.isEmpty() && !v.isSmooth()) {
                        // This corner is a blocky solid
                        if (cornerOff[c][0] == 0) blockyXlo = true; else blockyXhi = true;
@ -483,8 +552,8 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
                int smoothCount = 0;
                for (int c = 0; c < 8; c++) {
                    if (corner[c] < 0.0f) {
-                        VoxelData v = readVoxel(chunk, world,
+                        VoxelData v = voxelGrid[gridIdx(
-                            x + cornerOff[c][0], y + cornerOff[c][1], z + cornerOff[c][2]);
+                            x + cornerOff[c][0], y + cornerOff[c][1], z + cornerOff[c][2])];
                        if (!v.isEmpty()) {
                            allMatCounts[v.getMaterialID()]++;
                            if (v.isSmooth()) {
@ -510,7 +579,7 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
                for (int c = 0; c < 8; c++) {
                    if (corner[c] >= 0.0f) continue;
                    int cx = x + cornerOff[c][0], cy = y + cornerOff[c][1], cz = z + cornerOff[c][2];
-                    VoxelData v = readVoxel(chunk, world, cx, cy, cz);
+                    VoxelData v = voxelGrid[gridIdx(cx, cy, cz)];
                    if (v.isEmpty()) continue;
                    // Check if this voxel is on the surface
                    bool onSurface = false;
@ -531,11 +600,7 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
                // GPU interpolation creates the smooth edge-to-interior falloff.
                uint8_t blendW = (secCount > 0 && secMat != bestMat) ? 255 : 0;
-                // Normal from SDF gradient (used later for face normal orientation check)
+                // Store vertex (normals zeroed — computed later from face normals in Step 4)
                float gnx, gny, gnz;
                computeNormal(chunk, world, x, y, z, gnx, gny, gnz);
                // Store vertex
                int32_t vertIdx = (int32_t)chunk.smoothVertices.size();
                vertexMap[vertMapIdx(x, y, z)] = vertIdx;
@ -543,9 +608,9 @@ uint32_t SmoothMesher::meshChunk(Chunk& chunk, const VoxelWorld& world) {
                sv.px = ox + vx;
                sv.py = oy + vy;
                sv.pz = oz + vz;
-                sv.nx = gnx;
+                sv.nx = 0;
-                sv.ny = gny;
+                sv.ny = 0;
-                sv.nz = gnz;
+                sv.nz = 0;
                sv.materialID = bestMat;
                sv.secondaryMat = secMat;
                sv.blendWeight = blendW;
--- a/src/voxel/VoxelRenderer.cpp
+++ b/src/voxel/VoxelRenderer.cpp
@ -5,6 +5,7 @@
 #include <chrono>
 #include <cmath>
 #include <cstring>
 #include <unordered_map>
 using namespace wi::graphics;
@ -116,6 +117,45 @@ void VoxelRenderer::initialize(GraphicsDevice* dev) {
        wi::backlog::post("VoxelRenderer: GPU compute mesher not available", wi::backlog::LogLevel::Warning);
    }
    // ── GPU Smooth Mesher resources (Phase 5.3) ───────────────────
    wi::renderer::LoadShader(ShaderStage::CS, smoothCentroidShader_, "voxel/voxelSmoothCentroidCS.cso");
    wi::renderer::LoadShader(ShaderStage::CS, smoothMeshShader_, "voxel/voxelSmoothCS.cso");
    if (smoothCentroidShader_.IsValid() && smoothMeshShader_.IsValid()) {
        // Centroid grid buffer (34^3 float4, reused per-chunk sequentially)
        GPUBufferDesc cgDesc;
        cgDesc.size = CENTROID_GRID_SIZE * 16; // float4 = 16 bytes
        cgDesc.bind_flags = BindFlag::UNORDERED_ACCESS | BindFlag::SHADER_RESOURCE;
        cgDesc.misc_flags = ResourceMiscFlag::BUFFER_STRUCTURED;
        cgDesc.stride = 16;
        cgDesc.usage = Usage::DEFAULT;
        device_->CreateBuffer(&cgDesc, nullptr, &centroidGridBuffer_);
        // GPU smooth vertex output buffer (GPUSmoothVertex = 32 bytes)
        GPUBufferDesc svDesc;
        svDesc.size = MAX_GPU_SMOOTH_VERTICES * 32;
        svDesc.bind_flags = BindFlag::UNORDERED_ACCESS | BindFlag::SHADER_RESOURCE;
        svDesc.misc_flags = ResourceMiscFlag::BUFFER_STRUCTURED;
        svDesc.stride = 32;
        svDesc.usage = Usage::DEFAULT;
        device_->CreateBuffer(&svDesc, nullptr, &gpuSmoothVertexBuffer_);
        // Atomic counter
        GPUBufferDesc scDesc;
        scDesc.size = sizeof(uint32_t);
        scDesc.bind_flags = BindFlag::UNORDERED_ACCESS;
        scDesc.misc_flags = ResourceMiscFlag::BUFFER_RAW;
        scDesc.usage = Usage::DEFAULT;
        device_->CreateBuffer(&scDesc, nullptr, &gpuSmoothCounter_);
        // Readback
        GPUBufferDesc srbDesc;
        srbDesc.size = sizeof(uint32_t);
        srbDesc.usage = Usage::READBACK;
        device_->CreateBuffer(&srbDesc, nullptr, &smoothCounterReadback_);
        wi::backlog::post("VoxelRenderer: GPU smooth mesher available (2-pass with smooth normals)");
    }
    cpuMegaQuads_.reserve(MEGA_BUFFER_CAPACITY);
    cpuChunkInfo_.reserve(MAX_CHUNKS);
    chunkSlots_.reserve(MAX_CHUNKS);
@ -316,13 +356,20 @@ void VoxelRenderer::rebuildMegaBuffer(VoxelWorld& world) {
    chunkSlots_.clear();
    cpuChunkInfo_.clear();
    // Position → index map for neighbor lookup
    std::unordered_map<uint64_t, uint32_t> posToIdx;
    auto posKey = [](const ChunkPos& p) -> uint64_t {
        return ((uint64_t)(uint16_t)p.x) | ((uint64_t)(uint16_t)p.y << 16) | ((uint64_t)(uint16_t)p.z << 32);
    };
    uint32_t offset = 0;
    float debugFlag = debugFaceColors_ ? 1.0f : 0.0f;
    world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
        if (chunk.quadCount == 0) return;
-        if (offset + chunk.quadCount > MEGA_BUFFER_CAPACITY) return; // overflow guard
+        if (offset + chunk.quadCount > MEGA_BUFFER_CAPACITY) return;
        uint32_t curIdx = (uint32_t)chunkSlots_.size();
        ChunkSlot slot;
        slot.pos = pos;
        slot.quadOffset = offset;
@ -341,13 +388,30 @@ void VoxelRenderer::rebuildMegaBuffer(VoxelWorld& world) {
        for (int f = 0; f < 6; f++) {
            info.faceOffsets[f] = chunk.faceOffsets[f];
            info.faceCounts[f] = chunk.faceCounts[f];
            info.neighbors[f] = 0xFFFFFFFF;
        }
        cpuChunkInfo_.push_back(info);
        posToIdx[posKey(pos)] = curIdx;
        cpuMegaQuads_.insert(cpuMegaQuads_.end(), chunk.quads.begin(), chunk.quads.end());
        offset += chunk.quadCount;
    });
    // Fill neighbor indices
    static const int offsets[6][3] = {
        {1,0,0}, {-1,0,0}, {0,1,0}, {0,-1,0}, {0,0,1}, {0,0,-1}
    };
    for (uint32_t i = 0; i < (uint32_t)chunkSlots_.size(); i++) {
        const auto& pos = chunkSlots_[i].pos;
        for (int f = 0; f < 6; f++) {
            ChunkPos npos = { pos.x + offsets[f][0], pos.y + offsets[f][1], pos.z + offsets[f][2] };
            auto it = posToIdx.find(posKey(npos));
            if (it != posToIdx.end()) {
                cpuChunkInfo_[i].neighbors[f] = it->second;
            }
        }
    }
    chunkCount_ = (uint32_t)chunkSlots_.size();
    totalQuads_ = offset;
 }
@ -357,13 +421,19 @@ void VoxelRenderer::rebuildChunkInfoOnly(VoxelWorld& world) {
    chunkSlots_.clear();
    cpuChunkInfo_.clear();
    // First pass: build position → index map and chunk info
    std::unordered_map<uint64_t, uint32_t> posToIdx;
    auto posKey = [](const ChunkPos& p) -> uint64_t {
        return ((uint64_t)(uint16_t)p.x) | ((uint64_t)(uint16_t)p.y << 16) | ((uint64_t)(uint16_t)p.z << 32);
    };
    uint32_t idx = 0;
    float debugFlag = debugFaceColors_ ? 1.0f : 0.0f;
    world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
        ChunkSlot slot;
        slot.pos = pos;
-        slot.quadOffset = 0; // not used in GPU mesh path
+        slot.quadOffset = 0;
        slot.quadCount = 0;
        chunkSlots_.push_back(slot);
@ -376,10 +446,27 @@ void VoxelRenderer::rebuildChunkInfoOnly(VoxelWorld& world) {
        );
        info.quadOffset = 0;
        info.quadCount = 0;
        for (int i = 0; i < 6; i++) info.neighbors[i] = 0xFFFFFFFF;
        cpuChunkInfo_.push_back(info);
        posToIdx[posKey(pos)] = idx;
        idx++;
    });
    // Second pass: fill neighbor indices
    static const int offsets[6][3] = {
        {1,0,0}, {-1,0,0}, {0,1,0}, {0,-1,0}, {0,0,1}, {0,0,-1}
    };
    for (uint32_t i = 0; i < (uint32_t)chunkSlots_.size(); i++) {
        const auto& pos = chunkSlots_[i].pos;
        for (int f = 0; f < 6; f++) {
            ChunkPos npos = { pos.x + offsets[f][0], pos.y + offsets[f][1], pos.z + offsets[f][2] };
            auto it = posToIdx.find(posKey(npos));
            if (it != posToIdx.end()) {
                cpuChunkInfo_[i].neighbors[f] = it->second;
            }
        }
    }
    chunkCount_ = (uint32_t)chunkSlots_.size();
 }
@ -667,6 +754,153 @@ void VoxelRenderer::dispatchGpuMesh(CommandList cmd, const VoxelWorld& world,
    gpuMeshDirty_ = false;
 }
 // ── GPU Smooth Mesh Dispatch (Phase 5.3) ─────────────────────────
 // Dispatches GPU Surface Nets compute shader for all chunks.
 // Uses voxelDataBuffer_ (already uploaded by dispatchGpuMesh).
 void VoxelRenderer::dispatchGpuSmoothMesh(CommandList cmd, const VoxelWorld& world) const {
    if (!smoothCentroidShader_.IsValid() || !smoothMeshShader_.IsValid()) return;
    auto* dev = device_;
    // ── Collect smooth chunk indices (chunks that contain smooth OR neighbor smooth) ──
    struct SmoothChunkEntry { uint32_t chunkIdx; };
    std::vector<SmoothChunkEntry> smoothChunks;
    smoothChunks.reserve(256);
    {
        // Build chunk index list + check containsSmooth for neighbors
        std::vector<std::pair<ChunkPos, uint32_t>> allChunks;
        allChunks.reserve(chunkCount_);
        uint32_t ci = 0;
        world.forEachChunk([&](const ChunkPos& pos, const Chunk& chunk) {
            allChunks.push_back({pos, ci});
            ci++;
        });
        // Build position→index map for neighbor lookup
        std::unordered_map<uint64_t, uint32_t> posToLocal;
        auto posKey = [](const ChunkPos& p) -> uint64_t {
            return ((uint64_t)(uint16_t)p.x) | ((uint64_t)(uint16_t)p.y << 16) | ((uint64_t)(uint16_t)p.z << 32);
        };
        for (uint32_t i = 0; i < (uint32_t)allChunks.size(); i++) {
            posToLocal[posKey(allChunks[i].first)] = i;
        }
        static const int offs[6][3] = {{1,0,0},{-1,0,0},{0,1,0},{0,-1,0},{0,0,1},{0,0,-1}};
        for (auto& [pos, idx] : allChunks) {
            const Chunk* c = world.getChunk(pos);
            if (!c) continue;
            bool needed = c->containsSmooth;
            if (!needed) {
                for (int f = 0; f < 6 && !needed; f++) {
                    ChunkPos np = {pos.x + offs[f][0], pos.y + offs[f][1], pos.z + offs[f][2]};
                    const Chunk* nc = world.getChunk(np);
                    if (nc && nc->containsSmooth) needed = true;
                }
            }
            if (needed) smoothChunks.push_back({idx});
        }
    }
    if (smoothChunks.empty()) {
        gpuSmoothMeshDirty_ = false;
        return;
    }
    uint32_t smoothCount = (uint32_t)smoothChunks.size();
    // ── Resize centroid grid buffer if needed (one slot per smooth chunk) ──
    uint32_t requiredGridSize = smoothCount * CENTROID_GRID_SIZE * 16; // bytes
    if (!centroidGridBuffer_.IsValid() || centroidGridBuffer_.desc.size < requiredGridSize) {
        GPUBufferDesc cgDesc;
        cgDesc.size = requiredGridSize;
        cgDesc.bind_flags = BindFlag::UNORDERED_ACCESS | BindFlag::SHADER_RESOURCE;
        cgDesc.misc_flags = ResourceMiscFlag::BUFFER_STRUCTURED;
        cgDesc.stride = 16;
        cgDesc.usage = Usage::DEFAULT;
        dev->CreateBuffer(&cgDesc, nullptr, const_cast<GPUBuffer*>(&centroidGridBuffer_));
        wi::backlog::post("VoxelRenderer: resized centroid grid for " + std::to_string(smoothCount)
            + " smooth chunks (" + std::to_string(requiredGridSize / 1024) + " KB)");
    }
    // Zero the smooth vertex counter
    uint32_t zero = 0;
    dev->UpdateBuffer(const_cast<GPUBuffer*>(&gpuSmoothCounter_), &zero, cmd, sizeof(uint32_t));
    // Pre-barriers
    GPUBarrier preBarriers[] = {
        GPUBarrier::Buffer(const_cast<GPUBuffer*>(&gpuSmoothCounter_), ResourceState::COPY_DST, ResourceState::UNORDERED_ACCESS),
        GPUBarrier::Buffer(const_cast<GPUBuffer*>(&gpuSmoothVertexBuffer_), ResourceState::UNDEFINED, ResourceState::UNORDERED_ACCESS),
        GPUBarrier::Buffer(const_cast<GPUBuffer*>(&centroidGridBuffer_), ResourceState::UNDEFINED, ResourceState::UNORDERED_ACCESS),
    };
    dev->Barrier(preBarriers, 3, cmd);
    struct SmoothPush {
        uint32_t chunkIndex;
        uint32_t voxelBufferOffset;
        uint32_t maxOutputVerts;
        uint32_t centroidGridOffset;
        uint32_t pad[8];
    };
    const uint32_t wordsPerChunk = CHUNK_VOLUME / 2;
    // ── Pass 1: Dispatch ALL centroid computations (batched, no barriers) ──
    dev->BindComputeShader(&smoothCentroidShader_, cmd);
    dev->BindResource(&voxelDataBuffer_, 0, cmd);    // t0
    dev->BindResource(&chunkInfoBuffer_, 1, cmd);     // t1
    dev->BindUAV(const_cast<GPUBuffer*>(&centroidGridBuffer_), 0, cmd);  // u0
    for (uint32_t i = 0; i < smoothCount; i++) {
        uint32_t ci = smoothChunks[i].chunkIdx;
        SmoothPush pushData = {};
        pushData.chunkIndex = ci;
        pushData.voxelBufferOffset = ci * wordsPerChunk;
        pushData.maxOutputVerts = MAX_GPU_SMOOTH_VERTICES;
        pushData.centroidGridOffset = i * CENTROID_GRID_SIZE;
        dev->PushConstants(&pushData, sizeof(pushData), cmd);
        dev->Dispatch(5, 5, 5, cmd);
    }
    // ── Single barrier: centroid grid UAV → SRV ──
    GPUBarrier midBarrier = GPUBarrier::Buffer(
        const_cast<GPUBuffer*>(&centroidGridBuffer_),
        ResourceState::UNORDERED_ACCESS, ResourceState::SHADER_RESOURCE);
    dev->Barrier(&midBarrier, 1, cmd);
    // ── Pass 2: Dispatch ALL emit passes (batched, no barriers) ──
    // Emit shader reads ONLY from centroid grid (no voxelData access)
    dev->BindComputeShader(&smoothMeshShader_, cmd);
    dev->BindResource(&chunkInfoBuffer_, 1, cmd);     // t1
    dev->BindResource(&centroidGridBuffer_, 2, cmd);  // t2: centroid grid (SRV)
    dev->BindUAV(const_cast<GPUBuffer*>(&gpuSmoothVertexBuffer_), 0, cmd);  // u0
    dev->BindUAV(const_cast<GPUBuffer*>(&gpuSmoothCounter_), 1, cmd);       // u1
    for (uint32_t i = 0; i < smoothCount; i++) {
        uint32_t ci = smoothChunks[i].chunkIdx;
        SmoothPush pushData = {};
        pushData.chunkIndex = ci;
        pushData.voxelBufferOffset = ci * wordsPerChunk;
        pushData.maxOutputVerts = MAX_GPU_SMOOTH_VERTICES;
        pushData.centroidGridOffset = i * CENTROID_GRID_SIZE;
        dev->PushConstants(&pushData, sizeof(pushData), cmd);
        dev->Dispatch(4, 4, 4, cmd);
    }
    // Post-barriers
    GPUBarrier postBarriers[] = {
        GPUBarrier::Buffer(const_cast<GPUBuffer*>(&gpuSmoothCounter_), ResourceState::UNORDERED_ACCESS, ResourceState::COPY_SRC),
        GPUBarrier::Buffer(const_cast<GPUBuffer*>(&gpuSmoothVertexBuffer_), ResourceState::UNORDERED_ACCESS, ResourceState::SHADER_RESOURCE),
    };
    dev->Barrier(postBarriers, 2, cmd);
    // Readback counter (result available next frame)
    dev->CopyBuffer(const_cast<GPUBuffer*>(&smoothCounterReadback_), 0,
        const_cast<GPUBuffer*>(&gpuSmoothCounter_), 0, sizeof(uint32_t), cmd);
    gpuSmoothMeshDirty_ = false;
 }
 // ── Frustum plane extraction (Gribb-Hartmann method) ────────────
 static void extractFrustumPlanes(const XMMATRIX& vp, XMFLOAT4 planes[6]) {
    XMFLOAT4X4 m;
@ -1200,36 +1434,33 @@ void VoxelRenderer::uploadTopingData(const TopingSystem& topingSystem) {
    // GPU instances are just float3 (12 bytes), sorted by (type, variant) for batched draws.
    // We sort a copy and build a draw group table.
-    struct SortedInst {
+    // Reuse persistent vectors to avoid per-frame allocations.
-        float wx, wy, wz;
+    topingSorted_.resize(instances.size());
        uint16_t type, variant;
    };
    std::vector<SortedInst> sorted(instances.size());
    for (size_t i = 0; i < instances.size(); i++) {
-        sorted[i] = { instances[i].wx, instances[i].wy, instances[i].wz,
+        topingSorted_[i] = { instances[i].wx, instances[i].wy, instances[i].wz,
                       instances[i].topingType, instances[i].variant };
    }
-    std::sort(sorted.begin(), sorted.end(), [](const SortedInst& a, const SortedInst& b) {
+    std::sort(topingSorted_.begin(), topingSorted_.end(), [](const TopingSortedInst& a, const TopingSortedInst& b) {
        if (a.type != b.type) return a.type < b.type;
        return a.variant < b.variant;
    });
    // Pack GPU instance data (just float3 positions)
-    struct GPUTopingInst { float x, y, z; };
+    uint32_t instCount = (uint32_t)std::min(topingSorted_.size(), (size_t)MAX_TOPING_INSTANCES);
-    uint32_t instCount = (uint32_t)std::min(sorted.size(), (size_t)MAX_TOPING_INSTANCES);
+    topingGpuInsts_.resize(instCount);
    std::vector<GPUTopingInst> gpuInsts(instCount);
    for (uint32_t i = 0; i < instCount; i++) {
-        gpuInsts[i] = { sorted[i].wx, sorted[i].wy, sorted[i].wz };
+        topingGpuInsts_[i] = { topingSorted_[i].wx, topingSorted_[i].wy, topingSorted_[i].wz };
    }
-    // Create or recreate instance buffer
+    // Recreate buffer each frame (UpdateBuffer requires barrier management).
    // Persistent staging vectors eliminate per-frame heap allocations.
    GPUBufferDesc ibDesc;
-    ibDesc.size = instCount * sizeof(GPUTopingInst);
+    ibDesc.size = instCount * sizeof(TopingGPUInst);
    ibDesc.bind_flags = BindFlag::SHADER_RESOURCE;
    ibDesc.misc_flags = ResourceMiscFlag::BUFFER_STRUCTURED;
-    ibDesc.stride = sizeof(GPUTopingInst);
+    ibDesc.stride = sizeof(TopingGPUInst);
    ibDesc.usage = Usage::DEFAULT;
-    device_->CreateBuffer(&ibDesc, gpuInsts.data(), &topingInstanceBuffer_);
+    device_->CreateBuffer(&ibDesc, topingGpuInsts_.data(), &topingInstanceBuffer_);
 }
 void VoxelRenderer::renderTopings(
@ -1351,8 +1582,10 @@ void VoxelRenderer::uploadSmoothData(VoxelWorld& world) {
    // Collect all smooth vertices from all chunks, stamping each with its chunkIndex.
    // The chunkIndex must match the order in chunkInfoBuffer_ (assigned by forEachChunk).
-    std::vector<SmoothVertex> allVerts;
+    // Reuse a persistent staging vector to avoid per-frame allocations.
-    allVerts.reserve(64 * 1024);
+    smoothStagingVerts_.clear();
    if (smoothStagingVerts_.capacity() < 64 * 1024)
        smoothStagingVerts_.reserve(64 * 1024);
    uint32_t chunkIdx = 0;
    world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
@ -1360,36 +1593,66 @@ void VoxelRenderer::uploadSmoothData(VoxelWorld& world) {
            for (auto& sv : chunk.smoothVertices) {
                sv.chunkIndex = (uint16_t)chunkIdx;
            }
-            allVerts.insert(allVerts.end(),
+            smoothStagingVerts_.insert(smoothStagingVerts_.end(),
                chunk.smoothVertices.begin(),
                chunk.smoothVertices.end());
        }
        chunkIdx++;
    });
-    smoothVertexCount_ = (uint32_t)std::min(allVerts.size(), (size_t)MAX_SMOOTH_VERTICES);
+    smoothVertexCount_ = (uint32_t)std::min(smoothStagingVerts_.size(), (size_t)MAX_SMOOTH_VERTICES);
    if (smoothVertexCount_ == 0) {
        smoothDirty_ = false;
        return;
    }
-    // Create or recreate vertex buffer
+    // Recreate buffer each frame (UpdateBuffer requires barrier management).
    // Persistent staging vector eliminates per-frame heap allocations.
    GPUBufferDesc vbDesc;
    vbDesc.size = smoothVertexCount_ * sizeof(SmoothVertex);
    vbDesc.bind_flags = BindFlag::SHADER_RESOURCE;
    vbDesc.misc_flags = ResourceMiscFlag::BUFFER_STRUCTURED;
    vbDesc.stride = sizeof(SmoothVertex);
    vbDesc.usage = Usage::DEFAULT;
-    device_->CreateBuffer(&vbDesc, allVerts.data(), &smoothVertexBuffer_);
+    device_->CreateBuffer(&vbDesc, smoothStagingVerts_.data(), &smoothVertexBuffer_);
    smoothDirty_ = false;
 }
-    char msg[128];
+void VoxelRenderer::uploadSmoothDataFast(VoxelWorld& world) {
-    snprintf(msg, sizeof(msg), "Smooth: uploaded %u vertices (%u triangles, %.1f KB)",
+    if (!device_ || !smoothPso_.IsValid()) return;
-        smoothVertexCount_, smoothVertexCount_ / 3,
+
-        smoothVertexCount_ * sizeof(SmoothVertex) / 1024.0f);
+    // Fast path: chunkIndex already stamped during parallel meshChunk.
-    wi::backlog::post(msg);
+    // Just collect vertices (no per-vertex stamping needed).
    smoothStagingVerts_.clear();
    if (smoothStagingVerts_.capacity() < 64 * 1024)
        smoothStagingVerts_.reserve(64 * 1024);
    world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
        if (chunk.hasSmooth && chunk.smoothVertexCount > 0) {
            smoothStagingVerts_.insert(smoothStagingVerts_.end(),
                chunk.smoothVertices.begin(),
                chunk.smoothVertices.end());
        }
    });
    smoothVertexCount_ = (uint32_t)std::min(smoothStagingVerts_.size(), (size_t)MAX_SMOOTH_VERTICES);
    if (smoothVertexCount_ == 0) {
        smoothDirty_ = false;
        return;
    }
    GPUBufferDesc vbDesc;
    vbDesc.size = smoothVertexCount_ * sizeof(SmoothVertex);
    vbDesc.bind_flags = BindFlag::SHADER_RESOURCE;
    vbDesc.misc_flags = ResourceMiscFlag::BUFFER_STRUCTURED;
    vbDesc.stride = sizeof(SmoothVertex);
    vbDesc.usage = Usage::DEFAULT;
    device_->CreateBuffer(&vbDesc, smoothStagingVerts_.data(), &smoothVertexBuffer_);
    smoothDirty_ = false;
 }
 void VoxelRenderer::renderSmooth(
@ -1397,8 +1660,12 @@ void VoxelRenderer::renderSmooth(
    const Texture& depthBuffer,
    const Texture& renderTarget
 ) const {
-    if (!smoothPso_.IsValid() || !smoothVertexBuffer_.IsValid() ||
+    // Use GPU-generated smooth buffer if available, otherwise CPU buffer
-        smoothVertexCount_ == 0) return;
+    const bool useGpuSmooth = smoothCentroidShader_.IsValid() && smoothMeshShader_.IsValid();
    const auto& smoothBuf = useGpuSmooth ? gpuSmoothVertexBuffer_ : smoothVertexBuffer_;
    uint32_t vertCount = useGpuSmooth ? gpuSmoothVertexCount_ : smoothVertexCount_;
    if (!smoothPso_.IsValid() || !smoothBuf.IsValid() || vertCount == 0) return;
    auto* dev = device_;
@ -1438,7 +1705,7 @@ void VoxelRenderer::renderSmooth(
    dev->BindResource(&textureArray_, 1, cmd);
    dev->BindResource(&chunkInfoBuffer_, 2, cmd);    // t2: chunk info for PS voxel lookups
    dev->BindResource(&voxelDataBuffer_, 3, cmd);    // t3: voxel data for PS neighbor blending
-    dev->BindResource(&smoothVertexBuffer_, 6, cmd); // t6: smooth vertices
+    dev->BindResource(&smoothBuf, 6, cmd); // t6: smooth vertices (GPU or CPU buffer)
    dev->BindSampler(&sampler_, 0, cmd);
    // Push constants (unused by smooth VS, but must be valid 48 bytes)
@ -1449,7 +1716,7 @@ void VoxelRenderer::renderSmooth(
    dev->PushConstants(&pushData, sizeof(pushData), cmd);
    // Single draw call for all smooth vertices
-    dev->DrawInstanced(smoothVertexCount_, 1, 0, 0, cmd);
+    dev->DrawInstanced(vertCount, 1, 0, 0, cmd);
    smoothDrawCalls_ = 1;
    dev->RenderPassEnd(cmd);
@ -1498,23 +1765,41 @@ void VoxelRenderPath::Start() {
        wi::backlog::post(msg);
    }
-    // Phase 5: CPU Surface Nets mesh for smooth voxels, upload to GPU
+    // Phase 5: Smooth surface mesh — GPU path or CPU fallback
    if (renderer.isInitialized()) {
-        uint32_t totalSmooth = 0;
+        if (renderer.smoothCentroidShader_.IsValid() && renderer.smoothMeshShader_.IsValid()) {
-        uint32_t smoothChunks = 0;
+            // GPU smooth mesher available — will dispatch in first Render()
-        world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
+            renderer.gpuSmoothMeshDirty_ = true;
-            uint32_t count = SmoothMesher::meshChunk(chunk, world);
+            wi::backlog::post("SmoothMesher: GPU path active, dispatch deferred to Render()");
-            if (count > 0) {
+        } else {
-                totalSmooth += count;
+            // CPU fallback: Surface Nets mesh for smooth voxels (parallelized)
-                smoothChunks++;
+            std::vector<Chunk*> chunkPtrs;
            world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
                chunkPtrs.push_back(&chunk);
            });
            const VoxelWorld& worldRef = world;
            wi::jobsystem::context smoothCtx;
            wi::jobsystem::Dispatch(smoothCtx, (uint32_t)chunkPtrs.size(), 1,
                [&chunkPtrs, &worldRef](wi::jobsystem::JobArgs args) {
                    SmoothMesher::meshChunk(*chunkPtrs[args.jobIndex], worldRef);
                });
            wi::jobsystem::Wait(smoothCtx);
            uint32_t totalSmooth = 0;
            uint32_t smoothChunks = 0;
            for (auto* c : chunkPtrs) {
                if (c->smoothVertexCount > 0) {
                    totalSmooth += c->smoothVertexCount;
                    smoothChunks++;
                }
            }
-        });
+            renderer.uploadSmoothData(world);
-        renderer.uploadSmoothData(world);
+            char msg[256];
-        char msg[256];
+            snprintf(msg, sizeof(msg),
-        snprintf(msg, sizeof(msg),
+                "SmoothMesher: %u vertices (%u tris) in %u chunks",
-            "SmoothMesher: %u vertices (%u tris) in %u chunks",
+                totalSmooth, totalSmooth / 3, smoothChunks);
-            totalSmooth, totalSmooth / 3, smoothChunks);
+            wi::backlog::post(msg);
-        wi::backlog::post(msg);
+        }
    }
    worldGenerated_ = true;
@ -1649,6 +1934,45 @@ void VoxelRenderPath::Update(float dt) {
            renderer.voxelCacheDirty_ = false;    // cache already filled by fused pack
            renderer.gpuMeshDirty_ = true;        // GPU still needs upload + dispatch
            // Re-mesh smooth surfaces — GPU path or CPU fallback
            if (renderer.smoothCentroidShader_.IsValid() && renderer.smoothMeshShader_.IsValid()) {
                renderer.gpuSmoothMeshDirty_ = true; // will dispatch in Render()
            } else {
                // CPU fallback (Surface Nets) — parallelized
                auto ts0 = std::chrono::high_resolution_clock::now();
                std::vector<Chunk*> chunkPtrs;
                world.forEachChunk([&](const ChunkPos& pos, Chunk& chunk) {
                    chunkPtrs.push_back(&chunk);
                });
                const VoxelWorld& worldRef = world;
                wi::jobsystem::context ctx;
                wi::jobsystem::Dispatch(ctx, (uint32_t)chunkPtrs.size(), 1,
                    [&chunkPtrs, &worldRef](wi::jobsystem::JobArgs args) {
                        uint32_t idx = args.jobIndex;
                        SmoothMesher::meshChunk(*chunkPtrs[idx], worldRef);
                        // Stamp chunkIndex during parallel pass (avoids sequential loop in upload)
                        for (auto& sv : chunkPtrs[idx]->smoothVertices)
                            sv.chunkIndex = (uint16_t)idx;
                    });
                wi::jobsystem::Wait(ctx);
                auto ts1 = std::chrono::high_resolution_clock::now();
                profSmoothMesh_.add(std::chrono::duration<float, std::milli>(ts1 - ts0).count());
                renderer.uploadSmoothDataFast(world);
                auto ts2 = std::chrono::high_resolution_clock::now();
                profSmoothUpload_.add(std::chrono::duration<float, std::milli>(ts2 - ts1).count());
            }
            // Re-collect toping instances — parallelized
            {
                auto tt0 = std::chrono::high_resolution_clock::now();
                topingSystem.collectInstancesParallel(world);
                auto tt1 = std::chrono::high_resolution_clock::now();
                profTopingCollect_.add(std::chrono::duration<float, std::milli>(tt1 - tt0).count());
                renderer.uploadTopingData(topingSystem);
                auto tt2 = std::chrono::high_resolution_clock::now();
                profTopingUpload_.add(std::chrono::duration<float, std::milli>(tt2 - tt1).count());
            }
        }
    }
@ -1692,6 +2016,21 @@ void VoxelRenderPath::Render() const {
                renderer.dispatchGpuMesh(cmd, world,
                    &profVoxelPack_, &profGpuUpload_, &profGpuDispatch_);
            }
            // GPU smooth mesh: readback previous frame's vertex count
            if (renderer.smoothCounterReadback_.mapped_data) {
                uint32_t* smoothCount = (uint32_t*)renderer.smoothCounterReadback_.mapped_data;
                renderer.gpuSmoothVertexCount_ = *smoothCount;
            }
            // GPU smooth mesh dispatch (uses same voxelDataBuffer_ already uploaded)
            if (renderer.gpuSmoothMeshDirty_ && renderer.smoothCentroidShader_.IsValid() && renderer.smoothMeshShader_.IsValid()) {
                renderer.dispatchGpuSmoothMesh(cmd, world);
            }
            // Re-dispatch next frame if readback not yet available (1-frame delay)
            if (renderer.gpuSmoothVertexCount_ == 0 &&
                renderer.smoothCentroidShader_.IsValid() && renderer.smoothMeshShader_.IsValid()) {
                renderer.gpuSmoothMeshDirty_ = true;
            }
        }
        // GPU mesh benchmark state machine (runs once after world gen, CPU path only)
@ -1717,7 +2056,7 @@ void VoxelRenderPath::Render() const {
 }
 void VoxelRenderPath::logProfilingAverages() const {
-    char msg[512];
+    char msg[1024];
    snprintf(msg, sizeof(msg),
        "=== PERF PROFILE (avg over %.0fs) ===\n"
        "  Regenerate:    %7.2f ms  (%u calls)\n"
@ -1725,6 +2064,10 @@ void VoxelRenderPath::logProfilingAverages() const {
        "  VoxelPack:     %7.2f ms  (%u calls)\n"
        "  GPU Upload:    %7.2f ms  (%u calls)\n"
        "  GPU Dispatch:  %7.2f ms  (%u calls)\n"
        "  SmoothMesh:    %7.2f ms  (%u calls)\n"
        "  SmoothUpload:  %7.2f ms  (%u calls)\n"
        "  TopingCollect: %7.2f ms  (%u calls)\n"
        "  TopingUpload:  %7.2f ms  (%u calls)\n"
        "  Render:        %7.2f ms  (%u calls)\n"
        "  Frame (Upd):   %7.2f ms  (%u calls, %.1f FPS)",
        PROF_INTERVAL,
@ -1733,6 +2076,10 @@ void VoxelRenderPath::logProfilingAverages() const {
        profVoxelPack_.avg(), profVoxelPack_.count,
        profGpuUpload_.avg(), profGpuUpload_.count,
        profGpuDispatch_.avg(), profGpuDispatch_.count,
        profSmoothMesh_.avg(), profSmoothMesh_.count,
        profSmoothUpload_.avg(), profSmoothUpload_.count,
        profTopingCollect_.avg(), profTopingCollect_.count,
        profTopingUpload_.avg(), profTopingUpload_.count,
        profRender_.avg(), profRender_.count,
        profFrame_.avg(), profFrame_.count,
        profFrame_.count > 0 ? (1000.0f / profFrame_.avg()) : 0.0f);
@ -1743,6 +2090,10 @@ void VoxelRenderPath::logProfilingAverages() const {
    profVoxelPack_.reset();
    profGpuUpload_.reset();
    profGpuDispatch_.reset();
    profSmoothMesh_.reset();
    profSmoothUpload_.reset();
    profTopingCollect_.reset();
    profTopingUpload_.reset();
    profRender_.reset();
    profFrame_.reset();
 }
--- a/src/voxel/VoxelRenderer.h
+++ b/src/voxel/VoxelRenderer.h
@ -23,6 +23,8 @@ struct GPUChunkInfo {
    uint32_t pad[2];            // align to 32 bytes
    uint32_t faceOffsets[6];    // per-face quad offset within this chunk's quads
    uint32_t faceCounts[6];     // per-face quad count
    uint32_t neighbors[6];     // chunk index of face neighbors (+X,-X,+Y,-Y,+Z,-Z), 0xFFFFFFFF = none
    uint32_t pad2[2];          // pad to 112 bytes (7 × float4)
 };
 // ── Voxel Renderer (Phase 2: mega-buffer + MDI pipeline) ────────
@ -81,6 +83,11 @@ private:
    wi::graphics::GPUBuffer topingVertexBuffer_;   // StructuredBuffer<TopingVertex>, SRV t4
    wi::graphics::GPUBuffer topingInstanceBuffer_; // StructuredBuffer<float3>, SRV t5
    static constexpr uint32_t MAX_TOPING_INSTANCES = 256 * 1024; // 256K instances max
    // Persistent staging buffers for toping upload (avoids per-frame allocations)
    struct TopingSortedInst { float wx, wy, wz; uint16_t type, variant; };
    struct TopingGPUInst { float x, y, z; };
    std::vector<TopingSortedInst> topingSorted_;
    std::vector<TopingGPUInst> topingGpuInsts_;
    mutable uint32_t topingDrawCalls_ = 0;
    // Shaders & Pipeline (smooth surfaces, Phase 5)
@ -89,6 +96,7 @@ private:
    wi::graphics::RasterizerState smoothRasterizer_;
    wi::graphics::PipelineState smoothPso_;
    wi::graphics::GPUBuffer smoothVertexBuffer_;   // StructuredBuffer<SmoothVertex>, SRV t6
    std::vector<SmoothVertex> smoothStagingVerts_;  // persistent staging buffer (avoids per-frame alloc)
    static constexpr uint32_t MAX_SMOOTH_VERTICES = 4 * 1024 * 1024; // 4M vertices max
    mutable uint32_t smoothVertexCount_ = 0;
    mutable uint32_t smoothDrawCalls_ = 0;
@ -168,6 +176,18 @@ private:
    mutable bool gpuMeshDirty_ = true;        // true: GPU needs upload + re-dispatch
    mutable bool chunkInfoDirty_ = true;      // true: chunkInfoBuffer needs re-upload
    // ── GPU Smooth Mesher (Phase 5.3) ─────────────────────────────
    wi::graphics::Shader smoothCentroidShader_;      // voxelSmoothCentroidCS (pass 1: centroid grid)
    wi::graphics::Shader smoothMeshShader_;          // voxelSmoothCS (pass 2: emit with smooth normals)
    wi::graphics::GPUBuffer centroidGridBuffer_;     // float4[34^3] per-chunk centroid grid (reused)
    wi::graphics::GPUBuffer gpuSmoothVertexBuffer_;  // RWStructuredBuffer<GPUSmoothVertex>, UAV+SRV
    wi::graphics::GPUBuffer gpuSmoothCounter_;       // atomic counter for smooth vertices
    wi::graphics::GPUBuffer smoothCounterReadback_;   // READBACK buffer for vertex counter
    static constexpr uint32_t CENTROID_GRID_SIZE = 34 * 34 * 34; // 39304 entries per chunk
    static constexpr uint32_t MAX_GPU_SMOOTH_VERTICES = 2 * 1024 * 1024; // 2M vertices max
    mutable uint32_t gpuSmoothVertexCount_ = 0;       // readback from previous frame
    mutable bool gpuSmoothMeshDirty_ = true;
    // Benchmark state machine: runs once after world gen
    enum class BenchState { IDLE, DISPATCH, READBACK, DONE };
    mutable BenchState benchState_ = BenchState::IDLE;
@ -179,6 +199,7 @@ private:
    void dispatchGpuMesh(wi::graphics::CommandList cmd, const VoxelWorld& world,
        ProfileAccum* profPack = nullptr, ProfileAccum* profUpload = nullptr,
        ProfileAccum* profDispatch = nullptr) const;
    void dispatchGpuSmoothMesh(wi::graphics::CommandList cmd, const VoxelWorld& world) const;
    void rebuildChunkInfoOnly(VoxelWorld& world);
    // ── GPU Timestamp Queries (Phase 2 benchmark) ────────────────
@ -220,12 +241,13 @@ public:
    // Phase 5: Smooth surface rendering
    void uploadSmoothData(VoxelWorld& world);
    void uploadSmoothDataFast(VoxelWorld& world); // chunkIndex already stamped
    void renderSmooth(
        wi::graphics::CommandList cmd,
        const wi::graphics::Texture& depthBuffer,
        const wi::graphics::Texture& renderTarget
    ) const;
-    uint32_t getSmoothVertexCount() const { return smoothVertexCount_; }
+    uint32_t getSmoothVertexCount() const { return (smoothCentroidShader_.IsValid() && smoothMeshShader_.IsValid()) ? gpuSmoothVertexCount_ : smoothVertexCount_; }
    uint32_t getSmoothDrawCalls() const { return smoothDrawCalls_; }
 };
@ -280,6 +302,10 @@ private:
    mutable ProfileAccum profGpuDispatch_;    // compute dispatches in dispatchGpuMesh
    mutable ProfileAccum profRender_;         // render() total
    mutable ProfileAccum profFrame_;          // full frame (Update + Render + Compose)
    mutable ProfileAccum profSmoothMesh_;     // SmoothMesher::meshChunk (all chunks)
    mutable ProfileAccum profSmoothUpload_;   // uploadSmoothData
    mutable ProfileAccum profTopingCollect_;  // topingSystem.collectInstances
    mutable ProfileAccum profTopingUpload_;   // uploadTopingData
    mutable float profTimer_ = 0.0f;
    static constexpr float PROF_INTERVAL = 5.0f;
    void logProfilingAverages() const;
--- a/src/voxel/VoxelWorld.cpp
+++ b/src/voxel/VoxelWorld.cpp
@ -197,6 +197,13 @@ void VoxelWorld::generateChunk(Chunk& chunk, float timeOffset) {
    }
    chunk.dirty = true;
    // Scan for smooth voxels (used by SmoothMesher for early-exit)
    chunk.containsSmooth = false;
    for (int i = 0; i < CHUNK_VOLUME && !chunk.containsSmooth; i++) {
        if (chunk.voxels[i].isSmooth())
            chunk.containsSmooth = true;
    }
 }
 void VoxelWorld::regenerateAnimated(float time, uint32_t* packDst, uint32_t packDstCapacity) {
--- a/src/voxel/VoxelWorld.h
+++ b/src/voxel/VoxelWorld.h
@ -22,7 +22,8 @@ struct Chunk {
    // Smooth mesh data (output of Surface Nets mesher, Phase 5)
    std::vector<SmoothVertex> smoothVertices;
    uint32_t smoothVertexCount = 0;
-    bool hasSmooth = false; // true if chunk contains any smooth voxels
+    bool hasSmooth = false; // true if chunk has smooth mesh output (set by mesher)
    bool containsSmooth = false; // true if chunk contains any FLAG_SMOOTH voxels (set during generation)
    VoxelData& at(int x, int y, int z) {
        return voxels[x + y * CHUNK_SIZE + z * CHUNK_SIZE * CHUNK_SIZE];
--- a/voxel_engine_spec.md
+++ b/voxel_engine_spec.md
@ -318,3 +318,8 @@ Le VoxelRenderer s'insère dans le render path de Wicked via des hooks dans le R
 | Compatibilité cross-vendor | Fonctionne sur RDNA 2+ et RTX 3060+ | Tests manuels |
 *Fin du document de spécification*
 # Autres idées
 J'aimerais tester quelque chose, c'est un nouveau type de block qui ne contient que des modèles 3D customs et qui aurait des comportements de jointure dynamique selon les blocs voisins identiques. Spécifiquement, j'aimerais créer des tuyaux qui se connectent les uns aux autres ou créent des nouvelles connexions pour toujours toucher les blocks tuyaux voisin.