Reduction Constraints【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when a kernel needs row-wise reductions, normalization, softmax, or quantization statistics inside avfstage.GoalChoose the right reduction idiom so that:the reduction stays in registers when possible to reduce UB trafficmulti-pass flows keep intermediate state persistent across tile passesstreaming stats follow the correct update order for numerical stability1. Row sum normalizationSingle-pass pattern:sum RegList.cadd()dup(sum)then divide row registersFiles to study:agent/example/kernels/a5/matmul_rowwise_norm.py2. Tiled softmax (three-pass)Use when the fullSdimension does not fit in one tile.pass 1: store float logits and track row max withcmax()dup()pass 2: reload logits, subtract duplicated row max,exp(), accumulate row sum, and store exponentialspass 3: reload exponentials and divide by duplicated row sum before the final cast3. Streaming softmax-style MLA statsSingle-pass online update per tile:curr_max max(prev_max, qk_tile.amax(-1))p_tile exp(qk_tile - curr_max)row_sum prev_sum * exp(prev_max - curr_max) p_tile.sum(-1)if the tile must be materialized in fp8, updaterow_sumfrom the float tile first and cast only after the float reduction is completeStreaming MLA with final normalizedoutput:[B,H,Dn]:rescale the running numerator byexp(prev_max - curr_max)before adding the current block contributionkeep the numerator in float across all blocksapplyoutput / row_sumonly once after the loopif the value path intentionally usesp.half().float(), updaterow_sumfrom the floatexp(...)tile before the cast, then cast only for the downstream cube consume pathFiles to study:agent/example/kernels/a5/test_mla_entire.pyagent/example/kernels/a2/flash_attn_full.py4. Absmax scaling and quantizationabs()thencmax()for row/block scalardivide row by duplicated scalaroptionally emit scale via duplicated scalar to 64-lane rowFiles to study:agent/example/kernels/a5/matmul_chunk_absmax_norm128.pyagent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py5. Two-pass rowwise normalizationUse when theNdimension is too large for single-pass normalization.pass 1: matmul row-stat accumulation into persistent UB buffer temporary output writepass 2: reload temporary output and normalize by accumulated statsThe row-stat buffer must persist in UB across allN-tile passes within oneMtile.Files to study:agent/example/kernels/a5/matmul_rowwise_norm_large_nk.pyagent/example/kernels/a5/matmul_rowwise_l2_norm.py6. Running state across inner-loop iterations (a2 pattern)When a kernel must accumulate a statistic (like running max for online softmax) across tiles in the inner loop, the DSL has no conditional logic (if first_iteration).Solution: identity-element initialization unconditional update.Initialize the running buffer to the identity element of the accumulation operation before the inner loop usingdup, then apply the update unconditionally every iteration:AccumulationIdentity elementUpdate operationExamplerunning maxneg_large(finite negative sentinel)vmax(running, running, tile)online softmax max trackingrunning sum0.0add(running, running, tile)online softmax sum accumulationrunning product1.0mul(running, running, tile)decay chain productsWhy this works: withneg_largechosen below every valid tile value,max(neg_large, x) x,0 x x,1 × x x— the first iteration naturally produces the correct initial value without special-casing.Choosing the right tensor format for the update operation:The update (vmax,add, etc.) is a binary element-wise operation between two UB tensors. Both must have matching stride layouts, and the operation must cover all intended elements.On a2 without registers,cmaxoutputs dense scalars in[M, 1]format. Operate on this format directly:# Correct: vmax on [64, 1] covers all 64 rows vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # both [HALF_M, 1]Do NOT broadcast to[M, 8]first and then attemptvmaxbetween two[M, 8]buffers —blk_stride0makes that operation cover only 1/8 of the elements. Seeagent/references/constraints/vec-reduction-a2.mdsection 5 for the detailed proof.Lifetime and reset rules:The running buffer must be reset at the beginning of eachouterloop iteration (each new M-tile gets fresh running stats)It persists across allinnerloop iterations (N-tiles accumulate into it)UB is per-sub-block and persistent — no special lifetime management neededComplete pattern:ub_rmax_s Tensor(DT.float, [HALF_M, 1], Position.UB) with auto_sync(): for gmt in range(mt_begin, mt_end): # outer: M-tiles dup(ub_rmax_s, neg_large) # reset per M-tile with the running-max identity for nt in range(0, tiles_n): # inner: N-tiles # ... compute tile, get ub_max_s via cmax ... vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # accumulate brcb(ub_max, ub_rmax_s, ...) # broadcast AFTER update # ... subtract, exp, store ...On a5 withReg/RegList, the same pattern uses register-leveldupandvmaxsinstead of UB-level operations. The identity-element principle is the same.Files to study:agent/example/kernels/a2/flash_attn_score_iter.py— validated running max pattern on a2agent/example/kernels/a5/test_mla_entire.py— streamed running max/sum in a5 register pipeline7. a2 vec reduction (no registers)On a2,Reg/RegListare not available. Reductions use UB-to-UB operations:cmax/caddreduce 64 elements to 1 scalar per repeat (dense output)The dense output must be broadcast viabrcbbefore use insub/divFor buffers wider than 64 columns, first merge withvmax/addto 64Complete pattern:vmax → cmax → brcb → sub(sliced for repeat alignment)Read:agent/references/constraints/vec-reduction-a2.md8. General ruleskeep reductions in registers when possible (a5 withvf)on a2, usecmax → brcbUB-to-UB pattern insteadusedup()to broadcast a scalar reduction result back to full-row width before element-wise operationsfor multi-pass flows, decide upfront which UB buffers persist across passes and which are reused【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考