CANN/EasyAsc DSL约简约束

张

张建站

2026/6/3 21:48:31

10分钟阅读

Reduction Constraints【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when a kernel needs row-wise reductions, normalization, softmax, or quantization statistics inside avfstage.GoalChoose the right reduction idiom so that:the reduction stays in registers when possible to reduce UB trafficmulti-pass flows keep intermediate state persistent across tile passesstreaming stats follow the correct update order for numerical stability1. Row sum normalizationSingle-pass pattern:sum RegList.cadd()dup(sum)then divide row registersFiles to study:agent/example/kernels/a5/matmul_rowwise_norm.py2. Tiled softmax (three-pass)Use when the fullSdimension does not fit in one tile.pass 1: store float logits and track row max withcmax()dup()pass 2: reload logits, subtract duplicated row max,exp(), accumulate row sum, and store exponentialspass 3: reload exponentials and divide by duplicated row sum before the final cast3. Streaming softmax-style MLA statsSingle-pass online update per tile:curr_max max(prev_max, qk_tile.amax(-1))p_tile exp(qk_tile - curr_max)row_sum prev_sum * exp(prev_max - curr_max) p_tile.sum(-1)if the tile must be materialized in fp8, updaterow_sumfrom the float tile first and cast only after the float reduction is completeStreaming MLA with final normalizedoutput:[B,H,Dn]:rescale the running numerator byexp(prev_max - curr_max)before adding the current block contributionkeep the numerator in float across all blocksapplyoutput / row_sumonly once after the loopif the value path intentionally usesp.half().float(), updaterow_sumfrom the floatexp(...)tile before the cast, then cast only for the downstream cube consume pathFiles to study:agent/example/kernels/a5/test_mla_entire.pyagent/example/kernels/a2/flash_attn_full.py4. Absmax scaling and quantizationabs()thencmax()for row/block scalardivide row by duplicated scalaroptionally emit scale via duplicated scalar to 64-lane rowFiles to study:agent/example/kernels/a5/matmul_chunk_absmax_norm128.pyagent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py5. Two-pass rowwise normalizationUse when theNdimension is too large for single-pass normalization.pass 1: matmul row-stat accumulation into persistent UB buffer temporary output writepass 2: reload temporary output and normalize by accumulated statsThe row-stat buffer must persist in UB across allN-tile passes within oneMtile.Files to study:agent/example/kernels/a5/matmul_rowwise_norm_large_nk.pyagent/example/kernels/a5/matmul_rowwise_l2_norm.py6. Running state across inner-loop iterations (a2 pattern)When a kernel must accumulate a statistic (like running max for online softmax) across tiles in the inner loop, the DSL has no conditional logic (if first_iteration).Solution: identity-element initialization unconditional update.Initialize the running buffer to the identity element of the accumulation operation before the inner loop usingdup, then apply the update unconditionally every iteration:AccumulationIdentity elementUpdate operationExamplerunning maxneg_large(finite negative sentinel)vmax(running, running, tile)online softmax max trackingrunning sum0.0add(running, running, tile)online softmax sum accumulationrunning product1.0mul(running, running, tile)decay chain productsWhy this works: withneg_largechosen below every valid tile value,max(neg_large, x) x,0 x x,1 × x x— the first iteration naturally produces the correct initial value without special-casing.Choosing the right tensor format for the update operation:The update (vmax,add, etc.) is a binary element-wise operation between two UB tensors. Both must have matching stride layouts, and the operation must cover all intended elements.On a2 without registers,cmaxoutputs dense scalars in[M, 1]format. Operate on this format directly:# Correct: vmax on [64, 1] covers all 64 rows vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # both [HALF_M, 1]Do NOT broadcast to[M, 8]first and then attemptvmaxbetween two[M, 8]buffers —blk_stride0makes that operation cover only 1/8 of the elements. Seeagent/references/constraints/vec-reduction-a2.mdsection 5 for the detailed proof.Lifetime and reset rules:The running buffer must be reset at the beginning of eachouterloop iteration (each new M-tile gets fresh running stats)It persists across allinnerloop iterations (N-tiles accumulate into it)UB is per-sub-block and persistent — no special lifetime management neededComplete pattern:ub_rmax_s Tensor(DT.float, [HALF_M, 1], Position.UB) with auto_sync(): for gmt in range(mt_begin, mt_end): # outer: M-tiles dup(ub_rmax_s, neg_large) # reset per M-tile with the running-max identity for nt in range(0, tiles_n): # inner: N-tiles # ... compute tile, get ub_max_s via cmax ... vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # accumulate brcb(ub_max, ub_rmax_s, ...) # broadcast AFTER update # ... subtract, exp, store ...On a5 withReg/RegList, the same pattern uses register-leveldupandvmaxsinstead of UB-level operations. The identity-element principle is the same.Files to study:agent/example/kernels/a2/flash_attn_score_iter.py— validated running max pattern on a2agent/example/kernels/a5/test_mla_entire.py— streamed running max/sum in a5 register pipeline7. a2 vec reduction (no registers)On a2,Reg/RegListare not available. Reductions use UB-to-UB operations:cmax/caddreduce 64 elements to 1 scalar per repeat (dense output)The dense output must be broadcast viabrcbbefore use insub/divFor buffers wider than 64 columns, first merge withvmax/addto 64Complete pattern:vmax → cmax → brcb → sub(sliced for repeat alignment)Read:agent/references/constraints/vec-reduction-a2.md8. General ruleskeep reductions in registers when possible (a5 withvf)on a2, usecmax → brcbUB-to-UB pattern insteadusedup()to broadcast a scalar reduction result back to full-row width before element-wise operationsfor multi-pass flows, decide upfront which UB buffers persist across passes and which are reused【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

ssm学院党员管理系统（10160）

有需要的同学，源代码和配套文档领取，加文章最下方的名片哦一、项目演示项目演示视频二、资料介绍完整源代码（前后端源代码SQL脚本）配套文档（LWPPT开题报告/任务书）远程调试控屏包运行一键启动项目&…...

2026/6/3 21:44:18 阅读更多 →

如何用QKeyMapper轻松实现Windows游戏按键映射优化

如何用QKeyMapper轻松实现Windows游戏按键映射优化【免费下载链接】QKeyMapper [按键映射工具] QKeyMapper，Qt开发Win10&Win11可用，不修改注册表、不需重新启动系统，可立即生效和停止。支持游戏手柄映射到键鼠，手柄摇杆控制鼠…...

2026/6/3 21:42:05 阅读更多 →

Java程序员必知必会Spring全家桶如何高效速通？

Spring可以说是我们Java入门时最先接触的框架了，只要你是Java程序员，它就是你绕不开必须要学习的一个点。对于我们这些有工作经验的Javaer来说，你不仅要学好Spring，还需要学好后续由它衍生一系列的框架组件（我们一般把…...

2026/6/3 21:41:27 阅读更多 →

智能水印工具终极指南：如何批量为照片添加专业相机参数水印

智能水印工具终极指南：如何批量为照片添加专业相机参数水印【免费下载链接】semi-utils 一个批量添加相机机型和拍摄参数的工具，后续「可能」添加其他功能。项目地址: https://gitcode.com/gh_mirrors/se/semi-utils 还在为数百张照片手动添加相…...

2026/6/3 17:02:45 阅读更多 →

Go语言可扩展性设计：水平扩展

Go语言可扩展性设计：水平扩展1. 引言在互联网时代，业务的快速增长对系统的扩展性提出了极高的要求。水平扩展（Scale Out）作为分布式系统的核心设计理念，能够通过增加服务器节点来提升系统的整体处理能力。与垂直扩展&…...

2026/6/3 11:01:44 阅读更多 →

Claude Code Tool System 与 Permission 机制深度解析

代码解析 Claude Code Tool System 与 Permission 机制深度解析 0. 背景与定位 Claude Code 是一个运行在终端的 Agentic 编码工具，其核心能力来自工具系统（Tool System）——AI 通过调用工具与文件系统、Shell、网络、子 Agent 交互。而**权…...

2026/6/3 17:02:49 阅读更多 →