【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isaPTO Tile LibraryParallel Tile Operation (PTO) is a virtual ISA for tile-oriented programming defined by Ascend CANN. This repository provides PTO Tile instruction implementations, examples, tests, and documentation to help developers migrate and optimize operators more smoothly across different Ascend generations. News2025-12-27: PTO Tile Library is officially open-sourced.✨2026-01-30: Added reduction instructions and MX instructions.2026-02-28: Added convolution instructions, quantization instructions, and inter-kernel communication instructions.2026-03-30: Added support for Ascend A5, asynchronous communication instructions, and CostModel performance simulation.️2026-04-02: Local engineering workflow improved with pre-commit checks, documentation build verification, and CPU-SIM validation updates. Project PositioningThe PTO ISA is built on Ascends underlying hardware and software abstractions and defines more than 90 standard tile instructions. It uses a higher-level tile programming model to bridge implementation differences across generations. Its goal is not to hide low-level capabilities, but to raise the abstraction level while preserving room for performance tuning.Unified cross-generation tile abstraction: reduces migration cost across different Ascend generations.Balances portability and performance: guarantees correct behavior under fixed tile shapes while preserving tuning dimensions such as tile size, tile shape, and instruction ordering.Designed for frameworks, operators, and toolchains: serves as a common interface for upper-layer frameworks, operator implementations, and compiler toolchains.Continuously extensible: defines 90 standard operations today, with ongoing implementation and ecosystem integration.In addition to compute and># CPU Simulator (recommended first step) python3 tests/run_cpu.py --clean --verbose # Run GEMM demo python3 tests/run_cpu.py --demo gemm --verbose # Run Flash Attention demo python3 tests/run_cpu.py --demo flash_attn --verbose # Run a single ST testcase python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64_64x64 # One-click build and run recommended tests ./build.sh --run_all --a3 --simFor more complete build, test, and scripting details, see the Getting Started Guide and Test Guide.Recommended ExamplesAuto Mode Add example: a good first example for understanding how PTO instructions are organizedGEMM performance example: useful for understanding tile-level operator optimizationFlash Attention example: useful for understanding complex operators and performance tuningRecommended Learning PathStart from simple examples to understand how PTO instructions organize tile-level computation and data movement.Verify functionality and correctness in CPU simulation to build intuition about instruction semantics and results.Port the code to Ascend hardware to validate correctness and collect performance data. See the msprof toolIdentify performance bottlenecks (CUBE Bound / MTE Bound / Vector Bound) and start optimization and tuning. See Performance OptimizationThis repository also demonstrates how standard tile operations can be mapped to different pipeline implementations through template parameters:Tile Programming Model: understand static tile shapes, dynamic tile masks, and data organizationEvents and Synchronization: understand set/wait flag and pipeline synchronizationGeneral Conventions: understand general PTO programming rules and constraintsPTO Instruction List: browse the standard operations defined by the PTO ISA️ Documentation NavigationISA and Programming ModelISA Overview: entry point and navigation for PTO ISA documentationPTO Instruction List: browse PTO standard operations by categoryTile Programming Model: understand tile shapes, masks, and the programming modelEvents and Synchronization: understand event recording, waiting, and synchronizationGeneral Conventions: review naming, constraints, and common rulesDevelopment and OptimizationDeveloper Documentation Index: browse documentation for extending PTO Tile LibPerformance Optimization: review performance analysis and tuning guidanceDocumentation Build Guide: learn how to build the MkDocs site locally Examples and Performance ReferencesGEMMReference implementation:kernels/manual/a2a3/gemm_performance/Detailed analysis and tuning notes: High-Performance GEMM Operator ExampleFlash AttentionOperator implementation and tuning notes: A2/A3 version, A5 versionA5 build guide, with A5 performance numbers still pending: Flash Attention Performance Kernel (A5)S0: query sequence length (number of rows in Q/O)S1: key/value sequence length (number of rows in K/V)Ascend 910B2 multi-core comparison, usingtorch_npuas the baseline:Sequence lengthPTO time (us)torch_npu time (us)PTO TFLOPStorch_npu TFLOPSPTO speedup102420.96058.46125.619.182.79x204832.46170.80166.1630.332.18x409688.902118.30296.6272.611.33x8192292.626353.147117.4297.301.21x16384909.0581118.462151.19122.881.23x327683262.6453646.173168.50150.781.12xCommunication Instruction BandwidthReference implementation:kernels/manual/a2a3/tget_bandwidth/Detailed analysis and build/run guide: TGET / TGET_ASYNC Bandwidth Comparison ExampleThis example measures point-to-point remote-read bandwidth on Ascend A2/A3 and comparesTGET(synchronous, via UB staging) withTGET_ASYNC(asynchronous, direct transfer through the DMA engine).GEMM AllReduce Fused Compute-CommunicationReference implementation:kernels/manual/a2a3/gemm_ar/Detailed analysis and tuning notes: High-Performance GEMM AllReduce Fused Operator ExampleThis example shows how PTO communication primitives can be fused with compute kernels to overlap GEMM and AllReduce within one operator pipeline.️ Platform SupportAscend A2 (Ascend 910B)Ascend A3 (Ascend 910C)Ascend A5 (Ascend 950)CPU (x86_64 / AArch64)For more details, see include/README.md.️ RoadmapPlanned future features:FeatureDescriptionScopeProgress / target completionPTO Auto ModeBiSheng compiler support for automatic tile buffer allocation and synchronization insertion.Compiler / toolchainOngoingPTO Tile FusionBiSheng compiler support for automatic tile operation fusion.Compiler / toolchainOngoingPTO-ASBytecode support for PTO ISA.Compiler / toolchainOngoingConvolution extensionPTO ISA support for convolution kernels.ISA extensionOngoingCollective communication extensionAdd asynchronous communication instructions for Ccu and Roce, and add the TPREFECTH (AIV direct-drive) communication instruction.Communication ISA extension2026 Q2System scheduling extensionPTO ISA support for SPMD/MPMD programming schedules.ISA extensionPlannedMicro-instructionsSupport expressing high-performance operators through micro-instructions, together with a foundational high-performance micro-instruction library.ISA extension / operator development2026 Q2Base instructionsFurther optimize A5 instruction performance, add Pooling-related base instructions, and enhance convolution, quantization, and Fixpipe instruction capabilities.ISA extension2026 Q2CostModelSupport CostModel performance simulation for A5 instructions.Toolchain / performance modeling2026 Q2CPU-SIMKeep CPU-SIM built in sync with instruction enhancements.CPU simulation2026 Q2️ Directory StructureKey directories are listed below:├── include/ # Public PTO headers and interfaces │ └── pto/ # Common types, ISA interfaces, and CPU/NPU implementations ├── kernels/ # Kernels and operator implementations │ ├── manual/ # Hand-optimized implementations and performance examples │ └── custom/ # Custom operator examples ├── docs/ # ISA, programming model, getting started, and doc site sources │ ├── isa/ # Instruction references and category indexes │ ├── coding/ # Developer and performance optimization docs │ ├── assembly/ # PTO-AS assembly syntax and specification │ └── mkdocs/ # MkDocs config and source files ├── demos/ # Auto Mode, baseline, and torch_jit examples ├── tests/ # CPU / NPU tests, scripts, and test entry points │ ├── cpu/ # CPU simulation tests │ ├── npu/ # SoC-specific NPU tests │ └── script/ # Test build and execution scripts ├── scripts/ # Build, install, and release scripts ├── cmake/ # Shared CMake configuration and packaging logic ├── build.sh # One-click build and run entry script └── CMakeLists.txt # Top-level CMake configurationℹ️ Related InformationContributing Guide: contribution workflow and development guidelinesSecurity and Vulnerability Disclosure: process for reporting security issuesRelease Notes: version updates and release historyLicense: CANN Open Software License Agreement Version 2.0PyPTO: an upper-layer programming framework in the PTO ecosystemPTOAS: PTO assembler and compiler backend for PTO workflowspto-dsl: Pythonic frontend and JIT workflow exploration for PTO Contact UsIssue reporting: submit problems through repository IssuesFeature requests: share suggestions through Issues or discussion channelsCode contributions: contribute through Pull Requests【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考