asc.language.basic.load_data【免费下载链接】pyasc本项目为Python用户提供算子编程接口支持在昇腾AI处理器上加速计算接口与Ascend C一一对应并遵守Python原生语法。项目地址: https://gitcode.com/cann/pyascasc.language.basic.load_data(dst: LocalTensor, src: LocalTensor, params: LoadData2DParams) → Noneasc.language.basic.load_data(dst: LocalTensor, src: GlobalTensor, params: LoadData2DParams) → Noneasc.language.basic.load_data(dst: LocalTensor, src: LocalTensor, params: LoadData2DParamsV2) → Noneasc.language.basic.load_data(dst: LocalTensor, src: GlobalTensor, params: LoadData2DParamsV2) → Noneasc.language.basic.load_data(dst: LocalTensor, src: LocalTensor, params: LoadData3DParamsV1) → Noneasc.language.basic.load_data(dst: LocalTensor, src: LocalTensor, params: LoadData3DParamsV2) → Noneasc.language.basic.load_data(dst: LocalTensor, src: LocalTensor, params: LoadData3DParamsV2Pro) → None源操作数/目的操作数的数据类型为uint8_t/int8_t时分形矩阵大小在A1/A2上为16*32 在B1/B2上为32*16。 源操作数/目的操作数的数据类型为uint16_t/int16_t/half/bfloat16_t时分形矩阵在A1/B1/A2/B2上的大小为16*16。 源操作数/目的操作数的数据类型为uint32_t/int32_t/float时分形矩阵大小在A1/A2上为16*8 在B1/B2上为8*16。 支持如下数据通路 GM-A1; GM-B1; GM-A2; GM-B2; A1-A2; B1-B2。对应的Ascend C函数原型template typename T __aicore__ inline void LoadData(const LocalTensorT dst, const LocalTensorT src, const LoadData2DParams loadDataParams)template typename T __aicore__ inline void LoadData(const LocalTensorT dst, const GlobalTensorT src, const LoadData2DParams loadDataParams)template typename T __aicore__ inline void LoadData(const LocalTensorT dst, const LocalTensorT src, const LoadData2DParamsV2 loadDataParams)template typename T __aicore__ inline void LoadData(const LocalTensorT dst, const GlobalTensorT src, const LoadData2DParamsV2 loadDataParams)template typename T __aicore__ inline void LoadData(const LocalTensorT dst, const LocalTensorT src, const LoadData3DParamsV2Pro loadDataParams)template typename T, const IsResetLoad3dConfig defaultConfig IS_RESER_LOAD3D_DEFAULT_CONFIG, typename U PrimTT, typename Std::enable_ifStd::is_samePrimTT, U::value, bool::type true __aicore__ inline void LoadData(const LocalTensorT dst, const LocalTensorT src, const LoadData3DParamsV1U loadDataParams)template typename T, const IsResetLoad3dConfig defaultConfig IS_RESER_LOAD3D_DEFAULT_CONFIG, typename U PrimTT, typename Std::enable_ifStd::is_samePrimTT, U::value, bool::type true __aicore__ inline void LoadData(const LocalTensorT dst, const LocalTensorT src, const LoadData3DParamsV2U loadDataParams)参数说明dst目的操作数类型为 LocalTensor。作为二维数据加载的目标 Tensor。支持的 TPosition 为 VECIN/VECCALC/VECOUT。起始地址需要 32 字节对齐。src源操作数类型为 LocalTensor 或 GlobalTensor。当为 LocalTensor 时表示在芯片内部不同本地存储单元之间按 2D 方式搬运。当为 GlobalTensor 时表示从 Global Memory 按 2D 方式加载数据到 LocalTensor。元素数据类型需与 dst 保持一致。params类型为下面结构体LoadData2DParams 结构体start_index分形矩阵ID说明搬运起始位置为源操作数中第几个分形0为源操作数中第1个分形矩阵。取值范围start_index∈[0, 65535] 。单位512B。默认为0。repeat_times迭代次数每个迭代可以处理512B数据。取值范围repeat_times∈[1, 255]。src_stride相邻迭代间源操作数前一个分形与后一个分形起始地址的间隔单位512B。取值范围src_stride∈[0, 65535]。默认为0。sid预留参数配置为0即可。dst_gap相邻迭代间目的操作数前一个分形结束地址与后一个分形起始地址的间隔单位512B。取值范围dst_gap∈[0, 65535]。默认为0。if_transpose是否启用转置功能对每个分形矩阵进行转置默认为false。addr_mode预留参数配置为0即可。LoadData2DParamsV2 结构体m_start_positionM维起始位置取值范围m_start_position∈[0, 65535]。默认为0。k_start_positionK维起始位置取值范围k_start_position∈[0, 65535]。默认为0。m_stepM维步长取值范围m_step∈[0, 65535]。默认为0。k_stepK维步长取值范围k_step∈[0, 65535]。默认为0。src_stride源操作数步长取值范围src_stride∈[-2147483648, 2147483647]。默认为0。dst_stride目的操作数步长取值范围dst_stride∈[0, 65535]。默认为0。if_transpose是否启用转置功能默认为false。sid流ID取值范围sid∈[0, 255]。默认为0。LoadData3DParamsV2Pro 结构体channel_size通道大小取值范围channel_size∈[0, 65535]。默认为0。en_transpose是否启用转置功能默认为false。en_small_k是否启用小K优化默认为false。filter_size_w是否启用滤波器宽度优化默认为false。filter_size_h是否启用滤波器高度优化默认为false。f_matrix_ctrl是否启用矩阵控制默认为false。ext_config扩展配置取值范围ext_config∈[0, 18446744073709551615]。默认为0。filter_config滤波器配置取值范围filter_config∈[0, 18446744073709551615]。默认为0x10101010101。LoadData3DParamsV1 结构体pad_listpadding列表顺序为[padding_left, padding_right, padding_top, padding_bottom]每个元素取值范围[0, 255]。l1_h源操作数height取值范围[1, 32767]。l1_w源操作数width取值范围[1, 32767]。c1_index卷积窗口在源Tensor C1维度的起点取值范围[0, 4095]。fetch_filter_w卷积窗口在filter W维度的起始位置取值范围[0, 254]。fetch_filter_h卷积窗口在filter H维度的起始位置取值范围[0, 254]。left_top_w卷积窗口在源Tensor W维度的起点取值范围[-255, 32767]。left_top_h卷积窗口在源Tensor H维度的起点取值范围[-255, 32767]。stride_w卷积核在W维的滑动步长取值范围[1, 63]。stride_h卷积核在H维的滑动步长取值范围[1, 63]。filter_w卷积核width取值范围[1, 255]。filter_h卷积核height取值范围[1, 255]。dilation_filter_w卷积核W维膨胀系数取值范围[1, 255]。dilation_filter_h卷积核H维膨胀系数取值范围[1, 255]。jump_stride迭代之间目的操作数地址递增步长取值范围[1, 127]。repeat_mode迭代模式取值范围[0, 1]默认为0。repeat_time迭代次数取值范围[1, 255]。c_size通道展开优化控制参数取值范围[0, 1]默认为0。pad_valuepadding填充值需与src数据类型一致默认为0。LoadData3DParamsV2 结构体pad_listpadding列表顺序为[padding_left, padding_right, padding_top, padding_bottom]每个元素取值范围[0, 255]。l1_h源操作数height取值范围[1, 32767]。l1_w源操作数width取值范围[1, 32767]。channel_size通道大小不同数据类型与平台存在对齐约束。k_extensionK维扩展长度取值范围[1, 65535]。m_extensionM维扩展长度取值范围[1, 65535]。k_start_ptK维起始位置取值范围[0, 65535]。m_start_ptM维起始位置取值范围[0, 65535]。stride_w卷积核在W维的滑动步长取值范围[1, 63]。stride_h卷积核在H维的滑动步长取值范围[1, 63]。filter_w卷积核width取值范围[1, 255]。filter_h卷积核height取值范围[1, 255]。dilation_filter_w卷积核W维膨胀系数取值范围[1, 255]。dilation_filter_h卷积核H维膨胀系数取值范围[1, 255]。en_transpose是否启用转置功能取值为bool默认为false。pad_valuepadding填充值需与src数据类型一致默认为0。filter_size_w是否在filterW基础上增加256元素默认为false。filter_size_h是否在filterH基础上增加256元素默认为false。f_matrix_ctrlFeatureMap矩阵控制开关默认为false。约束说明dst 与 src 的数据需要满足起始地址对齐要求具体可查看文档。不使用或者不想改变的配置建议保持默认值有助于性能提升。调用示例Local Memory 内部 2D 搬运Local - Localasc.jit def kernel_load_data_l2l(x: asc.GlobalAddress) - None: x_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512) y_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512) params asc.LoadData2DParams(0, 4, 0, 0, 0, 0, 0) asc.load_data(y_local, x_local, params)Global Memory 到 Local Memory 的 2D 搬运Global - Localasc.jit def kernel_load_data_g2l(x: asc.GlobalAddress) - None: x_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512) y_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512) x_gm asc.GlobalTensor() x_gm.set_global_buffer(x) params asc.LoadData2DParams(0, 4, 0, 0, 0, 0, 0) asc.load_data(y_local, x_local, params) asc.load_data(x_local, x_gm, params)Local Memory 内部 2D 搬运V2版本Local - Localasc.jit def kernel_load_data_l2l_v2(x: asc.GlobalAddress) - None: x_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512) y_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512) params_v2 asc.LoadData2DParamsV2(0, 0, 16, 16, 0, 0, False, 0) asc.load_data(y_local, x_local, params_v2)Global Memory 到 Local Memory 的 2D 搬运V2版本Global - Localasc.jit def kernel_load_data_g2l_v2(x: asc.GlobalAddress) - None: x_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512) y_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512) x_gm asc.GlobalTensor() x_gm.set_global_buffer(x) params_v2 asc.LoadData2DParamsV2(0, 0, 16, 16, 0, 0, False, 0) asc.load_data(y_local, x_local, params_v2) asc.load_data(x_local, x_gm, params_v2)Local Memory 内部 3D 搬运V2Pro版本Local - Localasc.jit def kernel_load_data_3d_v2pro(x: asc.GlobalAddress) - None: x_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512) y_local asc.LocalTensor(dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512) params_3d_v2_pro asc.LoadData3DParamsV2Pro(16, False, False, False, False, False, 0, 0x10101010101) asc.load_data(y_local, x_local, params_3d_v2_pro)Local Memory 内部 3D 搬运LoadData3DParamsV1def test_load_data_v1(mock_launcher_run): asc.jit def kernel_load_data_v1(x: asc.GlobalAddress) - None: x_local asc.LocalTensor( dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512, ) y_local asc.LocalTensor( dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512, ) x_gm asc.GlobalTensor() x_gm.set_global_buffer(x) params_3d_v1 asc.LoadData3DParamsV1( [0, 0, 0, 0], 16, 16, 0, 0, 0, 0, 0, 1, 1, 3, 3, 1, 1, 1, 0, 1, 0, 0, ) asc.load_data(y_local, x_local, params_3d_v1) x MockTensor(asc.float16) kernel_load_data_v1[1](https://link.gitcode.com/i/6be149b10436233a6a00488d75fc6df8) assert mock_launcher_run.call_count 1Local Memory 内部 3D 搬运LoadData3DParamsV2def test_load_data_v2(mock_launcher_run): asc.jit def kernel_load_data_v2(x: asc.GlobalAddress) - None: x_local asc.LocalTensor( dtypeasc.float16, posasc.TPosition.VECIN, addr0, tile_size512, ) y_local asc.LocalTensor( dtypeasc.float16, posasc.TPosition.VECOUT, addr0, tile_size512, ) x_gm asc.GlobalTensor() x_gm.set_global_buffer(x) params_3d_v2 asc.LoadData3DParamsV2( [0, 0, 0, 0], 16, 16, 16, 16, 16, 0, 0, 1, 1, 3, 3, 1, 1, False, 0, False, False, False, ) asc.load_data(y_local, x_local, params_3d_v2) x MockTensor(asc.float16) kernel_load_data_v2[1](https://link.gitcode.com/i/6be149b10436233a6a00488d75fc6df8) assert mock_launcher_run.call_count 1【免费下载链接】pyasc本项目为Python用户提供算子编程接口支持在昇腾AI处理器上加速计算接口与Ascend C一一对应并遵守Python原生语法。项目地址: https://gitcode.com/cann/pyasc创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考