注本文为 “内存知识” 相关译文。英文引文机翻未校。如有内容异常请看原文。Memory part 4: NUMA support内存专题第四部分非统一内存访问NUMA支持October 17, 20072007 年 10 月 17 日This article was contributed by Ulrich Drepper本文由乌尔里希·德雷珀供稿[Editor’s note: welcome to part 4 of Ulrich Drepper’s “What every programmer should know about memory”; this section discusses the particular challenges associated with non-uniform memory access (NUMA) systems. Those who have not read [part 1, [part 2, and [part 3 may wish to do so now. As always, please send typo reports and the like to lwnlwn.net rather than posting them as comments here.]【编者按】欢迎阅读乌尔里希·德雷珀《程序员必须了解的内存知识》系列文章第四部分。本章节将讲解非统一内存访问NUMA系统带来的各类难题。尚未阅读第一、二、三部分的读者可先行阅读往期内容。如有文字错误等问题反馈请发送邮件至 lwnlwn.net请勿在评论区留言。5 NUMA Support5 非统一内存访问支持In Section 2 we saw that, on some machines, the cost of access to specific regions of physical memory differs depending on where the access originated. This type of hardware requires special care from the OS and the applications. We will start with a few details of NUMA hardware, then we will cover some of the support the Linux kernel provides for NUMA.在第二章中我们了解到部分计算机的物理内存不同区域访问开销会随访问发起位置的不同而变化。这类硬件需要操作系统与应用程序进行针对性适配。本文首先介绍 NUMA 硬件的相关细节再讲解 Linux 内核对 NUMA 提供的各类支持机制。5.1 NUMA Hardware5.1 NUMA 硬件Non-uniform memory architectures are becoming more and more common. In the simplest form of NUMA, a processor can have local memory (see Figure 2.3) which is cheaper to access than memory local to other processors. The difference in cost for this type of NUMA system is not high, i.e., the NUMA factor is low.非统一内存架构如今应用愈发广泛。最简形态的 NUMA 架构中每个处理器都配备本地内存见图 2.3访问本地内存的开销低于访问其他处理器的本地内存。这类 NUMA 系统的访问开销差异较小也就是 NUMA 系数较低。NUMA is also—and especially—used in big machines. We have described the problems of having many processors access the same memory. For commodity hardware all processors would share the same Northbridge (ignoring the AMD Opteron NUMA nodes for now, they have their own problems). This makes the Northbridge a severe bottleneck sinceallmemory traffic is routed through it. Big machines can, of course, use custom hardware in place of the Northbridge but, unless the memory chips used have multiple ports—i.e. they can be used from multiple busses—there still is a bottleneck. Multiport RAM is complicated and expensive to build and support and, therefore, it is hardly ever used.NUMA 架构也大量应用于大型服务器设备。此前我们讲解过多处理器访问同一块内存会引发各类问题。在通用硬件平台上所有处理器会共用同一北桥芯片暂不讨论 AMD 皓龙处理器的 NUMA 节点该架构存在独立问题。由于所有内存访问流量都需经过北桥北桥会成为严重的性能瓶颈。大型设备固然可以采用定制硬件替代标准北桥但除非内存芯片支持多端口即可被多条总线同时访问否则瓶颈问题依旧存在。多端口内存芯片的设计、制造与维护成本高昂因此几乎没有实际应用。The next step up in complexity is the model AMD uses where an interconnect mechanism (Hypertransport in AMD’s case, technology they licensed from Digital) provides access for processors which are not directly connected to the RAM. The size of the structures which can be formed this way is limited unless one wants to increase the diameter (i.e., the maximum distance between any two nodes) arbitrarily.复杂度更高的架构以 AMD 的设计为代表该架构借助互联总线AMD 采用从迪吉多公司授权的超传输总线技术让未直连内存的处理器完成内存访问。若不刻意增大网络直径即任意两个节点间的最大链路跳数该组网方式可容纳的节点数量存在上限。Figure 5.1: Hypercubes图 5.1超立方体拓扑An efficient topology for the nodes is the hypercube, which limits the number of nodes to2 C 2^C2CwhereC CCis the number of interconnect interfaces each node has. Hypercubes have the smallest diameter for all systems with2 n 2^n2nCPUs. Figure 5.1 shows the first three hypercubes. Each hypercube has a diameter ofC CCwhich is the absolute minimum. AMD’s first-generation Opteron processors have three hypertransport links per processor. At least one of the processors has to have a Southbridge attached to one link, meaning, currently, that a hypercube withC 2 C2C2can be implemented directly and efficiently. The next generation is announced to have four links, at which pointC 3 C3C3hypercubes will be possible.超立方体是一种高效的节点拓扑结构节点总数上限为2 C 2^C2C其中C CC代表单个节点的互联接口数量。在包含2 n 2^n2n个处理器的系统中超立方体拓扑的网络直径最小。图 5.1 展示了前三阶超立方体结构每阶超立方体的网络直径均为C CC这也是理论最小值。AMD 第一代皓龙处理器单颗芯片配备三条超传输总线链路且至少有一颗处理器的链路需要连接南桥。受此限制当下仅能高效搭建C 2 C2C2的超立方体架构。该厂商下一代处理器将配备四条链路届时便可实现C 3 C3C3的超立方体组网。This does not mean, though, that larger accumulations of processors cannot be supported. There are companies which have developed crossbars allowing larger sets of processors to be used (e.g., Newisys’s Horus). But these crossbars increase the NUMA factor and they stop being effective at a certain number of processors.上述限制并不代表无法组建规模更大的处理器集群。部分厂商研发了交叉开关架构例如 Newisys 公司的 Horus 架构可容纳更多处理器。但这类架构会拉高 NUMA 系数且当处理器数量达到某一阈值后性能会快速衰减。The next step up means connecting groups of CPUs and implementing a shared memory for all of them. All such systems need specialized hardware and are by no means commodity systems. Such designs exist at several levels of complexity. A system which is still quite close to a commodity machine is IBM x445 and similar machines. They can be bought as ordinary 4U, 8-way machines with x86 and x86-64 processors. Two (at some point up to four) of these machines can then be connected to work as a single machine with shared memory. The interconnect used introduces a significant NUMA factor which the OS, as well as applications, must take into account.更复杂的方案是将多组处理器互联并为整个集群构建共享内存。这类系统均依赖定制硬件不属于通用设备。此类架构分为多种复杂度等级。IBM x445 等设备的设计相对贴近通用服务器该机型为标准 4U 机架式 8 路服务器搭载 x86 与 x86-64 架构处理器。两台部分场景下最多四台该类设备可互联为一台共享内存的整机。设备间的互联链路会带来极高的 NUMA 系数操作系统与应用程序都必须对此进行适配。At the other end of the spectrum, machines like SGI’s Altix are designed specifically to be interconnected. SGI’s NUMAlink interconnect fabric is very fast and has a low latency; both of these are requirements for high-performance computing (HPC), especially when Message Passing Interfaces (MPI) are used. The drawback is, of course, that such sophistication and specialization is very expensive. They make a reasonably low NUMA factor possible but with the number of CPUs these machines can have (several thousands) and the limited capacity of the interconnects, the NUMA factor is actually dynamic and can reach unacceptable levels depending on the workload.另一类极端案例是 SGI Altix 系列设备该机型专为集群互联场景设计。SGI 自研的 NUMAlink 互联网络具备高带宽、低延迟的特性这两项特性也是高性能计算HPC场景尤其使用消息传递接口 MPI 时的核心需求。当然高度定制化的架构会带来高昂的成本。该类设备可将基础 NUMA 系数控制在较低水平但由于单机能搭载数千颗处理器且互联链路的承载能力存在上限NUMA 系数会随负载动态变化部分工作负载下会达到无法接受的数值。More commonly used are solutions where clusters of commodity machines are connected using high-speed networking. But these are not NUMA machines; they do not implement a shared address space and therefore do not fall into any category which is discussed here.目前应用最广泛的方案是借助高速网络将多台通用服务器组建为集群。但这类集群不属于 NUMA 系统其并未实现全局共享地址空间因此不在本文的讨论范围内。5.2 OS Support for NUMA5.2 操作系统对 NUMA 的支持To support NUMA machines, the OS has to take the distributed nature of the memory into account. For instance, if a process is run on a given processor, the physical RAM assigned to the process’s address space should come from local memory. Otherwise each instruction has to access remote memory for code and data. There are special cases to be taken into account which are only present in NUMA machines. The text segment of DSOs is normally present exactly once in a machine’s physical RAM. But if the DSO is used by processes and threads on all CPUs (for instance, the basic runtime libraries likelibc) this means that all but a few processors have to have remote accesses. The OS ideally would “mirror” such DSOs into each processor’s physical RAM and use local copies. This is an optimization, not a requirement, and generally hard to implement. It might not be supported or only in a limited fashion.操作系统想要适配 NUMA 设备必须考虑内存分布式部署的特性。例如当进程在某颗处理器上运行时其地址空间对应的物理内存应当优先分配自该处理器的本地内存。否则进程执行每条指令时都需要跨节点访问远端内存中的代码与数据。NUMA 设备还存在一些特有场景需要特殊处理动态共享对象DSO的代码段在整机物理内存中通常仅留存一份。若所有处理器上的进程与线程都在调用该共享对象例如libc等基础运行时库绝大多数处理器都只能跨节点访问远端内存。理想情况下操作系统可将这类共享对象“镜像”至每颗处理器的本地内存让程序使用本地副本运行。该方案属于性能优化手段并非强制要求且实现难度较高多数系统仅提供有限支持甚至完全不支持。To avoid making the situation worse, the OS should not migrate a process or thread from one node to another. The OS should already try to avoid migrating processes on normal multi-processor machines because migrating from one processor to another means the cache content is lost. If load distribution requires migrating a process or thread off of a processor, the OS can usually pick an arbitrary new processor which has sufficient capacity left. In NUMA environments the selection of the new processor is a bit more limited. The newly selected processor should not have higher access costs to the memory the process is using than the old processor; this restricts the list of targets. If there is no free processor matching that criteria available, the OS has no choice but to migrate to a processor where memory access is more expensive.为避免性能进一步下降操作系统应尽量避免将进程或线程从一个 NUMA 节点迁移至另一个节点。即便在普通多处理器设备上系统也会尽量规避进程迁移因为进程切换处理器会导致原有缓存数据失效。若为了均衡负载必须迁移进程或线程普通多处理器系统可任选一台负载空闲的处理器作为目标。而在 NUMA 环境中目标处理器的选择会受到约束新处理器访问该进程所用内存的开销不能高于原处理器。这一条件会缩小可选范围。如果没有符合条件的空闲处理器系统只能将进程迁移至内存访问开销更高的节点。In this situation there are two possible ways forward. First, one can hope the situation is temporary and the process can be migrated back to a better-suited processor. Alternatively, the OS can also migrate the process’s memory to physical pages which are closer to the newly-used processor. This is quite an expensive operation. Possibly huge amounts of memory have to be copied, albeit not necessarily in one step. While this is happening the process, at least briefly, has to be stopped so that modifications to the old pages are correctly migrated. There are a whole list of other requirements for page migration to be efficient and fast. In short, the OS should avoid it unless it is really necessary.出现上述情况后系统有两种处理方式。第一种是等待负载状态恢复后续再将进程迁回更合适的处理器。第二种是将进程占用的物理内存页迁移至新处理器的本地内存但该操作开销极大。系统可能需要拷贝海量内存数据无需一次性完成且迁移期间进程需要短暂暂停确保旧内存页的修改内容可以完整迁移。想要实现高效、快速的内存页迁移还需要满足多项附加条件。总而言之除非迫不得已操作系统应当尽量避免内存页迁移。Generally, it cannot be assumed that all processes on a NUMA machine use the same amount of memory such that, with the distribution of processes across the processors, memory usage is also equally distributed. In fact, unless the applications running on the machines are very specific (common in the HPC world, but not outside) the memory use will be very unequal. Some applications will use vast amounts of memory, others hardly any. This will, sooner or later, lead to problems if memory is always allocated local to the processor where the request is originated. The system will eventually run out of memory local to nodes running large processes.NUMA 设备上的各个进程内存占用量通常并不均衡因此进程在处理器间的分布无法让各节点内存使用率保持平均。实际上除高性能计算领域的专用程序外绝大多数场景下程序的内存占用差异极大部分程序会占用海量内存另一部分则几乎不使用内存。如果内存始终优先分配至请求发起处理器的本地节点迟早会引发问题——运行大内存程序的节点最终会耗尽本地内存。In response to these severe problems, memory is, by default, not allocated exclusively on the local node. To utilize all the system’s memory the default strategy is to stripe the memory. This guarantees equal use of all the memory of the system. As a side effect, it becomes possible to freely migrate processes between processors since, on average, the access cost to all the memory used does not change. For small NUMA factors, striping is acceptable but still not optimal (see data in Section 5.4).为解决这类严重问题操作系统默认不会将内存仅分配至本地节点。为了充分利用整机内存系统默认采用内存条带化分配策略。该策略能让整机内存的使用率保持均衡同时带来一个附带效果进程可在处理器间自由迁移因为从平均情况来看进程对所有内存的访问开销不会发生变化。当 NUMA 系数较低时内存条带化策略可以正常使用但该策略并非最优方案详见 5.4 节数据。This is a pessimization which helps the system avoid severe problems and makes it more predictable under normal operation. But it does decrease overall system performance, in some situations significantly. This is why Linux allows the memory allocation rules to be selected by each process. A process can select a different strategy for itself and its children. We will introduce the interfaces which can be used for this in Section 6.内存条带化属于一种折中方案它能规避严重故障让系统在常规负载下的运行状态更可控但会拉低整机性能部分场景下性能损耗十分明显。因此 Linux 系统允许每个进程自定义内存分配规则进程可为本进程及其子进程指定专属策略。相关编程接口将在第六章中介绍。5.3 Published Information5.3 系统暴露的硬件拓扑信息The kernel publishes, through thesyspseudo file system (sysfs), information about the processor caches below内核通过sys伪文件系统sysfs对外暴露处理器缓存信息对应目录如下/sys/devices/system/cpu/cpu*/cacheIn Section 6.2.1 we will see interfaces which can be used to query the size of the various caches. What is important here is the topology of the caches. The directories above contain subdirectories (namedindex*) which list information about the various caches the CPU possesses. The filestype,level, andshared_cpu_mapare the important files in these directories as far as the topology is concerned. For an Intel Core 2 QX6700 the information looks as in Table 5.1.6.2.1 节会介绍查询各级缓存容量的接口本节重点讲解缓存的拓扑结构。上述目录下包含多个index*命名的子目录分别记录处理器各级缓存的信息。在拓扑分析中type、level、shared_cpu_map这三个文件最为关键。英特尔酷睿 2 QX6700 处理器的缓存信息如表 5.1 所示。Table 5.1:sysfsInformation for Core 2 CPU Caches表 5.1酷睿 2 处理器缓存的 sysfs 信息What this data means is as follows:该表格数据解读如下Each core has three caches: L1i, L1d, L2.每个处理器核心配备三级缓存一级指令缓存L1i、一级数据缓存L1d、二级统一缓存L2。The L1d and L1i caches are not shared with anybody—each core has its own set of caches. This is indicated by the bitmap in shared_cpu_map having only one set bit.一级数据缓存与一级指令缓存为核心私有不与其他核心共享。shared_cpu_map位图仅有一个比特位置 1也印证了这一点。The L2 cache on cpu0 and cpu1 is shared, as is the L2 on cpu2 and cpu3.cpu0与cpu1共享二级缓存cpu2与cpu3共享二级缓存。The knowledge thatcpu0tocpu3are cores comes from another place that will be explained shortly.cpu0至cpu3为处理器核心这一信息可通过另一组文件查询下文会进行说明。If the CPU had more cache levels, there would be moreindex*directories.若处理器包含更多层级的缓存目录下会出现更多index*子目录。For a four-socket, dual-core Opteron machine the cache information looks like Table 5.2:一台 4 路双核皓龙服务器的缓存信息如表 5.2 所示Table 5.2: sysfs Information for Opteron CPU Caches表 5.2皓龙处理器缓存的 sysfs 信息As can be seen these processors also have three caches: L1i, L1d, L2. None of the cores shares any level of cache. The interesting part for this system is the processor topology. Without this additional information one cannot make sense of the cache data. Thesysfile system exposes this information in the files below由表格可见该处理器同样配备一级指令缓存、一级数据缓存、二级统一缓存且所有核心的各级缓存均为私有互不共享。该设备的重点在于处理器拓扑结构缺少拓扑信息则无法解读缓存数据。sys 文件系统在以下路径暴露拓扑信息/sys/devices/system/cpu/cpu*/topologyTable 5.3 shows the interesting files in this hierarchy for the SMP Opteron machine.该对称多处理器SMP皓龙设备的拓扑相关文件内容如表 5.3 所示。Table 5.3: sysfs Information for Opteron CPU Topology表 5.3皓龙处理器拓扑的 sysfs 信息Taking Table 5.2 and Table 5.3 together we can see that no CPU has hyper-threads (thethread_siblingsbitmaps have one bit set), that the system in fact has four processors (physical_package_id0 to 3), that each processor has two cores, and that none of the cores share any cache. This is exactly what corresponds to earlier Opterons.结合表 5.2 与表 5.3 可得出结论该设备未开启超线程thread_siblings位图仅单个比特置 1整机包含四颗物理处理器physical_package_id取值 0~3每颗物理处理器集成两个核心且所有核心的缓存均相互独立。这与早期皓龙处理器的硬件设计完全吻合。What is completely missing in the data provided so far is information about the nature of NUMA on this machine. Any SMP Opteron machine is a NUMA machine. For this data we have to look at yet another part of thesysfile system which exists on NUMA machines, namely in the hierarchy below上述数据尚未体现设备的 NUMA 架构信息而所有皓龙对称多处理器设备均为 NUMA 设备。NUMA 相关信息可在 sys 文件系统的另一路径中查询/sys/devices/system/nodeThis directory contains a subdirectory for every NUMA node on the system. In the node-specific directories there are a number of files. The important files and their content for the Opteron machine described in the previous two tables are shown in Table 5.4.该路径下的每个子目录对应系统中的一个 NUMA 节点节点目录内包含多个信息文件。结合前文两台皓龙设备其核心 NUMA 信息文件的内容如表 5.4 所示。Table 5.4: sysfs Information for Opteron Nodes表 5.4皓龙 NUMA 节点的 sysfs 信息This information ties all the rest together; now we have a complete picture of the architecture of the machine. We already know that the machine has four processors. Each processor constitutes its own node as can be seen by the bits set in the value incpumapfile in thenode*directories. Thedistancefiles in those directories contains a set of values, one for each node, which represent a cost of memory accesses at the respective nodes. In this example all local memory accesses have the cost 10, all remote access to any other node has the cost 20. {This is, by the way, incorrect. The ACPI information is apparently wrong since, although the processors used have three coherent HyperTransport links, at least one processor must be connected to a Southbridge. At least one pair of nodes must therefore have a larger distance.} This means that, even though the processors are organized as a two-dimensional hypercube (see Figure 5.1), accesses between processors which are not directly connected is not more expensive. The relative values of the costs should be usable as an estimate of the actual difference of the access times. The accuracy of all this information is another question.结合该部分信息我们就能完整梳理出整机架构。已知设备包含四颗物理处理器从各node*目录下的cpumap位图可以看出每颗物理处理器独立作为一个 NUMA 节点。distance文件内的一组数值依次代表当前节点访问各个节点内存的开销。本例中节点访问本地内存的开销为 10访问其他节点远端内存的开销为 20。【补充说明该数据存在偏差。尽管该处理器配备三条一致性超传输总线链路但至少有一颗处理器需要连接南桥因此至少有一对节点间的访问开销应当更高这表明 ACPI 上报的信息有误。】这也意味着即便处理器采用二维超立方体拓扑见图 5.1非直连节点间的内存访问开销并未进一步增加。这类相对开销数值可用于估算实际访问时延的差异但该类信息的准确性无法完全保证。5.4 Remote Access Costs5.4 跨节点内存访问开销The distance is relevant, though. In [amdccnuma] AMD documents the NUMA cost of a four socket machine. For write operations the numbers are shown in Figure 5.3.节点间的访问开销具备实际参考价值。AMD 在相关文档中记录了 4 路服务器的 NUMA 访问开销写操作的测试数据见图 5.3。Figure 5.3: Read/Write Performance with Multiple Nodes图 5.3多节点场景下的内存读写性能Writes are slower than reads, this is no surprise. The interesting parts are the costs of the 1- and 2-hop cases. The two 1-hop cases actually have slightly different costs. See [amdccnuma] for the details. The fact we need to remember from this chart is that 2-hop reads and writes are 30% and 49% (respectively) slower than 0-hop reads. 2-hop writes are 32% slower than 0-hop writes, and 17% slower than 1-hop writes. The relative position of processor and memory nodes can make a big difference. The next generation of processors from AMD will feature four coherent HyperTransport links per processor. In that case a four socket machine would have diameter of one. With eight sockets the same problem returns, with a vengeance, since the diameter of a hypercube with eight nodes is three.内存写操作本身慢于读操作这一现象符合常识。重点在于一跳、两跳跨节点访问的开销两种一跳访问场景的开销也存在细微差别详情可查阅 AMD 官方文档。从该图表中可总结出关键数据两跳读操作比本地读操作慢 30%两跳写操作比本地读操作慢 49%两跳写操作比本地写操作慢 32%比一跳写操作慢 17%。处理器与内存节点的相对位置会对访问性能造成极大影响。AMD 下一代处理器单芯片将配备四条一致性超传输链路届时 4 路服务器的网络直径可降至 1。但 8 路服务器仍会面临严重问题因为 8 节点超立方体的网络直径为 3。All this information is available but it is cumbersome to use. In Section 6.5 we will see an interface which helps accessing and using this information easier.上述拓扑与开销信息均可查询但原生使用方式较为繁琐。6.5 节将介绍简化该类信息调用的编程接口。The last piece of information the system provides is in the status of a process itself. It is possible to determine how the memory-mapped files, Copy-On-Write (COW) pages and anonymous memory are distributed over the nodes in the system. {Copy-On-Write is a method often used in OS implementations when a memory page has one user at first and then has to be copied to allow independent users. In many situations the copying is unnecessary, at all or at first, in which case it makes sense to only copy when either user modifies the memory. The operating system intercepts the write operation, duplicates the memory page, and then allows the write instruction to proceed.} Each process has a file/proc/PID/numa_maps, where PID is the ID of the process, as shown in Figure 5.2.系统还会在进程维度暴露 NUMA 相关状态通过该状态可查看内存映射文件、写时复制COW内存页、匿名内存在各个 NUMA 节点上的分布情况。【写时复制是操作系统的经典内存管理机制当内存页仅有一个使用者时若新增使用者系统不会立即复制内存页仅当任意使用者修改该内存页时系统才会拦截写操作、复制内存页再执行写入动作以此减少不必要的数据拷贝。】每个进程都会在/proc/进程号/numa_maps文件中记录相关分布信息文件内容示例见图 5.2。00400000 default file/bin/cat mapped3 N33 00504000 default file/bin/cat anon1 dirty1 mapped2 N32 00506000 default heap anon3 dirty3 active0 N33 38a9000000 default file/lib64/ld-2.4.so mapped22 mapmax47 N122 38a9119000 default file/lib64/ld-2.4.so anon1 dirty1 N31 38a911a000 default file/lib64/ld-2.4.so anon1 dirty1 N31 38a9200000 default file/lib64/libc-2.4.so mapped53 mapmax52 N151 N22 38a933f000 default file/lib64/libc-2.4.so 38a943f000 default file/lib64/libc-2.4.so anon1 dirty1 mapped3 mapmax32 N12 N31 38a9443000 default file/lib64/libc-2.4.so anon1 dirty1 N31 38a9444000 default anon4 dirty4 active0 N34 2b2bbcdce000 default anon1 dirty1 N31 2b2bbcde4000 default anon2 dirty2 N32 2b2bbcde6000 default file/usr/lib/locale/locale-archive mapped11 mapmax8 N011 7fffedcc7000 default stack anon2 dirty2 N32Figure 5.2: Content of/proc/*PID*/numa_maps图 5.2/proc/进程号/numa_maps文件内容The important information in the file is the values forN0toN3, which indicate the number of pages allocated for the memory area on nodes 0 to 3. It is a good guess that the program was executed on a core on node 3. The program itself and the dirtied pages are allocated on that node. Read-only mappings, such as the first mapping forld-2.4.soandlibc-2.4.soas well as the shared filelocale-archiveare allocated on other nodes.文件中N0至N3字段为核心信息分别代表对应内存区域在 0~3 号 NUMA 节点上分配的内存页数量。根据该文件可判断示例程序运行在 3 号节点的核心上程序本体与已修改的内存页均分配在 3 号节点。而ld-2.4.so、libc-2.4.so的只读映射段以及共享文件locale-archive这类只读资源则分配在其他节点。As we have seen in Figure 5.3 the read performance across nodes falls by 9% and 30% respectively for 1- and 2-hop reads. For execution, such reads are needed and, if the L2 cache is missed, each cache line incurs these additional costs. All the costs measured for large workloads beyond the size of the cache would have to be increased by 9%/30% if the memory is remote to the processor.结合图 5.3 的数据可知一跳跨节点读性能下降 9%两跳跨节点读性能下降 30%。程序运行过程中必然会产生内存读请求一旦二级缓存未命中每条缓存行的访问都会产生额外开销。当程序工作集大小超出缓存容量、内存位于远端节点时整体性能会对应下降 9% 或 30%。Figure 5.4: Operating on Remote Memory图 5.4远端内存访问性能测试To see the effects in the real world we can measure the bandwidth as in Section 3.5.1 but this time with the memory being on a remote node, one hop away. The result of this test when compared with the data for using local memory can be seen in Figure 5.4. The numbers have a few big spikes in both directions which are the result of a problem of measuring multi-threaded code and can be ignored. The important information in this graph is that read operations are always 20% slower. This is significantly slower than the 9% in Figure 5.3, which is, most likely, not a number for uninterrupted read/write operations and might refer to older processor revisions. Only AMD knows.为验证实际场景中的性能差异我们沿用 3.5.1 节的带宽测试方案本次将内存部署在一跳远端节点测试结果与本地内存的对比如图 5.4 所示。图表中部分数据出现大幅波动这是多线程程序测试的正常现象可忽略不计。该图表的核心结论为跨节点读操作的性能稳定下降 20%。该降幅高于图 5.3 中的 9%大概率是因为图 5.3 的数据针对非连续读写场景或是基于老旧版本处理器测试具体原因仅有 AMD 官方能够解释。For working set sizes which fit into the caches, the performance of write and copy operations is also 20% slower. For working sets exceeding the size of the caches, the write performance is not measurably slower than the operation on the local node. The speed of the interconnect is fast enough to keep up with the memory. The dominating factor is the time spent waiting on the main memory.当程序工作集可完全存入缓存时跨节点写操作与数据拷贝操作的性能同样下降 20%。当工作集超出缓存容量时跨节点写操作与本地写操作的性能差异已无法量化观测。此时节点间互联总线的带宽足以匹配内存读写速度性能瓶颈转变为物理内存本身的访问时延。每位程序员都应当了解的内存知识 第五部分 程序员可采取的优化手段-CSDN博客https://blog.csdn.net/u013669912/article/details/161550476referenceMemory part 4: NUMA supporthttps://lwn.net/Articles/254445/