在看 moderngpu 的代码是看到了CTA(Cooperative Thread Array),不知道这个名词的含义就搜了下:
The PTX Cooperative Thread Array(CTA) is conceptually and functionally the same as a block in CUDA or a workgroup in OpenCL.
The Thread Hierarchy section of the CUDA PTX ISA document explains that, essentially, CTA means a CUDA block. Also note that it's actually not a "Compute Thread Array", but rather a "Cooperative Thread Array" (!).
CTA is just another way of saying Threadblock Nvidia calls it CTA.
由上可知CTA是PTX层面thread block的另一种描述。
CTA = Thread Block
PTX Programming Model
Thread Hierarchy
The batch of threads that executes a kernel is organized as a grid. A grid consists of either cooperative thread arrays or clusters of cooperative thread arrays as described in this section and illustrated in Figure 1 and Figure 2. Cooperative thread arrays (CTAs) implement CUDA thread blocks and clusters implement CUDA thread block clusters.
图1 Grid with CTAs
图2 Grid with clusters
Cooperative Thread Arrays
并行线程执行(PTX)编程模型是显式并行的:PTX程序指定并行线程数组的给定线程的执行。 协作线程数组(CTA)是并发或并行执行内核的线程数组 。
CTA中的线程可以相互通信 。为了协调CTA中线程的通信,可以指定同步点,在这些同步点中,线程等待,直到CTA中的所有线程都到达。
CTA中的线程以SIMT(单指令,多线程)的方式在称为warp的组中执行 。warp是来自单个CTA的线程的最大子集,因此线程在同一时间执行相同的指令。warp中的线程是按顺序编号的。warp size是一个与机器有关的常数。通常,一次warp有32个thread。一些应用程序可以通过了解warp大小来最大化性能,因此PTX包含了一个运行时即时常量 WARP_SZ,它可以用于任何允许使用即时操作数的指令。
参考文献
- https://stackoverflow.com/questions/17649570/ptx-what-is-a-cta
- https://docs.nvidia.com/cuda/parallel-thread-execution/
