CUDA 之 CTA （Cooperative Thread Arrays） - 文章 - 开发者社区

在看 moderngpu 的代码是看到了CTA（Cooperative Thread Array），不知道这个名词的含义就搜了下：

The PTX Cooperative Thread Array(CTA) is conceptually and functionally the same as a block in CUDA or a workgroup in OpenCL.

The Thread Hierarchy section of the CUDA PTX ISA document explains that, essentially, CTA means a CUDA block. Also note that it's actually not a "Compute Thread Array", but rather a "Cooperative Thread Array" (!).

CTA is just another way of saying Threadblock Nvidia calls it CTA.

由上可知CTA是PTX层面thread block的另一种描述。

CTA = Thread Block

PTX Programming Model

Thread Hierarchy

The batch of threads that executes a kernel is organized as a grid. A grid consists of either cooperative thread arrays or clusters of cooperative thread arrays as described in this section and illustrated in Figure 1 and Figure 2. Cooperative thread arrays (CTAs) implement CUDA thread blocks and clusters implement CUDA thread block clusters.

picture.image

图1 Grid with CTAs

picture.image

图2 Grid with clusters

Cooperative Thread Arrays

并行线程执行(PTX)编程模型是显式并行的:PTX程序指定并行线程数组的给定线程的执行。 协作线程数组(CTA)是并发或并行执行内核的线程数组 。

CTA中的线程可以相互通信 。为了协调CTA中线程的通信，可以指定同步点，在这些同步点中，线程等待，直到CTA中的所有线程都到达。

CTA中的线程以SIMT(单指令，多线程)的方式在称为warp的组中执行 。warp是来自单个CTA的线程的最大子集，因此线程在同一时间执行相同的指令。warp中的线程是按顺序编号的。warp size是一个与机器有关的常数。通常，一次warp有32个thread。一些应用程序可以通过了解warp大小来最大化性能，因此PTX包含了一个运行时即时常量 WARP_SZ，它可以用于任何允许使用即时操作数的指令。

参考文献

picture.image