|SM||Compute Unit, CU||One of many parallel vector processors in a GPU that contain parallel ALUs. All waves in a wrokgroup are assigned to the same CU.|
|Kernel||Kernel||Functions launched to the GPU that are executed by multiple parallel workers on the GPU. Kernels can work in parallel with CPU.|
|Warp||Wavefront||Collection of operations that execute in lockstep, run the same instructions, and follow the same control-flow path. Individual lanes can be masked off. Think of this as a vector thread. A 64-wide wavefront is a 64-wide vector op.|
|Thread Block||Workgroup||Group of wavefronts that are on the GPU at the same time. Can synchronize together and communicate through local memory.|
|Thread||Work Item / Thread||Individual lane in a wavefront. On AMD GPUs, mush run in lockstep with other work items in the wavefront. Lanes can be individually masked off.
GPU programming models can treat this as a separate thread of execution, though you do not necessarily get forward sub-wavefront progress.
|subpartation of SM||SIMD||Both of them are 4 in SM/CU.|
ROCm 目前不支持managed memory。
Scalar Unit && Scalar Registers (todo) https://www.youtube.com/watch?v=uu-3aEyesWQ&list=PLx15eYqzJifehAxhWRD6T35GZwAqM9IK4&index=5&t=332s
AMD ROCm Profiler
跟nvidia的ncu类似。但提供的hardware counters 比ncu的少很多。public的counters有：
rdna white paper: