# ELE 655 Microprocessor System Design

#### Section 3 – Data Level Parallelism

Class 3 – GPU

- GPU overview
  - Massively parallel HW solution to massively parallel Data problems
  - Originated as co-processors for Graphics applications
  - Converged with other SIMD platforms as the instructions set and hardware became more capable of control operations
  - Cheapest path to huge numbers of parallel processors

- GPU Basics
  - Thread
    - In normal computing terms this is a series of instructions that can be run independently
    - In GPU world sequence of instructions that can be independently executed in a single SIMD lane
    - Each with its own PC and registers
    - E.g. one iteration of the body of a parallelized loop

- GPU Basics
  - SIMD Thread (CUDA Thread)
    - A group of 32 threads
    - Often just referred to as a thread
    - Eg. Thread 14, instruction 12

- GPU Basics
  - Thread Block
    - A group of 16 SIMD Threads
    - Can communicate between threads via local memory

- GPU Basics
  - Grid
    - A group of 16 Thread Blocks

- GPU Basics
  - Example
    - Vector-vector multiply with 8192 elements / vector
    - Fully independent
    - SIMD thread 32 independent elements
    - Thread block 16 SIMD threads
       → 512 elements / thread block
    - 8192/512 = 16 Thread blocks = 1 Grid

- GPU Basics
  - Multi-threaded SIMD processor
    - Multi-lane processor (16, 32 lanes)
    - Load/Store, FPU
    - Operates on 1 Thread Block
    - Large register file
      - 16 SIMD Threads/Block x 32 threads/SIMD Thread x 64 registers thread = 32,768 32-bit registers
    - Local memory

- GPU Basics
  - Multi-threaded SIMD processor 16 lanes



- GPU Basics
  - Operation
    - Grid is created by the compiler
    - Thread Block Scheduler
      - Determines how many thread blocks are needed
      - Assigns thread blocks from the grid to multithreaded SIMD processors
      - Continues until all blocks are assigned
    - Thread Scheduler (in each multithreaded SIMD processor)
      - Warp Scheduler in Nvidia
      - Selects the next thread (instruction) to execute
      - Based on data and resource availability
      - Dynamic Scheduling

- GPU Basics
  - Operation
    - Execution
      - Each multithreaded SIMD processor must
        - Load 32 elements of each of 2 vectors
        - Perform the required operation
        - Store 32 elements of the result

GPU Basics

•



- GPU Basics
  - Operation



- GPU Basics
  - Implementation
  - Nvidia Fermi GTX 480
    - 16 SIMD processors
    - 16 SIMD Lanes
    - Local and Global Memory
    - Thread Block Scheduler
      - GigaThread



#### GPU ISA

- Instruction set is an abstraction instead of direct implementation
  - PTX Parallel Thread Execution
  - Simplifies compiler compatibility across HW variations
  - Usually map 1-1 with the HW instructions
  - Can represent multiple HW instructions in 1 PTX instruction
  - Uses virtual registers
    - Compiler assigns resources
  - Format
    - Opcode.type d,a,b,d
    - 8/16/32/64 bit operands
    - 1 bit predicate registers

| Basic Type       | Fundamental Type Specifiers |
|------------------|-----------------------------|
| Signed integer   | .s8, .s16, .s32, .s64       |
| Unsigned integer | .u8, .u16, .u32, .u64       |
| Floating-point   | .f16, .f32, .f64            |
| Bits (untyped)   | .b8, .b16, .b32, .b64       |
| Predicate        | .pred                       |

#### • GPU ISA

| Group           | Instruction                                                                                                                                                      | Example                          | Meaning                              | Comments                   |  |  |
|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|--------------------------------------|----------------------------|--|--|
|                 | arithmetic .type = .                                                                                                                                             | .s32, .u32, .f32, .s64, .u64, .f | 64                                   |                            |  |  |
|                 | add.type                                                                                                                                                         | add.f32 d, a, b                  | d = a + b;                           |                            |  |  |
|                 | sub.type                                                                                                                                                         | sub.f32 d, a, b                  | d = a - b;                           |                            |  |  |
|                 | mul.type                                                                                                                                                         | mul.f32 d, a, b                  | d = a * b;                           |                            |  |  |
|                 | mad.type                                                                                                                                                         | mad.f32 d, a, b, c               | d = a * b + c;                       | multiply-add               |  |  |
|                 | div.type                                                                                                                                                         | div.f32 d, a, b                  | d = a / b;                           | multiple microinstructions |  |  |
|                 | rem.type                                                                                                                                                         | rem.u32 d, a, b                  | d = a % b;                           | integer remainder          |  |  |
| Arithmetic      | abs.type                                                                                                                                                         | abs.f32 d, a                     | d =  a ;                             |                            |  |  |
| runnieue        | neg.type                                                                                                                                                         | neg.f32 d, a                     | d = 0 - a;                           |                            |  |  |
|                 | min.type                                                                                                                                                         | min.f32 d, a, b                  | d = (a < b)? a:b;                    | floating selects non-NaN   |  |  |
|                 | max.type                                                                                                                                                         | max.f32 d, a, b                  | d = (a > b)? a:b;                    | floating selects non-NaN   |  |  |
|                 | setp.cmp.type                                                                                                                                                    | setp.lt.f32 p, a, b              | p = (a < b);                         | compare and set predicate  |  |  |
|                 | numeríc .cmp = eq, r                                                                                                                                             | ne, lt, le, gt, ge; unordered cm | np = equ, neu, ltu, leu,             | gtu, geu, num, nan         |  |  |
|                 | mov.type                                                                                                                                                         | mov.b32 d, a                     | d = a;                               | move                       |  |  |
|                 | selp.type                                                                                                                                                        | selp.f32 d, a, b, p              | d = p? a: b;                         | select with predicate      |  |  |
|                 | cvt.dtype.atype                                                                                                                                                  | cvt.f32.s32 d, a                 | d = convert(a);                      | convert atype to dtype     |  |  |
|                 | special .type = .f3                                                                                                                                              | 2 (some .f64)                    |                                      |                            |  |  |
|                 | rcp.type                                                                                                                                                         | rcp.f32 d, a                     | d = 1/a;                             | reciprocal                 |  |  |
|                 | sqrt.type                                                                                                                                                        | sqrt.f32 d, a                    | d = sqrt(a);                         | square root                |  |  |
| Special         | rsqrt.type                                                                                                                                                       | rsqrt.f32 d, a                   | d = 1/sqrt(a);                       | reciprocal square root     |  |  |
| Function        | sin.type                                                                                                                                                         | sin.f32 d, a                     | d = sin(a);                          | sine                       |  |  |
|                 | cos.type                                                                                                                                                         | cos.f32 d, a                     | d = cos(a);                          | cosine                     |  |  |
|                 | lg2.type                                                                                                                                                         | 1g2.f32 d, a                     | d = log(a)/log(2)                    | binary logarithm           |  |  |
|                 | ex2.type                                                                                                                                                         | ex2.f32 d, a                     | d = 2 ** a;                          | binary exponential         |  |  |
|                 | logic.type = .pred,.b32, .b64                                                                                                                                    |                                  |                                      |                            |  |  |
|                 | and.type                                                                                                                                                         | and.b32 d, a, b                  | d = a & b;                           |                            |  |  |
|                 | or.type                                                                                                                                                          | or.b32 d, a, b                   | d = a   b;                           |                            |  |  |
| Logical         | xor.type                                                                                                                                                         | xor.b32 d, a, b                  | d = a ^ b;                           |                            |  |  |
|                 | not.type                                                                                                                                                         | not.b32 d, a, b                  | d = ~a;                              | one's complement           |  |  |
|                 | cnot.type                                                                                                                                                        | cnot.b32 d, a, b                 | d = (a==0)? 1:0;                     | C logical not              |  |  |
|                 | shl.type                                                                                                                                                         | shl.b32 d, a, b                  | d = a << b;                          | shift left                 |  |  |
|                 | shr.type                                                                                                                                                         | shr.s32 d, a, b                  | d = a >> b;                          | shift right                |  |  |
|                 | memory.space = .global, .shared, .local, .const; .type = .b8, .u8, .s8, .b16, .b32, .b64                                                                         |                                  |                                      |                            |  |  |
|                 | ld.space.type                                                                                                                                                    | ld.global.b32 d, [a+off]         | d = *(a+off);                        | load from memory space     |  |  |
| Memory          | st.space.type                                                                                                                                                    | st.shared.b32 [d+off], a         | *(d+off) = a;                        | store to memory space      |  |  |
| Access          | tex.nd.dtyp.btype                                                                                                                                                | tex.2d.v4.f32.f32 d, a, b        | d = tex2d(a, b);                     | texture lookup             |  |  |
|                 | atom one on tuna                                                                                                                                                 | atom.global.add.u32 d,[a], b     | atomic { $d = *a; *a = con(*a, b)$ . | ,                          |  |  |
|                 | <pre>atom.spc.op.type atom.global.cas.b32 d,[a], b, cop(*a, b); } operation atom.op = and, or, xor, add, min, max, exch, cas; .spc = .global; .type = .b32</pre> |                                  |                                      |                            |  |  |
| Control<br>Flow | branch                                                                                                                                                           | 0p bra target                    | if (p) goto target;                  | conditional branch         |  |  |
|                 | call                                                                                                                                                             | call (ret), func, (params)       | ret = func(params);                  | call function              |  |  |
|                 | ret                                                                                                                                                              | ret                              | return;                              | return from function call  |  |  |
|                 |                                                                                                                                                                  |                                  | wait for threads                     | barrier synchronization    |  |  |
|                 | bar.sync<br>exit                                                                                                                                                 | bar.sync d<br>exit               | exit:                                | terminate thread execution |  |  |
|                 | CALL                                                                                                                                                             | CAIL                             | exiti                                | terminate uncau execution  |  |  |

- GPU ISA
  - CUDA example DAXPY
    - Format for a function call
      - Name <<< dimGrid, dimBlock>>> (... param list...)
    - Identifiers
      - Each block has an identifier blockldx
      - Each thread has an identifier inside the block threadIdx
      - BlockDim = dimBlock

#### • GPU ISA

• PTX example DAXPY – 1 thread

R8 now points to my thread ID in memory relative to the base address

shl.u32 R8, blockIdx, 9
add.u32 R8, R8, threadIdx
shl.u32 R8, R8, threadIdx
shl.u32 R8, R8, 3
id.global.f64 RD0, [X+R8]; RD0 = X[i]
id.global.f64 RD2, [Y+R8]; RD2 = Y[i]
mul.f64 RD0, RD0, RD4
add.f64 RD0, RD0, RD2
; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0; Y[i] = sum (X[i]\*a + Y[i])

- GPU ISA
  - Conditional Branching
    - Each Lane as its own mask bit to determine whether to execute the current instruction or not
    - Each Thread has its own stack to keep track of branch return addresses
    - Assembler optimizes
      - Use branch instructions for complex situations
        - Branch diverges when only some lanes branch
      - Use predicates for simple situations
        - Only lanes with predicate=1 execute

- GPU ISA
  - Conditional Branching
    - Regardless of how the branch is handled all lanes stay synchronized to the same instructions
    - This leads to potential inefficiency when few lanes are actually executing

#### GPU ISA

Conditional Branching

if (X[i] != 0)X[i] = X[i] - Y[i]else X[i] = Z[i]

ld.global.f64 RDO, [X+R8] setp.neq.s32 P1, RDO, #0 @!P1, bra ELSE1, \*Push

ld.global.f64 RD2, [Y+R8] sub.f64 RDO, RDO, RD2 st.global.f64 [X+R8], RD0 @P1, bra ENDIF1, \*Comp

ELSE1: ld.global.f64 RDO, [Z+R8] ; RDO = Z[i] st.global.f64 [X+R8], RD0 ENDIF1: <next instruction>, \*Pop

; RDO = X[i]; P1 is predicate register 1 ; Push old mask, set new mask bits ; if P1 false, go to ELSE1 ; RD2 = Y[i]; Difference in RDO ; X[i] = RDO; complement mask bits ; if P1 true, go to ENDIF1 ; X[i] = RDO; pop to restore old mask

- GPU Memory
  - Each thread (lane) has a private memory block
  - Private Memory
    - Off chip
    - Not shared with anyone
    - Holds stack, private variables, ...
    - Can be cached to speed up access
  - Each Multithreaded SIMD processor has local memory
  - Local Memory
    - On chip
    - Shared by SIMD lanes
    - Not shared across multithreaded SIMD processors
    - Allocates portions of Local Memory to each thread block
      - Private to a thread block

- GPU Memory
  - External memory available to the whole GPU
  - GPU memory
    - Available to the Host for R/W
    - Pipelined accesses by the GPU
    - Latency hidden by multithreading
  - Special memory hardware
    - Coalesce memory accesses from individual threads in a SIMD thread into a single pipelined access
    - Hold some requests to group requests to the sar



GPU Memory



- FERMI multithreaded SIMD processor core
  - Dual scheduler dual dispatch
  - 2 sets of 16 Lanes
  - 16 L/S units
  - 4 Special Function Units (SFU)
  - Looks a little like a superscalar with 2 ALU, 1 L/S and 1 SFU
    - But here we are executing 32 threads in each of 2 of the execution units in 2 clock cycles \*\*

#### FERMI – multithreaded SIMD processor core

| SIMD thread scheduler                                        | SIMD thread scheduler                                        |  |
|--------------------------------------------------------------|--------------------------------------------------------------|--|
| Instruction dispatch unit                                    | Instruction dispatch unit                                    |  |
|                                                              |                                                              |  |
| SIMD thread 8 instruction 11<br>SIMD thread 2 instruction 42 | SIMD thread 9 instruction 11<br>SIMD thread 3 instruction 33 |  |
| SIMD thread 14 instruction 95                                | SIMD thread 15 instruction 95                                |  |
|                                                              |                                                              |  |
| SIMD thread 8 instruction 12                                 | SIMD thread 9 instruction 12                                 |  |
| SIMD thread 14 instruction 96                                | SIMD thread 3 instruction 34                                 |  |
| SIMD thread 2 instruction 43                                 | SIMD thread 15 instruction 96                                |  |

Time

#### • FERMI – multithreaded SIMD processor core

| -                     |   | Instruction      | on cache     | 9                     |     |  |
|-----------------------|---|------------------|--------------|-----------------------|-----|--|
| SIMD Thread Scheduler |   |                  | SIM          | SIMD Thread Scheduler |     |  |
| Dispatch unit         |   | Dispatch unit    |              |                       |     |  |
| ÷                     |   |                  | <u> </u>     | +                     |     |  |
|                       | F | Register file (S | 32,768×      | 32-bit)               |     |  |
|                       | + |                  |              | +                     |     |  |
| SIM<br>Lan            |   |                  | SIMD<br>Lane | LD/ST<br>LD/ST        | SFU |  |
| SIM                   |   |                  | SIMD<br>Lane | LD/ST                 | 5-0 |  |
| SIM                   |   |                  | SIMD<br>Lane | LD/ST                 |     |  |
| SIM                   |   |                  | SIMD<br>Lane | LD/ST                 | SFU |  |
| SIM                   |   |                  | SIMD<br>Lane | LD/ST                 |     |  |
| SIM                   |   |                  | SIMD<br>Lane | LD/ST<br>LD/ST        | SFU |  |
| SIM<br>Lan            |   |                  | SIMD<br>Lane | LD/ST                 | SFU |  |
| SIM                   |   |                  | SIMD<br>Lane | LD/ST<br>LD/ST        | 5-0 |  |

27

- FERMI multithreaded SIMD processor core
  - Fast Double-Precision Floating Point
    - 2x single precision
  - Caches for GPU memory
    - L1 Data and Memory
      - Memory array is shared with Local Memory
      - Split is programmable 16KB/48KB or 48KB/16KB
    - L2 Unified
      - 768KB
  - 64 bit addressing
    - All memories

FP unit

• FERMI – multithreaded SIMD processor core



#### GPU Vs. Vector Processor



30