# ELE 455/555 Computer System Engineering

# Section 4 – Parallel Processing Class 3 – GPUs

GPU Video

- Graphics Processing Unit (GPU)
  - Optimized processor for computing 2D and 3D graphics objects
    - 2D/3D graphics
    - Images
    - Video
  - Used in
    - Window based operating systems
    - Graphical user interfaces
    - Video games
  - Highly parallel, highly multithreaded multiprocessor

Graphics Pipeline



- Graphics Pipeline
  - Vertex
    - Location (point) in 3D space on the graphics object
    - Includes: location, color, texture, motion, ... information
  - Vertex Shader
    - Operations performed on each vertex
      - Transform 3D location to 2D screen location
      - Includes "Z" processing to emulate depth on the screen

- Graphics Pipeline
  - Geometry
    - Point, line or triangle created from the vertices
    - Includes: location, color, texture, motion, ... information
  - Geometry Shader
    - Operations performed on each geometry
      - Creation of the geometry
      - Combine or divide geometries
      - Add or delete geometries based on detail requirements (zoom)

- Graphics Pipeline
  - Pixel
    - Smallest render unit on the screen
  - Pixel Shader
    - Operations performed on each pixel
      - Color
      - Texture mapping
      - Lighting

• Graphics Pipeline



• Graphics Pipeline

### History

- 1990s Video Graphics Array controller (VGA)
  - Memory controller used to paint to the screen
- 2000 Integration allowed most of the processing to happen in the GPU using fixed hardware
  - Triangle setup, rasterization, texture mapping
- Fixed hardware was replaced with programmable hardware •
- Programmable hardware was consolidated into a multithreaded multiprocessor architecture
- 2010 Additional capability added to support general computing operations ELE 455/555 – Spring 2016

- Application Programming Interface (API)
  - Allow programmers to write to the API and not be concerned about the underlying hardware
  - Allow the underlying hardware to progress at a rapid pace
  - OpenGL
    - Open standard
    - Broadly available
  - DirectX
    - Microsoft APIs

- Heterogeneous system
  - GPUs used as co-processors for the main CPU
  - Early implementation



- Heterogeneous system
  - GPUs used as co-processors for the main CPU
  - Current implementation



- Heterogeneous system
  - GPUs used as co-processors for the main CPU
  - Current wireless implementation

| Dynamic men<br>L2 ca                                                                                                        | ARM<br>Cortex-M4                          |                                     | ARM<br>Cortex-M4          |  |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|-------------------------------------|---------------------------|--|--|--|--|--|
| LZ G                                                                                                                        |                                           |                                     |                           |  |  |  |  |  |
| ARM <sup>®</sup><br>Cortex <sup>™</sup> -A15<br>MPCore                                                                      | ARM<br>Cortex-A15<br>MPCore               | POWERVR'<br>SGX544-MI<br>3D graphic | Px C64x video             |  |  |  |  |  |
| (up to 2 GHz)                                                                                                               | (up to 2 GHz)                             | TI 2D<br>graphics                   | Image signal<br>processor |  |  |  |  |  |
| L3 Network-on-chip interconnect                                                                                             |                                           |                                     |                           |  |  |  |  |  |
|                                                                                                                             | Controller, Mai<br>System DMA             | Audio processor                     |                           |  |  |  |  |  |
| Boot/Se                                                                                                                     | cure ROM, L3 R                            | AM                                  | Λ                         |  |  |  |  |  |
| M-Shield <sup>™</sup> system security technology: SHA-1/SHA-2/MD5,<br>DES/3DES, RNG, AES, PKA, secure WDT, keys, crypto DMA |                                           |                                     |                           |  |  |  |  |  |
| ν                                                                                                                           | Multi-pipe<br>Display Sub-System<br>(DSS) |                                     |                           |  |  |  |  |  |

#### **OMAP5432**

Unified GPU Architecture



- Unified GPU Architecture
  - Built from a parallel array of unified processors
  - Tightly coupled with fixed function processors
    - rasterization, compression, video decoding, ...
  - Focus is on executing large numbers of parallel threads on large numbers of cores
  - Utilize multithreading to hide memory latency instead of using multi-level caches



- Unified GPU Architecture
  - Streaming Processor Core
    - Pipelined
    - Superscalar
    - Highly Multithreaded
      - 96 Concurrent Threads
      - Hardware managed
    - 1024, 32bit registers



#### Nvidia Tesla

- Unified GPU Architecture
  - Streaming Multiprocessor
    - 8 Streaming Processor Cores (SP)
    - 2 Special Function Units (SFU)
      - Transcendentals (sin, cos, log, exp, ...)
    - Instruction Cache
    - Constant Cache
    - Multithreaded Issue unit
    - Shared memory
  - Texture Processor Cluster
    - 2 Streaming Multiprocessors
    - Controller
    - Texture Unit

#### Nvidia Tesla





### Unified GPU Architecture



ELE 455/555 - Spring 2016

- Programming CUDA
  - C like code is written in serial fashion
  - Code calls parallel Kernals
    - Kernals are parallelizable functions, blocks, programs
  - Kernels execute across parallel processors as a set of threads
  - Threads are organized into Thread Blocks
    - · Sets of concurrent threads that can work together
      - Through synchronization or shared private memory
  - Independent thread blocks are grouped together as a Grid
    - Can be executed in parallel

- Programming CUDA
  - 3 Abstractions
  - Thread Groups
  - Shared Memories
  - Barrier Synchronization
  - Kernals
    - Functions or entire programs whose operations can be done in parallel
    - Specifies # of Blocks and # of threads/block in a grid
  - Blocks ← blockIDx
  - Threads ← threadIDz



Programming – CUDA

```
Computing y = ax + y with a serial loop:
```

```
void saxpy_serial(int n, float alpha, float *x, float *y)
{
    for(int i = 0; i<n; ++i)
        y[i] = alpha*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);</pre>
```



- All parallelization is handled by the processor
  - Thread management is handled by hardware

- Programming CUDA
  - Synchronization
    - A synchronization barrier can be created <u>\_syncthreads</u>
    - No thread can pass the barrier until all threads reach the barrier
    - Applies to threads with-in a block
  - Thread blocks cannot be directly synchronized
    - · Blocks must be able to operate independently
    - Can synchronize by using atomic memory processes
    - Can synchronize grids

- GPU Memory Considerations
  - Each thread has its own context
    - PC, registers
  - Each thread has its own private local memory
    - For anything that does not fit in its registers incl. stack
  - Each thread block has a shared memory
    - Visible to all threads in the block
    - Exists as long as the block exists
    - On chip ram
  - All threads have access to global memory
    - Grids pass data via global memory
    - DRAM

GPU Memory Considerations



- Using the GPU as a SIMT Multi-processor
  - Architecture already supports multiple kernel (programs) running via multiple threads
  - Define a Warp to be a set of threads running the same instruction
  - Tesla
    - 32 threads/wrap
    - Hardware managed



- Performance
  - Matrix multiplication
  - GPU: 1.35GHz
  - CPU: 2.4GHz



- Performance
  - Matrix factorization
  - GPU: 1.35GHz
  - CPU: 2.4GHz



- Nvidia Kepler
  - 192 cores

| SMX                                                            |                                                        |              |                 |               |                |                |                            |  |  |  |  |
|----------------------------------------------------------------|--------------------------------------------------------|--------------|-----------------|---------------|----------------|----------------|----------------------------|--|--|--|--|
| Instruction Cache Warp Scheduler Warp Scheduler Warp Scheduler |                                                        |              |                 |               |                |                |                            |  |  |  |  |
| Dispatch                                                       | Dispatch                                               | Dispatch     | Dispatch        | Dispatch      | Dispatch       | Dispatch       | Dispatch                   |  |  |  |  |
| -                                                              | ÷                                                      | +            | +               | +             |                | +              | +                          |  |  |  |  |
| Register File (65,536 x 32-bit)                                |                                                        |              |                 |               |                |                |                            |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LOVST SFU  | Core Core Cor |                | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LOUST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LOJST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LIDIST SFU | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LD/ST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST <mark>SFU</mark> |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LD/ST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LIVIST SFU | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LD/ST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LD/ST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LD/ST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LEVST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LOIST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LD/ST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LINST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LDJST SFU  | Core Core Cor | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
| Core Core Core                                                 | DP Unit Core                                           | Core Core DP | Unit LINST SFU  |               | e DP Unit Core | Core Core DP U | nit LD/ST SFU              |  |  |  |  |
|                                                                | Interconnect Network<br>64 KB Shared Memory / L1 Cache |              |                 |               |                |                |                            |  |  |  |  |
| 48 KB Read-Only Data Cache                                     |                                                        |              |                 |               |                |                |                            |  |  |  |  |
| Tex                                                            | Tex                                                    | Tex          | Tex             | Tex           | Tex            | Tex            | Tex                        |  |  |  |  |
| Tex                                                            | Tex                                                    | Tex          | Tex             | Tex           | Tex            | Tex            | Tex                        |  |  |  |  |

Huang Video