# ELE 655 Microprocessor System Design

#### Section 3 – Data Level Parallelism

Class 1 – VLIW

- SIMD overview
  - Leverage large parallel data problems
    - Matrix oriented scientific computing
    - Image and audio processing
  - Can be more energy efficient
    - Fewer instructions
    - Simpler logic than Dynamic scheduling with speculation, ...

- SIMD overview
  - 3 variations
    - Vector processors
    - SIMD extensions to MIMD or SISD processors
    - Graphics Processor Units (GPUs)

SIMD overview



MIMD expansion = 2 cores / 2 yrs

SIMD expansion = 2x ops/ 4 yrs

- Vector Architecture
  - Read sets of data into "vector" registers
  - Operate on registers
  - Store results back to memory

- Vector Architecture
  - Memory accesses
    - Memory naturally provides data serially
    - Vector registers read/write data serially then operate in parallel

Vector Architecture



- Vector Architecture
  - Instruction Set
  - Fig 4.3

- Vector Architecture
  - Instruction Example

| L.D     | F0,a     | ; load scalar a          |
|---------|----------|--------------------------|
| LV      | V1,Rx    | ; load vector X          |
| MULVS.D | V2,V1,F0 | ; vector-scalar multiply |
| LV      | V3,Ry    | ; load vector Y          |
| ADDVV   | V4,V2,V3 | ; add                    |
| SV      | Ry,V4    | ; store the result       |

- 6 instructions vs approximately 600 in MIPS for 64 iterations
  - Even worse if we unroll loops

- Vector Architecture
  - Execution time
    - Length of operands
    - Structural hazards between operations
    - Data dependencies between operations
  - VMIPS
    - Execution units take 1 element / clock cycle
    - Pipelineing  $\rightarrow$  execution time = fill + vector length  $\cong$  vector length

- Vector Architecture
  - Convoy
    - · Vector instructions that could potentially execute together
    - Limited by structural hazards
    - Limited by issue width
    - We add the restriction one convoy must finish before the next starts
  - Chaining
    - Within a convoy:
      - Vector operations start as soon as the first element (or any dependent element) of its source operand are available
      - Solves any RAW dependencies in a convoy

- Vector Architecture
  - Chime
    - Unit of time to execute one convoy
    - Simplified to ignore fill, chaining delays, and issue delays
    - Vector length of n, and m chimes
      - Requires approximately m x n clock cycles
  - Execution time = # of convoys X chime

Vector Architecture

| LV      | V1,Rx    | ;load vector X          |
|---------|----------|-------------------------|
| MULVS.D | V2,V1,F0 | ;vector-scalar multiply |
| LV      | V3,Ry    | ;load vector Y          |
| ADDVV.D | V4,V2,V3 | ;add two vectors        |
| SV      | Ry,V4    | ;store the sum          |

Convoys:

| 1 | LV | MULVS.D |
|---|----|---------|
| 2 | LV | ADDVV.D |
| 3 | SV |         |

chaining allows V1 RAW dependency chaining allows V2 RAW dependency

For 64 element vectors, requires  $64 \times 3 = 192$  clock cycles

- Vector Architecture
  - Overhead not included in chime measurements
  - Issue width
    - Single issue requires an extra clock for a second instruction , ...
  - Start up time
    - Pipeline fill
    - 6 clks FP add, 7 clks FP mult, 20 clks FP div, 12 for vector load
  - Chaining delay
    - · Directly tied to start up time

- Vector Architecture Enhancements
  - Multiple Lanes



ELE 655 - Fall 2015

- Vector Architecture Enhancements
  - Multiple Lanes
  - No inter-lane communication
  - Optimize
    - Performance vs. cost
    - Clock speed vs. complexity



- Vector Architecture Enhancements
  - Vector Length Register
  - Real code does not have data in nice 64 word segments
  - Real code may not even know the vector length until run time
  - VLR indicates the vector length

- Vector Architecture Enhancements
  - Strip Mining
    - Use VLR to manage vector lengths > hardware configuration
      - Break actual length into an integer number (n) of natural length + what's left
      - Execute the "what's left" by setting VLR < natural</li>
      - Loop n times with VLR = natural



- Vector Architecture Enhancements
  - Vector Mask Register
    - Determines which elements in a vector actually get saved
    - 1 bit for each word in the vector
    - Operate in the vector as normal
    - Only writeback results for words with the mask set

- Vector Architecture Enhancements
  - Vector Mask Register
    - Use VMR to skip elements in the vector
    - Consider:

```
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] - Y[i];
```

Can't be vectorized because of the "if"

- Vector Architecture Enhancements
  - Vector Mask Register
  - · Use vector mask register to "disable" elements and allow the "if"

| LV      | V1,Rx    | ;load vector X into V1        |
|---------|----------|-------------------------------|
| LV      | V2,Ry    | ;load vector Y                |
| L.D     | F0,#0    | ;load FP zero into F0         |
| SNEVS.D | V1,F0    | ;sets VM(i) to 1 if V1(i)!=F0 |
| SUBVV.D | V1,V1,V2 | ;subtract under vector mask   |
| SV      | Rx,V1    | ;store the result in X        |

• GFLOPS rate decreases!

- Vector Architecture Enhancements
  - Banked Memory
  - · We can execute a new calculation every clock cycle via pipelining
  - But can we provide data at full clock rate
    - Caching
    - Memory banking
      - Allow multiple loads and stores in parallel
      - Scatter-Gather → independent bank addressing
      - Multiple processors → independent instruction streams
      - → Large number of banks
        - Cray 1 = 1024 banks

- Vector Architecture Enhancements
  - Striding
  - · How do we work with a multidimensional array structure
    - In memory the array is linear (row major or column major)
    - Subsequent element on the minor axis are separated by the major axis length – stride
    - Add a register to hold the stride value (computed at run time)
    - Can lead to bank contention
      - Separate stride accesses end up in the same bank  $\rightarrow$  stall

Number of banks

Least common multiple (Stride, # of banks) < Bank busy time

- Vector Architecture Enhancements
  - Scatter-Gather
    - Sparse matrices are very common
    - Use a vector to indicate the non-zero element of the matrix

for (i=0; i<n; i++) A[K[i]] = A[K[i]] + C[M[i]];

Where A and C are the matrices and K and M are vectors of the non-zero elements

- Special instructions LVI, SVI to gather and scatter
  - Use one operand (non-zero element vector) as an index to a base address

- Vector Architecture Enhancements
  - Scatter-Gather
    - Add elements of two sparse matrices

Ra, Rc, Rk, Rm are the base address for the vectors

| LV      | Vk, Rk        | ; load Rk                 |
|---------|---------------|---------------------------|
| LVI     | Va, (Ra + Vk) | ; load A[K[ ]] – gather   |
| LV      | Vm, Rm        | ; load Rm                 |
| LVI     | Vc, (Rc + Vm) | ; Load C[M[ ]] – gather   |
| ADDVV.D | Va, Va, Vc    | ; add vectors             |
| SVI     | (Ra + Vk), Va | ; store A[K[ ]] - scatter |

- Vector Processor modern example
  - Intel Phi Co-processor



- Vector Processor modern example
  - Intel Phi Co-processor

| Silicon Cores                        | 57-61                                                                                    |  |
|--------------------------------------|------------------------------------------------------------------------------------------|--|
| Silicon Max Freq                     | 1.05-1.25 GHz                                                                            |  |
| Double Precision<br>Peak Performance | 1003-1220 GFLOP                                                                          |  |
| GDDR5 Devices                        | 24-32                                                                                    |  |
| Memory Capacity                      | 6-8GB                                                                                    |  |
| Memory Channels                      | 12-16                                                                                    |  |
| Form Factors                         | Refer to picture: passive, active,<br>dense form factor (DFF), no thermal solution (NTS) |  |
| Memory/BW Peak                       | 240-352 GT/s                                                                             |  |
| Total Cache                          | 28.5- 30.5MB                                                                             |  |
| Board TDP                            | 225-300 Watts                                                                            |  |

- Vector Processor modern example
- L1 TLB and Code Cache Miss TO IP T1IP T2 IP T3 IP 32KB Intel Phi Co-processor • **TLB Miss** Code Cache 16B/Cycle (2 IPC) 4 Threads Decode ⇔uCode In-Order TLB Pipe 0 Pipe 1 Miss Handler D0D1 D2 WB  $\mathbf{V}$ PPF PF v L2 TLB **VPU RF** X87 RF Scalar RF ALU 0 ALU 1 x87 VPU 512b SIMD L1 TLB and 32KB TLB Miss Data Cache DCache Miss Core X86 specific logic < 2% of core+L2 area

- Vector Processor modern example
  - Intel Phi Co-processor



- Vector Processor modern example
  - Intel Phi Co-processor
  - 16 x 32bit or
  - 8 x 64bit



- Vector Processor modern example
  - Intel Phi Co-processor



- Vector Processor modern example
  - Intel Phi Co-processor
    - Vector Mask

|            |       | 1   | ISI        | В          |               |              | LSB         |   |                       |   |
|------------|-------|-----|------------|------------|---------------|--------------|-------------|---|-----------------------|---|
|            | zmm0  | =   | [ (        | 0x0000003  | 0x0000002     | 0x0000001    | 0x00000000  | 1 | (bytes 15 through 0)  |   |
| <b>n</b> r |       |     | [ (        | 0x00000007 | 0x0000006     | 0x0000005    | 0x0000004   | ] | (bytes 31 through 16) |   |
| Л          |       |     | [ (        | 0x000000x0 | A000000x0 8   | 0x0000009    | 0x0000008   | ] | (bytes 47 through 32) |   |
|            |       |     | [ (        | 0x0000000  | 0x000000E     | 0x000000D    | 0x000000C   | ] | (bytes 63 through 48) |   |
|            | zmm 1 | =   | [ (        | 0x000000F  | 0x000000F     | 0x000000F    | 0x000000F   | ] | (bytes 15 through 0)  |   |
|            |       |     | [ (        | 0x000000F  | 0x000000F     | 0x000000F    | 0x000000F   | ] | (bytes 31 through 16) |   |
|            |       |     | [ (        | 0x000000F  | 0x000000F     | 0x000000F    | 0x000000F   | ] | (bytes 47 through 32) |   |
|            |       |     | [ (        | 0x000000F  | 0x0000000F    | 0x000000F    | 0x000000F   | ] | (bytes 63 through 48) |   |
|            | zmm2  | =   | E (        | OXAAAAAAA  | OXAAAAAAAA    | OXAAAAAAAA   | OXAAAAAAA   | ] | (bytes 15 through 0)  |   |
|            |       |     | [ (        | OxBBBBBBBB | 0xBBBBBBBBB   | OxBBBBBBBB   | 0xBBBBBBBB  | ] | (bytes 31 through 16) |   |
|            |       |     | [ (        | 0xCCCCCCC  | 0xCCCCCCCC    | 0xCCCCCCCC   | 0xCCCCCCCC  | ] | (bytes 47 through 32) |   |
|            |       |     | [ (        | OxDDDDDDDD | 0xDDDDDDDDD   | OxDDDDDDDD   | OxDDDDDDDD  | ] | (bytes 63 through 48) |   |
|            |       |     |            |            |               |              |             |   |                       |   |
|            | k3 =  | 0x8 | BF(        | 03         |               |              |             |   | (1000 1111 0000 0011) |   |
|            |       |     |            |            |               |              |             |   |                       |   |
|            |       |     |            | vpa        | idd zmm2 {k3  | 3}, zmm0, zm | um 1        |   |                       |   |
|            |       |     |            |            |               |              |             |   |                       |   |
|            |       | E : | **         | ******     | ********** 0  | x00000010 0  | x0000000F ] |   | (bytes 15 through 0)  |   |
|            |       |     |            |            | *********     |              | -           |   | (bytes 31 through 16) |   |
|            |       | -   |            |            | 0x00000019 0  |              |             |   | (bytes 47 through 32) |   |
|            |       | 1   | 0x         | 0000001E * | ********      | *******      | ******      |   | (bytes 63 through 48) |   |
|            |       |     |            |            |               |              |             |   |                       |   |
|            | zmm2  | = [ | 0          | XAAAAAAAA  | OxAAAAAAAA    | 0x0000010    | 0x000000F   | ] | (bytes 15 through 0)  |   |
|            |       |     |            |            | xBBBBBBBBB 03 |              |             |   | (bytes 31 through 16) |   |
|            |       |     |            |            | x00000019 03  |              |             |   | (bytes 47 through 32) |   |
|            |       | L 0 | <b>x</b> 0 | 000001E 0  | xDDDDDDDD 01  | CDDDDDDDD 01 | (DDDDDDDDD] |   | (bytes 63 through 48) |   |
|            |       |     |            | 22         |               |              |             |   |                       | - |

- Vector Processor modern example
  - Intel Phi Co-processor
  - Peak performance

Clock freq x 8 lanes (64bit mode) x 2 FLOPS/clock x # cores

1Ghz x 8 lanes x 2 Flops/clock = 16Gflops / core

60 cores  $\rightarrow$  960Gflops

- Vector Processor modern example
  - Intel Phi Co-processor
    - Swizzle / Broadcast
      - 4 word permutations (16 bytes in 32 bit mode, 32 bytes in 64 bit mode)
      - Convers when the operand is loaded

| $S_2 S_1 S_0$ | Function: 4 x 32 bits                        | Usage               |
|---------------|----------------------------------------------|---------------------|
| 000           | no swizzle                                   | zmm0 or zmm0 {dcba} |
| 001           | swap (inner) pairs                           | zmm0 {cdab}         |
| 010           | swap with two-away                           | zmm0 {badc}         |
| 011           | cross-product swizzle                        | zmm0 {dacb}         |
| 100           | broadcast a element across 4-element packets | zmm0 {aaaa}         |
| 101           | broadcast b element across 4-element packets | zmm0 {bbbb}         |
| 110           | broadcast c element across 4-element packets | zmm0 {cccc}         |
| 111           | broadcast d element across 4-element packets | zmm0 {dddd}         |

- Vector Processor modern example
  - Intel Phi Co-processor
    - SP transcendental instructions supported in hardware
      - Exponent
      - Logarithm
      - Reciprocal
      - Square root operations.

- Vector Processor modern example
  - Intel Phi Co-processor
    - Standard instruction format
    - vop v0{mask}, v1, v2|mem{swizzle},