# **NEON Advanced SIMD Instructions**

#### References

 NEON Programmer's Guide DEN0018 (NPG) – read this first!

#### ■ NEON Programmer's Guide

- Contents
- 🗄 📲 Preface

2

- E P 2: Compiling NEON Instructions
- 🗄 📲 4: NEON Intrinsics
- 🗄 📲 5: Optimizing NEON Code
- ⊞ III 6: NEON Code Examples with Intrinsics
- $\oplus$  - $\mathbb{P}$  7: NEON Code Examples with Mixed Operations

ABSDIFF

ABSDIFF

- 🗄 📲 8: NEON Code Examples with Optimization\_
- <sup>⊕</sup> **I** A: NEON Microarchitecture
- 🗄 📲 B: Operating System Support
- $^{\pm}$   ${
  m I\!P}$  C: NEON and VFP Instruction Summary
- 🗄 📲 D: NEON Intrinsics Reference

- Instr. Functionality: ARM Arch. Ref. Manual
  - Load/Store: 4.11
  - Register Transfer: 4.12
  - Data Processing: 4.13, 4.14
- ARM C Language Extensions IHI0053 (ACLE)
- ARM NEON Intrinsics Reference IHI0073 (NIR)
  - Performance: Cortex-A72 Software
     Optimization Guide UAN0016

# **BASIC ASIMD INSTRUCTIONS: INDEPENDENT LANES**

#### Bitwise Logic









4

#### Shifts



 VSLI – shift left and insert, leaving lower bits unchanged



 VSRI – shift right and insert, leaving upper bits unchanged



# Bitwise Logic and Move

VORN – Bitwise OR not



VMOV - Move



VTST – If element-wise AND is non-zero,
 VMVN – Move Not. Invert all bits.
 set all element bits to 1





#### Math



VSUB



- VNEG Negate
  - Destination = I\*source
- VMAX write larger of two source lanes to destination lane
- VMIN write smaller of two source lanes to destination lane



# SIMD Math with Multiply

Dd



- accuracy
- VFMS Fused Multiply Subtract
  - Products not rounded before subtracts, so better accuracy

#### **Absolute Values**

VABS – Absolute Value



VABD – Absolute Value of Difference

 Dn
 Dm

 X = a-M

 ABSDIFF

 ABSDIFF

 ABSDIFF

 ABSDIFF

 ABSDIFF

 VABA – Absolute Value of Difference and Accumulate



 $\mathbf{N}$ 



/

VCNT – Count set (I) bits

# **Bitwise Multiplex Operations**

- Bitwise: each lane is one bit wide
- Copy bits specified by mask register from source register to destination register

128,256

- VBIT: Bitwise Insert if True
  - Qm is mask register
  - VBIT Qd, Qn, Qm: If Qm[i] is one, copy Qn[i] to Qd[i]
- VBIF: Bitwise Insert if False
  - Qm is mask register
  - VBIF Qd, Qn, Qm: If Qm[i] is zero, copy Qn[i] to Qd[i]
- VBSL: Bitwise Select
  - Qd is mask register
  - VBSL Qd, Qn, Qm: If Qd[i] is one, copy Qn[i] to Qd[i], else copy Qm[i] to Qd[i]



#### **Compare and Absolute Compare**





- Compare: Compare elements
  - VCop {Qd,} Qn, Qm

VICEQ

- Inputs Qn[i], Qm[i]: integer (8, 16, 32) or float (32)
- Output Qd:[i] integer as wide as input
- Compares |Qn[i]| and |Qm[i]|
- Is result true?
  - Yes: Qd[i] = ||....||
  - No: Qd[i] = 00...00
- Instructions: VCEQ,VCLE,VCLT, VCGE,VCGT, VCLE,VCLT

- Absolute Compare: Compare absolute values of elements
  - VACop {Qd,} Qn, Qm
    - Inputs Qn[i], Qm[i]: must be float (F32)
    - Output Qd:[i] integer as wide as input (32 bits)
    - Compares |Qn[i]| and |Qm[i]|
    - Is result true?
      - Yes: Qd[i] = ||....||
      - No: Qd[i] = 00...00
  - Instructions:VACGE,VACGT,VACLE,VACLT

#### Vector Load/Store

#### Table A4-13 Extension register load/store instructions

| Instruction           | See                 | Operation                                                                                                         |
|-----------------------|---------------------|-------------------------------------------------------------------------------------------------------------------|
| Vector Load Multiple  | VLDM on page A8-626 | Load 1-16 consecutive 64-bit registers (Adv. SIMD and VFP)<br>Load 1-16 consecutive 32-bit registers (VFP only)   |
| Vector Load Register  | VLDR on page A8-628 | Load one 64-bit register (Adv. SIMD and VFP)<br>Load one 32-bit register (VFP only)                               |
| Vector Store Multiple | VSTM on page A8-784 | Store 1-16 consecutive 64-bit registers (Adv. SIMD and VFP)<br>Store 1-16 consecutive 32-bit registers (VFP only) |
| Vector Store Register | VSTR on page A8-786 | Store one 64-bit register (Adv. SIMD and VFP)<br>Store one 32-bit register (VFP only)                             |

# Memory

- VLD
- VST
- VLDM
- VSTM
- VLDR
- VSTR
- VPOP
- VPUSH

# Move (See NPG, Appendix C)

- VMOV
- VDUP
- VEXT
- VMN
- VREV
- VSWP
- VTRN
- VUZP
- VZIP

# Structure Load/Store Instructions with Element De-Interleaving/Interleaving

# Arrays and Structures

#### Memory



#### Array of structures

struct {

uint8\_t Red, Green, Blue;

} image[N];

 Not a great fit for load/store register instructions

- Could rewrite code to rearrange data in memory into a structure of arrays: struct { uint8\_t Red[N], Green[N], Blue[N]; } image;
- Is better fit for load/store instructions
- Requires significant code modifications ③

#### "Structure Load" De-Interleaves From Memory Into Registers



Loading RGB data with a structure load.

Instructions: NPG, page C-63

into **n** separate registers

#### "Structure Store" Interleaves From Registers Into Memory



#### **Another View**



- Have support for 2, 3 and 4 element structures
- How well can it work?
  - Wide interfaces between NEON registers and memory
  - LI Data Cache
    - 128 bit interface

#### Load Structure

- Multiple Structure Access e.g. {D0, D1}
- Single Structure Access e.g. {D0[2], D1[2]}
- Single Structure Load to all lanes e.g. {D0[], D1[]}

#### IC STATE UNIVERSITY

| Load single element      |                                                               |
|--------------------------|---------------------------------------------------------------|
| Multiple elements        | VLD1 (multiple single elements) on page A8-602                |
| To one lane              | VLD1 (single element to one lane) on page A8-604              |
| To all lanes             | VLD1 (single element to all lanes) on page A8-606             |
| Load 2-element structure |                                                               |
| Multiple structures      | VLD2 (multiple 2-element structures) on page A8-608           |
| To one lane              | VLD2 (single 2-element structure to one lane) on page A8-610  |
| To all lanes             | VLD2 (single 2-element structure to all lanes) on page A8-612 |
| Load 3-element structure |                                                               |
| Multiple structures      | VLD3 (multiple 3-element structures) on page A8-614           |
| To one lane              | VLD3 (single 3-element structure to one lane) on page A8-616  |
| To all lanes             | VLD3 (single 3-element structure to all lanes) on page A8-618 |
| Load 4-element structure |                                                               |
| Multiple structures      | VLD4 (multiple 4-element structures) on page A8-620           |
| To one lane              | VLD4 (single 4-element structure to one lane) on page A8-622  |
| To all lanes             | VLD4 (single 4-element structure to all lanes) on page A8-624 |

#### **Store Structure**

Store single element

| Multiple elements | VST1 (multiple single elements) on page A8-768     |
|-------------------|----------------------------------------------------|
| From one lane     | VST1 (single element from one lane) on page A8-770 |

#### Store 2-element structure

|     | Multiple structures    | vs VST2 (multiple 2-element structures) on page A8-772         |  |
|-----|------------------------|----------------------------------------------------------------|--|
|     | From one lane          | VST2 (single 2-element structure from one lane) on page A8-774 |  |
| Sto | re 3-element structure |                                                                |  |
|     | Multiple structures    | VST3 (multiple 3-element structures) on page A8-776            |  |
|     | From one lane          | VST3 (single 3-element structure from one lane) on page A8-778 |  |
| Sto | re 4-element structure |                                                                |  |
|     | Multiple structures    | VST4 (multiple 4-element structures) on page A8-780            |  |

From one lane VST4 (single 4-element structure from one lane) on page A8-782

## Structure VLD/VST and Operand Syntax

| Elements per<br>Structure | Load all structures<br>to all lanes. Load<br>every element of<br>every structure  | Load one structure<br>to registers. Load<br>one element into<br>one lane in each<br>register | Load one structure<br>to all lanes. Load<br>multiple copies of<br>structure elements<br>into multiple<br>registers. |
|---------------------------|-----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| l (no interleaving)       | {D0, D1}                                                                          | {D0[2], D1[2]}                                                                               | {D0[], D1[]}                                                                                                        |
| 2                         |                                                                                   |                                                                                              |                                                                                                                     |
| Elements per<br>Structure | Store all registers<br>to all lanes. Store<br>every element of<br>every structure | Load one structure<br>to one lane                                                            | Load one structure<br>to all lanes                                                                                  |
| l (no interleaving)       | {D0, D1}                                                                          | {D0[2], D1[2]}                                                                               | {D0[], D1[]}                                                                                                        |
| 2                         |                                                                                   |                                                                                              |                                                                                                                     |
| 3                         |                                                                                   |                                                                                              |                                                                                                                     |
| 4                         |                                                                                   |                                                                                              |                                                                                                                     |

Three form

- Forms dist
  - Multiple
  - Single
  - Single



# **Multiple 2-Element Structure Access**

- VLD2, VST2 provide access to multiple 2-element structures
  - List can contain 2 or 4 registers
  - Transfer multiple consecutive 8, 16, or 32-bit 2-element structures





### **Multiple 3/4-Element Structure Access**

- VLD3/4, VST3/4 provide access to 3 or 4-element structures
  - Lists contain 3/4 registers; optional space for building 128-bit vectors
  - Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures





# UNCOMMON ASIMD INSTRUCTIONS AND FEATURES

#### Table Lookup



#### • **Table extension:** VTBX Dd, list, Dm

Same as VTBL, but doesn't overwrite Dd[i] if Dm[i] is not in range

# Vector Reciprocal and Reciprocal Square Root

The NEON instruction set does not include:

- division operation (use VRECPE and VRECPS instead to perform Newton-Raphson iteration)
- square root operation (use VRSQRTE and VRSQRTS and multiply instead).
- Approximate with estimate instruction, then refine step instruction(s)
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka14282.html
- https://github.com/thenifty/neon-guide
  - Estimate reciprocal

float32x4\_t v = { 1.0, 2.0, 3.0, 4.0 };
float32x4\_t reciprocal = vrecpeq\_f32(v);
// => reciprocal = { 0.998046875, 0.499023438, 0.333007813, 0.249511719 }

More accurate (estimate plus one refinement step)

float32x4\_t v = { 1.0, 2.0, 3.0, 4.0 };
float32x4\_t reciprocal1 = vrecpeq\_f32(v);
float32x4\_t reciprocal2 = vmulq\_f32(vrecpsq\_f32(v, reciprocal1), reciprocal1);
// => inverse = { 0.999996185, 0.499998093, 0.333333015, 0.249999046 }

# Lane Changes

#### Merge Lanes with Pairwise Reduction Operations



#### Change Lane Width with Instruction "Shape" Modifiers



Both operands and results are the same width

VADD.I16 Q0, Q1, Q2

Operands are the same width. Number of bits in each result element is half the number of bits in each

VADDHN.I16 DØ, Q1, Q2

Operands are the same width. Number of bits in each result element is double the number of bits in each operand element.

VADDL.S16 Q0, D2, D3

#### Wide –W

Result and operand are twice the width of the second operand. Example:

VADDW.I16 Q0, Q1, D4

## Modifiers for Instruction Operation

| Modifier | Action                    | Example                | Description                                                                                                                                                                                                                                                       |
|----------|---------------------------|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| None     | Basic operation           | VADD.I16 Q0, Q1, Q2    | The result is not modified                                                                                                                                                                                                                                        |
| Q        | Saturation                | VQADD.S16 D0, D2, D3   | Each element in the result vector is set to either the maximum or minimum if it exceeds the representable range. The range depends on the type (number of bits and sign) of the elements. The sticky QC bit in the FPSCR is set if saturation occurs in any lane. |
| Н        | Halved                    | VHADD.S16 Q0, Q1, Q4   | Each element shifted right by one place (effectively<br>a divide by two with truncation). VHADD can be used to<br>calculate the mean of two inputs.                                                                                                               |
| D        | Doubled before saturation | VQDMULL.S16 Q0, D1, D3 | This is commonly required when multiplying<br>numbers in Q15 format, where an additional<br>doubling is needed to get the result into the correct<br>form.                                                                                                        |
| R        | Rounded                   | VRSUBHN.I16 D0, Q1, Q3 | The instruction rounds the result to correct for the bias caused by truncation. This is equivalent to adding 0.5 to the result before truncating.                                                                                                                 |

# Example – adding all lanes

- Input in Q0 (D0 and D1)
- u16 input values

- Now Q0 contains 4x u32 values (with 15 headroom bits)
- Reducing/folding operation needs 1 bit of headroom



ARM

# Image Format Information

# A Digression: YUV Color Space



Full-Color Image

- Y: Luminance (brightness)
  - Y components alone give gray-scale image (no color)
  - U,V: Chrominance (color)
    - U: Blue projection
    - V: Red projection







#### References

V Component: **Red Projection** 

- https://softpixel.com/~cwright/programming/colorspace/yuv/
- https://en.wikipedia.org/wiki/YUV

By User:Brianski - Concept from en:Image:YUV components.jpg, original public domain image at en:Image:Barns\_grand\_tetons.jpg, Public Domain, https://commons.wikimedia.org/w/index.php?curid=2792866

# A Further Digression: YUV Chrominance Subsampling

#### Original Image



- Retina in human eye has far more brightness sensors (rods) than color sensors (cones)
  - → Worse spatial resolution for color than brightness (except in very center of vision)
- Color in digital images is often spatially sub-sampled
  - Removes information we can't see, saving time and space
  - Good explanation: <u>https://www.impulseadventure.com/photo/chroma-</u> <u>subsampling.html</u>
- 4:2:0 (aka 2x2) subsampling
  - Average together chroma values of 4 adjacent pixels
  - Reduces chrominance resolution by half horizontally and half vertically compared with luminance resolution
- Example:
  - I MPixel image needs to represent 3 million elements: IMY, IMU, IMV
  - Subsampling reduces it to 1.5 million elements: IMY, 0.25MU, 0.25 MV

Original Image



Reconstructed Subsampled Image



# Structures and Arrays in Memory

 In SIMD, want to work on same component from multiple pixels simultaneously. Must first load them from memory.



- Is a structure of arrays
- All Y's are adjacent, so loading is quick



- RGB image data
  - typedef struct { uint8\_t R, G, B; } RGB\_t;
  - RGB\_t RGB\_Image[WIDTH\*HEIGHT];
- Is an array of structures
- R's are separated by G and B (and padding?), so loading is harder and slower

#### Swap Registers

Now can swap red and blue easily
 VSWP d0, d2



VSWP d0, d2

Swapping the contents of registers d0 and d2.

1



•

\_

39