Did you just nerd-snipe yourself into porting this to NEON? No. That turned out to be a lie. Despite knowing almost nothing about NEON, I couldn’t resist and blew my weekend trying to parse ARM documentation. Here’s a NEON followup to Parsing numbers into base-10 decimals with SIMD

The full code can be found here.

Benchmark Previews

Speedup over rust_decimal’s parser for varying number lengths: dec_speedup

Speedup by parsing in batches: dec_speedup

What’s different?

Much of the parser is a one-to-one port. All of the basic arithmetic operators, comparison operators, and shuffle instructions appear in both SSE and NEON.

The two differences are the lack of a movemask equivalent, and no match for the horizontal multiply-add instructions

I would love to get suggestions on how to improve this code! I’ve never written anything meaningful with NEON, so I may be some useful instructions.

Replicating movemask in NEON

This article describes an effective way to replicate movemask in NEON, as long as the input vector lanes are either all zeros or all ones.

Essentially, you can:

  1. Shift a vector of 16-bit elements and narrow it to a vector of 8-bit elements
  2. Use this to put the bottom 4 bits of the upper byte and the top 4 bits of the lower byte in the lower byte of each 16-bit element
  3. Move that value into a u64
  4. Find the first/last set bit, and divide by 4

Giving the code

// Set lanes to all ones if they're equal to dot, otherwise all zeros
let is_eq_dot = vceqq_u8(number, dot);

// Cast to 16-bit elements (purely at type level)
let is_eq_fot_16 = vreinterpretq_u16_u8(is_eq_dot);

// Shift and narrow elements
let is_eq_dot_4x_vec_mask = vshrn_n_u16(is_eq_fot_16, 4);

// Extract the bottom 64 bits
let exploded_dot_mask = vget_lane_u64(vreinterpret_u64_u8(is_eq_dot_4x_vec_mask), 0);

// Find first set bit and adjust
let dot_idx = exploded_dot_mask.trailing_zeros() / 4;

Replicating the horizontal adds

You can look at the SIMDe port of _mm_madd_epi16 and see that it involves splitting the vector up, doing a few multiplies, and concatenating the two halves.

The direct port looks pretty slow, and since half of our multiplies are by one, it feels wasteful to:

  1. Do a lot of work to split a vector in half
  2. Multiply half the numbers by 1, and half by 10
  3. Recombine the vectors

Looking through the various NEON multiply-accumulate intrinsic, I found a family that sticks out. This family of instructions lets you:

  1. Multiply either the high half or low half of an integer vector by some scalar N
  2. Accumulate those results into an integer vector of double width

So, for example, I could write (in pseudocode):

// Start with a vector of 4 u32s
let vector: [u32; 4] = [1, 2, 3, 4];

// Take the bottom two elements and widen into a vector with 2 u64s
let lower_half: [u64; 2] = widen(vector[0..2]);

// Multiply high elements of vector by two, and accumulate into 64-bit elements of lower_half
let mult_acc: [u64; 2] = vmlal_high_n_u32(lower_half, vector, 2);

assert_eq!(mult_acc, [1 + 2 * 3, 2 + 2 * 4])

With a bit of shuffling, this exactly replicates our useage of _mm_madd_epi16 and friends!

// Start with a vector of 4 u32
let vector: [u32; 4] = [1, 2, 3, 4];

// Shuffle so that the elements we want to multiply are in the higher half
// Since we're multiplying with a pattern [10, 1, 10, 1],
// We put even digits in the high half and odd ones in the lower half

let shuffled = shuffle(vector, shuffle_mask);

assert_eq!(shuffled, [2, 4, 1, 3]);

let lower_half: [u64; 2] = widen(vector[0..2]);

assert_eq!(lower_half, [2, 4]);

let mult_acc: [u64; 2] = vmlal_high_n_u32(lower_half, shuffled, 10);

assert_eq!(mult_acc, [12, 34])

Shuffle games

There is a lot of shuffling that has to happen to make this work. Every operation requires that we re-shuffle numbers into the correct position, and each multiply-accumulate more or less shuffles the high and low half together.

Let’s instead do just one shuffle at the beginning so everything ends up in the right spot. Each multiply-accumulate shuffle

  • Is invariant to patterns that are the same on the lower and upper half
  • Rearranges the vector so that elements n and n + width/2 are next to each other
  • Doubles the element width and halves the vector length

Since the operations on wider vectors will respect symmetric operations across the halfs, we can start with the final steps and work backward.

We want to end up with [d1d2d3d4d5....d16]. We know that the input to this must then be [d1d2...d8, d9d10...d16] since multiply-accumulate on a two-element vector will concatenate the two halves.

Repeating on the new first element [d1d2..d8], we know that it will be d1d2d3d4 and d5d6d7d8 concatenated. The input must look like [d1d2d3d4, _, d5d6d7d8, _].

As above, d1d2d3d4 will come from concatenating d1d2 and d3d4 - meaning our input to this layer of the operations is [d1d2, _, _, _, d3d4, _, _, _, _].

Finally, since d1d2 must come from d1 and d2, the input looks like [d1, _, _, _, _, d3, _, _, _, d2, _, _, _, d4, _, _, _]. The same logic fills the remaining indices.

For example:

// The initial number is 1234567887654321
initial = [1, 2, 3, 4, 5, 6, 7, 8, 8, 7, 6, 5, 4, 3, 2, 1]

// Shuffle the number to the starting point
shuffled = [
    1, 8, 5, 4, 3, 6, 7, 2,
    2, 7, 6, 3, 4, 5, 8, 1
];

// Multiply the lower half by 10 and sum with the upper half
mul_10 = [
    12, 87, 56, 43,
    34, 65, 78, 21
];

// Multiply the lower half by 100 and sum with the upper half
mul_100 = [
    1234, 8765,
    5678, 4321,
];

// Multiply the lower half by 10000 and sum with the upper half
mul_10000 = [
    12345678,
    87654321,
];

// Multiply the lower half by 100000000 and sum with the upper half
finished = 1234567887654321;

High and low vector sets

Looking at some code from the actual implementation,

let lower = vmovl_u16(vget_low_u16(acc_16));
let acc_32 = vmlal_high_n_u16(lower, acc_16, 1_00);

An astute reader would notice that, unlike the example, we’re multiplying the upper half and accumulating into the lower half. Why is that? The benchmarks are significantly better, but there’s no satisfactory explanation from the generated code.

The generated code for a given multiply-accumulate is quite similar across both version:

-- Multiplies the high elements and accumulates into the low
do_parse(__Uint8x16_t):
        mov     w8, #57600
        movk    w8, #1525, lsl #16
        ushll   v1.2d, v0.2s, #0
        dup     v2.4s, w8
        umlal2  v1.2d, v0.4s, v2.4s
        fmov    x0, d1
        ret

-- Multiples the low low elements and accumulates into the high
do_parse_highlow(__Uint8x16_t):
        mov     w8, #57600
        movk    w8, #1525, lsl #16
        dup     v1.2s, w8
        umull   v1.2d, v0.2s, v1.2s
        uaddw2  v0.2d, v1.2d, v0.4s
        fmov    x0, d0
        ret

LLVM makes some pretty different decisions about the operations to do here (interestingly, GCC generates almost identical code). The first call has an almost 1-1 correspondence with the intrinsics we used.

For the second, LLVM decides to split out the multiply and addition into two operations - totally different that what we wrote! Is this faster, or is it a performance bug in the optimiser?

Benchmarks

Comparing the performance of parsing a single number to rust_decimal you get:

Test NS/number NS/digit Simd relative improvement Integer simd relative improvement
Simd parser 2.9 N/A N/A N/A
Simd integer parser 2 N/A N/A N/A
Two digit Decimal 2.06 0.71 0.8 1
Four digits 3.1 1.83 1.07 1.57
Six digits 3.72 1.5 1.29 1.88
Eight digits 4.42 0.55 1.53 2.23
12 digits 6.79 0.56 2.35 3.43
16 digits 9.42 0.59 3.27 4.76

And in graph format with just the relative performance: dec_speedup

For the batching results,

Batch size Decimal NS Integer NS Decimal NS/number Integer NS/number
1 2.9 2.0 2.9 2.0
2 4.2 2.9 2.1 1.45
4 10.4 6.3 2.6 1.6
6 13.4 8.75 2.23 1.46
8 19.84 13.9 2.5 1.74
12 33 25.9 2.75 2.16
16 43.7 33.6 2.73 2.10

And a relative performance graph: parse_speedup

At least on my laptop, the performance improvements could be better. There are a lot of confounding factors at play here:

  1. Different architectures with different instruction sets. There’s a lot more ceremony around doing the multiply-accumulates on NEON than on x86, and this probably contributes to the mediocre batching results
  2. I’m testing on an M1 laptop vs. an X86 workstation. The M1 appears significantly better at the scalar loops than my workstation.
  3. Microbenchmarks like this can really hit random microarchitecture edges, for better or worse. I’m calling small functions in repeated data in very tight loops. The branch predictors and caches are perfectly trained

The following steps need to be testing this parser in real workloads that aren’t predictable and repeated. These legitimate tests accomplish two things- they get more realistic results for the parser itself and better data on how this interacts with surrounding code.