Did you just nerd-snipe yourself into porting this to NEON? No. That turned out to be a lie. Despite knowing almost nothing about NEON, I couldn’t resist and blew my weekend trying to parse ARM documentation. Here’s a NEON followup to Parsing numbers into base-10 decimals with SIMD
The full code can be found here.
Speedup over rust_decimal’s parser for varying number lengths:
Speedup by parsing in batches:
Much of the parser is a one-to-one port. All of the basic arithmetic operators, comparison operators, and shuffle instructions appear in both SSE and NEON.
The two differences are the lack of a
and no match for the horizontal
I would love to get suggestions on how to improve this code! I’ve never written anything meaningful with NEON, so I may be some useful instructions.
Replicating movemask in NEON
This article describes an effective way to replicate movemask in NEON, as long as the input vector lanes are either all zeros or all ones.
Essentially, you can:
- Shift a vector of 16-bit elements and narrow it to a vector of 8-bit elements
- Use this to put the bottom 4 bits of the upper byte and the top 4 bits of the lower byte in the lower byte of each 16-bit element
- Move that value into a u64
- Find the first/last set bit, and divide by 4
Giving the code
// Set lanes to all ones if they're equal to dot, otherwise all zeros let is_eq_dot = vceqq_u8(number, dot); // Cast to 16-bit elements (purely at type level) let is_eq_fot_16 = vreinterpretq_u16_u8(is_eq_dot); // Shift and narrow elements let is_eq_dot_4x_vec_mask = vshrn_n_u16(is_eq_fot_16, 4); // Extract the bottom 64 bits let exploded_dot_mask = vget_lane_u64(vreinterpret_u64_u8(is_eq_dot_4x_vec_mask), 0); // Find first set bit and adjust let dot_idx = exploded_dot_mask.trailing_zeros() / 4;
Replicating the horizontal adds
You can look at the SIMDe port
_mm_madd_epi16 and see that it involves splitting the vector up, doing a few multiplies, and concatenating the two halves.
The direct port looks pretty slow, and since half of our multiplies are by one, it feels wasteful to:
- Do a lot of work to split a vector in half
- Multiply half the numbers by 1, and half by 10
- Recombine the vectors
Looking through the various NEON multiply-accumulate intrinsic, I found a family that sticks out. This family of instructions lets you:
- Multiply either the high half or low half of an integer vector by some scalar N
- Accumulate those results into an integer vector of double width
So, for example, I could write (in pseudocode):
// Start with a vector of 4 u32s let vector: [u32; 4] = [1, 2, 3, 4]; // Take the bottom two elements and widen into a vector with 2 u64s let lower_half: [u64; 2] = widen(vector[0..2]); // Multiply high elements of vector by two, and accumulate into 64-bit elements of lower_half let mult_acc: [u64; 2] = vmlal_high_n_u32(lower_half, vector, 2); assert_eq!(mult_acc, [1 + 2 * 3, 2 + 2 * 4])
With a bit of shuffling, this exactly replicates our useage of
_mm_madd_epi16 and friends!
// Start with a vector of 4 u32 let vector: [u32; 4] = [1, 2, 3, 4]; // Shuffle so that the elements we want to multiply are in the higher half // Since we're multiplying with a pattern [10, 1, 10, 1], // We put even digits in the high half and odd ones in the lower half let shuffled = shuffle(vector, shuffle_mask); assert_eq!(shuffled, [2, 4, 1, 3]); let lower_half: [u64; 2] = widen(vector[0..2]); assert_eq!(lower_half, [2, 4]); let mult_acc: [u64; 2] = vmlal_high_n_u32(lower_half, shuffled, 10); assert_eq!(mult_acc, [12, 34])
There is a lot of shuffling that has to happen to make this work. Every operation requires that we re-shuffle numbers into the correct position, and each multiply-accumulate more or less shuffles the high and low half together.
Let’s instead do just one shuffle at the beginning so everything ends up in the right spot. Each multiply-accumulate shuffle
- Is invariant to patterns that are the same on the lower and upper half
- Rearranges the vector so that elements
n + width/2are next to each other
- Doubles the element width and halves the vector length
Since the operations on wider vectors will respect symmetric operations across the halfs, we can start with the final steps and work backward.
We want to end up with
[d1d2d3d4d5....d16]. We know that the input to this must then be
since multiply-accumulate on a two-element vector will concatenate the two halves.
Repeating on the new first element
[d1d2..d8], we know that it will be
The input must look like
[d1d2d3d4, _, d5d6d7d8, _].
d1d2d3d4 will come from concatenating
d3d4 - meaning our input to this layer of the operations is
[d1d2, _, _, _, d3d4, _, _, _, _].
d1d2 must come from
d2, the input looks like
[d1, _, _, _, _, d3, _, _, _, d2, _, _, _, d4, _, _, _].
The same logic fills the remaining indices.
// The initial number is 1234567887654321 initial = [1, 2, 3, 4, 5, 6, 7, 8, 8, 7, 6, 5, 4, 3, 2, 1] // Shuffle the number to the starting point shuffled = [ 1, 8, 5, 4, 3, 6, 7, 2, 2, 7, 6, 3, 4, 5, 8, 1 ]; // Multiply the lower half by 10 and sum with the upper half mul_10 = [ 12, 87, 56, 43, 34, 65, 78, 21 ]; // Multiply the lower half by 100 and sum with the upper half mul_100 = [ 1234, 8765, 5678, 4321, ]; // Multiply the lower half by 10000 and sum with the upper half mul_10000 = [ 12345678, 87654321, ]; // Multiply the lower half by 100000000 and sum with the upper half finished = 1234567887654321;
High and low vector sets
Looking at some code from the actual implementation,
let lower = vmovl_u16(vget_low_u16(acc_16)); let acc_32 = vmlal_high_n_u16(lower, acc_16, 1_00);
An astute reader would notice that, unlike the example, we’re multiplying the upper half and accumulating into the lower half. Why is that? The benchmarks are significantly better, but there’s no satisfactory explanation from the generated code.
The generated code for a given multiply-accumulate is quite similar across both version:
-- Multiplies the high elements and accumulates into the low do_parse(__Uint8x16_t): mov w8, #57600 movk w8, #1525, lsl #16 ushll v1.2d, v0.2s, #0 dup v2.4s, w8 umlal2 v1.2d, v0.4s, v2.4s fmov x0, d1 ret -- Multiples the low low elements and accumulates into the high do_parse_highlow(__Uint8x16_t): mov w8, #57600 movk w8, #1525, lsl #16 dup v1.2s, w8 umull v1.2d, v0.2s, v1.2s uaddw2 v0.2d, v1.2d, v0.4s fmov x0, d0 ret
LLVM makes some pretty different decisions about the operations to do here (interestingly, GCC generates almost identical code). The first call has an almost 1-1 correspondence with the intrinsics we used.
For the second, LLVM decides to split out the multiply and addition into two operations - totally different that what we wrote! Is this faster, or is it a performance bug in the optimiser?
Comparing the performance of parsing a single number to rust_decimal you get:
|Test||NS/number||NS/digit||Simd relative improvement||Integer simd relative improvement|
|Simd integer parser||2||N/A||N/A||N/A|
|Two digit Decimal||2.06||0.71||0.8||1|
And in graph format with just the relative performance:
For the batching results,
|Batch size||Decimal NS||Integer NS||Decimal NS/number||Integer NS/number|
And a relative performance graph:
At least on my laptop, the performance improvements could be better. There are a lot of confounding factors at play here:
- Different architectures with different instruction sets. There’s a lot more ceremony around doing the multiply-accumulates on NEON than on x86, and this probably contributes to the mediocre batching results
- I’m testing on an M1 laptop vs. an X86 workstation. The M1 appears significantly better at the scalar loops than my workstation.
- Microbenchmarks like this can really hit random microarchitecture edges, for better or worse. I’m calling small functions in repeated data in very tight loops. The branch predictors and caches are perfectly trained
The following steps need to be testing this parser in real workloads that aren’t predictable and repeated. These legitimate tests accomplish two things- they get more realistic results for the parser itself and better data on how this interacts with surrounding code.