Parsing numbers into base-10 decimals with SIMD

When parsing number-dense JSON, much of the overhead is parsing numbers instead of parsing the JSON itself. For the typical case, most of the overhead within number parsing is converting into a form mantissa * 10^-exponent, not transforming into a proper floating point value.

Looking through high-quality number parsers such as rust_decimal’s or simd_json’s, they tend to avoid SIMD or only use it after ensuring there’s a long enough series of decimal characters. They also don’t support parsing in batches and concern themselves with problems like

How do I avoid reading off the end of the number?
How can I give precise error messages?
How do I handle extremely long decimals?
How do I handle separators?

In realistic parses, most of the above never apply! Invalid data is rare, long numbers are rare, weird separators are rare, and most numbers aren’t at the end of a buffer (with padding, none are). We can relax the above constraints and get something friendlier to a SIMD implementation. First, we assume:

All inputs have an extra 16 bytes of data off the end
Precise error reporting is not required
Inputs are not longer than 16 bytes
Inputs only contain digits and potentially 1 ‘.’
The sign is already processed (it’s just a simple check and pointer adjustment)
Optionally, we can parse many numbers at once

While these restrictions appear limiting, they’ll handle almost all actual cases, and a fallback can take the rest. With these considerations, we can go about writing a vectorized parser.

The full code can be found here.

Benchmark Previews

Since everyone’s favorite section is the benchmarks at the end, here’s a preview of what’s to come!

Speedup over rust_decimal’s parser for varying number lengths: dec_speedup

Relative performance per number when parsing in batches: parse_speedup

Our lord and savior pshufb

Before going into code, let’s take a brief tour to talk about the workhorse of the parser, the pshufb instruction. pshufb is an instruction that allows you to rearrange bytes in a vector by using it as a lookup table, with a second vector passed as indices. For a pseudocode example,

let input = [1, 2, 3, 4];
let control = [2, 3, 2, 3];

assert_eq!(
    [input[2], input[3], input[2], input[3]],
    pshufb(input, control)
);

Finally, if the uppermost bit of an input is set, the instruction fills that element with a zero:

// 255 is all bits set to one, so pshufb returns a zero
let input = [1, 2, 3, 4];
let control = [2, 3, 255, 3];

assert_eq!(
    [input[2], input[3], 0, input[3]],
    pshufb(input, control)
);

Pshufb lets us implement all sorts of shifts, semi-shifts, and masks with one instruction!

Converting to a 16-digit number

The first step is normalizing all inputs to a 16-digit number with leading zeros to fill the space. We can do this in two easy steps - subtract out the ASCII ‘0’, and shift so that the number is 16 digits padded with leading zeros:

// Given a pointer to the start of our integer, load into a vector register
// Since we've assumed there are 16 bytes of data after this pointer,
// this is ok to do
let string_as_vector = _mm_loadu_si128(
    input_bytes.as_ptr() as *const __m128i
);

// Leaving out some intrinsics for brevity - subtract out '0'
// so '0'..'9' are in the range 0..9
let adjusted_digits = string_as_vector - '0';

// There's no variable-byte vector shift that takes a dynamic shift count
// as far as I know.
// One would significantly improve performance
// as it would pull a lookup table out of the fast path

// Use a shuffle to implement a shift-by-variable-bytes
// _mm_shuffle_epi8 calls pshufb
// Each mask shuffles digits to the end, filling the front with zeros
// for example, given a number with three digits, we'd see:
// input = [4, 1, 2, _]
// mask = [255, 0, 1, 2]
// pshufb (input, mask) == [0, 4, 1, 2]
// It's important that we're in the 0..9 range instead of '0'..'9'
// since shuffle only lets us fill the front with zeros
let shifted_digits = _mm_shuffle_epi8(
    adjusted_digits,
    GLOBAL_SHUFFLE_MASKS[input_bytes.len()]
);

Handling the Decimal point

A number with a decimal point will now look something like [0, 0, ..., 0, 1, 2, '.' - '0', 1, 2]. The goal now is to remove this point (if it exists) and record the location (which gives the exponent) ¹.

It’s straightforward to discover the possible location of a decimal point:

// Mark points that are equal to the point
let is_eq_point = _mm_cmpeq_epi8('.' - '0', shifted_digits);

// Convert this mask into an integer bitmask
// Add an implicit dot at the end
// It would be great if x86 had instruction to query the first set byte
let is_eq_point_bits = _mm_movemask_(is_eq_point) | 0xffff_0000;

// The first set bit is the index
let point_idx = is_eq_point_bits.trailing_zeros();

As with the length shift, we then use this to select a shuffle mask that shifts everything ahead of the dot down and fills the front with zeros.

let compressed_number = _mm_shuffle_epi8(
    shifted_digits,
    POINT_SHUFFLE_MASKS[point_idx]
);

We calculate the exponent from a lookup table. The difficulty comes from having to handle the case of a trailing point and no point. Those will give point indices 15 and 16, different point shuffle masks, but exponents of zero.

15.saturating_sub(point_idx) accurately calculates the exponent, but that turns out to be more expensive than just doing a lookup table, especially when computing many at once. We already have a nontrivial amount of integer arithmetic and relatively few loads, so shifting work from integer ports to load ports gives more parallelism.

let exponent = EXPONENT_TABLE[point_idx];

Validating the result

In a valid decimal, everything left at this point is bytes in the range 0..9. We do a simple check of the digits vector to see if there are any invalid characters and if so, exit early.

// There's no unsigned integer comparison for x86 vectors, so we can:
// 1. Do two signed comparisons.
// 2. Alternatively, there's an unsigned integer max operation.
//   a. We take the max of 9 with the vector
//   b. Search for anything not equal to 9
//
// 2 uses less instructions, and the dependency chain is the same length

let max_of_nine = _mm_max_epu8(compressed_number, 9);

// If all digits are valid, this gives zeros.
// subtraction has higher throughput than equality comparison
// so we do this instead ofn the obvious cmpeq
let is_eq_nine = _mm_sub_epi8(max_of_9, 9);

// Test for any zeros, and return if so
if _mm_test_all_zeros(is_eq_nine) != 1 {
    return false;
}

Here, we don’t perform any meaningful error discovery, just exit. Since errors are usually sporadic, we can pass the input through a more fully featured parser later to get better error reporting

Computing the integer component

We now have a 16-digit valid integer and the exponent - all that’s left is to convert this integer into a u64

We do this mostly with a series of horizontal multiply-adds. There’s a class of integer instructions that multiply adjacent entries and sum into a larger integer size. For example,

let a: [u16; 4] = [10, 1, 10, 1];
let b: [u16; 4] = [1, 2, 3, 4];

let c: [u32; 2] = multiply_add(a, b);

// [10*1 + 1*2, 10*3 + 1*4]
assert_eq!(c, [12, 34])

We use these multiply-adds to repeatedly multiply adjacent pairs by a multiple of ten and sum to get the mantissa ²:

let acc = compressed_number;
// Take pairs of u8s (digits) and multiply the more significant one by 10,
// and accumulate into pairwise u16
let mul_1_10 = _mm_setr_epi8(
    10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1
);
let acc = _mm_maddubs_epi16(acc, mul_1_10);

// Take pairs of u16s (not digits, but two digits each)
// multiply the more significant by 100 and add to get pairwise u32
let mul_1_100 = _mm_setr_epi16(100, 1, 100, 1, 100, 1, 100, 1);
let acc = _mm_madd_epi16(acc, mul_1_100);

// We now have pairwise u32s
// but there are no methods to multiply and horizontally add
// them. Doing it outright is *very* slow.
// We know that nothing yet can be larger than 2^16, so we pack the u16s
// into the first and second half of the vector
// Each vector half will now be identical.

let acc = _mm_packs_epi32(acc, acc);

// Multiply the top half by 10000 and the bottom half by 1
let mul_1_10000 = _mm_setr_epi16(10000, 1, 10000, 1, 10000, 1, 10000, 1);
let acc = _mm_madd_epi16(acc, mul_1_10000);

Letting the first two 32-bit elements of the vector be [a, b], our integer is equal to a*1e8 + b. To finish, we take the bottom 64 bits out from the vector and do the transformation:

    let as_u64 =  _mm_cvtsi128_si64(acc) as u64;

    let small_bottom = as_u64 >> 32;
    let large_half = as_u64 as u32 as u64;

    let mantissa = 100000000 * large_half + small_bottom;

And we’re done!

Initial benchmarks

We’re going to take the rust_decimal string parser as a decent representation of fast string parsers (the simd_json number parser is similar as well).

For the microbenchmark setup, I parse a number of varying length with both the SIMD parser and rust_decimal. I’m doing so with a forked rust_decimal that removes some of the postprocessing and just computes mantissa + exponent. I also compare using the SIMD parser in a mode where it assumes the input is an integer. We get the results (SIMD parser is constant time so we only have one row):

Test	NS/number	NS/digit	Simd relative improvement	Integer simd relative improvement
Simd parser	3.4	N/A	N/A	N/A
Simd integer parser	1.8	N/A	N/A	N/A
Two digit Decimal	2.7	1.35	0.8	1.5
Four digits	7.3	1.83	2.14	4.05
Six digits	9.1	1.5	2.67	5.05
Eight digits	9.5	1.18	2.79	5.27
12 digits	10.9	0.9	3.2	6.05
16 digits	15	0.93	4.4	8.3

And in graph format with just the relative performance: dec_speedup

Those are pretty good results! It’s not surprising that parsing in a vectorized manner is much faster, especially on the longest numbers. There’s still more we can do to improve the performance of the vectorized parser

Batching

Computers are really good at doing things simultaneously. We can see that already with the vectorized version. Further, CPUs are also able to issue multiple independent instructions at once.

In the vectorized parser, there’s little parallel work. Preparing and computing the mantissa is the most work; here, most instructions depend on the previous one.

Now you might be thinking “But wait, vgatherps! The decoder can run ahead of execution and find more independent instructions.” and that’s absolutely correct.

Let’s consider what the decode timeline of running one parse after another looks like:

serial

Eventually, the CPU will finish decoding and issuing all the instructions from number A and start issuing instructions from parse B, which execute in parallel with anything remaining from A. However, the decoder must get through parse A to start issuing instructions from B. In actual code, the reorder buffers might already be full, preventing the decoder from running ahead in the first place!

If we were to interleave the parsing of A and B, we’d have a timeline resembling

parallel

There’s immediately realizable parallelism!

It’s pretty straightforward to change the code to do this, and I’ll direct those curious to the actual function being benchmarked

Comparing the relative performance per-number, we get

Batch size	Decimal NS	Integer NS	Decimal NS/number	Integer NS/number
1	3.4	1.8	3.4	1.8
2	5.1	3.3	2.55	1.65
4	8.5	5.4	2.13	1.35
6	12	7.7	2	1.28
8	15.7	9.8	1.96	1.22
12	24.3	15.5	2.02	1.29
16	38.2	26.1	2.39	1.63

And a relative performance graph: parse_speedup

Initially, up to 8 interleaved numbers, we get significant and steady performance improvements. Sadly, more interleaving hurts us. LLVM starts barfing on the given inputs and spilling to the stack since there are too many live variables.

It’s somewhat unfortunate, as the code is almost entirely transformable to a non-parallel version that maintains fewer live variables. I suspect that optimizing strangely wide and shallow execution graphs is a rare use case and hasn’t seen much attention.

What’s next?

The elephant in the room is figuring out if this optimized parser is faster in real-world settings. It’s one thing to be faster in a very tight microbenchmark loop; replicating performance improvements in production is more challenging. Microbenchmarks can exacerbate coincidental interactions with microarchitecture that wouldn’t be relevant in the real world. Especially in parsing, there’s lots of complex, branchy, unpredictable surrounding code that complicates an otherwise clean sequence of instructions.

I’m experimenting with adding this to the rust port of simd-json, and adding unsafe fast-path parsers to rust_decimal. Hopefully we’ll see the desired real-world gains!

^{1. It’s possible to hoist the decimal discovery to happen before shifting to the end. In theory, this would introduce more ILP as a few more codepaths could execute in parallel with the length shift. In practice, this had a neutral to negative impact on benchmarks!↩}

^{This is largely taken from http://0x80.pl/articles/simd-parsing-int-sequences.html#core-i7-results↩}