Even the best branch predictors can only get data as good as the input. Naively spinning on a signal variable, a common technique in lock-free programming, will lead the branch predictor to pessimize your software.
Why is this? Ususally, the pattern looks like:
// load_signal_for_new_data frequently says none if the reader is caught up
// the branch predictor will predict that we re-enter this loop,
// not issuing the load on your data as well as spamming load requests
while (load_signal_for_new_data() != has_new_data) {}
load_new_data();
//...
Since another core is presumably writing to both the signal and the data, we expect each one to miss the local caches when there is actually new data, but remain in the cache before then.
When combined with the fact that our branch predictor is running full steam ahead through as many iterations of the new-data-check as it can, we end up with a cache problem! We will never (and couldn’t) try and speculatively load the new data in parallel with the signal, and as a result, will suffer two cache misses in serial.
A benchmark for the problem
The classic benchmark for measuring communication delay is a ping-pong like program. One thread sends a message to another, and that thread sends a message back using the same method. This exchange is timed to see what the round-trip time for message sending is.
I want to test the smallest replication of this:
- Thread 1 writes some value to a shared datastructure
- Thread 1 signals there is new data
- Thread 2 polls for data from thread 1 until there is something new
- Thread 2 reads the new data from thread 1
- Thread 2 writes the data it got from thread 1 into another shared datastructure
- Thread 2 signals there is new data
- Thread 1 polls for data from thread 2 until there is something new
- Thread 1 reads the new data from thread 2
One can see why this is called ping-pong - the threads are just bouncing a message back and forth. This process is looped over and timed to measure the latency of sending a ‘message’.
On a per-send basis, we expect to wait for 4 cache misses:
- Thread 2 sees there is new data from thread 1
- Thread 2 reads data from thread 1
- Thread 1 sees there is new data from thread 2
- Thread 1 reads data from thread 2
We expect to get a cache miss on both seeing new data and reading the data itself since the branch predictor won’t predict we get new data.
Here’s the pseudocode of what that might look like:
void pong(struct data *in, struct data *out, int runs) {
// Serves to creat a data dependency between read and write
int read_value = 0;
for (int _i = 0; i < runs; i++) {
uint64_t start = get_cycle_counter();
store_data(out, read_value);
set_out_has_new(out);
while (!check_for_new_data(in)) {}
read_value = load_new_data(in);
uint64_t end = get_cycle_counter();
// Timestamp logging code
}
}
void ping(struct data *in, struct data *out, int runs) {
// Serves to creat a data dependency between read and write
int read_value = 0;
for (int _i = 0; i < runs; i++) {
while (!check_for_new_data(in)) {}
read_value = load_new_data(in);
store_data(out, read_value);
set_out_has_new(out);
}
}
int main() {
struct data pong_in, ping_in;
start_thread(pong, pong_in, ping_in);
start_thread(ping, ping_in, pong_in);
}
You can find the real code here
Prefetching
One immediate way to solve our cache problem is to issue a prefetch whenever we look for new data.
while (!check_for_new_data(in)) {
prefetch_data(in);
}
This won’t solve the misprediction penalty when we succeed, but it will ensure that the two cache misses on finding a message happen in parallel.
Reducing load spam
Currently, our spin loop is our cpu issuing as many loads as it can, many speculatively because of the branch predictor. This spam could slow down the cache subsystem when the writing core attempts to acquire the cache line for writing.
The pause instruction was intended to solve the problem - inform the cpu that this is a spinlock, and yield more time to a hyperthread. However, the pause instruction has high latency on some intel models, which makes it unreliable for use in a latency-sensitive setting.
There’s another instruction which fits the bill - lfence! Lfence specifically prevents further instructions from being issued until previous instructions have locally completed 1, which would only give us one inflight load + prefetch to the signal.
While the original intent of this instruction is to ensure ordering on non-temporal loads, it works well for our instruction-limiting purposes.
Similarly to adding a prefetch, the change is just:
while (!check_for_new_data(in)) {
lfence();
}
Benchmarks
Since everybody loves benchmarks, here’s a preview of the final results (cycles per ping-pong):
- | Prefetch | |
430 | 345 | |
Lfence | 390 | 315 |
In short, both prefetch and lfence are effective in achieving their stated goals. Issuing an lfence specifically reduces the number of loads attempted by the ping process per round from about 650 to about 40. However, while prefetch helped the most, it didn’t seem to cut the time as much as one would expect from halving effective cache misses.
Currently, I’m in a coronavirus safe haven and don’t have access to my Linux machine with more involved performance monitoring tools. When I get home, I’ll follow this up with a more specific investigation of how these changes affect performance events.
1. For stores, this specifically means that the store has been committed to the store buffer, not that it is visible to other cores yet. There are some other instructions (prefetch, clflush, sfence, probably more) that are not ordered with regard to lfence.↩