-
Notifications
You must be signed in to change notification settings - Fork 570
Performance of FASTER in C#
The goal of this article is to provide details on the performance of FASTER in C#. For lack of a significantly better alternative in C#, we use ConcurrentDictionary, the standard in-memory .NET hash data structure, as a rough baseline to showcase the performance.
It is important to clarify at the outset that comparing FASTER to ConcurrentDictionary is not entirely fair to either system, because of the different target applications. FASTER indexes data larger than memory, is backed by a page-allocated (hybrid) log with no small-object allocations, does not support in-place deletes (instead, uses expiry-based garbage collection), uses latch-free primitives, and is specially optimized for large datasets. While FASTER supports arbitrary data types, it performs its best when dealing with blittable value types. On the other hand, ConcurrentDictionary is a pure in-memory data structure focused on ease-of-use, read performance, and handling arbitrary objects. There are more advanced hash tables proposed in research as well, but they are usually available in C++. We refer readers to our SIGMOD paper for related work and more detailed comparisons to data structures (in C++) that may be better baselines, such as the Intel TBB concurrent hash map (pure in memory), Masstree (in memory), and RocksDB (larger than memory). That said, this post will help readers better understand the upper bounds of performance that you may achieve with C# FASTER, when your dataset (or working set) fits entirely in main memory, consists of value types, and has updates.
We use the project /cs/benchmark/FASTER.benchmark.csproj
as the driver for experiments. We experiments with a large dataset (250 million keys) and a small dataset (2.5 million keys) that fits in L3 cache or above. The dataset size is selected using the constant kUseSmallData
.
This benchmark deals only with value types. We define two struct value types: Key
and Value
, each storing 8 bytes of data. We first pre-load all keys into the system, and then, over a 30 second period, we issue a uniform random workload mix of reads and blind updates. The small keys and values serve to stress the data structure rather than memory bandwidth. The uniform random workload incurs little contention on individual record access. See here for evaluations with larger values and/or a Zipf distribution, which show similar trends as the results in this article.
With FASTER, we size the hybrid log such that the entire dataset fits in the mutable region of the log; thus, storage is not involved in this experiment. The benchmark project uses the FASTER implementation directly, instead of going through code-gen which would produce identical code.
With ConcurrentDictionary, we define a custom key equality comparer (to avoid the expensive default object comparer) and size the data structure with concurrency level equal to number of threads used in the experiment. We set <gcConcurrent enabled="true"/>
in App.config
, without which update-intensive runs lead to long pauses during GC. We found that ConcurrentDictionary is optimized for specific types such as Int32
and Int64
; we cover this case separately below.
For each dataset, we vary the percentage of reads as 0%, 50%, 90-99%, and 100%. For each read percentage setting, we vary the number of threads from 1 to 72 to test throughput scalability. We spray threads evenly across the two NUMA sockets on our test machine. The machine specs are shown in the image below:
Use constant kUseSmallData
in FASTER.benchmark
to choose large or small dataset. Use -r
parameter to set the percentage of reads. The -n
parameter sets threads to spray evenly across two NUMA sockets. The -b
parameter chooses to run either FASTER (0) or ConcurrentDictionary (1). The -t
parameter sets the number of threads for the experiment. For example:
// FASTER
FASTER.benchmark.exe -b 0 -t 1 -n 1 -r 0
FASTER.benchmark.exe -b 0 -t 2 -n 1 -r 0
...
// ConcurrentDictionary
FASTER.benchmark.exe -b 1 -t 1 -n 1 -r 0
FASTER.benchmark.exe -b 1 -t 2 -n 1 -r 0
...
We use the large dataset and measure the throughput of the two systems during the initial data loading phase. Results are shown below.
As expected, with FASTER, there is first a slight drop as we move from one to two threads (the two threads are pinned on different sockets), and then loading throughput increases steadily until it hits the contention of allocation on the tail of the hybrid log. FASTER then retains a fairly stable data load throughput with increasing number of cores. ConcurrentDictionary get around 500K ops/sec for this workload. The 10-50X performance boost with FASTER is attributed to the avoidance of fine grained object allocations and the latch-free paged design of the index and record log.
Same graph in log scale:
FASTER uses paged record allocation, new cache-friendly latch-free data structures, and in-place updates (in the mutable region of the log), which combine to provide over two orders of magnitude better performance, with linear scalability. Record-level concurrency in the mutable region is handed off to users in FASTER; for 8-byte values, our user code does not take a lock during a blind update. Note that with ConcurrentDictionary, when we use all cores, throughput drops slightly because additional cores that were previously being used for garbage collection are no longer available for free. FASTER does not incur GC (when you use value types), because of its paged design.
Same graph in log scale:
Interestingly, we found that with 90% reads, at 72 threads, the performance of FASTER was 139M ops/sec, while ConcurrentDictionary achieved 1.32M ops/sec. At 95% reads, FASTER gets 137M ops/sec, whereas ConcurrentDictionary achieved 3.62M ops/sec. This shows that with even a small fraction of updates, FASTER becomes a good option to consider if it fits your usage scenario. With a very high read fraction of 99%, ConcurrentDictionary gets 14.87M ops/sec, while FASTER achieves 139M ops/sec. As expected, with very few updates, ConcurrentDictionary starts doing better as it avoids the overheads and contention related to updates.
Same graph in log scale:
For a read-only workload, you don't get any benefit from FASTER, which is expected: there are no concurrency bottlenecks in such a workload, and there are no writes to memory. Both systems are limited by the number of read cache misses and give similar performance. Note that some public evaluations have reported that FASTER is around 4 times better for a read-only workload: we attribute that result to the use of the default key equality comparer in those experiments, which used a more expensive object-based hash code and equality comparison. For a read-only workload, both systems are limited by the number of read cache misses and give very similar performance as expected.
The figure below shows the data from above, for 72 threads, with the percentage of reads on the x-axis. When a non-trivial fraction of updates are involved in your workload, FASTER provides orders-of-magnitude better performance. With very high read percentages above 90%, the performance of the two systems begins to converge as expected.
As mentioned earlier, ConcurrentDictionary specially optimizes for the case that the value is one of a few special types, such as Int32
and Int64
. The previous experiments used a black-box Value
that does not meet these criteria. For completeness, we investigate this scenario by replacing both Key
and Value
with Int64
for ConcurrentDictionary. The results are shown below. We see that in this case, the update scalability of ConcurrentDictionary is much better than before, as it performs in-place updates of nodes in hash bucket chains, with no GC involvement.
For completeness, we re-run the experiments with the smaller dataset (2.5 million keys). Since the data fits in the higher levels of cache, performance is generally better for both systems. Otherwise, the trends are similar.
Article Update Log
9/14: Added info on using Int64 as value type with ConcurrentDictionary, text cleanup/re-org
9/13: Initial version of article