|
gpurhh
GPU Robin Hood Hashing — header-only CUDA library
|
-std=c++20 support in nvcc (CUDA 12.0+) and cuda::std namespaced operations in libcu++ (CUDA 13.0 was the first version where this combination worked cleanly on the test system).Within the implementation, we stick to C++17-shape idioms by default and only reach for C++20 features where they demonstrably improve the code. The C++20 uses today are:
std::bit_width in next_pow2 (host-side, compile-time).std::bit_cast for the empty_key_byte extraction and the memsettability static_assert lambda.cuda::std::bit_cast (libcu++) for the runtime bit-reinterpret in default_hash. The libcu++ version is mandatory here because std::bit_cast is host-only constexpr in libstdc++; calling it from a __device__ function is a hard error in CUDA 13.x.Concepts on the Hash template parameter, std::source_location in place of __FILE__ / __LINE__ macros, ranges, and coroutines are all deferred until they pay for themselves.
mod becomes a mask. The constructor takes min_capacity (in slots) and rounds up to the next power of two, never below bucket_size. Users targeting a specific load factor compute the slot count themselves (e.g. min_capacity = ceil(N / target_load_factor) for an expected N insertions).The "capacity is a power of two" constraint is not just a convenience — it's wired into multiple operations. Anyone touching these sites should know that breaking the invariant would require switching every & mask back to a % capacity, which could be slower on certain GPU (multi-cycle integer divmod vs. a single AND).
In include/gpurhh/hash_table.cuh:
HashTable constructor**: next_pow2(min_capacity) rounds the user-supplied capacity up; capacity_mask_ = capacity_ - 1 is cached for use across the hot paths. Capacity can grow by up to 2× over what the user asked for. The trade-off is a one-cycle mask vs. a multi-cycle modulus on every probe step.home_slot = hash(key) & capacity_mask_ (and home_bucket = home_slot / bucket_size). Replacing & with % would also require capacity_mask_ to be replaced by the raw capacity, doubling the mod cost on every probe.bucket_idx = (home_bucket + probe) & bucket_mask, where bucket_mask = num_buckets - 1. Since num_buckets = capacity / bucket_size and both are powers of two, the bucket count is also pow-2 and admits a mask.resident_probe = (bucket_idx - resident_home) & bucket_mask — this handles wrap-around correctly precisely because bucket_mask = num_buckets - 1.static_assert((tile_size & (tile_size - 1)) == 0, ...). Used by tiles_per_block = block_size / tile_size and by cg::tiled_partition<tile_size>, which itself requires a pow-2 tile size.In benchmarks (benchmarks/benchmarks.cuh):
map_to_range): folds cuRAND bits to [0, range) via x & (range - 1) when range is a power of two, falling back to x % range otherwise. The benchmark script always uses key_range = capacity, so the mask path is taken in practice.A subtler dependency lives in the default hash function. With a pow-2 mask, only the low bits of hash(key) reach capacity_mask_ — the high bits are thrown away. If a user supplied an identity hash on structured keys (e.g. dense integer IDs), low-bit clustering would map directly into slot clustering and probe lengths would explode. The default default_hash<uint32_t> is fmix32, a finalizer-style mixer that spreads bits well, so the low bits of hash(k) are pseudo-uniform even when keys are not. Users supplying their own Hash should preserve this property; passing a non-mixing hash with a pow-2 capacity is the textbook way to break a fast hash table.