|
gpurhh
GPU Robin Hood Hashing — header-only CUDA library
|
CacheLineBytes and WarpSize as table-level template parameters) don't preclude AMD support, but we haven't written or tested it, as this would require switching to HIP, which would be quite involved. The same algorithm with AMD's 64-byte cache line and 64-thread wavefront would give BucketSize = 8 and TilesPerWarp = 8.cudaStream_t through the constructor and destructor. Today only clear() takes a stream; the constructor's cudaMalloc is synchronous and dominates construction cost, so threading a stream through the slot-init memset would buy nothing while leaving the constructor's overall behavior confusingly half-async. The destructor doesn't take a stream either; destruction is a host-side lifecycle event, and the caller is expected to synchronize relevant streams before letting the table go out of scope.