c++ - CUDA programming: Memory access speed and memory usage: thread-local variables vs. shared-memory variables vs. numeric literals? -
this question has answer here:
- using constants cuda 2 answers
suppose have array several fixed numerical values accessed multiple times multiple threads within same block, pros , cons in terms of access speed , memory usage if store these values in:
thread-local memory:
double x[3] = {1,2,3};
shared memory:
__shared__ double x[3] = {1,2,3};
numeric literals: directly hardcode these values in expression appear
thanks!
tl;dr
use __constant__ double x[3]; // ... initialization ...
first, know variable resides
in question:
- thread-local memory: double x[3] = {1,2,3};
this imprecise. depends on how code access x[]
, x[]
can reside in either registers or local memory.
since there no type qualifiers, compiler try best put things in register,
an automatic variable declared in device code without of __device__, __shared__ , __constant__ qualifiers described in section resides in register. in cases compiler might choose place in local memory,
but when can't, put them in local memory:
arrays cannot determine indexed constant quantities,
large structures or arrays consume register space,
any variable if kernel uses more registers available (this known register spilling).
you don't want x
in local memory, it's slow. in situation,
an array several fixed numerical values accessed multiple times multiple threads within same block
both __constant__
, __shared__
can choice.
for complete description on topic, check: cuda toolkit documentation: variable-type-qualifiers
then, consider speed & availability
hardcode
the number embedded in instructions. may expect performance improvement. better benchmark program before , after doing this.
register
it's fast, scarce. consider block 16x16 threads, maximum 64k registers per block, each thread can use 256 registers. (well, maybe not scarce, should enough kernels)
local memory
it's slow. however, thread can use 512kb local memory.
the local memory space resides in device memory, local memory accesses have same high latency , low bandwidth global memory accesses...
shared memory
it's fast, scarce. typically 48kb per block (less registers!).
because on-chip, shared memory has higher bandwidth , lower latency local or global memory.
constant memory
it's fast in different way (see below), highly depends on cache, , cache scarce. typically 8kb ~ 10kb cache per multiprocessor.
the constant memory space resides in device memory , cached in constant cache mentioned in compute capability 2.x.
a request split many separate requests there different memory addresses in initial request, decreasing throughput factor equal number of separate requests.
the resulting requests serviced @ throughput of constant cache in case of cache hit, or @ throughput of device memory otherwise.
Comments
Post a Comment