c++ - CUDA programming: Memory access speed and memory usage: thread-local variables vs. shared-memory variables vs. numeric literals? -

July 15, 2010

this question has answer here:

using constants cuda 2 answers

suppose have array several fixed numerical values accessed multiple times multiple threads within same block, pros , cons in terms of access speed , memory usage if store these values in:

thread-local memory: double x[3] = {1,2,3};
shared memory: __shared__ double x[3] = {1,2,3};
numeric literals: directly hardcode these values in expression appear

thanks!

tl;dr

use __constant__ double x[3]; // ... initialization ...

first, know variable resides

in question:

thread-local memory: double x[3] = {1,2,3};

this imprecise. depends on how code access x[], x[] can reside in either registers or local memory.

since there no type qualifiers, compiler try best put things in register,

an automatic variable declared in device code without of __device__, __shared__ , __constant__ qualifiers described in section resides in register. in cases compiler might choose place in local memory,

but when can't, put them in local memory:

arrays cannot determine indexed constant quantities,

large structures or arrays consume register space,

any variable if kernel uses more registers available (this known register spilling).

you don't want x in local memory, it's slow. in situation,

an array several fixed numerical values accessed multiple times multiple threads within same block

both __constant__ , __shared__ can choice.

for complete description on topic, check: cuda toolkit documentation: variable-type-qualifiers

then, consider speed & availability

hardcode

the number embedded in instructions. may expect performance improvement. better benchmark program before , after doing this.

register

it's fast, scarce. consider block 16x16 threads, maximum 64k registers per block, each thread can use 256 registers. (well, maybe not scarce, should enough kernels)

local memory

it's slow. however, thread can use 512kb local memory.

the local memory space resides in device memory, local memory accesses have same high latency , low bandwidth global memory accesses...

shared memory

it's fast, scarce. typically 48kb per block (less registers!).

because on-chip, shared memory has higher bandwidth , lower latency local or global memory.

constant memory

it's fast in different way (see below), highly depends on cache, , cache scarce. typically 8kb ~ 10kb cache per multiprocessor.

the constant memory space resides in device memory , cached in constant cache mentioned in compute capability 2.x.

a request split many separate requests there different memory addresses in initial request, decreasing throughput factor equal number of separate requests.

the resulting requests serviced @ throughput of constant cache in case of cache hit, or @ throughput of device memory otherwise.

read: cuda toolkit documentation: device-memory-accesses

Search This Blog

MOno