Processing Units
Describe in your own words the tradeoffs between
a CPU
and GPU. Do you even
need a CPU if you have a GPGPU?
Suggested Answer
The CPU is the traditional locus
of control in a computer, that orchestrates the peripheral devices and
processes in- and output. Modern day GPUs evolved
from ASICs,
designed for specific tasks (usually of a graphical nature),
to programmable devices that can in principle compute
anything a CPU can compute as well, with an aptitude for parallelized
tasks. Even if a GPU is general-purpose
, it cannot displace
the CPU as it is not designed to access the periphery, even memory.
GPU, what is it good for?
Judge which of the following problem you can use a GPU to help solve? Sketch your intuition and what role the GPU could play. Be creative!
- Multiplying a matrix with a (compatible) vector
- Typesetting a text document (e.g. TeX involving line breaking)
- Downscaling a video (e.g. from 1080p to 480p)
- Parse the syntax of a C file
- Playing chess
- Computing a checksum of a data stream
- Emulating an Intel 8086 processor
Suggested Answer
To a certain degree, this is a trick question, since you can employ the GPU to solve these problems, albeit not always efficiently or even utilization the strengths a GPU has at is's disposal.
We will set aside this pedantic view, and instead concentrating on meaningful applications:
- Yes, matrix multiplication is one of the canonical examples of a problem that you can solve more efficiently in parallel.
- No, typesetting involves a lot of interdependencies, that make it difficult to parallelize. For instance, deciding to break a line one word earlier or later on the first page might propagate further line-breaks through the document so that the final output might have an entire page more or less.
- Yes, each group of pixels can be downscale in parallel, as can each frame of the video (in practice, video codecs are more efficient than to store pixels directly, but the point remains).
-
No, parsing the syntax of a text document is
traditionally sequential, especially in a language like C where
declarations effect the validity of a parsing process. To which
degree this is an issue, depends on which part of the parsing pipeline
we process: It should be obvious that if we split a file up into
regular chunks, and process these independently, that any macro
definitions at the top of the file will pose difficulties in parsing
the later parts of the file if these are not known. Similarly, if we
are also interested in
type
checking that function calls have the right arty, the declarations must be known beforehand. Then there are also the C-specific complications such as the ambiguity of an expression like(A) * Bthat depend on whetherAis a type andBis a pointer or if they are two arithmetical expression.Yet, there has been research into how GPUs can help speed up context free grammars (specifically using the CYK algorithm), which could be employed to parse a C document as well. The intuition here could be that since context free languages are recognized by non-deterministic pushdown automata, we can try to exploit the non-determinism as a parallelization strategy. For programming languages, this will probably not end up being that useful, since most programming languages are restricted/designed to be parsed deterministically, so as to be efficiently recognized by just a CPU.
-
No, as deciding what move to make when playing chess,
is naively a
problem that grows exponentially in size. You do not absolve yourself
from the weight of the problem by checking the legality of each move
in parallel, since the explosion of combinations that each side can
choose remains the main bottleneck.
Yet, in recent years heuristics such as Monte Carlo tree search that combine neural networks (GPU! GPU! GPU!) with classical symbolic approaches have proven very successful, resulting in the necessary breakthrough for machine victory over humans in the game of Go.
- No, as traditionally a checksum (e.g. from the SHA family) process blocks of data from the input, where the intermediate checksum depends on that of the predecessor. Furthermore, when processing a stream of data (e.g. a network download), we don't have random access which makes paralleling more difficult.
- No, since the 8086 is a single-core processor, meaning that all emulated computations have to be executed in order of occurrence. Perhaps a GPU could be useful to facilitate an anachronistic kind of branch prediction?
Single instruction, multiple something?
Sometimes people get confused about the difference between SIMD and SIMT. Try to explain the difference, but also how the two relate to one another.
Perhaps reading up on
the historical
context of this terminology can be useful.
Suggested Answer
SIMT
is a variation
of SIMD.
Usually SIMD executes instructions simultaneously,
using special
instruction sets: For
instance, ADDSS
can be used to perform multiple additions at once by thinking of the
source and destination as vectors. SIMT on the other hand
has multiple threads (of execution, not necessarily operating-system
threads) receive the same instructions, distinct memory and registers
to operate on.
OpenCL Kernels I
Explain why the kernel could be slow, and briefly sketch how to improve the performance.
__kernel void
discretize(__global float *f_dis, int n)
{
int i = get_global_id(0);
int j = get_global_id(1);
float a = 0;
float b = 2 * M_PI;
float h = (b - a) / n;
if (i < n && j < n) {
float x = a + i * h;
float y = a + j * h;
/* In OpenCL C, sin is type-generic, so no sinf needed. */
f_dis[i * n + j] = -2 * sin(x + y);
}
}
Suggested Answer
Each, separate memory access tof_dis slows down GPU.
The solution to this is to coalesce access by combining
accesses as far as possible.
OpenCL Kernels II
What concurrency-related mistake has been made in the following implementation of (square) matrices transposition?
__kernel void
transpose(__global float *mat, int m)
{
int i = get_global_id(0);
int j = get_global_id(1);
if (i <= j) return;
float tmp = mat[i * m + j];
mat[i * m + j] = mat[j * m + i];
mat[j * m + i] = tmp;
}
Modify the code to solve the problem.
Warning (2026-04-30): The code block had a mistake
that was not intended to be part of the question (hint: it is a
possible solution, but explain why). This code has been
stricken through for future reference.
Suggested Answer
We have a race condition since we are reading and writing to the same
memory accessed via mat. To avoid this, we can add
another output parameter:
__kernel void
transpose(__global float *mat, int m, __global float *out)
{
int i = get_global_id(0);
int j = get_global_id(1);
out[j * m + i] = mat[i * m + j];
}