Exercise Sheet 3: OpenCL and GPUs

Processing Units

Describe in your own words the tradeoffs between a CPU and GPU. Do you even need a CPU if you have a GPGPU?

Suggested Answer

The CPU is the traditional locus of control in a computer, that orchestrates the peripheral devices and processes in- and output. Modern day GPUs evolved from ASICs, designed for specific tasks (usually of a graphical nature), to programmable devices that can in principle compute anything a CPU can compute as well, with an aptitude for parallelized tasks. Even if a GPU is general-purpose, it cannot displace the CPU as it is not designed to access the periphery, even memory.

GPU, what is it good for?

Judge which of the following problem you can use a GPU to help solve? Sketch your intuition and what role the GPU could play. Be creative!

Multiplying a matrix with a (compatible) vector
Typesetting a text document (e.g. TeX involving line breaking)
Downscaling a video (e.g. from 1080p to 480p)
Parse the syntax of a C file
Playing chess
Computing a checksum of a data stream
Emulating an Intel 8086 processor

Suggested Answer

To a certain degree, this is a trick question, since you can employ the GPU to solve these problems, albeit not always efficiently or even utilization the strengths a GPU has at is's disposal.

We will set aside this pedantic view, and instead concentrating on meaningful applications:

Yes, matrix multiplication is one of the canonical examples of a problem that you can solve more efficiently in parallel.
No, typesetting involves a lot of interdependencies, that make it difficult to parallelize. For instance, deciding to break a line one word earlier or later on the first page might propagate further line-breaks through the document so that the final output might have an entire page more or less.
Yes, each group of pixels can be downscale in parallel, as can each frame of the video (in practice, video codecs are more efficient than to store pixels directly, but the point remains).
No, parsing the syntax of a text document is traditionally sequential, especially in a language like C where declarations effect the validity of a parsing process. To which degree this is an issue, depends on which part of the parsing pipeline we process: It should be obvious that if we split a file up into regular chunks, and process these independently, that any macro definitions at the top of the file will pose difficulties in parsing the later parts of the file if these are not known. Similarly, if we are also interested in type checking that function calls have the right arty, the declarations must be known beforehand. Then there are also the C-specific complications such as the ambiguity of an expression like (A) * B that depend on whether A is a type and B is a pointer or if they are two arithmetical expression.
Yet, there has been research into how GPUs can help speed up context free grammars (specifically using the CYK algorithm), which could be employed to parse a C document as well. The intuition here could be that since context free languages are recognized by non-deterministic pushdown automata, we can try to exploit the non-determinism as a parallelization strategy. For programming languages, this will probably not end up being that useful, since most programming languages are restricted/designed to be parsed deterministically, so as to be efficiently recognized by just a CPU.
No, as deciding what move to make when playing chess, is naively a problem that grows exponentially in size. You do not absolve yourself from the weight of the problem by checking the legality of each move in parallel, since the explosion of combinations that each side can choose remains the main bottleneck.
Yet, in recent years heuristics such as Monte Carlo tree search that combine neural networks (GPU! GPU! GPU!) with classical symbolic approaches have proven very successful, resulting in the necessary breakthrough for machine victory over humans in the game of Go.
No, as traditionally a checksum (e.g. from the SHA family) process blocks of data from the input, where the intermediate checksum depends on that of the predecessor. Furthermore, when processing a stream of data (e.g. a network download), we don't have random access which makes paralleling more difficult.
No, since the 8086 is a single-core processor, meaning that all emulated computations have to be executed in order of occurrence. Perhaps a GPU could be useful to facilitate an anachronistic kind of branch prediction?

Single instruction, multiple something?

Sometimes people get confused about the difference between SIMD and SIMT. Try to explain the difference, but also how the two relate to one another.

Perhaps reading up on the historical context of this terminology can be useful.

Suggested Answer

SIMT is a variation of SIMD. Usually SIMD executes instructions simultaneously, using special instruction sets: For instance, ADDSS can be used to perform multiple additions at once by thinking of the source and destination as vectors. SIMT on the other hand has multiple threads (of execution, not necessarily operating-system threads) receive the same instructions, distinct memory and registers to operate on.

OpenCL Kernels I

Explain why the kernel could be slow, and briefly sketch how to improve the performance.

__kernel void
discretize(__global float *f_dis, int n)
{
  int i = get_global_id(0);
  int j = get_global_id(1);

  float a = 0;
  float b = 2 * M_PI;
  float h = (b - a) / n;

  if (i < n && j < n) {
    float x = a + i * h;
    float y = a + j * h;
    /* In OpenCL C, sin is type-generic, so no sinf needed. */
    f_dis[i * n + j] = -2 * sin(x + y);
  }
}

Suggested Answer

Each, separate memory access to f_dis slows down GPU. The solution to this is to coalesce access by combining accesses as far as possible.

OpenCL Kernels II

What concurrency-related mistake has been made in the following implementation of (square) matrices transposition?

__kernel void
transpose(__global float *mat, int m)
{
  int i = get_global_id(0);
  int j = get_global_id(1);

  if (i <= j) return;
  float tmp = mat[i * m + j];
  mat[i * m + j] = mat[j * m + i];
  mat[j * m + i] = tmp;
}

Modify the code to solve the problem.

Warning (2026-04-30): The code block had a mistake that was not intended to be part of the question (hint: it is a possible solution, but explain why). This code has been stricken through for future reference.

Suggested Answer

We have a race condition since we are reading and writing to the same memory accessed via mat. To avoid this, we can add another output parameter:

__kernel void
transpose(__global float *mat, int m, __global float *out)
{
  int i = get_global_id(0);
  int j = get_global_id(1);

  out[j * m + i] = mat[i * m + j];
}

Remember to submit your answers in groups of two via Brightspace! If you cannot find a group on your own, please reach out and we will try to pair you up.