Parallel Computing '26: Exercise Sheet 3
OpenCL and GPUs

Submission Deadline: (a few days longer than last time)

Processing Units

Describe in your own words the tradeoffs between a CPU and GPU. Do you even need a CPU if you have a GPGPU?

Suggested Answer The CPU is the traditional locus of control in a computer, that orchestrates the peripheral devices and processes in- and output. Modern day GPUs evolved from ASICs, designed for specific tasks (usually of a graphical nature), to programmable devices that can in principle compute anything a CPU can compute as well, with an aptitude for parallelized tasks. Even if a GPU is general-purpose, it cannot displace the CPU as it is not designed to access the periphery, even memory.

GPU, what is it good for?

Judge which of the following problem you can use a GPU to help solve? Sketch your intuition and what role the GPU could play. Be creative!

  1. Multiplying a matrix with a (compatible) vector
  2. Typesetting a text document (e.g. TeX involving line breaking)
  3. Downscaling a video (e.g. from 1080p to 480p)
  4. Parse the syntax of a C file
  5. Playing chess
  6. Computing a checksum of a data stream
  7. Emulating an Intel 8086 processor
Suggested Answer

To a certain degree, this is a trick question, since you can employ the GPU to solve these problems, albeit not always efficiently or even utilization the strengths a GPU has at is's disposal.

We will set aside this pedantic view, and instead concentrating on meaningful applications:

Single instruction, multiple something?

Sometimes people get confused about the difference between SIMD and SIMT. Try to explain the difference, but also how the two relate to one another.

Perhaps reading up on the historical context of this terminology can be useful.

Suggested Answer SIMT is a variation of SIMD. Usually SIMD executes instructions simultaneously, using special instruction sets: For instance, ADDSS can be used to perform multiple additions at once by thinking of the source and destination as vectors. SIMT on the other hand has multiple threads (of execution, not necessarily operating-system threads) receive the same instructions, distinct memory and registers to operate on.

OpenCL Kernels I

Explain why the kernel could be slow, and briefly sketch how to improve the performance.

__kernel void
discretize(__global float *f_dis, int n)
{
  int i = get_global_id(0);
  int j = get_global_id(1);

  float a = 0;
  float b = 2 * M_PI;
  float h = (b - a) / n;

  if (i < n && j < n) {
    float x = a + i * h;
    float y = a + j * h;
    /* In OpenCL C, sin is type-generic, so no sinf needed. */
    f_dis[i * n + j] = -2 * sin(x + y);
  }
}
Suggested Answer Each, separate memory access to f_dis slows down GPU. The solution to this is to coalesce access by combining accesses as far as possible.

OpenCL Kernels II

What concurrency-related mistake has been made in the following implementation of (square) matrices transposition?

__kernel void
transpose(__global float *mat, int m)
{
  int i = get_global_id(0);
  int j = get_global_id(1);

  if (i <= j) return;
  float tmp = mat[i * m + j];
  mat[i * m + j] = mat[j * m + i];
  mat[j * m + i] = tmp;
}

Modify the code to solve the problem.

Warning (2026-04-30): The code block had a mistake that was not intended to be part of the question (hint: it is a possible solution, but explain why). This code has been stricken through for future reference.

Suggested Answer We have a race condition since we are reading and writing to the same memory accessed via mat. To avoid this, we can add another output parameter:
__kernel void
transpose(__global float *mat, int m, __global float *out)
{
  int i = get_global_id(0);
  int j = get_global_id(1);

  out[j * m + i] = mat[i * m + j];
}

Remember to submit your answers in groups of two via Brightspace! If you cannot find a group on your own, please reach out and we will try to pair you up.