Terminology: Concurrency
and Parallelism
Give an example for a problem that can be solved concurrently but not
in parallel, and that can be solved in parallel but not concurrently.
If you think this doesn't exist, explain why.
Recall that Coroutines (or in their weaker form, generators) are a canonical
example of non-parallel concurrency. They can be implemented using
parallelism or by suspending the execution of one computation, and
gives control over to a different coroutine.
In the worst case, this is a restriction, as with old, single-core
operating systems that lacked the means to implement a preemptive
scheduler and couldn't physically execute multiple processes. On the
other hand, coroutines can serve as a means of modularization that
make complex programs more manageable to maintain. The related notion
of lazy evaluation in languages like Haskell also make use of this.
In the definition given in the lecture, parallelism without
concurrency doesn't make sense, since any units of work that are
done at the same time, must also can be done at the same time.
Suggested Answer
concurrency
is a semantic property, while parallelism
is an implementation choice.
Speedup Arithmetic
Consider two programs, one sequential and one parallel. Executing these one after another on a shared memory multicore system yield the following results:
| # Cores | Seq. | Par. |
|---|---|---|
| 1 | 27s | 34s |
| 2 | 27s | 18s |
| 4 | 27s | 10s |
| 8 | 27s | 7s |
Compute the speedup and efficiency for 2, 4 and 8 cores and explain
the results. Briefly comment on why on a single-core system the
sequential version is faster than the parallel version.
The speedups are 27/18 = 1.5, 27/10 = 2.7 and 27/7 = 3.85714285714
and the efficiencies is 75%, 67.5% and 25.9% respectively.
The efficiency decreases with the number of cores, due to
Amdahl's law.
We observe that the sequential program is faster on a single-core
system due to the overhead of (vain) parallelization.
Suggested Answer
Parallel Execution and Data Races
Consider the following tasks and explain whether or not data races could occur. Assume multiple actors are doing the same operation in parallel. If they can occur, suggest the most appropriate means of preventing them.
- Adding a value to a binary tree.
- Finding a value in a binary tree.
- Incrementing an integer that fits in a machine word.
- Creating a new file on a multi-user system.
- Replacing an existing file on a multi-user system.
Suggested Answer
- There is a data race, since two threads could attempt to mutate the same node leading to a race conditions. You can avoid this using some sort of locking regiment or by finding an adequate lock-free algorithm.
- There is no data race, since all are only accessing values in the data structure
- There is a data race, unless you use operations like "fetch and add" allow you to do just that.
- Depending on the operating and file system, but
on UNIX
openproposes that it creates a file atomically. If two processes attempt to create a file, only one should succeed. -
Again it depends on the operating system, but replacing a regular
file is not an atomic operation. You have to protect it my some
sort of lock (e.g. file locks or POSIX semaphores). You can replace a
file atomically by
renameing it, but it remains a race who gets to execute the system call last.
OpenMP: Parallelization
Judge whether or not you can correctly parallelize these loops using
OpenMP just by adding a #pragma omp, without
adjusting the program. Briefly explain your answer, and if
you think it is possible, give the pragma you would use.
for (int i = 0; i < N; ++i) { a[i] = a[i] * a[i+1]; }for (int i = 0; i < N-1; i += 2) { a[i] = a[i] * a[i+1]; }bool any = false; for (int i = 0; i < N; ++i) { any = any || c[i]; }
Suggested Answer
-
No, you cannot since neighboring elements of the array
aform data dependencies that prevent parallelization. -
Yes, since the even elements depend on their own value and that of
their successors, while the value of the odd elements are not
modified.
#pragma omp parallel for -
Yes, we can even use reduction since this
||over truth values forms a monoid, withfalsebeing the neutral element.#pragma omp for reduction(||: any)
OpenMP: Reduction
The lectures introduced reduction
. Explain in your own words
what the advantage is of reducing data, instead of manually iterating.
Specifically go into the dangers and difficulties that you
can avoid dealing with, if your problem matches the reduction
pattern.
To illustrate this, come up with a (simple) problem where
you can use reduction and contrast this to one where
you cannot.
Problem 3. from the previous exercise is a positive example. You
cannot use reduction to divide the elements of a list in order of
occurrence, since division is not associative.
Hint
What algebraic
properties does the reduction operation need to exhibit?
Suggested Answer
As it is quasi-declarative, OpenMP has more freedom optimize the
execution and use a more efficient approach to accumulate the result
(recall the tree strategy mentioned in the lectures). Furthermore it
avoids having to use a global counter that require some kind of manual
synchronization.
OpenMP: Schedulers
As you might know, is_prime(n) is computationally
expensive
while is_coprime(n,
m) is cheap. To determine if two positive
integers n and m are coprime, you just have to
check
.
In the following program these functions are both used:
#define N 128
void
foo(int in[N], bool out[N])
{
for (int i = 0; i < N; ++i) {
if (i & 1) { /* i is odd */
out[i] = is_prime(in[i]);
} else { /* i is even */
out[i] = is_coprime(in[i], in[(i+1)%N]);
}
}
}
Explain what the general efficiency issue is when you attempt
to execute the program in parallel using OpenMP. Which of the
three schedulers
mentioned in the lecture (static, dynamic
and guided) should you use and why? Write out the OpenMP
pragma you would use.
Hint
What do the respective schedulers do when a thread runs out of work?
Suggested Answer
The work load is not evenly balanced, so some kind of non-static
scheduler is necessary to prevent work-starving. Whether to use
dynamic or guided shouldn't matter, assuming
that the effort to check for primality is much more expensive than
#pragma omp for schedule(dynamic)