Real-time Linux (Xenomai)

Radboud University Nijmegen

Exercise #10: Measuring Jitter and Latency

Note: This excercise is intended for a real hardware platform, although the program for scheduling measurements could be tried first on VMware.

Introduction

In real-time programming one usually has to guarantee that certain deadlines are always met. Hence, predictability of timing is crucial. In this exercise we investigate a few aspects of predictability by measuring scheduling jitter and interrupt latency.

Objectives

The primary objectives of this exercise are:

To get experience with timing measurements
To investigate predictability and timing characteristics of Xenomai by measure scheduling jitter and interrupt latency.

Description

Terminology

We start with a definition of a number of terms, using a figure from the presentation Real Time
in Embedded Linux Systems by Petazzoni & Opdenacker:

Interrupt latency: the time between the occurrence of the interrupt and the start of the interrupt handler.

Scheduler latency: the time elapsed after completion of the handler and before execution the scheduler .

Scheduling latency, also called task latency: the time between the occurrence of the interrupt that makes a sleeping task runnable again and the moment the task is resumed.

Scheduling jitter: the unwanted variation of the release times of a periodic task. It can be characterized in various ways such as an interval around the desired release time, a maximal deviation from the desired time point, or a standard deviation from the mean value.

Pentium hardware and performance

In the lab we use PCs with standard Pentium hardware which is optimized for throughput, i.e., the number of instructions per time unit, at the cost of predictable latency. This means that we rather have a high average speed of instructions than a guaranteed low execution time of a specific instruction. The optimizations used by Pentium hardware makes most instructions executed very fast. Occasionally, however, a single instruction might take much longer to execute than it would do without optimization. This is problematic for real-time systems that need guaranteed hard real-time deadlines.

Unpredictability

Execution time

We list a number a number of reasons that contribute to the unpredictability of the execution time of instructions:

Optimization techniques used in CPU design :
- Pipelining is used in modern Intel processors to run a set of instructions in parallel.
- Branch prediction
  When trying to execute more instructions in parallel, there is a problem with branch instructions. Then not all information might be available yet to decide which instructions to execute. Instead of waiting until this info is available, the CPU does ad educated guess (prediction) of the result of the branch instruction. After that it already starts executing independent instructions from that branch in parallel. If the guess was wrong, the results are thrown away and execution continues in the other branch. If the guess is right, throughput has been increased.
- Out-of-order execution
  When a certain instruction has to wait for available data, subsequent instructions are already executed (out-of-order) and the results are re-ordered later to reconstruct the logical order.
These techniques are rather complex and they make it is difficult to predict exactly when an single instruction will executed.
CPU cache
Caches are used because of price-performance optimizations:
- Cache memory is 10 times faster than RAM memory
- RAM is much cheaper than cache memory
- Often a relatively small part of memory is used most frequently, so by using a small piece of fast memory (the cache) access to slower RAM is occurs infrequently and the machine becomes faster at relatively low costs. But it increases unpredictability: instructions are fast if the data is in the cache, but become slower if data is not in cache (a so-called "cache miss") and (1) some space in the cache has to be made, and (2) data has to be accessed from slower RAM and stored into cache.
Address Translation Cache (ATC) is used by the memory management unit to get a quick access to the address translating table in memory. A cache miss, however, leads to additional time.
Paging implements virtual memory: it uses RAM as a cache for hard disk memory. Here a cache miss is called a "page fault" and also leads to additional time.
Direct Memory Access (DMA) is a hardware mechanism which allows data transfer in parallel with other instructions executed by the CPU. Having a DMA channel leads to less CPU overhead for data transfer, but it may occupy the entire memory bus asynchronously from what the CPU is doing. Hence a process that needs access to RAM might have to wait for a long time for a DMA operation to complete.
Interrupts

An example of the worst case execution scenario of an instruction :
        what happens                                      time cost in nanoseconds
    -------------------------------------------------------------------------
        - execution time instruction                                     1 ns
        - instruction cache miss                                       50 ns
        - instruction ATC miss                                           500 ns
        - data cache miss                                                 1 000 ns
        - paging needed for instruction and data    90 000 000 ns
        - many interrupts during execution                 100 000 ns
        - one big DMA                                           10 000 000 ns
        -------------------------------------------------------------------------
                                                      total : 100 101 551 ms

Hence, the execution of an instruction which basically costs a few nano seconds might take more than 100 milli seconds.

Interrupt latency

Besides the causes mention above, there a few additional factors that contribute to the unpredictability of interrupt latency.:

The CPU has to finish the current instruction (some instructions take 10 TSC clock ticks)
Context saving time
Time to switching to ISR (read IRQ table, setup ISR stack)
Interrupt masking: when interrupts were disabled at the occurrence of an interrupt, it has to wait until they are enabled again
Interrupt priority: the servicing of low priority interrupts is delayed when a higher priority interrupt is being serviced

Scheduling latency

For scheduling latency the same factors apply as for interrupt latencies. However in this case extra latency can be caused that the schedular has to wait for the Linux kernel to complete some other tasks before it can execute.

When we periodically schedule a task, each cycle of the task starts late caused by the scheduling latency. Some part of the scheduling latency is sporadic, but other parts like e.g. "context saving" time is reocccuring for each cycle. Thus there is some fixed part of the scheduling latency reoccuring for each cycle. This means that when we take the difference between two adjacent cycle times, we remove this fixed part! Thus the latency between relative scheduling times of the periodic task are smaller. Hence the variation of the scheduling latency is smaller!

Load on a Linux system

To investigate how the load of a system affects jitter and latency, we will put some load on the system when performing measurements. We discuss a number of ways to monitor and to add various types of load on a Linux system.

I/O network load

Monitor tool:
- in terminal : iftop
- in gui: gnome system monitor
Load tools:
- flood ping: ping -f <nearby-machine> -s 65000 >/dev/null
```
ping -f localhost -s 65000  >/dev/null
# show load with : iftop -i lo
```

I/O disk load

Monitor tool:
- in terminal : iotop (needs IO_ACCOUNTING option in kernel)
- in gui: gnome system monitor
Load tools
using dd:
- big write, no read : dd if=/dev/zero of=/tmp/bigfile
- big read, no write : dd if=/dev/hda1 of=/dev/null bs=1000k
  ==> watch out when using /dev/hda1 : a typo can delete your harddisk!!!
- big read and write : dd if=/dev/hda1 of=/tmp/bigfile bs=1000k
  ==> watch out when using /dev/hda1 : a typo can delete your harddisk!!!
using 'ls -lR':
- while true; do ls -lR / > /tmp/list ; done &> /dev/null

CPU load

Monitor tool: top, htop
Load tools :
- CPU burn-in:
  The program heats up any x86 CPU to the maximum possible operating temperature that is achievable by using ordinary software.
- Hackbench:
  The hackbench test is a benchmark for measuring the performance, overhead, and scalability of the Linux scheduler.
- while true; do cat /proc/interrupts >/dev/null ; done
  => high CPU load in kernel mode
- while true; do (( 3+4 )) >/dev/null ; done
  => high CPU load in user mode

Memory load- swapping

monitor tool: top, htop
load tools:
use memoryload.c :

#include <stdlib.h>
main() {
 while (1) {
 malloc(10240);
 }
}

Exercises

Exercise 10a.

Write a program to collect data about the real periodic scheduling of a task and plot this data.

Approach:

Write a periodic real-time task which is scheduled every 100 microseconds and which executes 10 000 times.
At the beginning of each run of the task, read the current time with "rt_timer_read" (which returns a value of type RTIME) and store this in a global array.
After reading all values, compute the difference between each pair of successive times (which should be close to 100 000 ns) and store this in another array.

     write_RTIMES("time_diff.csv",nsamples,time_diff);

void write_RTIMES(char * filename, unsigned int number_of_values,
                  RTIME *time_values){
         unsigned int n=0;
         FILE *file;
         file = fopen(filename,"w");
         while (n<number_of_values) {
              fprintf(file,"%u,%llu\n",n,time_values[n]);  
              n++;
         }
         fclose(file);
 }

Open this .csv file in a spreadsheet (e.g., in Excel) and plot the data in graphic showing the measured value in the y-axis and the number of the measurement in the x-axis. [The comma in the .csv file should lead to two columns; if this does not work - maybe because of language settings - try replacing the comman by ";".]

Try the measurments first in VMware and next on a Linux PC. Describe the results.

Exercise 10b.

Use the spreadsheet to calculate the average value of the measured periods, the maximal and minimal deviation from the desired period (100 000), and the standard deviation of the timing differences.

Exercise 10c.

Use the following combined.sh script to put Linux under a big load:

    #!/bin/sh
    ping -f localhost -s 65000  >/dev/null   &  # network load
    while true; do cat /proc/interrupts >/dev/null ; done &  # cpu load
    while true; do ls -lR /  > /tmp/list ; done  &> /dev/null  # disk load

Look at the contents of this shell script (e.g. by "cat combined.sh"), run the script and monitor the load increase.

An option is to run the script in one console (./combined.sh) and to switch to another console with Ctrl-Alt-Fx to perform some monitoring. To stop the load, return to first console and stop the script by Ctrl-C.
An alternative is to start the script on the background (./combined.sh &), perform monitoring on the same console, look at the running processes with "ps -a" and their PID, and stop the load processes explicitly with "kill -9 <pid>".
Note that not all monitoring tools are available under VMware.

Start the shell script and next rerun the jitter measurements concurrently with the additional load. (Stop the load processes afterwards.)
Describe and analyze the results.

Exercise 10d.

Use the special parallel port cable to connect two PCs running Xenomai to each other. This special cable connects in both ways the data line D0 from one machine to the interrupt line S6 of the other machine. Now write a program to measure the interrupt latency of a PC as follows:

On both PC 1 and PC 2 keep data line D0 high.
On PC 1 measure the time t1 and generate an interrupt on PC2 by pulling data line D0 low and then high again.
On PC 2 write an interrupt handler for the parallel port which generates an interrupt on PC1 by pulling data line D0 low and then high again.
On PC 1 write an interrupt handler for the parallel port which measures the current time t2.

Repeat this measurement 10.000 times every 100us, and compute in each case the interrupt latency.
Similar to the exercise on scheduling jitter, write the results to a file, plot the measured latencies in a graph, and calculate average latency and the standard deviation.

Last Updated: 26 September 2008 (Jozef Hooman)
Created by: Harco Kuppens
h.kuppens@cs.ru.nl