**Parallel** and **Distributed** Systems / 1 Pavel Krömer, Dept. of Computer Science, VSB – Technical University of 100 Ostrava EP., 



- Course introduction and organization
- Motivation
- Basic concepts of parallelism

Introduction

### Contact

- Location
  - office: FEI EA444
  - office hours: per e-mail/phone
  - office phone: +420597325898
- E-communication
  - pavel.kromer@vsb.cz



## **Course objectives**

- Introduction to the basic concepts of parallel and distributed platforms and algorithms
  - full range of parallel architectures (edge, multicore, distributed (HPC), cloud, edge and fog)
- Working knowledge of basic parallel algorithms and parallelization strategies
  - methods to solve computationally extensive problems from different application areas
- Practical introduction of selected parallel languages / language extensions
  - practical development of parallel codes

### Credit

- Active participation on lectures and seminars / tutorials
- Submission of a project on an assigned topic
  - list of indicative themes will be available
  - each topic individually approved with instructor
- Deadline
  - end of semester for final years

Q.IJ.E.S.I.I.O.N.S

Motivation

| Model         | Release                       | Price    | Fab             | Chiplets                          | Cores         | Core                  | (GHZ)  |       | Cache in MB |      | MB  | 1    |  |  |  |  |         |       |     |     |   |    |     |   |
|---------------|-------------------------------|----------|-----------------|-----------------------------------|---------------|-----------------------|--------|-------|-------------|------|-----|------|--|--|--|--|---------|-------|-----|-----|---|----|-----|---|
|               | date                          | (USD)    |                 |                                   | (Threads)     | config <sup>[i]</sup> | Base   | Boost | L1          | L2   | L3  |      |  |  |  |  |         |       |     |     |   |    |     |   |
| Mainstrea     | Mainstream Enterprise         |          |                 |                                   |               |                       |        |       |             |      |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9124 🖄        |                               | \$1,083  | TSMC            | 4 × <u>CCD</u><br>1 × <u>I/OD</u> | 16 (32)       | 4 × 4                 | 3.0    | 3.7   | 1           | 16   | 64  |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9224 🖄        |                               | \$1,825  |                 |                                   | 04 (40)       | 4 × 6                 | 2.5    | 3.7   | 1.5         | 24   | 04  |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9254 🖄        | November                      | \$2,299  |                 |                                   | 24 (48)       | 4 × 6                 | 2.9    | 4.15  | 1.5         |      | 128 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9334 🖄        | 10, 2022                      | \$2,990  | N5              |                                   |               | 4 × 8                 | 2.7    | 3.9   |             |      | 120 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9354 🖄        |                               | \$3,420  |                 | 8 × <u>CCD</u>                    | 32 (64)       | 0 × 4                 | 2.05   | 0.75  | 2           | 32   | 256 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9354P 🖉       |                               | \$2,730  | 1 × <u>I/OD</u> | 1 × <u>I/OD</u>                   |               | 8 × 4                 | 3.25   | 3.75  |             |      | 200 | 1    |  |  |  |  |         |       |     |     |   |    |     |   |
| Performa      | ince Enterpr                  | rise     |                 |                                   |               |                       |        |       |             |      |     | -    |  |  |  |  |         |       |     |     |   |    |     |   |
| 9174F 🖄       | November<br>10, 2022          | \$3,850  |                 |                                   |               |                       | 4.1    | 4.4   |             | 10   | 256 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9184X 🖄       | June 13,<br>2023 \$4,928      |          |                 | 16 (32)                           | 16 (32) 8 × 2 | 3.55                  | 4.2    | 1 16  | 16          | 768  |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9274F 🖉       | November                      | \$3,060  | TSMC            | 8 × <u>CCD</u><br>1 × <u>I/OD</u> | 24 (48)       | 8 × 3                 | 4.05   | 4.3   | 1.5         | 24   |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9374F 🖄       | 10, 2022                      | \$4,860  | N5              |                                   |               |                       | 3.85   | 4.3   |             |      | 256 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9384X 🖄       | June 13,<br>2023 \$5,529      | _        |                 | 32 (64)                           | 8 × 4         | 3.1                   | 3.9    | 2     | 32          | 768  |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9474F 🖄       | November<br>10, 2022          | \$6,780  |                 |                                   |               |                       |        |       |             |      |     |      |  |  |  |  | 48 (96) | 8 × 6 | 3.6 | 4.1 | 3 | 48 | 256 | 1 |
| Cloud & I     | НРС                           |          |                 |                                   |               |                       |        |       |             |      |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9454 🖉        |                               | \$5,225  |                 |                                   | 48 (06)       | 0 × 0                 | 0.75   | 2.0   | 2           | 40   |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9454P 🖉       |                               | \$4,598  |                 |                                   | 48 (96)       | 8 × 6                 | 2.75   | 3.8   | 3           | 48   |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9534 🖄        |                               | \$8,803  |                 | 8 × <u>CCD</u><br>1 × <u>I/OD</u> |               |                       | 2.45   | 3.7   |             |      | 256 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9554 🖄        | November                      | \$9,087  |                 | 1.00                              | 64 (128)      | 8 × 8                 | 2.4    | 0.75  | 4           | 64   |     | 1    |  |  |  |  |         |       |     |     |   |    |     |   |
| 9554P 🖉       | 10, 2022 \$7,104              | \$7,104  |                 |                                   |               |                       | 3.1    | 3.75  |             |      |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9634 🖄        | \$11,80<br>\$10,62<br>\$14,75 | \$10,304 | TSMC            | N5 12 ×                           | /IC           | 84 (168)              | 12 × 7 | 2.25  | 3.7         | 5.25 | 84  |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9654 🖄        |                               | \$11,805 | N5              |                                   |               |                       | 0.4    | 3.7   |             |      | 384 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9654P 🖉       |                               | \$10,625 |                 | <u>CCD</u><br>1 × <u>I/OD</u>     | 96 (192)      | 12 × 8                | 2.4    |       | 6           | 96   |     |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9684X 🖄       |                               | \$14,756 |                 |                                   |               |                       |        | 2.55  | 3.7         |      |     | 1152 |  |  |  |  |         |       |     |     |   |    |     |   |
| 9734 🖄        |                               | \$9,600  | 8.4             | 8 x CCD                           | 8 x CCD       | 112 (224)             | 16 x 7 | 2.2   | 3.0         | 7    | 112 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| 9754S 🖄       | 2023                          | \$10,200 |                 | 1 x I/OD                          | 128 (128)     | 16 0                  | 2.25   | 2.4   | 0           | 100  | 256 |      |  |  |  |  |         |       |     |     |   |    |     |   |
| <u>9754</u> 🗗 |                               | \$11,900 |                 |                                   | 128 (256)     | 16 x 8                |        | 3.1   | 8           | 128  |     |      |  |  |  |  |         |       |     |     |   |    |     |   |

AMD 5nm EPYC Genoa • EPYC 9374F, announced 10. 11. 2022 • 5nm Zen 4 core architecture • up to 96 cores/192 threads • 3.85 GHz / 4.3 GHz, 256 MB L3 cache designed for liquid-cooled spaces

| Model Release |                       | Price<br>(USD) | Fab                            | Chiplets                          | Cores Core<br>(Threads) config <sup>[i]</sup> |          | Clock rate<br>(GHz) |       | Cache in M |      | МВ   |     |  |
|---------------|-----------------------|----------------|--------------------------------|-----------------------------------|-----------------------------------------------|----------|---------------------|-------|------------|------|------|-----|--|
|               | date                  | (050)          |                                |                                   | (Threads)                                     | connges  | Base                | Boost | L1         | L2   | L3   |     |  |
| Mainstrea     | Mainstream Enterprise |                |                                |                                   |                                               |          |                     |       |            |      |      |     |  |
| 9124 🖄        |                       | \$1,083        |                                |                                   | 16 (32)                                       | 4 × 4    | 3.0                 | 3.7   | 1          | 16   | 64   |     |  |
| 9224 🖄        | \$1,825               |                | 4 × <u>CCD</u>                 | 24 (48)                           | 4 × 6                                         | 2.5      | 3.7                 | 1.5   | 24         | 04   |      |     |  |
| 9254 🖄        | November              | \$2,299        | TSMC<br>N5                     | 1 × <u>I/OD</u>                   | 24 (40)                                       | 4 × 6    | 2.9                 | 4.15  | 1.5        | 24   | 128  |     |  |
| 9334 🖉        | 10, 2022              | \$2,990        |                                |                                   | <u>CCD</u> 32 (64)                            | 4 × 8    | 2.7                 | 3.9   |            | 32   | 120  |     |  |
| 9354 🖄        |                       | \$3,420        |                                | 8 × <u>CCD</u>                    |                                               | 0 4      | 0.05                | 3.75  | 2          |      | 256  |     |  |
| 9354P 🖄       |                       | \$2,730        |                                | 1 × <u>I/OD</u>                   |                                               | 8 × 4    | 3.25                | 5.75  |            |      | 256  | 1   |  |
| Performa      | ince Enterpi          | rise           |                                |                                   |                                               |          |                     |       |            |      |      | -   |  |
| 9174F ピ       | November<br>10, 2022  | \$3,850        |                                |                                   | 16 (32)                                       | 8 × 2    | 4.1                 | 4.4   | 1          | 16   | 256  |     |  |
| 9184X IZ      | June 13,<br>2023      | \$4,928        |                                |                                   | 10 (52) 0 ^ 2                                 | 3.55     | 4.2                 |       |            | 768  |      |     |  |
| 9274F 🖉       | November              | \$3,060        | TSMC                           | 8 × <u>CCD</u>                    | 24 (48)                                       | 8 × 3    | 4.05                | 4.3   | 1.5        | 24   | 256  |     |  |
| 9374F 🖉       | 10, 2022              | \$4,860        | N5                             | N5 1 × <u>I/OD</u>                |                                               | 8 × 4    | 3.85                | 4.3   | 2          | 32   | 200  |     |  |
| 9384X 🖄       | June 13,<br>2023      | \$5,529        |                                |                                   | 32 (64)                                       |          | 3.1                 | 3.9   |            |      | 768  |     |  |
| 9474F ⊠       | November<br>10, 2022  | \$6,780        |                                |                                   |                                               | 48 (96)  | 8 × 6               | 3.6   | 4.1        | 3    | 48   | 256 |  |
| Cloud & I     | НРС                   |                |                                |                                   |                                               |          |                     |       |            |      |      |     |  |
| 9454 🖉        |                       | \$5,225        |                                |                                   | 49 (06)                                       | 0 × 0    | 0.75                | 2.0   | 2          | 40   |      |     |  |
| 9454P 🖉       |                       | \$4,598        |                                |                                   | 48 (96)                                       | 8 × 6    | 2.75                | 3.8   | 3          | 48   |      |     |  |
| 9534 🖄        |                       | \$8,803        |                                | 8 × <u>CCD</u><br>1 × <u>I/OD</u> |                                               |          | 2.45                | 3.7   |            |      | 256  |     |  |
| 9554 🖄        | November              | \$9,087        |                                |                                   | 64 (128)                                      | 8) 8 × 8 | 2.1                 | 2 75  | 4          | 64   |      |     |  |
| 9554P 🖄       | 10, 2022              | \$7,104        |                                |                                   |                                               |          | 3.1                 | 3.75  |            |      |      |     |  |
| 9634 🖄        | \$10,304              | \$10,304       |                                |                                   |                                               |          |                     |       | 5.25 84    |      |      |     |  |
| 9654 🖄        | \$11,8                |                | <sup>′</sup> ~ 10.000 EUR on a |                                   |                                               |          |                     | 4     |            | 384  |      |     |  |
| 9654P 🖄       |                       | \$10,          |                                | <b>C</b> 7                        | CZ e-shop                                     |          |                     |       |            | 6 96 |      |     |  |
| 9684X 🖄       |                       | \$14,7         |                                |                                   |                                               |          |                     | 3.7   |            |      | 1152 |     |  |
| 9734 🖄        | June 13,              | \$9            |                                | 8 X CCD                           |                                               | AT-      | 2.2                 | 3.0   | 7          | 112  |      |     |  |
| 9754S 🖄       | 2023                  | \$10,200       |                                | 1 x I/OD                          | 128 (128)                                     | 16 x 8   | 2.25                | 3.1   | 8          | 128  | 256  |     |  |
| <u>9754</u> 🖄 |                       | \$11,900       |                                | 1 4 1/00                          | 128 (256)                                     | 16 x 8   | 2.25                | 5.1   | 0          | 120  |      |     |  |

AMD 5nm EPYC Bergamo announced June 2023 • 5nm Zen 4 core architecture 128 cores and 256 threads • 2.2(5) GHz / 3.(1) GHz • giant TDP (up to 320-360W) clock rate vs core count opt. versions

### Zen 5 microarchitecture in 2024

#### NVIDIA TeslaV100 Tensor Core GPU (Volta)

- 5,120 CUDA cores; 640 Tensor cores; 32GB HDM2;
- Deep learning: TensorFlow, PyTorch, theano, Caffe2; Applications: OpenFoam etc.

NVIDIA A100 Tensor Core GPU (Ampere)

• 6,912 CUDA cores; FP64/32/16/BFLOAT16/INT8, 432 tensor cores; 40/80GB HDM2;

NVIDIA H100 Tensor Core GPU (Hopper)

• 18,432 CUEA cores; FP64/32/16/BFLOAT16/INT8, 640 tensor cores; 80GB HDM2;

GEFORCE RTA

#### **TITAN RTX (2019)**

• 4,608 CUDA cores; 576 Tensor cores; 24 GB GDDR6; NVIDIA's CUDA-X AI SDK; 70,000 CZK RTX 4090 (2023)

• 16,384 CUDA cores; 512 Tensor cores (4th gen); 24 GB GDDR6X; cca 54,000 CZK

42 Years of Microprocessor Trend Data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

### Moore's Law: The number of transistors on microchips doubles every two years

Moore's law describes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. This advancement is important for other aspects of technological progress in computing – such as processing speed or the price of computers.



Edge and Fog computing Fog computing pushes intelligence down to the local area network level of network architecture, processing data in a fog node or IoT gateway. Edge computing pushes the intelligence, processing power and communication capabilities of an edge gateway or appliance directly into devices like programmable automation controllers (PACs)

 Typical applications
 Intelligent Transportation Management (ITS)

FCC ID: OWS

Type FO FORM 2S CL200

Sm

AR 60 min.

DS

- Industrial & Commercial Networking
- Smart Metering for Utilities
- Autonomous Vehicles

### Parallelism and Concurrency

- Multitasking OS
  - Allows the execution of more than one process (computer programme + resources in memory) at a time



- The number of tasks can exceed the number of physical cores / threads
- Multitasking OS has all the components necessary to handle this situation by task switching

| • | Process |
|---|---------|
|---|---------|

Register

Counter

Heap

Code

Stack

- runnable software is in the process model organized into sequential processes
- a running program, including the current values of the program counter, registers, and variables. It has a virtual CPU and shares physical CPU(s) via task-switching
- processes are independent (isolated by the OS) and communicate via inter-process communication (IPC)
- Task-switching
  - when more processes require a shared resource (CPU), fast task switching can give an illusion of parallelism (aka. pseudo-parallelism, concurrency)



### Process states

- a process can be in one of several internal states:
  - Running -> Waiting: process blocks for input
  - Waiting -> Ready: input becomes available
  - Ready -> Running: scheduler picks the process and allocates CPU
  - Running -> Ready: scheduler removes process from CPU
- all is done according to a specific scheduling algorithm

### • Thread

- a unit of (parallel) execution within a process (1-N)
- each thread in a process shares its resources (that makes them more lightweight)
- communication between threads is cheaper than IPC
- problem of one thread can easily affect other threads, too (and the whole process might have to be killed)
- Properties
  - each thread has its own program counter, registers, stack
  - and shares data space, address space, resources (and limits)

| d | in     | Process |        |      |
|---|--------|---------|--------|------|
|   | Thread | Thread  | Thread | Time |

| Single Thread   | Multi Threaded  |                 |  |  |  |
|-----------------|-----------------|-----------------|--|--|--|
| Неар            | He              | ар              |  |  |  |
| Registers Stack | Registers Stack | Registers Stack |  |  |  |
| Code            | Code            |                 |  |  |  |
| Thread          | Thread          | Thread          |  |  |  |

## Threads vs. processes

#### Process

- Processes are heavyweight operations
- Each process has its own memory space
- Inter-process communication is slow as processes have different memory addresses
- Context switching between processes is more expensive
- Processes don't share memory with other processes

### Thread

- Threads are lighter weight operations
- Threads use the memory of the process they belong to
- Inter-thread communication can be faster than inter-process communication because threads of the same process share memory with the process they belong to
- Context switching between threads of the same process is less expensive
- Threads share memory with other threads of the same process

## Inter-process communication (IPC)

- Communication between cooperating processes, preferably in a well-structured way and not using interrupts. It must deal with:
  - information passing,
  - clash of activities
  - dependencies
- Usually implemented by
  - shared memory
  - message passing
  - -> pipes and named pipes; message queueing; semaphores; shared memory; and sockets
- IPC concepts and issues
  - synchronization, critical regions, mutual exclusion, semaphores, mutexes, monitors, barriers
  - race conditions, deadlocks, busy waiting

# Inter-process communication (IPC)

#### • Deadlocks

- a set of processes is deadlocked if each process in the set is waiting for an event that only another process in the set can cause.
- a situation in which one process, A, waits for a resource exclusively used by another process, B, which, in turn, waits for another resource owned by A. Both processes wait forever.

#### Conditions for a deadlock are:

- mutual exclusion,
- hold and wait,
- no preemption, and
- circular wait
- Dealing with deadlocks:
  - do nothing
  - detection and recovery
  - avoidance by careful resource allocation
  - preventing (one) of the conditions for deadlock



### Parallelism vs. concurrency

- Parallelism vs. concurrency
  - Parallelism: using multiple processors/cores running at the same time. Property of the machine (parallel machine).
  - Concurrency: non-determinacy due to interleaving threads. Property of the application (concurrent tasks).

|             |          | Concurrency                  |                        |  |  |
|-------------|----------|------------------------------|------------------------|--|--|
|             |          | sequential                   | concurrent             |  |  |
| Parallelism | serial   | Traditional programming      | Traditional<br>OS      |  |  |
| Parallelism | parallel | Deterministic<br>parallelism | General<br>parallelism |  |  |

### Parallel vs. sequential programme

brings more challenges, including IPC, scalability, portability

```
Sequential vector add
```

```
void vector add(double* a, double *b, const unsigned int len)
ſ
  for (unsigned int i = 0; i < len; i++)</pre>
    a[i] += b[i];
}
int main (void)
  const unsigned int len = 200;
  double a[len];
  double b[len];
  rand_vec(a, len);
  rand_vec(b, len);
  vector_add(a, b, len);
        Summary
  retur
}
```

Parallel vector add (pseudocode)

```
void vector add p(double* a, double *b, const unsigned int len)
                                                   ſ
                                                     const unsigned int from = THREAD_ID == 1 ? 0: len / 2;
                                                     const unsigned int to = THREAD_ID == 1 ? len/2 : len;
                                                     for (unsigned int i = 0; i < len; i++)</pre>
                                                       a[i] += b[i];
                                                   }
                                                   int main (void)
                                                   ſ
                                                     const unsigned int len = 200;
                                                      double a[len];
                                                     double b[len];
                                                     rand_vec(a, len);
                                                     rand_vec(b, len);
                                                     vector_add_p(a, b, len); // Parallel section

    parallel code is intuitive, but requires additional attention and
```

# Parallel vs. sequential programming

- Summary
  - parallel code is intuitive\*, but requires additional attention
  - brings more challenges, including IPC, scalability, portability
- Parallel programming is a two-step process
  - design a work-efficient, low-span parallel algorithm
  - implement it on the target hardware
- In reality: many systems require different code to implement the algorithm efficiently
  - huge effort to generate efficient portable parallel code.

\* a manual implementation of Quicksort in MPI can be 1700 lines of code, and about the same in CUDA

## Parallel vs sequential programming

- Problem solving based on parallel thinking
  - Recognizing true dependences (cw. sequential programming)
  - Parallel algorithm design
    - operations on aggregates: map/reduce/scan
    - divide & conquer, contraction
    - viewing computation as DAG (based on dependences)