Parallel computing

One of the easiest way to parallelize programs is with OpenMP. It is available for C, C++, and Fortran, and is present in most modern compilers.

To activate OpenMP on shared-memory systems (i.e., on a single computer), you only need to set a compiler flag. The addition of a single line near a for-loop will run it in parallel. The only limitation in this situation is the number of cores in the CPU.

The example below illustrates a simulation of falling and bouncing particles, parallelized at the physics.

The particle positions are initialized using the C++ random number generator. This is a serial part because the random number generator is not thread-safe.

Check the code: Intialize_particles.txt

The actual physics is done by continuously updating the particle positions. By adding a #pragma statement above the for-loop, this part is run in paralle.

Compiling programs using OpenMP requires a compile flag. Its name will depend on the compiler, but is usually in the form of -fopenmp or -openmp. For example, to compile the above example use:

g++ main.cpp -fopenmp

See these tutorials for more information on how to program with OpenMP.

The example above was relatively straightforward, i.e. running a program in parallel because the particles were not interacting with each other.

Check the code: simulate_physics.txt

Interaction complicates things because multiple cores simultaneously access the positions of the particles, which can lead to undesired effects. There are also limitations in terms of how much speed-up can be obtained because there is always some overhead with parallel computing. For instance, the example above does not scale linearly with the number of threads. In fact, the program can become slower when adding more threads. Therefore, it is better to profile your program to see what the actual speed-up is, especially if you are considering RAC applications at Compute Canada.

A plot of the above example on an Intel Xeon E5520 processor reveals that while a speed-up of almost 2x can be obtained by using four threads, there is no gain when using more.

This plot was generated by timing the runtime of the program using different numbers of threads. This can be specified using the environment variable OMP_NUM_THREADS.

Even though Python is not naturally geared toward multithreading, it does have several packages that enable it to do so. The multiprocessing package launches multiple Python processes to leverage multiple CPU cores. Below is an example of a simulation of the random walk of photons in an absorbing and isotropically scattering medium.

Multiprocessing is set up by initializing the Manager and adding jobs to run.

Check the code: set_up.txt

The "target" argument to Process is the name of the function that the job will execute. The function called “simulate” simulates the photons as they travel through the medium.

Check the code: simulate.txt

Notes:

The checking of "name" being equal to "main" is required because multiprocessing launches multiple instances of Python loading the script and executing any code not protected like this.
A separate I/O writer job is used to write the results to disk. Communication between the processes to this writer is done through Queue.
Using 3 cores gives a speed-up of 270% compared to a single core on a Intel i5-4310U.

Another common technique for making parallelized programs is Message Passing Interface (MPI), which is used on all of the large clusters. It enables the use of hundreds or thousands of CPU (or more) by combining the computing power of many individual computers that communicate over the network. This network is typically very fast as it is often the limiting factor for doing large-scale simulations. Documentation is available for Open MPI and MPICH.

Making a program using MPI is more involved that OpenMP as one really needs to think about passing data around to all the compute nodes. For example, if the simulation involves a grid, the grid needs to be divided in subgrids and their boundaries need to be shared between the different processes at every time step.

First the grid is spread over all of the MPI processes.

Check the code: mpi1.txt

During the simulation, each MPI process needs to communicate with its neighbours through "ghost" cells, which make the boundaries of the local grids line up.

Check the code: mpi2.txt

Notes:

An MPI program always starts with MPI_Init.
The number of processes involved with the simulations is obtained with MPI_Comm_size.
The rank is the process number and is used for communicating with the other processes and for running code that needs to be done by a single process only. Here process 0 is responsible for writing data to a file.
The exchanging of ghost cells uses MPI_Send and MPI_Recv to transfer the boundaries of the subgrids to the neighbouring subgrids. Generally, you want to pass data between the processes as little as possible to run the simulation at maximum speed.
Writing data is only done by one process to avoid race conditions. All other processes send their subgrids to process 0 for writing. This ensure the data remain sequential.
An MPI program always ends with MPI_Finalize.

This example simulates the heat equation with fixed boundary conditions.

Heatmpi.cpp

#include <iostream>
#include <fstream>
#include <mpi.h>

int main(int argc, char **argv) {
  MPI_Init(&argc, &argv);

  int num_processes;
  MPI_Comm_size(MPI_COMM_WORLD, &num_processes);
  
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  double starttime = 0;
  // Only use process 0 to output text
  if(rank == 0) {
    std::cout << "Using " << num_processes << " processes" << std::endl;
    starttime = MPI_Wtime();
  }

  const int gridsize = 400;
  const int num_iterations = 50000;
  const int save_every = 100;
  const double alpha = 0.1;
  const double dt = 0.05;
  const double dx = 0.1;

  // Divide the grid over all processes and add ghost cell
  int localgridsize = gridsize / num_processes + 2;
  if(rank == num_processes-1)
    localgridsize += gridsize % num_processes;

  double *localgrid_now = new double[localgridsize];
  double *localgrid_future = new double[localgridsize];

  // Init
  for(int i = 0; i < localgridsize; i++)
    localgrid_now[i] = (i + rank*(gridsize / num_processes));
  if(rank == 0) {
    localgrid_now[0] = 0;
    localgrid_future[0] = 0;
  }
  if(rank == num_processes-1) {
    localgrid_now[localgridsize-1] = 0;
    localgrid_future[localgridsize-1] = 0;
  }

  for(int j = 0; j < num_iterations; j++) {
    // Do heat equation physics
    for(int i = 1; i < localgridsize-1; i++) {
      localgrid_future[i] = localgrid_now[i] + dt*alpha/(dx*dx) * (localgrid_now[i-1] - 2*localgrid_now[i] + localgrid_now[i+1]);
    }
    std::swap(localgrid_now, localgrid_future);

    // Exchange ghost cells with neighbouring processes
    if(rank > 0) {
      // Send first cell of subgrid to right-most ghost cell of left neighbour
      MPI_Send(localgrid_now+1, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD);
      // Receive last cell of subgrid of left neighbour as our left-most ghost cell
      MPI_Recv(localgrid_now, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }
    if(rank < num_processes-1) {
      // Receive first cell of subgrid of right neighbour as our right-most ghost cell
      MPI_Recv(localgrid_now+localgridsize-1, 1, MPI_DOUBLE, rank+1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      // Send last cell of subgrid to left-most ghost cell of right neighbour
      MPI_Send(localgrid_now+localgridsize-2, 1, MPI_DOUBLE, rank+1, 0, MPI_COMM_WORLD);
    }

    if(j % save_every == 0) {
      // Save data. Only process 0 does I/O.
      if(rank == 0) {
        std::ofstream out("heat_" + std::to_string(j / save_every) + ".dat");
        for(int i = 1; i < localgridsize-1; i++)
          out << localgrid_now[i] << '\n';

        const int maxsize = gridsize / num_processes + gridsize % num_processes;
        double *buffer = new double[maxsize];
        for(int i = 1; i < num_processes; i++) {
          int recv;
          MPI_Status status;
          MPI_Recv(buffer, maxsize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
          MPI_Get_count(&status, MPI_DOUBLE, &recv);
          for(int k = 0; k < recv; k++)
            out << buffer[k] << '\n';
        }
        delete[] buffer;
        out.close();
      }
      else {
        MPI_Send(localgrid_now+1, localgridsize-2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
      }
    }
  }

  delete[] localgrid_now;
  delete[] localgrid_future;

For other languages, multiple solutions exist. For example, Julia has parallel computing built in. If more fine-controlled usage is required than OpenMP can offer, one can instead use pthreads on Linux. MPI is also available for Python, for R as well as for other languages.

For advanced usage, it is even possible to combine MPI with multithreading to leverage shared-memory systems while also having the ability to combine multiple systems.

Parallel computing

Parallel computing methods

Simulation of heat equation to demonstrate the use of MPI arrow_drop_down