MPI is a protocol to pass messages between different nodes on a cluster. Using the capabilities provided by an MPI implementation, we can combine several independent computers together to solve problems that are too large or complicated to be solved on a single computer. MPI can combine multiple computers so they (mostly) look like one very large computer. Not entirely, of course; that's where MPI comes in. To use all the nodes to solve a single problem, you need to do some special programming using MPI.
There are several implementations of MPI available, some free and some commercial. One of the most popular is OpenMPI, and that's what we'll use as our MPI implementation. You should check the web site at Open MPI for the latest version, and start there, unless there's a reason to get an older version. Once you establish a version of openMPI for your cluster, some people will always be dependent on that old version, and you will end up supporting that version, and every version going forward.
At the time I rebuilt my Pi cluster, the latest version of OpenMPI available was 3.1.2, and we downloaded the source when we gathered all the initial software. All we need to do now is build it. A word of warning: OpenMPI ihas extensive optimizations and has a self-tuning feature that runs when you build it. This will optimize the code for your specific cluster, but it can take a long time to finish its optimization. Be prepared for a long build cycle. Here's how I built the software. The output from the configure and make steps are extensive, so the output is available in the Command Output subpage.
admin@baker:~> cd /apps/source/openmpi-3.1.2/
admin@baker:/apps/source/openmpi-3.1.2> /usr/bin/time ./configure --prefix=/apps/openmpi-3.1.2 --with-slurm --enable-heterogeneous --disable-dlopen |& tee configure.log
499.47user 235.47system 11:27.92elapsed 106%CPU (0avgtext+0avgdata 41060maxresident)k
0inputs+234944outputs (0major+19268558minor)pagefaults 0swaps
admin@baker:/apps/source/openmpi-3.1.2> /usr/bin/time make -j4 |& tee make.log
4843.28user 657.95system 35:09.42elapsed 260%CPU (0avgtext+0avgdata 75112maxresident)k
83560inputs+314136outputs (61major+30312069minor)pagefaults 0swaps
My configuration options for OpenMPI are not very complicated. I enable heterogeneous mode because I may want to add new and different hardware to this cluster at some point. I also explicitly disable the dlopen() function, because it wasn't available or didn't workcorrectly on earlier Raspberry Pi builds. I should probably check it again to see if it's available and working, but as you can see from the times it takes to configure and build the software, it's an expensive (in terms of time) experiment to make.
Now that our software is installed, we need a module file for it. We can use our standard template as a starting point, put OpenMPI is a little more complicated. OpenMPI isn't just a normal application, it's an application development environment. We need to define the compilers we will use and where the include files are. OpenMPI is usually used with C and C++ compilers, as well as Fortran compilers. We haven't installed any Fortran compilers yet, so we'll skip that part for now and just focus on C/C++. This is what my initial module file looks like:
#%Module1.0
## openmpi modulefile
##
proc ModulesHelp { } {
puts stderr "This module sets up the user environment for OpenMPI"
}
module-whatis "Configures a user environment to use OpenMPI"
set curMod [module-info name]
prepend-path PATH /apps/openmpi-3.1.2/bin
prepend-path LIBPATH /apps/openmpi-3.1.2/lib64
prepend-path MANPATH /apps/openmpi-3.1.2/share/man
conflict openmpi
setenv CC mpicc
setenv CXX mpiCC
setenv MPI_ROOT /apps/openmpi-3.1.2
setenv MPI_HOME /apps/openmpi-3.1.2
prepend-path PATH /apps/openmpi-3.1.2/bin
prepend-path INCLUDE /apps/openmpi-3.1.2/include
prepend-path LD_LIBRARY_PATH /apps/openmpi-3.1.2/lib64
prepend-path MANPATH /apps/openmpi--3.1.2share/man
Save this as /apps/modules-4.2.0/picluster/openmpi/openmpi-3.1.2, and save the following file as /apps/modules-4.2.0/picluster/openmpi/.version:
#%Module1.0
##
set ModulesVersion "openmpi-3.1.2"
Now we can load the openmpi module and verify that it works. Save the following code as shownodes.c. This is a fairly simple MPI program that uses a couple common MPI functions (send and receive) to print out every noode we're running on. This will verify that our MPI compiler is working, and that MPI can talk between the nodes.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"
/*
* Simple MPI test program that uses a MPI_Send/MPI_Recv to print out the name of
* the node of every MPI process.
*/
int main (int argc, char *argv[])
{
int id, np;
char name[MPI_MAX_PROCESSOR_NAME];
char message[128];
int namelen;
MPI_Status status;
int i;
void nodesused ();
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &np);
MPI_Comm_rank (MPI_COMM_WORLD, &id);
MPI_Get_processor_name (name, &namelen);
/* ------------------------------------------------------------------- */
/* The following code prints the name of the run directory: */
if (id == 0)
{
printf ("\n");
printf ("Running in directory\n");
system ("/bin/pwd");
printf ("\n");
fflush (stdout);
}
/* ------------------------------------------------------------------- */
/* The following code tests MPI_Send / MPI_Recv: */
nodesused ();
/* ------------------------------------------------------------------- */
MPI_Finalize ();
return (0);
}
/* --------------------------------------------------------------------- */
void nodesused (void)
{
int id, np;
char name[MPI_MAX_PROCESSOR_NAME];
int namelen;
MPI_Status status;
int i;
char message[128];
MPI_Comm_size (MPI_COMM_WORLD, &np);
MPI_Comm_rank (MPI_COMM_WORLD, &id);
MPI_Get_processor_name (name, &namelen);
sprintf (message, "Process %2d out of %2d running on host %s", id, np, name);
/* uncomment for testing: */
/* printf ("[%2d] %s\n", id, message); */
if (id == 0)
{
printf ("\n");
/* print own message first: */
printf ("[%2d] %s\n", id, message);
/* receive and print messages from other processes: */
for (i = 1; i < np; i++)
{
MPI_Recv (message, 128, MPI_CHAR, i, i, MPI_COMM_WORLD, &status);
printf ("[%2d] %s\n", id, message);
}
fflush (stdout);
}
else /* (id > 0) */
{
MPI_Send (message, strlen(message)+1, MPI_CHAR, 0, id, MPI_COMM_WORLD);
}
}
Use the following file as the job script. It requests 16 processor cores, so it should run four processes on every node in our cluster, using every available core. Save the following file as shownodes.sh.
#!/bin/sh
#SBATCH --job-name shownodes
#SBATCH -o shownodes.out.%J
## #SBATCH -N 4
#SBATCH -n 16
/bin/echo
/bin/echo Execution host: `hostname`.
/bin/echo Directory: `pwd`
/bin/echo Date: `date`
module add openmpi
mpirun ./shownodes
echo Job end time: `date`
sleep 10
Now compile the code and submit it to the job manager. Here's what it looks like:
admin@baker:~/dev> module add openmpi
admin@baker:~/dev> module add slurm
admin@baker:~/dev> mpicc -o shownodes shownodes.c
admin@baker:~/dev> sbatch shownodes.sh
Submitted batch job 42
admin@baker:~/dev> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
42 prod shownode admin R 0:03 4 compute[01-04]
admin@baker:~/dev> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
admin@baker:~/dev> cat shownodes.out.42
Execution host: compute01.
Directory: /home/admin/dev
Date: Tue Nov 20 20:43:31 CST 2018
Running in directory
/home/admin/dev
[ 0] Process 0 out of 16 running on host compute01
[ 0] Process 1 out of 16 running on host compute01
[ 0] Process 2 out of 16 running on host compute01
[ 0] Process 3 out of 16 running on host compute01
[ 0] Process 4 out of 16 running on host compute02
[ 0] Process 5 out of 16 running on host compute02
[ 0] Process 6 out of 16 running on host compute02
[ 0] Process 7 out of 16 running on host compute02
[ 0] Process 8 out of 16 running on host compute03
[ 0] Process 9 out of 16 running on host compute03
[ 0] Process 10 out of 16 running on host compute03
[ 0] Process 11 out of 16 running on host compute03
[ 0] Process 12 out of 16 running on host compute04
[ 0] Process 13 out of 16 running on host compute04
[ 0] Process 14 out of 16 running on host compute04
[ 0] Process 15 out of 16 running on host compute04
Job end time: Tue Nov 20 20:43:32 CST 2018
admin@baker:~/dev>
Everything looks good. We've built a real cluster!
A few things to note here. In my job script, I've specified the name of the output file as shownodes.out.%J. This overrides the defaul output file name of slurm-<job_number>.out. This is quite handy for sorting your output files.
You can see two lines that specify how many nodes or cores we want. The job manager looks for #SBATCH at the start of a line to identify directives that it should interpret for how the job should be run. You can easily comment them out by adding extra # characters at the beginning of the line. I have two directives that tell the job manager how many machines I should use:
## #SBATCH -N 4
#SBATCH -n 16
The first line (commented out) asks for four nodes. The second line asks for 16 cores. Both allocate all nodes in our little cluster, but the first line will allocate a single core on each node, and the second will allocate all (four) cores on each node. The first comes in handy when you need to use all the memory on a node and aren't as concerned about processing cores.
At the beginning of the job script, I print out the host that we're running on, the directory we're running in, and the date. At the end of the job, I print the date again. That gives me another place to track how much time the job took (in addition to what the job manager tells me). It can come in handy if you have to pay for your compute time, and the job manager tells you something that doesn't seem reasonable. The execution hostname is handy for tracking down problems if your job crashes. The running directory is handy to know, since some job managers don't start from your current directory. SLURM does, but explicitly printing out the current working directory can help you quickly resolve file not found errors.
I explicitly add the openmpi module. When your job starts running, it has a default login environment with no modules loaded. You need to explicitly load any modules you need to use.
t the end of the job, I run a sleep command (sleep 10) for 10 seconds. I only use this for test jobs. This gives me a little time to examine the job in the queue, and lets me perform some testing on running jobs (suspending the job, cancelling the job, modifying job parameters on the fly, etc.).
Let's move on to a little testing and gather some metrics.