This document describes techniques to test your network configuration in case of MPI problems.
If MPI accelerated simulations with CST STUDIO SUITE do not work as expected, the reason might be an erroneous configuration of the network. If you encounter problems, please work carefully through this troubleshooting guide.
Test Connection Between MPI Daemons
Select Interconnect Used for MPI Communication
Enable MPI Debug Output (Windows only)
MPI Client Nodes need to "see" each other on the network. To check whether this is the case, please execute the command
ping <nodeB>
on <nodeA> and the command
ping <nodeA>
on <nodeB>.
You should see a reply of the remote node and no error message in case both hosts can communicate over the network.
Make sure DNS is correctly configured unless you are using a static name resolution mechanism via a host file. DNS is correctly configured if you can resolve hostnames to IP addresses.
To check whether this is the case please execute the command:
nslookup <nodeA>
nslookup will tell you which DNS server it is using and to which IP the hostname <nodeA> is resolved. Check that you can do a reverse lookup on this IP, i.e. execute the command
nslookup <IP-you-got>
Try this for both nodes on both nodes.
Make sure that the CST STUDIO SUITE installation folders can be accessed on all nodes and that the user account you use for your MPI simulations has the permission to execute programs located in these folders. To check this, execute the following program on each of the nodes:
Windows: <CST_DIR>\CSTMPIClusterInfo.exe
Linux: <CST_DIR>/Linux32/CSTMPIClusterInfo
It should tell you the version of your installation of CST STUDIO SUITE.
Next step is to check if the MPI daemon can start processes on each of the nodes. Login to the first node and try to execute some program on the second node. For example, execute the following command in your CST STUDIO SUITE installation folder:
Windows: mpiexec -n 1 -host <nodeB> hostname
which should give you the hostname of <nodeB>.
If that works, try the same with the CSTMPIClusterInfo.exe:
Windows: mpiexec -n 1 -host <nodeB> <CST_DIR>\CSTMPIClusterInfo.exe
Check that you can login from the head node to all other nodes without a password. So execute
ssh <nodeB> hostname
Which should give you the hostname of <nodeB> without asking for a password or anything else.
Next step is to check if the MPI daemon can start processes on each of the nodes. Login to the first node and try to execute some program on the second node. To execute a program remotely on e.g. <nodeB>, first create a file named ’mpd.hosts’. You need to specify the location of this file on the command line later on. Create the file ’mpd.hosts’ which contains just a single line with the hostname that you want to run a program on, in this case <nodeB>. For example, execute the following command in the Linux32 subfolder of your CST STUDIO SUITE installation:
./mpirun -r ssh -f /your/path/mpd.hosts -n 1 -host <nodeB> hostname
which should give you the hostname of <nodeB>. If that works, try the same with the CSTMPIClusterInfo program:
./mpirun -r ssh -f /your/path/mpd.hosts -n 1 -host <nodeB> <CST_DIR>/Linux32/CSTMPIClusterInfo
Next, try to execute the CST MPI Performance Test by running it from the installation directory. On Linux, first create a file ’mpd.hosts’ which contains two lines with the hostname of the two nodes where you want to run the test on.
MS Windows (32 Bit):
mpiexec -hosts 2 <nodeA> <nodeB> <CST_DIR>\CSTMPIPerformanceTest.exe
MS Windows (64 Bit):
mpiexec -hosts 2 <nodeA> <nodeB> <CST_DIR>\AMD64\CSTMPIPerformanceTest_AMD64.exe
Linux (32 Bit):
mpirun -r ssh -f /your/path/mpd.hosts
-n 1 -host <nodeA> <CST_DIR>/Linux32/CSTMPIPerformanceTest :
-n 1 -host <nodeB> <CST_DIR>/Linux32/CSTMPIPerformanceTest
Linux (64 Bit):
mpirun -r ssh -f /your/path/mpd.hosts
-n 1 -host <nodeA> <CST_DIR>/LinuxAMD64/CSTMPIPerformanceTest_AMD64 :
-n 1 -host <nodeB> <CST_DIR>/LinuxAMD64/CSTMPIPerformanceTest_AMD64
This tool first tries to establish an MPI connection between the two nodes. In case of success, it measures the latency and bandwidth of your interconnect.
If this tool works, MPI enabled simulations with CST STUDIO SUITE should also work.
The best interconnect available is automatically chosen. However, for troubleshooting it might be useful to manually select the interconnect used by MPI. This can be done using the environment variable I_MPI_FABRICS.
To select TCP/IP (Ethernet) as interconnect, set I_MPI_FABRICS to "tcp":
MS Windows (64 Bit):
set I_MPI_FABRICS=tcp
mpiexec -hosts 2 <nodeA> <nodeB> <CST_DIR>\AMD64\CSTMPIPerformanceTest_AMD64.exe
Linux (64 Bit):
export I_MPI_FABRICS=tcp
mpirun -r ssh -f /your/path/mpd.hosts
-n 1 -host <nodeA> <CST_DIR>/LinuxAMD64/CSTMPIPerformanceTest_AMD64 :
-n 1 -host <nodeB> <CST_DIR>/LinuxAMD64/CSTMPIPerformanceTest_AMD64
To select InfiniBand as interconnect, set I_MPI_FABRICS to "dapl" (Windows) or "ofa" (Linux):
MS Windows (64 Bit):
set I_MPI_FABRICS=dapl
mpiexec -hosts 2 <nodeA> <nodeB> <CST_DIR>\AMD64\CSTMPIPerformanceTest_AMD64.exe
Linux (64 Bit):
export I_MPI_FABRICS=ofa
mpirun -r ssh -f /your/path/mpd.hosts
-n 1 -host <nodeA> <CST_DIR>/LinuxAMD64/CSTMPIPerformanceTest_AMD64 :
-n 1 -host <nodeB> <CST_DIR>/LinuxAMD64/CSTMPIPerformanceTest_AMD64
If the simulation cannot be started even though the CST MPI Performance Test works, try to enable debugging of the smpd. To do so, log in to one of your compute nodes, go to the installation folder of CST MPI Service and tell the smpd service to produce logging output:
smpd -traceon <logfilename>
Then try to start an MPI enabled simulation on that node. After running the simulation, disable logging again by executing:
smpd -traceoff
Then have a look at <logfilename> and/or send it to CST support.
MPI Simulation Overview, MPI Cluster Dialog, MPI Installation