FAQ

[SX-Aurora TSUBASA] NEC MPI for SX-Aurora TSUBASA FAQ

Question

About NEC MPI Operating Procedures

How can we get the physical VE number?

How can we redirect the output of MPI processes, when using NEC MPI with NQSV qsub?

Can we redirect the output of the mpirun or mpiexec command, for example "mpirun ... > output.txt", in order to save the output of MPI program?

Why does this error happen, when executing a batch job?

host1: mpid(0): bind or listen failed in listen_port_range: Address already in use
mpirun: cannot find fifo file: /tmp/mpi2mpid_fifo.xx_xxxxxxx; jid xxxxxxx
mpirun: fatal error : cannot find fifo file (not created by mpid)

Why does this warning happen, when source necmpivars.sh or source necmpivars.csh is executed ?

necmpivars.sh: Warning: invalid argument. LD_LIBRARY_PATH is not updated.
Note: "necmpivars.sh [gnu|intel] [version]" format should only be used
at runtime in order to use VH MPI shared libraries other than those specified
by RUNPATH embedded in a MPI program executable by the MPI compile command.
In other cases, "source /opt/nec/ve/mpi/2.x.0/bin/necmpivars.sh"should be used without arguments.
version is a directory name in the following directory:
/opt/nec/ve/mpi/2.x.0/lib64/vh/gnu (if gnu is specified)
/opt/nec/ve/mpi/2.x.0/lib64/vh/intel (if intel is specified)

For the program execution, can we use a different version of NEC MPI from the one that is used at compile/link time ?

Why some sort of arrays in a Fortran program can't be passed to MPI non-blocking procedures ?
The note in "2.13 Nonblocking MPI Procedures in Fortran Programs" of NEC MPI User's Guide says
that we cannot get correct result or it causes an abnormal behavior in the program execution,
if we set one of the following items as an actual argument of the communication buffer or I/O buffer of the non-blocking MPI procedures.
- array section
- array expression
- array pointer
- assumed shape array

I specified the number of VEs and the number of processes for the program execution, but the number of processes assigned to each VE may be different from the others. Why does it happen?
#PBS --venode=125
#PBS --venum-lhost=8
mpirun -np 1000 ./a.out

When I submit a request under the conditions above, it seems that MPI processes are assigned as follows.
- 8 processes are not allocated to each VE
- In particular more than 8 processes are allocated to VE0 through VE3 of job:0015.

Is there any possibility for MPI communication performance to be improved on 8VE models of Intel XEON processors (A300-8, A311-8, B300-8, A500-64, A511-64, etc.).

About PROGINF/FTRACE

When thread parallelism is performed, FTRACE shows that the amount of calculation has decreased. Why does it happen?

If the program terminates abnormally, can the standard error, PROGINF, and FTRACE results be output to the end of all ranks?

About InfiniBand

Is there any possibility of a problem in a MPI program execution over InfiniBand with different generation models of InfiniBand HCA?

Answer

Operating Procedures

How can we get the physical VE number?
If you set the -v option for mpirun, the process generation information is output.
You can find the physical VE number in this information.

% mpirun -v -np 8 a.out
mpid: Creating 8 process of './a.out' on VE 0 of local host

It Indicates that 8 processes have been created on physical VE number 0.

For the interactive execution, you can also find it by referring to the environment variable VE_NODE_NUMBER. (getenv("VE_NODE_NUMBER"), for example)

How can we redirect the output of MPI processes, when using NEC MPI with NQSV qsub?
You can use mpisep.sh to redirect the output of each rank to a file.

export NMPI_SEPSELECT=4
mpirun -np 4 /opt/nec/ve/bin/mpisep.sh ./a.out

By the above procedure the standard output and standard error are redirected to the file "std.Universe number:rank number"
(The Universe number is normally 0. the number of processes generated by MPI_Comm_spawn etc. is more than 1.)
Details can be described in 3.3 Standard Output and Standard Error of MPI Programs of the NEC MPI User's Guide.
If the NEC MPI process manager is hydra in the queue settings of the NQSV you which you using
you can control the output destination by setting the environment variable NMPI_OUTPUT_COLLECT when executing the MPI program as follows.
- NMPI_OUTPUT_COLLECT=ON
  The output of the MPI program is output to the standard output and standard error output of the MPI execution command.
- NMPI_OUTPUT_COLLECT=OFF (Default)
  The output of the MPI program similar to mpd is output for each logical node.

Can we redirect the output of the mpirun or mpiexec command, for example "mpirun ... > output.txt", in order to save the output of MPI program?
It is not possible to save the output from MPI program to a single file by redirecting the output from mpirun or mpiexec command in the execution through NQSV.
However, if Hydra process manager is working, the output from mpirun or mpiexec command can be redirected in order to save the output of the MPI program with NMPI_OUTPUT_COLLECT=ON option.

Why does this error happen, when executing a batch job?

host1: mpid(0): bind or listen failed in listen_port_range: Address already in use
mpirun: cannot find fifo file: /tmp/mpi2mpid_fifo.xx_xxxxxxx; jid xxxxxxx
mpirun: fatal error : cannot find fifo file (not created by mpid)

TCP/IP port used by MPI daemon is most likely not allocated. NEC MPI uses 10 ports, 25257 through 25266 by default each node, 1 port for each MPI daemon, for waiting and accepting coming requests, therefore, if many (>10) MPI daemon processes are running, the default 10-port is insufficient. In this case, the number of ports should be increased.
For example, if 20 MPI daemon processes will be running at the same time, the total number of ports that NEC MPI will use can be increased to 20 by specifying NMPI_PORT_RANGE=25257:25276. The incremented number of ports can be set as system-wide configuration, or for each request with #PBS -v option for example.
See also 4.8.2 Firewall, SX-Aurora TSUBASA Installation Guide .

Why does this warning happen, when source necmpivars.sh or source necmpivars.csh is executed ?

necmpivars.sh: Warning: invalid argument. LD_LIBRARY_PATH is not updated.
Note: "necmpivars.sh [gnu|intel] [version]" format should only be used
at runtime in order to use VH MPI shared libraries other than those specified
by RUNPATH embedded in a MPI program executable by the MPI compile command.
In other cases, "source /opt/nec/ve/mpi/2.x.0/bin/necmpivars.sh"should be used without arguments.
version is a directory name in the following directory:
/opt/nec/ve/mpi/2.x.0/lib64/vh/gnu (if gnu is specified)
/opt/nec/ve/mpi/2.x.0/lib64/vh/intel (if intel is specified)

When evaluating necmpivars.sh or necmpivars.csh in another shell script with arguments or after "set" command is executed, positional parameters are passed to those NEC MPI scripts, which cause the warning. You can safely ignore the warning, see 3.11 Miscellaneous (19) in the NEC MPI User's Guide.

For the program execution, can we use a different version of NEC MPI from the one that is used at compile/link time ?
By default, most NEC MPI libraries, including key features, are statically linked.
Therefore, the version A library is used regardless of the version of the NEC MPI setup script that is sourced when the MPI program is run.
If you want to change the NEC MPI version when running an MPI program, you need to dynamically link all libraries by setting the -shared-mpi option at the time of compiling and linking.
By this procedure You can switch the NEC MPI version to version B and execute an MPI programs by sourcing the version B setup script at runtime.
Please note that if all NEC MPI libraries are dynamically linked, the MPI communication performance may be reduced compared to the case of static linking.

Why some sort of arrays in a Fortran program can't be passed to MPI non-blocking procedures?
The contents described in 2.13 Nonblocking MPI Procedures in Fortran Programs of NEC MPI User's Guide are due to the MPI specifications (argument specifications of MPI procedures) and the optimization processing of the compiler.
The compiler optimization process here is not unique to the NEC compiler,
but refers to widely and generally performed optimization processes such as instruction order replacement and memory access reduction (register optimization).
It is the content explained in the MPI specification (MPI: A Message-Passing Interface Standard, Version 3.1, June 4, 2015),
17.1.17 Problems with Code Movement and Register Optimization.

I specified the number of VEs and the number of processes for the program execution, but the number of processes assigned to each VE may be different from the others. Why does it happen?
In order to allocate the same number of processes to all VEs, you need to explicitly specify VEs as a hosts with the -venode option.
If you do not specify them, VHs are considered as hosts and processes are evenly allocated to VHs.
As a result, if the number of VEs per VH is different from one another, the number of processes allocated to each VE is different from one another.
Options for mpirun including -venode are explained in NEC MPI User's Guide 3.2.2 Runtime Options.
In this example VEs are not explicitly assigned as hosts, VHs are assumed as hosts,
On the other hand, by the NQSV conditions
- 15 VHs with 8VEs
- 1 VH with 5VEs
are allocated as a resource.
As a result,1000 processes are evenly allocated to 16VHs.
Since 1000 / 16 = 62.5, 63 processes are allocated to first 15 VHs, and remaining 55 processes are allocated to the last VH (job: 0015).
As a result more than 8 processes are allocated to each VE of the last VH.
When the number of VEs per VH and the number of process per VH are uneven as in this case,
MPI allocate 8 processes to each of 125VEs(1000 / 125 = 8) by assigning VES as hosts with mpirun -venode -np 1000 ./a.out.

Is there any possibility for MPI communication performance to be improved on 8VE models of Intel XEON processors (A300-8, A311-8, B300-8, A500-64, A511-64, etc.).
Communication performance may be improved by changing to execution on 4VE / logical node, for example,

* Interactive execution

mpirun -ve 0-7 -np 64 ve.out

would be:
NMPI_EXEC_LNODE=ON mpirun -host host_0 -ve 0-3 -np 32 -host host_0/A -ve 4-7 -np 32 ve.out

* NQSV batch execution

#PBS -b 2
#PBS --venum-lhost=8
mpirun -np 128 ve.out

would be:
#PBS -b 4
#PBS --venum-lhost=4
#PBS --use-hca=2
mpirun -np 128 ve.out

About PROGINF/FTRACE

When thread parallelism is performed, FTRACE shows that the amount of calculation has decreased. Why does it happen?
When a function that is automatically parallelized or OpenMP parallelized is compiled without specifying the -ftrace option,
performance information other than the master thread is not included in the analysis result.
In addition, when the parallelization function of NLC (NEC Numeric Library Collection) is used, the performance information of threads other than the master thread which are generated by the function is not included in the analysis result.
Please refer to 2.4 Notes of PROGINF/FTRACE User's Guide for this note.

If the program terminates abnormally, can the standard error, PROGINF, and FTRACE results be output to the end of all ranks?
If the program terminates abnormally, the output of standard error, PROGINF, and FTRACE results are not guaranteed.
Therefore, all parts or some parts of these outputs may not be output.

About InfiniBand

Is there any possibility of a problem in a MPI program execution over InfiniBand with different generations of InfiniBand HCA?
MPI program cannot be executed between different generations of InfiniBand HCA in certain cases.
As there is some incompatibility with Infiniband communication between EDR HCA and HDR HCA (HDR100), and between EDR HCA and NDR HCA, MPI program cannot be executed in such cases. Several Aurora models are defined with different generations of InfiniBand HCAs, EDR HCA, HDR HCA (HDR100), and NDR HCA. Please find below the Aurora models and equipped InfiniBand HCAs.

Aurora models and equipped InfiniBand HCAs

EDR	HDR	NDR
(Rack Mount) A300-2 A300-4 A300-8 (Supercomputer) A500-64 A511-64	(Rack Mount) A311-4 A311-8 B300-8 A412-8 B401-8 B302-8	(Rack Mount) C401-8

Product Name

SX-Aurora TSUBASA Software

Note

Revision history

2021/09/30 New release
2021/12/24 Updated information on questions and answers about NEC MPI Operating Procedures
2023/09/28 Updated information on questions and answers about NEC MPI, PROGINF/FTRACE, InfiniBand

Content ID: 4150101113
Release date: 2021/09/30
Last updated:2023/09/28

Top