The mpi_demo.c is given to illustrate the fundamental structure of a C MPI program.
To compile the MPI program, use mpicc
, a wrapper of gcc
, which links to the headers and libraries that MPI required automatically:
mpicc -g -Wall -o mpi_demo mpi_demo.c
Use mpiexec
to run the MPI program:
mpiexec -n <number-of-processes> ./mpi_demo
here -n
option sets the the number of processes it establishes.
The object name defined by MPI obey the two form:
Start by MPI_
.
For variables and functions, the first character of name is captalized, for macros and constants, all of the characters are captalized.
By following the naming convention, one can easily tell whether an object is defined by MPI.
MPI_Init
and MPI_Finalize
MPI_Init
initialize the MPI system:
int MPI_Init(
int* argc_p /* in/out */
char*** argv_p /* in/out */
);
The parameters argc_p
, argv_p
are pointers to argc
and argv
. Set NULL
if not used.
After all MPI operations finished, use MPI_Finalize
to do the clean up:
int MPI_Finalize(void);
All MPI operations should be invoked only between MPI_Init
and MPI_Finalize
.
A MPI communicator is a collection of processes that can send messages to each other. The special global communicator MPI_COMM_WORLD
consists all of the running programs, which is craeted by MPI_Init
.
The communicator must be specified when send or receive messages. The propose of MPI communicator is creating a “communication space”, avoiding accidentally send or receive to wrong processes.
The functions MPI_Comm_size
and MPI_Comm_rank
get number of processes and the rank of current process of given communicator comm
.
int MPI_Comm_size(
/* in */
MPI_Comm comm int* comm_sz_p /* out */
);
int MPI_Comm_rank(
/* in */
MPI_Comm comm int* my_rank_p /* out */
);
For example, the mpi_demo.c use
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
to the number of processes and current process rank in global communicator MPI_COMM_WORLD
.
MPI_Send
The MPI_Send
and MPI_Recv
is a pair for communication. MPI_Send
send message to the recevier:
int MPI_Send(
void* msg_buf_p, /* in */
int msg_size, /* in */
/* in */
MPI_Datatype msg_type, int dest, /* in */
int tag, /* in */
/* in */
MPI_Comm communicator );
The first three parameters msg_buf_p
, msg_size
and msg_type
determine the connect of message, and the last three parameters dest
, tag
and communicator
determine the determination of message.
msg_buf_p
is a pointer to the message content.
msg_size
and mg_type
specify how many bytes in memory should be sent. For instance, in mpi_demo.c msg_size=strlen(greeting)+1
, msg_type=MPI_CHAR
, which tell the MPI program to send strlen(greeting)+1
bytes.
Since it’s not possible for the MPI function to use the C keyword as the argument, MPI provides the predefined constants in type MPI_Datatype
to represent type.
communicator
and dest
set which recevier (by specifying communicator and the rank) should receive the message.
tag
is used to distinguish message sent by the same source.
MPI_Recv
receive the message sent by MPI_Send
:
int MPI_Recv(
void* msg_buf_p, /* out */
int buf_size, /* in */
/* in */
MPI_Datatype buf_type, int source, /* in */
int tag, /* in */
/* in */
MPI_Comm communicator, /* out */
MPI_Status* status_p );
To receive from specific source/tag, the parameters shall be set properly corespondingly in the receiver side. While there may be some cases that multiple senders might send messages in unpredicatable sequence, that is, we do not know which sender might send the message first.
In such case the wildcard constants MPI_ANY_SOURCE
and MPI_ANY_TAG
can be filled to the source
and tag
parameter, which allows messages from any sources or any tags (in specified communicator) to be received. There is no a wildcard for communicator, that is, the communicator must always be specified.
When we filled source/tag with wildcard, status_p
can be used to get the information about the source/tag, by examing its properties:
MPI_Status status;
status.MPI_SOURCE; status.MPI_TAG;
The retrieve of data size requires a function invocation:
int MPI_Get_count (
/* in */,
MPI_Status status_p /* in */,
MPI_Datatype type int* count_p /* out */
);
MPI_Datatype
MPI_Datatype
use to represent the type in C:
MPI_Datatype |
C Datatype |
---|---|
MPI_CHAR |
signed char |
MPI_SHORT |
signed short int |
MPI_INT |
signed int |
MPI_LONG |
signed long int |
MPI_LONG_LONG |
signed long long int |
MPI_UNSIGNED_CHAR |
unsigned char |
MPI_UNSIGNED_SHORT |
unsigned short int |
MPI_UNSIGNED |
unsigned int |
MPI_UNSIGNED_LONG |
unsigned long int |
MPI_FLOAT |
float |
MPI_DOUBLE |
double |
MPI_LONG_DOUBLE |
long double |
MPI_BYTE |
|
MPI_PACKED |
The output file descriptor is avaliable for all the processors, while the input can only be read by process with rank 0
(the “controller” process). Some functions must be defined by the user to pass the input from process 0
into other processes.
MPI_Reduce
collects the data in all the processes and reduce them by predefined operator
, which encapsulates the process about manually construct propogation tree:
int MPI_Reduce(
void* input_data_p, /* in */
void* output_data_p, /* out */
int count, /* in */
/* in */
MPI_Datatype datatype, /* in */
MPI_Op operator, int dest_process, /* in */
/* in */
MPI_Comm comm );
To use the collective communication properly, note the following rules:
dest_process
must be set as the same.Since no tag
is provided in the collective communication, if multiple communication are called, the mapping is simply depends on the order, for example:
Time | Process 0 | Process 1 | Process 2 |
---|---|---|---|
0 | a = 1, c = 2 |
a = 1, c = 2 |
a = 1, c = 2 |
1 | MPI_Reduce(&a, &b, ...) |
MPI_Reduce(&c, &d, ...) |
MPI_Reduce(&a, &b, ...) |
2 | MPI_Reduce(&c, &d, ...) |
MPI_Reduce(&a, &b, ...) |
MPI_Reduce(&c, &d, ...) |
Assume process 0 is configured as the destination, the final output will be b=1+2+1=4
and d=2+1+2=5
.
When using MPI_Reduce
, the output_data_p
is only avaliable for the one process that set to be the dest_process
. To make all the processes get the output, use MPI_Allreduce
:
int MPI_Allreduce(
void* input_data_p, /* in */
void* output_data_p, /* out */
int count, /* in */
/* in */
MPI_Datatype datatype, /* in */
MPI_Op operator /* in */
MPI_Comm comm )
The only difference in the argument list is that Allreduce
does not require dest_process
since all processes shall get the output.
MPI_Op
The list of operators supported by MPI
:
Operator | Interpretation |
---|---|
MPI_MAX |
maximum |
MPI_MIN |
minimum |
MPI_SUM |
sum |
MPI_PROD |
product |
MPI_LAND |
logical and |
MPI_BAND |
bitwise and |
MPI_LOR |
logical or |
MPI_BOR |
bitwise or |
MPI_LXOR |
logical exclusive or |
MPI_BXOR |
bitwise exclusive or |
MPI_MAXLOC |
maximum and location of maximum |
MPI_MINLOC |
minimum and location of mimimum |
There are two basic approaches to distribute components of data into processes:
Block partition, which compute how many data a process should have, and then devides the processes into blocks containing continuous data. For exmaple, if we devides data with id 1~12 into 3 processes, the block partition first compute each process got 4 data, and then devide from the start:
Process ID | Process 1 | Process 2 | Process 3 |
---|---|---|---|
Data | 1, 2, 3, 4 | 5, 6, 7, 8 | 9, 10, 11, 12 |
Cyclic partition, which
Distribution is simply the reverse of reduction, boardcast is simply the reverse of reduce:
int MPI_Bcast (
void* data_p, /* in/out */
int count, /* in */
/* in */
MPI_Datatype datatype, int source_proc,/* in */
/* in */
MPI_Comm comm, )
which boardcast the data_p
into all process in communicator comm
.
Bcast
just simply send all the processes with the same data, to distribute the data to processes as different components, the MPI_Scatter
function is designated:
int MPI_Scatter (
void* send_buf_p, /* in */
int send_count, /* in */
/* in */
MPI_Datatype send_type, void* recv_buf_p, /* out */
int recv_count, /* in */
/* in */
MPI_Datatype recv_type, int src_proc, /* in */
/* in */
MPI_Comm comm )
Correspondingly, the MPI_Gather
function gather all the components:
int MPI_Gather(
void* send_buf_p, /* in */
int send_count, /* in */
/* in */
MPI_Datatype send_type, void* recv_buf_p, /* out */
int recv_count, /* in */
/* in */
MPI_Datatype recv_type, int dest_proc, /* in */
/* in */
MPI_Comm comm )
and MPI_Allgather
to gather the components to all nodes:
int MPI_Allgather(
void* send_buf_p, /* in */
int send_count, /* in */
/* in */
MPI_Datatype send_type, void* recv_buf_p, /* out */
int recv_count, /* in */
/* in */
MPI_Datatype recv_type, /* in */
MPI_Comm comm )
Use MPI_Type_create_struct
to build a derived datatype that consists of individual elements that have different basic types:
int MPI_Type_create_struct (
int count, /* in */
int array_of_blocklengths[], /* in */
/* in */
MPI_Aint array_of_displacemetns[], /* in */
MPI_Datatype array_of_types[], /* out */
MPI_Datatype new_type_p );
Before use the datatype we msut first commit it:
int MPI_Type_commit(
/* in/out */
MPI_Datatype* new_mpi_t_p );
int MPI_Type_free(
/* in/out */
MPI_Datatype* old_mpt_t_p );
Here is a demo to illustrate the workflow:
MPI_Aint a_addr, b_addr, n_addr;
int array_of_blocklengths[3] = {1, 1, 1};
int array_of_displacement[3];
MPI_Get_address(&a, &a_addr);0] = 0;
array_of_displacements[
MPI_Get_address(&b, &b_addr);1] = b_addr - a_addr;
array_of_displacements[
MPI_Get_address(&n, &n_addr);2] = n_addr - b_addr;
array_of_displacements[
MPI_Datatype input_mpi_t;3, array_of_blocklengths, array_of_displacements, array_of_types, &input_mpi_t);
MPI_Type_craete_struct(
MPI_Type_commit(input_mpi_t);
// ...
MPI_Type_free(input_mpi_t);