08_MPI

Type to enter text

Challenge the future

Delft

University of

Technology

Parallel programming

MPI IO

Jan Thorbecke

• This course is partly based on the MPI courses developed by

–

Rolf Rabenseifner at the High-Performance Computing-Center

Stuttgart (HLRS), University of Stuttgart in collaboration with the

EPCC Training and Education Centre, Edinburgh Parallel Computing

Centre, University of Edinburgh.

http://www.hlrs.de/home/

https://www.epcc.ed.ac.uk

• CSC – IT Center for Science Ltd.

–

https://www.csc.fi

–

https://research.csc.fi

• http://mpitutorial.com

Acknowledgments

Contents

• Parallel IO

•

POSIX-IO

•

single MPI-task does all

•

each MPI-tasks one file

•

MPI-IO

•

Independent

•

exercise: exa1 MPI_File_write_at

•

exercise: exa2 MPI_File_set_view

•

collective

•

exercise: mpiio_2D MPI_File_write_all

• Examples

Parallel IO

Parallel I/O

...1101101010101101110110010101010101001010101...

•

I/O (Input/output) is needed in all programs but is often overlooked

• Mapping problem: how to convert internal structures and domains to

files which are a streams of bytes

• Transport problem: how to get the data efficiently from hundreds to

thousands of nodes on the supercomputer to physical disks

Parallel IO

Parallel I/O

•

Good I/O is non-trivial

–

Performance, scalability, reliability

–

Ease of use of output (number of files, format)

• Portability

•

One cannot achieve all of the above - one needs to decide what is most

important

•

New challenges

–

Number of tasks is rising rapidly

–

The size of the data is also rapidly increasing

• The need for I/O tuning is algorithm & problem specific

•

Without parallelization, I/O will become scalability bottleneck for

practically every application!

IO layers

I/O layers

MPI-IO

Parallel file system

POSIX

syscalls

High level I/O

Libraries

Applications

• High-level

–

application: need to write or read data from

disk

• Intermediate

–

libraries or system tools for I/O

–

high-level libraries

•

HDF5, netcdf

–

MPI-I/O

–

POSIX system calls (fwrite / WRITE)

• Low-level:

–

parallel filesystem enables the actual

parallel I/O

–

Lustre, GPFS, PVFS, dCache

POSIX IO

POSIX I/O

• Built in language constructs for performing I/O

–

WRITE/READ/OPEN/CLOSE in Fortran

–

stdio.h routines in C (fopen, fread, fwrite, ...)

•

No parallel ability built in - all parallel I/O schemes have to be

programmed manually

• Binary output not necessarily portable

• C and Fortran binary output not necessarily compatible

• Non-contiguous access difficult to implement efficiently

•

Contiguous access can be very fast

File System Fundamentals

Single Logical File

e.g. /work/example

File automatically

divided into stripes

Stripes are written/read from

across multiple drives

To achieve fast bandwidth

reading or writing to disk....

Text

File decomposition – 2 Megabyte Stripes

3-0

5-0

7-0

11-0

3-1

5-1

7-1

11-1

11-0

7-0

3-0

5-0

2MB

3-1

OST 3

Lustre

Client

7-1

OST 5

OST 7

OST 11

5-1

11-1

Disk

Spokesperson, basically serial I/O

• One process performs I/O.

•

Data Aggregation or Duplication

•

Limited by single I/O process.

• Easy to program

• Pattern does not scale.

•

Time increases linearly with amount of data.

•

Time increases with number of processes.

• Care has to be taken when doing the “all to one”-

kind of communication at scale

• Can be used for a dedicated IO Server

(not easy to program) for small amount of data

One file per process

• All processes perform I/O to

individual files.

•

Limited by file system.

• Easy to program

• Pattern does not scale at large

process counts.

•

Number of files creates

bottleneck with metadata

operations.

•

Number of simultaneous

disk accesses creates

contention for file system

resources.

filesystem

Shared File

• Each process performs I/O to a

single file which is shared.

• Performance

•

Data layout within the shared

file is very important.

•

At large process counts

contention can build for file

system resources (OST).

• Programming language might not

support it

•

C/C++ can work with fseek

•

No real Fortran standard

A little bit of all, using a subset of

processes

•

Aggregation to a processor group which processes the data.

•

Serializes I/O in group.

•

I/O process may access independent files.

•

Limits the number of files accessed.

•

Group of processes perform parallel I/O to a shared file.

•

Increases the number of shares to increase file system

usage.

•

Decreases number of processes which access a shared file

to decrease file system contention.

UM per Timestep I/O

●

!"#$%&'$()#*$( #+,-% ','./'(#$#('0+.$1%'#.2,1'&#30#$4).5*&3.324#678#4'& 9'&4#

●

:$5*#*$.(%+.;#$#5'&/$+.#.2,1'&#30#0+%'4#<=3&/&$.#678#2.+/4>

●

?*'.#(321%+.;#/*'#.2,1'&#30#53&'4@#+('$%%)#53,-2/'#/+,'#ABC#$,32./#30#678#

-'&#53&'#+4#&'(25'(#/3#*$%0

●

678#/+,'#4*32%(#45$%'#D#12/#+/#(3'4.E/#D#*3F#53,'G

●

H*'#4+.;%'#678#4'&9'&#-'&#0+%'#1'53,'4#39'&F*'%,'(

●

6.5&'$4+.;#.2,1'&#30#4,$%%'&#-$5I'/4

●

678#4'&9'&#53%%'5/4#($/$#+.#$#-&'45&+1'(#3&('&@#53,-2/'#/$4I4#F$+/# 03&#53,-%'/+3.

I/O Performance : To keep in mind

• There is no “One Size Fits All” solution to the I/O problem.

• Many I/O patterns work well for some range of parameters.

• Bottlenecks in performance can occur in many locations.

(Application and/or File system)

• Going to extremes with an I/O pattern will typically lead to

problems.

•

I/O is a shared resource. Expect timing variation

BeeGFS

• Installed on DelftBlue

•

beegfs-ctl —getentryinfo .

•

beegfs-ctl --liststoragepools

MPI-IO

MPI I/O

•

MPI I/O was introduced in MPI-2

• Defines parallel operations for reading and writing files

–

I/O to only one file and/or to many files

–

Contiguous and non-contiguous I/O

–

Individual and collective I/O

–

Asynchronous I/O

• Portable programming interface

• Potentially good performance

•

Easy to use (compared with implementing the same algorithms on your

own)

• Used as the backbone of many parallel I/O libraries such as parallel

NetCDF and parallel HDF5

• By default, binary files are not necessarily portable

Basic concepts MPI-IO

Basic concepts in MPI I/O

• File handle

–

data structure which is used for accessing the file

• File pointer

–

position in the file where to read or write

–

can be individual for all processes or shared between the processes

–

accessed through file handle

• File view

–

part of a file which is visible to process

–

enables efficient non-contiguous access to file

• Collective and independent I/O

–

collective: MPI coordinates the reads and writes of processes

–

independent: no coordination by MPI

Logical view

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 283 / 338

Logical view / Physical view

mpi processes of a communicator

file, physical view

file, logical view

addressed

only by hints

scope of

MPI-I/O

Message Passing Interface (MPI) [03]

Chap.14 Parallel File I/O – Block 1

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 284 / 338

Definitions

etype (elementary datatype)

filetype process 2

file displacement (number of header bytes)

filetype process 0

filetype process 1

file

holes

tiling a file with filetypes:

0 5

1 6

2 3 4 7 8 9

view of process 2

view of process 0

view of process 1

0 1 2 3 4 5 6 7 8 9

I/O – Definitions

Chap.14 Parallel File I/O – Block 1

Short tour – 6 slides – 2 remaining

Definitions

Comments

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 285 / 338

Comments on Definitions

file -an ordered collection of typed data items

etypes -is the unit of data access and positioning / offsets

-can be any basic or derived datatype

(with non-negative, monotonically non-decreasing, non-absolute displacem.)

-generally contiguous, but need not be

-typically same at all processes

filetypes -the basis for partitioning a file among processes

-defines a template for accessing the file

-different at each process

-the etype or derived from etype(displacements:

non-negative, monoton. non-decreasing, non-abs., multiples of etype extent)

view -each process has its own view, defined by:

a displacement, an etype, and a filetype.

-The filetype is repeated, starting at displacement

offset -position relative to current view, in units of etype

Chap.14 Parallel File I/O – Block 1

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 287 / 338

Default View

•Default:

– displacement = 0 each process

– etype = MPI_BYTE has access to

–filetype = MPI_BYTE the whole file

• Sequence of MPI_BYTE matches with any datatype (see MPI-3.0, Section 13.6.5)

• Binary I/O (no ASCII text I/O)

MPI_FILE_OPEN(comm, filename, amode, info, fh)

file

0 1 2 3 4 5 6 7 8 9

view of process 0

0 1 2 3 4 5 6 7 8 9

view of process 1

0 1 2 3 4 5 6 7 8 9

view of process 2

0 1 2 3 4 5 6 7 8 9

Chap.14 Parallel File I/O – Block 1

A simple MPI-IO program in C

MPI_File fh;!

MPI_Status status;

MPI_Comm_rank(MPI_COMM_WORLD, &rank);!

MPI_Comm_size(MPI_COMM_WORLD, &nprocs);!

bufsize = FILESIZE/nprocs;!

nints = bufsize/sizeof(int);

MPI_File_open(MPI_COMM_WORLD, ‘FILE’,!

MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);!

MPI_File_seek(fh, rank * bufsize, MPI_SEEK_SET);!

MPI_File_read(fh, buf, nints, MPI_INT, &status);!

MPI_File_close(&fh);

bufsize

0 1 2 3

Write instead of Read

•

Use MPI_File_write'or'MPI_File_write_at'

•

Use MPI_MODE_WRONLY'or'MPI_MODE_RDWR'as the flags to

MPI_File_open'

•

If the file doesn’t exist previously, the flag MPI_MODE_CREATE'

must'be passed to MPI_File_open'

• We can pass multiple flags by using bitwise-or ‘|’ in C, or addition

‘+’ or IOR in Fortran

• If not writing to a file, using MPI_MODE_RDONLY might have a

performance benefit. Try it.

Parallel write

!"#$"%&'#()*()+

+,-.'&!/+

+/&!0/1/2'3#3.+

+/32.$."'44'5667'87'9:8;7'<8=57'8>)?8@5+

+/32.$."'44'?)A)(?B&!/C-2%2,-C-/D.E+

+/32.$."7'!%"%&.2."'44'FG(>)HIJJ+

+/32.$."7'K/&.3-/#3BFG(>)E'44'L(<+

+/32.$."BM/3KH&!/C#NN-.2CM/3KE'44';8?*+

+1%00'&!/C/3/2B566E

+1%00'&!/C1#&&C"%3MB&!/C1#&&CO#"0K7'9:8;7P'+

+++566E+

+K#'8'H'I7'FG(>)+

++L(<B8E'H'9:8;'Q'FG(>)'R'8+

+.3K'K#+

SSS

+1%00'&!/CN/0.C#!.3B&!/C1#&&CO#"0K7'T)5?)T7'P+

++++++&!/C&#K.CO"#30U'R'&!/C&#K.C1".%2.7'P+

++++++&!/C/3N#C3,007'<8=57'566E+

+1%00'&!/C2U!.C-/D.B&!/C/32.$."7'8>)?8@57566E+

+;8?*'H'9:8;'Q'FG(>)'Q'8>)?8@5+

+1%00'1%00+&!/CN/0.C-..MB<8=57;8?*7P

++++++&!/C-..MC-.27+566E+

+1%00'&!/CN/0.CO"/2.B<8=57'L(<7'FG(>)7'P+

++++++&!/C/32.$."7'?)A)(?7'566E+

+1%00'&!/CN/0.C10#-.B<8=57'566E+

+1%00'&!/CN/3%0/D.B566E+

.3K'!"#$"%&'#()*()+

• Multiple processes write to a

binary file test.

• First process writes integers 1-

100 to the beginning of the file,

etc.

Example: parallel write

Note !

File (and total data) size depends

on number of processes in this

example

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 291 / 338

Writing with Explicit Offsets

• writes count elements of datatype from memory buf to the file

• starting offset * units of etype

from begin of view

• the elements are stored into the locations of the current view

• the sequence of basic datatypes of datatype

(= signature of datatype)

must match

contiguous copies of the etype of the current view

MPI_FILE_WRITE_AT(fh,offset,buf,count,datatype,status)

I/O – WRITE / Explicit Offsets

Message Passing Interface (MPI) [03]

Chap.14 Parallel File I/O – Block 1

MPI_File_Seek and MPI_File_Write combined in one call

Exercise: MPI-IO exa1:

Four processes write a file in parallel

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 292 / 338

MPI–IO Exercise 1: Four processes write a file in parallel

• each process should write its rank (as one character) ten times

to the offsets = my_rank + i * size_of_MPI_COMM_WORLD, i=0..9

• Result: “0123012301230123012301230123012301230123“

• Each process uses the default view

writing

1 1 1 1 1 1 ...

writing

2 2 2 2 2 2 ...

writing

3 3 3 3 3 3 ...

writing

0 0 0 0 0 0 ...

mpi processes of

a communicator

file

• please, use skeleton:

cp ~/MPI/course/C/mpi_io/mpi_io_exa1_skel.c my_exa1.c

cp ~/MPI/course/F/mpi_io/mpi_io_exa1_skel.f my_exa1.f

see also login-slides

I/O – Exercise 1

Chap.14 Parallel File I/O – Block 1

Sol.

cd MPI_IO

cp mpi_io_exa1_skel.c my_exa1.x

mpi_io_exa1.c

OST 1

OST 0

OST 2

file'on'disk(s)

temp'buffer'

user'buffer'

P2P1P0

write

communication

File Views

• Provides a set of data visible and accessible from an

open ﬁle

• A separate view of the ﬁle is seen by each process

through triple := (displacement, etype, ﬁletype)

• User can change a view during the execution of the

program - but collective operation

• A linear byte stream, represented by the triple (0,

MPI_BYTE, MPI_BYTE), is the default view.

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 295 / 338

Set/Get File View

• Set view

– changes the process’s view of the data

– local and shared file pointers are reset to zero

– collective operation

– etype and filetype must be committed

– datarep argument is a string that specifies the format

in which data is written to a file:

“native”, “internal”, “external32”, or user-defined

– same etype extent and same datarep on all processes

• Get view

– returns the process’s view of the data

MPI_FILE_SET_VIEW(fh, disp, etype, filetype, datarep, info)

MPI_FILE_GET_VIEW(fh, disp, etype, filetype, datarep)

Chap.14 Parallel File I/O – Block 2

MPI_File_set(get)_view

MPI_File_set_view

File view

MPI_File_set_view(fhandle,4disp,4etype,4filetype, datarep, info)

disp Offset from beginning of file. Always in bytes

etype Basic MPI type or user defined type

Basic unit of data access

Offsets in I/O commands in units of etype

filetype Same type as etype or user defined type constructed of etype

Specifies which part of the file is visible

datarep Data representation, sometimes useful for portability

“native”: store in same format as in memory

info Hints for implementation that can improve performance

MPI_INFO_NULL: No hints

File View

File view

• A file view defines which portion of a file is “visible” to a process

• File view defines also the type of the data in the file (byte, integer,

float, ...)

•

By default, file is treated as consisting of bytes, and process can access

(read or write) any byte in the file

• A file view consists of three components

–

displacement : number of bytes to skip from the beginning of file

–

etype : type of data accessed, defines unit for offsets

–

filetype : portion of file visible to process

•

same as etype or MPI derived type consisting of etypes

Default file view

etype=MPI_INT

filetype=MPI_INT

etype=MPI_INT

filetype=MPI_INT

etype=MPI_INT

filetype=MPI_Type_vector(4, 1, 2, MPI_INT, &filetype);

etype=MPI_INT

filetype=MPI_Type_vector(4, 1, 2, MPI_INT, &filetype);

1 2 3 4 5 6 7 8

1 3 5 7

MPI_File_set_view (picture 1)

read

file'on'disk(s)

(+4-

'/)-'

0+%'/)-'

MPI_File_set_view (picture 2)

read

file'on'disk(s)

(+4-

'/)-'

0+%'/)-'

File view for non-contiguous

data

MPI_TYPE_VECTOR(4, 1, 4, MPI_INTEGER, filetype)

2D-array distributed column-wise

File

...

INTEGER :: count = 4

INTEGER, DIMENSION(count) :: buf

...

CALL MPI_TYPE_VECTOR(4, 1, 4, MPI_INTEGER, filetype, err)

CALL MPI_TYPE_COMMIT(filetype, err)

dispJ=JmyidJ*Jintsize

CALLJCALL MPI_FILE_SET_VIEW(file, disp, MPI_INTEGER, filetype, “native”, &

MPI_INFO_NULL, err)

CALLJMPI_FILE_WRITE(file, buf,Jcount, MPI_INTEGER,Jstatus,Jerr)

...

Storing multidimensional arrays

in files

Domain decomposition for 2D-

array

File

MPI_TYPE_CREATE_SUBARRAY(...)

!!!

"#$%&'&("))"*+,-*"."/012"103

"#$%&'&("))"*45*+,-*"."/062"603

"#$%&'&(2"7#8&$9#:$/6263"))"54;

"!!!

"8<#=>?(%=>::(79/8<#=>:88=@:(A72"BC+D2"62"*EFGE*2"-GG3"

">?AA"8<#=%H<&=>(&?%&=9IJ?((?H/62"*+,-*2"*45*+,-*2"*EFGE*2"

""""""8<#=#$%&'&(2"8<#=:(7&(=>2";+K-ECL-2"-GG3

">?AA"8<#=%H<&=>:88#%/;+K-ECL-3"

">?AAM>?AA"8<#=N#A&=9&%=O#&@/;+K-2"P2"8<#=#$%&'&(2";+K-ECL-2"QRFE+S-T2"U

""""""8<#=#$N:=$IAA2"-GG3"

">?AAM8<#=N#A&=@(#%&/;+K-2"54;2MVW4RE2"8<#=#$%&'&(2M*EFE4*2M-GG3

!!!"

Exercise: MPI-IO exa2:

Using fileviews and individual filepointers

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 311 / 338

MPI–IO Exercise 2: Using fileviews and

individual filepointers

• Copy to your local directory:

cp ~/MPI/course/C/mpi_io/mpi_io_exa2_skel.c my_exa2.c

cp ~/MPI/course/F/mpi_io/mpi_io_exa2_skel.f my_exa2.f

• Tasks:

– Each MPI-process of my_exa2 should write one character to a

file:

• process “rank=0” should write an ‘a’

• process “rank=1” should write an ‘b’

• ...

– Use a 1-dimensional fileview with MPI_TYPE_CREATE_SUBARRAY

– The pattern should be repeated 3 times, i.e., four processes

should write: “abcdabcdabcd”

– Please, substitute “____” in your my_exa2.c / .f

–Compile and run your my_exa2.c / .f

I/O – Exercise 2

Chap.14 Parallel File I/O – Block 2

Exercise: MPI-IO exa2:

Using fileviews and individual filepointers

Höchstleistungsrechenzentrum Stuttgart

Rolf RabenseifnerMPI Course

[3] Slide 312 / 338

MPI–IO Exercise 2: Using fileviews and

individual filepointers, continued

etype = MPI_CHARACTER / MPI_CHAR

filetype process 2

file displacement = 0 (number of header bytes)

filetype process 0

filetype process 1

file

tiling a file with filetypes:

view of process 2

view of process 0

view of process 1

filetype process 3

holes

a b c

b b b

a a a

d d d

c c c

view of process 3

see also login-slides

Chap.14 Parallel File I/O – Block 2

Sol.

MPI IO

• The MPI interface support two types of IO:

• Independent

•

each process handling its own I/O independently

•

supports derived data types (unlike POSIX IO)

• Collective

•

I/O calls must be made by all processes participating in a particular I/O

sequence

•

Used the "shared file, all write" strategy are optimized dynamically by

the MPI library.

Collective IO with MPI-IO

•

MPI_File_read_all, MPI_File_read_at_all, …

•

_all'+.(+5$/'4#/*$/#$%%#-&35'44'4#+.#/*'#;&32-#4-'5+M'(#1)#/*'#

53,,2.+5$/3&#-$44'(#/3#"N6O=+%'O3-'.#F+%%#5$%%#/*+4#02.5P3.#

• Each process specifies only its own access information – the

argument list is the same as for the non-collective functions

• MPI-IO library is given a lot of information in this case:

•

Collection of processes reading or writing data

•

Structured description of the regions

• The library has some options for how to use this data

•

Noncontiguous data access optimizations

•

Collective I/O optimizations

Collective read: two-phase IO

OST 1

OST 0

OST 2

file'on'disk(s)

temp'buffer'

user'buffer'

P2P1P0

read

communication

Two techniques : Data sieving and

Aggregation

• Data sieving is used to combine lots of small accesses into a

single larger one

•

Reducing number of operations important (latency)

•

A system buffer/cache is one example

• Aggregation refers to the concept of moving data through

intermediate nodes

•

Different numbers of nodes performing I/O (transparent to the user)

• Both techniques are used by MPI-IO and triggered with HINTS.

Data Sieving read

24'&Q4#&'R2'4/#03&#.3.S53./+;2324#($/$#<########>#0&3,#$#0+ %'#

&'$(#$#53./+;2324#5*2.I#+./3#,',3&)#

53-)#&'R2'4/'(#-3&/+3.#+.#24'&#1200'&#

'disk(s)

temp'buffer'

user'buffer'

Data Sieving write

24'&Q4#.3.S53./+;2324#F&+/'#/3#($/$#<########>#+.#$#0+%'#

&'$(#$#53./+;2324#5*2.I#+./3#,',3&)#

F&+/'#( $/$#/3#/',-#1 200'&#

'disk(s)

temp'buffer'

user'buffer'

F&+/'#53./+;2324#5*2.I#+./3#,',3&)#

Aggregation: only P1 reads

OST 1

OST 0

OST 2

file'on'disk(s)

temp'buffer'

user'buffer'

P2P1P0

read

communication

Collective IO

Collective operations

•

I/O can be performed collectively by all processes in a communicator

MPI_File_read_all

MPI_File_write_all

MPI_File_read_at_all

MPI_File_write_at_all

• Same parameters as in independent I/O functions

MPI_File_read, MPI_File_write, MPI_File_read_at,

MPI_File_write_at

•

All processes in communicator that opened file must call function

• Performance potentially better than for individual functions

–

Even if each processor reads a non-contiguous segment, in total the read is

contiguous

Collective IO example

Collective I/O

• Collective write can be over hundred times faster than the

individual for large arrays!

!!!

"#$%&'&("))"*+,-*"."/012"103

"#$%&'&("))"*45*+,-*"."/062"603

"#$%&'&(2"7#8&$9#:$/6263"))"54;

"!!!

"8<#=>?(%=>::(79/8<#=>:88=@:(A72"BC+D2"62"*EFGE*2"-GG3"

">?AA"8<#=%H<&=>(&?%&=9IJ?((?H/62"*+,-*2"*45*+,-*2"*EFGE*2"

""""""8<#=#$%&'&(2"8<#=:(7&(=>2";+K-ECL-2"-GG3

">?AA"8<#=%H<&=>:88#%/;+K-ECL-3"

">?AAM>?AA"8<#=N#A&=9&%=O#&@/;+K-2"P2"8<#=#$%&'&(2";+K-ECL-2"QRFE+S-T2"U

""""""8<#=#$N:=$IAA2"-GG3"

">?AAM8<#=N#A&=@(#%&=?AA/;+K-2"54;2MVW4RE2"8<#=#$%&'&(2M*EFE4*2M-GG3

!!!"

MPIIO hints

Giving hints to MPI I/O

• Hints may enable the implementation to optimize performance

• MPI 2 standard defines several hints via MPI_Info object

–

MPI_INFO_NULL : no info

–

Functions MPI_Info_create, MPI_Info_set allow one to create and set hints

• Some implementations allow setting of hints via environment variables

–

e.g. MPICH_MPIIO_HINTS

–

Example: for file “test.dat”, in collective I/O aggregate data to 32 nodes

export MPICH_MPIIO_HINTS=”test.dat:cb_nodes=32”

• Effect of hints on performance is implementation and application

dependent

Exercise: MPI-IO mpiio_2D_r/w.c:

collective operations

• Use number of MPI-tasks N, such that sqrt(N) = integer

• review MPI_Cart and MPI_Type_create_subarray

• MPI-IO for writing 2D array (32x32) to disk

• Without running the program draw a picture how the output will

look like.

nx=32;

mpi_dims[0] = mpi_dims[1] = sqrt(nproc);

int ndims = 2;

int dimsf[2] = {nx, nx};

for (i=0; i < 2; i++) {

count[i] = nx / mpi_dims[i];

offset[i] = coords[i] * count[i];

bufsize *= count[i];

}

MPI_Type_create_subarray(ndims, dimsf, count, offset,

MPI_ORDER_C,MPI_INT, &filetype);

Picture of mpiio_2D_r output N=16

(0,0)

(0,1)

(1,0)

(1,1)

N=16 nx=32

• count[i]=nx/dims[i] => count[0]=8; count[1]=8

• offset[i]=coords[i]*count[i] => offset[0]=x*8; count[1]=y*8

• bufsize *= count[i] => 8*8

MPI_Type_create_subarray()

00000000

……

00000000

11111111

……

11111111

22222222

……

22222222

33333333

……

33333333

44444444

……

44444444

Example: 3D-Finite Difference

• 3D domain decomposition to distribute the work

•

MPI_Cart_create to set up cartesian domain

• Hide communication of halo-areas with computational work

• MPI-IO for writing snapshots to disk

•

use local and global derived types to exclude halo areas write in

global snapshot file

• Tutorial about domain decomposition:

http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/

Problem we want to solve

• We have 2 dim domain on a 2 dimensional processor grid

• Each local subdomain has a halo (ghost cells).

• The data (without halo) is going to be stored in a single file,

which can be re-read by any processor count

• Here an example with 2x3 procesor grid :

lnx

lny

Approach for writing the file

• First step is to create the MPI 2 dimensional processor grid

• Second step is to describe the local data layout using a MPI

datatype

• Then we create a „global MPI datatype“ describing how the data

should be stored

• Finally we do the I/O

Basic MPI setup

nx=512; ny=512 ! Global Domain Size!

call MPI_Init(mpierr)!

call MPI_Comm_size(MPI_COMM_WORLD, mysize, mpierr)!

call MPI_Comm_rank(MPI_COMM_WORLD, myrank, mpierr)!

dom_size(1)=2; dom_size(2)=mysize/dom_size(1)!

lnx=nx/dom_size(1) ; lny=ny/dom_size(2) ! Local Domain size!

periods=.false. ; reorder=.false.!

call MPI_Cart_create(MPI_COMM_WORLD, dim, dom_size, periods,

reorder, comm_cart, mpierr)!

call MPI_Cart_coords(comm_cart, myrank, dim, my_coords, mpierr)!

halo=1!

allocate (domain(0:lnx+halo, 0:lny+halo)) !

Creating the local data type

gsize(1)=lnx+2; gsize(2)=lny+2!

lsize(1)=lnx; lsize(2)=lny!

start(1)=1; start(2)=1!

call MPI_Type_create_subarray(dim, gsize, lsize,

start, MPI_ORDER_FORTRAN, MPI_INTEGER,

type_local, mpierr)!

call MPI_Type_commit(type_local, mpierr)

lnx

lny

(1,1)

!4'#$#421$&&$)#($/$/)-'#/3#

('45&+1'#/*'#.3.53./+;2324#

%$)32/#+.#,',3&)#

N$44#/*+4#($/$/)-'#$4#

$&;2,'./#/3#

MPI_File_write_all

And now the global datatype

gsize(1)=nx; gsize=ny!

lsize(1)=lnx; lsize(2)=lny!

start(1)=lnx*my_coords(1); start(2)=lny*my_coords(2)!

call MPI_Type_create_subarray(dim, gsize, lsize,

start, MPI_ORDER_FORTRAN, MPI_INTEGER,

type_domain, mpierr)!

call MPI_Type_commit(type_domain, mpierr)

Now we have all together

call MPI_Info_create(fileinfo, mpierr)!

call MPI_File_delete('FILE', MPI_INFO_NULL, mpierr)!

call MPI_File_open(MPI_COMM_WORLD, 'FILE',

IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh,

mpierr)

disp=0 ! Note : INTEGER(kind=MPI_OFFSET_KIND) :: disp !

call MPI_File_set_view(fh, disp, MPI_INTEGER,

type_domain, 'native', fileinfo, mpierr)!

call MPI_File_write_all(fh, domain, 1, type_local,

status, mpierr)!

call MPI_File_close(fh, mpierr)

Example

• 1024x1024x512 sized snapshots (2.1 GB) are written to disk; 16

in total (each 100 time steps).

• stripe size is 1MB

• stripe count is 4 or 16

• At 1024 cores each MPI task write a 2 MB portion to disk

• Interlagos 32 core nodes at 2.1 GHz

Storage into file per MPI task

:$5*#"N6#(3,$+.#*$4#

$#.3.S53./+;2324#

4/3&$;'#9+'F#+./3#/*'#

4.$-4*3/#0+%'U

H*+4#+4#/&$.4-$&'./%)#

*$.(%'(#1)#"N6S68

100

32 64 128 256 512 1024 2048 4096

IO-time in seconds

number of cores

3D_FD XE6 (IL 2.1 GHz)

IO with 4 OST

IO with 16 OST

IO-time collective buffering

parallel IO build up:

more nodes are used

to stream data to disks

Collective

buffering

communication

overhead for

many small

messages

And there is more

• http://docs.cray.com

•

Search for MPI-IO : „Getting started with MPI I/O“, „Optimizing MPI-

IO for Applications on CRAY XT Systems“

•

Search for lustre (a lot for admins but not only)

•

Message Passing Toolkit

• Man pages (man mpi, man <mpi_routine>, ...)

• mpich2 standard : http://www.mcs.anl.gov/research/projects/

mpich2/

Summary

Parallel I/O - summary

• POSIX

–

single reader/writer, all read/write, subset read/write

–

user is responsible for communication

•

MPI I/O

–

MPI library is responsible for communication

–

file views enable non-contiguous access patterns

–

collective I/O can enable the actual disk access to remain contiguous