p-TOMCAT Performance

Glenn

The efficiency of a parallel program is defined as: E = T1 / P.Tp, where T1 is the time for the program to execute on one processor. Tp is the time taken to execute on P processors.

The speedup is then defined as: S = P.Ep = T1 / Tp.

The following data was obtained by turning on the CPU timing code in the main program (see mod_switch.f90). All times were taken from process 0. There is a small variation in times on different processes so these are illustrative times.


Hector timings

T42 - model version 1.5.

The following timings were made on the Hector machine (Cray XT). The model was run for 10 days starting from 1/1/1996. The compiler options used were: -r8 -byteswapio -extend_source -O3 -ffast-math -inline -trapuv -TENV:X=2 -LIST:options=ON -I/work/n02/n02/emgdc/netcdf/include. In all runs, netcdf output was every 6hrs and creation of restart files was turned off. All species apart from 46:52 were output. Input analyses were 6hrly. The times for the 1 processor run were from a run using MPI but with 1 processor. Note that asad_pts (jpnl) was set to 4; very little difference if jpnl=1.

Click here for graph of speedup against number of CPUs.

NCPUs nproci/
nprock
Speedup &
efficiency
Total time (mins) Advection Chemistry Convection Ini/Fini Emissions PBL Patch size
1 1 / 1 -- 421.0 141.3 190.5 46.1 3.7 1.3 38.5 128x64
2 1 / 2 1.8 : 91% 232.2 79.6 100.6 23.6 2.2 0.55 25.6 64x64
4 2 / 2 3.7 : 93% 113.5 37.3 49.0 10.7 2.5 0.24 13.7 64x32
8 2 / 4 6.7 : 83% 63.3 20.8 20.4 5.5 9.2 0.14 7.1 32x32
16 4 / 4 11 : 71% 37.1 14.7 10.7 3.2 4.6 0.11 3.7 32x16
32 4 / 8 21 : 67% 19.8 7.9 5.2 1.5 2.6 0.10 2.4 16x16
64 8 / 8 39 : 61% 10.7 3.2 2.8 0.65 1.4 0.11 2.2 16x8

 

HPCX Timings

The following timings were made on the HPCx machine. The model was run for 10 days starting from 1/3/97. The compiler options in use were: -qinitauto=FF -q64 -O3 -qarch=pwr4 -qtune=pwr4 -qstrict. In all runs netcdf output was every 6hrs; all species were output apart from 46:52. Input analyses were 6hrly. The times for the 1 processor run were from a run using MPI but with 1 processor.

Click here for graph of speedup against number of CPUs.

NCPUs nproci/
nprock
Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emissions PBL Patch size
1 1 / 1 -- 656 mins 171 335 52.0 4.0 1.2 85.0 128x64
2 2 / 1 1.99 : 99% 328 85.3 169.4 26.1 4.3 0.58 42.7 64x64
4 2 / 2 3.9 : 97% 167.5 41.9 86.7 13.5 3.3 0.4 21.7 64x32
8 4 / 2 7.3 : 91% 89.7 23.9 44.9 7.0 2.6 0.31 11.0 32x32
16 4 / 4 12.8 : 80% 51.3 14.3 21.0 4.0 6.2 0.26 5.8 32x16
32 8 / 4 23.7 : 74% 27.7 7.5 9.8 1.7 4.9 0.25 3.5 16x16
64 8 / 8 38.8 : 61% 16.9 4.2 5.1 0.9 3.7 0.26 2.7 16x8

 

CSAR Timings

T21 - 64 longitudes, 32 latitudes (model version 1.2)

All times are in minutes and were obtained from a run timed over 5 days. Unless otherwise stated, these runs were done on Green using -O2 optimisation. The costs of the ini/fini routines can be significant. In these runs all tracers apart from 46:52 were output to the netcdf file. Output was every 6hrs. Input analyses were 6hrly.

The times for the 1 processor run were from a run using MPI but with 1 processor.

Click here for graph of speedup .v. no. of cpus.

NCPUs nproci/
nprock
Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emissions PBL Patch size
1 -- -- 280 mins 65 156 33 2.3 0.40 23.0 64x32
2 1 / 2 1.99 : 99% 141.0 31.2 79.4 15.4 1.4 0.22 13.4 64x16
4 2 / 2 3.9 : 97% 72.1 16.4 39.8 7.4 1.3 0.17 7.0 32x16
8 2 / 4 7.0 : 87% 40.1 8.9 16.7 3.6 7.3 0.14 3.4 32x8
16 4 / 4 13 : 79% 22.1 5.3 9.0 1.5 4.2 0.13 1.9 16x8
32 4 / 8 22 : 70% 12.5 3.1 4.2 0.70 2.9 0.15 1.4 16x4
64 8 / 8 34 : 53% 8.2 2.0 2.27 0.46 2.3 0.15 1.0 8x4

T42 - 128 longitudes, 64 latitudes (model version 1.2)

Click here for graph of speedup .v. no. of cpus.

NCPUs nproci/
nprock
Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emissions PBL Patch size
1 -- -- 1298 mins 364.5 640 178.7 11.3 2.46 100.7 128x64
2 1 / 2 1.94 : 97% 668.3 185.9 324.4 90.1 6.9 1.5 59.6 128x32
4 2 / 2 3.8 : 96% 336.6 92.9 162.4 45.0 5.5 0.77 30.0 64x32
8 2 / 4 7.2 : 91% 178.8 45.6 67.0 21.3 29.0 0.55 15.3 64x16
16 4 / 4 13.7 : 85% 94.9 25.7 35.2 10.3 15.4 0.49 7.7 32x16
32 4 / 8 25 : 78% 51.9 14.6 16.2 5.1 10.3 0.56 5.2 32x8
64 8 / 8 44 : 68% 29.8 8.6 8.4 1.9 6.7 0.52 3.6 16x8

n.b. A 128 processor run at T42 of v1.2 crashed with a bus error in the gatherrow routine in the PBL scheme. I don't intend to try to debug this code as we intend replacing the PBL scheme in the near future.

T42 - Green but using -O3 optimisation instead of -O2

Turning on -O3 instead of -O2 significantly improves the runtime on green. It makes an especially noticeable impact on the times for the convection code (which is essentially a big matrix multiply).

Click here for a graph of speedup .v. cpus when using -O3 on green

NCPUs nproci/
nprock
Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emissions PBL Patch size
1 -- -- 1210 mins 346.3 631 135.5 9.9 2.5 84.9 128x64
2 1 / 2 2.0 : 100% 599.7 159.5 323.1 58.8 6.1 1.4 50.8 128x32
4 2 / 2 4.0 : 100% 299.8 78.7 160.3 27.6 5.5 0.79 26.9 64x32
8 2 / 4 7.7 : 96% 156.8 36.2 67.5 10.2 28 0.57 14.3 64x16
16 4 / 4 14.5 : 90% 83.6 21.5 35.5 4.2 14.4 0.49 7.4 32x16
32 4 / 8 26 : 82% 45.9 12.6 16.3 1.9 9.7 0.57 4.7 32x8
64 8 / 8 44 : 68% 27.8 8.5 8.4 0.84 6.7 0.52 3.3 16x8

T42 - Newton (-O2 optimisation)

The model is much faster on newton, 3x or more. As a consequence, communication, I/O and parts of the program which are not so efficient have more impact on the efficiency of the program at higher numbers of processors.

Click here for a graph of speedup .v. cpus on for newton.

NCPUs nproci/
nprock
Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emissions PBL Patch size
1 -- -- 250.7 mins 65.4 135.1 19.7 3.1 1.0 26.4 128x64
4 2 / 2 3.5 : 87% 72.2 21.0 35.0 5.8 1.7 0.36 8.3 64x32
8 2 / 4 6.6 : 82% 38.2 10.6 15.0 2.8 5.1 0.30 4.3 64x16
16 4 / 4 11.7 : 73% 21.4 6.2 7.9 1.6 2.9 0.30 2.5 32x16
32 4 / 8 22 : 69% 11.27 3.5 3.7 0.54 1.4 0.30 1.8 32x8
64 8 / 8 33 : 51% 7.67 2.2 2.0 0.21 1.8 0.30 1.2 16x8

 


Old timings

T21 - 64 longitudes, 32 latitudes (alpha version, no PBL)

All times are in minutes and were obtained from a run timed over 6 hours.

Click here for graph of speedup .v. no. of cpus.

NCPUs nproci/nprock Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emiss Patch
1 -- -- 14 mins 3.7 8.1 2.0 0.08 0.11 64x32
2 2 / 1 1.9 : 95% 7.1 1.95 4.0 0.95 0.058 0.10 32x32
4 4 / 1 3.2 : 81% 4.16 1.43 2.0 0.5 0.05 0.11 16x32
8 8 / 1 5.4 : 67% 2.42 0.90 1.1 0.18 0.08 0.11 8x32
16 8 / 2 9.3 : 58% 1.41 0.46 0.60 0.08 0.051 0.11 8x16
32 8 / 4 13 : 39% 1.10 0.32 0.28 0.04 0.14 0.12 8x8
64 8 / 8 15 : 23% 0.93 0.22 0.14 0.02 0.10 0.15 8x4

Unfortunately because these times include the startup cost of the model they don't give a particularly accurate picture of the cost for some components. In particular the Ini/Fini parts of the model (routines which read and write forcing files and write restart and netcdf files) and the emissions code have relatively high restart costs.

The following table shows the time taken by the model for a run over 5 days with the 16 CPU configuration used above.

NCPUs nproci/nprock Total time Advection Chemistry Convection Ini/Fini Emiss Patch
16 8 / 2 20mins 9.2 10.5 1.6 1.4 0.13 64x32

 

T42 - 128 longitudes, 64 latitudes (alpha version; no PBL)

Times are for 6hr run. Click here for a graph of speedup .v. no. of CPUs.

NCPUs nproci/nprock Speedup/efficiency Total time Advection Chemistry Convection Ini/Fini Emiss Patch
1 -- -- 55 mins 17.4 28.6 8.5 0.4 0.42 128x64
2 2 / 1 1.9 : 96% 28.7 9.5 14.0 4.4 0.27 0.37 64x64
4 2 / 2 3.4 : 84% 16.3 5.8 7.4 2.3 0.23 0.35 64x32
8 4 / 2 5.6 : 69% 9.9 4.0 3.7 1.1 0.23 0.44 32x32
16 4 / 4 10 : 63% 5.4 1.7 1.9 0.53 0.63 0.40 32x16
32 4 / 8 16 : 50% 3.5 1.1 0.87 0.27 0.30 0.63 32x8
64 4 / 16 20 : 32% 2.7 0.81 0.43 0.13 0.20 0.64 32x4

As for the T21 runs, these costs include the startup costs of the model. A more realistic timing over 5 days is:

NCPUs nproci/nprock Total time Advection Chemistry Convection Ini/Fini Emiss Patch
32 4 / 8 53mins 22 16 5 5 5 32x8