Centre for Atmospheric Science
How to use p-TOMCAT on our local Dobson cluster
The Dobson cluster consists of Sun Opteron based servers. Each machine has two dual-core processors which means p-TOMCAT (or MPI) sees 4 processors per machine. Each machine is connected via a high speed, low-latency Infiniband network which reduces communication delays compared to normal ethernet. Timings have shown the model runs >20% faster on Dobson than on HPCx.
You can log in to dobson (using rlogin,telnet or ssh) which is the front-end machine of the cluster but you cannot log into the other compute nodes. The /home and /data directories are mounted on all machines, so you can refer to files in these directories in your jobs.
An important point to note is that the cluster machines are based on Intel processors unlike the julius, caesar & rome servers which use sparc processors. Therefore you MUST recompile your programs before you can run them on dobson. Also, binary format on these machines is little-endian rather than big-endian on the other servers. If you write binary files from your Fortran programs on rome (say), you need to add the appropriate compiler option to read them correctly on dobson.
Setting up your environment
To access all the new software correctly and because dobson runs Solaris 10, you will need to add some directories to your PATH variable. The directories you need to add are:
You also need to add the following directories to your MANPATH to correctly access the man pages:
Lastly, to access the batch queue commands and man pages, edit your .profile file and add this at the end of the file:
All the files required to run p-TOMCAT have been downloaded from HPCx and are in the directories under /home/tomcat/public/data. The sample dobson tom.run script has the correct directory names in the namelist for the model.
The forcing files are in /home/tomcat/public/data/ECMWF. We have all the forcing files from 1995- present day. I will keep the files up to date with the forcing files on HPCx. If anyone notices any missing or wants any earlier years please let me know.
QueuesThere are two queues available on the cluster. Each queue support running jobs in parallel but they have different limits.
In the same way as HPCx, the top of the tom.run script needs some job control instructions. The format for the dobson batch queue system is the same as NQS (for those who remember NQS batch systems).
In order to make your job run in the right queue you need to specify how many processors you want to use in the job (you can also specify the queue name as well).
Here's the top of the tom.run script for dobson with the suggested options:
The options are:
There are 2 batch queues available on dobson described above. As dobson is our local cluster the queues can be configured how we decide. If anyone thinks a different queue setup would be better let me know.
Use the tom.run.dobson script downloadable from the main p-TOMCAT website. This run script will have the job control options specified above. Change any you might need to (ie. email address, no. of processors).
Also, check the namelist variables and make sure the experiment directory is correct. The directories in the namelist will be different in the tom.run.dobson job and should be correct. But if you have any custom settings in your HPCx jobs you will need to change them here.
Once the run script is ready, you submit the job by the command:
or whatever your run script filename is called.
To check on the status of your job, use the 'qstat' command. e.g.
Dobson $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------- 28 0.55500 tomcat glenn r 04/24/2007 21:57:04 email@example.com 16
If your job is not running you can use 'qstat -j' followed by the job-id (e.g. qstat -j 28) which then gives more details and the reason why the job isn't running.
By default qstat will only list your jobs. If you want to see all the jobs in the batch queues then do: qstat -u "*".
To delete a queued or running job, use the 'qdel' command. e.g.
For more details of these commands, see their man pages.
NOTE! It is very important when your job is running that you don't attempt to look at the log file being generated by the model. For example, the job above would send it's output to the file: tomcat.o28. Due to the way parallel I/O is handled by the system, if you attempt look at this file in any way whilst the job is running, say by using 'more' or 'tail', no further output will be written to this file. Do NOT look at the contents of the file until the job has finished otherwise you will lose the rest of the output: the job will continue until finished by no further log output will be written.
Some other commands which might be useful: