Bigben
- System Architecture
- Access To Bigben
- Getting an account on bigben
- Connecting to bigben
- Changing your bigben password
- Changing your login shell
- Accounting on bigben
- Files On Bigben
- Transferring Files
- Compilers
- MPI
- Running A Job
- Scheduling policies
- Batch access (the qsub command)
- Sample job script
- Flexible walltime requests
- Other qsub options
- Interactive access
- Monitoring And Killing Jobs
- Checking the Status of BigBen
- Debugging
- Debugging strategy
- Compiler options
- TotalView
- Memory
- Core files
- Exception handling
- Little endian versus big endian
- Recl specifier for Fortran direct access IO
- Optimization
- CrayPat
- CrayPat and the Gnu compilers
- Mflops
- Cache
- Compilers and compiler options
- IO optimization
- Using two processors per node
- small_pages option
- Gnu malloc routines
- Third-party software
- Scalability
- Optimization assistance
- Software Packages
- Cray Documentation
- Portland Group Web Site
- Bigben and the TeraGrid
- Acknowledgement in Publications
- Reporting A Problem
System Architecture
BigBen is a Cray XT3 MPP system with 2068 compute nodes linked by a custom-designed interconnect. Twenty-two dedicated IO processors are also connected to this network. Each compute node has two 2.6 GHz AMD Opteron processors. Each compute processor has its own cache, but the two processors on a node share 2 Gbytes of memory and the network connection.
Each compute processor runs the Catamount operating system. Catamount is a subset of Unix, and consequently not all Unix system calls are available on the compute processors. For example, you cannot use threads on bigben's compute processors, and thus OpenMP is not available on bigben. However, typical computational science applications do not rely on system calls and hence should easily port to bigben. If you want to know if a particular Unix call is supported on Catamount, send email to remarks@psc.edu.
There are multiple front end processors, which are also AMD Opteron processors and which run SuSE Linux. Logins are to one of these front end processors, not to the compute processors.
BigBen is primarily intended to run applications with very high levels of parallelism or concurrency (512-4136 processors).
Access to Bigben
Getting an account on bigben
There are two types of grants available on bigben: development grants and production grants. Development grants are available as precursors to large requests or for work which will exploit the unique architectural capabilities of bigben. Production grants are large awards for users with extensive computational requirements.
To apply for a production or a development grant on bigben you must fill out the online POPS proposal form. This form allows three types of request: Start-up, Medium and Large. The Medium and Large requests are for production grants that ask for 30,001-500,000 Service Units (SUs) and for 500,001 and above SUs respectively. The Start-up request should be used for development grants or for production grants that ask for 30,000 or fewer SUs.
Connecting to bigben
To connect to bigben you must ssh to
Changing your bigben password
You must change your bigben password within 30 days of the date on the initial password sheet. If you don't, logins will be disabled on your account. Contact PSC User Services if this happens. We will also disable your password if you do not change it at least once a year. We send out an email notice warning you that your password is about to be disabled in the latter case.
Use the kpasswd command to change your PSC Kerberos password. Do not use the passwd command to change your password. You have the same password on all PSC production systems. If you change your password on one PSC system using kpasswd you will change it on all other PSC production systems.
See the general PSC password policies.
Changing your login shell
You can change your default login shell with the /usr/psc/bin/chsh command. When doing so, specify a shell from the /usr/psc/shells directory.
Accounting on bigben
One processor-hour on bigben is one SU.
If you have more than one account, use the qsub option -W group_list to indicate to which account you want a job to be charged. The use of this option is discussed in the "Batch access" subsection of this document. To change your default account you must send email to remarks@psc.edu with this request.
User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.
Accounting information for grants is also available at the PSC Grant Management System on the Web at https://grants.psc.edu/arms. You will need your PSC Kerberos password to access this system. This system can provide more detailed information than xbanner, although some of the information is only available to grant PIs. The system has extensive internal documentation.
Files on Bigben
File Systems
File systems are file storage spaces directly connected to a system. There are currently two such areas available to you on bigben. Together these two areas have over 200 Tbytes of space.
- /usr/users/n/username
-
This is your home directory. The numeral 'n' will be replaced by an integer and 'username' will be replaced by your userid. You can also refer to this directory as $HOME. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all the compute processors and all the front end processors. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to golem. Golem, PSC's file archival system, is discussed below.
- $SCRATCH
-
This is bigben's scratch area to be used as a working space for your running jobs. $SCRATCH is implemented using the Lustre parallel file system. $SCRATCH is visible to all the compute processors and all the front end processors. There are no quotas for $SCRATCH, so you must police your own disk usage. However, we will delete $SCRATCH files if we need space to keep jobs running. In addition, $SCRATCH is not backed up. Since $SCRATCH is not backed up and we will delete $SCRATCH files if we need to free up space for running jobs, you should save copies of your files to either your local site or to golem as soon as you can after you create them. You should not treat $SCRATCH as permanent storage. Golem, PSC's file archival system, is discussed below.
File Repositories
File repositories are file storage spaces which are not directly connected to a front end processor or compute processors. You cannot, for example, in a program open a file that resides in a file repository. You must use explicit file copy commands to move files to and from the repository. You currently have one file repository available to you on bigben: golem, PSC's file archiver.
golem
Golem runs Cray's DMF file archival system. It is a combination tape-and-disk archival system.
The far program can be used to transfer files between golem and bigben.
You should transfer your files between golem and bigben outside of your batch jobs. Otherwise your jobs will be holding compute processors while your files are being transferred.
If you need to store a file to golem that is 2 Tbytes or larger first contact User Services so that special arrangements can be made to store your file.
Transferring Files
You can use kftp or scp to transfer files between bigben and your local machine. For performance reasons kftp is the preferred method. As was mentioned above, far should be used for file transfers between bigben and golem. Golem is on the TeraGrid so you can also use TeraGrid methods of file transfer between golem and your local machine. The TeraGrid is discussed below.
Compilers
Both the Portland Group and the Gnu compilers are installed on bigben. The Unified Parallel C compiler (UPC) is also installed. The standard compilers on bigben are Portland Group compilers.
Compiling for execution on the front end nodes
You should only run small test programs on the front ends. All production runs must be done on bigben's compute processors.
To create an executable that runs on the front ends with the Portland Group compilers, use the commands pgcc, pgf90 and pgCC for C, Fortran or C++ programs respectively.
Using the Gnu compilers gcc and g++ will create executables that will run on the front ends. Executables created with these commands will not run on bigben's compute processors.
UPC
UPC is an extension of the C programming language designed for high-performance computing on parallel machines.
Documentation on the use of UPC on bigben is available at http://www.psc.edu/general/software/packages/upc/upc.html.
MPI and Portland Group compilers
To use MPI, you must use the compiler wrappers cc, ftn and CC for C, Fortran or C++ respectively. For example, a command to compile and link a C MPI program would look like
cc -o hellompi hellompi.c
These wrappers create executables which only run on bigben's compute processors. They will not run on the front ends.
MPI and Gnu compilers
Use the same compiler wrappers cc and CC for an MPI program compiled with the Gnu gcc or g++ compilers instead of the Portland Group compilers. We have not installed a Gnu Fortran compiler on bigben for performance reasons. However, before you issue the wrapper command, switch the Portland group programming environment, loaded by default, for the Gnu programming environment.
module switch PrgEnv-pgi PrgEnv-gnu
Executables created in this way will run on bigben's compute processors only and not on the front ends.
Inlining and the Gnu Compilers
The Gnu compilers cannot inline functions if they are not in the source file that the compiler is currently compiling. To see how to guarantee that your functions get inlined see below the "Compilers and compiler options" subsection of the "Optimization" section.
Using the -Mcache_align or -fastsee options
If you are creating 32-bit binaries and
any of your routines are compiled with the Portland Group
-Mcache_align option, you must
compile all of your routines with that option. This includes system
libraries. Always check the documentation for any packages
you use to see if they are compiled with -Mcache_align. Note that the -fastsse
option, which we recommend below in the
"Optimization section", turns on the
-Mcache_align option.
Thus, if you compile one of your routines with -fastsse, you must
compile all of them with this option.
Other compiler information
The man pages for the Portland Group compilers are pgf90, pgcc and pgCC for the Fortran, C and C++ compilers respectively. Portland Group has a Web site with information on its compilers, including porting and optimization tips.
The man page for the Gnu C and C++ compilers is gcc. There is also a Web site for the Gnu C and C++ compilers.
MPI
MPI is available on bigben. The MPI programming model is the one most users will use on bigben. The version of MPI on bigben supports all of MPI 2.0, except for the dynamic process functions. MPI programs will only run on bigben's compute processors.
To enable your programs to use MPI, you must include the MPI header file in your source code. For Fortran, use this include directive in your source:
include 'mpif.h'
For C and C++, use this include directive:
#include <mpi.h>
To avoid name conflicts you should include mpi.h in your C++ programs before you include stdio.h or iostream.
You do not need to explicitly specify the MPI library with an -l option if you use the MPI wrapper commands mentioned above.
If your MPI programs receive error messages about overflowing MPI buffers there are two approaches you can take to resolve this problem. The error message may mention an MPI variable whose value you can increase. The MPI variables and their default values are described in the intro_mpi man page.
However, this approach does not always work, either because no variable is mentioned in the overflow error message or because you cannot set the variable to a large enough value to solve the problem. The other approach to the problem requires you to set variable MPICH_PTL_SEND_CREDITS to -1. You may choose to use this approach first both because of its simplicity and because it handles most cases of the problem.
Running a Job
The Portable Batch Scheduler (PBS) controls all access to bigben's compute processors, for both batch and interactive jobs. Bigben has two queues--the batch queue and the debug queue. Interactive jobs can be run in the batch or debug queue.
The debug queue is turned on everyday, 8:00 a.m. through 8:00 p.m. However, on the weekend we reserve the right to preempt without notice the nodes allocated to the debug queue if they are needed to run large jobs that need those nodes. Jobs can be submitted to the debug queue at any time but they will only run during the 8:00 a.m. through 8:00 pm. hours.
Jobs in the debug queue can request a maximum of 16 processors (8 nodes) and a maximum walltime of 30 minutes. 16 processors (8 nodes) in total are allocated to the debug queue. The scheduling policies for the debug queue are designed to prevent a single user from dominating the processors allocated to the debug queue.
You should use the debug queue for your debugging runs. Do not run a debugging run on any of bigben's front ends.
The debug queue is intended to be used in the classic debugging cycle in which you run a debugging job, check its output and then submit another debugging job. You should not flood the debug queue with jobs nor should you chain jobs through the debug queue by having a debug job submit its successor.
The debug queue should not be used for production runs that use only a few processors.
The batch queue is always on. Your batch queue jobs can request as many processors as are available on the system. The maximum walltime in the batch queue is 18 hours.
Scheduling policies for the batch queue are designed to favor large jobs, while giving jobs of all sizes a chance to get on the machine. The policies are also designed to prevent a single user from flooding the queue or dominating the processors allocated to the batch queue.
If you have suggestions or comments about the scheduling policies on bigben or find that these policies are not meeting your computing needs send email to remarks@psc.edu.
Batch access
You use the qsub command to submit a job script to PBS. A PBS job script consists of PBS directives, comments and executable commands. The last line of your job script must end with a newline character.
Sample job script
A sample job script is
#!/bin/csh #PBS -l size=4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q debug set echo # move to my $SCRATCH directory cd $SCRATCH # copy executable to $SCRATCH cp $HOME/hellompi . # run my executable pbsyod -size $PBS_O_SIZE ./hellompi
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used.
The next four lines are PBS directives.
- #PBS -l size=4
-
The first directive requests 4 processors. If you request M processors your job will run on ceiling(M/2) nodes. If you request an odd number of processors you only use one processor on the last node allocated to your job but you are still charged for use of the entire node. Jobs cannot share nodes.
By default your processes are allocated across your processors in an SMP-style manner. Your first process is allocated to a processor on your first node. Then your second process is allocated to the other processor on your first node. Your next two processes are allocated to the two processors on your second node. This procedure of allocating your next two processes to each successive node continues until all your processes have been allocated.
You can change this allocation scheme by setting the value of the variable
MPICH_RANK_REORDER_METHOD before you run your program. If you set the variable PMI_DEBUG before you run your program your standard output will display how your processes were allocated across your nodes. See the mpi man page for more information about these two variables.You can also specify the value for size as -l size=N:M instead of -l size=M, where N is the number of nodes you want and M is the number of processors. This approach is useful if you want to run on only one processor per node because you want each of the processors you are running on to have access to all the available memory on a node or you want to reduce processor contention on your nodes. For example, the specification
-l size=1024:1024 indicates that you want to run on 1024 nodes, but use only one processor per node. You must also use the option -SN to pbsyod if you want to run on one processor per node. See the discussion of pbsyod below. If you use only one processor per node you are charged for the use of the entire node. Jobs cannot share nodes. If you specify an illegitimate combination of values for nodes and processors your job will be rejected as soon as you submit it. - #PBS -l walltime=5:00
- The second directive requests 5 minutes of wallclock time. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
- The next directive combines your .o and .e output into one file, in this case your .o file. This will make your program easier to debug.
- #PBS -q debug
- The final PBS directive requests that your job be run in the debug queue. The batch queue is the default queue. Thus, if you want to run in the batch queue you can omit this directive.
The remaining lines in the script are comments or command lines.
- set echo
- This command causes your batch output to display each command next to its corresponding output. This will make your program easier to debug. If you are using the Bourne shell or one of its descendants use 'set -x' instead of 'set echo'.
- Comment lines
- The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must begin in column one of your script file. The remaining lines in the sample script are executable commands.
- pbsyod
-
The pbsyod command is used to launch your executable on your compute processors. Only programs executed with pbsyod are executed on your compute processors. All other commands are executed on a front end processor. Thus, you must use pbsyod to run your executable or it will run on the front end, where it will probably not work. If it does work, it will degrade system performance. The need to use pbsyod applies to debug jobs as well as production jobs and to serial programs as well as parallel programs. Also, you can only use pbsyod to run an executable created with a compiler. You cannot run a script with pbsyod. The pbsyod command actually calls the yod command. You can find information on the options available for pbsyod by looking at the man page for yod.
The -size option to pbsyod indicates on how many processors you want your program to run. The variable PBS_O_SIZE contains the value you gave for the -l size PBS specification. Thus, if you gave the value of
-l size as -l size=M you can specify the value of the pbsyod -size option as $PBS_O_SIZE. However, if you gave the value of-l size as-l size=N:M then you must specify the value of the pbsyod -size option as $PBS_NPROCS. You can of course also use this value in the case where you specify-l size as -l size=M. In addition, if you want to run on one processor per node you must include the-SN option to pbsyod when you issue your pbsyod command.
Within your batch script the variable PBS_O_WORKDIR is set to the directory from which you issued your qsub command. The variable PBS_NNODES is set to the number of nodes you requested.
After you create your batch script you can submit your script to PBS with the qsub command.
qsub myscript.job
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub comand after your job finishes.
Each of your jobs also generates a job_XXXX_console.log file, where 'XXXX' is replaced by your job's jobid. This file is placed in the job_console_logs directory off of your home directory. This file may be useful to us if we need to debug a problem with your job. Please do not remove files from this directory.
You can also specify PBS directives as command-line options to qsub. Thus, you could omit the PBS directives in the sample script above and submit the script with
qsub -l size=4 -l walltime=5:00 -j oe -q debug myscript.job
Command-line options override PBS directives included in your script.
Flexible walltime requests
Two other qsub options are available for specifying your job's walltime request.
-l walltime_min=HH:MM:SS
-l walltime_max=HH:MM:SS
You can use these two options instead of "-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround. To accommodate large jobs, the system actively drains nodes to create dynamic reservations. The nodes being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill nodes. When your job starts, the system selects a walltime value within the limits you've specified and it does not change during your job's execution.
For example, if your job requests 64 processors and a range of walltime between 2 and 4 hours and a 64-processor slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.
If you run a job with a flexible walltime request you can determine the actual walltime your job was assigned by examining the Resource_List.walltime field of the output of the qstat -f command. The command
qstat -f $PBS_JOBID
will give this output for the current job. You can capture this output to find the value of the Resource_List.walltime field.
You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit that the file writing is completed when the limit is reached. The function mpi_wtime can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.
You may also want to save time at the end of your job to allow your job
to transfer files after your program ends but before your job ends. You can
use
the
For more information on the -tlimit option see the yod man page. If you want assistance on the procedures needed to capture your job's actual walltime or to determine when your job should write checkpoint files send email to remarks@psc.edu.
Other qsub options
Besides those mentioned in the sample script above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on bigben.
- -r y|n
- Indicates whether or not a job should be automatically
restarted if it fails due to certain system states. If there is a system
failure, jobs that were running will be restarted unless
-r is set to "n". The default is to restart the job. Note that a job which fails because of a problem in the job itself will not be restarted. - -W group_list=charge_id
- Indicates to which charge-id you want a job to be charged. You can see your valid charge-ids by greping your entry in the /etc/group file. You replace 'charge_id' in the above option by the charge-id you want your job to be charged to. Your default charge-id is indicated by the group field in your entry in the /etc/passwd file. The fourth field in your entry in the /etc/passwd file is your group-id. If you grep for this number in the /etc/group file the first field of the output is your default charge-id. If you want to switch your default charge-id send email to remarks@psc.edu. If you only have one grant on bigben you do not need to use this option. This option can only be specified as a command-line option.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of
other jobs. Some values for dependency are:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Interactive access
A form of interactive access is available by using the -I option to qsub. For example, the command
qsub -I -q debug -l walltime=10:00 -l size=2
requests interactive access in the debug queue to 2 processors for 10 minutes.
The system will respond with a message similar to
qsub: waiting for job 54.bigben.psc.edu to start
Your qsub request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When your job starts you will receive the message
qsub: job 54.bigben.psc.edu ready
and then your shell prompt. You can use the -M and -m options to qsub to have the system send you email when your job has started.
At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are all connected to your terminal. To run on the compute processors allocated to your interactive job you must use the pbsyod command.
When you are finished with your interactive session type ^D. The system will respond
qsub: job 54.bigben.psc.edu completed
When you use qsub -I you are charged for the total time that you hold your processors whether you are computing or not. Thus, as soon as you are done running executables you should type ^D.
Monitoring and Killing Jobs
The qstat -a command is used to display the status of the PBS queue. It includes running and queued jobs. For each job in the queue it shows the amount of walltime and number of nodes and processors requested. This information can be useful in predicting when your job might run. The -f option to qstat provides you with more extensive status information for a single job.
The shownids command, located in /usr/local/bin, shows you the status of all the compute nodes on bigben. A nid is a node id. The output of shownids shows the number of nodes in certain types of states. Enabled nodes are all nodes available to PBS for scheduling. Allocated nodes are those enabled nodes that are currently running jobs. Free nodes are those enabled nodes that are currently free. The allocated and free categories of nodes have seperate entries for the batch and the debug queue. You can use the output from shownids and qstat -a to determine when your jobs might start.
The qdel command is used to kill queued and running jobs.
qdel 54
The argument to qdel is the jobid of the job you want to kill. If you cannot kill a job that you want to kill send email to remarks@psc.edu.
Bigben Status
Check the current state of bigben--free nodes, scratch space, scheduled drains, etc.--on the bigben status page.
A Web-based monitor for bigben is also available. It shows the current status of the nodes on the machine and which jobs are running on the machine and where.
Debugging
Debugging strategy
Your first few runs should be on a small version of your problem. You first run should not be for your largest problem size. It is easier to solve code problems if you are using fewer nodes. This strategy should be followed even if you are porting a working code from another system.
You should use the debug queue for your debugging runs. Do not run a debugging run on any of bigben's front ends. You should always run a bigben program with qsub and pbsyod.
The debug queue is intended to be used in the classic debugging cycle in which you run a debugging job, check its output and then submit another debugging job. You should not flood the debug queue with jobs nor should you chain your jobs through the debug queue by having a debug job submit its sucessor.
The debug queue should not be used for production runs that use only a few processors.
Compiler options
Several compiler options can be useful to you when you are debugging your program. If you use the -g option to the Portland Group or the Gnu compilers, the error messages the system provides when your code fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code do not use the -g option. If you do your code will run slower.
The -Mbounds option to the Portland Group compilers will cause your code to report if it exceeds an array bounds while running.
Variables are not automatically initialized on bigben. This can cause program failures if you port a code to bigben from a platform where variables are automatically initialized and you do not insert code to initialize those variables on bigben.
If you use the options -Wall and -O to the Gnu compilers you will be warned
about many cases of uninitialized variables. There are no corresponding
options for the Portland Group compilers. However, for debugging
purposes you can temporarily switch from a Portland Group compiler to
a Gnu compiler if you think uninitialized variables may be a cause of your
program failure. Similarly, you could switch from a Gnu compiler to a
Portland Group compiler to use the
There are many options to the Portland Group and the Gnu compilers which may assist you in debugging your program. For more information about these options see the appropriate man pages.
TotalView
The TotalView debugger is available on bigben. Online documentation is available with information on the steps to follow to run TotalView.
Memory
If you are using both processors on a node you have available about 870 Mbytes of memory per processor for an MPI code and about 950 Mbytes of memory per processor for a non-MPI code.When you are using two processors on a node one of the processors cannot poach memory from the other processor. If you are using only one processor on a node you have about 1850 Mbytes of memory available for that processor for an MPI code and about 1920 Mbytes of memory available for a non-MPI code.
You can reduce the amount of memory an MPI code uses by adjusting the values of certain MPI environment variables. The variable MPICH_UNEX_BUFFER_SIZE has the largest default value. If your code exceeds the size of an MPI buffer you will receive a diagnostic error that will tell you what variable to change. See the man page intro_mpi and the MPI discussion above for more information about MPI and memory usage.
You can also reduce the size of your executable by using the strip command to strip your executable of its symbol tables. However, if you do this you cannot use a debugger and the core dump messages you receive if your program fails will probably be less informative. The strip program is described in the strip man page.
If you try to load a program with pbsyod that is too big the load will fail. If you try to allocate more memory than is available while your program is running, your program wil fail. However, if you are using error checking your program can catch this error and continue executing.
The default allocation for heap memory is all of available memory, so you cannot increase this value. The heap_info function can be used to check how much free heap memory is left. See the heap_info man page for more information.
You can also use the heapmax utility to tell what your maximum heap memory usage is. To use heapmax link your program with /usr/local/packages/heapmax/heapmax_cat.o if you are using the Portland Group malloc routines and with /usr/local/packages/heapmax/heapmax_gnu.o if you are using the Gnu malloc routines. You use the Gnu malloc routines by linking with the option -lgmalloc, whether you are using the Portland Group compilers or the Gnu compilers. Below in the "Gnu malloc routines" subsection we recommend using the Gnu malloc routines in certain situations to improve your program's performance.
When you run your program after linking with the heapmax library heapmax will write output to standard output that shows the maximum amount of heap memory your program uses. To see this value you should look at the line in your output labelled "megabytes maximum heap allocation".
The default allocation for stack memory is 16 Mbytes. You can change this with the -stack option to pbsyod. See the man page for yod for more information. If your program exceeds its stack memory allocation, it will fail.
Core files
If your program fails and generates a core file, the default behavior of the system is to not keep the core file in order to save space. If you want to keep the core file a job generates you can set the environment variables CORE_ACTION_FIRST and CORE_ACTION_OTHER to FULL in your script before your pbsyod command.
For example, if you are using the C-shell you could use the commands
setenv CORE_ACTION_FIRST FULL
setenv CORE_ACTION_OTHER FULL
to set these environment variables. If you are using a different shell you would substitute the equivalent commands for setting environment variables.
See the core man page for more information on generating core files.
Exception handling
For performance reasons the Portland Group compilers will, by default, generate programs that will run through most exceptions. Segmentation fault is an example of an exception that these programs will not run through.
During program development you may want to trap exceptions. You can do this by using the option
-Ktrap=fe,denorm
when you link your program. Once your code is debugged you may want to revert to the default behavior if you have determined that any exceptions that could be run through will not affect the results of your program.
For more information about the -Ktrap option see the pgcc, pgf90 or pgCC man pages.
Little endian versus big endian
The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Bigben is a little endian machine, which means that the low-order byte of a number is stored in the memory location with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a big endian machine.
If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command
echo 'puts $tcl_platform(byteOrder)' | tclsh
You can read a big endian file on bigben
if you are using Portland Group Fortran by
using the compiler option
Recl specifier for Fortran direct access IO
The units of the recl specifier in the Fortran open statement on bigben are bytes. On some other machines the default units are four-byte words.
Optimization
CrayPat
The CrayPat tool is available on bigben to enable you to collect performance data about your code. You use CrayPat to instrument your code and to display the performance data generated by your program.
Follow the steps below to use CrayPat with the Portland Group compilers.
-
Configure your software environment to use CrayPat with the command
module load craypat
-
Compile and link your program. When you compile and link you must use the -Mprof=func option.
-
After you compile and link your program you must instrument it with the command
pat_build -g mpi hellompi hellompi_instr
-
You then run the instrumented version of your code using pbsyod. In this example the name of the instrumented executable is hellompi_instr. Before you execute your program you must set the environment variable PAT_RT_TRACE_HOOKS to 1 and load the craypat module. If you want to collect Mflops and cache miss data you must also set PAT_RT_HWPC to 1 before you run your program.
-
After your code runs you can display the instrumented output using the command
pat_report directory
where you replace 'directory' by the name of the directory in which CrayPat placed its output files. The name of this directory will be written to your stdout at the end of the execution of your program. Before you run pat_report you must have the craypat module loaded.
You primarily use the -d and -b options to pat_report to control the content and appearance of the output from pat_report. For example, the option pair
-d time%@5.0,mflops,counter2,counter3
-b function
will report on the exclusive time spent in each of your subprograms, the Mflop rate for each subprogram, and the total cache accesses and cache misses for each subprogram. The report will list all your traced subprograms in descending order by time spent in each subprogram as long as the subprogram consumes 5.0% of more of your program's total execution time.
The option pair
-d time%@5.0,mflops,counter2,counter3
-b function,pes
will produce a similar report, but it will display this output broken out for each of your processors. This can help you determine if you have a performance problem on a particular processor.
Once you load the craypat module you have access to man pages for craypat, pat, pat_build, pat_report and pat_hwpc. You can use the man pages for pat_build and pat_report to determine which options to pat_build and pat_report you should use to create the output report you want. You can use the man page for pat_hwpc to determine if you want to collect the data from a different set of hardware counters by changing the value to which you set PAT_RT_HWPC.
You do not want to have the craypat module loaded when you create the production version of your executable. You want the production version of your executable to be uninstrumented.
CrayPat and the Gnu compilers
You can also use CrayPat with the Gnu compilers. You must follow the steps outlined just above but with several changes. First, instead of the compiler option -Mprof=func you must use the compiler option -finstrument-functions. Second, you must insure that your .o files are available when you run pat_build by keeping them after you compile and having them in your working directory when you run pat_build.
Mflops
You can measure the Mflops performance of your code using CrayPat. If your Mflops rate for a single processor is less than 1 Gflop you should make an effort to optimize your code.
Cache
A technical description of the cache for bigben's processors is available online. Bigben's processors are Model 285 dual-core AMD Opteron 2.6 Ghz processors. You can measure your code's cache miss rate using CrayPat. If your code's cache miss rate is greater than 10% you should make an effort to optimize your code's cache performance.
Compilers and compiler options
The Portland Group compilers will, in general, generate more efficient code on bigben than the Gnu compilers. However, there are cases where the Gnu compilers generate faster code. Thus, you may want to try both types of compilers on your code to see which one performs better.
The following Portland Group compiler options may result in faster code.
-fastsse -O3 -Mnontemporal -Mprefetch=distance:8,nta
We have found the -fastsse option to be the most useful. You should not use this option when you are debugging your program, only when you are ready to run your program in production mode.
The Portland Group Web site has optimization tips for its compilers.
The following Gnu compiler options may result in faster code.
-O2
-O3
-funroll-loops
-march=opteron
-mtune=opteron
-fprefetch-loop-arrays
-finline-functions
-mfpmath=sse
You should try these options to see if any speedup your code. Some codes will perform better with the option -O2 instead of -O3.
Inlining when using the Gnu compilers can increase your code's performance, especially if you have small functions, medium-sized functions called inside loops or functions to which you pass constant or predictable arguments. However, the Gnu compilers cannot inline a function if it is in a different source file from the source file the compiler is currently compiling. To guarantee that your function gets inlined you should move the entire function into a header file and include that header file in the source file you are compiling. You must also add the keyword 'inline' before the declaration of the inlined function's return type. You must use at least optimization level -O1 for inlining to work.
If your source code calls a lot of string.h functions you can force them
to be inlined with the option
IO optimization
There are several steps you can take to improve your application IO performance on bigben. If your program reads or writes data in small chunks you can use the iobuf library. If your program reads or writes files that are 1 Gbyte or larger, you can use file striping so that your file IO will be done in parallel.
iobuf
The iobuf library adds a layer of buffering to your program. The result will be that your program will have fewer IO operations to disk and the IO operations will handle larger volumes of data. This should speed up your program.
You do not need to change your source code to use iobuf. Your program's IO calls are intercepted by the iobuf library. However, you will need to load the iobuf module before you link your code. To load the iobuf module issue the command
module load iobuf
Then you must link or relink your program. Once you have loaded the iobuf module you can access the iobuf man page.
The behavior of iobuf is controlled by a number of environment variables. After you set the appropriate variables you run your program.
The most important of the variables you can set to control iobuf's behavior is IOBUF_PARAMS. The value of IOBUF_PARAMS is a list of filenames or filename patterns and the iobuf parameters you want to set for each filename or filename pattern. See the iobuf man page for information on the format of the value of IOBUF_PARAMS and a description of the parameters you can set for each filename or filename pattern. The man page also discusses the other environment variables you can set to control the behavior of iobuf.
There are special steps you must follow when using iobuf and low-level C IO routines. See the iobuf man page for more information.
Which IOBUF_PARAM parameters are best to use is very application dependent. We have found that a buffer count of 4 and a buffer size of 12 Mbytes can result in improved performance. We recommend using the noflush parameter for all files and setting prefetch to 1 for input files. We also recommend turning on default buffering for standard out using the %stdout filename pattern. Two other parameters that have been found useful are ignoreflush and shared. However, you will have to investigate whether any of these recommendations are suitable for your application. If you want assistance in using iobuf send email to remarks@psc.edu.
File striping
If your program reads or write large files you should use $SCRATCH. Your $HOME space is limited. In addition, the $SCRATCH file space is implemented using the Lustre parallel file system. A program that uses $SCRATCH can perform parallel IO and this can significantly improve its performance.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. Your goal should be to use as many OSTs as possible concurrently. This is how you can use Lustre as a parallel file system. Bigben currently has 32 OSTs.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your processors writes to its own file you should not stripe these files. If each file is placed on its own OST then as each processor writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the processors as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single processor performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as ls -l or unlink will take longer for striped files. Also, striped files are more at risk for data loss due to hardware failure. If a file is spread across several OSTs a hardware failure of any of them will result in the loss of part of the data in that file. You may choose to lose all of a small number of files rather than parts of all of a large number of your files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is
lfs setstripe filename stripe-size OST-start stripe-count
We recommend that you always set the stripe size parameter to 0 and the starting OST parameter to -1. This will result in the default stripe size of 1 Mbyte and assign your starting OST in a round-robin fashion. A value of -1 for the stripe count means the file should be spread across all the available OSTs. Since the Lustre file system on bigben currently has 32 OSTs you could also specify the stripe size paramater as 32 to have the file spread across all available OSTs.
For example, the command
lfs setstripe bigfile.out 0 -1 -1
sets the stripe count for bigfile.out to be all available OSTs. This command would be suitable for the situation where you have one processor which is writing all your data.
The command
lfs setstripe manyfiles.out 0 -1 1
has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each processor writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
If you are using striping and the iobuf library we recommend setting the iobuf buffer size to 1 Mbyte. We also recommend setting the iobuf prefetch parameter to your stripe count.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application.
There is a man page for lfs on bigben. Online Cray documentation about file striping is also available. If you want assistance with what striping strategy to follow send email to remarks@psc.edu.
Using two processors per node
As was mentioned in the "System Architecture" section, the two processors on each of bigben's compute nodes share access to the node's memory and the node's network connection. The contention that can result from this arrangement can lead to performance degradation for your program.
To determine if your code has this problem you should run your program using one processor per node and two processors per node. If your code runs 30% or more slower in the latter case node contention may be causing your code's performance degradation. How to run on one and on two processors per node is described in the "Batch access" subsection in the discussion of the -l size PBS specification.
To determine if memory contention is causing your performance slowdown you should use CrayPat to collect cache performance data for either a one or two processor per node run. How to use CrayPat to collect cache performance data is described in the "CrayPat" subsection. If your cache miss rate is 10% or greater then memory contention is probably the cause of your slowdown and you should invest effort to improve the data locality of your code in order to decrease your cache miss rate.
If your program has a contention slowdown but does not have a high cache miss rate then network contention is probably the cause of your slowdown. Network transfers within a node result in less contention than network transfers between nodes. Thus changing the layout of your processes across your nodes using the MPICH_RANK_REORDER_METHOD variable to reduce the number of between-node transfers may reduce your network contention. How to change the layout of your processes is described in the "Batch access" subsection in the discussion of the -l size PBS specification.
If you want assistance in dealng with contention slowdown issues send email to remarks@psc.edu. In order to assist you we will need to see your CrayPat output.
small_pages option
The small_pages option to the pbsyod command runs your program with smaller memory pages. This may improve the performance of your program or it may decrease its performance, depending on the characteristics of your program. You should try your program with and without small_pages and then choose the best approach for your program. For more information on small_pages see the man page for yod.
Gnu malloc routines
The Gnu suite of malloc routines may perform better than the default suite of malloc routines, especially if your code does a lot of small mallocs. To use these routines you must include the option -lgmalloc when you compile. You must use this option with the PGI compiler and also with the Gnu compiler to get the Gnu routines.
Third-party Software
Third-party routines can often perform better than routines you code yourself. You should investigate whether there is a third-party routine available to replace any of the routines you have written yourself.
For example, we recommend the
FFTW library
for FFTs. For linear algebra routines we recommend that
ACML library.
Scalability
How well your code scales is an important determinant of how well it will perform on bigben. To measure your code's scalability you should time the execution of your program using the built-in time command. You should make several runs increasing your job's number of processors while keeping the work done per processor constant. If your code scales well your execution time should decrease in a nearly linear fashion as you increase your job's number of processors.
If your code does not scale well an examination of the amount of time spent in your MPI calls can help you determine the cause of this poor performance. You can get MPI timing data from CrayPat.
If your MPI_Barrier time increases as the number of processors increases you might have a load imbalance problem. You should redistribute your program's work across your processors.
If your MPI_Send, MPI_Recv or MPI_Reduce times increase as your number of processors increases, try to restructure your code to reduce the amount of communication between processors.
Optimization assistance
If you would like to optimize your code so that it can run on at least 1000 processors, you can get optimization assistance from PSC through the PSC's Scaling Advantage Program. This program includes consulting assistance from PSC, special queue handling if necessary, and service unit discounts, all of which are designed to enable you to scale your code to run on at least 1000 processors as quickly as possible. Send email to remarks@psc.edu if you would like to participate in this program. If you want optimization assistance but do not think your program will qualify for the Scaling Advantage Program you should still send email to remarks@psc.edu.
Software Packages
A list of software packages available on bigben is available online. If you would like us to install a package that is not in this list send email to remarks@psc.edu.
Cray Documentation
Cray XT3 documentation is available online at
http://docs.cray.com/cgi-bin/craydoc.cgi?mode=SiteMap;f=xt3_sitemap
Portland Group Web Site
The Portland Group Web site
http://www.pgroup.com
has information on the Portland Group compilers, including porting and optimization tips.
Bigben and the TeraGrid
Bigben is on the TeraGrid. Thus, you have additional methods of connecting to bigben, of transferring files to and from bigben and of running jobs on bigben. For information on using the TeraGrid see the general online documentation for the Teragrid and the PSC-specific online TeraGrid documentation.
Acknowledgement in Publications
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on bigben be sent to the PSC Allocation Coordinator. We also request that you include an acknowledgement of PSC in your publication.
Reporting a Problem
You have several options for reporting problems on bigben.
- You can call the User Services Hotline at 1-800-221-1641 from 9:00 a.m. until 8:00 p.m., Eastern time, on weekdays, and 9:00 a.m. until 4:00 p.m., Eastern time, on Saturdays.
- You can send email to remarks@psc.edu.