Guidelines on disk space and memory
From David Young,
You will have to run these jobs on the SGI Altix. It is a shared memory machine, where a single CPU can get to large amounts of memory. The Cray XD1 is a distributed memory machine, where no CPU can get to more than 8 GB of memory.
Your home directory has a disk quota. When you first get an account that quota is set to 1 GB of memory. I can increase it up to 20 GB at no charge if you ask for more. To have more than 20 GB in your home directory, your research group can purchase additional space (but there is another option).
Individual file 1 GB limit is only relevant to interactive work on the login node. For bigger file size connect to a compute node and create file > 1GB.
There is a shared /scratch disk system. You can make a directory on /scratch and use much more disk space there. The altix /scratch area is currently 2.8 terabytes (2800 GB), and we have funding to add another 4 terabytes in the next few months. However, there is a big precaution about using scratch. It is for work in progress right now. Once the calculation completes, you must transfer your output files off of /scratch and back to a machine on campus.
Any files left on /scratch are automatically deleted one week after the calculation completes.
The altix has various queues that allow jobs to use different amounts of memory.
The large-serial queue allows single CPU jobs to use 35 GB of memory. This setting is because most of the altix nodes can run jobs this size. The altix has just one node capable of running jobs up to 140 GB of memory, which can be accessed through the special queue.
The special queue is there so that people can occasionally get access to larger resources. We do this with a different queue that people get access to by request. This is because we can run a few larger jobs, but we don't have the resources to let everyone run such large jobs all the time.
There are several things you need to know about using the special queue.
1. Once we give you permission to use it, you will have access to the special queue for the next six months.
2. The special queue allows jobs to run for a very long time. However, when we do maintenance shutdowns we don't turn off the queue far enough in advance for jobs to finish before the shutdown. Thus it is your responsibility to watch for shutdown announcements and know if your job has time to complete, or turn on any checkpointing capability in your application. The Linux operating system does not have operating system level checkpointing.
In order to get access to the special queue, send an email to David Young (firstname.lastname@example.org) requesting special queue access.
Please include a short description of the type of work, and your best estimate of how much memory, CPU time, and disk space your jobs will require.
Please note that the supercomputers are running 64 bit operating system.
As such, some applications will require as much as twice the amount of memory that they would require on a 32 bit desktop computer. If in doubt, you can do test calculations in the other queues, then find out the memory usage by using the "tracejob" command to get the queue log entry after your calculation has completed.
More guidelines on disk space and memory
From David Young,
Your home directory has a limit of 1 GB by default. I can increase this up to 20 GB if you just drop me an email requesting it.
If you need more disk space, you can create a directory on the /scratch drive and do your work there. This is a 2.8 TB disk area (we have a 4 TB expansion on order), and you can see how much is presently available with the "df" command. Note that /scratch is for jobs presently running in the queues. I often write my script to copy from my home directory to /scratch, run the job, then copy results back. At any rate, you must get the data off of /scratch when the job finishes. Any files left on /scratch are automatically deleted two weeks after the job completes.
The small and medium queues have a limit on disk usage. The large queues do not have a disk limit (as long as you don't put one in your script).
The SGI Altix is set up to have enough memory that you don't need virtual memory. The queue system guarantees that you get the amount of memory that you request. Seymour Cray once said, "Why have virtual memory, when you can have the real thing."
I know that large memory jobs have been slow getting through the queues. We have an additional 288 GB of memory on order (shipped from the factory yesterday), which will allow the machine to have several of these 100 GB jobs running at once.
It is also a valuable skill for a computational researcher to be able to predict in advance how much memory (or disk, or CPU time) a given size job will require to run. You can get a paper and pen estimate from a complexity calculation (time complexity, memory complexity,
etc.) This is taught in the computer science curriculum in courses on data structures and algorithms. You will also find an example calculation like this on the Altix in the file /opt/asn/doc/gaussian/Estimating_CPU_time.pdf
That example is written for a chemistry code, but the same principle and equations can be applied to any computational program.
If you don't know the memory complexity of your algorithm, the second part of the document shows you how to compute it.