Professional Documents
Culture Documents
Using A CPU Farm
Using A CPU Farm
A front
f t end
d
This will be the machine that you log onto.
Disk
There should be “a lot” of disk space available that you can
access from the front end and the nodes.
Many farm nodes.
These are the CPUs that do the work.
They will often also have their own local disk.
A network
The nodes will be connected to the front end by a network.
The network capacity can be the limiting factor.
The front end
When using g the farm yyou will spend
p most of yyour
time on the front end.
Typically this will have the same OS as the nodes.
You can compile code here.
You submit jobs from the front end to the nodes.
You manage the disk on the front end
You might take a quick look at the output here
here.
Remember the front end will have many other users,
so try and be as undisruptive as possible
possible.
The nodes
The nodes are where your CPU time occurs.
Usually they will have local disk.
Using this will cut down on network traffic.
Improves farm performance
performance.
Be careful about how much space is available.
On some farms the same box may be several
nodes.
Dual CPU machines
Hyperthreading.
They will have high memory, but watch out for
programs with
ith very hi
high
h memory usage, th
they may
not play well together.
JJobs on a farm
qsub my_job.scr
Listingg the jobs
j submitted to a farm.
To list the jjobs use the command q
qstat
This will tell you
The job name
The job ID
It’s status
The running time
The owner.
Use qstat –u username to see the jobs belonging to
a particular user.
There are other useful switches
See the man page.
p g
JJob Status
Running
A job that is currently running on a node.
You will be able to see how long it’s running with qstat
Queued
A job that is waiting for a free node.
Terminated
A job that is finished. You won’t see these in qstat
Suspended/Error
Something has happened to the job and it’s in a error state.
This is probably your fault.
But it could be a system error, so it’s worth restarting these
once.
once
Deletingg a job
j
qdel jobid
source ~/env
/env_script.csh
script.csh
Use the local disk.
cd $TMPDIR
Copy needed data to local disk
cp ~/input_data/my_data .
$SNO_CODE/snoman.exe
_ –c mycmd.cmd
y
Run my analysis code
cp result.ntp ~/output_data/
Copy my results back to
the data disk
JJob Master Script
p
You
ou ca
can pu
put much
uc that
a we eddiscussed
scussed in the
e
last two lectures into action.
Writing
t g multiple
u t p e co
command
a d files
es a
and
d sshell
e scscripts.
pts
Running system programs and analysing their
output.
Examining the output of your analysis programs.
You can put limits on the number of jobs in
two ways.
Using the sleep command when too many jobs
are submitted.
Usingg a cronjob
j
Cronjobs
j
A cronjob
j b iis a jjob
b th
thatt runs att a scheduled
h d l d
time.
Your cronjobs are controlled by your crontab.
Not allowed on all systems (including RAL).
To edit your crontab use
crontab –e
You will use your $EDITOR variable to decide the
editor
You need to exit the editor for the change to take
effect.
A typical
yp contrab. Redirect output.
p
Otherwise you’ll get
Time to run job an email
# My program
0 * * * * my_program
y_p g > /dev/null 2>> /dev/null
#End of crontab.
Comment to end crontab. Need a newline
at the end of each command
Time is specified by five variables.
mhdwm
* is a wild card that means any
y
When the system time equals this time the job will
run.
The GRID