Computing Kitchen

Everyday computational science


Project maintained by adolgert Hosted on GitHub Pages — Theme by mattgraham

Reading Grid Engine QAcct

08 Sep 2020 - adolgert

How to read performance characteristics from grid engine output.

Introduction

Does anybody use Grid Engine any more? I work with people using Univa Grid Engine. Some folks work with Slurm. This post is a tutorial on how to read the accounting data for a job. It uses a particular job that failed as an example.

The qacct won’t tell you what’s wrong with the operating system, so it wouldn’t have told you how to fix the problem, but it can offer some pointers to whether the problem is in the application. That’s a skill that’s available and helpful.

UGE Qacct gives three performance information about three subsystems: CPU, memory, and Input/Output (I/O). We read these in two ways, an initial health check, and then a deeper dive.

The problem run

The use case here is a run of a job, where sometimes it succeeds and sometimes it fails. The Qacct output below shows the failed run. It’s edited to remove local information.

==============================================================
qname        all.q               
hostname     hostname
group        Domain Users        
owner        borlaug             
project      proj_wheat            
department   defaultdepartment   
jobname      MEPS_RX_s2_d440     
jobnumber    55123964            
taskid       6                   
pe_taskid    NONE                
account      sge                 
priority     0                   
cwd          /projects/borlaug/pipeline/scripts
submit_host  hostname
submit_cmd   qsub -cwd -q all.q -l m_mem_free=200G -l fthread=2 -l h_rt=24:00:00 -N MEPS_RX_s2_d440 -P proj_wheat -t 1-6:1 /projects/borlaug/pipeline/run_shell.sh worker.py -x 2 -d 440 -s MEPS_RX
qsub_time    08/31/2020 12:37:59.279
start_time   08/31/2020 12:37:55.967
end_time     08/31/2020 13:04:24.859
granted_pe   NONE                
slots        1                   
failed       0    
deleted_by   NONE
exit_status  137                 
ru_wallclock 1588.892     
ru_utime     68.535       
ru_stime     1259.848     
ru_maxrss    19217024            
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    25683421            
ru_majflt    185                 
ru_nswap     0                   
ru_inblock   839408              
ru_oublock   40                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     25505               
ru_nivcsw    29955               
wallclock    1589.132     
cpu          1328.383     
mem          14404.325         
io           0.021             
iow          1.220             
ioops        5362                
maxvmem      24.666G
maxrss       0.000
maxpss       0.000
arid         undefined
jc_name      NONE
bound_cores  NONE

Let’s read this!

Initial health check

CPU

For the initial health check on the CPU, we find the percentage of time the process is doing calculations. This percentage is in three metrics:

For a single-threaded process, user time, added to system time, should almost equal the wallclock time. That last time duration is time the application was in line behind some other application, waiting to get scheduled by the CPU. The way our scheduler works, this time should be small. If it’s not under ten percent, something else in the system is hogging it.

The computational intensity of a process is user time divided by wallclock time. For a single process, this can be 15% to 40%, with a bell curve. If it’s under ten percent, the application isn’t doing well.

The qacct that you sent has (system time=1260, user time=69, wallclock=1600). That system time is mostly disk access. For a process that doesn’t do much math, (so it reads a file, divides by 10, and writes it), the system time can be as large as the user time. This system time is off the charts, as a fraction of wallclock.

Memory

For the initial health check on the memory, we look at the resident set size, which is ru_maxrss. That’s in bytes, for this setup. Because of the ways the operating system attempts to minimize memory use, counting the memory used by a process can be difficult. Much of its memory is shared with other threads or processes. The resident set size and proportional set size (PSS) are attempts to give a good estimate.

For this qacct, the ru_rss is within the requested memory limits, so that didn’t kill the process. We do see that the memory allocated is significant, though, so the program must have run and done some work. When you want to ask, how much memory should this program use, you generally start with, how big are the data sets, and how many times do I copy them and make changes to those copies? So the large memory number means data probably got read in.

I/O

Input and output can be to any device, be that a file, the network, or the screen. On this cluster, it’s almost always about files. We look to ru_inblock and ru_outblock for those numbers, but they are also summarized in the summary “io” number at the bottom of qacct. A block is 512 bytes, so you take the number ru_inblock = 839408 blocks, multiply by 512 bytes / block. That gives you bytes, so divide by 1024^3 to get GB. Here, it’s 0.4 GB. Some input was read, but not a lot. It looks like more than the amount needed to read the executable and libraries, but it’s not a lot over that. I’d have to run an experiment to be sure. The experiment would load just libraries and quit.

Meanwhile, output is 40 blocks, which is 20kb. That’s enough for log files.

Summary of initial health check

The application looks like it started and like it started reading input files. It may have allocated memory to store inputs. The time in system could mean it’s waiting on complicated multithreading stuff or networking stuff, but that isn’t in this program, so our default guess is that it was waiting on input files.

Deeper Dive

CPU

There is a weak CPU metric in the qacct that happens to be helpful here. A computer runs all the programs we ask it to run by switching from one to the next. Each CPU cycles through which programs are active, and each time your application switches out, it’s called a context switch.

We therefore have a weird read on the involuntary context switch. It either means your code is sailing happily or that it’s locked in reads. For a rule of thumb, the involuntary context switches are usually far less than the voluntary ones. You can get a sense, for your code, by comparing a good run with a bad run. Context switching rates can be all over the board for different types of applications.

In this particular case, involuntary switches for the failing run are way, way higher. That tells me that it’s locked in reads or writes.

If further digging might help, you would try:

Those are good starts.

Memory

First, don’t worry that memory for maxrss and maxpss isn’t always reported. Different operating systems store and report these different ways, and system upgrades can change those. Sometimes they aren’t available through certain retrievals until the application is over. Don’t worry that maxrss and maxpss are zero.

Meanwhile, maxvmem can be interesting. This is not the total amount of memory your process used. It’s the total amount of memory your process threatened to use. If you have a dataframe, and you make a copy, the operating system creates a section of virtual memory for that copy, but it’s only an IOU for real memory. That IOU only gets called in if your code modifies values in that memory. It’s called copy-on-write. That feature is great for working with dataframes because it greatly increases performance. That kind of table manipulation can lead to very large virtual memory allocations, though. They can be twenty times the real memory the application uses.

This qacct shows virtual memory is at least fifty times the real memory usage. That’s not a problem. It shows some early allocation happened in the program.

Digging even deeper, we look at what Qacct calls “faults,” but these aren’t problems. They are computer lingo for a count of how many times the program requested memory.

For this qacct, 185 is a lot of faults. I’m more sure of that because we have an example of a happy version of the program that has five or so. That’s an indicator the program was hanging around doing nothing the operating system considered useful enough to keep in memory.

I/O

I don’t see more about I/O. It would be great to see how much is to the shared drives, right? That’s hard to get from the operating system, though. It has been more helpful to look at the Qumulo dashboard for activity while the program is running. You can narrow in on a particular directory and get a good sense of activity.

Deep dive summary

The deep dive supports the high-level because the page faults are high and involuntary context switches are high. We don’t know why the program isn’t making headway on I/O, but it isn’t.

There were a bunch of times I couldn’t give you a rule of thumb on how big a number should be. It depends on the application. If you want to learn what’s a good number, occasionally take a look at qacct for a job that ran well. See if the numbers make sense to you, given what you know the program does.

Summary

I hope that was fun. I shouldn’t skip the reveal of what actually happened to the program to make it fail intermittently. Our capable infrastructure team figured out quickly that nodes were deteriorating because they ran out of memory associated with cgroups. It looks like they used a technique similar to the one I described here, but applied to the whole system: look at CPU, memory and I/O. Start at the top level and drill down. Fixing that fixed the problem with this application, and that likely fixed many more applications that had been failing on the cluster but weren’t reported, so you did a good deed with detailed reporting.

The nodes were running out of a dedicated part of memory that’s used to enforce allocation of space to each application. I’m not sure how your application experienced that problem. It could have seen one of two things.

I’m open to suggestions on which of these is more likely. Running the code and starting htop, with strace on, would have answered the question.