Everyday computational science
08 Sep 2020 - adolgert
How to read performance characteristics from grid engine output.
Does anybody use Grid Engine any more? I work with people using Univa Grid Engine. Some folks work with Slurm. This post is a tutorial on how to read the accounting data for a job. It uses a particular job that failed as an example.
The qacct won’t tell you what’s wrong with the operating system, so it wouldn’t have told you how to fix the problem, but it can offer some pointers to whether the problem is in the application. That’s a skill that’s available and helpful.
UGE Qacct gives three performance information about three subsystems: CPU, memory, and Input/Output (I/O). We read these in two ways, an initial health check, and then a deeper dive.
The use case here is a run of a job, where sometimes it succeeds and sometimes it fails. The Qacct output below shows the failed run. It’s edited to remove local information.
============================================================== qname all.q hostname hostname group Domain Users owner borlaug project proj_wheat department defaultdepartment jobname MEPS_RX_s2_d440 jobnumber 55123964 taskid 6 pe_taskid NONE account sge priority 0 cwd /projects/borlaug/pipeline/scripts submit_host hostname submit_cmd qsub -cwd -q all.q -l m_mem_free=200G -l fthread=2 -l h_rt=24:00:00 -N MEPS_RX_s2_d440 -P proj_wheat -t 1-6:1 /projects/borlaug/pipeline/run_shell.sh worker.py -x 2 -d 440 -s MEPS_RX qsub_time 08/31/2020 12:37:59.279 start_time 08/31/2020 12:37:55.967 end_time 08/31/2020 13:04:24.859 granted_pe NONE slots 1 failed 0 deleted_by NONE exit_status 137 ru_wallclock 1588.892 ru_utime 68.535 ru_stime 1259.848 ru_maxrss 19217024 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 25683421 ru_majflt 185 ru_nswap 0 ru_inblock 839408 ru_oublock 40 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 25505 ru_nivcsw 29955 wallclock 1589.132 cpu 1328.383 mem 14404.325 io 0.021 iow 1.220 ioops 5362 maxvmem 24.666G maxrss 0.000 maxpss 0.000 arid undefined jc_name NONE bound_cores NONE
Let’s read this!
For the initial health check on the CPU, we find the percentage of time the process is doing calculations. This percentage is in three metrics:
ru_wallclock - The total seconds the process has run, or real-world time.
ru_utime - This user time (utime) is the number of seconds that a CPU was running the code you wrote in your script. This counts the time of each thread, so it can be as large as the wallclock time multiplied by the number of threads.
ru_stime - This is system time, and it’s the number of seconds the operating system was working on behalf of the application. That means the application, somewhere under one of the functions you called in Python, Stata, or R, called a function that called the operating system to read a file, write to an internet connection, wait on a signal, or other system service.
For a single-threaded process, user time, added to system time, should almost equal the wallclock time. That last time duration is time the application was in line behind some other application, waiting to get scheduled by the CPU. The way our scheduler works, this time should be small. If it’s not under ten percent, something else in the system is hogging it.
The computational intensity of a process is user time divided by wallclock time. For a single process, this can be 15% to 40%, with a bell curve. If it’s under ten percent, the application isn’t doing well.
The qacct that you sent has (system time=1260, user time=69, wallclock=1600). That system time is mostly disk access. For a process that doesn’t do much math, (so it reads a file, divides by 10, and writes it), the system time can be as large as the user time. This system time is off the charts, as a fraction of wallclock.
For the initial health check on the memory, we look at the resident set size, which is ru_maxrss. That’s in bytes, for this setup. Because of the ways the operating system attempts to minimize memory use, counting the memory used by a process can be difficult. Much of its memory is shared with other threads or processes. The resident set size and proportional set size (PSS) are attempts to give a good estimate.
For this qacct, the ru_rss is within the requested memory limits, so that didn’t kill the process. We do see that the memory allocated is significant, though, so the program must have run and done some work. When you want to ask, how much memory should this program use, you generally start with, how big are the data sets, and how many times do I copy them and make changes to those copies? So the large memory number means data probably got read in.
Input and output can be to any device, be that a file, the network, or the screen. On this cluster, it’s almost always about files. We look to ru_inblock and ru_outblock for those numbers, but they are also summarized in the summary “io” number at the bottom of qacct. A block is 512 bytes, so you take the number ru_inblock = 839408 blocks, multiply by 512 bytes / block. That gives you bytes, so divide by 1024^3 to get GB. Here, it’s 0.4 GB. Some input was read, but not a lot. It looks like more than the amount needed to read the executable and libraries, but it’s not a lot over that. I’d have to run an experiment to be sure. The experiment would load just libraries and quit.
Meanwhile, output is 40 blocks, which is 20kb. That’s enough for log files.
The application looks like it started and like it started reading input files. It may have allocated memory to store inputs. The time in system could mean it’s waiting on complicated multithreading stuff or networking stuff, but that isn’t in this program, so our default guess is that it was waiting on input files.
There is a weak CPU metric in the qacct that happens to be helpful here. A computer runs all the programs we ask it to run by switching from one to the next. Each CPU cycles through which programs are active, and each time your application switches out, it’s called a context switch.
ru_nvcsw—The number of voluntary context switches. These are times that the program invites the operating system to switch to another thread, if it wants. You don’t think your program does this, but it’s implied in most calls to the operating system that, if you’re calling the operating system, you’re OK with a context switch.
ru_nivcsw—Involuntary context switches happen when the operating system pauses your program’s thread of execution in order to let some other thread have a chance. This can happen when your program is in a tight mathematical loop, but it can also happen when your program is in one of the operating system calls that are marked as non-interruptible. Reading and writing from drives is non-interruptible.
We therefore have a weird read on the involuntary context switch. It either means your code is sailing happily or that it’s locked in reads. For a rule of thumb, the involuntary context switches are usually far less than the voluntary ones. You can get a sense, for your code, by comparing a good run with a bad run. Context switching rates can be all over the board for different types of applications.
In this particular case, involuntary switches for the failing run are way, way higher. That tells me that it’s locked in reads or writes.
If further digging might help, you would try:
Run htop when the process starts, and watch it in htop. Run strace through htop by hitting the ‘s’ key to show you the system calls it makes.
Run the Python profiler or set up a https://poormansprofiler.org.
Those are good starts.
First, don’t worry that memory for maxrss and maxpss isn’t always reported. Different operating systems store and report these different ways, and system upgrades can change those. Sometimes they aren’t available through certain retrievals until the application is over. Don’t worry that maxrss and maxpss are zero.
Meanwhile, maxvmem can be interesting. This is not the total amount of memory your process used. It’s the total amount of memory your process threatened to use. If you have a dataframe, and you make a copy, the operating system creates a section of virtual memory for that copy, but it’s only an IOU for real memory. That IOU only gets called in if your code modifies values in that memory. It’s called copy-on-write. That feature is great for working with dataframes because it greatly increases performance. That kind of table manipulation can lead to very large virtual memory allocations, though. They can be twenty times the real memory the application uses.
This qacct shows virtual memory is at least fifty times the real memory usage. That’s not a problem. It shows some early allocation happened in the program.
Digging even deeper, we look at what Qacct calls “faults,” but these aren’t problems. They are computer lingo for a count of how many times the program requested memory.
ru_minflt—There is another min fault every time the CPU asks for the next value in cache. All data is read into cache, and the CPU brings bits of it close by, into what are called registers. Reads to cache are happening all the time, and this number will usually be very large, which it is here. Fine.
ru_majflt—A major fault is more important. This means that the program used some data in memory, didn’t use it for a little while, and then the operating system took it away for a bit. The major fault is when the application has to ask the operating system for its data back. These happen, and they can add up for programs that live in the background of an operating system, but you won’t see a lot of them in a happy application. Some, but not a lot.
For this qacct, 185 is a lot of faults. I’m more sure of that because we have an example of a happy version of the program that has five or so. That’s an indicator the program was hanging around doing nothing the operating system considered useful enough to keep in memory.
I don’t see more about I/O. It would be great to see how much is to the shared drives, right? That’s hard to get from the operating system, though. It has been more helpful to look at the Qumulo dashboard for activity while the program is running. You can narrow in on a particular directory and get a good sense of activity.
The deep dive supports the high-level because the page faults are high and involuntary context switches are high. We don’t know why the program isn’t making headway on I/O, but it isn’t.
There were a bunch of times I couldn’t give you a rule of thumb on how big a number should be. It depends on the application. If you want to learn what’s a good number, occasionally take a look at qacct for a job that ran well. See if the numbers make sense to you, given what you know the program does.
I hope that was fun. I shouldn’t skip the reveal of what actually happened to the program to make it fail intermittently. Our capable infrastructure team figured out quickly that nodes were deteriorating because they ran out of memory associated with cgroups. It looks like they used a technique similar to the one I described here, but applied to the whole system: look at CPU, memory and I/O. Start at the top level and drill down. Fixing that fixed the problem with this application, and that likely fixed many more applications that had been failing on the cluster but weren’t reported, so you did a good deed with detailed reporting.
The nodes were running out of a dedicated part of memory that’s used to enforce allocation of space to each application. I’m not sure how your application experienced that problem. It could have seen one of two things.
A failure for a read() call to return quickly. That would explain the system time and involuntary context switches.
A failure in memory allocation, which is an alloc() call. That would explain why the operating system would kill the application with a SIGKILL, which is a -9 signal, which is an exit status of 137.
I’m open to suggestions on which of these is more likely. Running the code and starting htop, with strace on, would have answered the question.