Richard L. Sites discusses his new book Understanding Software Dynamics, which offers expert methods and advanced tools for understanding complex, time-constrained software dynamics in order to improve reliability and performance. Philip Winston spoke with Sites about the five fundamental computing resources CPU, Memory, Disk, Network, and Locks, as well as methods for observing and reasoning when investigating performance problems using the open-source utility KUtrace.
This transcript was automatically generated. To suggest improvements in the text, please contact email@example.com and include the episode number and URL.
Philip Winston 00:01:10 This is Philip Winston with Software Engineering Radio. Today, my guest is Dr. Richard Sites. Dr. Sites has spent most of his career at the boundary between hardware and software with a particular interest in CPU-software performance interactions. His past work includes VAX Microcode, DEC Alpha co-Architect, and inventing the hardware performance counters you see in many CPUs today. He has done low-overhead microcode and software tracing at DEC, Adobe, Google, and Tesla. Dr. Sites earned his PhD at Stanford in 1974. He holds 66 patents and is a member of the US National Academy of Engineering. Let’s start at the top. What are software dynamics and what benefits are there in striving to understand them?
Richard L. Sites 00:02:00 Software dynamics refers to different programs or different threads or a single program, or the operating system, all interacting with each other. The contrast would be with Static Software, a program that you start and it runs and it finishes. And each time you run it, it does sort of the same thing at about the same speed, like benchmarks. But real software more and more today is time-sensitive and has lots of user-facing work to be done or responses to give. And that dynamically ends up interacting with all the other things running on our computer, not just standalone like a benchmark. So, if you look at something like activity monitor, or TOP, or task manager, depending on your operating system, you’ll find there’s like 300 different programs running. So, software dynamics refers to the interactions between all of these and trying to get the responses back to something that’s time-sensitive — a person or robot or something in motion that needs responses quite quickly.
Philip Winston 00:03:05 When did you first become interested in software dynamics? Was there a particular project or problem you can recall that set you off in this direction?
Richard L. Sites 00:03:15 That’s a good question. When I was at Digital Equipment, I got interested in careful tracing of what was going on in a single program. And that turned into being able to trace what was going on in an operating system — in this case, the VMS operating system — and one of the questions that the VMS designers had was sometimes the operating system would not respond to an interrupt very quickly at all. It would appear to be out to lunch for a while. So, by doing a microcode-based tracing of all of the instructions being executed, I got to find that when that happened, the swapper program had just started up and was holding onto the CPU and not taking any interrupts. And that was a real simple thing to fix once they knew what the dynamics were, but they had never been able to observe it before. So, that was around 1980, 1981.
Philip Winston 00:04:11 So, do you feel that early software engineers say in the 1970s knew more about hardware than engineers typically know today?
Richard L. Sites 00:04:22 Oh, certainly. In the 70s, lots of people wrote in assembly language. Optimizing compilers weren’t very good. And so anyone who paid much attention to performance had to know a lot about what the real machine was. But it was also a much simpler environment; we’re simply looking at really running just one program at a time.
Philip Winston 00:04:42 So, who is the target audience for the book?
Richard L. Sites 00:04:45 There’s sort of two target audiences. One is graduate students, interested in software performance and the other software professionals who are actively writing complex software, for instance, at places like Google or Facebook or Amazon that have lots of interactions with people or with machinery.
Philip Winston 00:05:06 So, I’m curious, performance is obviously a major concern with understanding these dynamics, but are there any other goals that might lead us to want to understand this runtime behavior in detail? Is it strictly performance?
Richard L. Sites 00:05:19 To my mind it is. I mean, that’s what the book is about. The industry has lots of tools, observation tools, and software and hardware help to understand the average performance of simple programs, and almost no tools to understand what delays are when you care about response time and you have 30 or 40 different programs running. So, I’ve tried to look at the harder problem of understanding the dynamics in a very complex environment, which is also the environment you would find in simple embedded controllers. The embedded controller for Tesla autopilot has about 75 different programs running at once. And it has responses that it needs to make essentially every video frame.
Philip Winston 00:06:06 So, I remember the difference between the average case and I guess maybe not the worst case, but the, you mentioned the tail latency typically is one measurement to find these slower cases. Can you explain a little bit more about what tail latency is?
Richard L. Sites 00:06:20 Sure. If you have something like a piece of a program that’s responding to requests for email messages from users all over the world, and a user sitting there and says, I want to look at my next message and it pops up. I want to look at my next message it pops up. Let me look at my next message. And there’s a four second delay, and then it pops up. I’m interested in that variance in the things that on occasion are slow, even though the average performance is very good. Some of those slow responses are just annoying, but some of them are life-threatening when you’re dealing with big machinery.
Philip Winston 00:06:57 Okay. I think that’s a good introduction. The book is centered somewhat around what you call the four fundamental computing resources, I guess the hardware resources, which are the CPU, memory, disk, and network. And then you add locks and maybe queues as critical software resources. Before we dive into these, there’s a utility you discuss in the book, which is available on your GitHub site called KUtrace. Can you tell me a little bit about what prompted you to write this utility? When did you have the idea for it and just kind of, how did it get developed?
Richard L. Sites 00:07:34 Sure. The idea came about around 2006, when I was working at Google and we had intermittent delays in web search and finding advertisements to deliver and all sorts of the software services. And no one knew why those delays occurred. So, I decided to build an observation tool that would show us at least what was happening in Gmail or in search or whatever. And from my previous experience, I knew that doing something like tracing every function call inside the operating system or tracing every piece of code in hundreds of applications, that would be much, much too slow because the delays occurred usually during the busiest hour of the day in live data centers. They weren’t things that we could find by running offline, by running canned test programs and stuff. So, I came up with the idea of tracing all of the transitions between user mode and kernel mode, every operating system service call, every interrupt, every fault, every context switch, and worked with one of the Linux kernel people at Google to build an implementation that would trace just those transitions and trace with very low overhead, less than 1% of slowdown of the CPU.
Richard L. Sites 00:08:59 Because my experience with Google was that if you went to the people whose job was to run the data centers and said, I have this great observation tool that has 10% overhead, so everything will be 10% slower. It’s a really short conversation. They just say no. And if you say it’s about a 1% overhead, it’s also short conversation. They say, sure, we can’t measure a 1% difference anyway. And if it was sending a number in between, that’s a long conversation. And then the answer is no.
Philip Winston 00:09:28 Yeah, that makes a lot of sense. And what really interested me about these chapters about KUtrace is you discuss in detail, basically all of the design decisions behind what you did. It’s almost like a walkthrough of your thought process and pretty extensive engineering that had to go into it. I’m going to get back to this if we have some time near the end, but I wanted to touch on all of the fundamental resources at least a little bit first. So, the first resource you talk about is CPUs. You have a chapter or you give a great history lesson on CPU features. For example, you mentioned page virtual memory first appeared in the 1962 machine Manchester Atlas. Reading all of these descriptions of the features that seem to be additively growing on each other, I’m wondering do CPUs always get more complicated over time, or has the trend ever been reversed? For example, people claim that ARM chips today are simpler than x86. Do you feel that’s true that some things do get simpler?
Richard L. Sites 00:10:33 It can happen in waves that things get more and more complicated. New instructions or additive features are added and then performance gets too slow or the power dissipation gets too large or the clock cycle keeps getting longer and longer. And then there’s sort of a step function, and somebody says, “oh, well, we can do things much simpler.” John Cocke did that by inventing RISC machines after complex instructions, that machines just got slower and slower. We see, I’m not sure I would say today’s ARMs are less complicated than x86, just because that architecture, including the 64-bit version, has grown and grown and grown. But we do as an industry go through simple periodic simplifications. DEC went through that with the VAX architecture, which turned out to be big and slow after a while. And the Microvax architecture was a subset that could be implemented more simply and more cheaply. And that extended the life of the VAX architecture by several years.
Philip Winston 00:11:33 Yeah. I guess people talk about the pendulum swinging back and forth with architecture, both hardware and software. In the book you explain how the hardware and the compiler can subvert your attempts to measure how long individual instructions take. So, if I wrote a for loop to do an operation 10,000 times and time that loop, what are some less obvious ways that the compiler or the hardware might make my timings inaccurate?
Richard L. Sites 00:12:03 I’m going to give a little context first. The first section of the book: for a graduate class, part of the purpose is to get a bunch of grad students who’ve come from different backgrounds all on the same page. Some of them will know a whole lot about CPU. Some will know about memory or disk. And after the first four weeks, everyone knows a fair amount about all of those. So, the timing on an instruction, I give them the exercise of how fast is a single add instruction. You can read some time-based, which we’ll talk about I’m sure. And do a whole bunch of adds and read the time basis, subtract and divide and say here’s how long it took. So, I lead the students into lots of mistakes by giving them a program that does this. It’s, you know, it’s a little short 2020 line kind of program, but it has a few flaws.
Richard L. Sites 00:12:51 If you compile it on optimized and run it, you get some number like six or 10 cycles per add instruction. And if you compile it optimized or run it and you get some number like zero cycles per add instruction. And the reason is that in the optimized form, the GCC compiler or most any other optimizing compiler takes out the entire loop because the result of all the adds is not used anywhere. And that’s sort of leading the reader into the idea that you need to be careful that what you think you’re measuring is what you’re actually measuring.
Philip Winston 00:13:28 Yeah. I’ve run into that myself trying to time instructions. And I think I went down that road of feeling like I needed to print out some final sum or something to tell the compiler that I actually needed that result. And there’s a number of other pitfalls and tricks you cover. When I started my career, CPUs always ran at a fixed frequency. Today it seems like the clock frequency can vary dramatically over time. What challenges does this pose for timing or tracing operations and do real CPUs and data centers do the frequency? Is it variable or do they tend to lock it down to something?
Richard L. Sites 00:14:07 Varying the clock frequency is a technique for reducing power consumption and therefore heat generation. I think it first started with Intel SpeedStep in the 80’s. One of the things that gets heavily used when you’re doing careful performance measurements is some time-based that counts fairly quickly. The cycle counter, the 1976 Cray-1 computer had a cycle counter that simply incremented once every cycle. And it was a 64-bit register. You could read it and you could literally read the cycle counter, read it a second time and subtract, and you would get a difference of one, one cycle. So, when we did the alpha architecture at DAC, 1992, I included a cycle counter in the architecture so that any program could read it. And a year or two later cycle counters started showing up all across the industry. And they would count each time that the CPU executed did a clock cycle to execute instructions.
Richard L. Sites 00:15:10 And then a few years later, when SpeedStep came along, the effect was that when the CPU clock was slowed down to save power, the time for one cycle slowed down. And if you’re using the cycle counter to measure wall clock time, suddenly it got way out of whack compared to wall clock time. And that matters for instance, in the early Google file system, GFS. Cycle counter was used along with a model applying an add to reconstruct the time of day. And that was used to timestamp files. And have you ran on a machine where time appeared to go backwards, the file system would crash. And the effect when SpeedStep came in was that they could not use it. They had to keep running the clock at a constant rate. Otherwise the software would get confused and crash. Subsequent to that people created the so-called constant rate cycle counter, which actually just counts time and accounts at the same rate, independent of the power saving. Typically it would count at a hundred megahertz increment once every 10 nanoseconds. And that gives a much more stable time-based
Philip Winston 00:16:22 Yeah. In my work I’ve run into the situation. I think it was the RD TSC instruction on x86. And you had to also worry about whether your program had moved from one CPU you to another, and whether the clocks are synchronized across CPUs. And I just remember there was a lot of pitfalls there. So, that’s a little bit about CPUs There’s a lot more detail in the book, especially about the history and the complexity. So, let’s move and talk about memory. So, the chapter on memory had a lot of information about caching and the complexities of caching. The difference between an algorithm that fights with the cache versus one that’s very cache aware can be extremely large. Do you feel this is something a lot of software could do better? Is cache awareness, something that is often ignored?
Richard L. Sites 00:17:15 A lot of software is not very sensitive to the cache behavior, but some important software is. So, if you’re looking at inner loops of matrix small repliers something, it makes a huge difference. If you’re looking at the Linux operating system, running the operating system code, isn’t terribly sensitive to cache behavior, except when it’s doing something like bulk move, so a bunch of data from one place to another place. So, it’s sort of a mixed bag. On the other hand, if you don’t know anything about caches and, essentially caches are speed up mechanism, and they’re wonderful when they work as intended and when the software uses them as intended. But if you end up perhaps by mistake with software that defeats the cache caching mechanisms. So, what happens is your performance just falls off a cliff. And that happens all over this industry, not just with caches, it happens with networks
Richard L. Sites 00:18:12 if you have magic hardware that offloads a TCP packet assembly or something, maybe that hardware handles eight different active streams. But if you have nine, suddenly the performance drops by a factor of a hundredth. So, all of these speed-up mechanisms, as chips get more complicated and issue instructions out of order and five instructions that are declined, they’re wonderful until you step off the edge of the cliff. And to know about that, you have to actually understand a little bit about what the hardware is doing so that you recognize what you’ve done to yourself when you step off the cliff.
Philip Winston 00:18:48 So, one thing that interested me was all the different types of caches, different cache levels, sizes, associativity, is it possible to have an algorithm, this sort of roughly cache aware, but it’s not tuned to a specific CPU? Is there sort of a spectrum of cache awareness?
Richard L. Sites 00:19:08 Yeah. The main thing is to, when you’re accessing model, who uses of data to have them stored near each other. And if you have some huge amount of data, hundreds of megabytes, if you go to access part of it, try to access other parts nearby rather than being just totally scattered. That’s the main thing.
Philip Winston 00:19:32 A term I’ve come across is structure of arrays versus array of structures. And I guess structure of arrays means what you’re saying that the same type of data is sort of packed in without anything in between. Have you heard that terminology before?
Richard L. Sites 00:19:48 Not recently. I heard it a lot in the seventies. If you have something like six parallel arrays and you’re going for one item in each of the six, if they are literally separate arrays, then you’re looking at six different cache accesses. If you have an array of elements that are more than one eye that are all six pieces physically together in memory, then you may be looking at one cache access or one cache missed. I have a quote I want to throw in here. That’s from Donka Knuth. It’s in the book in Chapter Two, the quote is ìPeople who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weirdî.
Philip Winston 00:20:34 Yeah, definitely. I think that awareness of hardware is a huge theme in the book. Continuing on memory for a little bit is there was a section about the pre-charged cycle of DRAM row versus column access of memory. I’ve definitely witnessed the impact of caching on my software, but I’ve never thought about DRAM access at this level of detail. Have you seen issues where these hardware details affect performance or is it less significant than say Kashi?
Richard L. Sites 00:21:06 I’ve seen instances where it does affect performance. DRAM (Dynamic Random Access Memories), aren’t random. The internal implementation of the transistors, if you read someplace that’s near where you last read in a particular bank of RAM, it’ll be faster than if you are always scattered about reading just a few items here and there. So, the effect is much like caching, the DRAM chips internally cache like a thousand bytes in one access. And if you reuse bytes within that, it’s faster than if you go to a completely different group of a thousand bytes.
Philip Winston 00:21:44 Yeah, I guess the term locality of access that jumps to mind related to this. So, that’s a little bit about CPU’s and memory. Let’s move on to talking about disk. So, you have disks as the third fundamental computing resource. You include a lot of details about both hard disks and Solid State Disks (SSDs). Let’s talk mostly about SSDs here since increasingly what people are using at least in their own machines. So, like with memory, you discuss several ways that hardware and low-level software can subvert your tab to make simple measurements. Can you mention some of the ways here that would subvert your ability to measure how long a disc access would take?
Richard L. Sites 00:22:29 An SSD access?
Philip Winston 00:22:30 Yeah, I think for an SSD.
Richard L. Sites 00:22:33 Yeah. When you go access, let’s say you want to read a 4k block off of an SSD. There’s all these mechanisms under the covers that are quote helping unquote you, the operating system file system almost surely has a cache of recently access storage data. And so you may do a read and you simply hit in the file cache and never go to the device. Most SSDs actually have a small RAM, standard RAM inside the SSD package. And they will read from the flash memory into the RAM and then supply data from the RAM. This is most useful when you’re writing to buffer up a whole bunch of writes and then write them off to the flash transistors all at once. But you may find that you do reads that go that hidden the RAM that’s inside the Solid State Drive and don’t suffer 10 or 50 or a hundred microseconds to get to the real flash transistors. So, everyone has their finger in the pie trying to speed things up and occasionally slow things down.
Philip Winston 00:23:43 So, reading about the specific electrical properties of SSDs, and again, the charts cycles, I guess I got a little confused on what is the difference between DRAM and SSD is the underlying technology totally different? Of course, SSDs keep their data when the power’s off. But other than that, are there similarities between the two?
Richard L. Sites 00:24:05 They’re really completely different. The flash transistors can hold the value that you set in the middle one or zero for 10 years or more, but they wear out, if you write them a hundred thousand times, they stop being able to separate once from zeros, the amount of charge that’s stored inside the floating transistor, degrades over time. I’m not sure that fully answered your question.
Philip Winston 00:24:32 Yeah, well, that’s definitely a huge difference. I think that what I really liked about the book is that it packed in a lot of the details, the hardware details that I had come across at various points in my career, but it packed them into one section. So, even the, in the hardest drive section, I thought it was really interesting to read about all of those details put together.
Richard L. Sites 00:24:54 I should say one other thing about the SSDs, when you write an SSD, the actual write of the flash transistors assumes that they’ve already been set to all ones and then you selectively change some of them to zeros and the erase cycle that sets them to all ones. It takes a long time. It takes like 10 milliseconds and most flash chips, when you are doing any erase cycle, they can’t do anything else. And the effect that application programmer can see is if you’re doing writes to an SSD, reads that are intermixed may be now and then completely delayed by an extra 10 milliseconds, because the chip can’t do any reads while it’s doing in an erase cycle. And that really is noticeable in data center performance and in some other real-time contexts.
Philip Winston 00:25:46 Yeah, that’s definitely a super low level detail. And I guess when I first started to read the chapter, I assume that SSDs were going to be more or less, you know, perfect performance compared to hard disc drive. So, it was pretty interesting to hear about the, they have their own peculiarities that can surface. So, that was CPUs, memory, disks, let’s move on to network. The networking chapters talk a lot about remote procedure calls. When I think of accessing a resource of the network, I’m usually thinking about HTTP REST. Are remote procedure calls something different, or is REST a type of remote procedure call?
Richard L. Sites 00:26:25 Remote procedure calls are used to connect together lots of machines that are sharing work and they don’t show up much, if you just have one computer or you have a small number of computers that don’t interact. A remote procedure calls is like, a procedure call within a single program, you know, where procedure A calls procedure B except that B is running on a different machine somewhere, typically in the same room, but sometimes across country. And the arguments to that call are shipped across the network to the other machine where it runs procedure B and get some answer. And the answer is shipped back over the network to the caller procedure A which then continues. And that can be incredibly useful for having something like a search, a web search at Google, where the computer that gets a search from a user immediately, fans it out to a hundred other machines using a remote procedure call for each of those machines to do a piece of the work. And those fanned out, they actually do another 20 machines each or something. So, there’s 2000 machines. And then the answers come back on are merged together across the 2000 machines, a hundred machines, the one machine, and then an HTML page is put together and send to the user all in a quarter of a second or so.
Philip Winston 00:27:47 So, specifically remote procedure calls could be implemented by different networking technology. You’re just using it as kind of a generic term for any type of call to a remote machine? Or is it, are you specifically talking about a certain type of ?
Richard L. Sites 00:28:00 No, just any generic call. And most of the networking chapter is about waiting on what the other machines are doing or enable to understand who’s waiting when and the same could apply to remote access to files. You have distributed file system across many machines.
Philip Winston 00:28:22 Okay. I said, we’re not going to talk too much about KUtrace yet, but in the chapters about networking, you have a long section, I think talking about RPC IDs and how you need to record those ideas in order to do a trace. Can you talk a little bit more about that? Because I wasn’t totally clear on how you were able to deduce so much information from just really short IDs.
Richard L. Sites 00:28:46 Okay. If you look at something, I’ll pick a disaster that I’m going to work on at all, the US government’s rollout of signing up for Obamacare, that was a set of computers that performed very poorly. And we’re usually not working put together by about 30 different companies. None of whom had any responsibility for the entire works, actually delivering signups to citizens. But they were all connected together so that whatever a citizen did would send messages between lots of different computers. And when you’re trying to figure out why some response either doesn’t happen at all, or happens very slowly, you need some way of figuring out which message relates to which in this case, a citizens request or carriage return or whatever. And so giving all of the messages, some kind of identifying number, which keeps changing, every message has a different number, is an underpinning that’s absolutely necessary, if you want to do any kind of performance analysis of where did all the time go? So, it can be just a simple number, you know, 32 or 64 bit numbers.
Philip Winston 00:29:58 I see. Yeah. So, you’re recording these on the different machines and that allows you to trace what work was done on behalf of that call.
Richard L. Sites 00:30:06 Yeah. And the messages between the machines, each message includes, transmitted over the network, that particular ID number.
Philip Winston 00:30:14 I see. Okay. That makes sense. How about this term slop you used in network communications? It sounds like a very informal term, but how do you measure it and how do you decrease it?
Richard L. Sites 00:30:27 Yeah. Well, if you have two machines connected with something, like an ethernet, and Machine A sends a message or request to Machine B, and Machine B gets that and works on it and sends an answer back to Machine A. And Machine A gets the answer and that whole round trip takes a long time. So, you’re concerned about understanding what’s going on. You might look at the time on machine A when it sent the request and the time also on machine A, when the response came back, and then go over to machine B and look at when the request came in and when machine B sent the response. And maybe on Machine A, the whole works took 200 microseconds. And on machine B between the time it got the request and it sent its answer, there was only 150 milliseconds and we do all this as milliseconds.
Richard L. Sites 00:31:19 So, the center sees 200 milliseconds. The server in this case sees 150 milliseconds. And the question is, where did the other 50 milliseconds go? That’s the slop? It’s the difference between the elapsed time, the color sees and the elapsed time the colleague sees. And if the slop is a few microseconds, that’s perfectly normal. And if it’s tens or hundreds of milliseconds, somebody dropped the ball somewhere, maybe within the kernel on the sending machine of the request, maybe in the network hardware, maybe in the kernel on the receiving machine, or maybe the receiving machines application program, didn’t bother to get around, asking for the next piece of work. And whenever there’s a delay like that, and you talk to a bunch of software programmers, there’s always, it’s easy to point if somebody else’s problem. And it’s your hard to figure out where the actual time went.
Philip Winston 00:32:14 So, this might be related earlier this year, I saw Facebook released an open source hardware implementation of a time card that contained a miniature atomic clock chip. They presumably use this to keep time synchronized between servers in their data center. You go into some detail about how we can synchronize traces from different machines. If the clock is different, do you feel that tightly synchronized clocks aren’t necessary? Are they worth the effort of having customized software? Or can we just deal with the clocks differing by a certain amount?
Richard L. Sites 00:32:49 I’m not a fan of expensive high resolution clock hardware. Google data centers, for instance, have a GPS receiver on the roof or something. And then the GPS time is forwarded via software and networks within a data center room that might be an egg or something forwarded to all the machines. And some other data center in some other state has its own GPS, receiver, et cetera. But if you have only one, it’s a single point of failure. Suddenly the whole building doesn’t know what time it is. So, in fact, you need like three of them, and then you need to figure out which one to actually believe if they’re different. And there’s also places like Facebook or papers from Stanford about very, very careful hardware that can keep clocks on different CPU boxes, synchronized within a few nanoseconds of each other. And for understanding the dynamics of application software, I found all that to be on necessary.
Richard L. Sites 00:33:49 That it’s good enough to simply use whatever, a hundred megahertz kind of psycho counter clock there is on one machine and whatever one there is on another machine and they’ll differ, you know, maybe by the time of day might differ by 10 milliseconds or so, and it might drift so that after an hour, it differs by 11 milliseconds. But if you have time-stamped interactions between those machines and you have some that don’t have big delays, big delays are uncommon in individual round trip interactions. Then you can in software from all a bunch of timestamps, you can align the clocks between the two machines in order to make sense of some trace of what was happening. And you can pretty easily achieve five or 10 microsecond alignment. So, one of the things I encourage the readers to do and walk them through is you don’t really need expensive, fancy clock hardware. You can do perfectly well with different machines that have slightly different clock speeds and align them in software.
Philip Winston 00:34:52 Yeah. And you did walk through that and pretty extensive detail. And it seemed like not incredibly fancy, but it was definitely using statistics and algorithms that were maybe more than someone would come up with just off the top of their head. So, those are four major hardware, resources, CPU, memory, disk, and network. You include locks as I guess, the fifth major resource. Why are software locks almost as important as hardware? And do you feel this is new or this has been changing over time? Or would you have always included locks as a primary resource?
Richard L. Sites 00:35:31 Software locks are used to keep multiple threads of execution from going through the same critical section simultaneously. Two things go through something like reserving the code that reserves an airplane seat simultaneously. They might both get the same seat. So, software locks weren’t around in the 1950s, but it’d become really important these days. When you have large machines doing lots of different work, you have operating systems that run the same operating system image on four different cores on a single processor chip use. There are pieces of the operating system where you need to be sure that two different cores aren’t updating some internal data structure simultaneously. So, there’s software locks all over. I once did a search through the Google code base when I was there. The whole code base is searchable, of course, since search company. And there were like 135,000 different locks declared software locks. Most of the delay in real-time responses in that environment is delay waiting on locks. It’s not waiting on all the other things that the book talks about. So, yeah, they’re important.
Philip Winston 00:36:52 You also talk about queues. I assume that queues are often implemented with a lock. So, is this just a special case of locks or is there anything about queues which deserves to be focused on as its own different resource?
Richard L. Sites 00:37:06 I didn’t make the context for the chapter on queues quite clear enough. I’m specifically interested in work that is done in pieces, a little pieces done. And then the package of work to be done is placed on a software queue. And then later some worker program picks up that piece of work off the queue. Does the next step or next piece of the word puts it on a queue for some other thread. And eventually after four or five steps, the work is completed and then the results are sent out or the responses is done or whatever. So, queues themselves have some locking at the very bottom of the design to make sure that two different things aren’t being put on a single queue simultaneously. But the chapter on queuing is more about the next level of, if you have pieces of work, getting queued up. If they get stuck into queues too long, that’s a source of delay.
Philip Winston 00:38:04 You briefly mentioned lock free programming where special CPU instructions like compare and swap are used. I felt like a LAO has made about these algorithms a number of years ago, but lately I’ve not been reading as much. Do lock free algorithms, solve all the problems of locks or what problems still remain?
Richard L. Sites 00:38:24 They don’t remove the need to do locks, but they can give you some low-level pieces that don’t have to lock and wait, as you would have some other thread is using a software lock that you need. They’re just instructions that atomically within a single instruction, move two pieces of data around instead of just one piece. And they guarantee that two different CPU cores aren’t moving the same two pieces simultaneously such that they got shuffled out of order.
Philip Winston 00:38:58 So, you feel that lock free algorithms?
Richard L. Sites 00:39:00 Yeah. Lock free algorithms are important at a very low level. And the underlying hardware instructions are in all machines now.
Philip Winston 00:39:09 Okay. That makes sense. So, we’ve talked about these five fundamental computing resources, maybe six, if you count queues separately, and we’ve talked a little bit about KUtrace, two other big sections in the book are about observing and reasoning. One of your refrains in the book is asking people to predict what they expect to find before measuring it. Why is this prediction step helpful? And when did you start doing this yourself or fall into the habit of trying to make predictions about performance measurements?
Richard L. Sites 00:39:42 So, you answered the second part. First, I started making predictions when I took Don Knuth’s Fundamental Algorithms class. And we counted cycles in this fake mix processor. And if you don’t know how many cycles or how fast or how much time something should be taking, then you run some program on some computer and you get some performance numbers and you say, okay, that’s what it does. And you have no basis to question whether that makes any sense. So, for instance, the half as an add, where I lead the students into optimized code that simply deletes the loop and says an add takes zero cycles. If you haven’t written down ahead of time that you think an add might take one cycle, I have students who say, oh, an add takes zero cycles and turn that in as the answer on their homework. So, the point is to first raise a readers’ awareness that you can actually estimate within a factor of 10, how long things should take for almost anything. And then you have a little touchstone that if you then go run some program and measure it a little bit, if the measurement you got is wildly different than your estimate, then there’s some learning to be done. You might learn that your thought process for the estimate was way off. You might learn that the program is way off. You might learn that it’s a little bit of each. So, I think there’s a really important professional step for software programmers who care about performance.
Philip Winston 00:41:13 I can definitely see that. So, how would you say this is related to the scientific method? Like making a hypothesis, doing some tasks, looking at the data. It sounds like, as engineers, we shift into doing a little bit of science and then shift back into engineering. Do you see a connection between the two?
Richard L. Sites 00:41:32 I think that’s true. The estimate is a bit like a hypothesis. If you’re looking at some piece of biology and you think that some protein has some action, you make that as hypothesis. And then you try to design experiments to see. And in this case, you make an estimate of speed or performance, and then you see what happens and then compare. If you tried to do science by having no hypothesis, you just say, “let’s do a bunch of experiments and see what happens,” but we have no idea what that means, you don’t make progress very quickly.
Philip Winston 00:42:08 Yeah. I can definitely tell in my own work, sometimes when I’m running against the limit of what I understand, I’ll sort of get this anticipatory feeling like, well, at least I’m going to learn something here with my next task, because it just has to reveal something. Another mental model from the book that almost sounds too simple to consider a model but actually I think is helpful: As you say, when your software is running too slowly, it’s either not running, or it’s running but running slowly. Why is it worth keeping those two as separate possibilities? And I guess it could be a combination of the two also.
Richard L. Sites 00:42:45 Oh, they’re separate because the way you fix it is completely different. If you have a program that’s occasionally slow doing some operation, it could be because that program is on the slow instruments is executing a whole lot more code. You know, it goes off and does some subroutine call you weren’t expecting to happen. And that only happens now and then, and it goes off and does a lot more work. That’s one choice. The second choice is: it’s executing exactly the same code as fast instances, but there’s something interfering with that code somewhere around the shared hardware, some other program or the operating system that’s making it run more slowly than normal. And then the third choice is that is not running at all. And as an industry, we have lots of tools and profilers and things that pay attention to where the CPU time is going, but we’re very weak on tools that say, “oh, you’re not executing at all and here’s why.” So, in the case where you’re executing more code than normal, you need to find what the extra code path is; in the case of executing the same code but slowly, you need to find what other program or piece of the operating system is interfering. And how is it interfering? Is it thrashing the cache? Is it taking over major portions of the CPU that you’re trying to use? Is it loading down the network, whatever? It’s only one of five things, and if you’re not running at all, then you need to go understand why the program isn’t executing — what it is that it’s waiting for — and then go fix how come the thing is waiting for took too long? So, in some cases you fix the program you’re working on, and in some cases you fix other programs.
Philip Winston 00:44:29 Yeah. I think I remember from the book, one of the examples of executing code that you didn’t expect, and it was actually preparing a DBA value or preparing some information that was then not even used. And so, the investigation was difficult to find this case, but the solution was actually very simple in terms of just not doing that extraneous work. So, I can see how that’s a very different case from where it’s executing the exact thing you expect, but slowly. So, yeah, they’re definitely different.
Richard L. Sites 00:45:00 And that was a real example from Google that took us about a month to track down why some service would go out to lunch for a little while. And we eventually found, oh, there’s this big piece of debug code that’s running. And then the results thrown away. This happens in LAR software. Nobody’s a bad programmer. You just, you end up with things like that after a while.
Philip Winston 00:45:22 Yeah. And so you definitely feel like you’re discovering this, these traits. So, one thing I enjoyed was you mentioned the difference between batch processing — or I guess, pipeline processing or data processing — versus user-facing transactions. And how, for instance, your CPU utilization is your ideal CPU. Utilization is different in those cases. Can you speak to, have you dealt with both of those types of cases or is one more it’s software dynamics, more of a concern with one of those types?
Richard L. Sites 00:45:59 Yeah. The software dynamics are more of a concern in time-sensitive code. A lot of our industry focuses on simple programs that start and run and stop, and they model them with benchmarks that run on empty machines. So, the whole point of the benchmark is if we ran it five times in a particular machine and particular configuration, you should get five answers, five time measurements that are about the same, and then the marketing people take over from there. But that’s not a very good model at all of software that’s on the other end of your cell phone or in your cell phone where you’re waiting for something to happen. So, programs that run in the background are run in batch and nobody’s waiting on them particularly strongly. You know, they can run for a couple of hours. So, it doesn’t matter if it takes two hours or two and a half hours. That’s a very different environment than, I hit carriage return and I want something to happen on my screen in that environment with the time-sensitivity. You never want the CPU to be 100 or even 90, or even 80% busy. While in the benchmarking environment or the high-performance physics environment where you’re doing lots and lots of matrix calculations, the goal is to make the CPUs 100% busy. So, they’re very different environments.
Philip Winston 00:47:19 Yeah. And that’s a distinction I’ve run into also; you’re either trying to sort of soak up all of the hardware resources available, or you’re trying to reserve some for when you need to have a spike in usage or when you need it. So, you have two neat examples in the book. One was, I think you were just investigating or you found this documented. It was an IBM 7010 from 1964. And this was one of the earliest cases you found of someone using the type of tracing techniques that you talk about to investigate a real performance problem. I assume it was performance. And then maybe the next chapter, or later in that chapter, you talk about some of your work investigating a specific problem with performance in Gmail in 2006. So, these examples are more than 40 years apart. What can you say about the process of investigation that was the same and what was different? We don’t have time to talk about the details of the investigation, but I’m just interested were you left with thinking that the process itself has remained much the same or if there’s been wildly different processes?
Richard L. Sites 00:48:31 I think the processes are surprisingly similar. I should say a word about tracing versus other observations. If you are dealing with problems that are reproducibly slow, you can go find those and fix them sort of working offline. You don’t have to deal with a user-facing real-time environment, time-sensitive environment, but if you have occasional hiccups in time-sensitive software, you don’t know when they’re going to occur. And if you don’t know when they’re going to occur, you need to watch for quite an interval of time. You need to watch everything that’s going on, and then hope that you get some of these hiccups so you can track down what the root cause is and fix it. And so, there’s a lot of observation tools that do logging and profiling and stuff that sort of merged together a lot of data and give you some aggregate numbers, and to really see these anomalous executions fast you need to trace everything that’s happening over on the order of a few minutes.
Richard L. Sites 00:49:36 That’s hard to do. It’s particularly hard to do with tiny enough overhead that you’re not just distorting what you’re trying to learn about. And that difficulty of tracing what’s going on has been the thing that’s constant from the 50S to now. The IBM 7010 people, they built a whole box of hardware to watch the program counter value on some instruction bus, every cycle, for seconds. And it was a one-off pile of hardware at someplace in someplace like Rochester, New York. And that was the only way they could see what the programs were really doing. And the same thing. Now it’s real hard to build low enough overhead tracing software. You get lots of high-overhead tracing software instead, and then you can’t use it in a real-time environment.
Philip Winston 00:50:24 Yeah, I had forgotten that they built custom hardware to observe the machine. Well, I think we’re going to start wrapping up. Are there any resources you’d like to point out where people can learn more about the book or about yourself? I’ll put any links you mentioned in the show notes so people can look them up there
Richard L. Sites 00:50:44 Okay, the two main places where the book is available are on the Pearson or Addison-Wesley website, which is called informit.com. That website, in addition to selling the book, has all of the code that goes with the book and is starting to have reviews. The other place is Amazon, which I think is just now getting their first shipments of boxes of books.
Philip Winston 00:51:11 Okay. That’s great. Yeah. And this has been recorded in December, 2021. So, that’s what we’re talking about. How about yourself? Any other links to recommend or resources?
Richard L. Sites 00:51:21 No, I’m not really on social media very much. I am on LinkedIn.
Philip Winston 00:51:34 Okay. I’ll definitely add that to the show notes. Well, thanks so much for being on the episode. I really enjoyed reading the book. You have a lot of great technical detail that we didn’t get into here in the episode. And I would say that some of the chapters read somewhat like a mystery or a thriller. So, it was really interesting to go through those examples. Do you have anything else you want to mention?
Richard L. Sites 00:51:58 Yeah. Some of the readers may enjoy the 40+ index entries under Screw Ups. There’s lots of examples of real world screw ups in the book.
Philip Winston 00:52:07 Yeah, I remember this. Okay. Well thanks a lot. This is Philip Winston for Software Engineering Radio. Thanks for listening.
[End of Audio]