Jay Gordon: You're listening to the on call nightmares podcast every week could bring you conversations withs technologist who spent time On-Call this week was Data Dog Dash here in New York City. So I got a chance to catch up with a friend of mine, Jason Yee. He's a technical evangelist of Data Dog, and we do a little bit different than the normal. I'm calling my embarrassed conversation, but we look more conceptual ideas around on call data that you may want to, uh, keep in analyze, know what measure when you were coming up with a monitoring solution. So that being said, like I said, I have to stop by the Datadog Dash conference learned a lot of new things, took a couple of workshops there. And as always, I think Datadog doing some really interesting stuff to help you really get the most out off your I T. Solutions data being able to quarterly metrics to big incidents. Things like that, I feel, is really helpful for people. So let's get into that conversation. But before we do, much like everyone who's been on the podcast before they found to be some of them, I found what some found me. And if you want to be one of those people who finds me to be armed, just send me an email. It's on Call Nightmare's at gmail dot com, or you can tweet me at OnCallNightmare or at jaydestro. So let's get into the conversation with Jason and hear what he has to say. Well, welcome back to Uncle Nightmares. And today I'm lucky that not only do I have a person to be a guest, but he's actually in the house, were drinking some coffee. We're chatting. And then when we're done, I think we're gonna go eat breakfast. Jason, he's here. He is a technical evangelist over Data Dog, and before that, he spent his time and O'Reilly publishing helping people and, you know, really getting out the word about what he's been doing, which it seems like. What is it almost like 12 13 years

Jason Yee: tech? I've been a tech probably longer than that.

Jay Gordon: Really? What? Yeah, I'm sorry. I did a quick look at year. Your bio on on linked in and I just didn't get the numbers quick. Quick enough. So how long have you been in technology.

Jason Yee: I mean, as as a full time job ever since since college. So, uh, coming up on probably 20 years.

Jay Gordon: Wow. So you're another 20 year person. The only reason why I wasn't sure it is. You know, you spent time as a chef, uh, cooking. And I know that that was, like a part of your career, you know, prior so or your experience. So I always like to kind of make sure that I get everything in there, but Yeah, we'll get there. But Jason is someone that I guess we kind of met on the road in the world of the Dev Ops Days and such And luckily enough, he's in town for Data Dogs Dash Convention, which I went thio this past week, and I have to say it was a good time, lots of great information. And we'll talk more about Data dog as we kind of get into the conversation. But I want to start off kind of like that where I normally start off. But I want to preface this kind of conversation. Jason and I kind of have a more of a focus conversation on how people could get better information to help their on call. And I think that being part of data, Dogus someplace someplace, ultimately, that's the goal, Right? To get better on call information, uh, we'll talk more about that. Yeah. Uh, but first, let's let's talk about how you got started in technology. So tell me a little bit about, like, where did this look path where you start?

Jason Yee: I mean, I guess if you trace it all the way back to when I was a kid, my dad had bought an apple to e knowing that we're having some inkling that computers were the future on dhe. You know, back then you get the these, these floppy disks, and you load him up and you play games. But one of the cool things that he did is he got me this magazine when as a kid and and in the back was always a couple of basic programs on because it was a printed magazine. It literally was just printed tax dude, right?

Jay Gordon: Basic magazine or something like that? Like I remember there were a few of those. I'm sorry for interrupting, but I remember, like, back in, I guess it was like the eighties, There were magazines that came with Look, just code in the back and you copied it yourself. Yeah. And then I also remember books like big, big, Thick books or mostly binders off just code that you just literally had to sit there and copy in time.

Jason Yee: Which is great, though, right? Because that's, you know, the

Jay Gordon: first stage of open source. Yeah, it was all about muscle memory and being able to remember the way the commands were format and how you put everything together.

Jason Yee: But it was also because it was that early open source. There was the notion of I don't know what this command means, especially as you were starting toe render graphics onto a screen. You know, it was easy. Thio wrap your mind around if it else statements and and logic. Sure, but when it came to rendering graphics, most of that wasn't explained. And so you ended up just going in there And you're like, Well, what if I just changed these values to something completely ran? Let me make it bigger. Let me make it smaller.

Jay Gordon: Let me turn it blue instead of white. Yeah, like those first early days. Really got, I think people saying, I can do this And I think before the days of even, like, how do I get something on the Web? It was, How don't I at least change that color from white, blue or green thio yellow or something?

Jason Yee: And when it's a value that doesn't say whether it's blue or green or whatever, you just start tweaking these numbers like, Okay, it's getting a little more red when you go the other way.

Jay Gordon: So So let's fast forward a little. You go to school on what? What was your focus when you went to school?

Jason Yee: Yeah. So I started out in the university, went to Purdue, thinking I'd be an engineer, a mechanical engineer. Like most Asian kids grow up. You're good at math and science there, like you should be an engineer or you should be a doctor. So went the engineering route. But all along my my path, I had always been building websites and and writing these these basic applications for friends, a lot of Web applications, and in those days it was just, you know, Pearl pretty much, and I realized that I was cutting classes. It build you know program things and build websites and do things for friends that we're having a little side jobs and things. And so at that point is like, Well, why should I? Why should I be paying for a college education that I'm not using? At the very least, they should change majors on. So I thought that I'd be was, you know, is the early days of JavaScript job script and D html What we called it back then and it was exciting and it was interesting. So I thought I'd go down that route, changed my major into design, ended up graduating and graphic design right as the first dot com crash it.

Jay Gordon: Sure, that spot when I started to have to find a real job,

Jason Yee: Yeah, and designing things pushing pixels around at that point was not considered a real job. So I don't know doing these interviews, and people were like, Oh, you could program. We need, you know, back and developers much more than we need designers. And so that decided not to fight that and just ended up going for it and becoming a developer primarily. But again, as you know, back in the day, developers were full stack in the sense that, like we built our own servers, you installed Apache your engine X on it just to get a website up and running. There was no just, you know Oh, I'm a python programmer building a website or PHP. It was literally, like soup to nuts. Everything

Jay Gordon: together there was the I have to figure out how to make the key. You know, I have to scan the picture that I want. So there was like working with scanner to get the photos or the different elements that you need it because it's very, very hard to go online and you rip a picture off from somebody else. And then you had to add all those elements and understand how they would fit together. And we're talking about before ways of like being able to test your code before you didn't avoid it. So yeah and hell deploy. It was just like, let's ftp and upload. So you start moving a little for your career. Eventually you go into evangelism and advocacy. After years of being a developer and then working in Riley, you start really helping people learn more about the technology that they work with, Um, part of that has been seeing people recover from failure. A lot I would imagine, especially where you are a data dog right now. So, um, let's talk about a little thing called failure and part of that is me reminding our listeners that we have rules to this podcast. And like we mentioned in the beginning, Jason isn't necessarily going to be giving us a particular on call nightmare. But he's going to talk about the on call nightmares in general, which is something that his company over data Dog they worked kind of help reduce. But the rules are still the rules, and I still need to tell them because it's part of this silly podcast. And here's out goes one. Don't incriminate yourself because nobody wants you to get in trouble, too. Don't incriminate others because we always make sure that we're always blameless and how we do these retrospectives and then help us learn, because this podcast is in general a retrospective. So with that being said, what has been your time working at Data Dog been like the ultimate nightmare you'd see from most people who have on call issues like when they come and say to you, Jason, you know, I saw your talk today. How we get out of this mess.

Jason Yee: Oh, that's a great question. Um, you have to think a little bit on that. Uh, you know, largely when people come to me, it's it's after that's happened and s o it really is setting up for that next outage that they think they're gonna have. Sure, it's sort of done a little bit of, ah, retrospective in their head, often not a formalized one. And then they come to me and they say, Here are the things that we didn't have that we wish we had. How do we get that? Do you know what the most common wish we had is
a lot of times? It's simply the ability to go from an alert that says some metric is out of bounds to figuring out, you know, And it's the common thing. Whenever you're troubleshooting an issue, how do you find what caused that? You know, And oftentimes that comes down to you either, trying to correlate what's changed in the system in a rapid way in which there's always, you know, this divide and it's we talk about it in the Devil Apps world, a lot between Devs making changes and ops trying to keep things reliable on they're still often this this big disconnect between what's changed, and I can't figure out what the change Wasit was probably a code changed more than a convict change in the system that we own. It was probably something on the death side, but we can't see that.

Jason Yee: how can we see that? How can we get metrics around what they've been doing and what changes they've made in the system to help understand how it's impacting on our systems?

Jay Gordon: So ultimately, the big starters, how do you correlate my problems? I can see that because ultimately we need to kind of find out when it started and when it finished. If we don't actually know what caused or what was part of a cascading failure, Yeah, so let's let's move a little more into the process of you've talked to this person and they said to you, You know, Jason, um, I don't know what to measure. And that's a common issue that are found is that people don't necessarily know what to measure, So they go to a template. And, um, how do you really look? ATT? Avoiding template, sizing the data that you need and being able to measure things properly, like, How do you get that started

Jason Yee: that process? Yeah, I mean, to be fair. Templates are not that OK. Any means one of the best things that our customers are. We have customers that I think are leading the way in doing things really, really well. And one of the things that a lot of them have been doing is starting off with templates, building platforms where when it developer creates a service out of the box, they get dashboard okay, and the good ones, you know, you do mention templates. Templates could be awful. They could be things that do not matter at all, but the good ones air traditionally measuring things around what's been called the four Golden Signals. And for the listeners for golden signals are Leighton. See errors, traffic in saturation. So I like to say, Let's help me remember It's a good afternoon, but late and see being, you know, what's the response time? How are users actually experiencing this errors being not just error rates. But what sort of errors are we getting? And that could be either from your logging system or captured as metrics should traditionally both would be good s so you can capture you again. Not only the rates, but what they are. Traffic is a big one. Traffic oftentimes is super useful in figuring out cascading failures. Right? As you see searches in traffic, where's something dies in that cascades over announced hammering another system that's often a place where cascading failures start on then similarly saturation, what do your resource is look like? So, um, again, that could be on system side CPU memory, I Oh, but on the service side, you know, it could be just about the same. How? How much of my reaching out to my service dependencies. How much is our things reaching out to mine?

Jay Gordon: Yeah, I heard a common theme, and I'm hearing it more more lately. And it's about browser based late, Nancy and render time and how we're not just thinking about infrastructure, but the actual the time to build things in the browser and how applications respond to that and how that's becoming. Actually, I think a greater point of monitoring. Now it's less how much memory utilization is there. It's less about Io and I ops and stuff. And it seems more like this is how long it took for the browser to do X for this user. And that happened in Mobile. That happened, you know, standard Web. And so I'm wondering, you know, there's a lot of numbers coming at Operator Dev Ops 13 Dev. Ops Team. I don't want to call person develops. That's not the right word. Anyhow, how do you begin giving them? What's the right measures off all that data that kind of pay attention to? Because I felt that memory, things like these they're supposed to be used. So how do you get someone passed metric overload?

Jason Yee: Yeah. I mean, if I understand the question is it's essentially now that we're monitoring and end, Yeah, there is an overwhelming amount of metrics on it does become What should I actually look at? You know, the common advice that I give everyone is measure everything as much as there's a metric overload. Our systems these days are huge. They're built to handle lots of data data Dog. One of the ways that we've designed the system is that we'd like you to send everything that you've got, because at some point you'll need it on. We've done a lot around what we call metrics and logs, and Trace is without limits, and it's a It's a fun marketing phrase, but what it really means for us is send us everything and then on Lei Index. What you need? Sure, but that allows you then to have everything. But so now you're measuring everything, but then only alert on what matters and that when it comes down to end end, really, all of your alert should be on that front and side, not necessarily the code of the applications there, but thinking of front and thinking of the end user and how they're experiencing.

Jay Gordon: Yeah, less about how much memory are using on a server. And how does the server respond when the person wants to add an item to the car

Jason Yee: yet? Yeah, if I have a ton of customers and all of my systems are maxed out and they look like they're gonna burn down. But all of my customers are happy and I can keep adding more customers than let the servers merriman

Jay Gordon: the fault. The fallacy of the 100% if you will. And I've found that to be now with s sorry on. And that was another huge, huge thing. That was a kind of a theme in yesterday and the day before CE up talks is that s sorry. Practice and methodology is built into more and more tools to help you actually sort through the noise. And that's the big thing that I've found is that, you know, setting all those service level indicators and determining, you know, outage allowances that you could have in your budget. That seems to be part of the process. And how more tools are saying we know this exists s l O Zahra thing as the lives or a thing. We're gonna build this into the tool. I think there's actually a tool not totally blameless. That literally is just an s sorry. Kind of like tool for having all those. It's not like patching it into your existing juror or something like that. And so, like, I like that people are starting to think How do we use the big concepts to take those metrics and make them more useful.

Jason Yee: Well, the thing that I love about setting s allies and sl O's is it gives us a better definition, which eventually just gives us more freedom, right? Because engineers in the past it was like, whatever went down, And it didn't matter if it was like you went down for five minutes out of the year and you have what would now be an amazing you know s l O. But it was like you went down. You should feel bad, right? Yeah. And now that we have this idea of service level objectives and error budgets, it's like, OK, it's it's more off, like I'm spending a little money, right? It's okay to spend money. But if you're blowing your paycheck every week, that's awful.

Jay Gordon: So don't go through your error budget and not necessarily be prepared to say, you know what? That's all the average for the month. I don't know how I'm gonna be able to ship anything new that may have an error made cause of potential out of you that also

Jason Yee: don't feel bad when you if you go down for a little bit of time, you know, you get things back up. It's okay. Don't don't feel bad about that. It's just part of life. Failures happen.

Jay Gordon: Yeah, I think that that's one of the things that I feel has been. One of the more positive parts of technology in the last five years is the acceptance and embracing a failure, like in the beginning of the Dev Ops period. That's about 10 years now is the death box periods I want to call it. We started reminding people, you know, you're going to break things here and there. It's gonna happen But I feel like it took a little while to remind people, but it's okay. It's gonna happen. Evan. It wasn't until people started creating tools that actually generated that failure on a randomized basis, like, you know, sure, we had Apache benchmark in and really, really basic tools to do late and see. But now that we can do tools that can specifically, you know, attack memory attack, uh, you know, the the actual packet rate that's going in out that to me is giving us far greater ability. Yeah, and so here's the way I'm thinking it. And let me ask you this question and I led up to it a lot with all this, and I apologized. We're in a point right now where the measurement based on clicks is so important. You know what? What, How long it takes. How do you believe people make that determination within the organization that this is acceptable? But this is unacceptable. Like, where did that s l i n s l o start really hacking organically. When do those things start? But how do you start determining these air than metrics? We're gonna go by.

Jason Yee: Yeah. I mean, that's a conversation that every order needs to have between business and engineering. Sure. Obviously there's there's a push and pull of business would say we should never go down. We should have 100% of time. That's R S L O. But it really comes down to, I think, numbers figuring out What's the comfortability doing that calculation of how much money are we losing her downtime or when it comes to setting S. L. A's with your customers? If there's the idea of having chargebacks on so there, you're gonna have to pay your customers back. What's the acceptability there?

Jay Gordon: Tandy Bhutto came up with a really formula on Tuesday. You remember what the formula was?

Jason Yee: Yeah. So the formula is essentially, she had a bunch of letters, and I'm going to get the letters wrong. But it's essentially the formula is how much money am I losing in downtime like lost opportunity. How much money adding that to how much money am I losing due to lack of engineers working on things. Right. So the time lost, um, and then reputation, reputation along with that And, you know, your brand And then one of the interesting ones that I hadn't thought of. Waas engineer attrition. Yeah, more that you have downtime. And the more that it's just an awful place to work because you're constantly firefighting. You're not gonna be able to hire new engineers on dhe. Your work first is gonna get depleted on DSO. Those were all the things that essentially added up, and you could actually you know, a lot of those air fuzzy numbers and you'll have to think about that with the narrow organization.

Jay Gordon: So what's the business value behind shipping X number of features per month as opposed to keeping the service always available in facts, and I suppose that's where you start really having those conversations and where the business value is to one part of the company is supposed to the other. You know, if Product Team says, well, we need to get these feature shipped You know, it's gotta be a concern for the engineering side to say Well, you know, we're having spotty outages here. How do we spend time getting your new features out If we can't get these outages fixed so I can see where you say it like, it's, it's there's different values to different parts of the business. I guess it's just about making a conversation happen internally.

Jason Yee: That's that's really where it comes down to. Is getting that conversation starting actually discussing with business people in your organization with leadership in organization And all of this is around. You know, you mentioned Tammy. It's because on Tuesday Tammy and I lead a chaos workshop at Dash, and it always comes down for chaos to people being like, I love this. I love to do chaos, engineering. There's no way leadership is gonna let me intentionally kill our stuff. How do I start that? And the number one advice is always Well, start the conversation, right. Like you have to start somewhere. But having these numbers, trying to come up with these numbers and again, they're fuzzy numbers. But if you can get business to agree on what those numbers would be, you at least now have a formula toe, add things up and figure out how much things will cost. Yeah, I

Jay Gordon: like this idea behind chaos, engineering a practice, and I I like to jokingly say that it's about making sure you know what? You don't know because you don't know that you don't know, and I know it's a really weird way of putting it, but literally, it's about finding the things you don't know exist in. And I saw a correlation between data and failure and how the two actually really work together to be able to provide on don't like to say a root cause because, and we'll get into that in a second because root cause to me is still a term that I find a little fuzzy. I know it's it's It's controversial now, which is weird, but, you know, once we start finding what is the I guess the trip of the cascading failure. We all want to, like, look more into it, look deeper. So tell me a little bit and let's talk about this term root cause. And I'm gonna bring it up just shortly because I know this did a bunch yesterday. Uh, tell me what your feelings is. Our is root cause a true term that we should be using now. Or do you think that there's something that better identifies thes problems?

Jason Yee: Yeah. So, um, I'm gonna I'm gonna put out a little bit of a maybe I'm on the fence kind of answer. A lot of times when we talk about root cause within the sense of postmortem or a retrospective were thinking in larger Dev ops, Terms of technology and people. Uh, so in that sense, theirs there's rarely ever a singular route cost because people are complex. And we, you know, as John also Wallace say, like, nobody goes to work with the intention of killing everything crashing on the systems except for maybe disgruntled employees. But the idea is that everybody goes to work with the best of intentions, trying to do the best job that they can. And that's on the people's side that said, on the technology side, I think that they're often are singular route constants. We contract things back Thio, for example, since this is, you know, on call nightmares and people are sharing their own personal stories. For example, one of mine I set up ages ago, and I can I will gladly take the blame for this because this was probably, I think was my first job out of college. I was setting up a file server. So is is actually Lennox box that was connected to two sands. Huge rate rate a raise. But it was a window shop. So all of those raid arrays were formatted at the time in, um, in fact, 32. Okay, because, you know, this was ages and ages ago, and so is cool is up and running. And people were putting their files on there, and everyone was happy that they could connect. You know, this network drive to their Windows machine and we were saving everything there, and I massively increased the amount of storage. That's that this non profit is that we could use. Um I didn't do any sort of shutdown attacks like by chaos. I had actually just set this thing up and it was up and running and had been up and running for about a week. Sure. And then at one point, it rebooted Thea Ultimate test. Yeah, and I actually hadn't realized that at some point those volumes had gotten added to the S F s step in Lenox and Lennox was like, Oh, yeah, I don't know what these are. There's something very wrong with your with your volumes here is like Cool. Uh, sure, it was like it's not going to start. So what do you do? Well, you run file system, check on it, right? F s C k, which everyone jokes about because it's one letter off from moving something bad. And that's something bad did happen because they now think that Okay, these were at the time. I think it was 62 or three. And so it just starts rewriting bits on about 10 minutes into this. I'm like, this is taking a long time. Something is very wrong. Stop. It turns out that it's corrupted all the info.

Jay Gordon: You find a lot of lost and found and start crying. Yeah,

Jason Yee: a lot in lost and found that is unrecoverable. So about half of the half of the data that people had decided to store on this file system was was now dead. We went back and a lot of what we rebuilt from people's laptops because they had various versions and copies locally. But yeah, that was an awful weekend for me, particularly because, you know, this This happened right as a weekend was starting. It rebooted and spent a weekend in the server room thinking that I was I was about to be fired, not knowing how to fix it. Thankfully, the organization was really blameless. My boss was like, stuff happens. You new to us? No. Nobody's ever done this at the organ. We've never had, you know, networking, attached storage. So you know we'll deal with it.

Jay Gordon: The learning experience like any other. Yeah. So we're gonna start winding down a little. You know, we've been through and those of you are listening. You might be getting thio your subway stop. If you're like me and your New Yorker or you're driving in your car, you might be getting ready to get off your exit soon to goto whatever place of business you may be going too. So I want to start wrapping things up with Jason. And, um, I want to ask what is a big piece of advice you could give someone who's building it on call team right now? Oh,

Jason Yee: man, so much advice, but just one piece. I

Jay Gordon: want one good piece.

Jason Yee: I if you're building that team and you're hiring people higher for for not a culture fit but culture extension, right? Don't hire people that are exactly like you hire people that are different from you, but in in similar ways in their approach, who are blameless. We're forgiving who are humble but still excited to learn because that's really the job. Is trying to stay humble and always trying to learn.

Jay Gordon: Very awesome. So tell me a bit for a second. About what data dogs got new on what you're doing lately.

Jason Yee: Oh, man, we announced so many new things yesterday.

Jay Gordon: I know a lot of new like a PM products.

Jason Yee: Yes. So we've got well, I mentioned earlier everything without limits. So that used to be just on our log side where we ingest all of your logs. We would form at it for you in a nice Jason structure, and then you could tell us what to index or not. And you only get charged for what you index on. Now we've done that, too. Metrics and traces as well. So all of their custom metrics you can send in on Ben at any point. You can turn off the indexing and we don't charge for it. And then you can turn it back on when you think you want to investigate something. So that's a big thing for us. Huge is that we just rolled out a real user monitoring, um, function our future. So we've had we added synthetics earlier, which is essentially I like to call it robots so you can get robots to simulate human actions that check what a user would experience. But now we can actually start to measure what your actual users are experiencing. You can send that information in, and the nice thing is, now you can also correlate that with everything.

Jay Gordon: So browsers based off like we were talking about earlier the browser Very, very cool. So thanks again, Jason, for being here. If you'd like to find Jason on Twitter, it's pretty easy it's get bisect. I'll make sure that's in the notes. Anything else? You kind of wanna say goodbye with

Jason Yee: now Just everyone who's out there, you know, like I was mentioning before, Stay humble state. You're gonna learn. It's the best way to get through this industry on improve your career and have a great time while you're doing it.

Jay Gordon: Oh, yeah, and measure everything

Jason Yee: and measure everything.

Jay Gordon: Great. We'll be right back to wrap up podcast. Thank you so much, Jason. Thanks. That's it for this week. Thanks so much, Jason. Always love talking with you. Really appreciated your time. You confined Jason on Twitter and gitbisect. And obviously, if any of those big conferences where you see Data Dog, maybe you may see Jason. So stop by, say hello and thank him for all the cool stories that he told on today's episode. So, other than that, just a reminder. If you'd like to be on in a future podcast episode, just said that email to oncallnightmares at gmail dot com, or send a tweet at on call Nightmare or at jaydestro. So remember next week I hope to bring you another conversation with the technologist who spent time on call. I'll see ya.