R for the Rest of Us Podcast Episode 22: Alex Gold

Learn More

If you want to receive emails to help you on your R Journey, sign up for the R for the Rest of Us newsletter.

If you're ready to learn R, check out our courses.

Transcript

[00:00:00] David Keyes: Well, I'm delighted to be joined today by Alex Gold. Alex leads the solution engineering and support teams at Posit.

Outside of work, Alex is a lifelong martial arts enthusiast who loves also doing handstands and is very into home improvement. Alex, thanks for joining me today. It's

[00:00:42] Alex Gold: great to be here, David. Thanks for, uh, thanks for having me.

[00:00:45] David Keyes: Yeah, so maybe you could start out by just telling me a little bit about kind of how you got into the position that you're at now.

We're going to talk about your book, DevOps for Data Science, but there are two parts of that, right? There's the DevOps and the data science. I know you were telling me before we got on that you did the data science thing before you did the DevOps thing. So can you kind of tell us a bit about your background?

[00:01:08] Alex Gold: Yeah, very much so. So my background as an undergrad, I did math and econ. like so many data science people, right? I had a social science background. And my first few jobs out of college, I was doing sort of DC think tank kind of jobs, uh, doing some like econometrics work and some just general policy work that had nothing to do with business.

statistics or programming. But I did some programming and some statistics, and then I started and dropped out of an econ PhD program, did some more time in the think tank world, then moved into sort of data science proper, and was an R programmer, and led a data science team in a few different fields.

Areas working on political campaigns, healthcare, federal consulting, those, those kinds of things. And then about five and a half years ago, I came to posit as a, a solutions engineer and. And solutions engineering at Posit really is this place where like DevOps and data science get smushed together because we're helping posits customers figure out how to get.

Posit's tools, you know, Workbench, Connect, those kind of things. Installed, deployed, configured, and usable for data scientists. How to create a great environment for data scientists. So that was, you know, what I did all day, every day. And then after a little while of that, I moved into management. So now I, I lead the team.

And so my day to day really is, is not the topic of the book. The, the book started when I was still an individual contributor. And, and it's been really fun as I've transitioned out of doing that work day to day to get to. Keep a foot in it by working on the book. So that's sort of my story.

[00:02:36] David Keyes: Cool. That's helpful. So as I mentioned, the book is called DevOps for Data Science, um, published with CRC Press. maybe just starting out, for me, before You know, I looked at the book, and I think probably for a lot of people who are in kind of more traditional data science roles, they might not be super familiar with what DevOps means.

So how do you describe it? And I guess specifically, how do you think about it in the context of data science work?

[00:03:04] Alex Gold: Yeah, DevOps is a super slippery term. Just to go back history, right, uh, you go back to like the at least the story is that back then everybody was doing waterfall software development.

I think that's pretty apocryphal, I don't think that was really the way of it, but like there was not a codified alternative that people would do. So if you were writing a piece of software, at least, again, apocryphal, but like the story goes, you would spend. Nine months gathering all the requirements for the piece of software.

And then you'd leave for two years and you'd write the software and then you'd come back and look, it was not exactly what people wanted. It would have missed some of the requirements. The requirements had moved in that interim. That's bad for a lot of ways. And so in the 90s, a bunch of people codified what's now the agile software movement, right?

And the idea there, and I think a lot of people in data science are familiar with this idea. A lot of data science teams, I think, aspire to an agile way of working, deliver small increments of work. Frequently. Do it a lot, right? Like, deliver a thing once a week, once a month, even once a quarter, right? Much faster.

And you get more feedback then from the customer. You understand better. Are you actually building something that's valuable? I will just put a plug in here that if you're on a data science team that doesn't work like this, this might be step one is just trying to sort of move your work in a more agile direction.

[00:04:17] David Keyes: Why though, let me interrupt you. Why do you think that's more valuable than working in a kind of more traditional, like, waterfally sort of way? Yeah, I mean, I,

[00:04:26] Alex Gold: it's really interesting. So agile came out in the nineties, it took the world sort of by storm and all software now is agile that everybody says they're doing agile, whether they are or are not.

I think the main innovation of agile is that frequent feedback loop with the consumer of your software or your data science project, right? That it's not a, I'm going to talk to you once. Try and gather everything from your mind, go away for a long time and come back. It's, we're having a conversation and as I iterate, I'm gonna see, is this the next thing you want?

Is this the next thing you want? And I think the fundamental sort of observation is that priorities change, needs change, the world changes. And so the longer you, and this is very much a people thing, not a technical thing, but like, the longer you go off and work on something without checking if it's right, the sort of more right you have to have gotten that initial prompt and that's really hard to do.

And even if you get it perfect, maybe things change in the meantime. And so that's sort of the fundamental, like, motivation behind agile is that, you know, check in a lot with, with your, your stakeholders because things may change while you're not looking.

So if you're trying to deliver things as agile way though, there's a problem, which is you can start writing software in an agile way. So you're like, you know, if you're, if you're thinking about this from like a Git framework, right, you're doing like merge requests of modest size and you're delivering them frequently, you're checking them frequently.

But there's this whole other part of it, which is then like, how do you deliver the software? How do you get the software to your user? How do you test it? How do you make sure it's working properly? And not, you know, right? There's the piece of this that's like, how do you talk to the, to the, I say customer, but I mean it in sort of the broad sense of

[00:06:06] David Keyes: whoever,

[00:06:07] Alex Gold: whoever cares about what you're doing.

And so there's the question of how do you do that with them? But there's also the like, how do you run your test suite? Whatever that is, again, do you have a test suite? how do you make sure that, the software is in some sense correct? and that's where DevOps comes in. So DevOps is this set of tools, practices, and systems to make sure that if you're building things in an agile way, you can deliver them.

in an agile way. And then, you know, in many cases, it's going into a production system, and it has to stay up in an agile way. So it's sort of, once you've written a bunch of code, that code is not very valuable if it just sits on your computer.

[00:06:45] David Keyes: And

[00:06:45] Alex Gold: so what do you have to do to put that, actual thing, the code is not the thing.

It's a thing, right? It's a, for a data scientist, it's a, it's a report. It's a shiny app. Right? Like nobody wants to read your shiny code. They want to actually have the app. And so how do you actually get the app in front of them? How do you make sure it is up? How do you make sure it stays up? How do you make sure that when there is an issue, you can diagnose it and fix it?

These are the kinds of things that DevOps concerns itself with. And so I think one of the observations I've had over my time at Posit is that obviously there's this huge world of DevOps for pure software engineering. The relationship between software engineering and DevOps. A data science is like, I think of it as the relationship between architecture and archaeology.

Like, Data science is very different than pure software engineering. And so there's a lot of like principles in DevOps that apply to data science, but the actual way you do it, the tools you use, and in particular because almost all data scientists are using R or Python, there was a lot of space to just be like, well, let's talk specifically about what you need as a data scientist to accomplish these sort of DevOps y goals, uh, in your work life.

[00:07:54] David Keyes: So in the context of kind of, you know, as you described, like pure software engineering, there's this idea of putting something in production. And in that context, it's a bit clearer. Like, you know, if you're producing a web app and, people are working with it, then it needs to run on the website or say you're producing like an app that runs on someone's, Computer you're like packaging up the app and people download it.

It's very clear kind of like what putting something into production in that context means. I think it's a little bit sometimes less clear for people who work in kind of data science, what it means to put something in production. And I'm actually going to read a little quote that I like from your book.

You said, If you're a data scientist putting your work in front of someone else's eyes, you are in production. And, I believe, if you're in production, this book is for you. So can you talk about what you see? I mean, you have obviously a pretty broad definition of what it means to put something in production.

Can you talk about what it means in this context of people doing data science?

[00:08:54] Alex Gold: Yeah, I like what I wrote there. This piece to me is really, really critical because I think So many data scientists work in an environment where they're like, Oh, but this isn't real data.

Like we're not Netflix. We're not putting a thousand models into production to horse rate. Like this isn't real. We're working from CSV files. Maybe on a good day, we have access to a database. And I think the reality from my observation at working at POSIT and working with hundreds of, people doing real data science, making a difference in the world in some way, is that like, A lot of data science is not super technically sophisticated in terms of what production means, right?

A lot of people are working off of CSV files on their hard drive. A lot of people are just working on maybe they have a Postgres database that they're pulling stuff. Yes, there are organizations that are very sophisticated, you know, streaming data event Kafka driven things. There are people who are using, some of the, incredible capabilities of like spark to do really sophisticated, huge data problems and modeling across terabytes all up.

Like, yes, there are people who are doing that, but most people who are doing data science that matters are doing it on modestly sized data. Putting Putting it into production means like writing a report that goes to another human or, you know, putting up a shiny app, hosting a model, but like one in most cases.

And so like, I think there's a real temptation as a data scientist. I felt this a lot. I remember having shortly after I came to Posit, I had a coffee with Jenny Bryan, who many people who listen to this podcast will, will know, right, for setting their computers on fire, or at least threatening to, um, And we had this conversation where it's like, yeah, everybody's just pulling stuff from Google Sheets.

Like, that's what they're doing. That's, that's what data science is. And I think, in the five and a half years I've been at Posit, the world has matured little. Like, most people now also have a database. That's how far we've gotten. It's like five and a half years ago. It was like, yeah, like 75 percent of people don't even have a database.

Now it's like 75 percent of people have a database, but they don't necessarily have a lot more than that. And I think, I think as a data scientist, it's really easy to underestimate the realness of the work that you're doing just because it's not super technically sophisticated. I feel really strongly that we need to sort of combat that sense that you have to be doing crazy streaming data. That that's the only thing that counts as production.

[00:11:17] David Keyes: So I'm thinking about like some clients that we've worked with. it's an organization. called the Child Welfare Partnership. It's here in Portland. They do trainings across the state of Oregon for child welfare workers think of like social workers, people who work in like the foster care system, that type of thing.

And they also have an evaluation team who conducts surveys of people who go through this. And one thing we helped them to do was put together a set of reports that, you know, they can like pull their data, which ironically does live in Google Sheets. Um, and it's not ironic, it's what we're doing. Yeah, yeah, exactly.

So they pull the data, they can now pull the data down whenever they want and generate reports, that they then share with the trainers or the kind of management staff. So in your mind, do you consider that kind of putting something into production, that type of workflow?

[00:12:08] Alex Gold: Yeah, absolutely. I mean, to me, I would say absolutely.

And I think This gets to my, maybe you'd call it a bias. Maybe you'd call it like because I'm old and now I'm in management or something. like, I think a lot of young people in data science, and this is particularly, like, I, I had this experience when I was managing a data science team, that a lot of the, junior people are really excited about the technical aspects of, of data science, and index really heavily on, am I doing something technically sophisticated?

And I understand why, if you're a young person, you're trying to learn a lot, like, it's exciting to do the technically sophisticated stuff. What really matters, you know, I'm an old person now, I, I manage a team, like, what matters is, are you creating value? for your organization, for the world?

Is what you're doing valuable? And I think, very often, overly focusing on technical sophistication means that you're maybe not doing the most valuable thing, uh, an example that, that I, I think of from my past was I was working, on healthcare data. It was, um, in hospital, data.

And one of the things we were looking at were, adverse events after childbirth. So postpartum hemorrhage is like a. bad thing that can happen to a person who has just given birth. You want to avoid it or manage it if it occurs. And so we got really excited about having this data and we were like, Oh, we could model postpartum hemorrhage and look at risks for that.

And we built a pretty decent model that gave some, you know, there was some predictive power in terms of who was going to have a postpartum hemorrhage, but there was absolutely no way to operationalize this model, right? There was no way to get this in front of. Nurses on the labor and delivery floor, right?

Like there was there was no way to use this and so like That was a case where like focusing on what we could do technically was a complete distraction from providing value because there was just no path to value for this work even if it was technically interesting and like cool that we could model this right and and you can totally say like okay maybe it's a good idea to take this on and show you can model it then maybe you can build right like there's a there's a case here that like that's how you could build uh you know a case to that but you know I think also It's really important to be paying attention to that value.

And to me, just to, I sort of, to loop back to your question, right? Like, value and in production are like synonyms for me. Like, if you are providing value to somebody, you're in production because you've now inherited a responsibility to doing a good job at what you're doing.

And to me, that's what in production means, right? It's really about, does somebody care that I get this right? And if they care about getting it right, You're in production and you need to start thinking about some of both the technical and non technical systems

[00:14:43] David Keyes: to make

[00:14:43] Alex Gold: sure that what you're doing is correct and stays correct and stays available to the people who want it, right?

That it's that sort of like contract you're making with your, with your users that they care about what you're doing. You have a responsibility to do it well. Keep it available and keep it

[00:14:58] David Keyes: correct.

[00:14:58] David: Hi, David here. Did you know that are for the rest of us, does consulting work? We help organizations to communicate more effectively and efficiently. With beautiful parameterized reports, interactive websites and custom art packages. Learn more about how we can help your organization and our, for the rest of us.com/consulting.

[00:15:20] David Keyes: So what are some of the main things then, you know, if DevOps can help you think about keeping something in production?

And I like how you talked about that being sort of synonymous with the idea of creating value. What are some things then that DevOps can help people do to ensure that things continue to be in production slash provide value.

[00:15:38] Alex Gold: Yeah, I think, depending on like what source you're talking about, right?

Like DevOps has a bunch of like pillars that, you know, you can talk about and like, I will just editorialize here for a minute that like, there are so many tools for DevOps

[00:15:51] David Keyes: that

[00:15:52] Alex Gold: I think the term has really been co opted by people trying to sell you their DevOps tool. And so it It's a little hard to talk about what is DevOps.

Because so many people have tried to appropriate that term and be like, DevOps is when you're using this. tool. And it's like, no, all of that is wrong, right? Like, no, it's a, it's a set of processes and conventions and yes, tools. but it's much more a way of thinking about the world, I think, than it is any particular tool or set of tools.

And I think that's why it's so mushy because both it is inherently not a tool. Exactly one thing. It's whatever your organization needs, but also there are a lot of people trying to co opt the term for their own purposes, and that makes it. Real, the water's really muddy. But I do think there are a handful of things that are applicable.

And so just, just to talk for a moment about my book and how I think about this, right? Because like DevOps and like administration, server administration, Linux administration, tightly coupled in, in a lot of ways, right? And so in the book, there are sort of three sections. And the first section is really about like, what can you do as a data scientist to help make your stuff DevOps ready or, you know, production ready, production grade, right?

The second section of the book is about like, okay, you have regrettably been pressed into service as a server admin. How do you do that? Cause that happens to most data scientists, I think at some point that it's like, Oh, you have to administer the server now. Okay. Have fun, right? Be scared. So that's sort of what the section, section of the book is about.

And then the third section of the book is you're a data scientist. You work at an organization that has. A sophisticated IT administrative function. How do you communicate with those people? And to me, I really enjoyed writing sections one and three of the book. Those are things that I like. I'm like, yes, these are great.

These are good things for data scientists to know. The middle section of the book was, like, really obligatory. And I think it's a good overview of all these topics. I wish I had had it a few years ago when I was first learning about these things. But it's the section that, like, I wish I didn't have to write this section, because I just wish everybody could, like, do their part, and then could, like, communicate to the IT admins.

Although, on the other hand, I will say, If you find you like that middle bit, that's where like most of the people on my team at Posit have come from, is they're folks who have taken on that server administrative task, including myself, and been like, Oh, that was kind of fun. I kind of like that. So there's a little plug there.

But in terms of what you can learn as a data scientist, right? I think there are a bunch of just fundamental principles here, which is like, capture your environments as code. Think about the architecture of your project. Include logging and monitoring as a standard part of what you're doing, and think about security up front, and then think about how you're going to deploy this thing the first time and over time from the very beginning.

And then there's a bunch of things you can learn and this gets pretty nitty gritty. Uh, but like, then how do you think about communicating your needs to an IT admin? So to me, like, there's a bunch of things that people can adopt, best practices they can adopt to make their data science work more ready for production.

That's like really what the first section of the book is about is like, you code all day in R or Python. You're writing a Shiny app, you're creating a quarto document, you're writing an API or building a model. What should you be thinking ahead so that when you then go to go to production, you're not like, oh, I did not think about that part of it.

[00:19:12] David Keyes: Yeah. So let me take the first one you, you mentioned. So think of your environment as code. So, tell me if I'm wrong, but I'm guessing what you're talking about is like the classic scenario. Say I work in RStudio desktop. Um, and I work locally. I've got versions of packages that.

You know, based on whatever the last time I ran install. packages, I write some code, works great for me, I pass it off to somebody else, they're like, hey, this isn't working the same way on my system. and so you talk about how can thinking about environment as code in that scenario, And I'm not even talking about, I mean, that example I just gave is not even like kind of production in the sense that it's just like, let me share it with my coworker, but how can environments code help me think about that problem?

And then building on that, how does it help to think about kind of when I want to deploy this in some way, ensuring that that will work?

[00:20:08] Alex Gold: Yeah, David, I think that's like, that's a great example. We've all been there. And like, sometimes your co worker is just future you. Cause like, past you does not answer the phone when you're like, why the hell doesn't this run?

I don't know what's going on. Yeah. So, I mean, the, fundamental idea, right, is, and I think everybody, most, most people who spend a lot of time coding in R are pretty comfortable with the idea that like, you write code because it, encapsulates your ideas in a rigorous way that makes them repeatable, right?

Like, there's nothing fundamentally wrong with instead of coding in R, you could just click around in Excel. The biggest difference is the way coding makes things concrete, specific, and repeatable, right? And so, like, environments as code is taking that one layer out in that you think about the kitchen that you're writing, I use a kitchen analogy here often, which is like you know, if you think of, your code is a recipe for what you want to cook, right?

Your data is like fresh produce you're bringing in, right? It's like, this is the good stuff. This is fresh, right? And then, and then there's this question of like, where do you do all that work? Do you have the right pantry ingredients? Do you have a knife? Is that knife sharp? And so, you know, you can like, again, to, to just belabor the, the kitchen example, right?

Like, I think many people have had the experience of going to an Airbnb and discovering that the kitchen is disappointingly, uh, a portion, right? And so, like, That experience of like, I love cooking this recipe at home, but I'm now at this Airbnb and I like, can't. Roast the carrots because the roasting pan is one of those like stupid little tiny ones.

And like, sure. Yeah. Right. Like that's sort of what, what we're talking about. And so the idea of environments as code is in addition to having your data and your code specifies your recipe, like you actually encapsulate in some meaningful way, what is the environment where all this cooking happens, right?

if you, if you've just captured the data. and the code. You actually don't have enough to reliably reproduce what you've done, right? That's part of it, but only part of it. And I think pieces to think about here is your environment where you're going to do your data science, where you're going to cook, has a bunch of layers to it.

There's the layer of your R and Python packages, right? Like, Do you have the right version of dplyr in your environment? Right? like you said, if you're just installing things locally, Every time you run install.

packages in any project on your system, you're updating dplyr for every project on your entire system. You come back a year later or you try and share it with somebody else. And it's like, well, that used to run, but like, you know, they deprecated a function that, that happens, right? And you can be in real trouble, but there are also things below that layer.

And this is where it starts to get into, okay, Maybe those things aren't your job, but they are kind of your, if they don't work, it's your problem.

[00:22:53] David Keyes: Yeah. And so

[00:22:53] Alex Gold: you have to start thinking about things like system dependencies and system libraries that you might rely on. You'll start thinking about the operating system in certain contexts.

You don't have to think about the literal hardware. And so understanding those layers, is important. For me as a data scientist, the piece that You really should be responsible for is capturing a package environment. Like, if you're creating a shiny app or, or report in quarto, like you really are responsible for capturing the set of packages that you need to run that report again. Nobody else can do that for you. I think you could argue, like, other people, maybe, you know, if you're in a, uh, an organization that has, like, a, server environment, probably somebody else should deal with, like, the operating system and some of those things, but, like, you need to deal with your packages.

And so understanding how a tool like rn works to capture what is the environment of this project and to keep it safe and separate on your system from the other things you're doing. Like that to me is one of the very most basic things you can do as a data scientist to sort of production grade

[00:23:57] David Keyes: your, your work.

So just to go back, and we'll have you show in just a minute, um, an example RM package. The idea there, and tell me if I'm capturing this accurately, but the idea I think is you want to make sure that your packages work so, you know, it's not just like I run code, it works on my system, it's when I pass it to somebody else it's going to work.

The kitchen metaphor there is, With using a tool like renv, which basically ensures that everybody is using the same set of packages, it would be kind of having, you know, to make your roasted carrots. Well, you don't just need like the carrots and, you know, what salt and pepper, what an oil, you also need to ensure that you have the right, um, roasting pans or whatever.

And so renv is basically like, putting all of the tools that you need into like a big, you know, container you get at the container store or wherever, and you carry it with you to the, you know, Airbnb that you go to, to ensure that it's going to work.

[00:24:59] Alex Gold: Is that? Yeah, that's, That's exactly right.

And there's actually a subtle distinction here that I think is really important, which is that it's not actually the, the things. It is not the packages. It's the list of packages. And this actually is a really important distinction and it makes it better, right? It makes it better because what renv provides is just the right.

amount of information so that somebody else, some other system, whether it's you in the future, whether it's somebody else on their system, whether it's on a server can reconstitute that package environment so that it works for them. So like, here's an example, right? let's say you, uh, you're trying to, I like this roasted carrots example, um, Let's say you are going to go to Europe, let's say, right? And they have the different shaped plugs, and your roasted carrots are done in a toaster oven, right? It'd be really, it'd be really bad if you sent them a toaster oven that had the American plug, and then they couldn't use it, right?

Instead, if you said, oh, you need a toaster oven that goes up to, you know, 450 degrees. that's more what renv does because you actually don't write like and people this I'm belaboring this point because people sometimes make this mistake they try and actually share their packages and that's actually a bad idea because if you are using a different operating system if you end up using a different version of R if you are for example going from desktop to server you're going to run into all kinds of messes trying to literally Share the packages.

Instead, what you want to share is like a manifest of like, this is exactly what you need. And then let them rebuild it on their side in exactly the right way. And so that the distinction between it being the actual objects versus the list actually

[00:26:40] David Keyes: is meaningful. Yeah, that makes sense. So maybe, maybe it is like you said, you know, a very detailed list of all the tools you need to make your carrots versus.

And

[00:26:52] Alex Gold: again, not to mention, right, like carrying around a toaster oven is a pain in the ass, like carrying around all of your package, like that's a lot of size, right, like it's like megabytes, perhaps even gigabytes of data. Sending a manifest is, you know, it's a few kilobytes worth of text. And so for portability, and again, like, if you're going on to the next phase of these things, right?

Using tools like git, githubactions, these kinds of things, right? Like sending a manifest works really nicely because you can have those tools Rebuild the package environment for you based on the manifest as opposed to having to like go to git

[00:27:28] David Keyes: lfs To

[00:27:29] Alex Gold: share an entire package environment,

[00:27:31] David Keyes: right? Right, right.

Well, maybe since we're talking about it Do you want to show the example that you've put together to show how rm works?

Hey, David here. Just wanted to let you know that at this point in the conversation, we switched to a screencast. Now obviously, showing code doesn't work very well in an audio podcast, so if you want to see the rest of this conversation, check out the video version of this podcast on YouTube. You can find a link to that in the show notes.

Cool. So that, you know, I mean, I brought up the very simple example of like, you want to share this with a colleague, it works in that context, but then you also mentioned in the context of, you know, if you're putting it on GitHub, say you're running GitHub actions, like I have clients, for example, who will, you know, generate it.

Right. Right. Reports from quarto or R Markdown documents that just render to HTML. And then they want to share those. So this, you could set this up so that when you, you know, push it, push to GitHub, say it'll run the GitHub action. It'll use our end to like. Install the, or I guess it's, I don't know if it's technically installed, but use the right versions of packages Yep.

And kind of build your report. Is that, is that an

[00:28:44] Alex Gold: accurate That's exactly, that's exactly right. And so like, there, there is an example in the, actually if you, if you, go through the whole book, there's. There are sort of a set of labs, and part of the lab will take you through this process of standing up GitHub Actions and then rebuilding an R environment inside the GitHub.

Because basically the way to think about GitHub Actions is it's like an ephemeral computing environment that comes up with nothing in it. And so you have to install R,

[00:29:08] David Keyes: Run RM

[00:29:09] Alex Gold: to get your packages and then you can do something there and push that off to something else. But the, the GitHub actions is very ephemeral computing environment.

And so on the one hand you're like, well, it doesn't have anything I need. But actually that's a good thing, right? It's a blank slate

[00:29:20] David Keyes: where

[00:29:20] Alex Gold: if you capture your environment in code, you get to build it back up exactly right. Every time. And so you always know it's gonna be exactly the environment that you need to run.

The same way

[00:29:32] David Keyes: everybody. Yeah. Well, and to loop back to the initial conversation, I mean, that's that kind of thinking in the context of DevOps. Because it's, you know, that's your production, right? Is like having a quarto document producing and, you know, an HTML file from it. But to do that, you need to ensure that you're always using the, the same versions and especially in a context where you're running something like GitHub actions, you know, you got to ensure that it's going to, work correctly on what is like somebody else's computer, which is, you know, just a computer in the cloud. Exactly. Exactly.

[00:30:06] Alex Gold: Yeah.

[00:30:06] David Keyes: Yeah.

[00:30:07] Alex Gold: And the other place I think, I think you see this a lot is, some organizations have patterns around using Docker containers to run projects. And again, there are lots of like philosophies about how to use Docker containers in a, um, data science context.

My personal recommendation is, however you're going to build your container, whether you put the packages into the container or load them in at runtime. Either way, using rnv is the right way to do that. Not a bunch of install. packages commands because that doesn't control any of this versioning and where is it coming from and all that kind of stuff.

And so you can get yourself in a lot of trouble with just sort of Hacking your way through this with bare install. packages commands. Using rend is, is really the way to do this.

[00:30:49] David Keyes: Yeah. If you're trying to. When it seems like the classic thing that like, for people who have never experienced this problem, they might be listening, thinking like, what are these guys talking about?

But anybody who has experienced, so it'd be like, Oh, I know exactly why you would want to do that. Yes, this is one of those things,

[00:31:05] Alex Gold: and I think there are a lot of problems like this. I honestly, like, this is one of the problems with writing this book, is it's very hard to convince people who haven't hit these problems that they need solutions.

hope people who have experienced these problems and read the book, or who read the book and haven't experienced these problems, and then run into, like, oh, I see it. Um, I do think these are problems that are hard to, um, they're hard to understand until you've lived them. And then you look at me, you're like, Oh, wow, that is a big problem that it would be nice to have a convenient solution for.

Um, so if you haven't experienced this yet, it's probably just a matter of time.

[00:31:41] David Keyes: Well, I will say the book is great, and it's got a lot of, good solutions for people who have experienced those problems, and hopefully kind of preemptive solutions for people who have not yet experienced them. Um, and we've obviously only touched, the top level, um, in terms of Yeah, we got through the first chapter.

Exactly. So if people do want to go to learn more about the book and DevOps, and data science, should they go?

[00:32:06] Alex Gold: Yeah, the whole book is available for free online, do4ds. com. You can buy copies in print or EPUB, but the whole book is just available as a quarto book online. d04ds. com. I do also occasionally blog about related topics at alexkgold. space Things like why you should drop out of a PhD program, or some of the things more germane to my day to day work, about like management and hiring, if you're interested in that. Um, or thinking about like data science management. I have some, blog posts on that.

Um, but yeah, the book is at do4ds. com.

[00:32:42] David Keyes: Is the book built with renv and, uh, with GitHub actions?

[00:32:46] Alex Gold: Oh yeah. The repos on, GitHub. it is there and you can read the actions that I'm using to publish it. Um, so yeah.

[00:32:53] David Keyes: We'll post a link both to the book itself as well as , to the GitHub repository in case people do want to, check that out. Um, Alex, thanks again for joining me. This was a really interesting conversation. I appreciate you taking the time.

[00:33:06] Alex Gold: Thank you. It's been a, been an absolute pleasure. Uh, thanks for having me.

R for the Rest of Us Podcast Episode 22: Alex Gold

Listen to the Audio Version

Watch the Video Version

Learn More

Transcript

Let us know what you think by adding a comment below.