R for the Rest of Us Podcast Episode 14: Will Landau
In this episode, I’m joined by Will Landau, a statistician and software developer currently working with Eli Lilly and Company. Will specializes in Bayesian methods, high-performance computing, and reproducible workflows. He is the creator of the {targets} R package, a pipeline tool for reproducible computation in statistics and data science. The package became part of ROpenSci in early 2021.
Will talks about his journey into R and using it for open source projects. He gives a detailed account of {targets} - its origin and how it works as a reproducible analysis pipeline tool.
Learn more about Will by visiting his website and connect with him on LinkedIn and X (@wmlandau).
Learn More
If you want to receive emails when we publish new podcast episodes, sign up for the R for the Rest of Us newsletter. And if you're ready to learn R, check out our courses.
Audio Version
Video Version
In the video version, Will gives a code walkthrough of how the {targets} package works.
Resources Discussed
Important resources mentioned: Get started with the {targets} R package in four minutes
Learn more about Will by visiting his website and connect with him on LinkedIn and X (@wmlandau).
Transcript
[00:00:00] David Keyes: I'm joined today by Will Landau. Will is a statistician and software developer in the life sciences. He earned his PhD in statistics at Iowa State in 2016, and he specializes in Bayesian methods, high performance computing, and reproducible workflows. Will is the creator of the targets R package, a reproducible analysis pipeline tool, which became part of R OpenSci in early 2021.
[00:00:51] Will, welcome and thanks for joining.
[00:00:54] will-landau--he-him-_1_02-28-2024_130947: Glad to be here. Thanks for having me.
[00:00:56] David Keyes: Um, well, I want to start out just by asking some kind of basic questions. I'm curious kind of how you first got into R.
[00:01:03] will-landau--he-him-_1_02-28-2024_130947: Well, I got into R because at the time it was pretty much right in front of me. I was in undergrad in my third or fourth year of And I was just discovering how much I love to code. And I was just getting into statistics and that kind of gradually started to intersect when our homework assignments used R and there is a mix of reactions.
[00:01:29] I mean, people come to statistics. I'm sure, you know, from. All kinds of backgrounds and people come from economics, computer science, um, even a lot of other different backgrounds. Um, some people really love to code and some people struggle with it a bit. I, I think that for me, it hit me right at the right time.
[00:01:50] And I loved it. I loved how the computational problems intersected with the statistical ones. And, um, I, I wanted to carry that as a, as a focus. through, you know, whatever I did after that. And in grad school at Iowa State, I got really lucky. Um, there was just such a strong R community. There was a lot of people doing really cool things in visualization and package development.
[00:02:19] And professors Dycook and Heike Hoffman were there. Um, to give you an idea, they were the mentors of Hadley when he was at Iowa State. And they, they kept that community going. And the grad students around me kept that community going. Uh, there was, there was just, uh, a lot of collegiality and a lot of excitement around the stuff that we were doing in R.
[00:02:42] It was, uh, there, there were just a lot of positive vibes and interesting projects. And I, I couldn't really help but, but grow to like it every, every, uh, And, you know, now at work, it's, um, I'm among colleagues who use R all the time. We're, we're basically an, an R shop and what I do falls under, you know, classical experimental design, um, a lot of the times.
[00:03:10] So. It's still a good fit for the job.
[00:03:15] David Keyes: Great. And I know you work, I should have mentioned this before you work at, uh, Eli Lilly and. For obvious reasons, we can't talk about, you know, the details of what you do. Um, so I wonder if you could give me kind of an overview. I mean, you talked briefly about, you know, some of the, the, in broad strokes, the work that you do there, but I wonder if you could also talk about kind of the daily work that you do, um, with R as well, especially on the kind of the open source side and, and, and what that looks like.
[00:03:43] will-landau--he-him-_1_02-28-2024_130947: Sure. Well, there's two sides that I quite like with the open source work that I do. And one is statistical. And, you know, not all of the models that I develop or contribute to are, are, you know, Available as open source packages, but, um, some are, and there are these, there's this, uh, great collaboration that I'm part of.
[00:04:08] Uh, there's this ASA group called OpenStats where, and we have a work stream to implement a Bayesian, uh, MMRM. And we're, we're developing this package. We're pooling, um, all kinds of knowledge from, from across our companies. And I've, um, taken the lead on the implementation of this package. It's called BRMS MMRM.
[00:04:30] It's very much a group effort. And, um, that's, that's a model that comes up a lot in my line of work. It's this repeated measures, uh, model of, of experimental data. And it's, uh, it's non competitive. It's, and it's, um, it's used by so many of us that we just want to get a solid implementation that we all agree on.
[00:04:52] Um, and that's, that's been extremely rewarding. And, um, you know, with that package, we've, we've, we have a, a CRAN release or two, and we have, uh, done a lot of the interface work. And we're going to move on to, to the deeper statistical pieces of, of, you know, borrowing from historical data sources and stuff pretty soon.
[00:05:16] Um, and the other side of that is the infrastructure and the workflow tools that, that I work on like targets, which we'll get into, these are tools just to make your life easier and just to manage, um, the, the analyses and the analysis workflows. That go into either developing a model like that or running really any kind of data analysis project.
[00:05:45] And so, I mean, there are things that I do that, that a lot of people may not be able to relate to and, um, like to be upfront about that, but, um, We all face a common set of, of workflow problems, no matter what we're using to analyze our data, whether it's a Bayesian model or a, or a machine learning method, or just, just some, just some data transformation and an ETL.
[00:06:10] David Keyes: Yeah, yeah. And I'll say, I mean, you know, speaking for myself, hearing you talk about like the Bayesian work that you do, that's not something that I am even familiar with. I don't even wouldn't even know where to start with that. But, you know, I know. As I was mentioning to you before we started, targets as a package is something I've heard about and has intrigued me, but it's not something that I've ever had the opportunity or, you know.
[00:06:34] I guess I've chosen not to take the opportunity to use it. So, um, I'm curious if you could kind of at a very high level, give me an overview of what Targets is and talk about where the idea for it originally came.
[00:06:52] will-landau--he-him-_1_02-28-2024_130947: Great. So targets is a, what I'd call a reproducible analysis pipeline tool. And so you have in a typical project, some data sets that you want to analyze. And you'll produce model objects for the fitted models and then you'll summarize those models and, and, um, or maybe you're not doing modeling. Maybe you start from the datasets and you want to run some visualizations to, to explore the data to see, um, what, what you're doing.
[00:07:23] A scatterplot of one variable looks like against another, and maybe you want to summarize those results in a bunch of downstream reports in R Markdown or Quarto. There are these, uh, steps of the, of the, of a flow in, uh, in a data analysis project. And if we think of a data analysis as a project, that's where, that's when it becomes big enough in that sense, that's where, that's where targets comes in and that's where, that's where it starts to help.
[00:07:52] And Hopefully when we get into an example, we can start to think about, well, what kinds of, what kinds of projects, um, targets is originally intended for. But, um, the idea came. When I, I needed a package like that to do, my dissertation work in grad school. That's the, that's the first time I started to want a tool like Targets.
[00:08:13] I was developing this large Bayesian model for, uh, for this agronomy problem. And on top of all the writing and all of the, uh, All of the model development, I needed to run this model on some pretty big data sets as far as genomics was concerned. And each time that I started to run the model, I had to wait three or four hours for it to complete.
[00:08:39] And I had a bunch of those analyses. And then meanwhile, I had these other moving parts, like the dissertation that I was writing that depended on those results. And it was all a lot to keep track of. And the runtime of these models. Blocked me from, from updating one part of the project. And I had to focus on another, it was just very hard to wrap my head around it all and to, and to, to go about my day, putting together something that ended up being a, a dissertation.
[00:09:11] And my advisor, Jared Nemi, watched me struggle through all of this the whole time. And he mentioned to me that I should be using a tool called Make. To, to organize and wrap my head around this stuff. I was a bit too far into the project to transition to a thing like that at the time, but then when that was all wrapped up and I defended, then I immediately.
[00:09:35] Went to, to start saying, okay, how do I make this easier for other grad students like me, and how do I make this easier for, for my future self? And how do I make this easier for, for anybody who's, who's developing any kind of. project, whether it's, it's, whether it's like a dissertation or whether it's, it's, it's something even, even a little bit more lightweight or something that maybe has less of a runtime burden, but just has hundreds of artifacts that you want to produce it like in an ETL workflow.
[00:10:12] And that's, that's when I started searching for existing tools. There's this, developer, uh, scientist statistician by the name of Rich Fitz John, who, who had developed something that was almost what I was looking for. It was a package called a remake that he had developed between 2014 and 2016.
[00:10:35] And that package is like the make tool. Except designed completely for R and the, the, the concepts that he introduced there, um, just completely blew my mind. The way that he, he took something that, that was, it's usually language agnostic and made it completely focused on R and extremely friendly. And the reason that I originally got into developing pipeline toolkits was, was he had moved on to other things that he, he was no longer at a certain point interested in, in developing.
[00:11:08] Uh, remake. And so at that point I created a project called Drake, which was kind of like remake, but initially it attempted to, to pursue a greater degree of scale in a couple of ways. And that was when I was really first getting my start as a, as an R package developer, and that was, that was a journey. I learned absolute ton from it.
[00:11:35] But Drake got to the point where I was. I was faced with the hard limitations of my original design choices. And so, up to a certain point, it became time to, to take the lessons that I learned to developing Drake and to start something completely new. And then that became Targets, which I started to develop in 2020.
[00:11:59] And I was secretly working on it on nights and weekends for about six months. And then I open sourced it and, and, um, Released it to ROpenSci and the rest, as they, as they say, is a recent history.
[00:12:14] David Keyes: Yeah, that's great. So from my kind of naive understanding, So I think about, for example, the types of projects that we do, which, which don't actually involve any modeling, but typically what it involves is taking some raw data. We usually have, um, an R script file where we take the raw data, do, you know, cleaning transformation on it.
[00:12:36] Um, then we typically save that as usually as an RDS file. And then we have, we do a lot of reporting. Um, so a lot of parameterized reporting, that type of thing. So then our, Our markdown, our quarto files read in that kind of cleaned, you know, RDS file, um, and use that for the reporting. And we've developed this, a structure like, you know, we always, we basically follow the, the structure that our packages use where we'll put our raw data in a folder called data, data dash raw.
[00:13:11] And then our clean data goes just in the data folder. Um, But what it sounds like you're saying is essentially targets is kind of a formalization of some of the processes that we have kind of informally put together where, for example, we don't, you know, maybe it takes a while to run the code to clean your data.
[00:13:33] Um, and you know, the way we deal with that is we run it once and save it as an RDS file, but with targets, you could set it up so that it, Kind of always knows, is, have we run the most, you know, the, the code that kind of cleaning transformation code, um, is that up to date? And then if that code changes, it'll kind of like rerun it, right?
[00:13:58] And so it helps to, again, I guess it's, it's really like formalizing that process of making sure that everything, it both speeds up your, your kind of runtime, but then also make sure that everything is up to date are those in my. Accurately understanding the, the
[00:14:15] will-landau--he-him-_1_02-28-2024_130947: Yeah. Yeah. That's a great
[00:14:17] David Keyes: Target at a high
[00:14:18] will-landau--he-him-_1_02-28-2024_130947: And I like your use of the word transformation in particular, because this is where we start to think about a project oriented workload, like the, the, like the one that you described with, with, uh, working with data and generating reports as a series of transformations, you're transforming those datasets.
[00:14:37] into, into reports, or maybe you're transforming those datasets into human readable summaries, and then transforming those human readable summaries into those reports. It's this, it's this mapping that's, it's, it's a really useful aspect of this whole process to think about and to be a, and to be pedantic about.
[00:14:58] And. That's, to, and to, to, to think about just which one of those transformations, and what, what is, what is exactly going on in each one of those transformations? And what are the set of inputs that you need? And what is the one output that this transformation produces in each case?
[00:15:15] And I think that that's, that's kind of how targets makes this whole process a bit more pedantic. And then the process of, of skipping a step that's already up to date. and then only running it if the upstream code or dependencies changed. For, for somebody who's running these, these extreme kinds of Bayesian or machine learning models, that's, it's, it's a really critical time save.
[00:15:42] And it's the kind of problem that you don't know you're going to have until you're faced with a kind of analysis that takes forever to run that you might have to stop in the middle. But there is something for everyone in that, in that it's, it's, it's, it's, it's, Even if you're just, um, churning away at a, at a couple of, of data sets and, and, um, producing some reports.
[00:16:05] I mean, you might have, uh, a couple of those, or you might have a few hundred of data, a few hundred data sets and a few hundred reports, and to be able to, to see right in front of you that everything is up to date is a, it's vouching for the reproducibility of the project. In the most from first principles sense that I can really think of it.
[00:16:27] It, it's saying that if you were to reproduce or recreate this analysis from scratch, here are the results that you get. And here's, here's evidence that that is, that is synchronized. That's, that's a kind of reproducibility that's, that's, that was missing, I think for, for a, a, a long time. And it compliments the kind of reproducibility that you find in being able to describe in your own words exactly what's going on, let's say inside a an R markdown or a corto report for the analysis.
[00:17:00] Certainly.
[00:17:01] David Keyes: Yeah, that makes sense. Well, you know, I have a bunch of questions about Targets, but I actually think what would be most useful at this point is to have you give us a brief demo and then I can ask those questions. Um, so I'll let you, um, if you just want to put your screen up and then we can watch that.
[00:17:18] Will Landau: So this example comes from this Targets 4 Minutes repository that I have on GitHub. And it's a, it comes with a video and it comes with a
[00:17:31] PositCloud
[00:17:32] Will Landau: workspace where users can log in and access the same workflow without having to install R or any packages on their machine. The code that I'm showing right now is actually a bit of a a modification.
[00:17:48] So I, I took out the comments. So this screen is, is a bit less busy than, than what you'll see in the, uh, hosted version. And then I added a quarto report at the very end. And, uh, just to, to start from the beginning, um, This, this is a special kind of file that you see in front of us that'll, that'll get to first, it'll get to eventually.
[00:18:14] But what I'd like to do is start with the functions that go into this data analysis project, because From our discussion about the word transformation and how it applies to data analysis, the functions in R are just such a natural building block to describe transformations, especially pure ones where they don't write files in and of themselves, but they return an object at the end that really has everything that you might care about in an analysis.
[00:18:48] So, and usually these functions I found either They either produce data sets, or they fit models, or they create human readable summaries, or they generate, uh, some visualizations or, or reports. And I think if you're going from one of those types of artifacts to another one of those types of artifacts, a function is a really natural tool to describe that.
[00:19:12] Um, and it's, it's not, uh, a tool that's often used, but once you Once you get used to describing the work that you're doing in functions, even if it seems overly pedantic at first, it's hard to stop once you get going. Miles McBain has an excellent post. He called the, um, a, a function based diet in, in programming.
[00:19:39] And he talks about Target's predecessor, Drake, but he also makes a really compelling case that, that this is an approach that fits a lot of data analysis workflows. So I'll go right ahead and I'll load. And actually, well, can I
[00:19:52] David Keyes: just ask you, like, if you're making, if you're starting a new project and you're like, okay, I'm going to use Targets for this.
[00:19:57] How do you just start even before you get to this point? What would that look like? cause obviously you have some stuff set up here, which is great. Um, but I'm, I'm thinking, okay, I'm starting a new project and I'm like, yes, this time I'm going to try out targets.
[00:20:12] Um, how does that, how does that work?
[00:20:14] Will Landau: So I think to identify the steps in the workflow, just, just conceptually, what data sets do I start with? What summaries do I want to create of those data sets? And what reports do I want to produce? That's, that's something that would help to get really clear on.
[00:20:29] And then to think about, okay, what's, what is transforming into what? So if I, if I have a data set, and I have a model, what are all the steps of going from one artifact to another artifact? And that's, that's where the functions come in. So before you even get started with targets, um, I would do what, what you do, which is to, um, write a bunch of, uh, standalone R code and put it in an R folder of your project.
[00:20:58] And express that in terms of the functions that describe the data analysis. And to develop and test those functions on your own, play around with them, see what works on, on small test cases. I found that to be a really useful place to start before putting it into a target pipeline.
[00:21:16] So people
[00:21:17] David Keyes: will typically then start by like having a project that they've done in a non targets workflow, and then kind of think about applying targets to that, is that, is that right?
[00:21:30] Will Landau: Yes and no, I mean, for some kinds of projects that don't really make, explicit use of custom functions, the transition can be really hard, but if you've, if you start a project, That you will want to build a targets pipeline into.
[00:21:45] And if you've written these custom functions from the beginning, then it becomes. And it becomes much easier. And when you add targets, it doesn't, it doesn't need to be the kind of project where you've already put all those pieces together. You can start with just a few small functions that you do, you experiment with yourself, and once, once you, you get an idea that they're, you know, they're working okay in, in very small test cases, then it's a, it's a good uh, point to, to put it in, in the targets pipeline.
[00:22:16] Um, that
[00:22:16] David Keyes: makes sense. I'll let you go ahead then with your, with your demo.
[00:22:21] Will Landau: Sure. so just to demonstrate what I'm doing with these, functions, I'm just going to source them into this session real quick. And I'm going to say that we're going to start with the data file. This is just the air quality data set and base R.
[00:22:37] And if I were to think about just developing and testing these functions, I might sort of define some example data. And if I were to just walk through the, I'm going to be using the, the code for a scaled down version of this pipeline and might say, okay, let's transform the file into this data object that we're going to use.
[00:22:56] So data, and I use the get data function, which I wrote. And the data is going to be this air quality data set from, from base R. And let's say I want to fit a, just an ordinary least squares regression model to the data. I'd have something like model, just to, just to get, get used to testing the functions, experimenting with around with them and, and getting a feel for how, um, You know, how these transformations that we've created, um, are used.
[00:23:26] So you might say that, uh, let's fit the model to the data. And sure enough, we get, um, the, uh, model coefficients for fitted model and so on. So assuming I've done all this, which, which is completely outside of targets and is, is a good practice for a data analysis workflow anyway, we could then move on to targets.
[00:23:49] And what that's going to involve is taking these functions, these, these reusable building blocks, this shorthand for the different tasks of the analysis, and we'll turn that into the actual analysis. And this, that's where this underscore targets file comes from. Ends. You don't need to write it from scratch.
[00:24:11] There's this function targets called use targets that creates one. I'm not gonna, I'm not gonna run now, but if I were to run use targets, it would create one of these files if it doesn't already exist, and it would be heavily commented and wanted to provide directions for what options to set and, and et cetera, et cetera.
[00:24:31] But this target script looks and feels like you are starting a session and actually running all of the the tasks. and with an emphasis on the setup part. So when, when you begin an analysis that's whether it's in inside targets or outside targets, you start by loading some packages and setting some options, loading some reusable functions that you've defined, and then going and doing your work.
[00:24:58] And that's, that's why this targets file is the, is the way it is. So I start by loading some packages that I'm going to need to define the pipeline. I'm going to set some options, and this packages option list of packages that the tasks in the pipeline are going to need when they eventually run.
[00:25:18] then
[00:25:19] David Keyes: instead of running like library. dplyr, library, ggplot2, library, readr. You run tar option set, which essentially does that same thing. Is that right?
[00:25:29] Will Landau: Yeah. The packages option in tar option set registers the packages that are going to load when they need to. Um, so they're not going to load right away, but in the actual steps of the pipeline, they're going to be Uh, useful for performing the, because I
[00:25:45] David Keyes: saw, for example, in the functions on our file, you don't actually have, you know, you don't load your packages there.
[00:25:51] So it sounds like, and by setting it as part of your pipeline, it will then use those packages as necessary, including in those functions. Is that, is that accurate?
[00:26:01] Will Landau: Yeah. Yeah, exactly. Um, and the, the function script is loaded actually in the next step in this, this tar source function. What this does is it runs all the R code scripts that are inside this R folder in the, in the, uh, file system of the project.
[00:26:20] And has the effect of loading all those functions into the, environment where, where this, this script is running. And, um, if those scripts actually had the analysis, it would run that. But, standard practice is to just have, Uh, the functions themselves.
[00:26:36] And so it's, it's actually really quick. This is all part of populating your workspace. And then where this starts to differ from a regular analysis is in line 10 to the very end. And these, these are a list of what's called a target objects. And you have this call to the tar target function for each one.
[00:26:58] And each one of these. a step in the analysis. It's a single task that is going to be run in the pipeline when you run it, and so it'll have a name. It'll have a command, which is just an R expression that's going to run when the target runs. You can have a bunch of other options in the development version of targets.
[00:27:22] You can even add a human readable description. This is This is a feature that's not quite in the CRAN release yet, but is going to be in the next one. Um, likewise for other steps in the pipeline. And so these expressions like get data on file, they're not going to be run just yet when you're on this list.
[00:27:41] These are, these are quoted and they're going to be run later. Instead of just saying get data in the R console, like we did before. we're saying, okay, this is a task that's going to run later when, when everything's ready to run. And this is, this is a really pedantic way of laying out all the steps in the pipeline, all the inputs and all the outputs.
[00:28:04] And the, the inputs are, are what you would read as the, the function arguments or the global variables in the expression here. And the output is going to look and feel like the object that you've given in the name. Thank you.
[00:28:17] David Keyes: I was just going to say, it sounds like starting on line 10, you're not like running the code, but you're laying out, these are the steps that we're going to go through. So first the tar option set says, these are the packages that we're going to need for our project.
[00:28:31] Then tar source runs all the code in the R. folder, which has typically your functions, then you define all of your steps and I assume we'll get to this in a second, but down below, there's going to be a function or something that will actually like run the pipeline. Is that an overall accurate summary of how this works?
[00:28:51] Will Landau: Exactly. Exactly. It's a lot of table setting and all of that table setting isn't always. It's not apparent right away, but it does pay off. And I'm going to describe the ways that it pays off, well right now. So one of the things that this allows you to do, if you're really pedantic about setting up the pipeline, is you can call, functions to inspect the pipeline.
[00:29:16] So, for example, you can say tar glimpse is a function and And one of the, one of the small handful of views that targets gives you to let you see the dependency graph of the pipeline. And here it is. Let me move things around just so it's easier to see. So what this is, is like I said, it's a dependency graph.
[00:29:42] It's a visualization of the natural left to right flow of the pipeline. It's going to show you what is going to happen when you run it. So in an analysis pipeline where you have, Let's say hundreds of steps where you import data sets and, and write reports and you have steps that are even downstream of that.
[00:30:04] It's helpful to have a tool like this, to wrap your head around what you're doing, because sometimes code bases can get out of hand. And a visualization like this can kind of reign it in. So And no matter what you have here and no matter what you, do to list the order of these steps, you're going to get this dependency graph that shows the left to right flow of what's, what's going to happen.
[00:30:27] So we need to track the input file first. Then we're going to load the input file, we're going to model it, and we're going to take the data and the model, and with these arrows, it's showing you that that's going to be incorporated into the plot, and the plot and the model are going to the, to the report.
[00:30:46] And this graph is going to look the same no matter how you order things in the pipeline. So I could put the file down here below the data. And if I rerun this, it's going to give us the same graph. And the reason it does that is because there's this behind the scenes magic called static code analysis.
[00:31:07] And what it does is it parses the expressions that you give in the command, and it looks at the global variables that it gives you. And so if I look at this expression, get data on the object file, or where does the object file come from? Either it's a global object that you've defined up here, or it's the name of another target.
[00:31:28] And that's how this pipeline knows that, that the file depends on the data. And this is something I learned from Rich Fitz, John, that we can use static code analysis to, to construct one of these graphs and it's going to help the pipeline later on run the correct targets in the correct order.
[00:31:45] David Keyes: Yeah, that's great.
[00:31:46] I mean, that's so helpful to be able to see it. And, and honestly seeing that right there, it's like, Oh yeah, that is. Basically what we do in our consulting work, and we, we know because we do it over and over and over, but having that, that, you know, something like that, that you can actually see is really helpful just to, to keep everyone on the same page as well.
[00:32:06] Will Landau: Yeah, to keep everyone who's working on them the same page. It's, it's helpful. Um, I'm glad you brought it up because it's helpful for collaboration. It's helpful to document the workflow if you're sharing it with your colleagues and it's helpful for your future self. If you work on a project for a while and then leave for six months and then come back to it later.
[00:32:27] It's sometimes it's hard to look at this code and wonder what you did. I mean, with the dependency graph and the report at the end, I've gotten myself out of that jam in about 10 minutes where it would have taken a few hours or even longer just to remember what I did if I want to update the project without breaking it.
[00:32:48] David Keyes: Yeah, So how do you then run everything, say, to, you know, to import the data, to do the modeling, the plots, the reporting?
[00:32:57] How does that happen?
[00:32:58] Will Landau: There is a function called tar make, and what it does is the tar make is the thing, and tar glimpse as well, but tar make in particular is the thing that runs this report, and it expects the last object returned by this script to be this list of target objects, and that's what it reads and parses to understand what to do, and then it goes ahead and runs the correct targets in the correct order.
[00:33:27] And right now it's, it's spinning up the background reproducible process to isolate the pipeline environment from, from everything else. And it's, it's running the file first and then the data and then the model and, and so on. And this is what all the table setting led up to at this point.
[00:33:47] And I think it goes to show how, just how important that table setting is that we've gotten about 45 minutes into this podcast. Um, And, and this is the first time I'm running the pipeline and it's, it's a lot of, of upfront investment. And I first, until it becomes habit, but the payoff is tremendous.
[00:34:06] And then the table setting becomes a natural part of the, of the development process once you get used to it.
[00:34:13] David Keyes: And do you typically have that tar make within, um, that targets. r file, or do you just run that in the console?
[00:34:22] Will Landau: You can run it in the console. It's important that it exists separately from, from targets, uh, dot R.
[00:34:28] So this script is actually instructions for make. And if I were to put it somewhere in here, it would call itself recursively in infinite loop. And I know I shortcutted to run parts of this code, uh, a few minutes ago, uh, interactively, but this is, this is really for, um, not for interactive use.
[00:34:46] This is for, uh, a function like tar make or tar glimpse. Right. And for, so tar make, you can run it in your console. Um, in the most recent version of, targets, you can actually run it as an RStudio job as well. So, borrowing from the interface of quarto, you can say as job equal to true. And what this will do is it'll trigger a background job.
[00:35:09] And when it emits output, it'll tell you what each of the, the results of, processing the, the targets are. Um, so because I've already run the pipeline and I haven't changed the anything since I ran it last, it skips everything and it takes, uh, far less time to run. And meanwhile, this whole time while the job is running, your ARC console is completely free to do whatever else you need to do.
[00:35:34] David Keyes: Well, yeah, I mean, you can see there it was, you know, 21 seconds the first time, but then the second time it was like half of a second. Because it's, I don't know what the right terminology is. It's cached everything essentially. Um, so that it doesn't have to run everything again. It can use what it's already done.
[00:35:52] Will Landau: Exactly. Exactly. And, um, there's even a little bit of time savings is, is noticeable just, just to make your, your life easier. It doesn't have to be, you know, several hours. It can be just a few seconds or, or a few minutes and the more targets you have, the more it adds up. And so any, anything big or small can really benefit from the, from the peace of mind that that gives you, uh, to be able to, to flow through your work up more smoothly.
[00:36:20] David Keyes: Yeah. That makes sense. Um, was there anything else in terms of the code walkthrough that you wanted to show before I ask you some kind of broader questions about targets?
[00:36:29] Will Landau: I'll just, uh, mention that, um, this business of a target being up to date or not, there are a lot more. Functions in targets that allow you to inspect the state of the workflow in the progress of the workflow.
[00:36:40] Documentation as a website is a long list of utilities for that. And even in the, in the dependency graphs, that information is color coded into functions like tar viz network. Whereas we, you know, we had a tar glimpse here with just the targets in the pipeline all colored gray, when this tar viz network graph shows the nodes will be color coded based on whether they're up to date or not.
[00:37:08] And that includes the targets in the pipeline and the functions here that you see on the left as triangles.
[00:37:14] David Keyes: So in other words, if you see something in that viz network there that's not up to date, you would want to run tar make again, which would bring it up to date. Is that right? Right. Uh, well, this is great. This is really helpful. Um, let me, let me ask you some kind of broader questions now.
[00:37:33] Um, so I'm curious, like In your mind, how complex does a project need to be to benefit from using targets?
[00:37:41] Will Landau: That's a hard question because there really isn't a hard and fast rule and it can depend on personal preference. But I would say anything even a little bit complicated or even a pipeline with four or five steps or even three or four computationally intense steps is worth doing in Targets. Once you Once you, uh, experience the, the time, the time savings and the, and I didn't even get to the, um, the features and targets that abstract away files as R objects.
[00:38:23] You mentioned that it's, you know, common to save, to manually save, uh, the output data as an RDS file and then read it back later. In targets, there are more convenient functions to access data in that, uh, in that cache or that data store. And that in itself is, is a convenience that's hard to move away from, even for really small, uh, workflows once you've, once you've experienced it.
[00:38:50] Projects can also be complex in a bunch of different ways. So sometimes the computation time is a burden. Sometimes you just have a lot of data objects to output and a, and a lot of downstream tasks that read in those data objects and, and targets can smooth things out in, in either one of those cases.
[00:39:11] David Keyes: Yeah, I mean, I think about, you know, for the work that we do, it'd probably be less on the kind of runtime issues. You know, it's not that it's going to take a huge amount of time, although there are some cases where, where that happens, but it's more about making sure everything is up to date. You know, for example, like we create some reports and then the client says, oh, yeah, here's some, some new data.
[00:39:33] Well, the way we do it right now, we have to make sure manually that someone goes in and reruns the data cleaning and importing process. code, transformation code, so that it spits out that new RDS file so that when we generate the reports again, it's using the up to date data. Um, it seems like, you know, using targets would make it so that every time we run tar make, it will just check, okay, is the data up, you know, is everything up to date so that when it, before it generates the, those reports, it just automatically runs that code.
[00:40:05] Um, so I can, I can definitely see the benefit of it.
[00:40:08] Will Landau: Yeah. It's one of those things where you don't realize how much mental bookkeeping you're doing just to remember the state of the project. It's much easier to have a tool automatically tell you. And it's hard to move away from that once you've experienced it.
[00:40:24] David Keyes: Yeah, that absolutely makes sense. Cause a lot of times what we have to do too, is then we have to explain to the client, um, how everything works. Cause they, in some cases will actually take over the project and then run it themselves. And it's only when we're actually sitting down and going through, okay, first you do this, then you do that, you know, we'll have like a really long readme sometimes, and it seems like in some ways, targets, targets is a really long readme, um, in, in code in a way that, um, you know, can facilitate people continuing to run the code, uh, in a way that, that is efficient and, and hopefully makes more sense to them than the long readmes that we give them.
[00:41:02] Will Landau: Yeah, that's a, that's a great point. And I've seen users of targets insert dependency graph visualizations in their readme's. So there's this, uh, JavaScript library called mermaid. js that produces really nice static graphs. And there's this tar mermaid function in targets to produce one of those.
[00:41:24] visualizations of the dependency graph, but in a static format. And it's possible to embed that into an R Markdown document. And share that. And with the description field of each target that, um, that T. J. Marr and Noam Ross, uh, nudged me to implement in targets, I think that that, that could, um, be even more powerful for communicating how the pipeline is, is set up for somebody else, either who's just consuming the results or, or sharing in the development.
[00:41:57] David Keyes: Yeah, that makes sense. I have a couple last questions for you. Um, first, I've heard people describe targets as kind of the equivalent of the efficiency, switching to targets as the efficiency gains that you get, um, if you switch from an R script based workflow to using something like R Markdown or quarto.
[00:42:16] I'm curious if that, um, comparison resonates with you.
[00:42:20] Will Landau: It does. And I remember when I, I discovered that I could use R Markdown to do homework assignments in grad school, and I just remember how much nicer it was than having an R script over here, and maybe some comments on the code, and then, My description and my own words and prose in another place to, to, you know, show the other parts of my work and my thought process and to have those woven together in a lot of situations was just really nice.
[00:42:56] And it solved a problem that I didn't know I had until the problem was actually solved. And targets, I think, is, is similar. Not everybody is. Immediately faced with these crushingly long runtimes. And so maybe the friction is just enough to fly under the radar. It's, it's still a problem that you may not know you have until it's actually solved.
[00:43:24] And although those are different problems, they're hidden in similar ways.
[00:43:28] David Keyes: Yeah, that makes sense. Great. Well, one last question. Uh, I'm curious what changes you kind of anticipate making to targets in the future? I know you talked about adding the description, um, field. Are there other things you're thinking about as you move forward with targets?
[00:43:43] Will Landau: So in general, most of targets I think is in a really good spot. I think that there will always be bug fixes and little features and just general maintenance to do. And that's, that's just, um, that's just part of every maintained, actively maintained package. And I'm, I'm in it for the long haul. Over the past year, I've, I've focused most of the development on trying to make targets a viable cloud computing pipeline tool.
[00:44:20] And. That's been, that's, that kind of work has been really important to me and my colleagues. Um, that's kind of where the, the project is at the moment.
[00:44:30] David Keyes: Great. Well, Will, thank you very much. This has been really useful. It's given me a lot to think about.
[00:44:35] Um, it's honestly, it's inspired me to think about potentially trying out targets. Um, so thank you again for taking the time to chat. I really appreciate it.
[00:44:43] Will Landau: Glad to be here. Thanks for having me on.
[00:44:45] David Keyes: That's it for today's episode. I hope you learned something new about how you can use R. Do you know anyone else who might be interested in this episode? Please share it with them. If you're interested in learning R, check out R for the rest of us. We've got courses to help you no matter whether you're just starting out with R or you've got years of experience.
[00:45:04] Do you work for an organization that needs help communicating effectively with data? Check out our consulting services at rfortherestofus. com slash consulting. We work with clients to make high quality data visualization, beautiful reports made entirely with R, interactive maps, and much, much more. And before we go, one last request.
[00:45:25] Do you know anyone who's using R in a unique and creative way? We're always looking for new guests for the R for the Rest of Us pod. If you know someone who would be a good guest, please email me at david at rfortherestofus. com. Thanks for listening and we'll see you next time.
Sign up for the newsletter
Get blog posts like this delivered straight to your inbox.
You need to be signed-in to comment on this post. Login.