Skip to content
R for the Rest of Us Logo

R for the Rest of Us Podcast Episode 18: Miles McBain

In this episode, I speak with Miles McBain, a data scientist and R package developer from Brisbane, Australia, about patterns and anti-patterns in data analysis reuse. Miles shares his journey from a generalist software developer to a data science specialist, his passion for R, and the evolution of his coding practices. We delve into the intricacies of code reuse in data analysis, discussing common pitfalls to avoid, the benefits of creating reusable code packages, the process of breaking down large codebases, and how teams can evolve their coding practices to enhance efficiency and maintainability.

Listen to the Audio Version

Watch the Video Version

You can also watch the conversation on YouTube.

Important resources mentioned:

Learn more about Miles McBain on his website and connect with him on Fosstodon and GitHub (@milesmcbain).

Learn More

If you want to receive emails to help you on your R Journey, sign up for the R for the Rest of Us newsletter.

If you're ready to learn R, check out our courses.

Transript

[00:00:00] David Keyes: Well, I am delighted to be joined today by Miles McBain. Miles is a computer-turned-data scientist, R package developer, and open source enthusiast. Miles, welcome and thanks for joining.

[00:00:39] Miles McBain: Thank you. Thanks for the invitation.

[00:00:42] David Keyes: I know you're based in Australia. Where exactly in Australia are you located?

[00:00:45] Miles McBain: I'm coming to you from Brisbane. So that was where we had useR! 2018 and I was very proud to be part of that.

[00:00:51] David Keyes: That's great. . so tell me a bit about your background. I'm curious kind of how you came to use R, what your daily use of R looks like today.

[00:01:01] Miles McBain: Yeah, I was a software developer for a little while. I was working in Townsville in North Queensland and I moved to Brisbane and I guess I became kind of aware that like I was a bit of a generalist in software development and I didn't have any kind of specialization.

And you know, I felt like I wasn't really in control of the direction of my career because I was just sort of this generalist and I got moved around from team to team and different things. This was maybe like 2012 or something like that. It was shortly after that famous quote about like data scientists is going to be the sexiest job of the 21st century or something.

Um, and yeah, I was tossing up between two specializations at the time. I could either go into cyber security or I could either go into data science.

And I decided to go with data science mainly because I had really like fond memories of statistics and probability, which was like, I felt like a little bit unusual. Like a lot of people didn't enjoy those subjects, but for some reason I did. And so, yeah, I decided, yep, going to do my master's in stats. Um, and then early on in the course, it was like a lot of the stuff was to do with using like Minitab and, GUI like stats programs. And since I was already a programmer, I was just like, "no, this doesn't feel good." I tried a few different things and I already knew Python at that time, actually.

And I remember I tried to use Python, um, and I think maybe the Python ecosystem at that time was a bit. immature or whatever, but I just remember like trying to do like pip install packages and just got this like spinning kind of resolve dependencies and just got that classic Python environment hell. And I was like, this doesn't feel really good. And actually one of the programming languages I learned and really liked up until that point was Ruby. I don't know if you've ever used Ruby, but Ruby had this excellent like gem install thing, which was way better than Python. It just always worked. Then I saw some buzz about R and in particular, like people were making these really nice looking ggplots and stuff. And yeah, I sat down, I tried R and installed R packages. It just worked right. And it was that like same gem install experience. And I was like, great, I can work with this. So that was pretty much how I came to R. And as I learned more about it, like it seemed very quirky and a little bit inconsistent. But at the same time, I was aware that it almost had a lot less rules compared to other programming languages. And that appealed to me. A lot of the things that I create, I always like trying to probe it. Like, what is the limit here? What will R actually let me do?

[00:03:13] David Keyes: Well, that's interesting. Actually, if I can pick up on that, anecdotally, I've heard people who come from a more kind of computer sciencey background, which I know is where you come from as well, get to R and they're like, "what is this? Like, I don't like this." And there's a whole kind of like common trope of computer science developers ragging on R for being quirky.

But you had the opposite reaction. You kind of like that. I'm curious why that may have been the case.

[00:03:35] Miles McBain: Well, it just felt very productive, I guess. I guess it felt like whatever I could imagine that I wanted to do there was less kind of stopping me. At first it felt really weird to be working with no scalars, only vectors.

But then, you realize how powerful it is that nearly everything you want to do with data is already vectorized. It feels super productive. It's like, "Oh, I don't have to think about that now." The other thing was like, immutable by default.

In a lot of other programming languages there's always this thing about like, "Oh, should this be mutable or should this be immutable?" And you can get yourself into a real tangle if you forget like which mode you're in when you're calling functions and stuff like that, if they modify their arguments or not. But with R that whole question just goes away and I didn't really realize at that time. I was a bit early in my programming journey, but I later came to realize that this idea of immutability and the sort of like simplifications and guarantees that creates is like a core kind of benefit of like functional programming style.

[00:04:26] David Keyes: Are there other kind of things you think coming from your more computer sciencey background, gives you different perspective than other R users?

[00:04:36] Miles McBain: Um, I suppose one. I've done a fair bit of teaching of R into people who are coming from a sciencey background. And I think probably one of the main sort of advantages that you have coming from a more computer science background is you're really at home with the idea of like creating your own like functions and procedures for things. And R its core, like it's a functional programming language. And so everything's geared towards this idea that you're going to create your own procedures and you're going to pass them around, and that's part of how you create these abstractions, S3 methods and things like that are just like effectively procedures with attributes. I feel like people who come to R from like a more sciencey background, they are less in tune with the idea that like, "Hey, you can just make your own procedures for things." They're more like, "okay, teach me the procedure and I will follow it."

 So they write these like scripts that are like long series of like instructions and this copy pasted code, which I'm sure we'll get to soon. But they don't realize that you can just write a recipe for that and reuse that everywhere.

And I felt like I was a bit more primed to take advantage of that because I had come from a computer science background and already knew the power of creating functions for things.

[00:05:40] David Keyes: Yeah, that makes a lot of sense. You've always struck me as way more on the computer sciencey end of the R users who I come across. So it's interesting to hear your perspective as someone there.

[00:05:53] Miles McBain: Yeah, well you picked it correctly. That is where I started from.

[00:05:56] David Keyes: Right.

[00:05:56] Miles McBain: wouldn't say I identify heavily with that now though.

[00:05:59] David Keyes: Yeah. But I think even just in terms of your perspective, I mean even, you know, we're going to talk about this blog post that you wrote. I think there are elements in there that having that perspective even some of the things you were just talking about in terms of being able to use to by default, like just making your own functions, but then also realizing like what are the limits or what issues might you run into by doing this?

I think that's something that people who aren't necessarily coming from a computer sciencey background would necessarily anticipate in the ways that you clearly have. So, um, Maybe we can actually dive into that. So the article that you wrote, it's called patterns and anti patterns of data analysis reuse.

Um, I'm curious, maybe starting out, what do you consider an anti pattern of data analysis reuse? What does that mean to you?

[00:06:50] Miles McBain: Well, I feel like Jenny Bryan did a really good job of introducing the idea of patterns and anti patterns to the R community with her talk, Code Smells and Feels. I think she referenced a bit of Martin Fowler's work from the software world, you know, patterns and anti patterns, and if you haven't seen that talk, it's absolutely spectacular, and I highly recommend people go and check out that talk. That was at useR! 2018, by the way.

[00:07:10] David Keyes: Oh, wow. Okay.

[00:07:11] Miles McBain: So, it's this idea that there are these things that software frameworks and Programming languages might lead you to do, and they might seem like a good idea at the time, but there are like hidden costs.

And it doesn't necessarily mean bugs. It could be like performance penalties or like a classic one concurrency is like deadlocking, where you have like, there are certain anti patterns of the ways that you manage parallel processing that mean that you can get your program into a situation where like you can effectively get locked with two parts of the program, each waiting for a piece of information from each other. And there are patterns to avoid that happening.So an anti pattern is this idea that like there's like designs, ways you can design your systems and design your code that can lead to bad outcomes, or there's ways you can design for better outcomes. I just kind of took that concept because I saw a similar thing happening with code reuse. I feel like there's a discipline that I think is kind of like sitting between data science, data engineering, software development. I want to call it like data science engineering, but it's not like plumbing the data. It's like, okay, so we are analysts, we are doing data analysis every day.

We've got lots of projects and context switching. How do we bring some sort of engineering view to design that rather than just having this organically created mess. I feel like you have to be in that organically created mess a few times before you start to realize, like, the patterns that create that organic mess and ways you can get around them. And so this was the kind of idea. It's like, yeah, there are things that I've seen people do and I myself have done. They seem like a good idea at the time in terms of like managing your code and redoing the same sorts of analysis over and over again. But I guess the main theme is like quite often complexity is not managed very well.

And so complexity can ramp up. And it might be to do with like, you add more people to the team or you are doing more context switching than you were. And all of a sudden the strategy you were using falls apart.

And, um, yeah. The analogy I like to use is like the technical debt idea. And for people who aren't familiar with that, it's like, kind of like, servicing your car, right? Most people have the concept that like, "okay, you get your car serviced every three months, because if you don't like the wheels might fall off or something catastrophic might happen and then you have no car for a long time. Technical debt or process debt or complexity debt with code reuse, a similar thing can happen. Where if you don't kind of address it and acknowledge it and take steps to mitigate it in an ongoing way, then at some point, the complexity will get very hard to manage and potentially catastrophic things will happen.

[00:09:40] David Keyes: Yeah. And you talk about kind of four stages that you've identified. I'll walk through them, and then you tell me what I interpreted correctly or incorrectly.

So the first stage that you talk about is copying and pasting code from one project to the next. So say you're working on one project, you write some code, then you work on another project. You're like, "oh, hey, that code that I wrote for that last project could apply here. Let's move it over there." But then you realize, if you make changes in project number two, then you also should probably go back and make those changes in project number one. Did I get the first step is right. Anything you'd add there?

[00:10:19] Miles McBain: So I have these multiple copies of copies. And to me, like, I think I even said it in an article, it almost looks like a virus, like, replicating, and these copies acquire, like, mutations. And then what becomes unclear at times is which one should I use. So I've taken some methodology, some data analysis stuff that I did, and then I used that on project A, and then I copy pasted that on project B, and now project C needs to ramp up. Do I use project A or do I use project B? Or do I like copy paste from both and try to like merge them together? So copy pasting definitely saves you time because it saves you writing the code and saves you having to like think. But the hidden complexity cost comes later when you have all these like subtly different but similar versions of the same thing and you're trying to decide which you should use. And I think in the article I used a real example of like bugs that you've squashed suddenly like reappearing and you're like, "I've copied the wrong version that had the bug. And now I've reintroduced the bug into our workflow." And that's kind of frustrating.

[00:11:20] David Keyes: Yeah, and so then you talk about the next stage being to kind of make a template project with to dos sprinkled throughout. Can you talk about how that differs from the just copying and pasting?

[00:11:32] Miles McBain: Yeah. As soon as you start to realize how copy pasting fails to manage complexity, you recognize the problem, which is like, there's too many versions all spread out and they're all different and what you need to do is to centralize. Um it's a stage I went through and it's because package development feels hard, modules or whatever feel hard.

And it's like, no, I don't need all that complexity. I can just create a template for myself. Um, and I think this is something that people are reasonably familiar with from like just creating text documents and stuff. If you've ever had to like do like a mail merge or to create a document that had to go to like multiple people or you ever had to like create the same document for different purposes.

You create a structure and then you have like within that structure, the places where you're going to place the data. The content that differs. It's a fairly obvious idea and it can work pretty well. And so teams try and apply that to their code. So they create this template. And then the problems with templates is that we want to ask more and more and more of our templates. So we want to have this one template, but actually, if it could just do this, then we'd be right.

And then, we put in the ability to that and say, "oh, if the template could just do this", and that happens lots and lots of times. And then you have this situation where you're starting to like inject a lot of complexity into the template itself. And the template is becoming like this pseudo programming framework where you you might be using the mustaches, like the templating stuff to like parameterize your templates.

And then you might have like special syntax that like replaces things you put in there with other things or computes things that like template builds on and now you've effectively built something like a programming language almost, except you didn't do it and you didn't design it. It kind of just grew, and yeah, like templates can get really complicated. And particularly if you're working with a few different people, there's always this tension of like, "well I want the template to do this for me, so I'm going to put this feature in the template." Templates are kind of hard to test because they have this like explosion of like parameters and things, and so if I add a feature to the template, I'm going to test that feature works, but I might not test all the cases that affect what you need to do with it. So you have this thing of people like stepping on each other's toes and changing the template and not realizing that that breaks how someone else was using the template. That's the sort of complexity hell you find yourself in there.

[00:13:48] David Keyes: Yeah, so it sounds like with both approaches, the kind of just simple copying and pasting in templates, the main issue is that the code just kind of metastasizes and gets way more complex than you originally intended. And, uh, that complexity then becomes its own beast that you have to manage.

And so any of the benefits that you might gain from not having to rewrite the code from scratch are either mitigated or reduced by the fact that you have to maintain that complexity. Is that accurate?

[00:14:21] Miles McBain: think That's right. And you have to maithis templating framework that you created, which in itself is like, it's not going to be as well designed as like the R programming language. So it's like, "well, why didn't we just write it in code to start with?"; because we didn't have a way to manage it. And that's where we get to the next stage, which is like, "okay, can we create our own kind of like personal universe of functions and packages?"

[00:14:44] David Keyes: Well, but even before that, in your article you actually talk about the next step being to make one package, like a single package, one benefit being that it forces better practices than using something like a template or copying and pasting. Can you talk about what you mean by that?

[00:15:02] Miles McBain: Yeah. Okay. So I think immediately, as soon as you create a package, and I highly encourage people who want a centralized code to do that. You're sort of triggered, I mean, if you're reading like any material like, you know, Hadley's and Jenny's and other contributors great R packages book, then immediately you're confronted with all sorts of stuff about, "oh, okay, there's stuff in here about what to do with the documentation." And so you're encouraged to document your work properly and to have like nice HTML documentation that people can consume. Or you're encouraged to do some level of unit testing, which wasn't really potentially even possible with the template. And you might be encountering that idea for the first time. Then you have multiple people contributing to this code base with unit tests being written. That might even lead you down the path of like continuous integration where the unit tests have to pass in order to make the changes or something like that. So there's this path that creating packages puts you on. And yeah, a lot of that can only improve the output.

[00:15:56] David Keyes: So how do you decide at what point, when you've written some code, it makes sense to put it into a package. I mean, obviously it sounds like you tend to go that direction. So if say you're working with someone and they're deciding, you know, is this code that's worth putting into a package?

What's your, your rubric for making that decision with them?

[00:16:15] Miles McBain: Hmm. I have a very low threshold for that, I guess. First of all, it's like is this thing reusable? And sometimes there is a little bit of work to like see the reusable parts first, like the parts that are only specific to what they're doing at that time. So sometimes they'll say, "no, no, I can't put this into a package because it's too specific to the project that I'm working on now." But often that's just a case of like refactoring what they're doing a little bit, where you're like, okay, hang on, so we can separate the project specific stuff from the domain specific stuff, and we can make that into a package. There's ways you can go about that.

You can have functions that take arguments, or take even other functions that tell them how to do the domain specific part or the very specific part of what they need to do. And that takes a little bit of practice, right? But it's about looking at what's happening and seeing, like, okay, so what is the part that is genuinely reusable? And what is the part that is specific to this thing? And I think rather than like, looking at code that's reusable, it's actually a bit about, like, looking at concepts that are reusable, you know, so an example might be like, "okay, it seems like we are always like writing these like same or similar like SQL queries to get this like stuff from this database. Maybe we could, like, wrap over those with some parameters, and that would simplify our code a lot, rather than having to have essentially the same SQL copy pasted, modified slightly for this specific case. Maybe we can, like, parameterize that, and then wrap that up in a package." Something like that.

That's a really simple example.

[00:17:39] David Keyes: Yeah. Hmm.

[00:17:41] Miles McBain: The concepts would be like our core datasets. So it'd be like, okay, when I was at the Queensland Fire Service. Our core datasets are about like incidents. So if we're talking about like incidents that are happening and the location where they happened and the type of thing that they were, that's like a concept that is reusable across all our work. And in my current work, I work with like, not for profits, like charities. So like, something that would be like, reusable across them would be like, okay, charities are always at some stage gonna ask people for money, right? They have a variety of ways that they want to determine how much that should be. Um, and so the concept of asking people for money that's shared across all charities. And so there can be a package that's like, calculate how much to ask someone.

So I guess what I'm saying is often it's not like stare at the code and see like somehow extract the

like common bits. It's more like identify the concepts that the code is wrapping up and identify the shared concepts and then pull out the code for that.

[00:18:42] David Keyes: That makes a ton of sense. One thing I was wondering about too is, you know, I know I talk to people sometimes and I'll encourage them to build a package and they'll be like, "oh, that's too much work. I'm going to spend so much time just putting it together. Is it really worth it? Is it really going to save me time to do that?"

How do you answer that type of question?

[00:19:02] Miles McBain: Well, it's just not a lot of work. I mean, there are, there are people, like, I think Jim Hester, I can remember a great talk from him. He created an R package in 20 minutes. It's literally like dev tools, create project, or create package, or something like that. I can't actually remember what it is right now, but it's a one liner.

 And you'll get a package skeleton and then you can create your functions, run check and then you've got a package. So there's actually not a lot to it these days. I think the difficulty is actually feeling confident and understanding what you're doing. So that's the work. The work is like understanding what do you mean is an R package? What do I need to do? Why do I need to do it? So I'm more like understanding the structure, like, this is where the code goes in the R folder. And this is where the tests go in the test folder. And this is what happens when I run check. Um, so it's more like a kind of understanding deficit rather than like, "Oh, I have to do like a ton of work now to write this package."

[00:19:56] David Keyes: right.

[00:19:56] Miles McBain: The good thing about R is that is once you understand how to make one package, you understand how to make 20 packages. It's the same process every time.

So, that would be how I'd encourage people. I would say, look, there's a little bit of learning to do up front to make your first one. But after you make your first one, every single other package will feel very easy and very quick.

[00:20:14] David Keyes: Yeah. Yeah, I mean, I've always been surprised. I remember when I was first learning to make packages. I don't remember exactly when it was. And, it seemed very scary from the outside, but once I actually dove in and did it, I was like, oh, this is basically just creating functions with a few additional things tacked on top of that.

But,

[00:20:33] Miles McBain: Yeah. And, and so that is actually a thing you identified there. Creating the functions is sometimes the hard thing.

And we talked about that a little bit before about how like people coming from a science background, they're like, Oh, you know, functions, like what is that? It seems like a bit mysterious, and particularly the thing that people's brain is like functions being passed to other functions and stuff like that. But once you get over the hurdle of what is a function, pretty core idea that gets reused everywhere, including in package development.

[00:21:01] David Keyes: mm hmm.

[00:21:02] Miles McBain: So maybe if we're like trying to help people get to the point of creating packages, it's like, "well, hang on, are you comfortable with functions?"

[00:21:07] David Keyes: yeah, yeah. Maybe this is specific to me, but one thing I see when people are learning to write functions, you were sort of hinting at this, is they'll end up writing massive functions. Like, say, to clean my data, and it

[00:21:21] Miles McBain: Yeah.

[00:21:21] David Keyes: like, 200 lines of code.

[00:21:24] Miles McBain: Yeah.

[00:21:26] David Keyes: Yeah, I'm curious, like, how you explain the benefit of kind of breaking that into, you know, say, multiple functions.

How do you talk to people about why that may or may not be the best approach?

[00:21:37] Miles McBain: Yeah. I might have even talked about that in the article. I can't remember, but the first package that you write is a do everything package. I feel like that would be self revealing in a way. Because that massive function that they write with all those parameters it's basically a template. That's what I wrote in the article. So the issue that you'll have is like, maintaining that thing is going to be really challenging. All these different parameters and combinations of parameters, you're not really going to be able to test it. And you're going to have trouble. And because of that, you're going to have that situation with the template where you're like, "oh, okay that someone changed the function and that now breaks my work. Um, so I'll just use this different version of the function that I know works for mine and you're back to like we have multiple copies of the same thing.

So I guess I'd point out that the more complicated you make the function, the more likely it is that it won't be able to be reused.

[00:22:28] David Keyes: That's interesting.

[00:22:30] Miles McBain: yeah, and I don't know, this might be a bit high level for people trying to create their first functions. But the thing I think is like the reason why creating functions is a good and useful thing is because It lets you express what you're trying to do in terms of the domain. So using my Queensland fire example, like, I can write functions that talk about calculating response times and finding the locations of things that fall within areas, and it's like, the code that I am writing has function names that say that, "find things in this" and "time between this" and whatever.

So for someone in our domain, they might not even know R, but they can be like, "Oh yeah, I see what you're doing because I understand that core to this domain is like points and times and road networks and all that sort of stuff." And that is represented there in the code. So that's kind of like the core power of like creating a little, like vocabulary for yourself out of functions is that functions can now represent the domain knowledge like really clearly. And so for people just starting out, I don't know if there's an easy way to demonstrate that. But I saw Hadley actually give a pretty good workshop on StringR where he was like, this is what it looks like if you're just like throwing regexes at this thing.

You

[00:23:44] David Keyes: Mm

[00:23:45] Miles McBain: whereas like, if we break this down into a series of like verbs that you can call as in a kind of workflow on your strings, it's much more clear than just like RegEx, you

[00:23:55] David Keyes: Yeah. Yeah.

[00:23:57] Miles McBain: I think that was a pretty good example because regex is like as hard as it gets to like pass, right? Maybe something like that might be good.

[00:24:04] David Keyes: Yeah. So, we've been talking about making a package and the benefit that that offers compared to copying and pasting or making a template. But in your article, you actually talk about potential downsides of making a single package, which in many ways follows the same path as the other two approaches, which is it metastasizes, it explodes, it tries to do everything.

So can you talk a little bit about the downside or what that looks like when a package kind of blows up and what your solution to that is?

[00:24:38] Miles McBain: yeah, I mean, what it looks like when a package blows up, and this has happened in my last two jobs, I've seen this exact same thing happen. So I'm not sure if it necessarily happens everywhere or if it's just places I'm involved with, but like the namespace becomes so big that people kind of like carve out their own little niche inside the namespace.

It's almost like there's sub packages within the package, right?

And people are like, "well, this is the area that I know about. And so I'm not really going to step too much outside of that. Like, these are the functions that I use and the ones that I know about. And if I can't find what I'm looking for in here, I'll just assume that it doesn't exist."

And then I'll be like, "okay, I better add this to the package." And so you get this situation happening where you end up with like a bunch of similar functions in the package that do similar things and even have like similar kinds of like code in them. And you're like, oh, okay, this is not ideal. So basically there's a kind of like, just seeing the edges of this massive thing are hard. And so understanding what it can and can't do is sometimes challenging. So people just make assumptions because it seems too hard to figure it out. And then you get to other things like, well if we're doing the right thing and we're writing tests for this thing, then the test suite is going to be enormous. Um, and so then it's going to take a while to run. And so then people might go, you know what, I'll just like skip running that right now because it

takes a while to run. And then test failures accumulate. And then now you want to submit one little tiny change to a function. And you do the right thing and you run the test suite. And there's like a dozen test failures. And a lot

of them are in stuff that you've got no idea

[00:26:17] David Keyes: Right.

[00:26:17] Miles McBain: gums up the works, slows everything. You know, so basically I think once things get to a certain size, um, they almost encourage the accumulation of technical debt. And the thing that encouraged the accumulation of that technical debt was the fact that the test suite was too large, so it wasn't getting run regularly. Or the namespace was so large, so no one was like, taking the time to read it and understand it properly.

[00:26:42] David Keyes: Yeah. And so your solution is to make what you call a verse of packages. So, you know, similar to the

[00:26:48] Miles McBain: universe, a tidyverse,

[00:26:50] David Keyes: right. So could you maybe give an example of a verse of packages that you've seen that work well together? I guess I'm wondering specifically like how have you seen the different types of packages broken down and what does each of 'em do typically?

[00:27:07] Miles McBain: So I think in that article, um, I linked to a blog post by Emily Reederer and she had a really nice way of thinking about it where she's like, just imagine each package is like a specialist team member that you don't have. What happens in a data science team, at least in my experience, is you have to wear like a few hats because you don't necessarily have like, the cloud compute specialist or the database specialist or, you know, like you don't have someone who can like, um, Yeah, figure that all out and package it up for you.

So you end up having to do that. And what you want to do is just cram. how do I provision cloud services? How do I store stuff in the cloud? How do I, you know, do all that stuff. And then you want to like put that into a package and then you want to forget about it. Because it's not like your core work, right?

But you do need that functionality. So I think that's a really good way to think about it. So I think about my last job and we had like, you know, we had a data package that was called like QFES data.

And that was like the package I was talking about before, where um, you can just have, at your fingertips, like give me the incidents, give me the Station locations. You have functions that give you all that stuff. We had a visualization package that had like commonly used like visualization layers. Um, so these are like things for, um, interactive maps and for ggplots, like little geoms and layers that help us build up our maps like really quickly. We did a lot of mapping in that job. There might be like more domain specific stuff. So we were creating shiny applications. And we needed to like have the ability to like cache things on AWS rather than locally.

So we have like a little cache module that we wrote,

like a little bit of domain specific thing that makes our particular use case easier. So it's stuff like that, I guess. Data and viz and all that stuff are pretty obvious domains, but then I guess they could also be broken down further, always, depending on, like, how big that package gets and how much of that thing you're doing. So, like I said, we combined, like, static ggplot viz with web ish viz in one package.

And part of the reason we did that is because we wanted the ability that whether you were running and creating an interactive map or a static map, we wanted them to look identical.

So we needed like some level of like parity between the features, but you can see another team going like actually we're going to go interactive viz is its own thing and static viz is its own thing and they're two separate packages.

[00:29:20] David Keyes: Yeah, that makes sense. What criteria do you use to decide when to take one large package and break it up into multiple packages?

[00:29:32] Miles McBain: Oh yeah, I mean, I think it's just vibe, it's a feeling, but it's also when I start to see the technical debt being created, because that's a signal that things have gotten too big and too complicated. So if I start to see cases where the test suite hasn't been run when someone's contributed a feature. Or there is like duplication or like I want to say non orthogonality, but I don't know, people in computer science talk about orthogonality for like, design, um, but They often use it in the wrong way and people that come from a statistics background actually will understand it better where like if you have an orthogonal basis, you know, you don't have this kind of like correlation between features and functions.

And so you can see like a sort of non orthogonality or a correlation creeping into the package where there's things that do like almost the same thing but not quite and you know there is maybe like one concept in the middle that could be like pulled out and now both these things would be separated. So yeah, it's looking and seeing the signs of bad stuff happening.

[00:30:31] David Keyes: Yeah. I mean, it seems like it's a very similar process between deciding when to go from that kind of template to a package and then when to take a single package and break it into a collection of packages.

[00:30:44] Miles McBain: right. I think it is, at least for me anyway, I think it is a very similar process of like looking at code and deciding to put it in a function and deciding how many functions that should be.

And then looking at a set of functions and deciding whether they should be a package and how many packages they should be.

Yeah, it's about trying to like make the complexity of dealing with that manageable.

[00:31:00] David Keyes: Yeah, that makes sense. So, kind of wrapping up, you say in the article, you're a bit downbeat, you say it's sort of inevitable that people or organizations will go through the stages you outlined. I'm curious why you think it's inevitable.

[00:31:14] Miles McBain: Well, I don't know exactly how downbeat, um, but I am saying there is an inevitability to it, I think. In some cases there's always an inevitability to things, and it's good to acknowledge that. Like, I'll get back to your question in a second, I promise.

But, um, when I first started using Git, um, I use like a GUI tool. And people always say, no, use git on the command line, use git on the command line. But I think actually there's kind of an inevitable reaction to the complexity of git that in the beginning it feels good to hide away that complexity. And it's actually just better to acknowledge that inevitability rather than trying to like beat the people over the head with the command line and so they should use that from the start. And that's sort of the perspective I'm taking here. And as we've discussed, even initially, the complexity and understanding what a function is seems hard and overwhelming. So people are kind of on a path, and as they progress down that path, they're going to have different solutions available to them to manage this problem of like reusing data analysis. And what complicates it is teams need to do this together. So you might have some people on the team who are like, let's go, functions, packages, and you might have other people on team that are like,

oh, we don't need that, we don't need that template, it's fine. Um, and so the team is kind of on a journey together of like everyone getting on the same level of understanding about what's the best way to do something. So those two things combined to me to mean that there's always going to be a bit of a journey that people and teams need to go on together to arrive at this place, and you'd be very lucky to get dropped into a team of experienced, seasoned data scientists who could all be like, yep, perfect, let's just make packages and let's go. I'm not trying to be downbeat about that or

fatalistic about that. I'm more like, I think it's good to acknowledge and not try and like set the bar impossibly high um, for people and not try and feel bad about we have to go on this journey,

I guess.

[00:32:58] David Keyes: I mean, do you think there's value in going on the journey because people then see the downsides and appreciate the good sides of the approach that you lay out at the end of your article?

[00:33:07] Miles McBain: I think so. By reading the article and having conversations like this and just getting that content out there, I feel like you can at least short circuit the journey a little bit.

 People might read and be like, what's this guy talking about? You know, templates are fine.

But then they'll hit that point and be like, oh wait this is what they were talking about. I'm in that complexity hell and I've made the template too complicated. This is exactly it. And then it might reset realization a little bit faster, I guess. That's what I'm hoping for.

[00:33:33] David Keyes: Yeah, that's great. Um, great. Well, this was really enlightening. And for me, the types of folks that I tend to work with are very much on the kind of first one or two steps. So hopefully this will give some people some food for thought in terms of what might come after that. So, um, thanks again, Miles, for taking the time to chat.

I really appreciate it.

[00:33:54] Miles McBain: Yeah. Thanks very much. ​

Sign up for the newsletter

Get blog posts like this delivered straight to your inbox.

Let us know what you think by adding a comment below.

You need to be signed-in to comment on this post. Login.

David Keyes By David Keyes August 29, 2024

Sign up for the newsletter

R tips and tricks straight to your inbox.