R for the Rest of Us Podcast Episode 4: Abdoul Madjid
In this episode I speak with Abdoul Madjid about his journey to R. To learn more about Abdoul, you can find him on Twitter and on Mastodon. His data visualization, done as part of Tidy Tuesday, is on GitHub.
Learn More
If you want to receive emails when we publish new podcast episodes, sign up for the R for the Rest of Us newsletter. And if you're ready to learn R, check out our courses.
Audio Version
Video Version
The video version has a walkthrough of how he made his COVID-19 map using R.
Transcript
[00:00:00] David Keyes: Well, I'm joined today by Abdoul Madjid. Abdoul is a full stack developer who works for FirstEcho. That's a smart business integration company based in France. He writes code in R and Python that helps summarize news for their clients. And on the side, he does some amazing, amazing data visualization in R, including, uh, making some fantastic looking maps, which is what we're going to talk about today.
[00:00:32] The reason, um, I reached out to you was you made this visualization, um, it's based on a data throughout the 2021 showing, uh, COVID 19, um, rates, um, in all the states in the U. S. So let's go ahead and look at your code. Um, Ah, that's really interesting. Well, you're only loading one package here. I know you use a couple other packages down below.
[00:01:06] Um, and then go down here to where we're actually, um, bringing in this data. So, oh wait, sorry, let me restart my session. I was working on something before. All right, so I'll load that, and then let me go down where I'm bringing in the data. So, COVID data, So you're bringing in this data, um, directly from, uh, the New York Times, which has published their data on GitHub.
[00:01:41] So just, if I take a look at this data, super long data, date, state, um, FIPS, which I don't know if you actually use, but then cases and X, um, okay. Um, and then you have, um, USA states. Yes. which I'll just take a look at, which looks like all you're doing is bringing in that data and you just have the state and then the population, uh, of that state.
[00:02:13] Abdoul Madjid: Yes.
[00:02:14] David Keyes: Okay, so next thing you do, talk me through these lines here. Line 63 through 65, because what I understand is you're bringing in the data, you're bringing in the geospatial data. So talk about how you did that.
[00:02:33] Abdoul Madjid: My initial ambition was when I was to, I was about to make that data viz, was to display this map at Conti's level.
[00:02:46] It wasn't initially. Um, my ambition wasn't initially to put, uh, the data at, uh, state level, but, uh, with my initial ambition, I quickly faced a problem. In United States, you have approximately 3, 000 of counties. So, to display, um, a facet, a maps facet, with 3 to 3, 000 counties for, uh, Each of the 365 days, it requires a lot of computing resources.
[00:03:31] Unfortunately, my computer is not that powerful. So I quickly turned my interest to state level data. Even with that compromise to use state level data, I faced another problem. I need to use, uh, an a shape file. Why? Because we've an, an high resolution shape file. Golfal will be in high limit. So I, so I had to, I had to choose, uh, shape file with long resolution.
[00:04:17] So I turn my interest to our best us. SA states, states state file because, uh, Uhhuh this by, by Bob Rues offer, uh, uh, low dimension, uh, states state file, uh, offer a low dimension shape state file. So I, I general use that when I want to, to make a maps facet with, with low resolution. So
[00:04:54] David Keyes: it's why it's. And just so I understand, yeah, I'm sorry to interrupt you, but just so I understand you're saying when you're making this map, or the set of maps, each map was very high resolution, it would take a really long time to make them right.
[00:05:10] Yes,
[00:05:10] Abdoul Madjid: and when most of the case, that's to do. That's to do. Okay. So, that's why I, I, I, I carefully chose my shapefile tool. Yes.
[00:05:22] David Keyes: Got you. And so this code here, um, if I understand you correctly, is what's actually bringing in, um, your shapefile.
[00:05:34] Abdoul Madjid: And so if
[00:05:34] David Keyes: I run that, okay, perfect. So if I run that, I can see what's returned here is, um, some information.
[00:05:44] Um, and then most important, because it's a shapefile, it has this geometry column. Um, okay, go ahead and then talk about what you did. Next after you brought that da, that data in. Yes.
[00:05:58] Abdoul Madjid: Next slide is for, for turn the projection, the projection of, uh, uh, United States. Uh, United States, uh, um, form
[00:06:14] David Keyes: the shape file, uh, yes form of
[00:06:16] Abdoul Madjid: the projection of the, of the shape file.
[00:06:18] So it's just for that, that I use the St. Transformer function of the S cage.
[00:06:27] David Keyes: So. If I just actually add this, for example, and if I then, let's say I pipe this into
[00:06:38] ggplot, uh, wait, did I do that right?
[00:06:47] Oh, you know what? Sorry. It's display. Hold on. Let me change it. So it actually,
[00:06:54] uh, okay. So that's what displays, um, if you don't change that. Oh, and actually, first thing I notice is, One nice thing about this shapefile is it brings Alaska and Hawaii and puts them down here, which is obviously not where they're located, but for making a map, um, it's a little easier. So if I then add what you added here, this is going to change the projection.
[00:07:26] So in other words, it's going to change the way that it'll show up. So, ah, interesting. So you can even see, actually, The, the, the longitude and latitude lines, how it's kind of curved, um, cool. That does look much better. Okay. Um, and so then you save this as an object, USA, states, geom, um, which we'll see in a second.
[00:07:56] I think you use later on when you're actually making them up. Um, all right, cool. So, um, Here, um, on this is where you're actually bringing in the COVID data or not bringing it in. You're doing a bit of manipulation on it. So, um, if we can just talk through this and let me even run it kind of a couple lines at a time.
[00:08:23] So, for example, if we run down to here, um, you are grouping by state and by the FIPS code, which is just a identifier and then arranging by date. Um, talk about this, because actually I wasn't sure that I quite got what was going on here.
[00:08:49] I know you wrote this code a while ago.
[00:08:52] Abdoul Madjid: Each,
[00:09:02] generally, the New York, uh, the New York Times publish, uh, I mean, cumulative, cumulative, I don't know, it's been, yes, cumulative, number of case. So I need to, for each day, go to the previous, to the previous day number, to make the substitution. Subtraction to have a real number of case for this, this unique day.
[00:09:37] So, it is why I have made this manipulation to make it. A lag for take the value of a row and after make the difference for how the number, the exact number of case for this day.
[00:09:54] David Keyes: Okay. And so all of this down here is calculate, um, is part of calculating that, right? Yes. The number
[00:10:00] Abdoul Madjid: of
[00:10:02] David Keyes: case. Taking it from the cumulative totals and breaking it so it's, you know, By, uh, day.
[00:10:11] Abdoul Madjid: Yes.
[00:10:12] David Keyes: Okay. And interestingly, you're not actually, so now you're working with the COVID, you started out with a COVID data, data frame. So this here, and that's actually not, that's not geospatial data at all. So we're just working with a straight data frame now or a table technically. Um, so I, I'm assuming we'll go back later on and use this, uh, USA States when you, we actually want to.
[00:10:42] So it seems like your overall workflow is bring in the kind of raw data, then bring in the geometry, the shape file, do some manipulation on the, the, the data to get it, you know, by day. And then I'm assuming we'll go back and kind of combine that with the geospatial data to make your maps. Is that overall kind of how you work?
[00:11:06] Okay, cool. Um, so this, I'm actually not going to spend a ton of time on this, but. What I understand you to be doing is creating Um, a rolling average. Yes, it's
[00:11:22] Abdoul Madjid: that.
[00:11:23] David Keyes: Um, the rolling mean, right?
[00:11:26] Abdoul Madjid: Yes, it's that. It is was, when I first tried to upload this viz plot with fresh data, I saw something that, that
[00:11:39] pushed me to do, uh, uh, to, To compute the number of cases with a whole mean. With, I think this is a problem that I have with United States DataTor. With, uh, when I take the, uh, original number of cases published, I have a problem. I have a problem. It's not really a problem, but it is something that is not really coherent.
[00:12:11] It's that, uh, generally, on weekend, on weekend days, We have a number of kids that is sometimes almost null, almost equal to zero. What is not occurrence, right? So I choose to compute a whole man, which I think will fit, fit the more, the number of kids for mm-Hmm. for that day. So it is why I hold to compute that.
[00:12:43] Uh, that, um, uh, um, that, uh, rolling average for each day. So
[00:12:52] David Keyes: yeah,
[00:12:53] Abdoul Madjid: after that I choose to, I choose to compute what I call incident rates. In French, we call in French, we call it incident rate. Something that we, we compute the number of case for, for, uh. One, 100,000 people. Something like that. Yep. So that level made another man, uh, manipulation.
[00:13:25] I divide that, uh, number by the number, by the number of, for inherit for each state because. If I, I didn't, uh, do that. I didn't do that, uh, that way. I think, uh, the, yeah, the incident that I have won't. It won't reflect emergency states for each state because 100, 000 cases in New York state doesn't have the same importance that 100, 000 states in Oklahoma state.
[00:14:05] So this is why I use that way to do that.
[00:14:12] David Keyes: And so that's the USA states that you brought in initially, which has that population. So I can see here, you join that in order to be able to use that. Population data to calculate, right? Cool. So you join it, then you calculate that populate, or that incidence rate, um, cases per 100, 000.
[00:14:34] And then here, um, it looks like you're actually putting observations into groups because if, um, let me actually bring up the larger version of it. But if I look at your Whoops. Um, hold on. Let me just download it and open it up.
[00:15:01] Because if I look at your original plot, if I look at the legend, for example, it has these very, you, you break states into different groups and you use that for your legend. So it looks like here, you're actually calculating that um, you're breaking the instant rate into those groups. And so if I even open up your data, I can see that.
[00:15:30] Each row has an instance rate that is one of those categories.
[00:15:41] Okay, perfect. And so just to reiterate, like you're doing every, this is just, not just, but this is tidyverse stuff. Like nothing, nothing particular yet. Um, that's geospatial at this point. Um, like all you've done is just, well, with the exception maybe of role mean, which obviously is, it's comes from a different package.
[00:16:02] It comes from the zoo package. And I guess there's some stuff with Lubridate, which is, well, Lubridate's a tidyverse package. You're pretty much, um, just still working, you know, in the tidyverse. Um, okay, cool. Let's go down to where you're actually making your map. So, um, you, so it looks like at first what you're doing is defining some things here that you're going to use later on.
[00:16:33] So you define, for example, background color, which I'm guessing Is that kind of tan, um, background that you use. Um, you defined, uh, the font family that you defined, um, as we talked about before we recorded was one that I didn't have access to. So I changed the font to be able to, to render it here. Um, and then this is your caption, which shows up.
[00:17:03] Okay, so, uh, oh,
[00:17:11] weird. Doesn't like that. Those little, huh, okay. I don't know why I didn't like those symbols that I copied. Alright, so this is your plot. And let me, um, before we even get to the actual plotting of it, let me actually just take these lines and just run them because you're actually starting now. This is where the, the geospatial data that you saved here looks like it's coming back in.
[00:17:47] Um, and so what I see you doing is starting with that geospatial data, and then you're joining the COVID cases RM data so that you have that, mostly that incidence rate is what you want to use. Um, so you join it, and then you create, um, this fancy date, which is, um, using a LibriDate. Um, function, and those are, I assume, like, right there, it's going to show up.
[00:18:22] Um, so, uh, um,
[00:18:31] walk me through the rest of the, the visualization here. And we can even take it kind of like line by line, if that makes sense. Although, let's see how long it takes to run it. But like, um, let me actually comment that out so we're not saving. Um, so if I run it to this point, let me even run it, see how long it takes.
[00:18:54] Might take too long. That's the problem with geospatial data, right? I mean, you talked about it before, is it can be a bit slow, um, to, to run. Um, but, so my understanding here is you're running the ggplot function and then geom. sf, which is just a geom, like, geom bar or geom histogram or whatever. But it's specifically for shapefiles.
[00:19:26] Um, right, and so this allows you to plot this. Um, actually, while this is running, quick question. Did you, could you done it the opposite way? In other words, could I have done covid cases rm and then left join
[00:19:49] USA states by equals
[00:19:56] state equals name? Um, what happens if I, if I do that?
[00:20:04] Abdoul Madjid: Yeah, that way, I think. Um, it's not that I think, I know that with that case it's, uh, general don't work because, uh, it's, uh, we'll have trouble to, it won't, it won't keep the shape, the geometry column, so to, uh, then be, um, it won't show error, but, uh, it won't, uh, retrieve my geometric and so it won't work.
[00:20:37] It's something that, uh, I have trouble with, uh, several times, so it's something that I avoid. Generally, I take my dataframe with my geometry, and I left them with, uh, Uh, dataframe with my data, so I, it is a practice that I avoid.
[00:20:59] David Keyes: Yeah, I mean, I asked because I've had the same issue and it took me a while to figure out that that's the way you have to do, I mean, there are probably ways around it, but you're, it's just easier.
[00:21:08] And I can even see, for example, down here, if I, if I start with the non shapefile and do a left join. Now, the result is a tbl, whereas, for example, if I do this, the result is a simple feature collection, an sf object. And so, yeah, that makes sense. Okay, um, believe it or not, this actually showed up. Um, but of course it's not actually very far into the code.
[00:21:38] So, what it is, um, and specifically you haven't done, we haven't done any of the small multiples that you do here, where you have multiple maps. So, this is just one map, and everybody is in the greater than 50, because I assume there's like one observation somewhere that's probably for each state in the greater than 50.
[00:22:03] In any case, this isn't what you want to do, but this just shows that we can start to create a map. Um, so, Um, talk about this line and, um, what that, what line 126, what that,
[00:22:19] Abdoul Madjid: what that does. At this line that I say that I want my, I want my, for that will, will be with, The dates there it is with , because I have make some manipulation with my date, but it, it, uh, it was on the current date.
[00:22:41] I will, I would, uh, have, I would have a coded bath, printed dates, so there, because I want my, my facet will be on each day and each day. It's why I want. Uh, I want, uh, facet wa work part this, fancy this, and as I want my, my strip to be on the bottom, I, I, uh, did parameter slip that dot position to bottom. So it is what I have done at the that time.
[00:23:23] David Keyes: And so, just to be clear, the strip. Is the, is the date. So that's where that label then, um, shows up on the, on the bottom of each one. Okay. So this is going to take way too long. So I think what I might do,
[00:23:43] um, I'm just thinking if I can do it, like do a filter. So name equals equals, uh, here I'll do org cause we're in, see if that, I'm trying to come up with a way to actually be able to run the code and have it show up, uh, in the time we're talking. Um, let's see if that, how long that takes. I could also actually do,
[00:24:16] do if
[00:24:21] month date equals, so I just do January, see if that works. Wait, let me, yeah. So this is just going to show January in Oregon. Uh, let's see if we can get that to work. Okay, perfect. So, obviously it looks kind of different, um, because it's just showing Oregon, but you can see, that it is faceting based on the date.
[00:24:54] So we have Jan 1, Jan 2, etc. Cool. I really, I mean, to be honest, like, that's the thing I really like most about this map is how you use that faceting to, to, um, make the, you know, all of those small multiples maps. And it's really cool because you're doing that with, you know, this one line of code. That's the same exact thing you do if you're making a bar, a faceted bar chart or anything like that.
[00:25:19] You can do the same thing. Okay. Let's talk about, um, this portion now. So I will say this was actually new to me. Um, the color space fill or yeah, this particular function. So talk about, uh, What's going on in between lines, uh, what, 129 and 140?
[00:25:43] Abdoul Madjid: So, main color palette, color palette package in R.
[00:25:52] Color, color space package offer more than color, uh, more than color palette. It, uh, offer, um, the same advantages than R. A native space function. But with additional color palette, which is the main reason that I use it regularly, that package, and they have explicit function to fill the value we mapped with ggplot mapping systems.
[00:26:28] So there. Because I choose to make my number of, for covid cases discre by cutting it into, into several second short categories I choose there, I can specify that my, my value, uh, discreate. So it's why I use scale fee discreet and has my, I say that, I say that my value are sequential. I can choose the second.
[00:27:01] sequential scales function. If I suppose that my values were diverging, I would have used a scales field, discrete, diverging. So, scholar space gives 20 explicit names to the function. Um, color, uh, to the scale, the color system. So it has the same advantage that, uh, ggplot native scale function. And so
[00:27:37] David Keyes: I even did, so what I did is, cause you also have code in there that's dealing with your legend, but I actually took out all of that.
[00:27:44] Almost, um, I took out one other thing. And so if I just run that, I can see palette that you end up using. Let me actually compare that. So it looks similar. The one thing actually, um, here, let me leave in this rev equals true. I assume that's for reverse, right? Reverses the palette. Is that right? Um, I see.
[00:28:12] So there it goes. Yeah, it gets darker as the rate goes up. Okay, cool. So yeah, I can see how that makes, um, a nice palette. Um, And the rest of this then, if I understand your code correctly, deals with your legend up here. Um, so talk about, um, you made a bunch of tweaks to the legend. Um, and so, you know, for example here, if I do, if I just run it without those tweaks, this is what the legend looks like.
[00:28:47] So talk to me about how you got from that default legend to this legend here.
[00:28:54] Abdoul Madjid: gi uh, GI legend function offer, um, different parameters to our how we want our legend to look like. So there me, I, I wanted to, my legend to be at the top. So I specify that in at bottom in region the position. And I want my legend, uh, keys to be, yes.
[00:29:20] So that's here, right?
[00:29:22] David Keyes: Yes. The legend position. Okay.
[00:29:24] Abdoul Madjid: And I want my legend key to be, uh, my legend key to be displayed on one hole. So this is why I specify the parameters and the hole equal one. And I can also specify the key and key width, then I want my key to be a square, so I give a certain value to have a square, and I want my label, my key label And to be at the right, so I edit the label dot position parameters.
[00:30:04] And I made some fun, uh, some, uh, modification to the term for customize my label test and my title, my legend title text, so it is what I've done, uh, at, uh, in the gate legend function.
[00:30:28] David Keyes: Should I just look at this? So obviously I can see here it's in one row. You said you made those squares by setting the key height and key width to be the same. The position, the label position is on the right. Um, what, I think the label theme, what, what does that do? So I know you're changing the font. Um, but what, and I guess I'm curious, like why, what that does and why it's here versus down in the theme section.
[00:31:01] And that's, that's today. Oh, are you making, I think that if
[00:31:05] Abdoul Madjid: I have,
[00:31:05] David Keyes: oh, go ahead. Go ahead.
[00:31:09] Abdoul Madjid: I had done it in the theme function. I think, I don't know why, but I think it will work. Nevermind if I put it, if put it in the theme or the Git legend. So I think it's something that I should try and see, but I think even if I set the test in the theme, I think that will work.
[00:31:34] We can try it if we want to. Yes.
[00:31:37] David Keyes: Let's try it. Yeah. Yeah. So I'll. I left the nrow, um, as is, because I, unless I'm, uh, misunderstanding, I think you have to have that there, but, um, okay, so it is actually giving, it's saying key height and key width. are not defined in the element hierarchy. So I assume that means they can't go there.
[00:32:04] Abdoul Madjid: So let's try,
[00:32:06] David Keyes: let's see if these other,
[00:32:11] all right. So it doesn't like label position.
[00:32:18] Maybe it is the case that you do have to define. Okay, it doesn't
[00:32:22] Abdoul Madjid: like these things. You can't, you can't, you can not put that. Uh, but I think you can just edit, um, label, uh, test, uh, font family. Okay. What, uh, what has mm-Hmm. I was talking about is that if you just, you remove the family in the lab and the
[00:32:56] David Keyes: Yeah.
[00:32:59] Move, remove that
[00:33:03] Abdoul Madjid: and the bow.
[00:33:08] Um, yes. Or else Is it, and you, you run the. It reproduces the same result. Because we have same, we have R. Because we had set the family font in the same function.
[00:33:28] David Keyes: I see. So it wasn't actually necessary is what you're saying. Okay. Um, but it, great. So it looks like all these things, and probably the reason they're in the guide legend here, is they're specific to the legend itself.
[00:33:45] And obviously it doesn't show up quite correctly here because of aspect ratio, but you can see here. Okay, um, cool. So I really like how, yeah, how you kind of tweak the legend, because I think I, I myself am guilty of this when I do, When I make plots, oftentimes I'll just kind of leave the default legend or just tweak them a little bit, but you showed how, you know, with a few lines of code, you're really able to, to make some significant changes, which is great.
[00:34:16] Um, so last thing I want to ask you about is, um, let me actually save this. So, um,
[00:34:29] so now I have this, um, COVID evolution plot saved as object here. Um, So you actually combine these two things, because if I look back at your original, whoops, that's not what I want, uh, at this, you, um, actually combine, so you have your plot, which you made here, which would be everything highlighted, that, right?
[00:35:01] And then it looks like you're adding the title as well as a separate caption. Um, and you're doing that. with, uh, the calplot package. Talk about how that works. Generally,
[00:35:23] Abdoul Madjid: dedicated places to caption in ggplot. Uh, doesn't, um, fit with what, doesn't fit with what we want to do. This is generally the case when we have an empty spaces on our plot where we want to put our, uh, our caption. Well, the, I think that that problem can be solved with additional, um, annotation. When we are working, we with a plot that consists of two, two continuous dimension, but when we are working on maps or facets plot, it's a little bit hard to do that with, uh, with a mutation in that case.
[00:36:20] I use coreplot, coreplot, uh, ggdraw, uh, uh, evolt. Function because Mm-Hmm, . And when generally you use facets and you want, you want to, you wanna edit to a, to set title and caption, it doesn't, uh, take the parameters you define in your team. Don't know if you, you have tied in the past with function when you.
[00:36:59] Made that, that way it take the parameter that we define in the team. So I think we can try, uh, in life and see what it will give if we set the label and the loves, the title and the caption in, in, in previous, uh, plot.
[00:37:24] David Keyes: Oh, I see. So put the, you're saying if we put this in
[00:37:28] Abdoul Madjid: there,
[00:37:29] David Keyes: there, um, all right, let me do this so I'm not saving it. Just run it.
[00:37:47] So what do you, um, what's the issue that you usually see?
[00:37:54] Abdoul Madjid: First of all, when you look at the plot, you see that all that white spaces that we have in our plot with ggplot. Yeah. Function. When we set pen, we set the plot background. Don't have that white space.
[00:38:14] David Keyes: Mm-Hmm. .
[00:38:15] Abdoul Madjid: You can try and see.
[00:38:19] David Keyes: Are you talking about like here or, um, where, where specifically Inside do you mean wherever
[00:38:25] Abdoul Madjid: inside there where Yes.
[00:38:27] That
[00:38:30] David Keyes: Okay. Here. Yeah. Let me run, I try to save it as
[00:38:34] Abdoul Madjid: a PNG or PDF or something like that.
[00:38:39] David Keyes: Okay. Um, here, let me.
[00:38:44] Gigi save Covid evolution plot. Um, evolution plot.
[00:39:00] Oh, sorry. Uh,
[00:39:20] so Ex, ex, do you see the, uh, the issue here?
[00:39:34] Abdoul Madjid: Yeah, I don't see the problem. I'm talking about Can you save, save the final part with digital function and Sure.
[00:39:44] David Keyes: Yeah. Let me run that and then,
[00:40:00] oh, okay. So it's probably not working because you set. Uh, some things particular to the, probably the size, I'm guessing maybe because I'm just doing Oregon, something like that? Does that seem plausible?
[00:40:24] I don't know why it looks so different.
[00:40:30] Oh, well, first of all, I need to run that without the, that.
[00:40:42] Abdoul Madjid: Oh, it's not working.
[00:40:46] David Keyes: Uh, that's okay. That's okay. Um, let's not worry about the specifics of it. Here, I'm just going to clear this out. It seems like the most important thing is when you do this, you are, um, combining plot, uh, the maps, the, the small multiples maps. And if I even bring up your original one, you combine the small multiples maps, and then you add that title and add the caption and bring it all together into a single thing.
[00:41:24] Um, cool. And just to kind of recap, it seems like, you know, the, The nice thing about making maps with R is if you already use, and you're familiar with ggplot, all of the things that you know how to do in ggplot apply here. So, for example, the facet wrap. If you've made small multiples with any other geom, you can do it with geom sf as well.
[00:41:48] Choosing a palette you can do with color space or any other package that you, that you use. Are there other, any other benefits to you? Like, have you ever done, made maps in any other tool and any other things that you see as benefits to making maps in R?
[00:42:05] Abdoul Madjid: Oh, I think the only reason that I use that, I think it's pretty simple to make that with ggplot and gmsf.
[00:42:15] This is the main reason that I use, uh, I use that to make my, my maps. I tried to make with Python. It is a little bit harder than in, so I, I keep making my maps with R.
[00:42:31] David Keyes: Yeah. It seems like R for data visualization, including maps, um, is really a great tool.
Sign up for the newsletter
Get blog posts like this delivered straight to your inbox.
You need to be signed-in to comment on this post. Login.