Skip to content
R for the Rest of Us Logo

Advanced Variable Creation

This lesson is locked

Get access to all lessons in this course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Your Turn

  1. Remove the “x2018_19” portion of the race_ethnicity variable using str_remove()

  2. Convert all instances of the race_ethnicity variable to more meaningful observations (e.g. turn “american_indian_alaska_native” into “American Indian/Alaskan Native”) using any of the following:

  • recode()

  • if_else()

  • case_when()

Learn More

Bob Rudis has a semi-complex article on how he uses case_when() to work with data related to his work on internet security.

You might also find this video by Sharon Machlis on case_when() helpful.

And, in case you think I’m the only one who loves case_when() check out this love letter to the function by Matt Kerlogue.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Atlang Mompe

Atlang Mompe

April 18, 2021

Hi David, when I run this code: mutate(race_ethnicity = str_remove(race_ethnicity, "x2018_2019")), nothing happens?

Lucilla Piccari

Lucilla Piccari

April 18, 2021

Hello, I have the same issue... As well as if I run:

enrollment_by_race_ethnicity_18_19 %>% mutate(race_ethnicity = recode(race_ethnicity, "x2018_2019_american_indian_alaska_native" = "American Indian/Alaska Native"))

enrollment_by_race_ethnicity_18_19 %>% mutate(race_ethnicity = if_else(race_ethnicity == "x2018_2019_american_indian_alaska_native", "American Indian/Alaska Native", race_ethnicity))

Erin Guthrie

Erin Guthrie

April 20, 2021

I think you are doing what I am doing...it's "x2018_19"...I had "x2018_2019" and it could not match to it! Maybe this helps you all???

Erin Guthrie

Erin Guthrie

April 20, 2021

Did you ever figure this out? I am having the same issue.

David Keyes

David Keyes

April 20, 2021

Any chance one of you could record a short video of yourself attempting to do this and getting an error? You can use Loom or something similar and post a link. I want to help folks debug but it's hard without seeing what's going on.

Erin Guthrie

Erin Guthrie

April 20, 2021

I will record a video of this tonight...once I figure out how to record a video :-)...but it is odd because there are no errors, it just doesn't seem to want to change the information!

Erin Guthrie

Erin Guthrie

April 20, 2021

I think my issue was super simple and me not paying attention...I was asking to remove the string "x2018_2019" instead of "x2018_19". I fixed this and now everything works as intended! Duh!

Jody Oconnor

Jody Oconnor

April 20, 2021

Erin you figured it out! Human brain issue ("x2018_2019" processes exactly the same as "x2018_19"!). The example solution code uses "x2018_2019".

My case_when 'NA' issue is now solved as well (since there were no cases that didn't include "x2018-19"). Yay!

Lucilla Piccari

Lucilla Piccari

April 21, 2021

(Facepalm) You are right! I must have gone over the code a thousand times... I guess the machine is always right :D

Thank you!!

Jody Oconnor

Jody Oconnor

April 20, 2021

I'm getting the same thing for str_remove. The script runs with no error message. But it just doesn't change the data frame.

Also, when I use case_when all the race_ethnicity values are changed to 'NA's. Here is the code I'm running: race_ethnicity_enrollment_18_19 % select(-contains("percent")) %>% select(-contains("grade")) %>% select(-contains("kindergarten")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, "0")) %>% mutate(number_of_students = as.numeric(number_of_students)) %>%
mutate(race_ethnicity = str_remove(race_ethnicity, "x2018_2019_")) %>% mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "American Indian/Alaskan Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "multiracial" ~ "Multi-Racial", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Pacific Islander", race_ethnicity == "white" ~ "White"))

Jody Oconnor

Jody Oconnor

April 20, 2021

Not sure why the first line of code I pasted before got truncated, but this is what is actually in the first line:

race_ethnicity_enrollment_18_19 %

Jody Oconnor

Jody Oconnor

April 20, 2021

argh, it truncated the first line again, but anyway, that line works properly (it creates the new object from the enrollment_18_19 data frame). Everything works until the str_remove line.

David Keyes

David Keyes

April 20, 2021

Please correct me if I'm wrong, but it seems like the issue is that your original race_ethnicity_enrollment_18_19 data frame hasn't changed, correct? If so, the reason why is you need to assign it back to itself in order to change the data frame. For example:

race_ethnicity_enrollment_18_19 <- race_ethnicity_enrollment_18_19 %>% YOURCODEHERE

Does that help?

Jody Oconnor

Jody Oconnor

April 21, 2021

Hi David, my problem was completely due to trying to remove the string “x2018_2019” instead of “x2018_19”. I think you will want to fix that in your solution for this lesson so it doesn't throw more people off. Thanks for your responsiveness :)

Atlang Mompe

Atlang Mompe

April 21, 2021

HI Erin, shew great that you figured it out, I am only getting to this now, thanks for sharing your solution.

Abby Isaacson

Abby Isaacson

April 21, 2021

I'm stuck with an error that it can't find my race_ethnicity variable (I see it in my data frame): Error: Problem with mutate() input race_ethnicity. x object 'race_ethnicity' not found ℹ Input race_ethnicity is str_remove(race_ethnicity, "x2018_19"). Run rlang::last_error() to see where the error occurred.

CODE: enrollment_by_race_ethnicity_18_19 % select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% summarize(total = sum(number_of_students)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, "x2018_19"))

David Keyes

David Keyes

April 21, 2021

I think this was a typo on my part. Can you remove the summarize line and try again?

Abby Isaacson

Abby Isaacson

April 21, 2021

yes solved, thanks! I had pounded it out on my own, but that didn't work either. Removal was key!

Abby Isaacson

Abby Isaacson

April 22, 2021

I also see that you mentioned it in the solution video.

Harold Stanislaw

Harold Stanislaw

April 22, 2021

Is it just me or does the syntax for recode seem backwards to normal programming convention? I would have expected the new value to be left of the equals sign and the old value to be on the right side of the equals sign. Are there other instances of R where we would need to watch out for this? (Or maybe I'm the one who's backwards!)

David Keyes

David Keyes

April 22, 2021

You're not the only one who thinks this, Harold. I always forget the order and have to either look it up or try it one way (usually unsuccessfully at first) and then the other.

Harold Stanislaw

Harold Stanislaw

April 22, 2021

Judging by the warning messages generated in the video, am I correct in concluding that R is smart enough to exclude the NA entries from being recoded inappropriately (e.g., when using the TRUE option in case_when)? I ask because SPSS isn't so smart, depending upon how the receding is specified. Also, if one wanted to recode the NA entries, the missing = new value option could be used, correct? I'm just checking my understanding of the help page.

David Keyes

David Keyes

April 22, 2021

Yes, that's exactly right. And if you want to do something with the NA value, you can use the is.na() function to test for them.

Isaac Macha

Isaac Macha

April 22, 2021

Hello David. I have run the code successfully but it only shows the changes in the tibble and not in the data view. How do I reflect the change in the data frame? What may be the issue

David Keyes

David Keyes

April 22, 2021

Have you assigned the result of your pipeline to an object? Here's some pseudo code to show what I mean:

new_data_frame <- old_data_frame %>% code_to_make_changes()

Isaac Macha

Isaac Macha

April 22, 2021

Hello David. Thanks for the feedback. It works now.

Eduardo Rodriguez

Eduardo Rodriguez

April 28, 2021

Hi David, thanks again for the great instruction. Out of curiosity, is there a function similar to Excel's substitute function? In Excel I would use the substitute function to substitute " " for every "" and wrap a proper function around it to capitalize each word. Something like =proper(substitute(B2, "", " ")). Case_When is great but is potentially time consuming if there are 50 different instances.

David Keyes

David Keyes

April 28, 2021

Yup, check out str_replace()!

Eduardo Rodriguez

Eduardo Rodriguez

April 29, 2021

Thanks David! I like that R has the case_when function - it makes it so much easier to deal with long if statements.

For tasks where I'm simply removing underscores and capitalizing words, is it okay to do something like this? Or does it go against programming convention? Or take away customizability?

enrollment_by_race_ethnicity_17_18 % select(-contains(c("grade","kindergarten","percent"))) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students,"-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, "x2017_18_")) %>% mutate(race_ethnicity = str_to_title( str_replace_all(race_ethnicity,"_", " ") ))

David Keyes

David Keyes

April 29, 2021

If you're talking about the last line in your code, that's totally fine. I generally separate out things into multiple lines so I'd have one line that removes underscores and then another to make things title case, but you can do it however works best for you!

Carolyn Ford

Carolyn Ford

April 23, 2022

This code:

enrolment_by_race_ethnicity % select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students,"-")) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, "x2018_19_")) %>% mutate(race_ethnicity = recode(race_ethnicity, "american_indian_alaska_native" = "American Indian/Alaska Native", "asian" = "Asian", "black_african_american" = "Black/African American", "hispanic_latino" = "Hispanic/Latino", "multiracial" = "Multiracial", "native_hawaiian_pacific_islander" = "Native Hawaiian/Pacific Islander", "white" = "White" ))

... produces this error - I don't understand why the arguments are "unused":

Error in mutate(): ! Problem while computing race_ethnicity = recode(...). Caused by error in recode(): ! unused arguments (american_indian_alaska_native = "American Indian/Alaska Native", asian = "Asian", black_african_american = "Black/African American", hispanic_latino = "Hispanic/Latino", multiracial = "Multiracial", native_hawaiian_pacific_islander = "Native Hawaiian/Pacific Islander", white = "White")

David Keyes

David Keyes

April 25, 2022

Sorry, I can't reproduce your issue. Could you please post your full code from this section as a GitHub gist and post the link? Thanks!

Andrew Paquin

Andrew Paquin

April 23, 2023

Hi David, I noticed that you left the following line (from the last lesson) in your code, and that you added the various methods of mutation after it in the pipeline: summarize(number_proficient = sum(number_proficient, na.rm - TRUE)) I did the same thing and it didn't work. That summarize line resulted in a tiny table that showed only the total number of students in the districts, and it did not contain any instances of "x2018_19" to remove, so I got an error message. I'm not sure why it worked for you to leave it in but not for me.

Andrew Paquin

Andrew Paquin

April 23, 2023

Actually, disregard that. I just watched the solution video, and you dealt with it there. All good.

Daved Fared

Daved Fared

April 26, 2023

Hi David, when I followed the solution guide video and used the same exact code lines that you did, I got a table with race_ethnicity being all NA. Any idea what could have happened?

David Keyes

David Keyes

April 26, 2023

Can you post your code as a gist and share the link so I can review it?

Kiana Robinson

Kiana Robinson

May 15, 2023

This is my code (error message follows): enrollment_by_race_ethnicity_18_19 % select(-contains("grade")) %>% select(-contains("percent")) %>% select(-contains("kindergarten")) %>% pivot_longer(cols= -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students=na_if(number_of_students, "-")) %>% mutate(number_of_students=as.numeric(number_of_students)) %>% mutate(number_of_students=replace_na(number_of_students, 0)) %>% mutate(race_ethnicity=str_remove(race_ethnicity, "x2018_19_")) %>% mutate(race_ethnicity= case_when( race_ethnicity == "american_indian_alaska_native" ~ "American Indian/Alaska Native", race_ethnicity == "asian" ~"Asian", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian/Pacific Islander", race_ethnicity == "black_african_american" ~ "Black", race_ethnicity == "hispanic_latino" ~ "Hispanic", race_ethnicity == "white" ~ "White", race_ethnicity == "multiracial" ~ "Multiracial" ))

This is the error message: Error in case_when(): ! Failed to evaluate the left-hand side of formula 1. Caused by error: ! object 'race_ethnicity' not found Run rlang::last_trace() to see where the error occurred. >

David Keyes

David Keyes

May 15, 2023

Can you please post your entire R script file as a GitHub gist and posting the link? I'm having trouble isolating the issue.