Skip to content
R for the Rest of Us Logo

This lesson is locked

Get access to all lessons in this course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Heads up! There were some changes to R after I made this lesson. If you're having trouble with getting it to work, check out the solutions section for a video that explains what might be going on for you.

Your Turn

Create a function to clean each year of enrollment data, then use bind_rows() to bind them together

Arguments you’ll need to use:

  • Data year

  • Text to remove in the str_remove() line

Learn More

The best place to start learning more about creating your own functions is Chapter 19 of R for Data Science. The materials for the Stat 545 course also has a nice section on writing functions, as does this lesson from Kelly Bodwin.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Abby Isaacson

Abby Isaacson

April 25, 2021

Darn, my race and ethnicity column/variable now lists only NAs, where as the non-function code worked fine:

clean_enrollment_data % select(-contains("percent")) %>% select(-contains("grade")) %>% select(-contains("kindergarten")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, r_e_text_remove)) %>% mutate(race_ethnicity = case_when( race_ethnicity == "_american_indian_alaska_native" ~ "AI/AN", race_ethnicity == "_asian" ~ "Asian", race_ethnicity == "_native_hawaiian_pacific_islander" ~ "Native Hawaiian/Pacific Islander", race_ethnicity == "_black_african_american" ~ "Black/African American", race_ethnicity == "_hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "_white" ~ "White", race_ethnicity == "_multiracial" ~ "Multi-racial")) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup() %>% mutate(data_year) }

enrollment_by_race_ethnicity_17_18 <- clean_enrollment_data(data_raw = enrollment_17_18, data_year = "2017-18", r_e_text_remove = "x2017-18")

enrollment_by_race_ethnicity_18_19 <- clean_enrollment_data(data_raw = enrollment_18_19, data_year = "2018-19", r_e_text_remove = "x2018-19")

Abby Isaacson

Abby Isaacson

April 25, 2021

Also, this has happened before to me and others, I'm not sure why the first line of copied code gets truncated on this website so it's hard to show you what my first 2 lines of code are. I can log into Charlie's office hours Monday and share screen perhaps.

clean_enrollment_data %

Abby Isaacson

Abby Isaacson

April 25, 2021

yeah it cut it off again.

Abby Isaacson

Abby Isaacson

April 25, 2021

I see one of my errors on the last line with mutate (but I still get the NAs): mutate(year = data_year)

David Keyes

David Keyes

April 26, 2021

Sorry about the cutting off issue. Very annoying I know. I assume you got this error fixed based on our email, but let me know if otherwise!

Lucilla Piccari

Lucilla Piccari

April 29, 2021

I got the same "NA" issue. Any fixes?

Thanks!

Abby Isaacson

Abby Isaacson

April 29, 2021

Yes, the solution was literally an underscore where I had a dash! Check your syntax:)

Lucilla Piccari

Lucilla Piccari

May 7, 2021

It was! Those misspellings... :D

Megan Ruxton

Megan Ruxton

April 29, 2021

Having the same issue with NAs, and my syntax seems to be right. I'll post in a couple of comment chunks since there seems to be a limit.

clean_enrollment_data% select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students")%>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric((number_of_students))) %>% mutate(race_ethnicity = str_remove(race_ethnicity, remove_text)) %>% mutate(race_ethnicity = case_when(race_ethnicity == "american_indian_alaska_native" ~ "American Indian/Alaska Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian/Pacific Islander", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "white" ~ "White", race_ethnicity == "multiracial" ~ "Multiracial")) %>%

Megan Ruxton

Megan Ruxton

April 29, 2021

group_by(district_id) %>% mutate(pct = number_of_students/sum(number_of_students)) %>% ungroup() %>% mutate(year = data_year) )

enrollment_by_race_ethnicity_17_18 <-clean_enrollment_data(raw_data = enrollment_17_18, data_year = "2017-2018", remove_text = "x2017_2018_")

enrollment_by_race_ethnicity_18_19 <- clean_enrollment_data(raw_data = enrollment_18_19, remove_text = "x2018_2019_", data_year = "2018-2019")

Megan Ruxton

Megan Ruxton

April 29, 2021

Hoping it's something silly I missed because it's driving me nuts!

David Keyes

David Keyes

April 30, 2021

I'm having trouble with your code. Can you post your whole R script file as a GitHub gist and share the link?

Megan Ruxton

Megan Ruxton

April 30, 2021

https://gist.github.com/mruxton/4674fa36cc2f382e6f8a456c21ed3118

Let me know if you have issues with that, but I think it give you my code!

Megan Ruxton

Megan Ruxton

May 1, 2021

Aha, it was something small, thank you! And the video was helpful, that's a good system for debugging. Thanks!

Extremely basic question that I'm sure you've explained before but I've forgotten. What keystrokes are you using when you are running code in these videos? Example: in the Solutions video, you say "let me run that" and then run only a few lines of code. How are you doing that (you aren't selecting lines and then clicking Run)?

Related: when David is doing this, he is seeing output (e.g., a tybble) in the Console. What setting needs to be changed for this?

Related: In the solutions video, David says, "let me pop this open" at about 3:16 to open a new dataframe he's just made. What keystrokes were done to do that?

David Keyes

David Keyes

December 2, 2021

If you click on any object in the environment pane, it will open in the viewer. You can also do command + click (Mac) or control + click (Windows) on an object name in your code to make it appear in the viewer. Hope that helps!

CTRL clicking is a huge help! thank you

David Keyes

David Keyes

December 2, 2021

To run code, I typically use the keyboard shortcut command + enter (control + enter on Windows). You can see all keyboard shortcuts here.

On where output appears, by default code run in an R script file goes in the console while code run in an RMarkdown document goes below that code chunk. You can change it, though. See here.

Let's say you want to run the function for 20 years worth of data - so you have 20 input tables and have to run the function 20 times. Can you create a function of functions? How would you pass the arguments? I know that I want to replace the year argument with every year between 2000 and 2019 - how does that work? I'm sure it can be done somehow.

Charlie Hadley

Charlie Hadley

February 15, 2022

That's a great question Sara. The easiest, and most useful method for solving this kinda of problem is iteration - applying your function over all the values (years) of interest. I've recorded a short video explaining how this example script works which iterates over each year in the gapminder dataset.

The R for Data Science book chapter on Iteration provides more information about the {purrr} package and how the map_df() function works. Let me know if this helps!

Thank you for the video. I can see that this is a whole new area to explore!

JULIO VERA DE LEON

JULIO VERA DE LEON

April 29, 2022

Unfortunately I'm getting the same error with the replace_na() function:

Error in mutate(): ! Problem while computing number_of_students = replace_na(number_of_students, 0). Caused by error in vec_assign(): ! Can't convert replace to match type of data .

I'm guessing it has to be related with the readxl package and how it interprets the values of some variables.

The value for 2017-2018 has to be 0, and for the 2018-2019 dataset has to be "0".

Charlie Hadley

Charlie Hadley

May 3, 2022

Thanks again for commenting! I'm going to link to your earlier comment in the course where I explained the issue.

Sandra Obradovic

Sandra Obradovic

May 3, 2022

Hi Charlie, i'm having the same issue as Julio. i tried to go on the link you embedded but it just takes me to the 'Going Deeper with R' landing page. would be useful to understand this better.

Charlie Hadley

Charlie Hadley

May 5, 2022

Hi Sandra and Julio,

Sorry the link doesn't work. I'm copying and pasting what I wrote on the "Binding Data Frames" lesson previous to this one:

Thanks for reporting this! It’s due to a change in how replace_na() works, it is no longer allowed to “change data types” which means we need to use 0 if the column is numeric and “0” if the column is a character column.

In case you’re interested, you can see the documentation for this change in the NEWS.md file for the package. But please note the language used here is quite technical.

Thanks,

Charlie

Niger Sultana

Niger Sultana

May 5, 2022

Hi I have trouble for practicing code, which was posted in the link (https://rfortherestofus.com/2018/09/making-small-multiples-in-r/). I do not know how to download the data and code from github. It might be very silly. But if you could please show a video how to download data and code posted in GitHub for understanding how code works, will be helpful for me.

Cheers Niger

Niger Sultana

Niger Sultana

May 5, 2022

I figured out that. Thank you.

Charlie Hadley

Charlie Hadley

May 9, 2022

I'm glad it's working for you now!

Delia Ayled Serna Guerrero

Delia Ayled Serna Guerrero

May 14, 2022

Hallo! When I try to convert it to function it tells me that there is an error in mutate() that it can't convert 'replace' to match type of "data"

But this worked without problems when not in function form.

Delia Ayled Serna Guerrero

Delia Ayled Serna Guerrero

May 14, 2022

I tried the suggestions on the comments but still not working.

Delia Ayled Serna Guerrero

Delia Ayled Serna Guerrero

May 15, 2022

Hallo, When I run the function without the "remove text" argument it works but as soon as I add it it doesnt

Charlie Hadley

Charlie Hadley

May 16, 2022

Hello Delia,

Here's a video explaining how to fix the function. Let me know if you have any questions.

Thanks,

Charlie

Delia Ayled Serna Guerrero

Delia Ayled Serna Guerrero

May 16, 2022

Finally it works! Thanks!!

I have two questions (which I'm happy to bring up in an OH or live session, but am writing here so I don't forget):

(1) is it right that when we make a function, we'd never have a character string as an argument? For instance, example_function <- function(raw_data, name_of_column), but not example_function <- function("raw_data", "name_of_column"). If we want to have arguments that run as strings in the code, we'd use "" when we actually introduced those arguments while using the function, right? As in, example_function(raw_data_filename, "This is a column name and it's a string.")?

(2) I tried to add a fourth argument to my function that assigned the function's output to a new named dataframe, but it didn't work. Can you assign something within the function, or is that not possible? Here's what I had tried:

import_enrollment_by_year <- function(data_to_clean, xyear_yr_, year_year, dataframe_name) { dataframe_name % select(-contains(c("percent","grade","kindergarten"))) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students,0)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, xyear_yr_)) %>% mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "American Indian/Alaskan Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "white" ~ "White", race_ethnicity == "hispanic_latino" ~"Hispanic/Latino", race_ethnicity == "multiracial" ~"Multiracial", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian or Pacific Islander", TRUE ~ race_ethnicity )) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup() %>% mutate(year = year_year) %>% arrange(district_id, race_ethnicity) }

Charlie Hadley

Charlie Hadley

November 4, 2022

Hi Julia,

  1. function arguments and strings

Function arguments names don't need to be strings, but you can give arguments strings as values, eg

library(tidyverse)

title_a_chart <- function(the_title){
  
  ggplot() +
    labs(title = the_title)
  
}

title_a_chart("A title")
  1. Assignment in functions

Your code isn't working because the start of your pipe doesn't use the data_to_clean argument, but the dataframe_name

dataframe_name %
select(-contains(c(“percent”,”grade”,”kindergarten”))) 

It's advisable to not have functions create assignments inside of themselves that you intend to use elsewhere - what we would call global assignments.

Instead we assign the output of a function call to an object,

get_diet <- function(diet){
  
  msleep %>% 
    filter(vore == diet)
  
}

carni_mammals <- get_diet("carni")

Cheers, Charlie

Josh Gutwill

Josh Gutwill

November 7, 2022

I keep getting this error: Error in FUN(left) : invalid argument to unary operator

Here's my code: clean_enrollment_data % select(!contains("grade")) %>% select(!contains("percent")) %>% select(!contains("kindergarten")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, string_to_remove)) %>% mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "Native American or Alaskan", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Hawaiian or Pacific Islander", race_ethnicity == "black_african_american" ~ "Black or African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latinx", race_ethnicity == "white" ~ "White", TRUE ~ "Multiracial")) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup() %>% mutate(year = data_year) }

enrollment_by_race_ethnicity_18_19 < - clean_enrollment_data(raw_data = enrollment_18_19, string_to_remove = "x2018_19_", data_year = "2018-2019")

When I copy and paste David's code, I also get an error, though it's a different one: Error in mutate(): ! Problem while computing number_of_students = replace_na(number_of_students, 0). Caused by error in vec_assign(): ! Can't convert replace to match type of data . Run rlang::last_error() to see where the error occurred.

Charlie Hadley

Charlie Hadley

November 8, 2022

Hi Josh,

Unfortunately, this error is due to a change in the behaviour of the replace_na() function. In the provided solution it is showing the now deprecated behaviour of being allowed to change the type of data, in this case from character to numeric.

To fix this error you'll need to replace this line

 mutate(number_of_students = replace_na(number_of_students, 0))

With

mutate(number_of_students = replace_na(number_of_students, "0"))

Cheers,

Charlie

Kirstin O'Dell

Kirstin O'Dell

November 14, 2022

I thought in a prior video we were told to tidy our data in script and only use markdown for the output/reporting we want to do. I'm seeing that for this we're using markdown to tidy the data inside of the function. I'm confused as to which to be using. Can functions created in script be used in markdown files?

You're right that I generally recommend doing cleaning/tidying in an R script file and reporting in RMarkdown. However, I sometimes switch it up when I'm making examples (in this video, I didn't want to have to switch between two file types because I wanted to keep the focus on functions). But, in real projects, I keep R files for cleaning/tidying and Rmd files for reporting.

On your other question, yes, functions can be used anywhere you can run R code (R script files or code chunks in Rmd files). Hope that helps!

Kirstin O'Dell

Kirstin O'Dell

November 15, 2022

It does. Thanks! To clarify, is it 1) functions are translatable, meaning I can set one up in a script file and run it in a Rmd file or 2) I can set up and run a function within a single script or Rmd file but not between them?

Yes, you can run functions across R and Rmd files. I went a bit in depth in explaining it in this video. Hope it helps. If you still have questions, please let me know.

Kirstin O'Dell

Kirstin O'Dell

November 17, 2022

Thank you! Very helpful to see the example.

Andrew Paquin

Andrew Paquin

April 30, 2023

Hi David, I watched the "Heads Up" video above. If I understand it correctly, the function we created for the 18-19 data can't be used for the 17-18 data because of 1) the changes to the na_if command, and 2) the fact that the columns are in different formats (dbl and chr) in the two datasets. Is that what's happening?

David Keyes

David Keyes

April 30, 2023

Yep, exactly!

Kiana Robinson

Kiana Robinson

May 17, 2023

What does this line of code do?

race_ethnicity_remove_text = "x2018_19_"

Why was it included in the function statement? This is confusing.

David Keyes

David Keyes

May 17, 2023

It removes the text "x2018_19_" from the race_ethnicity variable. We do this because the function has to work with both 2018/2019 data and 2017/2018 data (which would have the text "x2017_18_". So, by putting it as an argument, we can make the function work for both years. Does that make sense?

Zain Asaf

Zain Asaf

May 22, 2023

Hi Charlie and Dan, I am having the same issues as a couple of people with the replace_na line:

I get the following message Error in mutate(): ℹ In argument: number_of_students = replace_na(number_of_students, "0"). Caused by error in vec_assign(): ! Can't convert replace to match type of data .

However, I have used the following code for that line: mutate(number_of_students = replace_na(number_of_students, "0")) i.e. I have put the "0" in quotation marks, as suggested. Here is the code I have used. Note, I just labeled the column "ethnicity" not "race_ethnicity" clean_enrollment_data%
select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-" )) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students, "0")) %>% mutate(ethnicity = str_remove(ethnicity, ethnicity_remove_text)) %>% mutate(ethnicity = case_when( ethnicity == "american_indian_alaska_native" ~ "native american", ethnicity == "native_hawaiian_pacific_islander" ~"pacific islander", ethnicity == "hispanic_latino" ~ "latino", ethnicity == "black_african_american" ~ "african american", ethnicity == "white" ~ "white", ethnicity == "asian" ~ "Asian", ethnicity == "multiracial" ~ "Multiracial" )) %>% group_by(district_id)%>% mutate(pct = number_of_students / sum(number_of_students)*100) %>% ungroup() %>% mutate(year = "data_year")

David Keyes

David Keyes

May 22, 2023

Can you please post your full code as a gist and share the link?