Get access to all lessons in this course.
-
Advanced Data Wrangling and Analysis
- Overview
- Importing Data
- Tidy Data
- Reshaping Data
- Dealing with Missing Data
- Changing Variable Types
- Advanced Variable Creation
- Advanced Summarizing
- Binding Data Frames
- Functions
- Merging Data
- Renaming Variables
- Quick Interlude to Reorganize our Code
- Exporting Data
-
Advanced Data Visualization
- Data Visualization Best Practices
- Tidy Data
- Pipe Data Into ggplot
- Reorder Plots to Highlight Findings
- Line Charts
- Use Color to Highlight Findings
- Declutter
- Use the scales Package for Nicely Formatted Values
- Use Direct Labeling
- Use Axis Text Wisely
- Use Titles to Highlight Findings
- Use Color in Titles to Highlight Findings
- Use Annotations to Explain
- Tweak Spacing
- Customize Your Theme
- Customize Your Fonts
- Try New Plot Types
-
Advanced RMarkdown
- Advanced Markdown Text Formatting
- Tables
- Advanced YAML
- Inline R Code
- Making Your Reports Shine: Word Edition
- Making Your Reports Shine: HTML Edition
- Making Your Reports Shine: PDF Edition
- Presentations
- Dashboards
- Other Formats
-
Wrapping Up
- You Did It!
Going Deeper with R
Functions
This lesson is locked
This lesson is called Functions, part of the Going Deeper with R course. This lesson is called Functions, part of the Going Deeper with R course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
Heads up! There were some changes to R after I made this lesson. If you're having trouble with getting it to work, check out the solutions section for a video that explains what might be going on for you.
Your Turn
Create a function to clean each year of enrollment data, then use bind_rows()
to bind them together
Arguments you’ll need to use:
Data year
Text to remove in the
str_remove()
line
Learn More
The best place to start learning more about creating your own functions is Chapter 19 of R for Data Science. The materials for the Stat 545 course also has a nice section on writing functions, as does this lesson from Kelly Bodwin.
You need to be signed-in to comment on this post. Login.
Abby Isaacson
April 25, 2021
Darn, my race and ethnicity column/variable now lists only NAs, where as the non-function code worked fine:
clean_enrollment_data % select(-contains("percent")) %>% select(-contains("grade")) %>% select(-contains("kindergarten")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, r_e_text_remove)) %>% mutate(race_ethnicity = case_when( race_ethnicity == "_american_indian_alaska_native" ~ "AI/AN", race_ethnicity == "_asian" ~ "Asian", race_ethnicity == "_native_hawaiian_pacific_islander" ~ "Native Hawaiian/Pacific Islander", race_ethnicity == "_black_african_american" ~ "Black/African American", race_ethnicity == "_hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "_white" ~ "White", race_ethnicity == "_multiracial" ~ "Multi-racial")) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup() %>% mutate(data_year) }
enrollment_by_race_ethnicity_17_18 <- clean_enrollment_data(data_raw = enrollment_17_18, data_year = "2017-18", r_e_text_remove = "x2017-18")
enrollment_by_race_ethnicity_18_19 <- clean_enrollment_data(data_raw = enrollment_18_19, data_year = "2018-19", r_e_text_remove = "x2018-19")
Abby Isaacson
April 25, 2021
Also, this has happened before to me and others, I'm not sure why the first line of copied code gets truncated on this website so it's hard to show you what my first 2 lines of code are. I can log into Charlie's office hours Monday and share screen perhaps.
clean_enrollment_data %
Abby Isaacson
April 25, 2021
yeah it cut it off again.
Abby Isaacson
April 25, 2021
I see one of my errors on the last line with mutate (but I still get the NAs): mutate(year = data_year)
David Keyes
April 26, 2021
Sorry about the cutting off issue. Very annoying I know. I assume you got this error fixed based on our email, but let me know if otherwise!
Lucilla Piccari
April 29, 2021
I got the same "NA" issue. Any fixes?
Thanks!
Abby Isaacson
April 29, 2021
Yes, the solution was literally an underscore where I had a dash! Check your syntax:)
Lucilla Piccari
May 7, 2021
It was! Those misspellings... :D
Megan Ruxton
April 29, 2021
Having the same issue with NAs, and my syntax seems to be right. I'll post in a couple of comment chunks since there seems to be a limit.
clean_enrollment_data% select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students")%>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric((number_of_students))) %>% mutate(race_ethnicity = str_remove(race_ethnicity, remove_text)) %>% mutate(race_ethnicity = case_when(race_ethnicity == "american_indian_alaska_native" ~ "American Indian/Alaska Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian/Pacific Islander", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "white" ~ "White", race_ethnicity == "multiracial" ~ "Multiracial")) %>%
Megan Ruxton
April 29, 2021
group_by(district_id) %>% mutate(pct = number_of_students/sum(number_of_students)) %>% ungroup() %>% mutate(year = data_year) )
enrollment_by_race_ethnicity_17_18 <-clean_enrollment_data(raw_data = enrollment_17_18, data_year = "2017-2018", remove_text = "x2017_2018_")
enrollment_by_race_ethnicity_18_19 <- clean_enrollment_data(raw_data = enrollment_18_19, remove_text = "x2018_2019_", data_year = "2018-2019")
Megan Ruxton
April 29, 2021
Hoping it's something silly I missed because it's driving me nuts!
David Keyes
April 30, 2021
I'm having trouble with your code. Can you post your whole R script file as a GitHub gist and share the link?
Megan Ruxton
April 30, 2021
https://gist.github.com/mruxton/4674fa36cc2f382e6f8a456c21ed3118
Let me know if you have issues with that, but I think it give you my code!
Megan Ruxton
May 1, 2021
Aha, it was something small, thank you! And the video was helpful, that's a good system for debugging. Thanks!
Matt M
December 2, 2021
Extremely basic question that I'm sure you've explained before but I've forgotten. What keystrokes are you using when you are running code in these videos? Example: in the Solutions video, you say "let me run that" and then run only a few lines of code. How are you doing that (you aren't selecting lines and then clicking Run)?
Related: when David is doing this, he is seeing output (e.g., a tybble) in the Console. What setting needs to be changed for this?
Matt M
December 2, 2021
Related: In the solutions video, David says, "let me pop this open" at about 3:16 to open a new dataframe he's just made. What keystrokes were done to do that?
David Keyes
December 2, 2021
If you click on any object in the environment pane, it will open in the viewer. You can also do command + click (Mac) or control + click (Windows) on an object name in your code to make it appear in the viewer. Hope that helps!
Matt M
December 3, 2021
CTRL clicking is a huge help! thank you
David Keyes
December 2, 2021
To run code, I typically use the keyboard shortcut command + enter (control + enter on Windows). You can see all keyboard shortcuts here.
On where output appears, by default code run in an R script file goes in the console while code run in an RMarkdown document goes below that code chunk. You can change it, though. See here.
Sara Kidd
February 14, 2022
Let's say you want to run the function for 20 years worth of data - so you have 20 input tables and have to run the function 20 times. Can you create a function of functions? How would you pass the arguments? I know that I want to replace the year argument with every year between 2000 and 2019 - how does that work? I'm sure it can be done somehow.
Charlie Hadley
February 15, 2022
That's a great question Sara. The easiest, and most useful method for solving this kinda of problem is iteration - applying your function over all the values (years) of interest. I've recorded a short video explaining how this example script works which iterates over each year in the gapminder dataset.
The R for Data Science book chapter on Iteration provides more information about the {purrr} package and how the map_df() function works. Let me know if this helps!
Sara Kidd
February 15, 2022
Thank you for the video. I can see that this is a whole new area to explore!
JULIO VERA DE LEON
April 29, 2022
Unfortunately I'm getting the same error with the replace_na() function:
Error in
mutate()
: ! Problem while computingnumber_of_students = replace_na(number_of_students, 0)
. Caused by error invec_assign()
: ! Can't convertreplace
to match type ofdata
.I'm guessing it has to be related with the readxl package and how it interprets the values of some variables.
The value for 2017-2018 has to be 0, and for the 2018-2019 dataset has to be "0".
Charlie Hadley
May 3, 2022
Thanks again for commenting! I'm going to link to your earlier comment in the course where I explained the issue.
Sandra Obradovic
May 3, 2022
Hi Charlie, i'm having the same issue as Julio. i tried to go on the link you embedded but it just takes me to the 'Going Deeper with R' landing page. would be useful to understand this better.
Charlie Hadley
May 5, 2022
Hi Sandra and Julio,
Sorry the link doesn't work. I'm copying and pasting what I wrote on the "Binding Data Frames" lesson previous to this one:
Thanks for reporting this! It’s due to a change in how replace_na() works, it is no longer allowed to “change data types” which means we need to use 0 if the column is numeric and “0” if the column is a character column.
In case you’re interested, you can see the documentation for this change in the NEWS.md file for the package. But please note the language used here is quite technical.
Thanks,
Charlie
Niger Sultana
May 5, 2022
Hi I have trouble for practicing code, which was posted in the link (https://rfortherestofus.com/2018/09/making-small-multiples-in-r/). I do not know how to download the data and code from github. It might be very silly. But if you could please show a video how to download data and code posted in GitHub for understanding how code works, will be helpful for me.
Cheers Niger
Niger Sultana
May 5, 2022
I figured out that. Thank you.
Charlie Hadley
May 9, 2022
I'm glad it's working for you now!
Delia Ayled Serna Guerrero
May 14, 2022
Hallo! When I try to convert it to function it tells me that there is an error in mutate() that it can't convert 'replace' to match type of "data"
But this worked without problems when not in function form.
Delia Ayled Serna Guerrero
May 14, 2022
I tried the suggestions on the comments but still not working.
Delia Ayled Serna Guerrero
May 15, 2022
Hallo, When I run the function without the "remove text" argument it works but as soon as I add it it doesnt
Charlie Hadley
May 16, 2022
Hello Delia,
Here's a video explaining how to fix the function. Let me know if you have any questions.
Thanks,
Charlie
Delia Ayled Serna Guerrero
May 16, 2022
Finally it works! Thanks!!
Julia Nee
November 3, 2022
I have two questions (which I'm happy to bring up in an OH or live session, but am writing here so I don't forget):
(1) is it right that when we make a function, we'd never have a character string as an argument? For instance, example_function <- function(raw_data, name_of_column), but not example_function <- function("raw_data", "name_of_column"). If we want to have arguments that run as strings in the code, we'd use "" when we actually introduced those arguments while using the function, right? As in, example_function(raw_data_filename, "This is a column name and it's a string.")?
(2) I tried to add a fourth argument to my function that assigned the function's output to a new named dataframe, but it didn't work. Can you assign something within the function, or is that not possible? Here's what I had tried:
import_enrollment_by_year <- function(data_to_clean, xyear_yr_, year_year, dataframe_name) { dataframe_name % select(-contains(c("percent","grade","kindergarten"))) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students,0)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, xyear_yr_)) %>% mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "American Indian/Alaskan Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "white" ~ "White", race_ethnicity == "hispanic_latino" ~"Hispanic/Latino", race_ethnicity == "multiracial" ~"Multiracial", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian or Pacific Islander", TRUE ~ race_ethnicity )) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup() %>% mutate(year = year_year) %>% arrange(district_id, race_ethnicity) }
Charlie Hadley
November 4, 2022
Hi Julia,
Function arguments names don't need to be strings, but you can give arguments strings as values, eg
Your code isn't working because the start of your pipe doesn't use the data_to_clean argument, but the dataframe_name
It's advisable to not have functions create assignments inside of themselves that you intend to use elsewhere - what we would call global assignments.
Instead we assign the output of a function call to an object,
Cheers, Charlie
Josh Gutwill
November 7, 2022
I keep getting this error: Error in FUN(left) : invalid argument to unary operator
Here's my code: clean_enrollment_data % select(!contains("grade")) %>% select(!contains("percent")) %>% select(!contains("kindergarten")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(race_ethnicity = str_remove(race_ethnicity, string_to_remove)) %>% mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "Native American or Alaskan", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Hawaiian or Pacific Islander", race_ethnicity == "black_african_american" ~ "Black or African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latinx", race_ethnicity == "white" ~ "White", TRUE ~ "Multiracial")) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup() %>% mutate(year = data_year) }
enrollment_by_race_ethnicity_18_19 < - clean_enrollment_data(raw_data = enrollment_18_19, string_to_remove = "x2018_19_", data_year = "2018-2019")
When I copy and paste David's code, I also get an error, though it's a different one: Error in
mutate()
: ! Problem while computingnumber_of_students = replace_na(number_of_students, 0)
. Caused by error invec_assign()
: ! Can't convertreplace
to match type ofdata
. Runrlang::last_error()
to see where the error occurred.Charlie Hadley
November 8, 2022
Hi Josh,
Unfortunately, this error is due to a change in the behaviour of the replace_na() function. In the provided solution it is showing the now deprecated behaviour of being allowed to change the type of data, in this case from character to numeric.
To fix this error you'll need to replace this line
With
Cheers,
Charlie
Kirstin O'Dell
November 14, 2022
I thought in a prior video we were told to tidy our data in script and only use markdown for the output/reporting we want to do. I'm seeing that for this we're using markdown to tidy the data inside of the function. I'm confused as to which to be using. Can functions created in script be used in markdown files?
David Keyes
November 15, 2022
You're right that I generally recommend doing cleaning/tidying in an R script file and reporting in RMarkdown. However, I sometimes switch it up when I'm making examples (in this video, I didn't want to have to switch between two file types because I wanted to keep the focus on functions). But, in real projects, I keep R files for cleaning/tidying and Rmd files for reporting.
On your other question, yes, functions can be used anywhere you can run R code (R script files or code chunks in Rmd files). Hope that helps!
Kirstin O'Dell
November 15, 2022
It does. Thanks! To clarify, is it 1) functions are translatable, meaning I can set one up in a script file and run it in a Rmd file or 2) I can set up and run a function within a single script or Rmd file but not between them?
David Keyes
November 15, 2022
Yes, you can run functions across R and Rmd files. I went a bit in depth in explaining it in this video. Hope it helps. If you still have questions, please let me know.
Kirstin O'Dell
November 17, 2022
Thank you! Very helpful to see the example.
Andrew Paquin
April 30, 2023
Hi David, I watched the "Heads Up" video above. If I understand it correctly, the function we created for the 18-19 data can't be used for the 17-18 data because of 1) the changes to the na_if command, and 2) the fact that the columns are in different formats (dbl and chr) in the two datasets. Is that what's happening?
David Keyes
April 30, 2023
Yep, exactly!
Kiana Robinson
May 17, 2023
What does this line of code do?
race_ethnicity_remove_text = "x2018_19_"
Why was it included in the function statement? This is confusing.
David Keyes
May 17, 2023
It removes the text "x2018_19_" from the race_ethnicity variable. We do this because the function has to work with both 2018/2019 data and 2017/2018 data (which would have the text "x2017_18_". So, by putting it as an argument, we can make the function work for both years. Does that make sense?
Zain Asaf
May 22, 2023
Hi Charlie and Dan, I am having the same issues as a couple of people with the replace_na line:
I get the following message Error in
mutate()
: ℹ In argument:number_of_students = replace_na(number_of_students, "0")
. Caused by error invec_assign()
: ! Can't convertreplace
to match type ofdata
.However, I have used the following code for that line: mutate(number_of_students = replace_na(number_of_students, "0")) i.e. I have put the "0" in quotation marks, as suggested. Here is the code I have used. Note, I just labeled the column "ethnicity" not "race_ethnicity" clean_enrollment_data%
select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-" )) %>% mutate(number_of_students = as.numeric(number_of_students)) %>% mutate(number_of_students = replace_na(number_of_students, "0")) %>% mutate(ethnicity = str_remove(ethnicity, ethnicity_remove_text)) %>% mutate(ethnicity = case_when( ethnicity == "american_indian_alaska_native" ~ "native american", ethnicity == "native_hawaiian_pacific_islander" ~"pacific islander", ethnicity == "hispanic_latino" ~ "latino", ethnicity == "black_african_american" ~ "african american", ethnicity == "white" ~ "white", ethnicity == "asian" ~ "Asian", ethnicity == "multiracial" ~ "Multiracial" )) %>% group_by(district_id)%>% mutate(pct = number_of_students / sum(number_of_students)*100) %>% ungroup() %>% mutate(year = "data_year")
David Keyes
May 22, 2023
Can you please post your full code as a gist and share the link?