Functions

This lesson is called Functions, part of the Going Deeper with R course. This lesson is called Functions, part of the Going Deeper with R course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/educator-resources/assessment/Pages/Assessment-Group-Reports.aspx

# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")


# Import, Tidy, and Clean Data --------------------------------------------

clean_math_proficiency_data <- function(raw_data) {
  
  read_excel(path = raw_data) |> 
    clean_names() |> 
    filter(student_group == "Total Population (All Students)") |> 
    filter(grade_level == "Grade 3") |> 
    select(academic_year, school_id, contains("number_level")) |> 
    pivot_longer(cols = starts_with("number_level"),
                 names_to = "proficiency_level",
                 values_to = "number_of_students") |> 
    mutate(proficiency_level = case_when(
      proficiency_level == "number_level_4" ~ "4",
      proficiency_level == "number_level_3" ~ "3",
      proficiency_level == "number_level_2" ~ "2",
      proficiency_level == "number_level_1" ~ "1"
    )) |> 
    mutate(number_of_students = parse_number(number_of_students)) |> 
    group_by(school_id) |> 
    mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE)) |> 
    ungroup()
  
}


third_grade_math_proficiency_2021_2022 <-
  clean_math_proficiency_data(raw_data = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")

third_grade_math_proficiency_2018_2019 <-
  clean_math_proficiency_data(raw_data = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")


third_grade_math_proficiency <- 
  bind_rows(third_grade_math_proficiency_2018_2019,
            third_grade_math_proficiency_2021_2022)

Your Turn

Solution

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx

# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20222023.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20212022.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20202021.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20192020.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20182019.xlsx")

# Import, Tidy, and Clean Data -----------------------------------------------------

clean_enrollment_data <- function(excel_file,
                                  sheet_name) {
  
  read_excel(path = excel_file,
             sheet = sheet_name) |> 
    clean_names() |> 
    select(1, 3, 7:19) |> 
    select(-contains("percent")) |> 
    set_names("district_institution_id",
              "school_institution_id",
              "american_indian_alaska_native",
              "asian",
              "native_hawaiian_pacific_islander",
              "black_african_american",
              "hispanic_latino",
              "white",
              "multi_racial") |> 
    pivot_longer(cols = -c(district_institution_id, school_institution_id),
                 names_to = "race_ethnicity",
                 values_to = "number_of_students") |> 
    mutate(race_ethnicity = case_when(
      race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
      race_ethnicity == "asian" ~ "Asian",
      race_ethnicity == "black_african_american" ~ "Black/African American",
      race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
      race_ethnicity == "multiracial" ~ "Multi-Racial",
      race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
      race_ethnicity == "white" ~ "White",
      race_ethnicity == "multi_racial" ~ "Multiracial"
    )) |> 
    mutate(number_of_students = parse_number(number_of_students)) |> 
    group_by(district_institution_id, race_ethnicity) |> 
    summarize(number_of_students = sum(number_of_students, na.rm = TRUE)) |> 
    ungroup() |> 
    group_by(district_institution_id) |> 
    mutate(pct = number_of_students / sum(number_of_students)) |> 
    ungroup() |> 
    mutate(year = sheet_name) 
  
}

enrollment_by_race_ethnicity_2022_2023 <- 
  clean_enrollment_data(excel_file = "data-raw/fallmembershipreport_20222023.xlsx",
                        sheet_name = "School 2022-23") 

enrollment_by_race_ethnicity_2021_2022 <-
  clean_enrollment_data(excel_file = "data-raw/fallmembershipreport_20212022.xlsx",
                        sheet_name = "School 2021-22") 

enrollment_by_race_ethnicity <- 
  bind_rows(enrollment_by_race_ethnicity_2021_2022,
            enrollment_by_race_ethnicity_2022_2023)

Create a function to clean each year of enrollment data.
To check that your function works, create enrollment_by_race_ethnicity_2021_2022 and enrollment_by_race_ethnicity_2022_2023 data frames and then bind them together with bind_rows().

This exercise is challenging! Use the starter code below to help you if you need to.

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx

# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20222023.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20212022.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20202021.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20192020.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20182019.xlsx")

# Import, Tidy, and Clean Data -----------------------------------------------------

clean_enrollment_data <- function(excel_file,
                                  sheet_name) {
  
  read_excel(path = YOURCODEHERE,
             sheet = YOURCODEHERE) |> 
    clean_names() |> 
    
    # I've selected by column position rather than names 
    # because the column names vary in the data between years
    # but they're always in the same positions
    select(1, 3, 7:19) |> 
    
    select(-contains("percent")) |> 
    set_names("district_institution_id",
              YOURCODEHERE,
              YOURCODEHERE,
              YOURCODEHERE,
              YOURCODEHERE,
              YOURCODEHERE,
              YOURCODEHERE,
              YOURCODEHERE,
              YOURCODEHERE) |> 
    pivot_longer(cols = -c(district_institution_id, school_institution_id),
                 names_to = "race_ethnicity",
                 values_to = "number_of_students") |> 
    mutate(race_ethnicity = case_when(
      race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
      race_ethnicity == "asian" ~ "Asian",
      race_ethnicity == "black_african_american" ~ "Black/African American",
      race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
      race_ethnicity == "multiracial" ~ "Multi-Racial",
      race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
      race_ethnicity == "white" ~ "White",
      race_ethnicity == "multi_racial" ~ "Multiracial"
    )) |> 
    mutate(number_of_students = parse_number(number_of_students)) |> 
    group_by(district_institution_id, race_ethnicity) |> 
    summarize(number_of_students = sum(number_of_students, na.rm = TRUE)) |> 
    ungroup() |> 
    group_by(district_institution_id) |> 
    mutate(pct = number_of_students / sum(number_of_students)) |> 
    ungroup() |> 
    mutate(year = sheet_name) 
  
}

enrollment_by_race_ethnicity_2021_2022 <- 
  clean_enrollment_data(excel_file = YOURCODEHERE,
                        sheet_name = YOURCODEHERE) 

enrollment_by_race_ethnicity_2022_2023 <-
  clean_enrollment_data(excel_file = YOURCODEHERE,
                        sheet_name = YOURCODEHERE) 

enrollment_by_race_ethnicity <- 
  bind_rows(enrollment_by_race_ethnicity_2021_2022,
            enrollment_by_race_ethnicity_2022_2023)

Learn More

If you want to learn more about the global options I showed in this lesson, the video from another lesson is below.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Danisha Lequet • December 17, 2023

enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx", sheet = "School 2022-23") |> clean_names() math_scores_2021_2022 <- read_excel ("data-raw/pagr schools math tot raceethnicity_2122.xlsx") |> clean_names()

math_scores_2021_2022 |> filter(student_group == "Total Population (All Students)") |> filter(grade_level == "Grade 3") glimpse select(academic_year, school_id, contains("number_level_")) |> view() pivot_longer(cols = contains("number_level_")), view() names_to = "proficiency_levels", values_to = "number_of_students")

third_grade_math_proficiency_2021_2022 |> mutate(proficiency_level = str_remove(proficiency_level, pattern = "number_level_")) third_grade_math_proficiency_2021_2022 |> mutate(proficiency_level = recode(proficiency_level, "number_level_4" = "4", "number_level_3" = "3", "number_level_2" = "2", "number_level_1" = "1"))

third_grade_math_proficiency_2021_2022 |> mutate(proficiency_level = case_when( proficiency_level == "number_level_4" ~"4" proficiency_level == "number_level_4" ~"3" proficiency_level == "number_level_4" ~"2" proficiency_level == "number_level_4" ~"1"

))

third_grade_math_proficiency_2021_2022 |> mutate(proficiency_level = parse_number(proficiency_level))

enrollment_by_race_ethnicity_2022_2023 <- enrollment_2022_2023 |> select(district_institution_id, x2022_23_american_indian_alaska_native:x2022_23_percent_multi_racial) |> select(-contains("percent")) |> pivot_longer(cols = -district_institution_id, names_to = "race_ethnicity", values_to = "number_of_students")

mutate(number_of_students = parse_number(number_of_students) bind_rows(third_grade_proficiency_2018_2019)

gss_cat |> view() third_grade_math_proficiency_2021_2022 |> group_by(school_id) mutate(pct = number_of_students / sum(number_of_students, na.rm =TRUE

))

third_grade_math_proficiency_2021_2022 |> enrollment_by_race_ethnicity_2022_2023 <- enrollment_2022_2023 |> select(district_institution_id, school_institution_id, x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |> select(-contains("percent")) |> pivot_longer(cols = -c(district_institution_id, school_institution_id), names_to = "race_ethnicity", values_to = "number_of_students") |> mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23_")) |> mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "multiracial" ~ "Multi-Racial", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander", race_ethnicity == "white" ~ "White", race_ethnicity == "multi_racial" ~ "Multiracial" ))

Can you review my code and apply any corrective changes. There is something I am not incorporating correctly. The data export is not displaying as expected.

Gracielle Higino Coach • December 17, 2023

Hi Danisha! It's not clear to me what you are trying to assign to object math_scores_2021_2022. I suspect the main error is there. Also, I noticed you're doing a filter operation that looks more like a select combined with mutate. When in doubt, run your code line by line from scratch, starting a new R session, until an error appears. Try to copy-paste the code displayed in the "Your turn" section and fill in the blanks to see if this process works for you.

Eric Juskewitz • January 24, 2024

A question for code readability:

ATM we use set_names() to get a column name which we later on change with mutate(race_ethnicity = case_when(...)) Would it be cleaner to immediately change them to the final out come or is the code more readable/understandable that way?

David Keyes Founder • January 25, 2024

To be honest, this is really just a personal preference thing. Whatever works best for you is great, as long as you understand what is going on (and, by your comment, I'm pretty sure you do!).

Rachel Udow • April 28, 2024

Incorporating functions will be a gamechanger for me! I have a related workflow question. In my work, I have several use cases where I implement similar cleaning steps across many raw assessment files; however, there is slight variation between the raw files due to differences in variable names, order, response options, etc. over time (a separate issue to tackle!).

In this case, what would you recommend for script structure? I'm thinking one R script file for establishing the functions, and another R script file that cleans each assessment by year -- starting with glimpse() or something similar to understand the structure of each file, then carrying out cleaning steps specific to that file/not standard across files, and finally implementing the functions that are standard across files. Does this make sense, or is there a better/cleaner way to approach this combined approach of working with functions and also implementing "extra" cleaning outside the functions? Thanks!

David Keyes Founder • April 29, 2024

Your approach sounds like a good one to me! Oftentimes what I do is to make a function to clean data. I run this on multiple files, but then I have some additional code that I run on individual files to handle differences. I recorded a video to show you what this looks like. Hope this helps!

Rachel Udow • May 1, 2024

Thank you for the video response, David -- that was really helpful and generous of you! If I'm understanding correctly, it sounds like when there are just minor variations in the cleaning that needs to be done by data source to get to the same end structure, this can be accomplished efficiently by adding some if/case_when "caveats" to the function(s) involved. I'll look back at my Week 8 project and see if this would make sense as an update -- I'm working w/ chronic absenteeism data from 5 different states and currently have separate functions set up for cleaning each of the states on an annual basis, which are ultimately being merged into one long data frame.

Also, that map function you showed with obtn_years is going to be a gamechanger for me! Thanks!

Haifa Haroon • April 24, 2025

Just as a heads up, because it was throwing me off with the number of fields, race_ethnicity == "multi_racial" ~ "Multiracial" is included twice in the starter code (assuming it wasn't thrown in intentionally :)