Skip to content
R for the Rest of Us Logo

Going Deeper with R

Tidy Data Rule #3: Every Cell is a Single Value

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video
# Load Packages -----------------------------------------------------------

library(tidyverse)
library(janitor)
library(readxl)
library(fs)

# Create Directory --------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# download.file(url = "https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")

# Import Data -------------------------------------------------------------

math_scores_2021_2022 <-
  read_excel("data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx") |> 
  clean_names()

third_grade_math_proficiency_2021_2022 <-
  math_scores_2021_2022 |> 
  filter(student_group == "Total Population (All Students)") |> 
  filter(grade_level == "Grade 3") |> 
  select(academic_year, school_id, contains("number_level_")) |> 
  pivot_longer(cols = contains("number_level_"),
               names_to = "proficiency_level",
               values_to = "number_of_students")

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = str_remove(proficiency_level,
                                        pattern = "number_level_"))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = recode(proficiency_level, 
                                    "number_level_4" = "4",
                                    "number_level_3" = "3",
                                    "number_level_2" = "2",
                                    "number_level_1" = "1"))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_4", 
                                     true = "4", 
                                     false = proficiency_level)) |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_3", 
                                     true = "3", 
                                     false = proficiency_level)) |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_2", 
                                     true = "2", 
                                     false = proficiency_level)) |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_1", 
                                     true = "1", 
                                     false = proficiency_level))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = case_when(
    proficiency_level == "number_level_4" ~ "4",
    proficiency_level == "number_level_3" ~ "3",
    proficiency_level == "number_level_2" ~ "2",
    proficiency_level == "number_level_1" ~ "1"
  ))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = parse_number(proficiency_level))

Your Turn

  1. Remove the “x2022_23_” portion of the race_ethnicity variable using str_remove().

  2. Convert all instances of the race_ethnicity variable to more meaningful observations (e.g. turn “american_indian_alaska_native” into “American Indian/Alaska Native”) using recode(), if_else(), or case_when().

Use the following starter code to help you if necessary.

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx

# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20222023.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20212022.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20202021.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20192020.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20182019.xlsx")

# Import Data -------------------------------------------------------------

enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
                                   sheet = "School 2022-23") |> 
  clean_names()

# Tidy and Clean Data -----------------------------------------------------

enrollment_by_race_ethnicity_2022_2023 <-
  enrollment_2022_2023 |> 
  select(district_institution_id, x2022_23_american_indian_alaska_native:x2022_23_percent_multi_racial) |> 
  select(-contains("percent")) |> 
  pivot_longer(cols = -district_institution_id,
               names_to = "race_ethnicity",
               values_to = "number_of_students") 

Learn More

Bob Rudis has a semi-complex article on how he uses case_when() to work with data related to his work on internet security.

You might also find this video by Sharon Machlis on case_when() helpful.

And, in case you think I’m the only one who loves case_when() check out this love letter to the function by Matt Kerlogue.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Joseph Lehman

Joseph Lehman • April 23, 2024

We may get to this in the future or already covered it, but is there a way to drop characters either at the beginning or end of a cell? I am thinking something like the "RIGHT" and "LEFT" functions in Excel.

David Keyes

David Keyes Founder • April 24, 2024

Yup, check out this quick video! Code I used in it is below:

library(tidyverse)
library(palmerpenguins)

penguins |> 
  select(sex) |> 
  mutate(sex_clipped = str_sub(sex, start = 1, end = 1))

Hope this helps!

Raouf Kilada

Raouf Kilada • November 1, 2024

Hey David, I am trying to do exactly wht you showed inthe video: third_grade_math_proficiency_2021_2022 |> group_by(school_id) |> mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE))

I got this error:

Error in mutate(): In argument: pct = number_of_students/sum(number_of_students, na.rm = TRUE). In group 1: school_id = 2. Caused by error in sum(): ! invalid 'type' (character) of argument Run rlang::last_trace() to see where the error occurred.

Gracielle Higino

Gracielle Higino Coach • November 4, 2024

Hi Raouf!

I'd need to take a look at your code to be able to debug it properly (feel free to send it to me by email), but it seems like R in interpreting the variable number_of_students as a character. Try adding the following after your group_by:

mutate(number_of_students = as.numeric(number_of_students)) |> 

So your code would become this:

third_grade_math_proficiency_2021_2022 |>
  group_by(district_institution_id) |> 
  mutate(number_of_students = as.numeric(number_of_students)) |> 
  mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE))
Da'Shon Carr

Da'Shon Carr • April 17, 2025

I cleaned up my variables before doing the pivot longer and used rename_with str_remove. It seemed to work this way also.

enrollment_2022_2023 |> select(district_institution_id, school_institution_id, x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |> select(-contains("percent")) |> rename_with(~str_remove(.,"x2022_23_")) |> pivot_longer(cols = -c(district_institution_id, school_institution_id), names_to = "2022_23_race_ethnicity", values_to = "number_of_students") ->enrollment_by_race_ethnicity_2022_2023

Course Content

44 Lessons