Skip to content
Coming soon: Ally. Your guide to the world of AI and R. Learn More →
R for the Rest of Us Logo

R in 3 Months (Fall 2025)

Tidy Data Rule #3: Every Cell is a Single Value

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video
# Load Packages -----------------------------------------------------------

library(tidyverse)
library(janitor)
library(readxl)
library(fs)

# Create Directory --------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# download.file(url = "https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")

# Import Data -------------------------------------------------------------

math_scores_2021_2022 <-
  read_excel("data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx") |> 
  clean_names()

third_grade_math_proficiency_2021_2022 <-
  math_scores_2021_2022 |> 
  filter(student_group == "Total Population (All Students)") |> 
  filter(grade_level == "Grade 3") |> 
  select(academic_year, school_id, contains("number_level_")) |> 
  pivot_longer(cols = contains("number_level_"),
               names_to = "proficiency_level",
               values_to = "number_of_students")

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = str_remove(proficiency_level,
                                        pattern = "number_level_"))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = recode(proficiency_level, 
                                    "number_level_4" = "4",
                                    "number_level_3" = "3",
                                    "number_level_2" = "2",
                                    "number_level_1" = "1"))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_4", 
                                     true = "4", 
                                     false = proficiency_level)) |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_3", 
                                     true = "3", 
                                     false = proficiency_level)) |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_2", 
                                     true = "2", 
                                     false = proficiency_level)) |> 
  mutate(proficiency_level = if_else(proficiency_level == "number_level_1", 
                                     true = "1", 
                                     false = proficiency_level))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = case_when(
    proficiency_level == "number_level_4" ~ "4",
    proficiency_level == "number_level_3" ~ "3",
    proficiency_level == "number_level_2" ~ "2",
    proficiency_level == "number_level_1" ~ "1"
  ))

third_grade_math_proficiency_2021_2022 |> 
  mutate(proficiency_level = parse_number(proficiency_level))

Your Turn

  1. Remove the “x2022_23_” portion of the race_ethnicity variable using str_remove().

  2. Convert all instances of the race_ethnicity variable to more meaningful observations (e.g. turn “american_indian_alaska_native” into “American Indian/Alaska Native”) using recode(), if_else(), or case_when().

Use the following starter code to help you if necessary.

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx

# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20222023.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20212022.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20202021.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20192020.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20182019.xlsx")

# Import Data -------------------------------------------------------------

enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
                                   sheet = "School 2022-23") |> 
  clean_names()

# Tidy and Clean Data -----------------------------------------------------

enrollment_by_race_ethnicity_2022_2023 <-
  enrollment_2022_2023 |> 
  select(district_institution_id, x2022_23_american_indian_alaska_native:x2022_23_percent_multi_racial) |> 
  select(-contains("percent")) |> 
  pivot_longer(cols = -district_institution_id,
               names_to = "race_ethnicity",
               values_to = "number_of_students")

Learn More

Bob Rudis has a semi-complex article on how he uses case_when() to work with data related to his work on internet security.

You might also find this video by Sharon Machlis on case_when() helpful.

And, in case you think I’m the only one who loves case_when() check out this love letter to the function by Matt Kerlogue.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Joseph Lehman

Joseph Lehman • April 23, 2024

We may get to this in the future or already covered it, but is there a way to drop characters either at the beginning or end of a cell? I am thinking something like the "RIGHT" and "LEFT" functions in Excel.

David Keyes

David Keyes Founder • April 24, 2024

Yup, check out this quick video! Code I used in it is below:

library(tidyverse)
library(palmerpenguins)

penguins |> 
  select(sex) |> 
  mutate(sex_clipped = str_sub(sex, start = 1, end = 1))

Hope this helps!

Raouf Kilada

Raouf Kilada • November 1, 2024

Hey David, I am trying to do exactly wht you showed inthe video: third_grade_math_proficiency_2021_2022 |> group_by(school_id) |> mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE))

I got this error:

Error in mutate(): In argument: pct = number_of_students/sum(number_of_students, na.rm = TRUE). In group 1: school_id = 2. Caused by error in sum(): ! invalid 'type' (character) of argument Run rlang::last_trace() to see where the error occurred.

Gracielle Higino

Gracielle Higino Coach • November 4, 2024

Hi Raouf!

I'd need to take a look at your code to be able to debug it properly (feel free to send it to me by email), but it seems like R in interpreting the variable number_of_students as a character. Try adding the following after your group_by:

mutate(number_of_students = as.numeric(number_of_students)) |> 

So your code would become this:

third_grade_math_proficiency_2021_2022 |>
  group_by(district_institution_id) |> 
  mutate(number_of_students = as.numeric(number_of_students)) |> 
  mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE))
Da'Shon Carr

Da'Shon Carr • April 17, 2025

I cleaned up my variables before doing the pivot longer and used rename_with str_remove. It seemed to work this way also.

enrollment_2022_2023 |> select(district_institution_id, school_institution_id, x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |> select(-contains("percent")) |> rename_with(~str_remove(.,"x2022_23_")) |> pivot_longer(cols = -c(district_institution_id, school_institution_id), names_to = "2022_23_race_ethnicity", values_to = "number_of_students") ->enrollment_by_race_ethnicity_2022_2023

gene trevino

gene trevino • August 4, 2025

I keep getting NA in the race_ethnicity variable.

enrollment_by_race_ethnicity_2022_2023 %>% mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23")) %>% mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "AIAN", race_ethnicity == "asian" ~ "A", race_ethnicity == "black_african_american" ~ "B", race_ethnicity == "hispanic_latino" ~ "HL", race_ethnicity == "multiracial" ~ "MR", race_ethnicity == "native_hawaiian_pacicif_islander" ~ "H", race_ethnicity == "white" ~ "WHITE", race_ethnicity == "multi_racial" ~ "M"))

HELP !

David Keyes

David Keyes Founder • August 5, 2025

Can you please post your full code all the way from data import so I can see what might be going on?

Annie Borland

Annie Borland • October 23, 2025

Hi,

Could you please give an example of how to write the .default line when using case_when?

I.e. how do you leave the values that you haven't set a condition for unchanged?

Gracielle Higino

Gracielle Higino Coach • October 22, 2025

Hi Annie! Sure! You can set the .default argument to a single value (e.g., .default = "unknown"), or to the corresponding value in a column (e.g., .default = column_name). Try running the code below to see what happens to the dataset in each case:

library(tidyverse)
library(palmerpenguins)

# example of use of case_when with .default argument set to a single value

penguins |> 
  mutate(
    size_category = case_when(
      body_mass_g < 3500 ~ "small",
      body_mass_g >= 3500 & body_mass_g < 4500 ~ "medium",
      body_mass_g >= 4500 ~ "large",
      .default = "unknown"
    )) |> View()

# another example using another variable as the .default

penguins |> 
  mutate(
    size_category = case_when(
      body_mass_g < 3500 ~ "small",
      body_mass_g >= 3500 & body_mass_g < 4500 ~ "medium",
      body_mass_g >= 4500 ~ "large",
      .default = species
    )) |> View()

If you want to recode a column instead of creating a new one and keep the original values if no condition is fulfilled, then you can do this:

penguins |> 
  mutate(
    body_mass_g = case_when(
    body_mass_g < 3500 ~ 1,
    body_mass_g >= 3500 & body_mass_g < 4500 ~ 2,
    .default = body_mass_g
    )) |> View()

The only thing is that the columns must contain the same value type in all rows. So you can't mix characters and numbers, and you need to be careful with that when creating your conditions. For example, if you want to keep the original values and they are numeric, your conditions should only contain numeric values to replace the original ones, just like the last example above.

Lorenzo Dragani

Lorenzo Dragani • December 28, 2025

I believe there is a small typo in the slide about case_when(): .default must be written with = (i.e., .default = "…") rather than with ~, because ~ is only used for condition–value pairs in case_when()

David Keyes

David Keyes Founder • January 6, 2026

Thanks! Will update it.

Course Content

128 Lessons