Dealing with Missing Data
This lesson is called Dealing with Missing Data, part of the R in 3 Months (Spring 2025) course. This lesson is called Dealing with Missing Data, part of the R in 3 Months (Spring 2025) course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
View code shown in video
# Load Packages -----------------------------------------------------------
library(tidyverse)
library(fs)
library(readxl)
library(janitor)
# Create Directories ------------------------------------------------------
dir_create("data-raw")
# Download Data -----------------------------------------------------------
# https://www.oregon.gov/ode/educator-resources/assessment/Pages/Assessment-Group-Reports.aspx
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")
# Import Data -------------------------------------------------------------
math_scores_2021_2022 <-
read_excel(path = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx") |>
clean_names()
# Tidy and Clean Data -----------------------------------------------------
third_grade_math_proficiency_2021_2022 <-
math_scores_2021_2022 |>
filter(student_group == "Total Population (All Students)") |>
filter(grade_level == "Grade 3") |>
select(academic_year, school_id, contains("number_level")) |>
pivot_longer(cols = starts_with("number_level"),
names_to = "proficiency_level",
values_to = "number_of_students") |>
mutate(proficiency_level = case_when(
proficiency_level == "number_level_4" ~ "4",
proficiency_level == "number_level_3" ~ "3",
proficiency_level == "number_level_2" ~ "2",
proficiency_level == "number_level_1" ~ "1"
))
third_grade_math_proficiency_2021_2022 |>
count(number_of_students) |>
view()
third_grade_math_proficiency_2021_2022 |>
mutate(number_of_students = na_if(number_of_students, "*")) |>
mutate(number_of_students = na_if(number_of_students, "-")) |>
mutate(number_of_students = na_if(number_of_students, "--")) |>
mutate(number_of_students = as.numeric(number_of_students))
Your Turn
Convert all of the missing values in the
number_of_students
variable to NA usingna_if().
Convert all of the NA values you just made to 0 using
replace_na()
.
Use this code to get started.
# Load Packages -----------------------------------------------------------
library(tidyverse)
library(fs)
library(readxl)
library(janitor)
# Create Directories ------------------------------------------------------
dir_create("data-raw")
# Download Data -----------------------------------------------------------
# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20222023.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20212022.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20202021.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20192020.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20182019.xlsx")
# Import Data -------------------------------------------------------------
enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
sheet = "School 2022-23") |>
clean_names()
# Tidy and Clean Data -----------------------------------------------------
enrollment_by_race_ethnicity_2022_2023 <-
enrollment_2022_2023 |>
select(district_institution_id, school_institution_id,
x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |>
select(-contains("percent")) |>
pivot_longer(cols = -c(district_institution_id, school_institution_id),
names_to = "race_ethnicity",
values_to = "number_of_students") |>
mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23_")) |>
mutate(race_ethnicity = case_when(
race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
race_ethnicity == "asian" ~ "Asian",
race_ethnicity == "black_african_american" ~ "Black/African American",
race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
race_ethnicity == "multiracial" ~ "Multi-Racial",
race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
race_ethnicity == "white" ~ "White",
race_ethnicity == "multi_racial" ~ "Multiracial"
))
Learn More
The best place to learn about replace_na()
is on the tidyr
website, which has an excellent documentation page about the function.na_if()
comes from the dplyr
package so check it out there.
I referenced the read_csv()
function having an na argument early on in the video. The read_excel()
function has an na argument as well. The na
arguments in these two functions can help you deal with missing data on import. However, replace_na()
and na_if()
are good to know because you don’t always have this option!
If you want to go deep on exploring missing data in your datasets in R, there is a package called naniar
that will help. Allison Horst also has a really nice tutorial on using it here.
Have any questions? Put them below and we will help you out!
Course Content
127 Lessons
You need to be signed-in to comment on this post. Login.
Jorge Eduardo Baquero • April 30, 2024
Hi David, I wonder what would be the difference between
third_grade_math_proficiency_2021_2022 |> mutate(....) |> mutate(....) |> mutate(....)
and
third_grade_math_proficiency_2021_2022 |> mutate(old_variable1 = new_variable1, old_variable2 = new_variable2, old_variable3 = new_variable3)
I remember reading (or watching) somewhere that is essentially the same. Is it?
David Keyes Founder • April 30, 2024
Yes, you're right that they are the same. I tend to do multiple
mutate()
calls, though I'm not totally sure why.Alagammai Meyyappan • April 27, 2025
Quick q - Does using as.numeric() or parse_number() in R automatically change non-numeric characters into NA?
Gracielle Higino Coach • April 29, 2025
Yes! However, if the variable is a factor,
as.numeric()
will return the underlying numeric (integer) representation, butparse_number()
won't work. For example, I mutated thenumber_of_students
variable into a factor and got values like 570, 464 and 2 for the corresponding values of 0, 6 and 1 in the original variable, but got just an error when trying toparse_number()
. Check this out: