Advanced Summarizing
This lesson is called Advanced Summarizing, part of the Going Deeper with R course. This lesson is called Advanced Summarizing, part of the Going Deeper with R course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
View code shown in video
# Load Packages -----------------------------------------------------------
library(tidyverse)
library(fs)
library(readxl)
library(janitor)
# Create Directories ------------------------------------------------------
dir_create("data-raw")
# Download Data -----------------------------------------------------------
# https://www.oregon.gov/ode/educator-resources/assessment/Pages/Assessment-Group-Reports.aspx
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")
# Import Data -------------------------------------------------------------
math_scores_2021_2022 <-
read_excel(path = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx") |>
clean_names()
# Tidy and Clean Data -----------------------------------------------------
third_grade_math_proficiency_2021_2022 <-
math_scores_2021_2022 |>
filter(student_group == "Total Population (All Students)") |>
filter(grade_level == "Grade 3") |>
select(academic_year, school_id, contains("number_level")) |>
pivot_longer(cols = starts_with("number_level"),
names_to = "proficiency_level",
values_to = "number_of_students") |>
mutate(proficiency_level = case_when(
proficiency_level == "number_level_4" ~ "4",
proficiency_level == "number_level_3" ~ "3",
proficiency_level == "number_level_2" ~ "2",
proficiency_level == "number_level_1" ~ "1"
)) |>
mutate(number_of_students = parse_number(number_of_students))
third_grade_math_proficiency_2021_2022 |>
group_by(school_id) |>
mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE)) |>
ungroup()
Your Turn
Create a new variable called
pct
that shows each race/ethnicity as a percentage of all students in each district. This will require two steps.You'll need to use
group_by()
andsummarize()
to calculate the number of students in each race/ethnicity group in each district.You’ll need to use
group_by()
andmutate()
to calculate the percentage of students in each race/ethnicity group in each district.
Don’t forget to ungroup()
at the end of each step. Use this code to get started:
# Load Packages -----------------------------------------------------------
library(tidyverse)
library(fs)
library(readxl)
library(janitor)
# Create Directories ------------------------------------------------------
dir_create("data-raw")
# Download Data -----------------------------------------------------------
# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20222023.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20212022.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20202021.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20192020.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20182019.xlsx")
# Import Data -------------------------------------------------------------
enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
sheet = "School 2022-23") |>
clean_names()
# Tidy and Clean Data -----------------------------------------------------
enrollment_by_race_ethnicity_2022_2023 <-
enrollment_2022_2023 |>
select(district_institution_id, school_institution_id,
x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |>
select(-contains("percent")) |>
pivot_longer(cols = -c(district_institution_id, school_institution_id),
names_to = "race_ethnicity",
values_to = "number_of_students") |>
mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23_")) |>
mutate(race_ethnicity = case_when(
race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
race_ethnicity == "asian" ~ "Asian",
race_ethnicity == "black_african_american" ~ "Black/African American",
race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
race_ethnicity == "multiracial" ~ "Multi-Racial",
race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
race_ethnicity == "white" ~ "White",
race_ethnicity == "multi_racial" ~ "Multiracial"
)) |>
mutate(number_of_students = parse_number(number_of_students))
Learn More
Daniel Carter has a nice walkthrough of using group_by()
and mutate()
.
If you forget to ungroup()
every once in a while, you’re joining an illustrious group.
ugh foiled by a missing ungroup() once again #rstats
— Andrew Heiss (🐘 @[email protected]) (@andrewheiss) November 25, 2019
When in doubt, try ungroup() #rstats
— Ben Casselman (@bencasselman) October 4, 2019
To my #rstats friends: Practice safe stats. Remember to dplyr::ungroup() after you're done with your within-group operations. pic.twitter.com/r4JblvgSjd
— Hlynur Hallgríms (@hlynur) July 19, 2018
Need a cheery reminder to use ungroup()
? Here you go!
Don't forget to bring dplyr::ungroup() to the party 🎁🥳 #rstats
— Allison Horst (@allison_horst) November 21, 2019
Thanks to @apreshill for inspiring this one! pic.twitter.com/gsf66KXJ2d
Have any questions? Put them below and we will help you out!
Course Content
44 Lessons
You need to be signed-in to comment on this post. Login.
Jorge Eduardo Baquero • April 30, 2024
The need for ungrouping using
ungroup()
disappears when usingsummarise (..., .by = variable)
. Am I right?David Keyes Founder • April 30, 2024
Yes, that's right!
Raouf Kilada • November 5, 2024
Hi again, I got the same "sum" error in the excercise: enrollment_by_race_ethnicity_2022_2023 |>
summarize()
: ℹ In argument:number_of_students = sum(number_of_students, na.rm = TRUE)
. ℹ In group 1:district_institution_id = 1894
andrace_ethnicity = "American Indian Alaska Native"
. Caused by error insum()
: ! invalid 'type' (character) of argument Runrlang::last_trace()
to see where the error occurred.suggeions?
Gracielle Higino Coach • November 5, 2024
Hi Raouf!
number_of_students
must be a numeric variable. Notice the last line of code in the Your turn section:It asks R to mutate the variable
number_of_students
to a numeric variable using the functionparse_number()
. Try your code with that and let us know if it helps!