Advanced Summarizing
This lesson is called Advanced Summarizing, part of the R in 3 Months (Fall 2025) course. This lesson is called Advanced Summarizing, part of the R in 3 Months (Fall 2025) course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
Loading transcript...
View code shown in video
# Load Packages -----------------------------------------------------------
library(tidyverse)
library(fs)
library(readxl)
library(janitor)
# Create Directories ------------------------------------------------------
dir_create("data-raw")
# Download Data -----------------------------------------------------------
# https://www.oregon.gov/ode/educator-resources/assessment/Pages/Assessment-Group-Reports.aspx
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
#
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
# mode = "wb",
# destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")
# Import Data -------------------------------------------------------------
math_scores_2021_2022 <-
read_excel(path = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx") |>
clean_names()
# Tidy and Clean Data -----------------------------------------------------
third_grade_math_proficiency_2021_2022 <-
math_scores_2021_2022 |>
filter(student_group == "Total Population (All Students)") |>
filter(grade_level == "Grade 3") |>
select(academic_year, school_id, contains("number_level")) |>
pivot_longer(cols = starts_with("number_level"),
names_to = "proficiency_level",
values_to = "number_of_students") |>
mutate(proficiency_level = case_when(
proficiency_level == "number_level_4" ~ "4",
proficiency_level == "number_level_3" ~ "3",
proficiency_level == "number_level_2" ~ "2",
proficiency_level == "number_level_1" ~ "1"
)) |>
mutate(number_of_students = parse_number(number_of_students))
third_grade_math_proficiency_2021_2022 |>
group_by(school_id) |>
mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE)) |>
ungroup()
Your Turn
Create a new variable called
pctthat shows each race/ethnicity as a percentage of all students in each district. This will require two steps.You'll need to use
group_by()andsummarize()to calculate the number of students in each race/ethnicity group in each district.You’ll need to use
group_by()andmutate()to calculate the percentage of students in each race/ethnicity group in each district.
Don’t forget to ungroup() at the end of each step. Use this code to get started:
# Load Packages -----------------------------------------------------------
library(tidyverse)
library(fs)
library(readxl)
library(janitor)
# Create Directories ------------------------------------------------------
dir_create("data-raw")
# Download Data -----------------------------------------------------------
# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20222023.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20212022.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20202021.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20192020.xlsx")
#
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
# mode = "wb",
# destfile = "data-raw/fallmembershipreport_20182019.xlsx")
# Import Data -------------------------------------------------------------
enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
sheet = "School 2022-23") |>
clean_names()
# Tidy and Clean Data -----------------------------------------------------
enrollment_by_race_ethnicity_2022_2023 <-
enrollment_2022_2023 |>
select(district_institution_id, school_institution_id,
x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |>
select(-contains("percent")) |>
pivot_longer(cols = -c(district_institution_id, school_institution_id),
names_to = "race_ethnicity",
values_to = "number_of_students") |>
mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23_")) |>
mutate(race_ethnicity = case_when(
race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
race_ethnicity == "asian" ~ "Asian",
race_ethnicity == "black_african_american" ~ "Black/African American",
race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
race_ethnicity == "multiracial" ~ "Multi-Racial",
race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
race_ethnicity == "white" ~ "White",
race_ethnicity == "multi_racial" ~ "Multiracial"
)) |>
mutate(number_of_students = parse_number(number_of_students))
Learn More
Daniel Carter has a nice walkthrough of using group_by() and mutate().
If you forget to ungroup() every once in a while, you’re joining an illustrious group.
ugh foiled by a missing ungroup() once again #rstats
— Andrew Heiss (🐘 @[email protected]) (@andrewheiss) November 25, 2019
When in doubt, try ungroup() #rstats
— Ben Casselman (@bencasselman) October 4, 2019
To my #rstats friends: Practice safe stats. Remember to dplyr::ungroup() after you're done with your within-group operations. pic.twitter.com/r4JblvgSjd
— Hlynur Hallgríms (@hlynur) July 19, 2018
Need a cheery reminder to use ungroup()? Here you go!
Don't forget to bring dplyr::ungroup() to the party 🎁🥳 #rstats
— Allison Horst (@allison_horst) November 21, 2019
Thanks to @apreshill for inspiring this one! pic.twitter.com/gsf66KXJ2d
Have any questions? Put them below and we will help you out!
Course Content
128 Lessons
You need to be signed-in to comment on this post. Login.
Jorge Eduardo Baquero • April 30, 2024
The need for ungrouping using
ungroup()disappears when usingsummarise (..., .by = variable). Am I right?David Keyes Founder • April 30, 2024
Yes, that's right!
Raouf Kilada • November 5, 2024
Hi again, I got the same "sum" error in the excercise: enrollment_by_race_ethnicity_2022_2023 |>
summarize(): ℹ In argument:number_of_students = sum(number_of_students, na.rm = TRUE). ℹ In group 1:district_institution_id = 1894andrace_ethnicity = "American Indian Alaska Native". Caused by error insum(): ! invalid 'type' (character) of argument Runrlang::last_trace()to see where the error occurred.suggeions?
Gracielle Higino Coach • November 5, 2024
Hi Raouf!
number_of_studentsmust be a numeric variable. Notice the last line of code in the Your turn section:It asks R to mutate the variable
number_of_studentsto a numeric variable using the functionparse_number(). Try your code with that and let us know if it helps!Mike LeVan • April 21, 2025
Picky mathematician coming in - shouldn't we call these "proportions" and not "percentages"? We could multiply by 100 to turn them into percentages. Just saying as someone that has seen several students get confused by the difference.
David Keyes Founder • April 21, 2025
Oh man, caught out by the mathematician! Yes, you're probably right, but hopefully the meaning is still clear. :)
Kaela Scott • November 7, 2025
I'm struggling to understanding the point of ungroup for the exercise. If I do not include the two ungroup() lines, the calculations are the same. Does using group() a second time "override" the first group? Thanks!
Gracielle Higino Coach • November 11, 2025
Hi Kaela! Using
group()a second time does not "override" the first group, but usingungroup()does! So in this case you are telling R do calculate something based on a group, then you ask R to undo the grouping and regroup the data based on other conditions, then calculating something based on this new group.However,
summarise()"drops" one group on your final tibble, and in some cases (especially when you have only one group) this may lead to the same result to when you don't useungroup()before the nextgroup()- but because that's not always the case, you need to ungroup every time you group.Notice that, in this particular case, at the end of the first summarise you get a group for
district_institution_id- you can see that if you run just the first part of the pipeline:You get this on the console:
Your next step is to ungroup, and then group by
district_institution_idagain - that's why this first ungroup does not make a difference in this case. If you wanted to do the second summary based on another grouping variable, you'd get different results.Finally, the last ungroup is in place to guarantee that the final tibble is not grouped at all for future use. If you don't ungroup, you might use that object for other summaries without realizing you're calculating things based on a group that was designated back when you created the object.
Let me know if it's still not clear!