Advanced Summarizing

This lesson is called Advanced Summarizing, part of the R in 3 Months (Fall 2025) course. This lesson is called Advanced Summarizing, part of the R in 3 Months (Fall 2025) course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/educator-resources/assessment/Pages/Assessment-Group-Reports.aspx

# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/pagr_schools_math_tot_raceethnicity_2122.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/Documents/TestResults2122/TestResults2019/pagr_schools_math_tot_raceethnicity_1819.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_tot_raceethnicity_1819.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2018/pagr_schools_math_raceethnicity_1718.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1718.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_math_raceethnicity_1617.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1617.xlsx")
# 
# download.file("https://www.oregon.gov/ode/educator-resources/assessment/TestResults2016/pagr_schools_math_raceethnicity_1516.xlsx",
#               mode = "wb",
#               destfile = "data-raw/pagr_schools_math_raceethnicity_1516.xlsx")


# Import Data -------------------------------------------------------------

math_scores_2021_2022 <-
  read_excel(path = "data-raw/pagr_schools_math_tot_raceethnicity_2122.xlsx") |> 
  clean_names()


# Tidy and Clean Data -----------------------------------------------------

third_grade_math_proficiency_2021_2022 <-
  math_scores_2021_2022 |> 
  filter(student_group == "Total Population (All Students)") |> 
  filter(grade_level == "Grade 3") |> 
  select(academic_year, school_id, contains("number_level")) |> 
  pivot_longer(cols = starts_with("number_level"),
               names_to = "proficiency_level",
               values_to = "number_of_students") |> 
  mutate(proficiency_level = case_when(
    proficiency_level == "number_level_4" ~ "4",
    proficiency_level == "number_level_3" ~ "3",
    proficiency_level == "number_level_2" ~ "2",
    proficiency_level == "number_level_1" ~ "1"
  )) |> 
  mutate(number_of_students = parse_number(number_of_students))

third_grade_math_proficiency_2021_2022 |> 
  group_by(school_id) |> 
  mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE)) |> 
  ungroup()

Your Turn

Solution

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx

# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20222023.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20212022.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20202021.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20192020.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20182019.xlsx")

# Import Data -------------------------------------------------------------

enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
                                   sheet = "School 2022-23") |> 
  clean_names()

# Tidy and Clean Data -----------------------------------------------------

enrollment_by_race_ethnicity_2022_2023 <-
  enrollment_2022_2023 |> 
  select(district_institution_id, x2022_23_american_indian_alaska_native:x2022_23_percent_multi_racial) |>  
  select(-contains("percent")) |> 
  pivot_longer(cols = -district_institution_id,
               names_to = "race_ethnicity",
               values_to = "number_of_students") |> 
  mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23_")) |> 
  mutate(race_ethnicity = case_when(
    race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
    race_ethnicity == "asian" ~ "Asian",
    race_ethnicity == "black_african_american" ~ "Black/African American",
    race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
    race_ethnicity == "multi_racial" ~ "Multiracial",
    race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
    race_ethnicity == "white" ~ "White"
  )) |> 
  mutate(number_of_students = parse_number(number_of_students))

enrollment_by_race_ethnicity_2022_2023 |> 
  group_by(district_institution_id, race_ethnicity) |>
  summarize(number_of_students = sum(number_of_students, na.rm = TRUE)) |> 
  ungroup() |> 
  group_by(district_institution_id) |>
  mutate(pct = number_of_students / sum(number_of_students, na.rm = TRUE)) |> 
  ungroup()

Create a new variable called pct that shows each race/ethnicity as a percentage of all students in each district. This will require two steps.
You'll need to use group_by() and summarize() to calculate the number of students in each race/ethnicity group in each district.
You’ll need to use group_by() and mutate() to calculate the percentage of students in each race/ethnicity group in each district.

Don’t forget to ungroup() at the end of each step. Use this code to get started:

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)
library(readxl)
library(janitor)

# Create Directories ------------------------------------------------------

dir_create("data-raw")

# Download Data -----------------------------------------------------------

# https://www.oregon.gov/ode/reports-and-data/students/Pages/Student-Enrollment-Reports.aspx

# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20222023.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20222023.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20212022.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20212022.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20202021.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20202021.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20192020.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20192020.xlsx")
# 
# download.file("https://www.oregon.gov/ode/reports-and-data/students/Documents/fallmembershipreport_20182019.xlsx",
#               mode = "wb",
#               destfile = "data-raw/fallmembershipreport_20182019.xlsx")

# Import Data -------------------------------------------------------------

enrollment_2022_2023 <- read_excel(path = "data-raw/fallmembershipreport_20222023.xlsx",
                                   sheet = "School 2022-23") |> 
  clean_names()

# Tidy and Clean Data -----------------------------------------------------

enrollment_by_race_ethnicity_2022_2023 <-
  enrollment_2022_2023 |> 
  select(district_institution_id, school_institution_id,
         x2022_23_american_indian_alaska_native:x2022_23_multi_racial) |> 
  select(-contains("percent")) |> 
  pivot_longer(cols = -c(district_institution_id, school_institution_id),
               names_to = "race_ethnicity",
               values_to = "number_of_students") |> 
  mutate(race_ethnicity = str_remove(race_ethnicity, pattern = "x2022_23_")) |> 
  mutate(race_ethnicity = case_when(
    race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
    race_ethnicity == "asian" ~ "Asian",
    race_ethnicity == "black_african_american" ~ "Black/African American",
    race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
    race_ethnicity == "multiracial" ~ "Multi-Racial",
    race_ethnicity == "native_hawaiian_pacific_islander" ~ "Native Hawaiian Pacific Islander",
    race_ethnicity == "white" ~ "White",
    race_ethnicity == "multi_racial" ~ "Multiracial"
  )) |> 
  mutate(number_of_students = parse_number(number_of_students))

Learn More

Daniel Carter has a nice walkthrough of using group_by() and mutate().

If you forget to ungroup() every once in a while, you’re joining an illustrious group.

ugh foiled by a missing ungroup() once again #rstats
— Andrew Heiss (🐘 @[email protected]) (@andrewheiss) November 25, 2019

When in doubt, try ungroup() #rstats
— Ben Casselman (@bencasselman) October 4, 2019

To my #rstats friends: Practice safe stats. Remember to dplyr::ungroup() after you're done with your within-group operations. pic.twitter.com/r4JblvgSjd
— Hlynur Hallgríms (@hlynur) July 19, 2018

Need a cheery reminder to use ungroup()? Here you go!

Don't forget to bring dplyr::ungroup() to the party 🎁🥳 #rstats

Thanks to @apreshill for inspiring this one! pic.twitter.com/gsf66KXJ2d
— Allison Horst (@allison_horst) November 21, 2019

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Jorge Eduardo Baquero • April 30, 2024

The need for ungrouping using ungroup() disappears when using summarise (..., .by = variable). Am I right?

David Keyes Founder • April 30, 2024

Yes, that's right!

Raouf Kilada • November 5, 2024

Hi again, I got the same "sum" error in the excercise: enrollment_by_race_ethnicity_2022_2023 |>

group_by(district_institution_id, race_ethnicity) |>
summarize(number_of_students = sum(number_of_students, na.rm = TRUE)) Error in summarize(): ℹ In argument: number_of_students = sum(number_of_students, na.rm = TRUE). ℹ In group 1: district_institution_id = 1894 and race_ethnicity = "American Indian Alaska Native". Caused by error in sum(): ! invalid 'type' (character) of argument Run rlang::last_trace() to see where the error occurred.

suggeions?

Gracielle Higino Coach • November 5, 2024

Hi Raouf! number_of_students must be a numeric variable. Notice the last line of code in the Your turn section:

mutate(number_of_students = parse_number(number_of_students))

It asks R to mutate the variable number_of_students to a numeric variable using the function parse_number(). Try your code with that and let us know if it helps!

Mike LeVan • April 21, 2025

Picky mathematician coming in - shouldn't we call these "proportions" and not "percentages"? We could multiply by 100 to turn them into percentages. Just saying as someone that has seen several students get confused by the difference.

David Keyes Founder • April 21, 2025

Oh man, caught out by the mathematician! Yes, you're probably right, but hopefully the meaning is still clear. :)

Kaela Scott • November 7, 2025

I'm struggling to understanding the point of ungroup for the exercise. If I do not include the two ungroup() lines, the calculations are the same. Does using group() a second time "override" the first group? Thanks!

Gracielle Higino Coach • November 11, 2025

Hi Kaela! Using group() a second time does not "override" the first group, but using ungroup() does! So in this case you are telling R do calculate something based on a group, then you ask R to undo the grouping and regroup the data based on other conditions, then calculating something based on this new group.

However, summarise() "drops" one group on your final tibble, and in some cases (especially when you have only one group) this may lead to the same result to when you don't use ungroup() before the next group() - but because that's not always the case, you need to ungroup every time you group.

Notice that, in this particular case, at the end of the first summarise you get a group for district_institution_id - you can see that if you run just the first part of the pipeline:

enrollment_by_race_ethnicity_2022_2023 |> 
  group_by(district_institution_id, race_ethnicity) |>
  summarize(number_of_students = sum(number_of_students, na.rm = TRUE))

You get this on the console:

# A tibble: 1,470 × 3
# Groups:   district_institution_id [210]

Your next step is to ungroup, and then group by district_institution_id again - that's why this first ungroup does not make a difference in this case. If you wanted to do the second summary based on another grouping variable, you'd get different results.

Finally, the last ungroup is in place to guarantee that the final tibble is not grouped at all for future use. If you don't ungroup, you might use that object for other summaries without realizing you're calculating things based on a group that was designated back when you created the object.

Let me know if it's still not clear!