Get access to all lessons in this course.
-
Advanced Data Wrangling
- Downloading and Importing Data
- Overview of Tidy Data
- Tidy Data Rule #1: Every Column is a Variable
- Tidy Data Rule #3: Every Cell is a Single Value
- Tidy Data Rule #2: Every Row is an Observation
- Changing Variable Types
- Dealing with Missing Data
- Advanced Summarizing
- Binding Data Frames
- Functions
- Data Merging
- Exporting Data
- Bring It All Together (Advanced Data Wrangling)
-
Advanced Data Visualization
- Best Practices in Data Visualization
- Tidy Data
- Pipe Data into ggplot
- Reorder Plots to Highlight Findings
- Line Charts
- Use Color to Highlight Findings
- Declutter
- Add Descriptive Labels to Your Plots
- Use Titles to Highlight Findings
- Use Annotations to Explain
- Tweak Spacing
- Create a Custom Theme
- Customize Your Fonts
- Try New Plot Types
- Bring it All Together (Advanced Data Visualization)
-
Quarto
- Advanced Markdown
- Advanced YAML and Code Chunk Options
- Tables
- Inline R Code
- Making Your Reports Shine: Word Edition
- Making Your Reports Shine: PDF Edition
- Making Your Reports Shine: HTML Edition
- Presentations
- Dashboards
- Websites
- Publishing Your Work
- Quarto Extensions
- Parameterized Reporting, Part 1
- Parameterized Reporting, Part 2
- Parameterized Reporting, Part 3
- Wrapping up Going Deeper with R
Going Deeper with R
Advanced Summarizing
This lesson is locked
This lesson is called Advanced Summarizing, part of the Going Deeper with R course. This lesson is called Advanced Summarizing, part of the Going Deeper with R course.
If the video is not playing correctly, you can watch it in a new window
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
Heads up! I’m using the slice_max()
function in this lesson, which only exists in newer versions of the dplyr
package. To update, just type install.packages("tidyverse")
, which will update dplyr
and all other tidyverse packages.
Your Turn
Create a new variable called pct
that shows each race/ethnicity as a percentage of all students in each district
You’ll need to use group_by()
and mutate()
Don’t forget to ungroup()
at the end!
Learn More
Daniel Carter has a nice walkthrough of using group_by()
and mutate()
.
If you forget to ungroup()
every once in a while, you’re joining an illustrious group.
ugh foiled by a missing ungroup() once again #rstats
— Andrew Heiss (🐘 @[email protected]) (@andrewheiss) November 25, 2019
When in doubt, try ungroup() #rstats
— Ben Casselman (@bencasselman) October 4, 2019
To my #rstats friends: Practice safe stats. Remember to dplyr::ungroup() after you're done with your within-group operations. pic.twitter.com/r4JblvgSjd
— Hlynur Hallgríms (@hlynur) July 19, 2018
Need a cheery reminder to use ungroup()? Here you go!
Don't forget to bring dplyr::ungroup() to the party 🎁🥳 #rstats
— Allison Horst (@allison_horst) November 21, 2019
Thanks to @apreshill for inspiring this one! pic.twitter.com/gsf66KXJ2d
You need to be signed-in to comment on this post. Login.
Atlang Mompe
April 18, 2021
HI David, I still get that race_ethnicity not found?
I am also running your code, but for some reason I may be doing something wrong, everything works until I try to mutate the race/ethnicity using the str_remove, even when I run your code. I also noticed that we may need to add the dates x2019_2019 in quotation marks, currently in your code, it only has one quotation mark: mutate(race_ethnicity = str_remove(race_ethnicity, "x2018_2019)) %>%
I tried to use your code, please see below and I am not sure what is going one, as the race ethnicity object is not found:
enrollment_by_race_ethnicity_18_19 % select(-contains("grade")) %>% select(-contains("kindergarten")) %>% select(-contains("percent")) %>% pivot_longer(cols = -district_id, names_to = "race_ethnicity", values_to = "number_of_students") %>% mutate(number_of_students = na_if(number_of_students, "-")) %>% mutate(number_of_students = replace_na(number_of_students, 0)) %>% mutate(number_of_students = as.numeric(number_of_students))%>% mutate(race_ethnicity = str_remove(race_ethnicity,"x2018_2019")) mutate(race_ethnicity = case_when( race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native", race_ethnicity == "asian" ~ "Asian", race_ethnicity == "black_african_american" ~ "Black/African American", race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino", race_ethnicity == "multiracial" ~ "Multi-Racial", race_ethnicity == "native_hawaiian_pacific_islander" ~ "Pacific Islander", race_ethnicity == "white" ~ "White" )) %>% group_by(district_id) %>% mutate(pct = number_of_students / sum(number_of_students)) %>% ungroup()
Thanks, Atty
Allison Brenner
April 21, 2021
I have a question about the "My turn" example. You say that you use summarize in the first part (vs. mutate) because you aren't actually adding a new variable. I'm still having trouble understanding this conceptually, and more broadly, the difference between mutate and summarize for functions that can be called in both. I know you aren't adding a new variable in your example, but you are over-writing/replacing. Could you do the same by using summarize to create a total in a new variable? Would that not add to the data frame?
Abby Isaacson
April 22, 2021
I had several of these problems happen for me too, including console processing code in blue but nothing happening in the data frame to change names (several tries). At one point when I ran David's solution code, all race_ethnicity turned to NA in the entire data frame. I still don't know exactly what happened to make it work, but I do have one small question:
Do the underscores() matter in the removing/renaming? For example with the variable "x2018-19_asian", if I remove x2018-19 (not x2018-19), but then only mutate "asian" (not _asian), what happens to that underscore?
Vuk Sekicki
April 26, 2021
Is there a difference between slice_max() and top_n() ?
David Keyes Founder
October 27, 2021
Huh, very strange. We can definitely discuss more this week!