Skip to content
Coming soon: Ally. Your guide to the world of AI and R. Learn More →
R for the Rest of Us Logo

R in 3 Months (Fall 2025)

Use Color to Highlight Findings

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video
# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)

# Create Directory --------------------------------------------------------

dir_create("data")

# Download Data -----------------------------------------------------------

# download.file("https://github.com/rfortherestofus/going-deeper-v2/raw/main/data/third_grade_math_proficiency.rds",
#               mode = "wb",
#               destfile = "data/third_grade_math_proficiency.rds")

# Import Data -------------------------------------------------------------

third_grade_math_proficiency <- 
  read_rds("data/third_grade_math_proficiency.rds") |> 
  select(academic_year, school, school_id, district, proficiency_level, number_of_students) |> 
  mutate(is_proficient = case_when(
    proficiency_level >= 3 ~ TRUE,
    .default = FALSE
  )) |> 
  group_by(academic_year, school, district, school_id, is_proficient) |> 
  summarize(number_of_students = sum(number_of_students, na.rm = TRUE)) |> 
  ungroup() |> 
  group_by(academic_year, school, district, school_id) |> 
  mutate(percent_proficient = number_of_students / sum(number_of_students, na.rm = TRUE)) |> 
  ungroup() |> 
  filter(is_proficient == TRUE) |> 
  select(academic_year, school, district, percent_proficient) |> 
  rename(year = academic_year) |> 
  mutate(percent_proficient = case_when(
    is.nan(percent_proficient) ~ NA,
    .default = percent_proficient
  ))

# Plot --------------------------------------------------------------------

top_growth_school <- 
  third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |> 
  group_by(school) |> 
  mutate(growth_from_previous_year = percent_proficient - lag(percent_proficient)) |> 
  ungroup() |> 
  drop_na(growth_from_previous_year) |>
  slice_max(order_by = growth_from_previous_year,
            n = 1) |> 
  pull(school)

third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |>
  mutate(highlight_school = case_when(
    school == top_growth_school ~ "Y",
    .default = "N"
  )) |> 
  mutate(school = fct_relevel(school, top_growth_school, after = Inf)) |>
  ggplot(aes(x = year,
             y = percent_proficient,
             group = school,
             color = highlight_school)) +
  geom_line() +
  scale_color_manual(values = c(
    "N" = "grey80",
    "Y" = "orange"
  ))

Your Turn

Highlight the district in your line chart that had the largest increase in its Hispanic/Latino population between 2021-2022 and 2022-2023.

Learn More

The Datawrapper blog has an amazing blog post by Lisa Charlotte Muth on using color effectively in data viz.

The issue that I ran into where the lines weren't visible because there were too many is called overplotting. Claus Wilke discusses overplotting in Chapter 18 of his book Fundamentals of Data Visualization (the chapter is about overplotting with points, but the concepts are the same).

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Matt Newman

Matt Newman • May 6, 2024

Really interesting set of lessons here! But I kept making dumb mistakes in my code by confusing "top_growth_hispanic" with "highlight_district." Could you walk through why both steps were needed to make these customizations work? Why create the "highlight_district" variable versus just plotting with the "top_growth_hispanic" value? Thank you!

top_growth_hispanic <-
     enrollment_by_race_ethnicity |> 
     filter(race_ethnicity == "Hispanic/Latino") |> 
     group_by(district) |> 
     mutate(hispanic_growth = pct - lag(pct)) |> 
     ungroup() |> 
     drop_na(hispanic_growth) |> 
     slice_max(order_by = hispanic_growth,
               n=1) |> 
     pull(district)

enrollment_by_race_ethnicity |> 
     filter(race_ethnicity == "Hispanic/Latino") |> 
     mutate(highlight_district = case_when(
          district == top_growth_hispanic ~ "Y",
          .default = "N"
     )) |> 
Patrick Cullen

Patrick Cullen • May 6, 2025

Is the order_by and n = 1 needed here?

slice_max(order_by = growth_from_previous_year, n = 1)

This seems to have the same result:

slice_max(growth_from_previous_year)

Am I correct that this is indexing and ranking based on the max value? So if I wanted the second highest, then I could use order_by and n = 2?

Gracielle Higino

Gracielle Higino Coach • May 6, 2025

Hi Patrick! David always tries to be very explicit on code writing, so he often writes the function arguments even when suppressing them would produce the same results. In this case, however, it's important to be explicit about the order_by argument for reproducibility reasons. It happens that in this particular case, defining the order doesn't make a difference, but it could, if you happened to have another variable in the dataset "weighting in" the ranking.

Now about the n argument, this one is to define the number of rows you want in the slice. n = 1 gets you one row, n = 2 gets you 2 rows. You can also use proportions instead of numbers, using the prop argument (e.g., prop = 0.5).

Lorenzo Botto

Lorenzo Botto • December 9, 2025

hi -- very interesting and useful. Quick question -- before using the lag function, would it be safer to order the year variable, to be sure that we compute the difference between later year and earlier year? or is it redundante? Thank you

Gracielle Higino

Gracielle Higino Coach • December 12, 2025

Hi Lorenzo! That's an interesting question! It seems like ordering the time variable it's rarely needed if you use group_by() first, but it's always good to check when doing timeseries operations! In this case, adding an arranging function does not change the results, probably because we only have two series. However, if you are to include a arrange(), you need to explicit that you want to rearrange the years by group.

Try to notice the difference in the results and in the dataframe with the two code options below:

# Arranging year within groups

third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |> 
  group_by(school) |> 
  arrange(year, .by_group = TRUE) |>
  View()


# Arranging year in descending order within groups. This one actually yields a different result when we try to get the top_growth_school!

third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |> 
  group_by(school) |> 
  arrange(desc(year), .by_group = TRUE) |>
  View()

# Arranging year on the whole dataset

third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |> 
  group_by(school) |> 
  arrange(year) |>
  View()

Let us know what other questions you have!

Course Content

128 Lessons