Use Color to Highlight Findings

This lesson is called Use Color to Highlight Findings, part of the R in 3 Months (Fall 2025) course. This lesson is called Use Color to Highlight Findings, part of the R in 3 Months (Fall 2025) course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)

# Create Directory --------------------------------------------------------

dir_create("data")

# Download Data -----------------------------------------------------------

# download.file("https://github.com/rfortherestofus/going-deeper-v2/raw/main/data/third_grade_math_proficiency.rds",
#               mode = "wb",
#               destfile = "data/third_grade_math_proficiency.rds")

# Import Data -------------------------------------------------------------

third_grade_math_proficiency <- 
  read_rds("data/third_grade_math_proficiency.rds") |> 
  select(academic_year, school, school_id, district, proficiency_level, number_of_students) |> 
  mutate(is_proficient = case_when(
    proficiency_level >= 3 ~ TRUE,
    .default = FALSE
  )) |> 
  group_by(academic_year, school, district, school_id, is_proficient) |> 
  summarize(number_of_students = sum(number_of_students, na.rm = TRUE)) |> 
  ungroup() |> 
  group_by(academic_year, school, district, school_id) |> 
  mutate(percent_proficient = number_of_students / sum(number_of_students, na.rm = TRUE)) |> 
  ungroup() |> 
  filter(is_proficient == TRUE) |> 
  select(academic_year, school, district, percent_proficient) |> 
  rename(year = academic_year) |> 
  mutate(percent_proficient = case_when(
    is.nan(percent_proficient) ~ NA,
    .default = percent_proficient
  ))

# Plot --------------------------------------------------------------------

top_growth_school <- 
  third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |> 
  group_by(school) |> 
  mutate(growth_from_previous_year = percent_proficient - lag(percent_proficient)) |> 
  ungroup() |> 
  drop_na(growth_from_previous_year) |>
  slice_max(order_by = growth_from_previous_year,
            n = 1) |> 
  pull(school)

third_grade_math_proficiency |>
  filter(district == "Portland SD 1J") |>
  mutate(highlight_school = case_when(
    school == top_growth_school ~ "Y",
    .default = "N"
  )) |> 
  mutate(school = fct_relevel(school, top_growth_school, after = Inf)) |>
  ggplot(aes(x = year,
             y = percent_proficient,
             group = school,
             color = highlight_school)) +
  geom_line() +
  scale_color_manual(values = c(
    "N" = "grey80",
    "Y" = "orange"
  ))

Your Turn

Solution

# Load Packages -----------------------------------------------------------

library(tidyverse)
library(fs)

# Create Directory --------------------------------------------------------

dir_create("data")

# Download Data -----------------------------------------------------------

# download.file("https://github.com/rfortherestofus/going-deeper-v2/raw/main/data/enrollment_by_race_ethnicity.rds",
#               mode = "wb",
#               destfile = "data/enrollment_by_race_ethnicity.rds")

# Import Data -------------------------------------------------------------

enrollment_by_race_ethnicity <-
  read_rds("data/enrollment_by_race_ethnicity.rds") |> 
  select(-district_institution_id)  |> 
  select(year, district, everything()) |> 
  mutate(year = case_when(
    year == "School 2021-22" ~ "2021-2022",
    year == "School 2022-23" ~ "2022-2023",
  ))

# Plot --------------------------------------------------------------------

top_growth_district <- 
  enrollment_by_race_ethnicity |> 
  filter(race_ethnicity == "Hispanic/Latino") |> 
  group_by(district) |> 
  mutate(growth_from_previous_year = pct - lag(pct)) |> 
  ungroup() |> 
  drop_na(growth_from_previous_year) |>
  slice_max(order_by = growth_from_previous_year,
            n = 1) |> 
  pull(district)

enrollment_by_race_ethnicity |> 
  filter(race_ethnicity == "Hispanic/Latino") |> 
  mutate(highlight_district = case_when(
    district == top_growth_district ~ "Y",
    .default = "N"
  )) |> 
  mutate(district = fct_relevel(district, top_growth_district, after = Inf)) |>
  ggplot(aes(x = year, 
             y = pct,
             group = district,
             color = highlight_district)) +
  geom_line() +
  scale_color_manual(values = c(
    "N" = "grey80",
    "Y" = "orange"
  ))

Highlight the district in your line chart that had the largest increase in its Hispanic/Latino population between 2021-2022 and 2022-2023.

Learn More

The Datawrapper blog has an amazing blog post by Lisa Charlotte Muth on using color effectively in data viz.

The issue that I ran into where the lines weren't visible because there were too many is called overplotting. Claus Wilke discusses overplotting in Chapter 18 of his book Fundamentals of Data Visualization (the chapter is about overplotting with points, but the concepts are the same).

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Matt Newman • May 6, 2024

Really interesting set of lessons here! But I kept making dumb mistakes in my code by confusing "top_growth_hispanic" with "highlight_district." Could you walk through why both steps were needed to make these customizations work? Why create the "highlight_district" variable versus just plotting with the "top_growth_hispanic" value? Thank you!

top_growth_hispanic <-
     enrollment_by_race_ethnicity |> 
     filter(race_ethnicity == "Hispanic/Latino") |> 
     group_by(district) |> 
     mutate(hispanic_growth = pct - lag(pct)) |> 
     ungroup() |> 
     drop_na(hispanic_growth) |> 
     slice_max(order_by = hispanic_growth,
               n=1) |> 
     pull(district)

enrollment_by_race_ethnicity |> 
     filter(race_ethnicity == "Hispanic/Latino") |> 
     mutate(highlight_district = case_when(
          district == top_growth_hispanic ~ "Y",
          .default = "N"
     )) |>

Patrick Cullen • May 6, 2025

Is the order_by and n = 1 needed here?

slice_max(order_by = growth_from_previous_year, n = 1)

This seems to have the same result:

slice_max(growth_from_previous_year)

Am I correct that this is indexing and ranking based on the max value? So if I wanted the second highest, then I could use order_by and n = 2?

Gracielle Higino Coach • May 6, 2025

Hi Patrick! David always tries to be very explicit on code writing, so he often writes the function arguments even when suppressing them would produce the same results. In this case, however, it's important to be explicit about the order_by argument for reproducibility reasons. It happens that in this particular case, defining the order doesn't make a difference, but it could, if you happened to have another variable in the dataset "weighting in" the ranking.

Now about the n argument, this one is to define the number of rows you want in the slice. n = 1 gets you one row, n = 2 gets you 2 rows. You can also use proportions instead of numbers, using the prop argument (e.g., prop = 0.5).

Lorenzo Botto • December 9, 2025

hi -- very interesting and useful. Quick question -- before using the lag function, would it be safer to order the year variable, to be sure that we compute the difference between later year and earlier year? or is it redundante? Thank you

Gracielle Higino Coach • December 12, 2025

Hi Lorenzo! That's an interesting question! It seems like ordering the time variable it's rarely needed if you use group_by() first, but it's always good to check when doing timeseries operations! In this case, adding an arranging function does not change the results, probably because we only have two series. However, if you are to include a arrange(), you need to explicit that you want to rearrange the years by group.