Skip to content

Data Cleaning with R

Missing, Implicit, or Misplaced Grouping Variables

This lesson is called Missing, Implicit, or Misplaced Grouping Variables, part of the Data Cleaning with R course. This lesson is called Missing, Implicit, or Misplaced Grouping Variables, part of the Data Cleaning with R course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Since this course was created, the separate() and separate_rows() functions have been superseded by the separate_wider_delim() and separate_longer_delim() functions, respectively. If you have code with separate() and separate_rows() that works, no need to change anything!

Your Turn

Solution

## %######################################################%##
#                                                          #
####          Missing, implicit, or misplaced           ####
####           grouping variables - your turn           ####
#                                                          #
## %######################################################%##

# Load the `primates2017` dataset bundled with 📦 `unheadr`
# Create a new column that groups the different species by taxonomic family.
# In biology, taxonomic families all end in the suffix "_DAE_"
# How many different ways can you identify the embedded subheaders in these data?


# load packages -----------------------------------------------------------
library(unheadr)
library(dplyr)
library(stringr)

# load data ---------------------------------------------------------------
data("primates2017")

# untangle subheaders to new column ---------------------------------------
primates2017 %>%
  untangle2(regex = "DAE$", orig = scientific_name, new = family)

# How many different ways can you identify the embedded subheaders in these data?
# 2, because families are also the only strings in all upper case
primates2017 %>%
  untangle2(regex = "^[A-Z]+$", orig = scientific_name, new = family) %>%
  mutate(family = str_to_title(family)) # change to title case

Load the primates2017 dataset bundled with 📦 unheadr and create a new column that groups the different species by taxonomic family.

In biology, taxonomic families all end in the suffix “DAE“

How many different ways can you identify the embedded subheaders in these data?

Learn More

To learn more about the unheadr package, check out its documentation website.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Course Content

32 Lessons

What is Data Cleaning?

Course Logistics and Materials

Data Organization Best Practices

Grouping and Indicator Variables

NA and Empty Values

Data Sharing Best Practices

Tidyverse Refresher

Working with Columns with across()

coalesce() and fill()

What are Regular Expressions?

Understanding and Testing Regular Expressions

Literal Characters and Metacharacters

Metacharacters: Quantifiers

Metacharacters: Alternation, Special Sequences, and Escapes

Combining Metacharacters

Regular Expressions and Data Cleaning, Part 1

Regular Expressions and Data Cleaning, Part 2

Common Issues in Data Cleaning

Unusable Variable Names

Missing, Implicit, or Misplaced Grouping Variables

Compound Values

Duplicated Values

Empty Rows and Columns

Parsing Numbers

Putting Everything Together

Wrapping Up Data Cleaning with R