nav.hidden, nav ul [x-cloak], nav ul li ul { display: block !important; } This website requires Javascript for some parts to function properly. Your experience may vary.

Data Cleaning with R

31 lessons

Welcome to Data Cleaning with R
What is Data Cleaning?
Course Logistics and Materials
Data Organization
Data Organization Best Practices
Tidy Data
Grouping and Indicator Variables
NA and Empty Values
Data Sharing Best Practices
Restructuring Data
Tidyverse Refresher
Working with Columns with across()
Pivoting Data
coalesce() and fill()
Regular Expressions
What are Regular Expressions?
Understanding and Testing Regular Expressions
Literal Characters and Metacharacters
Metacharacters: Quantifiers
Metacharacters: Alternation, Special Sequences, and Escapes
Combining Metacharacters
Regex in R
Regular Expressions and Data Cleaning, Part 1
Regular Expressions and Data Cleaning, Part 2
Common Issues
Common Issues in Data Cleaning
Unusable Variable Names
Whitespace
Letter Case
Missing, Implicit, or Misplaced Grouping Variables
Compound Values
Duplicated Values
Broken Values
Empty Rows and Columns
Parsing Numbers
Putting Everything Together

Regex in R

This lesson is locked

Get access to all lessons in this course.

This lesson is called Regex in R, part of the Data Cleaning with R course. This lesson is called Regex in R, part of the Data Cleaning with R course.

If the video is not playing correctly, you can watch it in a new window

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Your Turn

Solution

##%######################################################%##
#                                                          #
####               Regex in R - Your Turn               ####
#                                                          #
##%######################################################%##


# Match the following regular expressions against the test vector below using `str_detect`.  
## Can you explain the matches?

# Regular expressions

# 1. ^dog
# 2. ^[a-z]+$
# 3. \\d

test_vector <- c("Those dogs are small.","dogs and cats",
                 "34","(34)","rat","watchdog","placemat",
                 "BABY","2011_April","mice")


# load packages -----------------------------------------------------------
library(stringr)

# create text string ------------------------------------------------------
test_vector <- c("Those dogs are small.","dogs and cats",
                 "34","(34)","rat","watchdog","placemat",
                 "BABY","2011_April","mice")

# evaluate if each element contains a match -------------------------------
# match the literal string dog
str_detect(test_vector,"^dog") # second element contains match

# match lowercase
str_detect(test_vector,"^[a-z]+$") # elements 5,6,7, and 10 matched 

# match numbers
str_detect(test_vector,"\\d") # elements 3,4, and 9 matched

Match the following regular expressions against the test vector below using str_detect. Can you explain the matches?

Regular expressions

^dog
^[a-z]+$
\\d

test_vector <- c("Those dogs are small.","dogs and cats",
                 "34","(34)","rat","watchdog","placemat",
                 "BABY","2011_April","mice")

Learn More

To learn more about the stringr package, check out the documentation website. There is also a stringr cheatsheet. You also might check out Chapter 14 of R for Data Science as well as this blog post by Hugo Toscano on working with strings in R.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Alberto Cabrera

January 7, 2024

Need your help figuring out how the following regex works : "\[.+\]|\?". It was developed by Albert Rapp as part of a mutate program to pull years form the following strings. "1973 [YR1973]" "1974 [YR1974]" "1975 [YR1975]" "1976 [YR1976]"

wd_data %>% mutate( year = year %>% str_remove("\[.+\]|\?") %>% str_to_title()) %>% pull(year)

It is not clear to me how this regular expression worked in extracting just the first 4 characters associated with year. Thanks

David Keyes Founder

January 8, 2024

I'm sorry but I'm not sure we can answer this for you. The code you provided doesn't work for me. You might consider reaching out to Albert directly.

Alberto Cabrera

January 8, 2024

Thanks!

Data Cleaning with R

Welcome to Data Cleaning with R

Data Organization

Restructuring Data

Regular Expressions

Common Issues

Data Cleaning with R

Regex in R

This lesson is locked

Transcript

Your Turn

Solution

Learn More

Have any questions? Put them below and we will help you out!

Alberto Cabrera

David Keyes Founder

Alberto Cabrera