Merging Data

This lesson is locked

Get access to all lessons in this course.

This lesson is called Merging Data, part of the Going Deeper with R course. This lesson is called Merging Data, part of the Going Deeper with R course.

If the video is not playing correctly, you can watch it in a new window

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Your Turn

Solution

clean_enrollment_data <- function(raw_data, data_year, race_ethnicity_remove_text) {
  raw_data %>% 
    select(-contains("grade")) %>% 
    select(-contains("kindergarten")) %>% 
    select(-contains("percent")) %>% 
    pivot_longer(cols = -district_id,
                 names_to = "race_ethnicity",
                 values_to = "number_of_students") %>% 
    mutate(number_of_students = na_if(number_of_students, "-")) %>% 
    mutate(number_of_students = replace_na(number_of_students, 0)) %>% 
    mutate(number_of_students = as.numeric(number_of_students)) %>% 
    mutate(race_ethnicity = str_remove(race_ethnicity, race_ethnicity_remove_text)) %>% 
    mutate(race_ethnicity = case_when(
      race_ethnicity == "american_indian_alaska_native" ~ "American Indian Alaska Native",
      race_ethnicity == "asian" ~ "Asian",
      race_ethnicity == "black_african_american" ~ "Black/African American",
      race_ethnicity == "hispanic_latino" ~ "Hispanic/Latino",
      race_ethnicity == "multiracial" ~ "Multi-Racial",
      race_ethnicity == "native_hawaiian_pacific_islander" ~ "Pacific Islander",
      race_ethnicity == "white" ~ "White"
    )) %>% 
    group_by(district_id) %>% 
    mutate(pct = number_of_students / sum(number_of_students)) %>% 
    ungroup() %>% 
    mutate(year = data_year)
}

enrollment_by_race_ethnicity_18_19 <- clean_enrollment_data(raw_data = enrollment_18_19, 
                                                            data_year = "2018-2019",
                                                            race_ethnicity_remove_text = "x2018_19_")

enrollment_by_race_ethnicity_17_18 <- clean_enrollment_data(raw_data = enrollment_17_18, 
                                                            data_year = "2017-2018",
                                                            race_ethnicity_remove_text = "x2017_18_")

enrollment_by_race_ethnicity <- bind_rows(enrollment_by_race_ethnicity_17_18,
                                          enrollment_by_race_ethnicity_18_19)

# Import Oregon School and District Data -----------------------------------------------------

oregon_districts <- read_excel(path = "data-raw/oregon-districts.xlsx",
                               sheet = "Sheet1") %>%
  clean_names()

# Join District Data -------------------------------------------------------------------------

enrollment_by_race_ethnicity <- left_join(enrollment_by_race_ethnicity,
                                          oregon_districts,
                                          by = c("district_id" = "attending_district_institutional_id"))

Download the oregon-districts.xlsx file into the data-raw folder. The URL to download the file is https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx
Import a new data frame called oregon_districts from oregon-districts.xlsx
Merge the oregon_districts data frame into the enrollment_by_race_ethnicity data frame so you can see the names of the districts

Learn More

If you do nothing else, check out the tidyexplain project mentioned throughout the video. The animations are incredible for grasping how joins work.

This blog post by Crystal Lewis does a great join explaining joins.

Two other references to help you understand joining data are Chapter 13 of R for Data Science and Chapter 15 of Stat 545.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Daniel Sossa

March 24, 2021

Hello, If we have 2 df with many columns on both and we want to add only ONE column from df2 to df1 (instead of merging the whole thing), what should we do? is there an argument on the left_join function to do that ?

Jody Oconnor

May 6, 2021

When I download the file it I end up with something in my data-raw folder that can not be opened by Excel or by R. This is the error message:

> or_districts <- read_excel(path = "data-raw/oregon_districts.xlsx",

                       sheet = &quot;Sheet1&quot;)

Error: Evaluation error: zip file 'data-raw/oregon_districts.xlsx' cannot be opened.

If I go to the url directly and download it to my laptop then move it into the data-raw folder I am able to open it. How can I get R to download a usable file directly? Thanks!

Atlang Mompe

May 9, 2021

Hi David, Out of curiosity, how would you do an anti-join in order to keep 4, 4y? Thanks, Atty

Atlang Mompe

May 9, 2021

Hi David,

When I import the district data - I cannot read sheet 1, so I ended up using this code without specifying sheet 1: oregon_districts <- read_excel(path = "data-raw/oregon-districts.xlsx").

Also, I noticed that maybe you omitted some steps in your final code (enrollment_by_race_ethnicity) - it is not in the solution code you shared, but it is in the video.

Here is my code, I wonder why I could not read my excel Sheet 1.

#Download Oregon District File download.file(url = "https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx", mode = "wb", destfile = "data-raw/oregon-districts.xlsx")

#Import Oregon District Data oregon_districts % clean_names ()

#Join Race/Ethnicity and Oregon District Data enrollment_by_race_ethnicity <- left_join(enrollment_by_race_ethnicity, oregon_districts, by = c("district_id" = "attending_district_institutional_id"))

Thanks, Atty

Matt M

December 2, 2021

the code shown below the solution video appears to have an error and does not match the code in the Solution video. On line 38 it is missing the x data frame of the left_join() "enrollment_by_race_ethnicity" should come before the comma with oregon_districts after, right? [that's what was done in the solutions video

Matt M

December 2, 2021

The animations are Fantastic. The slide text describing full_join() and and inner_join() was confusing for me. The slides appear to be using the same text (I’ve checked multiple times and can’t find a difference) and was confusing.

The animation of inner_join() makes it look like it does NOT keep “all rows and all columns from both x and y”. it only does so IF there is a match.

Is this a correct interpretation of inner_join(): “Only returns rows where there are matches between x and y”?

JULIO VERA DE LEON

April 30, 2022

Hi! When I try to use the read_excel function I get this error:

Error: filepath: /Users/julio/Library/Mobile Documents/com~apple~CloudDocs/R/R for the Rest of Us/Going Deeper with R/going-deeper-dk/data-raw/oregon-districts.xls libxls error: Unable to open file

This is my code: oregon_districs <- read_excel(path = "data-raw/oregon-districts.xls", sheet = "Sheet1")

If I open it with excel I get a warning that the format and extension of the file don't match. In the end I can open it and see the values if I press I want to open it anyway.

Sara Cifuentes

May 26, 2022

Dear Charlie and David, I do not know what I'm doing wrong. I used inner_join to join a df that has 72500 observations and 253 variables (x), and I wanted to join with a df that has 718232 observations and 2 variables (y). I joined them using the variable in which they share the same code ("annotation"). I expected that I would have 72,508 observations and 254 variables in my new matrix, but a matrix with 718,232 observations and 254 variables was generated.

gene_ecoli_mat <- inner_join(matrix_profile_ecoli, ecoli_pangenome_annot, by = "Annotation")

I thought there would be a matrix where the 72508 observations are shared.

Thanks in advance,

Sara

Matt Kropp

November 9, 2022

I am often combining multiple datasets where I combining by name. Many times the names are slightly different. (Things like Jr. or AJ vs A.J. or Mike vs Michael) Do you have any suggestions to do handle this?

Hatem Kotb

January 13, 2023

In the solutions code lines 51-53 seem redundant?

Going Deeper with R

Advanced Data Wrangling

Advanced Data Visualization

Quarto

Going Deeper with R

Merging Data

This lesson is locked

Transcript

Your Turn

Solution

Learn More

Have any questions? Put them below and we will help you out!

Daniel Sossa

Jody Oconnor

Atlang Mompe

Atlang Mompe

Matt M

Matt M

JULIO VERA DE LEON

Sara Cifuentes

Matt Kropp

Hatem Kotb