Going Deeper with R
-
Welcome to Going Deeper with R
-
Advanced Data Wrangling and AnalysisOverview
-
Importing Data
-
Tidy Data
-
Reshaping Data
-
Dealing with Missing Data
-
Changing Variable Types
-
Advanced Variable Creation
-
Advanced Summarizing
-
Binding Data Frames
-
Functions
-
Merging Data
-
Renaming Variables
-
Quick Interlude to Reorganize our Code
-
Exporting Data
-
Advanced Data VisualizationData Visualization Best Practices
-
Tidy Data
-
Pipe Data Into ggplot
-
Reorder Plots to Highlight Findings
-
Line Charts
-
Use Color to Highlight Findings
-
Declutter
-
Use the scales Package for Nicely Formatted Values
-
Use Direct Labeling
-
Use Axis Text Wisely
-
Use Titles to Highlight Findings
-
Use Color in Titles to Highlight Findings
-
Use Annotations to Explain
-
Tweak Spacing
-
Customize Your Theme
-
Customize Your Fonts
-
Try New Plot Types
-
Advanced RMarkdownAdvanced Markdown Text Formatting
-
Tables
-
Advanced YAML
-
Inline R Code
-
Making Your Reports Shine: Word Edition
-
Making Your Reports Shine: HTML Edition
-
Making Your Reports Shine: PDF Edition
-
Presentations
-
Dashboards
-
Other Formats
-
Wrapping UpYou Did It!
Your Turn
You’ll be working with data on Oregon school enrollment by race/ethnicity.
Create a new project. Make sure you put it somewhere you’ll be able to find it again later!
Download the two files using the download.file() function into a data-raw folder (which you’ll need to create).
- 2018-2019 Enrollment by Race/Ethnicity Data File: https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx
- 2017-2018 Enrollment by Race/Ethnicity Data File: https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-17-18.xlsx
Create a new R script file where you’ll do all of your data cleaning work
Import the two spreadsheets into two data frames (enrollment_17_18
and enrollment_18_19
)
Heads Up!
If you have issues opening the spreadsheet, note that you may need to add the argument mode = "wb"
to the download.file()
function, as follows (read more about why here). You can see the full code in the solutions section below if necessary.
Solutions
Learn More
You can read about all of the arguments for the download.file()
function here.
To learn more about importing Excel files, check out the readxl
package documentation. You’ll see, for example, ways to download only certain ranges of cells, which can be helpful when you have messy Excel data!
I’ve also written an article about cleaning messy data in R. There are many packages to deal with messy data (which often comes in the form of Excel spreadsheets), and I go through several in the post.
Have any questions? Put them below and we will help you out!
You must be logged in to post a comment.
fyi – I was able to download both .xlsx files, but when creating objects with the 17-18 data I got this error message:
> download.file(url = “https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-17-18.xlsx”,
+ destfile = “data-raw/enrollment-17-18.xlsx”)
trying URL ‘https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-17-18.xlsx’
Content type ‘application/octet-stream’ length 47088 bytes (45 KB)
downloaded 45 KB
> enrollment_17_18 %
+ clean_names()
Error: Evaluation error: zip file ‘data-raw/enrollment-17-18.xlsx’ cannot be opened.
I was able to download the file to my computer, then copy it into the ‘data-raw’ folder manually and turn it into an object with R script successfully. So I’m moving on with the assignment but thought you would want to know.
I did not have the same problem with the 18-19 data.
Thanks for the heads up! Did you try adding mode = “wb” (see orange box above)? That may help with this issue.
looking back at my script… I did use mode = “wb” for the 18-19 data download, but not for the 17-18 dataset. So that is probably why one worked and the other didn’t.
I am having a problem just downloading the files. My code (with or without the mode = “wb”,):
download.file(url = “https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx”,
destfile = “data-raw-gd/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx”)
and I get these two errors:
Warning messages:
1: In download.file(url = “https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx”, :
URL https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx: cannot open destfile ‘data-raw/enrollment-18-19.xlsx’, reason ‘No such file or directory’
2: In download.file(url = “https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx”, :
download had nonzero exit status
This also happens with my own project file links (from Tidy Tuesday).
Do you have a folder called data-raw-gd? That’s where the
download.file()
function is trying to put the file it downloads, but if that folder doesn’t exist, it won’t work.yes, I created it before I tried to download.
Actually, take a look back at your destfile argument:
destfile = “data-raw-gd/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx”
It’s going to try to put the data in the data-raw-gd/going-deeper/raw/master/data-raw/ folder. Try this instead and let me know if it works:
download.file(url = "https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx",
destfile = "enrollment-18-19.xlsx")
That will put the file in the root of the project you’re working in. You did create a completely new project for this course, correct?
ah! I was copying too much of the url including forward slashes that were not right. Simple but frustrating, thanks!
this is what worked:
download.file(url=”https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx”,
destfile= “data-raw-gd/enrollment-18-19.xlsx”)
And I went through the steps to create a new project in a new Directory, but I am still saving everything in an “R in 3 mo” folder. I think I have some troubleshooting yet to do about keeping projects truly separate.
Yeah, I would recommend one RStudio project for each distinct project. It’s a bit artificial now because you’re working on “projects” that are actually just materials for each course. We can discuss more this week in our live session!
The excel file I am downloading is not getting opened in excel. The following dialog box appears-“file format or file extension not valid. Verify the file has not been corrupted and that file extension matches the format”.
You recommended data cleaning in R script but reporting in RMarkdown. What if I want my report to be 100% reproducible by someone else? Should I then do all my data cleaning within RMarkdown?
Either an R script file or an RMarkdown file will be reproducible because other people can run your code so it shouldn’t matter on that front. Others do their data cleaning in RMarkdown and that works fine as well. I just have a preference to do it in an R script file. In the end, it’s up to you!
At about 9:38 you said “let me highlight all of these” because keyboard shortcuts are faster. Did you use CTRL A to select all or something else to select just that chunk/those lines?
Yup, command (Mac) or control (Windows) and A to select all lines in an R script file.
Hi David,
What was your key at 12:38 second, that hash tag showed up before two lines R code ?
Hi Niger,
David is using code comments, I’ve recorded a very quick video demonstrating how they work.
Thanks,
Charlie
Hi I was trying to import data from two excels(field_data_AA and predictor season13July) where 4 spread sheets (name are Autumn, Spring, Winter and Summer. I found following messages. I first run code with Spring and data imported then run code with all seasons? It did not work. Can you suggest me please how can I fix cell reference in excel files ?
> field_data_2018
> field_data_2019 <- read_excel(path = "Data-raw/predictor season13July.xlsx",
+ sheet = "Autumn","Spring","Winter","Summer")
Error: Can't guess format of this cell reference:
Spring
In addition: Warning message:
Cell reference follows neither the A1 nor R1C1 format. Example:
Spring
NAs generated.
Cheers
Niger
Hello Niger,
The read_excel() function will only allow you to read in one sheet at a time, eg
read_excel(path = "Data-raw/predictor season13July.xlsx", sheet = "Autumn")
You would need to import each of these sheets separately and to then combine them together with bind_rows(). As you’re a member of R in 3 Months I’d recommend asking about how to combine these in a submission and I can help you.
Thanks,
Charlie
Hello whenever i try to run my code it keeps giving me an error mesage;
source(“~/.active-rstudio-document”, echo=TRUE)
Error in source(“~/.active-rstudio-document”, echo = TRUE) :
~/.active-rstudio-document:6:107: unexpected symbol
5:
6: download.file(url=”https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-mode = “wb
^
Hi Esther,
This error usually comes from an unexpected character in your code. Could you copy and paste the code you’re trying to run please?
Thanks,
Charlie
library(tidyverse)
library(janitor)
library(readxl)
download.file(url=”https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-mode = “wb”,
destfile = “data-raw/enrollment-18-19.xlsx”)
download.file(url= “https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-mode = “wb”,
destfile = “data-raw/enrollment-17-18.xlsx”)
#import data
enrollment_18_19 <- read_excel(path = "data-raw/enrollment-18-19.xlsx",
sheet = "Sheet 1")
enrollment_17_18 <- read_excel(path= "data-raw/enrollment-17-18.xlsx",
sheet = "Sheet 1")
enrollment_18_19 <- read_excel(path = "data-raw/enrollment-18-19.xlsx",
sheet = "Sheet 1")
enrollment_17_18 <- read_excel(path = "data-raw/enrollment-17-18.xlsx",
sheet = "Sheet 1")
Hello Esther,
Ah! Okay, this is a slightly difficult thing to explain.
There are two types of double quotation characters, the curly ones you’re using “ and the straight one ” which are what you need to use in R code. Almost without exception the curly bracket cannot be used in programming languages.
The next issue is a missing closing quotation mark at the end of the URL and the argument “mode”. The code you need to write is as follows:
library(tidyverse)
library(janitor)
library(readxl)
download.file(url = "https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-18-19.xlsx",
mode = "wb",
destfile = "data-raw/enrollment-18-19.xlsx")
download.file(url = "https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-17-18.xlsx",
mode = "wb",
destfile = "data-raw/enrollment-17-18.xlsx")
#import data
enrollment_18_19 <- read_excel(path = "data-raw/enrollment-18-19.xlsx", sheet = "Sheet 1") enrollment_17_18 <- read_excel(path= "data-raw/enrollment-17-18.xlsx", sheet = "Sheet 1") enrollment_18_19 <- read_excel(path = "data-raw/enrollment-18-19.xlsx", sheet = "Sheet 1") enrollment_17_18 <- read_excel(path = "data-raw/enrollment-17-18.xlsx", sheet = "Sheet 1")
Thanks,
Charlie
Thank you Charlie
What’s the difference between the “destfile” argument and “dest” in read_excel, which seems to have also put my downloads into the folder I directed them to? Are they two ways of doing the same thing?
Hi Julia,
Could you help me understand your question a little bit? The read_excel() function doesn’t have arguments contains “dest” – see here (https://www.rdocumentation.org/packages/readxl/versions/0.1.1/topics/read_excel). The download.file() function has an argument called destfile that determines where the file is downloaded.
Cheers, Charlie
Yes! I used this code:
download.file(url = “https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/enrollment-17-18.xlsx”,
dest = “data-raw/enrollment-17-18.xlsx”)
…and it downloaded the data to inside my “data-raw” folder. I was confused then when I watched the solutions video, because yeah, it seems like “dest” should not be accepted because it’s not an allowed argument; only “destfile” should work I would think.
Julia!
Thanks for teaching me something new! It turns out that R does partial matching for arguments. I hate that it does this 🤣 I tweeted about it to see if this is news to other people too https://twitter.com/charliejhadley/status/1587152103209930754?s=20&t=YR6Yy953f4nX0O7HZnyh9Q
I’m having trouble reading in the files. This is the error message:
> enrollment_18_19 <- read_excel(path = "data-raw/enrollment-18-19.xlsx",
+ sheet = "Sheet 1")
Error in utils::unzip(zip_path, list = TRUE) :
error -103 with zipfile in unzGetCurrentFileInfo
Please disregard. I was able to get it to work by adding the mode = “wb”. I didn’t realize the problem was with the download.
I was able to download the files, but when I go to view them before import, I receive an error that “the file format or extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.”
Disregard! I had noticed the wb code addition in the comments, but I realized I needed to add it to the download not the import.
Would we use the same process to download a Google Sheet file?
It’s a similar process. Check out the googlesheets4 package.
Thanks David!
Thanks for the note on Markdown vs Script, I thought it was only me 😁.
Agree with your suggestion, makes sense 🙏🏼