Import Data Again
This lesson is called Import Data Again, part of the Getting Started With R course. This lesson is called Import Data Again, part of the Getting Started With R course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
Loading transcript...
View code shown in video
library(tidyverse)
coffee_ratings <- read_csv(
"coffee_ratings.csv",
na = c("No rating", "Unknown", "8", "1", "3", "4", "23", "47"),
col_types = cols(total_cup_points = "d", harvest_year = "i")
)
coffee_ratings
Your Turn
Adjust your
read_csv()code so that you import the data againUse the
naargument to tellread_csv()what should be treated as missing in thealtitude_mean_meterscolumnUse the
col_typesargument to make sure thataltitude_mean_metersgets imported as numeric dataExamine your data again to confirm that the changes were made successfully
The code that I created and that you can use to get started is below.
library(tidyverse)
coffee_ratings <- read_csv(
"coffee_ratings.csv",
na = c("No rating", "Unknown", "8", "1", "3", "4", "23", "47"),
col_types = cols(total_cup_points = "d", harvest_year = "i")
)
coffee_ratings
You need to be signed-in to comment on this post. Login.
Caitlin McLemore • March 3, 2026
What is the difference between the data types of number, double, and integer?
Gracielle Higino Coach • March 6, 2026
Hi Caitlin! Both double and integer are types of numbers. Integers (sometimes you can see the notations Int, In16, Int 32 or Int64) refer to whole numbers, without decimals, such as 123, and can represent variables such as years, age and - to be very niche here - sample plots in space. They can often be thought of as categorical variables. Doubles are numbers with decimal digits (like 123.5), often thought of as continuous variables, and can represent variables such as height, length, and weight.
Eda Akpek • March 13, 2026
I am trying to do the practice exercise for altitude_mean_meters. The code shows that it's executing, but I don't see any changes in the histogram.
Gracielle Higino Coach • March 14, 2026
Hi Eda! Do you want to share the exact code you used? The change in the histogram is really subtle, but you should see differences when opening the little "toggle" for more info if you define what should be read as NAs in the dataset. But you're right that changing the column type doesn't change much in this case, but I suggest that you play a bit with the arguments to see what happens in each case. What happens if you don't designate the NA values for that column? What happens if you do? What if you change the column type?
Eda Akpek • March 16, 2026
I figured it out! I set the altitude_mean_meters as an integer, and then, because it changed -999.00 to -999, I was able to use -999 in my code to set as "NA."
Gracielle Higino Coach • March 16, 2026
Awesome! 🎉
Al-Afroza Sultana • March 26, 2026
Hi, as this is a small data set, we could check the data manually in an ascending or descending manner to find out if there's any abnormality. But how do we check or find out any abnormal values in a comparatively large dataset using R? Thanks.
Gracielle Higino Coach • March 26, 2026
There are a few methods to do that! Here's an informative tutorial to inspect outliers with basic min and max functions and with plots: https://statsandr.com/blog/outliers-detection-in-r/
Plotting is one of the most effective ways to detect outliers and discrepancies - very useful in my area, where we deal with millions of biodiversity data points!
One other tool you can take a look at is OpenRefine, which can be integrated with your R scripts: https://openrefine.org/ OpenRefine is great to detect typos.
Lalitha Vaishnavi Subramanyan • May 1, 2026
Inputting 1, 4, 23 from harvest_year etc into na worked here, but with a dataset where those can be legitimate values in another column this approach may fail and corrupt data, is there another approach where we would avoid this pitfall?
Gracielle Higino Coach • May 7, 2026
Great questions! Yes, in this case you could skip the na assignment when reading the data, and deal with it in a case-by-case level, perhaps, using mutate(), or case_when(), or na_if(). See some examples here: https://dplyr.tidyverse.org/reference/na_if.html#ref-examples