One of the challenges for people learning R is this: there are often many ways to accomplish the same goal. This is, of course, also a benefit (if you don’t like the first approach, just try another one!). But when you’re just starting out and don’t have much experience with R, multiple solutions to a problem are more likely to be confusing than helpful. This dilemma often shows up in one of the most common questions I get from new R users, which I tweeted about last year.
The responses to my tweet were thoughtful, as always with the R community. Here’s a summary of my thoughts, supported by what people helpfully shared. I’m intentionally focusing on the responses that are most helpful for new R users, the main folks I work with.
Is this package designed to be user friendly?
The first thing I think about when I look at a package is how user friendly it is. This is why I always try to use tidyverse packages first. Given the range of packages that are part of the tidyverse, it is quite likely that one of them does what you need. And, given the focus on naming functions with English verbs (e.g.
filter()), learning a tidyverse package is likely to be easier for newbies than packages without this focus on user friendliness.
If there isn’t a core tidyverse package to do what I need, I then look for packages that do what I need while working with the tidyverse. Because the tidyverse collection of packages all follow a consistent style, the grammar that you learn working with the tidyverse will translate to any package that works with it. For example, because all tidyverse packages have the data frame as their first argument, it makes it possible to use the pipe to create a series of functions in one pipeline of code (for example, you might filter to only include survey respondents over 50, then calculate average height by various age groups). Choosing a package that is tidyverse-friendly enables you to work in this way, knowing that the package will play nicely with everything else you’re using.
Is there an organization behind this package?
The tidyverse packages also benefit from having a team of talented developers (the folks at RStudio) working full-time on them. This organizational infrastructure makes it possible to create packages that follow a consistent structure, and ensure that the packages will be around for the long term.
Another organization helping to develop R packages is rOpenSci. This organization supports the development of packages that benefit open science. rOpenSci packages go through a code review process that ensures they meet quality standards. RStudio’s Alison Hill recommended rOpenSci packages in response to my original tweet.
Does this package have clear documentation?
In her response to me, Alison Hill spoke positively of rOpenSci packages because they have a vignette and README. Let’s look at each of these in turn.
The README that Alison Hill mentions is a broad overview of the package that typically has a few parts (these are from Hadley Wickham and Jenny Bryan’s R Packages book):
1. A paragraph that describes the high-level purpose of the package.
2. An example that shows how to use the package to solve a simple problem.
3. Installation instructions, giving code that can be copied and pasted into R.
4. An overview that describes the main components of the package. For more complex packages, this will point to vignettes for more details.
A vignette is like a book chapter or an academic paper: it can describe the problem that your package is designed to solve, and then show the reader how to solve it.
Vignettes walk the user through how they might use the package. They provide clear demonstration of how to use same part of the package. A great example is the pivoting vignette from the
tidyr package, which demonstrates how to reshape data.
Having at least one vignette (many packages have several) is a good indication that the package developer cares about making the package accessible to new users.
Many packages today have websites that display the README, vignettes, and documentation of how each function in the package work. These are typically built using a package called
pkgdown (yes, a package to display information about packages).
pkgdown site, something Alison mentions, is, again, an indicator that the package developer has taken the time to document and share information about how her package works.
When was this package most recently updated?
There are a few ways to check when packages were most recently updated. If a package has been released on The Comprehensive R Archive Network (aka CRAN), which means you can install it using the
install.packages() function, a Google search of the package name will typically bring up its page on the CRAN website. From there, you can look at the published date, which shows when the package was most recently updated (the name technically refers to when that version of the package was published, which is why it’s called published and not last updated or something similar).
Some packages are not released on CRAN. You can often find them on GitHub. If you’re not familiar with GitHub, it is a place where developers share the code they use to develop software. One nice thing about GitHub is that, in addition to the actual code that makes up a package, a lot of metadata about the code is visible. If you Google and get to what’s known as a repository (i.e. a place on GitHub where the code for a single package is stored), there are two places to look to see when the code was most recently updated.
If a package has a releases section on the right, that will tell you when the most recent version of the package was released. Below, for example, you can see that the
janitor package was most recently released in April 2020.
If the package does not use releases, simply look at the top blue row in the main table. This shows the last commit (i.e. the last time the developer made changes to the code). As you can see with
janitor this was June 3 (this means that the developer has made changes to the package since the most recent release; these changes will presumably form part of a future release).
Is the package developer responsive to user feedback?
High-quality packages tend to be updated regularly in response to user feedback. The team that develops tidyverse packages is well known for taking user feedback seriously and making changes as a result. It’s why, for exmple, the
tidyr package developed the
pivot_wider() functions after many users talked about how confusing their predecessors
How do you spot a package that has gone stale and whose developer is not responding to users? Within GitHub, there is an issues tab in each repository. Think of issues as to dos. With any public repository, all issues are visible and anyone create new issues. If you take a look at the issues in a GitHub-hosted package, you can see how many open (i.e. pending) and closed (i.e. completed) issues there are (51 and 255 respectively in the screenshot below).
Having a lot of closed issues means that the developer has made changes. Having a lot of open issues means there are requests for changes, either from users or things the developer has added herself. The fact that there are a lot of open issues doesn’t necessarily mean the developer is ignoring feedback — the package may be so popular that a ton of requests come in! But, you can click on any individual issue to see discussion of it. If issues are being submitted and then ignored, there’s a good chance the package is not being actively developed. If issues are being submitted and responded to, that’s a good sign, as Sharon Machlis pointed out.
How widely used is this package?
A last, though definitely least important, suggestion: look at how widely used the package is. As Fernando Rodríguez-Sánchez pointed out, the
packagemetrics package (developed by rOpenSci) can help with this.
packagemetrics package, you can use the
package_list_metrics() function to see information about any package. For example, this output, taken from the
packagemetrics README shows details about several table-making packages. In particular, I would focus on two metrics: how many downloads the package has (assuming it is released on CRAN) and how many stars it has on GitHub. These show up in the
stars variables respectively. For both, higher numbers indicate wider use.
Other ways to evaluate R packages
My original tweet prompted a wide variety of responses. I’ve only touched on a few of them here. Some things, like test coverage, continuous integration, and reading the source code are good ideas, but not things that most newcomers to R are capable of doing.
A final way to evaluate R packages is something that is probably the one I actually use the most: seeing others I know and trust in the R community post about them (I’m not alone in this). When I see Hadley Wickham, chief scientist at RStudio and the main force behind the tidyverse, post about the
gt package, for instance, I know that it’s going to be good.
Of course, it’s hard to know who to trust when you first start out. My advice? Follow the #rstats hashtag on Twitter. It’s where the R community is primarily found. After you’ve followed #rstats for a few weeks, you’ll see who continues to pop up. Follow them and see what they talk about.
Good luck finding packages to help you on your R journey!