R for the Rest of Us Podcast Episode 3: Tom Mock
In this episode, I speak with Tom Mock. Tom works at Posit (formerly RStudio) where he helped customers use that company's suite of products. Tom also loves building tables. I spoke with Tom about how he learned R, how he uses it today, and why he is so into making tables in R. You can connect with Tom on Twitter.
Learn More
If you want to receive emails when we publish new podcast episodes, sign up for the R for the Rest of Us newsletter. And if you're ready to learn R, check out our courses.
Audio Version
Video Version
The video version has a code walkthrough of how to make high-quality tables in R with the {gt} package.
Resources Discussed
The blog post on making tables that we reference in our conversation is here: https://themockup.blog/posts/2020-09-04-10-table-rules-in-r/
Transcript
[00:00:00] David: Hi, I'm David Keyes and I run R for the rest of us. You may think of R as a tool for complex statistical analysis, but it's much more than that from data visualization to efficient reporting, to improving your workflow. R can do it all on this podcast. I talk with people about how they use R in unique and creative ways.
[00:00:18] Join me and learn how art can help you.
[00:00:22] I'm joined today by Tom Mock. You prefer Tom, right? Not Thomas. I used to always call you Thomas, but now, now, now it's Tom. Correct.
[00:00:31] Tom: Tom. Tom's great. That's that's more than fine.
[00:00:33] David: All right. Cool. And Tom is the customer enablement lead at our studio. Before that you did a PhD, I I'm actually blanking on what your PhD was in. Oh, neurobiology. Is that right?
[00:00:48] So welcome, Tom. Maybe you could start out by just telling folks kind of what that, what that job, what your role at our studio entails, what that means to be a customer enablement lead.
[00:00:57] Tom: yeah, absolutely. I'm sitting kind of between some of our customer success reps who work directly with different accounts or customers. We have. And then our product and solution engineering team. The reason why it's customer enablement is I'm specifically applying, like how can we integrate open source into people who've bought into our professional products?
[00:01:16] That's kind of the only way everything works is we have all the open source stuff. That's amazing. It also works in our professional products. And we have to make sure that things that work swimmingly outside of professional products work really, really well inside the pro products too. So a lot of the time I spend is like in documentation or advocacy for making sure things work or extending things, to make sure that you get a good experience.
[00:01:37] Yeah.
[00:01:39] David: Cool. Well, in addition to doing all of that you have a really nice blog where you've written a bunch of different posts and a bunch of topics. And for a while you were doing a series on tables which I got really interested in. And we're gonna be talking today about one post in particular, where you took from some guidelines that John Schwab kind of data, viz communication expert wrote about.
[00:02:04] Out how to implement them in R before we get to that, though. Just a bit of background about you. Can you kind of, give us a bit of background about when you started using R and why you made that switch.
[00:02:16] Tom: Yeah, absolutely. I, I think the first line of R code I wrote probably would've been around 2016. But then started using it much more in earnest in 2017. Ironically, my first experience didn't go well, which I, I think is true for a lot of learners, whether it's programming or anything, statistics or anything else.
[00:02:35] And a lot of that was due to how the course was presented to us in terms of intentionally constraining you to think about it as only a a statistical tool and not really giving you the guidelines of like, oh yeah, you can do a lot more with this than just, you know, a linear model or an Inova and constraining to that made it really hard to learn as opposed to using it as a general purpose programming tool.
[00:02:59] David: Gotcha. And so that was in grad school, right. And you were at university of north.
[00:03:04] Tom: university of north Texas. That was second, second year of grad school. We had the statistics course and it was taught with, with our we didn't use our studio. We didn't use, I think we used our commander as the interface which. Pretty similar to using the R console. So not really scripting, but more of just like interactive programming which was a rough experience at the start.
[00:03:25] But teaching myself and kind of learning with the R for DS community. My third year of my PhD was a lot more fun and a lot more kind of rewarding
[00:03:34] David: So you talked about the R for DS community, that being the R for data science community, talk about kind of what that was. And you know, what it looked like for you being involved with that and how that contributed to your learning.
[00:03:46] Tom: totally. Yeah. So around that same time of like, when I was trying to find a community it was 2017 third year of grad school. I was really trying to figure out what can I do that is not becoming a professor trying to get out of the grad school pipeline, still finish the PhD, but get more into data science.
[00:04:03] So found R was able to like, you know, I had some knowledge of it from before, even though it was pretty limited. And then the Hadley Wickham and Garrett Groman released the R for data science textbook. And around the same time Jesse PAC created kind of a book club based around group reading of that book and kind of interacting.
[00:04:23] And it, it really flourished like the community grew to be much more than just a book club, but that was kind of the initial creation,
[00:04:32] David: Cool. And you talked about, you know, when you first learned it, it was really just focused on the. Piece of it. To what degree was being involved with R for data science to, to, to what degree did that help kind of open up your perspective in terms of what R was capable of?
[00:04:46] Tom: Yeah, absolutely. I mean, number one, R for DS teaches the Tidyverse dly Gigi plot, tid R per all these packages. They'll let you do a lot more than the kind of built in statistical functions and really more about doing data analysis, exploratory data analysis, graphing Andre, this graphing and, and working with.
[00:05:05] And that was what I was really passionate about was like, you know, I didn't necessarily want to do just statistics. Like I wanted to improve my workflow in terms of, I didn't want to do statistics insist that copy paste the results to Excel, copy paste the results from Excel. After I'd done some summaries into PowerPoint, and then also copy paste them into word like I'm, I'm working across four or five different softwares at this point.
[00:05:30] And every time you switch, there's that huge switching cost and, and loss and functionality. So with R four DS and that community, you get exposed to, you know, web scraping and iterating across functions and repeating things, or creating multiple reports from the same thing, which is all like it's things that R is really, really good at, but it's not necessarily statistics, you know, it's, it's like general programming or just using functions and functional programming in software, along with your data analysis.
[00:06:01] But all in one pipeline and not having to do that switching cost in and out in and out of different softwares
[00:06:08] David: Cool. And you were, you talked about before you were using CSTA, which I'll admit I'm not actually familiar with. So I'm curious if there were things that changed, you talked about, you know, being able to improve your workflow, centralize everything within R I'm curious if there are other things that changed for you moving from CSTA to R.
[00:06:26] Tom: Yeah. I, I think number one is that CSTA is roughly similar to SPSS. In terms of you have a, a gooey or graphic user interface, you say like I'm running an Inova. I click on Inova, how many degrees of freedom type in some things what are the variables, read it in from a table and then click execute.
[00:06:47] And it runs all the analysis. I think the biggest thing for me was that every time I did that, there was like 10 minutes of. You know, five to 10 to 15 minutes of like actually getting the commands to run in the correct order, execute, you know, I'm at 25 to 45 minutes of just trying to get my analysis to rerun.
[00:07:07] And it was essentially the same analysis every few weeks as we were adding on things or, or doing new projects. When I switched over to R you know, I just re executed the script. So I took what was essentially almost an hour of work and was able to basically just rerun re-execute the code with new data and get, Hey, here's the summary of everything.
[00:07:28] Here's the preliminary statistics, here's the summary statistics and a graphic. And I could show that to my boss in a few minutes as opposed to an hour. I think that answered the question, but let me know if I went off the rails there.
[00:07:39] David: No, no, that's great. I mean, I think that type of reproducibility you know, is, is a huge thing for moving from a graphical user interface to a scripted process like with our cool. So I know you've said that one of the most difficult things for you when you were learning are, was actually importing data.
[00:07:59] Can you talk about what that, like, what that looked like and, and how you eventually figured it out?
[00:08:04] Tom: right. Yeah. So, I mean, like, I think it's very key and again, I'll pull on kind of experience that Jesse Mostek would talked about was when you first get something, you know, you use your computer for doing Netflix and email and other things you don't necessarily think, oh yeah. My computer is a way to do programming.
[00:08:21] So if you have data, like you just double click on the file and it opens up right now, I'm opening it with R or Python or programming language. Like you are much more about like, trying to understand what's a file path, which directory am I in? How do I get the file from on disk to in memory and kind of, that's a change in your mental model.
[00:08:42] That's really useful in terms of like what's on disk exists and what's in memory exists, but like you're moving between them. But it's very different than just double click on the file and, and get going. When I was initially learning R and the reason I got frustrated is we never talked about that step.
[00:08:58] Like we just assumed that you were always in the right working directory, which is never the case. When you get to your own analysis, like you have a project or a sub folder that you're working from and doing things only interactive in the R console meant that you really didn't have state of like your script.
[00:09:17] It was not that great, cuz you're just executing commands and you don't really get to see like the history or right comments. It's just the interactive portion. So using things like our markdown or even just our scripts where you can write things out, then execute them. That was a big step for me.
[00:09:36] David: Yeah. I mean, I know I had a, when I first started learning and I hadn't done any, I had done some HTML and CSS, but not programming in a similar way to what I knew now and are, I didn't understand that. Okay. There was a, like a CSV file. Say that lived on my, our drive. But it wasn't actually in R as an object until I ran the code read underscore CSV, that didn't make any se I was like, wait, but there's external data, but it's not in R.
[00:10:07] And that whole process was, was very confusing. So it's, I think it's interesting to think about how so many of the things, especially to become more experienced, become second nature for, for someone starting out are, or can be incredibly confusing. So.
[00:10:23] Tom: Yeah. I mean, just last week, I know I saw a Twitter thread from someone trying to learn a Python for data analysis. And they literally walked through that process, like live of being like, okay, cool. I'm gonna learn Python. I want to, you know, import a data file. It's like, okay, how do I do that? Google import data into this?
[00:10:42] And it's like, load Panda. It's like, okay, what's Panda. And you go down this rabbit hole of like, not having to build up mental models on mental models. And if it's presented, but again, like R for DS walks through like how to do all of that stuff, the book. And that was really helpful for me, learning of like having it typed out like that step by step.
[00:11:02] David: Yeah, that makes sense. So you talked about finding value moving to R because it was, it could do kind of more general purpose programming stuff.
[00:11:14] So I'm curious, not wanting to start an R versus Python flame more, but just like what value you see, like, why do that in R what do you get out of R maybe not in contrast to Python, but like what, what makes you love R so much?
[00:11:30] Tom: Yeah. I would, I would say like the startup cost and you know, when I was in 2017, when I was like, Hey, I'm gonna learn data science. I installed R and I installed Python. And for Python, introductory tutorials, I found were general software, right? Like make this do this thing, open a file, transfer it over there.
[00:11:51] Rename. Et cetera. And I was like, I need to learn data analysis, cuz if I'm gonna learn it during PhD school, I have to concurrently learn while I'm doing my actual work. Like I had to use the software for my work. I think a lot of folks actually used our like as a scaffold to learn Python. Like they learn our the startup cost is lower.
[00:12:11] Cuz you can just install our install, our studio install a package and you don't have to immediately think about what's a virtual environment or what's a project specific library or how many versions of R versus how many versions of Python are on my laptop? Both are super powerful. The customers I work with were very multilingual.
[00:12:31] Like there's some teams that are full Python, some teams that are full R or a mix of both both are really good at data analysis and there's overlap. I think of like the pandas library for data analysis and Python is actually really similar to a lot of base arson. With a mix of some of the Tidyverse principles.
[00:12:48] So there's again like cross learning. It's like, if you can build up a mental model of programming in R or in Python, you're switching cost from R to Python or Python to R is much lower than trying to learn them both at the same time. So like learn one, learn it, well, use it to be powerful, be happy. And if you want to learn the other one, then you can learn it probably a lot faster.
[00:13:11] The second time around.
[00:13:12] David: Yeah, that makes sense. You know, it's interesting. Doing on my website, which is a, a WordPress website. I've occasionally have to do some PHP.
[00:13:20] Tom: Yeah.
[00:13:21] David: I, I don't know PHP, but you know, a couple years ago, or few years ago, I was like, I, I don't even know what this is like, this doesn't make any sense to me.
[00:13:29] Whereas now having done some R I still don't know PHP, but I can look at it and kind of understand what's going on. So I think that makes a lot of sense is, you know, using one language to scaffold your learning of another. So now you, you know, having gotten through the, the struggles of learning you run tidy Tuesday, so.
[00:13:48] Tidy Tuesday for anybody who doesn't know is a kind of social data project. You release a new data set every week. And then people take the data, do a bit of analysis. I shouldn't say a bit, they do significant analysis. And then often mostly produce data visualization, sometimes tables, different things.
[00:14:08] And then you'll see those posted under the tidy Tuesday hashtag on Twitter. So I'm curious, first of all, how your background informed you starting tidy Tuesday. And in what ways, if you had had tidy Tuesday, when you were learning, if that would've changed your, your, your journey learning are.
[00:14:24] Tom: Yeah. I mean, if I'm being honest, like tidy Tuesday was part of my journey in learning R like when I was, you know, 20 17, 20 18, I'm going through R for DS, I'm doing my own statistical analysis and general programming for my PhD. But at a certain point, I was like, Hey, I have to talk to people in interviews and interview as a data science, you know, person as opposed to a neurobiology person, like if I'm stepping outta that persona.
[00:14:50] And if I talk to them about behavioral analysis and mice may not land very well, and I'm applying to companies that, you know, sold pizza or did marketing, like again, like they don't care about mice. They want to know about like their specific domain or applying the principles. So for me, I was like, I need to build a portfolio.
[00:15:11] How can I build a portfolio? I can get random data sets from the internet. Analyze them graph them, do machine learning. And that'll be my portfolio. The third step of that, you know, I've got the idea. I, you know, found some data. I was like, oh my gosh, this is an absolute mess to get these data sets out or you find tutorials and they're like, oh yeah, just load the data.
[00:15:31] And I'm like, where's the data you don't have it listed anywhere. You just have code. I don't know where your data is. So for me, it was very much around like grab a data, set, get some basic summary of that. Like here's a description of the columns and rows and an article that you could kind of build off of it.
[00:15:49] Like I'm building a new mental model every single week of like, what is this data? I don't know anything about solar power, but let's learn about solar power this week, or let's learn about best sellers and popularity of books over time. So my initial go at, it was like, I was learning for myself. And like, if, bring some people along with me, similar to what Jesse was doing with the R for DS, like she started as wanting to learn R in a book club with a group, you know, I just wanted to create a scaffold for myself that others could benefit from as well.
[00:16:20] So four years later, you know, it, it's, it's less so about me kind of learning the core and trying to kind of give back and create scaffolding behind me and not just like tear it up once I'm done in terms of like, not give that scaffolding away.
[00:16:36] David: Yeah, that makes a lot of sense. And for anybody who hasn't looked you know, being able to go on Twitter, I don't do people post stuff, other places. I mean, I just see place, see,
[00:16:46] Tom: There's some folks who do LinkedIn and, and ultimately I say, it's like, even if you never post it, you can just work locally. Like, I think there's like 2000 forks of the repo. So it means like, I know people are. Using it, even if they're not posting and that's, the purpose is just like, use it. If you want to get feedback, then you can ask for feedback or post it and be like, look what I've done.
[00:17:10] You know, I'm learning with you or I'd like feedback or, or whatever.
[00:17:14] David: Yeah. And I mean, for, for those who do post publicly for other people, then being able to go and see, you know, here's what they produced this week. Here's a link to their code that for people learning is so helpful. I know, you know, students who I work with, I always point them to tidy Tuesday as a way to just see other people's code.
[00:17:35] So, alright. Let's switch to talking about tables which is the, the main topic for today. So, I guess, first question, why do you care so much about tables?
[00:17:45] Tom: Yeah. I, I mean, for me, I, I kind of joke, in retrospect, it was like a form of like, control and pandemic self care in terms. I was very much cooped up in my own house. Like we couldn't go anywhere. Texas was not doing a good job of containing the virus or suggesting good practices at the time. So my wife and I stayed in, you know, we do puzzles and we do other things.
[00:18:08] And I had more time on my hand because I wasn't doing as much elsewhere. So, you know, GT was going the GT package or a grammar of tables package was, you know, really ramping up at that point. So it was like, I'll learn about GT and dive into it. And that first jump in terms of like, oh, this is actually very similar to Gigi plot or other kind of things in the Tidyverse where the syntax makes sense.
[00:18:33] I can be very powerful very quickly, but you still have to learn in terms of, I, there wasn't that much learning material at the point that I saw. So I was like, well, you know, I can probably do something, kind of share some details about the package and learn a lot along the way. Ultimately, I was just like, I've kind of gone way far in my spectrum of like, I should always use data viz to, should I always use tables to somewhere in the middle now.
[00:18:59] And ultimately it's like with, with tables, I think a lot of people will see them out in the wild. And they're just miserable. Like in terms of you've got an entire page, it's all black and white. There's no real delineation of lines and it's just a mess. It's like, well, if you're gonna gimme that, I'd rather just have the CSV file or, or whatever.
[00:19:17] If you're just giving me 10 pages of PDF of a table, I, I want the raw data. I'm not gonna be able to be very powerful with this, but using tables as a summary statistic or an extension of data, visualization is exceptionally powerful. You think of like a bar chart, right? A horizontal bar chart is just a table with one column.
[00:19:39] David: Right.
[00:19:39] Tom: It's all it is. It's like you, you have a number, your X axis is the labels. Or, sorry, your Y axis is labels and your X axis is how big are the bars with a table? You could take that bar chart, throw it in there as two columns. Like what is the grouping and what is the bar? And then you have five or six other columns that describe that data or give context.
[00:20:02] That would be really, really hard to graph together. It's really hard to encode more than three or four variables at a time versus a table. You're having literal representations of both the visualization and the data itself in small sections. Again like summary statistics or, you know, 10 to 20 rows is fine.
[00:20:23] If you, if you organize.
[00:20:26] David: Yeah, I think, I think what I like about your treatment of tables is that you kind of, put them on the same level as data viz and it, cuz I think you're right. A lot of people just assume, okay, I'm gonna spend a bunch of time making some nice graphs and then I'll just dump the, the, the data, the rest, you know, the rest of the data, you know, all of the data in a table.
[00:20:46] And I, I think what I liked about the way you just discussed the two is that tables in a way are a form of data viz. I mean, they're a way that you visually. Present data. And so taking the same level of care in thinking about how you present that can, can yield really great results is I think some of the tutorials you've written show what makes our well suited to, to making tables, Ables.
[00:21:12] Tom: Yeah, I, I think first and foremost, R is really good for data analysis, whether that's statistics, machine learning, or just general exploratory data analysis amongst other things. So you're already doing a lot of your work there. Also there's really rich publishing tools. If you think of things like our markdown, you can take your analysis and publish it up into a book or into a website or into a journal article pretty quickly.
[00:21:38] So there's a benefit to being able to even just have tables in your toolbox. Like just it's good to have as far as why I think R is really well suited is just like, we have people who have strong and useful opinions about data visualization. Similar people have extended that into tables. And actually like you were saying and I agree with starting to care about those tables.
[00:22:01] It's more than just, oh, this is the most basic late tech table I can create to one that's visually appealing that is useful as a storytelling tool or as a visualization tool. And it kind of can extend what you do with like a scatter plot. And you can have an accompanying summary table that is telling you additional details more than what, you know, a different graphic could tell you.
[00:22:25] David: Yeah, that makes a lot of sense. So let's talk about the kind of package landscape within R you talked about GT and well in a minute, walk through some code that uses GT. But I'm curious if there are other packages that you find useful and you go to when you're making different types of tables.
[00:22:42] Tom: Yeah, absolutely. I'm very much similar to what you do in data visualization. You kind of have to make a hard forked path of what you, what you're intending to do. Do you want a static graphic, like a Gigi plot that you know, is outputting like a PNG a literal static image? Or do you want an interactive graphic that you're almost treating as a small application?
[00:23:05] Is it a story or is an application trying to tell a story? And that's two different kind of use cases and it's similar in tables. You have static tables that are really good for doing summaries or for small use cases. And then you have interactive tables that wrap things like JavaScript. That are actually small scale applications in the form of a table.
[00:23:30] So for static tables, I really like GT, I think it's very expressive. I was able to kind of hack against it and extend it a lot. And it has, you know, some dependencies that I'm, I'm happy taking on and I can work with, but there's other packages like cable extra that are really, really good at doing like late tech.
[00:23:50] So if you are part of a group trying to publish to PDF and wanna do static tables, cable extra is really strong. Or for groups, that's like, I have to do word, I wanna create static tables, but I have to get them into word or Excel or something else. The trying to think the formattable table
[00:24:09] David: Flex
[00:24:10] Tom: flex table, that's the one.
[00:24:11] Yeah. Flex tables. Really good at that. And I haven't used it as much, but I, I, I like what they do with it. They've got good documentation. Just almost all the work I do today is HTML based. So I like GT really well, cuz I can output to an actual image or I can embed the HTML. So as, as far as static packages, those are the ones I like when you make that decision to, I want to use like more than 30 lines of, of data to show or I want to really like make something where they can interact with it or explore shifting over to interactive tables with reactable or the DT package is, is really nice.
[00:24:52] You can have thousands of lines of data in there. And of course, you know, you could, you know, just package up the, the file and let people explore it in R or Python or whatever else. But by putting it into an interactive table you can actually get the user without having to do anything else besides interact with it, to answer some of their own questions, they can filter it or sort the data.
[00:25:15] Or there's even like group by summaries that can be done within the table, which is, which is really nice. Yeah, I think that covers most of the landscape, but I'll just say that just like, there's more than one way to do a plot and there's extension packages beyond Gigi plot and R there's I think like a dozen table making packages in R so maybe you really like GT and I really like GT, but maybe you have a specific use case in something like GT summary for doing like actual statistical summaries in a table is, is better for you.
[00:25:47] So yeah, it's using GT in the back end, but it's a different interface or a different API for you to work with.
[00:25:54] David: Yeah, that makes sense. And thinking about kind of what your, what your output format is probably like the most important thing, cuz I know we have had to use not had to, I, I really like the flex table package a lot and we, we use that mostly because a lot of the clients that we work with need. At least at times need to be able to export to word when they work in our markdown.
[00:26:16] And at this point, I know there's some work to make GT be able to do that. But at this point flex table is, is your only option pretty much. So, yeah, there, like you said, there, there are ton of great options.
[00:26:27] Well, thanks Tom so much for chatting today. I really appreciate it. And it's great to learn about all the different things that you can do with tables.
[00:26:35] So thanks again.
[00:26:36] Tom: yeah, absolutely. Thanks for having me.
[00:26:39] Thanks again for listening. I hope you found this conversation. Interesting. If you have any feedback, I'd love to hear it, David, at our, for the rest of us.com. Thanks.
Sign up for the newsletter
Get blog posts like this delivered straight to your inbox.
You need to be signed-in to comment on this post. Login.