Skip to content
R for the Rest of Us Logo

R for the Rest of Us Podcast Episode 19: Crystal Lewis

In this episode, I chat with Crystal Lewis about data management and her book titled ‘Data Management in Large-Scale Education Research’. Crystal, a freelance research data management consultant, shares insights on good planning and systematic implementation of practices that are key to effective data management. She discusses the importance of automated data validation, and outlines a structured approach to data cleaning. Additionally, Crystal reflects on her experience writing an open-source book with Bookdown and navigating the publishing process.

Listen to the Audio Version

Watch the Video Version

You can also watch the conversation on YouTube.

Learn more about Crystal Lewis by visiting her website and connect with her on X (@Cghlewis), LinkedIn, GitHub, and Fosstodon.

Learn More

If you want to receive emails to help you on your R Journey, sign up for the R for the Rest of Us newsletter.

If you're ready to learn R, check out our courses.

Transcript

 [00:00:00] David Keyes: I am delighted to be joined today by Crystal Lewis. Crystal is a freelance research data management consultant who helps other people to understand how to organize, document, and share their data. She's been wrangling data in the field of education for over 10 years and is also a co organizer for R Ladies St.

Louis. Crystal, thanks for joining me today.

[00:00:50] Crystal Lewis: Thanks for having me, David. Good to see you.

[00:00:53] David Keyes: Yeah, good to see you too. And I should also say you have done work for R for the Rest of Us, which has been great because we love having someone who enjoys data management and data cleaning as much as you do.

[00:01:05] Crystal Lewis: Absolutely.

[00:01:06] David Keyes: maybe you could start out by just telling us, like, what does a freelance research data management consultant do?

What does that mean?

[00:01:15] Crystal Lewis: That's a great question. And it's a great question because I don't know that there's a lot of other people that have that specific title, at least no one that I know. Um, but essentially I, I, the work I do is in three kinds of buckets. So, um, so one I work with mostly, um, University faculty who have these kinds of large scale federally funded grants in the world of education and they're looking for help.

So one is I do help with data wrangling. So maybe they've collected data over a couple of years and they haven't really done much with it and so it's been sitting there and it's very messy and they need to analyze or share it and so they need someone to kind of come in and quickly help them kind of Um, and cleanup and document that data. So that's one bucket of what I do. Um, and I also do consulting work. So that's more, "Hey, I got a new grant. I need to think, I need to plan for how I'm going to manage data throughout the lifecycle of this project. I need someone to help me think through that process.

Can we set up some meetings and do consultations around that?" Um, and then my third bucket is training. So maybe you have a lab and you've got 15 people in that lab and you're all kind of doing things different ways and you need someone to come in and do a training around how can we kind of standardize our practices.

Um, and so that's the other bucket that I do. Um, and so I've been doing this for two years and I've gotten to meet lots of different, um, cool researchers at different universities across the U. S. And, um, I've really enjoyed the work, so

[00:02:46] David Keyes: Cool. And when you talk to people who maybe are less familiar with your work, how do you kind of succinctly describe what data management is? Because, you know, as you just said, like there can be kind of different things that do. So what's the kind of definition that you use to describe data management?

[00:03:05] Crystal Lewis: Yeah, sure. Um, so I would say that data management is any process associated with, um, the collection, the organization, the storage, and the preservation of data. And so that's a lot, right? So I think a lot of times when I talk to people, their first reaction to the word data management is something that happens after collection.

Data cleaning, right? I think that's what most people I talk to. They're like, oh, data management. So you mean how you clean data. And I'm like, well, actually, data management begins long before you ever collect data in the planning process. So you want to, if possible, sometimes you're given the data set, And you don't have that luxury of planning, but if you're collecting your own original data. You want to start that planning process long before your project ever starts. And so you plan things like, what do I want the data set to actually look like in the end? What kind of quality assurance, quality control procedures can I implement during data collection to get the kind of data that I want?

Um, for instance, one of my favorite tools to create during a planning process is a data dictionary. I talk a lot about data dictionaries. Anybody that's heard me talk about data management hears me talk about data dictionaries. Um, and that tool is an excellent planning tool because you can lay out exactly what you want your final data set to look like.

What are the variables going to be? How am I going to name them? What variable types are they going to be? You know, numeric, character, date, what are the allowable values of those? And then you can use that throughout your project to kind of plan for where you're going. You can build your tools to align with what you expect, you know, and you can use that in your data cleaning and your data validation process as well. And so I try to get people to think more about, um, data management as a process. Not just something that just happens at the end of a project.

[00:04:49] David Keyes: Yeah, that makes a ton of sense. I mean, I'm, you know, thinking about projects that we have worked Oh yeah.

or possibly you have worked on where. I mean, people are always like, well, can we clean this data in R? And the answer is always yes, of course you can. But like, in many cases, if there's more, the more thought you can put into things up front, the less kind of coding, you know, reshaping, whatever you have to do on the back end.

So it seems like what you're saying is a lot of the, Data management work that you do is both on the back end in terms of that data cleaning, but also on the front end helping people to avoid getting into the situations where they have to have, you know, a thousand lines of R code to,

[00:05:29] Crystal Lewis: Yes.

[00:05:29] David Keyes: do that data cleaning.

Is that, is that accurate?

[00:05:31] Crystal Lewis: It's very accurate because not only does it create less work, but it's possible that depending on how you collect the data, if you didn't plan to collect your data, you might have data that you can't even use. Right? Like maybe you can't even interpret what somebody inputted into your tool. Maybe it was like an open ended text value and you're like, I have no idea what this means.

So you might even lose data, not only just having more work. So it's definitely worth putting that work in.

[00:05:57] David Keyes: Yeah. I'm curious if you have, Any kind of, you know, when you first start talking to a new client, like are there any kind of main points that you highlight? I know you talked about data dictionaries. Are there other kind of main points that you highlight with them when thinking about like how to kind of, how to manage their data?

[00:06:17] Crystal Lewis: Um, yeah, so I'll be in a lot of what I talk about is this planning process. It's like putting structures in place so that your whole project is more organized. So we talked about the data dictionary. Some of it is just like basic file organization, right? Like making sure that you develop a project structure that people know how to use and to find things and file naming.

Um, so that you're using the correct version of files. Um, and some of it's more even, I would almost even call it project management, but that ultimately affects data management. So things like assigning roles and responsibilities so that like everybody on your team knows exactly what they're doing and that pieces of data management don't get dropped off because someone's like, Oh, I thought.

You know, someone over here was doing it, but no, actually, it's my responsibility. And so putting all those systems into place early on leads to better data, right? So, um, I try to talk to people, talk to people about all these different processes that they can implement that might help them get better data, ultimately.

[00:07:18] David Keyes: Yeah, it's interesting how much of the, you know, getting good data ultimately is really about like the kind of non data pieces that...

[00:07:26] Crystal Lewis: I feel like 50 percent of what I talk about is just project management stuff. But if you think about it, like every person that works with the data set impacts that data. So it's not always like the data person. So like a project coordinator is usually the person that's organizing the collection of data.

And then there's data collectors out in the field and the way they collect data impacts the data. And so all these people have a hand in the data and everyone impacts the quality of it.

[00:07:52] David Keyes: Yeah, that makes sense. So you recently published a book. It's called Data Management in Large Scale Education Research. Now, obviously, you know, what you're talking about can go beyond education research, but I'm curious what makes large scale education research so in need of good data management to the point that you would write a whole book about it?

[00:08:13] Crystal Lewis: Yeah, that's fair. Yeah, I mean, there's a lot of existing materials out there around research data management, but a lot of them are, um, you know, discipline agnostic or they're like STEM focused. And what makes education research so unique is it's, um, typically human participant data. So then you've got this whole, um, issue of privacy and confidentiality.

And then a lot of these education projects are what I would call large scale. So they're typically these federally funded multi year longitudinal projects. And you're collecting data on many different types of participants, so you've got like parents and teachers and students and schools, and so you have all these levels of information coming in.

And then not only is a lot of data, but you're often collecting data in a lot of different ways. So we've got people collecting data on paper in the field, electronic surveys, people assessing classrooms and collecting data on tablets, you're getting data from school districts. It's all these different types of things you have to think about.

And not only Think about, but track and make sure that you're not losing things and it's all quality data. And so it's just a lot more to consider than I would say in like a maybe normal like, um, science y lab types of projects. Yeah, yeah.

[00:09:27] David Keyes: Yeah, can you give an example of a, a project, you know, like what, what, what's the project do? What does the data look like? That type of thing, just to help us kind of put it into it's it's essentially, most of the projects I work with are essentially kind of what I just mentioned. They'll be funded by like a federal funder like, um, IES, which is the Institute of Education Sciences. Or NIH, or NSF, or NIJ, those big federal funders. Um, and, and typically they're, not, not all the projects, but a lot of the projects are evaluating some sort of program in schools on the efficacy of a program. And so, they are not only implementing some sort of um, program in schools. That's like a whole different arm of the project, is they implement the, the, um, the program. But the other arm is then evaluating it. And so, to evaluate it, they're typically collecting like, yeah, things like I mentioned, like surveys. observations. So people actually go in the classroom sometimes and will collect observation data on teachers or students or classrooms as a whole. And they will collect school district data. So they'll actually ask the district for individual level data on students and assessments. So like different types of math or reading assessments will be collected.

[00:10:44] Crystal Lewis: And all that data is collected in various ways, right? And so there's so much to think about because sometimes Sometimes you're having to hand enter data from paper, because if you're working with little students, a lot of times your data is going to be on paper kids can't go into Qualtrics and do these things, right?

So, um, so you've got the paper data. You've got to think about the quality of that. You've got surveys going out through RedCap or Qualtrics. You've got the district data coming in, which has its own, you know, array of messiness. Um, and so there's just a lot to think about in this

[00:11:17] David Keyes: So I'm imagining a project would be something like, um, you know, a new technique for teaching kids reading or something like that. And that's funded and then, you know, like you said, there's the implementation side where they, the teachers or whoever else is involved in actually doing it. But then there's the evaluation side, which is looking at like, does this new method of teaching reading actually lead to better outcomes, however those are defined?

Is that kind of a, a typical,

[00:11:45] Crystal Lewis: And a lot of the projects I work on, not all of them obviously, but a lot of them that I work on are Randomized control trials. So a portion of the participants are getting the intervention, a portion are not getting the inter or they're getting a different intervention. And then there's this whole kind of pre and post assessment of the intervention throughout the study.

And a lot of times there's cohorts. So another level of complex like complexity, right? So we've got Longitudinal, and we've got cohorts of new people coming in every year. It's so complex, and that's why data management is so important for these projects. Yeah, yeah.

[00:12:18] David Keyes: So I'm curious, when you talk about data management in this context, what is, does this look like a set of, typically a set of Excel files? What is it? I mean, I know it varies obviously project, but I'm curious kind of what it concretely looks like.

[00:12:35] Crystal Lewis: Um, so yes, that varies greatly. So, so most of the tools that I've interacted with are RedCap, if you've ever used that. It's, it's more common, I would say, in health fields, but education researchers use it too, because it's a fairly secure tool to collect data. And Qualtrics are two big tools that people use, and then paper data.

Um, and so the paper data, Um, people enter that in a variety of systems. So some people come back to the office, they enter that in REDCap, they might enter it into Excel, they might enter it into a database like Access FileMaker or something like that. Um, and so, and then eventually what happens is everything gets exported, right?

And that varies. It could be a CSV, it could be an Excel file, it could be SBSS file, and then that all all saved into other file system, and then the data wrangling from there happens. Um, and so eventually what your ultimate goal is. is to have, uh, linking variables across all those. So, you know, primary and forward keys, so you can kind of link all that data as needed across students, across teachers.

Um, across schools and so forth.

[00:13:39] David Keyes: Got it. Yeah, I mean, I'm just thinking this is not something I plan to ask you, but like about in the context of projects that we work on where, you know, it is a series of Excel files for the most part. Um, and one of the challenges there, especially, you know, you just talked about like needing to link multiple sources of data.

Um, It can be complex because, I mean, just as a simple example, like, we're working on a project where we have, um, trainings that are done and there's an identifier for the cohort for the training. And sometimes when it gets read in, it gets read in as numeric, and it gets read in

[00:14:23] Crystal Lewis: as, um,

[00:14:24] David Keyes: Character.

And so then sometimes when we try and join, it's, it's Complicated, like it's, mean, obviously, again, you can do it because we work in R, you can always do it, but complicated. Um, I guess in my mind, I'm always like, oh, if I could just get people to work in, like, in a database, then it would be better.

Can you talk to me, like, because I'm sure seen cases where people do use a database, you know, because I think like, oh, well, then you can say, you know, this, this variable has to have this, be this data type. Um, what are the pros and cons of, of working in that way versus working in the, you know, kind of collection of Excel files?

[00:15:08] Crystal Lewis: Yeah. So, you could essentially, so if you use a tool like RedCap, you could Essentially build this kind of database in the back end where things are linked all within Redcap. Although I don't know if Redcap, I say that but actually Redcap had its own limitations. But you could build everything into a system where you can collect everything in one system and then link it into the system.

Um, but, Like I mentioned with all these different types of data that you're collecting, it's, it's quite tricky to use one tool when it comes down to it for everything because one tool just doesn't end up meeting the needs of every type of you need to collect. Um, with that said, I, I tell people it's actually okay to use different tools because even if it's not a database, if you use Qualtrics, Um, and Excel is not, I would not recommend entering data into Excel because we know the limitations of Excel, but if you use Qualtrics, for instance, you can build up validation in Qualtrics, right?

So you can say, I only want variable to be numeric. I only want this variable to be the ranges of one to 100 or something like that. So most tools besides Excel, which you can build it in Excel with more complicated kind of formulas, you can build that validation in there. So I tell people as long as you're building that data validation and you're being consistent across your tools and building things consistently, then it's okay to have different tools and then export out to one file format that you're eventually going to link those together.

[00:16:34] David Keyes: That makes sense. And I saw you recently posted that, um, here I'll quote from what Tools aren't the key to better data management. Good planning and consistent implementation of practices is. Can you talk more about what you mean there?

[00:16:47] Crystal Lewis: Yeah, I, I just, I hear this whole thing about tools a lot. People are like, we just need better tools. We just need better tools for data collection. We just need better tools that allow us to do exactly what we want to do. To me, we have tools and they may not all be perfect, but if we just do the pieces that I'm talking about, where we plan, we build in data validation, we do things in a more systematic way, then the tools, it doesn't really matter too much.

Yeah. Like, when I talk to people about data management, I'm talking to people who use SPSS, SAS, data, everybody uses a variety of things. And I'm not like, well, you know what, you don't use R, you're up the creek. Like, I love R, of course. But I want people to know that, like, whatever tool you're using, you can still do good data management.

Um, it's just, you need to do things in a more systematic, planful way.

[00:17:40] David Keyes: Yeah, that makes sense. So, let's actually talk, um, you just brought up data validation and, you know, I think you were talking about in the context of, I don't know, say you are using Excel where, know, you set like the values in this column are numeric and can only have a range from one to a hundred or whatever.

but you also, I think in your book, you talk about doing data validation kind of like on the back end. Backend, like after you kind of import your data R, what does that look like?

[00:18:13] Crystal Lewis: yeah. So, data validation, and, and typically when I use data validation in, in the context of data management, typically I am talking about the data cleaning process. But then I also, yeah, will sometimes also mention how it's, you can build it into tools as well. And so when I talk about data validation in the context of a data cleaning process, what I'm trying to suggest to people is that they need to do I've heard other people call it one last final data sanity check.

Right? So you've done your data cleaning and you've been even doing check, maybe you've been doing checks throughout your data cleaning process to make sure everything's going okay. But before you are like putting the seal of approval on that data set for use, you need to do this one final data validation check.

And by one, I mean it's multiple checks, Um, and I think that the one package that actually really, um, changed or transformed the way I think about data validation is the point blank package. I don't know if you use that much in your work, and there's, there's a lot of data validation packages out there, obviously, and they're all probably great, but for some reason, point blank is the one that just really impacted me.

So before finding point blank, I typically for data validation or data checks, I would just kind of run summary statistics, you know, and see if different variables about fall without sight of a range or, um, you know, if their variable type wasn't as expected, but it was all done with my eyes, right? So I'm just like looking at it and scanning and

[00:19:39] David Keyes: Right.

[00:19:40] Crystal Lewis: The thing about humans is you miss things, right?

And so until I really started building checks, like validation checks into my data using the point blank package. But again, you could use any package. That's when it became more systematic. And so then I started building these actual checks into all my cleaning, um, syntax files. And I would build that around my, remember the data dictionary, I would build it around the data dictionary, right?

So every variable in my data dictionary has these checks that I need to confirm, and I would build all my validation around, you know, are these variables meeting their specific type? Are they within the right ranges? You know, are there duplicates in the variables? There shouldn't be duplicates and things like that.

And so those would be my final data validation checks before I allow people to use it for analysis or something like that.

[00:20:27] David Keyes: That makes sense. And I'm thinking about, um, a project that I've worked on. It's called Oregon by the numbers. I import, I've done it for, I don't know, like five or six years. I import data every year that gets sent to me by the client. And I mean, just as an example of a simple check that, um, Thomas, one of the other, um, consultants who I work with helped me with, um, we use the AssertR package.

[00:20:51] Crystal Lewis: Yeah. And I know there's lots of other Yeah. I think very similar.

[00:20:56] David Keyes: Yeah, exactly. Functionally, I think it's really similar. I mean, just as an example, we need to make sure that when the data comes in, Almost all the, yeah, all the data has one column that's like the location and the, it's almost always by county.

And so we run a check that's like, does this column only have, you know, the names of these 36 counties? If it doesn't, Something's wrong. You know, usually what that means is like it was misspelled, um, in the original data that was given to us. And that's the kind of thing that, like, I got worried after I did this for several years because I was like, oh man, like what if someone misspells this and then like something It, you know, is off in all the visualizations 'cause I'm producing like hundreds of visualizations for them.

And so having that validation, like, I still worry, but I worry a little bit less because I, I feel like there's, I that at least level of check. And I imagine it's a similar situation in terms of the data that you work with, just knowing there is, that, that validation in

[00:21:59] Crystal Lewis: Yeah, and I do, I do recommend, like I talked about that being the last thing that you do in the dating process, but I do recommend that if you're collecting longitudinal data, it's possible that you're reusing a syntax every way that you collect that same data. And so I do recommend also adding, just like you said, adding that to the top of the script.

So that when you read the data, like you expect it to have the same format every way, but what something changed, a variable was added or something like that, you know, change, it's good to check that before you start cleaning your data and it's not doing what you think it's doing, right. So,

[00:22:31] David Keyes: Yeah, that makes sense. So I asked you about data validation, which is in some ways like jumping to the end cleaning process. Although, like you said, you can also do it like right when you're reading your data. But, um, you know, as you said before, data management is more than data cleaning, but data cleaning is an important part.

So when you think about data cleaning, what are the kind of most common steps that you think about? Because I think Cleaning, data cleaning can mean very different things to different people. So from your perspective, yeah, what are, what are those most common steps?

[00:23:04] Crystal Lewis: Yeah. So in the book, I talk about sort of three phases of a data set. So the first phase is your raw data, and this is your untouched data that comes directly from a source. So maybe you're downloading it from somewhere or you're extracting it in another way. Then the second phase is your clean data set.

Um, and so even with some of these best practices and data management implemented throughout your project, um, often you still need to do some additional processing. to get your data in a format that you feel comfortable using or sharing. Um, and so this clean data set is, um, minimally altered for any future purpose.

Um, so we're not going to get deep into what variables you need for a specific analysis or a certain, um, structure you need for a specific visualization. That happens later. The second phase is just kind of getting it into a general clean format, and we'll talk about that more in a minute. And then the third phase is when you create your data sets for a specific purpose.

So you are creating a data set for a specific analysis, and maybe you're creating a bunch of variables specific to that analysis. Or you need a data set for a specific visualization. And so you're restructuring the data for that specific visualization. Those are all created from your general clean, clean data set.

And so what I talk about in the book and what I do in my own work is clean data for general purposes. Um, of sharing with future users. And so in many ways, creating that general clean data set is a very personalized process. So the way I might clean a data set for a electronically collected teacher survey might look very different from the way I would clean a data set that I receive from a school district that has student record data in it.

With that said, all data sets When they are clean for general purposes, still need to ultimately meet a set of data quality criteria. So in the book, I review seven data quality criteria that all data sets should meet. And so let me walk you through those real quick. So the first criteria that we're working towards is that the data set should be analyzable.

And what I mean by that is that it should be, it should make a rectangle. It should be machine readable. So that means that the first row of your data should be your variable names, And the remaining, remaining, remaining cells should be made up of values. And that also you have a unique identifier that identifies unique cases in your data as well.

And that also your data meets a series of organizational rules that we would expect. The second indicator is that your data should be complete. So if you've collected the data, it should exist in your data set. You shouldn't be missing anybody. No one should have gotten dropped along the way. Um, and also you shouldn't have duplicate information if you should, if that shouldn't exist in your data.

Um, and same with variables, not just cases, but also variables, so making sure that you didn't accidentally drop variables when you were downloading your raw data or something like that. Uh, the third indicator is interpretable, and what I mean by that is that your variable names should be both human and machine readable.

So the variable name should make sense to people who are using your data, and it should also not include things like special characters or other things that machines have a difficult time interpreting. The fourth indicator is valid, and by that I mean that your variables should conform to the constraints that you lay out in your data dictionary.

So, um, if your variables are supposed to be numeric, that they are actually numeric in your data set. Um, if your variables should fall within a certain range, so that your variables are actually falling within that range in the data set. Um, and things like that. The next indicator is accurate. Now, accuracy is a tough thing to judge.

You don't always know that what something someone reported is actually accurate or not, but you can tell, you can check things within a data set or across data sets to see, um, if there's accuracy or not. So, consider something like, A student is in 2nd grade in a dataset, then they should be associated with a 2nd grade teacher.

Um, and so you can check things like that, um, for consistency across variables. Um, the next indicator is consistent. For consistency, I mean, two things. So one value should be consistent within a variable. So maybe all dates should be the same format. And I also mean consistency across data sets.

So if you're collecting the same form across several ways of data collection, then you should be collecting the variables in the same way. Um, and then the last indicator is de identified. So if you promise confidentiality to your participants, then Then their information, their, um, direct and indirect identifiers should be considered in the dataset.

There should be no names, um, no emails, no social security numbers, no things that directly identify participants. Those should be removed and replaced with, um, a unique random ID that you've assigned to them for your project. Um, and that indirect identifiers are also considered what combinations of variables might be able to re identify someone, especially when you're publicly sharing data.

Um, so to that end, using that data quality indicator list, I go through a series of, uh, cleaning steps to create a data set that meets those criteria. And in the book, I give a checklist of 19 steps that you want to consider in a data cleaning process. Um, and again, not every data set will need to go through all 19 steps.

You know, we might have a fairly clean data set that only needs a couple of those steps to be wrangled into a, acceptable format. Um, but others might need all 19 steps. It really just depends. And so, just to give you a glimpse of what that checklist looks like, you know, step number one is always to access your data, whether you're importing it into R or something like that.

Step number two is always to, um, review your data, kind of like what we talked about with the data validation question. Um, you want to make sure you know what you're working with and what kind of wrangling needs to be done. And then from there, I list out a series of other steps. that don't necessarily have to go in a specific order.

It's more just kind of checking through to make sure you've considered each of these kinds of wrangling steps. So we've got like renaming variables, we've got recoding variables, restructuring data, um, de identifying data, merging, all those kinds of things are considered in the checklist. And so what steps you actually use is going to depend on your specific data set, but ultimately We will end up with data sets that all meet the data quality criteria that I just spoke about.

And that's, um, what's really cool about kind of using the data quality indicator list as your guide while you're cleaning.

[00:29:44] David Keyes: For those of us who work in R and work with the tidyverse, I think it's really easy to kind of conflate data cleaning and data tidying it sounds like you're, that's not, that doesn't, to me, it doesn't sound like tidying at all. That sounds like very much just kind of on the cleaning end.

Right.

[00:30:05] Crystal Lewis: you think of Hadley's, you know, exact format of how, like, a file should look, right? Um, but in education research, in particular, data doesn't always fit into that tidy format, you know, sometimes people need their data to be in this really wide format where like variables repeat over waves of data.

Um, and so I'm not necessarily trying to tidy, um, data in a way that would make it easiest for an analysis or easiest for a graph. I'm trying to tidy it in the way that it needs to be for the education purposes for anybody. So even if I give someone. a wide data set, it's cleaned up and then they can restructure it as needed.

[00:30:46] David Keyes: That makes sense. So you talked before and you said, you know, we said tools are not the key to better data management. And I assume also to better data cleaning, but I know that you like R in particular. So I'm curious, um, you know, what, what is it about R that you think makes it uniquely good for data cleaning?

[00:31:07] Crystal Lewis: Yeah. Um, obviously because it's open source, I'm a freelance worker now. So buying proprietary software is not necessarily something I want to add to the budget. Um, so that's one, but also just sharing data with clients is, it's really helpful to have an open source format. So I don't have to be like, Oh, you don't have Stata.

Well, then you can't open my data set or whatever it is. Um, and so it's nice to be able to share that with anyone and know that they will be able to. Um, but also I love the tiny verse. So I started learning R and And I learned basar and I was like, no, it did not stick with me. Um, and so it wasn't until many years later when the divers started becoming more popular and I started learning that.

That I was like, this, this is intuitive. This makes sense to me. And not only that, but it, it had, I was able to access all the functions I needed to do the type of data clean I'm talking about. I really didn't have to start building like new functions to meet the needs of what I would know. And then I guess the third thing is, if I need to do something more complex, which does happen with some really complicated data, then I have the ability to build more complex functions if I need to, which I don't know if I would have that in some of the more proprietary tools.

Like I don't know if SPSS allows you to do that. You know, write your own complex functions and that. So, so all those in the, in the community, of course, um, I've learned so much from everybody in our community. I, I'd say like, I don't know, 40 percent of what I learned is kind of just like self learning and, you know, going through different reading materials.

And then like 60 percent it's just learning from like what other people share, which is amazing.

[00:32:51] David Keyes: Yeah. Well, it's interesting. Cause I thought, I thought one of the things you would say would be because it's a code based language that, you know,

[00:33:00] Crystal Lewis: And it is, it's just, I forgot that.

[00:33:02] David Keyes: Oh, okay. Sorry.

[00:33:03] Crystal Lewis: Yeah, no, that's me not thinking about it. So, yes. 100%. I guess why that didn't come to my head is I've used SPSS, I've used Stata, and I've used SAS, and I coded in all of them. And so it wasn't necessarily unique, that R did not give me necessarily that uniqueness because I did write syntax in all those programs.

But yes, that's huge, obviously. Like, no matter what program you use, we want data to be reproduced, your processes to be reproducible. And so, yes, being able to do that is huge.

[00:33:32] David Keyes: Yeah, and I guess that, I mean, that comparison makes sense if you're comparing it to something like Excel, like a point and click

[00:33:38] Crystal Lewis: Yes, yes, don't clean in Excel. And yeah, I always tell people like some people are not ready to make the leap from Excel a tool. And what I tell people is if you're using Excel, you need to be so thorough in documenting exactly what your process is in order for people to be able to reproduce what you've done.

And you'll never, even with that, you'll never be able to meet the reproducibility of the syntax.

[00:34:05] David Keyes: yeah, that makes sense. Um, so let's talk briefly about your book. You wrote your book in a way that it has, uh, you know, a free online version. It's open source. So can you talk about how you wrote the book and, and put it online?

[00:34:22] Crystal Lewis: Yeah, so I, um, I started with, I hope somebody at my desk was like shaking around, shake the desk. Um, I wrote the book in Bookdown, and I think I started with that because I wasn't, I wasn't 100 percent sure where I was going as far as, well, was I going to integrate any code at all, um, or anything like that.

Um, and so Bookdown was the tool I knew most people who had written open source books, that's what they used. Um, and so I was familiar with it. I also knew a lot of people who I could reach out to for help if I needed to. A lot of people have GitHub repos with their Bookdown code. And so it was something where I felt comfortable, um, Um, trying to dig into, uh, in hindsight, in hindsight, I probably didn't have to do that because I ended up not really putting any code in the book.

Um, but I don't regret it because it didn't create a really beautiful product to put on the internet. And so I think it worked out in the end. Um, and it's got, you know, cool features like searchability and things like that. So, I'm glad I did it. It was a tricky process and I'm very thankful for people like you and several other people who, who I was able to reach out to for assistance as they did, but,

[00:35:35] David Keyes: That's interesting to hear, too, that you, you know, you wrote a book with Bookdown that doesn't have code in it,

[00:35:42] Crystal Lewis: which is not common.

[00:35:43] David Keyes: of Yeah, but it speaks to how, you know, powerful a tool like Bookdown or, you know, same thing if you're writing a book with Quarto, like you don't need to have code. I mean, I've done, like, I have, like, an internal, like, Art for the Rest of Us handbook that's written as a quarto book.

It has zero code, but it works. It just, it's, it's pretty straightforward to make it work, and I think it sounds like that was the kind of benefit for you

[00:36:08] Crystal Lewis: Yeah. And I, and it kept like, I had this whole workflow going for all my data cleaning where, you know, you're pushing the GitHub and you're, you're versioning things and it allowed me to have that kind of same workflow in my writing process as well, which was nice.

[00:36:22] David Keyes: Yeah, that makes sense. How did you work with your publisher, um, to go from, you know, the book down version to something that they would to be able to

[00:36:32] Crystal Lewis: review?

Yeah, good question. So if I was more savvy with Bookdown, so don't get me wrong, Bookdown was great for exactly what I needed. If I was a little more savvy, I might have had an easier time. So, uh, CRC provides you a style file, um, that's, that kind of matches up to their style. And so typically what you would do is you would integrate that style file into your Bookdown project, and then you would print to a PDF. I had, um, several people try to help me with this, and I was never able to make it work. And so I ended up rendering to, um, a Word document, um, and having to kind of split out each chapter as its own unique document. So it took a little bit more time, but it worked great for me and my purposes. So it's not a, it's not a big deal.

Um, but yeah, so if, if, if you're a little more savvy than I am with Booked On, you can do it pretty easily to integrate that style file in there and just, and just, um, Just report it right out.

[00:37:28] David Keyes: That sounds amazing. I was thinking about my experience, um, because I had to unfortunately export to Word documents and do the editing process with my publisher. And then

I really wanted it to live online, I had to take the finalized versions and put them back into what started as a book down project.

Then became a quarto project. it. And it was, it was a lot of work

[00:37:54] Crystal Lewis: I think you and I had a similar experience there. And at the end of the day, I was happy to do it because I felt more comfortable about it than the PDF process that I couldn't get to work very well. But I'm pretty sure the people who got the PDF process to work, it was much quicker and smoother.

[00:38:12] David Keyes: Yeah, yeah. Um, any other kind of benefits or drawbacks that you think about when it comes to writing a book in that way?

[00:38:20] Crystal Lewis: Um, honestly, I don't even know what other tools I would want to use to publish. an open access book online. I'm just not even that familiar because everyone I know uses BookTown or Quarto, um, and, and mostly everyone I know that does open access books are doing some sort of code based book. And so I, I don't know a lot of people who aren't doing like something like I did and publishing open access.

And that's probably why I don't know a lot of other tools that I would use.

[00:38:49] David Keyes: That makes sense. Um, well if people want to read your book and or find out more about you and the work that you do, where would be good places for them to go?

[00:39:01] Crystal Lewis: Yeah, so it's online. It's at, well, maybe we can link to it. It's basically a shortened version of data management in educationresearch. com. Um, and then I also have a website, um, if you want to learn more about my work. I also have a blog on there, um, where I share posts about various data management topics.

And then I share, um, slides from talks I've given and things like that. So you can kind of read various parts of my work on either of those websites. Um,

[00:39:30] David Keyes: Great. And we'll link, um, both of those, uh, down below in the show notes so people can check

[00:39:36] Crystal Lewis: Yeah, yeah.

[00:39:38] David Keyes: Great. Um, well, Crystal, thank you very much for taking the time to chat with me today. I really appreciate it.

[00:39:44] Crystal Lewis: It was super fun to chat with you, David.

Sign up for the newsletter

Get blog posts like this delivered straight to your inbox.

Let us know what you think by adding a comment below.

You need to be signed-in to comment on this post. Login.

David Keyes
By David Keyes
October 10, 2024

Sign up for the newsletter

R tips and tricks straight to your inbox.