Putting the R in CER - ACS Symposium Series (ACS Publications)

Nov 20, 2017 - R is capable of conducting one of the largest varieties of statistical techniques compared to other programs and has the potential to t...
1 downloads 11 Views 2MB Size
Chapter 6

Putting the R in CER

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

How the Statistical Program R Transforms Research Capabilities Jordan Harshman,*,1 Ellen Yezierski,2 and Sara Nielsen3 1Department

of Chemistry, University of Nebraska-Lincoln, Lincoln, Nebraska 68588, United States 2Department of Chemistry and Biochemistry, Miami University, Oxford, Ohio 45056, United States 3Department of Chemistry, Hanover College, Hanover, Indiana 47243, United States *E-mail: [email protected].

When researchers employ quantitative methods in their investigations, they have the choice of many programs to conduct their analyses. In this chapter, we argue that the statistical programming language called R demonstrates the greatest utility in these analyses. R is capable of conducting one of the largest varieties of statistical techniques compared to other programs and has the potential to transform how researchers analyze their data. Throughout the chapter, we will discuss the significant benefits of using R to more efficiently and effectively analyze data by re-conceptualizing data visualizations, defining custom functions, writing programmatic loops, and enhancing reproducibility and documentation.

Introduction As scientists in chemistry education research (CER), we are tasked to present evidence of the effects of pedagogical interventions, students’ knowledge, skills, affect, and other complex constructs that cannot be directly observed. When our © 2017 American Chemical Society Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

research questions warrant quantitative measurements, we are often held to far different standards than the r = 0.9999 set by our analytical chemistry colleagues due to the nature of our widely varying subjects, human beings. Considering this research context, it becomes crucially important to identify and effectively report all possible evidence in our investigations. By extension, we would also posit that, due to a number of reasons discussed throughout this chapter, the statistical programs used to analyze data play an important role of what evidence can be identified and brought to bear throughout the analytic process. Throughout this chapter, we argue that one program, R, has the ability to serve the needs of researchers in CER better than existing alternatives. To make this argument, we need to first address a statement we’ve heard commonly: “The choice between R, SAS, Stata, Excel, SPSS, Matlab, JMetric, and all other programs is really just about preference.” We liken this with the belief that someone’s choice of automobile is just about preference. To some extent, this is true, as you need to feel comfortable with operating the vehicle. Some prefer manual versus automatic, and some like certain designs and colors. But foremost is the consideration of function. A small, commuter car might be your favorite car for getting around town, but is not a good option for someone who needs to haul a couple tons of gravel. Similarly, common programs such as SPSS and Excel are great programs for many quantitative needs, but often provide limited options, techniques, and efficiency in comparison to R. Additionally, and perhaps more importantly, we contend that the choice of statistical program plays a role in a researcher’s selection of statistical techniques and visualizations. The choice of one may seem unrelated to the choice of the other, but there are plenty examples of two things that theoretically should not affect each other, but do. For example, consider three decades worth of evidence that suggests that use of a credit card versus cash increases propensity to spend (1, 2), decouples reward (purchase) from pain (3) (cost), and can affect the way that consumers perceive products (4). With statistical programs, it is possible that defaults and frustrating procedures can actually affect methods used in research. Consider the 46-year old problem (5) of confusing principal components analysis (PCA) with exploratory factor analysis (EFA). It is possible that this problem has been exacerbated by the fact that the default extraction method in SPSS for a “Factor Analysis” actually conducts a PCA. In a similar line of thought, researchers that exclusively used Excel to produce their visualizations may have been less likely to use boxplots – one of the most fundamental displays of data –because in prior Excel versions, it was very time consuming to manipulate a stacked bar chart to look like a boxplot in Excel (6). R is not excluded from these problems, as it comes with defaults and frustrations just like other programs. The point is to recognize that a researchers’ choice of program can impact what analyses are and are not conducted. While this should not be the case, many of the researchers who use applied statistics are not statisticians by training. Because researchers have limited training, they may be more likely to accept the defaults of many programs simply because they do not know the consequences of each choice. It is worth stressing again that users can willingly copy/paste code in R from another resource without fully realizing the ramifications, but we contend that many features that we discuss here will help facilitate thoughtful consideration of analysis procedures. 66 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

It is in light of this recognition that we unveil the thesis for this chapter: Using R does not just allow researchers to perform techniques not available in other programs. Rather, we hope to convince the reader that R has the ability to transform how researchers view data analysis. But before we can present the argument, we will first summarize what R is, list its advantages and disadvantages, and then include four sections that describe how use of R can transform how researchers see visualizations of data via graphing packages, analyses via custom functions, analyses via programmatic loops, and documentation via interactive notebooks.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

What is R? While there exists a program that you can download called “R,” R is technically a programming language, not a program. When you download R (https://www.r-project.org/), the program is simply a text editing program with build in menus and abilities to display graphs produced by the R language. Many users are currently running R in a front-end program called RStudio® (https://www.rstudio.com/home/), which contains a number of features that make writing code and managing results more efficient while opening up many additional features not available in R. R is an open-source software, meaning that the source code is public, further meaning anyone can build, change, or remove features and functions on their own local copies. The base R language is copyrighted and governed by a non-profit organization. Throughout the years, researchers and programmers have added 9,886 (as of January 11, 2017) packages, all of which contain additional functions and capabilities. These are also open-source contributions available on existing mirrors of the Comprehensive R Archive Network (CRAN).

Advantages and Disadvantages of Programming in R Broadly speaking, many will be quick to point out the biggest advantage to R, which is that because it is open-source, it is free to use and modify anywhere around the world. While certainly a huge advantage, free does not necessarily mean good. It wouldn’t be surprising if researchers favored paying for a statistical program over taking a free one if the paid program had enhanced features and capabilities. However, in this case, it is rare to find a statistical technique or visualization another program can do that R cannot. Basic statistical functions, data manipulation, and graphics are a part of the base R package, and more advanced techniques are found in the nearly 10,000 additional packages (think of these as add-ons). Additionally, users can write their own functions in R. This means that if there doesn’t exist a defined function to carry out something the user wants to do, they can write one themselves. It sounds daunting to write your own function, but in a later section, we’ll demonstrate that it is not nearly as difficult as it sounds to customize R to give a researcher exactly the output that the researcher wants to examine. This is made considerably easier by the propensity of coders to make their code available in order to prevent individual users from having to 67 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

reinvent the wheel, which is a core philosophy of the open source movement. Lastly, a relatively recent overhaul of the graphics system has made R a top contender in production of quality, completely customizable graphics, which provides researchers a huge advantage in effectively telling their stories with data. For its use in CER, there are two main disadvantages to R. The first is the generally steep learning curve associated with R. It is not “like learning a new language,” it is learning a new language, which takes time and practice. If a user has previous coding experience, this process likely won’t be as long as a beginning coder. To overcome this barrier, it is recommended that R users learn one function at a time and eventually start combining those functions for enhanced capabilities. However, we believe that R is the last statistical program a researcher will have to learn because of its very wide array of capabilities. The other main disadvantage is in the eye of the beholder: If anyone can develop and publish new functions and features, how can researchers trust that functions do as they advertise and that results are accurate? First, while anyone can build a package, to develop and release one on the CRAN, many standards of design and documentation must be met. Second, because R is open-source, everyone has access to what these functions do, down to the source code that defines the function. Therefore, with proper expertise, anyone can read the code and find out exactly what a function does and compare it to what the authors of the function claim it does. This is something researchers usually cannot do with commercial programs because the source code is owned and generally not released to the public. Lastly, many of the packages in R are written by statistical experts in academia that lead to publications that go through a peer review process. There are great incentives to produce accurate packages because mistakes could be damaging to the authors’ reputations.

Data and R Code Presented Throughout this chapter, we will primarily be referencing hypothetical data. In the spirit of the open source philosophy, we have included all of the code used to produce the various figures and analyses discussed in this chapter. Readers are encouraged to download R and RStudio® to conduct these analyses themselves because there is simply not enough space in this chapter to display all of the code. Therefore, we are encouraging an interactive reading experience that will give the reader an experience and a “feel” for working in R. Whenever this bolded and italicized phrase, TryThis(#), appears in the text, there will be a section in the supplemental R files containing the code relating to that section. Supplemental files can be accessed at http://bit.ly/2jGmIfy and include the following files (unzip folder prior to opening): 1. 2. 3.

Benefits of R Supplemental.R – contains all TryThis examples kmeans example.Rmd – produces interactive notebook when run in RStudio® JACS.csv – data file containing most common word pairs of JACS titles 68

Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

Transforming Data Visualization Data are too often seen as substitutes for long tables as opposed to “instruments for reasoning about quantitative information (7).” Visualizations can reveal (and conceal) information, which greatly affects the evidence authors present, for better or for worse. An example of this used frequently is the Anscombe quartet (8), TryThis(#1). In these four sets of data (11 observations each), every set of x values has μ = 9.0 and σ = 3.3 and every set of y values has μ = 7.5 and σ = 2.0. Simply reporting these means and standard deviations, however, fails to reveal the clear patterns shown in Figure 1. As this exemplifies, reporting only means and standard deviations runs the risk of concealing additional evidence that may support or refute a researchers’ conclusions. We also encourage the readers to look into world-renowned data visualization expert Edward Tufte’s famous Challenger rocket example of how tragedy may have been avoided if more effective displays of information were available to key decision makers (9).

Figure 1. Anscombe’s quartet. So how does R incorporate graphical design principles and transform how researchers view data visualizations compared to other programs? Many programs give the user the option to create one of several types of graphs, such as line graphs, bar charts, scatter plots, etc. Similarly, perhaps in earlier education you learned about these types of graphs. However, the goal of data visualization is not to force data into a limited number of types of graphs and instead, it is about presenting the evidence to tell the story represented by the data. For example, imagine 100 students taking a pre and posttest. We have created such a data set in TryThis(#2). To measure change from pre to post, researchers commonly report a change in means, portrayed in text, table, or graph such as a so-called “dynamite plot” (shown in Figure 2A). Here, we can see that students increased their scores from the pretest (μ = 75.30%, σ = 17.85%) to the post test (μ = 77.55%, σ = 17.80%). The story told by Figure 2A is one of indifference or if there is a difference, it is small. Dynamite plots such as this have been heavily scrutinized as a poor representation of data (10). Now consider the graph Figure 2B. It is difficult to label this as a particular “type” of graph, but if it is a type, most programs do not have built in function to produce it. This illustration presents a story that is much more descriptive of the data itself (not just summaries of it). Two boxplots, one for pre and one for 69 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

post scores, show aggregated results on top of the individual 100 students’ scores. Each line connects the individual student’s pre score to that same student’s post score. This graphic shows that (1) all but 8 students either gained or remained stagnant from pre to post, (2) These 8 students seems to be outliers by declining in performance by 30-40 points, and (3) there may be a significant ceiling effect for this test. None of these observations are apparent in the dynamite plot on the left, but are revealed in the plot on the right.

Figure 2. (A) Dynamite plot of pre/post outcomes versus (B) lines representing individual student scores from pre to post.

While much, much more could be said about the advantages and disadvantages of certain visualizations over others, we want to discuss how R’s graphic system can actually encourage researchers to see graphics as a means of displaying data and telling stories. To accomplish this, we need to investigate the syntax that is used in plotting. Figure 2A and B were produced in a package called ggplot2 (11), which is now widely used in the R community. The “gg” stands for “grammar of graphics,” indicating that it, like R, is not a series of options and functions, but a language. Code 1 shows a generic syntax for ggplot. In the first line, the ggplot function is called to indicate the start of a graphic. The name of the data set, data in this case, is then provided along with aesthetics, aes. These aesthetics map the Var1 variable to the x-axis and the Var2 variable to the y-axis. Line 2 then maps different geometries onto those aesthetics, such as points (geom_point), lines (geom_line), boxplots (geom_boxplot), and many others. In other words, the syntax requires that the user first consider the data to be displayed as opposed to a type of graph.

70 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

This is in contrast to many other programs that require a user to first choose pre-determined geometries (a type of graph), and then map those geometries onto the data second. The ggplot package applies exactly the reverse process, where data are first identified (line 1) and then those data are mapped onto one or more geometries second (line 2). Defining the data first followed by the geometries might seem like a small difference, but is more aligned with the fundamental reasons researchers show visualizations of data by allowing researchers to think about the best visual display versus the best of available options. As an analogy, consider being asked “how was your day?” at the end of a work day. One option is to represent an aggregate summary of the events of the day by choosing “type” of day: Was it a good day? Bad? Unproductive? Busy? None of these representations depict the actual interactions and events of the day. As a result, if someone only reports that they’ve had a bad day, there are many important details that are not accessible, and the consumer of this limited information may conclude things that were not actually observed. Thus, instead of choosing a “type of day,” a more effective way of communicating is to find a way to describe some of the individual events of the day and let the consumer determine their own labels for the day. Like with the ggplot syntax, it is more informative to first define the values (events of the day) and represent them in some way (using various parts of speech in a sentence). The same is true with data and the ggplot2 syntax which encourages researchers to remember that the goal is to display raw data when possible. So often, as in the dynamite plot shown previously, what is displayed is a visual representation of a parameter of the whole data set, not the individual data themselves. Advanced capabilities in R, such as jittering (adding small random noise to prevent over-plotting), transparencies, and facet plotting (creating multiples of the same graph broken down by group) all help researchers effectively display hundreds or even thousands points of data without inducing cognitive overload. We will almost exclusively be focusing on quantitative information, but data visualization is important in qualitative settings as well. There is a text mining package called tm (12, 13) that has gained popularity recently. While R is not the program of choice to code qualitative data, this package provides a number of useful tools to explore large sets of short-answer responses. R also allows for word clouds in the wordcloud package (14), and we will discuss a unique visualization called a chord diagram (15). Originally, this visualization was used in bioinformatics to visualize genome sequences, but it can be used with text to show not just how frequently words appear (word cloud), but also how frequently words appear next to each other. To illustrate the potential power of this visualization, we used R to visualize the titles for every issue of the Journal of the American Chemical Society (JACS) printed from January 1, 2016 through December 31, 2016 (volume 138, through issues 1-51). TryThis(#3). In this example, we used a package that does web scraping (automated access and import of information from servers into R) to easily create a list containing all the article titles. Then, we reformatted the data so that we could accurately count how many times each unique word pair showed up in an article title. For space concerns, this part is not included in the supplemental file, and instead we have also included a raw data file 71 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

that can be imported. As of issue 51, a total of 2,478 articles were published in 2016 (about 48 articles per issue). We then made a chord diagram to show the common word pairings that appeared in titles. The chord diagram (Figure 3) is read by finding the side of a particular link that is furthest away from the circle’s edge, which represents the first word in a pair. Following the link to the side with that is closer to the circle’s edge, this is the second word in the pair. You can gauge how many times that word pair was mentioned in that order by reading the axis. For example, the thickest line stretches from “metal-organic” to “framework”, indicating that word pair is the most common pair in 2016 JACS titles. The axis indicates that 64 titles contain this word pair. We’ve highlighted the most common word pairs/phrases appearing in JACS titles in a darker color. Doing so reveals that the most common topics explicitly mentioned in article titles were metal-organic frameworks (64 articles) and total (enantioselective) synthesis (51 articles). This visualization is particularly effective at displaying short phrases of three or more words long. This can be seen in the phrase “covalent organic framework” (19 articles), which we’ve also highlighted in a darker color. While this particular example of JACS articles titles is limited because we cannot always infer much from an article title, this technique would prove useful in a variety of settings, such as analyzing open ended responses where phrases are expected to be commonly used. For example if students are asked to explain why a balloon’s volume increases at higher temperature, the proportion of students who include “increase temperature, increase volume” versus “increase kinetic energy of molecules” in their explanations could be meaningful.

Figure 3. Chord diagram of 2016 JACS article titles. 72 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Creating custom visualizations in R could be an entire book’s worth of content, so instead of showing additional visualizations here, we encourage readers to visit the following resources to browse some unique visualizations produced in R.





Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

• •

The R Graph Gallery (http://www.r-graph-gallery.com/) – Each graph on this page comes with the code used to build it, some are purely artistic, others offer unique ways to visualize actual data ggplot2 Help Topics (http://docs.ggplot2.org/current/) – Lists current geometries available in ggplot package and how to use them Plotly Integration with R (https://plot.ly/r/) – Plotly is a web-based graphing site that offers full integration with R Quick-R Graphs (http://www.statmethods.net/graphs/) – Useful for making plots using base R opposed to ggplot2

Functions and Programmatic Loops The best way to show the benefits of R for chemistry education research is to actually walk through an analytic problem and show how R leads to efficient and robust research via functions, loops, and reproducibility. Before we introduce and investigate a problem, we’ll first provide a brief tutorial on functions and loops in R as these will be used throughout the example research investigation. The basics of how a function works are shown in Code 2. A function we’ve called myfun is defined to take two arguments called argument1 and argument2. This meaningless function starts with the { in Line 1 and ends with the } in Line 4. Line 2 instructs R to define an object, x, as the sum of whatever is entered in argument1 and divide it by the mean of whatever is in argument2. Line 3 simply tells R to return (print) the value of x at the end of the function. This function is exemplified by defining two sets of numbers, a and b in lines 5 and 6, and running the function in two different orders (Lines 7 and 9). Line 7 is essentially telling R to replace every instance of argument1 in the function with the object a, which resolves to a set of 3 numbers, and replace every instance of argument2 in the function with object b, which resolves to a different set of 3 numbers. Therefore, Line 2 starts as object x being defined as sum(argument1)/mean(argument2), which evaluates to sum(a)/mean(b), which further evaluates to R computing the sum of 5, 2, and 8 (15) divided by the mean of 7, 3, and 0 (3.33), which equals 4.5, which is displayed as a result of the return function in Line 3. The exact same happens in reverse if the user enters b for argument1 and a for argument2, as shown in Line 9. You can try this for yourself in TryThis(#4).

73 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on November 23, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch006

Programmatic loops can add greater functionality to existing and custom functions in R. The basic concept behind a programmatic loop, called a for loop, is that some chunk of code is re-run multiple times, each time changing something. A simple example for the for loop is shown in Code 3. First, the object c is defined as 10 numbers (Line 1). The boundaries of the loop are defined in Line 2, which basically tells R that it should run the code inside of the curly brackets a total of 10 times (1:10 is shorthand for “every integer between and including 1 through 10”). The i is representing an index. Therefore, the first time the code is run, i will be equal to the first element in the vector given. In this case, the vector given is 1:10, so for the first run through, i = 1. The second time the code is run, i will be equal to the second element in the vector given, and so on. In this case, the second run through will evaluate to i = 2, third run through, i = 3, and so on. In the first run through i = 1, so Line 3 evaluates to x