Editorial pubs.acs.org/cm
Substance over Subjectivity: Moving beyond the Histogram
C
lear communication of scientific data is an imperative constituent of research and, by extension, of the scientific process as a whole. It is necessary for researchers to be adept at expressing their work in both written and graphical means, with data communicated in a manner that is free from subjectivity and with the proper mathematical, statistical, and scientific analysis. Specifically, appropriate statistical treatment of data can remove ambiguity in results, allowing other researchers to understand the claims being made, giving them the proper footing to make the next scientific step. The language of statistics is steeped in mathematics and may be intimidating to researchers without substantial or recent mathematical training. The aim of this editorial is to briefly explore the Average Shifted Histogram (ASH) as a straightforward, intuitive tool that is vastly superior to the standard histogram as a method of visually communicating the distribution of values in relatively small data sets. The appearance of the familiar histogram is determined by both the bin width and the bin origin (this latter parameter referring to the left-hand edge of the left-most bin);1 however, these parameters are ultimately chosen by the researcher and have a significant impact on the appearance of the histogram. This subjectivity may be inappropriate when the goal is a quantitative analysis that is free of bias. The ASH eliminates this subjectivity, is well-known to those with a background in statistics or mathematics, and is an underused but extraordinarily useful tool for chemists and materials scientists. Histograms are a type of frequency plot that aim to estimate the probability distribution for a set of data. To demonstrate the susceptibility of standard histograms to distortion, first, we investigate the effects of bin width on the appearance of the histogram for a given data set in Figure 1. The data set used was generated by taking a random selection of 20 numbers from a Gaussian distribution (mean of 0.0 and standard deviation of 1.0). It is clear that the appearance of the histogram is fundamentally dependent on the bin width. The bin width must be chosen to be small enough to show the distribution as a discrete function but not too small as to capture only individual points and return the original data set. At the opposite extreme, the bin width must be large enough to sufficiently “fill” the bins, thereby capturing the local changes in probability density; to further complicate matters, they cannot be so large as to smooth out significant portions of the data’s domain. Properly managing the balance of factors in choosing bin width captures the essence of the histogram. Thankfully, several rules of thumb have been established to estimate the optimal bin width, including Scott’s and Silverman’s reference rule for bin width calculation.2−4 For data sets that are expected to be normally distributed, we suggest using Scott’s rule as it determines the optimal bin width, h (by minimizing the mean squared error), given by
h=
Figure 1. Four histograms made from the same data but with different bin widths indicated here as is traditionally done with the number of bins that that fit within the data limits. Below each histogram is a series of vertical lines that show the actual positions of each data point, known as a “rug” plot.
where σ is the sample standard deviation and n is the number of data points. The underlying mathematics behind Scott’s rule and others can be read about in refs 2, 3, and 4. It can be seen in Figure 2 that on using small data sets Scott’s rule produces histograms that, while keeping one from overinterpreting the data, also offers too few bins to allow useful visualization. As we will see in the next section, histograms with few bins are also subject to dramatic shape changes with a shift of the bin origin.
Figure 2. Using the same data as Figure 1, this histogram is produced by following Scott’s rule, applied to bin widths.2
3.5σ n1/3
Published: September 13, 2016 © 2016 American Chemical Society
5973
DOI: 10.1021/acs.chemmater.6b03430 Chem. Mater. 2016, 28, 5973−5975
Chemistry of Materials
Editorial
Next, bin origin and its impacts on histogram appearance are described. Figure 3 illustrates three different histograms with
Figure 4. Four ASHs produced using 4 shifts, 8 shifts, 16 shifts, and 64 shifts as indicated. Figure 3. Using the bin width determined from Scott’s rule, these histograms only vary in bin origin, as indicated as a fraction of bin width. All four are fabricated from the same data as Figures 1 and 2.
histograms but not in others. The shaded blue area is comprised of all the histograms used to make the ASH added together the darkest value representing where all the histogram overlap. The thickness of the edge gradient gives a visualization of the error in the ASH. Expressing data in an ASH allows the data to speak for itself, removing subjectivity that may arise in histogram construction due to arbitrary or careless selection of bin edge position. We have built a web application for computing and displaying average-shifted histograms, which can be found and used at http://maverick.chem.ualberta.ca/ plot/ash. More details regarding the usage and capabilities of this web application can be found in the Supporting Information. Recently, ASHs have been used in literature when analyzing efficiency results of organic photovoltaics (OPVs).5 Researchers were comparing different architectures of OPV devices, and the aim was to determine what underlying factors were having a significant effect with regards to the resulting device efficiencies. Due to the relatively small number of samples (