Using Similarity Metrics to Quantify Differences in ... - ACS Publications

Dec 1, 2016 - Using Similarity Metrics to Quantify Differences in High-Throughput. Data Sets: Application to X‑ray Diffraction Patterns. Efraín Her...
2 downloads 7 Views 2MB Size
Subscriber access provided by Binghamton University | Libraries

Article

Using Similarity Metrics to Quantify Differences in HighThroughput Datasets: Application to X-Ray Diffraction Patterns Efrain Hernandez-Rivera, Shawn P. Coleman, and Mark A. Tschopp ACS Comb. Sci., Just Accepted Manuscript • DOI: 10.1021/acscombsci.6b00142 • Publication Date (Web): 01 Dec 2016 Downloaded from http://pubs.acs.org on December 6, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Combinatorial Science is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

334x107mm (150 x 150 DPI)

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 34

Using Similarity Metrics to Quantify Dierences in High-Throughput Datasets: Application to X-Ray Diraction Patterns Efraín Hernández-Rivera,



Shawn P. Coleman, and Mark A. Tschopp



U.S. Army Research Laboratory Weapons and Materials Research Directorate Aberdeen Proving Ground, MD 21005 E-mail: [email protected]; [email protected]

Abstract The objective of this research is to demonstrate how similarity metrics can be used to quantify dierences between sets of diraction patterns. A set of 49 similarity metrics is implemented to analyze and quantify similarities between dierent Gaussian-based peak responses, as a surrogate for dierent characteristics in X-ray diraction (XRD) patterns. A methodological approach was used to identify and demonstrate how sensitive these metrics are to expected peak features. By performing hierarchical clustering analysis, it is shown that most behaviors lead to unrelated metric responses. For instance, the results show that the Clark metric is consistently one of the most sensitive metrics to synthetic single peak changes. Furthermore, as an example of its utility, a framework is outlined for analyzing structural changes due to size convergence and isotropic straining, as calculated through the virtual XRD patterns.

1

ACS Paragon Plus Environment

Page 3 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Introduction X-ray powder diraction (XRD) is a common characterization technique used to identify the structure of materials and provide insight into how chemistry, alloy composition, and materials processing inuences structural evolution.

1

XRD use as a non-destructive charac-

terization tool allows researchers to observe and quantify structural evolution throughout the experiment. For example, researchers have coupled XRD characterization with mechanical alloying studies to observe the formation of nanocrystalline solid solutions from elemental powders.

24

Similarly, researchers have coupled XRD characterization with heat treatment

studies to understand phase stability, phase transformations, and precipitation.

5

More re-

cently, high entropy alloy studies used XRD to identify the structural and phase changes at varying temperatures and compositions.

6

XRD methods are not limited to experimental studies of materials. Material models, including atomistic simulations, have the necessary spatial and atomic information to compute XRD patterns. Simulated XRD patterns enable direct comparison between material models and experiments. Simulated XRD patterns can provide insight into underlying structure and mechanisms that conventional experimental XRD techniques are unable to capture and/or isolate. For example, traditional characterization techniques often are unable to properly analyze complex nanostructured materials;

7

however, coupled computational and experimental

data have provided insight into the structure of binary glassy materials origin of microstrain within nanocrystalline metals.

8

and similarly the

9,10

Both experimental and computational XRD studies are impacted by the White House's Materials Genome Initiative (MGI),

11,12

which has motivated a shift towards highthroughput

(HiTp) approaches to materials science. HiTp data mining coupled to the renewed interest in materials informatics are becoming the foundation for materials-by-design and discovery.

1316

With the embrace of datainformation driven materials development, methods to

analyze and compare large scale datasets are needed to address the challenges put forward by the MGI. For HiTp XRD studies, the gap between big data and knowledge has raised

2

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 34

interest in developing ecient ways of comparing patterns and linking them to process structureproperty relationships.

14,17,18

Similarity metrics are routinely applied to HiTp datasets

19,20

to quantify how similar

two (or more) responses are from another. Similarity metrics can be classied as (a)

bin-

to-bin

i.e.,

where bin

i

of response

s(x, y) → f (xi , yi ),

or (b)

x

is compared to the corresponding bin

cross-bin

where bin

the second response (e.g., quadratic form

f (xi , y0 , . . . , yi , . . . , yn ).

21

i

i

of response

y,

of a response is compared to all bins of

and Earth mover

22

distances), i.e.,

s(x, y) →

Under the MGI paradigm, where big data is expected to play a

crucial role, the eciency of

bin-to-bin similarity metrics will enable rapid evaluation of large

XRD pattern datasets. The added computational expense for whole pattern kernel methods

15

cross-bin

similarity metrics or

can become expensive for HiTp XRD studies, in some cases

prohibitively so. Furthermore, these metrics often employ some form of

bin-to-bin

metric in

their measurement of similarity. Hence, in these studies and others, relatively inexpensive

bin-to-bin similarity metrics are often not considered in comparison to the cross-bin similarity metrics. That being said, using bin-to-bin similarity metrics may prove necessary for HiTp datasets, owing to their minimal computational expense. Similarity metrics can quantify changes in XRD patterns that correspond to structural changes due to experimental or modeling conditions. For example, the L 1 -norm, cosine, and Pearson correlation metrics have been used, with relative success, to eciently analyze XRD patterns and cluster their similarities. et al.,

25

23,24

However, as pointed out by HattrickSimpers

these metrics perform poorly with straightforward XRD pattern changes, like peak

shifts, which limits their ability to identify the common features associated with phase transformations. Fortunately, there are a multitude of similarity metrics researchers can utilize, which have been compiled in an encyclopedia of distances by Deza and Deza.

26

In fact, Cha

27

outlined and implemented a subset of 49 bin-to-bin similarity metrics used herein. Recent studies focus on systematically comparing how metrics perform in measuring similarities between increasing size and/or dimension datasets.

28

However, a comprehensive study of 1-

3

ACS Paragon Plus Environment

Page 5 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

D/line datasets is required to understand how each of these metrics captures common XRD features. Therefore, this work performs a systematic study of how common XRD peak features are captured by 49 similarity metrics outlined by Cha.

27

By quantifying the sensitivity of

these similarity metrics to various XRD features, this work provides insights into the best approach to identify changes in diraction proles. The results show that certain metrics, such as Clark, are most consistently sensitive to small net changes in synthetic single peaks. Thus, these metric can eciently dierentiate changes in diraction proles due to size eects or lattice strain. The signicance of this research is that these dierent measures can have broad applicability towards quantifying the convergence to a desired state or similarity between dierent underlying microstructures and crystal structures.

Methodology The

Methodology

section is organized as follows. The

Similarity and Distance Measures

sub-

section rst introduces the various similarity metrics and describes how they are calculated throughout this work. The

Synthetic Single-Peak Analysis subsection then describes how var-

ious peak features are modeled (using variations of a Gaussian distribution) to approximate single-peak pattern behavior. The

XRD Multi-Peak Analysis

subsection then describes the

methods used to examine each metric's ability to quantify pattern convergence and isotropic strain in virtual XRD patterns.

Similarity and Distance Measures This work provides a framework for analyzing similarities between XRD datasets, as measured by the 49 similarity metrics outlined by Cha. A description of each metric is included in the

Supplementary Information

and the code is available in Ref. 29. Cha groups these

metrics into families, where family members shared a basic mathematical operation. These

4

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 34

mathematical family-based operations are summarized in Table 1.

For instance, the

L1

family uses the absolute value of the dierence between distributions in all of its measures. Likewise, the inner product family uses the dot product (or inner product) in all of its measures. While members within each family have the same functional origin, this work shows that they often yield uncorrelated similarity responses.

A generic response of the family op-

erations is illustrated in Figure 1. The blue peaks represent two patterns before ( x) and after (y) they have been shifted, narrowed, split, and added noise (from left to right, respectively). The family-based metric response is obtained by applying the corresponding mathematical operation, yielding the dierent magenta curves.

Visually inspecting the responses, it is

evident that some families are more sensitive to specic peak features. the case of noise (the right-most peaks), the Shannon entropy and

χ2

sensitive than the intersection and inner product metrics, while the

Ln=1

the noise.

For example, for

metrics are much less metric accentuates

The numerical inserts correspond to the calculated similarity between the two

blue curves for each of the 4 peak features. Hence, for the addition lead to the least

(0.0)

and most

(0.96)

χ2

family, peak shifting and noise

similar peaks.

Table 1: Similarity metric families and their shared mathematical operation. These metrics are used to perform

bin-to-bin

comparison between responses

y={y0 , . . . , yi , . . . , yn }. Family Minkowski

L1 Intersection Inner Product Fidelity 2

χ

Shannon's Entropy Combinations

|xi − yi |p |xi − yi | min (xi , yi ) xy √i i xi y i (xi − yi )2 ln (xi /yi ) 

and

Description

Operator

p p

x={x0 , . . . , xi , . . . , xn }

Distance of order

p

between two distributions

Absolute value of distribution distance Minimum between each distribution element Dot product between distributions Squareroot of the inner product Squared of the

L1

Natural logarithm of distribution ratio Mix of multiples ideas from previous metrics

When comparing data in the form of scalars, vectors, images, large multidimensional datasets, binary datasets, or even text, it is important to dene a quantitative measure of how similar the dataset is from another known dataset of equal size. Mathematically, this is

5

ACS Paragon Plus Environment

Page 7 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Shift

Broadening

Split

Noise

Figure 1: Generic metric response for dierent family forms as peaks become shifted, nar-

Figure 1: Generic metric response for dierent family forms as peaks become shifted, narrowed, split and added noise (from left to right, respectively, separated by vertical lines).

rowed, split and added noise (from left to right, respectively, separated by vertical lines). The top two plots show a set of synthetic peaks before ( x) and after (y) being modied.

The top two plots show a set of synthetic peaks before ( x) and after (y) being modied. The numbers correspond to the similarity value for each peak feature, which is just the

The numbers correspond to the similarity value for each peak feature, which is just the summation of the individual bin similarities. The Minkowski and Combination families are

summation of the individual bin similarities. The Minkowski and Combination families are not shown because of commonality to the L1 or lack of generic mathematical operation, not shown because of commonality to the L1 or lack of generic mathematical operation, respectively.

respectively. often represented through use of either similarity or distance (dissimilarity) measures. While

often represented through use of either similarity or distance (dissimilarity) measures. While the relationship between similarity and distance measures can be very straightforward, it is

the relationship between similarity and distance measures can be very straightforward, it is useful to normalize these values to t expected properties of two datasets to better allow

useful to normalize these values to t expected properties of two datasets to better allow for comparisons.

for comparisons.

For example, as two datasets become increasingly similar, a normalized

For example, as two datasets become increasingly similar, a normalized

distance measure should approach zero, while a normalized similarity measures should ap-

distance measure should approach zero, while a normalized similarity measures should approach one. This normalization allows for multiple metrics to be easily compared in the proach one. This normalization allows for multiple metrics to be easily compared in the same manner. same manner. 6

ACS Paragon Plus Environment 6

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 34

Since the goal of this work is to quantitatively describe similarity between XRD patterns, the framework implemented herein focuses on employing similarity metrics, instead of distance metrics. Similarity metrics ( s) are expected to satisfy the following properties:



Constraint (limited range):



Reexivity:

s(x, y) = s0



Symmetry:

s(x, y) = s(y, x)



Triangle Inequality:

where

if

30,31

s ≤ s0 x=y

s(x, y) ≤ s(x, z) + s(y, z)

x, y and z are n-sized vectors (e.g., x = {x0 , . . . , xi , . . . , xn }) representing the response

distributions (e.g., diraction intensities). While several similarity metrics are derived independently of a distance metric (e.g., the Cosine metric), many can be derived from the denition of the distance metric,

d.

For example, a common approach used to transform a

distance metric into its similarity counterpart is a linear model, i.e.,

s(x, y) = 1 − d(x, y). Of course, Equation 1 works best when

d

(1)

is bounded and normalized, i.e.,

0 ≤ d ≤ 1.

Other conversion approaches are used when this linear approach performs poorly due to an unbounded

d.

For instance, Shepard

32

proposed

s(x, y) = e−d(x,y) , which clearly limits similarity values to values for

(2)

{s ∈ R : 0 ≤ s ≤ 1}

despite potential unbounded

d.

In this work, two variants of distance to similarity conversion methods were considered,



d(x, y) s(x, y) = exp − max d − d(x, y) 7

ACS Paragon Plus Environment

 (3)

Page 9 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

and

s(x, y) = 1 − where

dmax

d(x, y) dmax

(4)

is the absolute maximum expected distance between the two distributions. The

latter equation was proposed by Niblack distance calculations.

Using

dmax

21

as the

bin-to-bin component of the quadratic-form

to normalize these equations enables better comparison

across all metrics because the scaling is unambiguous.

Similarity metrics that have been

derived independently from a distance metric also required normalization by the maximum similarity measurement to ensure that

s(x, x) = s0 = 1

(e.g., the Kulczynski similarity).

In general, the conversion approach given by Equation 3 leads to

much more sensitive

metrics because of the exponential decay; this characteristic may be advantageous for detecting extremely ne changes in peaks, but generally tends to oversensitize the metrics' response and hence is not considered herein. Therefore, the linear relationship (Equation 4) was chosen to convert distance to similarity metrics. The maximum expected distances can be obtained in several ways, one of which is outlined in the Herein,

dmax

Supplementary Information .

was assigned as the largest distance metric calculated after comparing all mem-

bers within the dataset.

For example, in the case of peak shifting,

dmax

is given by the

maximum shift between peak centers, i.e.,

dmax = |f (∆(2θ0 )max )| = |I(2θ0max ) − I(2θ00 )| where

I(2θ0 ) is the intensity of a peak centered around 2θ0

and

I(2θ0max ) is the peak intensity

for the maximum shift. To compare XRD patterns, the distance between all set members is calculated and

dmax

is dened as the maximum measured absolute value.

It should be noted that not all the metrics outlined by Cha

27

meet the symmetry con-

dition, i.e., strictly speaking they are not similarity metrics. While ideal similarity metrics must possess symmetry (i.e.,

s(x, y) = s(y, x)), this is not true of all the metrics surveyed by

Cha and implemented herein. Some similarity metrics depend on the order of the operation.

8

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For example, Pearson

χ2

clearly depends on whether

y

Page 10 of 34

is compared against

x,

or vice versa.

Nonetheless, all the metrics surveyed are included in this work for completeness. If desired, some of these asymmetric metrics can be adapted in such a way that symmetry is achieved (e.g., adding Pearson

χ2

and Neyman

χ2

eectively yields a metric that satises the outlined

denition of a similarity metric).

Synthetic Single-Peak Analysis An initial study examines how well each similarity metric captures common XRD features when applied to a single synthetic peak, modeled by a Gaussian distribution. A Gaussian distibution was deemed to be an appropriate surrogate to eectively and eciently model peaks from experimental XRD patterns because of their shared expected data characteristics. Further, these synthetic peaks are relatively straightforward to systematically manipulate. In fact, many XRD peak prole analysis methods implement Gaussian or hybrid Gaussian Lorentzian (pseudoVoigt approximations) tting routines to quantify peak location, peak broadening, and peak intensity. Synthetic peaks were used to test six types of peak features that are common in XRD patterns, which are discussed in the following sections. A Gaussian peak to the outlined changes and compared to an original peak

G(2θ0 , σ0 )

G(2θi , σi ) subjected through the use of

similarity metrics. For instance, in the case of peak shifting, an original peak is shifted as

θ0 → θ0max

towards larger diraction angles and all peaks along this transition are compared

to this original peak, as shown in Figure 2. Hence, the procedure quanties how these peaks become less similar to the original peak as a function of peak shifting. A brief description of the dierent peak features explored and why they are considered in this study follows. In each of these instances, the equations for obtaining the intensity of the second peak

I(θi , σi )

are given and this intensity is then normalized by the cumulative sum of the intensity, to guarantee that the peaks retain properties required from probability density functions.

9

ACS Paragon Plus Environment

Page 11 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Peak Shifting

Peak Splitting

(1)

(2)

Peak Broadening

Peak Background

(3)

(4)

Peak Noise

Peak Shape

(5)

(6)

Figure Peak featuresexamined examinedin in the the present present study study includes comFigure 2: 2: Peak features includes six sixdierent dierentbehaviors behaviors common within XRDpatterns: patterns: (1) (1)peak peak shifting, shifting, (2) peak (4) mon within XRD peak splitting, splitting, (3) (3)peak peakbroadening, broadening, (4) peak background, (5)peak peaknoise, noise,and and (6) (6) peak peak shape. shape. The is is shown peak background, (5) The original originalGaussian Gaussiancurve curve shown dark blue; theremainder remainderof ofthe the curves curves represent represent dierent is is in in dark blue; the dierent curves curvesas asthe thepeak peakfeature feature changed. changed. 14

10

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(1) Peak Shifting diraction angle.

Page 12 of 34

The simplest case to consider is that of peak shifting from its original

The following shift from the original peak

G(2θ0 , σ0 )

is considered by

calculating a new pattern,

I(∆(2θ0 )i ) = G(2θ0 + ∆(2θ0 )i , σ0 ),

which is shown in Figure 2 for

2θ0 = 40◦ , σ0 = 0.5◦ ,

and

(5)

∆(2θ0 )i = {0◦ , . . . , 15◦ }.

These

peak center angle limits are chosen to capture enough of the peak tails, which contribute to the overall similarity metrics. An example of experimentally observed peak shifting can occur due to elastic straining of the lattice microstructure.

(2) Peak Splitting

The second case considered is that of peak splitting,

I(2θδ,i ) =

where

2θδ,i = 2θ0 + 2δθi

1 [G(2θ0 , σ0 ) + G(2θδ,i , σ0 )] 2

(6)

is the center of the split peak's second peak. This case is similar to

the peak shifting case except the peak decomposes into two peaks that each comprise 50% of the integral of the original peak. In this manner, part of the peak stays within the origin

(2θ0 , σ0 )

while the second peak shifts away from this original peak by

(3) Peak Broadening

2δθi .

The third case considered herein is that of peak broadening as

compared to the original peak

G(2θ0 , σ0 )

with

σ0 = 0.5◦ ,

I(σi ) = G(θ0 , σi )

for

σi = {0.5◦ , . . . , 5◦ },

(7)

which transitions from a narrow peak to a broad peak.

(4) Peak Background

The fourth case considered is the background signal of the pat-

terns. An exponential equation is used to simulate the decay in the background signal that

11

ACS Paragon Plus Environment

Page 13 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

is evident at lower diraction angles,

   β0 I(bi ) = G(2θ0 , σ0 ) + bi 1 − exp − 2θ β0 = 75◦

where from

0

to

0.16.

bi

and

(8)

is selected to vary the maximum background response-to-peak ratio

This behavior might not be as crucial as others because there are multiple

ways of correcting for background (e.g., better sample holders and computational background removal schemes). That being said, understanding how dierent metrics handle a gradual change in the background may provide a better understanding of the metrics' general response behavior.

(5) Peak Noise

The fth case considered is the noise amplitude in the signal for the

patterns. Noise was introduced by

I(ni ) = G(2θ0 , σ0 ) + ni P{n}

where

ni

(9)

scales the amplitude of the noise and the noise vector is represented by the Poisson

distribution

(P{n}).

Here, the Poisson noise vector added to the pattern remains xed and

the noise is merely scaled by amplitude

ni

for the dierent patterns. While other approaches

to understand the aleatory nature of noise exist, this approach provides insight into the metrics' ability to measure dissimilarities due to noise.

(6) Peak Shape

The sixth and nal case considered is that of peak shape. As previously

discussed, XRD patterns are usually t to the pseudo-Voigt formula, which is a mix of the Gaussian and Lorentzian distributions,

I(ηi ) = ηi L(2θ0 , σ0 ) + (1 − ηi )G(2θ0 , σ0 ).

12

ACS Paragon Plus Environment

(10)

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

where the

ηi

Page 14 of 34

controls the peak shape transition. In this study, synthetic peaks are varied

from a purely Gaussian peak ( ηi

= 0)

to a Lorentzian distribution ( ηi

= 1).

XRD Multi-Peak Analysis XRD patterns with multiple peaks are generated using two dierent atomistic simulation tools to explore the capability of a metric to identify more complex XRD prole behavior. First, virtual diraction patterns are created using the XRDcompute package molecular dynamics framework, LAMMPS.

34

33

within the

The LAMMPS XRD computation uses kine-

matic diraction theory to compute diraction intensities in 3-D reciprocal space based on all atoms within the simulations. The diraction intensities are then binned by their corresponding scattering angle to generate 2 θ proles. For the purpose of computing similarities, the data is converted into a probability distribution function by normalizing by the cumulative sum of the dierent intensity bins (i.e.,

x¯i = xˆi /

P

i

xˆi ).

This procedure guarantees

that the patterns have the properties outlined in the previous sections, which enables us to eectively bound the metrics as shown in the

Supplementary Information .

The rst multi-peak case study uses similarity metrics to quantify size eects inherent to the LAMMPS XRD computation, to achieve achieve pattern convergence. Because the LAMMPS XRD computation calculates diraction intensities across 3-D reciprocal space, it inherently nds points near Bragg conditions that have non-zero intensities due to the truncation of the intensity calculations over the nite number of atoms.

These non-zero

intensities located near Bragg peaks form into relrods in 3-D space, which are observed experimentally during diraction studies of small volumes. However, most studies are typically interested in exploring bulk structures whose XRD patterns do not contain eects stemming from relrods. Thus, to model these bulk XRD patterns using the LAMMPS XRD utility, the computation should be expanded over larger simulation cells until the diraction pattern converges towards a representative bulk XRD pattern. To negate the relrod eect entirely, the LAMMPS XRD computation may require large supercells (100,000s to 1,000,000s of atoms).

13

ACS Paragon Plus Environment

Page 15 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

In this work, similarity metrics are used to quantify the rate of convergence for XRD patterns generated from dierent crystal structures as a function of unit cell replication (i.e., simulation cell volume or number of atoms). The second multi-peak case study uses XRD patterns created using the RIETAN-FP method

35

(VESTA).

found in the Visualization of Electron and Structural Analysis software package

36

The RIETAN-FP method computes diraction intensities strictly at Bragg con-

ditions that are found based on the atomic position and crystal symmetry group of the unit cell. The diraction intensities are computed using kinematic approaches that take advantage of the Debye scattering equations. Because the RIETAN-FP method computes diraction intensities only at Bragg conditions, no relrods will be observed and size convergence is not a concern. This case-study explores the outlined metrics' ability to capture peaks changes associated to straining of a unit cell. For this case, face-centered cubic Cu simulations are initially created with a 3.615 lattice parameter.

37

Then, subsequent simulations increase the

lattice parameter to mimic elastic straining of the unit cell and to calculate the corresponding XRD patterns. Systematically increasing the lattice parameter causes coordinated peak displacements associated with elastic strain, which are further examined using the similarity metrics.

Results and Discussion This section features application of the proposed framework to quantitatively compare XRD patterns. First, single-peak patterns are systematically modied to capture a metric's sensitivity to 1-D peak features. Secondly, the framework is implemented to assess how similarity metrics can be used to quantify dierences between XRD patterns containing multiple peaks. The latter employs two case studies to understand supercell structural convergence and isotropic straining evolution. The resulting heat bands were analyzed by hierarchical clustering to observe whether responses behave equivalently across dierent peak features.

14

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 34

Synthetic Single Peaks Systematic changes were used to identify which similarity metrics are most and least sensitive in analyzing expected XRD features. Figure 3 shows an example of how all metrics behave as a function of peak shift. Each curve corresponds to a metric's response as a function of peak center separation. Clearly, there is a wide range for which these metrics can capture dissimilarity between two synthetic peaks. response characteristics.

However, many of these metrics share similar

For instance, one set of metrics was extremely sensitive to the

slightest deviation between peak centers (shown in black), while the others are not very sensitive (shown in red).

Metrics corresponding to the peak shift insensitive group (red)

have a similarity measure of

s=1

despite that fact that the peak centers are separated by

approximately 8 times the standard deviation (i.e., little overlap,

∆(2θ0 ) ∼4◦ =8σ0 ).

Another

metric group (shown in pink) monotonously decreases in similarity, but at a relatively slow rate in comparison to the majority of similarity metrics (black curves). While these trends are visually apparent from the curves, condensing the 1-D information to heat bands (i.e., 1-D heat maps) is necessary for comparing all the metrics over all peak features.

A few

representative heat bands are shown as three plots above the 1-D curves in Figure 3. Each heat band corresponds to the evolution of a metric (shown as dashed curves) as the peak feature is altered, e.g., the Kulczynski metric's sensitivity to peak shift is represented by the red dashed curve (Figure 3, below) and the

SKul

heat band (Figure 3, above).

The heat bands for the set of 49 similarity metrics across the six outlined XRD features are shown in Figure 4. Similarity metrics are grouped by family as described by Cha and the exact metric ordering corresponds to the order used to dene these in the

Information.

Supplementary

Several aspects of interest can be identied by comparing the dierent peak

features shown in Figure 4.

First, a metrics' response is not necessarily uniform across

the same family members. For instance, the Kulczynski metric of the L 1 family is largely insensitive to peak shift, while the other family members are able to capture dissimilarities after a

∼1.5◦ shift.

Furthermore, the vast majority of the metrics explored are very insensitive

15

ACS Paragon Plus Environment

ure is altered, e.g., the Kulczynski metric's sensitivity to peak shift is represented b Page 17 of 34

ACS Combinatorial Science

dashed curve (Figure 3, below) and the

SKul

heat band (Figure 3, above).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Figure 3: (bottom panel) Similarity as a function of peak shift for the complete set of metrics re373: (bottom panel) Similarity as a function of peak shift for the complete set of m studied. While the evolution of some similarity metrics are similar, other evolution curves 38 ied. While the substantially. evolution An of example some of similarity are similar, other deviate the dierent metrics criteria used to rank order the variousevolution c 39 40 metrics is shown in blue by the intersection of ∆ (2θ0 ) values with similarity values of 0.00, ate An example of the dierent criteria used to rank order the v 41 substantially. 0.25, 0.50, and 0.75 (shown as blue lines) on the blue similarity evolution curve. (top panel) 42 is shown in blue by the intersection of ics with similarity values o Heat bands for three metric responses with dierent degrees of peak shift sensitivity. The 0 values 43 bands are representative of the black, blue and red dashed-line similarity responses. 44 0.50, and 0.75 (shown as blue lines) on the blue similarity evolution curve. (top 45 46 t bands for three metric responses with dierent degrees of peak shift sensitivity to peak shape transition, while highly sensitive to peak shifting. This can be observed from 47 ds48are representative of the black, blue and red dashed-line similarity responses. the large range of low similarity values (shown in blue) within the heat map as compared to 49 50 the other behaviors. 51 52 The for the of 49ofsimilarity metrics across six XRD fe 53 heat bands To quantify the set sensitivity the various similarity metrics, each the metric wasoutlined analyzed 54 55 using a set of four criteria, which are shown in Figure 3 as the blue lines intersecting the 56 shown in Figure 4. Similarity metrics are grouped by family as described by Ch 57 blue curve. These criteria determine a scalar value that reects the amount of change in 58 59 exact metric ordering corresponds to the order used to dene these in the 60

rmation.

∆ (2θ )

Supplem

16

ACS Paragon Plus Environment

Several aspects of interest can be identied by comparing the dierent

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 34

the pattern required to produce similarity values of 0, 0.25, 0.50, or 0.75. The metrics can then be ranked ordered by these values to dene the sensitivity to each peak feature. For instance, it could be important to use a peak that is very sensitive to peak shifting, but less sensitive to peak broadening like the Clark metric. Using this analysis method, higher rankings indicate a more sensitive metric to a specic peak behavior and lower rankings indicate less sensitivity.

The resulting metric rankings are outlined in the

Supplementary

Information. Comparing heat bands in Figure 4 from dierent XRD features shows that there is no universal least and/or most sensitive similarity metric to all XRD peak features. This is shown by the large variability on the 1-D heat bands across the dierent features.

For

instance, the Minkowski family is very sensitive to peak background and relatively insensitive to peak shape transition.

If a peak undergoes both changes simultaneously, a synergistic

metric response is expected. This is true for cases where noise is coupled to each of the other peak behaviors, as shown in the

Supplementary Information .

Therefore, when considering

the use of a metric for experimentally measured XRD data, this coupling eect must be taken into account.

17

ACS Paragon Plus Environment

Page 19 of 34

18

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

ACS Combinatorial Science

0

1 Similarity

Figure 4: 1-dimensional heat bands corresponding to the set of metrics assessed in this study. Each map corresponds to a peak

Figure 4: 1-dimensional heat bands corresponding to the set of metrics assessed in this study. Each map corresponds to a peak behavior, where the bands are grouped by metric family.

behavior, where the bands are grouped by metric family.

ACS Paragon Plus Environment

ACS Combinatorial Science

Metric Ranking Across Features An integral based ranking was performed (Figure 5) to understand the sensitivity of dierent similarity metrics to peak features. Calculated similarities as a function of peak feature were numerically integrated to determine the area under each curve, the similarity measured by metric

m

at bin

i

and

∆n

PN

i=1

sm,i /∆n,

where

sm

is

is the bin width for the peak feature

of interest. This analysis quanties the metric's sensitivity for each peak feature over the

ranked as more sensitive to background, and less sensitive to small, peak shifting, full feature range considered. Therefore, noise if the and area shape, under the similarity curve is the broadening and splitting. Red curves show less clear pattern between structural and inmetric is deemed to be very sensitive to a peakafeature, and vice versa. The metrics were then ranked infeatures. order of increasing integrals, as shown in theway Supplementary Information While can strumental This analysis provides a useful of highlighting which .metrics several criteria were considered in the ranking procedures, the integral criterion is presented

best capture a particular feature.

For instance, the Wave Hedges metric could be useful

here because it is able to capture metric uniqueness, where as the other criteria yield more

when optimizing an instrumental setup by reducing noise and background since it is more clustered sensitivities. Nonetheless, similar analysis for all four criteria were performed and

sensitive to these features than other structurally-dependent peak features. the remaining gures are available in the

Supplementary Information .

Low

Rank

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 34

High Figure 5: Metric ranking linkage across six dierent expected peak features. The highlighted metrics correspond to those that are highly ranked for a feature, but lowly ranked for another

Figure 5: Metric ranking linkage across six dierent expected peak features. The highlighted (i.e., more and less sensitive, respectively).

metrics correspond to those that are highly ranked for a feature, but lowly ranked for another (i.e., more and less sensitive, respectively). 19

ACS Paragon Plus Environment

Multi-Peak Convergence to Optimal Supercell

Page 21 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Metrics of interest ( MI) were identied by calculating the maximum rank ( R) transition across the dierent peak features, i.e.

MI = {m ∈ M : max(Rm ) − min(Rm ) ≥ Rthresh } where

Rm

is a metric's rank for a particular peak feature. A threshold of

(11)

Rthresh = 30

rank

dierence was applied in this analysis to identify metrics that are sensitive to one (or more) peak features and relatively insensitive to one (or more) other peak features. The peak features were ordered in such a way that features can be attributed to structural changes (shifting, broadening and splitting), and instrumental/experimental setup (background, noise and shape). One caveat to this classication is that shape transition clearly has both material and instrument contributions. The

MI

were then clustered by overall behavior, as shown in

Figure 5. The blue cluster corresponds to metrics that are very sensitive to peak shifting and broadening, but less so the other features. The green cluster highlights a group that is ranked as more sensitive to background, noise and shape, and less sensitive to peak shifting, broadening and splitting. Red curves show a less clear pattern between structural and instrumental features. This analysis provides a useful way of highlighting which metrics can best capture a particular feature.

For instance, the Wave Hedges metric could be useful

when optimizing an instrumental setup by reducing noise and background since it is more sensitive to these features than other structurally-dependent peak features.

Multi-Peak Convergence to Optimal Supercell The LAMMPS XRD computation can be benecial in uncovering subtle structural dierences. However, because of how the algorithm is implemented in LAMMPS (more details in Coleman et al.

33

), a large simulation cell domain is required to compute bulk XRD pat-

terns and minimize artifacts due to relrods in the patterns, which increases computational cost. This is required, though, to enable virtual diraction patterns over multimillion atom

20

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 34

simulations. Hence, the ability to analyze a system with sucient delity at a reasonable computational cost (i.e., a minimal amount of atoms) is often desired.

The framework

outlined in previous sections was implemented to assess how similarity metrics can quantitatively capture supercell size convergence for dierent crystallographic systems (and dierent degrees of symmetry). The systematic increase in supercell size in conjuction with the outlined framework enables the optimal supercell size to be determined in terms of the tradeo between minimizing diraction artifacts and maximizing computational eciency. Herein, similarities between a set of computed XRD patterns as the supercell size (and number of atoms) increases were measured. Three crystal structures were explored to analyze the inuence of symmetry on pattern convergence. Unit cells of body-centered cubic (bcc) Ta, face-centered cubic (fcc) Cu and diamond (dia) C were minimized using the potentials developed by Purja Pun et al. for CuTa,

38

and Baskes for C.

39

The minimized structures

were then systematically replicated to compute the XRD pattern as the system size increases. For each crystal structure, the unit cell was replicated the same number of times,

N ∈ [10, 20, 30, 40, 50, 60, 70]}.

{N×N×N :

The resulting converging XRD patterns for the bcc case is

shown in Figure 6, alongside the similarity heat map as measured by Clark. As the system size increases, the diraction peaks become sharper (more dened) as the relrod artifacts diminish. As shown, the Clark metric is able to capture how similar two patterns are to each other, but it does not indicate whether a converged supercell volume is achieved. The inner product metric exhibits an intrinsic convergence-like behavior that is useful for determining the optimal supercell volumes.

Hence, as the supercell volume increases,

the relrod eects are minimized, and the diraction intensities are accentuated, resulting in the inner product converging to a maximum value. The convergence results for the dierent crystal structures are shown in Figure 7. The

N ×N

heat map for each metric shows the rate

of convergence to a stable pattern, as shown in the bottom heat maps in Figure 7. This set for

N

was chosen because the higher symmetry crystal structure (bcc) was deemed to converge

at

70×70×70.

It should be noted that while fcc and dia seem to converge at similar rates, the

21

ACS Paragon Plus Environment

Page 23 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

bcc Ta

Clark Heat Map

Figure 6: XRD pattern evolution as the Ta (bcc) supercell volume increases. The XRD Figure 6: XRD pattern evolution as the Ta (bcc) supercell volume increases. The XRD responses converge as the number of atom increases. The Clark heat map is included to responses converge as the number of atom increases. The Clark heat map is included to illustrate the transition from identical to completely dissimilar XRD patterns. illustrate the transition from identical to completely dissimilar XRD patterns.

dia convergence will be more expensive as there are more atoms per unit cell. In the case of dia convergence will be more expensive as there are more atoms per unit cell. In the case of dia the XRD XRD curves curvesfrom fromXRDcompute XRDcompute dia(i.e., (i.e.,lower lowersymmetry symmetry crystal crystal structure), structure), computing computing the for Hence, while while the the lower lowersymmetry symmetrycases cases forthe the70 70××70 70××70 70 cell cell volume volume is is quite quite expensive. expensive. Hence, did analysis shows shows how how the the convergence convergencerate rateisis didnot notconverge converge to to an an optimized optimized cell cell size, size, this this analysis aected shown in in Figure Figure 7. 7. A A feature featureobserved observedacross across aectedby bythe thecrystal crystal structure structure symmetry, symmetry, as as shown the dia) is is that that the the responses responses begin beginto toconverge converge thelower lowersymmetry symmetry crystal crystal structures structures (fcc (fcc and dia) at convergence as asaafunction functionof ofcell cellsize size atthe thesame samenumber numberof of unit unit cell cell replications. replications. Analyzing Analyzing convergence canbe bemisleading, misleading, though, though, since since the the number number of atoms can atoms are are drastically drastically dierent dierent across acrossthe the dierentstructures. structures. This This analysis analysis shows shows that the inner dierent inner product product metric metric could could prove proveuseful useful formeasuring measuring response response convergence convergence for for any any number number of for of probability probability distribution distribution functions. functions. Furthermore, this this convergence convergence test test suggests suggests that that a Furthermore, a minimum minimum of of 500,000 500,000 atoms atoms should shouldbe be consideredto tominimize minimize size size eects, eects, even even for for high high symmetry considered symmetry structures structures such suchas asbcc. bcc. For Forthe the lowersymmetry symmetrystructures structures considered considered (i.e., (i.e., fcc fcc and lower and dia), dia), supercells supercells containing containingover over1,000,000 1,000,000

atoms should be considered. 22

22

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 34

atoms should be considered.

bcc

fcc

dia

Figure 7: Pattern similarity convergence as a function of number of atoms as calculated for

Figure 7: Pattern similarity convergence as a function of number of atoms as calculated for the inner product metric. The inner product metric exhibits a behavior that lends itself for

the inner product metric. The inner product metric exhibits a behavior that lends itself for convergence studies. As expected, higher symmetry crystals (e.g., bcc) converge at a faster

convergence studies. As expected, higher symmetry crystals (e.g., bcc) converge at a faster rate.

rate.

Multi-Peak Isotropic Cell Straining Multi-Peak Isotropic Cell Straining

The second multi-peak case study used similarity metrics to identify peak changes caused by The second multi-peak case study used similarity metrics to identify peak changes caused by

idealized elastic strain. The idealized elastic straining is simulated by increasing the lattice idealized elastic strain. The idealized elastic straining is simulated by increasing the lattice

37

parameters for a reference Cu lattice parameter37 and remapping atoms accordingly. This parameters for a reference Cu lattice parameter

and remapping atoms accordingly. This

approach allows the exibility responsessolely solely due the unit volumetapproach allows the exibilityto toanalyze analyze metric metric responses due toto the unit cellcell volumetric changes without introducing shiftingof ofatoms atoms due force equilibrations ric changes without introducingartifacts artifacts (i.e., (i.e., shifting due toto force equilibrations as as calculated interatomic potentials). The The resulting resulting XRD are shown in Figure 8, 8, calculated by by interatomic potentials). XRDresponses responses are shown in Figure where VESTA softwarepackage package calculated calculated the pattern changes of the where thethe VESTA software the expected expectedXRD XRD pattern changes of the strained unit cell, not considering internal atomic relaxations.

strained unit cell, not considering internal atomic relaxations.

As expected from Bragg's

As expected from Bragg's

Law, the peaks are shifted to smaller diraction angles as the cell is expanded.

Law, the peaks are shifted to smaller diraction angles as the cell is expanded. The renormalized similarity metrics, shown in Figure 8, quantify how these XRD patterns

The renormalized similarity metrics, shown in Figure 8, quantify how these XRD patterns diverge as a function of isotropic straining. Instead of plotting all 49 metrics, only the most

diverge as a function of isotropic straining. Instead of plotting all 49 metrics, only the most and least sensitive metric for each family are included. The Inner Product and Minkowski

and least sensitive metric for each family are included. The Inner Product and Minkowski families yield highly sensitive responses to straining regardless of the metric used. This is

families yield highly sensitive responses to straining regardless of the metric used. This is useful for identifying when a microstructure experiences measurable straining. However, it

useful for identifying when a microstructure experiences measurable straining. However, it 23

is also useful to measure the amount of straining as captured by the XRD response. such a case, the

χ2

For

family (or the least sensitive of the L 1 family, the Kulczynski metric) is

23

ACS Paragon Plus Environment

Page 25 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

(a) Most Sensitive

(b) Least Sensitive

Figure 8: (Top) CalculatedXRD XRDpatterns patterns of a strained CuCu unit cell.cell. DiracFigure 8: (Top) Calculated a hydrostatically hydrostatically strained unit Diracpeaks shift lowerdiraction diraction angles angles and peaks narrows. (Bottom) tiontion peaks shift to to lower and the thewidth widthofofthe the peaks narrows. (Bottom) Renormalized similarity of the (a) most and (b) least sensitive members of each family of Renormalized similarity of the (a) most and (b) least sensitive members of each family of metrics.

metrics.

combined hierarchical clustering analysis (not shown), which showed that clustering of met-

better suited where a smooth transition from identical to completely dissimilar responses is rics heavily depends on the peak features in the multi-peak patterns. The lack of universal

observed. Finally, the Clark metric is identied as the most sensitive from its family, but clustering patterns is not surprising as the single-peak patterns show that the resulting heat

is not as sensitive as the other families.

The presence of multiple peaks helps to explain

band clusters share no obvious commonality across the peak features. Hence, the relation-

this trend. The rank ordering with the integral criterion (Figure 5) is in better agreement ship observed by Cha for the random peak distribution (i.e., that certain metrics cluster in

with Figure 8 than the rank ordering via the 75% sensitivity criterion in the

Supplemental

responses) is not observed in this study, in part because of the wide range of variation in the

Information . patterns.

For instance, Jaccard was identied as more sensitive than Clark for the integral

criterion, which is captured in the multi-peak straining analysis. Nonetheless, Clark provides 25

a good balance between sensitivity to XRD pattern divergence and a diusive transition

24

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 34

between identical patterns to completely dissimilar patterns, making it an eective metric for analyzing XRD patterns.

Multi-Peak Pattern Clustering Analysis A comprehensive understanding of a metrics' performance when evaluating multi-peak patterns can be achieved by performing a hierarchical clustering analysis. An equivalent analysis for the single-peak case study is presented in the

Supplemental Information .

The metric re-

sponse dendrograms for the two individual case studies are shown in Figure 9. These dendrograms capture how the metrics cluster with convergence and straining, which are analogous to peak broadening and peak shifting, respectively.

The clustering in these dendrograms

was performed using a Euclidean distance between teh sampled metrics. Its observed that the metrics identied with the clustering analysis do not cluster according to the family of similarity metrics and the

clusters are not correlated between the two case studies. There

are a number of metrics that are clustered in the lower distance clusters (e.g., the Euclidean and angle-based metrics), in part because these metrics are all very sensitive to changes in the multi-peak patterns, which may or may not be desired. Additionally, the two individual cases can be combined and clustered (by restricting the straining to

ε = 7%)

to perform a

combined hierarchical clustering analysis (not shown), which showed that clustering of metrics heavily depends on the peak features in the multi-peak patterns. The lack of universal clustering patterns is not surprising as the single-peak patterns show that the resulting heat band clusters share no obvious commonality across the peak features. Hence, the relationship observed by Cha for the random peak distribution (i.e., that certain metrics cluster in responses) is not observed in this study, in part because of the wide range of variation in the patterns.

25

ACS Paragon Plus Environment

Page 27 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Supercell Size ( N )

Strain ()

similarity

Figure 9: Hierarchical clustering analysis of the sampled metrics supercell size (left) Figure 9: Hierarchical clustering analysis of the sampled metrics forfor thethe supercell size (left) and strain (right) case studies. The clustering was performed using a Euclidean distance and strain (right) case studies. The clustering was performed using a Euclidean distance

L2 norm) between sampled metrics. The and green branches the dendrograms (L2 (norm) between the the sampled metrics. The redred and green branches of of the dendrograms correspond to a distance threshold between cluster members for a size of two clusters, meant correspond to a distance threshold between cluster members for a size of two clusters, meant to indicate that these two clusters are substantially dierent from each other. This is also to indicate that these two clusters are substantially dierent from each other. This is also evident from the heat bands, which show the similarity between the initial cell (left column) evident from the heat bands, which show the similarity between the initial cell (left column) and the larger supercell or strained congurations. and the larger supercell or strained congurations.

26

26

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 34

Summary A framework for employing similarity metrics to quantify dierences between single and multi-peak XRD patterns is presented.

The set of 49 metrics outlined by Cha were used

in measuring similarities between XRD patterns. Several commonly observed XRD features were systematically applied to quantify the ability of similarity metrics to capture changes in a synthetic single-peak pattern. Multi-peak patterns, more representative of experimental XRD, were then produced for supercell convergence and unit cell straining studies.

Hier-

archical clustering analysis was performed on both single and multi-peak studies, showing how metrics cluster as a function of the XRD peak features. Based in the metric families outlined by Cha, it is shown that certain families are better suited at capturing common XRD features. Several key observations were discovered in this study:



No metric was found to be universally most or least sensitive across all peak features,



the metric response behavior is not homogeneous across members of a given family,



the relrod artifacts from the LAMMPS XRD computation requires large supercells for convergence of the XRD pattern (e.g., 500,000 atoms for the lower symmetry bcc),



the Clark metric yields a good balance between sensitivity and smooth similarity measurement,



the metric sensitivity to multi-peak pattern changes are in agreement with the sensitivity to the single-peak feature changes using the integral criterion,



the clustering analysis reinforces the observation that family members yield heterogeneous responses and that 1-D studies cannot be easily extrapolated to multi-peak cases, and



the clustering behavior observed by Cha was specic to the random peak pattern.

27

ACS Paragon Plus Environment

Page 29 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

As was shown, structural dierences can be readily captured by virtual XRD. Hence, a potential application for the framework is that of validating interatomic potentials developed for classical molecular dynamics. It was demonstrated that for a high delity validation, a large supercell volume must be used with the LAMMPS XRD-compute algorithm to reduce artifacts stemming from small cell dimensions and approximate bulk XRD patterns (e.g., the symmetry-dependent pattern convergence study). This study shows that care must be taken when choosing a metric to study similarities between sets of XRD patterns as their ability to identify XRD features is largely heterogeneous and depends on peak features.

Supporting Information Available The following les are available free of charge.

The

Supplementary Information

contains

the equations for the various similarity metrics used herein, the ranking for the various similarity metrics for dierent peak features and criterion, and additional analysis regarding the similarity metrics.

Acknowledgement This research was supported in part by an appointment to the Postgraduate Research Participation Program at the U.S. Army Research Laboratory (ARL) administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and ARL.

References (1) Thomas, J. M. Centenary:

The birth of X-ray crystallography.

186187.

28

ACS Paragon Plus Environment

Nature 2012, 491,

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 34

(2) Darling, K.; Roberts, A.; Armstrong, L.; Kapoor, D.; Tschopp, M.; Kecskes, L.; Mathaudhu, S. Inuence of Mn solute content on grain size reduction and improved strength

Materials Science and Engineering: A 2014,

in mechanically alloyed AlMn alloys.

589,

57  65.

(3) Atwater, M. A.; Roy, D.; Darling, K. A.; Butler, B. G.; Scattergood, R. O.; Koch, C. C. The thermal stability of nanocrystalline copper cryogenically milled with tungsten.

Materials Science and Engineering: A 2012, 558,

226  233.

(4) Darling, K.; VanLeeuwen, B.; Koch, C.; Scattergood, R. Thermal stability of nanocrystalline FeZr alloys.

Materials Science and Engineering: A 2010, 527,

3572  3580.

(5) Kainuma, R.; Satoh, N.; Liu, X.; Ohnuma, I.; Ishida, K. Phase equilibria and Heusler phase stability in the Cu-rich portion of the Cu-Al-Mn system.

Compounds 1998, 266,

Journal of Alloys and

191  200.

(6) Jasiewicz, K.; Cieslak, J.; Kaprzyk, S.; Tobola, J. Relative crystal stability of AlxFeNiCrCo high entropy alloys from XRD analysis and formation energy calculation.

of Alloys and Compounds 2015, 648,

Journal

307  312.

(7) Billinge, S. J. L.; Levin, I. The Problem with Determining Atomic Structure at the Nanoscale.

Science 2007, 316,

561565.

(8) Biswas, P.; Tafen, D. N.; Drabold, D. A. Experimentally constrained molecular relaxation: The case of glassy

GeSe2 . Phys. Rev. B 2005, 71,

054204.

(9) Markmann, J.; Yamakov, V.; Weissmüller, J. Validating grain size analysis from X-ray line broadening: A virtual experiment.

Scripta Materialia 2008, 59,

15  18.

(10) Stukowski, A.; Markmann, J.; Weissmüller, J.; Albe, K. Atomistic origin of microstrain broadening in diraction data of nanocrystalline solids.

57,

1648  1654.

29

ACS Paragon Plus Environment

Acta Materialia 2009,

Page 31 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

(11) Materials Genome Initiative for global competitiveness. Executive Oce of the President, National Science and Technology Council, 2011.

(12) de Pablo, J. J.; Jones, B.; Kovacs, C. L.; Ozolins, V.; Ramirez, A. P. The Materi-

Current

als Genome Initiative, the interplay of experiment, theory and computation.

Opinion in Solid State and Materials Science 2014, 18, (13) Rajan, K. Materials informatics.

99  117.

Materials Today 2005, 8,

38  45.

(14) Rajan, K. Materials Informatics: The Materials Gene and Big Data.

of Materials Research 2015, 45, (15) LeBras, R.;

Damoulas, T.;

Dover, R. B. In

Annual Review

153169.

Gregoire, J. M.;

Sabharwal, A.;

Gomes, C. P.;

van

Principles and Practice of Constraint Programming  CP 2011: 17th In-

ternational Conference, CP 2011, Perugia, Italy, September 12-16, 2011. Proceedings ; Lee, J., Ed.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2011; pp 508522.

(16) Ermon, S.; Le Bras, R.; Suram, S. K.; Gregoire, J. M.; Gomes, C. P.; Selman, B.; van Dover, R. B. Pattern Decomposition with Complex Combinatorial Constraints: Application to Materials Discovery. Proceedings of the 29th International Conference on Articial Intelligence. 2015; pp 636643.

(17) Fischer, C. C.; Tibbetts, K. J.; Morgan, D.; Ceder, G. Predicting crystal structure by merging data mining with quantum mechanics.

Nat Mater 2006, 5,

641646.

(18) Curtarolo, S.; Hart, G. L.; Nardelli, M. B.; Mingo, N.; Sanvito, S.; Levy, O. The highthroughput highway to computational materials design.

Nat Mater 2013, 12,

191201.

(19) Kusne, A. G.; Gao, T.; Mehta, A.; Ke, L.; Nguyen, M. C.; Ho, K.-M.; Antropov, V.; Wang, C.-Z.; Kramer, M. J.; Long, C.; Takeuchi, I. On-the-y machine-learning for high-throughput experiments: search for rare-earth-free permanent magnets.

Reports 2014, 4,

6367.

30

ACS Paragon Plus Environment

Scientic

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 34

(20) Potyrailo, R.; Rajan, K.; Stoewe, K.; Takeuchi, I.; Chisholm, B.; Lam, H. Combinatorial and High-Throughput Screening of Materials Libraries: Review of State of the Art.

Combinatorial Science 2011, 13,

ACS

579633.

(21) QBIC project: querying images by content, using color, texture, and shape. 1993; pp 173187.

(22) Rubner, Y.; Tomasi, C.; Guibas, L. J. The Earth Mover's Distance as a Metric for Image Retrieval.

International Journal of Computer Vision 2000, 40,

99121.

(23) Long, C. J.; Hattrick-Simpers, J.; Murakami, M.; Srivastava, R. C.; Takeuchi, I.; Karen, V. L.; Li, X. Rapid structural mapping of ternary metallic alloy systems using the combinatorial approach and cluster analysis.

2007, 78,

Review of Scientic Instruments

072217.

(24) Kusne, A.; Keller, D.; Anderson, A.; Zaban, A.; Takeuchi, I. High-throughput determination of structural phase diagram and constituent phases using GRENDEL.

otechnology 2015, 26,

Nan-

444002.

(25) Hattrick-Simpers, J. R.; Gregoire, J. M.; Kusne, A. G. Perspective:

Composition

structureproperty mapping in high-throughput experiments: Turning data into knowledge.

APL Mater. 2016, 4,

(26) Deza, M.; Deza, E.

053211.

Encyclopedia of Distances ; Springer-Verlag Berlin Heidelberg, 2009;

pp 1590.

(27) Cha, S.-H. Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions.

International Journal of Mathematical Models and Methods in

Applied Sciences 2007, 4,

300307.

(28) Shirkhorshidi, A.; Aghabozorgi, S.; Wah, T. A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data.

PLoS ONE 2015, 10, e0144059.

31

ACS Paragon Plus Environment

Page 33 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

(29) Tschopp, M.; Hernandez, E. Quantifying Similarity and Distance Measures for Vectorbased Datasets:

Histograms, Signals, and Probability Distribution Functions.

TN-XXXX 2017,

ARL-

120.

(30) Gan, G.; Ma, C.; Wu, J.

Data Clustering: Theory, Algorithms, and Applications ; SIAM

Series on Statistics and Applied Mathematics, 2007; Chapter 6, pp 67106.

(31) Goshtasby, A.

tion ;

Image Registration, Advances in Computer Vision and Pattern Recogni-

Springer: London, UK, 2012; pp 766.

(32) Shepard, R. Toward a universal law of generalization for psychological science.

1987, 237,

Science

13171323.

(33) Coleman, S.; Spearot, D.; Capolungo, L. Virtual diraction analysis of Ni [010] symmetric tilt grain boundaries.

neering 2013, 21,

Modelling and Simulation in Materials Science and Engi-

055020.

(34) Plimpton, S. Fast Parallel Algorithms for Short-Range Molecular-Dynamics.

of Computational Physics 1995, 117,

Journal

119.

(35) Izumi, F.; Momma, K. Three-Dimensional Visualization in Powder Diraction. APPLIED CRYSTALLOGRAPHY XX. 2007; pp 1520.

(36) Momma, K.; Izumi, F.

VESTA:

and structural analysis.

(37) RWG, W.

a three-dimensional visualization system for electronic

Journal of Applied Crystallography 2008, 41,

Crystal Structures,

653658.

2nd ed.; Interscience Publishers: New York, New York,

1963; pp 783.

(38) Purja Pun, G.; Darling, K.; Kecskes, L.; Mishin, Y. Angular-dependent interatomic potential for the Cu-Ta system and its application to structural stability of nanocrystalline alloys.

Acta Materialia 2015, 100,

377391.

32

ACS Paragon Plus Environment

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(39) Baskes, M. Modied embedded-atom potentials for cubic materials and impurities.

Rev B 1992, 46,

27272742.

33

ACS Paragon Plus Environment

Page 34 of 34

Phys