Using Data Analysis To Evaluate and Compare Chemical Syntheses

Jul 24, 2018 - We present ChemPager, a freely available tool for systematically evaluating chemical syntheses. By processing and visualizing chemical ...
1 downloads 0 Views 2MB Size
Subscriber access provided by University of Sussex Library

Full Paper

Using Data Analysis to Evaluate and Compare Chemical Syntheses Dustin Kaiser, Jianbo Yang, and Georg Wuitschik Org. Process Res. Dev., Just Accepted Manuscript • DOI: 10.1021/acs.oprd.8b00199 • Publication Date (Web): 24 Jul 2018 Downloaded from http://pubs.acs.org on July 24, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Organic Process Research & Development

Using Data Analysis to Evaluate and Compare Chemical Syntheses Dustin Kaiser, Jianbo Yang, Georg Wuitschik‡ ‡

[email protected], F. Hoffmann-La Roche Ltd., Basel, Process Chemistry & Catalysis, Small Molecules Technical Development.

ACS Paragon Plus Environment

Organic Process Research & Development 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table of Contents Graphic:

ACS Paragon Plus Environment

Page 2 of 20

Page 3 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Organic Process Research & Development

Abstract We present ChemPager, a freely available tool for systematically evaluating chemical syntheses. By processing and visualizing chemical data, the impact of past changes is uncovered and future work guided. The tool calculates commonly used metrics such as Process Mass Intensity (PMI), VolumeTime Output and production costs. Also, a set of scores is introduced aiming to measure crucial but elusive characteristics such as process robustness, design and safety. Our tool employs a hierarchical data layout built on common software for data entry (Excel, Google Sheets etc.) and visualization (Spotfire). With all project data being stored in one place, cross-project comparison and data aggregation becomes possible as well as cross-linking with other data sources or visualizations.

Keywords: Data Analysis; Process Development; Scoring Functions; Robustness; Cost Calculator Introduction Data analysis has become a pivotal tool in all areas of business to gain insight, evaluate performance in order to make informed decisions. 1 Raw data is often unwieldy and has to be processed and visualized. 2 Chemists can choose among a plethora of indicators such as greenness or costs to judge aspects of their chemical processes. Many different applications of data science in process chemistry have been reported in order to judge aspects of chemical processes, ranging from PMI prediction, 3 process greenness evaluation, 4, 5, 6 process safety evaluation, 7 scoring of potential regulatory starting materials, 8 definition of a good manufacturing process, 9 and evaluation of the entire process based on a score. 10 These, and commercially available solutions, are mostly implemented as self-contained spreadsheets limiting customization, data mining and reusability. 11 Re-entering the same data over and over again leads to data silos. 12 It also compromises productivity and decreases data quality. Our goal was to create a reusable data source that enables any chemist to review the history of his process, assess its current state and make informed decisions on where to focus future development work. 13 Minimizing the burden of manual data entry and generating a maximum of knowledge for all stakeholders is paramount for the adoption of any tool among those providing the data. ChemPager, a contraction of chemistry one-pager is currently based on Google Sheets or Microsoft Excel for data entry and Tibco Spotfire for visualization. It is currently used at Roche in chemical development and is available in full for free as part of the supporting information. 14

ACS Paragon Plus Environment

Organic Process Research & Development 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: Landing Page of the ChemPager application using data of the HBsAg project. This pages provides an overview on a given project’s development history, providing key measures such as Process Mass Intensity (PMI), solvent distribution, production cost, Volume-Time Yield and a performance score.

ChemPager’s landing page is shown in Figure 1. It displays a brief overview on the left and two rows of visualizations. These summarize the history and current state of a given project. The top row displays the change of Process Mass Intensity (PMI), overall and Volume-Time Yield as well as an aggregation of scores for robustness, economy, safety, greenness and project difficulty. The bottom row provides information on the total number of steps and the longest linear sequence as well as production costs which includes costs for reagents, solvents, equipment and labor. Details of solvent usage are displayed on the bottom right. On the top, a line of tabs enables the user to analyze these indicators on a step-wise level and plan further development efforts. Users can interact with visualizations and change the scope of the analysis to a sub-set of steps or campaigns by simple mouse click. The tool highlights the major cost contributors, indicates which steps have the highest cost saving potential, provides scenario analyses and aids in planning of future campaigns. It can be customized to deliver cross-project KPI charts and highlight trends in any subset of past and present projects.

ACS Paragon Plus Environment

Page 4 of 20

Page 5 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Organic Process Research & Development

Figure 2: Data Layout of the ChemPager application. The raw data is stored in a hierarchical database layout, in which projects form the top layer. One project can have a number of campaigns, one campaign a number of steps, one step a number of batches and one batch a number of materials and reactors. This data is combined with data from other databases and processed by the main ChemPager application. The modular approach simplifies making changes and swapping of components. Once structured, the data can be cross-linked with other databases or used for Machine Learning.

ChemPager is built within Tibco Spotfire and can handle data from a variety of sources such as Microsoft Excel, Google Sheets and a number of database formats. 15 We chose a modular approach in which data is acquired, stored, processed, and visualized separately. This makes the application easier to maintain and adapt while providing a common platform for reusing the data (see Figure 2). In the following paragraphs, we present the workflow, structure, and included analyses of the ChemPager tool. We describe the foundation of the scoring functions and present an example for the application of ChemPager using data from the Hepatitis B viral expression inhibitor RG7834. 16 Characterizing chemical processes Common key performance indicators (KPIs) of organic process chemistry include the E factor 17 or PMI5, costs per kg final product, overall yield, information on step count and longest linear sequence, as well as Volume-Time Outputs.9,18 They can be calculated from data found in batch records or ACS Paragon Plus Environment

Organic Process Research & Development 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

representative laboratory experiments. While hard to quantify, soft factors like process robustness are at least as important for judging the quality of a chemical process. 19 Insights on these seemingly intangible qualities may be derived from commonly available data, if combined intelligently. We attempt to do this by calculating scores for robustness, safety, economy, greenness and project difficulty (Figure 3, left). Scores intend to quantify an otherwise subjective judgement using a model supplied with standardized inputs. As such, the scores’ ability to replicate expert judgement relies on the quality of the training data and model construction. In our case, each score is a weighted sum of sub-scores (Figure 3, middle). These are calculated from input data shown on the right in Figure 3 using scoring-functions. Most scoring functions in ChemPager are capped linear functions of input data with interval scale. For example, the scoring function for a step’s PMI is constructed as follows: Steps that have a PMI of less than 5 kg/kg step product are assigned a subscore of 2 and everything above 25 kg/kg step product is assigned a sub-score of 0. PMI values in between are linearly interpolated. For ordinal and nominal data, the scores are discrete functions, yielding one of two or more pre-defined values. This can be rationalized as «getting bonus points» if a criterion is met.

Figure 3: The composition of the robustness, greenness, safety, and economy score is outlined schematically. All scores are built from weighted campaign averages of per-step sub-scores. The sub-scores themselves are built from a reduced number of input values. All scores are evaluated on a step- and campaign level with the exception of project difficulty. OHC: 20 Occupational Hazard Category

ACS Paragon Plus Environment

Page 6 of 20

Page 7 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Organic Process Research & Development

As an example, the construction of the robustness score and its sub-scores is discussed in detail below. 21 To our knowledge, this is the first score to quantify process robustness to be proposed. Process robustness can be defined as the influence of input changes on outputs. Translated into our model, processes are considered more robust that deliver consistent yield and purity over a wide range of input variable combinations. 22 Processes that deliver consistent quality and yield from batch to batch and independent of scale and equipment are more robust than processes that lead to batchto-batch variations in yield and quality or even to deviations, reworks, ad-hoc changes or unexpected behavior 23 in the plant. Thus, we use the absolute percentage difference of yield and purity between laboratory experiment and plant outcome, ∆𝑦𝑦𝑦𝑦𝑦 and ∆𝑝𝑝𝑝𝑝𝑝𝑝 , to build the “Lab vs. Batch” sub score. 24 Equally, we consider the standard deviation of yield (𝜎𝑦𝑦𝑦𝑦𝑦 ) and purity (𝜎𝑝𝑝𝑝𝑝𝑝𝑝 ) over all batches produced within a campaign. In both scores the differences in yield is added to differences in purity which receives a weight of four. The arbitrary weight of four was set, since we deem a nominally equal change in purity to be more relevant than changes in yield. This sum is then entered into a linear equation and evaluated as follows for the two sub-scores: 4 ∙ ∆𝑝𝑝𝑝𝑝𝑝𝑝 + ∆𝑦𝑦𝑦𝑦𝑦 ⎧3, for 2.8 − 19.8 ∙ >3 5 ⎪ 4 ∙ ∆𝑝𝑝𝑝𝑝𝑝𝑝 + ∆𝑦𝑦𝑦𝑦𝑦 𝐿𝐿𝐿. 𝑣𝑣. 𝑃𝑃𝑃𝑃𝑃 Sub-Score: 0, for 2 5 ⎪ 4 ∙ 𝜎𝑝𝑝𝑝𝑝𝑝𝑝 + 𝜎𝑦𝑦𝑦𝑦𝑦 𝐵𝐵𝐵𝐵ℎ 𝑣𝑣. 𝐵𝐵𝐵𝐵ℎ Sub-Score: 0, for