Adversarial Controls for Scientific Machine Learning

Oct 19, 2018 - that the application of machine learning to experimental research ... In parallel to the strong control experiments that remain a corne...
0 downloads 0 Views 252KB Size
In Focus Cite This: ACS Chem. Biol. 2018, 13, 2819−2821

pubs.acs.org/acschemicalbiology

Adversarial Controls for Scientific Machine Learning Kangway V. Chuang† and Michael J. Keiser*,† †

Downloaded via KAOHSIUNG MEDICAL UNIV on October 19, 2018 at 19:42:36 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

Department of Pharmaceutical Chemistry, Department of Bioengineering and Therapeutic Sciences, Institute for Neurodegenerative Diseases and Bakar Institute for Computational Health Sciences, University of CaliforniaSan Francisco, 675 Nelson Rising Lane, San Francisco, California 94158, United States ABSTRACT: New machine learning methods to analyze raw chemical and biological data are now widely accessible as opensource toolkits. This positions researchers to leverage powerful, predictive models in their own domains. We caution, however, that the application of machine learning to experimental research merits careful consideration. Machine learning algorithms readily exploit confounding variables and experimental artifacts instead of relevant patterns, leading to overoptimistic performance and poor model generalization. In parallel to the strong control experiments that remain a cornerstone of experimental research, we advance the concept of adversarial controls for scientific machine learning: the design of exacting and purposeful experiments to ensure that predictive performance arises from meaningful models.

A

researchers in applied ML, including choices in data processing, study design, model selection, and model validation. Yet, procedures to determine the origin of model performance are frequently overlooked during model development and testing. This is unfortunate because models trained from large biological and chemical datasets only provide actionable scientific insights when they learn truly salient patterns. Indeed, techniques enabling model interpretability, or the ability to provide explanations in terms understandable to humans, have been developed to explore this concept.10 Integrating these considerations and our own experiences, we advocate for a framework of adversarial controls in the application of ML, wherein purposeful experiments are devised to systematically eliminate competing hypotheses and reveal what patterns the models are learning. We find this framework helpful in our own work for the design and refinement of ML models, and particularly as a useful means to flag instances of unintended pattern recognition. 1. Opening the Black Box: Does Your Model Make Scientific Sense? Machine learning is inherently quantitative. Researchers empirically train an ML model through the iterative optimization of a user-defined objective f unction that minimizes the average prediction error across training examples. Similarly, we typically assess the resulting model by quantitative performance measures such as squared-error, accuracy, precision, recall, and F-score. Although indispensable, the exclusive use of quantitative evaluation obfuscates underlying mechanisms of performance, as they only reveal how well, rather than how, the model achieves its goal. In contrast, we advocate for additional logical measures to interrogate the inner workings of models that ultimately motivate the formulation of competing alternative hypotheses. At its simplest, manual inspection of individual test cases can reveal model failures and flag problems in data quality or model training.11 For instance, convolutional neural networks enable rapid cellular image classification. Inspecting cases of the most pronounced correct and incorrect predictions

dvances in computational power, algorithmic development, and data accessibility have sparked explosive growth in machine learning (ML).1 Sophisticated and powerful pattern recognition tools have become easily accessible and simple to deploy out-of-the-box.2 Popular scientific computing software such as MATLAB, Mathematica, and R are now equipped for ML, and open-source libraries such as scikitlearn,2 TensorFlow,3 and PyTorch4 provide versatile frameworks for customized scientific machine learning development. Scientific researchers have more flexibility and power over their data than ever before. Unfortunately, the democratization of ML within the scientific community has propagated the misconception that ML on “big data” may stand in for hypothesis-driven research.5 Although ML algorithms such as deep neural networks automatically learn hidden and complex relationships from large datasets,6 pattern recognition does not guarantee meaningful learning, and algorithms can unintentionally exploit experimental artifacts or confounding variables that prevent model generalization. To best harness ML for scientific discovery, we argue that a hypothesis-driven approach is now more critical than ever. In a seminal piece published in 1964, physicist J. R. Platt describes the method of strong inference as an effective model of scientific inquiry.7 Platt argues that an iterative process of generating, testing, and excluding alternative hypotheses allows researchers to rapidly traverse the chain of logical reasoning in science. The falsification of multiple and often opposing hypotheses is inherently an adversarial process, as the researcher must design experiments to threaten all working and alternative hypotheses.8 As a case study, Platt points to the field of molecular biology, where exacting and unequivocal experiments designed to rule out alternative hypotheses had spurred rapid advancement. Today, the development of control experiments remains a cornerstone of experimental research, particularly at the forefront of scientific inquiry. ML now ventures into uncharted territory. Although often considered a tool, ML itself is an experimental science.9 As such, strong inference should inform the applications of machine learning to new questions. Many challenges face © 2018 American Chemical Society

Published: October 19, 2018 2819

DOI: 10.1021/acschembio.8b00881 ACS Chem. Biol. 2018, 13, 2819−2821

In Focus

ACS Chemical Biology

be accepted. Likewise, an experimental design that does not distinguish the working from the obviously false model cannot anchor a chain of scientific reasoning. For example, methods in ligand-based virtual screening commonly employ ML algorithms to predict target-binding affinities based on molecular descriptors. The implicit working hypothesis underlying such a model can be stated as, “molecular features are important for predicting binding affinity.” This formalization suggests that a good falsification experiment would be to study what happens when you delete molecular features (a process termed “feature ablation”) from the dataset prior to training the model. Similarly, what happens when you replace molecular features with unique random values (random barcodes), randomly shuffle the relationships between molecules and their protein targets, or inject noisy examples?17−19 If the model performs equally well after these manipulations, this signifies that the working hypothesis cannot be accepted. Thus, these results provide a quantitative performance baseline for significance. In addition, they provide a qualitative logical check (referring back to the first adversarial control, above) because their application to specific test cases can often be interpreted intuitively.

provides a simple sanity check: Does the classifier correctly predict the “easy” (to a human) examples and miss the “hard” examples? Are misclassified images out of focus, or do they contain cells that are deformed or compressed unusually? Are the results in line with, or a challenge to, the researcher’s domain intuition? More systematically, interpretability methods specific to particular ML algorithms serve as further logical checks: feature coefficients of linear models,12 nearby instances in knearest neighbors, feature importances for random forest models,13 and saliency maps for deep neural networks14 make it possible to check a model concretely against the researcher’s intuition and expert knowledge of the underlying physical process. In the cell image classification example above, does the model learn from batch differences in fluorescence intensity when the primary distinguishing feature should be cell shape? Does the model operate on features of the cells at all, or draw from some other aspect of the image entirely? Extracting intelligible explanations from trained models and performing simple sanity checks can flag unintended pattern recognition, to protect against drawing inadvertently false scientific conclusions. 2. The Method of Multiple Models: Is a Confounding Variable Driving the Prediction? Variables inherent to experimental design, such as when a subset of samples were collected, can confound ML models, especially if the variables are shortcuts to solving the task at hand. Moreover, models that exploit confounding variables can sometimes attain strong quantitative performance, but in a way that is not scientifically meaningful or generalizable. Without procedures to detect spurious correlations or irrelevant learning,15 the scientific conclusions that are reached may not be sound. Drawing from Chamberlin,8 we advocate for the explicit formulation of multiple alternative, competing models and for the purposeful design of falsification experiments that detect confounding variables. Alternative explanations can take many forms. For example, does your image classifier exploit details in the periphery rather than the object of interest?16 In a model of drug action, are predictions capitalizing on properties such as the solubility or molecular weight of the compound, rather than the key information about the drug−protein interaction? Stating an explicit, confounding hypothesis frequently inspires a control experiment. For instance, the hypothesis, “we can achieve equivalent molecular-target binding prediction performance using corresponding author name and publication date instead of chemical structure” is one that can be falsified by training an alternative model only on these features and then evaluating its competitive performance. As Platt reminds us, what experiment(s) could be performed to disprove these alternative hypotheses? 3. Outperforming the Straw Model: Does the Model Break When You Remove What Matters? Physiologist W. A. H. Rushton observed that, “A theory which cannot be mortally endangered cannot be alive.”7 Taking the method of multiple models to its logical conclusion, a final adversarial control focuses on designing experiments crafted to disprove the working hypothesis. Specifically, we advocate for straw models that lack central aspects of the working hypothesis in order to establish these aspects’ significance.9 A straw model is a superficially similar variant of the working model that excludes the working hypothesis by design. If the working model cannot outperform the straw model, neither model can



CONCLUDING REMARKS Here, we have outlined three types of control procedures that can be used to evaluate ML models. These adversarial controls test whether models align with scientific domain intuition, assess alternative explanations for how they work, and evaluate whether working models strictly reflect their working hypotheses. These controls are not intended to be comprehensive, nor do they stand in for best practices or other quantitative baselines such as stratified cross-validation or detecting overfitting. Rather, we outline this framework with the motivation that a hypothesis-driven approach remains critical for scientific ML. Importantly, such a hypothesis-driven approach will need to be developed within the domain under exploration, with an understanding of the nuances of the experimental details and the context of the findings. Only by adhering to strong and domain-specific models of scientific inquiry will ML approaches be able to enable new insights. This is the difference between ML merely generating new data and ML generating new knowledge. Machine learning is here to stay, but it will be up to the scientific community to exercise due diligence in thoughtfully integrating these methods and ensuring their rigor.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Kangway V. Chuang: 0000-0002-0652-8071 Michael J. Keiser: 0000-0002-1240-2192

■ ■

ACKNOWLEDGMENTS We would like to thank the Paul G. Allen Foundation for funding. REFERENCES

(1) Jordan, M. I., and Mitchell, T. M. (2015) Machine Learning: Trends, Perspectives, and Prospects. Science 349 (6245), 255−260. (2) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,

2820

DOI: 10.1021/acschembio.8b00881 ACS Chem. Biol. 2018, 13, 2819−2821

In Focus

ACS Chemical Biology et al. (2011) Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (Oct), 2825−2830. (3) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., and Isard, M. (2016) Tensorflow: A System for Large-Scale Machine Learning. OSDI 16, 265−283. (4) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic Differentiation in PyTorch; NIPS Workshop, 2017. (5) Anderson, C. End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine. https://www.wired. com/2008/06/pb-theory/ (accessed 05/11/2013). (6) LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep Learning. Nature 521 (7553), 436−444. (7) Platt, J. R. (1964) Strong Inference. Science 146 (3642), 347− 353. (8) Chamberlin, T. C. (1965) The Method of Multiple Working Hypotheses. Science 148 (3671), 754−759. (9) Langley, P. (1988) Machine Learning as an Experimental Science. Mach. Learn. 3 (1), 5−8. (10) Doshi-Velez, F., and Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1605.08695. (11) Ribeiro, M. T., Singh, S., and Guestrin, C. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; ACM, 2016; pp 1135−1144. (12) James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer Texts in Statistics, Springer, 2013. (13) Breiman, L. (2001) Random Forests. Mach. Learn. 45 (1), 5− 32. (14) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv 2013, arXiv:1312.6034. (15) Kaufman, S., Rosset, S., Perlich, C., and Stitelman, O. (2012) Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Trans. Knowl. Discovery Data 6 (4), 1−21. (16) Kuehlkamp, A., Becker, B., and Bowyer, K. Gender-from-Iris or Gender-from-Mascara? In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE, 2017; pp 1151−1159. (17) Cohen, P. R., and Howe, A. E. (1988) How Evaluation Guides AI Research: The Message Still Counts More than the Medium. AI Magazine 9 (4), 35. (18) Cohen, P. R., and Howe, A. E. (1989) Toward AI Research Methodology: Three Case Studies in Evaluation. IEEE Trans. Syst. Man Cybern. 19 (3), 634−646. (19) Tropsha, A. (2010) Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inf. 29 (6−7), 476−488.

2821

DOI: 10.1021/acschembio.8b00881 ACS Chem. Biol. 2018, 13, 2819−2821