Back to the Future: CAS and the Shape of Chemical Information To

Aug 6, 2014 - Chemical Abstracts Service (CAS), the only organization in the world ... CAS has designed computer applications both for database-buildi...
0 downloads 0 Views 444KB Size
Chapter 9

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

Back to the Future: CAS and the Shape of Chemical Information To Come Roger J. Schenck* and Kevin R. Zapiecki Chemical Abstracts Service, 2540 Olentangy River Road, Columbus, Ohio 43202 *E-mail: [email protected]

Chemical Abstracts Service (CAS), the only organization in the world whose objective is to find, collect and organize all publicly disclosed chemistry, has been a leader in providing scientists with access to chemical information for more than 100 years. CAS relied on a group of globally situated volunteer abstractors from 1907 until the early 1990s. CAS now keeps pace with the explosion in newly disclosed chemistry with more than 500 scientists working at the CAS headquarters in Columbus, Ohio, who are supported in turn by that same number of scientists working in locations around the world. CAS has designed computer applications both for database-building efforts and service delivery. In 1984, STN was developed for professional searchers to access scientific and technical databases. With the introduction of SciFinder in 1995, CAS developed the first chemical information analysis tool specifically targeted to help chemists working in the lab. Since then, CAS has leveraged rapid changes in technology and evolving sources of disclosed chemistry, to fulfill its mission to provide the world’s best digital research environment to search, retrieve, analyze and link chemical information. This chapter describes how CAS has adapted to the phenomenal growth in published research to continuously support scientific discoveries and will close with some thoughts about the future of chemical information.

© 2014 American Chemical Society In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

Overview of CAS In 1907, E. J. Crane established the importance of indexes, not just abstracts, as part of Chemical Abstracts, starting with author and subject indexes (1). Since there was little control over nomenclature systems used in the early chemical literature, Carleton Curran and Austin Patterson of Chemical Abstracts devised a systematic method of naming substances in 1916 (2). They surveyed organic chemical literature for common practices, established an order of precedence for chemical functionality and instituted the use of inverted index names. Inverted names became popular as a way to group similar classes of compounds in an alphabetical printed index (3). Chemical Abstracts came to be recognized as a leader for chemical substance nomenclature development. In 1937, Chemical Abstracts published its one-millionth abstract (4). Around the time of the Seventh Collective period (1962-1966) Chemical Abstracts staff was struggling to keep pace with substances reported in the chemical literature (5). Before 1965, structures were hand drawn and the substances were subsequently named. Manual comparisons were done to determine if the incoming substance had been previously indexed. At the same time computer technology was emerging, and Chemical Abstracts Service research staff brought computers to bear on the problem. The CAS Chemical Registry System was introduced in 1965 as an internal production system that replaced the redundant and very expensive task of naming known compounds. Using a unique CAS Registry Number to identify each chemical substance, the system proved to be a future benefit to chemical research, health and safety information, and the communication of chemical information. There are now more than 85 million (April 2014) (6) organic and inorganic substances in CAS REGISTRY, which makes it the world’s largest substance database. Introduced in 1980, CAS ONLINE made it possible for users (primarily information specialists) to search the CAS REGISTRY database (7). Using a command language, users communicated their search strategies to the system. Users with a specific model of an intelligent graphics terminal could select structure features from a menu and then assemble them on the terminal monitor using a graphics tablet and stylus. These terminals could display answers with consistently drawn structure diagrams. CAS content speeds the pace of scientific discovery through two platforms: STN and SciFinder. In 1983, CAS partnered with FIZ Karlsruhe (in Germany) and was represented in Japan by The Japan Science and Technology Agency (JST) to form an international online network. STN, the Scientific and Technical Informationinformation Network, was launched the next year. STN made databases accessible through distributed processing on a global scale. Initially, only CAS databases and Physics Briefs were accessible. Over time, STN grew to include many scientific databases from a range of information providers. STN databases are uniquely integrated so researchers can consult multiple databases with a single query. A new web-based platform, with a project-oriented workflow, and enhanced search power, precision and usability, was recentlyrecently released and continues to be developed. 150 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

CAS introduced SciFinder in 1995 as a research tool to give scientists direct access to CAS databases with no prerequisite to learn a command language (8). With its intuitive, graphical interface, SciFinder simplified the exploration of the world’s scientific literature, patents and substance information, making this activity part of the process for scientific research. CAS recognized the possibilities of the Internet to speed and simplify access to original journal articles and patents. CAS Full Text Options (originally called ChemPort) was introduced to CAS and STN electronic services in 1997. Today it provides access to full-text journal articles and patents from more than 7,400 electronic journals from nearly 360 participating publishers (9). CAS Full Text Options also provides links to electronic patent documents from full-text patents from five offices: USPTO (U.S. Patent and Trademark Office), Espacenet (European Patent Office), SIPO (State Intellectual Property Office of the P.R.C.), JPO (Japanese Patent Office) and KIPRIS (Korea Intellectual Property Rights Information Service).

Addressing the Information Needs of Scientists In the late 1960s, with the advent of computer technologies, CAS investigated chemical information products and services beyond what was already available in CAS REGISTRY and the CA File on STN. The market drove CAS to consult chemists and information professionals to better understand their needs. Beyond the need for access to chemical substance information and the literature from which those substances were selected, there was a clear opportunity for CAS to provide scientists with much more targeted information. The desire for a collection of chemical reactions that included both standard, trusted reactions as well as new and novel synthetic techniques was front and center among customers interviewed. This was the beginning of a rich suite of additional chemical information currently available to scientists in the CAS databases. CAS is the only organization in the world whose objective is to find, collect and organize all publicly disclosed substance information. CAS currently covers more than 10,000 active journals (10) and patents from 63 patent authorities (11). This scientific literature and these patents come from 180 countries in 50 languages (12). CAS has developed seven core databases that cover the most current scientific information: chemical substances (CAS REGISTRY), references (CAplus), Markush (MARPAT), reactions (CASREACT), chemical suppliers (CHEMCATS), regulated chemicals (CHEMLIST) and Chemical Industry Notes (CIN). CAplus covers international journals, patents, patent families, technical reports, books, conference proceedings and dissertations from all areas of chemistry, biochemistry, chemical engineering and related sciences from 1907 to the present. There are more than 38 million records as of April 2014. In addition, over 180,000 records for pre-1907 patent and journal references are available, from sources such as the American Chemical Society (ACS), the Royal Society of Chemistry (RSC) and Chemisches Zentralblatt (9). Other benefits of CAplus 151 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

include abstracts of foreign language references (patent and journal) that are translated into English. CAplus also assures patent records, from nine major patent offices worldwide, are available online within two days of the patent’s issuance, and fully indexed by CAS scientists in 27 days or less from the date of issue (13). Voicing a clear need to leave no stone unturned when searching for prior art and freedom to operate, information professionals pushed CAS to develop a database of generic structures selected from patent applications. To address this need, CAS developed MARPAT, a database of Markush structures derived from patent applications. Introduced on STN in 1990, MARPAT was designed as an extension of the information provided in the CAS REGISTRY and CAplus databases to perform comprehensive patent substance searching.8 There are more than one million searchable Markush structures derived from patents covered by CAS from 1988 to the present. CASREACT was introduced in 1988 on STN and made available in SciFinder since the launch of the product in 1995. CASREACT offers access to current reaction information found in literature covering synthetic organic chemistry. The literature includes journals and patents from 1840 to the present. There are currently more than 58 million single- and multi-step reactions, and more than 13 million synthetic preparations in SciFinder (14). CHEMCATS, introduced on STN in 1995, is a chemical catalog database containing information about commercially available chemicals and worldwide suppliers. It contains more than 65 million commercially available products, more than 990 chemical catalogs, more than 880 suppliers and more than 27 million unique CAS Registry Numbers (15). After the passage of the Toxic Substances Control Act (TSCA) by the U.S. Congress in 1976, regulatory officers began asking CAS for access to an electronic version of the TSCA Inventory and other national inventories like the EINECS Inventory in Europe. CHEMLIST, the regulated chemicals database, is available on STN and in SciFinder. It was originally built from data in the 1985 TSCA inventory of more than 308,000 regulated substances (16). It is the most accurate source of substance and regulatory information with validated CAS Registry Numbers and the world’s most extensive collection of chemical names, consisting of systematic, trade and common names from 14 national chemical inventories. Seeking current business information from the chemical enterprise worldwide, CAS introduced a database called Chemical Industry Notes (CIN) on STN in 1989. It was built from 100 trade journals (including bibliographic data, abstracts, indexing and CAS Registry Numbers). CIN offers chemical business news related to production, pricing, sales, facilities, products and processes, corporate activities, government activities and people. Today, CIN contains an estimated 1.7 million records drawn from 80 sources from 1974 to the present, including both domestic and foreign journals, trade magazines, newspapers and newsletters (17).

152 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Trusting CAS for Current and Comprehensive Information

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

As well as covering chemistry in its broadest sense, the CAS databases are current and up-to-date so chemists can discover information sooner than from other scientific information providers. While the identification and approval process for new projects within research organizations typically requires a comprehensive review prior to moving forward, it still remains possible that, during the lifetime of a project, information can become available that could alter the scope of the project or even ruin it. Specific types of information affecting these efforts include: • •

• •

Recent publication of parallel or more advanced research efforts by competitors using the same approach and goals as the current project. Recent publication of key processes in the project by academic researchers or companies that limit patentability of the approach and/or enables competitor workarounds. New patent filings by competitors preventing freedom-to-operate for key processes in the project. Identification of old publications or patents (not identified previously) that limit the patentability of current efforts (i.e., prior art).

It is important that scientists have access to up-to-date information. There is intense competition to publish research first. The sooner the research is published by reliable sources, the more it provides scientists the help they need to plan and generate new scientific ideas and concepts. In the mid-1960s, as CAS REGISTRY was being designed and implemented, chemists and computer scientists at CAS needed to estimate the pace and size of future growth – how many substances might chemists ultimately synthesize, and how fast? Initial estimates ranged from six to twelve million substances. Some predicted that when chemists had finally synthesized all possible substances; when they had combined all atoms in all synthetically accessible combinations, CAS REGISTRY might reach 25 million substances. While it took CAS 33 years to register its first ten million substances in published literature (18), in December of 2012, just 18 months after reaching 60 million small molecules, CAS registered its 70 millionth substance (19). Where are CAS analysts seeing these new substances? Patents, especially from the Asia Pacific region, have exploded during the past eleven years. In 2012, CAS saw a spike in Chinese patent applications unlike any in its history. Covering 63 patent authorities, the CAS databases reflect patent activity around the globe through the years. Figure 1 shows Chinese patent growth as a major force in the Asia Pacific region and worldwide. In 2013 alone, the number of patents from the Asia Pacific countries was responsible for more than 67 percent of the patent publications seen by CAS, and China contributed around 65 percent of that region’s patent output.

153 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

Figure 1. Asian patent growth over the past 11 years. The black bars show total worldwide patent growth; the grey bars show the contribution to worldwide growth from the Asia Pacific region (China included); the white bars show China only. Source: CAplus database.

For drug discovery scientists, knowing what’s being patented for freedom-tooperate and intellectual property concerns is important. Every day, CAS scientists add more than 3,000 substances from Chinese patent applications alone. SciFinder and STN searchers have access to this novel patent information up to three months sooner than their competition.

The Future of Chemistry Research At the inception of any research effort, whether it is a commercial drug development project or the potential subject for a PhD dissertation, researchers need to know what has been done in the past. They must find out what has worked, what hasn’t worked and who else is working in the area of research. Before the 1970s, days, sometimes weeks, were spent in the library searching printed Chemical Abstracts indexes, and other compendia, to uncover what had been accomplished in the past. Extensive notes documenting the literature search were kept. Original journals articles, if not held in the local library, were acquired through interlibrary loans or document delivery services. Figure 2 is a visual representation of the relative time spent fetching (search and acquisition) relevant chemistry research and original literature versus the time spent reading and absorbing that literature (evaluation).

154 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

Figure 2. Content innovation and technology have significantly simplified scientific literature searches and provided a new area of opportunity: EVALUATION. Note: This graph is qualitative not quantitative.

With the advent in the 1970s of computer-based searching systems, time spent in the library began to shrink. Not all major reference works were available electronically, so library time was still necessary. Because of the intricacies of online searching systems, researchers often had to explain their questions to information experts who would then query online databases. As the secondary information industry moved through the 1980s and into the 1990s, searching became more efficient. More and more chemical information products were made available in electronic form. The primary literature was beginning to be delivered electronically in formats like PDF. In the mid-nineties, CAS developed SciFinder, a researcher’s tool that was simple to use. Chemists were no longer required to understand the nature of arcane printed indexes or the sometimes complex search commands necessary to use online databases – they could search for themselves, find useful answers quickly, and access electronic versions of patents and journal articles – all from their own computer. So, today, the time required to search and acquire scientific information has been greatly reduced. A new problem has arisen – too many answers are resulting from the explosion in worldwide scientific publishing. CAS is currently developing features and functions in its products that take advantage of that content to reduce the time it takes scientists to evaluate a collection of patents and literature articles. The problem that CAS needs to solve now is not getting more answers but getting the best answers.

155 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

So what is CAS doing to aid researchers in getting to the most relevant literature quickly? CAS is adding more context to its records so scientists have more information that points to the right answers. Access to comprehensive and timely scientific information is vital. CAS, with its comprehensive, timely and high quality content, helps organizations eliminate or avoid wasted, unproductive efforts by quickly discovering business critical information as soon as possible. The search and acquisition time has been reduced and now CAS is finding ways to drastically cut the evaluation time. Let’s describe some of those enhancements.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

Experimental Procedures and Reaction Transformations CAS provides access to more than 58 60 million single- and multi-step reactions and synthetic preparations (20), as well as associated experimental procedures for reactions, through SciFinder. Experimental procedures help scientists find useful reactions and the most relevant publications. CAS provides access to millions of experimental procedures from other sources including English-language translations from German and Japanese patents, the Shanghai Institute of Organic Chemistry, Chinese Journal of Organic Chemistry and Acta Chimica Sinica, hundreds of Springer journals and all ACS Publications journals in addition to English-language patents from the United States Patent and Trademark Office, European Patent Office, and the World Intellectual Property Organization (2000 to the present). The group by reaction transformation feature in SciFinder saves users time reviewing reaction answer sets by speeding evaluation synthesis options and preferred pathways by grouping single-step reaction answers by transformation type. It classifies answers in a way that is meaningful to synthetic chemists and allows a user to easily manage and evaluate large, comprehensive answer sets.

Bioactivity and Target Indicators Scientists working in the drug discovery arena, such as medicinal chemists, are experts in diseases, the protein pathways involved in those diseases, and small molecules or biologics that may inhibit, or enhance, protein expression. The essence of drug discovery is in identifying and validating druggable protein targets, designing lead molecules that affect their behavior and decorating that drug lead to maximize its efficacy. In 2011, CAS began adding bioactivity indicators and target indicators to the small molecules in CAS REGISTRY. Bioactivity indicators are a defined set of approximately 260 bioactivity terms, much like therapeutic indications. A term is assigned to a substance in CAS REGISTRY when there is a high probability that the bioactivity indicator was reported for that substance in a journal article or patent. For instance, Velcade (CAS Registry Number 179324-69-7) is associated with bioactivity terms like antitumor agents and biological radio sensitizers. Target indicators are assigned by the same manner. Thus, Velcade is associated with the target indicators Akt kinase and 26S proteasome. These bioactivity and target indicators guide drug discovery scientists to new uses for known drugs, possible 156 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

side effects and the original literature where this pharmaceutical information was reported.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

Relevancy Ranking Relevancy ranking speeds access to desired results for researchers. Users sometimes performed multiple searches and refined them to obtain a more manageable answer set size. By using relevancy ranking in both STN and SciFinder, the best answers are pushed to the top, which leads to fewer follow-up searches.

Conclusion Access to comprehensive and timely scientific information is vital for the advancement of science. For centuries scientists have routinely published their research; their conclusions may then be reviewed, confirmed and used by other scientists. Discoveries lead to more discoveries and science advances. CAS has been the repository of that research for more than 100 years. CAS is cognizant of the fact that along with more information available in its databases comes the concern of navigating too many answers. CAS analysts are not only indexing and abstracting the important chemical content in reputable scientific publications including articles and patents, but also offering new content and functionality that aids searchers to quickly winnow a large collection of CAS records down to a useful and manageable set for their research. Recent notable content additions include graphical abstracts, experimental procedures for reactions, experimental and predicted properties, bioactivity and target indicators, citations and relevancy ranking capabilities. In some sense, CAS has come full circle. The first issue of Chemical Abstracts, published on January 1, 1907 (8) contained 502 abstracts. Its purpose was more than raising the visibility of the American chemical enterprise. It was to summarize the growing volume of research papers being published worldwide for quick review. For many years, Chemical Abstracts was produced by a team of volunteer abstractors located around the world. Today, although CAS indexes well over a million documents on an annual basis, it continues to do so with the support of a team located around the globe. And, from the CAS customers’ perspective, strives to develop database content and features that enable researchers, information professionals and patent searchers to winnow a massive collection of published information down to what’s important for the problem at hand…just like what happened in 1907. Many generations of scientists, information professionals, educators and students have used services from CAS, from printed Chemical Abstracts to STN and SciFinder. With knowledge gleaned from the CAS databases, scientists have begun their research efforts knowing what has been done before them, and in time, have contributed their own discoveries. In turn, CAS continues to include those discoveries in its databases. 157 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

References 1.

Downloaded by PENNSYLVANIA STATE UNIV on August 11, 2014 | http://pubs.acs.org Publication Date (Web): August 6, 2014 | doi: 10.1021/bk-2014-1164.ch009

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

Schenck, R. J. Back to the Future. Presented at the Fall 2012 ACS National Meeting. Patterson, A.; Curran, C. J. Am. Chem. Soc. 1917, 39, 1623–38. Crane, E. J. The Chemical Abstracts, Service - Good Buy or Good-by. Chem. Eng. News 1955, 33 (26), 2753. Crane, E. J. Why Indexers Turn Gray. Chem. Eng. News 1937, 15 (8), 175. CAS Report Highlights Progress. Chem. Eng. News 1962, 40 (22), 90–97. REGISTRY counter on the www.cas.org website (accessed April 2014). CAS offers new online service. Chem. Eng. News 1980, 58 (40), 34–35. Shively, E. CAS Surveys Its First 100 Years. Chem. Eng. News 2007, 85 (24), 41–53. http://www.cas.org/fulltext/cas-full-text-options (accessed April 2014). http://www.cas.org/content/references (accessed April 2014). http://www.cas.org/content/references/patworld (accessed April 2014). http://www.cas.org/about-cas/cas-fact-sheets/registry-fact-sheet (accessed April 2014). http://www.cas.org/content (accessed April 2014). http://www.cas.org/content/reactions (accessed April 2014). http://www.cas.org/content/chemical-suppliers (accessed April 2014). http://www.cas.org/File%20Library/Training/STN/DBSS/chemlist.pdf (accessed April 2014). http://www.cas.org/File%20Library/Training/STN/DBSS/cin.pdf (accessed April 2014). Toussant, M. A Scientific Milestone. Chem. Eng. News 2009, 87 (37), 3. http://www.cas.org/news/product-news/70-millionth-substance (accessed April 2014). http://www.cas.org/products/scifinder/content-details (accessed April 2014).

158 In The Future of the History of Chemical Information; McEwen, L., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.