Extracting Information From the Literature By Text ... - ACS Publications

We define S&T text mining as the extraction of useful infor- mation from the technical literature. Three major components fall under our definition—...
1 downloads 0 Views 29MB Size
When coupled with expert human analysts, text mining can extract useful information from technical material.

he rapid advancement of science and technology (S&T) depends on how efficiently knowledge, including both infrastructure (authors, journals, institutions) and thematic (technical thrusts, relationships) information, is extracted from the literature. Questions of interest center around what S&T is being performed, who is performing it, where is it being performed, and what messages and undiscovered information can be extracted from the literature. Expert analysts can then judge what is not being done and recommend what should be done differently. In the past, the technical community used the thorough but inefficient approach of visually scanning printed and electronic technical literature to identify relevant documents and then reading them to extract the information. New techniques now semi-automatically preselect relevant literature and order the

Ronald N. Kostoff D Ronald A. DeMarco Office of Naval Research

J U LY 1 , 2 0 0 1 / A N A LY T I C A L C H E M I S T R Y

371 A

technical concepts and their relationships, providing a framework for an integrated analysis. These techniques are encompassed under the umbrella of S&T text mining. This article defines text mining, describes its major components, and shows its myriad applications to support all types of S&T functions. Text mining can benefit researchers, managers, sponsors, administrators, evaluators, and oversight organizations. It can serve as a catalyst to enhance peer review, metrics, road mapping, and other decision-making processes. For example, comprehensive road maps for strategic planning could be constructed, serving as a foundation for international policy assessment. Text mining can support workshops and reviews by identifying the key performers in related disciplines. To illustrate some text mining products, examples from diverse literatures, but mainly analytical chemistry, will be presented.

According to the dictionary . . . We define S&T text mining as the extraction of useful information from the technical literature. Three major components fall under our definition—information retrieval, information processing, and information integration. Information retrieval is the extraction of records from the source technical literatures. High-quality information retrieval produces both com-

prehensive and highly relevant records and is the foundation for text mining. However, the most sophisticated information processing cannot compensate for low-quality and -quantity core records. Information processing is the extraction of patterns, including bibliometrics, computational linguistics, and clustering, from the retrieved records. For multifield structured records (records from the Science Citation Index with author, title, journal, and address fields) with some free-text fields (such as abstracts), bibliometrics is the extraction of the technical discipline infrastructure (authors, journals, organizations) from the core records. Computational linguistics is the computer-based extraction of technical themes and their relationships. Computational linguistics is a complex task for technical literature analysis, because the technical phraseology is a foreign language to the computer. Clustering is the combining of common technical themes that can be executed as phrase patterns or actual document groupings. Information integration is the synergistic combination of information processing computer output with the reading of the retrieved relevant records. The information processing output serves as a framework for the analysis, and the insights from reading the records enhance the skeleton structure to provide

Detection

Separation

Sensitivity

Compounds

Ions

Electrode

Spectra

Measurements

Water

Mass Spectrometry

Proteins

Detection Limits

Experiments

Chromatography

Spectrometry

Table 1. Dimensionalized co-occurrence matrix for the highest-frequency phrases from Analytical Chemistry.

Detection

397

62

71

44

24

30

21

28

30

24

23

39

20

23

18

Separation

62

215

19

31

12

7

11

8

12

15

22

19

9

21

13

Sensitivity

71

19

175

20

9

21

14

12

8

9

17

16

6

7

10

Compounds

44

31

20

155

4

3

14

6

10

9

1

15

4

12

14

Ions

24

12

9

4

136

10

16

11

7

20

9

7

10

3

14

Electrode

30

7

21

3

10

131

1

12

3

2

1

10

7

0?

1

Spectra

21

11

14

14

16

1

129

8

6

8

6

6

13

7

13

Measurements

28

8

12

6

11

12

8

127

6

4

7

7

9

3

4

Water

30

12

8

10

7

3

6

6

118

6

1

7

2

9

5

Mass Spectrometry

24

15

9

9

20

2

8

4

6

117

14

5

7

5

13

Proteins

23

22

17

1

9

1

6

7

1

14

113

5

11

4

12

Detection Limits

39

19

16

15

7

10

6

7

7

5

5

107

3

8

7

Experiments

20

9

6

4

10

7

13

9

2

7

11

3

104

6

7

Chromatography

23

21

7

12

3

0?

7

3

9

5

4

8

6

98

6

Spectrometry

18

13

10

14

14

1

13

4

5

13

12

7

7

6

97

Matrix elements are co-occurrence frequencies.

a logical integrated product. Additional background in text mining is available (1, 2).

A myriad of applications Text mining can substantially improve the breadth and relevance of records retrieved from databases. There are many approaches to information retrieval; annual conferences focus on comparing various techniques for their comprehensiveness and the S/N of retrieved records (3, 4). Most high-quality methods include some type of relevance feedback. This is an iterative method in which a test query is generated, records are retrieved, and then patterns from relevant and irrelevant records are used to modify the query for increased comprehensiveness and precision. These patterns are typically linguistic phrase and phrase-combination patterns, but they could also include infrastructure patterns such as author/journal/organization (5). The infrastructure of a technical discipline consists of the authors, journals, organizations, and other groups or facilities that contribute to the advancement and maintenance of the discipline. Traditionally, this information is obtained by simple extraction from retrieved records without text mining (6). However, text mining can identify these infrastructure elements and provide their specific relationships to the discipline or subdiscipline. This information is valuable for inviting the right people and discipline combinations to workshops and reviews, as well as planning site visits.

Computational linguistics Analyzing phrase patterns by computational linguistics identifies technical themes, relationships among themes, relationships within the infrastructure, and technical taxonomies. These are important for understanding the structure of a discipline and the linkages among people, organizations, and subdisciplines and for estimating the adequacies and deficiencies of S&T in subtechnology areas. (Subtechnology refers to a smaller part of a larger technology; for example, the weapons system on a military aircraft.) Literature-based discovery consists of examining relationships among linked literatures and discovering relationships or promising opportunities not obtainable from reading each literature separately. For example, literature 1 has a central theme “a” and subthemes “b”; literature 2 has central themes “b” and subthemes “c”. From these combinations, linkages can be generated through the “b” themes that connect both literatures. The disjointed components of the two linked literatures (the components “a” and “c” of literatures 1 and 2 whose intersection is zero) are candidates for discovery, because themes such as “c” in literature 2 could not have been obtained from reading literature 1 alone. Successful performance of this generic approach can introduce ideas from one discipline to a disparately related discipline and thereby identify promising new S&T opportunities and research directions. Some initial applications of literature-based discovery have been found in the medical literature (2, 7–13). An early study focused on identifying treatments for Raynaud’s syndrome, a circulatory disease (7 ). Assume that two literatures can be gen-

erated, the first containing potential treatments with a central theme “a” and subthemes “b”, and the second containing Raynaud’s syndrome papers with central theme(s) “b” and subthemes “c.” One interesting discovery was that dietary eicosapentaenoic acid (theme “a” from the first literature) decreases blood viscosity (theme “b” from both literatures) and alleviates symptoms of Raynaud’s disease (theme “c” from the second literature). There was no mention of eicosapentaenoic acid in the Raynaud’s syndrome literature, but the acid was linked to the disease through the blood viscosity themes in both literatures. Subsequent medical experiments confirmed the validity of this literature-based discovery (14). A Web site (http://kiwi. uchicago.edu) overviews the process used to generate this discovery and contains software that allows the user to experiment with the technique (8, 15). This next section provides illustrative examples of S&T text mining techniques, starting with a query developed for a recent aircraft S&T study. We then show some bibliometrics results, most of which were obtained from a database of papers published recently in Analytical Chemistry. Additional computational linguistics examples are taken from a variety of sources that are related to analytical chemistry when possible.

Record retrieval query, aircraft technology In the typical S&T text mining analysis (performed by Kostoff), the starting point is a record retrieval query. A query development example is provided from a recent text mining study of aircraft S&T literature to illustrate an important point about query complexity (16). The study’s focus was the S&T of the aircraft platform. Our philosophy was to start with the term Aircraft, then add terms that would expand the number of aircraft S&T papers retrieved and eliminate irrelevant papers. Two databases were examined: the Science Citation Index (SCI; covers basic research; 5300 journals accessed) and the Engineering Compendex (EC; covers technology development; 2600 journals accessed). The SCI record retrieval query required 207 terms (separate phrases and phrase combinations) and 3 iterations to develop, whereas the EC query required 13 terms and only 1 iteration. The SCI query retrieved 4346 relevant records, while the EC query retrieved 15,673 relevant records (16). Because of the technology focus of EC, most papers retrieved using an Aircraft or Helicopter-type query term focused on the S&T of the aircraft platform, whereas the SCI yielded many papers that focused on the science that could be performed from the aircraft—these papers were not aligned with the goals of the study. No adjustments were required to the EC query, whereas with the SCI query, many NOT Boolean terms were required to eliminate aircraft papers not aligned with the main study objectives. It is analogous to the selection of a mathematical coordinate system for solving a physical problem. If the grid lines are well aligned with the physical problem, the equations will be relatively simple. If the grid lines are not well aligned, the equations will contain numerous terms that require translation between the geometry of the physical problem and the geometry of the coordinate system.

J U LY 1 , 2 0 0 1 / A N A LY T I C A L C H E M I S T R Y

373 A

An important message from the aircraft study is that the information retrieval query size depends on the objectives of the study and the contents of the database relative to those objectives. The query size should not be predetermined but should result from comprehensive and precise objectives. Another important message is that substantial manual labor is required to examine the thousands of detailed technical phrases that result from the computational linguistics analyses of the free text, and judgements about the usefulness of these phrases in the final query must be made. Because these queries are applied to multidiscipline source databases such as SCI, these phrases must be understood in other technical disciplines for successful query development. Thus, developing a query for a specific technical subdiscipline requires broader technical knowledge than just the target discipline alone.

This example illustrates some of the limitations of metrics in general, and bibliometrics in particular. These institutions are large, and one would expect a large output from them. There is no indication of efficiency, that is, output per unit of resources. There is no indication of output quality, other than the papers met the criteria for publication in Analytical Chemistry. Organizational citation data would begin to address the quality issue. Because of space limitations, organizational subunits could not be listed. Thus, the high achievements of a subunit may not be reflective of the institution overall.

Technical themes, taxonomies, aircraft

The analysis of phrase patterns allows technical themes to be identified and technical taxonomies to be generated. In the recent aircraft S&T study (16), phrase pattern analysis and taxonomy generation Most prolific authors showed that global levels of emphasis of As a simple example of a bibliometrics aircraft subtechnologies could be estioutput, records of the 2000 most recentmated and judgements of technology ly published articles in Analytical Chem“adequacy” and “deficiency” could be istry (June 1998–August 2000) were exmade. tracted from SCI. There were 5072 Single-word, adjacent double-word, authors listed. The most prolific authors, and adjacent triple-word phrases were exand the number of papers on which they tracted from the abstracts of the retrieved were listed, include Ramsey, J. M. (19), papers, and their frequencies of occurSmith, R. D. (18), Wang, J. (17), Jacobrence in the text were computed. A taxonClustering son, S. C. (14), Yeung, E. S. (12), Anomy whose categories covered the techniphrases, authors, derson, G. A. (11), Umezawa, Y. (11), cal scope of the phrases was manually Carr, P. W. (11), Guillame, Y. C. (10), generated, and the phrases and their assoor organizationsciated occurrence frequencies were placed Peyrin, E. (10), and Sweedler, J. V. (10). These are rather impressive numbers for a the appropriate categories. The frecan be useful in two-year publication period. Most auquencies in each category were summed, thors have only one or two publications. for understanding andproviding an estimate of the level of emdisplaying Previous technical discipline studies show phasis of each technical category. These linkages. that author distribution functions range summed category frequencies were comfrom 1/N 2 to 1/N 3 (16–18), where N is pared with their counterparts extracted the number of papers per author. The aufrom “requirements” and “needs” docuthor distribution function for this examments, and estimates of aircraft technolople is closer to 1/N 3. gy adequacy and deficiency were made. One cautionary note—the technology perspective of aircraft Most prolific institutions S&T was a function of the database record field (abstract, keyThe most prolific organizations, and the number of articles on words, and title) examined, because different record fields were which they were listed from the 2000 Analytical Chemistry used for different purposes by the authors of the papers. Mulpapers, include Oak Ridge (45) and Sandia (17) National Lab- tiple-record fields must be examined to provide a balanced peroratories, Iowa State (27), Indiana (25), Texas A&M (20), spective of the overall technology. Texas Tech (17), Cornell Universities (15), the Universities of Specifically, aircraft longevity and maintenance were a focus California (all campuses, including Los Alamos and Lawrence of the aircraft keyword field analysis. These topics were not evLivermore National Laboratories) (83), Michigan (36), Texas ident in the literature using high-frequency abstract phrases; (32), Tokyo (31), Washington (27), Alberta (26), North Car- the lower-frequency abstract phrases had to be accessed. Also, olina (25), Florida (23), Illinois (22), Lund (18), and Ten- the abstract phrases from the aircraft study heavily emphasized nessee (16). laboratory and flight test phenomena, whereas a noticeable ab-

374 A

A N A LY T I C A L C H E M I S T R Y / J U LY 1 , 2 0 0 1

sence existed of any test facilities and testing phenomena in the keywords. Indexers may view much of the testing as a means to the end, and their keywords reflect the ultimate objectives or applications rather than the detailed approaches for reaching these objectives. However, high-performance aircraft characteristics, a category conspicuously absent from the keywords, were also emphasized in the abstract phrases. In fact, the presence of longevity descriptors (aging aircraft, maintenance, reliability, durability) and the absence of high-performance descriptors in the keywords provide a very different picture of aircraft literature than do the presence of high-performance descriptors and the absence of longevity and maintenance descriptors in the abstract phrases. Keywords are “summary judgements” made by the author and indexer about the main focus of the paper and represent a higher-level description of the contents than the actual words in the paper and abstract. Thus, the keyword indexer may decide that an abstract that describes “corrosion” research is really describing “maintenance” as the paper’s broader objective. Another explanation is that maintenance and longevity issues are politically popular now, and authors and indexers may be applying (consciously or subconsciously) this spin to attract reader interest. The above findings have strong implications for Web-based information retrieval. The Web is an undefined conglomeration of keyword, abstract, and other unstructured fields and many different databases with widely varying contents. Yet, these findings demonstrate that the record retrieval queries and technical perspectives obtained are strong functions of specific fields examined. The credibility of record retrieval and analysis quality from any Web text mining must be questioned, and improvement will require intensive and substantial work. The aircraft and other studies show conclusively that, for high-quality text mining, close involvement of the technical expert(s) is required at all stages, especially for the computational linguistics and taxonomy generation portions. To ensure that multiple perspectives are incorporated into the study and data anomalies are detected, many technical experts with diverse backgrounds are needed, in addition to text mining experts who have analyzed many different disciplines.

Identifying themes, theme relationships, and taxonomies All single-, double-, and triple-word phrases in the 2000 most recent Analytical Chemistry paper abstracts were examined, providing an interesting picture of the field. Phrases that appeared in the abstracts are capitalized. The main emphasis was on Analysis Separation Methods for Detecting Compound Mass and Concentrations. In order of emphasis (phrase frequency), the main Techniques in this Measurement-based discipline include Mass Spectrometry, Capillary Electrophoresis, Chromatography (mainly Liquid, but some Gas), and hybrid combinations of these techniques. Also, in order of emphasis, the main Mass Spectrometry techniques include Electrospray Ionization, Matrix-Assisted Laser

Desorption, Tandem, Time-of-Flight, Ion Cyclotron Resonance, and Quadrupole Ion Trap. Other high-emphasis Analysis techniques include Cyclic Voltammetry, Capillary Electrochromatography, Atomic Force Microscopy, Micellar Electrokinetic Chromatography, Laser-Induced Fluorescence Detection, Atomic Absorption Spectrometry, Supercritical Fluid Chromatography, X-ray Photoelectron Spectroscopy, and Scanning Electrochemical Microscopy. Mention of theoretical approaches and analyses is negligible. The main types of Compounds (Analytes) examined include, in order of emphasis, Amino Acids, Organic Compounds, Aromatic Hydrocarbons, Glucose Oxidase, Hydrogen Peroxide, Ascorbic Acid, and Horseradish Peroxidase. These Analytes reflect the strong biomedical thrust of analytical chemistry today. Although phrase frequency analysis identifies major and obscure technical themes, it does not identify the relationships among the themes or among the themes and the infrastructure. To achieve these goals, some type of co-occurrence or clustering analysis is required (2, 19, 20). In the aircraft study described earlier, the clustering generation and taxonomy generation were performed manually. The experts assigned phrases to categories. A more quantitative approach to defining categories, or clusters, is based on co-occurrence analysis in which a domain is selected (e.g., sentence, paragraph, abstract), and the number of times multiple elements co-occur in that domain (the co-occurrence frequency) is the key indicator of the relatedness of those elements. Thus, if the retrieved record database consists of 100 records, and Analytical and Chemistry co-occur in 30 of those records, then the co-occurrence frequency of these 2 terms is 30. A symmetrical matrix can be constructed with phrase headings along the rows and columns, and the matrix element (Analytical, Chemistry) would have a value of 30. For analysis purposes, it is very useful to nondimensionalize the matrix elements so that the co-occurrence frequency can be related to the total occurrence frequency of each phrase in the total text. Co-occurrence matrixes do not have to be symmetrical. Author–phrase, author–organization, and organization–phrase matrices could also be constructed. Analysis of these matrix patterns is very useful for understanding linkages that may not be immediately obvious from reading the literature. Table 1 shows a dimensionalized co-occurrence matrix for the highest-frequency phrases from the Analytical Chemistry retrieved records. Clustering phrases, authors, or organizations can be useful for understanding and displaying linkages. Many statistical packages are available for clustering; finding an appropriate algorithm is relatively straightforward. Defining the appropriate association metrics and cluster hierarchies is much more complex. Phrase clustering has a fractal characteristic, wherein clusters can be appropriately formed in different phrase frequency regimes. The cluster structures are similar, but high-frequency phrase clusters contain relatively less detail than low-frequency clusters. High-frequency phrases tend to be single words with modest detail; low-frequency phrases tend to be multiple words, with correspondingly greater detail.

J U LY 1 , 2 0 0 1 / A N A LY T I C A L C H E M I S T R Y

375 A

Table 2. Cluster analysis for Table 1. Number of records

CL2

CL4

CL8

CL16

175 107 94 90 50 215 155 93 70 50 42 41 44 136 117 113 104 77 70 36 63 31 52 37 35 131 127 77 69 62 52 45 40 93 61 58 40 35 34 33 118 37 36 89 39 36 96 82 32 87 44 41 37 33 52 47 42 36 32

A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F A–F G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P G–P

A–C A–C A–C A–C A–C A–C A–C A–C A–C A–C A–C A–C A–C D–F D–F D–F D–F D–F D–F D–F D–F D–F D–F D–F D–F G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N G–N OP OP OP OP OP

AB AB AB AB AB AB AB AB AB AB AB AB AB DE DE DE DE DE DE DE DE DE F F F G–I G–I G–I G–I G–I G–I G–I G–I G–I G–I G–I G–I G–I G–I G–I JK JK JK JK JK JK L–N L–N L–N L–N L–N L–N L–N L–N OP OP OP OP OP

A A A A A B B B B B B B C D D D D D D D E E F F F G G G G G G G G H H H H I I I J J J K K K L L L M M M N N O P P P P

Sensitivity Detection Limits Resolution Capillary Electrophoresis Acids Separation Compounds Mixture Temperature Mobile Phase Liquid Chromatography Stationary Phase Particles Ions Mass Spectrometry Proteins Experiments Laser Peptides Dissociation Instrument Plasma Complexes Distribution Assay Electrode Measurements Membrane Sensor Cell Film Enzyme Glucose pH Oxidation Reduction Aqueous Solution Theory Equation Diffusion Water Methanol Sodium Reaction Carbon Oxygen Molecules Fluorescence Dye Rate Surfaces Adsorption Transfer Kinetics Polymer Volume Stability Flow Coating

Unit clusters (for example A, B) are defined at the lowest level in the hierarchy (CL16). Merged character clusters (for example AB) represent a merger of unit clusters at higher levels in the hierarchy.

A clustering analysis was performed for the co-occurrence matrix in Table 1. A subset of the results is shown in Table 2, as a four-level hierarchical structure. The choice of the 4 levels, consisting of 2, 4, 8, and 16 clusters, was made after detailed inspection of the dendogram. (A dendogram is a tree-type schematic of the clustering process that shows how each phrase is linked to other phrases in the clustering process.) In the following discussion, phrases that appear in the clusters are capitalized. We are only going to discuss the top two levels and make references to specific clusters in the bottom two levels as appropriate. The top hierarchical level, CL2, was the computer output for a two-cluster assignment. The first cluster’s members have the attribute A–F (ABCDEF; it is the merger of the smaller clusters A, B, C, D, E, F that are defined separately at the 16 cluster fourth hierarchical level) in column CL2. The cluster has a central theme of physical Separation of Compounds in Mixtures to obtain composition Distributions and is focused around physical Separation techniques such as Electrophoresis, Chromatography, and Spectrometry. The second cluster’s members have the attribute G–P in column CL2 with a central theme of direct Reaction Rate Measurements of compounds in mixtures to obtain composition or specific compound distributions. The cluster is focused around Membrane or Film-Coated Sensor-Electrode techniques that use Surface Adsorption, electron-Transfer Kinetics, and Reaction-Rate dependent processes. The first cluster contains no specific mention of any theory-related terms and concentrates on experimental techniques (Capillary Electrophoresis, Liquid Chromatography, Mass Spectrometry, Assay), experimental tools (Laser, Spectrometer, Instrument, Detector), experimental processes and issues (Detection Limits, Sensitivity, Resolution, Separation, Experiments), and experimental variables (Temperature, Pressure, Ionization, Dissociation). These experiment-specific phrases tend to have high frequencies. The second cluster contains mainly experiment-based phrases (Electrodes, Sensor, Cell, Films, Substrate, Solutions, Dye, Polymers) that occur with high frequencies. Theory-related terms appear concentrated in the I and N subclusters, all of which have relatively low frequencies in the mid-30s. Thus, the discipline, as portrayed by Analytical Chemistry, appears to be heavy on experiments and light on theory. Of all the disciplines we have text mined, ana-

To obtain a more detailed technical understanding of the lytical chemistry has the strongest imbalance between expericlusters and their contents, the lower-frequency phrases in each ment and theory, as perceived through the literature. The Mass Spectrometry subcluster (D–F) focuses on Pro- cluster need to be identified. A different matrix element with a teins and Peptides; the Membrane-Coated Electrode subcluster nondimensional quantity that remains relatively invariant to dis(G) focuses on Enzyme and Glucose; and the element Reaction tance from the matrix origin is required. In addition, a different approach for clustering the low-frequency subcluster (JK) focuses on Water, phrases in the sparse matrix regions is reMethanol, Sodium, Carbon, and Oxygen. quired—one that relates the very detailed These focal areas span both clusters and low-frequency phrases to the more generreflect the strong emphasis of analytical al high-frequency phrases that define the chemistry on biological, biomedical, and cluster structure. In this way, the low-frebiochemical processes. quency phrases can be placed in their apThe CL4 hierarchical level was the propriate cluster taxonomy categories. computer output for a four-cluster assignThe following high-frequency phrases ment. The first cluster, attribute A–C, fowere selected as generic themes for the cuses on the solution-based physical Seplow-frequency phrases: Electrophoresis, aration processes of Electrophoresis and Capillary Electrophoresis, ChromatograChromatography. The second cluster, atphy, Liquid Chromatography, Spectromtribute D–F, focuses on the Ion massetry, Mass Spectrometry, Film, Membased Separation process of Mass Specbrane, Theory, Fluorescence, Kinetics, trometry. Although these two clusters can and Polymer. Four types of results were be differentiated on the basis of fundaobtained. Many of the lower-frequency mental physical operational characterisphrases were closely associated with only tics, their linkage under the first cluster in one higher-frequency phrase. For examthe top-level hierarchy reflects the potenple, the higher-frequency phrase Chrotial for tandem operation. Liquid Chromatography was associated with the matography and Capillary Electrophorelower-frequency phrases Atomic Emissis have been combined operationally as sion Detection, Pulsed Amperometric capillary electrochromatography, exploitDetection, Block Copolymers, Ionic ing the best features of both processes; Liquids, Vegetable Oil, Triglycerides, capillary electrochromatography has been combined with Mass Spectrometry to exHuman expertise Gas–Liquid Partition Coefficients, and Immobilized Metal Affinity. The phrases ploit the best features of those processes. is required to in this category, on average, tend to be The third cluster in the second hierarlonger and more detailed than in any of chical level, attribute G–N, focuses on the sift out the the other categories, and they also tend analysis of Surface Reaction, Adsorption, to be the lowest-frequency. and Transfer processes at Membrane or nuggets from the large amount Some lower-frequency phrases were Film-Coated Electrodes to identify Species concentrations in mixtures. The fourth of background unique to a second-tier cluster. For example, the higher-frequency phrases Fluorescluster in the second hierarchical level, atclutter. cence and Kinetics were associated with tribute OP, focuses on the Stability of the lower-frequency phrases Anisotropy Polymer-Coated sensors and cells under and Intensity, C18-Modified Silica, and different Flow and cell-Volume conditions. The contents of the 4-level taxonomy Determining the Excited State. A few described above are based on the 97 techlower-frequency phrases were unique to a nical phrases with the highest frequency. As such, they are valu- first-tier cluster. For example, the lower-frequency phrases Capable for displaying the overall thematic structure of analytical ture Negative Ion, Ginseng, Hydroxylations and N-Oxidations, chemistry, but are limited in describing the categories in any de- Methyl Vinyl Ketone, and Separation of Proteins were associattail. To obtain the more detailed low-frequency phrases associ- ed with the higher-frequency phrases Chromatography and ated with any of the clusters, numerical indicators are used to Mass Spectrometry in the first-tier cluster. All first-tier clusters identify the low-frequency phrases that occur in close physical shared a few lower-frequency phrases. For example, the lowerproximity to the high-frequency phrases. These low-frequency frequency phrases Alleles, Aptamer, Colon Adenocarcinoma detailed phrases are then placed in the same clusters as the high- Cell, Emitted Fluorescent Light, and Laser Beam Scanning are frequency phrases, allowing both the “needles-in-haystacks” associated with the higher-frequency phrases Electrophoresis (the low-frequency phrases) and the more generic descriptors and Fluorescence. As a general rule, the low-frequency phrases (the high-frequency phrases) to co-exist. It also places the low- in this category tend to be relatively generic, at least compared frequency phrases in the broader high-frequency context. with phrases in the other three categories.

J U LY 1 , 2 0 0 1 / A N A LY T I C A L C H E M I S T R Y

377 A

Barriers to text mining Despite the myriad potential applications of text mining, the surface of this powerful technique has barely been scratched because many barriers impede its widespread implementation. One problem is the lack of incentive. A substantial effort is required to obtain high-quality information retrieval and text mining. The computer can produce thousands of phrases and phrase patterns from the core text. Human expertise is required to sift out the nuggets from the large background clutter. Unfortunately, few, if any, rewards exist for expending the effort on high-quality text mining, and there are essentially no penalties for low-quality text mining. In addition, the “not invented here” syndrome is a strong disincentive for making a substantial effort to determine S&T performed elsewhere. Many S&T personnel are unaware of the processes, tools, and subsequent potential benefits of high-quality information retrieval and text mining. Database availability restricts what can be obtained from text mining. More than $500 billion of S&T is being performed globally on an annual basis, and only a very modest fraction of it is documented (21). Of that which is documented, only a fraction is accessed by the major S&T databases (SCI, EC, National Technical Information Service reports). Of this accessed and documented S&T, only a fraction is available to users because of cost, restricted access, inclusion of data fields not uniform across databases, lack of awareness, and software that is not user-friendly. A major problem with accessibility is that the database developers, not the S&T sponsors or the users, determine the contents of the databases. Of the available accessed, documented S&T, only a modest fraction is available to the information processing software due to poor information retrieval techniques and poor text-to-phrase conversion techniques. Database development, data input quality and structure, and data dissemination require horizontal cooperation among global entities and vertical cooperation among the full spectrum of S&T sponsors, database developers, journal publishers and editors, researchers, and managers. There is no coordinated agreement and support for the full data development and dissemination cycle. The paradox exists that cooperation among competitors is required for the common good. Lastly, text mining is not treated as an integral part of the S&T strategic management process (22). Rather, it is treated as an ad hoc add-on, isolated from other management decisions. The downside of such an approach is that the data available from ordinary business operations are driving the study objectives, rather than the study objectives driving the data necessary to quantify the business performance metrics. One benefit that has accrued from text mining studies, from an organization’s long-range strategic viewpoint, are the broadened horizons and perspectives of the technical experts who participated in the process and who can use this expanded knowledge to better support, conduct, and manage S&T. Although the text mining tools, processes, protocols, and products are important, they are of lesser importance to the organization’s long-term strategic health, relative to the expert with advanced capabilities.

378 A

A N A LY T I C A L C H E M I S T R Y / J U LY 1 , 2 0 0 1

The views expressed in this article are solely those of the authors and do not necessarily represent those of the U.S. Navy.

Ronald N. Kostoff is a physical science administrator and Ronald A. DeMarco is acting associate technical director at the Office of Naval Research. Kostoff’s interests include research evaluation and text mining. DeMarco’s research interests include the synthetic chemistry of fluorinated molecules. Address correspondence about this article to Kostoff at Office of Naval Research, 800 N. Quincy St., Arlington, VA 22217 ([email protected]).

References (1) (2)

(3)

(4)

(5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)

(21) (22)

Losiewicz, P.; Oard, D.; Kostoff, R. N. J. Intell. Inform. Syst. 2000, 15 (2), 99–119. Hearst, M. A. Proceedings of ACL ’99, the 37th Annual Meeting of the Association of Computational Linguistics; University of Maryland, June 20–26, 1999; ACL: New Brunswick, NJ, 1999 (also available at http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html). Proceedings SIGIR 99, Proceedings of the Conference on the 22nd International Conference on Research and Development in Information Retrieval; Hearst, M., Gey, F., Tong, R., Eds.; ACM Press: New York, 1999. NIST Special Publication 500 240; Proceedings of the Sixth Text Retrieval Conference (TREC-6), NTIS, order number PB98-148166; Voorhees, E. M., Harman, D. K., Eds.; Department of Commerce, National Institute of Standards and Technology: Gaithersburg, MD, 1998. Kostoff, R. N.; Eberhart, H. J.; Toothman, D. R. J. Inf. Sci. 1997, 23 (4), 301–311. Braun, T.; Bujdoso, E.; Schubert, A. Literature of Analytical Chemistry: A Scientometric Evaluation; CRC Press: London, 1987. Swanson, D. R. Perspect. Biol. Med. 1986, 30 (1), 7–18. Swanson, D. R.; Smalheiser, N. R. Artif. Intell. 1997, 91 (2), 183–203. Swanson, D. R. Abstr. Pap. Am. Chem. S. 1999, 217:010-CINF. Smalheiser, N. R.; Swanson, D. R. Neurosci. Res. Commun. 1994, 15 (1), 1–9. Smalheiser, N. R.; Swanson, D. R. Comput. Meth. Prog. Bio. 1998, 57 (3), 149–153. Smalheiser, N. R.; Swanson, D. R. Arch. Gen. Psychiat. 1998, 55 (8), 752–753. Kostoff, R. N. Technovation 1999, 19 (10), 593–604. Gordon, M. D.; Lindsay, R. K. J. Am. Soc. Inform. Sci. 1996, 47 (2), 116–128. Finn, R. The Scientist 1998, 12 (10), 12–13. Kostoff, R. N.; Green, K. A.; Toothman, D. R.; Humenik, J. J. Aircraft 2000, 37 (4), 727–730. Kostoff, R. N.; Eberhart, H. J.; Toothman, D. R. Inform. Process. Manag. 1998, 34 (1), 69–85. Kostoff, R. N.; Braun, T.; Schubert, A.; Toothman, D. R.; Humenik, J. J. Chem. Inform. Comput. Sci. 2000, 40, 19–39. Kostoff, R. N.; Eberhart, H. J.; Toothman, D. R. J. Am. Soc. Inform. Sci. 1999, 50 (5), 427–447. Zamir, O. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. Ph.D. Thesis, University of Washington, 1999. Kostoff, R. N. The Scientist 2000, 14 (9), 6. Kostoff, R. N.; Geisler, E. Technol. Anal. Strate. Manag. 1999, 11 (4), 493–52.