A: The Sound of Talking - Analytical Chemistry (ACS Publications)

Jun 2, 2011 - all which isn't talking is mere typing and all typing's typing to oneself but the very tone of talking is telling. Raymond E. Dessy. Ana...
2 downloads 0 Views 6MB Size
A/C

WebWorks

The Sound of Talking all which isn't talking is mere typing and all typing's typing to oneself but the very tone of talking is telling (7) Prelude

Guttural sounds and organized speech preceded writing. Reputations in science depend upon the spoken and the written word. Nevertheless, when it comes to science, our network computers are mute. Despite inexpensive soundcards and a decade of research in text-to-voice and voiceto-text conversion, a Webchemist's exposure to sound is usually limited to cute sucking sounds as a file is deleted, the chatter of c|net radio, or folksongs from the Mudcat Cafe. That silence may be transformed into a sonata. This column explores demos, downloads, and new dimensions in sound for the lab. Would you like to shift font and emphasis on voice command, dictate e-mail, translate it into another language, listen to voicetext, lecture in foreign languages, or even hear data in surround sound? Listen! Exposition

Speech-aware software enables the acoustic dimension. Speech recognition (voice-totext) and speech synthesis (text-to-voice) are currently "in crescendo" because computer processors with speeds of greater than 150 MHz are available to meet the challenges. IBM's TV ads during the Nagano Olympics, and the hype, have also helped. Speech recognition may be speakerdependent or-independent. However, until recently, PC software could handle only discrete utterances, which have approximately 100-ms gaps between words. Conttnuous speech analysis is now possible. For example, speaker verification finds application in secure access systems. Command voice recognition lets one word replace a series of keystrokes, providing control and navigation. Voice dictation into a nattve--anguage text format is commonplace, and the conver-

sion of your words into a foreign tongue (cross-language) is now possible. The ultimate goal, of course, is a natural language front end for your WebTools. Speech synthesis lets diem talk back, providing status information. Unlike with phone menus, we can always barge in on the talking computer, leaving us still in charge. Development

The analog sound is usually captured at 11-kHz rates. A frequency analysis is performed over 10-ms windows, and the slice is labeled according to basic sound content, or phoneme (fo-nem). In spoken language, an initial sound will directly influence the sound that follows. Thus, a statistical model is often used to correctly identify connected sounds. The software word decoder then attempts to assign word identity, evaluating grammar/syntax relationships with the two adjacent words. Predefined dictionaries of 16-64 kB are common, along with backup dictionaries of 128-256 kB. Discipline-specific vocabularies of 16-64 kB are available for professioris such cis medicine or law. Unfortunately a chemistry adjunct is not to the best of my knowledge yet available The word assignment algorithm assigns a firm status to most words in real time. Where ambiguity exists, the words are labeled infirm. The latter words are largely resolved in relevanttimeas sentences are completed. Our own minds resolve unrecognized words from their context this way. Some software packages claim an input capacity of 125 words per minute (wpm) with 10% error rates without training, and

Voice- to-command/control: It is simple to map the sound shape of a single word to a string of ASCII command characters. For example, rather than a ffngerr twisting Ctrl+Shift+\ merely uttering "circumflex" "rings su ppecial font tharacters. Such long words have a sensuousness and are easily recognized. Saying "Italics Times Roman" is quicker thanfourmouse motions involving modality shifts from keyboard to mouse or scroll ball. Voice posting to a form from a fixed-text prompt menu is trivial. How does it work? http://www. commandcorp.com/incube welcome.html. Voice-to-text: Software costing $50$500 and a 166-200 MHz MMX processor with 32 MB of memory are needed for continuous speech. A quality noise-canceling, directional, close-talking microphone and earphone headset are mandatory; inexpensive pendant mikes and speakers are unsatisfactory. Many standard soundcards for analog-to-digital and digital-to-analog conversions are available. The combination of software and microphone provides a text manipulator appropriate for e-mail dictation text insertion into standard word flnd sotrip text editing This hands-free environment is important for lab work providing a place where the Internet and Intranet are new Cross-language support-tools tor Web searches and reagents e-mail are becoming available.

Analytical Chemistry News & Features, May 1, 1998 341 A

A/C

WebWorks

5% error rates with "enrollment"—a euphemism for training the software package. However, most software reviews report 75-100 wpm and a larger error rate. These errors must be edited manually. Nevertheless, examine the phrases "I couldn't hear here due to two too many of their words there" or "I'm working on a new polymer with Polly Ester", and one's respect for these systems increases. English grammar and homonyms provide formidable chasms and challenges. You can examine some commercial offerings and demos at http://www.dragonsys.com/ demos/demos.html; http://www.software. ibm.com/is/voicetype/; http://www. research.microsoft.com/research/srg/ install.htm; and http://www.lhs.com/ kurzweil/. Text-to-voice: This auxiliary provides "eyes-free operation", command confirmation, remote e-mail access, keyboard note-taking while listening to a text playback and implementation of various cartoon-image agents with attentiongetting voices. A large look-up table of symbolic phonetic representations for root words begins the pronunciation process. In English, these representations might involve 35-50 phonemes. Fluid transitions to yield smooth speech require cutting and splicing at phoneme "centers", producing diphones, and ~ 400 elements. Some approaches use even smaller voice fragments for greater fluidity, and these segment databases are very large. Lexical and syntactic principles are used in the speech construction. Prosody (or emphasis) and intonation must be added for proper understanding. Finally variables such as pitch limits spectral characteristics vocal-track specifications breathiness and rate yield gender and voice type An 8-11 kHz playback of the digital data is File formats are available for the various Wintel Annie and Unix platforms (wav aiff and au) Many Western European languages are sur> ported You ran exnlore demos and offerings at httn-//www bell-ahs cnm/nroierts/ ft