Software makes speech recognition more accurate and reduces correction time

Society has witnessed a vast increase in the use of pattern recognition, artificial intelligence, and big data analytics in government, business, and other fields.  Increasingly, the issue is quickly finding and verifying the relatively few errors made by more accurate pattern recognition.  End users often find confidence scores and other techniques designed to indicate relative accuracy difficult to understand.  Output comparison can help find errors by comparing recognition output from different recognizers.  It is a relatively easy-to-understand indicator of possible error and need for review. 

Company stated in 2005, "While there is also a proliferation of software programs for conversion, interpretation, and analysis of speech, audio, text, image, or other data input, there is a continued reliance upon manual processes for interpretation of the same data input.  Among other drawbacks, none of these programs have a single graphical user interface to present or compare results from computers, humans, or between both computers and humans."   

U.S. patents and patent pending

Beginning in 1998, company developed software for dictation, transcription, speech recognition, text to speech, audio file conversion, and other speech and language processing.  Company has written software to integrate with Dragon (Nuance), Microsoft, and IBM speech recognition, ATT text to speech, Philips microphones, Sony and Olympus handheld recorders, Infinity transcriptionist foot pedals, X-keys programmable keypads, and ProNexus telephony.  Company now operates as a virtual business out of Terre Haute, Indiana, southwest from Indianapolis.  

In addition to dictation and transcription utilities, software has included:

workflow manager
--SpeechMax™ speech-oriented HTML session file editor 
SweetSpeech™ nonadaptive speech engine and toolkit
--SpeechServers™ speech recognition and other speech processing 

Customers have included health care organizations, doctors, lawyers, law enforcement, solo transcriptionists, transcription companies, and large and small businesses in the U.S. and overseas.  Latest software versions are compatible with Windows XP with limited testing with Vista or later Windows OS. 

SpeechMax™ . . . . More than a word processor™

SpeechMax™ is a speech-oriented, multilingual, multiwindow HTML document editor.  Software provides a single interface for synchronizing, reviewing, and correcting output from a wide variety of manual or automatic processing of audio, text, or image data.  It supports data entry and review for a wide variety of processes, including dictation, manual transcription, editing speech recognition text, real-time and desktop server-based speech recognition, text to speech, and other speech and language processing.  Interface also can display output from data processed by two or more manual and/or automatic processes.  Each document window supports annotation window(s) for text or audio comments aligned to document window input audio, text, or image.  Interface supports aligning a web page or video segment to specific document text. 

functionality includes:
--Multiwindow display
--Tile multiple windows horizontally, vertically, or in cascade
--Differences highlighted for rapid error detection
--Synchronize (normalize) session file tags
--Create identical segment numbers
--Compare synchronized transcribed outputs
--Tab navigate to synchronized segment
--File compare after synchronize audio, text, or image
--Text correct after output comparison 
--Create composite best-guess result from > 2 files
--Highlight differences between best-guess and other files
--Compare synchronized text in multiple languages
--Synchronized display text flow for R to L and L to R text
--Compare text of multiple drafts or translations
--Support for file compare w/o synchronization (as in standard word processor where text compare of text strings without reference to input data)

Use WordCheck to detect speech recognition mistakes and errors from other computer- and/or manually-processed audio, image, or text data.  Reduce speech editor review time by up to 80% with 90% accurate speech recognition.  Software's novel interface also supports text/audio redaction and comparison of scrambled audio-aligned text (ScrambledSpeech™) transcribed from the same audio.       

WordCheck™ for Speech Recognition--Why New Techniques are Needed  

Spelling error vs. recognition error
A spell checker detects manual transcription errors. Speech engine misrecognitions are almost always spelled correctly.  The rare exception is a word that is misspelled in the speech recognition lexicon (dictionary).  A speaker proofing text who remembers what was said, or speech recognition editor replaying audio synchronized to text, can detect a misrecognized word or phrase.   As with manual transcription, speaker should check the final document before distribution. The Joint Commission, Division of Health Care Improvement, "Transcription Translates to Patient Risk," Quick Safety (April 2015)

Using text compare with speech recognition
With text compare, the user can quickly identify
text differences in red in two or more document windows, navigate directly to these differences, play back linked audio, and select correct text from dropdown menu or enter text by keyboard or voice.  With highly accurate recognition, text compare can substantially reduce the text and audio that the speech editor must review, and typically enable the editor to return a document to the speaker with accuracy comparable to gold-standard manual transcription.

More time savings with highly accurate recognition
Speech engines tend to misrecognize the same audio differently.  When comparing output less accurate speech engines, there are more differences with text from other speech engines.  With more accurate recognition, there are fewer differences.  As a result, there is a more significant decrease in editor review time as accuracy increases.  At 100% accuracy with two speech engines, there are no differences, and no audio for speech editor review. 

Identical misrecognition
When a speech engine misrecognizes a word, the transcribed text is usually correctly spelled, e.g., where speaker dictated "that" and speech engine transcribed "hat".  The rare exception is where the speech spelling dictionary (lexicon) spells it wrong.  Reasons for speech recognition errors are poorly understood.  Which Words are Hard to Recognize? Prosodic, Lexical, and Dissiliency Factors that Increase Speech Recognition Error Rates 

Two or more speech engines rarely make an identical mistake.  For example, speaker says "that" but both engines transcribe "hat".  Text compare process does not detect a difference.  Identical misrecognitions occur less frequently with highly accurate speech recognition. 
Nuance® case study reports up to 98% accuracy for physician users of Dragon® Medical.   With this high rate of accuracy, identical misrecognitions are 2% or less.  See Voice Recognition Software Dictation Test 

How text compare works  
The screenshot below shows a transcribed dictation for chest x-ray report.  Speech utterances (phrases separated by a short pause) are delimited by vertical purple lines in the dual-document view below.  Utterance (audio phrase) boundaries are established by SpeechMax™ analysis of utterance waveform.  Synchronization (normalization) results in display of utterance text from two or more speech engines arising from the same audio.  Synchronized utterances each have the same audio start/stop times.  In a preferred method, editor reviews text/listens to audio representing only differences.  Editor immediately can tab to text differences in red, playback, select correct output for final version, and so on until complete.  Editor returns copy to dictating speaker for review.  Speaker has option to make changes and approve for final distribution, or return to editor for additional editing.  Annotation (comment) windows linked to specific document text enable the speaker to indicate the specific document text changes requested.  After any additional editing, process can format, copy into template, or otherwise prepare document for distribution.  Sometimes errors corrected by speaker will represent identical misrecognitions where both speech engines mistranscribe (misrecognize) the same word the same way.  There are no identical misrecognitions in the two texts below. On the other hand, more commonly, two speech engines mistranscribe the same word differently so that both are incorrect (e.g., one engine transcribes as "mouse" and another transcribes as "douse").  The text differences in red will indicate that there is a difference for review by editor. In the example below, editor reviews about 10% (1 x 10%) of audio since the difference in each text reflects same speech audio.  In the cases where speech engines each misrecognize a different word in a 10 word sentence, editor reviews about 20% (2 x 10%) of word audio tags.  This represents the two different audio tags played back by the editor.  Where the speech engines each misrecognize the same word differently (e.g., "mouse" and "douse" for "house"), the editor reviews about 10% (1 x 10%) of audio.  Editor need review only the same, single word audio tag since both engines misrecognize the same word in a different way.    


Hours/minutes saved by using text compare to correct errors depends upon two key factors--time required to review speech audio and speech recognition accuracy . . .    

Text edit time without error-spotting
American Health Information Management Association (AHIMA) position paper indicates that it takes a speech recognition editor 2-3 minutes to listen to and edit a minute of speech recognition text. 
(AHIMA October 2003)  With 1 hour (60 minutes) of dictation audio, it can take up to 180 minutes to edit the document using traditional speech recognition editor techniques.

Fewer text differences => more editor time savings
As speech engines become more accurate, there are more time savings.  With 90% accuracy in both speech engines, there is estimated up to 80% reduction of audio review time by speech editor using text compare.   Speech editor can reduce review time to as little as 36 minutes--a time savings of up to 144 minutes (.8 x 180).  Speaker saves time because speech editor has edited most, if not, all errors--leaving speaker with few, if any, corrections to make. 

* Expected audio review time refers to expected percentage decrease in speech editor audio review time when reviewing text with differences.  Speech editor corrects most or all errors for dictating speaker.  Text returned to speaker generally will have about the same error rate as manual transcription with highly accurate speech recognition.  When there are no recognition errors, speech editor review time drops to zero with no differences being detected for review.  No recognition errors are highlighted when speech engines make the same mistake (identical misrecognition).  Bar graph represents transcription by two different speech engines with respective recognition errors representing different speech audio.    

More efficient correction makes organization more efficient 
Some employers require speech recognition speakers self-correct because it saves on back-end editing costs.  There are trade-offs because self-correction limits a speaker's ability to do other work.   Some physicians, for example, say that they could spend more time with patients, perform more surgeries, or read more x-rays or EKGs if they spent less time correcting their dictation.  Other speech recognition users such as physician assistants, nurses, lawyers, law enforcement, government workers, and others face similar challenges.

Limits of confidence scores
Some propose using confidence scores as an indicator of accuracy for speech recognition and other pattern recognition, including big data analytics and various forms of artificial intelligence.  In company experience these scores have limited value:  the typical end user rarely understands how the process generates these scores or what they mean.   Company has found that it is easier for typical editor to understand the implication of text differences in red compared to a computer-generated "confidence" score.  Because of this, text compare more likely results in significant editor efficiency and improved turnaround time.     

A technology whose time has come
With improved software and faster computers in recent years, speech recognition has become highly accurate for many dictating speakers.  Speech recognition accuracy of up to 98% is obtainable for some speakers.  Occasionally there is even 100% accuracy.  Even with 90% accurate recognition, there are significant potential speech editor productivity gains compared to use with less accurate earlier recognition technology (see bar graph above). 

Using WordCheck™ with other pattern recognition or manual processing
Company's patented process finds recognition mistakes using the principle that output differences by two pattern recognition processes indicate a mistake by one or both.  Same logic applies to differences between three or more texts, as well as nonspeech audio, text, or image pattern recognition.  Examples include:
--Audio mining
--Speaker identification
--Machine translation
--Natural language understanding
--Facial recognition
--Fusion biometrics (e.g., facial + speaker recognition)
--Computer-aided diagnosis (CAD) for medical purposes
--Manually-generated transcription or translation with audio-tagged text (as may be useful for education)

SpeechMax™ Text Comparison--Summary

Navigate to errors quickly with SpeechMax™
--Text differences in red
--Supports display synchronized audio-tagged text
--Recognition differences indicate recognition error
--Text compare identifies recognition differences
--Correctionist (editor) reviews audio, corrects if required
--No highlighting if identical misrecognition 
--Speech editor may review only differences
--Speaker expected to review entire text

Other novel features
--Voice correction by second speaker of first speaker text
--Train both original speaker and correcting speaker
--Train multiple speakers
--Use for web-based, multispeaker collaborative document
--My AV Notebook™=voice-oriented electronic scrapbook
--Use for speech-oriented multimedia, karaoke, lecture 
--Edit document text and audio tags to improve audio quality --ScrambledSpeech™ control reorders content and protects privacy and confidentiality

For additional information, see About


Nonadaptive speech recognition user profile training
--SpeechSplitter™ segments audio before transcription
--Text manually transcribed from audio
--Need > 6-8 hours text/audio to create individual profile 
--Accuracy nonadaptive > adaptive recognition, see below.
--Create acoustic and language models, lexicon
--Also create small group model for meetings or videos
--Best result if same small-group speakers over time
--Use data/models for other speech/language processing

Includes speech engine and do-it-yourself (DIY) toolkit
--Create user profile tuned to speaker's speaking style
--DIY tools for novice with minimal/no lexical expertise
--Use text to speech phonetic generator for lexicon
--Speech engine automatic lexical questions generator
--Use for virtually all languages or dialects
--Create profiles for speech impaired or medical diagnostics
--Create profile for robot speaker or imaginary languages

One-of-a kind high-level tools for nonexperts
This do-it-yourself toolkit includes easy-to-use techniques for transcriptionists, local IT personnel, and other users to create acoustic model, language model, and lexicon for nonadaptive speech recognition, as well as speaker independent speaker dependent software. Among other aids, it includes text to speech (TTS) phonetic generator for phonemes.  User listens to speech, enters phoneme spelling for words and sub words to TTS phonetic generator, plays back synthetic speech generated from phonemes, determines accuracy of sound compared to recorded speech, and modifies phonemes if required until synthetic speech best matches recorded speech.  Experience indicates that consistent phoneme use is more important than their actual representation.  Software supports creation of speech recognition profiles based upon user-specific data, data from small groups (e.g., board of directors or speakers in a video), or large groups (to create speaker independent group model).  Use SharedProfile™ data to create models for other automatic speech and language processing such as voice commands, text to speech, machine translation, speaker ID, and natural language processing.   High-level tools, such as automatic pronunciation generator and automatic questions generation, reduce need for expensive lexical expertise when creating speech user profiles. 

Higher accuracy with nonadaptive speech recognition
With nonadaptive techniques, models more accurately reflect single-speaker speech and background and channel noise.  Microsoft research (below) indicates that "massively speaker specific" nonadaptive models have higher accuracy than industry-dominant speaker adaptive technique. 
SweetSpeech™ model builder also supports custom, personalized speech user profile creation for conversation or for medical diagnostics for Parkinsonism, autism, patients with cleft lip, and other diseases affecting speech.

Increased accuracy in bar graph below refers to 8.6% error reduction of nonadaptive recognition compared to adaptive technique with less than 8 hours of training data, and 12.3% error reduction with greater than 15 hours of training.  Word error rate (%) is about 1% lower for nonadaptive speaker-specific model for different levels of speaker training data.  Word error rate is estimated about 7% for training data of 12,000 sentences for nonadaptive technique, and 8% for adaptive model with same level of training data.  Similarly, there is improvement in relative error reduction of 45% and 52.2% for nonadaptive approach compared to speaker-independent model.

For additional information, see About


 Server-based speech recognition
--Dragon, IBM, Microsoft, SAPI 5.x, SweetSpeech™
--TransWaveX™ returns text or audio-linked text
--CompleteProfile™ enrolls speaker with audio/text
--SpeechTrainer™ supports iterative corrective training
--Server transcribes audio, compares text to verbatim
--Continues until achieves target accuracy or limit iterations
--Create SweetSpeech™ single speaker or group profile  
--Integrates with Command!™ workflow system (including  CallDictate™ telephony server)

Error correction using text compare
Early review describes error spotting using dual-engine text compare.  Error correction is followed by SpeechServers™ repetitive iterative training of speech user profile using SpeechTrainer™ service.  If a word is incorrectly recognized, audio is replayed by server and retranscribed until it is accurate, or for a preset number of iterations. 

For additional information, see About


Customers include transcription companies, doctors, and hospitals, law firms, software developers, manufacturers, insurers, schools and universities, law enforcement, government, and large and small businesses. 

"We selected their SpeechServers™ running with Dragon NaturallySpeaking . . . .  we recommend it to anyone looking for cost-effective software for server-based  speech recognition."  Robert Duffy, Programmer, Columbia University Medical Center, New York, NY

"After implementation of web transfer server and web services for transcription, there was no significant interruption in service for the doctors or transcriptionists.  The typing ladies think the new server is wonderful . . . . "   Ian Yates, Medical I.T. Ltd., Queensland, Australia

"Since installing your product, acWAVE™ we have processed over 1,000,000 conversions (on one server alone) without a problem. . . . Thank you for such a great product."  Scott D. Stuckey, ASP Product Manager, Voice Systems, Inc. Tampa, FL



Various publications have reviewed company software.  Most recently, software was described in a compilation of articles edited by William Meisel, editor of industry trade journal Speech Strategy News.  Compilation also includes articles written by contributors from Nuance, Microsoft, IBM, and other companies.





Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)