voiceXML
Kevin Curran, Computer Lecturer - Magee College
VoiceXML is a language for creating voice-user interfaces, particularly for the telephone. It uses speech recognition and touchtone (DTMF keypad) for input, and pre-recorded audio and text-to-speech synthesis (TTS) for output. It is based on the Worldwide Web Consortium's (W3C's) Extensible Markup Language (XML), and leverages the web paradigm for application development and deployment. By having a common language, application developers, platform vendors, and tool providers all can benefit from code portability and reuse.
With VoiceXML, speech recognition application development is greatly simplified by using familiar web infrastructure, including tools and Web servers. Instead of using a PC with a Web browser, any telephone can access VoiceXML applications via a VoiceXML "interpreter" (also known as a "browser") running on a telephony server. Whereas HTML is commonly used for creating graphical Web applications, VoiceXML can be used for voice-enabled Web applications.
There are two schools of thought regarding the use of VoiceXML:
One popular type of application is the voice portal, a telephone service where callers dial a phone number to retrieve information such as stock quotes, sports scores, and weather reports. Voice portals have received considerable attention lately, and demonstrate the power of speech recognition-based telephone services. These, however, are certainly not the only application for VoiceXML. Other application areas, including voice-enabled intranets and contact centers, notification services, and innovative telephony services, can all be built with VoiceXML.
By separating application logic (running on a standard Web server) from the voice dialogs (running on a telephony server), VoiceXML and the voice-enabled Web allow for a new business model for telephony applications known as the Voice Service Provider. This permits developers to build phone services without having to buy or run equipment.
While originally designed for building telephone services, other applications of VoiceXML, such as speech-controlled home appliances, are starting to be developed.
The rapid growth of the Web was due largely to its open architecture and high-level common interfaces to differing computing resources. HTML and HTTP hide much of the complexity of building interactive applications. Just as an HTML developer doesn't need to know how bits paint the screen of a web user's PC, VoiceXML shields developers from many of the complexities of telephony platforms.
VoiceXML has features to control audio output; audio input; presentation logic and control flow; event handling; and basic telephony connections. These and other features are described as follows:
Dialogs <menu>, <form>
Beyond the scope of the language are application logic, state management, dialog generation and sequencing, database operations, and interfaces to legacy systems (e.g., "screen scraping"). These are handled by traditional Web application programming techniques.
A VoiceXML application consists of several components, as shown in Figure 1:

Figure 1: Components of a VoiceXML Application
Figure 2 shows the relationship between a traditional Web application, and a voice-enabled Web application.

Figure
2: Relationship Between a Traditional Web Application
and a Voice-Enabled Web Application
To see how VoiceXML works, let's start with a very simple example of a basic menu. Following the architecture shown in Figure 1, a caller dials the telephone number of this simple voice portal. The call is routed to the VoiceXML telephony server. The appropriate VoiceXML page (in this case menu.vxml) is fetched via HTTP from the application (web) server, and interpretation begins.
Example 1: menu.vxml |
The
first line of Example 1 indicates that it complies with
W3C's XML version 1.0. Line 2 is the top-level VoiceXML
element containing dialogs of either <menu>s or
<form>s. This also indicates compliance with VoiceXML
version 1.0. Lines 4 through 10 contain a menu consisting
of a prompt and three choices. The contents of the <choice>
elements are used by the VoiceXML interpreter to instruct
the ASR engine what to listen for, in this case the
words sports, weather, or news.
The content is also used to construct a prompt if the
<enumerate> element is included. A speech synthesis
engine would render the text as audio.
The user interaction would be as follows:
Computer: Choose
from sports, weather, news.
Human: Sports.
The VoiceXML interpreter then fetches the file sports.vxml and the process continues.
But what if the user asked for help, didn't say something appropriate, or said nothing at all? VoiceXML has language elements that allow a dialog designer to handle these circumstances. Here's the same menu example embellished to handle "unexpected" responses:
Example 2: menu.vxml (embellished) |
The user interaction might be:
Computer: Choose
from sports, weather, news.
Human:
(user says nothing)
Computer: You must
say something. Choose from sports, weather, news.
Human: Tblisi
Computer: Please
speak clearly and try again. Choose from sports, weather,
news.
Human: Help
Computer: If you
would like sports scores, say sports. For local weather
reports, say weather, or for the latest news, say news.
Human: Sports
VoiceXML is a powerful, yet simple language for building voice dialogs. It leverages web architecture, tools, and technology to enable innovative new telephone applications. Thanks to the standardization efforts of the VoiceXML Forum and the W3C, it is gaining widespread adoption--especially by the 350-plus members of the VoiceXML Forum. New language features in the recently published draft of VoiceXML 2.0, and new call control features currently under development, promise an even richer voice-enabled Web.
Chris "Dr. CT" Bajorek clearly delineates where he stands: "Touchtone access to pre-recorded information powered the first wave of IVR; Voice portals and speech recognition are now powering the second."
Noted industry analyst Peter Davidson provides a solid foundation for understanding voice portals and touches on what the Big 3 Telecom DSP board vendors (Brooktrout, Intel/Dialogic, Natural MicroSystems) are doing in this space.
This article recaps a recent online seminar called "Buy Or Host?: The Business Case For Voice Portals" in which Intel/Dialogic employees Peter Gavalakis and Paul Gibilisco offered an online slideshow and interactive Q&A session to clarify the future of speech-based apps and their technologies.
CT Mag's latest "future of IVR" article is about how IVR vendors are repositioning themselves as companies that don't sell IVR.
It's simple. The dudes in Parsippany see big opportunity for voice portals and their experienced application developers.
Learn how companies are are using Brooktrout Software technology to provide voice-enabled Wireless Web solutions. Also hear from technology partner, Lernout & Hauspie, who discusses how text-to-speech and speech recognition technologies fit in here.
Free white paper from Brooktrout.
Nuance looks at how voice and the Web are integrating.
Here is an excellent VoiceXML tutorial
The topic that you should concentrate on is a VoiceXML directional navigation application. Basically, you ring a number on your mobile when driving a car and it will direct you to a destination. Should you get lost, you are able to return to local landmarks by interacting with the voiceXML application throughout. This application will be quite easy to develop and it is as complicated and as detailed as you wish to make it. Come and see me to get the more detailed explanation.
To contact Author: Email: [email protected].