VoiceXML Project

voiceXML
Kevin Curran, Computer Lecturer - Magee College

What is VoiceXML?

written by Kenneth G. Rehor

Introduction

VoiceXML is a language for creating voice-user interfaces, particularly for the telephone. It uses speech recognition and touchtone (DTMF keypad) for input, and pre-recorded audio and text-to-speech synthesis (TTS) for output. It is based on the Worldwide Web Consortium's (W3C's) Extensible Markup Language (XML), and leverages the web paradigm for application development and deployment. By having a common language, application developers, platform vendors, and tool providers all can benefit from code portability and reuse.

With VoiceXML, speech recognition application development is greatly simplified by using familiar web infrastructure, including tools and Web servers. Instead of using a PC with a Web browser, any telephone can access VoiceXML applications via a VoiceXML "interpreter" (also known as a "browser") running on a telephony server. Whereas HTML is commonly used for creating graphical Web applications, VoiceXML can be used for voice-enabled Web applications.

There are two schools of thought regarding the use of VoiceXML:

As a way to voice-enable a Web site, or
As an open-architecture solution for building next-generation interactive voice response telephone services.

One popular type of application is the voice portal, a telephone service where callers dial a phone number to retrieve information such as stock quotes, sports scores, and weather reports. Voice portals have received considerable attention lately, and demonstrate the power of speech recognition-based telephone services. These, however, are certainly not the only application for VoiceXML. Other application areas, including voice-enabled intranets and contact centers, notification services, and innovative telephony services, can all be built with VoiceXML.

By separating application logic (running on a standard Web server) from the voice dialogs (running on a telephony server), VoiceXML and the voice-enabled Web allow for a new business model for telephony applications known as the Voice Service Provider. This permits developers to build phone services without having to buy or run equipment.

While originally designed for building telephone services, other applications of VoiceXML, such as speech-controlled home appliances, are starting to be developed.

VoiceXML Features

The rapid growth of the Web was due largely to its open architecture and high-level common interfaces to differing computing resources. HTML and HTTP hide much of the complexity of building interactive applications. Just as an HTML developer doesn't need to know how bits paint the screen of a web user's PC, VoiceXML shields developers from many of the complexities of telephony platforms.

VoiceXML has features to control audio output; audio input; presentation logic and control flow; event handling; and basic telephony connections. These and other features are described as follows:

Dialogs <menu>, <form>

Audio Output <prompt>

Speech synthesis controls (text-to-speech, or TTS) <emp>, <pros>, etc.
Pre-recorded audio (files or streams) <audio>

Audio Input
- Speech recognition (ASR)
- Audio recording <record>
- Touchtone (Dual-tone Multi-Frequency, or DTMF) <dtmf>
Presentation logic
- Control flow <if>, <else>, etc.
- ECMAScript client-side scripting <script>
- Server-side/dynamic content generation <submit>
Event handling
- Bad input <noinput>, <nomatch>
- Shorthand <help>
- <catch>, <throw>
Basic Connection Control
- Call transfer and bridging <transfer>
- Disconnect <disconnect>

Beyond the scope of the language are application logic, state management, dialog generation and sequencing, database operations, and interfaces to legacy systems (e.g., "screen scraping"). These are handled by traditional Web application programming techniques.

Architecture

A VoiceXML application consists of several components, as shown in Figure 1:

Application Server: Typically a Web server, which runs the application logic, and may contain a database or interfaces to an external database or transaction server.
VoiceXML Telephony Server: A platform that runs a VoiceXML interpreter that acts as a client to the application server. The interpreter understands VoiceXML dialogs and controls speech and telephony resources. These resource include ASR, TTS, audio play and record functions, as well as a telephone network interface.
Internet-style network: A TCP/IP-based packet network that connects the application server and telephony server via HTTP.
Telephone Network: Typically the Public Switched Telephone Network (PSTN), but could be a private telephone network (e.g. PBX), or VoIP packet network. Caller: Any telephone that can connect to the telephone network.

Figure 1: Components of a VoiceXML Application

Figure 2 shows the relationship between a traditional Web application, and a voice-enabled Web application.

Figure 2: Relationship Between a Traditional Web Application
and a Voice-Enabled Web Application

A Basic Menu Example

To see how VoiceXML works, let's start with a very simple example of a basic menu. Following the architecture shown in Figure 1, a caller dials the telephone number of this simple voice portal. The call is routed to the VoiceXML telephony server. The appropriate VoiceXML page (in this case menu.vxml) is fetched via HTTP from the application (web) server, and interpretation begins.

Example 1: menu.vxml
 
 1    <?xml version="1.0"?>
 2    <vxml version="1.0">
 3
 4        <menu>
 5            <prompt> Choose from <enumerate/></prompt>
 6
 7            <choice next="sports.vxml"> sports </choice>
 8            <choice next="weather.vxml"> weather <choice>
 9            <choice next="news.vxml"> news <choice>
10        </menu>
11
12    </vxml>

The first line of Example 1 indicates that it complies with W3C's XML version 1.0. Line 2 is the top-level VoiceXML element containing dialogs of either <menu>s or <form>s. This also indicates compliance with VoiceXML version 1.0. Lines 4 through 10 contain a menu consisting of a prompt and three choices. The contents of the <choice> elements are used by the VoiceXML interpreter to instruct the ASR engine what to listen for, in this case the words sports, weather, or news. The content is also used to construct a prompt if the <enumerate> element is included. A speech synthesis engine would render the text as audio.

The user interaction would be as follows:

Computer: Choose from sports, weather, news.
Human: Sports.

The VoiceXML interpreter then fetches the file sports.vxml and the process continues.

But what if the user asked for help, didn't say something appropriate, or said nothing at all? VoiceXML has language elements that allow a dialog designer to handle these circumstances. Here's the same menu example embellished to handle "unexpected" responses:

Example 2: menu.vxml (embellished)
 
 1    <?xml version="1.0"?>
 2    <vxml version="1.0">
 3
 4        <menu>
 5            <prompt> Choose from <enumerate/></prompt>
 6
 7            <choice next="sports.vxml"> sports </choice>
 8            <choice next="weather.vxml"> weather <choice>
 9            <choice next="news.vxml"> news <choice>
10        
11            <help>
12                If you would like sports scores, say sports.
13                For local weather reports, say weather, or
14                for the latest news, say news.
15            </help>
16
17            <noinput>You must say something.</noinput>
18
19            <nomatch>Please speak clearly and try again.</nomatch>
20
21          </menu>
22
23    </vxml>

The user interaction might be:

Computer: Choose from sports, weather, news.
Human: (user says nothing)
Computer: You must say something. Choose from sports, weather, news.
Human: Tblisi
Computer: Please speak clearly and try again. Choose from sports, weather, news.
Human: Help
Computer: If you would like sports scores, say sports. For local weather reports, say weather, or for the latest news, say news.
Human: Sports

Summary

VoiceXML is a powerful, yet simple language for building voice dialogs. It leverages web architecture, tools, and technology to enable innovative new telephone applications. Thanks to the standardization efforts of the VoiceXML Forum and the W3C, it is gaining widespread adoption--especially by the 350-plus members of the VoiceXML Forum. New language features in the recently published draft of VoiceXML 2.0, and new call control features currently under development, promise an even richer voice-enabled Web.

Other Resources

VoiceXML Forum
IBM alphaWorks - VoiceXML
Motorola MIX
SpeechWorks
Conversa
Lernout and Hauspie

Recommended next step for students contemplating a thesis in this area

Visit a voice portal such as VoxBuilder or TellMe or BeVocal. I recommend VoxBuilder.

The topic that you should concentrate on is a VoiceXML directional navigation application. Basically, you ring a number on your mobile when driving a car and it will direct you to a destination. Should you get lost, you are able to return to local landmarks by interacting with the voiceXML application throughout. This application will be quite easy to develop and it is as complicated and as detailed as you wish to make it. Come and see me to get the more detailed explanation.

Home

To contact Author: Email: [email protected].