VoiceXML Project

voiceXML
Check out Wirelessdevnet.com for the full text of this article.

The speedy growth of the multimedia Internet, combined with new wireless smartphones, is changing the ways people communicate and access information via text or voice messaging. While the devices we use to do both are converging at the desktop and in our pockets, we still have to consider which medium to use for input vs. output.

The traditional telephone heritage has been DTMF input and pre-recorded speech output for Interactive Voice Response (IVR) applications. This is because automatic speech recognition (ASR) and text-to-speech (TTS) technologies were still being developed. But now that these technologies have improved significantly, the use of voice input and output is being heavily promoted for the wireless phone market as a way for consumers to tap into the exploding self-service information and transaction resources available from the World Wide Web.

A voice portal is the interface between a caller and an information source - it's the point of entry for a person using an IVR or speech recognition system. When augmented with VoiceXML, the voice portal can host a much wider variety of information, literally funneling any web-based data from your servers out to callers.

VoiceXML is the name of a technology standard developed and managed by the VoiceXML Forum (www.voicexml.org). It builds upon the work of earlier technologies such as VoXML from Motorola and SpeechML from IBM to create a standardized way to interact with services through a voice interface. VoiceXML Forum aims to drive the market for voice- and phone-enabled Internet access by promoting a standard specification for VXML, a computer language used to create Web content and services that can be accessed by phone. The 1.0 Release of the VoiceXML Specification can be downloaded from the VoiceXML Forum website.

Why Voice?

Perhaps the first question that may arise is: "Why do we need a markup language for voice commands?" The answer to that question is becoming increasingly obvious as some members of the technology community have expressed their displeasure with textual wireless interfaces such as WAP. Wireless communication devices have the disadvantage of having small screens, limited input capabilities, and limited processing power. They've obviously been huge successes as voice communication conduits however it remains to be seen how the public will accept them as data delivery vehicles. One alternative to the textual interface offered by technologies such as WAP is what was originally known as an IVR, or Interactive Voice Response, system. Historically, these systems have been very proprietary and therefore unsuitable for allowing access to Web-based content. VoiceXML basically allows you to define a "tree" that steps the user through a selection process - known as voice dialogs. The user interacts with these voice dialogs through the oldest interface known to mankind: the voice! Powerful speech recognition software resides on the server to convert the user's stated selection (i.e. "Yes" or "No") into textual selection. This process is akin to selecting a hyperlink on a traditional Web page. Dialog selections result in the playback of audio response files (either prerecorded or dynamically generated using some sort of server-side text-to-speech conversion).

From a business viewpoint, voice applications open up a host of new revenue opportunities. Perhaps the most obvious revenue opportunity comes from the increased number of minutes we will all be spending on our wireless phones. In addition, advertising will become as commonplace through these services as it currently is on traditional media (Web, TV, radio, etc.). As voice services are added to your traditional carrier plan, there will clearly be a market for pay-as-you-go premium services (information lookups, email, contact databases, etc.). It's not hard to imagine most consumers opting to listen to a 15-second ad in exchange for free access to these premium services! Because VoiceXML is XML-based, it is yet another technology driving the move towards content distribution and management in XML. See an article on this topic a the Wireless Developer Network. (where most of this page was taken from.) Within two years, it is very likely that content providers will offer both WAP- and Voice-accessible sites for their wireless customers. Clearly, by this point, a manageable architecture using XML will be required.

VoiceXML Applications

VoiceXML is an XML application that defines a tree-like structure that the user can traverse through using voice commands. Click here to view the VoiceXML Document Type Definition (DTD), the document that defines the "grammar" of a valid VoiceXML application. An integral component to every VoiceXML application is the text-to-speech and speech-to-text processing engine that runs on the server. These products are available from a variety of vendors including IBM, Motorola, and SpeechWorks. Readers familiar with WML will find themselves vaguely familiar with VoiceXML as well. This is because both are XML-based markup languages used to define a group of elements that enable a user to traverse information. For instance, common tags supported by VoiceXML include "<"form">", "<"var">", and "<"menu">". All VoiceXML "documents" begin and end with the "<"vxml">" tag. Before diving in, we'd be remiss if we didn't at least kick things off with a simple "Hello World!" example.

"<"pre">"
"<"FONT SIZE = "2"">"
"<"?xml version='1.0'?">"
"<"vxml version="1.0"">"
"<"form">"
"<"block">"Hello World!"<"/block">"
"<"/form">"
"<"/vxml">"
"<"/FONT">"
"<"/pre">"

As you can see, this is a very simple example that uses the

element to present some text to the user. VoiceXML defines two types of dialogs that the application uses to interface with the user: forms and menus. A form is simply used to present information to a user or to retrieve information from the user. A menu is a specialized form that forces the user to choose a specific option and then branch based on the option that was chosen. The element generally includes a directive that tells the application where to jump based on the user's input. A menu uses multiple elements to define where to transition to based on the user's selection. The following example prompts the user to select from a variety of options and calls a different VoiceXML choice based on the selection option.

 
 
 
  
  
  
  
  
  Mail
  
  
  Operator

As you can see, the choice options within the menu call additional VoiceXML "routines" that define further tasks. The element handles every response that's not either a "Mail" or an "Operator" command. In this case, it includes a command that simply orders the server to replay the items in the prompt list.

By no means is this a comprehensive tutorial but it should give you a feel for the way in which voice commands are processed and defined on the server. Obviously, this sort of application would have to factor in scalability and server loads given the processing power required to continually do speech-to-text conversion. In addition, traditional server development tasks (database/messaging system access, spatial queries, etc.) would need to be developed to interface with prompt choices.

Potential Applications

The potential applications of this technology abound. Currently, voice "portals" such as TellMe, Quack, and BeVocal are springing up to offer voice access to stock quotes, movie and restaurant listings, and daily news. Just as traditional Web portals redefined themselves over time into personalized services, these voice portals will eventually offer access to your email (to both check and send emails using your voice!), . Ironically, voice technologies are over a century ahead of the Web in one area: instant messaging (isn't that what Alexander Graham Bell's original phone call basically amounted to?!?).

These voice-related technologies will also be among the first location-based services to appear on the market because of the mobile nature of the end user. For instance, currently the TellMe portal can automatically retrieve movie listings for the nearest theater based on the number you are calling from. For the majority of location-based applications, this type of service is accurate enough. In the future, one could imagine integration with the FCC-mandated E911 positioning or even a GPS in the handset for driving directions that actually talk you through the twists and turns required to reach a destination.

Benefits of Voice Processing

Separate from the discussion of VoiceXML is a look at the benefits of voice processing technologies in general. Despite the advent of technologies such as WAP, the fact remains that accessing textual content over a small phone display is difficult and, in some applications, rather unnatural. When adding in any amount of data entry over the phone, it quickly becomes an impractical interface. Voice technologies, on the other hand, take advantage of the very interface that phones were designed to server and will undoubtedly be accepted more readily by the general public. VoiceXML, specifically, is a well-structured, uniform way to build logic trees that customers can use to access the information of interest to them. Look for tools and services based on VoiceXML to become increasingly popular in the near future.

Perhaps the biggest disadvantage of voice-based technologies is the rigid structure that they impose on the end user. While a textual interface (i.e. WML) can support popular tools such as search engines and online browsing of catalogs or information, voice technologies are much better at delivering a specific pinpointed bit of information to an end user (i.e. a stock quote, a movie time, a restaurant location, etc.). In addition, while convenient, using a voice portal such as TellMe can be maddeningly slow when forced to drill through several layers of options before finding exactly what you want. One interesting combination of the textual interface and the voice interface is the tool known as a "voice browser" (for a demo of a popular voice browser, visit Conversa). A voice browser allows the user to "speak" links to quickly traverse through textual content, which may be a great compromise...particularly in automobile- or hands-free type of applications.

Other Resources

VoiceXML Forum
IBM alphaWorks - VoiceXML
Motorola MIX
SpeechWorks
Conversa
Lernout and Hauspie

Task

Visit a voice portal such as VoxBuilder or TellMe or BeVocal.

The topic that you should concentrate on is a VoiceXML directional navigation application. Basically, you ring a number on your mobile when driving a car and it will direct you to a destination. Should you get lost, you are able to return to local landmarks by interacting with the voiceXML application throughout. This application will be quite easy to develop and it is as complicated and as detailed as you wish to make it. Come and see me to get the more detailed explanation.

Home

To contact Author: Email: [email protected].