|
Article Excerpt There's been an industry shift from using proprietary approaches for developing speech-enabled applications to using strategies and architectures based on industry standards. The latter offer developers of speech software a number of advantages, such as application portability and the ability to leverage existing Web infrastructure, promote speech vendor interoperability, increase developer productivity (knowledge of speech vendor's low-level API and resource management is not required), and easily accommodate, for example, multimodal applications. Multimodal applications can overcome some of the limitations of a single mode application (GUI or voice), thereby enhancing a user's experience by allowing the user to interact using multiple modes (speech, pen, keyboard, etc.) in a session, depending on the user's context.
VoiceXML, Call Control eXtensible Markup Language (CCXML), and Speech Application Language Tags (SALT) are emerging XML specifications from standards bodies and industry consortia that are directed at supporting telephony and speech-enabled applications. The purpose of this article is to present an overview of VoiceXML, CCXML, and SALT and their architectural roles in developing telephony as well as speech-enabled and multimodal applications.
Before I discuss VoiceXML, CCXML, and SALT in detail, let's consider a possible architectural deployment that employs these specifications. At a high level are two main architectural components: document server and speech/telephony platform. Each interfaces with a number of secondary servers (Automated Speech Recognition server (ASR), Text-to-Speech server (TTS), data stores).
In this architecture a document server generates the documents in response to requests from the speech/telephony platform. The document server leverages a Web application infrastructure to interface with back-end data stores (message stores, user profile databases, content servers) to generate VoiceXML, CCXML, and SALT documents. Typically, the overall Web application infrastructure separates the core service logic (the business logic) from the presentation details (VoiceXML, CCXML, SALT, HTML, WML) to provide a more extensible application architecture. The application infrastructure is also responsible for maintaining application dialog state in a form that's separate from a particular presentation language mechanism.
To process incoming calls, the speech/telephony platform requests documents from the document server using HTTR A VoiceXML or CCXML browser that resides on the platform interprets the VoiceXML and CCXML documents to interact with users on a phone. Typically, the platform interfaces with the PSTN (Public Switched Telephone Network) and media servers (ASR, TTS) and provides VoIP (SIP, H.323) support. An ASR server accepts speech input from the user, uses a grammar to recognize words from the user's speech, and generates a textual equivalent that is used by the platform to decide the next action to take, depending on the script. A TTS server accepts markup text and generates synthesized speech for presentation to a user. In this deployment a SALT browser on a mobile device interprets SALT documents. Figure 1 is a diagram illustrating such an architecture.
[FIGURE 1 OMITTED]
VoiceXML
Now that you have an overall understanding of the architecture in which these specifications can be used, let's begin by discussing VoiceXML. VoiceXML can be viewed as another presentation language (HTML, WML) in your architecture. VoiceXML is a dialog-based XML language that leverages the Web development paradigm for developing interactive voice applications for devices such as phones and cell phones. It's a self-contained presentation language designed to accept user input in the form of DTMF (touch tones produced by a phone) and speech, and to...
|