Enhancing Web Accessibility Via the Vox Portal and a Web Hosted Dynamic HTML<->VoxML Converter
Stuart Goose, Mike Newman
Multimedia Department, Siemens Corporate Research
755 College Road East, Princeton, NJ 08540, USA
Information and Communication Networks, Siemens AG
Schertlinstrasse 8, D-81379 Munich, Germany
Information and Telecommunication Research, Siemens AG
3 rue Blaise Pascal, 22300 Lannion, France
Interactive voice browsers offer an alternative paradigm that enables both sighted and visually impaired users to access the WWW. In addition to the desktop PC, voice browsers afford ubiquitous mobile access to the WWW using a wide range of consumer devices. This technology can facilitate a safe, "hands-free" browsing environment which is of importance both to car drivers and various mobile and technical professionals. By providing voice mediated access, information providers can reach a wider audience and leverage existing investment in their WWW content. In this paper we describe the Vox Portal, a scaleable VoxML client, and a WWW server hosted dynamic HTML<->VoxML converter.
Keywords: Structure, browsing, accessibility, speech, VoxML.
1. Introduction and Motivation
The World Wide Web (WWW) has enjoyed phenomenal growth over recent years and now accounts for a significant proportion of all Internet traffic. The unmitigated success of the WWW bears testimony to the previously unsatisfied need for a system able to integrate and deliver distributed information. The profile of hypermedia has been raised significantly by the WWW, which has endorsed hypermedia as an appropriate technology for accessing and navigating information spaces. Users can access a wealth of information and associated services over the WWW, ranging from international news to local restaurant menus.
Interactive voice browsers that make extensive use of speech synthesis and recognition offer an alternative paradigm that enables both sighted and visually impaired users to access the WWW. This technology can facilitate a safe, "hands-free" browsing environment which is of importance both to car drivers and various mobile and technical professionals. By providing voice mediated access, information providers can reach a wider audience and leverage existing investment in their WWW content.
The interest in ubiquitous computing has escalated in recent times. Ubiquitous, or pervasive, computing is the attempt to break away from the traditional desktop interaction paradigm by distributing computational power and resources into devices in the environment surrounding the user. The primary aim of the Vox Portal is to support ubiquitous voice driven access to multiple information services from a range of devices. This technology has the potential to increase the global user community exponentially. Subscribers will enjoy greater convenience and flexibility by being able to access dynamic information at any time from anywhere.
VoxML  is a new standard markup language for specifying the dialogs of interactive voice response applications that feature speech synthesis and recognition technologies. The Vox Portal is a carrier class VoxML client. Rather than expecting WWW publishers to translate their HTML content and services to VoxML, we have built a dynamic HTML<->VoxML converter that bridges the void between the voice and WWW domains. HTML forms are translated to VoxML, thus enabling the interactive input of data into the respective fields and hence requires the converter to be bi-directional to submit form data to the originating WWW server. Voice driven interactive support for HTML form input affords the user mobile access to a variety of compelling e-commerce WWW services such as financial services, on-line purchasing, arranging travel, route finding, etc.
Described in this paper is the Vox Portal that uses a WWW server hosted dynamic HTML<->VoxML converter. A survey of the related work is discussed in section 2. A justification of the importance of document structure is outlined in section 3. An overview of the system architecture and operation is provided in section 4. Section 5 proposes areas for further research and provides some concluding remarks.
2. Related Work
Making extensive use of earcons [7, 6], Albers et al  suggests a variety of acoustic embellishments to a standard WWW browser for conveying and reaffirming to the user the dynamic behavior of the browser.
Much hypermedia research has focused on the seamless integration of media within a unified framework. Due to the application scenarios and the delivery devices targeted, our emphasis is exclusively on the audio medium. In comparison with its visual counterpart, little work has been conducted on interactive audio-only hypermedia systems. The Hyperspeech system  was the first to demonstrate such an approach. Arons manually transcribed several recorded interviews, analyzed their structure and generated corresponding audio nodes and links. Unlike our system, HyperSpeech requires that documents be pre-recorded in audio prior to use.
To access computer-mediated information blind people, until recently, largely relied upon Braille output devices and software known as screen readers. A screen reading program applies various techniques to gain access to the textual content of application software and employs speech synthesis technology to speak this information to the user. Although far from perfect, screen readers provided visually impaired people with a tool for hearing the content of the screen until graphical user interfaces (GUIs) became commonplace. The advent of the GUI made the task of screen reading more complex, thus inspiring research into GUIs for the blind [17, 18]. Petrie et al  have conducted preliminary evaluations on input and output schemes to identify favorable hypermedia system interfaces for blind users. As screen readers are application software independent they can also be used to read the text displayed within a visual WWW browser. In this case the screen reader extracts only the text, as it is not concerned, or aware, of the underlying HTML. As a result, the speech output generated communicates the raw content to the listener but fails to impart any information regarding the structure of the document. The importance of document structure as an aid to understanding, orientation and navigation support is elucidated upon in section 3.
Several researchers have since attempted to address this shortcoming. Raman  has integrated the EmacsSpeak system with Emacs W3 to create a browser that intersperses appropriate spoken descriptions of the HTML document structure along with content. Asakawa et al  explains how the Netscape browser supplies Home Page Reader data to achieve a similar result. Also designed specifically for visually impaired computer users, the pwWebSpeak browser  parses HTML documents in order to augment the audio rendering with structural descriptions. The car radio was selected as the interface metaphor by Wynblatt et al  as the basis of the voice browser (WIRE) for providing drivers with access to email and WWW. These systems are for sedentary computer users, with the exception of pwWebSpeak which has support for single user telephone access.
Goose et al  describe a proxy-based interactive service (DICE) into which multiple users can simultaneously dial and use touch tones or voice commands to browse dynamically generated audio renditions of both email and WWW documents. The DICE audio renderings are also imbued with rich structural descriptions of the document. Web-On-Call  offers telephone access to WWW sites rendered using audio, but this solution requires documents to be specially prepared on the server side. Since only a small proportion of sites offer this service this is not a generic solution for telephony WWW browsing. WebGalaxy  supports natural language queries and navigation of the WWW. Users can formulate rich and flexible queries, but the domains currently supported are limited to weather, air travel and tourism in Boston.
Reported by Goose et al  is a voice browser that judiciously applies three-dimensional (3D) audio technology. A new 3D conceptual model of the HTML document structure and exploitation of the "cocktail party effect"  facilitate a variety of novel features for improving comprehension of structure, orientation and navigation. Unfortunately, current telephone devices preclude the use of 3D audio technology.
Tim Berners-Lee is quoted as saying "The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect." In early 1997 the Web Accessibility Initiative (WAI)  was introduced by the W3C to promote this theme through publishing guidelines and developing associated tools. A voice-centric extension to HTML has also been proposed: Aural Cascading Style Sheets (ACSS) . A Voice Browser Working Group  was also initiated in 1998 to reach a consensus on the appropriate markup language requirements for speech technology. Work on extending accessibility principles to Java are being pursued by Sun and IBM with Java Accessibility and Microsoft with Active Accessibility. Lazzaro  provides a good review of accessibility issues, products and related organizations.
This literature review confirms that much progress has been made in the area of voice browsing, but to the authors' knowledge the Vox Portal is the first standards-based (VoxML) carrier class voice browsing system with a generic solution for dynamic HTML<->VoxML conversion to be reported. The conversion service handles HTML documents through to version 4 and includes full support for sophisticated features, such as the description of and interactive input to HTML forms, tables and the linearization of arbitrary depth HTML framesets.
3. Deriving Structure From Documents
The native document description language of the WWW is called Hypertext Markup Language (HTML). At a quick glance, a sighted user can assimilate the document structure of a richly graphical HTML page as rendered by a visual WWW browser. This is possible as much of the context is conveyed implicitly through the document structure and layout of the information. A user can then apply their understanding of the HTML document structure to aid orientation, navigation, and ultimately, the location relevant information.
Given that structure is obviously a key aid to the comprehension of a visual document, it is of paramount importance to convey this to the user of a voice browser. It is clear that most of the context would be lost if a document were "rendered" by simply sending the raw text of a document to a text-to-speech synthesizer. The Siemens WIRE  and DICE  voice browsers and the HTML<->VoxML converter all apply an analytical algorithm to an HTML document, or frameset, to elicit both the structure and context. This analysis also distinguishes between "content sections" and "navigation sections" based on a link density metric calculation . The difference is clear visually, but this technique informs the listener as to whether the section is mainly a menu of links or contains mostly text.
Although intended to represent document structure, HTML has also evolved to include constructs for visual specifications. Consequently, no clear distinction exists between the document structure and its presentation view. Many authors strive to design aesthetic and intuitive graphical HTML pages. In order to achieve this goal some authors purposefully select alternative HTML constructs to fashion a custom graphical view of the structure of the page, as opposed to employing the HTML constructs originally designated for specifying the logical structure. One typical example of this is the selection of a large font to customize the appearance of a section heading in favor of the standard HTML header construct. While entirely legitimate, algorithms that analyze the HTML document structure must attempt to identify such behavior to determine the authors logical intent. Such problems have been documented together with recommendations and guidelines for publishers [23, 19, 11, 26].
Once analyzed, the converter generates VoxML output which is forwarded to the Vox Portal. An audio rendering is then produced which combines the use of descriptions, earcons [7, 6] and the features of a speech synthesis engine, such as multiple voices, prosody, announcements and pausing, to make structural elements of the document explicit to the listener. The aesthetics of the audio rendition can simultaneously help reduce the monotony factor and enhance comprehension .
4. Standards-based Architectural Support for Mobile Voice Browsing
VoxML proved to be an obvious candidate with which to deliver value-added services because the subscribers interact with the Vox Portal using telephones and other audio enabled devices. This section begins with a topological overview of how such audio enabled devices can connect and access Internet and WWW services via the Vox Portal. Following this is a brief description of the Vox Portal and HTML<->VoxML converter. The operation and features of the voice browser are then explained.
4.1 Four Tier Architecture
Over recent years, three tier architectures have become commonplace and well integrated with the WWW. A typical 3 tier architecture comprises a Client (Tier 1), a Server (Tier 2) and a Database (Tier 3). The architecture described below adds an additional tier to this model to create a Client-Agent-WWW Server-Database. Figure 1 is a context diagram that depicts (in the region labelled Voice Client) how various devices can connect to the Vox Portal; how the Vox Portal interacts with the VoxML-Agent (in the region labelled Agent); and how the VoxML-Agent communicates with WWW Servers (in the region labelled Server).
Figure 1: Multi-tiered architecture context diagram.
4.2 Overview of the Vox Portal Client
The client consists of three main components: a Voice Terminal, a Voice Over IP Gateway (VoIP) and a Vox Portal. The Voice Terminal is typically a device such as a traditional or cellular telephone. This terminal is connected through the appropriate native local network with the audio channel carried to the PSTN. A digital channel is then routed through a VoIP gateway. The VoIP gateway maps the circuit switched domain to the packet switched domain. The audio channel data is then sent through the packet switched network (LAN/WAN) using the H.323 protocol. With the aid of an auxiliary component, called a Gatekeeper (not shown in Figure 1), the H.323 session is routed to the Vox Portal.
The Vox Portal is a scaleable, multi-user client able to interface with the VoxML-Agent. The Vox Portal acts as a gateway that maps H.323 sessions to HTTP sessions. Although any media type can be carried by the H.323 protocol, audio is the only medium currently supported. Interactivity is facilitated by interpreting voice and/or out-of-band DTMF (Digital Tone Multiple Frequency). As alluded to earlier, the Vox Portal currently subscribes to the standard VoxML v1.1  for specifying the nature of the interaction and behavior of the speech technologies. It is likely in the future that the Vox Portal will evolve to support the new standard VoiceXML  as promulgated by the VoiceXML Forum. The Vox Portal also includes necessary support for issues and services such as management, accounting, user profiles, and internationalization.
4.3 Web Hosted Dynamic HTML<->VoxML Converter
The VoxML-Agent component is responsible for retrieving HTML documents, transforming them into VoxML documents and returning the VoxML content to the requesting HTTP(VoxML) client. Since the core functional components are hosted by a WWW server the attributes of multi-user support, horizontal scaleablity, security and administration are inherited. The VoxML-Agent component bears similarities to a proxy server with the major difference being that the HTML content undergoes analysis and transformation prior to being returned to the client. Additionally, proxy servers do not support session handling, which can be necessary when accessing some WWW sites.
4.3.1 VoxML-Agent Architecture
The VoxML-Agent is architected with modular and reusable components. It consists of a primary HTTP entry point, an HTTP client plus HTML parser, core HTML to VoxML transformation logic, core Form processing logic, and various fixed documents such as the VoxML DTD. Figure 2 depicts the high-level architecture of the VoxML-Agent and its external interfaces.
Figure 2: VoxML-Agent high-level architecture diagram.
The entry point of the VoxML-Agent application is the HTTP Query Processor which interprets the request of the HTTP (VoxML) client. The HTTP Query Processor supports two query types: FetchURL and PostForm. Clearly other queries are possible, but these two cover the majority of the HTTP requests usually encountered when browsing the WWW. Once the query is categorized as FetchURL or PostForm, the HTTP Query Processor delegates tasks to the appropriate modules.
The HTTP client and HTML parser module is responsible for carrying out the Agent HTTP GET or POST methods. The HTTP client supports SSL (HTTPS) and interoperability with Proxy Servers. The parsing capabilities allow this module to perform the majority of functions found in a typical WWW browser, such as automatically constructing nested framesets and combining XML/XSL to produce HTML.
The HTML-2-VoxML Core is the component in which the HTML document analysis and VoxML construction functionality reside. The HTML document analysis remains independent of the VoxML document construction. The HTML document structure and content analysis is performed as outlined in section 3. On completion of the HTML analysis, a VoxML version of the document is generated. To ensure compliance with the VoxML standard, the VoxML XML document generated is validated against the VoxML DTD before being transferred to the Vox Portal.
The Form Processing Core is the module responsible for constructing the POST or GET data of HTML forms that were converted to VoxML and returned by the VoxML Client. This module does not perform the POST or GET, it takes the information returned by a VoxML client and invokes the appropriate POST or GET method as expected by the originating WWW server that hosts the form.
4.3.2 VoxML-Agent Session Control
Since the VoxML-Agent is a multi-user (or multi-session) middle-tier component, it is responsible for keeping track of the Client-Agent HTTP session and any associated established Agent-Server HTTP session. This is achieved through the combination of session cookies and the help of the WWW server that hosts the VoxML-Agent. After the first HTTP request by the VoxML client, the VoxML-Agent's hosting WWW server inserts a session ID in the cookie field of the HTTP response header. The VoxML-Agent services can also be used without cookies, however, the Agent-Server connection may fail in situations where the requested URL resides at a WWW site that employs session cookies. The use of such cookies are common on WWW sites that host free e-mail, financial services and other pre-paid services.
4.3.3 Overview of Converter Operation
Illustrated in Figure 3 is a simplified flow of events that transpire during a typical VoxML-Agent session. The shadowed boxes represent VoxML Client states that can prompt an action, the circles represent actions performed by the VoxML-Agent and the diamonds represent basic descisions that result in the branching to different actions.
Figure 3: VoxML-Agent interaction flow diagram.
Using the flow diagram in Figure 3 as a guide, the following example describes the process of navigating to a WWW URL, completing a Form and then submitting it to the originating WWW server. Each step is labelled in the diagram from 1-7, respectively.
Beginning at Step 1: A VoxML client (HTTP(VoxML)) performs an HTTP request to the VoxML-Agent to fetch a URL. Step 2: The VoxML-Agent processes the request by dynamically creating an HTTP client to fetch the designated URL. Step 3: The document and its constituent parts, such as frames, are downloaded. Step 4a: If an error "time-out" occurs a VoxML page describing the error is generated and returned to the VoxML client. Step 4b: Otherwise, while the document is downloading a VoxML document describing the status is generated and returned to the VoxML client. A new status document is periodically generated and sent to the client until the the successful download or the process "times-out". Step 4c: Upon successful download, the HTML document is analyzed and a VoxML representation is created. The VoxML document is then returned to the VoxML client for rendering. Step 5: At the time of rendition, the VoxML client will allow the user to follow links and complete forms. Step 6: If the user chooses to follow a link, the process described from Step 1 to Step 6 is repeated. Step 7: If the user completes and submits a form using the VoxML client, it is returned and handled by a process on the VoxML-Agent. Step 8: The Form Processor of the VoxML-Agent accepts the data returned by the VoxML client and recreates the appropriate GET or POST method expected by the originating WWW server. Starting from Step 3, the cycle is then repeated.
4.4 The Browsing Experience
An essential ingredient of hypermedia documents is the link, and, in the context of the WWW, a link can either point to another place within the same document (intra-document link) or another document entirely (inter-document link). Petrie  notes that users can become disorientated during navigation without a mechanism for disambiguating these two link types. Moreover, Landow  advocates the use of a "rhetoric of arrival and departure" when authoring hypermedia documents for mitigating the effects of disorientation during navigation. In order to navigate the WWW a user must be cognizant of the links. Once aware of the convention, empirical tests indicated that the combination of a distinct earcon, followed by a specific synthesized voice reserved for announcing the anchor text, enabled users to identify the presence of links correctly every time. Two sonically related earcons are used for link notification thus enabling the listener to distinguish easily between the two link types.
It is a challenge to create an intuitive graphical user interface, but it is notoriously difficult to design an intuitive telephone user interface. We have implemented many of Resnicks recommendations  for spoken feedback and keypad mappings. To reduce complexity, we sought an interface that required minimal interaction. The activity of browsing in an audio-only environment is quite different in comparison with its visual counterpart. The serial nature of audio gives rise to a "listen and interrupt" paradigm.
Akin to bookmarks, a number of favorites can be set in advance and accessed through the Vox Portal. Browsing commences by selecting a favorite. Once downloaded, the document is processed by the converter and rendered as previously described. Each link encountered is announced and becomes the active link. A link remains traversable until the next link is reached. The minimal pattern of interaction for browsing can thus be reduced to selecting favorite(s) and following the active link(s).
To alleviate the user from listening to every document in its entirety, a selection of browsing modes allow different perspectives of the document to be generated. In addition to the entire document being rendered, a second mode announces each link anchor; useful if a document is frequently used as an index to subsequent documents. A third mode announces only the sections headings; a convenient mechanism for rapidly scanning a document. A fourth mode skips over dense link clusters and renders the content. As the browsing mode can be changed dynamically, a user can combine these approaches to navigate more efficiently within and across document boundaries. In addition, a history list of the documents visited is also maintained. The user typically traverses backward or forward through the history listening to the titles being announced, and selects the desired document to revisit.
5. Future Work and Conclusions
Naturally, a voice browser cannot challenge a traditional WWW browser in delivering a rich multimedia presentation. Hence traditional WWW browsers are unlikely to be supplanted by voice browsing technologies. However, voice browsers can provide an appropriate interface for many people in many situations. Voice browsing technologies still face many interesting research challenges. The effectiveness of all voice browsers are compromised when faced with unsympathetically authored HTML documents [23, 19, 11, 26] without explicit logical structure, an abundance of graphics, use of client side scripting and unclear associations between form labels and their corresponding fields. In addition, more research is necessary to take this technology beyond the current generation of voice browsers.
In this paper we described a standards-based (VoxML) carrier class voice browsing system called the Vox Portal and an associated generic solution for dynamic, bi-directional HTML<->VoxML conversion. This generic conversion service unites the voice and WWW arenas and also relieves publishers from translating their HTML content and services to VoxML. The conversion service handles HTML documents through to version 4 and includes full support for sophisticated features, such as the description of and interactive input to HTML forms, tables and the linearization of arbitrary depth HTML framesets.
Interactive voice browsers offer an alternative paradigm that enables both sighted and visually impaired users to access the WWW. In addition to the desktop PC, voice browsers afford ubiquitous mobile access to the WWW using a wide range of consumer devices. This technology can facilitate a safe, "hands-free" browsing environment which is of importance both to car drivers and various mobile and technical professionals. By providing voice mediated access, information providers can reach a wider audience and leverage existing investment in their WWW content.
Thanks are due to the following members of technical staff at SCR and SRIT who have contributed to the work described in this paper: Steffen Rusitschka, Sreedhar Kodlahalli, Denis Perraud, Laurent Strullu and Philippe Menard.
 S. Goose and C. Mller, A 3D Audio Only Interactive Web Browser: Using Spatialization to Convey Hypermedia Document Structure, Proceedings of the ACM International Conference on Multimedia, pp. 363-371, October, 1999.
 H. Petrie, S. Morley, P. McNally, A. O'Neill and D. Majoe, Initial Design and Evaluation of an Interface to Hypermedia System for Blind Users, Proceedings of the ACM International Conference on Hypertext, pp. 48-56, April 1997.
Stuart Goose received B.Sc. (1993) and Ph.D. (1997) degrees in Computer Science from the University of Southampton, England. At Siemens he leads a research group with multiple projects exploring and applying various aspects of Internet, mobility, multimedia, hypermedia, speech and audio technologies.
Mike Newman received his B.S. Electrical Engineering (1992) from Temple University and his M.S. Computer Science (1997) from Penn State University.
Claus Schmidt received a Dipl. Informatiker degree from FH Furtwangen, Germany (1990) and an M.Sc degree in Computer Science from Leicester Polytechnic, England. He leads a group that develops next generation voice data convergence applications for telecommunication networks. His focus are Voice over IP, Multimedia and Web enabled telephony.
Laurent Hu gratuated from the Ecole Nationale Suprieure des Tlcomunications de Bretagne in 1992. At Siemens, he leads a service developing VoIP products, including the Vox Portal server (SURPASS family). He worked first for Thomson in the field of signal processing (publication at GRETSI '93 and OCEAN '93). He then joined the France Telecom Research Center in 1993, where he was in charge of IP and ATM network design and ATM equipment specification. He has taken part in the standardization process of ATM equipment, as Chairman of the Q10/15 within ITU-T and as a delegate in the ETSI body. He has been awarded three European patents in the domain of the ATM access networks for mobiles.