Automating Web Navigation with the WebVCR
Vinod Anupam, Juliana Freire, Bharat Kumar, Daniel Lieuwen
Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ 07974, USA
Recent developments in Web technology such as the inclusion of scripting languages, frames, and the growth of dynamic content, have made the process of retrieving Web content more complicated, and sometimes tedious. For example, Web browsers do not provide a method for a user to bookmark a frame-based Web site once the user navigates within the initial frameset. Also, some sites, such as travel sites and online classifieds, require users to go through a sequence of steps and fill out a sequence of forms in order to access their data. Using the bookmark facilities implemented in all popular browsers, often it is not possible to create a shortcut to access such data, and these steps must be manually repeated every time the data is needed. However, hard-to-reach pages are often the best candidates for a shortcut, because significantly more effort is required to reach them than to reach a standard page with a well-defined URL.
The WebVCR system addresses this problem by letting users record and replay a series of browsing steps in smart bookmarks-- shortcuts to Web content that require multiple steps to be retrieved. It provides a VCR-style interface to transparently record and replay users' actions. Creating and updating smart bookmarks is a simple process involving only the usual browsing actions and requiring no programming by the user. In addition to saving users time by providing shortcuts to hard-to-reach Web content, smart bookmarks can be used as building blocks for many interesting Web applications and new e-commerce services.
In this paper, we describe the WebVCR and the techniques it uses to record and replay smart bookmarks, as well as our experiences in building the system. We also discuss some applications that are simplified/enabled by smart bookmarks.
Keywords: affiliate programs, bookmarks, dynamic content, electronic commerce, notification, personalization, smart bookmarks , tutorials, Web clipping, wrappers
The growing trend of making the Web more interactive and personalized, together with the explosion of dynamic content has led to the wide use of scripting languages, frames, cookies, and forms. As a result, the process of retrieving Web content has become more complicated, and can sometimes be tedious. For example, Web browsers do not provide a method for a user to bookmark a frame-based Web site once the user navigates away from the initial frameset. Also, some sites require users to go through a sequence of steps in order to access their data. For example, in order to find out the available flights and fares for a certain itinerary, one needs to login at a travel Web site (by filling out a form with login id and password), and enter the itinerary information to retrieve the available fares. These steps cause dynamic pages to be generated, often with session-ids encoded in the URL or embedded inside the page. Using the bookmark facilities implemented in all popular browsers, it is not possible to create a shortcut to the list of available flights. Consequently, in order to track the cost of a trip, these steps must be repeated multiple times. Such pages are often the best candidates for a shortcut, because significantly more effort is required to reach them than to reach a standard page which has a well-defined URL.
In order to address this shortcoming, we built the WebVCR system. WebVCR presents a VCR-style interface to record and play browsing steps. It is very simple to use: a user needs only instruct the system to start recording and go on with his usual navigation. Once he reaches the desired final page, he can stop recording and save the sequence of browsing steps in a smart bookmark to be replayed at a later time. Smart bookmarks are shortcuts to Web content that require multiple browsing steps to be retrieved -- they may be saved in bookmark lists, or mailed to others like any other bookmark.
Figure 1: Travelocity main page
Figure 2: Travelocity login
Figure 3: Itinerary form (to specify origin, destination, dates, etc)
Figure 4: List of alternative flights
Example 1.1 (Navigating travelocity.com) Consider the following scenario. Juliana plans to attend the WWW9 conference and she is looking for flights from Newark to Amsterdam, that leave from Newark May 14th and return from Amsterdam on May 20th. She must take the following steps:
- Go to http://www.travelocity.com
- Choose the Find/Book a Flight option (Figure 1),
- Login (Figure 2),
- Specify details of itinerary (Figure 3).
This series of steps produces a page with a list of alternative flights (Figure 4) whose URL is something like:
Bookmarking this URL is not useful, since once the session1 times out, the URL can no longer be used to access the page. However, it is likely that a single visit to this page will be insufficient. It may take weeks or months to find an acceptable fare. Using the WebVCR, these steps can be saved and later replayed with a single click -- saving Juliana a lot of clicking and time.
Whereas the ability to create shortcuts to hard-to-reach Web content can be a time saver for a user, it is specially useful for applications that consume such content. Significant effort has already been invested into developing techniques to build wrappers to extract information from HTML pages (see e.g., [KM98,SA99,Ade98]) and more recently to query XML documents (see e.g., [DFF+99]). However, issues involving the actual retrieval of the data have been largely overlooked. Currently, in order to automate the retrieval process, one must write a program (an access wrapper) in general purpose languages such as Perl and Java, or more specialized languages such as WebL [KM98] to perform the required navigation. However, especially in the context of Web integration systems, this is not always practical. Given the rate at which Web sites change, maintaining a large number of access wrappers can be very time consuming. The WebVCR can be used to quickly create access wrappers to Web content. Creating and updating these wrappers is a simple process involving only the usual browsing actions.
As a result, a number of applications can be greatly simplified by the WebVCR. For example, casual users can easily put together personal portals (such as
http://my.yahoo.com) with information retrieved from sites of their choice (e.g., their bank balance, weather report, etc.) [ABFK99]. This and other new applications enabled by WebVCR are described in Section 3
The paper is organized as follows. In Section 2 we give an overview of the WebVCR system and its methodology. Applications simplified and/or enabled by WebVCR are described in Section 3. Implementation details and our experiences are presented in Section 4. Related work is discussed in Section 5, and we conclude in Section 6 with future directions we plan to pursue.
The record-play facility provided by WebVCR allows users to save shortcuts to Web pages that do not have a well-defined (static) URL. In this section we describe the methodology behind WebVCR, and illustrate one of its uses: a personal WebVCR that lets casual Web users create smart bookmarks.
The main idea behind WebVCR is to transparently record a sequence of browsing steps that can be saved, and automatically replayed later. We break down this functionality into three coarse functions:
Recording: Storing user's browsing information Once notification about an action is received, enough information must be saved so that the step can be replayed later. Since a smart bookmark may visit several (static and dynamic) pages, at each page the WebVCR must be able to identify the correct action needed to retrieve the next page. Our initial implementation uses the DOM signature (e.g., document.links, that represents the fifth link in the current document) of the active (clicked/modified) object, as well as other information available about the object (see Section 4.1 for details). Note that storing only the DOM signature of the objects is not enough. Take for example sites such as http://amazon.com that may display a different number of ads each time a page is visited. If the new ad contains a link (or a form), the DOM signatures of all subsequent links (or forms) change -- if the selection of the object is based solely on the DOM, an incorrect object may be chosen during replay.
There are many possible ways to implement a WebVCR system depending on the choice of notification system, where and how smart bookmarks are replayed, and where and how they are stored. Because of space limitations, in this paper we restrict our discussion to two different architectures, client-based and server-based. The differences and tradeoffs between these architectures are discussed in Section 2.2. We also restrict our discussion of implementation details to Netscape Navigator.2 In what follows, we illustrate how Web traversals are recorded and replayed using a client-based implementation, the personal WebVCR.
The architecture of the personal WebVCR is shown in Figure 5. The personal WebVCR uses a Java applet in conjunction with the user's browser to record and replay smart bookmarks. The applet can be installed on the end-user's desktop, or downloaded whenever required from a Web site hosting this applet.
Figure 5: Client-based architecture
Figure 6: Recording smart bookmarks
The user starts the WebVCR by loading the WebVCR starting page into a browser window (MainWindow). The starting page opens a new browser window (AppletWindow) and loads an HTML page containing the WebVCR applet. The reason for loading the WebVCR applet in its own browser window is to make the applet persistent (while the user is recording/playing smart bookmarks in the MainWindow). The applet, which has standard VCR-style buttons (see Figure 7), is then started. The recording process is depicted in Figure 6. To record a smart bookmark, the user traverses the Web to the desired starting point for the smart bookmark and clicks on the Record button in the applet. Clicking on the Record button causes two actions to take place (which are transparent to the user): (1) the applet records the current URL as the starting location of the smart bookmark; and (2) the applet inserts event handlers on all elements in the MainWindow that the user might operate on. From then on, as the user navigates via link traversals or form submissions, each action triggers an event handler that causes the applet to record the corresponding action. Whenever a new page is loaded, the applet re-inserts the event handlers (Step 2 above). As shown in Figure 8, the applet window keeps the user informed of his progress.
When the user finally reaches the desired page, he clicks on the Stop button and the applet stops recording. The user can then play or step through the recorded smart bookmark . During replay, the WebVCR applet uses the steps recorded in the smart bookmark to inform the browser which action to take in order to retrieve the next page. For example, for link traversals, the corresponding URL is loaded into the browser; for form submissions, the values input by the user (and recorded in the smart bookmark ) are used to fill the form before submitting it.
A set of smart bookmarks can be concatenated. For example, one may create a smart bookmark for login at a specific site, and a number of others to perform distinct after-login activities. However, there is a requirement that the first step in each sequence of smart bookmarks must have a well-defined URL. For example, if the user has been browsing inside a frameset such that the current URL doesn't reflect the content in the frames, then replay will not work properly.
Once a smart bookmark is recorded, the user also has the option of saving it into an HTML file that contains a representation of the smart bookmark along with a reference to the WebVCR applet. This HTML page can be bookmarked like any other Web page, and can be added to the browser's bookmark/favorites list. If the user loads that HTML page into a browser, the WebVCR applet starts and automatically replays the entire recorded smart bookmark , thus providing the user with one-click access to the final page.
Figure 7: Screenshot of WebVCR applet when first started up
Figure 8: Screenshot of WebVCR applet while recording steps
In the discussion above, we described an implementation of a client-based WebVCR that uses a WebVCR applet and browser to record and play smart bookmarks. However, for applications such as Web clipping for wireless access (see Section 3), a tool is needed that does not require the use of a full-fledged browser on the client side, and that minimizes the communication between client and server. For such an application, a server-side process that receives a request, performs the replay, and ships only (some section of) the final page is more appropriate.
- Privacy: A WebVCR server has access to all information recorded in the smart bookmark, and that is sent to and downloaded from the destination site during record and replay. The client-based architecture in contrast, offers greater privacy to the user, since record/replay occur at the user's desktop, and the information recorded in the smart bookmark is also stored locally.
- Implementation complexity:In client-based architectures, since record and replay is done via the user's browser, destination sites requiring cookies or secure access (HTTPS) pose no problems. In contrast, server-based implementations must provide special support for these features.
- Ease of use and convenience:In server-based architectures, the user does not need to install/download the application, but needs to access the third party's Web server every time the smart bookmarking functionality is required.
- Security:Because of security restrictions imposed by browsers, in client-based architectures, the WebVCR applet must be granted certain privileges. For example, UniversalBrowserAccess is required since the applet needs to read/modify pages downloaded from different domains, and UniversalFileAccess is required if the user desires to save bookmarks into HTML files for later access. It is our experience that some users are not comfortable with accepting certificates and granting such privileges to applets.5
- Secure connections:In server-based architectures, HTTPS connections must be handled properly. In order to provide an end-to-end secure connection, the WebVCR server must open two distinct secure connections: one to the destination, and one to the client.
Figure 9: Server-based architecture
It is worth pointing out that a hybrid architecture consisting of a combination of server-based and client-based components is also possible: a user may create a smart bookmark with a personal WebVCR on his desktop, and later replay this smart bookmark on the desktop from his personal digital assistant (PDA) using a wireless modem.
Some applications of smart bookmarks are evident, for example: a user can record smart bookmarks for any task he might need to perform multiple times, such as accessing local weather information, searching used car classifieds, checking for best airfares to a particular destination, filling up an online shopping basket, etc. However, a variety of other applications can also be built using smart bookmark technology. We list some of these applications.
Web Personalization and Mobile Access Smart bookmarks were described in passing in our previous work on Web personalization [ABFK99], where we propose a new approach to personalization that allows users to specify the contents of a personal home page, much like in MyYahoo (http://my.yahoo.com) but with more choices: a personal page is built from a set of general queries over information from multiple Web sites. For instance, the user can specify that an arbitrary Web page (or section thereof) be embedded in the personal home page as a frame (or layer). The specification of the content of each frame can be a URL, a smart bookmark, or a more complex query that retrieves the desired content.
The ability to easily create access wrappers makes it possible to produce highly personalized Web clipping services when combined with the ability to extract parts of pages. The ability to return only select portions of the final Web page is very desirable for mobile clients (e.g., smart phones and PDAs) that have limited bandwidth. Furthermore, this can be combined with phone browsing technology (e.g., [Aea97,Voi]) to make this personalized content available via phone.
Smarter Affiliate Programs and Permission Marketing Many sites offer affiliate programs, where they give third-party sites commissions from sales originated in those sites (see e.g., [Ama]). Using smart bookmarks to produce complex orders, affiliate programs can be made more valuable, both to the merchant who ultimately ships the items and to the consumer who uses the service. For instance, currently, a recipe site can put a link to a merchant site selling ingredients used in the recipe or to a product on that site which is needed in the recipe. In the latter case, the user clicks on the product link and then makes a second click on the resulting page at the merchant site to add the item to the shopping cart. However, affiliate programs cannot make it really simple to order all the items in the recipe unless the merchant site has already produced such a bundle. With the WebVCR, staff of the affiliate programs can produce a smart bookmark that will load a user's shopping cart with exactly the right items for the recipe from the merchants site. The user can remove any unneeded items or add any other desired items before checking out. The increased ease of purchasing makes impulse buying of ingredients to make the recipe more likely. By producing more recipes and corresponding smart bookmarks, the affiliate site can add significant value to the merchant's offerings and gain significant revenue.
Similarly, bundles of offers, for example, clothing suggestions (possibly with ability to see tried on), party supplies, gift baskets items, can be produced by a merchant employing permission marketing and sent as email or placed on a personal Web page for individual customers. The user can easily order the items as above -- the difference being that the bookmark is customized with a particular customer in mind rather than as a more general offering. Given the simplicity of producing smart bookmarks, creative bundling options are significantly easier to develop than the alternatives which require server-side programming -- creating smart bookmarks requires no programming, only an intuitive VCR style interface. This makes it possible to do many more experiments on what kind of promotions really work. This is crucial in permission marketing -- examples of improving response rates from 3% to 40% by repeated experimentation have been reported in [GP99].
Tutorials Smart bookmarks can be used as tutorials of how to use a site. The WebVCR has a step-facility that allows users to take their time at each page encountered during a traversal. This can help them learn how the site is structured for particular kinds of use. For example, an online customer care representative for travelocity.com can instruct customers how to navigate the travelocity site for specific tasks (e.g., booking a flight) by remotely creating smart bookmarks and emailing it to customers (possibly including some further explanation in the email).
Web Site Testing Another useful application for the WebVCR is in Web site testing. Smart bookmarks can be used not only as a test suite to test Web site functionality, but also to test how well a site responds to high volume of hits -- for example, by firing multiple smart bookmarks simultaneously.
The WebVCR applet has a very small footprint, so that it is practical for users to experiment with the system without experiencing large delays for downloading the Java code. It achieves that in part by not duplicating functionality provided by the browser (e.g., instead of implementing an HTML parser, it uses the user browser's DOM API to locate page elements).
The rest of this section describes the implementation as well as issues we encountered while building the system. Certain details are omitted to simplify the presentation, for example we consider user actions to be only link traversals or form submissions, though other kinds of actions (e.g., button clicks) can also be handled.
For link steps, the WebVCR records the following information: text associated with the link; URL that the link refers to; the target name6 (if present) in which the resulting Web page should be displayed; and DOM location of the link. Form steps contain: name of the form; DOM location of the form; action associated with the form; method associated with the form (GET/POST); and all the elements in the form. For each form element, further recorded information includes: element name; index of the element in the form (e.g., 3rd element); type of element (e.g., text, password); properties of the element (see below); and type-specific properties (e.g., values for text fields, checked flag for checkboxes and radioboxes, selection index or option lists for selections).
Figure 10: Smart bookmark steps to login at http://www.travelocity.com
There are different modes for storing user-specified information in smart bookmarks. For instance, the user is able to specify that password fields (e.g., Figure 2) are either prompted for when needed during replay, or are stored encrypted in the smart bookmark , whereas fields like the origin and destination of flight (Figure 3) can often be stored in plain text. Accordingly, each attribute has the one of the following properties to guide the WebVCR during playback: prompt (ask the user for the attribute value); stored (use the value that is stored in plain text); encrypted (use the value that is stored encrypted).7
The recording process is as follows. When the user presses the record button in the applet window, the applet uses LiveConnect [Fla98] to set event handlers on all clickable elements in the page displayed in the browser (i.e., onclick handlers for links, onsubmit handlers for forms, etc.). If there are already event handlers present in the page, the new handlers are chained to the existing handlers to ensure proper replay.
When an event fires, the applet records all the necessary information for the event. It must then wait until the following page is loaded to repeat the process of adding handlers and waiting for events. The WebVCR adds onload handlers to each page to detect when a page has been fully loaded. However, if the page has already loaded when the onload handler is added, the event will never fire. Thus, in addition to onload handlers, the WebVCR uses a separate thread that polls the browser window to check whether the document changed.
Figure 11: Playing a smart bookmark
The recorded steps are replayed as described in Figure 11. Each step is executed depending on its type. For URL steps (lines 4-5), the stored href is fetched and loaded into the browser. For link traversals (lines 6-8), the recorded properties of the link (DOM location, text, URL) are used to determine the href of the page to be fetched (see below for details on the heuristics used to find the closest match). The page is displayed in the target window specified in the step. Finally, for form submissions (lines 9-11), the recorded properties of the form (name, DOM location, element names and types) are used to determine the appropriate form to be submitted. Attribute values specified as stored or encrypted are read from the smart bookmark, and the user is prompted for attribute values specified as prompt -- these values are used to set the values of the form elements, and the form is then submitted.
During replay, the applet must also detect when a new page is loaded (line 2). The process used is similar to that used for recording. The applet inserts an onload handler in the Web page to detect when the page has been completely loaded. In addition, the applet polls the DOM structures (created by the browser) at regular intervals to check if a sufficient portion of the page has loaded. This is determined currently by checking if the link/form at the recorded DOM location is available, though more sophisticated reasoning is possible.
Since Web pages may change after a smart bookmark is recorded, special care must be taken to ensure that smart bookmarks are correctly replayed. In what follows, we describe some error-correction heuristics that are required to make the replay robust in the presence of changes to the page structure. Even though we limit our discussion to link traversals, similar techniques can be applied for other kinds of smart bookmark steps as well.
During the replay of a link step, the WebVCR first accesses the properties of the link in the currently loaded page that has the same DOM location as the recorded link. If the URL and the target of this link are the same as the recorded information, a match is declared, and this link is used for the replay. However, occasionally there will not be a match. There are several reasons for this. For example:
- The DOM location may have changed. A Web page may have ads that appear before the link that the user has recorded. Since ads that appear in the page usually differ from one traversal to the next, the number of links embedded in the ads may also differ, and consequently, the DOM location of the recorded link may change.
- The URL may have changed. Some sites encode session information in the URLs. Consequently, when one logs out and then logs in, the corresponding URL will have changed, and thus, using the same URL is likely to result in a server error. However, the text associated with the link can still be used to perform the match.
- The link text may have changed. For example, the cnn.com Web page has a link pointing to the daily almanac, where the text associated with the link refers to the current date. Hence, the link text changes daily, however, the URL associated with the link remains the same.
These changes do not pose a problem to a user browsing the Web since the user can easily determine which link he wants to follow, but they do present a challenge to a system that performs automatic navigation. We use the following heuristics in order to find the closest match for a recorded link step (note that a number of the steps given below can be combined for efficiency):
- Attempt to locate a link in the last retrieved page corresponding to DOM location stored in current smart bookmark step. If the link exists, the target of the link matches the bookmark, and either the URL or text of the retrieved link match the step, then use that link.
- Otherwise, if there is a unique link in the page whose target, URL, and text match those of the stored link, use that link.
- Otherwise, if there is a unique link in the page whose target and URL match those of the stored link, use that link.
- Otherwise, if there is a unique link in the page whose target and text match those of the stored link, use that link.
- Otherwise, if the link corresponds to a CGI bin script (e.g., contains ``?'' in it), then find all links that match the stored URL up to the first occurrence of a ``?'' and store them in set of candidate links, which we denote L.
- Eliminate any elements of L whose parameter names do not match the stored version. For instance, if the stored URL is
http://xyz.com/script?x=20&y=32 matches, but
http://xyz.com/script?x=10&z=12 does not, since it has a parameter named z that does not appear in the stored version.
- For each parameter in the stored version whose value matches the corresponding parameter value in at least one element of L, eliminate all elements of L with a non-matching value for the same parameter.
- If L is a singleton set, use that element.
- Otherwise, the playback can either be aborted, or the link present at the recorded DOM location can be used to try and proceed through the playback (our implementation uses the latter). However, the playback might fail later in the sequence, or the sequence might traverse pages different from what the user had recorded.
Steps 1-4 are self explanatory. Steps 5-8 deal with the case where the link refers to a cgi-script. Step 6 eliminates the case when the same CGI-bin script is used to handle a variety of tasks, where the same set of variables is required even if there values differ because session information is encoded in them. Step 7 is used to differentiate between variables that specify the task to perform and those which encode session information. For instance, menu=7 almost certainly implies a task to perform rather than encoding session information. Thus, if a match is found, all non-matching candidates are eliminated.
These heuristics are hard-coded in our current implementation of the WebVCR. In the future, we plan to let users manipulate them, for example, by choosing the order in which they are applied. The robustness of smart bookmarks can be further improved by letting users define their own matching rules for the various steps in smart bookmarks.
Also, it might be possible to skip some steps during replay. For example, if the replay is currently at step i, and it can be determined that a subsequent step j has a well-defined (static) URL, then intermediate steps between i and j can be skipped, thus compressing the bookmark. In Example 1.1, the Find/Book a Flight step can be skipped, and the WebVCR can go directly to the login page. Another example case is searching for cars in Yahoo classifieds, where some of the intermediate form submissions can be skipped as the information entered into forms in preceding steps ends up encoded in the URL, for example: http://classifieds.yahoo.com/yc?ce_mk=&ck=Toyota&za=and&ce_sl=&cc=automobiles&
cr=New+York+City&cs=time+2&g=&cf=1 encodes two form submissions, one entering a zip code and one entering the car make Toyota.
Automatically compressing smart bookmarks is made difficult by the fact that the Web server could be maintaining (and updating) some server state during the entire interaction. However, if either no such server state is maintained, or it is not essential for the interaction, reasonable heuristics can be used for compression.
HTTP authentication One limitation of solely using the browser for recording smart bookmarks is that some user actions cannot be recorded in the client. For example, it is not possible to detect when HTTP authentication takes place, and since the values entered by the user are not available through the DOM API, such interaction cannot be recorded by the applet. One way to handle this scenario is to have a proxy that intercepts the HTTP authentication messages during recording, so that they can be recorded by the WebVCR applet. During playback, the WebVCR applet can inform the proxy to directly perform the authentication with the destination Web site (without going through the browser). In the current implementation, smart bookmarks that require HTTP authentication can be recorded, but during replay, the WebVCR blocks at the HTTP authentication stage until the user enters the proper values and clicks OK.
State information The HTTP protocol is stateless, however, Web sites usually maintain some kind of state (e.g., session information). This is accomplished in various ways, the common ones being 1) embedding session ids in URLs, 2) encoding session information in hidden variables inside HTML forms, or 3) storing appropriate cookies on the user's machine. The first two cases are handled easily by heuristics mentioned earlier in the paper. The third case presents some interesting problems. If a user records a smart bookmark with cookies turned off, but replays it with cookies turned on (or vice versa), he might see completely different pages during record and replay. The problem is aggravated if smart bookmarks are shared among users. In the general case (if the two sets of pages are radically different), there is not much that can be done. However, there are specific, common cases that can be handled.
Signed applets During both recording and replay of smart bookmarks, WebVCR needs to access and modify the Web pages being navigated. By default, to prevent unauthorized snooping of a user's Web activity, the browser only allows such access to Web pages retrieved from the same domain as the WebVCR applet. Hence, the WebVCR code needs to be digitally signed with a certificate from a trusted third-party (e.g., Verisign), and the user needs to explicitly grant the requested privileges before WebVCR can be used.
Users might not mind granting privileges to WebVCR to access Web documents from other domains during record and replay, but might hesitate granting privileges that allow the WebVCR applet to modify their files (which is required if a user wants to save smart bookmarks). A less convenient mechanism, which doesn't require such a privilege, is to display the recorded bookmark in a browser window, and ask the user to use the browser to save the smart bookmark into an HTML file.
Automatic refresh Accessing some sites (e.g., cnn.com) results in pages being retrieved which might be redirected after some time to a different URL (e.g., due to the METATAG with an HTTP-EQUIV value of "REFRESH"). While recording a smart bookmark, it is not possible to automatically distinguish between this case, and the case where the user simply typed in a different URL in the location bar (or pulled one from his bookmark list). However, during replay, the WebVCR must distinguish between these cases, since it must execute a step for the latter but not for the former. In our current implementation, the default is to assume that a refresh took place, and if the user wants to create a smart bookmark with disconnected steps, he must explicitly specify so.
Smart bookmarks were described in passing in [ABFK99] as a basic building block of a personalization platform. In this paper, we give a detailed description of the idea underlying smart bookmarks, the methodology and implementation of a system that provides the functionality.
There is a huge literature on tools and techniques to build wrappers for Web sites (e.g., [SA99,KM98,HGMC+97,Ade98]). However, the main focus of previous work is on extracting information from Web pages. While it is possible to add extraction functionality to the WebVCR, its major emphasis is to automate the retrieval of hard-to-reach pages.
Internet Explorer (IE) version 5 has introduced the Intellisense Technology, a feature built into the browser that records values of certain form elements every time a form is filled out. If the same form is later loaded in the browser, IE displays, under the elements, a list previously entered values from which the user may choose. Note that not all form elements are supported, for example, the values for elements such as radio buttons or pull-down lists are not recorded.
Recently, there has been a proliferation of personalization and notification services (e.g., [Lia,Cal,Min]). The WebVCR and smart bookmarks can be used to simplify as well as extend these services. For example, Mind-it [Min] is a notification service that allows users to specify pages or sections of pages that they would like monitored. When the pages change, Mind-it alerts the users. However, Mind-it is not able to track hard-to-reach pages (e.g., whose URLs contain session ids, or that are only reachable via a sequence of dynamically generated pages). Similar limitations apply to the other services that we are aware of.
In this paper, we describe the WebVCR system and the techniques it uses to create and replay smart bookmarks -- shortcuts to Web content that require multiple browsing steps to be retrieved. The WebVCR presents a VCR-style interface to record and play browsing steps, requiring no programming by the user. Creating and updating smart bookmarks is a simple process involving only the usual browsing actions. Smart bookmarks may be saved in bookmark lists, or mailed to others like any other bookmark. They offer an easy means to auto-navigate the Web, thus simplifying the retrieval of hard-to-reach Web content. Besides saving users time, smart bookmarks may also be used to create a variety of new e-commerce services.
For future work, we plan to address many of the issues described in Section 4 such as the recording of HTTP authentication steps. In addition, we plan to extend the WebVCR to support editing as well as parameterization of smart bookmarks. For example, to let users create template smart bookmarks and provide different input values at each interaction. Other future extensions include: a server-side version to enable to creation of services like Mind-it [Min] and CallTheShots [Cal], as well as for Web site testing; and the integration of the WebVCR with existing extraction tools.
- V. Anupam, Y. Breitbart, J. Freire, and B. Kumar.
Personalizing the Web using site descriptions.
In DEXA - Workshop on Internet Data Management (IDM), pages 732-738, 1999.
- B. Adelberg.
NoDoSe - a tool for semi-automatically extracting structured and semi-structured data from text documents.
In Proc. SIGMOD, pages 283-294, 1998.
- D. Atkins and et al.
Integrated Web and telephone service creation.
Bell Labs Technical Journal, 2(1):19-35, 1997.
- Amazon affiliate program.
- A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu.
A query language for XML.
In Proc. of WWW, pages 77-91, 1999.
- H. Davulcu, J. Freire, M. Kifer, and I. Ramakrishnan.
A layered architecture for querying dynamic web content.
In Proc. SIGMOD, pages 491-502, 1999.
- D. Flanagan.
- S. Godin and D. Peppers.
Permission Marketing : Turning Strangers into Friends, and Friends into Customers.
Simon & Schuster, 1999.
- J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and E. Aranha.
Extracting semistructured information from the Web.
In Proceedings of the Workshop on Management of Semistructured Data, 1997.
- T. Kistlera and H. Marais.
WebL: a programming language for the Web.
In Proc. of WWW, 1998.
- B. Krulwich.
Automating the Internet: agents as user surrogates.
IEEE Computing, July-August 1997.
- Liaison Technology.http://www.liaison.com/.
- Mind-it. http://www.netmind.com/.
- A. Sahuget and F. Azavant.
Building light-weight wrappers for legacy Web data-sources using W4F.
In Proc. of VLDB, pages 738-741, 1999.
- VoiceXML Forum. http://www.voicexml.org/.
Vinod Anupam is a Research Scientist in the Systems and Software Research Center at Bell Labs, Lucent Technologies, in Murray Hill, NJ. He received a Ph.D. in Computer Sciences from Purdue University, USA in 1994, and a Bachelor's in Computer Science from Birla Institute of Technology and Science, India in 1988. His research interests include Collaborative Computing (specifically synchronous and asynchronous multi-user Web-based interaction), Internet and Web security, and Electronic Commerce.
Juliana Freire is a Member of the Technical Staff in the Database Systems Research Department at Bell Laboratories, Lucent Technologies. She received a BS from Federal University of Ceará (Brazil), and MS and Ph.D. from the University at Stony Brook, all in Computer Science. Her early research focussed on optimizing evaluation of Datalog programs, and recently she has been working on various issues related to integrating and querying heterogeneous data sources, such as the ones found in the Web.
Bharat Kumar is a Member of the Technical Staff at Bell Laboratories, Lucent Technologies. His research interests are on querying and integrating information from Web sources having limited query capabilities. He is also working on tools and techniques to support easy specification and efficient execution of CRM (Customer Relationship Management) treatments, for customer contacts over different media. He has received a B. Tech. in Computer and Information Science from the Indian Institute of Technology, Delhi (India), and MS and Ph.D. in Computer Science from The Ohio State University.
Daniel F. Lieuwen is a Member of Technical Staff in the Database Systems Research Department at Bell Laboratories, a division of Lucent Technologies. Lieuwen attended Calvin College, Grand Rapids, Michigan where he studied mathematics and computer science (and a fair bit of German). He received the M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin-Madison. He joined Bell Laboratories in 1992. His early research foci were object-oriented databases particularly Ode), main-memory databases, and active databases. More recently, he has been working on topics related to materialized views, Directory Enabled Networks, and the Internet.
- The number at the right-hand side of the URL is a session id.
- Issues regarding implementation of WebVCR using Microsoft Internet Explorer are discussed in Section 4.
- Smart bookmarks can be made available through a unique URLs generated by the server.
- ... server.4
- Some notification services such as MindIt [Min] use this technique.
- ... applets.5
- Note that the UniversalFileAccess requirement can be relaxed if the user is willing to have the HTML dumped to the browser, and then saved to file from the browser itself.
- For example, if the document is to be displayed in a particular frame, the target specifies the frame.
- ... encrypted).7
- The decryption key can be entered once for each WebVCR session.