Next: Impacting the Web Site Up: Integration with Web Servers Previous: Raw vs. Semantic Information
In this subsection, we examine five techniques of gathering session information required for the decision engine, and discuss their pros and cons. A summary is presented in Figure 6. It is expected that a combination of these techniques will be required in a DFP toolkit, if it is to be deployed to support personalization for a broad variety of web sites. After presenting the techniques we make some general remarks.
Content generation scripts send high-level semantics to decision engine. Assuming that all pages that need to be tracked are generated via executable scripts/programs (which is a reasonable assumption to make for large sites), an obvious approach to obtaining meaningful semantic information would be to create or modify these scripts to gather/create the desired information, and then pass it on to the decision engine. The primary advantage of this approach is that the people developing the web pages will have the best idea of the intended semantics of the pages, and thus what the decision engine should receive. For this reason, we expect this to be the approach of choice when creating new web sites. Further advantages are that the actual HTTP requests and responses do not need to be transformed/parsed, and HTTPS connections can be handled. The primary disadvantage concerns legacy sites, where modifying all the existing scripts to generate the high level semantic data would be quite expensive. Another disadvantage is that maintenance of the site would become more cumbersome.
So how can the DFP approach be used in large legacy web sites? In such cases, the only solution might be to try and extract meaningful information from the raw HTTP requests/responses. There are various ways to do so, some of which we discuss below.
Content generation scripts send raw HTML to decision engine. This is a variation of the approach mentioned earlier, however, in this case, only the raw HTTP requests/responses are forwarded by the scripts. This can be done by injecting the same (small) block of code into the scripts that generate each and every page of the web site. The advantage is that converting a legacy site to this approach is straight-forward (assuming that it was implemented with server-side scripting language such as JSP or ASP), since the only function the extra piece of code performs is to forward appropriate data (HTTP request and/or response) to the decision engine. However, since the injected code will be uniform and generic, it will not be able to extract high-level semantic information from each page. This means that detailed knowledge of what information to extract for specific (categories of) pages, and how to extract it, needs to be built, either in the decision engine or in some other process. Depending on the level of information to be extracted, this can cause maintenance problems anytime the structure of the corresponding pages change. Moreover, this approach is sensitive to the language and/or platform that the web site is implemented in, e.g., if the CGI scripts are based on C++ then it may be hard to know where to inject the code block, making the approach infeasible.
Proxies. A proxy can be inserted between a company's web site and the end user. The proxy is responsible for tracking user requests, extracting the site responses, and contacting the decision server to determine the appropriate intervention strategy. An advantage is that HTML page transformation is not required. However, there are several disadvantages to this approach. Firstly, if SSL tunneling is being used, then the proxy will need to serve as the receiving end of the tunnel, and will need to perform encrypting/decrypting of the web traffic. Moreover, it would also need to extract higher level semantic information from the HTML. Lastly, the use of one or more proxies may have impact on scalability, because the proxy servers can become a bottleneck. It will be important to have enough proxies to cover the anticipated load on the web site.
Web Server Extensions.
Most popular web servers
 (Apache, Netscape Enterprise, Microsoft IIS) have an API (Apache modules, Netscape's NSAPI, Microsoft's ISAPI) that can be used to extend the functionality provided by the server. In particular, these can be used to attach monitoring hooks into the web server itself, thus gaining low-level access to all web interactions. The advantage is that no transformation of HTML response being generated is required, and secure connections can be handled. The disadvantages of needing to extract higher-level semantic information from HTML responses still applies. Moreover, writing server extensions is tricky (since they should not impact reliability or scalability) and server specific.
We conclude this subsection with some general remarks about these techniques and our experience with two of them.
We first consider session tracking. Three techniques are commonly used for tracking a session in web sites: encoding the session ID into the URLs sent and requested by the customer, placing cookies on the customer machine, and placing the session ID into a hidden form field. (The latter technique requires that all pages transmitted to the user are generated via form submissions.) In order for the decision engine to know the session of a page request, the session ID must be passed to the engine along with other page information. The session ID can be sent explicitly, or it can be sent as it occurs in the HTML of the requested page, and the encoding scheme used by the web site can be used to extract the session ID.
We now turn to the issue of scalability. In particular, how do the above techniques work when a web site is supported by a web server farm rather than a single web server? There are two main issues. First, in the case of a web server farm there may also need to be a farm of decision engines. Because the log of a given customer session will generally be maintained in the main memory of a single decision engine, it will be important that all decisions about that session be made by the same decision engine, even if different web servers are being used to serve the pages. This can be accomplished by encoding the decision engine ID inside the session ID. A load-balancing strategy can be implemented to distribute customer sessions across the decision engines. Furthermore, in applications such as MIHU, if all of the decision engines reach saturation then the system can decide for some customers that they will not receive any MIHU decisions. This permits a graceful degradation of service in the face of unexpectedly high load.
The second scalability issue concerns how the added expense of transmitting information from web server to decision engine will impact performance. In all cases except for proxies, the processing involved in transmitting to the decision engine can be performed on the web server. Thus, each server will be more loaded, but no architectural problems arise. In the case of proxies (and wrapper scripts if they are implemented on separate machines) there is a possibility of the proxy becoming a bottleneck.
Importantly, with any of the above monitoring methods presented, the method can be phased into the site - e.g., initially the tracking can focus only on part of the site, and only on part of the relevant data.
Next: Impacting the Web Site Up: Integration with Web Servers Previous: Raw vs. Semantic Information Rick Hull