A Client-Aware Dispatching Algorithm for Web Clusters Providing Multiple Services
Copyright is held by the author/owner.
WWW10, May 1-5, 2001, Hong Kong.
Keywords: Load balancing, Dispatching algorithms, Clusters
21,10] and proxy server systems aim to decrease user latency time through network access redistribution and reduction of amount of data transferred, respectively. In this paper we consider different Web systems, namely Web clusters, that use a tightly coupled distributed architecture. From the user's point of view, any HTTP request to a Web cluster is presented to a logical (front-end) server that acts as a representative for the Web site. This component called Web switch retains transparency of the parallel architecture for the user, guarantees backward compatibility with Internet protocols and standards, and distributes all client requests to the Web and back-end servers. Cluster architectures with Web switch dispatcher(s) have been adopted with different solutions in various academic and commercial Web clusters, e.g. [2,7,12,17,20]. Valuable recent surveys are in [25,23].
One of the main operational aspects of any distributed system is the availability of a mechanism that shares the load over the server nodes. Numerous global scheduling1 algorithms were proposed for multi-node architectures executing parallel or distributed applications. Unlike geographically distributed Web sites, where the dispatching role is taken by system components (e.g., DNS) that have only a limited control on client requests [14,17], the Web cluster with a single Web switch controlling all workload is a very robust architecture to front Web arrivals that tends to occur in waves with intervals of heavy peaks. The motivations for a new dispatching policy come from two main considerations on Web service distributions. The service time of HTTP requests may have very large or infinite variance even for traditional Web publishing sites. Moreover, heterogeneity and complexity of services and applications provided by Web sites is continuously increasing. Traditional sites with most static contents have being integrated with recent Web commerce and transactional sites combining dynamic and secure services. Web switch policies that want to dispatch the requests in a highly heterogeneous and dynamic system by taking into account server states require expensive and hard to tuning mechanisms for monitoring and evaluating the load on each server, gathering the results, combining them, and taking real-time decisions. For these reasons, we propose a client-aware policy (CAP) policy that, in its pure version, takes dynamic decisions by looking only at client requests instead of server states. CAP partitions the Web cluster services into broad classes on the basis of the expected impact that each request may have on main server components, that is, network interface, CPU, disk. Then, it schedules client load by taking into account the service class each request belongs to. Under workload characteristics that resemble those experienced by real Web sites, we demonstrate through simulation and prototype experiments that this simple client-aware policy is much more effective than state-of-the-art dynamic algorithms when applied to Web clusters providing heterogeneous services such as static, dynamic and secure information. By using CAP, the 90-percentile of page latency time can be half of that of commonly used dynamic Web switch policies, such as Weighted-Round Robin  and LARD . Moreover, CAP guarantees stable results for a wide range of Web sites because it does not need a hard tuning of parameters for each type of Web site as most server-aware policies require.
The remainder of this paper is organized as follows. In Section 2, we outline the typical architecture of a Web cluster with a focus on Web switches. In Section 3, we discuss some related work on global scheduling algorithms for the Web switch and propose the CAP policy. In Section 4, we present a detailed simulation model for the Web cluster, and we discuss the results for various classes of Web sites. In Section 5, we describe a prototype of Web switch operating at layer-7 that implements CAP and other policies, and analyze experimental results in a controlled environment. In Section 6, we give our final remarks.1). 4.1. Various academic and commercial products confirm the increasing interest in these distributed Web architectures. In the IBM TCP router , all HTTP requests reach the Web switch that distributes them by modifying the destination IP address of each incoming packet: the Web switch replaces its IP address with the private address of the selected Web server. Magicrouter , Distributed Packet Rewriting  and Cisco LocalDirector  are other Web cluster architectures relying on a Web switch that receives the totality of client requests. In particular, Magicrouter is a mechanism of fast packet interposing where a user level process acting as a switchboard intercepts network packets and modifies them by changing addresses and checksum fields. Cisco LocalDirector rewrites the IP header information of each incoming packet according with a dynamic table of mapping between each session and the server to which it has been redirected. Unlike the TCP router that modifies only the client-to-server packets and lets the servers modify outgoing IP packets, Magicrouter and LocalDirector Web switches can be defined as gateways because they intervene even on server-to-client packets. An evolution of the TCP router architecture is represented by the IBM Network Dispatcher that does not require a modification of the packet addresses because packet forwarding to cluster nodes is done at the MAC address level . A different forwarding approach to configure a Web system with multiple servers uses the if-config-alias option, which is available in most UNIX platforms . This architecture publicizes the same secondary IP address of all Web servers as the IP single virtual address, namely ONE-IP, of the Web cluster. This is achieved by letting the servers of the cluster share the same IP address as their secondary address, which is used only for the request distribution service. 25]. Layer-4 Web switches work at TCP/IP level. Since packets pertaining to the same TCP connection must be assigned to the same server node, the Web switch has to maintain a binding table to associate each client TCP session with the target server. The switch examines the header of each inbound packet and on the basis of the flag field determines whether the packet pertains to a new or an existing connection. Layer-4 Web switch algorithms are content information blind, because they choose the target server when the client establishes the TCP/IP connection, before sending out the HTTP request. Global scheduling algorithms executable at the layer-4 Web switch range from static algorithms (say, random, round-robin) to dynamic algorithms that take into account either network client information, (say, client IP address, TCP port), or server state information (say, number of active connections, least loaded server) or even a combination of both information. Layer-7 Web switches can establish a complete TCP connection with the client and inspect the HTTP request content prior to decide about dispatching. In such a way, they can deploy content information aware distribution, by letting the Web switch examine the HTTP request and then route it to the target server. The selection mechanism (usually referred to as delayed binding) can be based on the Web service/content requested, as URL content, SSL identifiers, and cookies. In  there are many techniques to realize the dispatching granularity at the session level or at the single Web object request level. Scheduling algorithms deployed at layer-7 may use either client information (as session identifiers, file type, file size) or a combination of client and server state information. The potential advantages of layer-7 Web switches include the possibility to use specialized Web server nodes and partition the Web content among clusters of heterogeneous servers , and to achieve higher cache hit rates, for example, through affinity-based scheduling algorithms such as the LARD policy . On the other hand, layer-7 routing introduces additional processing overhead at the Web switch and may cause this entity to become the system bottleneck. To overcome this drawback, design alternatives for scalable Web server systems that combine content blind and content aware request distribution have been proposed in [6,26]. These architecture solutions are out of the scope of this paper which is more focused on the dispatching algorithms for Web switches.
Server-aware algorithms route requests on the basis of some server state information, such as load condition, latency time, availability or network utilization. Client-aware algorithms route requests on the basis of some client information. Layer-4 Web switches can use only some basic client network information, such as IP address and TCP port. Layer-7 Web switches can examine the entire HTTP request and take decisions on the basis of detailed information about the content of the client request. Client- and server-aware algorithms route requests on the basis of client and server state information. Actually, most of the existing client-aware algorithms belong to this class. Indeed, although the most important information is the client request, these policies combine it with some information about the server loads. The main goal is to avoid assignments to overloaded servers. The Web switch cannot use highly sophisticated algorithms because it has to take fast decision for dozens or hundreds of requests per second. To prevent the Web switch becoming the primary bottleneck of the Web cluster, static algorithms are the fastest solution because they do not rely on the current state of the system at the time of decision making. For this reason, these algorithms can potentially make poor assignment decisions. Dynamic algorithms have the potential to outperform static algorithms by using some state information to help dispatching decisions. On the other hand, dynamic algorithms require mechanisms that collect and analyze state information, thereby incurring potentially expensive overheads.
In this paper, we consider three widely used dispatching policies that are based on client and/or server information: Weighted Round Robin (WRR), Locality Aware Request Distribution (LARD) and StaticPartitioning. WRR has resulted the layer-4 policy that guarantees best load sharing in most simulations and experiments from several research groups. On the other hand, we do not expect LARD to work well in a site providing heterogeneous services, but we have chosen it because we are not aware of other layer-7 dispatching algorithms proposed by the research community. StaticPartitioning uses dedicated servers for specific services or multiple Web sites (co-location). This is the most representative example of a client-aware algorithm working at layer-7 in commercial Web switches [1,19].
WRR comes as a variation of the round robin policy. WRR associates to each server a dynamically evaluated weight that is proportional to the server load state . Periodically (every seconds), the Web switch gathers this information from servers and computes the weights. WRR is actually a class of dynamic policies that uses some information about the system state. The first issue that needs to be addressed when we consider a server state aware policy is how to compute the load state information because it is not immediately available at the Web switch. The three main factors that affect the latency time are loads on CPU, disk and network resources. Typical load measures are the number of active processes on server, mean disk response time, and hit latency time, that is, the mean time spent by each request at the server. In particular, the load indexes we consider are the number of active processes at each server (WRR_num policy), and the mean service time for the requests (WRR_time policy). Additional information on WRR can be found in .
If we consider Web clusters of homogeneous servers, the main goal of the proposed policies is to augment disk cache hit rates, for example through the LARD policy  or other affinity-based scheduling algorithms [26,29]. The LARD policy  is a content based request distribution that aims to improve the cache hit rate in Web cluster nodes. The principle of LARD is to direct all requests for a Web object to the same server node. This increases the likelihood to find the requested object into the disk cache of the server node. We use the LARD version proposed in  with the multiple hand-off mechanism defined in  that works for the HTTP/1.1 protocol. LARD assigns all requests for a target file to the same node until it reaches a certain utilization threshold. At this point, the request is assigned to a lowly loaded node, if it exists, or to the least loaded node. To this purpose, LARD defines two threshold parameters: denoting the upper bound of a lowly loaded condition, and denoting the lower bound of a highly loaded condition.1,19]. The simulation experiments will confirm the intuition that a StaticPartitioning policy, although useful from the system management point of view, achieves poor server utilization because resources that are not utilized cannot be shared among all clients. To motivate the CAP policy, let us classify Web services into four main categories.
- Web publishing
- sites providing static information (e.g.,
HTML pages with some embedded objects) and dynamic services that do not intensively use server resources (e.g., result or product display requests). The content of dynamic requests is not known at the instant of a request, however, it is generated from database queries whose arguments are known before hand.
- Web transaction
- sites providing dynamic content generated from (possibly complex) database queries built from user data provided by an HTML form. This is a disk bound service because it makes intensive use of disk resources.
- Web commerce
- sites providing static, dynamic and secure information. For security reasons, some dynamically generated content may need a secure channel that in most cases is provided by the SSL protocol. Cryptography makes intensive use of CPU resources. Hence, Web commerce services are disk and/or CPU bound.
- Web multimedia
- sites providing streaming audio and video services. In this paper, we do not consider this type of application that often is implemented through specialized servers and network connections.
We suppose that the server A has already received one request of type CB and one of type DCB; the server B has received one request of type N, and one of type DB. The sequence of successive requests to the Web cluster is shown in Figure 2. By using the CAP assignment, server A and B have a similar number of requests for each class of service, while this does not happen when using RR or LARD. For example, in the case of RR the server A receives four intensive requests that stress the CPU and/or disk, while server B receive only one CPU bound request. In the case of LARD, we suppose that the requests of type DB and CB are assigned to the server A and those of other types to the server B. This dispatching results that the server A receives two CPU bound and two disk bound requests, while the server B receives only one request of type DCB.
CAP does not require a hard tuning of parameters which is typical of most dynamic policies because the service classes are decided in advance, and the scheduling choice is determined statically once the URL has been classified.1. As the focus is on Web cluster performance we did not model the details of the external network. To prevent the bridge(s) to the external network becoming a potential bottleneck for the Web cluster throughput, we assume that the system is connected to the Internet through one or more large bandwidth links that do not use the same Web switch connection to Internet .
Each server in the cluster is modeled as a separate component with its CPU, central memory, hard disk and network interface. All above components are CSIM processes having their own queuing systems that allow for requests to wait if the server, disk or network are busy. We use real parameters to setup the system. For example, the disk is parameterized with the values of a real fast disk (IBM Deskstar34GXP) having transfer rate equal to 20 MBps, controller delay to 0.05 msec., seek time to 9 msec., and RPM to 7200. The main memory transfer rate is set to 100MBps. The network interface is a 100Mbps Ethernet card. The back-end servers are replicated database servers that reply to dynamic requests. The Web server software is modeled as an Apache-like server, where an HTTP daemon waits for requests of client connections. As required by the HTTP/1.1 protocol, each HTTP process serves all files specified in a Web request.
The client-server interactions are modeled at the details of TCP connections including packets and ACK signals. Each client is a CSIM process that, after activation, enters the system and generates the first request of connection to the Web switch of the Web cluster. The entire period of connection to the site, namely, Web session, consists of one or more page requests to the site. At each request, the Web switch applies some routing algorithm and assigns each connection to a server. The server dedicates a new HTTP process for that connection. Each client request is for a single HTML page that typically contains a number of embedded objects. A request may include some computation or database search, and a secure connection. The client will submit a new request only after it has received the complete answer, that is, the HTML page and all (possible) embedded objects. A user think time between two page requests models the time required to analyze the requested object and decide about a new request.
It is worth observing that for a fair comparison of the Web switch algorithms our models consider the overhead difference in dispatching requests at layer-4 and layer-7 Web switches. For example, the delays introduced by layer-7 switching are modeled with the values given in .4,8,9,15]. Random variables generated by these distributions can assume extremely large values with non-negligible probability.
The number of page requests per client
session, that is, the number of consecutive Web requests
a user will submit to the Web site,
is modeled according to the inverse Gaussian distribution.
The time between the retrieval of two successive Web pages
from the same client, namely the user think time,
is modeled through a Pareto distribution .
The number of embedded objects per page request
including the base HTML page
is also obtained from a Pareto distribution [9,22].
The inter-arrival time of hit requests, that is, the
time between retrieval of two successive hit requests
from the servers,
is modeled by a heavy-tailed function distributed as a Weibull.
The distribution of the file sizes requested
to a Web server is a hybrid function,
where the body is modeled according to a lognormal distribution,
and the tail according to a heavy-tailed Pareto
A summary of the distributions and parameters
used in our experiments is in
To characterize the different Web services classified as in Section 3.2 we have modeled also the impact of a secure channel on server performance (that is, the presence of CPU bound requests) and the impact of intensive database queries (that is, disk bound requests). Our model includes all main CPU and transmission overheads due to SSL protocol interactions, such as key material negotiation, server authentication, and encryption and decryption of key material and Web information. The CPU service time consists of encryption of server secret key with a public key encryption algorithm such as RSA, computation of Message Authentication Code through a hash function such as MD5 or SHA, and data encryption through a symmetric key algorithm, such as DES or Triple-DES. Most CPU overhead is caused by data encryption (for large size files), and public key encryption algorithm (RSA algorithm), that is required at least once for each client session, when the client has to authenticate the server. The transmission overhead is due to the server certificate (2048 bytes) sent by the server to the client the server hello and close message (73 bytes), and the SSL record header (about 29 bytes per record). Table 2 summarizes the throughput of the encryption algorithm used in the secure workload model.
We compare the performance of different scheduling policies for Web clusters under three main classes of workload.
- Web publishing
- site containing static and lightly dynamic documents. A static document resides on the disk of the Web server; it is not modified in a relatively long time interval and is always cacheable. The cache of each node is set to 15% of the total size of the Web site document tree. A lightly dynamic document is cacheable with 0.3 probability.
- Web transaction
- sites contain 60% of static documents and 40% of dynamically created documents. Database queries to back-end servers require intensive disk use and their results are not cacheable.
- Web commerce
- sites have 30% of static requests, 30% of lightly dynamic requests and various combinations for the remaining 40% of requests.
The most complex policy to tune is the WRR that is sensible to the adopted load metric and to the selected value for gathering server state information. To show the difficulty of optimal tuning parameters of some dynamic policies in Figure 3 we show the sensitivity of WRR_num and WRR_time with respect to the period for an almost static (Web publishing) and a highly variable workload (Web commerce). As a performance metrics, we use the 90-percentile of page latency time, that is, the page latency time limit that the Web site guarantees with 0.9 probability. The value has a big influence on performance especially for sites with heterogeneous services. If not well tuned, a dynamic policy such as WRR can behave worse than static policies such as RAN and RR. In both instances, the number of active connections (WRR_num) seems to be the best load metric and low values seem to be preferable to higher values. In the remaining part of the paper, we use the number of active processes as a load metric and assume that we are always able to choose the best parameters for the WRR policy. We will refer to it simply as WRR.
In the simulations, we consider the ideal StaticPartitioning algorithm that for any workload scenario is able to partition the dedicated servers proportionally to the percentage arrival of requests for each class of services.
In the next experiments, we consider only requests for lightly dynamic documents that have a lower probability of being found in the disk cache. Figure 4(b) shows that CAP and LARD have a similar behavior even if LARD still performs slightly better. However, if we compare this figure with Figure 4(a) we can observe that, while CAP and WRR maintain similar results, actually LARD is truly penalized by a lower cache hit rate.
- Alteon WebSystems, Alteon 780 Series, in www.alteonwebsystems.com/products/
- E. Anderson, D. Patterson, E. Brewer, ``The Magicrouter, an application of fast packet interposing'', unpublished Tech. Rep., Computer Science Department, University of Berkeley, May 1996.
- Apache docs., in www.apache.org/docs/mod/
- M.F. Arlitt, C.L. Williamson, ``Internet Web servers: Workload characterization and performance implications'', IEEE/ACM Trans. on Networking, vol. 5, no. 5, Oct. 1997, pp. 631-645.
- M. Aron, P. Druschel, W. Zwaenepoel, ``Efficient support for P-HTTP in cluster-based Web servers'', Proc. USENIX 1999, Monterey, CA, June 1999.
- M. Aron, D. Sanders, P. Druschel, W. Zwaenepoel, ``Scalable content-aware request distribution in cluster-based network servers'', Proc. USENIX 2000, San Diego, CA, June 2000.
- L. Aversa, A. Bestavros, ``Load balancing a cluster of Web servers using Distributed Packet Rewriting'', Proc. of IEEE IPCCC'2000, Phoenix, AZ, February 2000.
- P. Barford, A. Bestavros, A. Bradley, M.E. Crovella, ``Changes in Web client access patterns: Characteristics and caching implications'', World Wide Web, Jan. 1999.
- P. Barford, M.E. Crovella, ``A performance evaluation of Hyper Text Transfer Protocols'', Proc. of ACM Sigmetrics '99, Atlanta, Georgia, May 1999, pp. 188-197.
- V. Cardellini, M. Colajanni, P.S. Yu, ``Dynamic load balancing on scalable Web server systems'', Proc. of MASCOTS'2000, San Francisco, Aug. 2000.
- T.L. Casavant, J.G. Kuhl, ``A taxonomy of scheduling in general-purpose distributed computing systems'', IEEE Trans. on Software Engineering, vol. 14, no. 2, Feb. 1988, pp. 141-154.
- Cisco's LocalDirector, in www.cisco.com
- A. Cohen, S. Rangarajan, H. Slye, ``On the performance of TCP splicing for URL-aware redirection'', Proc. of 2nd USENIX Symposium on Internet Technologies and System, Boulder, CO, Oct. 1999.
- M. Colajanni, P.S. Yu, D. Dias, ``Redirection algorithms for load sharing in distributed Web server systems'', IEEE Trans. on Parallel and Distributed Systems, vol. 9, no. 6, pp. 585-600, June 1998.
- M.E. Crovella, A. Bestavros, ``Self-similarity in World Wide Web traffic: Evidence and possible causes'', IEEE/ACM Trans. on Networking, vol. 5, no. 6, Dec. 1997, pp. 835-846.
- O.P. Damani, P.E. Chung, Y. Huang, C. Kintala, Y.-M. Wang, ``ONE-IP: Techniques for hosting a service on a cluster of machines'', Proc. of 6th Intl. World Wide Web Conf., Santa Clara, CA, Apr. 1997.
- D.M. Dias, W. Kish, R. Mukherjee, R. Tewari, ``A scalable and highly available Web server'', Proc. of 41st IEEE Comp. Society Int. Conf., Feb. 1996.
- R.S. Engelschall, ``Load balancing your Web site'', Web Techniques Magazine, Vol. 3, May 1998.
- F5 Networks Inc. BigIP Version 3.0, in www.f5labs.com
- G.D.H. Hunt, G.S. Goldszmidt, R.P. King, R. Mukherjee, ``Network Web switch: A connection router for scalable Internet services'', Proc. of 7th Int. World Wide Web Conf., Brisbane, Australia, April 1998.
- A. Iyengar, J. Challenger, D. Dias, P. Dantzig, ``High-Performance Web Site Design Techniques'', IEEE Internet Computing, vol. 4 no. 2, March/April 2000.
- B.A. Mah, ``An empirical model of HTTP network traffic'', Proc. of IEEE Int. Conf. on Computer Communication, Kobe, Japan, April 1997.
- R. Mukherjee, ``A scalable and highly available clustered Web server'', in High Performance Cluster Computing: Architectures and Systems, Volume 1, Rajkumar Buyya (ed.), Prentice Hall, 1999.
- V.S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, E. Nahum, ``Locality-aware request distribution in cluster-based network servers'', In Proc. of 8th ACM Conf. on Arch. Support for Progr. Languages, San Jose, CA, Oct. 1998.
- T. Schroeder, S. Goddard, B. Ramamurthy, ``Scalable Web server clustering technologies'', IEEE Network, May-June 2000, pp. 38-45.
- J. Song, E. Levy-Abegnoli, A. Iyengar, D. Dias, ``Design alternatives for scalable Web server accelerators'', Proc. 2000 IEEE Int. Symp. on Perf. Analysis of Systems and Software, Austin, TX, Apr. 2000.
- Webstone Mindcraft Inc., Webstone2.01, in www.mindcraft.com/webstone/ws201index.html
- C.-S. Yang, M.-Y. Luo, ``A content placement and management system for distributed Web-server systems'', Proc. of IEEE 20th Int. Conf. on Distributed Computing Systems, Taipei, Taiwan, Apr. 2000.
- X. Zhang, M. Barrientos, J.B. Chen, M. Seltzer, ``HACC: An architecture for cluster-based Web servers'', Proc. 3rd USENIX Windows NT Symp., Seattle, July 1999.
- ... scheduling1
- In this paper we use the definition of global scheduling given in , and dispatching as synonymous.