Fractal Characterization of Web Workloads
Daniel A. Menascé Bruno D.
Abrahão
Daniel Barbará
Virgílio A.F. Almeida Flávia P. Ribeiro
Dept. of Computer Science  Dept. of Computer Science  
George Mason University  Universidade Federal de Minas Gerais  
Fairfax, VA 22030 USA  Belo Horizonte, MG 31270 Brazil  
+1 703 993 1537  +55 31 3499 5860  
menasce@cs.gmu.edu  {brut,virgilio,flavia}@dcc.ufmg.br 
Dept of Information and Software Engineering  
George Mason University  
Fairfax, VA 22030 USA  
+1 703 993 1627  
dbarbara@gmu.edu 
Abstract:
Keywords: Web workload characterization, ecommerce, fractal clustering, fractal dimension, means clustering.
Word count: 6,300.
1 Introduction
Understanding the workload of Web and ecommerce sites is a fundamental step in sizing the IT infrastructure that supports these sites and in planning for their evolution so that Quality of Service (QoS) goals are met within cost constraints [12]. The first attempts at characterizing the workload of Web sites focused on information providing sites and considered only the stream of HTTP requests coming to the sites [3,4,8].
Some of the characteristics analyzed in previous studies are: file size distributions, file popularity distribution, selfsimilarity in HTTP traffic, reference locality, and user request patterns. A number of studies of different Web sites found that file sizes have heavytailed distributions and that the relationship between object popularity and frequency of access follows a Zipf distribution. Other studies of different Web site environments demonstrated longrange dependencies in the user request process, primarily resulting from strong correlations in the user requests.
More recently [13], the authors proposed a characterization approach for ecommerce workloads that takes into account the user model and its interactions with the ecommerce site. In particular, they proposed a hierarchical view of the workload consisting of a session layer, a function layer, and protocol (i.e., HTTP) layer. A session is a sequence of requests of different types made by a single customer during a single visit to a site. During a session, a customer requests the execution of various ebusiness functions such as browse, search, select, add to the shopping cart, register, and pay. A request to execute an ebusiness function may generate many HTTP requests to the site. For example, several images may have to be retrieved to display the page that contains the results of the execution of an ebusiness function. Using this approach to characterize the workload of online bookstores and auction sites, they were able to show, among other things, that i) most sessions last less than 1,000 sec, ii) more than 70% of the functions performed are product selection functions as opposed to product ordering functions, iii) the popularity of search terms follows a Zipf distribution, iv) there is a very strong correlation in the arrival process at the HTTP request level, indicated by a Hurst parameter with a value of 0.9273, v) at least 16% of the requests are generated by robots, vi) 88% of the sessions have less than 10 requests to execute ebusiness functions, and vii) the session length, measured in number of requests to execute ebusiness functions, distribution is heavytailed, especially for sites subject to requests generated by robots.
One of the main potential benefits of properly characterizing the workload is improve the quality of a customer's experience at the site. Once one understands what customers do, what navigational patterns they follow, one can improve their experience and even attempt to dynamically identify a customer with a wellknown behavior in order to provide a better experience for the customer, i.e., personalized services, and maybe generate more profit to ecommerce sites.
Menascé and Almeida proposed two approaches to characterize user sessions: Customer Behavior Model Graphs (CBMGs) and Customer Visit Models (CVMs) [14]. The CBMG is a state transition graph, in which the nodes correspond to states in the session (e.g., browsing, searching, selecting, checking out, and paying) and the arcs correspond to transitions between states. Probabilities are associated with transitions as in a Markov Chain. In [15], a clusteringbased method was presented to process HTTP logs and obtain clusters of CBMGs with ``similar'' patterns (e.g., heavy buyers, occasional buyers).
A Customer Visit Model (CVM) represents sessions as a collection of session vectors, one per session. A session vector for the jth session indicates the number of times, , that each of the different functions (e.g., search, browse, add to cart, etc) were invoked during the session. In this paper, we use clustering techniques based on the CVM model rather than on the CBMG since the CVM is a more compact representation of the workload than a CBMG. This reduces the number of attributes that have to be considered. More importantly, we use the fractal dimension [17] to reduce the dimensionality of the space and extract the most relevant attributes that should be used to characterize the workload. Besides using distancebased clustering algorithms (e.g., kmeans, minimum spanning tree), as done in [15], we use fractal clustering (FC) [5]. One of the motivations for using fractal clustering is to improve the ``quality'' of the clusters so that one can assign meaningful interpretations to each of them. Fractal clustering is quite appropriate to find sets of points that are ``similar'' with respect to a fractal dimension. This implies that clusters do not have to be shaped as hyperspheres, as is the case with distancebased clustering approaches. In this paper, we compare the results obtained with fractal clustering and distancebased clustering on a dataset generated from an ecommerce workload.
The rest of this paper is organized as follows. Section 2 briefly summarizes the results of previous work relevant to this paper. Section 3 discusses fractal dimension and its application to Web workloads and shows, through the use of Paircount plots, that the workload analyzed is fractal in nature. The next section presents the results of applying fractal clustering to our Web workloads and offers an interpretation of the resulting clusters. Section 5 presents the results of performing attribute selection (i.e., dimension reduction) to our Web workloads. Section 6 compares FC with kmeans clustering on the same dataset. Finally, section 7 presents some concluding remarks.
2 Related Work
While there are many references on workload characterization in the WWW focusing on information provider Web sites [4], there is much less work on workload characterization for ecommerce sites.
In [7], the notion of a session which consists of many individual HTTP requests is introduced. Reference [14] presents several models (e.g., the Customer Behavior Model Graph and the Customer Visit Model) and shows how user sessions can be represented by these models and how to obtain them from HTTP logs. In [13], the authors use a hierarchical model to characterize ecommerce workloads from a real Web store log. This method takes into account, not only requests for files, but also user sessions consisting of requests for ecommerce functions (i.e., add to cart and pay). In addition, they detect and characterize some entities widely mentioned in this work as Shopbots and Crawlers. They also present the summary and the characteristics of the Webstore log we are using in this work.
In the database field, fractal dimension has been proven to be a suitable tool to estimate spatial joins and range queries [6], indexing and feature selection [17] amongst others. A good coverage of fractal dimension, including the correlation fractal dimension , is given by Belussi [6]. As well as presenting the concepts behind those metrics, they also show an efficient algorithm to compute them.
In [10], the Pair Count Exponent power law is presented and its application to estimating the selectivity on spatial joins queries. As a special case, it includes . Since our datasets follow this perfect power law, it motivated us to examine the fractal properties of our workload.
Last, the method used to perform attribute selection in our datasets is described in [17]. In addition to presenting the algorithm, they also show results using real and synthetic databases. We are using the Fractal Clustering algorithm (FC) proposed by Barbará et. al. [5] to cluster datasets of ndimensional points and group them in clusters not restricted by its shape.
3 Fractal Dimension of Web Workloads
Many phenomena observed in nature seem to have chaotic behavior, but many, under close inspection, exhibit patterns that repeat themselves as the observation scale varies. These patterns are called fractals and the phenomenom is called selfsimilarity [5,11]. Selfsimilarity has been observed in many phenomena related to computer systems and to Web and ecommerce workloads. For example, the number of bytes retrieved from a Web server in time slots of duration is a selfsimilar process over a wide range of values of [8]. So, is the number of HTTP requests arriving at an ecommerce site over a time slot of duration [13].
One of the goals of this work is to use fractal properties
to refine the characterization of ecommerce workloads. These
workloads are characterized at higher levels of abstraction,
i.e., sessions as opposed to HTTP requests. For that purpose,
we analyzed large HTTP logs of an online bookstore. The logs
correspond to 15 days in which 955,818 HTTP requests were
processed. Entries corresponding to images, errors, etc were
deleted and the URLs of the remaining entries in the log were
mapped to one of 12 ebusiness functions defined in
Table 1.
Function Name  Description 
Acc  Account Login and Creation 
Add  Add items to the cart 
Browse  Navigate through product categories 
Help  Obtain help 
Home  Request the site's home page 
Info  Obtain product information 
Pay  Pay for an item 
Postpay  Request a summary of items paid for 
Prepay  Submit payment info 
Robot  Request the file robot.txt 
Search  Search for products based on keywords 
Undef  Undefined function 
Table 1: List of Ebusiness Functions
Then, sections within the log were identified using a combination of session ids generated by the online bookstore and a 30minute inactivity period threshold [14]. A session vector was generated for each session. This vector indicates the number of times, , that each of the 12 different functions (e.g., Search, Browse, Add) was invoked during the session. The resulting dataset is what we call the complete dataset. We define the session length, , as the total number of requests to execute ebusiness functions during the session. So, .
We derived three additional datasets from the complete dataset. One, called UpTo50, only includes session vectors with session length not exceeding 50. By considering only sessions whose length does not exceeed 50, we exclude sessions generated by robots and a few sessions generated by human beings as well. It should be noted however that the UpTo50 log contains 99.5% of the sessions of the complete log. This log can be used to focus on humangenerated requests only.
The other dataset, called Add, includes sessions in which customers had an intention to buy. This is indicated by , i.e., at least one item was placed in the shopping cart. The other dataset, called Pay, includes sessions in which and and , i.e., sessions in which a purchase occurred.
As a preliminary investigation of the fractal nature of our datasets, we used the method in [10] to obtain the paircount plot or PCplot of the dataset. The pair count, , of a dataset for a given radius is defined as the number of pairs in the dataset in which points are within a distance of each other. The study in [10] showed that follows a power law for many real datasets for a usable range of distances. In other words, where is a constant and the exponent is called the paircount exponent. The PCplot is then a plot of vs. . Clearly, if follows a perfect power law, the PCplot is a straight line for a given range of distances. The paircount exponent is the slope of the straight line in the PCplot and is also the ``correlation fractal dimension'' [10,17] discussed later in this section. This is also called the intrinsic dimensionality of the dataset.
Figure 1 shows the PCplot for the complete dataset. The distance metric used is the Euclidean distance and the natural logarithm was used for both axes. As expected [10], similar results were obtained with the Manhattan distance [9] and when the dataset was sampled. As it can be seen, for a range of distances ranging from 1 () to 5.5 () the PCplot is pretty much linear with a slope (i.e., correlation fractal dimension) of approximately 3.86.
Figure 1: PCplot for the "Complete" dataset using Euclidean distance
A closer look at the PCplot in Figure 1 reveals that the plot has
some inflection points, which indicates the presence of more
than one fractal cluster. In order to better analyze the
fractal dimension of the dataset, we can refer to
Figure 2 that
uses the approach in [6]. The idea is to
partition the space into a grid of dimensional cells of side equal to . We then count the number of points in the th cell of side and compute the frequency, , in which points fall within cell , i.e., the count of points in the
cell divided by the total number of points in the dataset. The
``correlation fractal dimension'' is
then defined as
(1) 
for the range in which the dataset is selfsimilar [10,17]. The correlation fractal dimension is quite useful in data mining and measures the probability that two points chosen at random will be within a certain distance from each other [5].
Figure 2 plots the log of the sum of the squares of the cell occupancy frequencies vs. the log of . Thus, the slope of this graph in selected ranges is the fractal dimension of the dataset. The figure shows that the dataset is selfsimilar but that there is more than one range with different values of .
Figure 2: Log of the sum of cell occupancy frequency vs. cell side for the complete dataset
4 Attribute Selection and Correlation Analysis
Our original dataset has dimension equal to 12 since there are 12 ebusiness functions in the session vector. Some of the attributes may not be relevant when attempting to group sessions into clusters. We examine in this section how one can select the relevant attributes in a dataset. Since we intend to use fractal clustering (see Section 3), which places points in clusters by taking into account the fractal dimension of each cluster, we use the fractal dimension as the criterium for attribute selection.
The basic process works as follows. We start by computing the fractal dimension of the complete dataset. Then we examine one attribute (say attribute ) at a time and compute the partial fractal dimension , defined as the fractal dimension of the dataset using all attributes except attribute . Select the attribute such that . Set equal to and remove attribute from the dataset. The process continues until all attributes have been removed.
Figure 3 shows the results of applying this process to the complete dataset. The yaxis value for a given attribute is the partial fractal dimension before the attribute is removed. The order in which attributes are removed as a result of executing the procedure outlined above is from right to left. So, as the figure shows, removing Postpay through Undef does not significantly change the fractal dimension of the dataset. Similar results for the UpTo50, Add, and Pay datasets, are shown in Figs. 4, 5, and 6, respectively. These results indicate that fractal clustering analysis of these datasets can be carried out with a number of attributes much smaller than the original 12.
Figure 3: Attribute selection of the ``Complete'' dataset
Figure 4: Attribute selection of the ``Up to 50''dataset
Figure 5: Attribute selection of the ``Add Pay'' dataset
Figure 6: Attribute selection of the ``Pay'' dataset
Finding correlations between attributes is important in explaining user behavior. These correlations can be found through the use of association rules [1]. The data is assumed to be a basket of items, which in its simplest form is a vector of binary values: an item is either in the basket or it is not. This technique finds rules of the form , where and are itemsets in the data, i.e., sets of items. The rule comes with two measures: the support, which indicates the probability of the itemset , in the data set (number of times and occur together in the baskets divided by the number of baskets), and the confidence, or probability that occurs in the baskets already containing (number of times and occur together in the baskets divided by the number of baskets containing ). The rules by themselves do not indicate correlation.
However, a different measure, called the lift of the rule, defined as the ratio between the confidence and the probability of the consequent (in our example the itemset ) is useful in establishing correlations. A lift greater than 1 indicates a positive correlation and a lift less than 1 a negative correlation.
In our logs, we took as the baskets the sessions, as items the presence or absence of the functions (attributes of the log). For instance, one can have the item or , indicating whether the session used the browse function or not.
The most interesting correlations were found in the Pay dataset and are shown in Table 2.

Table 2: Best rules for paying custumers in the web log
The first two indicate a strong correlation between browsing and visiting the information pages (the customer either uses both or none). The third one indicates that when the rule is broken (the customer does not use the browsing function, but visits the information pages), then the search function is commonly used.
5 Fractal Clustering and Interpretation
This section presents the results of performing FC on the online bookstore dataset and offers an interpretation of the resulting clusters. The FC algorithm used is described in [5]. There, the authors presented a novel algorithm for clustering data points, based on exploiting selfsimilarity. This algorithm, named FC for Fractal Clustering, incrementally processes each point by placing it in the cluster in which it causes the minimum fractal impact, that is the cluster whose fractal dimension changes the least when the point is added to it.
Since the workload data sets exhibit a high degree of selfsimilarity, we wanted to use FC to cluster them, showing the advantage that FC exhibits over other clustering methods such as means.
The results of using FC over the complete set of web logs are shown in Table 3 in the form of the centroids of the three clusters found. The last column indicates the percentage of points in each cluster. The three clusters correspond to human sessions, since the robot sessions were considered as outliers by FC (and form a fourth very small cluster). The clustering was done in a reducedattribute set of five attributes: add, browse, home, info, and search. The centroids suggest the presence of three groups of human sessions, whose main difference is the intensity on using the functions.
Cluster  add  browse  home  info  search  % 
1  0.251  0.772  0.778  0.811  0.944  96.6 
2  3.148  6.312  1.963  6.553  5.100  2.57 
3  5.027  8.699  3.208  10.321  8.732  0.83 
Table 3: Centroids found by FC on the complete web logs
6 Comparing Distance Based with FC Clustering
Clustering techniques based on the definition of a distance (e.g., Euclidean or Manhattan) are commonly used. Some examples of these algorithms include means, minimum spanning tree, and others [9]. The purpose of this section is to compare results obtained with kmeans and FC clustering with respect to the quality of the clusters obtained and the interpretations obtained from them.
The kmeans algorithm chooses points as the initial clusters and adds each remaining point to the closest cluster using a predefined distance metric. Every time a point is added to a cluster, the coordinates of the centroid of the cluster have to be recomputed [9]. Point allocation to clusters may have to be repeated until no points change their cluster allocation or a maximum number of passes is performed.
We present in Tables 4, 5, and 6 the results of applying means clustering to our ecommerce workload,
where, as before, each point is a session vector. Table 4 shows the results of applying
means clustering to the complete
log using 5, 6, and 7 clusters. Only sevenacc, add, browse,
home, info, pay, and searchof the twelve attributes were used
in the clustering processes. Using more attributes did not
improve the quality of the clustering process. For each number
of clusters, the table shows the coordinates of the centroid
for each cluster. These coordinates represent the average
number of executions of each of the seven ebusiness functions
by sessions represented by that cluster. Column 9 indicates the
percentage of sessions in each cluster. The next column, S,
indicates the sum of the number of executions of each of the
seven ebusiness functions for that cluster. This sum can be
considered as the session length restricted to these seven
functions. The last column of the table presents a possible
interpretation for the type of sessions represented by each
cluster.
Cluster  acc  add  browse  home  info  pay  search  %  S  Interpr. 
5 clusters  
1  0.7  5.2  4.0  1.8  4.9  0.0  232.7  2.0  249  shopbots 
2  3.8  2.7  2.4  1.9  2.2  1.0  2.9  0.2  17  buyers 
3  0.1  0.1  0.5  0.0  0.8  0.0  2.9  60.7  4  hit&run 
4  0.1  0.1  0.8  1.2  0.7  0.0  0.8  34.5  4  hit&run 
5  59.0  53.5  1384.2  118.6  1290.7  0.0  81.2  0.0  2987  crawlers 
6 clusters  
1  1.2  0.3  0.8  0.5  0.7  0.0  94.8  5.9  98  shopbots 
2  3.8  2.7  2.4  1.9  2.2  1.0  2.9  0.2  17  buyers 
3  0.0  0.1  0.5  0.0  0.8  0.0  1.3  56.9  3  hit&run 
4  0.0  0.1  0.8  1.2  0.7  0.0  0.8  32.4  4  hit&run 
5  59.0  53.5  1384.2  118.6  1290.7  0.0  81.2  0.0  2987  crawlers 
6  0.7  5.2  4.0  1.8  5.0  0.0  4.6  1.9  21  chm 
7 clusters  
1  1.2  0.0  0.5  0.3  0.4  0.0  122.0  4.7  124  shopbots 
2  3.8  2.7  2.4  1.9  2.2  1.0  2.9  0.3  17  buyers 
3  0.0  0.0  0.5  0.0  0.8  0.0  1.3  55.2  3  hit&run 
4  0.0  0.0  0.7  1.2  0.6  0.0  0.7  30.6  3  hit&run 
5  59.0  53.5  1384.2  118.6  1290.7  0.0  81.2  0.0  2987  crawlers 
6  0.9  7.8  6.0  2.2  7.5  0.0  6.3  0.8  31  chm 
7  0.2  1.5  1.3  0.8  1.3  0.0  1.3  8.4  6  chm 
Table 4: Results of means clustering for the complete log.
We used the following labels in Tables 4, 5, and 6 to indicate the following categories of sessions:
 shopbots: sessions generated by shopbots that send requests to various sites for price comparison purposes. As indicated in [2], shopbots are characterized by relatively large sessions and a very large percentage of search requests as compared to other requests.
 crawlers: a typical crawler requests a site home page, parses it, and then follows the links present in that page. It then repeats the process for each new link found in all pages retrieved. As indicated in [2], sessions generated by crawlers tend to be very longthousands of requestsand tend to cover most ebusiness functions, except pay and postpay.
 buyers: sessions with buying activity (i.e, ).
 hit&run: these sessions are very small, do not show any significant interest from the customer in the site, as indicated by very little product selection (e.g., browse, info, search) activity and very little product ordering (e.g., acc, add, pay) activity.
 chm: these sessions characterize customers who seem to have changed their minds with respect to buying as indicated by add to cart activity not followed by a checkout (i.e., pay) activity.
 bots: these sections include a mix of shopbots and crawlers.
 info: sessions dominated by info requests with negligible paying activity.
 browsers: sessions dominated by browse requests with negligible paying activity.
 searchers: sessions dominated by search requests with negligible paying activity.
The characterization of sessions could be useful for planning purposes. It helps to answer questions such as What would be the response time if the percentage of bots increase by 100%? How could one reduce the percentage of chm customers? What kind of customers usually buy? Is there a common reason why customers change their mind?
In each of the three cases (, and 7) in Table 4, there is one cluster of sessions labeled as ``buyers''. One interesting observation is that the centroid of the ``buyers'' cluster is the same regardless of the number of clusters selected: . The analysis of the complete log indicates also that shopbots and crawlers seem to be very well separated for , and 7. The coordinates of the centroid of the ``crawlers'' cluster is the same in all three cases and the coordinates of the ``shopbots'' centroid are not much different. The non robot clusters, referred here as human clusters, are separated into three groups when . One is composed of buyers and the other two belong to the hit & run category. When the number of clusters is increased to 6 and 7, we start to see a new type of customer, those who change their mind. In fact, with 7 clusters we can see that 9.2% of the customers changed their mind, 0.3% bought something, 85.8% are nonserious customers (hit&run), and 4.7% are robots.
Cluster  acc  add  browse  home  info  pay  search  %  S  D 
5 clusters  
1  82.4  74.8  1368.8  165.9  841.3  0.0  84.5  0.0  2618  bots 
2  1121.0  1074.0  6955.0  2147.0  16167.0  0.0  1122.0  0.0  28586  bots 
3  1.0  6.5  6.7  2.2  9.5  0.0  8.3  7.9  34  chm 
4  0.6  1.7  1.5  1.0  1.6  0.0  1.2  57.4  8  chm 
5  4.0  2.8  2.5  1.5  2.8  1.0  2.8  2.2  18  buyers 
7 clusters  
1  16.7  14.7  336.9  30.8  418.5  0.0  17.9  0.1  835  bots 
2  91.1  84.3  1533.9  183.0  966.9  0.0  93.1  0.0  2952  bots 
3  1.1  10.4  7.6  2.8  10.7  0.0  12.8  3.1  45  chm 
4  0.6  1.4  1.4  1.0  1.4  0.0  0.8  74.4  6  chm 
5  4.0  2.8  2.5  1.5  2.8  1.0  2.8  3.3  18  buyers 
6  1121.0  1074.0  6955.0  2147.0  16167.0  0.0  1122.0  0.0  28586  bots 
7  0.8  4.1  3.1  1.6  4.2  0.0  5.1  19.1  19  chm 
Table 5: Results of means clustering for the Add log.
Table 5 shows the results when only sessions that added at least one item to the cart are considered. In this case, the ``buyers'' centroid in the Add log is the same for 5 and 7 clusters and is very similar to the ``buyers'' centroid in the complete log. We can see that this log does not provide a clean distinction between shopbots and crawlers. Still, the bots with a very large number of requests manage to remain isolated for both and 7.
Cluster  acc  add  browse  home  info  pay  search  %  S  D 
1  0.2  1.1  1.4  0.7  9.7  0.01  1.6  2.6  15  info 
2  0.1  0.2  0.7  0.7  0.1  0.00  0.3  51.4  2  hit&run 
3  0.0  0.0  0.1  0.1  1.0  0.00  0.6  42.2  2  hit&run 
4  0.3  1.1  10.4  1.7  3.7  0.02  2.0  1.7  19  browsers 
5  0.2  1.2  0.6  0.7  1.1  0.01  12.7  2.0  17  searchers 
Table 6: Results of means clustering for the Upto50 log and 5 clusters.
Finally, Table 6 shows the results of applying means when robots are excluded. The results show two ``hit&run'' clusters and three clusters that show add to cart activity but with product selection dominated by either info, browsing, or search requests.
Looking at the results of Table 3, obtained when FC was performed, we were able to derive a relationship between the sum of the variables browse, info, and search and the add variable. This relationship would not have revealed itself without performing FC. We then performed ChiSquare tests over the variables and the sum , along with a linear regression between these variables. The results are summarized in Table 7. The interesting fact is that the variables are correlated in all cases (each cluster and complete set), but while the correlation is positive in Cluster 1 and the complete set, it is negative in both of the other clusters (the slope of the regression curve is negative).
Clusters 2 and 3 have significantly more Add activity than cluster 1 (see Table 3). In these two cases, as customers increase their product selection activity, they tend to add less items to the shopping cart (negative slope). Another interesting difference between the results obtained between FC and means can be seen by comparing Table 6 for means without robots and Table 3 for FC without robots. In the first case, there is very little separation due to Add activity while in the second case the centroids provide a very clear distinction based on Add to cart activity. It may be more useful, from a management standpoint, to separate customers based on their product ordering activity behavior.
Cluster  Slope  Constant  Correlation 
Cluster1  3.885e02  0.153  Correlated 
Cluster2  0.179  6.362  Correlated 
Cluster3  0.206  10.754  Correlated 
Complete  3.885e02  0.153  Correlated 
Table 7: Correlations between and in the complete set and each of the clusters
7 Conclusion
The interactions of customers of Web and ecommerce sites are represented by sessions, which are sequences of requests of different types made by a single customer during a single visit to a site. During a session, a customer requests the execution of various ebusiness functions such as browse, search, select, add to the shopping cart, register, and pay. In this paper, we used the Customer Visit Model (CVM) [14] to represent an ecommerce site workload. A CVM represents sessions as a collection of session vectors, one per session. A session vector for the jth session indicates the number of times, , that each of the different functions (e.g., search, browse, add to cart, etc) were invoked during the session.
We used clustering techniques to group ``similar'' sessions and to understand the behavior of users within each cluster. In this paper, we proposed a fractalbased approach to characterize Web workloads. First, we showed how the fractal dimension of the dataset can be used to reduce the number of attributes, i.e., the dimension of the session vector. By analyzing variations in the fractal dimension, the method selects those attributes that are more relevant to the workload characterization. This process leads to a reduced complexity of the Web workload characterization and a better understanding of real workloads.
Then, fractalbased clustering algorithms were used to identify groups of user sessions that share some common features. We compared the results of the fractal clustering with those obtained by traditional clustering techniques, based on the means algorithm. Additional contributions include the following:
 The identification of the fractal dimension of actual workload datasets.
 The derivation of association rules that help explain important user navigation patterns. For example, for the online bookstore analyzed, it was found that there is a strong correlation between browsing and visiting the information pages (the customer either uses both or none) for customers who buy. It was also found that, for these users, the search function was commonly used when the customer does not use the browsing function but visits the information pages. This type of information can be used to guide a redesign of the Web site in order to improve user experience.
 The use of the information provided by the fractalbased approach to assign meaningful interpretations to groups of Web users.
 The use of a compact workload representation (i.e., the CVM) to understand an actual ecommerce workload.
Obtaining logs from actual ecommerce sites is not trivial due to the proprietary nature of these logs. We were fortunate to be given logs from a real site, that shall remain unnamed for obvious reasons. The results reported in this paper should not be seen as general results that apply to many ecommerce sites but as an application of a methodology, based on fractal clustering, to characterize and reduce the complexity of Web and Ecommerce workloads.
Acknowledgements
The authors would like to thank Ping Chen and Julia Couto for their help in running some of the experiments.
Bibliography
 1
 R. Agrawal, T. Imielinski, and A. Swami, ''Mining Association rules between sets of items in large databases,'' Proc. of the ACM SIGMOD Conference on Management of Data, Washington, DC, May, 1993.
 2
 V. Almeida, D. A. Menascé, R. Riedi, F. P. Ribeiro, R. Fonseca, and W. Meira, Jr., ``Analyzing Web Robots and their Impact on Caching,'' Proc. Sixth Workshop on Web Caching and Content Distribution, June, 2001, pp. 299310.
 3
 V. Almeida, M. Crovella, A. Bestavros, and A. Oliveira, ``Characterizing Reference Locality in the WWW,'' Proc. IEEE/ACM International Conference on Parallel and Distributed Systems (PDIS), December 1996.
 4
 M. Arlitt and C. Williamson, ``Web Server Workload Characterization: The search of invariants,'' Pro. 1996 ACM Sigmetrics Conference on Measurement of Computer Systems, May 1996.
 5
 D. Barbará and P. Chen, ``Using the fractal dimension to cluster datasets,'' Proc. Int'l Conf. on Knowledge Discovery and Data Mining, 2000, pp. 260264.
 6
 Alberto Belussi and Christos Faloutsos, ``Estimating the Selectivity of Spatial Queries using the `Correlation' Fractal Dimension,'' Proc. 21st Int. Conf. Very Large Data Bases (VLDB), November, 1995, Morgan Kaufmann, Umeshwar Dayal, Peter M. D. Gray, Shojiro Nishio, eds, pp. 299310.
 7
 L. Cherkasova and P. Phaal, ``Session Based Admission Control: A Mechanism for Improving the Performance of an Overloaded Web Server,'' HP Labs Technical Report HPL98119, 1998.
 8
 M. Crovella and A. Bestavros, ``SelfSimilarity in the World Wide Web Traffic: Evidence and Possible Causes,'' Proc. IEEE/ACM Transactions on Networking, December 1997, Vol. 5, pp. 836846.
 9
 B. Everitt, Cluster Analysis, 4th ed., Oxford University Press, Oxford, 2001.
 10
 C. Faloutsos, B. Seeger, A. Traina, and Caetano Traina, ``Spatial join selectivity using power laws,'' Proc. 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 177188.
 11
 B. B. Mandelbrot, The Fractal Geometry of Nature, W. H. Freeman, New York, 1983.
 12
 D. Menascé and V. Almeida, Capacity Planning for Web Services: metrics, models, and methods, Prentice Hall, NJ, 2002.
 13
 D. A. Menascé, V. Almeida, R. Riedi, F. Ribeiro, R. Fonseca, and W. Meira Jr., ``In Search of Invariants for EBusiness Workloads,'' Proc. ACM Conference in Electronic Commerce 2000, Mineapolis, MN, Oct. 2000.
 14
 D. Menascé and V. Almeida, Scaling for Ebusiness: Technologies, Models and Performance and Capacity Planning, Prentice Hall, NJ, May, 2000.
 15
 D. Menascé, V. Almeida, R. Fonseca, and M. Mendes, ``A Methodology for Workload Characterization for Ecommerce Servers,'' Proc. 1st ACM Conference in Eletronic Commerce, Denver, CO, Nov. 1999, pp. 119128.
 16
 M. Schroeder, Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise, W. H. Freeman and Company, NY, 1990.
 17
 C. Traina, A. Traina, L. Wu, and C. Faloutsos,'' ``Fast feature selection using fractal dimension,'' Proc. XV Brazilian Database Symposium, October 2000.
Authors Vitae
Daniel A. Menascé is a Professor of Computer Science, codirector of the ECenter for Ebusiness, and Director of the MS in Ecommerce program at George Mason University.
Bruno D. Abrahão is a student in the BS in Computer Science program at the Federal University of Minas Gerais, Brazil.
Daniel Barbará is an Associate Professor of Information and Software Engineering at George Mason University.
Virgilio Almeida is a Professor of Computer Science at the Federal University of Minas Gerais, Brazil.
Flávia Ribeiro is a Ph.D. student in the Computer Science Department at the Federal University of Minas Gerais, Brazil.
Bruno Dias Abrahao 20020308