A Simulation-Based Evaluation of an Exponential Growth Copying Model for the Web-Graph

A Simulation-Based Evaluation of an Exponential Growth Copying Model for the Web-Graph

Antonios Kogias
Dept of Informatics, University of Athens
Panepistimiopolis, 15771
Athens, Greece
Dimosthenis Anagnostopoulos
Dept of Informatics, University of Athens
Panepistimiopolis, 15771
Athens, Greece

ABSTRACT

We evaluate the behavior of the Exponential Growth Copying (EGC) model for the web-graph and analyze the effect of individual parameters on its overall effectiveness through simulation. We derive the in and out degree probability distributions of the resulting graphs for various parameter values, measure them against the corresponding empirical power-laws on the real web and suggest appropriate values to improve EGC effectiveness. We conclude that the EGC model provides a very good approximation of the in-degree distribution while the out-degree distribution is plausible only when specific values of EGC parameters are employed.

Keywords

Web-graph, Models, Simulation.

1. INTRODUCTION

Valid models of the web are important, both for creating WWW-like representations where new applications can be tested (searching, indexing, compression etc.) and for improving the understanding of content creation on the web, predicting its evolution and the emergence of important new phenomena. The web-graph is an abstraction of the WWW: each HTML page is a vertex and each hyperlink a directed edge, defined by origin vertex (tail) and destination vertex (head). The out (in) degree of a vertex is the number of edges having as tail (head) this specific vertex. Crawls report [1, 2, 4] that “power-laws” exist for in and out-degrees of web-graph vertices: The probability that a vertex has in (out) degree x is proportional to x-k for some k > 0. The values of x for the in-degree and out-degree have been reported to be 2.1 and 2.7 respectively.

Among the various models that have been proposed for web-graph evolution, the Evolving Copying Models are highly regarded as they were explicitly designed to model the WWW. Their authors show that they have a large number of complete bipartite subgraphs, as has been observed in the crawls, whereas several other models do not. Their development was based on very realistic intuitions about the WWW [3][5]. In this paper, we study the behavior of the Exponential Growth Copying model (named henceforth EGC) [3] and analyze the effect of individual parameters on its overall effectiveness. We derive the in and out degree distributions of the resulting graphs for various parameter values and measure them against the reported power-laws for in and out degrees. We draw conclusions about the model behavior by measuring the error distribution between experimental and empirical results, using the mean error as a measure of performance. Finally, we suggest appropriate values for EGC parameters for more “web-like” graphs.

2. EGC MODEL SIMULATION

Copying models in general are based on the intuition [5] that, although some page creators may create content and link to other sites regardless of the already represented topics in the web, many will be drawn to existing topics of interest to them and link to pages within some of these existing topics. In [3] the intuition for the EGC is extended with the fact that, due to the exponential growth of the WWW, a page creator will not be aware of the existence of pages created in the most recent “epoch” of pages (e.g. the last week or fortnight). At each time step a new epoch of vertices arrive, the size of which is a constant fraction of the current graph. Each of these vertices may be linked only with vertices of previous epochs. The EGC model is formally described parametrically, using a constant “growth” factor p > 0, the “self-loop” factor γ > 1, the “tail copy” factor γ' in (0, 1) and the “out-degree” factor d > 0.

We simulated the operation of the EGC as follows: the model starts with a small initial graph, a parameter vector (p, γ, γ', d) and the terminating condition N (number of vertices in the graph). Let n denote the current number of vertices in the graph up to the last epoch and v the number of vertices to be added in the next epoch. The value of v is decided by random sampling from the binomial distribution with parameters p and n. For each new vertex the algorithm creates γ self-loop edges; then d new edges are created and each one’s head and tail are set to selected vertices: the head of a new edge is chosen uniformly among the heads of all existing edges up to the last epoch (i.e. preferential attachment on in-degree among the n existing vertices) and the tail with probability γ' is one of the existing edges up to the last epoch (i.e. preferential attachment on out-degree among n existing vertices) while with probability 1 - γ' the tail is chosen randomly among the vertices of the new epoch. When all new edges have been created, the new epoch “arrives” and is added to the graph; new epochs are generated until n > N.

EGC was implemented in Matlab. Each simulation run was made for a number of epochs until at least 10,000 vertices were generated in the resulting graph. Initial conditions were the same for all runs, specifically one vertex with γ self-loop vertices. We chose characteristic values for the model parameters p, γ, γ', d. For each parameter vector, 5 independent replications were made. To ensure reliability, we also made a number of experiments with 1,000,000 vertices where no significant variation from the 10,000 vertices results was observed. We thus experimented with 135 different combinations of [p, γ, γ', d] parameters, to explore a wider range of parameter values, which, as later presented, have a strong impact on EGC efficiency. We chose low, intermediate and high values of [p, γ, γ', d]; specifically {0.05, 0.5, 0.95} for p and γ', {2, 5, 10} for γ and {1, 2, 3, 4, 5} for d. In and out degrees of all vertices (removing self-loop edges) and the corresponding degree probability distributions for each run are then computed, and averaged over the 5 runs of each parameter vector. Finally, we compare these with the empirical power-law degree probability distributions by means of error distributions for each parameter vector.

In-degree probability plots, in all 135 cases, keep a steady form of “power-law” [Figure 1], while out-degree probability plots deviate from this form in many cases [Figures 2, 3]. The examination of the impact of individual parameters p, γ, γ', d on EGC results shows that p does not affect mean errors, while as γ increases, they tend to increase also. As γ' increases, the out-degree error tends to decrease. Both errors decrease as d increases. Analyzing the impact of [p, γ, γ', d] on EGC, we conclude that parameter p does not have any significant importance, γ should be kept as low as possible, γ' should be as high as possible and the greater d is the better. Overall the EGC model provides a very good approximation of the in-degree distribution. It is not, however, analogously efficient when approximating the out-degree. It gives better results when γ' -> 1, γ -> 2 and d -> 5. Especially concerning d, we consider it as a drawback of the EGC model, since empirical values for d in the web are between 3 and 4.

Figure 1. ‘power-law’ in-degree probability distribution [γ'=0.05, p=0.5, γ=2, d=3]

Figure 2. ‘no power-law’ out-degree probability distribution [γ'=0.05, p=0.5, γ=2, d=3]

Figure 3. ‘power-law’ out-degree probability distribution [γ'=0.95, p=0.5, γ=2, d=3]

3. CONCLUSIONS

We have presented a simulation-based evaluation of the EGC model, starting from an initial graph of 1 vertex with γ self-loop edges, for a wide variety of [p, γ, γ', d] values. We concluded that the in-degree probability distribution of the resulting graph, for all parameter vectors, is very close to the empirical distribution of the WWW (power-law with factor –2.1). However, the out-degree probability distribution displays sensitivity to the parameter γ', only approximating the power-law with factor -2.7 when γ' = 0.95. Moreover, both in and out degree approximations tend to improve with the increase of d over 4, but this is problematic because this value in the web is between 3 and 4. We also concluded that the value of parameter p does not have any significant impact on the results, while parameter γ should be kept as low as possible.

4. REFERENCES

  1. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener. Graph structure in the web: Experiments and models. Proc. 9th WWW Conf, pp. 309-320, 2000.
  2. J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. The web as a graph: Measurements, models and methods. Proc. Intrnl Conf on Combinatorics & Computing, pp.1-18,1999.
  3. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal. Stochastic models for the web-graph. Proc. 41st Annual Symp on Foundations of Computer Science, 2000.
  4. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Trawling the web for emerging cyber-communities. Proc. 8th WWW Conf, pp. 403-416, 1999.
  5. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Extracting large-scale knowledge bases from the web. Proc. 25th VLDB Conf, pp. 639-650, 1999.