An Entropy Approach to Unintrusive Targeted Advertising on the Web
John A. Tomlin
IBM Research Division
Almaden Research Center, K53/802
650 Harry Road
San Jose, CA 951206099
{tomlin@almaden.ibm.com}
Abstract
This paper describes the formulation of a new model for unintrusive targeted advertising on the web, extending the linear programming approach taken by Langheinrich et al.[10] A feature of our model is that it avoids unrealistic solutions of the type which may show ads to only a toonarrow group of users. This is accomplished by using a statistically derived entropy maximization model, which incorporates a form of randomization in associating advertisements with targetable groups of users, as well as considering clickthrough probability. It is then shown that this nonlinear entropy model can be embedded in larger models for the purpose of optimal management of web advertisement portfolios by agencies or brokerages.
Keywords: WorldWide Web, Electronic Commerce, Online Advertising, Optimization, Statistics, Entropy, Equilibrium
1. Introduction
This paper outlines the formulation of some entropy models for directed advertising on the web ([9]), and in particular for unintrusive targeted advertising. The initial aim is to build an optimization model which can be used to compute frequencies for showing advertisements to groups of users to stably optimize clickthroughs or revenue. A feature of the model is that it is formulated to avoid unrealistic, "overtargeted'' solutions. To illustrate what this means, and to motivate the rest of this paper, let us consider a very simple example.
Suppose we have 100 identical banner ads to be presented to two distinguishable types, or groups, of users, who view the page on which the ad may be displayed in equal numbers, and who have estimated clickthrough probabilities of 51% and 49%. How many of the ads should we show to each type of user to maximize the expected number of clickthroughs?
If we let x_{1}, x_{2} be the number of ads shown to users of type 1 and 2, this problem can be expressed as a tiny linear program (LP):
subject to x_{1} + x_{2} = 100
x_{i} 0
The obvious "optimal" solution is x_{1}=100, x_{2}=0, in other words to show all the ads to the first group of users, to achieve an expected 51 clickthroughs. The second group are shown no ads at all.
Is this solution realistic or desirable? If the clickthrough probabilities and ratio of users were known with perfect accuracy, perhaps. But in the real world such data are uncertain. Suppose the uncertainty in the clickthrough probabilities is a modest 5%. Then in the worst case, the actual probabilities might be 46% and 54%, the coefficients in the function to be maximized would be 0.46 and the 0.54, rather than 0.51 and 0.49, and the "optimal" solution would be completely different  to show all of the ads to the second group of users (x_{1}=0, x_{2}=100). This would result in 54 expected clickthroughs, whereas our previous solution with x_{1}=100 would result in only 46. These drastic differences in solution are clearly unsatisfactory, and we shall refer to "allornothing" solutions of this type as "overtargeted".
It might be thought that one way out of this dilemma is to specify a minimum allocation of ads to each group  to demand that (say) each group is shown at least 10 ads (i.e. x_{i} 10). However the effect of this is that once each group has been assigned its minimum 10 ads, the remaining 80 will again be assigned on an allornothing basis as before, and our solution will be (x_{1}=90, x_{2}=10)  only slightly more stable under changes in the probabilities than before. We might improve the situation by increasing the lower bound further, but we have no guidance on how much.
Faced with the above situation, the commonsense approach is to "more or less evenly distribute the ads between the groups of users, but with a bias toward the group(s) with the higher clickthrough probability". We must then ask how this is to be expressed mathematically. The results of section 3 of this paper suggest that instead of solving the tiny linear program (LP)above, we should solve the following tiny nonlinear program (NLP):
subject to x_{1} + x_{2} = 100
x_{i} 0
The solution (to 4 figures) to this NLP is x_{1}=51.00, x_{2}=49.00, yielding an expected number of 50.00 clickthroughs  one less than in the LP optimum, but much more robust under changes in the data. Under the worst case 5% variation in probabilities considered above, the extreme cases  when the probabilities are (0.56, 0.44) and (0.46,0.54)  yield solutions of (55.97, 44.03) and (46.01, 53.99) with expected clickthroughs of 50.72 and 50.32  a far more stable outcome. The nonlinear terms in the NLP objective have accomplished the goal of "more or less evenly distributing the ads between the groups of users" , which we might more formally describe as randomization, but the linear terms still give us a "bias toward the group(s) with the higher clickthrough probability". While the concept of randomizing may at first appear odd in the advertising context, any web surfer will have observed that portals do in fact serve up mixtures of ads.
The first priority of this paper is then to formulate optimization models which avoid "overtargeted'' solutions by a technique which allows for randomization, as well as considering clickthrough probability, and which is suitable for much larger (and less contrived) problems than our example, involving many types of users and ads. We use a statistical argument in deriving this model, following the lead given in other modeling areas  notably transportation planning. These models also have the advantage that efficient algorithms are available for their solution, making them attractive in practice.
A further advantage of this formulation is that it provides a fundamental building block for more farreaching objectives in webbased advertising. In particular we will describe how this linearlogarithmic model can be embedded in a larger model for the purpose of optimal management of web advertisement portfolios by agencies or brokerages.
In the next section we move beyond the highly simplified LP model described above, and outline the quite general linear programming model formulated by Langheinrich et al.[10] for their ADWIZ system. This is followed by a derivation of the entropy approach and a description of the form of the solutions. We then describe how this model may be embedded in a prototype multitimeperiod planning model for an advertising campaign. Finally we summarize some of the many interesting areas of further exploration associated with the entropy modeling approach.
2. The ADWIZ LP Model
Langheinrich et al.[10] have described a system for "unintrusive customization" of web advertising  unintrusive in the sense that only impersonal information, such as use of a search keyword, or accessing of a page, is required. This system consists of a number of elements. Those of most interest to us here are a Selection Engine and a Learning System. The first of these components uses a set of weights in a Relevancy Computation to compute the probability of displaying a particular ad to a user, the user being associated with certain keywords or pages.
The component of principle interest here is a "learning" system, which computes display weights based on current estimates of clickthrough probabilities, the keyword list and the ad inventory. The components and formulation of the model (but with the indices reversed here) are as follows:
Indices
i = 1,...,m  The keywords 
j = 1,...,n  The ads available 
Data
h_{j}  The desired display rate for ad j (normalized over all ads) 
k_{i}  The keyword input probabilities (normalized over all keywords) 
c_{ij}  Clickthrough probability for ad j and keyword i 
Variables
d_{ij}  The display probability of ad j for keyword i. 
Constraints
n
j = 1 
d_{ij} = 1 
(i = 1,...,m) 
m
i = 1 
k_{i} d_{ij} = h_{j}  (j = 1,...,n) 
(1) 
d_{ij}  0  (i = 1,..,m ; j = 1,...,n) 
Maximize
m
i = 1 
n
j = 1 
c_{ij} k_{i} d_{ij} 
Note that the first equation in (1) above can be multiplied through by k_{i}, followed by a change of variable:
_{ij} = k_{i} d_{ij} 
to produce the equivalent constraints:
n
j = 1 
_{ij} = k_{i} 
i = 1,...m) 

m
i = 1 
_{ij} = h_{j} 
(j = 1,...n) 
(2) 
_{ij}0  (i = 1,..,m; j = 1,...,n) 
and objective:
Maximize 
m
i = 1 
n
j = 1 
c_{ij}_{ij} 
(3) 
This is a classic "transportation problem" (see e.g. [3]), and we shall work with this and subsequent models in this form. The model may be complicated in practice by contractual requirements setting a minimum on certain ad display rates. Such constraints, of the form
_{ij}  L  _{ij} 
where L_{ij} is a constant, do not significantly complicate the model. This is fortunate, because although efficient algorithms exist for general linear programs, there exist algorithms which are orders of magnitude faster for the transportation problem (see [8]). Since these models may become quite large, or need to be solved frequently, this is an important practical consideration.
Despite the virtues of this model  simplicity and fast solution time  it suffers from much the same weakness as the LP model in the introduction. Linear programming theory (see [3]) tells us that a model like this, with (m+n) linear constraints on mn variables, has optimal solutions in which at most (m+n) of the variables are away from their lower bound (usually zero). Thus for realistically large values of m and n we expect only a small fraction of the variables to be nonzero. In other words we expect the same phenomenon of overtargeting; on an even larger (if less extreme) scale. Once again we must regard such solutions as unrealistic and unsatisfactory, with many of the ads not being shown at all for most keywords. Langheinrich et al.[10] recognized this, and resorted to the technique of applying rather arbitrary lower bounds on all the variables. As we saw in the small example, this can remove the immediate symptom  where many or most ad/keyword combinations never appear  but it is far from clear in what sense such solutions can be described as "optimal".
3. A Statistical Theory
In this section we shall employ a methodology used in Traffic Theory for the distribution of vehicles from origins (sources) to destinations (sinks). Following the description (but not the notation) of Wilson [13] in the traffic context, let us define a total number of trips A_{i} originating at origin node i, and require B_{j} trips to end at destination j. Let x_{ij} be the number of trips for the origindestination pair (i,j), further let c_{ij} be some generalized cost or impedance of travelling from i to j. Then clearly we require:
n j = 1 
x_{ij} = A_{i} 
(i = 1,...,m) 

m
i = 1 
x_{ij} = B_{j} 
(j = 1,...n) 
(4) 
x_{ij}0  (i = 1,..,m; j = 1,...,n) 
and for feasibility we must have:
X = 
m
i = 1 
A_{i} = 
n
j = 1 
B_{j} 
(5) 
where X is the total number of trips.
We shall also assume for the time being that another constraint is satisfied:
m
i = 1 
n
j = 1 
c_{ij} x_{ij} = C 
(6) 
The basic assumption is that the probability of the distribution {x_{ij}} occurring is proportional to the number of states w(x_{ij}) of the system which give rise to this distribution. The number of distinct arrangements of individuals which give rise to a distribution {x_{ij}} is:
w(x_{ij}) = (X!) / ( 
ij 
x_{ij}!)  (7) 
Now finding the maximum of this function is equivalent to finding the maximum of the log of w, which after applying Stirling's formula, and neglecting constant terms, requires us to:
Maximize  
m
i = 1 
n
j = 1 
x_{ij} ln(x_{ij}) 
(8) 
subject to (4) and (6). The function appearing in (8) is an entropy function.
Without going into detail, the solutions of this entropy maximization problem can be shown to be of the form:
x_{ij} = a_{i} A_{i} b_{j} B_{j} exp(  c_{ij})  (9) 
where is the Lagrange multiplier for equation (6). In practice it is not necessary to know the value of . An efficient iterative (scaling) procedure is available (see [11] and [5]) which enables us to solve the problem without having to resort to more expensive general nonlinear programming methods. It can also be shown that for practical values of the A_{i}, B_{j} that this maximum is sharp ([13]).
The maximum entropy solution discussed above is one form of equilibrium solution.
The applicability of this model to the targeted advertising problem should now be clear. If we take trips originating at origins as analogous to "buckets" of users, with A_{i} users in each bucket, and the destinations as groups of ads to be shown, with B_{j} ads of each type, and consider the c_{ij} as clickthrough probabilities with C as the total number of clicks required, we obtain an unnormalized extension of the model discussed in section 2. Once again, contractual lower bounds may be imposed on particular ad/user pairings (as long as feasibility is not lost). However, there is no "allornothing" nature to the solution, and overtargeting may be avoided without any requirement for arbitrary lower bounds. Indeed, since the model involves the logarithm of the x_{ij}'s, they must necessarily be positive.
Another statistical model of interest here has also arisen in traffic theory, and is perhaps even more appropriate (see [12]). We proceed by analogy with the Helmholtz freeenergy function, which is at a minimum for a system in equilibrium in conditions of constant volume and temperature. This function is of the form:
F = E  K lnw  (10) 
where K is a constant, E is the internal energy, and w is the number of states as defined in (7). We are interested in maximizing clickthrough, or equivalently, minimizing nonclickthrough probability. Again using Stirling's formula, and defining
_
c_{ij } 
= [ 
max pq 
c_{pq}]  c_{ij} 
(11) 
if we identify these "cost" values as the analogue of energy we obtain:
F = constant + 
m
i = 1 
n
j = 1 
x_{ij}( 
_ c 
ij 
+ ln(x_{ij})) 
(12) 
Here is a constant, replacing K, whose value is yet to be determined. We now assert that the equilibrium distribution is that which minimizes F subject to (4). The constraint (6) is no longer needed, and the parameter accommodates a range of cases, from the extreme of = 0, which gives us the unnormalized version of the LP model in section 2, to a completely proportional model, g iving the solution
x_{ij} = A_{i} B_{j} / X  (13) 
when is taken to be arbitrarily large. The general form of the solution to this model can be shown to be of form
x_{ ij} = 
^
a_{ i} 
^
b_{ j} 
exp( 
_
c_{ ij} 
/ ) 
(14) 
which is of the same form as (9). We must then choose a value for . It is shown in [12] that under certain assumptions the weighted mean of the cost values (as defined in (11)) provides a good fit for the analogous traffic problem, and we shall initially assume so here. This allows us to use an iterative procedure (in ) to solve the problem. A good initial value for has proved to be simply the mean of these cost values for some models, and sometimes this is even a good enough estimate to obtain good agreement between the model and real data. (We used this value in the tiny NLP example in the introduction). Once again the solution to (14) for any can be obtained by an efficient iterative (scaling) procedure.
Thus far we have stated this form of the statistical model as minimization problem. Once has been chosen this is of course equivalent to the maximization problem:
Maximize 
m
i = 1 
n
j = 1 
x_{ij}( c_{ij}  ln(x_{ij})) 
(15) 
subject to (4), and we shall also use the model in this form.
This form of the statistical model offers significant advantages over that stated in (4)(8). The constraints are those of the classical transportation problem, and the rather arbitrary constraint (6) has been replaced by a parameter in the linearlogarithmic objective function for which we have some rationale for assigning a value. For either case, we have a selfcontained, easily solvable, constrained optimization model which can be embedded in more complex models which we may now consider building for the management of web advertising campaigns.
Note that we have made no assumptions on how the "buckets" of users are defined. They may correspond to search keywords, states or histories. Similarly, the assigning of the ads to groups may be by individual or classes of ad. The key pieces of data are the number of users or ads in each bucket or group and the clickthrough probabilities. The question of maximizing revenue then naturally arises, and can be answered by applying revenue weights to the c_{ij} term in the objective. This will be seen in the more general model of the next section.
4. An Embedded Model
We will consider the environment in which our models are to be used  the ad supply chain  to be made up of three segments:
 Advertisers
 who hire the agencies to display their ads as effectively as possible to users at the various properties.
 Agencies/Brokerages
 who choose and display ads at a property, using what information there is available on the users (if any).
 Properties.
 The particular pages, typically at a portal, where banner ads are displayed by the agency/broker(s).
For concreteness, let us formulate a model which considers only the first two of these specifically  an agency and a number of advertisers who wish to present ads to users in (at least some of) the same buckets. We also broaden the model to multiple time periods. The aim of the agency is to obtain ads from the advertisers which will maximize their net revenue, given the expected number of users in each bucket per time period, and the clickthrough probabilities for ads in each time period. The components of this model are:
Indices
i = 1,...,m  The buckets of users 
j = 1,...,n  The ad types available 
k = 1,...,K  The advertisers 
t = 1,...,T  Time periods 
Data
A_{it}  The number of expected users in bucket i in period t. 
c_{ij}  Clickthrough probability for ad j by user i in period t. 
R_{ijt}  Revenue from clickthrough for ad j by user i in period t. 
D_{ijt}  Revenue (or Cost) for displaying ad j by user i in period t. 
P_{jt}^{+}  Penalty for shortfall of shown ads type j at end of period t 
P_{jt}^{}  Penalty for excess of shown ads type j at end of period t 
M_{jkt}  Agency's cost of obtaining ads of type j from advertiser k to be shown in period t, 
U_{jkt}  Upper limit on ads of type j from advertiser k in period t. 
L_{jkt}  Lower limit on ads of type j from advertiser k in period t. 
_{t}  Entropy weight for period t. 
Variables
x_{ijt}  The displays of ad j for user type i in period t. 
y_{jkt}  The number of ads j bought by advertiser k for display in period t. 
z_{jt}  The number of ads j shown to all users in period t. 
s_{jt}^{+}  Inventory of unshown ads of type j at end of period t 
s_{jt}^{}  Excess of shown ads of type j at end of period t 
Constraints
Material Balance:
s_{jt}^{+}  s_{jt}^{} = s_{j,t1}^{+}  s_{j,t1}^{} +  K
k = 1 
y_{jkt}  z_{jt}  for all j,t 

(16)  
n
j = 1 
z_{jt} = 
m
i = 1 
A_{it} 
for all t 
Supply and Demand:
n
j = 1 
x_{ijt} = A_{it} 
for all i,t 

(17)  
m
i = 1 
x_{ijt} = z_{jt} 
for all j,t 
Bounds:
s_{jt}^{+},s_{jt}^{}, x_{ijt}, z_{jt} 0  for all i,j,t  
(18)  
L_{ijt} y_{ijt} U_{ijt}  for all i,j,t 
Maximize
i,j,t 
x_{ijt}(D_{ijt}+R_{ijt}c_{ijt}  _{t} ln(x_{ijt}))  
jt 
(P_{jt}^{+} s_{jt}^{+} + P_{jt}^{} s_{jt}^{})  
jkt 
M_{jkt}y_{jkt}  (19) 
Although the broader questions of costing of web advertising (see [1] and [7]) are beyond the scope of this paper, note that by allowing revenues (or costs) to be associated with ads that are clicked on or otherwise (via the D_{ijt} and R_{ijt} coefficients), as well as marketing costs M_{jkt}, considerable flexibility is provided.
This prototype model needs some further specifications to complete it. For example dummy ads may need to be added to make sure that the second material balance constraint can be satisfied, but the above description gives us a sufficiently detailed model to examine its structure. Clearly it can grow quite large quite quickly if T becomes large, or if m or n are taken to be large. However the model has a definite structure. If we let H represent an m by n transportation problem coefficient matrix, the structure of the entire problem coefficient matrix is of the form:
         
A^{(0)}
A^{(1)} A^{(2)} · · A^{(T)} 
H 
H 
· · · 
H 
         
where the A^{(k)} are coefficients corresponding to the y_{jkt}, z_{jt}, s_{jt}^{+} and s_{jt}^{} variables only. This structure is well known in the optimization community to be amenable to a technique known as Generalized Benders Decomposition (see [6]). Without going into too much technical detail, for this model the decomposition leads to solving multiple subproblems of the form (4), (15) in only the x_{ijt} variables, with the right hand sides B_{jt} now determined by the "master" z_{jt} values, and a "master" problem in the y, z, s^{+}, s^{} variables, with constraints derived from the A^{(k)} matrices and the "cuts" generated by the subproblems. We expect the overall solution procedure to very efficient, even for large problems.
It is important to note that the Generalized Benders approach can be applied in a wider context. It is not limited in applicability to the maximum entropy submodels we have considered. Any procedure for displaying groups of ads to buckets of users which can notionally be expressed as an optimization problem in our x_{ijt} variables, subject to constraints only on those variables, is a candidate for this treatment. Thus some form of Markovian model (see [2]) may be just as amenable as the entropy model(s) we have discussed.
5. Refinements and Extensions
There are many possible extensions to the embedded targeting model described above, to encompass more of the ad supply chain. It is relatively straightforward to extend it to consider properties on multiple portals by stratifying the buckets of users (say by adding an index p) and considering not only variables x_{pijt} etc., but stratified revenues R_{pijk} etc. Such a model, incorporating agencies, advertisers and properties/portals will still have the basic matrix structure shown above, and be amenable to treatment by decomposition.
There are other statistical techniques, again grounded in transportation studies, which might well be considered for application in targeted advertising applications. One of these is the "intervening opportunities" model (see [13]) which ranks, in this context, groups of ads in decreasing attractiveness for each bucket of users, and using a probability that opportunities of a certain rank will be passed up, constructs an exponential decay model for associating users with groups of ads.
It should be considered that the model(s) formulated here deliberately assume (since they are "unintrusive") that very little is known about individual users  only the bucket to which they belong and the clickthrough probabilities for that bucket of users. If we relax the uninstrusivity requirement it may well be that we can stratify users by information level  some with the information level we have used above, some with limited information available through cookies, and others for which a detailed clicktrail is available. Once again, it is possible to extend the model to accommodate this stratification, modifying it to vary the weight on the entropy terms for the different strata, without losing the matrix structure which promises efficient solution.
Other extensions of this model involve examining the structure of the costs which we have so far considered constant. Especially when multiple advertisers and multiple portals are considered, there is an opportunity to use the model to evaluate some forms of nonlinear pricing for yield management (see [4] and [14]).
6 Conclusion
We have shown that entropy models, and potentially other modeling techniques from other areas of Operations Research, can provide useful tools in the context of web advertising campaigns, either in a single timeslice application, or more generally, at a strategic level in the management over time of entire campaigns and portfolios.
As in any large scale optimization application, data availability is critical. This work is at an early stage. The next step is the acquisition of suitable data to test and calibrate the models we have described.
Acknowledgements
The author is indebted to Prabhakar Raghavan and Sridhar Rajagopalan for valuable discussions in the course of this work.
References
 [1]
 R. Briggs and N. Hollis, "Advertising on the Web: Is there response before clickthrough?" Journal of Advertising Research, 37(2), 3345 (1997).
 [2]
 Moses Charikar, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins, "On Targeting Markov Segments", in Proceedings of the ACM Symposium on Theory of Computing, ACM Press (1999).
 [3]
 G.B. Dantzig, Linear Programming and Extensions, Princeton University Press, Princeton, NJ, (1963).
 [4]
 S. Daudel and G. Vialle, Yield Management: Applications to Air Transport, Institut du Transport Aerien, Paris (1994).
 [5]
 Sven Erlander, Optimal Spatial Interaction and the Gravity Model, Lecture Notes in Economics and Mathematical Systems 173, SpringerVerlag, Berlin (1980).
 [6]
 A. M. Geoffrion, "Generalized Benders Decomposition", J. Optimization Theory and its Applications 10, 237260, (1972).
 [7]
 Bill Harvey, "The Expanded ARF Model: Bridge to the Accountable Advertising Future", Journal of Advertising Research, 37(2), 1120 (1997).
 [8]
 J.L. Kennington and R.V. Helgason, Algorithms for Network Programming, Wiley, NY, (1980).
 [9]
 Y. Kodha and S. Endo, "Ubiquitous Advertising on the WWW: Merging Advertisement on the Browser.", Fifth International World Wide Web Conference, Paris, France, May (1996). See Also: http://www5conf.inria.fr/fich_html/papers/P52/Overview.html
 [10]
 Marc Langheinrich, Atsuyoshi Nakamura, Naoki Abe, Tomonari Kamba, and Yoshiyuki Koseki, "Unintrusive Customization Techniques for Web Advertising", NEC Corporation, C&C Media Research Laboratories, Kawasaki, Kanagawa, Japan. Proceedings of WWW8 (1999). Available at: http://www8.org/w8papers/2bcustomizing/unintrusive/unintrusive.html
 [11]
 R.B. Potts and R.M. Oliver, Flows in Transportation Networks, Academic Press, New York (1972).
 [12]
 J.A. Tomlin and S.G. Tomlin, "Traffic Distribution and Entropy", Nature 220, 974976 (1968).
 [13]
 A.G. Wilson, Ä statistical theory of spatial distribution models", Transportation Research 1, 253269 (1967).
 [14]
 R.B. Wilson, Nonlinear Pricing, Oxford University Press, Oxford and NY, (1993).
Vitae
John Tomlin gained his B.Sc.(Hons) and Ph.D. in mathematics at the University of Adelaide, South Australia. Since then he has worked continuously in the areas of Operations Research, applied optimization and optimization software. He joined the IBM Research Division in 1987. He is a member of ACM, INFORMS and the Mathematical Programming Society.