An Entropy Approach to Unintrusive Targeted Advertising on the Web

An Entropy Approach to Unintrusive Targeted Advertising on the Web

John A. Tomlin
IBM Research Division
Almaden Research Center, K53/802
650 Harry Road
San Jose, CA 95120-6099
{tomlin@almaden.ibm.com}

Abstract

This paper describes the formulation of a new model for unintrusive targeted advertising on the web, extending the linear programming approach taken by Langheinrich et al.[10] A feature of our model is that it avoids unrealistic solutions of the type which may show ads to only a too-narrow group of users. This is accomplished by using a statistically derived entropy maximization model, which incorporates a form of randomization in associating advertisements with targetable groups of users, as well as considering click-through probability. It is then shown that this nonlinear entropy model can be embedded in larger models for the purpose of optimal management of web advertisement portfolios by agencies or brokerages.

Keywords: World-Wide Web, Electronic Commerce, On-line Advertising, Optimization, Statistics, Entropy, Equilibrium


1.  Introduction

This paper outlines the formulation of some entropy models for directed advertising on the web ([9]), and in particular for unintrusive targeted advertising. The initial aim is to build an optimization model which can be used to compute frequencies for showing advertisements to groups of users to stably optimize click-throughs or revenue. A feature of the model is that it is formulated to avoid unrealistic, "over-targeted'' solutions. To illustrate what this means, and to motivate the rest of this paper, let us consider a very simple example.

Suppose we have 100 identical banner ads to be presented to two distinguishable types, or groups, of users, who view the page on which the ad may be displayed in equal numbers, and who have estimated click-through probabilities of 51% and 49%. How many of the ads should we show to each type of user to maximize the expected number of click-throughs?

If we let x1, x2 be the number of ads shown to users of type 1 and 2, this problem can be expressed as a tiny linear program (LP):

Maximize     0.51x1 + 0.49x2
subject to    x1 + x2 = 100
xi 0

The obvious "optimal" solution is x1=100, x2=0, in other words to show all the ads to the first group of users, to achieve an expected 51 click-throughs. The second group are shown no ads at all.

Is this solution realistic or desirable? If the click-through probabilities and ratio of users were known with perfect accuracy, perhaps. But in the real world such data are uncertain. Suppose the uncertainty in the click-through probabilities is a modest 5%. Then in the worst case, the actual probabilities might be 46% and 54%, the coefficients in the function to be maximized would be 0.46 and the 0.54, rather than 0.51 and 0.49, and the "optimal" solution would be completely different - to show all of the ads to the second group of users (x1=0, x2=100). This would result in 54 expected click-throughs, whereas our previous solution with x1=100 would result in only 46. These drastic differences in solution are clearly unsatisfactory, and we shall refer to "all-or-nothing" solutions of this type as "over-targeted".

It might be thought that one way out of this dilemma is to specify a minimum allocation of ads to each group - to demand that (say) each group is shown at least 10 ads (i.e. xi 10). However the effect of this is that once each group has been assigned its minimum 10 ads, the remaining 80 will again be assigned on an all-or-nothing basis as before, and our solution will be (x1=90, x2=10) - only slightly more stable under changes in the probabilities than before. We might improve the situation by increasing the lower bound further, but we have no guidance on how much.

Faced with the above situation, the commonsense approach is to "more or less evenly distribute the ads between the groups of users, but with a bias toward the group(s) with the higher click-through probability". We must then ask how this is to be expressed mathematically. The results of section 3 of this paper suggest that instead of solving the tiny linear program (LP)above, we should solve the following tiny non-linear program (NLP):

Maximize 0.51x1 + 0.49x2 - 0.5x1ln(x1) - 0.5x2ln(x2)
subject to    x1 + x2 = 100
xi 0

The solution (to 4 figures) to this NLP is x1=51.00, x2=49.00, yielding an expected number of 50.00 click-throughs - one less than in the LP optimum, but much more robust under changes in the data. Under the worst case 5% variation in probabilities considered above, the extreme cases - when the probabilities are (0.56, 0.44) and (0.46,0.54) - yield solutions of (55.97, 44.03) and (46.01, 53.99) with expected click-throughs of 50.72 and 50.32 - a far more stable outcome. The nonlinear terms in the NLP objective have accomplished the goal of "more or less evenly distributing the ads between the groups of users" , which we might more formally describe as randomization, but the linear terms still give us a "bias toward the group(s) with the higher click-through probability". While the concept of randomizing may at first appear odd in the advertising context, any web surfer will have observed that portals do in fact serve up mixtures of ads.

The first priority of this paper is then to formulate optimization models which avoid "over-targeted'' solutions by a technique which allows for randomization, as well as considering click-through probability, and which is suitable for much larger (and less contrived) problems than our example, involving many types of users and ads. We use a statistical argument in deriving this model, following the lead given in other modeling areas - notably transportation planning. These models also have the advantage that efficient algorithms are available for their solution, making them attractive in practice.

A further advantage of this formulation is that it provides a fundamental building block for more far-reaching objectives in web-based advertising. In particular we will describe how this linear-logarithmic model can be embedded in a larger model for the purpose of optimal management of web advertisement portfolios by agencies or brokerages.

In the next section we move beyond the highly simplified LP model described above, and outline the quite general linear programming model formulated by Langheinrich et al.[10] for their ADWIZ system. This is followed by a derivation of the entropy approach and a description of the form of the solutions. We then describe how this model may be embedded in a prototype multi-time-period planning model for an advertising campaign. Finally we summarize some of the many interesting areas of further exploration associated with the entropy modeling approach.


2.  The ADWIZ LP Model

Langheinrich et al.[10] have described a system for "unintrusive customization" of web advertising - unintrusive in the sense that only impersonal information, such as use of a search keyword, or accessing of a page, is required. This system consists of a number of elements. Those of most interest to us here are a Selection Engine and a Learning System. The first of these components uses a set of weights in a Relevancy Computation to compute the probability of displaying a particular ad to a user, the user being associated with certain keywords or pages.

The component of principle interest here is a "learning" system, which computes display weights based on current estimates of click-through probabilities, the keyword list and the ad inventory. The components and formulation of the model (but with the indices reversed here) are as follows:

Indices
i = 1,...,m The keywords
j = 1,...,n The ads available
Data
hj The desired display rate for ad j (normalized over all ads)
ki The keyword input probabilities (normalized over all keywords)
cij Click-through probability for ad j and keyword i
Variables
dij The display probability of ad j for keyword i.
Constraints
n

j = 1 

dij = 1

(i = 1,...,m)
m

i = 1 
ki dij = hj
(j = 1,...,n)

(1)
dij 0 (i = 1,..,m ; j = 1,...,n)
Maximize
m

i = 1 
n

j = 1 

cij ki dij

Note that the first equation in (1) above can be multiplied through by ki, followed by a change of variable:

ij = ki dij

to produce the equivalent constraints:

n

j = 1 

ij = ki

i = 1,...m)
m

i = 1 

ij = hj

(j = 1,...n)

(2)
ij0 (i = 1,..,m; j = 1,...,n)

and objective:


Maximize      
m

i = 1 
n

j = 1 

cijij

(3)

This is a classic "transportation problem" (see e.g. [3]), and we shall work with this and subsequent models in this form. The model may be complicated in practice by contractual requirements setting a minimum on certain ad display rates. Such constraints, of the form

ij L ij

where Lij is a constant, do not significantly complicate the model. This is fortunate, because although efficient algorithms exist for general linear programs, there exist algorithms which are orders of magnitude faster for the transportation problem (see [8]). Since these models may become quite large, or need to be solved frequently, this is an important practical consideration.

Despite the virtues of this model - simplicity and fast solution time - it suffers from much the same weakness as the LP model in the introduction. Linear programming theory (see [3]) tells us that a model like this, with (m+n) linear constraints on mn variables, has optimal solutions in which at most (m+n) of the variables are away from their lower bound (usually zero). Thus for realistically large values of m and n we expect only a small fraction of the variables to be nonzero. In other words we expect the same phenomenon of over-targeting; on an even larger (if less extreme) scale. Once again we must regard such solutions as unrealistic and unsatisfactory, with many of the ads not being shown at all for most keywords. Langheinrich et al.[10] recognized this, and resorted to the technique of applying rather arbitrary lower bounds on all the variables. As we saw in the small example, this can remove the immediate symptom - where many or most ad/keyword combinations never appear - but it is far from clear in what sense such solutions can be described as "optimal".


3.  A Statistical Theory

In this section we shall employ a methodology used in Traffic Theory for the distribution of vehicles from origins (sources) to destinations (sinks). Following the description (but not the notation) of Wilson [13] in the traffic context, let us define a total number of trips Ai originating at origin node i, and require Bj trips to end at destination j. Let xij be the number of trips for the origin-destination pair (i,j), further let cij be some generalized cost or impedance of travelling from i to j. Then clearly we require:

n

j = 1 

xij = Ai

(i = 1,...,m)
m

i = 1 

xij = Bj

(j = 1,...n)

(4)
xij0 (i = 1,..,m; j = 1,...,n)

and for feasibility we must have:


X =
m

i = 1 

Ai =
n

j = 1 

Bj

(5)

where X is the total number of trips.

We shall also assume for the time being that another constraint is satisfied:

m

i = 1 
n

j = 1 

cij xij = C

(6)

The basic assumption is that the probability of the distribution {xij} occurring is proportional to the number of states w(xij) of the system which give rise to this distribution. The number of distinct arrangements of individuals which give rise to a distribution {xij} is:

w(xij) = (X!) / (
ij 
xij!) (7)

Now finding the maximum of this function is equivalent to finding the maximum of the log of w, which after applying Stirling's formula, and neglecting constant terms, requires us to:


Maximize        -
m

i = 1 
n

j = 1 

xij ln(xij)

(8)

subject to (4) and (6). The function appearing in (8) is an entropy function.

Without going into detail, the solutions of this entropy maximization problem can be shown to be of the form:

xij = ai Ai bj Bj exp(-  cij) (9)

where    is the Lagrange multiplier for equation (6). In practice it is not necessary to know the value of   . An efficient iterative (scaling) procedure is available (see [11] and [5]) which enables us to solve the problem without having to resort to more expensive general nonlinear programming methods. It can also be shown that for practical values of the Ai, Bj that this maximum is sharp ([13]).

The maximum entropy solution discussed above is one form of equilibrium solution.

The applicability of this model to the targeted advertising problem should now be clear. If we take trips originating at origins as analogous to "buckets" of users, with Ai users in each bucket, and the destinations as groups of ads to be shown, with Bj ads of each type, and consider the cij as click-through probabilities with C as the total number of clicks required, we obtain an unnormalized extension of the model discussed in section 2. Once again, contractual lower bounds may be imposed on particular ad/user pairings (as long as feasibility is not lost). However, there is no "all-or-nothing" nature to the solution, and over-targeting may be avoided without any requirement for arbitrary lower bounds. Indeed, since the model involves the logarithm of the xij's, they must necessarily be positive.

Another statistical model of interest here has also arisen in traffic theory, and is perhaps even more appropriate (see [12]). We proceed by analogy with the Helmholtz free-energy function, which is at a minimum for a system in equilibrium in conditions of constant volume and temperature. This function is of the form:

F = E - K lnw (10)

where K is a constant, E is the internal energy, and w is the number of states as defined in (7). We are interested in maximizing click-through, or equivalently, minimizing non-click-through probability. Again using Stirling's formula, and defining
_
cij 

= [

max
pq 

cpq] - cij

(11)

if we identify these "cost" values as the analogue of energy we obtain:


F = constant +
m

i = 1 
n

j = 1 

xij(

_
c

ij 

+ ln(xij))

(12)

Here is a constant, replacing K, whose value is yet to be determined. We now assert that the equilibrium distribution is that which minimizes F subject to (4). The constraint (6) is no longer needed, and the parameter accommodates a range of cases, from the extreme of  = 0, which gives us the unnormalized version of the LP model in section 2, to a completely proportional model, g iving the solution

xij = Ai Bj / X (13)

when is taken to be arbitrarily large. The general form of the solution to this model can be shown to be of form


x ij =
^
a i
^
b j

exp(-
_
c ij

)

(14)

which is of the same form as (9). We must then choose a value for . It is shown in [12] that under certain assumptions the weighted mean of the cost values (as defined in (11)) provides a good fit for the analogous traffic problem, and we shall initially assume so here. This allows us to use an iterative procedure (in ) to solve the problem. A good initial value for has proved to be simply the mean of these cost values for some models, and sometimes this is even a good enough estimate to obtain good agreement between the model and real data. (We used this value in the tiny NLP example in the introduction). Once again the solution to (14) for any can be obtained by an efficient iterative (scaling) procedure.

Thus far we have stated this form of the statistical model as minimization problem. Once has been chosen this is of course equivalent to the maximization problem:


Maximize      
m

i = 1 
n

j = 1 

xij( cij - ln(xij))

(15)

subject to (4), and we shall also use the model in this form.

This form of the statistical model offers significant advantages over that stated in (4)-(8). The constraints are those of the classical transportation problem, and the rather arbitrary constraint (6) has been replaced by a parameter in the linear-logarithmic objective function for which we have some rationale for assigning a value. For either case, we have a self-contained, easily solvable, constrained optimization model which can be embedded in more complex models which we may now consider building for the management of web advertising campaigns.

Note that we have made no assumptions on how the "buckets" of users are defined. They may correspond to search keywords, states or histories. Similarly, the assigning of the ads to groups may be by individual or classes of ad. The key pieces of data are the number of users or ads in each bucket or group and the click-through probabilities. The question of maximizing revenue then naturally arises, and can be answered by applying revenue weights to the cij term in the objective. This will be seen in the more general model of the next section.


4.  An Embedded Model

We will consider the environment in which our models are to be used - the ad supply chain - to be made up of three segments:

Advertisers
who hire the agencies to display their ads as effectively as possible to users at the various properties.
Agencies/Brokerages
who choose and display ads at a property, using what information there is available on the users (if any).
Properties.
The particular pages, typically at a portal, where banner ads are displayed by the agency/broker(s).

For concreteness, let us formulate a model which considers only the first two of these specifically - an agency and a number of advertisers who wish to present ads to users in (at least some of) the same buckets. We also broaden the model to multiple time periods. The aim of the agency is to obtain ads from the advertisers which will maximize their net revenue, given the expected number of users in each bucket per time period, and the click-through probabilities for ads in each time period. The components of this model are:

Indices
i = 1,...,m The buckets of users
j = 1,...,n The ad types available
k = 1,...,K The advertisers
t = 1,...,T Time periods

Data
Ait The number of expected users in bucket i in period t.
cij Click-through probability for ad j by user i in period t.
Rijt Revenue from click-through for ad j by user i in period t.
Dijt Revenue (or Cost) for displaying ad j by user i in period t.
Pjt+ Penalty for shortfall of shown ads type j at end of period t
Pjt- Penalty for excess of shown ads type j at end of period t
Mjkt Agency's cost of obtaining ads of type j from advertiser k to be shown in period t,
Ujkt Upper limit on ads of type j from advertiser k in period t.
Ljkt Lower limit on ads of type j from advertiser k in period t.
t Entropy weight for period t.

Variables
xijt The displays of ad j for user type i in period t.
yjkt The number of ads j bought by advertiser k for display in period t.
zjt The number of ads j shown to all users in period t.
sjt+ Inventory of unshown ads of type j at end of period t
sjt- Excess of shown ads of type j at end of period t

Constraints

Material Balance:

sjt+ - sjt- = sj,t-1+ - sj,t-1- + K

k = 1 
yjkt - zjt
for all j,t
     
(16)
n

j = 1 

zjt =
m

i = 1 

Ait

for all t
     

Supply and Demand:

n

j = 1 

xijt = Ait

for all i,t
     
(17)
m

i = 1 

xijt = zjt

for all j,t
     

Bounds:

sjt+,sjt-, xijt, zjt 0 for all i,j,t      
(18)
Lijt yijt Uijt for all i,j,t      

Maximize


i,j,t 
xijt(Dijt+Rijtcijt - t ln(xijt)) -
jt 
(Pjt+ sjt+ + Pjt- sjt-) -
jkt 
Mjktyjkt (19)

Although the broader questions of costing of web advertising (see [1] and [7]) are beyond the scope of this paper, note that by allowing revenues (or costs) to be associated with ads that are clicked on or otherwise (via the Dijt and Rijt coefficients), as well as marketing costs Mjkt, considerable flexibility is provided.

This prototype model needs some further specifications to complete it. For example dummy ads may need to be added to make sure that the second material balance constraint can be satisfied, but the above description gives us a sufficiently detailed model to examine its structure. Clearly it can grow quite large quite quickly if T becomes large, or if m or n are taken to be large. However the model has a definite structure. If we let H represent an m by n transportation problem coefficient matrix, the structure of the entire problem coefficient matrix is of the form:


|
|
|
|
|
|
|
|
|
A(0)
A(1)
A(2)

·
·

A(T)

H



H




·
  ·
    ·








H
  |
|
|
|
|
|
|
|
|

where the A(k) are coefficients corresponding to the yjkt, zjt, sjt+ and sjt- variables only. This structure is well known in the optimization community to be amenable to a technique known as Generalized Benders Decomposition (see [6]). Without going into too much technical detail, for this model the decomposition leads to solving multiple subproblems of the form (4), (15) in only the xijt variables, with the right hand sides Bjt now determined by the "master" zjt values, and a "master" problem in the y, z, s+, s- variables, with constraints derived from the A(k) matrices and the "cuts" generated by the subproblems. We expect the overall solution procedure to very efficient, even for large problems.

It is important to note that the Generalized Benders approach can be applied in a wider context. It is not limited in applicability to the maximum entropy submodels we have considered. Any procedure for displaying groups of ads to buckets of users which can notionally be expressed as an optimization problem in our xijt variables, subject to constraints only on those variables, is a candidate for this treatment. Thus some form of Markovian model (see [2]) may be just as amenable as the entropy model(s) we have discussed.


5.  Refinements and Extensions

There are many possible extensions to the embedded targeting model described above, to encompass more of the ad supply chain. It is relatively straightforward to extend it to consider properties on multiple portals by stratifying the buckets of users (say by adding an index p) and considering not only variables xpijt etc., but stratified revenues Rpijk etc. Such a model, incorporating agencies, advertisers and properties/portals will still have the basic matrix structure shown above, and be amenable to treatment by decomposition.

There are other statistical techniques, again grounded in transportation studies, which might well be considered for application in targeted advertising applications. One of these is the "intervening opportunities" model (see [13]) which ranks, in this context, groups of ads in decreasing attractiveness for each bucket of users, and using a probability that opportunities of a certain rank will be passed up, constructs an exponential decay model for associating users with groups of ads.

It should be considered that the model(s) formulated here deliberately assume (since they are "unintrusive") that very little is known about individual users - only the bucket to which they belong and the click-through probabilities for that bucket of users. If we relax the uninstrusivity requirement it may well be that we can stratify users by information level - some with the information level we have used above, some with limited information available through cookies, and others for which a detailed click-trail is available. Once again, it is possible to extend the model to accommodate this stratification, modifying it to vary the weight on the entropy terms for the different strata, without losing the matrix structure which promises efficient solution.

Other extensions of this model involve examining the structure of the costs which we have so far considered constant. Especially when multiple advertisers and multiple portals are considered, there is an opportunity to use the model to evaluate some forms of nonlinear pricing for yield management (see [4] and [14]).


6  Conclusion

We have shown that entropy models, and potentially other modeling techniques from other areas of Operations Research, can provide useful tools in the context of web advertising campaigns, either in a single time-slice application, or more generally, at a strategic level in the management over time of entire campaigns and portfolios.

As in any large scale optimization application, data availability is critical. This work is at an early stage. The next step is the acquisition of suitable data to test and calibrate the models we have described.


Acknowledgements

The author is indebted to Prabhakar Raghavan and Sridhar Rajagopalan for valuable discussions in the course of this work.


References

[1]
R. Briggs and N. Hollis, "Advertising on the Web: Is there response before click-through?" Journal of Advertising Research, 37(2), 33-45 (1997).
[2]
Moses Charikar, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins, "On Targeting Markov Segments", in Proceedings of the ACM Symposium on Theory of Computing, ACM Press (1999).
[3]
G.B. Dantzig, Linear Programming and Extensions, Princeton University Press, Princeton, NJ, (1963).
[4]
S. Daudel and G. Vialle, Yield Management: Applications to Air Transport, Institut du Transport Aerien, Paris (1994).
[5]
Sven Erlander, Optimal Spatial Interaction and the Gravity Model, Lecture Notes in Economics and Mathematical Systems 173, Springer-Verlag, Berlin (1980).
[6]
A. M. Geoffrion, "Generalized Benders Decomposition", J. Optimization Theory and its Applications 10, 237-260, (1972).
[7]
Bill Harvey, "The Expanded ARF Model: Bridge to the Accountable Advertising Future", Journal of Advertising Research, 37(2), 11-20 (1997).
[8]
J.L. Kennington and R.V. Helgason, Algorithms for Network Programming, Wiley, NY, (1980).
[9]
Y. Kodha and S. Endo, "Ubiquitous Advertising on the WWW: Merging Advertisement on the Browser.", Fifth International World Wide Web Conference, Paris, France, May (1996). See Also: http://www5conf.inria.fr/fich_html/papers/P52/Overview.html
[10]
Marc Langheinrich, Atsuyoshi Nakamura, Naoki Abe, Tomonari Kamba, and Yoshiyuki Koseki, "Unintrusive Customization Techniques for Web Advertising", NEC Corporation, C&C Media Research Laboratories, Kawasaki, Kanagawa, Japan. Proceedings of WWW8 (1999). Available at: http://www8.org/w8-papers/2b-customizing/unintrusive/unintrusive.html
[11]
R.B. Potts and R.M. Oliver, Flows in Transportation Networks, Academic Press, New York (1972).
[12]
J.A. Tomlin and S.G. Tomlin, "Traffic Distribution and Entropy", Nature 220, 974-976 (1968).
[13]
A.G. Wilson, Ä statistical theory of spatial distribution models", Transportation Research 1, 253-269 (1967).
[14]
R.B. Wilson, Nonlinear Pricing, Oxford University Press, Oxford and NY, (1993).

Vitae


John Tomlin gained his B.Sc.(Hons) and Ph.D. in mathematics at the University of Adelaide, South Australia. Since then he has worked continuously in the areas of Operations Research, applied optimization and optimization software. He joined the IBM Research Division in 1987. He is a member of ACM, INFORMS and the Mathematical Programming Society.