On the same lines as the above proof, using the same Lemma below and (111.41) and (111.42), we now have
„
1 <x-l .
x2-a-l
fj,
x
V-/^-
~ V » ,-a
a —2
—1
„
,
/'=>x/'=>
x3-a-l
2_a \=>n x —1
. „
/.
a>3
Now
„
x2"a-l
1 a-1 .
A
x
x3"a-l
„
, ,
a=2
From (111.34) and (111.39) we have
V = ^T 1-x
(HI.43)
H*=^i lnx
(111.44)
- t - = J^—
= >0
since In x < x — 1 always (since ex > x + 1 , for all x > 1). Also
Three-dimensional Lotkaian informetrics
(lnx) 2
dx
since Inx > 1
183
. This follows from the fact that the function 9(x) = Inx -\ x
1 is minimal x
in x = 1 and 9 (l) = 0. Hence \i /"=> \x /*.
a =3
From (111.33) and (111.40) we have H= 2 ^ -
(111.45)
x+1
V=-^r1-x
(HI.46)
Obviously n strictly increases in x and by the above (j,* strictly increases in x. So
n/^nVLemma HI.2.2.4: The function
f(x) = ^ f ^
(111.47)
increases if a and a + 1 have the same sign and decreases if a and a + 1 have opposite sign.
Proof:
, , x a -'(x a + 1 -ax-x + a) f x = ^ z >[ }
(* a "l) 2
184
Power laws in the information production process: Lotkaian informetrics
So f'(x) has the same sign as
cp(x) = xa+1 - a x —x + a.
Now
cp'(x) = (a + l ) x ' - ( a + l)
which is zero in x = 1. Furthermore
cp"(x) = a(a + l ) x a - \
so
cp"(l) = a(a + l ) .
So if a (a +1) < 0 then cp has a maximum in 1 and cp (l) = 0. So 0 then ep has a minimum in 1. So cp(x)>0 for all x and hence f increases.
As will become clear from the sequel (Table III. 1) we conjecture the following, but we are not able to find a proof for it.
Conjecture IH.2.2.5: For each fixed |a we have that \i is an increasing function of a .
Finally, we will calculate the TTT average |j* in function of the TT average \x. The calculation of (i* in function of \i and a requires the solution (for x) of the equation (cf. (111.41)), if a ^ 2
Three-dimensional Lotkaian informetrics
(n a - 2
(a
185
a-2
andof(cf. (111.43)), if a = 2
lnx + ^ - p , = O. x
(111.49)
Once x is found we have then jx* using the formulae: if a ^ 2, a =* 3 (cf. (III.42))
'=i_a
3-a
x
J_
x 2 -"-l
if a = 2 (cf. (111.44))
H*=4-^
(HI.51)
lnx and if a = 3 (cf. (111.46))
V T ^ 1-x
(111.52)
In Egghe (2003a) we solved (111.48), (111.49) for x using the MATHCAD 4.0 software package. Note, however, that a and [i cannot be taken fully independent of each other: there is a lot of freedom but we have the restriction (11.32) of Theorem II.2.1.2.1, if a > 2 :
H < ^
(01.53)
a-2 In the next table (of \x' in function of ]x and a ) we hence have the following restrictions
186
Power laws in the information production process: Lotkaian informetrics
a = 2.5=>ji<3 a = 3=>n<2 a = 3.5 =>n< 1.6667
Table III. 1 Values of \x* in function of n and a . Reprinted from Egghe (2003), Table 1, p. 608. Copyright John Wiley & Sons Limited. Reproduced with permission.
JJ*
a = 1.5
a =2
a = 2.5
a =3
a = 3.5
H = 1.2
i- 2148
1.2151
1.2157
1.2164
1.2178
H = 1.5 ^=2
i- 5838 2.3317
1.5984
1.6177
1.6479
1.7010
2.4612
2.7320
-
-
3.2492
3.7279
5.8541
-
-
5.5996
-
-
8.3955
-
-
^ = 2.5 H= 3
^ = 3.5
4.3333 5.5894
Another way to produce (n,H*) -relations, that is simpler than the above method, is by inputting given values of x = pm > 1 and a > 1 into (111.48) (or rather (III.33)) or (111.49) (or (111.34)) for ]x and (111.50) or (III.51) or (III.52) for u*, so that no numerical solution as above is necessary. In this way we obtain Table III.2.
Table III.2 Values of n and |x* in function of x = pm and a .
M
a = 1.5
a =-2
a = 2.5
=3
a = 3.5
x=1.5
(1.225, 1.242) (1.216, 1.233) (1.208,1.225) (1.200, 1.216) (1.192, 1.208)
x=2
(1.414, 1.471) (1.386, 1.443) (1.359, 1.414) (1.333, 1.386) (1.309, 1.359)
x=3
(1.732, 1.911) (1.648, 1.820) (1.570,1.732) (1.500, 1.648) (1.438, 1.449)
x=4 x=5
(2,2.333)
(1.848, 2.164)
(1.714,2)
(1.600, 1.848) (1.505, 1.714)
(2.236, 2.745) (2.012,2.485) (1.821,2.236) (1.667, 2.012) (1.545, 1.821)
IV LOTKAIAN CONCENTRATION THEORY
IV. 1 INTRODUCTION Concentration theory studies the degree of inequality in a set of positive numbers. It is not surprising that the historic roots of concentration theory lie in econometrics where one (early in the twentieth century) felt the need to express degrees of income inequality in a social group, e.g. a country. Hereby one expresses the "gap" between richness and poverty. One of the first papers on this topic is Gini (1909) on who's measure we will report later.
The reader of this book will now easily understand that concentration theory takes an important role in informetrics as well. Indeed, as is clear from Chapter I, mformetrics deals with the inequality in IPPs, i.e. in production of the sources or, otherwise stated, the inequality between the number of items per source. As we have seen, Lotkaian informetrics expresses a large difference between these production numbers. Just to give the most obvious example: if we have Lotka's law (with exponent a = 2, just to fix the ideas): f(n) = — then n f (2) = ——, f (3) = ——, fw(4) = —— and so on, where fv(n) denotes the number of sources w V; 9 16 ' 4 with n items. It is clear that, expressed per production class n, there is a large difference between the number f (n) of sources in these classes. Zipf s law is also a power law, hence it also expresses a large difference but now between the numbers g(r), r = 1,2,3,..., where g(r) denotes the number of items in the source on rank r (where sources are ranked in decreasing order of their productivity). It is clear that all examples of sources and items, given in Chapter I, can be the subject of a concentration study. The skewness of these examples was apparent and hence one should be able to measure it.
188
Power laws in the information production process: Lotkaian informetrics
Generalizing the above examples, we can say that we have a decreasing sequence of positive numbers x,,x 2 ,...,x N , N e N, and we want to describe the degree of inequality between these numbers, otherwise stated, the degree of concentration: a high concentration will be where one has a few very large numbers x,,x 2 ,... and many small numbers ...,x N _,, x N . It is clear that this must be formalized. We will use techniques developed in econometrics but we will also report on the "own" developments that have been executed in informetrics itself. Under the "own" developments we can count the so-called 80/20-rule and the law of Price. The main part of this chapter, however, will be the study of the Lorenz curve which was developed in econometrics around 1905 (cf. the historic reference Lorenz (1905)).
Let us briefly (and intuitively) describe these concepts here, before studying them more rigorously in the further sections. The simplest technique is the 80/20-rule which states that only 20% of the most productive sources produce 80% of all items. Of course, this is just a simplification of reality: it is the task for informetricians, in each case, to determine the real share of the items in the most productive sources: 20% of the most productive sources might produce 65% of all items but this could as well be 83.7%! Also, we do not have to consider 20% of the most productive sources: any percentage can be considered. So, generalizing, we can formulate the problem: for any x € ]0,l[ determine 0 e ]0,l[ such that 100x% of the most productive sources produce 1000% of all items. We can even ask to determine 6 as a function of x. This "generalized 80/20-rule" could be called the determination of "normalized" percentiles since both x and 0 belong to the interval [0,l] while in the calculation of percentiles, one of these numbers is replaced by actual quantities (of items or sources). Since both x and 0 denote fractions this technique is (sometimes) called an arithmetic way of calculating concentration (see Egghe and Rousseau (1990a)).
In this sense we can call the law of Price a geometric way of calculating concentration. The historic formulation (see De Solla Price (1971, 1976) and implicite in De Solla Price (1963)) i
states that, if there are T sources, the vT = T ^ most productive sources produce 50% (i.e. a fraction —) of all items. For evident reason, this principle is also called Price's square root law. It is clear how to extend this principle: let 9 6 ]0,l[, then the Te most productive sources produce a fraction 9 of all sources. This is called Price's law of concentration and we will
Lotkaian concentration theory
189
investigate in what cases in informetrics this is true. Also this principle could be generalized stating that for 8 E ]0,l[ the top TE sources produce a fraction 9 of all the items and we can ask for a relation between s and 0.
Both general formulations of the 80/20-rule (in terms of x and 0) and of the law of Price (in terms of e and 0) involve two numbers. We could wonder if we can construct a function F such that, for any decreasing vector X = (x,,x 2 ,...,x N ), with positive coordinates, the value F(x) = F(x,,...,x N ) is a good measure of the concentration in X. It is clear that an "absolute" good value for F(X) does not exist but we can determine requirements for the value of F(X) in comparison with values F(X') for other vextors X' as above, i.e. to give relative valuejudgements. Let us formulate some "natural" requirements.
(i)
F(X) should be maximal for the most concentrated situation, namely for a vector X of the type X = (x,0,.. .,0) where x > 0.
(ii)
F(X) should be minimal for the least concentrated situation, namely for a vector X of the type X = (x,x,.. .,x) where x > 0.
In terms of wealth or poverty, (i) states that X = (x,0,...,0) must have the highest concentration value (given F), since one source (e.g. person) has everything and the other sources have nothing. Condition (ii) states that if everybody has the same amount (e.g. money), the concentration value should be minimal (and preferably zero).
(iii)
F(X) should be equal to F(cX) where, for X = (x,,...,x N ), the vector cX is defined as (cx,,...,cx N ), forallc>0.
Condition (iii) is called the scale-invariant property and is requested since describing the concentration of income (i.e. describing wealth and poverty) should be independent on the used currency (€, $, Yen,...) which all are interrelated via a scale factor. The next property is also very important:
190
Power laws in the information production process: Lotkaian informetrics
(iv)
F(X') > F(X) if X' is constructed from X by "taking (in X) an amount away from a poor person and giving it to a rich person". In other words, we require F(X') > F(X)incase X = (x,,...,x N ) and X1 = (x,,...,x i +h,..., X j -h,...,x N )
(IV.l)
if 0 < h < x r Condition (iv) is called the transfer principle and was introduced in 1920 by Dalton - see Dalton (1920). It is clear that the transfer principle is a very natural requirement and in Egghe and Rousseau (1991) we showed that it comprises properties such as
(v)
"If the richest source gets richer, inequality must rise": for all X = (x,,...,x N ) (as always, ordered decreasingly) and h > 0 we have that, if X' = (xj +h,x 2 ,...,x N ), then f(X') > f(X). The same can be said if the "poorest source gets poorer",
(vi)
The principle of nominal increase: if, given a vector X = (x,,...,x N ), the production of each source is increased with the same amount h > 0, inequality is smaller: for X' = (x, +h,...,x N + h ) we have f(X!) < f(X).
We have, deliberately, used the econometric terminology, just to illustrate these principles but it is clear that these principles are universally required in any application of concentration theory.
A genius invention of Lorenz, however, requires the construction of a Lorenz curve (denoted L(X)) of a given decreasing vector X and these Lorenz curves have the property that any measure F satisfying L(X) < L(X')=>F(X) < F(X') automatically satisfies all the above principles. In words: any function F on vectors X = (x,,...,x N ), which agrees with the Lorenz order, is satisfying the above requirements for concentration measures and hence can be called a "good" concentration measure. The Lorenz curves have, in addition, the property that any generalized 80/20-rule can be seen on these curves, hence they comprise this aspect of concentration theory as well.
Lotkaian concentration theory
191
Therefore, in the next section we will develop Lorenz concentration theory for vectors X = (x,,...,x N ) as above. Then we will check for the existence of such good concentration measures and we will describe their properties. We are then at the challenge of developing Lotkaian concentration theory, i.e. to calculate Lorenz curves and good concentration measures if the vector X is of power type, i.e. X = (f(l),f(2),...,f(n mas ))
(IV.2)
where f is Lotka's size-frequency function (nmax is the maximal production of a source) or if
X = (g(l),g(2),...,g(T))
(IV.3)
where g is Zipf s rank-frequency function (T is the total number of sources). We will see that exact calculations of these discrete versions of the laws of Lotka and Zipf are, also in this application, not possible. As explained in Chapter I, also here we will encounter problems of evaluating discrete sums, which is not possible. The previous three chapters fully showed the power of the use of the continuous versions of the laws of Lotka and Zipf, i.e. the functions (11.27) and (11.89). But we face here the need of extending Lorenz concentration theory from discrete vectors X = (x,,...,x N ) to continuous functions h on an interval [l,x m ]: for h = f (Lotka's function) we have xm = p m and for h = g (Zipf s function) we have xm = T + 1 (see Chapter II). Even for a general (non-Zipfian) rank-frequency function g, defined on [0,T] (see II.8) we can apply such a theory on the interval [l,T + l] by replacing the re[0,T] by 1 + r e [l,T +1] as we did for Zipf s law. If we can extend the Lorenz concentration theory to functions h : [l,xm] —> M+ , we can apply it to the functions f of Lotka and g of Zipf in an identical way since both f and g are functions like h. This will be done from Section IV.3 on: based on our insights in the discrete case we will define the Lorenz curve L(h) of a general function h on an interval [l,xm] and determine its special form if h is a power law. Our approach will be simpler than (but equivalent with) earlier approaches e.g. of Atkinson (1970) and Gastwirth (1971, 1972). We will determine three important good concentration measures based on L(h), namely the Gini index, the coefficient of variation and Theil's measure and calculate their values for power laws h. These results can then be applied to the power laws of
192
Power laws in the information production process: Lotkaian informetrics
Lotka (f) and of Zipf (g) and we will determine the crucial role of the exponents a and P in this matter ( a in (11.27) and P in (11.89)). We will be able to present concrete formulae for the mentioned concentration measures in function of a and P, hereby proving concentration properties in function of a and p . This is the real heart of Lotkaian concentration theory. From this Lorenz concentration theory we will also determine the generalized 80/20-rule in Lotkaian informetrics and we will also show that Price's law of concentration is (exactly) valid if we have Zipf s law.
We close this introduction with an important remark: although, once Lorenz concentration theory is developed for general power functions h, we can apply it to f (Lotka) and g (Zipf) in a mathematically identic way, the interpretation of concentration theory on f is completely different from the one on g. The latter theory is the most important one since it describes the inequality among the sources (expressed by their ranks r) with respect to their item production. It is this application that is always studied, also in econometrics. But, as said, in a mathematical equivalent way, we can study concentration aspects of the function f, hereby calculating the inequality between the different source productivity levels j (as in (11.27)). This is a functionality that - as far as we are aware of- only occurs in informetrics: Zipf s law also occurs in linguistics, econometrics, physics,... but Lotka's law is a regularity that is only studied in our field. Because of the fact that it comprises Zipf s law and even Mandelbrot's law (Chapter II) and the fact that Lotka's law is a simple power function, we think it plays a central role in informetrics (hence the writing of this book!) and hence we think it is worthwhile to study also concentration properties of Lotka's function, in addition to the study of the concentration properties of Zipf s function.
IV.2 DISCRETE CONCENTRATION THEORY In order to describe the concentration (inequality) of a vector X = (x,,x 2 ,...,x N ) we have to introduce the Lorenz curve as a universal tool in the construction of a wide range of concentration measures. We will suppose that all xt > 0 (although extension to the case that some X; are negative is possible - we will not go into this since we do not need it in this book) and that X is decreasing (this can always be achieved: if X is not decreasing we can reorder it decreasingly).
Lotkaian concentration theory
193
The Lorenz curve of X denoted by L(X), is constructed by linearly connecting the points (0,0) and
^,£ a j
(IV.4)
i = 1,...,N, where
aj=-j£-
(IV.5)
Note that the last point (for i = N) is (1,1). Since X decreases we have that L(X) is a concave polygonal curve that increases from (0,0) to (1,1). Its form is depicted in Fig.IV.l.
Fig. IV. 1
General form of a discrete Lorenz curve.
The power of this genius construction lies in the fact it describes all aspects of concentration in one graph: the higher L(X), the more concentrated X is. Let us illustrate this. The vector of "no" concentration is X = (x,x,...,x) with x > 0 and L(X) is the straight line connecting (0,0) with (1,1), the lowest possible curve (since a Lorenz curve is concave). The vector of
194
Power laws in the information production process: Lotkaian informetrics
"highest" concentration is x = (x,0,...,0) with x > 0 and L(X) is the curve connecting (0,0) with —,1 and —,1 with (1,1), the highest possible curve. It is easy to see that if X' is
IN
J
INJ
constructed from X via an elementary transfer as in (IV. 1), we have that L(X') > L(X) (here > means > and not equal as curves). A theorem of Muirhead (see also Egghe and Rousseau (1991)) says that, conversely, L(X') > L(X) implies that X' can be derived from X via a finite number of elementary transfers (as in (IV.l)). Further, we trivially see that cX = (cx,,...,cxN) (c > 0) has the same Lorenz curve as X. So, the Lorenz curve is the right tool to describe concentration. It is now clear that any function C that respects the Lorenz order, i.e. which satisfies, for all vectors X, X'
L(X)
(IV.6)
can be called a good measure of concentration. We are now in a position to construct good concentration measures: we only have to satisfy (IV.6) and then we are guaranteed of all the properties described in (i)-(vi) in Section IV. 1. Do such good concentration measures exist? It turns out that there are many (see Egghe and Rousseau (1990c, 1991) and references therein). One evident good measure is the area under the Lorenz curve or, in its normalized form: the Gini index G (Gini (1909))
G(X) = 2{area under L ( x ) } - 1
G(X) = 2 f ' L ( X ) ( y ) d y - l fj
(IV.7)
0
The next measures are based on a result of Hardy, Littlewood and Polya (1928, 1952) that states (in our teminology) that L(X) < L(X') implies that
f>( a i )(a;) i=1
i=l
(IV.8)
Lotkaian concentration theory
195
for all convex continuous functions (p. Here X = (x,,...,x N ), X ' = (xj,...,x N ) and (a,,...,a N ) and (aj,...,a N ) are defined as in (IV.3). From this result, applied to cp(x) = x 2 we have that
V 2 (X) = N ^ a f - l
(IV.9)
is a good concentration measure. Note that (IV.9) is equivalent with (by (IV.5)):
V2(X) = N £
-^-T-1 j=l
V
' u2 N t f '
K
V 2 (X) = - ^
(IV. 10)
being the quotient of the variance and the square of the average of X. For this reason one calls V = V(X) the variation coefficient and, because of the above, it is a good concentration measure. If we take cp(x) = xlnx we find Theil's measure (Theil (1967))
N
Th(X) = lnN + ^ a i l n a i
(IV. 11)
i-i
For other good concentration measures we refer the reader to Egghe and Rousseau (1990a, 1990c, 1991).
So far for general discrete Lorenz concentration theory. Since we want to have a Lotkaian concentration theory, we are at a point to interpret L(X), G(X), V2(X) and Th(X) for X as in (IV.2) or (IV.3). This is not easy for two reasons. First of all the linear parts in L(X) destroy the power law and make calculations on L(X) (e.g. for G(X)) difficult. Secondly, as already remarked in Chapter I, evaluating discrete sums as in (IV.9) and (IV. 11), ending up with
196
Power laws in the information production process: Lotkaian informetrics
analytic formulae, is not possible. Using continuous functions is a solution to both problems: if we can define a Lorenz curve for a continuous function, we may fully work with this function and we do not have to introduce linear parts and also, finite sums are replaced by integrals which can be evaluated. This will be worked out from the next section on.
In Egghe (1987b) we made an attempt to calculate G (or rather Pratt's measure
N
G
N —1 which is close to G for large N and which will not have a purpose in our continuous theory to follow) for some informetric functions including the ones of Zipf and Lotka. The results are, necessarily, approximative. Since we will obtain exact results in the continuous case, we will not go into these results here.
IV.3 CONTINUOUS CONCENTRATION THEORY IV.3.1 General theory
Let h:[l,xm]—->R+ denote any positive decreasing continuous function on the interval [l,x m ], where xm > 1. Based on the discrete case we define the Lorenz curve of h, denoted L(h) as given by the set of points, for x e [l,xm]
x _!
JLj
X
In other words, putting y = x
fh(x')dx'
r,J'
(IV.12)
n,-! J ^ h ^ d X 1
e [0,1] (hence x = y(xm - 1) + 1), the Lorenz curve L(h) of m ~1
h is the function
r(x-1)+1h(X')dx' L(h)(y)=JY. J h(x')dx'
(IV.13)
Lotkaian concentration theory
197
This approach on defining the Lorenz curve is direct (on h) and similar to the definition of L(X) for discrete vectors X. It is different (in its formulation) than the definitions in Atkinson (1970) and Gastwirth (1971, 1972) in econometrics but all definitions are mathematically the same. In addition, our approach allows for finite xm , which is not the case in the other approaches.
It follows that
L h
( )'(y)=
r
*mr!
I
h(y(x m -i)+i)
(iv.i4)
hfx'jdx'
and that L(h)"(y)= j X m " l ) 2 h'(y(x m -l) + l) f h(x')dx'
(IV.15)
J1
hence L(h) is a concavely increasing function from (0,0) to (1,1) (since h > 0, h' < 0). Its general form is depicted in Fig.IV.2. Since L(h)' is continuous we have that L(h) is a C1 (i.e. a smooth) function (see e.g. Apostol (1957) or Protter and Money (1977)).
Fig. IV.2
General form of a continuous Lorenz curve.
198
Power laws in the information production process: Lotkaian informetrics
As in the discrete case such a Lorenz curve can be used to measure the concentration in the (continuous) set of values h(x) for x e [ l , x m ] . Based on the discrete results, we can also define C to be a good measure of (continuous) concentration if we have the following implication for all functions h and h' as above:
L(h)C(h)
(IV.16)
(L(h) < L(h') means: L(h) < L(h') and L(h) ^ L(h') as functions).
It is now (as in the previous section) the task to find such good measures. In Egghe (2002a), based on an inequality in Hardy, Littlewood and Polya (1928) we proved that any measure of the type
C(h) = J o 1 (p(L(h)'(y))dy
(IV.17)
, where cp is a continuous convex function, satisfies (IV.16) (since L(h) is a C1 function). This result is the continuous extension of (IV.8) and yields the continuous extension of (IV.9) and (IV.11): we define, for h as above:
V2(h) = J o 1 [L(h)f(y)dy-l
(IV.18)
Th(h) = / o 'L(h)'(y)ln(L(h)'( y ))dy
(IV. 19)
Of course (IV.16) is also satisfied for the Gini index
G(h) = 2("L(h)(y)dy-l 0
being twice the area under the Lorenz curve of h, minus 1.
(IV.20)
Lotkaian concentration theory
199
Having (IV. 13) at our disposition, we can also give answers to the general 80/20-rule: determine, given xe]0,l[, the number 0e]O,l[ such that the 100x% of most productive sources produce 1009% of the items. Indeed, it suffices to take x = y and 0= L(h)(y) as in (IV. 13). Hence for all values x = y e ]0,l[, the generalized 80/20-rule is given by the value L(h)(y), read on the graph of the Lorenz curve of h. This will be applied on power law types so that the 80/20-rule for power laws follows from its Lorenz concentration theory, to be developed in the next subsection.
IV.3.2 Lotkaian continuous concentration theory
IV.3.2.1 Lorenz curves for power laws
It is the purpose to apply continuous concentration theory to the functions of Lotka (11.84):
f(j) = £
(IV-21)
g(r) = |
(IV.22)
a > 0 , je[l,p r a ] and of Zipf (11.89)
r £ [l,T +1]. Both are power laws and since T —> oo if and only if pm —> oo in case (IV.22) applies (see Note II.4.2.2 (1)), we have that both functions f and g are defined on a bounded interval (Tp m
h(x) = 4 X
xe[l,x m ], x m >l, K>0, y>0. By (IV. 13) we have that
(IV.23)
200
Power laws in the information production process: Lotkaian informetrics
L(h)(y)=(y(x-"!!+!r"1
av-24)
if y * 1 and
L(h)(y)=h'y(;°-1) mxm
+ 1
'
(IV.25)
if y = 1. In the previous subsection we have not dealt with the case xra = oo. Since we have the application on the functions of Lotka and Zipf in mind, where pm = oo and T = oo are also studied (cf. the existence theory for the function of Lotka - Subsection II.2.1) we have to include the case xm = oo. We define the Lorenz curve L(h) for h as in (IV.23) with xm = oo as the limit of expression (IV.24) for xm —»oo
lim(y(xm,-L>+i)--i
C-i It turns out that this limit is always equal to 1, except if y < 1. In this case we have
lim L(h)(y) = y'- r
(IV.26)
In the case of Lotka's law we hence can use (IV.26) only in the case a < 1 (implying T = A = oo, by (11.20) and (11.21) and the fact that here p m = xm —> oo). In the case of the law of Zipf we can use (IV.26) only in the case |3 < 1, hence for a > 2 since
P= ^ 7 a-1
(IV.27)
as was proved in Subsection II.4.2. We can immediately apply (IV.26) to the Zipf function to prove the generalized 80/20-rule in Lotkaian informetrics.
Lotkaian concentration theory
201
Theorem IV.3.2.1.1: Let p < 1, hence a > 2 . Let \s. denote the average number of items per source. Then
\_ limL(g)(y) = y''
(IV.28)
where g is the Zipf function (IV.22). In words: a fraction y e [0,l] of the top sources produces a fraction y11 of the items, i.e. the generalized 80/20-rule for power laws.
Proof: In formula (IV.26), for each finite T, y equals
X
l
=—-
(by (IV.22)), the
fraction (in [l,T + l]) of the top sources (since g decreases). By definition, L(g)(y) is the fraction of items produced by these sources. So L(g)(y) represents the generalized 80/20-rule. But
1-Y = 1-P
(for g)
_a-2 ~ a-1 by (IV.27). Invoke (11.30):
a-1
in the limiting case p m —> oo. But by Note II.4.2.2 (1): pm —> oo iff T —> oo which is the case here since we take lim . This proves (IV.28).
This result is not new: it appears implicitely in Gastwirth (1972) (p. 307, Table 1), we think for the first time. It can also be read (implicitely) in Burrell (1992b), p. 22. See also Egghe (1986) for a discrete version of this result in case a = 2 and Gupta (1989) for an application
202
Power laws in the information production process: Lotkaian informetrics
of it. Another reference on the use of Lorenz curves in the study of the generalized 80/20-rule (but not in Lotkaian informetrics) is Burrell (1985).
The following corollary to Theorem IV.3.2.1.1 can be read in Egghe (1993a).
Corollary IV.3.2.1.2 (Egghe (1993a)): Under the assumptions of Theorem IV.3.2.1.1, if we have two such situations with p., <\i2, then L(g,)(y,) = L(g 2 )(y 2 ) implies yl > y 2 . In words ; if both IPPs satisfy Zipf s law as in the theorem and if the average production in the first IPP is smaller than in the second one then, in order to have the same fraction of items, we need a larger fraction of top sources (which produce these items) in the first IPP than in the second one.
Proof: n,<|a 2 implies — > — . Further, by (IV.28), L(g 1 )(y,) = L(g 2 )(y 2 ) implies
whence y, > y2 (since y,,y 2 <1).
Table IV.1 shows the values of y in function of \i for L(g)(y) = 0.8 (i.e. a 80/100y-rule), based on (IV.28).
This table also illustrates Corollary IV.3.2.1.2. This corollary is also true in some nonLotkaian informetric theories as shown in Egghe (1993a) but in this paper also examples of non-Lotkaian informetrics theories are given where Corollary IV.3.2.1.2 is false.
Corollary IV.3.2.1.2 implies for instance that, in a library, the higher the average number of borrowings per book the smaller fraction of the (most popular) books is needed in order to have a fixed (say 80%) fraction of all borrowings, hence the larger fraction of the least popular books that yield the other fraction (say 20%) of all borrowings implying that in libraries with a high average number of borrowings per book (e.g. public libraries in
Lotkaian concentration theory
203
comparison with scientific libraries) one can weed more of the low-popular books, hereby only loosing a (low) fixed percentage of all borrowings.
Table IV.l Values of y in function of (I for 80% of the items. Reprinted from Egghe (1993a), Table 3, p. 373. Copyright John Wiley & Sons Limited. Reproduced with permission.
V-
y
1 2 3 4 5 6 7 8 9 10
0.80 0.64 0.51 0.41 0.33 0.26 0.21 0.17 0.13 0.11
Let us now go back to the general case xra < oo or xm = oo and prove the following theorem on Lorenz curves for power functions.
Theorem IV.3.2.1.3 (Egghe (2004c)): The Lorenz curve L(h) for a general power law as in (IV.23) is strictly increasing in y.
Proof; Since we keep xm fixed here it suffices to show (by (IV. 12)) that y, < y2 implies that
r x d x ' rx» dx'
Jl
X'V|
Ji
X'T2
r x d x ' rx» dx'
J]
X'T2
J,
x 'Ti
for all x € [l,x m ]. This is equivalent with
r x d x ' rx» dx'
pdx'
r x »dx'
204
Power laws in the information production process: Lotkaian informetrics
For this it suffices to show that, for all x' e [l,x] and all y' e [x,xm]
J x 'Tl
1_ y 'T2
_1 XTT2
l__ y-T,
or
Since y2 > y, we can denote y2 = y, + s where 8 > 0. In this notation, inequality (34) reads
y)
which is trivial since x' < y' always.
Important remark IV.3.2.1.4: It is clear that the above theorem applies to Lotka's law (IV.21) and to Zipf s law (IV.22). So we have that L(f) is strictly increasing in a (i.e. the inequality increases with a ) : hereby one examines the inequality in the numbers f (j),je [l,p m ]. As mentioned before: much more important, from an mformetric point of view (but mathematically the same) is the study of the inequality in the numbers g(r),r€ [l,T + l] since these numbers express the sourceproductivity. So, Theorem IV.3.2.1.3 gives that L(g) is strictly increasing in p , hence the inequality increases with (3. The latter result can be interpreted in function of a by using (IV.27). So we have the result that L(g) decreases in a . Since this is a source of confusion (see Wagner-Dobler and Berg (1995), Rao (1988), Yoshikane and Kageura (2004) (see (*) p. 438)) we state these results (in a ) explicitely as a corollary.
Corollary IV.3.2.1.5: Let f and g be as above. Then
Lotkaian concentration theory (i)
L(f) strictly increases in a
(ii)
L(g) strictly increases in p
(iii)
L(g) strictly decreases in a
205
It is clear from Rao (1988) that Rao refers to expression (i), so the criticism in Wagner-Dobler and Berg (1995) is not in order, especially since they discuss the inequality in Lotka's law (hence (i)) and not the one in Zipf s law (being (ii)). In the same way, Yoshikane and Kageura (2004) confuse (i) and (iii).
This dual interpretation of inequality is typical in informetrics: one can consider the sizefrequency function f but also the rank-frequency function g, hereby interchanging the role of sources and items. In this sense Corollary IV.3.2.1.5 is not surprising. We will experience the same differences when we will express the measures G, V2 and Th for g in terms of (3 and a .
IV.3.2.2 Concentration measures for power laws In this subsection we will calculate the measures G, V2 and Th for general power laws h on [l,x m ], xm < oo and xm = oo and then we will interpret the results to the functions of Lotka and Zipf.
IV.3.2.2.1 Calculation of the Gini index The easiest measure to calculate is the Gini index G (formula (IV.20)) being twice the area under the Lorenz curve, minus 1. Let y * 1 and xm < oo. Then (IV.24) implies for the Gini index of h, denoted G(h),
G(h) = 2/ o 1 L(h)(y)dy-l
G h
( )=^b/o'((y( x m - 1 )+ i r- l )dy-i
G(h) = ^ ^ [ ^ ( x - - l ) - ( x m - l ) ] - l
(IV.29)
206
Power laws in the information production process: Lotkaian informetrics
The limiting value for xm —> oo is
lim G(h) = —
1 = —5L-
(IV.30)
but here 7 is restricted to 7 < 1. This is in accordance with the calculation of G(h) using (IV.26) which is also restricted to 7 < 1 and which also yields (IV.30), as can readily be checked.
So, for Lotka's function f we have, if pm < 00 :
G(f) =
' (p a -i)(pL--i)[^ p ^- 1 )- ( p -- 1 ) ]- 1
(IV31)
and for Zipf s function g we have (if T < 00 )
G(g) in function of a yields (using (IV.27))
G
(g) = -7
2
-l^—^ 7 ^ f ( T + 1 ) ^ - l "T "I
(IV.33)
T(T + l p - l H 2 a - 3 ^
For the limiting values we have for g, based on (IV.30):
limG(g) = —2— T^OC
w
(IV.34)
2-p
= —-— 2a-3
(IV.35) V
;
Lotkaian concentration theory
207
((3<1, equivalently, a > 2 ) . Result (IV.35) was already proved in Burrell (1992b), where only this limiting case (and a > 2) is considered. These results are in agreement with Corollary IV.3.2.1.5, as they should be. We leave it to the reader to calculate G(h) for y = 1, using formula (IV.25).
We refer to Egghe (1987b) for some discrete calculations of Pratt's measure (Pratt (1977)) which is essentially the same as the Gini index (see Carpenter (1979)).
IV.3.2.2.2 Calculation of the variation coefficient We present two methods: one based on (IV. 18) for general h and one based on the formula V = —, which only yields V for Zipf s function g.
1. First method By (IV. 18)
V2(h) = / o 1 [L(h)'f(y)dy-l,
where L(h) is given by (IV.24) (y * 1) and (IV.25) (y = 1). If y * 1, we have
L(h)'(y) = , I T ( 1 1 7 ) | X " " , \ ) ,V K Y -i)(y(x m -i)+i)
(IV36)
If y = 1, we have
L(h) (y)
' >o(;k'- 1 ) + 1 )
Let now y *= 1, y *= - . Then, by (IV.36)
-
208
Power laws in the information production process: Lotkaian informetrics
P-yft*--1)2
v> ( h )= f
2dy-i
^^(Hi'k-r-i).,
}
(x--l) 2 (l-2Y)
If y = —, then we have V2(h)=^m~^lnX--l
(IV.39)
4(V^-l) If y = 1, then (IV.37) implies
(x - i f 2
Vm
/ 1 (IV.40) (inx ) x So we have for Lotka's function f, if p m < oo (only the case a v= — ,a ^ 1) is given) V (h)=
(l-ay(p.y-l) (p^-l)(l-2a)
and for Zipf s function g, if T < oo and (3 ^ — ,p * 1:
(i-p)2T((T+iy-2f>-i) 2
V (g)= l j
;
lV
^ '—I ((T+ l ) - - l ) 2 ( l - 2 P )
which takes the following form in a > 1, a ^ 2, a ^ 3
(IV.42)
Lotkaian concentration theory
(a-2f
209
T T 1
[( + F - l ]
V 2 V( g ) = / ( \ , } ' V( a - 1 A ( a - 3; {,
J--1 ,«=l f
(IV.43);
(T + l ) a - i - l
using (IV.27). Based on (IV.39) and (IV.40) the reader can give the analogous formulae for P = - , (3 = 1 (hence a = 3, a = 2).
2. Second method This method only applies to the calculation of V2(g) since we use the formula V2 (g) = — , where a 2 and (j. are the variance and average of the Zipf function g, which can be calculated f using the Lotka function — as the weight function (cf. formula (11.20)):
» =l
C~ f(j)
J^dj
(IV.44)
So
V 2 (g) = ^ - r P m j 2 f(j)dj-l
(IV.45)
where n = —, the average number of items per source. We have by (11.34) and (11.35), if a > 1, a ^ 2 that
210
Power laws in the information production process: Lotkaian informetrics
T=
I^ a ( p "" 1 )
(IV 46)
r - l )
(IV-47)
'
and that
A= ^ (
P
Also, if a * 3,
j;Pmj2f(J)dJ = XP"']grdj = ^ - ( p - - l )
(IV.48)
(IV.46), (IV.47) and (IV.48) in (IV.45) now gives
i8J
"(«-l)(a-3)-
v-(g)= ^ ^
2
(«-l)(a-3)
(p^-1)2
(pr-iKpr-i)^ 2 (pr_!)
and we have the task of proving that (IV.49) and (IV.43) are the same. Since we have Zipf s law, (11.91) implies that
,_a oc-1 Pm = - £ -
This and (11.34) or (IV.46) give
T+l=- ^ - =pr' a-1
(IV.50)
Lotkaian concentration theory
211
Formula (IV.50) is the link between (IV.49) and (IV.43). We leave it to the reader to consider the cases a = 2 and a = 3 .
3. Limiting case Forxm —> oo, formula (IV.38) gives, if y < —
lim V 2 (h) = ^ — l L - \ = jf— l-2y l-2y *»— V '
(IV.51)
This gives for V2(g) if p = y < — (implying a > 3):
limV 2 (g) = - ^ — = 1 T^OO W l-2p (a-l)(a-3)
(IV.52)
using again (IV.27). Note again that these formulae are in agreement with Corollary IV.3.2.1.5, as they should be. Formula (IV.51) also follows by direct calculation based on (IV.26) (here we also find the restriction y < — ).
IV.3.2.2.3 Calculation of Theil's measure We have to evaluate (IV. 19) with L(h)' as in (IV.36) (y *= 1) or as in (IV.37) (y = 1). We will limit ourselves to the general case y ^ 1, leaving the other calculation to the reader. The calculation is straightforward but tedious. We have
Th(h) = / o 1 L(h)'(y)ln(L(h)'(y))dy
J
° ( x ^ - l ) ( y ( x m - l ) + l)y
( x ^ - l ) ( y ( X i n - l ) + l)T]
212
Power laws in the information production process: Lotkaian informetrics
_(1- Y )(x m -1)
(1- Y )(x m -1) r'
^-1 +
dy
Jo (y(Xm _ l) + 1y
C-l
(i-r)(x - i ) j . ( y ( X m _ 1 ) +
irin|(y(Xm_1)+ir|dy
m
, »?-' 1 p^_ifa) r Vnnzdz x^-1
Jl
z
x^ T -lJi
'
using the substitution z = y (xm — l) + 1 .
Now we use the following integral results which can readily be checked r dz _ z'~T
J^"TT^ rhiz , _ J
7"
Z
1
lnz
1
~-^ T 7^T (y-l)2
We then obtain
mM-h[fiz^i|+^-0^-0
(IV,3)
The measure Th(h) for xm = oo (restricted to y < 1, see (IV.26)) can be obtained in two ways: using (IV.26) and noting that L(h)'(y) = (l — y)y"y and reperforming the above calculation of Th(h), or, more simply, take lim Th(h) in (IV.53). Both ways give the X m —toe
formula (for y < 1)
Lotkaian concentration theory
Th(h) = ln(l-y) + - ^ -
213
(IV.54)
For Zipf s function (IV.22) (the most important case) this gives (for (3 < 1)
Th(g) = ln(l-(3) + J - .
In terms of Lotka's exponent a we have B = {
(IV.55)
(hence a > 2): a-lj
Th(g) = l n f ^ ) + - L - . l,ot —1J
(IV.56)
a-2
Note also that Th(g) increases in p and decreases in a as predicted by Corollary IV.3.2.1.5.
Note: Theil's measure is (as the other measures G and V2) calculated on the Lorenz curve, hence on normalized data in terms of abscissa and ordinate (see Figs. 1 and 2). It is different from the classical formula for the entropy of a distribution. To see this difference, let us first look at the discrete case. Theil's measure of the vector X = (x,,...,x N ) is given by (IV. 11), where the up are given by (IV.5). In the same notation, the entropy of the vector X is defined as (use log = In)
N
H = H(X) = - J ] a i l n a i
(IV.57)
i=i
Hence we have the relation
Th(X) + H(X) = lnN
or
(IV.58)
214
Power laws in the information production process: Lotkaian informetrics Th(X) = - H ( X ) + lnN
(IV.59)
We notice two differences between Th and H : although they are linearly related, the relation is decreasing. In other words, H increases if and only if Th decreases. Now Th is a good measure of concentration (inequality), hence H is a good measure of dispersion (i.e. of equality), used e.g. in biology to measure diversity (see Rousseau and Van Hecke (1999)). Further |Th| ^ H due to the fact that Th is calculated on the normalized Lorenz curve while H is not.
In the same way Th and H are different in the continuous case. We calculated already Th above. In the same way Lafouge and Michel (2001) calculate H (in fact, our proof of the Th formula was based on their proof) via the formula (f = Lotka function)
H(f) = -Jj°°f(j)ln(f(j))dj
(IV.60)
H(f) = - l n ( a - l ) + - i - + l
(IV.61)
For a > 1, they find
(note that H(f) decreases with a , in agreement with Corollary IV.3.2.1.5 since —H(f) is a good concentration measure). The result, apparently, was first stated by Yablonsky (1980). For H (g), the same formula (IV.61) applies but with a replaced by (5 > 1 (hence 1 < a < 2).
IV.3.3
A characterization of Price's law of concentration in terms of Lotka's law and of Zipf s law
In Theorem IV.3.2.1.1 we expressed the arithmetic fraction of the sources in [l,T + l] by numbers of the form
boiling down to an arithmetic fraction in [0,T]. For a geometric
Lotkaian concentration theory
215
fraction (needed in the formulation of Price's law) we cannot use the interval [0,T] since T e ,6 € [0,l] has the minimal value 1 (for 0 = 0). This, once more, shows the value of the use of the interval [l,T + l]: here a geometric fraction is perfectly expressed by the form (T + l) , Q £ [0,1]. This explains the used intervals in the following theory on Price's law.
The following proposition is a continuous extension of a result in Egghe and Rousseau (1986).
Proposition IV.3.3.1: Denote by G(r), for r e [l,T +1], the cumulative number of items produced by the sources in the interval [l,r]. Then the following assertions are equivalent:
(0 G(r) = Blnr
(IV.62)
r 6 [l,T +1], where B is a constant,
(ii)
The law of Price is valid: for every 9e]0,l[ (hence also for [0,l]), the top (T +1) sources produce a fraction 0 of the items.
Proof: (i)=>(ii)
Since we have the top (T + l) (0 G ]0,l[) sources, their cumulative production is given by (definition of G and since (T + i f € [l,T +1]):
G((T + l) 9 ) = Bln((T + l) 9 )
G((T + l) 8 ) = eG(T + l)
216
Power laws in the information production process: Lotkaian informetrics
But G(T + l) denotes the total number of items, hence the top (T + l) sources produce a fraction 8 of the items.
(iiWi)
Let r e [l,T +1] be arbitrary. Hence there exists Q € [0,1] such that r = (T +1) 6 . By (ii) and the definition of G we have that
G ((T
+ l) 9 ) = eG(T + l)
for 0 e ]0,l[. This is also true for 6 = 1 and for 9 = 0 (since we work with source densities). But r = (T +1) implies lnr = 61n(T +1). Hence
G (Vr;) =
tar ,G(T + 1) ln(T + l) l '
G(r) = Blnr
G(T + 1) where B = —-. '-, a constant. ln(T + l)
Corollary IV.3.3.2 (Egghe (2004c)): We have the equivalence of the following statements
(i)
Price's law is valid
(ii)
Zipf s law (IV.22) is valid for (3 = 1.
(iii)
Lotka's law (IV.21) is valid for a = 2 and
Pm
=C
Proof: The equivalence of (ii) and (iii) was proved in Subsection II.4.2.
(IV.63)
Lotkaian concentration theory
217
As to the equivalence of (i) and (ii) it suffices, by Proposition IV.3.3.1, to show that Zipfs law for (3 = 1 is equivalent with G(r) = Blnr, r € [l,T + 1 where B is a constant. In other words we have to show that
E
I \
r £ [0,T] is equivalent with
G(r) = Bln(l + r)
r £ [0,T]. This follows from formula (II.9) in Subsection II. 1.2 or from Theorem II.2.2.1 and formula (11.48).
We have the following (for historic reasons - see further) remarkable corollary:
Corollary IV.3.3.3 (Allison, Price, Griffith, Moravcsik and Stewart (1976) for Price's square root law and Egghe (2004c) for the general case): If a = 2 then Price's law is equivalent with Lotka's law for which the following relation is valid:
Pm
= C
(IV.64)
Proof: This follows readily from the previous corollary.
The result in Corollary IV.3.3.3 was found in Allison, Price, Griffith, Moravcsik and Stewart (1976) after a long approximative calculation (and limited to Price's law for 9 = — , i.e. Price's square root law; that the result is valid for the general law of Price was first proved, exactly, in Egghe (2004c)). This paper apparently (see editor's note) grew out of lengthy and frequently heated correspondence between these authors on the validity of Price's square root
218
Power laws in the information production process: Lotkaian informetrics
law in case of Lotka's law. We hereby show that a long debate on this issue is not necessary since even a more general result can be proved in an exact way.
Note also that the results in Proposition IV.3.3.1 and Corollary IV.3.3.2 are exact formulations of a "feeling" of De Solla Price (in De Solla Price (1963)) that Lotka's law, Price's law and Zipfs law (there called Pareto's law) describe "something like the approximate law of Fechner or Weber in experimental psychology, wherein the true measure of the response is taken not by the magnitude of the stimulus but by its logarithm; we must have equal intervals of effort corresponding to equal ratios of numbers of publications" (p.50).
We refer the reader to Glanzel and Schubert (1985) for a discrete characterization of Price's (square root, i.e. 0 = —) law. Further discrete calculations on Lotka's law in the connection of Price's law can be found in Egghe (1987a). Practical investigations on the validity of Price's law can be found in Berg and Wagner-Dobler (1996), Nicholls (1988) and Gupta, Sharma and Kumar (1998). We also give the reference Bensman (1982) for general examples of concentration in citation analysis and in library use.
IV.4
CONCENTRATION
THEORY
OF
LINEAR
THREE-
DIMENSIONAL INFORMETRICS Three-dimensional informetrics was described in Chapter III, which was almost entirely devoted to linear three-dimensional informetrics. The composition of the two IPPs was proved to be a situation of positive reinforcement i.e. where the rank-frequency function of the first IPP, denoted by g,, was "reinforced" as g = cp°g, with 9 a strictly increasing function such that cp(x) > x and where g is the rank-frequency function of the composed IPP. In positive reinforcement we hence keep the source set and we reinforce the production of these sources. Also in Chapter III, one studies Type/Token-Taken (TTT) informetrics where the use (in the second IPP) of the items (in the first IPP) was studied. Both cases yield a new IPP of which we have proved informetric properties in Chapter III.
Lotkaian concentration theory
219
It is, hence, natural to investigate the concentration properties of the positively reinforced IPP as well as the concentration properties of TTT informetrics. This will be the topic of this section.
IV.4.1 The concentration of positively reinforced IPPs
The reader is requested to re-read Subsection III. 1.3.1. To put it simply: the rank-frequency function g[ of the first IPP is composed with a strictly increasing function cp given by (III.5) or (III.6). Let us denote by L(g,) the Lorenz curve of gj and by L(g) = L ^ g ^ the Lorenz curve of g =
Theorem IV.4.1.1 (Fellman (1976)): (i)
L (cp°g,) > L (g,) if — ^ is increasing
(ii)
L ((p°g,) = L (g,) if ^-L
is constant
X
(iii)
L (cpogl) < L (g,) if W
is decreasing
Strict inequalities apply in (i) and (iii) when (ii) is not the case.
Proof: Note that Ti equals the total number of sources in the first as well as in the composed IPP (only the productivity is changed, by cp). By (IV. 13) we have:
L(cp°gl)(y) = V ,
and
^ '
dV.65)
220
Power laws in the information production process: Lotkaian informetrics F T l + 1 I <\A ' I g,(x)dx L(g,)(y ) = J ' T l + '
J,
(IV
-66)
g,(^')dx'
for y e [0,l]. Note that here, as explained in section IV.1, we use g, as a rank-frequency function on [l,T, + l ] (replace r by r + 1) in order to be able to apply the general continuous Lorenz theory, the Lorenz curve being defined as in (IV. 13) for a general positive decreasing function h on an interval of the form [l,x m ]. Note that
f T ' + ' g l (x')dx' = A,
(IV.67)
<J 1
the total number of sources in the first IPP and that
(cpog,)(x')aV
I
= Jj T ' +1 g(x')dx' = A
(IV.68)
the total number of sources in the composed IPP. Also
JT +1 &( x ') dx ' = fg^y'frdy'
(IV.69)
and
/>yT,+l
J,
,
.
= / o y
(iv.7O)
Lotkaian concentration theory
221
(use the substitution x' = y'T, +1). Denote by \i, — —-, the average number of items per
source in the first IPP and by p. = —, the average number of items per source in the composed IPP. Then (IV.67)-(IV.7O) in (IV.65) and (IV.66) yield
L(cp°g1)(y) = i/ o y (p(g,(y')) d y'
(iv.7i)
L(g 1 )(y) = ^-/ o y g,(y')dy'
(IV.72)
Now
A(y) = L(
=
=
n
(
gi(y')dy,
r&M(
Jo
n
g,(y')
dv.73)
hj
Note that A(0) = A(l) = 0 and further
po&)(y)_jL| n gi(y) hj
A.(y)=ftM(
Suppose (i). Then, since that
^ ' increases, x
<
( IV .74)
, ' decreases since g. decreases. If we had g,(y)
222
Power laws in the information production process: Lotkaian informetrics
^
g
'/Y' < -t
81 (y)
(IV.75)
h
for all y e [0,l], then by (IV.73) and using continuity and positivity of the functions we would have that A ( l ) < 0 . If we had (IV.75) but with < replaced by > for all ye[O,l] then we would have A ( l ) > 0 . Since A(l) = O we hence have, by continuity of the function
', g,
the existence of a point y0 6 ]0,l[ such that
(
Since ^
(IV.76)
h
g ' ' \ y ' decreases we hence have, by (IV.76) and (IV.74), that A'(y) > 0 on ]0,yJ 81 (y)
and A ' ( y ) < 0 on ]y o ,l[. Hence, since A(0) = A(l) = O we have that A(y)>0 on [0,l], hence
L(< P °g 1 )(y)>L(g 1 )(y)
for all y e [0,l], proving (i). (iii) is proved in the same way and, by (IV.74) and (IV.76) (ii) is trivial. By the above proof, it is clear that strict inequalities apply in cases (i) and (iii) when (ii) is not the case.
The above theorem (in case —^—^ increases), restricted to the discrete case, was proved in x Rousseau (1992a) but Burrell (1992a) pointed out the existence of the general result of Fellman(1976). This theorem gives the complete situation of positive reinforcement. Based on Theorem IV.4.1.1 we do not always have a more concentrated situation after positive reinforcement. Indeed, it is clear that, although cp implies an increase in the number of items per source of
Lotkaian concentration theory
223
the first IPP, this does not always imply an increase in concentration. Indeed, examples abound: one can add items to each source so that they all have equal production, in which case we even have equality, hence the lowest concentration that is possible (and L((p°g,)(y) = y on [0,l]). Another, more realistic case is given by adding to all productions, the same amount of items so that we are in situation (vi) in Section IV. 1 (the principle of nominal increase), hence a decrease of concentration, hence an L((p°g1) which is lower than L(g,)-
The following corollary trivially follows from Theorem IV.4.1.1.
Corollary IV.4.1.2: Let 9 be a power law of the type cp(x) = bxa with a, b positive parameters. Then L ( c p o g l ) > L ( g l ) i f f a > l , L(
Proof: Note that here —^—^ = bxa~' which increases iff a > 1, is constant iff a = 1 and decreases iff x 0 < a < 1. Since these are mutually exclusive cases and since we comprise all cases (since cp is a power law), we have that the conditions are necessary and sufficient in this case (hence the validity of the "iff statements).
Note that we proved in Chapter III that a power law of
This is the time to give an example of positive reinforcement. It was presented in Rousseau (1992a), being at the same time a real life example of a Lorenz curve construction.
Table IV.2 shows the available number of CDs in a local record library and the number of loans during the year 1990.
224
Power laws in the information production process: Lotkaian informetrics
Table IV.2 Availability and 1990 loans of CDs in the Public library of Puurs (Belgium). Reprinted from Rousseau (1992a), Table 2, p. 393. Copyright John Wiley & Sons Limited. Reproduced with permission.
Category
Number of CDs
Number of loans
595 340 313 151 639 330 107
1,582 649 627 185 1,120 563 395
148 1,274 427 491 2,549 235 244 475
764 5,615 2,290 817 17,510 339 672 1,387
CLASSICAL MUSIC Orchestral Concertos Soloists (instr.) Ensembles Vocal, secular Vocal, religious Various NON-CLASSICAL MUSIC Spoken recordings Amusement music Film music Jazz Pop Ethnic music Country and folk Various
We first consider the loans per category. Ranking loan data from highest to lowest (for the construction of the Lorenz curve) yields the following sequence:
17,510; 5,615; 2,290; 1,582; 1,387; 1,120; 817; 764; 672; 649; 627; 563; 395; 339; 185.
Now we calculate cumulative partial sums, that is the number of loans in the highest category, the sum of the first and the second category, the sum of the first three categories and so on. This gives the following sequence:
17,510; 23,125; 25,415; 26,997; 28,384; 29,504; 30,321; 31,085; 31,757; 32,406; 33,033; 33,596; 33,991; 34,330; 34,515.
We see that the total number of loans equals 34,515.
The Lorenz curve of these data is the curve which connects the points with coordinates
Lotkaian concentration theory
225
(0,0), (1/15, 17,510/34,515), (2/15, 23,125/34,515), (3/15, 25,415/34,515), (4/15, 26,997/34,515),..., (15/15, 34,515/34,515) = (1,1)
Fig. IV.3
Lorenz curve of loans per category.
Reprinted from Egghe and Rousseau (2001), Fig. 2.15, p. 53. Copyright Europa Publications. Reproduced with permission from Taylor & Francis.
We now do the same for the availability categories.
We obtain the following decreasing availability sequence:
226
Power laws in the information production process: Lotkaian informetrics 2,549; 1,274; 639; 595; 491; 475; 427; 340; 330; 313; 244; 235; 151; 148; 107.
The corresponding sequence of cumulative partial sums is:
2,549; 3,823; 4,462; 5,057; 5,548; 6,023; 6,450; 6,790; 7,120; 7,433; 7,677; 7,912; 8,063; 8,211; 8,318.
Consequently, this Lorenz curve connects the points with coordinates:
(0,0), (1/15, 2,549/8,318), (2/15, 3,823/8,318),..., (1,1).
Here the availability curve is situated completely under the loans curve (see Fig. IV.4). This means that overall availability is more balanced (less concentrated) than loans.
IV.4.2 Concentration properties of Type/Token-Taken informetrics
In the positive reinforcement model, the rank-frequency function g, was composed to g = cp°g,, the rank-frequency function of the positively reinforced IPP and in the previous subsection we proved properties of L(g) versus L(g,).
In Type/Token-Taken (TTT) informetrics it was the size-frequency funcion that was changed. Let f denote the size-frequency function of the "start" IPP and f* the one in the TTT version. By (III. 11):
f*(j) = jf(j)
(IV.77)
for all j e [ l , p m ] . Note that in TTT informetrics we change the sources not the production densities j G [l,pm] which is valid for the start IPP and for the one in TTT version.
Lotkaian concentration theory
Fig. IV.4
227
Lorenz curves of loans (a) and availability (b) per category.
Reprinted from Egghe and Rousseau (2001), Fig. 2.16, p. 55. Copyright Europa Publications. Reproduced with permission from Taylor & Francis.
In this connection it is natural (cf. the dual approach of positive reinforcement versus TTT) to compare L(f*) with L(f), the Lorenz curves of f* respectively f. Unlike Theorem IV.4.1.1 we have only one possible result here: the TTT Lorenz curve is always below the Lorenz curve of the original size-frequency function.
Theorem IV.4.2.1 (Egghe): Let f be the size-frequency function of any IPP and let f* be the size-frequency function of its TTT version. Then
228
Power laws in the information production process: Lotkaian informetrics L(f*)
(IV.78)
Proof: According to (IV. 13) and (IV.77) we have
r("")+1f(x-)dx' L(f)(y) = ^ — - ^ —
(iv.79)
J]"f(x')dx' and Lf
/ *(y)=A7ii f
x'f(x')dx' (IV-8°) x'f(x')dx'
(note, as explained above, that both formulae use the same pm). We will have proved (IV.78) if we can show that
rx'=>'(P»-1)+! rx"=Pii.
I
I
I
S
I
rx'=y(p m -l)+l rX"=Pm
S
x'f x ' f x " d x ' d x " < /
I
I
S
I
S
x"f x 1 f x" dx'dx"
for all y € [ 0 , l ] . Deleting the double integral over [l,y(p m —l) + l] in both sides, this boils down to showing that
P»'=»(P.- | )+ 1
rX"=Pm
I
I
S
I
S
x'f(x')f x" dx'dx"
p -1)+1 px"=p
/
/ x'=l
Jx"=y( PlI ,-l)+l
x " f x ' f ( x " dx'dx" V /
V
(IV.81)
/
But, for all ( x ' , x " ) e [ l , y ( p m - l ) + l ] x [ y ( p m - l ) + l,p m ] w e have that x ' < x " , hence (IV.81) is trivially true and so is (IV.78).
Lotkaian concentration theory
229
Note that, in case
f(j) = f j e [l,p m ], the Lotka function, we have that
f(j) = pr so that Theorem IV.4.2.1 is in agreement with Corollary IV.3.2.1.5 (i): also from this corollary it follows that L(f*) < L(f) since a — 1 < a . Note, however, that Theorem IV.4.2.1 is valid for general size-frequency functions f.
This page is intentionally left blank
V
LOTKAIAN FRACTAL COMPLEXITY THEORY
V.I INTRODUCTION The most natural and exact way in which complexity of a system of relations can be described is by indicating its dimension in space. As an example we can consider a citation analysis study of how n journals are citing another k journals (possibly, journals can appear in both sets; both sets can even be equal in which case n = k but this is not a requirement). For each journal in the first set we can look at all the references that appear in the articles in this journal and in this way we can count the number of times one finds a reference to an article that was published in a journal that belongs to the second set. We hence obtain n vectors with k coordinates, hence an n x k matrix in which entry ay ( i e {l,...,n}, j e {l,...,k}) denotes the number of references, appearing in articles of journal i, to articles of journal j . Such a situation can be considered as a cloud of n points that belong to k-dimensional space Rk (R denotes the real numbers). We here have the problem of visualizing (or at least describing) this k-dimensional cloud of points. We will not describe the techniques of multivariate statistics that can be used here in order to reach this goal: one uses dimension-reducing techniques such as principal components analysis, cluster analysis or multidimensional scaling in order to visualise the cloud of points in two dimensions, i.e. on a computer screen or a sheet of paper - see Egghe and Rousseau (1990a) for an extensive account on this, including several examples from informetrics.
The purpose of the present chapter is not to reduce the dimension of such a situation but rather to describe it mathematically. We will also limit our attention to the description of the dimension of IPPs, henceforth called the complexity of IPPs. Our main goal in this book, of course, will be the study of complexity of Lotkaian IPPs, i.e. where we have a size-frequency
232
Power laws in the information production process: Lotkaian informetrics
function f that is of the power law type (11.27). So, here, we do not study (as in the example above) which source (journal) cites which source (journal) but the situation "how many sources (e.g. journals) have how many items (e.g. citations)" and this from the viewpoint of the "dimensionality" of the situation. In other words we will describe (Lotkaian) IPPs as fractals. A fractal is a subset of k-dimensional space lRk (k = 1,2,3,4,...) but, dependent on its shape, it does not necessarily incorporate the full k-dimensionality of R k . A simple example is a straight line in Rk which is a one-dimensional subset of Rk or a plane in Rk which is a two-dimensional subset of R k .
In the next section we will study dimensionality of general subsets of R k : e.g. explain why a straight line has dimension 1 and a plane dimension 2 but we will also study "strange" subsets of Mk for which we have to conclude that their dimension is not an entire number but a general number in K+ (the positive real numbers): such sets are called proper fractals. The reason for this study is their interpretation in terms of IPPs: it will be seen (in Section V.3) that IPPs can be interpreted as proper fractals. Lotkaian IPPs will be characterized as special self-similar fractals, i.e. fractals (subsets of R k ) which are composed of a certain number of identical copies of themselves (up to a reduced scale). Therefore our study, in the next section, of fractals will focus mainly on such self-similar fractals and we will see that, for such sets, it is easy to calculate their fractal dimension (also introduced in the next section).
As said, in Section V.3, we will interpret IPPs as fractals and show the special self-similarity property of Lotkaian IPPs interpreted as fractals. It is now clear that the Lotka exponent a must play a central role in this: we will show that a - 1 is the fractal dimension of the selfsimilar fractal associated with this Lotkaian IPP. This, again, shows the importance of power laws.
V.2 ELEMENTS OF FRACTAL THEORY In this section we will, briefly, review the most important aspects of fractal theory that we will need further on. For a more detailed account on these matters we refer the reader to Feder (1988) or Falconer (1990) or to Mandelbrot (1977a), the founding father of fractal theory and also the one that formulated the law of Mandelbrot - see Chapters I and II; in fact his
Lotkaian fractal complexity theory
233
argument given in Subsection 1.3.5 will lead us directly to the fractal interpretation of random texts as fractals, which we will generalise to general Lotkaian IPPs - see further.
The concept of a fractal and of its fractal dimension will become clear once we have studied the "dimensionality" (or fractal aspects) of simple sets such as a line segment or a rectangle or a parallelepiped. This will be given in the next subsection.
V.2.1 Fractal aspects of a line segment, a rectangle and a parallelepiped
Let us take an arbitrary line segment. Suppose we cut this line segment into M equal pieces (i.e. into M e N equal length line pieces which are non-overlapping). Each line piece has a length which is a fraction r = — < 1 of the length of the original line segment and we need M M = — such line segments. The number r is called the scaling factor: the original line r segment is reduced by a scale r which forms the smaller line segments of which we need N = M = - to cover the original line segment, r Suppose now we have a rectangle. We now reduce this rectangle with a scaling factor r = — < 1 as above (i.e. obtaining a similar rectangle of which the sides are scaled with a M
, (if UJ
factor r, with respect to the original rectangle). It is now clear that we need N = M = -
of
these rectangles to cover the original rectangle in a non-overlapping way (except for the boundaries). In the same way, the reduction of a parallelepiped with a scale factor r = — < 1 M needs N = M3 = — of these scale-reduced parallelepipeds to cover the original one in a non-overlapping way (except for the boundaries).
It is hence clear that dimensionality is recovered by reducing the scale of the figure and then by examining how many of these reduced-scale figures can cover the original one. This, obviously, only works with so-called self-similar figures, where the definition of self-
234
Power laws in the information production process: Lotkaian informetrics
similarity is as indicated above: there exists a scaling factor r < 1 such that a certain number of these "reduced" figures (with scale r) recover the original figure in a non-overlapping way (except for some boundary points). We will now see how this works for a less-trivial (but still self-similar) figure: the triadic von Koch curve.
V.2.2 The triadic von Koch curve and its fractal properties. Extension to general selfsimilar fractals
We first explain the construction of the triadic von Koch curve which is a figure in the plane E 2 . Its construction is depicted in Fig.V.I.
We start with a line segment (phase n = 0). For phase n = 1 this line segment is reduced to a line segment of length — of the original one (hence scaled with a factor r = —) and then applied 4 times as indicated in Fig. V.I. For phase n = 2, we do the same on each of the 4 line segments as we did when going from phase 0 to phase 1. We continue in the same way in the other phases n = 3, 4,... : in the limit we obtain the so-called triadic von Koch curve.
The curve has the property that a scale factor r = — = — reduces the triadic von Koch curve M 3 to a curve which is congruent with the first fourth part of the curve itself and hence we need N = 4 of these curves to cover the original one. Putting
as we did in the case of a line segment (Ds = 1), a rectangle (Ds = 2) and a parallelepiped (D s = 3) we now have 4 = 3 D s , hence
D s = —?«1.26186 In 3
(V.I)
a non-entire number. The number Ds is called the similarity dimension of the triadic von Koch curve (being indeed a self-similar fractal).
Lotkaian fractal complexity theory
Fig. V.I
235
Example of a self-similar fractal: the triadic von Koch curve.
Reprinted from Feder (1988), Fig. 2.8, p. 16. Reproduced with kind permission of Kluwer Academic Publishers.
236
Power laws in the information production process: Lotkaian informetrics
The similarity dimension of any self-similar fractal in any k-dimensional space Rk can be calculated this way: if we reduce the fractal with a scale r = — < 1 and if we then need N M identical copies of this reduced figure to recover the original figure we define its similarity dimension to be
s
lnN=Jn^ lnM -lnr
Although we have seen above that D s for line segments, rectangles and parallelepipeds really is the dimension of these geometrical figures, we still have to indicate what can be the definition of a fractal dimension in general and how (V.2) can be used in this connection for self-similar fractals. This will be done in the next subsection.
V.2.3 Two general ways of expressing fractal dimensions Let F be any (non-empty) subset of Rk (k = 1,2,3,4,...). We want to give an intuitively appealing definition of its fractal dimension, of course in agreement with our ideas on dimension for familiar sets such as "smooth" lines, surfaces, volumes etc. We will limit ourselves to two (non-equivalent) possible definitions.
V. 2.3.1 The Hausdorff-Besicovitch dimension
Let U; C Kk, i e N be a countable (or finite) cover of F, i.e.
FcQu, The cover (Uj)
is called an s-cover of F if, for each i e N, 0 < | U ; | < e , where |Us
denotes the diameter of U; in R k , i.e.
|Uj| = sup{||x - y|| | x,y £ U J
Lotkaian fractal complexity theory
111
where ||.| denotes the norm in K k . Define, for every s > 0 and 8 > 0:
r oo
1
<Jf(F) = inf r C I U j f K U ^
is an e - cover of F
(V.3)
This means: we look at all 8-covers of F and minimise the sum of sth powers of the diameters. It is clear that the numbers a^
increase for s decreasing since the latter reduces
the number of possible covers. Hence the following limit
<*Fs(F) = l i m ^ s ( F )
always exists as a value > 0 or oo. ^
s
(V.4)
(F) is called the s-dimensional Hausdorff measure of
F.
Let (U,).^ be an e -cover of F and let t > s > 0. Then, obviously
Du.Nfiunu.r i=l
i-l
Hence (V.5) implies, by (V.3), that
^"(F)<s'-S^S(F)
The inequality has an important consequence: if ^
(F) < oo then c^" (F) = 0 for all t > s
(let s —> 0). By reversing the roles of t and s in (V.6) we also see that if ^ t < s , t h a t <*r'(F) = +oo.
(V.6)
s
(F) > 0 and
238
Power laws in the information production process: Lotkaian informetrics
Thus the graph of ^" (F) in function of s (see Fig.V.2) shows that there is exactly one critical value of s at which ^
(F) jumps from +00 to 0. This critical value of s is called the
Hausdorff-Besicovitch dimension and is denoted as s = dimH (F).
Fig. V.2
Graph of
Reprinted from Falconer (1990), Fig. 2.3, p. 28. Copyright John Wiley & Sons Limited. Reproduced with permission. In other words: dimH (F) = inf {s | «TS (F) = 0} = sup {s | JVS (F) = +00}
(V.7)
The intuitive interpretation of this is clear. Let e.g. F be a line segment. Then s = dimH (F) = 1, of1 (F) = 0 for t > 1 (e.g. for t = 2: the "area" of a line segment is 0), <%?1 (F) = +00 for t < 1 and <^" (F) happens to be the length of this line segment. Let F be a rectangle. Then s = dimH (F) = 2 , ^Cx (F) = 0 for t > 2 (e.g. for t = 3: the "volume" of a rectangle is 0), <%fl (F) = +00 for t < 2 (e.g. t = 1: the "length" of a rectangle is +00) and atf1 (F) is proportional (proportionality factor equals — see Falconer (1990)) to the area of 71
this rectangle. This proportionality factor arises because, in (V.3), we use (e.g. for disks of radius r) (2r)s which equals 4r2 for s = 2 while the area of a disk is rcr2. Let F be a parallelepiped. Then s = dimH (F) = 3, afx (F) = 0 for t > 3 ,
is
Lotkaian fractal complexity theory
proportional (proportionality factor equals
239
see Falconer (1990)) to the volume of this 71
parallelepiped. This proportionality factor arises because, in (V.3), we use (e.g. for balls of radius r) (2r) which equals 8r3 for s = 3 while the volume of a ball is —7tr3. In general, c^ s (F), for s = dimH (F), equals cnvoln (F) where voln (F) is the n-dimensional volume of
F (if it exists) and cn = —^
n
' (see Falconer (1990)) but we do not need this result in this
book.
Hence dimH (F) represents the idea of a fractal dimension for any subset F of R", for every n e N . It is recognized as the perfect description of dimensional complexity of a set F. However, its disadvantage is that it is (usually) very difficult to determine dimH (F). The next definition of dimension is in this sense more popular, both from the point of its relative ease of mathematical calculation and its empirical estimation.
V. 2.3.2 The box-counting dimension
For a general non-empty bounded subset F of Rk (k 6 N) one can define a new dimension of F, inspired both by the similarity dimension for self-similar fractals and the HausdorffBesicovitch dimension. Note that, as in V.2.3.1, we do not need F to be self-similar. The rough link between the Hausdorff-Besicovitch dimension dimH (F) and formula (V.2) is as follows (an exact definition of the new dimension will be given after this heuristic argument). ^ S ( F ) (formula (V.4)) is, for small s, by (V.3), close to the N E (F)s s , N E ( F ) being the smallest number of sets of diameter (in R k ) at most s which can cover F. <%"" (F) itself is a measure of the "s-dimensional length" of F (called length for s = 1, area for s = 2, volume for s = 3, ...) which is a constant denoted c. Hence NE(F)ss~c
N E (F)^ce" s
(V.8)
240
Power laws in the information production process: Lotkaian informetrics
a power law of e . Hence
lnNs(F)^lnc-slne
(V.9)
and so
_
lnN E (F) — Ins
|
lnc Ins
Letting s —> 0 the above reduces to
—Ins
showing the link with (V.2) (M = N8 (F), the number of sets, related to the "scaling" factor r = s ). Therefore, we formally define, for any non-empty bounded set F C Kn
dim B (F) = l i m l n N e ( F ) E -° - I n s
(V.ll)
if this limit exists and we call dimB(F) the box-counting dimension. Formula (V.ll) clearly expresses the meaning of a fractal dimension: the number of sets (one shows that one can limit oneself to balls, cubes,...) of diameter E that intersect a set F is an indication of how spread out or irregular the set is when measured with a scale s (= resolution level). The dimension reflects how rapidly the irregularities of F develop as s — 0.
We mentioned already the popularity of the box-counting dimension. It can e.g. be used in the calculation of the fractal dimension (using dimB) of coast lines or country borders. Indeed, for a selected set of small s-values we graph the points (s,N E (F)) in a log-log scale. Based on (V.9) we see that we will find that these points lie (more or less) on a straight line with slope -s, where s = dim B (F) (by (V.ll)). Linear regression on these points then gives an
Lotkaian fractal complexity theory
241
estimate for this slope, hence for dimB (F). An example is given in Feder (1988) (see Fig.V.3) where this method reveals that the fractal dimension of the coastline of Norway is close to 1.52, a proper fractal: although we speak about a (coast)line, its fractal dimension is much larger than 1 and even above 1.5.
Fig. V.3
The number of "boxes" of size s needed to cover the coastline of Norway, as a function of s. The straight line is the result of a regression analysis yielding a slope —s « — 1.52 .
Reprinted from Feder (1988), Fig. 2.7, p. 15. Reproduced with kind permission of Kluwer Academic Publishers. This technique could be compared with the experience of flying over the coastline of Norway. The above shows that, when we lower our height, say with 50% (expressed by halving our measuring diameter e) we need far more than double of the squares to cover the coastline, expressed by (V.8):
Nf(F)~c(£[S N e (F)^cs~ s .2 s 2
N e (F)«N E (F).2 s ^2.87N E (F) 2
for s = dimB (coastline) « 1.52.
242
Power laws in the information production process: Lotkaian informetrics
In a way it is a pity that the Hausdorff-Besicovitch dimension and the box-counting dimension, although related, are not always equal. However for self-similar fractals (and we will only need this in Lotkaian informetrics) we have the following important non-trivial result.
Theorem V.2.3.2.1 (see e.g. Falconer (1990)): Let F be a self-similar fractal. Then
dim H (F) = dimB(F) = D s
(V.I 2)
For the intricate proof we refer the reader to Falconer (1990), p.118-120. The result in Falconer (1990) holds for more general fractals, namely where more than one similarity is allowed: here we only use that one similarity applies for the whole curve, in which case we find that the Hausdorff-Besicovitch and box-counting dimensions equal the similarity dimension - formula (V.2).
The above theorem makes life much easier since we will have no problems in calculating fractal dimensions, once we have shown the self-similarity of a system. This we will show to be the case in Lotkaian informetrics in Section V.3.
V.3 INTERPRETATION OF LOTKAIAN IPPs AS SELF-SIMILAR FRACTALS Let us reformulate the construction of the triadic von Koch curve (Subsection V.2.2) as follows: let each phase (phases n = 0, 1, 2, 3, 4, 5 are depicted in Fig.V.l) be interpreted as discrete time t e N . The construction in Subsection V.2.2 is equivalent with the following statements:
(a)
The number of line segments grows exponentially in time t, proportional with 4* (in fact equal to 41 for the triadic von Koch curve).
Lotkaian fractal complexity theory
(b)
243
of each line segment is the same for every line segment (for fixed t) and length grows exponentially in time, proportional with 31.
If we generalise the numbers above by replacing 4 (in (a)) by a, > 1 and replacing 3 (in (b)) by a2 > 1 we see that, up to the names "line segments" (to be replaced by "sources") and "
" (to be replaced by "number of items") we have the reformulation of the Naranan length
assumptions, given in Subsection 1.3.4, hence Naranan's assumptions describe a self-similar fractal. In this Subsection 1.3.4 we explained the derivation of Lotka's law as size-frequency function in such systems for continuous time t and we added our own argument yielding the same law of Lotka, using discrete time t € N . In both cases we showed that the size frequency function f was of Lotka (i.e. power) type as in formula (1.54), the Lotka exponent equalling <x = l + ^ lna 2
( V .l3)
°"' =J Tn
(see formula (1.55)). Hence
|n
- (d
From the above we can conclude that Lotkaian IPPs can be interpreted as self-similar fractals with (by Theorem V.2.3.2.1 and formula (V.2)) fractal dimension.
D s = <x-1
(V.15)
as given by formula (V.I4). This result is important enough to be formulated as a theorem.
244
Power laws in the information production process: Lotkaian informetrics
Theorem V.3.1 (Egghe (2004d), based on Naranan (1970)): Suppose we have an informetric system (IPP) of Lotkaian type, i.e. where the size-frequency function f satisfies
f(j) = ^
(V.16)
where C, a > 0, j > 1 (here we take the maximal density pm = oo as in Subsection II.2.1.1). Such a system can be interpreted as a self-similar fractal with fractal dimension D s = a — 1.
Note that it follows from the above theorem and Corollary 1.3.4.1.2 that in case the growth rate of the sources equals the growth rate of the items in Naranan's model), hence in case a = 2, that we have a self-similar fractal of dimension 1. This, once more, shows the special role of the exponent a = 2 within informetrics.
Similar with (but unrelated to) the theory of concentration, we can draw the conclusion that the fractal dimension increases (linearly) with a - see (V.I5). In terms of the Zipf exponent P , using (11.56) and (V.I5) we see that
P = -J-
(V.17)
From the above we see that a power function gives rise to a self-similar fractal where a (formula V.16) equals Ds + 1 if we interpret the power function as a size-frequency function (Lotka). In terms of the rank-frequency function (Zipf) we have formula (V.17). Hence in no case the exponent itself (a,—a, p or — (3) can be interpreted as a fractal dimension, contrary to the assertions in Katz (1999) and Stewart (1989). In Katz (1999), the author uses both interpretations (Lotka and Zipf) and hence, since a s = p , we cannot have that both a and (3 represent the fractal dimension of the system; in fact none of these do so and we have to use formulae (V.15) and (V.17) for their interpretation in terms of fractal dimensions (see also Fairthorne (1969) and Mandelbrot (1967)).
Lotkaian fractal complexity theory
245
In Subsection 1.4.3 we gave examples (from the literature) of social networks of which their connectivity size-frequency function (i.e. the number of nodes with k edges) is Lotkaian with high exponents a between 2 and 4. Hence these systems, by (V.I5), have fractal dimension between 1 and 3, which is rather high.
The obtained results are valid in any Lotkaian informetrics (i.e. IPP) interpretation. They are confirmed in random texts, which are a special example of an IPP (cf. the linguistics example in Section I.I) by Mandelbrot (1977b) - see also Mandelbrot (1967). We recall the conclusions of Subsection 1.3.5. Let us have N letters in our alphabet and that a text is a string of letters and blanks. Each letter has an equal chance of occurrence (simplification of reality!) denoted by p . Since there are also blanks we have p = P(letter) < —. Such a text is N considered as a self-similar fractal of which the "phases" are (cf. Egghe and Rouseau (1990a)): n = 0 : empty text n = 1 : the N letters = the N words, say A, B, C,... n = 2 : the N2 words
AA, AB, ..., BA.BB, ...,
and so on. At each level n we have P (word) = p" and Nn words. So this system can be considered as a self-similar fractal with a scaling factor p and a multiplicator N to go from each level n to the next level n + 1. By (V.2) we have that its fractal dimension is given by
D
s
= ^ — lnp
(V.18)
Combining (V.18) with the result (1.58), obtained in Subsection 1.3.5, we hence have that P = — , hence the same result as in (V.I7). Because Subsection 1.3.5 showed that the law of Mandelbrot holds in this case, hence, by Theorem II.2.2.3 also Lotka's law and formula (11.56): (3 =
, we hence refind that a —1 = D S , the fractal dimension of this special a —1
246
Power laws in the information production process: Lotkaian informetrics
system, hence we refind formula (V.I5), which was, of course, more generally valid (Theorem V.3.1).
Theorem V.3.1 is a central result in Lotkaian informetrics. It expresses, in a mathematically exact way, that the power laws, which are characterized by their scale-free (self-similarity) property (Theorem 1.3.2.2 and Corollary 1.3.2.3) give rise to the systems (IPPs) in which the same size-frequency distribution is found at low or high productivity (number of items per source) values. To recall Huberman (2001) and referring to the size-frequency distribution of website sizes we can conclude that the same size-frequency distribution is found when looking at sites that have between 1,000 and 2,000 pages as when looking at sites that have a different size range, say from 10 to 100 pages. In other words (in intuitive fractal terms), zooming in or out in the scale at which one studies the web (as in the case of examining coast lines with a plane at different heights), one keeps obtaining the same result, as is also the case in every self-similar fractal.
VI
LOTKAIAN INFORMETRICS OF SYSTEMS IN WHICH ITEMS CAN HAVE MULTIPLE SOURCES
VI. 1 INTRODUCTION This is a unique chapter in many senses. This chapter deals with informetric systems in which items can have multiple sources. The most obvious application is, of course, the case where items are articles which are written by (possibly) more than one author (i.e. multiple sources). We underline that the goal of this chapter is not to study the informetric functions describing the number of authors per paper (for this, see Subsection 1.4.4) but the number of papers per author, although the former will be used in the latter. Note that the terminology of this application coincides with the terminology used in the historical paper Lotka (1926) in which Lotka's size-frequency function was introduced for the first time. As remarked in Chapter I, however, Lotka circumvented the problem of dealing with multiple authorship by only giving a credit (of 1) to the senior author (the other authors were not given any credit), hence, in fact, Lotka treated the articles as single-author articles. In the next Section VI.2 we will study other, more realistic ways, of crediting sources in systems where items can have multiple sources. Let us here just mention two possibilities: in an item (e.g. article) with n e N sources (e.g. authors), each author could be given a credit of 1 (each author of the article is hence fully recognized but in this way an article gives, in total, n author credits, so a different weight compared to articles with a different number of authors), called the "total counting" system or, alternatively, each author is given a credit of — (keeping hereby the total author "weight" n of each paper equal to 1 but making an author credit dependent on the total number of authors of an article), called the "fractional counting" system. We will see that the total counting
248
Power laws in the information production process: Lotkaian informetrics
system is very different from the fractional counting system, which causes a serious problem, e.g. for evaluation studies. This fact and solutions for this problem will be given in Section VI.2.
Although interesting and important in itself, the topic of Section VI.2 will not be the main issue of this chapter. The fact that we encounter multiple sources for an item in informetrics makes this field rather unique: apart from artificial examples that could be produced in any field, we note that this problem is not encountered in other -metrics sciences (or at least we have not seen a paper introducing this topic): classical econometrics deals with wealth and poverty expressed e.g. by the income of people and it is clear that salaries are single sourced; in biometrics each animal belongs to one species; in linguistics the "token" (= item) is uniquely linked with one "type" (= source) since here a source is a word (as such) and an item is the use of this word (e.g. in a text). Even in informetrics, the possibility that items have multiple sources, is a rare phenomenon. In Bradfordian terminology, a source is a journal and an item is an article in such a journal: it is clear that the article is published in only one journal. In citation analysis a reference is uniquely linked with the citing article and the same for citations. In library sciences the borrowing of a book (= item) is uniquely linked with this book (= source), obviously. The reader can check the many other examples given in Chapter I to find out that they are all examples of single source items (except, as said, the case of authors and their publications (and some derived examples, see Section VI.2) and some artificial examples, discussed in Subsection 1.4.4). This special place of the author/publication system makes us think of a possible special informetrics theory that is behind it. In Subsections II.2.2 and II.4.2 we showed that Lotka's law is equivalent with functions such as the ones of Mandelbrot, Zipf, Leimkuhler and Bradford, but how can this be when Lotka's function represents a multiple-source system for items while the other functions represent single-source systems for items? Of course we can answer this question: in the theories developed so far (hence also in Subsections II.2.2 and II.4.2) we have always used the same informetric system (i.e. IPP) for all these laws, in fact comparable with what Lotka did by treating the articles as single-authored (as mentioned above). Of course, this leaves us with the problem of deriving informetrics theories for the multiple-source framework for items. Here we immediately encounter the problem of choosing the scoring system for sources in such a multiple-source framework for items, e.g. (but these are not the only possibilities) the total scoring system or the fractional scoring
Lotkaian informetrics of systems in which items can have multiple sources
249
system (briefly discussed above and more exhaustively in Section VI.2 to come). Let us discuss the problem and possibilities of solution for both scoring systems separately since they are completely different.
Problem VI.1.1 (Total scoring system)
To fix the ideas, let us consider the discrete case. In this case the only values for "the number of items per source" are the natural numbers n = 1, 2, 3, . . . . So the size-frequency function f(n), describing the number of sources with n items, has the same arguments as the one we have discussed already throughout this book. The same elementary observations, developed in Subsection 1.3.1 can be made here: f > 0 and f decreasing in n, hence also Proposition 1.3.1.1 is valid: there exists a constant D > 0 such that limf (n) = D . We can refer to Kretschmer and Rousseau (2001) where f(n), the number of authors with n papers, in the total counting system, is not decreasing but only in those cases where there are more than 50 authors per paper: in all other cases, where the number of authors per paper is below 40 (still very high!), f(n) is empirically seen to be decreasing and even of Lotka type.
We can remark here that Egghe (1995) extends the success-breeds-success principle, introduced in Subsection 1.3.6, to the case that items can have multiple sources and where, possibly, non-decreasing size-frequency functions can occur. The model is, however, theoretic and does not yield analytic forms of these functions (in a way understandable because of the intricate nature of such, exceptional, size-frequency functions - see e.g. Fig. 1, p. 612 in Kretschmer and Rousseau (2001)).
Also the following argument based on the results of Subsection 1.3.2 can be convincing in accepting that the size-frequency function of the total counting system is Lotkaian.
Argument VI.1.1.1 (Egghe): Lotkaian informetrics where items can only have one source implies Lotkaian informetrics where items can have multiple sources and where the total scoring system for sources is applied.
250
Power laws in the information production process: Lotkaian informetrics
Proof: Let f denote the size-frequency function in the single source case and f, denote the sizefrequency function in the multiple source case. We make the following plausible assumption: there exists a number p > 1 such that the number of sources with n e N items in the singlesource case equals the number of sources with pn items in the multiple source case (one could even say that p will be around the average number of sources per item (>1!) in the multiple source IPP, but this is not needed further in the argument). Hence
f t (pn) = f(n)
(VI.l)
for all n £ N . Hence (VI.l) implies, for any n,m 6 N :
ft(pn.pm) = f(pnm)
= Ef(pn)f(m)
= E 2 f(p)f(n)f(m)
(VI.2)
using formula (1.29), Corollary 1.3.2.3 and Proposition 1.3.2.5 on f (given that f is Lotkaian). Hence (VI.2) implies, using (VI.l) again, that
f1(pn.pm) = Ff t (pn)f t (pm)
(VI.3)
where F = E 2 f(p), a universal constant in this multiple source IPP. We did this heuristic argument in m,n e N for reasons of clarity but we could, in the continuous setting, assume
f,(px) = f(x)
for all x € R+ leading to
ft(px.py) = Ff t (px)f,(py)
(VI.4)
Lotkaian informetrics of systems in which items can have multiple sources
251
for all x, y e R + . Since p > 1 > 0 we also have that pK+ = R + (this improves the above discrete argument since pN is not equal to N) so that we have: for every x,y e K+ :
ft(xy) = Ff,(x)f l (y)
(VI.5)
showing that ft satisfies the product property (1.29) and hence, by Proposition 1.3.2.5 and Corollary 1.3.2.3, we have that ft is Lotkaian.
The reason why we called the above observation an "Argument" and not a "Proposition" or a "Theorem" is that, formally, it is not possible to compare an informetric system where items have only one source with an informetric system where items can have multiple sources, since they do not occur in one IPP. One could, of course, consider an IPP in which papers can have multiple authors as an IPP in which papers have only one author, by simply replacing a paper with k authors by k papers with 1 author (each time we take one of these k authors). A similar approach was followed in Egghe (1994b), where the size-frequency function of the multiple author case was calculated based on the knowledge of the size-frequency function of the single author case. The used mathematical tool was: convolutions, a technique that we will encounter again in this chapter. In Egghe (1994b) we showed that a Lotkaian single author IPP implies (approximately) a Lotkaian multiple author IPP. We, however, think that the above Argument VI. 1.1.1 is simpler and more convincing to accept that a size-frequency function for IPPs where items can have multiple sources, with a total counting system, is Lotkaian (i.e. is a decreasing power function).
The most important topic in this chapter (and also the most intricate one) is the determination of the size-frequency function for informetric systems where items can have multiple sources and where we use a fractional counting system. Also this problem will be introduced here (but explained rigorously in Section VI.3).
Problem VI. 1.2 (Fractional scoring system)
To mention right away the most important feature of this problem: the size-frequency function is never decreasing. To put things right: we have an IPP where items can have multiple sources (e.g. articles can have more than one author) and where the source scores are
252
Power laws in the information production process: Lotkaian informetrics
credited in a fractional way. Example: an author A has published a paper authored by 3 authors (author A inclusive), a paper authored by 2 authors (author A inclusive) and a singleauthored (of course by author A) paper. Then this author receives a credit of — H
1-1 = —.
As is clear now, in principle, any score which is a positive rational number is possible. Denoting by Q + the set of positive rational numbers (i.e. numbers of the form — where b a,b 6 N), we hence have that, in such systems, the size-frequency function f is a function on Q + , namely q—>f(q), where q £ Q + . Without any experimentation we can already conclude that values such as
f — , f — , f — ,...
16
Ui
will be smaller than e.g.
Il2j'
f — , f(l), f(2) since a score of e.g. q = — can only come from one paper with 16 \2) 16 authors or (even more unlikely) 2 papers with 32 authors and so on, while a score of q = 1 can come from one single-authored paper, two papers with 2 authors, three papers with 3 authors, four papers with 4 authors or 3 papers: one with 2 authors and two with 4 authors (and so on). Many other examples can be given and this makes it very clear that f(q) cannot be a decreasing function of q. An immediate consequence of this is that f(q) cannot be a power function (i.e. Lotka's law), as remarked by Rousseau (1992b). This also shows that the heuristic argument in Bookstein (1990b), stating that if Lotka's law in one crediting system (e.g. total scores) is valid then Lotka's law is valid in any other crediting system (e.g. fractional scores), is not correct. We showed above that the difference between both sizefrequency functions is fundamental.
It now seems that we are beyond Lotkaian informetrics!
The interesting point is that we are in Lotkaian informetrics more than ever! In fact, as is clear from the above, two aspects are playing a role here: first of all we still have that, overall, for very large q (hence high productions), f(q) must be low; next, we can remark that rational values q with high denominators are the result of being a co-author in one or more papers with many authors (hence also leading to small values of f(q)). In other words, not only the number of papers (counted in a total way) per author is important here but also the number of authors per paper, the dual function (see Subsection 1.4.4).
Lotkaian informetrics of systems in which items can have multiple sources
253
As to the number of papers (counted in the total way) per author (more generally: the number of items per source), the Argument VI. 1.1.1 given under Problem VI. 1.1 sufficiently shows that we can suppose Lotka's law as size-frequency function in this case. As to the number of authors per paper, Subsection 1.4.4 sufficiently explained why also here a Lotka function can be used. Of course these remarks still leave open how to apply these two (dual) Lotka functions in order to model the fractional size-frequency function f(q) for q e Q + . This will be done in Section VI.3 in a, we think, satisfactory way so that this even sheds more light on the power of Lotkaian informetrics rather than it would indicate a breakdown of it (Rousseau (1992b)).
VI.2
CREDITING SYSTEMS AND COUNTING PROCEDURES FOR SOURCES AND "SUPER SOURCES" IN IPPs WHERE ITEMS CAN HAVE MULTIPLE SOURCES
As explained in the previous section, the most important example of an IPP in which items can have multiple sources is the case of articles (items) that can be written by more than one author (source). This, in turn, produces a substantial generalization: one could consider the country affiliation of each author and in this way considering countries as sources for articles. A credit for a country in an article (independent of the crediting system for authors - to be discussed in Subsection VI.2.1) is then the sum of the credits of the authors of this article that are affiliated to this country. In this way, countries are treated with a different crediting system than authors and it could also be considered as a generalization of the author crediting system: indeed, only when the country affiliations of each author in an article are different, the country crediting system coincides with the author crediting system. The same could be said when considering research labs, languages, . . . . This extension is basic and since we do not want to loose this important generalization we will consider it separately from authors as sources and call countries (labs, ...) "super sources". We will then develop a theory around super source credits which includes source crediting systems because of the above mentioned generalization. The next subsection, however, is devoted to the definitions of the different source crediting systems from which the super source credits can be derived.
254
Power laws in the information production process: Lotkaian informetrics
VI.2.1 Overview of crediting systems for sources
We will present a rather exhaustive list of source crediting systems, for the sake of completeness. Since some of these definitions use the rank of an author in a paper, we will, for clarity reasons, use this terminology. Each time we suppose that we have an article with A authors.
VI.2.1.1 First or senior author count
First author count gives a credit of 1 to the first author and no credit (i.e. a credit of 0) to the other A - 1 authors (Cole and Cole (1973)). This method is also known as straight counting. This is, clearly, an imperfect system since every author should at least receive a positive credit. However it can be seen as an easy-to-use sampling method in a system in which author ranks are given randomly. If author-ranks are given according to the alphabetic order of the names, this method clearly causes a bias in the sample; the same is true with any other lexicographic (or even any non-random) ordering system. The popularity of the system also comes from the fact that the Science Citation Index presents citation data (not publication data) by first author only (this has been improved in the Web of Science (Web of Knowledge) where one can search on every author).
Senior author count is the same as first author count but here only the senior author receives a credit of 1 and the other A - 1 authors receive no credit. This method was used by Lotka in his historic paper Lotka (1926). In some disciplines, the senior author is the last one in the list. For a variant of this method: see VI.2.1.6.
VI.2.1.2 Total author count
This is the method in which each of the A authors of an article receives a credit of 1. This method is also called normal or standard counting. It is a fair method in the sense that each author receives the same credit. This is also the case in the next method. The present method, however, increases the weight of articles with many authors since the total author count is A. This is remedied in the next method.
Lotkaian informetrics of systems in which items can have multiple sources
255
VI. 2.1.3 Fractional author count Here each of the A authors receives a credit of — (De Solla Price (1981)). This counting A method is sometimes called adjusted counting. Each author receives the same weight and, independent of the total number A of authors, the weight of each article is 1. VI.2.1.4 Proportional author count
Less used is the proportional author count (Van Hooydonk (1997)). If an author has rank r in the author list of an article (with A authors in total, hence r 6 {l,2,...,A}) then this author receives a credit of
H\—^-\ A{
(VI.6)
A + lj
It is easy to see that (VI.6) is the normalized form of a score A + 1 - r (so r = 1: score A,...,r = A : score 1) so that the sum of all numbers in (VI.6) for r = 1, 2,..., A, is 1. So, also this method gives an equal weight of 1 to each article, independent of the total number A of authors.
VI.2.1.5 Pure geometric author count
If an author has rank r in the author list of an article (with A authors in total), then this author receives a credit of
2 A-r
A
(VI.7)
2A-1
V
;
(Egghe, Rousseau and Van Hooydonk (2000)). It is easy to see that (VI.7) is the normalized form of a score 2A"r (so r = 1: score 2A~',..., r = A: score 1) so that the sum of the numbers in (VI.7) for r = 1, 2,..., A, is 1. So also this method gives an equal weight of 1 to each article, independent of the total number A of authors.
256
Power laws in the information production process: Lotkaian informetrics
VI. 2.1.6 Noblesse Oblige
As in senior author counting, this author receives the highest score but does not get everything: here the senior author receives a score of — while the other A - 1 authors receive an equal credit of —, r. Also in this method, the total weight of an article is 1, 2(A-1) independent of the total number A of authors.
VI.2.2 Crediting systems for super sources The above Subsection VI.2.1 also yields the different crediting systems for super sources by simply adding the credits of the sources that belong to this super source. Example: let an article have 5 authors having a credit c, (i = 1, 2, 3, 4, 5) (c ; is determined by one of the methods discussed in Subsection VI.2.1). Suppose that authors 1, 2 and 4 belong to the same super source (e.g. country, lab,...) and the authors 3 and 5 do not belong to this super source. Then the score (in this article) of this super source is c, + c2 + c 4 .
This is the only way we will consider scores for super sources, derived from scores for sources. There is a variant of this method where a super source receives a credit of 1 if it appears in one or more of the sources and of 0 otherwise. This can then be normalized by replacing 1 by - , where s is the total number of different super sources. This is used in s Nederhof and Moed (1993) if s * 1. If s = 1 then they set the score of this paper source on 0 and is called the fractionated scoring system (used to study multinationality of papers).
VI.2.3 Counting procedures for super sources in an IPP
This subsection is devoted to introducing counting procedures for super sources (e.g. countries, labs, universities, or even authors themselves) in IPPs, i.e. where several items (e.g. articles) are involved. Note that this study automatically contains the study of counting procedures for sources (e.g. authors) as remarked in the introduction of Section VI.2. We will limit ourselves to the study (and comparison) of the total counting system, the fractional counting system and the proportional counting system.
Lotkaian informetrics of systems in which items can have multiple sources
257
In each case we suppose we have N items and that item i (i = 1,..., N) is produced by a; sources and we fix one super source c of which we want to calculate the "score" in the system (the classical application being: N articles and article i is written by a, authors and we will check which of these authors has an affiliation in country c). We will always suppose af > 0 (e.g. excluding anonymous articles).
The key variable in the study of the various counting procedures for super sources is djrc (i = 1,...,N; r = l,...,aj), where djrc = 1 if c occurs on rank r in article i and djrc = 0 otherwise. We have that
Ed, re =a,(c)
(VI.8)
r=l
denotes the number of occurrences of super source c in item i £ {l,...,N}.
VI. 2.3.1 Total counting
The total score of super source c in this system, using total counting (note the different uses of the word "total" here: total score refers to the sum of all scores of super source c, independent of the counting method per item and total counting refers to the counting method (VI.2.1.2)), denoted WT (c), equals
W
x(C) = E E d , r c = I > , ( c ) i=l
r=l
(VI-9)
i-1
Noting that the total number of scores in this system (over all super sources), denoted WT, equals
wT=Ewx(c)=EEEd,c=Eai c
i=l r=l
c
i=l
we can defme the relative score (or contribution) of super source c in this system as
(VL1°)
258
Power laws in the information production process: Lotkaian informetrics
N
QT(c) = ™
=^
WT
(VI.11)
£*
VI 2.3.2 Fractional counting
The total score of super source c in this system, using fractional counting, denoted WF (c), equals
WF(c)=EE^ = E f 1 i=l r=l
a
i
i=I
d
CVL12)
i
Noting that the total number of scores in this sytem (over all super sources), denoted WF, equals
c
i=l r=l
c
a
i
i=l
a
i
we can define the relative score (or contribution) of super source c in this system as
VFl
'
WF
Ntf
a,
;
VI. 2.3.3 Proportional counting
The total score of super source c in this system, using proportional counting, denoted Wp (c), equals (use (VI.6), where A is replaced by at for each item i)
Lotkaian informetrics of systems in which items can have multiple sources 259 N
a; j
N
a;
j
wP(c)=2££^-2£E-^ i=i r=i a,
i=1 r= ,
at [ai + I)
WP(c) = £ - a i ( c ) - ^ M i=i "-j
(VI.15)
a.j -|- l
by (VI.8) and denoting
R(i,c) = £rd, r e
(VI.16)
r=l
the sum of all ranks occupied by super source c in article i. Note that the total number of scores in this system (over all super sources), denoted Wp, equals
WP=£wP(c) = E E £ d i r c ^ l - - J _ a,
a,
N £Z/ d ,rc 2V-^V—r t f a,(a1+ l)
NE Z X c Wp = 2Y"^-^ P t i ' a,
W P = 2 £ ^ - 2 £ ^ =N i=l
a
i
(VI.17)
i=l ^ a i
since
EE^, r=l
c
r e
=^
(vi.i8)
Z
being the sum of all ranks in article i, i.e. the sum of the numbers 1,2,..., at. Hence we can now define the relative score of super source c in this system as
260
Power laws in the information production process: Lotkaian informetrics
w = Q P (c)
wP ^
«-W-^-(c)-M?)
*W = « . W - 5 § ^
by (VI. 14).
Extensive examples, illustrating the large difference between these counting methods will follow. The following proposition will be handy in constructing examples for the proportional counting method, based on existing examples for the fractional counting method.
Proposition VI.2.3.1 (Egghe, Rousseau and Van Hooydonk (2000)): For any system of super sources (i.e. representing a system of collaboration between authors, countries, labs,...), denoted as system I, we can construct another one with the same super sources, denoted as system II, such that the proportional counting result in system II equals the fractional counting result in system I as well as in system II. Furthermore, also the total counting results are the same in both systems. In symbols: for every super source c:
Q'( c ) = Q'-(c) = Q?(c)
(VI.21)
Q ' ( C ) = Q; I (C)
(VI.22)
Proof: Take any system of N items. Consider the ith item. In this item, super source c appears a ; (c) times, by notation (VI.8), out of the total of a{ sources. Now mirror this situation so that we obtain an item with 2a; sources, where, if super source c has a rank r in system I it also has a rank 2a: - r + 1 in system II (and the existing rank r). Now super source c appears 2af (c) times in item i in system II. Do this for every item i = 1,...,N. Let us denote by Qx( c )
an
d
Lotkaian informetrics of systems in which items can have multiple sources
261
Q" (c) the relative scores of super source c in system I and II, where X can be T, F or P. We have, for every super source c:
P W
Ntf2a,
l W
2a;+l
where Rn(i,c) denotes the sum of all ranks of super source c in item i in system II. By construction, this is always equal to (2a; + l)a; (c). Hence
1
'
Nt^a^
lV ;
2ai+l
M =Q.(c)
=
N i=f 2af
F W
We also have
E2a,(c) E2a, i=l
VI.2.4
Inequalities between QT (c) and QF (c) and consequences for the comparison ofQ T (c),Q F (c)andQ p (c)
We have the following theorem, describing completely the interrelation of QT (c) and QF (c).
262
Power laws in the information production process: Lotkaian informetrics
Theorem VI.2.4.1 (Egghe and Rousseau (1996b,d)): QT (c) > QF (c) if and only if the slope of the regression line of
a (c)
over a, (i = 1,... ,N) is
strictly positive. If we replace > by < then this inequality is equivalent with a strictly negative slope.
Proof: The formula for the slope of the regression line of the cloud of points
a.(c) 'v ' a.1
equals i=l,...,N
(see e.g. Egghe and Rousseau (2001) p.56, a reference on statistics applied to the field of library and information science)
lfaa,(c) N j ^ ' a,
h^h N
a, N
s2 (s2 denoting the variance of (a,,...,a N ), hence positive). So this slope is positive if and only if
i=l
I i=l
I
i=l
a;
whence
N
'^-i(c).5' i W
hence
Q F (c)
The other assertion is proved in the same way.
Lotkaian informetrics of systems in which items can have multiple sources
ale) ' increases in
As an immediate consequence of Theorem IV.2.4.1 we have that if
a
function of a ; , then Q T (c)>Q F (c) since then the regression line of
263
i
a(c) " ' over a; increases
(see also Egghe and Rousseau (1996b) for a proof of this evident fact). In Egghe and Rousseau (1996b,d) the above result was applied to determine the relation between the global and average impact factor (which is beyond the topic of this book) and to many other fields, even outside informetrics (see Egghe and Rousseau (1996d)).
It is now clear that a universally valid inequality between QT and QF does not exist. The following example makes this explicite. Example VI.2.4.2 Let N = 3 and suppose we have the following countries appearing as author affiliation in these 3 articles
Article 1
Countries c,
C
2
C
c,
3
C
C
2 2
3
3
c3
c3 C
3
c3
Then it is easy to see that QT (c,) = — > QF (c,) = - and that QT (c,) = — < QF (c,) = — .
Hence, using Proposition VI.2.3.1, we also have an example where Q?( c i)> Q" ( c i)
an(
l
Q?(c2) 1 for all i = 1,...,N. Then, by (VI.20):
264
Power laws in the information production process: Lotkaian informetrics
Qp(c) = 2 Q F ( c ) - - V ^ - < 2 Q F ( c ) - ~ V !
Since a: (c) = 1 for all i = 1,.. .,N, this implies that Qp (c) < 2QF (c) - QF (c) = QF (c). Take now a country c' for which R(i,c') = 1 for all i = 1,.. .,N. Then, again using (VI.20)
QP(c') = 2 Q
F
(c
0
-lE^
> 2 Q p ( - ' ) - ^ E 1 = QF(c1) -N i=i a;
again since a( (c') = 1 for all i = 1,... ,N.
We can even make examples that yield the following paradox: Paradox VI.2.4.3: There exist examples of countries a and b such that
QT(b)
(VI.23)
and the same is true with F replaced by P. Proof: Once an example showing (VI.23) is given, we also have an example of (VI.23) with F replaced by P, using Proposition VI.2.3.1. So we only show (VI.23). We will also indicate the system behind the construction of such non-trivial (but very realistic!) examples. First of all we will limit ourselves to authors (sources) (otherwise said: super sources are sources or, equivalently, a{ (c) = 1 for each i = 1,...,N).
Let us consider the general situation of three authors a, b and c in a system with x + y + z single-author articles, namely x with a as single author, y with b as single author, and z with c
Lotkaian informetrics of systems in which items can have multiple sources
265
as single author; x' + y' + z' articles with two authors, namely x' with b and c as authors, y' with a and c as authors and z' with a and b as authors: and finally a articles with a, b and c as authors. So, for this system N = x + y + z + x' + y' + z ' + a . We will try to determine the seven unknowns x, y, z, x', y', z' and a such that (VI.23) becomes valid.
Table VI. 1
A collaboration
pattern of three authors a, b and c, leading to
QT(b)
Article number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
a
b
c
X X X X X X X X X X X X X X X
X
X X X
X
X
X
X
X
X
X
X
X X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
This can be done by expressing QT (a), QF (a), QT (b), QF (b) in terms of the above variables and then requiring (VI.23). One (but there are many!) solution is given as follows:
266
Power laws in the information production process: Lotkaian informetrics
a = 0, x = x' = 5, y = 7, y' = 8, z = 1, z' = 2, i.e. we have a collaboration pattern as in TableVI.l.
We have the following scores, as is readily seen
QT QF
QT
(a) 0.3488 = 0.3256
00
QF
QT
(c) 0.3256
QF
QT
Q 00 = 0.3750 (a) = 0.3571 (<0 = 0.2679
Note that QT (b) < QT (a) < QF (a) < QF (b), hence (VI.23) is satisfied. Also, author a ranks first according to total counts, and only second according to fractional counts.
So we have a real paradox here that, although QT (a) < QF (a), author a scores better in the T system than in the F system.
These results show the large difference between the total, fractional and proportional scoring systems. This means that one must be very careful when drawing conclusions on source or super source ranking according to one of these scoring systems since these rankings can be (completely) different in another scoring system. Yet, there are some positive results that can be mentioned, which can be handy to overcome the above mentioned difficulties in practical field work.
VI.2.5 Solutions to the anomalies
It is conceivable that scientists doing evaluations do not want to get involved in the problems and anomalies sketched in the previous sections. So, how can one exclude or at least diminish their influence?
We will show that Q T , QF and QP are close to each other if the number of collaborators is high. This gives a partial solution to the problems encountered in the previous subsection,
Lotkaian informetrics of systems in which items can have multiple sources
267
although we are not convinced that one should rely on it for a complete solution. Next we will show, based on earlier findings (Egghe and Rousseau (1996d)), that a complete solution exists if one replaces the arithmetic average by the geometric mean.
VI. 2.5.1 Partial solutions
The fact that QT (c) and QF (c) can be close to each other is shown in the unpublished note of Kranakis and Kranakis (1988) (the result can also be found in Egghe and Rousseau (1990a)).
Theorem VI.2.5.1.1 (Kranakis and Kranakis (1988)): For any super source
where m = min{a,,...,a N },M = max{a,,...,a N } .
Proof:
By (VI. 11) and (VI. 14) we have
—ya(c)
hence (VI.24). Inequality (VI.24) shows that, if m = M, i.e. if the number of coauthors is the same in all articles, then QF (c) = QT (c). Further, if m and M are approximately equal or if m (and hence M) is very large, then QF (c) RS Q T (C) . As to QF (c) versus Q p (c) we have the following result
268
Power laws in the information production process: Lotkaian informetrics
Theorem VI.2.5.1.2 (Egghe, Rousseau and Van Hooydonk (2000)): For any c, such that ai (c) = 1, for every i = 1,... ,N, the following inequality holds:
M-ZM^t^
(VI-25)
Proof: If &t (c) = 1 , for every i = 1,.. .,N, we have, by (VI. 14) and (VI.19):
i FV
VPV ;
'
N
1
N t f a,
Ntfa,
a, + l
where R(i,c)e {l,...^} , since a;(c) = 1, for every i= 1,...,N. So,
VFW
VPI )
N
Zj^
tfai(ai+l)
can be bounded as follows:
Consequently, since a: > 1:
IQ.(=)-Q,(=)I4E^.
°
Corollary VI.2.5.1.3 (Egghe, Rousseau and Van Hooydonk (2000)): Under the assumptions of the above theorem we have
Lotkaian informetrics of systems in which items can have multiple sources
Q F (c)-Q P (c)|< average of I—*—,..., - 1 - 1 [a,+l a N +lJ
269
(VI.26)
Corollary VI.2.5.1.4 (Egghe, Rousseau and Van Hooydonk (2000)): Let m = min{a1,...,aN} as before. Then
QF(c)-Qp(c)|<-L-. m+ 1
(VI.27)
Hence, QF pa Qp if m is high. Together with the earlier found result, we have that, if m is high:
QT~QF~QP-
(VL28)
VI.2.5.2 Complete solution to the encountered anomalies
In this section we will show that the use of geometric instead of arithmetic averages eliminates the occurrence of these anomalies. Recall that the use of a geometric mean was already suggested in (Egghe and Rousseau (1996d)). Let us first define these geometric versions (denoted with a g as superscript).
By dividing nominator and denominator of (VI.11) by N we have
arithmetic average of{a,(c),...,a N (c)} arithmetic average of {a p ...,a N }
Hence, we define (replace "arithmetic" by "geometric" in the above formula):
_i_
. . (a.(c)a,(c)...aN(c))N Q\{c)=y l W 2 W — ^ - . (a,a 2 ...a N )N
(VI.29)
270
Power laws in the information production process: Lotkaian informetrics
By (VI. 14), QF is:
Q F (c) = arithmetic average of - L ^ , . , . , a i
NV
'\ N J
a
Consequently, we define:
Q .( c)= f^M>M*. a,
a2
(VL30)
aN
It is clear that Q?(c) = Q«(c)
(VI.31)
for any system and any c. Hence all rankings based on the total counting method are the same as those based on the fractional counting method. Consequently, all ambiguities are gone!
We strongly advise researchers to start working with these geometric variants of QT and Q F , not only in this application but in all applications as described in Egghe and Rousseau (1996d).
VI.2.6 Conditional expectation results on Q T (c),Q F (c) and Q P (c)
The total counting system (T) is the simplest one from a probabilistic point of view: given N
N,a,,...,a N , we only need to know ^ a j (c), while for the fractional counting system (F) we i=i
need to know as (c) (see (VI. 11) and (VI.14)). In other words, for (T) we only need to know the total appearance of super source c in the system while for (F) we need to know how many times c appears in each item i = 1,...,N. So (F) implies (T) uniquely but, per (T)-situation, there are many (F) situations compatible with the given (T)-situation and hence also different values of QF (c), per fixed QT (c). We can now ask: what is the average (i.e. expectation) of
Lotkaian informetrics of systems in which items can have multiple sources
271
all the possible Q F (c) values, given the fixed Q T (c) value, in other words what is the conditional expectation, denoted E T (Q F (c)), given Q T (c). This conditional expectation is the same notion as the one encountered in Subsection 1.3.6.4 to define the evolution in time of stochastic processes, but the application is, of course, completely different.
We have the following theorem.
Theorem VI.2.6.1 (Egghe (1999b)):
E T (Q F (c)) = QT(c)
(VI.32)
for every super source c.
Proof: Using linearity of conditional expectations (being averages) we have:
E
(Q (C)) = E i f ^ 1 = - Y X ^iM)
denotes the averages of all a
i
(VI.33)
values, given Q T (c). Now Q T (c), by a
i
N
(VI.ll) is the overall probability, for c, to occupy one of the ^ a ; positions available in the i=i
N items. Then the probability to have a; (c) occurrences of c in a; places in item i is (binomial distribution)
a-w)Ql(o)-(l-Q,(c))"">
Hence
272
Power laws in the information production process: Lotkaian informetrics
ET*-M = a
i
a
-M
a
a;
a^cj
Si(c)=o
; Q T ( C r
But , s W
a,
a,-l ' 'a,(c)-l
aj(c)
as is readily seen. So, by (VI.33)
MQFW)4&,(C)£ N i=l
a
;~\ QT(cr'(i-QT(c))
ai(c)=l a i ( C j — 1
, the last sum equalling 1, being the total chance of a binomial distribution. Hence, E T (Q F (c)) = Q T (c).
The proportional and fractional scoring system can be compared in a similar way. Indeed, the proportional scoring system (P) is finer than the fractional (F) one: given at (c) (in (F)), there are several possibilities for the ranks that c in item i occupies, all leading to different Qp (c) scores, agreeing with one set as (c), i = 1,.. .,N. But given the ranks of c in each item i (i.e. the (P) situation), we can easily determine a^ (c) and hence the (F)-situation. As in the previous case we can ask: given a fixed (F)-situation ai (c), i = 1,... ,N what is the average of all Qp (c) values that agree with this (F)-situation? Again this is a conditional expectation which we denote by EF (Qp (C)) . We have the following theorem.
Theorem VI.2.6.2 (Egghe (1999b)):
E F ( Q P ( C ) ) = QP(C)
for every super source c.
(VI.34)
Lotkaian informetrics of systems in which items can have multiple sources
273
Proof: Using linearity of the conditional expectation (being an average), formula (VI.20) yields
, since Q F (c) is fixed here. Here E F (R(i,c)) denotes the average of the R(i,c) values, given the fixed (F)-situation i.e. given the numbers at (c), i = 1,.. .,N. If we can show that
E F (R(i,c)) = a f
i
(VI.36)
then (VI.35) yields
E F (Q p (c)) = 2 Q
F
( c ) - l X : ^ = Q F (c)
using (VI. 14). Hence we only have to prove (VI.36). For any situation, compatible with the given (F)-situation a s (c), i = 1,...,N and leading to R(i,c) we construct another situation, also compatible with the given (F)-situation ai (c), i = 1,...,N but leading to R'(i,c) (the sum of the ranks of c in item i in this second situation) for which we have that
R(1,c) + R'( i ,c)
= a ( c )
a^l
(VL37)
In this way all possible (P)-situations are partitoned in groups of two satisfying (VI.37) which is a constant, given the (F)-situation. Hence this shows (VI.36) since EF (R(i,c)) averages all a +1 these constants a (c)— . This leaves us with the task to construct the R'-situation such ,i ; 2 that (VI.37) is valid. Given the R(i,c)-situation (i = 1,...,N), denote by A(i,c)c {l,...^} the
274
Power laws in the information production process: Lotkaian informetrics
set of ranks that c occupies in this situation. The second situation is constructed as follows: let now c occupy ranks a; - r + 1, for every r€ A(i,c). The sum of the occupied ranks is now
R'(i,c)= £ ( a i - r + l) reA(i.c)
R'(i,c) = ( a , + l ) £ l - £ r reA(i,c)
reA(i.c)
R'(i,c) = (ai+l)ai(c)-R(i,c)
proving (VI.37) and hence the theorem.
Corollary VI.2.6.3 (Egghe (1999b)):
E T (Q P (c)) = Q T (c)
(VI.38)
for every super source c.
Proof: By the above theorems we have
ET (Qp (c)) = ET (EF ( Q P (C))) = ET (QF (c)) = QT (c).
Here we use the fact that ET°EF = ET since T is rougher than F, see also Chow and Teicher (1978) for a more theoretic treatment of this relation and of conditional expectations in general.
The above results allow us to use the simpler QT (c) values if we are dealing with averages E T (Q F (c)) or E T (Q P (C)). This was utilized in Egghe (1999b) to prove a relationship between the fraction of multinational publications (i.e. multi-authored publications where there are at least two different country affiliations) of a country and the fractional score of this
Lotkaian informetrics of systems in which items can have multiple sources
275
country, confirming a related empirical result of Nederhof and Moed (1993) on the relation between the fraction of multinational publications of a country and its fractionated score (cf. Subsection VI.2.2 for its definition).
This closes Section VI.2 where the different crediting systems and counting procedures for sources and super sources have been discussed. Although each of these systems have their own merits and pitfalls, we can - with little or no discussion - consider the fractional crediting system as one of the most important systems to count a contribution of a source or super source in the production of an item. Indeed, each source is contributing an equal weight, namely l/(the number of sources that produce an item) and we think this equality among the sources reflects, in modern times, best the contribution of each source in the production of an item (and even when equality is not the best way, it is, in most cases, not possible to derive a more optimal way). In addition, the fractional counting system yields a total weight of 1 to each produced item and, hence, the weight of each item, unlike the case of the total counting system, is independent of the number of sources producing it.
Therefore the next section is important. It gives an answer on the modelling of the sizefrequency function f, when fractional counting is applied. As explained in Section VI. 1 the numbers f(q) with q £ Q + (the positive rationals, being the set of possible production scores of sources in this fractional scoring system) are not decreasing, hence a discrete or continuous Lotka type function of the form
f(x) = J r
(VI.39)
with x G N, Q + or R + , is not the correct form of such frequency functions (cf. also the arguments in Rousseau (1992b)). Yet, the next section will show that we are able to model f(q) for the most important values of q e Q + , using Lotkaian informetrics. In fact, we will show that empirical data on f(q), in the fractional counting procedure, can be approximated very accurately by assuming even two Lotka functions: one for the number of authors with m € N articles, counted totally (cf. Argument VI. 1.1.1) and one for the number of articles with n authors (cf. Subsection 1.4.4). No doubt that we are here in an important part of Lotkaian informetrics: using Lotka' law both in its original and its dual formalism!
276
VI.3
Power laws in the information production process: Lotkaian informetrics
CONSTRUCTION OF FRACTIONAL SIZE-FREQUENCY FUNCTIONS BASED ON TWO DUAL LOTKA LAWS
VI.3.1 Introduction
We start by presenting the important data set of Rao (1995), see also Egghe and Rao (2002a), partially reproduced in the Appendix II for the reader's convenience, on experimental fractional frequency scores of authors in mathematics. It concerns data from the Mathematical Reviews of the year 1990. The table shows the total fractional score of an author in this year. This means, e.g. that an author with 3 articles, one with 1 author (of course this author), one with 2 authors in total and one with 3 authors in total will be appearing in this table as an author with a total fractional score of H
h - ~ l - 8 3 . The irregular shape of the graph
(Fig.A.l) (also in the Appendix II) is clear (note that we showed the nth fraction (abscissa) and its corresponding fraction of authors (ordinate) rather than the actual value of the n"1 fraction: this enchances the distinguishing level of the graph, which will be even more important in the comparison of theoretical and experimental values further on). Some intuitive explanations of values can be given. The value of 1 is the highest and the graph decreases in function of the entire values 1, 2, 3, 4, 5, . . . . If we look at the reciprocal scores: - , —, - , —, 1, we see that the graph increases. Both regularities can be explained using the size-frequency functions "number of authors with n = 1, 2, 3, ... articles" which is decreasing (see again section VI. 1) and the dual size-frequency function "number of articles with n = 1, 2, 3, ... authors" which is (or is supposed to be) decreasing (cf. Subsection 1.4.4) and certainly in fields where the number of co-authors is not very high (as is the case in mathematics -
in this sense,
mathematics ressembles fields in the humanities and social sciences, a fact that is also found to be true in citation analysis and aging studies!). The fact that the dual size-frequency function decreases implies that the number of authors with fractional score ...—, —, 1 is 3 2 increasing, at least for those authors with only one article in the database. The problem arises with the fact that authors can have several publications and that, as explained above, their fractional scores are added. This disturbs the simple reasoning given above, using the two size-frequency functions. Indeed, as also indicated in Rousseau (1992b), a fractional score of,
Lotkaian informetrics of systems in which items can have multiple sources
277
say — can come from one paper with 2 authors but also from two papers with 4 authors or even 4 papers with 8 authors or even 3 papers, two with 8 authors and one with 4 authors,... . This "decomposability" of rational numbers is linked with the number of divisors of the denominator (i.e. a score — can only come from papers with at least 7 authors since 7 is a prime number) but this is not the whole story: a score of — has again several decompositions (— = 1 + — but also — H
h — and so on). The point is that we can use a typical probabilistic
technique to handle all possible decompositions, namely convolutions. Intuitively (an exact argument will be given in Subsection VI.3.1) we proceed as follows. Let (p(j) denote the fraction of authors with j articles (counted in the total way, cf. Section VI. 1). Let vj/(i) denote the fraction of articles with i authors. q> and i|/ are the dual size-frequency functions that will be supposed to be Lotkaian, but we do not need this at the moment. It is clear that vj/(i) determines f[ (z) being the fraction of authors that have a fractional score z = - , supposing these authors to have written just one article. In fact, such an author receives a score z = - if i this author published one paper in which there are i authors. So f, (z) = f,
T
is related to the
occurrence of articles with i authors, hence with \\i (i). The exact relation will be given in the sequel. We then go from 1 paper to j papers: the j-fold convolution of f,, denoted f, ®... ® f, j times
is then the distribution that describes the fraction of authors, given that they have j articles, with a given fractional score z: (f, ®...® f,)(z) is the fraction of authors with a fractional j times
score z, given that they have j papers. This condition on the number of published papers can then be lifted using the theorem of total probability:
CO
f(z) = ^f 1 ®...®f 1 (z)«p(j) J=1
'
j~tim^
(VI.40)
278
Power laws in the information production process: Lotkaian informetrics
and here f(z) is the overall fraction of authors with a fractional score z. This is then, theoretically, the function we are looking for. The problem is to evaluate f qualitatively and quantitatively and to see how it can match irregular fractional frequency data such as the ones of Rao (1995), presented in the Appendix II.
In the next subsection we will consider the case z 61R+, i.e. studying fractional scores in a continuous way. We will present the exact formula for il in this case and we will confirm (VI.40). However, in this case, we are not able to explain the irregular values of the experimental f(q) for q 6 Q + . We indicate that, under some extra conditions, f is increasing on [0,l] and decreasing on [l,oo[ showing at least the "overall" tendency that was also discussed above. We consider these results (for z e R + ) as not satisfactory (unlike the other results obtained so far in continuous Lotkaian informetrics!) since we want to give an explanation for the large differences between the experimental values f(q) for q 6 Q + .
Of course, using only rational values, and knowing the irregularity of the values in q, we know that an analytical expression for f(q) is not possible. It is, however, also impossible to give individual results on f(q) for all q e Q + (or even limited, say to Q + n [0,2]) since these sets are infinite. In Subsection VI.3.3 we apply a special technique to approximate (VI.40) for grouped data, hereby having a finite number of possible f(q), q e Q+, each q being the midpoint of such an interval, representing each grouping. The technique is possible if the groups (of scores q) are not too small and if the values q £ Q + considered are not too large. We will indicate in Subsection VI.3.3 how to group data and in any case we will limit ourselves to q < 2 (the most important case since it contains the crucial value 1). We will see that the found values of (VI.40) (for this finite set of arguments) are very close to the data of Rao (see the Appendix II) when grouped in a similar way.
VI.3.2 A continuous attempt: z £ R +
Since we will reserve the notation f for our fractional size-frequency function that we want to determine, we will use another notation for the dual size-frequency functions that we will use. Let v|/ —> v]y(y): [l,pm] —> K+ denote the continuous size-frequency distribution describing the
Lotkaian informetrics of systems in which items can have multiple sources
279
density of papers with y as author density. From i|/ we can determine the function f, (z) being the distribution of authors with a paper density z in one paper, in the fractional scoring system. We have the following proposition (notations from Chapter II).
Proposition VI.3.2.1 (Egghe (1993b)): For every z £ [a (0), l] we have that
1
v f,(z) = - ^ -
(VI.41)
where (x is the average number of authors per paper.
Proof: Since i|/ = ij/(y) is a distribution of the density of papers with y as author density, we have that V(/T is the corresponding size-frequency function (T is the total number of sources, here papers). This is the function fin (11.78) and we have, by Theorem II.3.1:
h(z) = — ^
(VI.42)
for all z e [o(0),l], where h is the density function of items per source (i.e. h(z) is the density of authors with paper density z in case an author has only 1 article). Hence, since f, is a distribution, h = Af, where A equals the total number of items (authors) and by definition of f,. Hence
M1-
Af,(z) = — ^
280
Power laws in the information production process: Lotkaian informetrics
with u = —. v- T This result, hence, is a direct application of the new function h introduced in Section II.3. The only difference between the functions in II.3 and the ones here is that, in II.3, we are dealing with functions while here we are dealing with distributions (needed because we will have to apply some results of probability theory in the sequel). That explains the \x in (VI.41) which is lacking in (11.78). Note that (VI.41) was proved directly in Egghe (1993b) but with the use of approximations. The proof above shows that (VI.41) is an exact result in continuous informetrics theory, based on the new function h introduced in Section II.3.
We now go from f, to f, the overall fractional size-frequency distribution that we are seeking.
Proposition VI.3.2.2 (Egghe (1993b)): Let cp
: N —> M+ be the fraction of authors with n papers. Then, for every z £ R + we
have that the fractional size-frequency distribution equals
f(z) = £fi®...®f, (z)cp(j) J=>
j times
(VI.43)
'
where f, is as in Proposition VI.3.2.1 and where §3 denotes convolution and where we assume that the fractional scores distributions in a paper are independent and identically distributed (i.i.d.), being the distribution f,.
Proof: Let N be the random variable of the number of papers and Y(j) the fractional score from the j t h paper. Then
f (z) = P (overall fraction = z)
Lotkaian informetrics of systems in which items can have multiple sources
281
= P(Y(l) + Y(2) + ... + Y(N) = z)
= £ ) P ( Y ( 1 ) + Y ( 2 ) + ... + Y ( N ) = Z|N = J).P(N = J)
by the theorem of total probability (see e.g. Chung (1974), Blom (1989)). Hence
f(z) = £p(Y(l)+Y(2) + ... + Y(j) = z)q>(j) j=i
using independence of N with respect to the Ys. The distribution of Y(l) + Y(2) + ... + Y(j) is given by the convolution of the individual distributions, given the independence of the Ys (see again Chung (1974) or Blom (1989)) and this becomes the i-fold convolution of f,, using the assumption of identical distributions for the Ys.
Convolutions appear frequently in informetrics as a calculation tool, see Rousseau (1998) for an overview - see also Lafouge (1995), Egghe (1994b).
NoteVI.3.2.3; Although it is relatively easy to accept that all Ys have the same distribution, the fact that they are independent is less sure. It is indeed not certain that a certain score in paper 2 (say) is independent of the score in paper 1, due to collaboration habits. However, we have to suppose independence for the intricate model to work. The assumption can be considered as a simplification which is acceptable in this first attempt to model fractional frequencies.
In Egghe (1993b) one then shows that, under some realistic conditions, formula (VI.43) yields an increasing function f on [0,l] which is decreasing on [l,oo[. Since the result is technical and only valid under extra assumptions and since it does not reveal any of the irregularities of the fractional scores for q e Q + (as is also the case with the lognormal fitting of Rousseau (1992b) of such irregular graphs), we do not give the proof here and we go directly to the
282
Power laws in the information production process: Lotkaian informetrics
discrete variant of (VI.43) and then use groupings around certain rational numbers q, which will yield a good ressemblance with the irregular experimental fractional size-frequency data.
VI.3.3 A rational attempt: q £ Q +
We will now use the following discrete functions:
y = v|/(n) being the fraction of papers with n authors, n = 1, 2, 3, ... f, = f, (q) being the fraction of authors with a fractional score of q in 1 paper
(,-, i. }.-) cp =
We have the following variant of Proposition VI.3.2.1 which we will prove in two ways: one (heuristic) proof uses Proposition VI.3.2.1 and one proof (the easiest one) is direct: we will prove the discrete variant of (VI.41).
Proposition VI.3.3.1 (Egghe and Rao (2002a)): With f, and \\i as above we have
'I 1 ] f,(q) = - ^ uq
(VI.44)
1 1 A for all q = 1, —, —,... and where (a. = — is the average number of authors per paper.
First (heuristic) proof: In (VI.41) the continuous \\i, now denoted 4 to avoid confusion, is the density derived from our present function \\i, i.e. \ = — \\i' > 0 . Hence (VI.41) reads
Lotkaian informetrics of systems in which items can have multiple sources
283
.zzti z2
z\i
d m 1 dz
l^zj z^i
In our discrete case we replace — vi/ — by w — , hence
dz UJ
UJ
hence (VI.44) if we replace the continuous z by the rational q. This proof is necessarily heuristic since one cannot replace discrete by continuous variable functions and vice-versa, but it shows that (VI.41) and (VI.44) are not in contradiction with each other (as one could expect at first glance).
Second (exact) proof: f
/ \ '^ '
total # authorships with fractional score q in 1 paper total # authorships
— total # papers with — authors 5
f,(q) = ^ u
>-
total # authorships
total # papers . / N_ 1 q total # authorships
total # papers with — authors q total # papers
284
Power laws in the information production process: Lotkaian informetrics
f. (q) = — — . fraction of papers with — authors
q n
q
< W ~H Note that q = —, n e N , necessarily. This shows that the fractional score of an author in 1 n paper is directly derivable from \\i. Together with its dual function cp we will be able to derive the theoretical, discrete fractional frequency distribution.
Proposition VI.3.3.2 (Egghe and Rao (2002a)): Let f(q), for q£+ denote the overall fractional frequency distribution, i.e. the fraction of authors with a total fractional score q. Then
f (q) = £ J
~'
f,®...®f| (q)
(VI.45)
j times
where f, is as in Proposition VI.3.3.1. and where ® denotes (discrete) convolution. Here we assume that the fractional scores distributions in a paper are independent and identically distributed, being the function f,,
Proof: The proof is exactly the same as the one given in Proposition VI.3.2.2.
The model (VI.45) with f, as in (VI.44), in which both the size-frequency distributions cp and Y are appearing should be capable to explain the various irregular experimental f (q) data. The problem still is: how to deal with f (q), where q 6 Q + , an infinite set. First of all we will limit ourselves to q < 2 , hence to Q + n[0,2] since this contains the most interesting (and irregular) values, including the increase in the values —, - , —, 1 and then decrease to 2.
Lotkaian informetrics of systems in which items can have multiple sources
285
But Q + n[0,2] still is an infinite set. The trick is to allow only a few values q e Q + n[0,2]. This can be done by slightly deviating from the fractional counting system, which boils down to a grouping of scores. We will, hence replace f, by a function g,, which is derived from f, (in an exact way) and which only allows for a finite number of rational values q. We will show three degrees of refinement (and the increasing complexity) in the sense that, in the finer cases, we allow for more (but still a finite number of) qs. These theoretical models are then compared with the experimental results in which analogous groupings will be executed. The comparisons will speak for themselves.
Of course we still have to determine the distributions , the dual size-frequency functions. As discussed and explained in Section VI. 1 we will use power laws (Lotka laws) for both of these functions and, to make things as simple as possible, we will use the power a = 2, hence since
(VI.46)
7i n
for n = 1, 2,... . Note that (p and \|/ are equal as functions but not in their interpretation: (p(n) is the fraction of authors with n articles and v|/(n) is the fraction of articles with n authors. Note that (VI.46) in (VI.44) gives that we use the following function f,:
f,(q) = ^
(VI.47)
The case i = 2: allowing an author score of 1/2 or 1 in 1 paper Although too rough, this case will very simply illustrate the methodology that we will apply in this paper to yield discrete fractional frequency distributions. Note that we will limit ourselves to fractional scores q < 2, as explained above.
In this simple model, an author receives a score 1 if he/she is an author in a single-authored paper. If he/she is author in a multi-authored paper, this author receives a score —. Let us call g, the author distribution of fractional scores in 1 paper. By definition
286
Power laws in the information production process: Lotkaian informetrics
g,(l) = f,(l)
41=84) gl(i)
= l-f,(l),
(VL48)
(vL49) (VI.50)
since f, is a distribution. Using (VI.47) this yields
g,(l)
J-
8.^1 = 1-4-
(VL51)
(VI-52)
We apply now (VI.45) but with f, replaced by g, and for the values q = —, 1, —, 2, the only possible scores (inferior to 2). This gives
f =gi
(i) (i)
(VL53)
f(l) = gl(l)(p(l) + [ g l ( i | cp(2)
(VI.54)
f ( | ) = 2g1(l)g1(l)cp(2) + ( g l (l|cp(3)
(VI.55)
f(2) = (g1(l))2cp(2) + 3(g,(i))2g1(l)cp(3) + ( g , ( 0 ( p ( 4 ) )
(VI.56)
Lotkaian informetrics of systems in which items can have multiple sources
287
where cp is given by (VI.46).
These values are then compared by the corresponding grouped data from Rao's table in the Appendix II, grouped as follows:
score — corresponds to grouping the data in the interval ]0,0.75] score 1 corresponds to grouping in ]0.75,1.25] score — corresponds to ]l.25,1.75] score 2 corresponds to ll.75,2.25].
It is clear that we take the interval ]0,0.75] for — (and not ]0.25,0.75]) since in our model, all fractional scores (in one paper), smaller than —, are transformed into —.
Of course, these groupings are not a perfect analogue of our simplified model since an author with an overall score of —, being the result of participation in 2 8-authored papers is 4 classified into the score 1 in the model while it is classified in the interval ]0,0.75] in the grouping. This difference is there but will diminish in the next cases where we allow for smaller fractions. Also, if we find good results in this setting, this will indicate that the above difference is not destroying the similar nature of both simplications.
The (only) parameter \x is determined by requiring f — I to be exact: 2
f(i)_(1
(2){
6 ) 6 _ # in ]0,0.75]_ 18,892
~^VJ^?~
total #
~ 46,853
288
Power laws in the information production process: Lotkaian informetrics
(see the table in the Appendix II). This yields \i = 1.80537576. We have now the Table VI.2 and the graph in Fig.VI.l, comparing theoretical and experimental fractional frequency distributions.
From the above we see, although we only compare 4 fractions, both theoretical and experimental graphs are following the same pattern. This will become more clear in the next (more important and more interesting) cases.
Table VI.2
Distribution of overall fractional scores (case of i = 2).
Reprinted from Egghe and Rao (2002a), Table 1, p. 794. Copyright John Wiley & Sons Limited. Reproduced with permission.
Fig. VI. 1
q
Theoretical (f)
1/2
0.4033047
0.4033047
1
0.2715154
0.345911681
3/2
0.0875965
0.079546667
2
0.0545977
0.081360852
Experimental
Theoretical and experimental fractional frequency distributions (case of i = 2). The numbers in the abscissa refer to the number of fractional scores (their ranks), not to the fractional score itself.
Reprinted from Egghe and Rao (2002a), Fig. 2, p. 794. Copyright John Wiley & Sons Limited. Reproduced with permission.
Lotkaian informetrics of systems in which items can have multiple sources
289
The case i = 3: allowing an author score of 1/3,1/2 or 1 in 1 paper
This will be the first interesting case. Here an author receives a score 1 if he/she is an author in a single-authored paper, a score — if he/she is an author in a 2-authored paper and a score - if he/she is an author in a j-authored paper, for all j > 3 . Now we have
g,(l) = f,(l) = ^
'GH-h^li-'-^
(VI-57)
(VL59)
Possible overall fractional scores (in lO,2l) are - , —, —, —, 1, —, —, —, - , —, 2 V J V 3 2 3 6 6 3 2 3 6 (theoretical) to be compared with grouped data (from the Table in the Appendix II) in the , „ 5 15 7 7 9 19 11 111 13 13 15 15 17 17 19 intervals 0,— , — , — , —,— , — , — , —,— , — , — , — , — , —,— , 12 Jl2 12 12 12 Jl2 12 Jl2 12 12 12 12 12 12 12 19 21 12'12'
21 23 12 ' 1 2 '
23 25 12~'l2~'
We have the following formulas from which the theoretical fractional frequency distribution can be calculated.
f
(l)
= gl
(l) ( p ( l )
f =Si cp(1)
(i) (i)
(VL60)
(VL61)
290
Power laws in the information production process: Lotkaian informetrics
f(f) = (&(0
f
(f) =2 g # # ( 2 )
f(l) = gl(l)cp(l) + (g1(!))
f
8
0H ##
(3)
(VI.62)
(VL63) (VI.64)
(VL65)
f (f) = 2g, (l)g, (|)9(2) + BJg, [ 0 g, (i)
(VI.66)
f ( | ) = 2g1(i)g1(l)cp(2) + (g 1 (i|cp(3) + 4( g l (l|g 1 (l)cp(4)
(VI.67)
f =3 gl
(f) ( (^)) Sl(1)(p(3)+6 ( gl ^f( 8l (0 (p(4)+ ( g '(l)) 5
(VL68)
f(y) = 6gI(i)gI(i)g1(l)9(3) + 4g 1 (0g 1 (l|
f(2) = (g1(l))2cp(2) + 3(g1(0g1(l)(p(3) + 4(gl(l|g1(l)cp(4) + [g1(0cp(4)
+ io(g, (j)) (g, ( | | cp(5) + (g, (j)) 9(6).
Again, the parameter |i is determined by
(VI.70)
Lotkaian informetrics of systems in which items can have multiple sources
291
i [l).f,-LR-' 'H. f [3j
[
TtVjTt2
total #
_ 6,508 ~ 46,853
yielding \x = 1.1819488123 . We obtain the remarkable Table VI.3 and graph in Fig.VI.2.
The agreement between the theoretical and experimental results is remarkable. This proves that the two dual Lotka laws are capable of modelling fractional frequency distributions although we are also convinced that other decreasing size-frequency distributions, to be used for 9 and \\i, can serve the same purpose as well. The model is also capable of explaining "generally felt" inequalities, i.e. inequalities of the type f (l) > f — , f — < f — and so on.
\2)
[6)
2)
For a list, see Egghe and Rao (2002a) but we must remark that some results might depend on the method of grouping. This is certainly the case for the following result: for all n e N it follows from (VI.57), (VI.58), (VI.59) and (VI.45) (with f, replaced by g,) that
= 0,
q-f
limf(q)
(VI.71) _jl_ ~6n2'
_n q
~3
The explanation for this last regularity is as follows: for extremally high \x, the chance to have a paper with less than 3 authors is very small. In this case one can only receive a fractional score of — per paper. Hence the only overall scores that are possible are q = — if an author has n papers. The probability for this last event is
292
Power laws in the information production process: Lotkaian informetrics
Table VI. 3 Distribution of overall fractional scores (case of i = 3). Reprinted from Egghe and Rao (2002a), Table 2, p. 795. Copyright John Wiley & Sons Limited. Reproduced with permission.
Fig. VI.2
q
Theoretical (f)
Experimental
1/3
0.1389322
0.1389322
1/2
0.1562373
0.2530468
2/3
0.0079701
0.0112693
5/6
0.0294265
0.0236911
1
0.323324
0.3125520
7/6
0.0027311
0.0097112
4/3
0.0389478
0.0192304
3/2
0.0417687
0.0340426
5/3
0.0059575
0.0049303
11/6
0.0129368
0.0072781
2
0.0483318
0.070959
Theoretical and experimental fractional frequency distributions (case of i = 3). The numbers in the abscissa refer to the number of fractional scores (their ranks), not to the fractional score itself.
Reprinted from Egghe and Rao (2002a), Fig. 3, p. 795. Copyright John Wiley & Sons Limited. Reproduced with permission.
Lotkaian informetrics of systems in which items can have multiple sources
293
More universal inequalities between fractional score frequencies are obtained in Egghe (1996), where the SBS principle is extended to systems in which items can have multiple sources and where one uses fractional scores. The model leads to irregular fractional sizefrequency functions but a match with experimental data is not really possible.
Table VI.4
Distribution of overall fractional scores (case of i = 4).
Reprinted from Egghe and Rao (2002a), Table 3, p. 798. Copyright John Wiley & Sons Limited. Reproduced with permission.
q
Theoretical (f)
Experimental
1/4
0.036241009
0.036241009
1/3
0.1039555
0.102233051
1/2
0.1564714
0.250570935
7/12
0.0030927
0.003052099
2/3
0.0044441
0.01274198
3/4
0.0046533
0.00420464
5/6
0.013455
0.019827973
11/12
0.0003526
0.001323288
1
0.3220503
0.309905449
13/12
0.0010631
0.001686125
7/6
0.0015435
0.005698675
5/4
0.0100909
0.003905833
4/3
0.0290481
0.018654088
17/12
0.000205
0.001899558
3/2
0.0417575
0.053358376
19/12
0.0024309
0.000661644
5/3
0.0035013
0.003478966
7/4
0.0033829
0.001430004
11/6
0.0093664
0.007000619
23/12
0.0004929
0.000277464
2
0.0476329
0.0697428726
294
Power laws in the information production process: Lotkaian informetrics
Fig. VI.3
Theoretical and experimental fractional frequency distributions (case of i = 4). The numbers in the abscissa refer to the number of fractional scores (their ranks), not to the fractional score itself.
Reprinted from Egghe and Rao (2002a), Fig. 4, p. 798. Copyright John Wiley & Sons Limited. Reproduced with permission.
In Egghe and Rao (2002a) we further discuss the cases i = 4 and i = 5 but the complexity rapidly increases. The agreement with the experimental fractional frequency values remains remarkable. We only show the results in the case i = 4, i.e. where we allow an author score of - , -, - or 1 in 1 article (Table VI.4 and Fig.VI.3). 4 3 2 An attempt, comparable with the concrete calculation of discrete convolutions, has been undertaken in Burrell and Rousseau (1995) is some special cases (e.g. articles are written by 1 or 2 authors or a situation where articles can only have 1, 2, 4 or 8 authors). All these restrictions are needed to limit the number of rational numbers q to be considered, a necessity. The approach in Subsection VI.3.3 is more fruitful since it can handle any real situation.
The next Chapter VII (Section VII.4) gives another application of Lotka's law in the case that articles are written by multiple authors. There we calculate the author rank distribution e.g. in case that the number of authors per paper follows a Lotka function. Since this topic is totally different from the ones discussed in this chapter and since Chapter VII is devoted to Lotkaian applications we deferred this study to Chapter VII.
VII FURTHER APPLICATIONS IN LOTKAIAN INFORMETRICS
VII.l INTRODUCTION The purpose of this book was to present a unified theory of Lotkaian informetrics so that the informetrics community (and even beyond informetrics) has a tool that can be used in further applications. Indeed, although the law of Lotka and its equivalent expressions (Chapter II) are interesting in themselves, we should always try to apply them in other informetrics issues. If this can be done we are far ahead of research that always starts from zero and tries to explain certain regularities without referring to earlier results. In this sense the Chapters III, IV, V and VI could be considered as applications in Lotkaian informetrics but, because of their special importance and far-reaching consequences in informetrics, we devoted a special chapter to them.
This book could end here were it not that there exist further interesting applications of the Lotka function in diverse topics. They are gathered in this closing chapter. We start, in Section VII.2 with a warning that some regularities found in informetrics (exact or approximate) are not explained using informetric laws but simply by plain mathematics. We give two examples: one on the well-known arcs at the end of a Leimkuhler curve which are exact exponential functions on a log-scale (i.e. linear functions of the original variable) and one on a relation (called type-token identity) - see Chen and Leimkuhler (1989), which is not correct but which is - mathematically - explainable in an approximative way.
Section VII.3 deals with a regularity that is, in essence, also not informetric in nature but a probabilistic argument leads to the empirically verified power law. It concerns the graph of Wallace (1986) on the relation between the number of articles per journal and the journal
296
Power laws in the information production process: Lotkaian informetrics
median citation age. We will prove that the graph is below a decreasing power function with exponent 2, but this is not a consequence of Lotkaian theory but of the Central Limit Theorem in probability theory. Only the fact that the cloud of points is becoming thinner for high number of articles per journal is explained by Lotka's law.
Section VII.4 deals with another topic on multi-authored articles (cf. Chapter VI), namely on the distribution of the rank of an author in such an m-authored paper. We show that, if the number of authors per paper follows Lotka's law (cf. Subsection 1.4.4 and Chapter VI), the author rank distribution follows the same Lotka law. We further determine author ranks, using author seeds, i.e. a universal number indicating the general place of an author name as e.g. expressed by the alphabetic order.
A very important application of Lotkaian informetrics is given in Section VII.5. There we determine the so-called "first-citation distribution", i.e. (e.g. in a bibliography) the overall cumulative distribution of the time period between the publication of an article and the time it receives its first citation. We can explain it using an exponentially decreasing age function (i.e. time distribution of all citations) and a Lotka law for the number of citations for an article in this bibliography. Note that both functions are applicable in one model, due to the arguments given in Subsection 1.3.4.2. The application is remarkable since it is capable of explaining concave as well as S-shaped cumulative distribution functions and - what is even more interesting - the difference between the two types is characterized in terms of the Lotkaian exponent a . Hence Lotka's size-frequency function is decisive in this matter. In this section we also mention an application of aging functions (as the exponential one) to the explanation of the relation of the Price index in function of the mean and median reference age.
The closing Section VII.6 deals with the intricate problem of determining the rank-frequency and size-frequency functions for N-grams and N-word phrases. We derive these functions using Zipf s function as rank-frequency function for the constituting parts of the N-gram (letters) or of the N-word phrase (words) and using a technique of N-product space, being the Cartesian product of the IPPs of the constituting parts. The theory presented here is exact in the sense that no approximations are used and in this sense improves earlier results of the author. Using the size-frequency function we will also determine (as in Chapter III) formulae
Further applications in Lotkaian informetrics
297
for the (Type/Token) average nN and the Type/Token-Taken average \in of N-grams and Nword phrases and the values of uN and \x'N are compared. This final section also gives rise to (hard) problems concerning N-grams and N-word phrases.
VII.2 EXPLAINING "REGULARITIES" Real regularities need explanations. Sometimes these explanations are elementary and are not of an informetric nature (i.e. we do not need Lotka's law or subsequent results to explain them). The researcher should be able to make a distinction between the various types of explanations. In this section we will give two examples of regularities that can be explained via plain (simple) mathematics.
VII.2.1 The arcs at the end of a Leimkuhler curve
One of the simplest regularities ever found in informetrics, but which is not an informetric regularity at all, is the fact that, at the end of a Leimkuhler curve, one detects "arcs". One obtains a Leimkuhler curve when graphing the cumulative number G(r) of items in the first (largest) r sources, versus log r (any log can be used, e.g. In). The graph looks as in Fig.VII. 1 and can be found, for example, in Warren and Newill (1967), Brookes (1973), Praunlich and Kroll (1978), Wilkinson (1973) and Summers (1983).
Fig. VII. 1
A Leimkuhler curve, with arcs for large r.
Although the graph (without the arcs) has an equation of the form (see formula (11.43))
298
Power laws in the information production process: Lotkaian informetrics
G(r) = aln(l + br)
(VII. 1)
which is certainly an informetric regularity (see Chapter II or Egghe (1989, 1990a), Egghe and Rousseau (1990a)), the arcs, apparently deviating from (VII. 1), are not informetric in nature. Indeed, there are frequently several high-ranking sources that provide the same number of items: there might be a large number of sources with three items, a larger number with two items, and an even larger number of sources with only one item each. Because increases of G(r) at these ranks are linear in r (per group of equal productivity), the graph of G versus log r is exponential (per group of equal productivity). These exponential graphs become more visible as the groups of sources with equal productivity become longer. This explains these arcs near the end of a Leimkuhler curve.
Hence, this phenomenon has a purely mathematical consequence and has nothing to do with informetric aspects such as formula (VII. 1) or the so-called Groos droop (see Groos (1967)).
VII.2.2 A "type-token identity" of Chen and Leimkuhler
In our terminology a type-token identity could also be named a source-item identity. In our notation (A = total number of items, T = total number of sources) the Chen and Leimkuhler identity (see Chen and Leimkuhler (1989)) is the following (any log can be used):
l A
+
i2il = l
(VII2)
log A
Chen and Leimkuhler comment on this identity as "approximately true". But, both as a possible exact or approximate identity, we are interested in its explanation, certainly since such a possible identity completely belongs to the informetric theory developed in this book.
Let us suppose Lotkaian informetrics. Theorem II.2.1.2.1 says that for any A > T > 0 and any 1 < a < 2 , there exists a number pm > 1 such that the Lotka function
Further applications in Lotkaian informetrics
f(i) = £
299
(vn-3)
j e [ l , p m ] , satisfies (11.20) ad (11.21), hence that, even in the special case of Lotkaian informetrics (and even restricted to 1 < a < 2) any value A > T > 0 can occur. This shows that (VII.2) is not valid since we can take A as large as we wish, when keeping T fixed. So, informetrically we strongly have to reject (VII.2) since we have a proof of the contrary result even in a special part of informetrics (namely Lotkaian informetrics).
Yet Table 1, p.46 in Chen and Leimkuhler (1989) is remarkable since the 38 IPPs (texts in this case) considered there all have a right hand side for (VII.2) between 0.946 and 1.036! We checked the values for T and A of these 38 texts and we can conclude that, approximately, A pa 20,000 and T ~ 3,500 (magnitude estimation). So we could say that A = |aT with \x sa 6 (a relation that is more or less also true for their Table 2). But A = \xT implies for the left hand side of (VII.2)
1 , logT H logu + logT
= - + -;— V- l Q g^ | i logT
(VII.4)
which takes the value 0.987 for T = 3,500 and \i = 6. Furthermore, in all cases where (x is large and logT S> log(i we have that (VII.4) is close to 1 (= the limit of (VII.4) for [i —> oo and — logn
> oo). Hence the regularity (VII.2) is indeed approximately true due to
mathematical properties of the surface (in coordinates (x,y,z) £ R 3 )
z=*+^ y logy
(VII.5)
300
Power laws in the information production process: Lotkaian informetrics
It is our feeling that further informetric assumptions to prove (VII.2) are not necessary: in Chen and Leimkuhler (1989), SBS arguments (see Subsection 1.3.6), although yielding an approximation of (VII.2), only serve to yield A = yCX with \x large and with log^i < logT and again, the real explanation comes from the mathematical properties of surface (VII.5).
VII.3
PROBABILISTIC
EXPLANATION
OF
THE
RELATIONSHIP BETWEEN CITATION AGE AND JOURNAL PRODUCTIVITY In Wallace (1986) one can read the following hypothesis: "For a given subject literature, the median citation ages of the journals contributing to that literature will vary inversely with the productivity of those journals, where productivity is measured in terms of the number of articles contributed by each journal". The validity of this hypothesis needs verification, not in the least because the hypothesis describes an important property because it would mean that the size of the journal (number of articles) (which is visible at first glance) implies the scientific active life span (in terms of citations to a journal) of a journal, which is not so easy to see. Wallace produces the graph in Fig. VII.2 and comments "that the observed relationship is clearly a systematic one" although a functional relation does not seem to be the case here (an abscissa can have many ordinates).
This graph can be described as follows (Wallace (1986)). The highly productive journals all have relatively low median citation ages, while the journals with high median citation ages are all relatively unproductive. Unproductive journals do not all have high median citation ages (in fact they form a minority!) nor are journals with small median citation ages all highly productive (again: they form a minority in fact!). In fact, what Wallace describes is an area filled with points with increasing density for smaller coordinates and where there is a decreasing upper curve that looks hyperbolic, or, more general, a decreasing curve going from +oo to 0 when the abscissa goes from 0 to +oo. Denoting such a hypothetic curve by y = f(x) we hence are looking for a mathematical explanation of a relation of the form y < f (x) with f of the form as described above.
Further applications in Lotkaian informetrics
Fig. VII.2
301
Journal productivity plotted against journal median citation age.
Reprinted from Wallace (1986), Fig. 3, p. 143. Copyright John Wiley & Sons Limited. Reproduced with permission.
In our study we will replace the abscissa by the "journal mean citation age", leaving open the problem as such, but we are convinced that a solution to the problem where "median" is replaced by "mean" also explains the original problem to a great extent. We also fix the subject, represented as a set of journals, as did Wallace. Considering all mean citation ages of each article in each journal, we can calculate their average: the mean citation age of the field, denoted \x.
Let us now consider the journals in this set that consist of A articles (A e N , fixed). Let us call this the A-subfield of the entire field. This A-subfield again can be considered as
302
Power laws in the information production process: Lotkaian informetrics
consisting of articles (as we did above with the entire field). As above, this yields the mean citation \xA of the A-subfield. The hypothesis of Wallace, rephrased in this terminology, is that |i A is a decreasing function of A. To show that this hypothesis is not necessarily linked with Figure VII.2, and that the regularity in Figure VII.2 can be explained in a noninformetric way, we will assume that \iA is a constant of A, i.e., that p.A does not depend on A. In other words: nA = \i. We assume this for the sake of simplicity but also (and more importantly) because explaining Figure VII.2 with this assumption is the most "spectacular" explanation because, in our arguments, we deny Wallace's hypothesis, but we will nevertheless be able to explain the regularity of Figure VII.2. In the same way we assume that the variances a A are also A-independent: <JA = a 2 , for all A g N .
For each A G N fixed and for each journal with A articles (hence belonging to the Asubfield), we have that this journal's mean citation age is a number which is the mean of a sample of A articles. The Central Limit Theorem (CLT) (see e.g. Blom (1989), Chung (1974), Chow and Teicher (1978)) then yields that this mean belongs to a 100(1 — a)% confidence interval (around p.) of the form
) , CT , VA —1
(VII.6)
where Z(a) is the abscis such that, on the graph of the Gaussian distribution (i.e. the standard normal distribution), the tails, determined by Z(a) and -Z(a) have a total area of a. More concretely, for example, for a = 0.05, a 95% confidence interval is given by
6 ,
g
VA-I In general, the values of Z(a) can be read from the table of the standard normal distribution, which is available in any book on statistics or probability theory.
Further applications in Lotkaian informetrics
303
Expression (VII.6) contains the key of the explanation of the graph in Figure VII.2. Indeed, for each fixed A £ N (i.e. ordinate in Fig.VII.2) we have that the "sample" journals with A articles have a mean citation age between the values given by (VII.6), for 100(1 —a)% sure. The lower A, the larger this confidence interval. From (VII.6), it follows that the deviation from \x to the right is equal to
m = Z(a) , " VA —1
(VII.7)
where m is the abscissa in Figure VII.2. Because A is the ordinate, we will invert (VII.7), yielding
m
or, more simply
A = -% + l, m
(VII.8)
where Ea is a constant, decreasing with a. Equation (VII.8) is the decreasing graph at the right side of Fig. VII.2 (described above as y = f(x)). It turns out to be a decreasing power function with exponent 2, but, as indicated, unrelated with Lotka's function.
The "fading away" effect in Fig. VII.2 has two causes. First of all, in the abscissa, when going from low values of m to high values of m, horizontally at each fixed A, the effect is explained by the different values of a and corresponding probabilities 1 - a. Of course, the left part of (VII.6) is cut off by the requirement that m > 0 (in (VII.7)). Secondly, in the ordinate, when going from low values of A to high values of A, the effect is explained by the law of Lotka on the number of journals with A articles (hence decreasing in A). This is the only informetric part in the explanation!
Note also that from (VII.8) we have that
304
Power laws in the information production process: Lotkaian informetrics
lim A = 1
(VII.9)
limA = oo
(VII. 10)
and
all in agreement with the graph in Fig.VII.2. In short, Fig.VII.2, but with "median" replaced by "mean", is explained by high variances of small samples and, conversely, by small variances of large samples, as expressed by (VII.6).
Unrelated to this topic we note that in Egghe (1997) another probabilistic explanation is given for the relation between the Price index (the fraction of references that are not older than a certain fixed age) and the mean (or median) reference age. Lotka's function is not appearing in this explanation (therefore we do not give the argument here) but we can indicate that we are also dealing here with a thick cloud of points (as in the above problem), hence with a relation and not a function. The overall functional tendency is explained by using a decreasing exponential aging distribution (for the number of references to t years ago), having only one parameter but the full relation (the cloud) is explained by replacing the exponential aging distribution by a lognormal aging distribution which contains two parameters (this extra parameter allows for the explanation of the full relation and not just the functional tendency).
VII.4
GENERAL
AND
LOTKAIAN
THEORY
OF
THE
DISTRIBUTION OF AUTHOR RANKS IN MULTIAUTHORED PAPERS VII.4.1 General theory
It is clear that, if there are m authors in a paper, that each of the m author ranks is occupied, hence
that
the
distribution
of
author
ranks
is
a
uniform
distribution,
being
p(k|m) = —, k = l,...,m. If one is interested in the overall author rank distribution, m
Further applications in Lotkaian informetrics
305
incorporating a variable number of authors per paper, the distribution of author ranks is not uniform anymore since an author rank m cannot occur in k-authored papers with k < m.
We now wonder what the author ranking distribution is. It is clear that the distribution v|/ = v|/(m) of the papers with m authors ( y is the same as the function studied in Subsection VI.3.3) is involved here. Indeed, we have the following easy Proposition (all the results in this section are proved in Egghe, Liang and Rousseau (2003)).
Proposition VII.4.1.1: The author ranking distribution, p(k), k = 1,2,..., is given by
P
(k)=E^
(vim)
Proof: This result immediately follows from the theorem of total probability (see e.g. Grimmett and Stirzaker (1985) or Feldman and Fox (1991)). Indeed:
p(k) = £p(k|m)H/(m) = f ; ^ m=k
m=k
(VII.12) m
where p(k|m) denotes the conditional probability of rank k, given that there are m > k authors. As explained above p (k | m) = — for all k = 1,... ,m, the uniform distribution on the m set {l,...,m} . Note that ^ p ( k ) = 1, as is easily seen, using that y is a distribution. k=l
Of course, expression (VII. 11) is hard to evaluate since a discrete infinite sum is appearing. We can, however, proceed as follows.
Definition VII.4.1.2 (see e.g. Apostol (1957)): Consider two sequences (a(n))
and (b(n))
. One writes that
306
Power laws in the information production process: Lotkaian informetrics
a(n)= O(b(n))
(VII. 13)
if there exist numbers n0 and C such that, for all n 3 n 0 :
a(n)|£ C|b(n)|.
Intuitively, this means that the sequence (a(n)). (b(n)).
(VII. 14)
does not grow faster than the sequence
and, in the case that limb(n)= 0 (which will be the case in our application), that
nl ¥
the sequence (a(n)).
n® ¥
ressembles the sequence (b(n)).
in the sense of variability in n
(and, of course, then also lim a (n)= 0). The O-notation is originally due to E. Landau. n®¥
We have the following result. Proposition VH.4.1.3: If the distribution of numbers of authors per article, y(m), is given by a Lotka distribution (cf. Subsection 1.4.4), then the author ranking distribution, p(m), is related to \|/(m) by
p(m)= 0 ( v ( m ) )
Proof: Let \\i be the Lotka distribution
V(m)= -^r
with a > 0 then, by the above proposition, ¥
p(m)= a -
(VII.15)
Further applications in Lotkaian informetrics
307
Hence, by the integral test (cf. Apostol (1957) or Protter and Morrey (1977)) and the fact that * the series a
c
n=i n a +
r *s convergent (since a > 0), we have:
o n= m
c
n
v
CQX _
A
c
m
So,
or
p(m)£
JL|L + 1& ^SL + 1?=
v(rn|L+
This proves the theorem. If we adopt the continuous model of author densities, as in Subsection VI.3.2, we have the following result. Proposition VII.4.1.4: In the continuous setting, the author ranking density distribution p(k), k I | + , is given by
p(k)= 6 ^ ^ d x
where\|/ is the continuous size-frequency distribution of Subsection VI.3.2.
Proof:
(VII.16)
308
Power laws in the information production process: Lotkaian informetrics
Proof: The theorem of total probability is also valid for density functions (see e.g. Grimmett and Stirzaker (1985)) so that
p(k) = J k °°p(k|x) V (x)dx
where p ( k | x ) is the conditional density function for continuous rank k, given we can pick ranks in the interval [l,x] (ranks are > 1 since ranks are item densities and all item densities are > 1, see e.g. formula (11.10)). Hence p(k | x) =
and (VI.16) follows. Note that X
1
riP(k)dk=i as it should and as follows easily from an application of Fubini's theorem (see Apostol (1957) or Protter and Morrey (1977)) and the fact that vj; is a distribution.
If \j/(x) = —— = — — , X
the continuous Lotka distribution (c = a —1
since i|/ is a
X
distribution), (VII. 16) has the form
p
«=/r(£r^ dx
(VIL17)
which we cannot evaluate, let alone relate it to \\i as was done in the discrete case (Proposition VII.4.1.3).
VII.4.2 Modelling the author rank distribution using seeds
Assume that each author, A, has a characteristic number sA £ [0,l], where sA is equal to the probability that an other author comes before A in the byline of an article. This characteristic number will be called a 'seed'.
Further applications in Lotkaian informetrics
309
We will next solve the problem of determining p(k,s): the probability for an author with seed number s to be the kth author (in general); or more specific p(k,s | m ) : the probability of an author with seed s to be kth author in a publication with m authors. In this connection we have the following result.
Proposition VII.4.2.1:
p(kjS|m) = [ m _" i 1 js k - 1 (l-sr k
(VII.18)
p(k,s) = gfm"1 ^ ( l - f V H
(VII.19)
where v|/(m) is the probability that a paper has m authors.
Proof: Consider an author A, with seed s. Hence
P(an author is before A in an author list) = s P(an author is after A in an author list) = 1 - s
Author A has rank k in an article with m authors, m being at least equal to k, if and only if k — 1 authors precede A, and m - k follow A. We can describe this as follows. As the article has m authors, this means that m - 1 co-authors are choosen at random. They end up before A with probability s (we refer to this as 'success' in a Bernoulli trial). So author A ends up at rank k if there are k - 1 successes (and consequently m - k 'failures'). This shows that the situation can be described by a binomial distribution.
p(k,8|m) = [°_- 1 )."(l-.r
310
Power laws in the information production process: Lotkaian informetrics
The second formula, where the total number of authors of an article is not given, follows by the law of total probability:
p(k,s) = £ p ( k , s | m ) f ( m ) m=k
=£(" 1 " 1 1 ] sk " 1 ( l - s r^H
n
m=k( l K ~ 1 )
These formulae, based on seeds, are very important since it is relatively easy to determine seeds for authors, e.g. based on alphabetic ranking of authors' names. This is explained in the next subsection.
VII.4.3 Finding a seed based on alphabetical ranking of authors
In this section we introduce a method of finding a seed for an author. First we will introduce an injection between an author's name and a number in the set [0, l] D Q.
We will work with the standard western alphabet, consisting of 26 letters, but the method applies to any other alphabet or another lexicographical ordering consisting of symbols with a fixed rank. We add a 0 (zero) symbol to the alphabet, so that we have an alphabet of 27 symbols. Let S denote the set of all concatenations of a finite or infinite number of symbols. Then S r ..S n e S (a concatenation of a finite number of non-zero symbols) represents an arbitrary name. The injection
f : S -* [0,1]
is defined as:
f(S,S2...Sn) = ^ ^ 4
(VII.20)
where |Sf| denotes the rank of the symbol Sj. An equivalent way of defining the function f is:
Further applications in Lotkaian informetrics
311
i \tixz>1...s>n) — u. s,|...p n |
where O.|S,|...|Sn| denotes a number in the 27-ary number system. For clarity's sake each number SJ must be expressed by two digits, otherwise a 1 followed by a 2 could be confused with 12. Hence 1 must be written as 01, 2 as 02, and so on.
Examples: 1.
f(A) = 0.|A| = — = 27"' V ; I I 27
2.
f(AZZZZZZZZZZZ....) =
- + £ -
=-
L
+ - ^ ^ = - + - = - = a|B| = f(B) 27 j
This second example shows that, in the same way as we identify 0.19999... with 0 . 2 e R , AZZZZZZ... is identified with B.
3.
f(zzzz....) = V — = I-I
l l
,26 . = i . 27 1 - —
{
4.
Note that f (000...0 * 000....) =
21) rankf*) M-, where * denotes any symbol from the alphabet,
placed after (i — l) zeros. Of course, this symbol does not represent a real name. Clearly, the limit of this expression, for i —> oo, is zero.
5.
The image of any real name belongs to [0, l] n Q.
The function f, restricted to 'real names', (the subset of finite symbols of the form S,...SN) is an injection. Indeed, let
312
Power laws in the information production process: Lotkaian informetrics
f(S,S 2 ...S n ) = f(T 1 T 2 ...T m )
Hence,
i=l
Z /
j=lZ
/
Assume now that S,...Sn ^ T,...Tm (this is: assume that f is not an injection). Let k e |l,...,min(m,n)j be the first rank for which Sk ^ T k . There is no loss in generality in assuming that |Sk | > Tk + 1 . Hence
Is I
TI
1
^4-^4 >-*27k
(V11.21)
27" ~ 27k
But, we always have that n k 00 T£ T/; m IT I ynl_vn_
j=k+l
77 L
I
97
j=k+l ^ '
77
j=k+l ^ '
97
^'
+
|
1 1 L_ = _L 1 k
L
9l7
'
omm ^ '
27 which is in contradiction with (VII.21) and the fact that f (S,S2...Sn) = f (T,T2...Tm). Hence the function f is an injection on the subset of'real names'.
How can the actual occurrence of letters be taken into account? One suggestion is to use a telephone directory. This suggestion, however, only works for scientists 'speaking' the same language, because the use of letters differs in different languages. Otherwise one needs an 'international' directory (perhaps that of a city such as New York or Los Angeles). Then a name listed on page 345 of the 1,557 (this is just an example) would get a seed equal to 345/1,557 = 0.22158. Further refinements (using lines within a page) are possible. Averaging would be necessary for popular names.
Further applications in Lotkaian informetrics
313
Having determined a seed for an author, Proposition VII.4.2.1 then gives the probabilities for such an author to be at rank k. So these distributions can be tested by using a group of scientists with the same surname (hence the same seed) in a field where alphabetic ranking of authors is customary. Such is generally the case for pure mathematicians, logicians, statisticians and theoretical physicists (see Harsanyi (1993)).
This model is valid in all cases where a seed can be given, not just in the case of alphabetical name ordering. Indeed, there exist many ways and conventions for ranking co-authors (see Harsanyi (1993), Hoen, Walvoort and Overbeke (1998) or Feger (2000)). A seed can, e.g., be derived from the importance of the author. Indeed, assume that author ranking always occurs according to 'importance'. 'Importance' could then be calculated from the number of publications, the number of citations, or even the age of scientists (Liang, Kretschmer, Guo and Beaver (2001)).
Another suggestion is to calculate a seed from 'older' publications (calculating an average rank) and to use this to 'predict' the author rank distribution in 'newer' ones.
Note that a seed is always a number in the interval [0,l], so that observations must always be transformed to the unit interval in order to obtain a seed.
In the next subsection we will discuss an application of Lotka's function to the cumulative distribution of first-citations in a set of articles. It is one of the most intriguing applications of Lotka's law and, what is more, this application of Lotka's law also shows its explanatory value for the different forms of such a first-citation distribution, again showing that Lotka's power law is at the basis of many informetric regularities.
VII.5
THE FIRST-CITATION DISTRIBUTION IN LOTKAIAN INFORMETRICS
VII.5.1 Introduction
The fact that an article receives citations is an indication of its visibility and its use by other researchers. The importance of citation analysis is apparent from the many articles on this
314
Power laws in the information production process: Lotkaian informetrics
subject, e.g. in the journals Scientometrics, Research Policy,... . It is indeed so that this visibility (in terms of the articles of a research group, in a certain time period) is measured using citations and that these quantitative data are used by policy makers in research for the determination of the degree of subsidizing this research. Important research tools in this matter are the citation indexes produced by the Institute for Scientific Information (ISI), founded by E. Garfield. They can be used, e.g., to determine the times at which an article receives its citations and hence one can also determine, for each article in a field, the time of the first citation to an article. Over the whole field on can then talk about the first-citation distribution which (in its cumulative form) is the topic of this section. The first-citation distribution is the cumulative distribution of the time period between publication of an article (in a field) and the time it receives its first citation.
Knowing first-citation times is very important since it measures "response times" and the speed of scientific communication in a field. The first-citation time t[ is also very important for an article since, at this time, the article shifts its status from "unused" to "used" and the smaller t, is, the more we can say - in general - that the article under study is important and early visible in the scientific world. One could also say (with Moed and Van Raan (1986) and Schubert and Glanzel (1986)) that t, is a measure of immediacy, however not directly related to the immediacy index, defined by ISI in their Journal Citation Reports (JCR®) - see also Egghe and Rousseau (1990a), where also general aspects of citation analysis are treated.
The reason why we will deal with cumulative first-citation distributions is that they are easier to deal with (yielding also easier formulae - see further), it is more common to do so (cf. Rousseau (1994a) and Gupta and Rousseau (1999)) and, finally, cumulative processes are more important here since their limiting supremum value is the fraction y of eventually cited articles, hence 1 — y is the fraction of the articles that are never cited, an important indicator.
Although the first-citation distribution is important, there are not many papers involved in the study of this distribution. In Glanzel (1992) and Glanzel and Schoepflin (1995) one studies ith Harmonic Mean Response Times, being the harmonic means of the time elapsed between the publication date and the date of the i' (i = 1,2,...) citation of the papers. Of course, i =1 represents first-citation. Basically there is only Rousseau (1994a) who develops a model for the first-citation distribution. His arguments are based on the definition of two differential
Further applications in Lotkaian informetrics 315 equations leading to two different models (the one not implied by the other). One model is based on the differential equation
©'(t^Be-^l-d^t,))
(VII.23)
where ^ ( t , ) is the cumulative first-citation distribution and B and p are constants, leading to
(VII.24)
where k and b are other constants.
This model fits well in situations where the first-citation data are concave in t, as is, e.g., the case for the data (derived by Rousseau) appearing in Motylev (1981) on references in the Russian scientific literature to Russian language library science periodicals. But, as is easy to see, model (VII.24) can only handle concave cases since R " ( t , ) < 0 always. There exist, however, cases where the first-citation process is S-shaped, i.e. starts in a convex way, then proceeds in a concave way, both parts being separated by an inflection point. In fact, Rousseau himself points this out by collecting data on first citations of JACS articles in JACS (JACS = Journal of the American Chemical Society). This resulted in an S-shaped cloud of points (see Figs. 2 and 3 in Rousseau (1994a) or see further Fig.VII.4 where the data have been re-used) for which (VII.24) is unsuited. This leads Rousseau to develop a second model based on the differential equation.
cD1(t1) = Be-pt'(l-
(VII.25)
leading to
O(t,) =
L 1+
, pt
MBe- '
(VII.26)
316
Power laws in the information production process: Lotkaian informetrics
where M, b and p are constants. This function is capable of fitting the S-shaped cases very well, as can be seen in Rousseau (1994a).
We want to make the following remarks on these models. The work of Rousseau has some explanatory value since it is derived (in a mathematical way) from differential equations (VII.23) and (VII.25). This technique is custom in sciences like physics and chemistry, giving indications about the dynamics of first-citation processes but often these equations are only expressions of widely accepted and experimentally verified properties. In equation (VII.23), the first-citation rate is proportional with the fraction of the (at that time) yet uncited articles, but with proportionality factor decreasing with t,. In (VII.25), this rate is multiplied by R (t,) itself, for which a rationale seems to be missing.
The most important drawback of Rousseau's results is that neither of the two models is capable of modelling all existing first-citation relations: the first one is needed to model the concave cases while the second one is needed to model the S-shaped cases. This leads to two different rationales (coming from the different differential equations (VII.23) and (VII.25)) for the two types of first-citation relations. Not that this is wrong in itself but in this way we lack the rationale for this different behavior.
Last but not least, one can wonder if the observed first-citation regularities cannot be explained by using elementary informetric tools. For citation times we have the well-known aging distributions, the simplest ones being the exponential aging distribution, see e.g. Egghe and Rao (1992a) or Subsection 1.3.4.2 and for number of citations we can use a sizefrequency function of the power type (i.e. Lotka). This will be done in the next subsection: in other words, we will put the theory of first-citation distributions within Lotkaian informetrics. There we will also see that Lotka's exponent a plays a crucial role in the explanation of the different shapes (which happen to be concave or S-shaped) of the first-citation distribution.
In Subsection VII.5.3 we will then make explicite calculations of the model for the existing data-sets, i.e. the ones that were already used in Rousseau (1994a) in the test of his models (VII.24) and (VII.26).
Further applications in Lotkaian informetrics
317
In Subsection VII.5.4 we will extend the first-citation distribution to the n th (n = 2,3,- ) citation distribution and to general aging distributions.
VII.5.2 Derivation of the model
Let
c(t) = ba'
(VII.27)
(b > 0, 0 < a < 1, a,b : constants and t > 0) be the probability density function of citations to an article, t time after its publication. Since c is a probability density function we have
J
f°°c(t)dt = — = 1 ° lna
(VII.28)
(since 0 < a < 1), hence b = — lna > 0. If an individual article in the bibliography (or field) receives M citations in total we have that
n(t) = -(lna)Ma'
(VII.29)
describes the number of citations, t time after publication, to this article. The distribution of the M-values will be taken as Lotkaian
(VII.30)
for Me[l,+oo[, hence with maximal item density pm = oo as in Subsection II.2.1.1. We will take a > 1. This implies that, since cp is a probability density function,
r°(p(M)dM= f°°—dM = - 5 - = l. Ji v v ' Ji Ma a-1 (VII.31) in (VII.30) gives
(VII.31)
318
Power laws in the information production process: Lotkaian informetrics
cp(M) = ^ i
(VII.32)
Note that we do not only cover the case of Proposition II.2.1.1.1 ( a > 2 , hence A < oo) but also the case 1 < a < 2 where we have A = oo but T < oo, hence |i = — = oo; this is no objection for this model: only a > 1 is needed in order to have the existence of (p as in (VII.32). We have the following remarkable result.
Theorem VII.5.2.1 (Egghe (2000c)): If we have an exponential aging function as in (VII.29) (supposed to be the same for each article) and a Lotka size-frequency distribution as in (VII.32), then we have that
(VII.33)
where Ct> is the cumulative first-citation distribution (over all articles), t, > 0 and y is the fraction of ever cited articles.
Proof: Let the article have M citations in total. Then the time t, at which this article receives its first citation is defined by
Pn(t)dt = l. t/ 0
This yields:
M = —*— 1 —a'
(VII.34)
Since (p(M) is the fraction (among the at least once cited articles) of those articles with M citations, y
Further applications in Lotkaian informetrics
319
articles with M (given by (VII.34)) or more citations will have their first citation at t < t, (by (VII.29)). Hence the cumulative first-citation distribution, ( t>(t l ), is given by (using (VII.32))
(VII.35)
with M replaced as in (VII.34). This yields, since a > 1:
O»(t1) = y(l-a t ') a " 1 . We have the following important property of <J>.
Theorem VII.5.2.2 (Egghe (2000c)): The cumulative first-citation distribution O is concave if and only if 1 < a < 2 and is Sshaped (i.e. first convex, then concave) if and only if a > 2 .
Proof: It is easy to see that 0 for all t, > 0 and that ®" (t,) = 0 if and only if
'4—) t =
l » l> lna
(VII.36)
and that 2 then the value in (VII.36) is positive. Any value of t, below this yields O" > 0 and any value of t, above this yields (0) = 0 and
lim©(t,) = y
as it should (since we consider © with respect to all (also the not cited) articles).
(VII.37)
320
Power laws in the information production process: Lotkaian informetrics
That an S-shaped curve occurs only for large values of a and that the S-shape becomes more and more apparent the larger a is (as is clear from (VII.36)) is intuitively clear. The higher a the more inequality we have (in this case between the different total number of citations per article) - see Corollary IV.3.2.1.5, so, relatively speaking, there are fewer articles with a large number M of citations. In other words, using (VII.34), there are not many cases of low values of t, and hence the first-citation process starts in a convex way, meaning that increases very slowly. This remark makes clear the involvement of Lotka's a in the study of the firstcitation process, besides the aging rate a, whose involvement in the first-citation process is much more evident.
The model (VII.33) is easy to use in practical examples. We will try to find the parameters in (VII.33) that are best in the sense of nonlinear regression, given a set of data. We can estimate y from the data, hence using (VII.33) as a 2-parameter model (in a and a ) . Of course, we can also consider (VII.33) as a 3-parameter model where also y is determined by the optimalization process.
VII.5.3 Testing of the model
We will present two examples: one where the empirical data imply that O is entirely concave and one where the empirical data imply that ct> is S-shaped. This way we can perfectly test all aspects of the model.
VII. 5.3.1 First Example: Motylev (1981) data
These citation data concern references in the Russian scientific literature to Russian language library science periodicals, published by Motylev (1981). The data were re-used in Rousseau (1994a). We will use them here for testing our model. The data are presented in Table VII.1.
Note that in this case only about 30% of all articles is cited. Here a substantial improvement occurred when using a 3-parameter fit above a 2-parameter fit. We obtained a = 0.956, a = 1.746, y = 0.486. The latter value is far above 0.303 which is clear by visual inspection of the graph: after 16 years the graph is far from being horizontal; hence adding further years
Further applications in Lotkaian informetrics
321
will result in more cited papers (increasing the low number of 30%!). The calculated model is shown in Fig. VII.3. The fit is very good, giving an R2 (in nonlinear regression) of over 0.97.
Table VII. 1
First-citation data of Motylev.
Reprinted from Egghe (2000c), Table 2, p. 353. Reproduced with kind permission of Springer Science and Business Media.
Year
Number of articles
Fraction of
Cumulative fraction
cited for the first time
column 2
of first-citation
10 33 18 17 14 4 0 10 7 7 8 9 4 13 6 6
0.018 0.060 0.033 0.031 0.026 0.007 0.000 0.018 0.013 0.013 0.015 0.016 0.007 0.024 0.011 0.001
0.018 0.078 0.111 0.142 0.168 0.175 0.175 0.193 0.206 0.219 0.234 0.250 0.257 0.281 0.292 0.303
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fig. VII.3
Three-parameter fitting of the Motylev data.
Reprinted from Egghe (2000c), Fig. 2, p. 354. Reproduced with kind permission of Springer Science and Business Media.
322
Power laws in the information production process: Lotkaian informetrics
VII.5.3.2 Second Example: JACS to JACS data of Rousseau
These data were collected by Rousseau in JACS (Journal of the American Chemical Society) of the year 1975. The first citations to these articles are followed in JACS during a 4-year period (in biweekly issues). Since the table is long (see Rousseau (1994a) or Egghe (2000c)) we do not present it here. It follows from this table that, after 4 years, 67.6% of the articles are cited. Hence we use y = 0.676 in a 2-parameter fitting (since the 3-parameter fitting was very similar to the 2-parameter one, due to the fact that the data are more complete than in the previous example: for large t,, the graph is nearly horizontal). We have a = 0.955, a = 3.641 with an R (in nonlinear regression) of 0.999. See Fig.VII.4 where the S-shape is clear and the model fits very well.
Note that the high value of a = 3.641 is typical in the sciences (genetics, chemistry,...) and is higher than in e.g. library sciences as considered in the previous example (there a = 1.746). Note (Chapter IV, V) that high values of a imply high levels of concentration (inequality) and of fractal complexity. This is another way of interpreting "hard" sciences!
Fig. VII.4
Two-parameter fitting of the Rousseau data.
Reprinted from Egghe (2000c), Fig. 3, p. 355. Reproduced with kind permission of Springer Science and Business Media.
Further applications in Lotkaian informetrics
323
VII.5.4 Extensions of the first-citation model
If we look at the exponential aging distribution (VII.27), hence with b = — In a (see (VII.28)) we see that its cumulative form equals:
C(t)= ( " ( - i n a V ' d t ^ l - a '
(VII.38)
0
and hence the cumulative first-citation distribution
«p(t1) = r(C(t1))°"1
(VII.39)
Denote = —, which is the cumulative first-citation distribution, where one only considers Y the population of ever cited articles. Note that now, because of (VII.37) we have lim
We now have
O-(t,) = ( C ( t 1 ) r
(VII.40)
This result will be extended to general aging distributions C and to *, the cumulative nth citation distribution, as follows.
Theorem VII.5.4.1 (Egghe and Rao (2001)): Let <£>*n be the cumulative nth citation distribution, i.e. the distribution of times when an article receives its nth citation and where we only consider the articles that, eventually, receive n citations. Let C be the cumulative aging distribution, i.e. the distribution of the times of the citations to an article, supposed to be the same for all articles. Let the distribution of the number of citations received be given by (VII.30) (Lotkaian). Then we have
o;(t) = (C(t))""1
(VII.41)
324
Power laws in the information production process: Lotkaian informetrics
Proof: The nth citation is received if (M = total number of citations to an article)
MC(t) = n
(VII.42)
and at that time all articles with M' > M citations evidently (since M'C(t) > n) have at least n citations, hence received their nth citation earlier. Its fraction (among the at least once cited articles) is
JJ°
1
= / — - d x = M'-a JM
(VII.43)
X°
Hence (VII.42) and (VII.43) imply that
Cft) " ' —^ n
(VII.44)
is the cumulative nth citation distribution, conditional with respect to the set of at least once cited articles (because we do not use y here as we did in Theorem VII.5.2.1). For obtaining the distribution
It suffices to calculate
lim -i^ t-= n J
=JL n
(VH.45)
Hence (VII.44), divided by (VII.45) is the cumulative nth citation distribution, conditional with respect to the set of (eventually) n times cited articles. So
Further applications in Lotkaian informetrics
a>;(t) = (c(t))-\
325
n
Note that O | = <£>*, by notation. We have the following remarkable corollary.
Corollary VII.5.4.2: 1.
<£'n = * ; for all n e N
2.
If a = 2 then O* = C, i.e. the cumulative nth citation distribution equals the cumulative citation distribution.
Proof: This follows readily from the previous theorem. The previous results allow researchers to check the nth citation models for other than exponential aging functions. This was done in Egghe and Rao (2001) using the lognormal distribution as aging function c. The results are good but not better than the ones of Egghe (2000c) and the calculations are, of course, more intricate (note also that the lognormal distribution has 2 parameters while the exponential one has only 1; it is always better to use as few parameters as possible, if good fits can be obtained).
In Egghe and Rao (2002b) the same study was done but where one does not consider firstcitations but most-recent-references, the dual aspect. The advantage of this is that it is more easy to collect data: one can collect reference data from each journal article while for the firstcitation data, the citation indexes of ISI are needed. In addition, following earlier work of Stinson (1981) and Stinson and Lancaster (1987), one can expect the same results when studying references (synchronous study) as when studying citations (diachronous study).
These assumptions were tried out in Egghe and Rao (2002b) and confirmed: most-recentreference distributions, using (VII.33) (but with y = 1) and using c as a synchronous aging distribution were tested on most-recent-reference data from JASIS (Journal of the American Society for Information Science) and from JACS (Journal of the American Chemical Society): the results are similar to the ones in Egghe (2000c), also with very good fits of the obtained models.
326
Power laws in the information production process: Lotkaian informetrics
VII.6
ZIPFIAN THEORY OF N-GRAMS AND OF N-WORD PHRASES: THE CARTESIAN PRODUCT OF IPPs
VII.6.1 N-grams and N-word phrases
N-grams are strings of N letters; N-word phrases are strings of N words. In this sense both objects can be compared: they belong to an N-fold Cartesian product of the space of letters (for N-grams) and to the N-fold Cartesian product of the space of words (for N-word phrases). Symbolically, N-grams and N-word phrases can be symbolized as a vector (a,,a 2 ,...,a N ) where each a; is an arbitrary letter or an arbitrary word.
It is clear that N-word phrases are very important: they form the basis of linguistical expression and allow for more complex ideas than the single words on their own. This, in turn, implies that their use in information retrieval (IR) is basic in the refinement of searches: if the 2-word phrase "electric vehicle" is used in a query (supposing truncation at the end of each word) then this query is superior to a Boolean input "electric AND vehicle" in which one not only receives information on electric vehicles (as wanted) but also on electricity in vehicles which is a totally different subject and which, most probably, gives the majority of the retrieved articles. Hence, pre-coordinative indexing techniques for such N-word phrases (concepts) and post-coordinative retrieval techniques with proximity operators are basic in any literature database construction (see e.g. Salton and Me Gill (1987)).
N-grams are less known in concept and in their use. Let us indicate some elements here. Ngrams can be generated from words or, more generally, from texts (i.e. consisting of several words). They are generated (from a word) as follows. Let us consider the word SUBSTRING. From this word on can generate the following 2-grams:
*S, SU, UB, BS, ST, TR, RI, IN, NG, G*
or the following 3-grams:
**S, *SU, SUB, UBS, BST, STR, TRI, RIN, ING, NG*, G**
Further applications in Lotkaian informetrics
121
These are examples of "redundant" N-grams (Willett (1979)) in which one shifts from left to right with 1 symbol and where one completes the beginning and ending N-grams. These can be compared with non-redundant N-grams (e.g. for N = 3: SUB STR ING). The advantage of redundant N-grams over non-redundant N-grams is that all letters in a word (or text) appear in every place i = 1,...,N of an N-gram and hence that the overall occurrence of letters in redundant N-grams is independent of the given place ie{l,...,N} (this is also why, in the beginning and at the end, one makes use of the asterisk *). Since this will be a crucial property in our models (to be developed in this section) it is clear that we will, henceforth, only consider redundant N-grams. Also in applications (which we will describe now) redundant N-grams are the most important. We refer to Cohen (1995), Damashek (1995), Robertson and Willett (1998), Grossman and Frieder (1998) (Section 3.4), Yannakoudakis, Tsomokos and Hutton (1990) and Nelson and Downie (2001) for a more elaborate description of the applications on N-grams, which we will briefly indicate now.
The basic idea behind N-grams is that - for a word or a text - one creates a sequence of Ngrams which contains more information than each word in itself: e.g. similar words (e.g. woman, women) are different but their N-gram pattern is very similar. This similarity can be expressed in different ways e.g. by constructing, for each word or text, a vector in which each coordinate is linked with one N-gram (hence the length of this vector equals the total number of different N-grams) and the coordinate value equals the total number of times this N-gram occurs in the word or the text. In this way two words (or texts) can be compared using vector similarity measures such as (e.g.) the cosine measure of Salton (see e.g. Salton and Me Gill (1987), Blair (1990)). This gives the following opportunities of application.
1.
In indexing: documents can be clustered in the indexing process by using a similarity measure for their titles (clustering is done using multivariate analysis techniques - see e.g. Egghe and Rousseau (1990a)). Also, from the N-gram pattern of a document one can automatically generate a series of key words ("highlights", "stems") for indexing purposes. This technique generalizes stemming using truncation (e.g. "compression" and "textcompression" will be similar by using N-grams, not by using truncation).
2.
In IR: via given key words one generates similar key words using the N-gram technique, with which one can expand the query. Also this can be considered as a generalization of
328
Power laws in the information production process: Lotkaian informetrics truncation. N-grams (or more general search keys) can also be used in catalographic retrieval - see Subsection III. 1.3.2.
3.
Error detection and correction: if there is an error in a word, its N-gram pattern is very similar to the one of the correct word and hence the system can propose a correction. This is also applied for the correction of surnames.
4.
Compression of texts: unused byte-codes, reserved for 1-grams, can be linked with frequently occurring N-grams (N > l).
5.
Identification of languages: different languages have different N-gram patterns.
6.
Authorship determination: different authors have characteristic N-gram patterns in their texts.
7.
Classification of subjects: as for languages and authors, different subjects have characteristic N-gram patterns.
8.
Speech recognition: here one uses N-grams of words (i.e. N-word phrases) instead of Ngrams of letters.
9.
Indexation and retrieval of music: in monophone music one delimits intervals indicating the change (up or down) of one tone to another one. N-grams are formed by these intervals (instead of letters) as building blocks.
N-grams are extremely useful (but not exclusively!) in Asian languages because of their special structure (truncation is not so efficient there). Note also that all these techniques are language or subject independent (in contrast with many indexing and IR-techniques).
This should convince the reader that N-grams and N-word phrases are important tools in information science. Hence their informetric properties should be revealed. It is, however, clear that N-grams and N-word phrases consist of N objects each of which also have informetric properties. Indeed N-word phrases consist of N words and the informetric (linguistic) properties are covered by Zipf s law, hence, by Section II.4, are of Lotkaian nature. We will show in Subsection VII.6.3 that letters, building N-grams, can also be supposed to satisfy Zipf s law certainly in cases where there are many letters such as is the case in Asian languages. We can now wonder if, from Zipf s law for words or letters, we can deduce a rank-frequency function for N-word phrases and N-grams, by interpreting them as "Cartesian products" of the single objects. This will also be done in Subsection VII.6.3, while in Subsection VII.6.4 we will deduce the size-frequency function derived from this rank-
Further applications in Lotkaian informetrics
329
frequency function (using the general relations in Chapter II). This size-frequency function will then, in Subsection VII.6.5, be used to calculate the average number jiN of N-gram or Nword phrase occurrence per N-gram (N-word phrase) type as well as its Type/Token-Taken variant (x^ (cf. Chapter III) and a comparison between JI N and \iN for various values of N and Lotka's exponent a (being the size-frequency function for single letters or words) is made. In the next Subsection VII.6.2 we will present an argument for 2-word phrases, based on Mandelbrot's argument for words in random texts: i.e. we will extend the argument in Subsection 1.3.5 to 2-word phrases. This has interest in itself but will also yield material to compare with the results of Subsection VII.6.3 which will be far more general. Since the arguments of VII.6.3 also do not depend on those of Subsection VII.6.2, the reader can skip VII.6.2 without any problem. We have added VII.6.2 since it extends Mandelbrot's argument and since the result supports the one obtained in Subsection VII.6.3.
VII.6.2 Extension of the argument of Mandelbrot to 2-word phrases
We will use the same tools and notation as in Subsection 1.3.5. The results are from Egghe (1999a). We suppose we have an alphabet of N letters, which have an equal chance of occurrence. Words are formed with these letters and these words are separated by a blank. We put p as the probability for a letter to occur. Since there are also blanks, we have that (cf. 1.3.5) p < —. The probability for having a 2-word phrase consisting of k letters (in total) N hence is
P2(k) = p k ( l - N p ) 2
(VII.46)
(1 — Np is the probability to have a blank). Since this is decreasing in k we can estimate the rank r of this word as follows (cf. the similar argument given in Subsection 1.3.5)
N 2 + 2N3 +... + ( k - 2 ) N k " ' < r < N2 + 2N3 +... + ( k - 1 ) N k
(VII.47)
330
Power laws in the information production process: Lotkaian informetrics
(since, for every i, there are (i — l) N' possible 2-word phrases consisting of i letters in total the blank can indeed be at i —1 places). A long calculation, but only involving sums of geometric series yields
N + 2N 2 +... + iNi = — ^ - ^ ( i N i + 1 - ( i + l)N' + l) ; (N-l) V
(VII.48)
for all i. Applied in (VII.47) this yields
(k - 2) N k "' - (k - 1 ) Nk~2 < (^—H r - 1 < (k - 1 ) N k - kN k -'.
(VII.49)
To fix r, we will take the average of both sides:
p z i j r_iSiI[(k-2)Nk-1-(k-l)Nk-2+(k-l)Nk-kNk-1]
= lN"[(NJ-l)(k-l)-2N
2izlJ r - l « i N k - 2 ( N 2 - l ) ( k - l ) ;
(VII.50)
for large r (hence k). This is the (average) rank of 2-word phrases consisting of k letters in total. From (VII.46) it follows that
P 2 (r) = p k ( l - N p ) 2 = p k - 1 p ( l - N p ) 2 .
Hence
Further applications in Lotkaian informetrics
p(l-Np) V k-l = — ^ J—. lnp
331
(VII.51)
Formula (VII.51) in (VII.50) yields
lnNl
A
InN
P2(r) "P Pa(r) '" 2 (p(l-Np) ' p(l-Np)2 P
=
2NlnN fN-lf N2-l I N J
} 2(lnN)(N-l)r j N(N + 1)
for large r. So
InN
P Cri ,, 2\ ,2 p(l-Np)
lnp
=4(rr)
(VII.53)
with
2(lnN)(N-l) N(N + 1) '
a fixed number, and where \ is the inverse of the function x —> x lnx . We finally have
P2(r) =
7^V
where
M = p(l-Np) 2
(VIL54)
332
Power laws in the information production process: Lotkaian informetrics
and
P= - Y ^ lnN
(VH.55)
Note that we refind formula (1.58) for p being — where Ds is the fractal dimension of the text consisting of single words (see also formulae (V.I7) and (V.I8)). It is not easy to express the fractal dimension of the text, consisting of 2-word phrases since (VII.54) is not a power law: (3 acts on 4(yr), where \ is the inverse of the function x —> xlnx . This is left as an open problem.
The above argument has been given for historical reasons and since we will be able to compare the result with the ones in Subsection VII.6.3, but the argument above has many pitfalls:
(i)
it is only valid for N = 2 (i.e. 2-word phrases) although an extension to general N is feasible, but intricate,
(ii)
it only works for N-word phrases, not N-grams,
(iii)
the argument uses many approximations (for large ranks r),
(iv)
it supposes that all letters have the same probability,
(v)
it supposes that all N-word phrases occur and (because of (iv)) in equal quantities,
(vi)
it supposes that letters follow independently in words.
So the above cannot be considered as a definitive treatment of the rank-distribution of Ngrams and N-word phrases. In the next section we will give a general treatment where the above pitfalls do not occur anymore (except for (vi): independence will still be supposed and this is to be considered as a simplification of an otherwise too intricate problem).
Further applications in Lotkaian informetrics VII.6.3
333
The rank-frequency function of N-grams and N-word phrases based on ZipPs law for N = 1
In this subsection we will determine the rank-frequency function for general N-grams and Nword phrases (N = 2,3,4,...) given the rank-frequency function for letters (i.e. 1-grams) and for words (i.e. 1-word phrases). In this formulation the pitfalls (i), (ii) of the previous argument (Subsection VII.6.2) are removed. Also (iii) will not be a problem since we will derive exact results (no approximations). Of course, as is always done in the theories of this book, we will work with the continuous framework (Chapter II).
As to (iv) and for N-word phrases, it is clear that we can suppose Zipf s law for single-word occurrences. The reasons have been explained before: Zipf s law is a power law, a very common distribution in linguistics (Chapter I) and is fully a part of Lotkaian informetrics (Section II.4), because of Subsection II.4.2. Note that the Zipf function to use has the form (denoting it g, instead of g to express that it is the rank-frequency function for words (i.e. 1word phrases):
S W
' =(I77
-
re[O,T] and where p , E > 0 are constants (see Subsection II.4.2). By Proposition II.1.2.2 and the properties of the function U we have that
f T g,(r)dr = A.
(VII.57)
J0
So we have that
(r) =
^ (itr
-
is the rank-frequency density function for words. We will also substitute r' = 1 + r (and then replace r' by r) so that we have
334
Power laws in the information production process: Lotkaian informetrics
P,(r) = £
(VII.59)
with r 6 [1,T +1] (cf. formula (11.89)) and D = —.
As to (iv) for N-grams we would like to use (VII.59) also for the occurrence of letters. In Egghe (2000a) we checked 4 data sets of letter occurrence. Here we will present two of them. The first are data of Dewey (as presented in Heaps (1978), p. 200) on the occurrence of the 26 letters in English. The graph is given in Fig.VII.5 (see also Egghe (2000a)).
Although not perfect we can use (VII.59) as a first approximation. We have noted already the importance of N-grams for Asian languages and also in this context we have data: we used the website of Beckman (see Beckman (1999)) on the occurrence of more than 13,000 Chinese symbols. Fig.VII.6 gives the graph of the occurrence of the 100 most important Chinese symbols (see also Egghe (2000a)).
It is clear that the data are very close to a power law. In conclusion: also for the study of Ngrams we can and will use power law (VII.59).
Using (VII.59) also improves pitfall (v) a lot. The skew distribution of single words and of letters, as expressed by (VII.59) is also decisive for which N-word phrases or which N-grams are more popular (more occurring) and which are not.
Finally we can discuss (vi). To reduce the complexity of the problem we will suppose that letter occurrence in N-grams is independent of the place in the N-gram. Similarly, we will suppose that the word distribution in N-word phrases is independent of the place of the word in the N-word phrase. In symbols this means that, for every i = 1,...,N — 1 we have
P(r,|r i+l ,...,r N ) = P,(ri)
(VII.60)
(and we put P (rN) = P, (r N )) i.e. to have a letter (for N-grams) or a word (for N-word phrases) with rank r; on the ith place in the N-gram (N-word phrase), given that on the j t h place
Further applications in Lotkaian informetrics
335
(j = i + l,...,N) we have a letter or word with rank r^, is simply P,(r;) (as in (VII.59)), unconditional with respect to the letter or words occurring on places i + l,...,N. Note that (VII.60) also implies that the same rank-frequency distribution applies for each place i, which is a natural assumption for N-word phrases and which is clearly true for N-grams since we use redundant N-grams, hence on each place the same letter distribution applies, by construction.
The independence assumption is certainly not satisfied in practical examples but it is nevertheless worthwhile to develop a model in this case because it can serve as a first approximation of real examples. In fact, as indicated in Egghe (2000b), dropping the independence assumption leads to non-analytic formulae for the rank-frequency function of N-word phrases and N-grams. Of course, with the above described generalizations we reached a much more general model than the one used in Subsection VII.6.2.
Fig. VII.5
Letter occurrence, Dewey data (English).
Reprinted from Egghe (2000a), Fig. 1, p. 241. Reproduced with kind permission ofKluwer Academic Publishers.
336
Power laws in the information production process: Lotkaian informetrics
Fig. VII.6
Occurrence of the 100 mostly used Chinese characters.
Reprinted from Egghe (2000a), Fig. 3b, p. 244. Reproduced with kind permission ofKluwer Academic Publishers. We have the following theorem.
Theorem VII.6.3.1 (Egghe (2004f)): Let N e N be fixed and assume (VII.59) to be valid. Denote by PN (r) the rank-frequency probability density function of N-word phrases or N-grams. Then re[o,T N ] and
?».(*)=,
.
^
NlUp
CVH.61)
(^(r+ (-lD) where 5N' denotes the inverse of £,N and where ^N is the function
w.fjdZjaii
(VI,,2)
Further applications in Lotkaian informetrics
337
and In' y = ln(y)...ln(y), the ith power of ln(y). i times
Proof: Since ranks are determined by (decreasing) productivity we have that x = PN (r), where
r = vol{(r 1 ,...,r N )|P(r p ...,r N )>x}
(VII.63)
, where P(r,,...,rN) denotes the probability of occurrence of an N-gram or N-word phrase for which the ith letter (respectively word) has rank i; (in the single occurrence), i = 1,...,N . Here vol(S) denotes the volume of the N-dimensional set S. Now, by definition of conditional probability density (cf. Grimmett and Stirzaker (1985), p.61), repeatedly used:
P(r,,-,r N )
= P(r,|r 2 ,...,r N )P(r 2 ,...,r N )
= P(r 1 |r 2 ,...,r N )P(r 2 |r 3) ...,r N )P(r 3 ,...,r N )
= P(i; |r 2 ,...,r N )P(r 2 |r3,...,rN)...P(rN_, |rN)P(rN)
= P,(r1)P1(r2)...P1(rN_1)P,(rN)
by (VII.60). So, by (VII.63) we have
r = vol{(r,,...,rN) | P, (r,)P, (r2)...P, (rN) > x}
withx = P N (r), x e [ 0 , l ] .
(VII.64)
338
Power laws in the information production process: Lotkaian informetrics
Note that, because of (VII.56) and (VII.59), the real ranks (in the sense of the informetric theory of Chapter II) r; should be lowered with 1 but, in (VII.59), we can work with r( e [l,T + l] itself and the set S is only a translation of the rank N-tuples (r, -l,...,r N — l) over the vector (l,...,l) (N coordinates), so that the volume is the same. Hence we can use the T{S themselves in (VII.64). Note, however, that r itself denotes the real rank of N-grams or Nword phrases (in the sense of Chapter II). Indeed, let there be T letters (in case of N-grams) or T words (in case of N-word phrases (cf. (VII.56)), then r e [o,TN] and r = TN is obtained for x = 0, the set S being S = [l,T + l f x= —
which volume is TN and r = 0 is obtained for
since then vol(S) = 0 for the following reason: using (VII.64) we have
P 1 (r,)P 1 (r 2 )...P 1 (r N )>x = ( | ) N
But by (VII.58), each P , ^ ) ^ —. So A.
P1(r1)P1(r2)...P,(rN) = ( | ) N
c
But 0
p r
i( i) = f
( VIL65 )
for every i = 1,...,N . From (VII.58) and (VII.59) this implies
r{ = 1
(VII.66)
Further applications in Lotkaian informetrics
339
for every i = 1,...,N . Hence S = {(1,1,..-,1)}, a singleton in KN and hence vol(S) = 0.
The inequality
P1(r,)P1(r2)...P1(rN)>x
leads to, using (VII.59)
DN
-, j>* (r,r2...rN)
(VII.67)
hence
r,r2...rN<^=:a
(VII.68)
, by notation of a for reasons of simplicity. Formula (VII.68) implies
1 < r, < —^— r2...rN
(VII.69)
This gives us the number of possible r,s but dependent on the different r2s,...,rNs that are possible. This will be determined now. Formula (VII.69) yields
1 < r2 < — — r3...rN
(VII.70)
I
(VII.71)
Formula (VII.70) implies
r
4- r N
340
Power laws in the information production process: Lotkaian informetrics
and so on until
l
(VII.72)
N
and
l
(VII.73)
So vol(S) of (VII.64) is found when we remark that r, ranges in an interval of length — r2...rN
1 (by (VII.69)), where eachr2,...,rN range as indicated in (VII.70)-(VII.73). Hence
rN—a
r=a / ^ rN=l
r
rN
^
/
^ r N
'N-.-1
3-- r N
... /
N=a
r
^-/drN r
r2=l
- '
r
,
2
rN_l
r
N
/
3-- r N
drN_, ... /
rN_,-l
dr2
(VII.74)
r;=l
The evaluation of (VII.74) is tedious but easy. The first term in (VII.74) (called (I)) is calculated as follows: since
f
^ = -ln ^ 5 L
(VII.75)
(> 0 by (VII.70)), we have that
(l) =
3^5i
_ a J^J^... / ^ d r 3 r N =,
But
1
''=~!~ln
"N-I=—
r
N
rN _,_,
r
N-I
r rj=1
3
(VII.76)
Further applications in Lotkaian informetrics
r
i
3
341
2 I aJ
as is readily seen. This value goes in (VII.76) yielding
rN_,=JL
,4=^_ 1 , 2fr4...rNl
(l) = - a / ^ l J ^L ... J rN-l
^
r
rN_,=l
N-l
2 { a J^ T
r4=l
4
But
2
f
a
i
r
i
dr^lln-M. 4
4
3 ! 1 aJ
Note that each time the sign switches. This leads to
. , - . <--j(-l)"-'to»-"fSi=i.' rN=l
()
V1^
N rN_,=l
"aiV
V
^
^ W
J
(N-2)!
(N-l)!
=
/-rN-l
aln^a> (N-l)!~
aj
(vn77)
342
Power laws in the information production process: Lotkaian informetrics
since a = —— > r,...rN > 1, using (VII.68).
Now we calculate the second term in (VII.74), called (II).
r N-|—
a
r 2~
r
_
"
r
(ll) = -Jdr N /"dr,., ... / dr2 r
1N=1
N-i=l
r2=l
_ a
_
a
(ll) = -Jdr N /drN_, ... f 'N=l
%-,=!
_ a
'3=1
__
^—lW 3
"
N
-1
a
^ + lldr4
(ll) = -7drN / drN_, ... r _ l _ t o - 2 r N =l
r N _,-l
r 4 =l
1
4"-IN
M—'N
(ll) = -JdrN /NdrN_, ... J N I _ ? _ ^ _ 5 J
h
rN
^,
2r
rf=1
5-rN
r5..-rN
^ — ' N
^in fs-'N
r
ldr5 5-r N
i,...rN
j
a
(H) = - / d r N / rN=l
drN_, ...
rN_,=l
_
a
' T - l ^ - l n 3 ^ ^ -i—-5—ln 2 f^^-l J=]
3!r6...rN
a
In general we have (j + 3 = 2,...,N)
2 r6...rN
i, a
J
5—tofSi^.l r6...rN
I
a
J
5—+ lldr6 r6...rN
j
Further applications in Lotkaian informetrics 343 _
a
(ii)=-JdrN ... jr"lll(-iy^L_iii11'[ii^L+^L_(_iy+(_i)HL r N =l
r j + J =l
'j+S-'N
i=l
K
d
r
j+3- r N
Hence
(n)=-7 (-ir^E>( *)+(-irk + 7-(-irx rN=i
r
N i»i !
a )
I
fN=l
rN
(ll) = (-ira|^(-ln-(l))-(-ir(a-l)-(-iralna
(n) = (-irg a( ^' + ' a +(-l) N alna + (-ira + (-l)N
'
i=0
(n) = E ( " ° , i=0
» + (-!)"
Cvn.79)
1 !
Now (VII.78) and (VII.79) yield, by (VII.74)
aln N -a ^ ( - l ) N + M a l n - a (N-l)!+^ i!
r = E
(-l)
+(
alna+(_i)N
Using (VII.68) and the fact that x = PN (r), we have by (VII.80)
N l)
^
^
344
Power laws in the information production process: Lotkaian informetrics
r + (-\f-x=^—^-r,
(VII.81)
,(P N W)V where
!
i=0
, i.e. formula (VII.62). By (VII.68) the arguments of the logarithms, appearing in 4N
are
greater than or equal to 1, hence positive. Note that i;N is an injection on [l,+oo[. Indeed:
^NW
U
+
i!
i
^
i!
N—1
^N ( y ) = -^—I- > 0 NV ; (N-l)!
(VII.82)
ony e ]l,+co[. So ^N is a strictly increasing function on [l,+oo[ and hence an injection. But
a= —>1 xp
by (VII.68), hence we can take the inverse of ^N in (VII.81) yielding
P
"")=(c(r+(-.r)r
Further applications in Lotkaian informetrics
345
where ^N' denotes the inverse of the function ^ N .
The function PN (r) is not simple. We have the following corollary, proved in Egghe (1999a) and Egghe (2000a) as an approximate result:
Corollary VII.6.3.2 (Egghe (2000a)): If r is large, we have that
°W
PNO-K
„
(VII.83)
( XN '((N-l)!r)[
where xN' is the inverse of the function
XN(y) = yln N -'(y)
(VII.84)
(again lnN"' (y) denotes the (N - l)th power of ln(y)).
Proof: The number r large enough forces all the ranks rj,...,rN to be large by (VII.64). Since r, is large we have by (VII. 69) that
a rr..rN
^
a r2...rN '
In other words, in the proof of the above theorem we only calculate (I) for r and put (II) ~ 0 . By (VII.78), this yields the result.
Note that formula (VII.83) is, up to constants, the same as (VII.54) for N = 2 although both arguments are very different: the one leading to (VII.54) uses many artificial assumptions and approximations, as indicated in Subsection VII.6.2.
346
Power laws in the information production process: Lotkaian informetrics
This approximation was used in Egghe (1999a, 2000a) because evaluating (II) did not seem to lead to any useful result. Indeed, formulae (VII.61) and (VII.62) are much more complicated than (VII.83) and (VII.84) and if it were not for the results in the sequel we would not consider these intricate results as important. We are, however, lucky: in the next subsection we will derive the size-frequency function fN linked to the above rank-frequency distribution PN and we will show that the exact result (VII.61) leads to a very simple formula for fN, simpler than the one derived from the inexact (VII.83)!
The derivation of the size-frequency function fN is based on the general formulae of Chapter II on the link between the rank- and the size-frequency function. Therefore we first have to determine the rank-frequency function (called g in Chapter II and called gN here to show the N-dependence) derived from the rank-frequency density function PN in (VII.61). gN follows from PN simply by multiplying with the total number of items in the case of N-grams or Nword phrases, which we will denote by AN (in Chapter II this is denoted by A) - see Proposition II. 1.2.2 and the properties of U. Consequently, we have
g N (r)=
1
/ ^ ^ ^
fc'H- )"-))
( VII -85)
for r e [0,T N ], using Theorem VII.6.3.1.
In the proof of Theorem VII.6.3.1 we showed that 4N strictly increases, hence the same is true for 4N', so gN strictly decreases, using (VII.85). From (VII.82) it follows that £N (y) > 0 and ^" N (y)>0 on ]l,+oo[. This can be used in (VII.85) to show that gN is convexly decreasing, as it should (by the very definition of PN). We leave this as an exercise.
There are not many practical data on N-grams or N-word phrases. A convexly decreasing rank-frequency function for N-grams can be found in Cavnar and Trenkle (1994). These authors use the name "Zipfian" distribution which, visually, and probably also statistically, is a normal observation. In this Subsection VII.6.3 we only tried to show the mathematical link
Further applications in Lotkaian informetrics
347
between 1-gram (1-word phrase)-theory (i.e. Lotkaian, Zip flan informetrics) and N-gram (Nword phrase)-theory. In general, the above theory (and the one to follow on the size-frequency function) can be considered as the mathematical theory on how to describe informetrically the Cartesian product of N IPPs with the same Zipfian rank-frequency distribution.
The result (VII.85) on gN is intricate and not easy to work with. In the next subsection we will determine the size-frequency function fN that is equivalent with the rank-frequency function g N , using the model in Chapter II. The result on fN will be surprisingly simple (although its derivation is, once more, tedious).
VII.6.4
The size-frequency function of N-grams and N-word phrases derived from Subsection VII.6.3
We have the following theorem.
Theorem VII.6.4.1 (Egghe (2004f)): The size-frequency function fN that is equivalent with the rank-frequency function gN of (VII.85)isgivenby
fN(j) = - ^ l n N - ' 2JM j1+f J
(VII.86)
for je[l,p m (N)], where p m (N) is the maximal item density in the case if N-grams or Nword phrases, given by
P m (N)
= ANDN
(VII.87)
and where C is the constant
i
C= ^ p L P N (N-1)!
(VII.88)
348
Power laws in the information production process: Lotkaian informetrics
Proof: By the very definition of size-frequency function, we have (see formula (11.10)):
for je[l,p m (N)] with p m (N) the maximal item density in the case of N-grams or N-word phrases. Formula (VII.85) yields
gN(r)(#(r + (-l)N"))P=ANDN
Hence, taking derivatives
gN(r)(s"(r+ (-ir))T+gH(r)p(rH1(r + (-ir))r.%l(x = r + ( - i r ) = O where
%l(,. r+( -ir) means: the derivative of the function 4N' in the point r + (—l) ~ . S o
But, by (VII.82)
fa(r)^(r + (-ir)
=—
= ^
—
(VII.91)
Further applications in Lotkaian informetrics 349
Now we use (VII.85), yielding
;i
"Hnw;;;l.-,i
'
So
~PAND
a (A-
(VII 93)
Since j = gN (r) denotes the item density (by Definition (II.8) - see also (11.13)), we have by (VII.85) that
S'(r + (-ir) = (^ff
(VU.94)
in the point r = g"1 (j). So (VII.94) in (VII.93) yields
gN (g,' (J)) =
=^4^
lni j which yields, by (VII.89), the result
(M^f
(N-l)!
rr
(VIL95>
350
Power laws in the information production process: Lotkaian informetrics
fN(j) = - M L l n - ^ 1 N
p j'
+p
J
(N-l)!
(VII.96) '
,je[l,p n ] (N)], a remarkably simple result, taking into account from where we came! By definition of p m (N), see formula (II.8) we have
(N) fc(0)
-
by (VII.85). But £,N (l) = (—1)
-(#T
as follows readily from (VII.62). Hence, since we showed
in Theorem VII.6.3.1 that ^N is an injection on [l,+oo[, we have that 4 N ' ( ( ~ ^ f
j = ^ an<^
so, from (VII.97)
P m (N)
= ANDN
proving (VII.87). Now (VII.97) gives
fN(j) = ^ r m N - E E M " j1+p
J
with C as in (VII.88), hence we have proved (VII.86), for j € [l, pm (N)].
D
Note that, in terms of Lotka's a , see (11.87), we have that (VII.86) also reads as
fN(j) = -^lnN-> P B M
(VH.98)
Further applications in Lotkaian informetrics
351
hence a product of a power law and a power of a logarithm. It is easy to see that fN < 0 and fN > 0 hence fN is convexly decreasing on [l,pm (N)| = [l, A N D N 1.
Note also that gN and fN , for N = 1, reduce to the given laws of Zipf and Lotka (as it should). j_ ! Pm
F
Indeed, for f, this is clear (with C = - — as follows from (VII.88), agreeing with the results in Subsection II.4.2 (formula (11.91)), since we supposed Zipf s law for g,). For g,, we have by (VII.85)
g (r)=
'
(^(rTIjf=(^
since \x (y) = y by (VII.62), and hence
\ S i(r)=
'
E
(rTIf'
the same function as (VII.56), using that we denoted D = —. A In the next subsection we will use the size-frequency function fN to calculate the averages u. (here denoted as ji N ) of items per source and (x* (here denoted as n*N) being the Type/TokenTaken average as discussed in Chapter III. In terms of the present notations we could say that the Type/Token-Taken theory of Chapter III was based on f,; in the next subsection we will use fN (N > 2). Of course, the general defining formulae for u and \i (i.e. for general sizefrequency functions) of Chapter III also apply here.
352
Power laws in the information production process: Lotkaian informetrics
VII.6.5
Type/Token averages jxN and Type/Token-Taken averages \xK for N-grams and N-word phrases
As follows from Chapter III (formulae (III. 14) and (III. 15)), we have that the TT averages nN and the TTT averages \i'N are given by
^ = ^
cvn.99)
w ^N=T 1 L
(VII. 100)
where
Pm(N)
TN=JfN(j)dj
(VII.101)
i
P.(N)
AN=/jfN(j)dj
(VII.102)
i
Pn,(N)
W N = J ff N (j)dj
(VII.103)
i
and where fN is given by (VII.86). All these integrals are tedious to calculate but we can use the following formula found in Gradshteyn and Ryzhik (1965) (p.203 (2.722)):
J x - h T x dx = ^ — ^ ( - l ) k ( m + l)m(m-l)...(m-k + l ) —
valid forall n e » \ { - l } andmeN.
^ - (VII. 104)
Further applications in Lotkaian informetrics
353
For the calculation of TN (i.e. in function of pm (N), which will be our free parameter, just as it was the case with p m in Chapter III), we have two equivalent alternatives: or we can calculate (VII.101) directly or (which we will do here) use the following short argument. We note that j = g N (r) and hence l = g N (T N J (r = TN was the highest rank as proved in Theorem VII.6.3.1). Formula (VII.85) yields
ANDN
x
(s'(i*+(-ff so
T N +(-I) N = 4 N ( ( A N D 7
Using (VII.62) we have
N
> l ) N + 1 > N D 7 l n ' (ANDN)?
TN+(-ir = E
n
hence, by (VII.87)
N _ 1 (-l)
TN=(-1)N+^ i=o
N+ 1
" (p m (N))^n 1 f( Pm (N))F| : 1-
^
1
valid for all N e N and all p > 0.
We are left with the calculation of (VII. 102) and (VII. 103), using (VII.86). We have
(VII.105)
354
Power laws in the information production process: Lotkaian informetrics
AN = P"f -^ln1"-1 H s M dj. J 1
(VII.106)
1
~
J
/
Since
d^M=-P#dj j j
(VII.107)
we have that
lnK-I
f
PjN) ^dj^-p^N) —
1
^
f ^ J
;P
1
'
taN-.
P H Md P ^
1
i
J
J
(VH.108)
i
J
So, for p > 0, (3 ^ 1 we can apply (VII. 104) yielding
lnN-i
/
lnN_k_, p m (N)
Pn(N) 1 k
r ^ d J = -^g(- ) N(N-l)...(N-k)—
J^
(H jnN-k Pm(N)
= E^4(N-l)(N-2)...(N-k + l)—
^-
[H where we note that, for k = l, we have to take ( N - l ) ( N - 2 ) . . . ( N - k + l) = 1. (VII.106) now yields, using (VII.88)
Further applications in Lotkaian informetrics
(pm(N))p
355
^ ( - l ) > - l ) . . . ( N - k + l)ln N -> m (N))
(-!)>-!)!
valid for all N and p > 0 , p * 1 and noting that, for k = 1, (N-l)...(N-k + l) = l.
For P = 1, we have
\
J
J
p (N)ln N (p lHmV(N)) AN = ^ ^ "
(VII. 110)
as is easily calculated using (VII. 107) and (VII.88) for p = 1.
For WN we have
WN =
f ^ - l n N - ' H H I M dj l i J jP
(VII.Ill)
J
But, using (VII. 107), we have
lnN-,
J
PJM
_ ^ d j = -( Pm (N)) 2 -5 J P^l
P
m- ^ M
which can be calculated, using (VII. 104) for all p ^ —. This gives
d
PJM
(VII.112)
356
Power laws in the information production process: Lotkaian informetrics
taN.
J
PsM N _ ln— k -ji— T7^dj = -r-Lg(-l) N(N-l)...(N-k)—
f
fN"
9_M
1-2
JN-.C P m ( N ) ' j —
N C_i\ k
ln
where, for k = 1, we have to take (N — l)...(N — k +1) = 1. Hence we have, from (VII.Ill), using (VII.88)
WN =
( P - W
(-1)N(N-1)!
A ( - l ) k ( N - l ) . . . ( N - k + l)ln w - 1 '(p in (N))
valid for all N and P ^ — and where we have to take (N — l)...(N — k +1) = 1 for k = 1.
For p = - we have, using (VII. 107)
lnN->
Pm(N)
^dj =
f J So (VII. 111) and (VII.88) yield
^N Pm(N)'
j
L_ N
Further applications in Lotkaian informetrics
WN = 2 N ( P ;j N ) ) ln N (p m (N))
357
(VII.114)
for all N and (3 = - .
With these formulae for T N , AN and WN we are able to calculate \i.H and |i*N via (VII.99) and (VII. 100). Of course, formulae (111.33), (111.34), (III.38)-(III.4O) can be used in the calculation of (j, = \xx and \x = n*, i.e. TT and TTT averages in the case of 1-grams (single letters) or of 1-word phrases (single words), for comparison.
As examples we will take (3 = 1 (i.e. Lotka's a = 2) and (3 = — (i.e. Lotka's a = 3) and we will take N=l,2,3: the case of 2(3)-grams or 2(3)-word phrases in comparison with single letters or words will be informative enough for higher values of N. In addition the cases N = 2 and N = 3 are the most important cases for all applications.
Let us take P = 1 first. For N = 2 we have from (VII. 105), (VII. 110) and (VII. 113)
T 2 = l - P m ( 2 ) + Pm (2)ln(p m (2))
(VII.115)
A2=iPm(2)ln2(pm(2))
(VII.116)
W 2 =(p m (2)) 2 - P m (2)ln(p m (2))-p m (2)
(VII.117)
Hence
P m (2)ln 2 ( Pn ,(2)) ^-2(l-Pm(2)
+
p m (2)ln( Pm (2)))
( m U 8 )
358
Power laws in the information production process: Lotkaian informetrics
2fp (2)-ln(p ( 2 ) ) - l )
which yields Table VII.2.
Table VII.2
Values of \i2 and JJ.* for diverse values of pm (2), for P = 1.
Reprinted from Egghe (2004J), Table 1, with permission from Elsevier.
p (2)
I
^5
2
3
5
10
100
^
L140
L244
L397
T600
L890
2.933
^-
1.150
1.277
1.494
1.846
2.526
8.902
This can be compared with the values of \\.l and (i*, i.e. the non-composed case. For P = 1 (hence a = 2) we use the formulae (111.34) and (111.39)
H= h = - k £ a L . 1—Pm
(VII. 120)
H*=H;=H^i ln Pm
(VII.121)
and
yielding Table VII.3.
Further applications in Lotkaian informetrics Table VII.3
359
Values of p., and \i\ for diverse values of p m , for (3 = 1.
Reprinted from Egghe (2004/), Table 2, with permission from Elsevier.
1.5
2
3
5
10
100
1.216
1.386
1.648
2.012
2.558
4.652
1.233
1.443
1.820
2.485
3.909
21.498
Pm
h
We see that, for the same value of the input "seed " p m (2) or p m , we have that the values (j., and \i\ are larger than the values \x2 and \i\ respectively. We also see that n* — \\.2 < \i\ — \x, showing that the average screen lengths (e.g. in the case of the use of 2-grams by a cataloger - see Chapter III) are shorter than the ones given in the 1-gram case. Note further that n* > \xl and \i'2 > \x2 as it should, following Theorem III. 1.3.2.1.
Now we calculate the case N = 3, still with p = 1. We have from (VII. 105), (VII. 110) and (VII. 113)
T3 = - l + pm(3)-Pm(3)ln(pm(3)) + ipm(3)ln2(pm(3))
(VII.122)
A 3 =ip m (3)ln 3 ( Pm (3))
(VII.123)
W3 = (p m (3)) 2 -ip m (3)ln 2 (p nl (3))-p m (3)ln( Pm (3))-p m (3)
(VII.124)
Hence
„ ^
=
p m ( 3 ) l n 3 M 3 ))
- 6 + 6p m (3)-6 P m (3)ln( P m (3)) + 3 Pm (3)ln 2 (p m (3))
(VII 125) l
'
>
360
Power laws in the information production process: Lotkaian informetrics , 6p m (3)-31n 2 (p m (3))-61n(p m (3))-6 3 \x] = - ^ ^ / ( ])
(VII. 126)
which yields Table VII.4.
Table VII.4
Values of \i, and |a* for diverse values of pm (3), for p = 1.
Reprinted from Egghe (2004f), Table 3, with permission from Elsevier.
Pm(3) M-3
1.5
2
3
5
10
100
1.103
1.179
1.288
1.431
1.630
2.329
1.110
1.200
1.348
1.577
1.989
5.148
The same comments as for \i2, [i*2, given above, can be given here for (j.3, n*. Note again that the values of (x3, \x'} are smaller than the values of u.2, \i2 respectively.
We, finally, give formulae for (3 = — and N = 2, 3 and compare with the case N = 1. For N = 2 and P = — we have the following formulae, following from (VII. 105), (VII. 109) and (VII. 114)
Hence we have
T2=l-(Pm(2))2+2(Pm(2))2ln(Pm(2))
(VII. 127)
A2 = 4 P m (2) + 4(p m (2))2 ln( Pra (2)) - 4(p m (2)
(VII. 128)
W2=2(Pm(2))V(pm(2)
(VII. 129)
Further applications in Lotkaian informetrics
361
„ _ 4 P m ( 2 ) + 4(p m (2)) 2 ln(p m (2))-4(p m (2)) 2 "2=
(VIU30)
l-(p m (2)) 2 + 2(p m (2)) 2 ln( P m (2))
^ - 2
+
p.(2)h'(p a (2)) 2p m (2)ln( Pm (2))-2 Pra (2)
(VIU31)
yielding Table VII.5.
Table VII.5
Values of \x2 and u.* for diverse values of p m (2), for [5 = —.
Reprinted from Egghe (2004J), Table 4, with permission from Elsevier.
P-(2) P-2
1.5
2
3
5
10
100
1.130
1.214
1.321
1.433
1.552
1.761
1.140
1.244
1.397
1.600
1.890
2 .933
Compare now with the case N = 1, P = — (hence a = 3), using formulae (III.33) and (111.40):
H= h = ^ Pm+!
(VII.132)
H'=JI;=-!5PH_
(VII.133)
Pm yielding Table VII.6.
362
Power laws in the information production process: Lotkaian informetrics
Values of JI, and \i' for diverse values of p m , for (3 = —.
Table VII.6
Reprinted from Egghe (2004J), Table 5, with permission from Elsevier.
1.5
2
3
5
10
100
1.200
1.333
1.500
1.667
1.818
1.980
1.216
1.386
1.648
2.012
2.558
4.652
Pm Hi
h
For N = 3, p = - we have now, using (VII. 105), (VII. 109) and (VII. 114)
T3 = - 1 + (pm(3))2 - 2 ( p m (3))2 ln(p m (3)) + 2(p m (3)) 2 In2 (pm (3))
(VII.134)
A3=-8p n ,(3) + 4(p m (3)) 2 ln 2 (p m (3))-8(p m (3)) 2 ln(p n ,(3)) + 8( Pm (3)) 2 (VII.135)
W 3 =|(p m (3)) 2 ln 3 (p m (3))
(VII.136)
Hence
-8p m (3) + 4(p m (3)) 2 ln 2 (p m (3))-8(p n ,(3)) 2 ln(p ni (3)) + 8(pm(3))2 3
-l
+
( P r a (3)) 2 -2( P m (3)) 2 ln( P m (3)) + 2(p m (3)) 2 ln 2 (p m (3))
„ = Pm(3)'n3M3)) 2 * - 6 + 3 P m (3)ln ( P m (3))-6( P m (3))ln( P m (3)) + 6( Pra (3))
yielding Table VII.7.
( VII (
138) '
)
Further applications in Lotkaian informetrics
Table VII.7
363
Values of ]x3 and u^ for diverse values of p m (3), for |3 = —.
Reprinted from Egghe (2004J), Table 6, with permission from Elsevier.
Pn (3)
1.5
2
3
5
10
100
1.097
1.160
1.241
1.330
1.429
1.635
1.103
1.179
1.288
1.431
1.630
2 .329
We see again that the same tendencies of the comparison of n, ( (x*, |x2, (j,*, u3, uj are found as in the case (3 = 1.
We close with an open problem.
Open Problem: Describe the TT average and TTT average in case of N-grams where the number of items is limited to the number of documents in a database (e.g. an OP AC, used by a cataloger, as described in Chapter III). Since, here, the number of items (denoted A) is fixed and since there are TN N-grams (cf. Theorem VII.6.3.1), we might end up, for not even very large N with the relation TN > A , hence with more sources than items, which is out of the scope of the informetric theory, developed in this book.
This page is intentionally left blank
APPENDIX
APPENDIX I In this appendix we will give a self-contained proof of the characterization of scale-free functions as power functions (basic for Lotka's law) and on the characterization of functions, transforming products into sums, as logarithmic functions (basic for entropy). Both results are based on the characterization of functions, transforming sums into sums, as linear functions.
Theorem A.I.I: (i)
The following assertions are equivalent for a continuous function y : K —> K (a)
v|/(x + y) = v|/(x) + v)/(y)
(A.I.I)
for all x , y € R (b)
there is a real number c such that i|/(x) = cx
(A.I.2)
for all x e M.
(ii)
The following assertions are equivalent for a continuous function \\i: K+ —> K (a)
y(xy) = H/(x) + v|/(y)
(A.I.3)
for all x , y € K + (b)
there is a real number c such that i|;(x) = clnx
(A.I.4)
for all x e K+ (c)
there is a positive real number a such that \|/(x) = log a x
(A.I.5)
for all x e K + .
(iii)
The following assertions are equivalent for a continuous function \\i: M+ —> M+ (a)
\|/ is scale free (Definition 1.3.2.1)
366
Power laws in the information production process: Lotkaian informetrics (b)
there exist constants a E R + , c e t such that \|/(x) = ax c
(A.I.6)
for all x e R + .
Proof: It is clear that all (b)s imply all (a)s and that (ii)(b) is equivalent with (ii)(c). Therefore we
only have to prove the three implications (a) => (b)
(i)(a)=»(i)(b) Complete induction on (A.I.I) gives
vj/(nx) = nv|/(x)
(A.I.7)
for all x € R and n £ N. Let now m, n E N, t E R, x = — . Hence xn = mt and n hence, by (A.I.7), nvj/(x) = m\|/(t) and so
w
m . —t n
m
= n
V
u\ t .
Putting vj;(l) = c we see that (A.I.2) is valid for all positive rational numbers. Let x € R + arbitrarily. Since Q + is dense in R + , there exists a sequence (qn ) neN , qn E Q + for all n 6 N such that lim qn = x . Since \\i is continuous we have ll-*0O
\j/(x) = v|/(limqn)
= limv|/(qn) n—>oo
= lim cqn
= cx,
Appendix
367
hence (A.I.2) is proved for all x € R + . Let now x € W U {0}. If x = 0:
vj/(O) = O = cO
by (A.I.I) (take x = y = 0 there). If x < 0,
i|/(x) = \)/(0)-i|/(-x)
(by (A.I.I)). So
i|/(x) = -\|/(-x)
= -c(-x)
= cx,
completing the proof of (i).
ii)(a)=»(ii)(b) Define, for all x € R
f(x) = V (e").
Then
f(x + y) = v1/(e"+y)
= \|/(eV)
(A.I.8)
368
Power laws in the information production process: Lotkaian informetrics = v|/(ex) + v|/(ey)
(by (A.I.3))
= f(x) + f(y)
for all x, y G K , by (A.I.8). By (i) we have that there is a c e R such that
f (x) = ex
for all x G R. Hence, for all x G R +
\|/(x) = f (lnx) = clnx.
(iii)(a)=>(iii)(b) From (1.26), defining the scale-free property of \y, we have that: for every positive constant C, there is a positive constant D such that
V (Cx)
= Di|>(x)
for all x in the domain of \|/ i.e. R + here (and D only depends on C). Hence (since l€R+):
v|/(C) = D V (l)
hence V (Cx)
=^
W
(A.I.9)
Appendix 369 for all C,xeIR + . For all x e K + , define
cp(x) = l n ^ M
(A.I.10)
Hence cp : R+ —> R and we have
#) 2 (since (A.I.9) is valid for all C,x £ R+ and take C = y)
v(i)
v(i)
= cp(x) +
Also cp is continuous since \\i is. Now (ii) implies that there is a real number c such that
(p(x) = clnx
Since In is injective it follows that (by (A.I.10))
V( x ) = v(l)x c
370
Power laws in the information production process: Lotkaian informetrics
i|/(x) = ax
for all x e R + . Finally note that a = i]/ (l) > 0 .
D
The above theorem is a result on the characterization of functions satisfying certain so-called functional relations (such as the scale-free property or (A.I.I) or (A.I.3)). For more results in this area we refer the reader to Roberts (1979).
APPENDIX II Table of the experimental fractional frequency distribution of author scores in mathematics (Mathematical Reviews - 1990). Partially reprinted from Egghe and Rao (2002a), Table in Appendix, p. 801. Copyright John Wiley & Sons Limited. Reproduced with permission.
Fractional score 0.02 0.04 0.06 0.07 0.08 0.09 0.1 0.13 0.14 0.16 0.17 0.2 0.25 0.29 0.33 0.35 0.38 0.39 0.41 0.42 0.43 0.45 0.46 0.48
No. of authors 1 1 9 2 1 13 17 12 35 3 93 341 1,166 4 4,772 16 1 2 17 2 1 6 1 2
Fractional score 1.13 1.14 1.15 1.16 1.17 1.2 1.23 1.25 1.28 1.29 1.3 1.32 1.33 1.37 1.38 1.4 1.41 1.42 1.45 1.48 1.5 1.53 1.55 1.57
No. of authors 2 22 2 15 189 37 2 177 3 1 1 2 868 2 1 2 2 19 6 4 2,547 9 1 1
Fractional score 2.37 2.38 2.4 2.41 2.42 2.44 2.45 2.5 2.55 2.58 2.63 2.66 2.67 2.7 2.73 2.75 2.83 2.86 2.87 2.9 2.91 2.92 2.95 3
No. of authors 1 1 2 3 20 1 4 789 2 34 1 6 55 9 1 25 130 1 3 1 1 9 1 1,191
Appendix 0.5 0.53 0.56 0.58 0.59 0.62 0.63 0.64 0.65 0.66 0.69 0.72 0.74 0.76 0.78 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 0.99 1 1.02 1.04 1.06 1.08 1.09 1.1 1.11 1.12
11,673 36 1 136 3 2 1 5 2 17 474 23 1 170 3 2 919 7 1 6 2 48 6 1 1 14,507 9 2 2 66 3 2 2 2
1.58 1.6 1.62 1.64 1.65 1.66 1.67 1.7 1.73 1.75 1.79 1.83 1.9 1.91 1.92 1.95 1.98 2 2.03 2.06 2.07 2.08 2.09 2.12 2.14 2.15 2.16 2.17 2.2 2.25 2.28 2.29 2.33 2.36
27 1 1 1 1 1 149 11 1 65 1 328 2 1 9 1 6 3,255 6 1 1 47 1 2 2 1 2 70 26 50 2 1 288 1
371
3.03 3.08 3.1 3.11 3.17 3.2 3.23 3.25 3.33 3.36 3.42 3.45 3.48 3.5 3.53 3.58 3.6 3.66 3.67 3.7 3.72 3.75 3.78 3.8 3.83 3.91 3.92 3.95 3.98 4 4.08
3 10 1 1 42 9 1 29 126 1 4 2 1 33 1 5 2 2 33 2 1 13 1 1 80 1 1 1 1 386 9
17.2 17.58 18 Total
1 1 1 46,853
372
Power laws in the information production process: Lotkaian informetrics
Fig. A.I A fractional frequency curve (experimental data of previous table). The numbers in the abscissa refer to the number of fractional scores (their ranks), not to the fractional score itself. Reprinted from Egghe and Rao (2002a), Fig. 1, p. 790. Copyright John Wiley & Sons Limited. Reproduced with permission.
APPENDIX III Statistical determination of the parameters in the law of Lotka
A.III.l Statement of the problem
The purpose of this book was to investigate how far we can go (i.e. which results can be proved) when we only use the size-frequency fucntion
f
0) = ^
(A.m.i)
j £ [l,p m ], C, a > 0, i.e. a decreasing power law, i.e. Lotka's function. In this sense, formula (A.III.l) is given (and, in fact, it is the only thing that is given) and hence we suppose C, pm and a to be known parameters.
Appendix
373
However, in practise, one has a table of data (i.e. n e N (of course finite) versus f(n) = the number of sources with n items) and one is interested to determine the "best fitting" function of the type (A.III.l) ("best fitting" to the data or, rather, the graph of the data (n,f(n)V n = 1,2,3,...). This is important to know since the values of C, pm and a (especially a ) determine certain crucial properties of concentration (Chapter IV), of fractal complexity (Chapter V) or even the form of citation graphs (Subsection VII.5).
There are three problems to tackle: (i)
the fact that our practical data are never complete and hence that we have a sample on which, possibly, another size-frequency function than the one of the (unknown) complete data set applies,
(ii)
Clearing out the difference between the continuous Lotka function (A.III.l) and the discrete one
f(n) = 4
( A - IIL2 )
n n = l,2,...,n mx . (iii)
Considering the data set as such (cf. (i)) how can we determine a, a (most importantly) and C, K, pm and n ^
in the most accurate way. This is a three-
parameter problem. Often, of course, n ^ is high and hence, nraax and pm can be taken as +oo so that the problem is reduced to a two-parameter determination (C and a for (A.III.l); K and a for (A.III.2)).
The next section deals with (i) and Section A.III.3 with (ii). The rest of this Appendix III is devoted to possible solutions of problem (iii).
A.III.2 The problem of incomplete data (samples) and Lotkaian informetrics
In this section we will reveal a unique property of Lotkaian informetrics in connection with the determination of the size-frequency function f based on an incomplete set of data. Let us consider a complete IPP (unknown) with size-frequency function f and a "sample" IPP (known) which is a subset of the unknown one with size-frequency function f *.
374
Power laws in the information production process: Lotkaian informetrics
We will consider the following sampling types, which we define, using the continuous sizefrequency function f (Chapter II) of which the domain is extended to K+ (as is also the case in Subsection 1.3.2), and verify that the names reflect their intuitive meaning (see also Dierick (1992)).
Definition A.III.2.1 A sample is a systematic sample in the items (or an item systematic sample), with sample fraction 6 (i.e. the sample size is 9 A, a fraction 6 of all the items) if, for every j e R + :
f(9j) = f(j).
(A.III.3)
Definition A.III.2.2 A sample is a systematic sample in the sources (or a source systematic sample), with sample fraction r| (i.e. the sample size is r|T , a fraction r| of all the sources) if, for every j € K + :
f ( j ) = rif(j).
(A.III.4)
We have the following result, characterizing Lotkaian informetrics. Theorem A.III.2.2 (Egghe (2004h)): The following assertions (i) and (ii) are equivalent: (i)
Item and source systematic samples are the same, i.e.: (a)
For every 9 € ]0,l] there exists a r\ e ]0,l] (only dependent (injectively) on 8) such that every item systematic sample with fraction 9 is a source systematic sample with fraction r). Reversely:
(b)
For every r\ e ]0,l] there exists a 9 e ]0,l] (only dependent (injectively) on r|) such that every source systematic sample with fraction r| is an item systematic sample with fraction 9.
(ii)
The function f is scale-free and hence, equivalently (Corollary 1.3.2.3), f is a decreasing power law (i.e. Lotka's law).
Appendix
375
If the above assertions are true, the relation between r\ and 0 is given by
r| = 9 a
(A.III.5)
where a is Lotka's exponent, see (A.III.l).
Proof:
Suppose (i)(a). We have given formula (A.III.3) and hence, by (i)(a), we also have formula (A.III.4) with r| = r|(9), i.e. t| is an injective function of 0 . So, for all j £ R + :
f(ej) = f(j)
and
f(9j) = Tlf(©j).
Hence
f(6j) = - f ( j )
(A.III.6)
for all j € R + , where r\ = r\(6) for all 6 e ]0,l].
Suppose (i)(b). We have given formula (A.III.4) and hence, by (i)(b), we also have formula (A.III.3) with 0 = 0(r|), i.e. 0 is an injective function of r|. So, for all j e K + :
f ( j ) = Tlf(j)
376
Power laws in the information production process: Lotkaian informetrics
Hence
f ^ j ) = nf(j)
(A.m.7)
for all j e M+.
Note that the fact that 6 = 0(r|) is an injection implies that T| is a function of —, for all 6 0e]0,l]. Hence, by (A.III.6) and (A.m.7), f is scale-free (since 6e]0,l] and — £ [l,+oo[ and hence all values in M+ are covered) and the result follows.
(ii)=>(i)
Let f be given as a decreasing power law
f(j) = f j e R+ Define, given 0 e ]0,l]
f(9j) = f(j)
(i.e. given an item systematic sample) for all j £ K + . Hence, for all j 6 K + :
Appendix
111
f ( j ) = e°f(j)
Putting r| = Qa we see that we have a source systematic sample (with fraction r| = 9a note that indeed 6e]0,l] implies r|e]0,l] and that r| is an injective function of 0). Hence we proved (i)(a).
Let us now have a given source systematic sample (i.e. given r| e ]0,l])
f(j) = -nf(j)
for all j e E + . Hence, for all j e K + :
f* r, a j =T|f Tfj
n f T)°j = f ( j )
Hence we have an item systematic sample with fraction 6 = ria (note again that t| 6 ]0,l] implies 9 6 ]0,l] and that 9 is an injective function of r|). Hence we proved (i)(b). This completes the proof of the theorem.
For results on concentration aspects of systematically sampled (in items or sources) IPPs we refer the reader to Egghe (2902c).
378
Power laws in the information production process: Lotkaian informetrics
Note: In the above results as well as in Subsection 1.3.2 we used a size-frequency function f : R + —> R + . In Chapter II, for f, we limited ourselves to a domain [p(0),p(A)l and we even supposed (for the sake of simplicity) that p (0) = 1. Note that, in general (by the definition in formula (11.(5)), p takes values in K+ so that the above interval can be replaced by K + . This broader domain is needed here in order to be able to define size-frequency functions for sampled IPPs and to prove the scale-free property of f.
The above result shows that Lotkaian informetrics (and only Lotkaian informetrics) allows for a sample size-frequency function f * which is (up to constants) the same as the population size-frequency function f, a remarkable conclusion. We hence can delete (i) and give full attention to the solution of the problem itself, i.e. to (ii) and (iii). One can also do this in nonLotkaian informetrics but, as explained above, one only finds the appropriate size-frequency function for the sample (incomplete IPP) at hand which is then different from the one of the complete IPP. Luckily Lotka himself was not in this situation when he checked his power law on the volume of the Chemical Abstracts (1907-1916), where he limited himself to author names starting with the letter A or B, a clear sample! From the above, it is expected that, if Lotka had not applied this severe restriction he would have ended up, by (A.III.4), with (more or less) the same power law (i.e. with the same a ) as he has found in his article Lotka (1926).
A.I11.3 The difference between the continuous Lotka function and the discrete Lotka function
So we consider the functions
f(j)=
7
j £ [l,p m ], i.e. the continuous function that was the topic of study in this book and, denoting fd (d for discrete) to indicate the difference with f,
Appendix
379
£ n = l,2,...,nmax. For easy reasoning we will take n ^ = pm = +00. Suppose our data yield that there are T sources and A items in total. Then we repeat the following results from Chapter II (existence off) ((11.28) and (11.29) in Proposition II.2.1.1.1):
a= 2Azl =
A-T
AT
izl
(AJH.S)
(j. — 1
A
C = - ^ — = -— A-T n - 1
(A.III.9)
From this it is clear that the functions f and fd above are different. Indeed, (A.III.9) implies that C = f (l) > T while, obviously, K = fd(l) < T (fd(l) being the number of sources with 1 item). This difference can be explained by the fact that f represents a density function so that f (l) has no actual interpretation in terms of our data. The link between f and fd can be clarified by the "discretization" of the continuous f. The very definition of a density function (as f) gives that, for every n = 1,2,3,...
I(f)(n)=:/f(j)dj=/^dj.
For a ^ 1 this gives
I(f)
w=^r(^-(^r)
(AJIL10)
Note that l(f) is not a power law (and hence l ( f ) ^ f d ) but we will study the difference between l(f) and fd. It will turn out that both functions are very similar which will make our problem of determining fd easier.
380
Power laws in the information production process: Lotkaian informetrics
Note that
I f
( )w = ^r( 1 -^)
(AJIL11)
From (A.III.8) and (A.III.9) (or from Subsection II.2.1.1) it follows that
— =T a-1 so that (A.III.l 1) implies
l(f)(l) = T ( l - ^ i r ) < T
(A.III.12)
which explains the continuous model and C = f (l) > T . In fact we see that l(f )(n) < T for every n e N. Let us now make the following approximative calculation of (A.III.10):
C_ 1 a - 1 (n + l)*"1
,-1-ifl] a-ldnK-'J (A.III.13) now gives
I(f)(n)-^
1_ n"-1
(A.III.13)
Appendix
381
showing that l(f) has the same shape as f. Hence for f and fd to have the same shape we conclude that
a^a
(A.III.14)
if a and a are not too high (otherwise f (n) RS fd (n) RS 0 even if a and a are very different). Whether (A.III.14) is true can be checked as follows.
The exact relation between a and a can, numerically, be calculated as follows: given T and A as above (A > T) we have, from (A.III.8) and (A.III.9) (or from Subsection II.2.1.1, formula (11.30)) that
^ =T oc-1
^ a-2
= A
hence
^=^=^4 T
a —2
From (A.III.2) we have
n=l
n
n=l "
where C, denotes the classical Riemann-zeta function (see Apostol (1957)). So
(AJILI5)
382
Power laws in the information production process: Lotkaian informetrics
A = ^(a-1)
£(a)
T
So for a given data set, A and T, hence \x can be determined. From tables of ^(a) one can construct tables of
, .
versus a (provided in Egghe (2004g) and hence a follows from
(A.III.16). Also a follows from (A.III.15) (or see (11.30)):
a = ^—H-l
(A.III. 17)
and the comparison can be made. This is done in Egghe (2004g) where more details can be found. The relation (A.III. 14) is confirmed there.
It furthermore follows from (A.III.15) and (A.III.16) that a > 2 iff a > 2 (hence also a < 2 iff a < 2) even if a and a are different (for high, uncommon, values). This is an important conclusion because many - from Lotka's function f derived properties (e.g. shape of the cumulative first-citation distribution (Chapter VII), existence of the Groos droop (Chapter II), concentration (Chapter IV) and fractal properties (Chapter V)) - are common for different as above 2 and for different as below 2 but are different for values of a below and above 2.
So, from the above, the practical calculation of the exponent a (based on the data) (see Subsection A.III.4) yields a clear estimate of a which determines the Lotkaian informetrics theory, based on f, as developed in this book. The relationship between l(f )(l) and fd (l) = K can also be described as follows (in this way also indicating the closeness of l(f)(n) and fd(n) for all n e N ) . From (A.III. 12) and the fact that from (A.III.2) it follows that
n=l
n
Appendix
we conclude that we have to compare 1
383
— with (using (A.III.14))
;- 1 (a) = - J —
(A.III. 18)
Tables of ^"' (a) are available and we reproduce one here (Table A.I) since we will also need it further on (the table is reproduced from Egghe and Rousseau (1990a), with permission from the publisher - see also Nicholls (1987)).
Note that only the values C,{l) and C,(4) are exactly known since
?( 2 ) = E ^ T = ^ n—1
(A.III.19)
^
(cf. formula (I.I 1)) and
as can e.g. be found in Gradshteyn and Ryzhik (1965). Some examples are found in Table A.2.
We also see that, the higher a, the closer the results, i.e. I (f) (l) PS fd (l). This is due to the fact that
y(a>=
< i1 i — = i = * d i - £ \
An analogous argument shows this to be true for every n = 1,2,3,... .
(A-IIL2I>
384
Power laws in the information production process: Lotkaian informetrics
Table A.I Table of — - for a <= [1.11,3.49] with increments of 0.01. Reprinted from Egghe and Rousseau (1990a), Table IV.6.6, p. 357, with permission from Elsevier. 1 a 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49
^( a ) 0.1033 0.1121 0.1208 0.1294 0.1378 0.1462 0.1545 0.1627 0.1708 0.1788 0.1868 0.1946 0.2024 0.2100 0.2176 0.2251 0.2325 0.2399 0.2471 0.2543 0.2614 0.2685 0.2754 0.2823 0.2891 0.2958 0.3025 0.3090 0.3156 0.3220 0.3284 0.3347 0.3409 0.3471 0.3532 0.3592 0.3652 0.3711 0.3770
1 a 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89
W)
0.3828 0.3885 0.3942 0.3998 0.4054 0.4109 0.4163 0.4217 0.4270 0.4323 0.4375 0.4427 0.4478 0.4528 0.4578 0.4628 0.4677 0.4725 0.4773 0.4821 0.4868 0.4914 0.4961 0.5006 0.5051 0.5096 0.5140 0.5184 0.5227 0.5270 0.5313 0.5355 0.5397 0.5438 0.5479 0.5519 0.5559 0.5599 0.5638 0.5677
1 a 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29
<^(a) 0.5715 0.5753 0.5791 0.5828 0.5865 0.5902 0.5938 0.5974 0.6009 0.6044 0.6079 0.6114 0.6148 0.6182 0.6215 0.6249 0.6281 0.6314 0.6346 0.6378 0.6409 0.6441 0.6472 0.6502 0.6533 0.6563 0.6593 0.6622 0.6651 0.6680 0.6709 0.6737 0.6766 0.6793 0.6821 0.6848 0.6875 0.6902 0.6929 0.6955
1 a 2.30 2.31 2.32 2.33 2.34 2.35 2.36 2.37 2.38 2.39 2.40 2.41 2.42 2.43 2.44 2.45 2.46 2.47 2.48 2.49 2.50 2.51 2.52 2.53 2.54 2.55 2.56 2.57 2.58 2.59 2.60 2.61 2.62 2.63 2.64 2.65 2.66 2.67 2.68 2.69
^( a ) 0.6981 0.7007 0.7033 0.7058 0.7083 0.7108 0.7133 0.7157 0.7181 0.7205 0.7229 0.7252 0.7276 0.7299 0.7322 0.7344 0.7367 0.7389 0.7411 0.7433 0.7454 0.7476 0.7497 0.7518 0.7539 0.7560 0.7580 0.7600 0.7620 0.7640 0.7660 0.7680 0.7699 0.7718 0.7737 0.7756 0.7775 0.7793 0.7811 0.7830
1 a 2.70 2.71 2.72 2.73 2.74 2.75 2.76 2.77 2.78 2.79 2.80 2.81 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89 2.90 2.91 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09
1 a
0.7848 0.7866 0.7883 0.7901 0.7918 0.7935 0.7952 0.7969 0.7986 0.8003 0.8019 0.8035 0.8052 0.8068 0.8083 0.8099 0.8115 0.8130 0.8145 0.8161 0.8176 0.8191 0.8205 0.8220 0.8235 0.8249 0.8263 0.8277 0.8291 0.8305 0.8319 0.8333 0.8346 0.8360 0.8373 0.8386 0.8399 0.8412 0.8425 0.8438
3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.40 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49
^( a ) 0.8450 0.8463 0.8475 0.8488 0.8500 0.8512 0.8524 0.8536 0.8547 0.8559 0.8571 0.8582 0.8593 0.8605 0.8616 0.8627 0.8638 0.8649 0.8660 0.8670 0.8681 0.8691 0.8702 0.8712 0.8723 0.8733 0.8743 0.8753 0.8763 0.8772 0.8782 0.8792 0.8801 0.8811 0.8820 0.8830 0.8839 0.8848 0.8857 0.8866
Appendix
385
Table A.2 Comparison of l(f)(l) and K (both divided by T). Reprinted from Egghe (2004g), Table 2. Copyright John Wiley & Sons Limited. Reproduced with permission.
C'(a) a = 1.5
0.2929
0.3828
0.5
4- = 0.6079
a=2.5
0.6464
0.7454
a=3
0.75
0.8319
a=3.5
0.8232
0.8875
a=4
0.875
* J = 0.9239
a=2
Note also that l(f)(l)< K = f d (l) in all cases (assuming that a = cc and using (A.III.12), fd (l) = K = T C (a) and the fact that
1 L
\
!
oo
1
\2—J-\ra — I k=l
K
as is readily seen; see also Table A.2 as an illustration) so that we have the following order in n = 1 (by the above discussions)
I(f )(1) < K = fd (1) < T < C = f (1)
(A.III.22)
In Egghe (2004g) it is also shown that fd (n) < f (n) for all n e N, if we assume that a = a .
This clearly shows the relation between the functions f, fd and l(f). The graph in Fig.A.I shows the relation between the curves off, fd and l(f).
386
Power laws in the information production process: Lotkaian informetrics
Fig. A.I Qualitative illustration of the relation between f, fd and l(f). Reprinted from Egghe (2004g), Fig. 1. Copyright John Wiley & Sons Limited. Reproduced with permission.
A.III.4 Statistical determination of the parameters K, a, n _ in the discrete Lotka function — ,n = l,...,nmax n
We note that, finding the values of C, a and pm in the continuous function
f(j) = |
(A.IH.1)
je[l,p r a ] is explained in Subsection II.2.1.1 (for pm = o o ) and in Subsection II.2.1.2 (for pm < oo ), which is the mathematical equivalent of the fitting model of
f
dM = £
(A-m.2)
Appendix
387
n = l , . . . , ! ^ that we will explain here. See also Tague and Nicholls (1987) for similar results on the determination of the parameters of (A.III.l). For reasons of simplicity, we will denote fd by f since, from now on, only this function (A.III.2) will be considered.
A.III. 4.1 Quick and Dirty methods
Here we suppose n ^ = oo. Given A, the total number of items and T, the total number of sources (hence A > T) we can use (A.III.8) for estimating a (by (A.III.14)):
OA
y
a= ——-
(A.III.23)
K = f(l)
(A.III.24)
K can be estimated as
, the number of sources with 1 item (can be read from the data). This method was used in Egghe and Rousseau (1990a) and in Rousseau (1990b). Tague (1988) and Tague and Nicholls (1987) suggest the use of the formula
a=
f 2 — In 2
(A.III.25)
where f (i)(i = 1,2) denotes the number of sources with 1 or 2 items (which can be read from the data) which follows from (A.III.2) noting that (remember we use the notation f for fd now) f (1) = K, f (2) = ^ . So (A.III.23) together with (A.III.24) or (A.III.25) together with (A.III.24) give an easy estimate of the parameters K and a.
The above methods are "Quick and Dirty" since they are indeed quick but do not recognize that, on the values of A, T, f (l) and f (2) one has measurement errors. Criticism on the use
388
Power laws in the information production process: Lotkaian informetrics
of (A.III.24) can be read in Griffith (1988). In favor of these methods is the fact that all values T, A, f (l), f (2) are rather stable since they deal with total number of sources and items and the largest group of sources f (l), f (2), namely the ones with 1 or 2 items.
More exact methods now follow.
A.M. 4.2 Linear Least Squares method
The Linear Least Squares (LLS) method has been advocated by Pao (1982b, 1985, 1986), but already used in the historic paper Lotka (1926), see also Tague and Nicholls (1987), and goes as follows. Formula (A.III.2) (we denote f for fd now) yields, when taking logarithms, for n = l,2)...,nII]1I:
ln(f(n)) = l n K - a l n n
(A.III.26)
which is the equation of a straight line in a log-log scale. So, when we transform all our data in a log-log scale, we have solved the problem of determining a and K by applying linear regression analysis on these data. This gives (see any book on statistics or Egghe and Rousseau (1990a) for a derivation) the following formulae for a and K
n_£(tan)ln(f(n))-£mn£ln(f(n)) a=
^
0=1
^
(A.III.27)
2
n_£(lnn) -£lnnl n=l
n=l
lnK = ln(f(n)) + ahin
)
(A.III.28)
where ln(f (n)) denotes the average of the data values lnf (n) and Inn denotes the average of the values Inn, n = l,...,nmax. Here we used the notation (n,f(n)) for the data points in our data set as well.
Appendix
389
Note, however, that when applying the above method, one is not guaranteed that
n=l
"
hence one lacks the property that — is a distribution, which is unacceptable since we do not have a model anymore for the fraction of sources with n items for each n = 1 , . . . ^ ^ . See Sutter and Kocher (2001) and criticism on this work in Rousseau (2002a).
Therefore, one only uses (A.III.26) to determine a and one derives K, once a is known via (A.III.24) or by the following (better) method. Since we must have
"mix
V
£i7 n=l
=T
(A.III.29)
n
we have that —, hence K, is determined by
— = —*—
(A.III.30)
For n ^ < +oo (and not too high!) this can be calculated directly and for n ^ too high (or n
max = + ° ° ) O n e
Can a
PPty
— = —!— = - ) —
(A.III.31)
with i^(a) as in Table A.I. i^(a) is not calculable directly but one can use the following sharp approximating formula, proved by Pao and Singer, see Pao (1985) (a shorter proof being available in Berg and Wagner-Dobler (1996))
390
Power laws in the information production process: Lotkaian informetrics oo
1
V
P-l
1
-V
1
1
l
i +
i +
[
„
a
|
+
^-ir^ (^T)^ ^ 24(p-ir
rAnn2i 1
(AJIL32)
In fact, the values in Table A.I were calculated using (A.III.32), see Nicholls (1987), but (A.III.32) can also be used if one needs other values of a than the ones covered in Table A.I.
The above method was used in Alvarado (1999), Gupta and Karisiddippa (1999), Gupta, Sharma and Kumar (1998), Pao (1982b, 1985, 1986) Nicholls (1986, 1987), Tague and Nicholls (1987) and Newby, Greenberg and Jones (2003), Pulgarin and Gil-Leiva (2004) and mentioned in Wilson (1999).
The most classical criticism on the LLS method is that, by using the logarithms of the data one does not, actually, fit the data themselves and that deviations at one end of the graph are several orders of magnitude greater than at the other end, cf. Griffith (1988). A methodology could be applying non-linear regression techniques to fit (A.III.2) but we have only the reference Nelson (1989). But also the non-linear least squares technique suffers from the fact that — is not necessarily summing up to 1 anymore.
The following technique, replacing LLS (but also using (A.III.31) or (A.III.32)) seems to be the best one to find the exponent a.
A.III.4.3 Maximum Likelihood Estimating method
The Maximum Likelihood Estimating (MLE) method has been advocated by Nicholls (1986, 1987, 1989) although the technique was already developed in 1952, see Seal (1952). The technique, used to determine the exponent a (in replacement of (A.III.27)) gives the following formula:
"max
y^f(n)lnn
„ , , >.
C (a)
' v i
n=l
—
f n
E( ) n=l
^ \ I
K
'
(\
TTT
v
Appendix
391
where, again, C, denotes the Riemann-zeta function as defined above and where we again used the notation (n,f (n)) for the data points in our data set. Formula (A.III.33) can only be solved numerically e.g. using Table A.3 composed by Rousseau, see Rousseau (1993) (reproduced with permission). Using Table A.3 one can determine the value of the exponent —C a, given the value of —-—, determined by (A.III.33).
The value of K is still determined by (A.III.30) and, for high nmax (or nmax = + o o ) , by (A.III.31)and(A.III.32).
This combined technique was programmed by Rousseau and Rousseau - see Rousseau and Rousseau (2000), the program being freely useable on the Web; since it incorporates the most advanced technique of fitting Lotka's law we encourage the reader to make use of this very user-friendly package. We remark that this program was used to recalculate the parameters K and a of the data of Lotka himself in Lotka (1926), on which we reported in Chapter I.
This technique was used in Rousseau and Rousseau (2000), Rousseau (1997), Gupta and Karisiddippa (1999), Kinnucan and Wolfram (1990), Rousseau (2002a), Nicholls (1986, 1987, 1989), Tague and Nicholls (1987) and mentioned in Tague (1988) and, of course, in Wilson (1999) which is a very good review article, not only on the fitting of Lotka's law but on the complete "state-of-the-art" of informetrics at the year 1999.
392
Power laws in the information production process: Lotkaian informetrics
-C(a) . Table A.3 Values of—T\^- in function of a.
CM Reprinted from Rousseau (1993), Table 1, p. 411-412. Reproduced with permission from Emerald Group Publishing Limited.
a 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59
-C'(a) C(a)
a
4.4583 4.2219 4.0071 3.8112 3.6317 3.4666 3.3144 3.1736 1.0430 2.9214 2.8081 2.7022 2.6029 2.5098 2.4223 2.3398 2.2620 2.1885 2.1189 2.0529 1.9904 1.9309 1.8744 1.8205 1.7691 1.7201 1.6733 1.6285 1.5857 1.5446 1.5052 1.4675 1.4312 1.3963 1.3628 1.3306 1.2995 1.2696 1.2408 1.2129
1.60 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99
a 1.1861 1.1601 1.1351 1.1108 1.0874 1.0647 1.0427 1.0215 1.0008 0.9809 0.9615 0.9427 0.9244 0.9067 0.8895 0.8728 0.8565 0.8407 0.8253 0.8104 0.7958 0.7816 0.7678 0.7544 0.7413 0.7285 0.7161 0.7039 0.6921 0.6805 0.6693 0.6583 0.6475 0.6370 0.6268 0.6168 0.6070 0.5974 0.5880 0.5789
2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 2.30 2.31 2.32 2.33 2.34 2.35 2.36 2.37 2.38 2.39
0.5700 0.5612 0.5527 0.5443 0.5361 0.5281 0.5202 0.5125 0.5050 0.4976 0.4904 0.4833 0.4763 0.4695 0.4629 0.4563 0.4499 0.4436 0.4374 0.4314 0.4254 0.4196 0.4139 0.4083 0.4028 0.3974 0.3920 0.3868 0.3817 0.3767 0.3717 0.3669 0.3621 0.3574 0.3528 0.3483 0.3438 0.3395 0.3352 0.3309
Appendix
2.40 2.41 2.42 2.43 2.44 2.45 2.46 2.47 2.48 2.49 2.50 2.51 2.52 2.53 2.54 2.55 2.56 2.57 2.58 2.59 2.60 2.61 2.62 2.63 2.64 2.65 2.66 2.67 2.68 2.69
0.3268 0.3227 0.3187 0.3147 0.3108 0.3070 0.3032 0.2995 0.2959 0.2923 0.2887 0.2853 0.2818 0.2785 0.2752 0.2719 0.2687 0.2655 0.2624 0.2593 0.2563 0.2533 0.2504 0.2475 0.2446 0.2418 0.2391 0.2363 0.2336 0.2310
2.70 2.71 2.72 2.73 2.74 2.75 2.76 2.77 2.78 2.79 2.80 2.81 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89 2.90 2.91 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99
0.2284 0.2258 0.2233 0.2208 0.2183 0.2159 0.2135 0.2111 0.2088 0.2065 0.2042 0.2020 0.1998 0.1976 0.1955 0.1934 0.1913 0.1892 0.1872 0.1852 0.1832 0.1813 0.1793 0.1774 0.1756 0.1737 0.1719 0.1701 0.1683 0.1666
3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29
393
0.1648 0.1631 0.1614 0.1598 0.1581 0.1565 0.1549 0.1533 0.1517 0.1502 0.1486 0.1471 0.1456 0.1442 0.1427 0.1413 0.1399 0.1385 0.1371 0.1357 0.1344 0.1330 0.1317 0.1304 0.1291 0.1278 0.1266 0.1253 0.1241 0.1229
A.III.5 General remarks
A.III. 5.1 Fitting Zipf s function
Since the function of Zipf is of the same type as the one of Lotka one can use the same techniques, as described above, to fit Zipf s function. Of course, this fitting limits itself to the graph of Zipf s function and does not take into account the fact that the arguments in Zipf s function are ranks. For a study of this topic we refer to Tague and Nicholls (1987).
394
Power laws in the information production process: Lotkaian informetrics
A.III. 5.2 The estimation of pm and nnlax
So far we did not go into the issue of estimating the maximal item density pm (in the continuous model (A.III. 1)) or the maximal number of items per source, nmax (in the discrete model (A.III.2)). For pm this is completely solved (mathematically) in Subsection II.2.1.2, given A > T . For n lmx , the statistical estimate is discussed in Tague and Nicholls (1987). It turns out that, if the sample size is high enough, we can use the "sample" nmax for the population n ^
(to use in the model) and that the maximum likelihood estimate for the
population n ^ is its sample value (Tague (1988), Tague and Nicholls (1987)).
One can also put pm or n ^ equal to +oo, hereby simplifying the 3-parameter model to a 2parameter one. Both type of models are compared in Kinnucan and Wolfram (1990).
A.III.5.3 Fitting derived functions such as Price's law
The fitting techniques described here allow us to use the results developed in this book. In this way we can check for the exact form of derived results (e.g. the type of law of Price that is valid, given parameters C, a and pm - see Chapter IV). There is, therefore, no need to perform separate tests for these derived results as e.g. was done in Berg and Wagner-Dobler (1996), Gupta, Sharma and Kumar (1998) and Nicholls (1988). That Price's law fails (in the latter article) because of its relation to "Lotka's law, which is considered as an inverse square law rather than as a generalized model taking variable parameter values" is, in view of Corollary IV.3.3.2, a correct statement: Price's law is equivalent with Lotka's law with a = 2 and pm = C, so a very special case of Lotka's law (allowing no parameter freedom). In Geisler (2000) p. 173, note 22, this is rephrased in a wrong way: Geisler only refers to "an inverse relationship" which is meaningless.
If one considers Zipf s law as a derived function of Lotka's law (cf. Chapter II) one could, in view of the remarks made above on Price's law, also state that a fit of Lotka's law suffices and that, from the obtained parameters, Zipf s law can be reconstructed. This argument is also given in Urzua (2000) (but notice that what they call "Pareto's law" is the (continuous) informetric definition of Lotka's law, the functional relation being given by formula (11.13)).
Appendix
395
A.III. 5.4 Goodness-of-fit tests
We do not go into the diverse techniques to test the quality of the fit, i.e. how close is the calculated model to the data. Techniques such as x2
or
Kolmogorov-Smirnov can be read in
books on general statistics. We even dare to say that, for applying the results of this book on Lotkaian informetrics, we assume that a power law applies for the size-frequency function or the rank-frequency function and we just want to know the "best" parameters C, a (and, if pm cannot be taken as +oo, pm ). It is very well possible, due to the stability of the mathematical results in this book, that a Lotkaian model with the calculated parameters does not fit in the statistical sense but that the model, nevertheless, can be used to see what are the derived properties. Also, in many cases, it even suffices to see whether a < 2 or a > 2 in order to know the form of the derived property (e.g. concave or S-shaped first-citation distribution see Subsection VII.5, Groos droop or not - see Fig. II. 1, and so on). Even in the case of a nondecreasing size-frequency function (such as in some cases of number of authors per paper) we can use a Lotkaian function as a first approximation and use it for the derived properties. Also in this case a statistical fit does not help us, which is also the case in which a non-power law decreasing model can be fitted to the data but where a power law still can be used as a mathematical model due to general considerations.
This page is intentionally left blank
BIBLIOGRAPHY Adamic, L.A. and B.A. Huberman (2001). The Web's hidden order. Communications of the ACM, 44(9), 55-60. Adamic, L.A. and B. Huberman (2002). Zipf s law and the Internet. Glottometrics, 3, 143-150. Adamic, L.A., R.M. Lukose, A.R. Puniyani and B.A. Huberman (2001). Search in power-law networks. Physical Review E, 64, 46135-46143. Aida, M., N. Takahashi and T. Abe (1998). A proposal of dual Zipfian model for describing HTTP access trends and its application to address cache design. IEICE Transactions on Communication, E81-B(7), 1475-1485. Ajiferuke, I. (1991). A probabilistic model for the distribution of authorships. Journal of the American Society for Information Science, 42(4), 279-289. Albert, R., H. Jeong and A.-L. Barabasi (1999). Diameter of the World-Wide Web. Nature, 401, 130-131. Allison, P.D., D. De Solla Price, B.C. Griffith, M.J. Moravcsik and J.A. Stewart (1976). Lotka's law: a problem in its interpretation and application. Social Studies of Science, 6, 269276. Alvarado, R.U. (1999). La ley de Lotka y la literatura de bibliometria. Investigation Bibliotecologica, 13(27), 125-141. Anderson, R.B. and R.D. Tweney (1997). Artifactual power curves in forgetting. Memory and Cognition, 25(5), 724-730. Apostol, T.M. (1957). Mathematical Analysis. A modern Approach to advanced Calculus. Addison-Wesley, Reading (MA), USA. Arapov, M. V. (1982). A variational approach to frequency-rank distributions of text elements. In: Studies on Zipf s Law (H. Guiter and M. V. Arapov, eds.). Quantitative Linguistics, Vol. 16, 29-52, Studienverlag Dr. N. Brockmeyer, Bochum, Germany. Atkinson, A.B. (1970). On the measurement of inequality. Journal of Economic Theory, 2,244263. Axtell, R.L. (2001). Zipf distribution of U.S. firm sizes. Science, 293, 1818-1820. Baayen, R.H. (2001). Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht, the Netherlands.
398
Power laws in the information production process: Lotkaian informetrics
Balasubrahmanyan, V.K. and S. Naranan (2002). Algorithmic information, complexity and Zipf s law. Glottometrics, 4, 1-26. Barabasi, A.-L. and R. Albert (1999). Emergence of scaling in random networks. Science, 286, 509-512. Barabasi, A.-L., R. Albert and H. Jeong (2000). Scale-free characteristics of random networks: the topology of the world-wide-web. Physica A, 281, 69-77. Barabasi, A.-L., H. Jeong, Z. Neda, E. Ravasz, A. Schubert and T. Vicsek (2002). Evolution of the social network of scientific collaborations. Physica A, 311, 590-614. Bashkirov, A.G. and A. V. Vityazev (2000). Information entropy and power-law distributions for chaotic systems. Physica A, 277, 136-145. Batty,
M.
(2003).
The
emergence
of
cities: complexity
and
urban
dynamics.
http://www. casa. ucl. ac. uklworking_paperslpaper64.pdf Beckman(1999). http://casper.beckman.uiuc.edu/~c-tsai4/chinese/charfreq.html Benford, F. (1938). The law of anomalous numbers. Proceedings of the American Mathematical Society, 78, 551-572. Bensman, S.J. (1982). Bibliometric laws and library usage as social phenomena. Library Research, 4, 279-312. Berg, J. and R.Wagner-Dobler (1996). A multidimensional analysis of scientific dynamics. Part I. Case studies of mathematical logic in the 20th century. Scientometrics, 35(3), 321-346. Bilke, S. and C. Peterson (2001). Topological properties of citation and metabolic networks. Physical Review E, 6403(3), 76-80. Blackert, L. and S. Siegel (1979). 1st in der wissenschaftlich-technischen Information Platz fur die Informetrie? Wissenschaftliches Zeitschrift TH Ilmenau, 25(6), 187-199. Blair, D.C. (1990). Language and Representation in Information Retrieval. Elsevier, Amsterdam, the Netherlands. Blom, G. (1989). Probability and Statistics. Theory and Applications. Springer-Verlag, New York, USA. Bogaert, J., R. Rousseau and P. Van Hecke (2000). Percolation as a model for informetrie distributions: fragment
size distribution characterised by Bradford
Scientometrics, 47(2), 195-206. Bollobas, B. (1985). Random graphs. Academic Press, London, UK.
curves.
Bibliography
399
Bonitz, M. (1982). Scientometrie, Bibliometrie, Informetrie. Zentralblattfur Bibliothekwesen, 96, 19-24. Bookstein, A. (1976). The bibliometrie distributions. Library Quarterly, 46(4), 416-423. Bookstein, A. (1977). Patterns of scientific productivity and social change: a discussion of Lotka's law and bibliometrie symmetry. Journal of the American Societyfor Information Science, 28, 206-210. Bookstein, A. (1979). Explanations of the bibliometrie laws. Collection Management, 3(2/3), 151-162. Bookstein, A. (1984). Robustness properties of the bibliometrie distributions. Unpublished manuscript. This paper was partially published in Bookstein (1990a). Bookstein, A. (1990a). Informetrie distributions, Part I: Unified overview. Journal of the American Society for Information Science, 41(5), 368-375. Bookstein, A. (1990b). Informetrie distributions, Part II: Resilience to ambiguity. Journal of the American Society for Information Science, 41(5), 376-386. Bookstein, A. (1991). Theoretical properties of the informetrie distributions: some open questions. Proceedings of the third International Conference on Informetrics (I.K. Ravichandra Rao, ed.), 17-35, Sarada Ranganathan Endowment for Library Science, Bangalore, India. Bookstein, A. (2001). Implications of ambiguity for scientometrie measurement. Journal of the American Society for Information Science and Technology, 52(1), 74-79. Booth, A.D. (1967). A law of occurrences for words of low frequency. Information and Control, 10, 386-393. Bornholdt, S. and H. Ebel (2001). World Wide Web scaling exponent from Simon's 1955 model. Physical Review E, 64(3), 035104R (4 pages). Bradford, S.C. (1934). Sources on information on specific subjects. Engineering, 137, 85-86. Reprinted in: Collection Management, 1,95-103,1976-1977. Also reprinted in: Journal of Information Science, 10, 148 (facsimile of the first page) and 176-180, 1985. Broadus, R.N. (1987). Early approaches to bibliometrics. Journal of the American Society for Information Science, 38(2), 127-129. Brookes, B.C. (1973). Numerical methods of bibliographical analysis. Library Trends, 22,18-43. Brookes, B.C. (1981). The foundations of information science. Part IV. Information science: the changing paradigm. Journal of Information Science, 3, 3-12.
400
Power laws in the information production process: Lotkaian informetrics
Brookes, B.C. (1983). The empirical law of natural categorization. Journal of Information Science, 6, 147-157. Brookes, B.C. (1984a). Towards informetrics: Haitun, Laplace, Zipf, Bradford and the Alvey programme. Journal of Documentation, 40(2), 120-143. Brookes, B.C. (1984b). Ranking techniques and the empirical log law. Information Processing and Management, 20(1-2), 37-46. Brookes, B.C. (1990). Biblio-, sciento-, infor- metrics??? What are we talking about? In: Informetrics 89/90. Proceedings of the second international Conference on Bibliometrics, Scientometrics and Informetrics (L. Egghe andR. Rousseau, eds.), 31-43, Elsevier, Amsterdam, the Netherlands. Brookes, B.C. and J.M. Griffiths (1978). Frequency-rank distributions. Journal of the American Society for Information Science, 29, 5-13. Brown, J.H. and G.B. West (eds.) (2000). Scaling in Biology. Oxford University Press, Oxford, UK. Buckland, M.K. (1972). Are obsolescence and scattering related? Journal of Documentation, 28(3), 242-246. Buckland, M.K. and A. Hindle (1969). Library Zipf. Journal of Documentation, 25(1), 52-57. Burrell, Q.L. (1985). The 80/20 rule: library lore or statistical law? Journal of Documentation, 41(1), 24-39. Burrell, Q.L. (1992a). A note on a result of Rousseau for concentration measures. Journal of the American Society for Information Science, 43(6), 452-454. Burrell, Q.L. (1992b). The Gini index and the Leimkuhler curve for bibliometric processes. Information Processing and Management, 28(1), 19-33. Burrell, Q.L. (1992c). A simple model for linked informetric processes. Information Processing and Management, 28, 637-645. Burrell, Q. and R. Rousseau (1995). Fractional counts for authorship attribution: a numerical study. Journal of the American Society for Information Science, 46(2), 97-102. Carpenter, M.P. (1979). Similarity of Pratt's measure of class concentration to the Gini index. Journal of the American Society for Information Science, 30, 108-110. Cavnar, W.B. and J.M. Trenkle (1994). N-gram-based text categorization. In: Proceedings of the third Annual Symposium on Document Analysis and Information Retrieval, 161-175, University of Las Vegas, USA.
Bibliography
401
Chen, Y.-S. (1989). Analysis of Lotka's law: the Simon-Yule approach. Information Processing and Management, 25(5), 527-544. Chen, Y.-S., P. P. Chong and M. Y. Tong (1994). The Simon-Yule approach to bibliometric modeling. Information Processing and Management, 30(4), 535-556. Chen, Y.-S. and F.F. Leimkuhler (1986). A relationship between Lotka's law, Bradford's law, and Zipfs law. Journal of the American Society for Information Science, 37(5), 307-314. Chen, Y.-S. and F.F. Leimkuhler (1989). A type-token identity in the Simon-Yule model of text. Journal of the American Society for Information Science, 40(1), 45-53. Chen, Y.-S. and F.F. Leimkuhler (1990). Booth's law of word frequency. Journal of the American Society for Information Science, 41(5), 387-388. Chow, Y.S. and H. Teicher (1978). Probability Theory. Independence, Interchangeability, Martingales. Springer-Verlag, New York, USA. Chung, K.L. (1974). A Course in Probability Theory. Academic Press, New York, USA. Cohen, J.D. (1995). Highlights: language - and domain - independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46(3), 162174. Coile, R.C. (1975). Letter to the editor. Journal of the American Society for Information Science, 26, 133-134. Coile, R.C. (1977). Lotka's frequency distribution of scientific productivity. Journal of the American Society for Information Science, 28, 366-370. Cole, J.R. and S. Cole (1973). Social Stratification in Science. The University of Chicago Press, Chicago, USA. Coleman, S .R. (1992). The laboratory as a productivity and citation unit in the publications of an experimental-psychology specialty. Journal of the American Society for Information Science, 43(9), 639-643. Condon, E.U. (1928). Statistics of vocabulary. Science, 67(1733), 300. Cook, K.L. (1989). Laws of scattering applied to popular music. Journal of the American Society for Information Science, 40(4), 277-283. Cunningham, SJ. and S.M. Dillon (1997). Authorship patterns in information systems. Scientometrics, 39(1), 19-27. Dalton, H. (1920). The measurement of the inequality of incomes. The Economic Journal, 30, 248-361.
402
Power laws in the information production process: Lotkaian informetrics
Damashek, M. (1995). Gauging similarity with N-grams: language-independent categorization of text. Science, 267 (10 February 1995), 843-848. De Solla Price, D. (1963). Little Science, big Science. Columbia University Press, New York, USA. De Solla Price, D. (1971). Some remarks on elitism in information and the invisible college phenomenon in science. Journal of the American Society for Information Science, 22, 74-75. De Solla Price, D. (1976). A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science, 27, 292-306. De Solla Price, D. (1978). Cumulative advantage urn games explained: a reply to Kantor. Journal of the American Society for Information Science, 29(4), 204-206. De Solla Price, D. (1981). Letter to the editor. Science, 212, 987. Dierick, J.C.J. (1992). Determining the Lotka parameters by sampling. Scientometrics, 25(1), 115-148. Egghe, L. (1984). Stopping Time Techniques for Analysts and Probabilists. London Mathematical Society Lecture Notes Series 100. Cambridge University Press, Cambridge, UK. Egghe, L. (1985). Consequences of Lotka's law for the law of Bradford. Journal of Documentation, 41(3), 173-189. Egghe, L. (1986). On the 80/20 - rule. Scientometrics, 10(1-2), 55-68. Egghe, L. (1987a). An exact calculation of Price's law for the law of Lotka. Scientometrics, 11(12), 81-97. Egghe, L. (1987b). Pratt's measure for some bibliometric distributions and its relation with the 80/20 - rule. Journal of the American Society for Information Science, 38(4), 288-297. Egghe, L. (1988). On the classification of the classical bibliometric laws. Journal of Documentation, 44(1), 53-62. Egghe, L. (1989). The Duality ofinformetric Systems with Applications to the empirical Laws. Ph. D. Thesis, City University, London (UK). Egghe, L. (1990a). The duality ofinformetric systems with applications to the empirical laws. Journal of Information Science, 16(1), 17-27. Egghe, L. (1990b). New Bradfordian laws equivalent with old Lotka laws, evolving from a source-item duality argument. In: Informetrics 89/90. Proceedings of the second
Bibliography
403
international Conference on Bibliometrics, Scientometrics andInformetrics (L. Egghe and R. Rousseau, eds.), 79-96, Elsevier, Amsterdam, the Netherlands. Egghe, L. (1991). The exact place of Zipf s and Pareto's law amongst the classical informetric laws. Scientometrics, 20(1), 93-106. Egghe, L. (1992). Theory of search keys and applications to retrieval techniques used by catalogers. Mathematical and Computer Modelling, 16(4), 69-90. Egghe, L. (1993a). Exact probabilistic and mathematical proofs of the relation between the mean (x and the generalized 80/20 rule. Journal of the American Society for Information Science, 44(7), 369-375. Egghe, L. (1993b). Consequences of Lotka's law in the case of fractional counting of authorship and of first author counts. Mathematical and Computer Modelling, 18(9), 63-77. Egghe, L. (1994a). A theory of continuous rates and applications to the theory of growth and obsolescence rates. Information Processing and Management, 30(2), 279-292. Egghe, L. (1994b). Special features of the author - publication relationship and anew explanation of Lotka's law based on convolution theory. Journal of the American Society for Information Science, 45(6), 422-427. Translated into Chinese in Information Theory and Activity 19(4), 20-22, 1996. Egghe, L. (1995). Extension of the general "success breeds success" principle to the case that items can have multiple sources. In: Proceedings of the fifth biennial Conference of the international Societyfor Scientometrics and Informetrics (M. Koenig and A. Bookstein, eds.), 147-156, Learned Information, Medford (NJ), USA. Egghe, L. (1996). Source-Item production laws for the case that items have multiple sources with fractional counting of credits. Journal of the American Societyfor Information Science, 47(10), 730-748. Egghe, L. (1997). Price index and its relation to the mean and median reference age. Journal of the American Society for Information Science, 48(6), 564-573. Egghe, L. (1999a). On the law of Zipf-Mandelbrot for multi-word phrases. Journal of the American Society for Information Science, 50(3), 233-241. Egghe, L. (1999b). An explanation of the relation between the fraction of multinational publications and the fractional score of a country. Scientometrics, 45(2), 291-310. Egghe, L. (2000a). The distribution of N-grams. Scientometrics, 47(2), 237-252.
404
Power laws in the information production process: Lotkaian informetrics
Egghe, L. (2000b). General study of the distribution of N-tuples of letters or words based on the distributions of the single letters or words. Mathematical and Computer Modelling, 31, 35-41. Egghe, L. (2000c). A heuristic study of the first-citation distribution. Scientometrics, 48(3), 345359. Egghe, L. (2000d). New informetric aspects of the Internet: some reflections - many problems. Journal of Information Science, 26(5), 329-335. Egghe, L. (2001). A non-informetric analysis of the relationship between citation age and journal productivity. Journal of the American Societyfor Information Science and Technology, 52(5), 371-377. Egghe, L. (2002a). Construction of concentration measures for general Lorenz curves using Riemann-Stieltjes integrals. Mathematical and Computer Modelling, 35, 1149-1163. Egghe, L. (2002b). Development of hierarchy theory for digraphs using concentration theory based on anew type of Lorenz curve. Mathematical and Computer Modelling, 36,587602. Egghe, L. (2002c). Sampling and concentration values of incomplete bibliographies. Journal of the American Society for Information Science and Technology, 53(4), 271-281. Egghe, L. (2003). Type/Token-Taken informetrics. Journal of the American Society for Information Science and Technology, 54(7), 603-610. Egghe, L. (2004a). Solution of a problem of Buckland on the implications of obsolescence on scattering. Scientometrics, 59(2), 225-232. Egghe, L. (2004b). The source-item coverage of the Lotka function. Scientometrics, 61(1), 103115. Egghe, L. (2004c). Zipfian and Lotkaian continuous concentration theory. Journal of the American Society for Information Science and Technology, to appear. Egghe, L. (2004d). The power of power laws and the interpretation of Lotkaian informetric systems as self-similar fractals. Journal of the American Societyfor Information Science and Technology, to appear. Egghe, L. (2004e). Positive reinforcement and three-dimensional informetrics. Scientometrics, 60(3), 497-509. Correction. Scientometrics, 61(2), 283, 2004.
Bibliography
405
Egghe, L. (2004f). The exact rank-frequency function and size-frequency function of N-grams and N-word phrases with applications. Mathematical and Computer Modelling, to appear. Egghe, L. (2004g). Relations between the continuous Lotka function and the discrete Lotka function. Journal of the American Society for Information Science and Technology, to appear. Egghe, L. (2004h). A characterization of the law of Lotka in terms of sampling. Preprint. Egghe, L. and T. Lafouge (2004). On the relation between the Maximum Entropy Principle and the Principle of Least Effort. Mathematical and Computer Modelling, to appear. Egghe, L., L. Liang and R. Rousseau (2003). The byline: thoughts on the distribution of author ranks in multi-authored papers. Mathematical and Computer Modelling, 38(3), 323-329. Egghe, L. and I.K.R. Rao (1992a). Citation age data and the obsolescence function : fits and explanations. Information Processing and Management, 28(2), 201-217. Egghe, L. and I.K.R. Rao (1992b).Classification of growthmodels based on growth rates and its applications. Scientometrics, 25(1), 5-46. Egghe, L. and I.K.R. Rao (2001). Theory of first-citation distributions and applications. Mathematical and Computer Modelling, 34(1-2), 81-90. Egghe, L. and I.K.R. Rao (2002a). Duality revisited: Construction of fractional frequency distributions based on two dual Lotka laws. Journal of the American Society for Information Science and Technology, 53(10), 789-801. Egghe, L. and I.K.R. Rao (2002b). Theory and experimentation on the most-recent-reference distribution. Scientometrics, 53(3), 371-387. Egghe, L. and R. Rousseau (1986). A characterization of distributions which satisfy Price's law and consequences for the law of Zipf and Mandelbrot. Journal of Information Science, 12, 193-197. Egghe, L. and R. Rousseau (eds.) (1988). Informetrics 87/88. Proceedings of the First International Conference on Bibliometrics and Theoretical Aspects of Information Retrieval. Elsevier, Amsterdam, the Netherlands. Egghe, L. and R. Rousseau (1990a). Introduction to Informetrics. Quantitative Methods in Library, Documentation and Information Science. Elsevier, Amsterdam, the Netherlands.
406
Power laws in the information production process: Lotkaian informetrics
Egghe, L. and R. Rousseau (eds.) (1990b). Informetrics 89/90. Proceedings of the Second International Conference on Bibliometrics, Scientometrics and Informetrics. Elsevier, Amsterdam, the Netherlands. Egghe, L. and R. Rousseau (1990c). Elements of concentration theory. In: Informetrics 89/90. Proceedings of the second international Conference on Bibliometrics, Scientometrics and Informetrics (L. Egghe and R. Rousseau, eds.), 97-137, Elsevier, Amsterdam, the Netherlands. Egghe, L. and R. Rousseau (1991). Transfer principles and a classification of concentration measures. Journal of the American Society for Information Science, 42(7), 479-489. Egghe, L. and R. Rousseau (1995). Generalized success - breeds - success principle leading to time-dependent informetric distributions. Journal of the American Society for Information Science, 46(6), 426-445. Egghe, L. and R. Rousseau (1996a). Stochastic processes determined by a general success breeds - success principle. Mathematical and Computer Modelling, 23(4), 93-104. Egghe, L. and R. Rousseau (1996b). Average and global impact of a set of journals. Scientometrics, 36(1), 97-107. Egghe, L. and R. Rousseau (1996c). Modelling multi-relational data with special attention to the average number of collaborators as a variable in informetric distributions. Information Processing and Management, 32(5), 563-571. Egghe, L. and R. Rousseau (1996d). Averaging and globalising quotients of informetric and scientometric data. Journal of Information Science, 22(3), 165-170. Egghe, L. andR. Rousseau (2001). Elementary Statistics for effective Library and Information Service Management. Europa Publications, London, UK. Egghe, L. and R. Rousseau (2002). Co-citation, bibliographic coupling and a characterization of lattice citation networks. Scientometrics, 55(3), 349-361. Egghe, L. and R. Rousseau (2003a). BRS-compactness in networks : theoretical considerations related to cohesion in citation graphs, collaboration networks and the Internet. Mathematical and Computer Modelling, 37(7), 879-899. Egghe, L. and R. Rousseau (2003b). Size-frequency and rank-frequency relations, power laws and exponentials: a unified approach. Progress in Natural Science, 13(6), 478-480. Egghe, L. and R. Rousseau (2004). A local hierarchy theory for acyclic digraphs. Mathematical and Computer Modelling, 39, 107-117.
Bibliography
407
Egghe, L., R. Rousseau and G. Van Hooydonk (2000). Methods for accrediting publications to authors or countries: consequences for evaluation studies. Journal of the American Society for Information Science, 51(2), 145-157. Erdo's, P. and A. Renyi (1960). On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5, 17-61. Estoup, J.B. (1916). Gammes Stenographiques. 4th edition, Institut Stenographique, Paris. Fairthorne, R.A. (1969). Empirical hyperbolic distributions (Bradford-Zipf-Mandelbrot) for bibliometric description and prediction. Journal of Documentation, 25(4), 319-343. Falconer, K. (1990). Fractal Geometry. Mathematical Foundations and Applications. J. Wiley, Chichester, UK. Faloutsos, M., P. Faloutsos and C. Faloutsos (1999). On power-law relationships of the Internet topology. Computer Communication Review, 29, 251-263. Fechner, G.T. (1860). Elemente der Psychophysik. Breitkopf und Hartel, Leipzig, Germany. Feder, J. (1988). Fractals. Plenum, New York, USA. Fedorowicz, J. (1982a). A Zipfian model of an automatic bibliographic system: an application of MEDLINE. Journal of the American Society for Information Science, 33(4), 223-232. Fedorowicz, J. (1982b). The theoretical foundation of Zipfs law and its application to the bibliographic database environment. Journal of the American Society for Information Science, 33(5), 285-293. Feger, H. (2000). Co-authorship patterns in work reports. Paper presented at the second COLLNET Workshop on Scientometrics and Informetrics, September 1 -4,2000, Hohen Neuendorf. Feldman, D. and M. Fox (1991). Probability. The Mathematics of Uncertainty. Marcel Dekker, New York, USA. Feller, W. (1948). An Introduction to Probability Theory and its Applications. Vol. II. J. Wiley, New York, USA. Feller, W. (1968). An Introduction to Probability Theory and its Applications. Vol. I (third ed.). J. Wiley, New York, USA. Fellman, J. (1976). The effect of transformations on Lorenz curves. Econometrica, 44(4), 823824. Fox, M.F. (1983). Publication productivity among scientists: A critical review. Social Studies of Science, 13,285-305.
408
Power laws in the information production process: Lotkaian informetrics
Gastwirth, J.L. (1971). A general definition of the Lorenz curve. Econometrica, 39(6), 10371039. Gastwirth, J.L. (1972). The estimation of the Lorenz curve and Gini index. The Review of Economics and Statistics, 54(3), 306-316. Geisler, E. (2000). The Metrics of Science and Technology. Quorum Books, Westport (CT), USA. Gini, C. (1909). II diverso accrescimento delle classi sociali e la concentrazione della richezza. Giornale degli Economisti, serie 11, 37. Glanzel, W. (1992). On some stopping times of citation processes. From theory to indicators. Information Processing and Management, 28, 53-60. Glanzel, W. (1996). The need for standards in bibliometric research and technology. Scientometrics, 35(2), 167-176. Glanzel, W. and U. Schoepflin (1995). A bibliometric study on ageing and reception processes of scientific literature. Journal of Information Science, 21, 37-53. Glanzel, W. and A. Schubert (1985). Price distribution. An exact formulation of Price's "square root law". Scientometrics, 7(3-6), 211-219. Gleitman, H. (1981). Basic Psychology. Norton, New York. Gradshteyn, I.S. and I.M. Ryzhik (1965). Table of Integrals, Series and Products. Academic Press, New York, USA. Greenberg, MJ. (1974). Euclidean and Non-Euclidean Geometries. Development and History. Freeman, San Francisco, USA. Griffith, B.C. (1988). Exact fits in bibliometrics: some tools and results. In: Informetrics 87/88. Proceedings of the first International Conference on Bibliometrics and Theoretical Aspects of Information Retrieval (L. Egghe and R. Rousseau, eds.), 85-95, Elsevier, Amsterdam, the Netherlands. Grimmett, G.R. andD.R. Stirzaker(1985). Probability and random Processes. Clarendon Press, Oxford, UK. Groos, O.V. (1967). Bradford's law and the Keenan-Atherton data. American Documentation, 18, 46. Grossman, D.A. and O. Frieder (1998). Information Retrieval. Algorithms and Heuristics. Kluwer Academic Publishers, Dordrecht, the Netherlands.
Bibliography
409
Gupta, D.K. (1989). Scientometric study of biochemical literature of Nigeria, 1970-1984: application of Lotka's law and the 80/20-rule. Scientometrics, 15(3-4), 171-179. Gupta, B.M. and C.R. Karisiddippa (1996). Author productivity patterns in theoretical population genetics (1900-1980). Scientometrics, 36(1), 19-41. Gupta, B.M. and C.R. Karisiddippa (1999). Collaboration and author productivity: a study with a new variable in Lotka's law. Scientometrics, 44(1), 129-134. Gupta, B.M. and S. Kumar (1998). Scientific productivity in theoretical population genetics: a case study in core journals. Library Science with a Slant to Documentation and Information Studies, 35(2), 89-97. Gupta, B.M., S. Kumar and B.S. Aggarwal (1999). A comparison of productivity of male and female scientists of CSIR. Scientometrics, 45(2), 269-289. Gupta, B.M., S. Kumar and R. Rousseau (1998). Applicability of selected probability distributions to the number of authors per article in theoretical population genetics. Scientometrics, 42(3), 325-334. Gupta, B.M. and R. Rousseau (1999). Further investigations into the first-citation process: the case of population genetics. Libres, 9(2), aztec.lib.utk.edu/libres/libre9n2/fc.htm. Gupta, B.M., L. Sharma and S. Kumar (1998). Literature growth and author productivity patterns in Indian physics. Information Processing and Management, 34(1), 121-131. Haitun, S.D. (1982a). Stationary scientometric distributions. Part I. Different approximations. Scientometrics, 4(1), 5-25. Haitun, S.D. (1982b). Stationary scientometric distributions. Part II. Non-Gaussian nature of scientific activities. Scientometrics, 4(2), 89-104. Haitun, S.D. (1982c). Stationary scientometric distributions. Part III. The role of the Zipf distribution. Scientometrics, 4(3), 181-194. Hardy, G., J.E. Littlewood and G. Polya (1928). Some simple inequalities satisfied by convex functions. Messenger of Mathematics, 58, 145-152. Hardy, G., J.E. Littlewood and G. Polya (1952). Inequalities. Cambridge University Press, Cambridge (UK). Harsanyi, M.A. (1993). Multiple authors, multiple problems - Bibliometrics and the study of scholarly collaboration: a literature review. Library and Information Science Research, 15, 325-354.
410
Power laws in the information production process: Lotkaian informetrics
Heaps, H.S. (1978). Information Retrieval: computational and theoretical Aspects. Academic Press, New York, USA. Heine, M.H. (1998). Bradford ranking conventions and their application to a growing literature. Journal of Documentation, 54(3), 303-331. Herdan, G. (1960). Type-Token Mathematics. A Textbook of mathematical Linguistics. Mouton, 's Gravenhage, the Netherlands. Hertzel D.H. (1985). History of the development of ideas in bibliometrics. Statistical bibliography or bibliometrics. In: Encyclopedia of Library and Information Science (A. Kent, ed.), 42(7), 144-219. Hjerppe, R. (1980). A Bibliography of Bibliometrics and Citation Index & Analysis. Royal Institute of Technology Library, Stockholm, Sweden. Hoen, W.P., H.C. Walvoort and A.J.P.M. Overbeke (1998). What are the factors determining authorship and the order of the authors' names? Journal of the American Medical Association, 280(3), 217-218. Huber, J.C. (1998a). Invention and inventivity is a random, Poisson process: a potential guide to analysis of general creativity. Creativity Research Journal, 11(3), 231-241. Huber, J.C. (1998b). Cumulative advantage and success-breeds-success: the value of time pattern analysis. Journal of the American Society for Information Science, 49(5), 471-476. Huber, J.C. (1998c). The underlying process generating Lotka's law and the statistics of exceedances. Information Processing and Management, 34(4), 471-487. Huber, J.C. (1999). Inventive productivity and the statistics of exceedances. Scientometrics, 45(1), 33-53. Huber, J.C. (2001). A new method for analyzing scientific productivity. Journal of the American Society for Information Science and Technology, 52(13), 1089-1099. Huber, J.C. (2002). A new model that generates Lotka's law. Journal of the American Society for Information Science and Technology, 53(3), 209-219. Huber, J.C. and R. Wagner-Dobler (2001a). Scientific production: a statistical analysis of authors in mathematical logic. Scientometrics, 50(2), 323-337. Huber, J.C. and R. Wagner-Dobler (2001b). Scientific production: a statistical analysis of authors in physics, 1800-1900. Scientometrics, 50(3), 437-453. Huberman, B. A. (2001). The Laws of the Web. Patterns in the Ecology of Information. The MIT Press, Cambridge (MA), USA.
Bibliography
411
Huberman, B.A., P.L.T. Pirolli, J.E. Pitkow and R.M. Lukose (1998). Strong regularities in World Wide Web surfing. Science, 280, 95-97. Hubert, J.J. (1976). On the Naranan interpretation of Bradford's law. Journal of the American Society for Information Science, 27, 339-341. Hubert, J.J. (1978). A relationship between two forms of Bradford's law. Journal of the American Society for Information Science, 29, 159-161. Ikpaahindi, L. (1985). An overview of bibliometrics: its measurements, laws and their applications. Libri, 35(2), 163-177. Ioannides, Y.M. and H.G. Overman (2003). Zipf s law for cities: an empirical examination. Regional Science and Urban Economics, 33, 127-137. Ivancheva, L.E. (2001). The non-Gaussian nature of bibliometric and scientometric distributions: a new approach to interpretation. Journal of the American Society for Information Science and Technology, 52(13), 1100-1105. Jaynes, E.T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620630. Jeong, H., B. Tombor, R. Albert, Z.N. Ottval and A.-L. Barabasi (2000). The large-scale organization of metabolic networks. Nature, 407, 651-654. Jones, D.S. (1979). Elementary Information Theory. Clarendon Press, Oxford, UK. Kanter, I. and D.A. Kessler (1995). Markov processes: linguistics and Zipfs law. Physical Review Letters, 74(22), 4559-4562. Kantor, P.B. (1978). A note on cumulative advantage distributions. Journal of the American Society for Information Science, 29(4), 202-204. Kapur, J.N. (1992). Application of generalised maximum-entropy principle to population dynamics, innovation diffusion models and chemical reactions. Journal of Mathematical and Physical Sciences, 26(2), 183-211. Kapur, J.N. and H.K. Kesavan (1992). Entropy optimization Principles with Applications. Academic Press, Boston, USA. Katz, J.S. (1999). The self-similar science system. Research Policy, 28, 501-517. Kawamura, K. and N. Hatano (2002). Universality of Zipf s law. Journal of the Physical Society of Japan, 71(5), 1211-1213.
412
Power laws in the information production process: Lotkaian informetrics
Kilgour, F.G., P.L. Long and E.B. Leiderman (1970). Retrieval of bibliographic entries from a name-title catalog by use of truncated search keys. Proceedings of the American Society for Information Science, 7, 79-82. Kinnucan, M.T. and D. Wolfram (1990). Direct comparison of bibliometric models. Information Processing and Management, 26(6), 777-790. Koenig, M. and T. Harrell (1995). Lotka's law, Price's urn, and electronic publishing. Journal of the American Society for Information Science, 46(5), 386-388. Kot, M., E. Silverman and C. A. Berg (2003). Zipf s law and the diversity of biology newsgroups. Scientometrics, 56(2), 247-257. Kranakis, A. and E. Kranakis (1988). Comparing two weighting methods in citation analysis. Unpublished paper, Amsterdam, the Netherlands. Krapivsky, P.L., S. Redner and F. Leyvraz (2000). Connectivity of growing random networks. Physical Review Letters, 85(21), 4629-4632. Kretschmer, H. and R. Rousseau (2001). Author inflation leads to a breakdown of Lotka's law. Journal of the American Society for Information Science and Technology, 52(8), 610614. Kuch, T.D.C. (1978). Relation of title length to number of authors in journal articles. Journal of the American Society for Information Science, 29, 200-202. Kumar, S., P. Sharma and K.C. Garg (1998). Lotka's law and institutional productivity. Information Processing and Management, 34(6), 775-783. Kyvik, S. (1990). Age and scientific productivity. Differences between fields of learning. Higher Education, 19, 37-55. Lafouge, T. (1995). Stochastic information field. The International Journal of Scientometrics and Informetrics, 1(2), 57-64. Lafouge, T. (1998). Mathematiques du Document et de I'Information. Bibliometrie distributionnelle. Habilitation a diriger des Recherches, Universite de Lyon 3. Lafouge, T. and E. Guinet (1999). Relations between distributions of use and distribution of contents in the case of library circulation data. In: Proceedings of the seventh Conference of the international Society for Scientometrics and Informetrics (C.A. Macias-Chapula, ed.), 267-277, Universidad de Colima. Lafouge, T. and C. Michel (2001). Links between information construction and information gain. Entropy and bibliometric distributions. Journal of Information Science, 27(1), 39-49.
Bibliography
413
Laherrere, J. and D. Somette (1998). Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. The European Physical Journal B, 2(4), 525-539. Lawani, S.M. (1981). Bibliometrics: its theoretical foundations, methods and applications. Libri, 31(4), 294-315. Leimkuhler, F.F (1967). The Bradford distribution. Journal of Documentation, 23, 197-207. Lemoine, W. (1992). Productivity patterns of men and women scientists in Venezuela. Scientometrics, 24(2), 281-295. Li, W. (1992). Random texts exhibit Zipfs-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842-1845. Liang, L., H. Kretschmer, Y. Guo and D. Deb. Beaver (2001). Age structures of scientific collaboration in Chinese computer science. Scientometrics, 52, 471-486. Lorenz, M.O. (1905). Methods of measuring concentration of wealth. Journal of the American Statistical Association, 9, 209-219. Lotka, A.J. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences, 16(12), 317-324. Luce, R.D. (1959). On the possible psychophysical laws. The Psychological Review, 66(2), 8195. Mandelbrot, B. (1954). Structure formelle des textes et communication. Word, 10(1), 1-27. Mandelbrot, B. (1959). A note on a class of skew distribution functions: analysis and critique of a paper by H.A. Simon. Information and Control, 2, 90-99. Mandelbrot, B. (1967). How long is the coast of Britain? Statistical self-similarity and fractional dimension. Science, 156, 636-638. Mandelbrot, B. (1977a). Fractals: Form, Chance, Dimension. Freeman, San Francisco, USA. Mandelbrot, B. (1977b). The Fractal Geometry of Nature. Freeman, New York, USA. Mansuripur, M. (1987). Introduction to Information Theory. Prentice-Hall, Englewood Cliffs, USA. Marsili, M. and Y.-C. Zhang (1998). Interacting individuals leading to Zipfs law. Physical Review Letters, 80(12), 2741-2744. Matricciani, E. (1991). The probability distribution of the age of references in engineering papers. IEEE Transactions of Professional Communication, 34, 7-12. Moed, H.F. and A.F.J. Van Raan (1986). Cross-field impact and impact delay of physics departments. Czechoslovak Journal of Physics, B36, 97-100.
414
Power laws in the information production process: Lotkaian informetrics
Momcilovic, B. and V. Simeon (1981). Distribution of citation frequencies in a non-selected group of scientific papers. Informatologia Yugoslavica, 13(1-4), 123-128. Motylev, V.M. (1981). Study into the stochastic process of change in the literature citation pattern and possible approaches to literature obsolescence estimation. International Forum on Information and Documentation, 6, 3-12. Murphy, LJ. (1973). Lotka's law in the humanities? Journal of the American Society for Information Science, 24, 461-462. Nacke, O. (1979). Informetrie: eine neuer Name fur eine neue Disziplin. Nachrichten fur Documentation, 30(6), 219-226. Nalimov, V.V. and Z.M. Mul'cenko (1969). Naukometrija. Nauka, Moskva, USSR. Naranan, S. (1970). Bradford's law of bibliography of science: an interpretation. Nature, 227(5258), 631-632. Naranan, S. (1971). Power law relations in science bibliography - a self-consistent interpretation. Journal of Documentation, 27(2), 83-97. Naranan, S. (1991). Statistical laws in information science, language and system of natural numbers: some striking similarities. Unpublished manuscript. Narin, F. and J.K. Moll (1977). Bibliometrics. Annual Review of Information Science and Technology (ARIST), 12, 35-58. Nath, R. and W.M. Jackson (1991). Productivity of management information systems researchers: does Lotka's law apply? Information Processing and Management, 27(2/3), 203-209. Nederhof, AJ. and H.F. Moed (1993). Modeling multinational publication: development of an on-line fractionation approach to measure national scientific output. Scientometrics, 27, 39-52. Nelson, M. (1989). Stochastic models for the distribution of index terms. Journal of Documentation, 45(3), 227-237. Nelson, M. and J.S. Downie (2001). Informetrie analysis of a music database: distributions of intervals. In: Proceedings of the eighth International Conference on Scientometrics and Informetrics (M. Davis and C.S. Wilson, eds.), 477-484, BIRG(UNSW), Sydney, Australia. Neveu, J. (1975). Discrete-Parameter Martingales. North-Holland, Amsterdam, the Netherlands.
Bibliography
415
Newby, G.B., J. Greenberg and P. Jones (2003). Open source software development and Lotka's law: bibliometric patterns in programming. Journal of the American Society for Information Science and Technology, 54(2), 169-178. Nicholls, P.T. (1986). Empirical validation of Lotka's law. Information Processing and Management, 22(5), 417-419. Nicholls, P.T. (1987). Estimation of Zipf parameters. Journal of the American Society for Information Science, 38(6), 443-445. Nicholls, P.T. (1988). Price's square root law: empirical validity and relation to Lotka's law. Information Processing and Management, 24(4), 469-477. Nicholls, P.T. (1989). Bibliometric modeling processes and the empirical validity of Lotka's law. Journal of the American Society for Information Science, 40(6), 379-385. Nicolis, G., C. Nicolis and J.S. Nicolis (1989). Chaotic dynamics, Markov partitions and Zipf s law. Journal of Statistical Physics, 54(3/4), 915-924. Nielsen, J. (1997). Do websites have increasing returns? http://www.useit.com/alertbox/ 9704b.html Nieuwenhuysen, P. (1988). A bibliography of text information management software for IBM microcomputers and compatibles. The Electronic Library, 6, 264-320. O'Conner, D.O. and H. Voos (1981). Empirical laws, theory construction and bibliometrics. Library Trends, 30(1), 9-20. Oluic-Vukovic, V. (1997). Bradford's distribution: from the classical bibliometric "law" to the more general stochastic models. Journal of the American Society for Information Science, 48(9), 833-842. Oluic-Vukovic, V. (1998). Simon's generating mechanism: consequences and their correspondence to empirical facts. Journal of the American Society for Information Science, 49(10), 867-880. Onodera, N. (1988). A frequency distribution function derived from a stochastic model considering human behaviors and its comparison with an empirical bibliometric distribution. Scientometrics, 14(1-2), 143-159. Pao, M.L. (1978). Automatic text analysis based on transition phenomena of word occurrences. Journal of the American Society for Information Science, 29, 121-124. Pao, M.L. (1982a). American revolution: comparison of a bibliography with a quality-selected list. In: Proceedings of the 45th ASIS Annual Meeting, 19, 224-226.
416
Power laws in the information production process: Lotkaian informetrics
Pao, M.L. (1982b). Lotka's test. Collection Management, 4(1/2), 111-124. Pao, M.L. (1985). Lotka's law: a testing procedure. Information Processing and Management, 21(4), 305-320. Pao, M.L. (1986). An empirical examination of Lotka's law. Journal of the American Society for Information Science, 37(1), 26-33. Pareto, V. (1895). La legge della domanda. Giornale degli Economisti, 12, 59-68. Pastor-Satorras, R. and A. Vespignani (2004). Evolution and Structure of the Internet. A statistical Physics Approach. Cambridge University Press, Cambridge (UK). Petruszewycz, M. (1973). L'Histoire de la loi d'Estoup-Zipf: documents. Mathematiques des Sciences Humaines, 11(44), 41-56. Potter, W.G. (1980). When names collide: conflict in the catalog and AACR2. Library Resources and Technical Services, 24, 3-16. Potter, W.G. (1981). Lotka's law revisited. Library Trends, 30(1), 21-39. Pratt, A.D. (1977). A measure of class concentration in bibliometrics. Journal of the American Society for Information Science, 28(5), 285-292. Praunlich, P. and M. Kroll (1978). Bradford's distribution: a new formulation. Journal of the American Society for Information Science, 29, 51-55. Pritchard, A. (1969). Statistical bibliography or bibliometrics? Journal of Documentation, 25, 348-349. Protter, M.H. and C.B. Morrey (1977). A first Course in real Analysis. Springer-Verlag, New York, USA. Pulgarin, A. and I. Gil-Leiva (2004). Bibliometric analysis of the automatic indexing literature: 1956-2000. Information Processing and Management, 40(2), 365-377. Qin, J. (1995). Collaboration and publication productivity: An experiment with a new variable in Lotka's law. In: Proceedings of the fifth Conference of the International Society for Scientometrics and Informetrics (A. Bookstein and M. Koenig, eds.), 445-454, Learned Information, Medford (NJ), USA. Radhakrishnan, T. and R. Kernizan (1979). Lotka's law and computer science literature. Journal of the American Society for Information Science, 30, 51-54. Rapoport, A. (1982). Zipf s law re-visited. In: Studies on Zipf s law (H. Guiter and M.V. Arapov, eds.). Quantitative Linguistics, Vol. 16, 1-28, Studienverlag Dr. N. Brockmeyer, Bochum, Germany.
Bibliography
417
Rao, I.K.R. (1980). The distribution of scientific productivity and social change. Journal of the American Society for Information Science, 31(2), 111-121. Rao, I.K.R. (1988). Probability distributions and inequality measures for analyses of circulation data. In: Proceedings of the first International Conference on Bibliometrics and theoretical Aspects of Information Retrieval(L. EggheandR. Rousseau, eds.), 231-248, Elsevier, Amsterdam, the Netherlands. Rao, I.K.R. (1995). A stochastic approach to analysis of distributions of papers in mathematics: Lotka's law revisited. In: Proceedings of the fifth Conference of the International Society for Scientometrics and Informetrics (A. Bookstein and M. Koenig, eds.), 455464, Learned Information, Medford (NJ), USA. Redner, S. (1998). How popular is your paper? An empirical study of the citation distribution. The European PhysicalJournal B, 4(2), 131-134. Rescher, N. (1978). Scientific Progress. A philosophical Essay on the Economics of Research in natural Science. Blackwell, Oxford, UK. Roberts, F.S. (1979). Measurement Theory with Applications to Decisionmaking, Utility, and the social Sciences. Addison-Wesley, Reading (MA), USA. Robertson, A.M. and P. Willett (1998). Application of N-grams in textual information systems. Journal of Documentation, 54(1), 48-69. Rousseau, B. and R. Rousseau (2000). LOTKA: a program to fit a power law distribution to observed frequency data. Cybermetrics, 4(1), paper 4. http://www.cindoc.csic.es/ cybermetrics/articles/v4ilp4.html Rousseau, R. (1987). The Gozinto theorem: using citations to determine influences on a scientific publication. Scientometrics, 11(3-4), 217-229. Rousseau, R. (1988). Lotka's law and its Leimkuhler representation. Library Science with a Slant to Documentation and Information Studies, 25(3), 150-178. Rousseau, R. (1990a). Relations between continuous versions of bibliometric laws. Journal of the American Society for Information Science, 41(3), 197-203. Rousseau, R. (1990b). A bibliometric study of Nieuwenhuysen's bibliography of microcomputer software for online information and documentation work. Journal of Information Science, 16, 45-50.
418
Power laws in the information production process: Lotkaian informetrics
Rousseau, R. (1992a). Concentration and diversity of availability and use in information systems: A positive reinforcement model. Journal of the American Society for Information Science, 43(5), 391-395. Rousseau, R. (1992b). Breakdown of the robustness property of Lotka's law: the case of adjusted counts for multiauthorship attribution. Journal of the American Societyfor Information Science, 43(10), 645-647. Rousseau, R. (1993). A table for estimating the exponent in Lotka's law. Journal of Documentation, 49(4), 409-412. Rousseau, R. (1994a). Double exponential models for first-citation processes. Scientometrics, 30, 213-227. Rousseau, R. (1994b). Bradford curves. Information Processing and Management, 30(2), 267277. Rousseau, R. (1994c). The number of authors per article in library and information science can often be described by a simple probability distribution. Journal of Documentation, 50(2), 134-141. Rousseau, R. (1997). Sitations: an exploratory study. Cybermetrics, 1(1), paper 1. http://www.cindoc.csis.es/cybermetrics/articles/vlilpl.html Rousseau, R. (1998). Convolutions and their applications in information science. Canadian Journal of Information and Library Science, 23(3), 29-47. Rousseau, R. (2002a). Lack of standardisation in informetric research. Comments on "Power laws of research output. Evidence for journals of economics" by Matthias Sutter and Martin G. Kocher. Scientometrics, 55(2), 317-327. Rousseau, R. (2002b). George Kingsley Zipf: life, ideas, his law and informetrics. Glottometrics, 3,11-18. Rousseau, R. and S. Rousseau (1993). Informetric distributions: a tutorial overview. Canadian Journal of Information and Library Science, 18(2), 51-63. Rousseau, R. and P. Van Hecke (1999). Measuring biodiversity. Ada Biotheoretica, 47, 1-5. Rousseau, R. and Q. Zhang (1992). Zipfs data on the frequency of Chinese words revisited. Scientometrics, 24(2), 201-220. Salton, G. and M.J. McGill (1987). Introduction to modern Information Retrieval. McGraw-Hill, Auckland, New Zealand.
Bibliography
419
Schapiro, B. (1994). An approach to physics of complexity. Chaos, Solitons and Fractals, 4(1), 115-123. Schorr, A.E. (1974). Lotka's law and library science. Reference Quarterly, 14(1), 32-33. Schorr, A.E. (1975a). Lotka's law and map librarianship. Journal of the American Society for Information Science, 26, 189-190. Schorr, A.E. (1975b). Lotka's law and the history of legal medecine. Research in Librarianship, 30, 205-209. Schubert, A.W. and W. Glanzel (1986). Mean response time - a new indicator of journal citation speed with application to physics journals. Czechoslovak Journal of Physics, B36,121125. Seal, H.L. (1952). The maximum likelihood fitting of the discrete Pareto law. Journal of the Institute of Actuaries, 78, 115-121. Seglen, P.O. (1992). The skewness of science. Journal of the American Society for Information Science, 43(9), 628-638. Sen, A.K., C. A. Bin Taib and M.F. Bin Hassan (1996). Library and information science literature and Lotka's law. Malaysian Journal of Library and Information Science, 1(2), 89-93. Shannon, C.E. and W. Weaver (1975). The mathematical Theory of Communication. University of Illinois Press, Urbana (II), USA. Sharada, B.A. (1993). Bibliometric studies in linguistics and bibliometric laws. Library Science, 30,71-75. Shtrikman, S. (1994). Some comments on Zipfs law for the Chinese language. Journal of Information Science, 20(2), 142-143. Simkin, M.V. and V.P. Roychowdhury (2002). Read before you cite\ http://arxiv.org/abs/ condmat/0212043, Preprint. Simon, H.A. (1955). On a class of skew distribution functions. Biometrika, 42, 425-440. Soyibo, A. and W.O. Aiyepeku (1988). On the categorization, exactness and probable utility of bibliometric laws and their extensions. Journal of Information Science, 14, 243-251. Stevens, S.S (1960). The psychophysics of sensory function. American Scientist, 48, 226-253. Stewart, I. (1989). Does God play Dice. Penguin Books, London, UK. Stinson, E.R. (1981). Diachronous versus synchronous Study of Obsolescence. Ph. D. Thesis, University of Illinois, USA.
420
Power laws in the information production process: Lotkaian informetrics
Stinson, E.R. and F.W. Lancaster (1987). Synchronous versus diachronous methods in the measurement of obsolescence by citation studies. Journal of Information Science, 13, 65-74. Subramanyam, K. (1979). Lotka's law and the literature of computer science. IEEE Transactions on Professional Communication, PC-22(4), 187-189. Summers, E.G. (1983). Bradford's law and the retrieval of reading research journal literature. Reading Research Quarterly, 19, 102-109. Sutter, M. and M.G. Kocher (2001). Power laws of research output. Evidence for journals of economics. Scientometrics, 51, 405-414. Tague, J. (1981). The success-breeds-success phenomenon and bibliometric processes. Journal of the American Society for Information Science, 32(4), 280-286. Tague, J. (1988). What's the use of bibliometrics ? In: Informetrics 87/88. Proceedings of the first International Conference on Bibliometrics and Theoretical Aspects of Information Retrieval (L. Egghe and R. Rousseau, eds.), 271-278, Elsevier, Amsterdam, the Netherlands. Tague-Sutcliffe, J. (1994). Quantitative methods in documentation. In: Fifty Years of Information Progress: a Journal of Documentation Review, 147-188, Aslib, London, UK. Tague, J. and P. Nicholls (1987). The maximal value of a Zipf size variable: sampling properties and relationship to other parameters. Information Processing and Management, 23(3), 155-170. Terrada, M.-L. and V. Navarro (1977). La productividad de los autores espanoles de bibliografia medica. Revista Espanola de Documentation Cientifica, 1(1), 9-19. Theil, H. (1967). Economics and Information Theory. North-Holland, Amsterdam, the Netherlands. Thelwall, M. and D. Wilkinson (2003). Graph structure in three national academic webs: power laws and anomalies. Journal of the American Society for Information Science and Technology, 54(8), 706-712. Tsay, M.-Y., S.-J. Jou and S.-S. Ma (2000). A bibliometric study of semiconductor literature 1978-1997. Scientometrics, 49(3), 491-509. Urziia, CM. (2000). A simple and efficient test of Zipf s law. Economics Letters, 66, 257-260.
Bibliography
421
Van Hooydonk, G. (1997). Fractional counting of multi-authored publications: consequences for the impact of authors. Journal of the American Societyfor Information Science, 48,944945. Varian, H.R. (1972). Benford's law. American Statistician, June 1972, 65-66. Vickery, B.C. (1948). Bradford's law of scattering. Journal of Documentation, 4(3), 198-203. Vinkler, P. (1990). Bibliometric analysis of publication activity of a scientific research institute. In: Informetrics 89/90. Proceedings of the second International Conference on Bibliometrics, Scientometrics and Informetrics (L. Egghe and R. Rousseau, eds.), 309334, Elsevier, Amsterdam, the Netherlands. Voos, H. (1974). Lotka and information science. Journal of the American Societyfor Information Science, 25, 270-272. Wagner-Dobler, R. and J. Berg (1995). The dependence of Lotka's law on the selection of time periods in the development of scientific areas and authors. Journal of Documentation, 51(1), 28-43. Wagner-Dobler, R. and J. Berg (1999). Physics 1800-1900: a quantitative outline. Scientometrics, 46(2), 213-285. Wallace, D.P. (1986). The relationship between journal productivity and obsolescence. Journal of the American Society for Information Science, 37(3), 136-145. Warren, K.S. and V.A. Newill (1967). Schistosomiasis, a Bibliography of the World's Literature from 1852-1962, Western Reserve University, Cleveland (OH), USA. White, H.D. and K.W. McCain (1989). Bibliometrics. Annual Review of Information Science and Technology (ARIST), 24 (M.E. Williams, ed.), 119-186. Wilkinson, E.A. (1973). The Bradford-Zipf Distribution. OSTl-report #5172. University College, London (UK). Willett, P. (1979). Document retrieval experiments using indexing vocabularies of varying size II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation, 35(4), 296-305. Wilson, C.S. (1999). Informetrics. Annual Review of Information Science and Technology (ARIST), 34 (M.E. Williams, ed.), 107-247. Windsor, D.A. (1975). Developing drug literatures. 1. Bibliometrics of baclofen and dantrolene sodium. Journal of Chemical Information and Computer Sciences, 15(4), 237-241. Wyllys, R.E. (1981). Empirical and theoretical bases of Zipf s law. Library Trends, 30(1), 53-64.
422
Power laws in the information production process: Lotkaian informetrics
Yablonsky, A.I. (1980). On fundamental regularities of the distribution of scientific productivity. Scientometrics, 2(1), 3-34. Yablonsky, A.I. (1985). Stable non-Gaussian distributions in scientometrics. Scientometrics, 7(36), 459-470. Yannakoudakis, E.J., I. Tsomokos and P.J. Hutton (1990). N-grams and their implication to natural language understanding. Pattern Recognition, 23(5), 509-528. Yitzhaki, M. (1995). Relation between the number of references and length of journal article. In: Proceedings of the fifth Conference of the International Societyfor Scientometrics and Informetrics (A. Bookstein and M. Koenig, eds.), 647-657, Learned Information, Medford (NJ), USA. Yoshikane, F. and K. Kageura (2004). Comparative analysis of coauthorship networks of different domains: the growth and change of networks. Scientometrics, 60(3), 433-444. Zipf, G.K. (1949). Human Behavior and the Principle of least Effort. Addison-Wesley, Cambridge, USA. Reprinted: Hafher, New York, USA, 1965.
SUBJECT INDEX
80/20-rule 188, 199, 200 - generalized 188
Brookes' law 153 Buckland's problem 40
A
C
adjusted counting 255 aging distribution - cumulative 323 - exponential 304, 316 - lognormal 304 alphabetical ranking 310 arc 297 ASCII coding 67 Asian languages 328 author count - pure geometric 255 - senior 254 - total 254 author rank 304 - distribution 308 authors - number per paper 90 authorship determination 328 average 110 average effort 70 average information content 69 average screen length 359
cardinality 11 career duration 93 Cartesian product of IPPs 326 Central Limit Theorem 302 Chinese symbols 334 Chinese words 87 citation 8 citation age 300 citation network 87 classification of subjects 328 CLT 302 coastline of Norway 241 coefficient of variation: see variation coefficient collaboration network 87 complexity 231 compression of texts 328 concave 193, 197, 315, 319, 320 concentration measure 189, 190, 194, 198, 205 concentration theory 187 - continuous 196 -discrete 192 -Lotkaian 187, 199 - of linear three-dimensional informetrics 218 - of Type/Token-Taken informetrics 226 conditional expectation 56, 270 conferences 7 continuous concentration theory 196 continuous Lotka function: see Lotka's law convolution 251, 277, 280, 281, 284 counting - adjusted 255 - fractional 247, 258 - normal 254 -proportional 258 - standard 254 - straight 254 -total 18,247,257 counting procedure 253 crediting system 253 cumulative advantage 45
B Benford's law 20 Bernoulli trial 309 bibliometrics 7 binary digit 67 binomial distribution 272, 309 bit 67 book 8 Booth's law 47 borrowing 202 box-counting dimension 239 Bradford formalism 105 Bradford multiplicator 22, 23, 126 Bradford's function (see also Bradford's law) 122, 129, 143 Bradford's law 22, 126, 154 - graphical formulation 24, 105, 127, 154, 156 - group-free version 23 - verbal formulation 24 branch 55
424
Power laws in the information production process: Lotkaian informetrics
cumulative aging distribution 323 cumulative citation distribution 325 cumulative first-citation distribution 314 cumulative nth citation distribution 323, 325 cumulative rank-frequency function 13, 105 D De Solla Price's law: see Price's law decimal digit 68 decreasing exponential function 15 decreasing power law 14 demography 9 density 26 diachronous 325 - obsolescence 40 dictionary 67 dimensional complexity 239 discrete concentration theory 192 discrete Lotka function: see Lotka's law dit 68 diversity 214 domain name 90 double logarithmic scale 98 dual function 252 dualIPP 13, 102 dual Lotka law 276 dual size-frequency function 144, 276 dual, duality 12, 102, 159, 325 E econometrics 9, 21 entropy 65, 69, 213 Erdos-Renyi model 33 Erdos-Renyi network 88 error correction 328 error detection 328 evolution in time 28, 31, 32 existence theorem 111, 116 exponential aging distribution 304, 316 exponential function 32 exponential growth 34 exponential obsolescence 34 exponentially decreasing rank-frequency function 135 F Failure Breeds Failure 48 FBF48 first-citation distribution 313 - cumulative 314
forgetting 33 fractal 232 - self-similar 232, 234, 243 fractal dimension 232, 236, 243 fraction of multinational publications 274 fractional 92 fractional author count 255 fractional count(ing) 18, 247, 258 fractional counting system 146 fractional frequency curve 372 fractional frequency distribution 288, 292, 295 fractional frequency score 146 fractional score 146 fractional scoring system 251 fractional size-frequency function 276 fractionated score 275 fractionated scoring system 256 Fubini's theorem 308 functional relation 370 G gambling theory 56 Gaussian distribution 302 generalized 80/20-rule 188 generalized bibliography 8, 101 geometric mean 269 Gini index 194, 198, 205 Goodness-of-fit 395 Gozinto theorem 175 graph 87 Groos droop 139, 298 growth 55 growth process 61 growth rate 34 H Harmonic Mean Response Times 314 Hausdorff-Besicovitch dimension 236 hyperlink 87 I identification of languages 328 immediacy index 314 incomplete data 373 indexation of music 328 indexing 327 inequality 187 inflection point 140 information content 66 information production process 8, 101 information retrieval 326
Subject index Information Science Abstracts 86 informetrics 7, 86 in-link 54, 89 Institute for Scientific Information 314 integral test 307 internet 87 intranet 87 IPP 8, 101 - complexity 231 - dual 13, 102 IR 326, 327 ISA 86 ISI314, 325 ISSI7 item 8 item systematic sample 374 J JCR314 Journal Citation Reports 314 journal productivity 300 L law of anomalous numbers 20 law of Bradford: see Bradford's law law of Leimkuhker: see Leimkuhler's law law of Lotka: see Lotka's law law of Mandelbrot: see Mandelbrot's law law of Pareto: see Pareto's law law of Price: see Price's law law of succession of Laplace 48 law of surfing 89 law of total probability: see theorem of total probability law of Weber-Fechner: see WeberFechner's law law of Zipf: see Zipf s law Leimkuhler curve 297 - arc 297 Leimkuhler's function: see Leimkuhler's law Leimkuhler's law 22, 122, 128, 153 lexicographical ordering 310 Library and Information Science Abstracts 86 line segment 233 Linear Least Squares method 388 linear regression 15, 241, 388 linear three-dimensional informetrics 161 linear three-dimensional Lotkaian informetrics 175 linguistics 20, 86 LISA 86
425
LLS 388 loan 8, 223 logarithmic effort function 75 logarithmic function 66 lognormal aging (distribution) 33, 304 Lorenz curve 190, 193, 196, 200 -construction 223 Lotka exponent 15 Lotka's function: see Lotka's law Lotka's general power function: see Lotka's law Lotka's law (see also Lotkaian informetrics) 14, 122, 128, 135 - continuous 378 - discrete 378 Lotka's power law: see Lotka's law Lotkaian concentration theory: see Lotkaian informetrics Lotkaian continuous concentration theory: see Lotkaian informetrics Lotkaian function existence theory 114 Lotkaian informetrics -basic theory 101 - examples 85 - introductory aspects 7 - linear three-dimensional Lotkaian informetrics 175 - Lotkaian concentration theory 187 - Lotkaian continuous concentration theory 199 - Lotkaian fractal complexity theory 231 - Lotkaian Type/Token-Taken informetrics 177 Lotkaian Type/Token-Taken informetrics 177 M Mandelbrot's argument 329 Mandelbrot's function: see Mandelbrot's law Mandelbrot's law 19, 42, 122, 128, 143, 245 martingale 60 Maximum Entropy Principle 76 Maximum Likelihood Estimating 390 mean citation age 301 mean reference age 304 measure of dispersion 214 measure of the concentration: see concentration measure median citation age 300 median reference age 304 memories 33
426
Power laws in the information production process: Lotkaian informetrics
MEP76 MLE 390 most-recent-reference distribution 325 multi-authored paper 304 multiple authorship 247 multiple sources 247 multiplicator of Lagrange 71, 77
publication game 46
N Naranan model 34 Neperian logarithm 68 network 87 - citation network 87 - collaboration network 87 N-gram 326 - non-redundant 327 - redundant 327 noblesse oblige 256 node 55 nonlinear regression 320 non-redundant N-gram 327 normal counting 254 N-word phrase 326
random network 33,88 random text 42, 245 rank-frequency function 12, 105, 333 ranking 11 ratio scale 28, 99 rectangle 233 redundant N-gram 327 reference 8 reference age - mean 304 - median 304 regression -linear 241, 388 - nonlinear 320 retrieval of music 328 Riemann-zeta function 17, 381, 391
O
S
Obsolescence 40 - diachronous 40 - synchronous 40 O-notation 306 out-link 89
sample 373 SBS45 scale-free (property) 27, 85, 86, 98, 246 - characterization of power laws 29 scale-free function 365 scale-invariant property 189 scattering 40 scientometrics 7 search key 328 seed 308, 310 self-similar, self-similarity 233, 246 self-similar fractal 232, 234, 243 sensation 68, 94 sensation law of Weber-Fechner: see Weber-Fechner's law significant-digit 20 similarity dimension 234, 236 similarity measure 327 sitation 90 size-frequency function 10, 105, 333 slope of the regression line 262 social network 88 source 8 source systematic sample 374 source-item identity 298 speech recognition 328 S-shaped (growth) 33, 315, 319, 320 standard counting 254
P parallelepiped 233 Pareto's law 21 PLE70 PME78 positive reinforcement 163, 175, 223 - concentration 219 post-coordinative retrieval 326 power law: see Lotka's law, Zipf s law Pratt's measure 196,207 pre-coordinative indexing 326 Price index 304 Price's law, Price's law of concentration 94,97,188,214,394 Price's square root law: see Price's law Principle of Least Effort 70 Principle of Most Effort 78 principle of nominal increase 190, 223 product property 29, 30, 251 proportional author count 255 proportional counting 258
Q quantitative linguistics 8 R
Subject index standard normal distribution 302 standards, standardization 10, 89 stimulus 68, 94, 98, 218 stochastic process 55 straight counting 254 stretched exponential 33 submartingale 60 success 309 Success Breeds Success 45 super source 253, 256 supermartingale 60 synchronous 325 - obsolescence 40 systematic sample in the items 374 systematic sample in the sources 374 T Theil's measure 195, 211 theorem of total probability 281, 305, 308, 310 three-dimensional informetrics 157 - linear three-dimensional informetrics 161 -linear three-dimensional Lotkaian informetrics 175 time dependence 92 total count(ing) 18, 247, 257 total influence 172 total scoring system 249 transfer principle 190 tree 55 triadic von Koch curve 234, 242 truncation 327 - generalization 327-328 TT (informetrics): see Type/Token TTT (informetrics): see Type/TokenTaken two-dimensional informetrics 9, 157 Type/Token 8, 168 - average 352 Type/Token-Taken 168, 226, 329 - average 352 - concentration 226 - Lotkaian Type/Token-Taken informetrics 177 Type-Token identity 298 U universal Bradford multiplicator: see Bradford multiplicator urn model 46
V variation coefficient 195, 207 W Weber-Fechner' s law 68, 75, 94, 218 WWW 87 Z Zipf s function: see Zipf s law Zipf s law 19, 150, 328, 393
All
This page is intentionally left blank