This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
. For a given sE<s^, s^> we run a raruiom binary number generator generating 1 with probability P(s, 1) or 2 with probability (3.1.16) P(s, 2)^l-P(s ,1). We denote 6y Z*G {1, 2} the generated binary number. The fuzzy rough description of the primary state is s*^. The bloc diagram of the described transformation and typical probabilities P(5, /)> Z=l» 2 are shown in Figure 3.2. Also are shown the aggregation sets (see Section 1.5.3) determined by the rule (3.1.16). Contrary to the aggregation sets corresponding to deterministic quantization shown in Figure 1.18a, the aggregation sets corresponding to randomized quantization overlap. 0 . From this and from (8.3.49) it follows that u(Xi, r)=d^[w{Xi), r] (8.3.50) is a negative decision weight. Thus on assumptions A1-A6 the maximum conditional probability rule can be implemented as a next neighbour transformation with using the (8.3.51) Hamming distance and code words as reference patterns. D COMMENT 1 The conclusions (8.3.33) and (8.3.39) prove that based on the assumptions made here the receivers described in Section 2.1.1, equation (2.1.10), Figure 2.3 and in Section 2.1.2, equation (2.1.37) are optimal. The advantage of the derivation presented here is, that it shows that such factors as the a priori probability and the variance of the noise must be taken into account, while in the heuristic approach they do not appear. Therefore, when the heuristically chosen NNT is used, its performance is in general inferior compared with recovery rules derived here. The disadvantage of the approach presented here is that we had to introduce several specific assumptions to get the concrete results. However, the optimal rule (8.3.51) does not depend on the binary error probability Py^. Similarly the optimal rule (8.3.37) does not depend on the variance of the noise. Thus the optimal rules can be implemented if we have no information about some parameters assumed for derivation (the rules are in respect to those parameters uniformly optimal. However, the exact information about some other assumed features is essential, in particular the exact information about the potential forms of noiseless signals. If such an information is primarily not available we can use the hierarchical adaptive system described in Section 1.7.2, in particular the system shown in Figure 1.26 with a subsystem providing information about the features necessary for implementation of the optimal rule. Special classes of such systems are the intelligent data transmission systems described Section 2.2. COMMENT 2 We have shown in Section 5.4 that for several indeterministic transformations producing the available information the conditional probability distribution p(lR=r|^=x^) can be represented as decreasing function of a distance function between the available information r and a point representing the primary information. Then the generalization of conclusions (8.3.37) and (8.3.51) is evident: the maximum conditional probability rule is equivalent to a NNT. Thus, the NNT is in many cases an optimal recovery rule. We xan go even a step further and interpret the conditional performance indicator Q{Xi, r) given by (8.3.13) as a distance between the available information r and the potential form Xi of the primary information. Then the optimal information recovery rule (8.3.14) can be interpreted as a generalized NNT.
136
Chapter 3 Concrete State of a System
SELECTION OF PROBABiLrry DISTRIBUTION
GENERATOR OF A BINARY RANDOM NUMBER
ULTIMATE CODING
Figure 3.2. The generation of a ftizzy binary 5^! description of a primary continuous state s: (a) the bloc diagram, (b) typical auxiliary probabilities P(s, t), /=!, 2 and the corresponding aggregation sets ^ i. Instead of random numbers in practice numbers produced by special deterministic algorithms, which behave like random numbers, are used. Such numbers are cMtd pseudo-random numbers. The principles of operation of pseudorandom numbers generators are briefly discussed in Section 4.5. For details see Dagpunar [3.1], Yarmolik, Demidenko [3.2]; programs generating pseudo-random numbers can be found in Press et all.[3.3]. 3.1.3 THE COURSE OF AN EXTERNAL STATE IN TIME Till this point we considered the external state of a system at a given instant. Therefore, it is called the instantaneous state. The properties of a system in a time interval
3.1 The External State of a System
o h miiaft
'
iPftflW!^
«wiin!mnj["
!*»«'-«'
1—€iVM.-««"--nin'
yn
1.3
i- ' O
S A
137
r[j]
2 . 0
a) i 2a ^(0
10
b)
20
30
^
J
c)
Figure 3.3. An example of a state process with a multilevel hierarchical structure: speech waveforms: (a) The waveform of the sentence "every salt breeze comes from the sea", (b), (c) the fragments of the sound S respectively A magnified and expanded in time. Based on Flanagan et al., 1979 (IEEE). We may quantize the instantaneous values of the state process or we may quantize the time argument or both (for corresponding transformations of information see Section 1.5.4). Consider for example, the time process described by (3.1.17). We take the train of sampling instants
r,,
(3.1.18)
The vector (set of samples) s={s(0,n = U2,' ' ' ,N} (3.1.19) is the simplified description of the primary time-continuous state process. Besides discretization of either the value of the state process or/and the time other rough description of state processes are possible. For example, in mechanics and theory of time-continuous systems, of fundamental importance is the rough description of an evolving state process §(•) by the set of derivatives (d^^/dr")^.^, w - 1, 2, •, •, A^ at the current instant t^,. A counterpart a rough description of the evolving train of states by derivatives is the description of states of time-discrete systems by a set of difference quotients. However, their definition is quite cumbersome. Therefore, often we use the continuous approximation of time- discrete systems state description (see the discussion on discrete approximation in Section 1.4.3).
3.1.4 THE SET OF POTENTIAL FORMS OF AN EXTERNAL STATE Section 1.4 indicated that if exact information about the concrete state is not available, then the superior system can improve the efficiency of its purposeful actions by exploiting the properties of the set S of potential forms that the external state can take. This applies also to the set of potential forms of the external state. Using state-orientated terminology let us recapitulate considerations in section 1.4. The set S of potential forms of the external state of a system is described by • the structure STR of a potential state, and • the rules of membership MR that say which state belongs to S.
138
Chapter 3 Concrete State of a System
The structure STR is described by the list of elementary components of a state and by rules saying how the components can be assembled into a description of a potential state. Those assembling rules (also called syntax) include universal relationships between the components. The basic types of those relationships are discussed in forthcoming Sections 3.3 and 3.4. In the simplest case, the rule of membership is that every combination of elementary components is a potential state. Then the potential state is said to be unconstrained. Easiest is the description of MR when the set S is discrete. The membership rule is then described by the list of the possible states 5,, / = 1 , 2,- • , L. If they have the structure of vectors, then the list has the structure of an array. If the states have structure of an array, their list has the structure of a higher-ranking, array. The set of potential forms of practically all exact descriptions of external states of real objects are continuous. To make this statement meaningful we must define an indicator of distance between potential forms of states (see Section 1.4.3). As in the case of information discussed in Section 1.4.3 the sets of potential states may be unconstrained or constrained (see, e.g.. Figure 1.12). The set of the potential forms, which a potential state can take, is a property of the system. Therefore, we call it state of variety. To emphasize that the description of the state of variety includes the description of the structure of a concrete information and the description of membership rules, instead of the short symbol S we denote the state of variety by 5VAR. Thus, Sy^, = {STR. MR}.
(3.1.19)
3.2 THE INTERNAL STATE OF A SYSTEM 1: RELATIONSHIPS BETWEEN CONTINUOUS STATES In most systems the value of an external state at an instant is related with the value of this component at another instant by relationships that hold for all potential values of state parameters. Such relationships are an objective property of the system. They are directly nonaccessible but they manifest themselves through external states. Therefore, the relationships between external state components have been called internal state of the system. If the components of the state are connected by relationships, then it is often possible to express some state components as functions of other components. The first components are called dependent components, the second/re^ components. Dropping the dependent components we simplify the description of the state without impairing its accuracy. The relationships between components of states induce relationships between corresponding components of information. Those relationships make loss-less compression of information possible or they provide immunity to some types of distortions, similarly as relationships built into code words of an error correcting code (see Section 2.1.2). The relationships between components of external state also cause some accessible state components to deliver information about other inaccessible components. Thus, the knowledge of relationships between components of states is of great importance both for information processing and for utilizing information by a superior system.
3.2 The Internal State 1: Relationships Between Continuous External States 139 Usually a hierarchy of relationships exists between external states of systems. On the top are the universal relationships that are formulated as axioms of logic and mathematic. On the lower level are the relationships holding for broad classes of systems considered in natural sciences, particularly in physics. On the lowest level lay relationships holding for narrow classes of systems. The relationships may have the form of logical statements, equations, or inequalities. Explicit relationships express the dependent state components as functions of independent components and implicit relationships are functions relating state components. Most relationships considered in physics have the form of differential equations relating time and space derivatives of state processes. One of fundamental problems of natural sciences is to transform a relationship in an implicit form into a explicit form (in particular to find the "solution" of an equation). This is done by using the universal relationships of logic and mathematics. For many systems we can divide the components of the state into two categories: the causing components, which can be considered as the primary agent generating the state, and the resulting components, which can be considered as the effect of causing components. We call such relationships (and systems) causal. In technical terminology the caused state is called iht product of the causing state, and causal systems are called production systems. In any real system the changes of the causing components precede in time the consequent changes of resulting components. Such systems we call real time systems. In the case of a causal relationship it is natural to take the causing components as the free components mentioned earlier. There are, however, situations when some state components cannot be considered as the result of other. This applies to instantaneous states (such as positions of resting, noninteracting objects) and state processes. Then the relationship is mutual. The problem of relationships between components of states and thus, of models of systems is very broad and we do not go into details of analysis and synthesis of systems. This is the subject of several excellent books (see, e.g., Oppenheim, Wilsky [3.5] on time-continuous linear systems, Oppenheim, Schafer [3.6] on time discrete systems, on simulation of such systems Alkin [3.7], Wolfram [3.8]). The goals of this and the following sections are to present the principles of describing relationships between external states and to describe the classes of relationships that are used in subsequent chapters for implementation or interpretation of information transformations. The considerations start with a general discussion of types of interactions between components of systems. We introduce first the important concept of terminal interacting systems. Then we concentrate on causal relationships in systems which change their states continuously in time. As a representative example the relationships between components of states of lumped electrical elements and networks built of them are considered. Although the examples are very simple they allow to realize both goals.
140
Chapter 3 Concrete State of a System
3.2.1 TERMINAL INTERCONNECTED SYSTEMS A system is an assembly of objects tied together by mutual interactions. The interactions are established by space fields, such as mechanic, electromagnetic, or gravitational. We discuss here in more detail the types of interactions between the objects (subsystems) forming a system. The subsystems interact through space fields. The analysis of such interactions is usually very complicated and their control and utilization difficult. Therefore, of great practical importance are systems whose components can interact only through small interfaces. Such systems are called terminal interacting systems. Most systems build by people can be considered as terminal interconnected systems. Examples range from electronic devices over machinery, large cities to worldwide communication and transportation systems. As a concrete example, a typical building can be considered as a system of rooms interconnected by doors that play the role of terminals. In Chapter 1 and in Chapter 2 we assumed silently that considered information systems are terminal interconnected systems. This is the precondition to representing a system by a block diagram. The simplest interface is a point object called/7c>mr interface ox point terminal. A subsystem that can interact with other subsystems only through point interfaces is called a terminal interacting subsystem. A system consisting of terminal interacting subsystems is called a terminal interacting system. The simplest terminal interacting subsystem is the black box with two point terminals shown in Figure 3.4. The box between the terminals represents the primary subsystem establishing relationships between the states of terminals. Such relationships are discussed in the next section. ^(1)
O^(1)
BLACK BOX
—o s{2)
Figure 3.4. A two-terminal accessible system. I7(w)-the nth terminal, n = l,2, s(/i)-its state. Typical examples of objects that can be treated as terminal interacting subsystems are lumped electrical elements such as resistors, capacitors, coils, diodes, and transistors. The circuits built of these elements are examples of terminal interconnected systems. The electrical elements illustrate the limitations of the point terminal accessible objects. This model is only suitable if the physical dimensions of the elements are much larger then the length of the longest electromagnetic wave corresponding to the changes of the electrical state of the terminal. However, even for lower frequencies often costly electrical and magnetic screening techniques must be used to design those systems so that they behave as point interacting. An important and broad class of terminal interacting systems is networks. They consist of two types of subsystems: nodes and connecting channels that enable the interactions between nodes that usually are located at distant places (see Figure 2.13a).
3.2 The Internal State 1: Relationships Between Continuous External States 141 Typical networks are water, gas, electricity, and sewage networks. The elementary components of medium processed in those networks have no identity. E.g., electricity produced by a power plant is fed into the network but not directed to a particular user. Of different type are most transportation and communication networks, in particular the packet communication networks considered in Section 2.3.2. They process units (packets) that are characterised by their destination and often by their origin. To this point we have considered the interface between subsystems as a point object. However, we often have to take into account its finite size. A broad class of such interfaces are two-dimensional window interfaces that may be also called gates. The membranes play often in biological systems the role of window interfaces. The harbors and airports are examples of spatial interfaces between the road network and the sea or respectively air transport systems. Also, many living organisms can be considered as terminal interconnected systems. The nervous system is essentially a point-terminal connected system. However, most organs interact through window-type interfaces. 3.2.2 RELATIONSHIPS BETWEEN TIME-CONTINUOUS STATES OF A TWO-TERMINAL SYSTEM Our discussion on internal states we start with a very simple but representative example of two-terminal electrical elements. We assume^ the following Al. The subsystems interact through a flow of electric charges; A2. The interaction is effected only through terminals. Assumption A2 is justified if the changes of the intensity of flow of the changes of electrical state are slow. Such flows are called semistatic and the systems are called lumped. We consider here elementary electrical systems shown in Figure 3.5 that can interact only through two terminals D(n), n = l, 2. The next example deals with a simply network of such elementary subsystems.
:7(i) v(l,0 ^(1,0
R
^(2)
-•AAA^"
—o v(2,r) /(2,r)
:7(i)
av(l,0 z(l,r)
^h
-O
v(2,r) 1(2,0
Figure 3.5. A resistor and condenser considered as two terminal accessible systems. Further we assume the following A3. At an instant t the interaction of terminal D(n) with terminals of another element is determined by the instantaneous potential v(n, t) and instantaneous intensity of electrical current (briefly, intensity) i(n, i). The intensity is defined as the rate of change of electrical charge q(n, t) that flowed into the terminal: i{n, t)=dq/6t (3.2.1)
142
Chapter 3 Concrete State of a System
From the assumptions it follows that s(n, 0 = {v(/i, 0, Kn, 0} (3.2.2) is the state vector describing the instantaneous electrical state of the terminal If(n). As first lumped electrical element we take a resistor (see Figure 4.1). Many observations showed that its state parameters are connected by the universal relationships (called law of Ohm): 1(2, t)=-i(l, t) (3.2.3a) v(2, 0-v(l, 0 = / ( l , t)IR, (3.2.3b) where R is the resistance of the resistor. It depends on the shape of the conductor and electrical properties of the conducting material. If the conductor is a wire of cross-section S and length /, then R=al/S, (3.2.4) As the second lumped electrical element we take the condenser. Many observations show that the relationship (3.2.3a) holds again, but there is no imiversal relationship between instantaneous potential and current intensities. However, related are potentials and electrical charge q{n, t) on the electrode of the condenser connected with terminal Zl(n). The relationship is ^,(0 = C[v(2, 0-v(l, 0 ] , (3.2.5) where C is the capacity of the condenser. It depends on the shape of its electrodes and on properties of the isolator between them. For a flat condenser we have: C=eS/D, (3.2.6) where 5 is the surface of the electrodes, D is the distance between them, and e is the electric constant of isolator between the electrodes. After differentiating both sides of (3.2.5) and using definition (3.2.1) we get /(I, 0 = C [dv(2, 0/ci/-dv(l, 0/d/]. (3.2.7) Let us assume that v(2, t^)-v(l, tj)=0 where t^ is the instant when the observation of the capacitor begins. Integrating equation (3.2.7) gives t
v(2, 0-v(l, 0 = C-'/i(T)dT. From this we obtain
(3.2.8) ,
v(2, t)-v(l, t) = [v(Z, t-A)-v(l, r-A)]+C-' j /(r)dr.
(3.2.9)
This very simple example suggests a couple of generalizing conmients. COMMENT 1 To create a tractable model of real objects we must limit the class of considered real objects and situations (assumptions Al and A2 ) and decide which components of their states are relevant for considered class of superior systems (assumption A 3). Those steps are essential and usually difficult. They must base on results of specific sciences outside information sciences. Behind the seemingly simple assumptions Al to A3 stands the huge body of sciences concerned with electrical phenomena. The described procedure is also called choice of the universe of discourse.
3.2 The Internal State 1: Relationships Between Continuous External States 143 COMMENT 2 The examples give more insight into our considerations about hierarchical structure and the concept of atomic objects. If we stay at the highest (least precise) level of system description, we have to determine the resistance or capacity on the basis of many observations of electrical states of the point terminals. If we take into account the macrostructure at a finer level and go inside the black box, we get the geometrical dimensions of the components and can use formulae (3.2.4) and (3.2.6). With material constants a and e we face a similar situation. Staying on the macro level, we can obtain their values by observations of states at this level. However, if we go to atomic (in the sense of physics) level, we cannot derive the relationships from general relationships holding at the atomic level and in addition we can also express the material constants e and a in terms of characteristics of the material at the atomic level, such as electrical charges of elementary carriers or indices of their mobility. COMMENT 3 The primary relationships between external states are not derived but are the result of generalizations of many observations^ of the external electrical states. It must be so because the relationships are features of real world existing independently of our reasoning. COMMENT 4 The relationship between state components can be represented in different forms. We can pass from one form to an other through mathematical operations. Typical are differential forms as (3.2.7) and integral forms as (3.2.8). The various forms are helpful for analyzing specific properties of the relationship. COMMENT 5 For wide classes of systems their external state parameters may be related by the same relationship (in the examples relationships (3.2.3) and (3.2.5), and the properties of a concrete object enter in the relationship only through some parameters (resistance R for resistors, capacity C for condensers). Such a relationship is called a universal relationship and the parameters are called internal state parameters. COMMENT 6 For some systems, such as the resistor instantaneous state parameters are related. We call such systems memory-less. If the course of a state parameter in an interval in the past is related to the instantaneous value of another state parameter, as in the case of the condenser, we say that the system is with memory. For the capacitor the dependence has a specific character. From equation (3.2.9) it follows that if the difference v(2, /-A)-v(l, r-A) is known at the instant r-A where A > 0 , then the earlier course of the intensity /(«, T), r
144
Chapter 3 Concrete State of a System
COMMENT 7 Equation (3.2.8) shows that the instantaneous difference of potentials v(2, 0-v(l, 0 is related to the course of the intensity in the whole past. Equation (3.2.7) seemingly contradicts it. This would be, however, a superficial conclusion. To calculate the derivative dv(«, t)/dt or, more precisely, the value [dv(r, «)/dT]^^p it is not enough to know v(/z, 0 but we must know the process in an arbitrary small but finite time interval
R
'•(1./)
--AAA^-v(l.r)
r
•Q
v(2.r)
:7(3) '•(1.0
r^
Figure 3.6. A resistor-condenser network.
Assumption A3 does not limit our considerations. We introduce it only to simplify the notation. From Figure 3.5 we see that the current /(I, 0 causes the potential difference on the resistor and loads the condenser. From (3.2.3) and (3.2.8) we get ^^2, 0-v(l, t)=Ri(\, r), (3.2.12) v(2, 0 = C^f Kl, r)dr.
(3.2.13)
3.2 The Internal State 1: Relationships Between Continuous External States 145 Differentiating this and putting the result in (3.2.12) gives RC dv(2, t)/dt +v(2, 0-v(l, 0 = 0 ,
t>t^
(3.2.14)
This is the fundamental relationship between the potentials of terminals ^(1) and U(2). Since it relates instantaneous values of the processes and their derivatives, it is called differential relationship. Usually, the potential of terminal !7(1) is considered as the factor causing the potential of terminal 17(2), and we are interested in expressing v(2, 0 as a function of a given process v(l, r), r
^^ = Jv^(l,Od/
(3.2.20)
0
exists, and (b) it does not depend on the input impulse but is solely determined by the system. Typical processes v^(l, 0 and v^(2, 0 and the limit h(f) are shown in the left column of Figure 3.7.
Chapter 3 Concrete State of a System
146 'A(U)
A1
A2
VA'(1,0
M) A3 •^ B 3 Figure 3.7. Dlustration of the definition of the pulse response (left column) and of the derivation of the explicit relationship between the process at input and output of a linear causal system (right column). A2 v^(l, r) a narrow pulse at the input and v^(2, t) the response at the output; A3 the limiting case when the pulse width A'-K); h{t) the pulse response; B2 the elementary pulse v(l, t^^{t-Q approximating the input process in the interval
(3.2.22) (3.2.23)
3.2 The Internal State 1: Relationships Between Continuous External States 147 From (3.2.13) we get the potential difference on the terminals of the condenser at the end of the pulse: v(2, A)-(RC)-^A^, (3.2.24) where A^ is defined by (3.2.20). After the end of the pulse v^(l, 0 = 0 , t>A. Thus, in this time interval the differential equation (3.2.16) describing our system takes the form RC dv(2, 0/d/-f v(2, 0 = 0 , / > A (3.2.25) with the initial condition (3.2.24). The solution of this equations can be easily found (see, e.g., Arnold [3.9]). It is v^(2, 0=A^(RC)-^expK^Aj//?q, r>At.
(3.2.26)
From the definition (3.2.19) of the pulse response, we obtain finally *{RC)^ exp(-t/RQ, t>0 h(t)^Ci (3.2.27) ^ 0 , /<0. We show now that using both properties (3.2.16) and (3.2.18) we can obtain an explicit solution of the differential equation (3.2.14) in the general case. We approximate the process y(l, t), tE
v*(l, 0 = E v(l, O/rcC-O.
(3.2.28)
where r„=ra+(/z-l)A, A=(r-rJ/A^ and/^^(0 is a rectangular pulse of height 1 and duration A-e where e'«^A; the pulse is shown in Figure 3.7 Bl. From property (3.2.16) with c' =v(l, rj and c" = 0 and from property (3.2.18), (3.2.19) follows that if the input process is v(l, 0/rc(^0» ^^^ for A-K) the output process is close to v(l, t„)h(t't„) (see also Figures 3.7 B2, B3). Next, from property (3.2.16) it follows that when the right side of (3.2.28) appears at the input terminal !7(1), then A'
v*(2, O ^ E v(l. Oh(t-OA
(3.2.29)
is produced at the output terminal !7(2). From the definition of the integral we have limj^ v(l, Ok(t-t„)A-
v(r, l)k(t-T)dr.
(3.2.30)
From (3.2.29) and (3.2.30) it follows that the process at the output terminal is v(2,0-|it(r-r)v(l,r)dT.
(3.2.31)
'a
Taking into account (3.2.21) and assuming that v(l, 0 = 0 for r
v(2,0-[*(f-r)v(l,T)dT.
(3.2.32)
148
Chapter 3 Concrete State of a System
COMMENT 1 The class of linear systems defined on page 145 is very wide, since when the changes of the states are small, we can expand the characteristics (parameters, functions) of the system into Taylor series and to take as a good approximation only the linear term of expansion. On the hand, if we consider very wide ranges of states, practically every system is nonlinear. For example, if we would increase the potential difference on a resistor, it would heat to the melting point, and its behavior could be no more described by Ohm's low. However, even in such situations we could consider the system as piece-wise linear. COMMENT 2 The general idea of calculating the pulse response (3.2.27) is that we have two phases of changing states of systems elements. In the first phase, while the pulse lasts, the energy is accumulated in the elements of the circuit that are capable of storing energy. In the considered system, the condenser is the element that can store electric energy. In systems containing coils, magnetic energy is stored. The second phase begins when the pulse ended. During this phase (described in particular by (3.2.25) the accumulated energy discharges. Thus, the pulse response has the meaning of the process of discharge of energy loaded but the pulse in energy storing elements of the system. The discharge process depends on the stored energy but not on the way it was stored. This elucidates property (3.2.18) and suggests its formal proof. If we look closer at the derivation (equations (3.2.29 ) to (3.2.32 ) of equation (3.2.31) we see that deriving it we used only the general properties (3.2.16) and (3.2.18) but not the specific form (3.2.27) of the pulse response. Thus, equation (3.2.32) holds for any linear system that has the property (3.2.18). A wide class of systems having those properties are networks of the mentioned linear electric components. For such systems the states v(/z, /) and v(w, /) of two terminals V{n) and V{m) are related by the differential equation a(K)d^v{n, t)/dt^ + a(K-l)d^^-Mn, t)/di^^-^^ -f...+ a(0)v(Ai, 0 = (3.2.33) where the coefficients a(k) and b(l) are functions of parameters R, C, L (inductance) characterizing the elementary electrical components the circuit consists of. An example of this equation is the previously derived equation (3.2.14). The general solution of the differential equation (3.2.33) is again given by (3.2.31) with the pulse response is defined by (3.2.19). To find it we have to solve the homogeneous differential equation that we obtain from (3.2.33) by setting v(m, 0 = 0 t>0. As initial conditions we have to take the values of (d''v(n, O/d/*),^^^. determined by the processes of loading the energy storing elements by a narrow pulse v^(m, t) occurring at / = 0 . In our argument we assumed that the properties of the considered electrical elements, particularly the internal state parameters R, C, L do not change in time.
3.2 The Internal State 1: Relationships Between Continuous External States 149 Such elements and systems built of them are called time invariant or equivalently, stationary. The fact that the pulse response does not depend on the instant at which the short pulse appeared is the consequence of system being stationary. A powerful method of analysis of linear stationary systems is to present the processes as a superposition of harmonic processes. This method is discussed in Section 7.4.2. Our reasoning can be generalized for a wide class of linear systems in which the internal state parameters vary in time according a predetermined function. An example of such a system would be the system shown in Figure 3.5 with the resistor that resistance changes in time according to a given function R(t). Such a system is called a time varying linear system. It is described by the differential equations as (3.2.33) but with coefficients ain, t) and b(jn, t) which are functions of time. The result of this is that such a system does not have the property (3.2.19). However, the limit (3.2.19) exists but is dependent on the time at which the short pulse occurred and the pulse response h{t, r) is a function of the current instant t and instant r at which the short impulse appeared (for causal systems h(t, r)=0 for t
(3.2.34)
-oo
where h^ „(t, r) is the state of the mth terminal at instant t if at instant r occurred a change of the state of terminal n that had the form of a narrow pulse. COMMENT To this point we have considered terminal accessible electrical systems. As it has been mentioned, electrical phenomena have inherently a spatial character. The electrical instantaneous state of a point in space is described by electric and magnetic intensity vectors thus, by six state parameters. There exist universal relationships between them. These relationships can be presented either in the form of differential equations relating time and space derivatives of components of electric and magnetic field intensities (Maxwell equations) or in integral form. As in our simple examples, the properties of a concrete medium in which the electromagnetic field exists enter in the universal relationships through dielectric and magnetic permeability constants that in our terminology are typical internal state parameters. We have chosen electrical states to illustrate the relationships between external state parameters. The relationships between state components of other character are different in details but similar in principle. For example, almost identical as the previously presented are relationships between the components of mechanical states (position in space, forces, point and continuous models of objects etc.). 3.2.4 RELATIONSHIPS BETWEEN TIME DISCRETE-STATES In the previous section we assumed that time is, as in the real world, a continuous variable. However, many technical systems, particularly those controlled by computers, can change their states at predetermined, usually equally spaced instants. Such systems are called time discrete. We now consider such a typical system.
150
Chapter 3 Concrete State of a System
The block diagram of the system is shown in Figure 3.8a. The fundamental subsystem is the shift register that is a chain of memory cells. We assume that Al. The system can change its states only in the short interval (r^,r„-fe;, e
(3.2.37)
3.2 The Internal State 1: Relationships Between Continuous External States 151 SHIFT
t
1 0{\) VI 1 . m'
[' cell ^(0] i
<:ell
cell C{\]
c(l.O
c{Oj)
'1
REGISTER C{1'2)
c(l-2j)
c{l-\j)
'--
m(^
cell C{I-\ )
' ' ^ ^
1 1
hV^
z a) c{U3)f
intervals of stability
intervals of changes sliding window -•
I
•
1—I—I—r " V
d)
I pieces of info stored in the shift register S H I F T
R E G I S T E R
cell ^(1)
cell C(I-2)
:7(1)
e) Figure 3.8. The time-discrete linear system: (a) the block diagram of a semi-stationary system, (b) timing, (c) a typical state process, (d) interpretation of the operation as a sliding window, (e) the system with time varying multipliers (the non-stationary system).
152
Chapter 3 Concrete State of a System
The operation performed by the shift register can be interpreted as the transformation of a segment of sequentially arriving binary pieces of dynamic information seen at a sliding window of width / into static information stored in the chain of memory cells (see Figure 3.8d). Immediately after the new contents of the cells were set, the sum /-I
v(2, tJ = E h'(i)c{i. tj
(3.2.38)
/•-O
is evaluated, where^ h'(i), / = 0 , 1, 2,- • , 7-1 are fixed multipliers. Let us assume that at the input we have the train 1 for/z=0 v^(l,0 =t : ^ . . _ _ „ . . (3.2.39) ^0 for/2 = l, 2,Such a train is the discrete counterpart of the narrow pulse VA(l,t) that has been introduced in Example 3.2.1. From (3.2.36) and (3.2.37) we obtain VA(2, tj=h\m), m=0, 1, 2,- • • , 7-1, (3.2.40) where v^(2, tj is a train of states of the output terminal D(2) when the train v^(l ,r„) occurred at the input terminal D(l). From (3.2.40) it follows that the weighting coefficients arranged sequentially in time have the meaning of the pulse response of the system shown in Figure 3.7a. We denote h(Q^h'(n), /2=0, 1, 2,- • • , 7-1, h(t„)= 0, /z<0, AZ>7-1 (3.2.41) Changing the sequence of summation in (3.2.38), taking into account that tm-n=t^-t„, using (3.2.36) and (3.2.37) we obtain m
v(2, 0 =
E
h'(Tn-ri)v{\,t„)
(3.2.42)
rt-m-/+l
In general the coefficients h'{t^ may depend on the number m of the working cycle. We denote them by h{t^, t„); see Figure 3.8e. The relationship m
v(2, 0 =
E
A(r„, f„)v(l, O
(3.2.43)
n-m-I+l
is the generalization of (3.2.42). Since the system can memorize only 7 recent input elements, it is h(t^, Q=0 for m-n>I-l, (3.4.44) and since /i(r^, r„) has also the meaning of the response of the system at instant t„ to the pulse v^(l, r„), it must be h(t„, 0 = 0 for m
3.2 The Internal State 1: Relationships Between Continuous External States 153
Him)
H(/-i)=
a)
b)
Figure 3.9. Structure of matrix H(m) describing the time discrete, nonstationary linear system shown in Figure 3.8: (a) m=/-l, (b) m>I-\. Using the matrices we write the relationship (3.2.43) in the simple matrix form: v(2, m)^H{m)v{\, m).
(3.2.46)
The system described by (3.2.44) or equivalently by (3.2.46) is a time-discrete counterpart of the non-stationary time-continuous linear system described by formula (3.2.34). Therefore, we call a system described by (3.2.34) time-discrete nonstationary system. COMMENT 1 This system described by assumptions Al till A6 on page 150 is not stationary in the sense of definition on page 149, since it behaves differently in the changing and stability intervals. However, its properties during one work cycle are the same as during the other work cycle. Therefore, it may be called semistationary. COMMENT 2 In the limiting case when l-^oo and T-K) so that T—t^-t^ the time discrete system would see from the point of view of input and output terminals as a time-continuous system. Thus, the time-discrete system described by (3.2.42) or, respectively, by (3.2.43) can be considered as a discrete approximation of the time continuoussystems described by (3.2.31) respectively (3.2.34). Those two systems illustrate the considerations in Section 1.4.3 about relationships between discrete and continuous models of states. The advantage of the discrete approximation is that for a concrete input train we can easily evaluate numerically the sum (3.2.43) (for efficient algorithms see Press et al. [3.3, ch.l3]). If necessary, we can implement the system in real time using standard simple and inexpensive digital information processing devices. Also, the fundamental mathematical tools needed to analyze a discrete problem are simpler than to analyze a time continuous system. For example, the definition of the pulse response for the discrete system is simple, while for the continuous system, we had to introduce several heuristic assumptions or use precise mathematical methods to analyze the limit operations. On the other hand, for wide classes of input processes v(l,0 we can calculate the integral (3.2.31) in a closed, easy to analyze form. Even for narrow classes of input trains, the closed forms of the corresponding sum (3.2.43) are quite complicated and rather difficult to analyze.
154
Chapter 3 Concrete State of a System
COMMENT 3 We can modify the considered system by feeding back the output process into the input as shown in Figure 3.10.
Figure 3.10. Time-discrete linear system with feedback.
The system with feedback is a linear system, and it can be described by its pulse response. However, there is an essential difference between the pulse response of the system with feedback and the previously considered open system. The pulse response of an open system is the process v^(2, r j given by (3.2.40). Thus, for/z>/ the pulse response takes values 0. In other words, the time extension of the pulse response of previously considered system is FT. In the system with feedback, however, a unit pulse put at its input influences the pulses occurring later at the input of the shift register. Thus, the pulse response of the system with feedback has usually infinite duration. In particular, the response may be an infinitely long-lasting periodical function. Such a system is called a unstable system. Counterparts of such a system are time continuous R, L, C systems with feedback and at least one additional source of energy for example, with an amplifier, (for an analysis of the time-discrete systems with feedback see Oppenheim, Schaffer [3.6], for simulation Alkin[3.7], Wolfram [3.8]. 3.2.5 A CLASSIFICATION OF RELATIONSHIPS BETWEEN EXTERNAL STATES OF SYSTEMS A system in which the states of some terminals can be considered as the effect of states of other terminals is called 2i production system. The previous considerations suggest a classification of such systems. The classification is equivalent to a classification of transformations performed by systems. The classification is based on three fundamental properties of relationships between the causing and caused states. The first property is related to the course of the states in time (see Comment 6, page 142). If the instantaneous form of the caused state depends only on the form of the causing state at this same instant, we say that the relationship (transformation, system) is memory-less. If the instantaneous form of the caused state is determined by the course of the causing state during a time interval in the past, we say that the relationship (transformation, system) has memory.
3.2 The Internal State 1: Relationships Between Continuous External States 155 A subclass of relationships having memory are relationships in which the form of a cause state in an instant t depends only on the form of the caused state at an instant /-A, and on the course of the causing states in the interval
ST>VTIONA:RY C T I M E INTVyVRIANT) LINEA.R
MEI^ORYLESS V/ITH
NON
LINEAR NON
MEMORY
^ STATIONARY
Figure 3.11. A classification of production systems.
156
Chapter 3 Concrete State of a System
3.3 THE INTERNAL STATE 2: RELATIONSHIPS BETWEEN DISCRETE EXTERNAL STATES In the previous section it was assumed that the set of potential forms of the elementary component of the state is continuous. Here it is assumed that this set is discrete; thus, the elementary state components and in consequence the states are discrete. Often such states have the meaning of rough descriptions of primary continuous states. There are two basic methods for describing the relationships between discrete states: • We interpret the state components as discrete variables, use a discrete algebra and we describe the relationships by functions or algebraic equations, particularly recurrent equations, • We interpret the state components as logical, particularly Boolean variables, and we describe the relationships by logical expressions. We now illustrate each method with an example. 3.3.1 RELATIONSHIPS DESCRIBED BY EQUATIONS In Section 2.6.5 we described briefly the buffering system as a system decreasing the idle pauses between a train of working information blocks. The block diagram of the system and typical input and output processes are shown in Figure 2.23. We now demonstrate how the verbal description of the systems operation, which we was given in Section 2.6.5 can be formalized. The subsystems of the system are the buffer memory (we call it the waiting room) and the fundamental information processing subsystem (the server). We denote the kih primary block by £,,, k=l, 2,- • • . The auxiliary information about a block is its arrival time rik) and time d(k) needed by the fundamental information subsystem to process the block. We define first the state ^(Wj^, /) of the input terminal of the waiting room. It is assumed that the time needed to put the arriving block into the buffer memory is very short compared with the time it spends in the waiting room. Therefore, as a model of the state s(v/-^^, t) we take a process that stays in an idle state with exception of a train of instants; such a process is called 2i point time process. We set ^e(k)fort=T(k) ^(Win, 0 = C r Ofortr^T(k)
(3.3.1)
Such a process can be represented as a train of bars as shown in Figure 3.12a. We assume that the set S,^ of potential values of ^[Win, T(k)] is the interval (0, oo). Next, we define the state s{v/, t) of the waiting room as the number of blocks stored in the waiting room. We assume that at most Q blocks can be stored in the waiting room. We call C^ capacity of the waiting room. Thus, the set of its potential forms is ^ = {0, 1,- • • , C^} is the set of its potential forms.
3.3 The Internal State 2: Relationships Between Discrete External States
157
^(6) ^(3)
0.5 T -\
Tm
A(w,r)
5(r,r) 1 -1
Figure 3.12. Typical state processes in a buffer: (a) ^(Win, t) at the input (process of task arrival), (b) 5(w, /) of the buffer, (c) ^(r, t) of the server. T=t„'t„.^ cycle of systems operation; it is assumed that ^(A:)
(3.3.3)
and T is the cycle for the buffer memory-server cooperation. A3. A decision about transferring a block to the server is taken at the instant t^ and it is based on exact information about the states of the waiting room and the server at the instant r^-e', just before r„. A4. An input block can be put into the buffer memory as soon as it arrives (see Figure 3.12 b).
158
Chapter 3 Concrete State of a System
We assume also that the duration of the interval of state changes e
- ^ c . if , . ( ^ ,^^)>C^
(3.3.5)
where «-^(«)=<^.
^^
(3.3.6)
and5J,(w, vO) is interpreted as an auxiliary variable defined by (3.3.4). The relationship (3.3.5) can be represented by a state transition graph shown in Figure 3.13. Such a graph is called a state transition graph.
3.3 The Internal State 2: Relationships Between Discrete External States
S=4,5,../
S=0
/ ^
S=4,5,..
>
S=l
S = 3.4,..
S=l
^ \ \
159
S = 2,3,
S=l
Figure 3.13. Graph representing the changes of states of the waiting room given by the recursive formula (3.3.5). The circles represent the potential states of the waiting room at instants r^,.,-© and /„-0, Si, is the abbreviation for ^(w, /„.,-0)=/: and j(w, t„-0)=k, the arches indicate the potential transitions, S is an abbreviation for the number 5in(„.,-l-0,/„-0) of primary packets that arrive in the interval
COMMENT This system operates rhythmically as the time-discrete linear system considered in Section 3.2.3 thus, it is not stationary but semi stationary. The nonlinear function w^i') and the conditions in (3.3.5) cause the superposition principle not to hold. The system is with memory, and from (3.3.5) it follows that it is a Markovian system. 3.3.2 THE RELATIONSHIPS DESCRIBED BY LOGICAL EXPRESSIONS Till this point we considered relationships between state parameters describing the states of elementary components of a system. These state parameters have the meaning of atomic components of state description (see Section 3.1.1). Therefore, a relationship between state parameters may be called 2ifine relationship and we may interpret it as sifine internal states of the system. Sections 3.2.2 and 3.2.3 show that no direct relationship may exist between two external state parameters but a relationship may exist between one parameter and a rough description of another parameter. Let us take, for example, the condenser considered in Section 3.2.2. From equation (3.2.8) it follows that no universal relationship exists between the instantaneous potential and the instantaneous current, but the instantaneous potential is related to a rough description (the integral) of course of the instantaneous current. This same relationship is described by equation (3.2.7); see comment 3 page 144. Similarly as (3.2.7), equation (3.3.5) relates instantaneous states of the waiting room and a rough description of the course of the state of the input terminal that is described by 5in(
Chapter 3 Concrete State of a System
160
For a wide class of systems related are only the rough descriptions of the primary state parameters. Such a relationship is called a macro relationships. Most relationships formulated in a heuristic way by people have the form of macro relationships. Section 3.1 discussed the binary rough descriptions of external states. It has been shown that they can be considered as Boolean variables and that we can, using the logical operations of negation, alternative, and conjunction, produce secondary binary rough descriptions. We now show that by using the implication operation we can formalize the macro relationships between the binary rough descriptions of external states. This combined with the idea of treating the external and internal states in the same way permits the formulating and handling systematically of a wide class of macro relationships similarly as the previously considered fine relationships. The basic relationship between two binary rough states is R ^if an object O^ posses the property Pj then, the (3.3.7) object O2 posses the property Pj. The existence of a relationship is a property of the system. Therefore, we can use the general definition (3.1.5) to define the binary variable characterising the existence of the relationship R: •0 if the relationship R does not hold (3.3.8) *^(R)^ ^ 1 // it holds. Let us introduce the binary attributes (variables) s*(0„, P„), w = l, 2 saying whether an object O^ possesses a property P^ (see Section 3.1.2). It is natural to relate s*(R) and 5*(0,, Pj), s*(02, P2) by the logical implication ^*(R)=5*(0„ P,)=*5*(02, P2). This relationship can be represented by the table
s*(0„ P,)
s*(02,
Pj)
(3.3.9)
s*(R)
1
1
1
1
0
0
0
1
1
0
0
1
Table 3.3.1. The truth table for the relation (3.4.9) Combining the logical operations of negation, alternative and conjunction we can describe a great variety of macro relationships relating rough binary states both external and internal. In particular, the strong relationship "if and only i f is described by the logical function s*(0,,
P,)^sH02, P2)^*(02, P2)=*^*(0„ P,).
(3.3.10)
3.3 The Internal State 2: Relationships Between Discrete External States
161
The fact that the macro relationship R defined by (3.3.7) holds for a system or a class of systems we write in the form: 5*(R) = 1, V 5*(0„ P,), V s*(02, P2). (3.3.11) Such a relationship is a universal relationship, as are the previously discussed fine relationships characterizing systems, for example, the relationships (3.2.3) characterizing a resistor or relationship (3.2.5) characterizing a capacitor. As in the case of those relationships, if it causes no confusion we omit the V statements; thus, we do not indicate explicitly that the rough relationship holds for all values the involved variables can take. To emphasize that a relationship between external states holding for all their potential forms is an inherent property of the system, we called it internal state, and we indicated that the relationship can be treated in the same way as an external state. In particular, the existence of a relationship can be considered to be a property of the system, and we can apply the concept of the binary variable characterizing a property defined by (3.1.6). For example, the rough state ^*(Phg) characterizing the property Phg defined by (3.1.12) can be also interpreted as the rough state characterizing the possession of the relation "located higher" by the system consisting of two objects. In the definition (3.3.7) of the relationship we can take as the property of an object the possession of a lower-ranking relationship. Equivalently, as the variables in (3.3.10) we can take the variables characterizing the existence of the lowerranking relations. Then we obtain a hierarchy of relationships. On the top of these relationships we have the universal relationships between logical functions that hold for all values of the involved variables. For example, the logical function ((s'=>s")As')=^s" takes value 1 for all 4 combinations of the variables s\s". Thus, we may write (is'=>s")As')=^s" = l V5', 5"
(3.3.12)
The following two examples present universal relationships between binary indices characterizing the lower-ranking relationships between states of some classes of systems. EXAMPLE 3.3.5. A MACRO RELATIONSHIP BETWEEN POSITIONS OF OBJECTS We consider the system S={0(n); n = l,2, 3} of three point objects 0(n) located on a vertical line. Let us denote by S (n, m) ={0(/z), O {m)] the subsystem consisting of two objects. For such a subsystem we define the relationship of rank 1: Rhg(/z, m)=object 0(n) is located higher then object 0(m) (3.3.13) Saying that the relationship Rhg(/2, m) holds is equivalent to saying that the subsystem possesses the property P^^^in, m) which was introduced in Example 3.1.2 (definition (3.1.12)). We denote by s^^{m, n) the binary variable characterizing the property defined by (3.1.13). Equivalently, we may define the variable s^^{m,n) as the variable characterizing the existence of the relationship Rhg(Az, m).
162
Chapter 3 Concrete State of a System
We leam by observations that an universal relationship of rank 2 (higher ranking) exists between the relationships Rhg(l, 2), Rhg(2, 3), and Rhg(l, 3). The relationship is this: 7/0(3) is higher located as 0(2), (Rhg(3, 2) holds); and if 0(2) is located higher as 0(1), (Rhg(2, 1) holds); then 0(3) is higher (3.3.14) located as 0(1) (Rhg(3, 2) holds). Using the variables characterizing the existence of relationships of rank 1, we write the universal relationship (3.3.14) in this form: ^Hg(3, 2)A ^,g(2, 1)=> ^Hg(3> l) = l . n (3.3.15) The relationships Rhg(«, ni) occurring in this universal relationship are illustrated in Figure 3.14. h h(3) 0(3) 0 ;
\gih3) h(2)
0(2)
h(l)
0(1)
Figure 3.14. Illustration of the lower-ranking relationships Rhg(Ai, m) occurring in the universal relationship (3.3.14). COMMENT The state parameter describing the fine state of the object 0(n) is the height h(n) of the vertical position of the object (see Example 3.1.2). Thus, h ={h(n); /z = l, 2, 3} is the fine description of the system S. Then the relationship (3.3.14) or equivalently (3.3.15) can be derived from the following mathematical statement: If a>b and b>c, then a>c.
(3.3.16)
However, to use this possibility the exact information about the values of the three state parameters h(n), AZ = 1, 2, 3 must be available. Therefore, it may be much more natural to check by simpler means than to measure exact positions whether the subsystems S(n, m) have the property Phg(m, n) and to base on the empirical justification of the universal relationship (3.3.14). Doing so we must not invoke the fine description of the state by state parameters at all even use mathematical concepts, and yet we can improve the quality of many activities on the hierarchical level corresponding to rough descriptions and macro relationships.
3.3 The Internal State 2: Relationships Between Discrete External States
163
In principle, such an approach is not inferior to the mathematical approach that has its ultimate justification also in empirical facts. On the other hand, the knowledge of the external world that has been built into logical and mathematical statements so universal that utilizing it is much more efficient than performing several observations of specialized cases. EXAMPLE 3.3.2 A MACRO RELATIONSHIP BETWEEN PEOPLE AND OBJECTS We suppose that the components of the system are: S' a student, f? an office room, N{t^) the number of ^ " an other student the telephone in the office room ^ We introduce the following relationships of rank 1: in(^, le) a student S lives in s h ( ^ \ S ") a student S ' a dormitory room ^ shares a dormitory room with another student S" av(^, AO- the student ^ i s available at the telephone with number A^ and we denote by s.^,, s^^ and ^^v the binary indicators of existence of the corresponding relationships. From the meaning of the components of the defined system and from many observations can it be concluded that the following universal relationships of rank 2 exist: 5i„(^ , /e) A sdS\ S-) =^ S;,(S'\ ^) = 1 (3.3.17) 5i„(^, ^ ) ^ U ^ , A^(^)] = l. D (3.3.18) A universal relationship between binary indicators of possession of a property is a counterpart of an algebraic or differential relationship between continuous state components such as the universal relationships considered in Section 3.2 and the first part of this section. The counterpart of the problem of solving equations involving the continuous or discrete state components is this problem: For a given universal relationship between binary indicators of possession of some properties and for known values of some possession indicators, find the value(s) of the other possession indicators related with the known indicators by the universal relationships. Such a problem is called inference or logical reasoning. Some of the previously mentioned universal logical relationships allow some of the inference problems to be solved directly. One of such fundamental imiversal logical relationships is the relationship (3.3.12). In classical logic the inference based on this relationship is called modus ponens (see e.g.. Frost [3.11]). A simple example illustrates the problem of inference.
Chapter 3 Concrete State of a System
164
EXAMPLE 3.3.3 INFERENCE ABOUT DIRECTLY NOT KNOWN MACRO RELATIONSHIPS BETWEEN PEOPLE AND OBJECTS We take the system described in Example 3.3.2. We set ^ '= %(ary), S "= ^(wen). We know that (1) Mary lives in dormitory room #3, (2) Mary shares the room with Gwen, (3) the number of the telephone in room #3 is 9. We wonder by which telephone number we can reach Gwen. Thus, we set . Settmg
5JW, #3) = 1), s,,(W, g) = U ^ = # 3 , N(/e)=9.
(3.3.19)
5'=5jW, m^s,,(W, ^ ) , s-=s,,(^, #3) in the universal logical relationship (3.2.12) and taking into account the universal relationship (3.2.17) we conclude that s-^^i^, #3) = 1. Thus we infer that Gwen lives in room #3. In a similar way, using (3.2.18) we conclude that s^^i^, #3) = 1, thus that Gwen is available at the telephone number 9. The values of the initially unknown logical variables s^J,^, #3) and ^av(^» ^0 were obtained by simple substitutions. We assume now that Mary lives in dormitory room #3 and Gwen lives also in dormitory room #3. Thus, we set 5.„(W, #3) = 1,:9,„(^, #3) = 1 (3.3.20) and we wonder whether Mary and Gwen share the same room. Our task is to find formally the value of the Boolean variable s^^( W, Q). We denote as s*=^s,(W, #3)As,h(W, ^)=*s,n(^, #3) the Boolean variables occurring in (3.2.17). Table 3.2.1 lists the values of the introduced variables. S|„(W),#3) a b c d e f
g h
1 1 0 0 1 1 0 0
Ssh(W, g) 1 0 1 0 1 0 1 0
s j W , 3)As,,
Si„(^, #3)
1 0 0 0 1 0 0 0
1 1 1 1 0 0 0 0
s* 1 1 0 0 0 0 1 1 1
Table 3.3.1.Truth table for the Boolean variables considered in the example of formal concluding. From the table we see that (3.3.17) and (3.3.20) are satisfied only for the set a of values. For this set Ssh( W, ^ = 1. Thus knowing that the universal rule (3.3.17) holds, Mary lives in room #3 and Gwen lives in room #3, we derived formally that Mary and Gwen share a room. This conclusion is intuitively obvious. However, the presented method can be automatized and used in cases when intuitive reasoning is no longer simple. D
3.3 The Internal State 2: Relationships Between Discrete External States
165
In more complicated cases the construction of the tables of values may be prohibitively cumbersome. However, quite efficient algorithms are available (see e.g., Frost [3.11]) that evaluate some Boolean variables when the values of other variables involved in a universal relationship are given. 3.3.3 SET OF POTENTIAL FORMS OF AN INTERNAL STATE With the internal states we have a similar situation as with external states. Often we do not know the internal state exactly. Then the superior system can improve its purposeful activities if it knows the set of potential forms that the internal state can take. Usually the type of the system and in consequence the type internal state (of the relationship between the external states) is known. Then we have to consider only the set of potential values (forms) of internal state parameters or functions determining exactly the relationship. For example, suppose that we know that the considered system is a linear resistor. Then we know that the relationship between its external states is described by (3.2.3) and we need only to know the set of potential values of the resistance. Thus, the set of potential forms of the resistance is type Tj. If we know that the system is a linear stationary system, then the relationship between its input and out processes is determined by the pulse response of the system. Thus, the internal state is of type T,(„). To determine the set of potential forms of the internal state we must know factors influencing the relationship between the external states of the considered system. Often the system is a product of a hierarchically higher system. Then the relationships between the external states of the system can be considered as imbedded external properties of the producing system. For example, as the lower ranking system we take a resistor. The factory that produced it plays the role of the hierarchically higher system. Knowing the methods of production and testing used by that factory, we can determine the set of potential values of the resistance of a given resistor. Suppose that the resistor has the nominal resistance R^^ and is characterized by a tolerance of a percent. Then the set of potential values of resistance is the tolerance interval
166
Chapter 3 Concrete State of a System
^ Strictly taken only in the early stages of development of the theory of electricity these relationships were formulated as generalizations of many observations of the external states of resistors respectively capacitors. Today we derive them through mathematical operations from general relationships of theory of electricity. This does not change our argumentation, since the general relationships are not given a priori but were also formulated as generalizations of very many observations of external electrical states. * The assumption that the pulse lasts only during a finite interval is introduced to simplify the argumentation. The spectral properties of real pulses cause such an assumption to be not satisfied exactly, and a real pulse takes small values outside the finite time interval. We can modify the presented argumentation to take such an effect into account, but this would not change the final conclusions we are going to derive. ^ The limiting form of such a pulse for A-*0 is called Dirac impulse. Using this concept we define pulse response as the response to a Dirac impulse. Although for some applications it is useful, the Dirac impulse is a singular function that in some simations "produces" (only formally) bizarre results (see endnote 2 in Chapter 1). The advantage of the definition given, is that it directly suggests a procedure of determining experimentally the pulse response. ^ We introduce assumption 4 to simplify our subsequent argument. From a technical point of view, we should rather assume that the set of potential forms of stored information is discrete. We then would introduce a few new concepts to define arithmetic operations on discrete numbers, that are generalizations of the binary addition and multiplication. To avoid such a complication we make assumption 4, but a reader familiar with discrete algebra can see that all equations we use are also valid for discrete variables. ^ In this book we use a prime to denote a variable associated with the variable without the prime, but we do not use the prime to denote the derivative. * Notice that now we start numbering of cycles of operation with 1 but not with 0 as we did for the time discrete linear system.
REFERENCES [3.1] Dagpunar, J., Principles of Random Variate Generation, Clarendon Press, Oxford UK, 1988. [3.2] Yarmolik, V.N., Demidenko, S.N., Generation and Application of Pseudorandom sequences for Random Testing, J.Wiley, NY, 1988. [3.3] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., Numerical Recipes, 2-ed, Cambridge University Press, Cambridge UK, 1992. [3.4] O'Shaughnessy D., Speech Communication, Prentice Hall, Engelwood Cliffs, 1983. [3.5] Oppenheim A.V., Wilsky A.S., Signals and Systems^ Prentice Hall, Englewood Cliffs, 1986. [3.6] Oppenheim A.V., Schaffer R.W., Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, 1986. [3.7] Alkin, O., PC-DSP, Prentice Hall, Englewood Cliffs, NJ 1990. [3.8] Wolfram, S., Mathematica, 2-nd ed., Addison Wesley, Reading, MA, 1991. [3.9] Arnold, V.I., Ordinary Differential Equations, Springer Verlag, Berlin, 1992. [3.10] Seidler, J.A., Principles of Computer Communication Network Design, MJ.Wiley, NY, 1983. [3.11] Frost, R., Introduction to Knowledge Base Systems, Collins, London, 1987.
STATISTICAL STATE OF A SYSTEM Practically every information system has to process not a single piece of information but a train of them. Processing the train as a whole we could exploit all properties of its components, particularly any relationships between them, and consequently minimize the information-processing resources required for each component of the train. To process a train as a whole sufficiently large resources must be available. However, the resources are usually limited. Then it is natural to divide the primary train of pieces of information into blocs and to process each bloc separately but according to a rule that takes into account the properties of the whole train. Tliis is called bloc wise processing. The general principles of block wise processing have been presented in Section 1.7.2. A concrete example is adaptive Huffman data compression discussed in Section 6.2. It is shown there that the frequencies of occurrences of potential forms of elements of a train are comprehensive features of the train that allow efficient block by block data compression. Similar character has dimensionality reduction considered in Section 7.3. In this case empirical correlation coefficients are the features of the train essential for efficient bloc by block dimensionality reduction. As indicated in Section 1.4.4, for many systems the frequencies of occurrences of potential forms of a component of the state fluctuate around values that are determined by the properties of the system, but do not depend on a particular observation. Such a system is said to exhibits statistical regularities, and the fixed values around which the frequencies of occurrences fluctuate, are called probabilities. If the states exhibit statistical regularities, then often the knowledge of a state influences the statistical regularities of another state. Then a statistical relationship is said to exist between both states. If the states exhibit statistical regularities, then also the blocs of states exhibit the statistical regularities. Knowing the probabilities of the potential forms of blocs of elements, we can optimize the block-wise processing, without waiting until the whole train is available. In particular we can optimize the mentioned data compression and dimensionality reduction. The knowledge of probabilities and of statistical relationships is of paramount importance for other, then bloc-wise, types of information processing. Knowing them we can efficiently estimate nonaccessible components of the states on the basis of information about accessible components. In particular knowing the states that occurred in the past, we can efficiently predict future states. Such estimates can improve dramatically the quality of purposeful actions of the superior system.
168
Chapter 4 Statistical State of a System
This chapter concentrates on the properties of frequencies of occurrences. The first two sections describe those properties for a given train of states, without caring whether another train would have similar properties. Such a model is suitable for processing the whole train ex post, after it is available. A typical example is compression of information stored on magnetic media. The existence of statistical regularities of the states and, if they do exist, the probabilities are a property of the system. Therefore, we can in principle check the existence of statistical regularities and estimate the probabilities only experimentally. However, in special cases we can deduce the existence of statistical regularities of states from the mechanism of generating by the system a concrete form of the state. In Section 4.3 we discuss the methods of determining on the basis of properties of frequencies of occurrences of potential forms of states, whether the potential forms of states exhibit statistical regularities. Although it seems to be natural, the formalization of analysis of properties of probabilities based on properties of frequencies of occurrences, in particular in the case of continuous states, is difficult (see, e.g., Mises [4.1]). Therefore, the axiomatic approach to probabilities became prevalent. It is usually called mathematical probability theory {probability theory). We present it in Section 4.4. Section 4.5 discusses in the terms of probability theory the previously mentioned special but important cases, when from the mechanism of producing the state, we can draw concrete conclusions about its probabilities. Similarly, in Section 4.6 we consider from the point of view of probability theory the consequences of existence of statistical regularities and derive conclusions that are of paramount importance for information processing. In particular, our considerations lead in a natural way to the important concept of entropy of potential forms of state (of information). The external and internal states discussed in Chapter 3, the state of variety introduced in Section 1.4, and the statistical state described in more detail in this chapter are objective features of a system. Therefore, they may be jointly interpreted as the generalized state that provides a global characteristic of the system from the point of view of purposeful actions involving the system. The concept of generalized state plays an important role in this book, since it gives insight into problems of pursuing purposeful actions, particularly processing information, and allows to treat them in a uniform way. The generalized state and a universal classification of states are the subjects of the last section of this chapter.
4.1 FREQUENCIES OF OCCURRENCES OF DISCRETE STATES This section analyzes the properties of frequencies of occurrences of potential forms of discrete elementary components of a train. Although very simple and plausible, the concepts introduced are important for two reasons. First, they are directly useful for the design of the previously mentioned block wise operating systems processing discrete information.
4.2 Frequencies of Occurrences of Discrete States
169
Second, the analyzis of the frequencies of occurrences of discrete states can be generalized. The one generalization is for continuous states. This generalization is also a simple but representative illustration of presentation of properties of continuous states by densities, such as probability or spectral density that are introduced in the subsequent chapters. The other generalization has a much more fundamental character. It leads to the concept of probability and to the axioms of probability theory, which are presented in Sections 4.3 respectively 4.4. 4.1.1 BASIC CONCEPTS The quality of an action performed by a superior system depends usually on the state of the environment. Therefore, with a state s (with an information) a scalar weight q(s) is usually associated that characterizes the influence of the state (of the information) on the quality of action performed by the superior system (in particular the quality of processing a specific information). Section 8.1 discusses in detail the methods of choosing the function q('). A typical superior system performs an action many times. We denote by Q(S^,) an indicator characterizing the whole train S^={5(0,/ = 1,2,- • • , / } , 5 ( 0 E ^ of states from the point of realization of the purposeful activity (in particular of information processing). It is natural to base the definition of Q(S^) on the definition of the indicator q[s(i)] characterizing the components of the train. Such an approach is discussed in detail section 8.1. We show there that it is often justified to take /
e(s„)=E 9[s(0].
(4.1.1)
1-1
Since the indicator Q(Str) usually depends strongly on the length / of the train it may be more convenient to use the normalized indicator ^(/)^Q(Str)//= T E q[s(i)]=Aqls(i)] ^ /-I
where
(4.1.2)
I
. /
A=4E I
(4.1.3)
^ /-I
is the arithmetical averaging operation. We assume here that the set of potential forms of the state is discrete: S={s, / = 1 , 2 , - • • , L } . (4.1.4) After grouping in the sum (4.1.2) all components s(i)=Si we get L
q(D^Y^q(s,)PXsi,D, Pis,,!)-—-—
(4.1.5) ^4^^^
is the frequency of occurrences of the state s, in the train S„ and M(Si, I) is the number of occurrences of state s, in the train S„. (4.1.7)
170
Chapter 4 Statistical State of a System
From this definition it follows that L
TM(Si,f)=I.
(4.1.8a)
Dividing both sides by / we get L
E PXSi,I) = l. (4.1.8b) From (4.1.5) we see that the properties of the train 8^^ influence the index Q(S^) only through the set P'(D = {P\s^,D, 1=1,2,- • • ,L} (4.1.9) of frequencies of occurrences of potential forms of the state. A consequence of it is that if we optimize the rule according to which the superior system performs its actions separately on the elements of the train (in particular, the rule information processing), then the optimized rule depends only on the set i**(/). Let us introduce the family of weight functions ^ 1 forA:=/
From (4.1.2) and (4.1.5) we have P\s,, r)=Aq,[s(J)]
(4.1.11a)
Thus, The frequencies of occurrences can be considered as r4 1 llh^ arithmetical averages of a specific weight function of a state. To calculate the frequencies of occurrences the whole train of states must be known. Thus, we can utilize the frequencies only after the train has ended (see our discussion in Section 1.7.2, in particular. Figure 1.27). 4.1.2 THE FREQUENCIES OF JOINT OCCURRENCES OF STATES In the general definition (4.1.6) of frequency of occurrences, we did not specify the structure of the state. We assume now that the state is a vector and discuss the relationships between the frequencies of occurrences of its components. To simplify our argument we assume that the state vector has two components: ^={^(1), 5(2)}. (4.1.12) We assume that the set of potential values of the component s(n), n = 1, 2 is S{n)^{ sin), / = 1 , 2,- • • , L{n) }, n = \, 2 (4.1.13) We consider again the train S^ but now look at it as a train of pairs {s{i, 1), s{i, 2)}. ^"^^^^'"
. M[5,(l),.,(2),/] n s X D , s,{2), I\^ J , (4.1.14a) where ^ M[5X1), s^p.), /] is the number of pairs in the train the first .. - - . , . element of which is j / l ) and 5,(2) the second. We call P*[5X1), sjil), I\ iht frequency ofjoint occurrences ofSj^l) and ^^^(2).
4.2 Frequencies of Occurrences of Discrete States
171
Obviously, Af[5/(1), Sf,{2), /] and P\sf,\), Sf^L), I] are the previously introduced M{Siy I) and P\(Si, /)] written in another form^ In this notation equation (4.1.8b) takes the form: m) ^2)
E E ^ ^ W D , ^.(2),/] = l /-I
(4.1.15)
k'\
Next, we denote M[s/i\), /] as the number of elements (pairs) in the train with ,. . .^ 5/(1) as the first component and any second component. ^ From definitions of M[sji\), 5^(2), 7] and M[s^\), / ] , it follows that UD
MUD, /] = E M ^ X 1 ) , ^.(2), /]
(4.1.17)
k-\
Dividing both sides of this equation by /, we get P\si\), where
/] = E ^'Wl)> ^.(2), / ] ,
M[^/(l),/] n ^ X D , / ] = Z12i::iLL
(4.1.18) (4.1.19)
is the frequency of occurrences of form sf,\) of the first component of the vector state s irrespectively of the form of the second component. If we evaluate the probability P' [s^l), 7] from (4.1.18) we call it marginal probability. Therefore, equation (4.1.18) is C2\\td marginal probability formula. This is one of two equations that play a key role in our future considerations about utilizing the information about the statistical state. For symmetry reasons we have also ns,{2), /] = E P\si\).
5,(2), / ] .
(4.1.20)
/-I
CONDITIONAL FREQUENCIES OF OCCURRENCES Let us suppose that the first component s/il) of the pair {^^l), 5*(2)} is fixed. According to definition (4.1.16) we have M[s/i\)\ such pairs. Let us next take a concrete 5,(2). In the train S^ we have M{s^\), 5,(2), /] pairs with fixed first and second components. Therefore, the frequency of occurrences of the state 5,(2) in the class of such pairs that the first component is 5^1) is />[^.(2)UX1),/]-
M [5/(1), 5, (2), 7] ;^,^(,;,j .
(4.1.21)
We call P*[5,(2) 15/1), /] the conditional frequency of occurrences of component 5,(2) on the condition that 5/(1) is known. Dividing nominator and denominator by I and using (4.1.14) and (4.1.19) we obtain
ns.i2mi),i\-^^y^^:^^^^tR *
' '^
P'[5,(l),/]
(4.1.22)
172
Chapter 4 Statistical State of a System
where P'[s^\), 5^(2), /] is the frequency of occurrences of the pair [s^l), s^(2)], and P*[5X1), /] is the frequency of occurrence of state 5/(1) (marginal probability). Chapter 5 shows than when exact information about the state s^(2) is not available but 5^1) ^^ ^^^ conditional frequency P\si,(2) 15/1), /] are known, we can draw conclusions about s,,(2). Therefore, equation (4.1.22) is the second most important equation for information processing. Because the roles of both components are synmietrical, we have also
This equation together with (4.1.18) allows conditional frequencies with one of the components fixed to be calculated when the conditional frequencies with the other fixed component is known. In the following chapters, we often use this possibility. The set ^;(/)={/>-[5Xl), 5,(2), 7]; / = 1 , 2,- • , L(l),7 = l, 2,- • , L(2)}
(4.1.24)
of joint frequencies or equivalently the sets of conditional P:J.^HP^[S,(2)\S,
(1), / ] ; / = 1 , 2,- • , L ( l ) , ; = l, 2,- • , L(2)}(4.1.25)
and marginal frequencies of occurrences C/W^{^V/(1)^ ^; ' = 1^ 2,- • , L(l)}
(4.1.26)
provide the complete description of statistical relationships between the states 5(1) and 5(2). In general, the unconditional (marginal) frequency of occurrences P*[5,(2), 7] and the conditional frequency P*[5,(2) |5/ (1), /] are different. We can interpret this as an indication that a relationship exists between the component 5^(1) and the component 5j^(2). Therefore, if ^5,(2)15/ (1), /]=P*[5,(2), / ] , V/, Vfc,
(4.1.27)
we say that the states s(l) and s(2) are statistically independent. Then from (4.1.23) we have P-[5X1), s,(2), I]=P\s^l), I]P\s/(2), I] (4.1.28) Let us illustrate the introduced concepts with a numerical example. EXAMPLE 4.1.1 CALCULATION OF MARGINAL AND CONDITIONAL FREQUENCIES We take L(l)=6, L(2)=6; thus, the number of potential forms of the pair (vector {s(l), s(2)} isL=36. We assume that for some large I the joint frequencies can be described approximately by the equation r[5Xl), 5,(2), /] =.4exp(-a\l^k\)
(4.1.29)
where a is a parameter. The parameter A we obtain from the condition (4.1.15). Taking various values of it we obtain a class of joint frequencies.
4.2 Frequencies of Occurrences of Discrete States
173
j|(l) ijd) jjd) s^O) ss(l) s^(l) Si(2)
#
•
1^1
»
•
•
•
*
*
•
•
•
»
»
•
«
*
•
•
»
#
I I I I L*I^I*J*I*I^
|*l*|«{*|*|#|
{•{•{•{•|«|#
•
U » « « *M
» « • • • «
1
« « U U « '1 « * > > * >1 I»I»I»M»I«" • • • • * H
^$(2) S6(2)
I
vm ' • • • • •
• • • • • •
a=0.5 J , ( l ) S2(\) Sjd)
S^(l)
• • • • • .
a=l
J5(l)jg(l)
m
D [i
JBtB
b)
Figure 4.1. Graphical representation of joint probabilities (a) given by (4.1.29) and the corresponding marginal probabilities, (b) calculated from formula (4.1.20). The probability is proportional to the area of the disc.
The joint frequency P*[s^l), Jjt(2), 7] is represented in Figure 4. la as a disc the area of which is proportional to P*[^X1), ^^^(2), / ] . In the same manner we represent the marginal frequencies P'ls^l), 7]. Because of symmetry the frequencies P*[s,(l), /] are the same, and therefore, we do not show them. From (4.1.23) it follows that the conditional frequencies are proportional to the joint frequencies. Thus, a row on the square in Figure 4.1 illustrates the relative values of the conditional frequencies of occurrences. From Figure 4.1 we see that only for a = l are the conditional and marginal frequencies are the same. Thus, only for this value of a are the components independent. We say then that the set of potential forms of the trains is statistically unconstrained. In the limiting case when a-M), the one component determines exactly the other; thus, only the pairs {s^l), sX2)} / = 1 , 2,* • • , 6 occur in the train. In the other limiting case cr-^oo, the relationship becomes again rigid, but only the two pairs {s6(l), Si(2)} and {si(l), S6(2)} occur in the train. For a = 0 , all possible pair occur with the same frequencies. D 4.1.3 GENERALIZATIONS We introduced several assumptions only to simplify the notation and terminology, but we did not really use them in our argument. Therefore, we can directly generalize the previously introduced concepts and statements. We now illustrate such possibilities with two examples chosen so that they lead to equations that are needed in subsequent chapters. First, we show how to define the concept of conditional frequencies when the components of the state are not scalars as we assumed but they are vectors. We assume that
174
Chapter 4 Statistical State of a System
• The first component of the state s(i) is a A^-1 dimensional vector 5(1) = {5(1, n); n = l, 2,- • • , A^-1} where s(l, n) are scalars • S{\)^{s/i\),
1=1, 2,- • • , L(l)} is set of potential forms of each ^(1, n)
• The second component is a scalar ^(2); S(2) = {s^(2), k=l, 2,- - • , L(2)} is the set of its potential forms. The generalization of definition (4.1.23) is P'[sn),sm,I] p^[(s,(2)\5X1),n^ ';.!;; > (4.1.30) p [5,(1),/] where s^l) = {si^^)(l), %)(1),- • • , 5/(Ar-i)(l)} is the concrete form of the first component of the state s(l) and /={Z(1), /(2),- • • , /(A^-1)}, l(n)e{l, 2,- • • , L(l)} is the vector of indices of forms of elementary components s(l, n) of s(l). Next consider a generalization of the fundamental definition (4.1.6) for rough states. We assume that S = {s, /=1,2,- • • ,L} is the set of potential forms of the exact, primary state, but we are interested with the rough description , , ^ ^ ^ ^s,, s,r " , Sj} (4.1.31) of thefirstJ
(4.1.32)
From definition (4.1.6) of the primary states and from the definition of M(^, I) it follows that: j p\A r)='£ p\si, D (4.1.33) /-I
In a similar way, we define joint frequencies of occurrences and conditional frequencies when both states are rough states or one state is a rough state and the other is a state parameter or vector.
4.2 FREQUENCffiS OF OCCURRENCES OF CONTINUOUS STATES This section assumes that the elementary component of the train is a continuous state. To simplify the argument and terminology, we consider again a train S={s(i), / = 1, 2, • • • , /} of state parameters. We assume that the set of its potential forms of each s(i) is an interval ^ = <^^, 5^>.
4.2 Frequencies of Occurrences of Continuous States
175
Behind the definition of frequencies of occurrences of a potential state in a train of continuous states are the general relationships between the continuous and discrete models of information (of states) discussed in Section 1.4.3. The basic idea is to approximate the continuous state by a discrete state and to define the frequencies of occurrences of a potential form of the continuous in terms of frequencies of occurrences of potential forms of the discrete approximating state. Since a similar approach can be used also for the definition of probability density (Section 4.4) and spectral density (Section 7.4.2) the method of discrete approximation is described in more detail. 4.2.1 THE DISCRETE APPROXIMATION OF A CONTINUOUS PROCESS As the discrete state parameter approximating the primary continuous state s=s(i), we take the discrete state parameter obtained by the uniform scalar quantization described in Section 1.5.3. In the notation that is used here the transformation (1.5.16) takes the form: s,(i,L)'S^(L) if \s{i)-s,{L)\<\sii)-s,(L)\ >ik^l, (4.2.1) where sJii,L) the discrete state approximating the primary continuous state 5(0,and 5XL)=5,+(M/2)A(L), / = 1 , 2,- • • , L (4.2.2) are the potential forms of the discrete approximating state sJii^L) (the same for all n) and A(L)=(v^3)/L (4.2.3) is the distance between the quantization thresholds (see (1.5.14). The transformation (4.2.1) is illustrated in Figure 4.2. -^l(4)
-^2W
-^sW
-^4(4)
^3(4)
^4(4)
-A(L)Jl(4)
I
-^<^2(4)
I
5(0 ^b
u ^l(4)
u '
52(4)
U
li I
53(4)
I
54(4)
Sd^nX) ^b
Figure 4.2. Transformation producing the discrete approximation sJ^iX) of the primary continuous state 5(0; L = 4 .
176
Chapter 4 Statistical State of a System
Since we are interested with the accuracy of approximation and this accuracy is determined by the number L of potential forms of the discrete information in the introduced notation, we indicate explicitly the dependence of considered variables onL. The discretized state can be considered as a rough description of the primary state (see Section 3.1.3). From equation (1.5.15) and (1.5.18) we obtain the corresponding aggregations intervals (sets): _,(L)+5,(L) 5XL)+5,,,(L)^
AiU^i^—
(4.2.4)
For a given L we introduce: I M[5XL), /] the number of occurrences of a discrete state S/(L) in the train S^d={^^^L),/ = 1,2,- • • , / } of the discretized states, I P\sfX), I\=M[sf\S), /]//-the corresponding frequency of occurrences. From the definition of the discretized state sjiiy L) it follows that M[s^h), 7] is the number of occurrences in train of events s(i)E:^ (L)
(4.2.5)
P*[s^L), /] is the frequency of occurrences in train of events s(i)E^(L).
(4.2.6)
and
To define more precisely the relationship between the train S^ of continuous states and the train 8^,^ of discretized states, we introduce the auxiliary function: />*(5,/,L) = P*[5XL), /] for ^ e _^ (L).
(4.2.7)
This is a function of the continuous argument sEKs^, 5^>. Therefore, it is called a continuous envelope of the set of the probabilities P*[s^L), / ] . This envelope has the form of steps with height P'ls^L), / ] . The typical bar diagram of the frequencies P*[Si(L), 7] is shown in Figure 4.3Al, while that of their continuous envelope is shown in Figures 4.3A2 and 4.3A3. When L grows, the potential forms of the discreti2:ed state sj,i, L) become more densely distributed in the interval <Sg, 5^> as shown in Figures 4.3A2 and 4.3A3. However, with growing L the lengUi A(L) of each aggregation interval ^ (L) decreases. This usually causes that the chance that an s(i) falls into an aggregation set ^ (L) covering a given s'E <5^, 5^> decreases too. In other words, when L grows, the value that the continuous envelope ^c (s',/, L) takes for a given s' decreases with L. Figures 4.3A2 and 4.3A3 illustrate this effect. Figure 4.3A4 illustrates the limiting case whenL-»oo and all P*[s^L), /] are almost zero.
4.2 Frequencies of Occurrences of Continuous States
177
P*[s,(4).r]
B1
B2
B3
B4 ^b
s
^b
^
Figure 4.3. Illustration of the concept of density of occurrences of a potential form a continuous state in a train of continuous states; left colunm- the frequencies of occurrences of discretized states (Al) and the continuous envelopes of the trains of frequencies (A2), (A3); right column- the corresponding normalized frequencies and the continuous envelopes; (B4) shows the limiting continuous envelope (density of die frequency of occurrences. It is assumed that the length of the aggregation interval corresponding to L=4 is a unit length; thus, A(4)=l.
178
Chapter 4 Statistical State of a System
4.2.2 THE DENSITY OF OCCURRENCES OF A CONTINUOUS STATE The discussed dependence of ^c (s, /, L) on L causes this frequency to be an unsuitable characteristic of frequencies of occurrences of the primary continuous states. However, our considerations indicate that it is natural to normalize the frequency P*[s^L), /] in respect of the size of the corresponding aggregation set ^i (L). The ratio p[s,(L),I]^ ^(^) (4.2.8) is such a normalized characteristic, called density of occurrences of (continuous) states. Similarly to (4.2.7) we define for the train P * W^)»^, / = 1 , 2,- • • , L its continuous envelope p*(sJ,L)^
Jrr—^ for sE^d), (4.2.9) A(L) which is a function of the continuous argument s. The bar diagrams of the train P * lSi(L)yl] and the continuous envelope /7*(j,/,4) are shown in the right column of Figure 4.3. From (4.2.8) and from the definition (4.2.9) it follows that [/7*(5,/,L)ds-l.
(4.2.10)
Similarly, using (4.2.9) we write (4.1.5) in the form L
'b
q(D^^q(s)PXs,,I)=
f^*(5,/,L)/7*(5,/,L)d5,
(4.2.11)
L where ^*(^,/, ^) is defined similarly to P*(s,I,L) (equation (4.2.7)) however, with q[SiiL)] in place of P*[s^L), / ] . Observations of many systems show that if L becomes large^ then L almost does not influence the approximating function P c('^' ^' ^ ) . Thus, we may say that it "converges" to a "limiting" function/7*(5, 7), sE <5^, ^^> and write it in the form: p*(5,/,L) -
P*(^,/).
(4.2.12)
A(L)-*0
This function considered as a whole, is for the continuous states the counterpart of the set { P U / ) , / = ! , 2,- • • , L } characterizing the train of discrete states. Approximating p *(5,7, L) by p*(s, I) from (4.2.11) we have q{I) = J q{s, I)p*(s, Dds.
(4.2.13)
4.2 Frequencies of Occurrences of Continuous States
179
COMMENT 1 The frequencies of occurrences appeared here as compressed information about the primary train of states S^, which is relevant from the point of view of the class of performance criteria 2 ( S J given by (4.1.1). In particular, we did not introduce any assumptions about the existence of statistical regularities. Therefore, if we take Q(SJ as the criterion, the compression of the primary data into the much simpler set of frequencies of occurrences does not deteriorate the quality of primary data processing. To evaluate the frequencies of occurrences, the whole train must be available. This is, for example, possible in off line compression systems based on JPEG standard described in Section 2.5 or adaptive Huffman compression systems considered in Section 6.2.1. An other example are systems with a training cycle described in Section 1.7.2 (see Figure 1.27). However, for several types of information processing, particularly for making decisions about future actions such as prediction, the existence of statistical regularities is essential. We discuss such problems and the relationships between the frequencies of occurrences and probabilities in the next two sections. While the considerations about frequencies of occurrences of discrete states are formally strict, the concept of convergence in definition (4.2.12) has an heuristic character. Consequently, the equation (4.2.13) should be considered at this stage of considerations as an empirical approximation that may be useful for calculations. Thus, it is an example of the discrete approximations discussed in Section 1.4.3. The problems of convergence in (4.2.12) and of accuracy of expression (4.2.13) are discussed in the next two sections. COMMENT 2 From (4.2.6) it follows that the basic characteristic of the frequency of occurrences of continuous states is the frequency of events that an element of the train falls into the aggregation set ^ (L). Thus, the primary description of the frequencies of occurrences of continuous states is a function of sets assigning a number to a given set. The length A(L) of the aggregation interval is another function of the set. The density of frequencies, defined by (4.2.8) and (4.2.12) is the limit of the ratio of those two functions of the set (the aggregation interval) when this set shrinks to a point. Consequently, the density of occurrences is a function of a point. Therefore, it is much easier to handle than the frequency of occurrences that for continuous states, is a function of a set. Using information processing terminology, we may say that the density of occurrences is a representation of the frequency of continuous states. However, we pay a price for describing the frequencies of occurrences of continuous states by the density. Namely, the density depends not only the objective features of the train but also on the definition of the length of the aggregation interval, which is subjective. In particular, the numerical value of the length depends on the units for length, which in principle are set in an arbitrary way. For example, if the state parameter s is electrical potential, the physical dimension of the length of interval A(L) are volts. Since the frequency of occurrences has no physical dimension, the physical dimension of the density p\s, I) is [V] ^
180
Chapter 4 Statistical State of a System
4.3 STATISTICAL REGULARITIES The previous two sections considered the train of states as fixed. Now we look at the properties of the frequencies when the length / of the train grows. For many systems, the frequencies of occurrences then begin to stabilize in other words, to converge to limit values. The first two parts of this section discuss the effect of the convergence of the frequencies of occurrences of states in a train, which can be interpreted as a train of samples of an evolving state process, take in subsequent instants. The third section discusses an analogous effect when we observe at an instant the states of several similar systems. 4.3.1 STATISTICAL REGULARiriES IN TRAINS OF DISCRETE STATES In the previous considerations we did not specify the meaning of the element s(i) of the considered train S^ of states and we considered the number / of its elements as fixed. Now we assume that Al. We observe the system during a time interval
-./>*(5^),V/.
4.2 Statistical Regularities
181
Sc(r)
T. h-
a)
'ciO-.-^^
'a2 V
"AT-
^b2
"V^
b)
Figure 4.4. Illustration of assumptions and of the notation: (a) the single train, (b) a pair of trains. We say that the states having this property exhibit statistical regularities and the limiting value P*(Si) is called empirical probability. The statistical regularities are described by the set of empirical probabilities P*={P*(^,),/=1,2,(4.3.5) L} From (4.3.4) it follows that (4.3.6a) I—oo where q(l) is the arithmetical average over the train given by (4.1.2) and L
q^Y.9(s)P*(s)
(4.3.6b)
is the empirical statistical average. From (4.1.11) it follows that If with growing length I of the train, the arithmetical average q{I) (4.3.7) approaches a limit, then the states exhibit statistical regularities Till this point we assumed that the instantaneous state sjix) is scalar. In general, the state may be structured. If the frequencies of joint occurrences defined by (4.1.4) have the coimterpart of property (4.3.4), we say that the components of the structured state exhibitjoint statistical regularities. The corresponding limiting values of frequencies of joint occurrences are called empirical joint probabilities. The fundamental properties (4.1.8), (4.1.15), and (4.1.18) do not depend on the length / of the train. Therefore, the empirical probabilities possess them too. We also can use for them the definition (4.1.21) of the conditional probability. In all these equations we have only to drop /. When the empirical conditional probability of state s{l) does not depend on state 5(1) we say that both states are statistically independent.
182
Chapter 4 Statistical State of a System STATIONARITY OF STATISTICAL REGULARITIES
The introduced empirical probabilities characterize a long train of states. In some cases the properties of the system may change during such a long train and cause local changes of frequencies of occurrences. The superior system may use such a behaviour of its environment to improve the performance of purposeful actions but this would complicate the optimal rules of operations. However, many systems can be considered as time invariant and the local frequencies of occurrences of states do not change during the operation of the superior system. Then the goals the superior system can be realized efficiently in a simpler way. Therefore, it is important to fmd out whether the frequencies of occurrences are time invariant. To do this we take two nonoverlapping intervals
between frequencies of occurrences of a potential states of samples taken in both time intervals become very small; we write this in the form |/>/(5,,/)-/>2V„/)| - 0 V/. (4.3.8a) The frequencies P^ (5^7) ^ - 1 , 2 approach the same limit P\si); we write it in the form P / ( 5 ^ , / ) - P\s) V /, A:=l, 2. (4.3.8b) We say that the state parameter for which (4.3.8) holds, exhibits statistical regularities and they are stationary ( equivalently- time invariant). Let us comment the introduced concepts. COMMENT 1 We can expect that the necessary condition for the statistical properties to be time invariant is that the system is stationary (the relationships between the external states are time invariant (see Section 3.2.5). However, the internal states of most real systems change in time either in a periodical or in a systematic way. In particular, the internal states of most systems serving people exhibit daily and yearly cycles of changes. All systems age. However, often are we interested with a system only during a time interval and the changes of internal states of the system during this interval are small compared with the changes of instantaneous external states. We call such a system quasistationary. For such a system we may assume that the statistical properties are stationary but this is in principle an inconsistent assumption. In particular, in definition (4.3.4) we have to take / large, but not very large. To simplify the argument and notation we assumed that the state is a scalar and is discrete. As in the previous section we can generalize our argument for structured and/or continuous states. We sketch such a generalization for continuous states.
4.2 Statistical Regularities
183
4.3.2 STATISTICAL REGULARITIES IN TRAINS OF CONTINUOUS STATES Instead of A4.1 we assume: A4.2 The set ^S'of potential forms of an instantaneous state 5(0 thus, of a sample s(n) is the interval <s^, 5b > , In the previous section the frequencies of occurrences of continuous states have been defined by means of approximating discrete states. If for every number L of potential forms of the discrete approximation, the approximation exhibits the statistical regularities, we say that the train of continuous states exhibits statistical regularities. They are described by the function /7*(5), to which the density of frequency of occurrences pjis, /, L) (see (4.2.9)), "converges" when the number of potential forms of the discretized state L-^oo or equivalently, when the size of the set of potential forms of the continuous state that are aggregated into one discretized state A(L)-K): /7^*(5, /, L) -^p L-*oo
*(5)
(4.3.9)
J-*oo
Since condition I>L must be satisfied to achieve L-^-oo, we must also increase the number of samples we observe; thus, the condition /-•oo must be also satisfied. The function/7*(5), sE <s^, s^> is called density of empirical probability. This function considered as a whole describes the statistical properties of a continuous state similarly, as the set P* of empirical probabilities given by (4.3.5), describes the statistical properties of the discrete state. Therefore, we call both /^ and p*(s), 5E <5a, 5b> the empirical probability distribution of a state exhibiting statistical regularities. Let us denote by q(L,r) the arithmetical average of the train of the discrete approximating states given by (4.2.11) with Sj^L) in place of 5/. From (4.3.9) it follows that q{L,[) —>q^
(4.3.10)
L-»oo / - » a o
where qr^^(s)p*(s)ds
(4.3.11)
is the statistical average of the function q(s) of the continuous variable s. Similarly to (4.3.7) we conclude that If with growing number L of forms of a uniformly quantized state and with growing length I of the train the empirical average q(L, /) (A -x \2^ approaches a limit, then the continuous state exhibits statistical regularities. An example illustrates these general considerations about empirical probabilities.
184
Chapter 4 Statistical State of a System EXAMPLE 4.3.1 STATISTICAL PROPERTIES OF THERMAL NOISE OF AN RESISTOR
Consider a resistor made out of metal wire. In the macro scale it can be considered as two terminal accessible system considered in Section 3.2.1. In atomic scale the metal consists of crystals, and a crystal is an assembly of atoms of metal located at nodes of a spatial grid. Between those atoms move free electrons that collide with atoms. At each collision a potential impulse is induced on the terminal of the resistor. Very many of such pulses overlap and cause that a potential difference in the macro scale between the terminals of the resistor develops. It is called thermal noise. A typical diagram of such a process is shown in Figure 4.5.
A S{t) [p,V]
2.0H this]
100
200
300
400
500
Figure 4.5. Typical electrical potential process produced on a resistor by thermal movement of electrons (thermal fluctuation noise). As the parameter describing the external state of the resistor we state we take the instantaneous potential difference VjjCt). A detailed analysis of the elementary potential pulses generated by the collisions shows that the statistical average ^[v,2(0]"0 and the average of squared potential difference (an indicator of "magnitude") is ^[v2(0]-k7KB, (4.3.14) where k is the Bolzman constant, T the temperature in Kelvin degrees, B the frequency of the highest sinusoidal component of the potential process and R the resistance of the resistor. From equation (4.3.14) it follows, that if the temperature is constant the potential process is stationary. However, if the temperature would change, the frequencies of occurrences of a given quantized difference of potentials in an observation interval
4.2 Statistical Regularities
185
4.3.3 STATISTICAL REGULARITIES IN ASSEMBLES OF SYSTEMS States were interpreted in previous considerations as instantaneous states of a system in a train of instants. There is another great domain in which the concepts of statistical regularities and empirical probabilities occur. Namely, a system can often be considered to be an element of an ensemble of similar systems. By ,,similar'* we mean that the systems have the same general structure and properties but may differ only in some details. As an illustration, take our standard example the resistor. Then the ensemble would be a set of resistors produced in factory using the same base rough materials, the same technological process, and the same machinery. In spite of these similarities, each resistor would have slightly different properties and obviously the state on the atomic structural level would be for every resistor different. Notice that considering the set of potential forms of an internal state in Section 3.3.3, we introduced already the concept of the ensemble of systems. To define the statistical regularities of a nonstationary system, instead of choosing a set of sampling instants and considering the set of time samples of the state of the same system, we choose various sample systems out of the ensemble. Thus, we interpret the sample s(i) introduced at the beginning of our considerations about the statistical state in Section 4.3.1 as the state at a given instant / of the /-th system a, chosen of the ensemble E: ^(0^^,(0. (4.3.15) Replacing r, defined by (4.3.1) by a,, we can utilize all our previous argument and definitions to define the frequencies of occurrences, statistical averages, statistical regularities, and empirical probabilities for the states (external, internal) of system chosen out of the ensemble L. The counterpart of a stationary system is a homogenous ensemble for frequencies of occurrences of states of system chosen out of two subsets of E "converge" to the same limiting values. To this point we considered the instant t at which we observe the systems as fixed. If we consider this instant as variable the concept of statistical regularities in an ensemble of systems allows to define in a systematic way the time-dependent statistical regularities, particularly the time dependent empirical probabilities 4.3.4 TESTING THE EXISTENCE OF STATISTICAL REGULARITIES AND ESTIMATION OF PROBABILITY DISTRIBUTION The basic two problems are • To determine whether state parameters exhibit statistical regularities thus, the frequencies of occurrences or, equivalently, the arithmetical average converge and • If they do, how to find the probability distribution The first problem is called testing whether statistical regularities occur (briefly testing problem) and the second is called the statistical identification problem.
186
Chapter 4 Statistical State of a System
The approach to these problems presented here has a heuristic character. The concepts "very small" and "approaches a limit" have been used in the heuristic sense. Thus, the definitions (4.3.4) and (4.3.8,9) of empiric probability and probability density and of empiric averages are not strict in the mathematical sense. They are rather, practical instructions how to check whether a state parameter exhibits statistical regularities and, if it does, how to estimate them. A similar character have several rules used in statistics (see e.g., Frank, Althoen [4.2] for an introduction, Kotz, Johnson [4.3] for an encyclopedic review, and Press [4.4, ch. 13] for concrete programs for obtaining an estimate of a probability distribution). The empiric approach can be justified by the usefulness criterion. We assume tentatively that the states exhibit statistical regularities and that the estimates of probabilities are exact. Using such design information we fmd the rules of optimal information processing (see,e.g.. Section 1.7.1). If the performance of the superior system is improved, we consider ex post the tentative assumptions about existence of statistical regularities to be true. Although it may seem inconsistent, such a procedure is, in fact, the ultimate justification for the application of all the mathematical models we apply. The attempts to formalize the empiric approach have a long history (see the classical work by von Mises [4.1] and for more recent discussions and references Shafer, Pearl [4.5, ch.4, the tutorial]). The difficulties in formalizing the empirical approach caused the axiomatic approach to became prevalent in the mathematical probability theory (see Kolmogorov [4.6], Renyi [4.7], Billingsley [4.8]). This approach is described in the next section. This section explains also why, in spite of its mathematical elegance, the axiomatic approach does not replace the heuristic approach presented here. The basic approach of the axiomatic probability theory to the testing problem is to analyze the convergence of arithmetical averages on assumptions about the probabilities of the samples, which are so weak that it is possible to justify them knowing only some general properties of the mechanism generating the successive samples. An example is the assumption that the samples are statistically independent. The theorems about convergence of averages in such a case are called laws of large numbers (for basic theorems see Renyi [4.7], Papoulis [4.9], Breiman [4.10], for more detailed studies Reves [4.11]). More complicated is the case when a sample is produced by an indeterministic transformation from the previous sample (such as in the Markovian process considered in Section 5.3). The analysis of convergence of arithmetical averages of statistically related samples is subject of ergodic theory (for an introduction see Breiman [4.10]; for a more detailed study see study Mane [4.12]). The basic approach to the statistical identification problem is to gain more insight into the mechanism generating the state. This mechanism is described by (1) the universal relationships between the states of the system ( the internal states of the system), (2) the external factors influencing the system, (3) the initial states of the system.the factors determining the value of a sample. The approach analyzing those factors is called analytic approach.
4.2 Statistical Regularities
187
The typical analytic approach of the axiomatic probability theory to the statistical identification problem is (1) to assume that the sample is generated by a transformation from some primary states, called causing states, (2) to assume that the causing states exhibit statistical regularities, (3) to look for such generating transformations that the probability distribution can be approximated by a probability distribution depending on the generating transformation and only on some simple statistical features of the causing states. The states produced by such a mechanism may be called states weakly dependent on statistics of causing factors and the approximating probability distribution is called limiting distribution. Section 4.5.1 shows that the uniform distribution can be considered as a limiting distribution, while Section 5.5.2 shows that the Poisson distribution has such a character. Of paramount importance is the limiting character of the Gaussian probability distribution. The theorems stating that this is the limiting probability distribution of a state that is produced by adding a large number of independent commensurable components (with variances of similar order of magnitude), are called central limit theorems. These theorems are essential for statistical physics (see, e.g., Huang [4.13]). In particular, they allow to derive the probability distribution of state parameters characterising a macroscopic property that is a manifestation of atomic-scale factors. A typical example of such a state parameter is the pressure caused by the movement of molecules of gas. Another example is the fluctuation of potentials on terminals of a metallic resistor caused by movement of free electrons described in the previous example. The central limit theorems can be also used to identify the probability distribution of parameters determined by many factors in the macro scale. A typical example is the electrical energy consumed during one hour by households in a large city or the number of telephone connections established by an exchange serving many subscribers. For an introduction to central limit theorems see Renyi [4.7], Papoulis [4.8], Breiman [4.10]; for more detailed study see Gnedenko, Kolmogorov [4.14]. The theorems are discussed in Section 4.5.2. Although the character of the limiting probability distribution depends strongly on the generating transformation the statistical regularities of the considered state are a consequence of the statistical regularities of the causing states. Interesting are systems realizing a deterministic transformation that is unstable (as the shift register with feedback mentioned in Section 3.2.4) and being not influenced by external factors, produce a train of states that is locally similar to a train of states exhibiting statistical regularities and statistically independent. Such a system is called generator of pseudo-random numbers (for a detailed analysis see Dagpunar [4.15], Yarmolik, Demidenko [4.16], Niederreiter [4.17]). A similar character have special classes of oscillating deterministic systems described by specific nonlinear differential equations considered in the chaos theory (see, e.g., Davaney [4.18], Rasband [4.19]).
188
Chapter 4 Statistical State of a System
4.4 THE AXIOMATIC APPROACH TO STATISTICAL REGULARITIES To utilize the statistical regularities for enhancement of superior systems efficiency, we usually have to perform on the primary frequencies of occurrences various operations for example to calculate conditional, marginal frequencies or averages. A branch of mathematics, probability theory provides a formalism for such analyses. This section reviews the basic concepts of this theory. To illustrate the general concepts and to derive conclusions that are needed in subsequent chapters, we consider in the next two sections two more specialized areas of probability theory: the prototype probability distributions, and the "view of the probability theory" on the statistical regularities discussed in the previous section. The next chapter, devoted to statistical relationships, provides further examples of applications of probability theory. There are several excellent books on probability theory (e.g., Papoulis [4.9], Breiman [4.10]). Our approach is quite specific because we emphasize the relationships between probability theory and information systems analysis. Especially close are the links between the concepts of probability theory and the anaJysis of frequencies of occurrences of states presented in Section 4.1 Those links are bilateral. On the one hand, the axioms of probability theory can be considered as an abstract generalization of the plausible properties of the frequencies of occurrences. On the other hand, because the frequencies of occurrences satisfy the axioms of probability theory, they can be considered as probabilities and the theorems of this theory hold for them. In the following review of fundamental concepts of probability theory, we discuss the axioms, the concept of random variables, and the basic properties of statistical averages. 4.4.1 THE AXIOMS OF PROBABILITY THEORY Probability theory is based on axioms, and, as is usual in mathematics, the relationships between the axioms and the real world are not the subject of the theory. However, for analysis of states of systems and of information processing this relationship is crucial. Therefore in this presentation of axioms of probability theory we emphasize the relationships between the axioms and the previously introduced concepts of information systems analysis, particularly with concepts introduced in the previous chapter. The first two fundamental concepts of probability theory are iht primary event, which we denote e, and the set ^of forms, which the elementary event can take. The third concept is the event. It is a subset of the set
4.4 Axiomatic Approach to Statistical Regularities
189
The event corresponds to the situation in which a state parameter (primary, secondary, rough) takes a concrete form say, 5'. Such a situation can be considered as the set ^(s') of situations in which all components of the exact description of the instantaneous state take such values that the value of the considered state parameter is s'. Thus, the set ^(s') has the meaning of an aggregation set. The forth fundamental concept of the probability theory is the probability measure (briefly, probability). It is a number associated with each^ subset ^ of the set £ We denote it P(_y^) ; as usual, we use the special character P to denote an operation of assigning a number to an mathematical object more complicated than a number. Probability satisfies the following axioms: AXl. For asubset ^CC 0 < P ( ^ ) < 1 , P(^:) = l P(0)=O, (4.4.1) AX2. For two subsets ^ 1 , ^ 2 such that _^in_y^2 = 0 we have P(_><^,U^2) = P(^i)+P(-^2) (4.4.2) where 0 denotes the empty set. Taking into account the previously mentioned correspondence between the event in the sense of probability theory and the situation that a state parameter takes a concrete value, we see that the axioms can be considered as an abstract generalization of the obvious properties of the frequencies of occurrences. In particular axiom (4.4.2) is the generalization of the obvious property (4.1.33) of frequencies of occurrences. On the other hand, it is easily seen that the frequencies of occurrences defined by (4.1.6), (4.1.14) and (4.1.21) satisfy the axioms AXl and AX2. Therefore, most results of probability theory apply to the frequencies of occurrences. 4.4.2 THE RANDOM VARIABLES We assume that a probability measure is defined in the set (C of elementary events. A function assigning s(e) assigning to each elementary event e G i ' a number s, considered as a whole, is called random variable. In our notion, the random variable is the set {s(e); eEC P}. Instead of this lengthy expression, we use the shadowed character 8. Suppose that s(e) can take only the values 5/, / = 1 , 2,- • • , L. Then we call s a discrete random variable. When the event that s{e) takes a value Si we say that the random variable s takes value Si and we write it in the form 8=5/. The probability of this event is P(8=5,). This is the counterpart of the previously introduced frequency of occurrences P*(5/, I) and empirical probability P*(5/). If the random variable can take values from the interval S^<Sa, Sf,>, then we call it continuous random variable. To describe such a variable we take a sequence of subintervals ^^, /n = l, 2,- • • shrinking to a point sES. Thus lim7(<^)=0 and '^^•^"^ ,W.|ft|. For a continuous random variable, P(.^i>*0 but the limit exists.
p(5)=iim_L:::: •"-" yy'^mf
•"
(4.4.3) (4.4.4)
190
Chapter 4 Statistical State of a System
This is a similar effect as in case of frequencies of occurrences that has been discussed in detail in Section 4.2 (for a strict definition of density of probability see, e.g., Billingsley [4.8]). The limit characterizes the value s from the point of view of statistical regularities of the continuous state parameter. Since it is an analog of density of a thin material bar, it is called density of probability. In technical jargon the ttnn probability density is used. It is shorter and may be misleading, but when it causes no confusion, we will use it too. The set 4. is the counterpart of the aggregation set ^(L) (definition (4.2.4)), which was introduced defining the density p*{sJ,L) of frequency of occurrences (definitions (4.2.7), (4.2.8), and (4.2.9)). The d.o.p. p{s) (defined by (4.4.4)) is the counterpart of the density of empirical probability /7*(5) defined by (4.3.9). In particular, y{B,^ is the counterpart of A(L) in (4.2.9). Subsequent sections discuss in more detail the relationships between those concepts. To generalize our reasoning for continuous K-DIM vector states is similar to the case of frequencies of occurrences. We have to 1. Replace the sub-interval 8^, occurring in definition (4.4.4), by a K-DIM subset B^ of the set ^ o f the potential forms of the vector state, and 2. Take instead of the length of the interval the volume y(8jo{ the subset /S„ and to require in definition (4.4.4) that the subsets 8„ shrink in all dimensions mto a considered point s'={s\k); k=l, 2,- - - , K}; A typical choice for the subsets are K DIM cubes: ^ , = {{5(1), s(2),..,s(K)}; s\k)-AJ2<s(k)<s'(k)-¥AJ2, k=U 2,..,K} (4.4.5) and the typical definition of volume is 7(^J=(AJ^ (4.4.6) Similar to the density of occurrences (see Comment 2, page 179) the density of probability p(s) depends not only on the statistical properties of the system but also on the definition of the volume of the subsets 8„. This must be taken into account when transformations of random variables are considered. The density of probability {p(s);sEiS } considered as whole describes completely the properties of the continuous random variable similarly as the set of probabilities {P(8=^/);/=l, 2,- • • , L} describes the discrete variable. We call both {p(s);sES} and {P(8=5/);/=l, 2, • • ,L } probability distribution of the random variable. If we know the function s(e) determining the dependence of a variable on the primary event and the probability of primary events, then in principle we can calculate the probabilities (density of probability) for the corresponding random variable. However, if we consider only a single random variable 8(1) it is completely described by its probability distribution, and we do not need to know either the relationship between the value of the random variable and the primary events or the probability of the primary events. Similar, if we consider only a pair of random variables 8(1), 8(2), we need only to know their joint distribution. The bidirectional relations between frequencies of occurrences and probabilities not only cause them to correspond but cause all previously introduced definitions and derived properties of frequencies and empirical probabilities to apply to probabilities.
4.4 Axiomatic Approach to Statistical Regularities
191
In particular, if in equations (4.1.8b), (4.1.15), and (4.4.18) we drop / and the asterisk *, we obtain definitions valid for probabilities. Similarly, the counterparts of the definitions (4.1.22) and (4.1.23) of conditional frequencies are definitions of the conditional probabilities P[8(2)=:yJ2)|8(l)=5Xl)]^P[8(l)=5Xl), 8(2)=5J2)]/ P[8(1)=5X1)], (4.4.7a) and the density of conditional probability p[s(2)\s(l)]^p[s(l), (2)]/p[s(\)]. (4.4.7b) The counterparts of condition (4.1.28) for the statistical independence take the form P[S(1)=5X1), 8(2)=5,(2)]=P[8(l)=5Xl)]P[8(2)=:yJ2)] (4.4.8a) p[s(l\ s(2)] =p[s(l)] p[s(2)], (4.4.8.b) For discrete variables the counterpart of equation (4.1.20) is P[8(2)=5,(2)=^ P[8(1)=5X1), 8(2)=5,(2)], while for continuous p[s(2)] = j p[s(l), s(2)]ds(l).
(4.4.8c)
(4.4.8d)
4.4.3 THE STATISTICAL AVERAGE The counterpart of the arithmetical average qil) given by (4.1.5) and of the empirical statistical average q given by (4.3.6b) is the statistical average of a scalar function q(s) of a discrete random variable s: L
E^(8) = E9(«/)P(S=*/)-
(4.4.9)
/-I
°
We denote here as E the operation of statistical averaging with respect to the random variable s and we interpret the statistical average given by the right hand of (4.3.9) as the result of performing the operation of statistical averaging on the random variable ^(s). If it will not cause confusion in the subsequent we will drop the random variable under the symbol E. The operation E of statistical averaging is the counterpart of the operation A of arithmetic averaging defined by (4.1.3). If the random variable is scalar and q(s)=s, the definition (4.4.9) simplifies to the definition of the statistical average of a scalar random variable L
E8=E^/P(8=5/). 8
(4.4.10)
/-I
Using (4.4.9) and proceeding similarly as by derivation of equation (4.2.13) for a continuous scalar variable, we define the statistical average EQ=^sp(s)ds
(4.4.11)
192
Chapter 4 Statistical State of a System
To this point we considered the mean value of a scalar function of a scalar argument. Similarly, we define the statistical average of a scalar function q(s) of a K-DIM continuous vector: Eq(Q)^\\"\q(s)p(s)ds,
(4.4.12)
and Sf^ is the set (continuous) of the potential forms of the K DIM state s. Let us take as special case ^={5(1), ^(2)} and q(s)=a(l)s(l)'{'a(2)s(2). After some calculus from (4.4.10.) or (4.4.11) we get _ E[a(l)8(l)+fl(2)8(2)]=a(l)E8(l)-ha(2)E8(2) (4.4.13a) Thus, The operation of statistical averaging is a linear operation (4.413b) We use this conclusion frequently in subsequent considerations. The difference 8-E8 has the meaning of the deviation of the value of the random variable from its average. Since the statistical averaging operation is a linear operation, we have _ « _ ^
E(8-E8) = E8-E8=0.
(4.4.14)
Thus, the statistical average has the meaning of a constant around which the observations of the random variable fluctuate. However, the statistical average does not characterize the range of those fluctuations. A characteristic of this range is the r^"^^^^, ^ a^(8) = E(8-E8)^ (4.4.15) It is called variance. Using again the linearity of the averaging operation, we get a'(8) = E8'-(E8)' (4.4.16) The pair mean value and the variance can be considered as a rough description of the probability distribution characterizing the center and range of fluctuations of observations of the corresponding random variable. From (4.4.6) follows that the pair E8 and E8^ provides an equivalent rough description. It can be expected that we may obtain a still more accurate descriptions using the averages ^ . = E8^, (4.4.17) where m is an integer. The average A„ is called moment of mth order (of the probability distribution). It can be proved that on very general assumption all moments A^, m = 1, 2,- • provide an exact description of the probability density. Namely, under quite general conditions it can be presented in the form oo
P ( ^ ) = E M'")^(*y» ^)» -oo<^
(4.4.18)
m-l
where the coefficient /x(m), called mth commulant, is a function of moments i4i, A2,- ' , A„ (e.g., /x(2)=(T^(8)) and h(s, m) is a family of specific orthogonal functions. Thus, the probability density can be represented as the infinite set of commulants. This is a special case of spectral representations that are discussed in Section 7.4.1. An important property of a spectral representations is that its finite initial part is an optimal approximate representation (see. Section 7.1.3). Thus, e.g., the mean value and variance are an optimal rough description of a p. density by two parameters.
4.4 Axiomatic Approach to Statistical Regularities
193
CONDITIONAL AVERAGES Conditional averages are of paramount importance for theory of information processing. To simplify the argument we again assume that the two random variables 8(1) and 8(2) are discrete scalar variables. Replacing in (4.4.10) the probabilities P(z=Si) by the conditional probabilities defined by (4.4.7a) we obtain the conditional statistical average of 8(2) on the condition that 8(1)=5^1) L(l)
E 8(2)^5: 5,(2)P[8(2)=^J2)|8(1)=5X1)] 8(2)|5/(1)
(4.4.19)
/-l
where E is the operation of conditional averaging over the random variable 8(2) on the condition that 8(1) is fixed. The conditional average is a function of the condition 5X1). To indicate it we introduce the function D[s(l)]^ E 8(2) (4.4.20a) 8(2) UXD and the random variable ID=D[8(1)]. (4.4.20b) We denote by Z)=E© (4.4.21) the statistical average of ID. Using equation for marginal probabilities (counterpart of (4.4.8), after some algebra we get E 8(2) = 5 . (4.4.22) e(2)
Substituting (4.4.20), (4.4.21) we write this equation in the form E 8(2)= E E 8(2). (4.4.23) 8(2) 8(1) 8(2) 18(1) Thus, the average of 8(2) can be obtained by averaging the conditional average of 8(2) over the condition. This equation allows to take into considerations of a state 5(1) another statistically related state 5(2). Therefore, equation (4.4.23) plays a key role in optimization of information processing. 4.4.4 CORRELATION COEFFICIENTS AND CORRELATION MATRIX Let us consider two random variables 8(1) and 8(2). The difference between the random components of those variables is: [8(2)-E8(2)]-[8(l)-E8(l)]. It is natural to take: ^„(1,2)=E{[8(2)-E8(2)]-[8(1)-E8(1)]}2 (4.4.24a) as an indicator of the "difference" between the random components of both variables. Using the linearity of operation E we get ^ , 5„(1,2)=E [8(l)-E8(l)r-fE[8(l)-E8(l)r-2c„(l, 2), (4.4.24b) where* c„(l, 2)=E[8(1)-E8(1)][8(2)-E8(2)]. (4.4.24c)
194
Chapter 4 Statistical State of a System
From equation (4.4.24b) it can be seen that the indicator ^„(1,2) of the statistical difference depends on the statistical relationship between both variables only through the coefficient Css(l, 2), and the difference is the smaller the larger Css(l, 2) is. Thus, Css(l, 2) can be interpreted as an indicator of statistical relationships between the random variables 8(1) and 8(2). Therefore, c„(l, 2) is called correlation coefficient. Some authors call this centralized correlation coefficient, while the coefficient ^^^(l, 2) = £8(1)8(2) is called noncentralized correlation coefficient. Since we use only the centralized coefficient we drop the adjective "centralized". The concrete form for the averaging operation we obtain for discrete variables, taking in the definition (4.4.9) instead of 8 the pair {8(1), 8(2)}. Definition (4.4.24c) then takes the form L
L
c„(l, 2) = E E E[5Xl)-E8(l)]W2)-E8(2)]P[8(l)=5Xl),8(2)=:y,(2)l M n-\ (4.4.25a) Similarly, for continuous variables we use equation (4.4.11) and we get c„(l, 2)= j j 8[l)-Es(l)]U(2)-E8(2)]/7[5(l),5(2)]d5(l)dj(2).
(4.4.25b)
Let us next consider now the multidimensional random variable §={8(A:), ^ = 1 , 2 , - • ,K}, The mutual relationships between the components z{k), A:=l, 2,..,AT are characterized by the set of the correlation coefficients c^^(m, k), Vm, k. It is convenient to arrange them in a matrix C„4Css(m, /:)], (4.4.26a) where c„(m, ^) = E[8(m)-E8(m)][8(A:)-E8(/:]. (4.4.26b) The matrix C^^ is called the correlation matrix. We now derive a few properties of this matrix, which are used in subsequent chapters. To simplify the notation we assume _ E8(it)=0, Vit. (4.4.27) From definition of c^Jjn, k) it follows that the correlation matrix C^^ is symmetrical. To derive another important property of the matrix we introduce the auxiliary variable j^ 2=E^(*)SW»
(4.4.28)
where a{k), / : = l , 2 , - - , ^ a r e arbitrary numbers. The mean square of it is K
K
K
K
E2?=E(E a(m)m(/w))(E a(k)%(Jc))= E E EBi(m))ffl(/r)a(m)a(*) = m-1
m-1
k-\ K
k-\
K
E E CssC/n, k)a{m)a{k). m-1
(4.4.29)
k'\
Except singular cases^ E2?>0. Therefore K
K
E E CsAm, n)a(m)a(n)>0
(4.4.30)
4.5 Prototype Probability Distributions
195
A matrix satisfying such a condition is called positive definite (see Thompson [4.20], Horn [4.21]). This property is essential for efficient dimensionality reduction, which is described in Section 7.2. The important property of correlation coefficients is that to calculate the correlation coefficients of a multidimensional secondary state obtained by a linear transformation of a primary multidimensional state, we need only the correlation matrix of the primary state. We now derive this relationship. From the rules of matrix multiplication (see, e.g., Thompson [4.20], Horn [4.21]) follows that if a^ is a column matrix with elements a(k) then a^^d^^ is a square matrix with elements a(m)a(k): a^a''^ = [a(m)a(k)]. (4.4.31) Using this formula we write the correlation matrix C^^ given by (4.4.26a) in the form _ -^ ^^ Q = ES^S^,, (4.4.32) where S^x is the random column matrix with elements z(k). We denote: s^ the column matrix with the components s(k), / : = ! , 2,- • • , A' of the primary state, v^ the column matrix with the components v(k), /:=1, 2,- • • , ^ of the secondary state. A wide class of linear transformation can be presented (see, e.g.,Thompson [4.19], Horn [4.20], Usmani [4.21]) in the form v.x=^^n«. (4.4.33) where H is KiJC square matrix. For the corresponding random variables we have V„«=^S,x. (4.4.34) Using again (4.4.32) we represent the correlation matrix C^=[E^(mMk)] (4.4.35) of the transformed variables in the form C^=EY„Xu.(4.4.36) After substituting (4.4.34) and some elementary matrix algebra, we get C^=E(H^^)(HB^^y=H E^Xn. tf^HC,,If
(4.4.37)
where C^^ is the correlation matrix of the primary state. This equation is used frequently in the subsequent chapters.
4.5 PROTOTYPE PROBABILITY DISTRIBUTIONS This section has two purposes. First, it illustrates on concrete examples the previously introduced concepts of the axiomatic probability theory. Second, it provides two examples of transformations producing states weakly dependent on statistics of causing factors, discussed in Section 4.3.4. The limiting probability distributions are the uniform and gaussian distribution. We concentrate on these two distributions because
196
Chapter 4 Statistical State of a System
• In many cases we can conclude from very general premises that the mechanism producing a state is similar to one of the transformations considered here, and • often, due to deterministic relationships discussed in Chapter 3, some relevant states of the system can be considered as secondary states produced from the states mentioned above; then, using routine procedures of probability theory we can calculate the probabilities describing the secondary states (examples are given in Section 5.2). For these reasons we call the uniform and gaussian distributions the prototype probability distributions. 4.5.1 THE UNIFORM PROBABILITY DISTRIBUTION We say that a discrete random variable 8 has a uniform probability distribution if: P(8=5,)=const, (4.5.1) where 5/, / = 1 , 2,- • • , L are the potential forms of the state. We say that a continuous random variable S has a uniform probability distribution if p(5)=const, V5. (4.5.2) We consider a continuous scalar state. We assume that Al. The primary state is a scalar, and the set of its potential forms is the interval <-5b, ^b); A2. The primary state can be considered as a realization of the continuous random variable a; we denote by p(s) its probability density; A3. Using uniform quantization we transform the primary state s into discrete state w; we denote by 5/, /= 1, 2,- • • , L the potential forms of w and by T^(') the quanting transformation; thus w=^T^(s); A4. We achieve the quantization by the next-neighbour transformation^ w=Si if \s-Si\ < | 5 - 5 j , >/k9^l,
(4.5.3)
where the reference values 5^=[/-(L+l)/2]A (4.5.4) and A=2sJL is the length of the quantization interval; the described transformation is illustrated in figure 4.6a. A5. The secondary state is b^S'Si=S'T^(s).
(4.5.5)
The secondary state has the meaning of the quantization error. It can be interpreted as the state at the output of the system shown in Figure 4.6b. The diagram of 6 as function of s is shown in Figure 4.6c. The random variable representing the secondary state is lb=8-7;(8). (4.5.6) From Figure 4.6c we see that PnbE
(4.5.7)
197
4.5 Prototype Probability Distributions
H—K
1
K
1 K
a) r
1
K
tn1 K
QUANTIZING TRANSFORMATION
1 H
1
H ^
- • ^
b)
Figure 4.6. Illustration to the definition of the secondary state: (a) the transformation of the primary state s into the quantized state w, (b) the bloc diagram of the system producing the secondary state, (c) dependence of the secondary state b on the primary state s, (d) a typical probability density of the primary state, (e) interpretation of equation (4.5.7), (f) die resulting probability density of the secondary state b\ the scale on the vertical axis is roughly L times coarser then in Figure 4.6d. . From this and from the definition (4.4 4) of the probability density we get L
Pr>(b)=Y.Ps(Si-^b), be < - 4 / 2 , A , / 2 > ,
(4.5.8)
/-I
where p^(s) is probability density of the primary state and A is the length of the quantization interval.
198
Chapter 4 Statistical State of a System
From (4.5.33) follows that the probability density of the secondary state b is the sum of the shifted segments of length A of the probability density of the primary state as illustrated in Figure 4.6d. It is seen that: If the length of the quantization interval A is sufficiently small and the probability density is a smooth function, then independently of the concrete form of this density the probability density of the (4.5.9) transformed state approaches the uniform probability density. Thus, the system shown in figure 4.6c is an example of a system producing states weakly depending on the causing states. Our argument also proves that the uniform probability distribution is a limiting distribution (see, Section 4.3.4). Although the system in Figure 4.6b seems to be specific it is representative of a wide and important class of systems whose the states have a uniform distribution independent of the distribution of the primary factors generating the states. The reason is that the state b defined by (4.5.5) can be also defined as the rest of dividing the value of the primary state by a constant A. The state of a variety of systems is determined by such a mechanism. Examples are (1) the final position of a disc that after a push revolves several times such as roulette or a disk carrying information (in magnetic, optic form), (2) phase shift of a harmonic process that is delayed (significant is only the rest of dividing by Iw). If • The secondary state can be interpreted as the rest of division of a primary state by a fixed number A, and • This number is much smaller then the range of values of the primary state for which its probability density p^s) takes significant values, then arguing similarly as previously we can conclude that the probability distribution of the secondary state is almost uniform no matter what the probability distribution of the primary states is. There is a relationship between the considered system and random number generators mentioned in Section 4.3.3. Suppose, that instead of a scalar a bloc of binary numbers is taken and the concept of division so that division is suitably generalized, so that division can be realized by a shift register with feedback shown in Figure 3.10. It can be proved that such a system would transform a primary deterministic state (the seed) into such a train such that its segments exhibit statistical regularities. Most generators of pseudo-random numbers operate on this principle (for references see. Section 4.3.4). 4.5.2 THE GAUSSIAN PROBABILITY DISTRIBUTION The probability density p(s)^-^c-^-'''"^ (4.5.11) v/2^ is called gaussianprobability distribution, and the corresponding continuous random variable s is called gaussian (also normal) random variable. For this variable we have E s = ^ , c72(8)=o^ (4.5.12) The specific role of this distribution is justified by the following theorem (a simplified formulation^ of the basic central limit theorem; see. Section 4.3.4)
4.5 Prototype I^obability Distributions
199
If a random variable s can be represented in the form I
1-1
where the random variables z{i) have E8(0=0, v/, their variances a^[8(/) are of similar order of magnitude and the variables (4.5.13) 8(/) satisfy additional very broad constraints, then for large I the probability density of the normalized random variable z/y/l converges to the gaussian probability density. Saying that the variances are of similar magnitude means that two constants Ai>0 andA2>Ai exist, such that A^
,(4.5.14)
where G^^iGim, /:)], is called AT-DIM gaussian probability distribution. It is determined by the coefficients a{k) and G(m, k). These coefficients are related to the moments of the corresponding A^-dimensional random variable §={8(^), ^=1, 2,-
" ,K}
(for derivation see, e.g., Papaoulis [4.9], Breiman [4.10]), namely: 5(it) = E8(it) (4.5.15a) GC^^^D,, (4.5.15b) where C^^ is the correlation matrix of the components of the random variable S given by (4.4.25) and (4.4.26) and 10 0 0 0 10 0 Z>,
0" 0 (4.5.16)
^0 0 0 1 is the diagonal unit matrix. Since on very general conditions the inverse C"^ of the correlation matrix C exists, equation (4.5.51b) can be written in the form A=q;^ (4.5.17) From (4.5.15a) and (4.5.15b) follows that The ^-DIM probability density is exactly determined by averages of the components of the corresponding AT-DIM (4.5.18) random variable and their correlation matrix.
200
Chapter 4 Statistical State of a System
UNCORRELATED GAUSSIAN VARIABLES To illustrate the general considerations we assume: Al. The component variables 8(^) are uncorrelated (c(m, k)=0 "imj^k), A2. The mean values E8(/:)=0, A3. The variances a^[8(/:)]=a^=const. From Al it follows that all elements of the correlation matrix C^^ lying not on the main diagonal are zero. Taking into account assumption A3 we see that C-^ = l/a^Z),. Then (4.5.14) takes the form
I 1^ /7(5)=^-^^exp V—^Y. [•^(^)]^ I , I 2a k-\
(4.5.19a)
where A^ClTra"). (4.5.19b) Comparing (4.5.19) and (4.5.11) we see that the probability density of uncorrelated gaussian variables is equal to the product of the probability densities of the component variables. In other words If the gaussian variables are uncorrelated, then they are statistically independent. (4.5.20) MULTIDIMENSIONAL GAUSSIAN VARIABLES THE GENERAL CASE The reason of the great practical importance of multidimensional gaussian variables is that for them the generalization of the fundamental theorem (4.5.13) holds. Thus, whenever a set of random variables can be represented as a sum of a large number of independent component sets with variances of the similar order of magnitude, then the normalized sum has approximately the multidimensional probability density. As an example take the thermal fluctuation potential of a resistor described in example 4.3.1. The course of the noise is determined by several elementary pulses generated by the independent collisions of electrons. Thus, without going into details of the properties of elementary pulses, we conclude that a train of samples of thermal noise has the multidimensional gaussian probability distribution. Experiments prove this with great accuracy. The gaussian variables can be almost considered as a gift of nature. Not only are they often an accurate approximation of real frequencies of occurrences, but they have several properties that make the operations on them very easy. The following are the most important • To determine exactly the statistical properties of gaussian variables, we need only to know their average values and correlation matrix; (4.5.21) • A set of linear combinations of gaussian variables is again a (4.5.22) gaussian variable • The density of conditional probability distribution of a set of some components of a multidimensional random variable on the (4.5.23) condition that a set of other components is known is again a gaussian probability distribution.
4.6 The Fundamental Property of hong Random Trains
201
These properties in conjunction with equation (4.4.3) reduce the calculation of probability distributions of a linear combination of gaussian variables and of joint and conditional distributions of such linear combinations to routine matrix manipulation. They also lead to a useful presentation of a set of statistically dependent gaussian variables as a transformation of a set of statistically independent gaussian variables. Such presentations are discussed in Sections 5.2.1 and 7.3.
4.6 THE FUNDAMENTAL PROPERTY OF LONG TRAINS OF RANDOM VARLVBLES The subject of this section is an important theorem that can be considered as the view of the axiomatic probability theory on the statistical regularities, discussed in Section 4.4. We assume: Al. A train of states Str={5(l), ^(2),- • • , 5(1)} can be considered as an observation of a train Str = {8(l), 8(2),- • • , 8(/)} of random variables (the train of states exhibits statistical regularities) A2. The set of potential forms of each elementary state is the same and the potential states are 5/, / = 1 , 2,- • • , L. A3. The random variables 8(/) V i are statistically independent. As in Section 4.1 M(5/,/) denotes the number of occurrences of state ^^ in the train 8^^. The following theorem can be proved (see, e.g., Breiman [4.10], Revesz [4.11]): For given 6>0, 6 > 0 such an /(e, d) can be found that for I> /(e, 6) the set S (I) of unconstrained trains S^, can be divided into two subsets S^y and S^^^ such that for every train S^G ^y we have M(s^,I) ^ ^^y—.P[8(l)=5j <e, V/ (4.6.1a) arul P(S,eS^)<6. (4.6.1b) The ratio
\4(c
n
is the frequency of occurrences of elementary state 5/ in the train S^. Thus (4.6.2) says that in each train belonging to the set ^y the frequency of occurrences of any elementary state Si is with accuracy better than e close to the probability of the state Si. Such a train is called a typical-htnct the notation S^y. The set S^ty consists of trains for which the frequencies of occurrences of states differ at least by e. Such a train is called nontypical. The probability P(Su.^ -S'nty) is the sum of probabilities of all nontypical trains. Therefore, from (4.6.1b) it follows that the probability of each nontypical train can be made arbitrary small. Let us denote by SJ a train belonging to set S^y. We calculate the probability ,=S,')=P[8(1)=5'(1), 8(2)=5'(2),- • • , 8(/)=5'(/)]
(4.6.3)
that the train 5'^r is an outcome of an observation of the random train S^.
202
Chapter 4 Statistical State of a System Since the variables 8(/), Vz are statistically independent, P(S,=5V)= n
P[8(0=^'(0]= n
/ - 1
P[a(l)=5,r"''.
(4.6.4)
/ - 1
After logarithming we get
J-1
(4.6.5)
We write (4.6.1) in the form :!!^=P[8(l)=.j4-e., where |«,| <e. Substituting (4.6.6) in (4.6.5) we get |-log,P(S„=S'u)
H[s(l)]
where
L H[S(1)]= E {-log2 P[8(l)=.rJ}P[8(l)=5j
(4.6.6)
(4.6.7)
(4.6.8)
/ - 1
is the entropy of the random variable 8(1) and L
C,= E |log2P[8(l)=5j
(4.6.9)
/ - 1
From (4.6.7) it follows that ,=5V) = 2-'"^-<^^ where = means asymptotically equal, in the sense that lim
(4.6.10a)
-'°^'''^-"'-' .H[«l„
/-oo
The interpretation of formula (4.6.10a) is The probability of every typical train (that is of a train in which the states Si occur with frequencies close to their probabilities) (4.6.10b) is the same aftd given &y (4.6.10a). Taking into account (4.6.1) we can formulate our conclusions in geometrical terms: If we would "draw" a bar diagram in the set S of all possible trains S^={s(l), s(2),' • • , s(l)} the "peaks" of the bars would form two plateaus : the one high plateau over the set J^ (with altitude given by (4.6.10a)) and the second very low (almost at (4.6.11) zero level) plateau over the set S ^^ of non typical trains. The border area between the two plateaus is the more steep the larger the length I of the train is. On the assumption that the trains can be represented as points on a plane this conclusion is illustrated on Figure 4.7.
4.6 The Fundamental Property of Long Random Trains
203
log2 P(S,=5'J
typical non train typical train
Figure 4.7. Simplified geometrical illustration of the fundamental property of long trains exhibiting statistical regularities; (see conclusion (4.6.12)). From axiom (4.4.2) it follows that:
P(^,es,,)= E P(S,,=5,).
(4.6.12)
ses^ From conclusion (4.6.10) it follows that P(S,, E^,y)=7o(^ty)2-'""^^^
(4.6.13)
7o(-^) is* the number of elements of the discrete set ^
(4.6.14)
where
Writing (4.6.1b) in the form P(^,eS,y)
= l'd and using (4.6.12) we get^
7o(^ty) = 2'"^"^^^^
(4.6.15)
COMMENT To use the results of probability theory we must know if the potential states of the considered system exhibit the statistical regularities. As it has been emphasized, the regularities of occurrences of potential states are an objective property of a system. Thus, using the available concrete and meta information we must decide whether the states of the system exhibit the statistical regularities. This is the statistical identification problem, which we discussed in Section 4.3.4.
204
Chapter 4 Statistical State of a System GENERALIZATIONS
To simplify the terminology and notation we assumed that the random variables are discrete and statistically independent. However, suitable modifications of our conclusions hold with very general assumptions. First, we keep the assumptions that the variables z(i) have the same probability distribution and are statistically independent, but we assume that they are continuous described by the density of probability/7(5), sE<s^,s^,>. We define now as typical the set Sly of trains for which the continuous envelope of the density of occurrences of the potential forms of the approximating discrete variables, whose definition is similar to that of (4.2.9) is close (in the sense of a suitable chosen definition of distance between functions considered as a whole; see Section 1.4.3) to the density of probability p(s) (this is the counterpart of (4.6.1)). The counterpart of the first basic conclusion (4.6.8) holds if instead of the probability P(S=.s) we take the density of probability p(S) of a typical train and instead of (4.6.8) we define the entropy of the continuous variable by the equation: H[8(l)]= j[-\og^p(s)]p(s)ds
(4.6.16)
To formulate the counterpart of the second basic conclusion (4.6.16) we have to use the definition the volume 7 / ^ ) of /-dimensional sets based on the following definition of volume of an /-dimensional cube Cv/iih an edge of length A: yjiO^A^
(4.6.17)
It is essential that the unit for the length A be the same as the unit we use in (4.4.4) to define the density of probability p(s). Then the conclusion (4.6.15) with 7X^ty ) instead of 7o(^ty ) holds. For continuous variables we can gain more insight into the structure of the set of typical trains. For example, when the probability density p(s) is gaussian, the set Sly is a thin /-dimensional spherical shell. The fundamental property holds also for trains of structured pieces of information. Since for discrete information we did not use the assumption that the component information is one-dimensional, our argument holds for any structured discrete information. In particular, it holds in the case when the train has the hierarchical bloc structure. Let us suppose that the component is a ^-DIM vector information thus, s{i) = {s{i, k); k=\, 2,- - - , K}. Then instead of H[8(l)] in (4.6.10) and (4.6.15) we have to take the entropy per an element of the random vector S(0 defined by: //,=H[8(1), s(2),.., 8(/0]/A: (4.6.18) where H[8(l), s(2),.., 8(^] is the entropy of the discrete vector §(0- We get it from (4.6.8) taking the joint probability of vector components instead of probability P[8(1)=^J and performing K summations over all potential forms of all components of the vector, instead of a single summation. For an exact definition and derivation of properties oiH^ see Cover, Thomas [4.23], Blahut [4.24], Golomb et all [4.25].
4.6 The Fundamental Property of Long Random Trains
205
In several cases the elementary components of a train cannot be grouped into statistically independent blocs (vectors), but a component depends statistically on the adjacent components. A typical model of such a statistical dependence is the Markov train, that is considered in Section 5.4.2. For such a train we have to define the entropy per an element as the limit of the ratio on the right hand of (4.6.18) for K-*co (for details see Cover, Thomas [4.24], Blahut [4.25]). Section 5.4.2 presents a simple example of calculating of such a limit.
4.7 THE GENERALIZED STATE AND A UNIVERSAL CLASSIFICATION OF STATES AND INFORMATION The statistical regularities described by a probability distribution are an inherent property of the system. We call them the statistical state and denote its description by SsTAT- This description includes the description of the state of variety SVAR of the external state and the description of statistical weights assigned to each potential state. Thus: SsTAT= {SVAR, W} = { S R T , MR, W} (4.7.1) where SRT is description of the structure of external states, MR is the rule of membership in the set of potential forms, and W is the description of statistical weights. The type of description depends on the structure of the external state and the structure of the set S of its potential forms. Let us assume that this is discrete. Then the statistical weights are described by the set of probabilities W={P(si);s^eS}. (4.7.2) When the set of potential forms is continuous, the statistical weights are described by the probability density W={p(s);seS} (4.7.3) considered as a whole. To simplify the terminology we assumed previously that the state is an external state. However, our argument applies directly to internal states. We called the external and internal states concrete states. The set of variety of a concrete state (of potential forms of a concrete state) and its properties in particular, the statistical properties we call the meta state of the system. If the concrete states exhibit statistical regularities, the meta state is described by the statistical state SSTAT» if iiot, only by the state of variety SVAROur considerations about sets of potential forms of concrete states to meta states. Thus, if a meta state is not known, the set of potential forms of meta states in particular, of statistical states should be be considered. Such is also a property of the system and may be called meta state of second rank (meta, meta state). The hierarchy of higher ranking meta states has been discussed in Section The external, internal, and meta states give a complete description of the properties of the system and together are called, the generalized state of the system.
206
Chapter 4 Statistical State of a System Let us summarize our considerations in this and the previous chapters: 1. We have introduced the following types of the states: a. External state (directly influencing interactions between components of the system), b. Internal state (the universal relationships between components of external states), c. Concrete state (external and internal state), d. The state of variety (the set of potential forms of concrete states), e. The statistical state (the set of potential forms of concrete states with associated statistical weights). f. Meta state (joint name for variety and statistical states) g. Higher ranking meta state (the set of potential form of lower ranking meta states, eventually with associated statistical weigthts. h. Generalized state (concrete and meta states together). 2. Each of these states can be either exact or rough. 3. Each of these states has a fundamental structure and often a macrostructure. The fundamental structures are: vector, array (a function of discrete arguments), function of continuous argument(s) as a whole. 4. If any of these types of states is not known, the purposeful activity can be performed more efficiently if the set of potential forms of the unknown state is taken into consideration. (TYPE tlMPO
OF S T A T E ) ABOUT...)
^ ( STATISTICAL. STATBS NBTA STATBS V
STATES OF VARIETY
INTERNAL STATBS (RELATIONSHIPS) COMCIBTB STATBS
{
EXTERNAL STATES (DIRECTLY ACCESSIBLE STATE PARAMBTRS, ATTRIBUTES ) 6lSCRETB_ SETS
(x>irnNuous K DIM SPACES
=5f
SCALAR^ VECTORS ! ARRAYS FUNCTIONS ' / (FUNCTIONS OF^ DISCRETE OF CONTINUOUS ARGUMENTS) y ARGUMENTS • .f.tAiruB raoctttn
FUNDAHBNTAL .^STKUCTUBB OF . - / (TBB STATE) ' [THE IMFO]
HILBERT FUNCTIONAL SPACES
STIUCTUIB OF THE SET OF FOTIITIAL FOIHS OF ( T I E STATE) [THE INFO)
Figure 4.8. A classification of states (inscriptions in () valid) and of information (inscriptions in [ ] valid).
4.7 The Generalized State and a Universal Classification
207
If, for simplification we take into account only the fundamental structure of the state and the set of its potential forms, the generalized state can be represented as a point, in a 3-DIM ,"space" as shown in Figure 3.8. On one horizontal "coordinate axis" we have the fundamental structure the state; on the other, the structure of the set of potential forms it can take. On the vertical axis we indicate the type of the state. For example, point P^ corresponds to an external state that is a scalar and the set o potential forms is an interval. Point P2 represents the statistical state of the state represented by point P, . In view of bilateral relationships between the state and information discussed in Section 1.2.2, concrete forms of states and information and sets of their potential forms can be classified in the same way. We can also classify the information according to the type of the state the information is about. Thus, after changing the interpretation of the features we can use the previously described classification of states as a universal classification of information, as shown in Figure 4.8.
NOTES ' We used 1 as an index to number the potential forms of the vector 5. To each s„ corresponds a pair of components Si(„)(l) and s,^f^^(l). The indexes l{m) and k(m) numbering the potential forms of components may be in general different. Because in our considerations the relationship between numbering the forms of s and of the components 5(1) and s{2) is irrelevant, we write briefly / instead of l(m) and k instead of ^(m). ^ It is essential for our argumentation that I>L. Thus L can be large only if the train 5^ has a suitable large length /. ^ With a strict approach, the class of considered subsets is restricted to the class of Borel sets (see, e.g., Kolmogorov [4.6], Billingsley [4.8]). This is a very wide class including practically all sets occurring in applications. "* In the notation of the correlation coefficient and the correlation matrix we add in the subscript ss to indicate that the correlation coefficients of pairs of components of a multidimensional random variable S are considered. We do so because in the later sections we consider matrices of correlations coefficients for components of several multidimensional variables. ^ Always Ez^>0. We could have Ez^=0 for any set of coefficients a(k) if for all random variables B!D^=0. Such a case is of no technical interest. ^ This is obviously the special case of rule (1.5.16) with the reference pattern used directly as its identifier. The quantization rule considered here is the same as rule (4.2.1), however we use another notation. ^ The proof of the theorem and exact formulation of its premises require several additional concepts-they can be found in Breiman [4.10]. For detailed study of the sums of random variables see Gnedenko [4.14]. Here we give a simplified formulation of the theorem. * Using the same symbol as the symbol used to denote the volume of a K-DIM set; (see, e.g., (4.4.3)) is not coincidental. The number of elements of a discrete set has the same fundamental properties as the volume of a K-DIM set. In mathematical terms the number of elements of a set is an additive measure (see, e.g., Billingsley [4.8]). ^ We can get this equation, calculating the number of trains satisfying constraint (4.6.1) and using the Stirling formula for the logarithm of the factorial.
208
Chapter 4 Statistical State of a System
REFERENCES [4.1] Mises von, R., Probability, Statistics and Truth, Dover Publications, N.Y., 1957. [4.2] Frank, H., Althoen, S.C, Statistics, Concepts and Application, Cambridge University Press, Cambridge, 1994. [4.3] Kotz, S., Johnson, N.L., Encyclopedia of Statistical Sciences, J.Wiley, N.Y., 1988. [4.4] Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetterling, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [4.5] Shafer, G., Pearl, J., Readings in uncertain reasoning, Morgan Kaufman Publ. San Mateo CA, 1990. [4.6] Kolmogorov, A.N., Foundations of the Theory of Probability, 2-nd ed., Chelsa Publishing Corporation, N.Y., 1956. [4.7] Renyi, A., Probability Theory, North-Holland, Amsterdam, 1970. [4.8] Billingsley, P., Probability and Measure, J.Viley, N.Y., 1979. [4.9] Papoulis, R., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, N.Y., 1991. [4.10] Breiman, L., Probability, SIAM Publications, Philadelphia, 1995. [4.11] Revesz, P., The Laws of Large Numbers, Academic Press, N.Y., 1960. [4.12] Mane, R., Ergodic Theory and Differentiable Systems, Springer Verlag, Berlin, 1978. [4.13] Huang, K., Statistical Mechanics, J.Wiley, N.Y., 1966. [4.14] Gnedenko, B.V., Kolmogorov, A.N., Limit Distributions of Independent Variables, Addison Vesley, Reading, 1968. [4.15] Dagpunar, J., Principles of Random Variate Generation, Clarendon Press, Oxford, 1988. [4.16] Yarmolik, V.N., Demidenko, S.,N., Generation and Application of Pseudo-random Sequences for Random Testing, J.Wiley, N.Y., 1988. [4.17] Niederreiter, H., Random Number Generation and Quasi-Monte Carlo Methods, SIAM Publications, Philadelphia, 1992. [4.18] Devaney, R.L., An Introduction to Chaotic Dynamical Systems, Addison-Wesley, Redwood, 1989. [4.19] Rasband, S.N., Chaotic Dynamics of Nonlinear Systems, J.Wiley, N.Y., 1990. [4.20] Thompson, E.E., ^ Introduction to Algebra of Matrices with some Applications, Adam Hilger, London, 1969. [4.21] Horn, R.A., Johnson, C.R., Matrix Analysis, Cambridge University Press, Cambridge 1988. [4.22] Usmani, R.A., Applied Linear Algebra, Marcel Decker, N.Y, 1987. [1.23] Cover, T.M., Thomas, J.A., Elements of Information Theory, J.Wiley, N.Y., 1991. [1.24] Blahut, R.E., Principles and Practice in Information Theory, Addison- Wesley, Reading, MA, 1990. [1.25] Golomb, S.W., Peile, R.A., Scholtz, R.A., Basic Concepts in Information Theory and Coding, Plenum Press, N.Y., 1994.
STATISTICAL RELATIONSHIPS A statistical relationship exists between states if one state influences the frequencies of occurrences of the other state. The statistical relationship is described either by joint or by conditional frequencies of occurrences of potential forms of states (see (4.1.24) to (4.1.26)). If the states exhibit statistical regularities, the statistical relationship is described by joint or conditional probability distributions (see (4.4.7) and (4.4.8)). The statistical relationships are of paramount importance for the superior system. Using these relationships the system can take into account the components of states of its environment, that are directly inaccessible. In particular, the system may "predict" future states. This, in turn, can dramatically improve the performance of the superior system. The analysis of optimal utilization of statistical relationships and assessment of achievable advantages are important topics of this book. The basic statistical relationships are the relationships between the atomic components of the states having one of the fine structures described in Section 1.3. The wide class of such states are functions of a discrete ( respectively, continuous) identifier (argument (s)); see Section 1.3.3. The statistical model of such a function whose "values" exhibit joint statistical regularities is called stochastic process. Usually the states have a macro structure. Then the macro components may be also related statistically. We call such a relationship statistical macro-relationship. An example is the relationship between a primary process and a secondary process produced from the primary process by an indeterminate transformation. Our considerations of statistical relationships start with a review of their rough descriptions by means of parameters. Such typical parameters are correlation coefficients and parameters based on entropy, in particular, the amount of statistical information. Section 5.2 describes the birth and death processes and the Gauss processes, considered as result of deterministic transformations of a primeval train of statistically independent components. Section 5.3 is devoted to Markov processes, which can be considered as successive indeterministic transformations of an initial state. Section 5.4 presents typical descriptions of relationships between the input and the output of a channel introducing indeterministic distortions. In a real system the transformation transforming the state into information is practically never reversible; thus, we have to consider the state as indeterminate. Characterizing such a state by frequencies of occurrences of its potential forms or by probability, we introduce, in fact, a model of indeterminism. Besides the probabilistic, other models of indeterminism have been proposed. We close this chapter with a systematic review of the various models of indeterminism.
210
Chapter 5 Statistical Relationships
5.1 THE ROUGH DESCRIPTION OF STATISTICAL RELATIONSHIPS BY PARAMETERS Sections 4.1, 4.2, and 4.3 indicated that already for structured information consisting of few components, experimental estimation of joint probability distributions (or equivalently conditional distribution) would be tedious. Therefore, of great importance are simplified descriptions of statistical relationships by parameters. This section concentrates on correlation coefficients and parameters based on entropy particularly, on the amount of statistical information. 5.1.1 THE CORRELATION COEFFICIENT In Section 4.4.4 it has been shown that the correlation coefficient is a useful indicator of statistical relationships in two cases. The first is when the joint probability of the components of the state is gaussian. In view of property (4.5.13) and its multidimensional generalization this is often the case. Then from conclusion (4.5.18) it follows that the correlation coefficients describe the probability distribution exactly. The second case is when we consider only linear transformations and we describe the statistical relationships only by correlation coefficients. Then from equation (4.4.37) it follows that we can calculate the correlation coefficients for the components of the transformed state. Thus, in the case of linear transformation the rough description of statistical relationships is self-sufficient. This property of correlation coefficients plays an important role in this book. We review here properties of the correlation coefficient defined by (4.4.24), which allow us to use this coefficient as a rough description of statistical relationships between two random variables 8(1) and 8(2). First, assume that the variables are statistically independent; thus, for discrete variables equation (4.4.8a) holds or for continuous variables equation (4.4.8 b) holds. After substituting the former in (4.4.25a) respectively in (4.4.25b), we prove that both for discrete and continuous variables c(l, 2)=0.
(5.1.1a)
Thus, Statistically independent variables are uncorrelated.
(5.1.1b)
In general, the reverse does not hold. An example is the density p[s(l), 5(2)] of joint probability shown in Figure 5.1. The marginal probability distributions/7,[5(1)] andp2[s(l)] are uniform. Figure 5.1 shows that for points inside the white squait p^[s(l)]p2[s(l)]j^p[s(l), 5(2)]. Thus, from definition (4.4.8) it follows that the variables 8(1) and 8(2) are statistically dependent. For symmetry reasons c(l, 2)=0. Thus, in spite of being noncorrelated, the variables are statistically dependent. However, this is not typical. For several classes of random variables zero correlation implies independence. In particular this holds in the important case when the variables are gaussian.
5.1 The Rough Descriptions of Statistical Relationships s(2)
211
s{2) P[sa),s(2)]
M—
Figure 5.1. Density of joint and corresponding marginal probability distributions of random variables, which are noncorrelated but statistically dependent; the probability density in die shadowed area is constant. It can be easily proved (see, e.g., Papoulis [5.1], Breiman [5.2]) that c^d, 2)
(5.1.2)
a^[8(m)]=E[s(m)-E8(m)]2, m = l, 2,
(5.1.3)
where and the sign of equality holds if 8(2)=/l8(l),
(5.1.4a)
where ^4 is an arbitrary constant. Thus, The correlation coefficient reaches its maximum when both variables are linearly related. From (5.1.2) it follows that
(5.1.4b)
where
(5.1.5)
0 < |c„(l, 2)| < 1 ^"^^'^^
a[s(l)Ms(2)]
(5.1.6)
is the normalized correlation coefficient. 5.1.2 THE ENTROPY AND AMOUNT OF STATISTICAL INFORMATION 1: DISCRETE STATES Entropy appeared in Section 4.6 where frequencies of occurrences of potential forms of states in long trains of observations exhibiting statistical regularities were analyzed. The definition (4.6.8) of entropy of the discrete random variable 8 was^: L
H(8) = E
[-log2P(8=5,)]P(8=5,)],
/-I
where j / , / = 1 , 2,- • • , L are the values which the variable can take.
(5.1.7)
212
Chapter 5 Statistical Relationships
Comparing this definitions with the definition (4.4.9 ) of the average, we see that H(8) = E[-log2P(8)] (5.1.8) where P(8) is the discrete random variable taking the value -log2P(8=5/) with probability P(s=5/). It can be easily proved (see e.g Abramson [5.3], Cover,Thomas [5.4], Blahut [5.5]) that: The entropy H(a) reaches the maximum when P(s=j^) = 1/L=const. (5.1.9) and max H (S)=logjL. all prob.distributions
Figure 5.2 illustrates this property forL=2.
0
0,f
0,2
0.2
04
Q5
Q6
Q7
Q8
Q9
1.0 p ( g = ^ )
Figure 5.2. The entropy of a binary random variable. From definition (5.1.7) it follows that the entropy does not depend on the values of the random variable but only on its probabilities. Tlius, equation (5.1.7) is in fact, the definition of entropy of a random object having any structure, provided that its potential forms can be identified by integers. In particular, if we take in place of s the 2-DIM variable §={8(1), 8(2)} and replace the probabilities P(8=5/) by the joint probabilities, equation (5.1.7) becomes the definition of the entropy of S H(S)^E E /-I
{-log2P[8(l)=5Xl),8(2)=j,(2)]}P[8(l)=5Xl),8(2)=5,(2)],
(5.1.10)
k'\
and equation (5.1.8) takes the form H(® = E{-log2P[8(l), 8(2)]} (5.1.11) After expressing the joint probabilities in equation (5.1.10) by the conditional probabilities and marginal probabilities (using equation (4.4.7a) we get H(§) = H[8(1)] + H[8(2)|8(1)],
where
(5.1.12)
L
H[S(2)|8(l)]^EH[8(2)|8(l)=5Xl)]P[8(l)=5j L
(5.1.13)
'-'
H[8(2)18(1)=5X1)] = E {-log2P[8(2)=5,(2)18=5,)]P[8(2)=:y,(2)|z^s^)]}.
(5.1.14)
k-\
We call H[8(2)|8(1)=5X1)] the conditional entropy and H [8(2) 18(1)] the average conditional entropy.
5.1 The Rough Descriptions of Statistical Relationships
213
The conditional entropy occurs only as an intermediate expression. However, of basic importance for analyzing the statistical relationships between the random variables s(l) and s(2) is the conditional average entropy. It can be shown (see, e.g., Abramson [5.3], Cover,Thomas [5.4], Blahut [5.5]) that H[S(2)|8(1)]
(5.1.21)
is an indicator of the reduction of the indeterminism of the random variable 8(2) when the value taken by the variable 8(1) is known. Therefore, it is natural to call l[8(l):8(2)] the amount of statistical information, that the knowledge of the random variable 8(1) provides about another random variable 8(2). It is traditionally also called Shannon *s information. Using (5.1.7), (5.1.14) after some elementary algebra from (5.1.21) we get l[8(l):8(2)] = j ^ 5 ^
^^ P[s^(l),s,(2)] P[s^(l),s,(2)]. ' ""^ P[s^(l)]P[s,(2)]
(5.1.22)
Chapter 5 Statistical Relationships
214
The previously used notation for probability indicating the random variable is useful for general considerations, but it is inconvenient for longer calculations. In equation (5.1.22) we apply a simplified notation, in which we indicate only the considered potential value. For example, the probability P[8(l)=5/)] we write briefly as P(Si). Because the random variable does not occur, we utilize here the italic character P but not the special symbol P. In future, whenever this causes no confusion, we use this simplified notation. EXAMPLE 5.1.1 ILLUSTRATION OF PROPERTIES OF ENTROPIES AND THE AMOUNT OF STATISTICAL INFORMATION We take the joint probability distribution considered in Example 4.1.1 and given by equation (4.1.29). The parameter a determines the character of the probability distribution as illustrated in Figure 4.1. The dependence of the joint, marginal, conditional entropies and of the amount of information are shown in Figure 5.3.
c„(l,2) H[8(2)] H[8(2)|c(l)] I[8(l):c(2)]
0.1
G25
0.5
1
1,6 2
a
Figure 5.3. The dependence of joint, conditional, marginal entropies, the amount of statistical information, and the normalized correlation coefficient on the parameter a. determining the character of the probability distribution given by (4.1.29) and illustrated in Figure 4.1. According to property (5.1.9) the entropy H[8(l)] achieves its maximum for a = l because for this value the marginal probability distribution is uniform. However, for this value also the joint probability distribution is uniform, and consequently the variables 8(1) and 8(2) are independent. Then, according to properties (5.1.9) and (5.1.16), the joint entropy H[8(l), 8(2)] and the conditional entropy H [8(2) 18(1)] take their maximum. Because the variables 8(1) and 8(2) are independent the amount of statistical information i[8(l):8(2)]=0. For a-K) the variable 8(1) determines the value of 8(2) and vice versa. Then the amount of statistical information l[8(l):8(2)] reaches its maximum. For cr*oo the probability distribution approaches binary uniform distribution. Then, I[8(l):8(2)] grows again asymptotically to 1, and the entropies decrease to l . D
5.1 The Rough Descriptions of Statistical Relationships
215
COMMENT 1 The frequently used abbreviation "statistical information" of the terms, amount of statistical information" and "Shannons information'' is misleading and confusing. Information, as defined by (1.1.1), has a different character than the amount of statistical information l[8(l):8(2)]. In the terminology used here, l[8(l):8(2)] is a number characterizing the set of probabilities of potential forms of the information. Therefore, the statistical amount of information has a sense only if the states exhibit statistical regularities. The term "information" in the sense here defined, that is close to its common sense understanding, is not necessarily a number but may have any structure and it is not necessarily associated with the existence of statistical regularities. COMMENT 2 We considered here the entropy as a primary concept and defined the statistical amount of information in terms of entropy. However, entropy can be considered as amount of information. Suppose that an exact observation of the variable 8 is available. The variable is then determined. Therefore, H[8(l)|8(l)]=0. From (5.1.22) it follows that ' l[8(l):8(l)] = H[8(l)]. (5.1.23a) Thus, The entropy also has the meaning of the amount of statistical information that is obtained by making an exact observation of the (5.1.23b) value that the variable takes. This conclusion does not contradict conclusion (5.1.20). 5.1.3 THE ENTROPY AND AMOUNT OF STATISTICAL INFORMATION 2: CONTINUOUS STATES Till this point we assumed that the variables are discrete. The definition (4.6.16) of entropy of the continuous random variable 8 introduced in Section 4.6 was: H{z)^\[Aogj>{s)\p{s)ds,
(5.1.24)
where p{s), sE<s^, s^> is the density of probability characterizing the random variable 8. The formula u . x rr i / M H(8) = E[-log2p(8)] (5.1.25) holding for the continuous random variable is the counterpart of formula (5.1.8) for the discrete variable. For a pair §={8(1), 8(2)} of continuous variables described by the density of joint probability p{s) the counterparts of definition (5.1.14) of conditional entropy and definition (5.1.22) of the amount of statistical information are H[8(2)|8(l)]= j J {-log2/7[5(2)|5(l)]}/7[^(l),5(2)]dj(l)dj(2) l[8(l):8(2)]= f f J Aog,,^^^^^^!^
y [ where D={<s^,
Id5(l)d5(2)
PW1)]P[^(2)] J
Sy^>\<s^, s^>} is the square outside which/7(5)=0.
(5.1.26) (5.1.27)
216
Chapter 5 Statistical Relationships EXAMPLE 5.1.1 ENTROPY AND AMOUNT OF STATISTICAL INFORMATION FOR GAUSSIAN VARIABLES
The gaussian probability distribution is given by equation (4.5.11). Substituting this in (5.1.24) with s^=-co^ s^= oo and with some calculus we get H(s) = y2log2(27rea^), (5.1.28) where e=2,718.. is Euler constant. Some more calculations show (see, e.g., Abramson [5.3], Cover, Thomas [5.4], Blahut [5.5]) that for the 2-DIM gaussian probability density (4.5.14) the statistical amount of information I[8(l):8(2)]=-log2\/l-^n(l,2)
,
(5.1.29)
where Cn(l, 2) is the normalized correlation coefficient defined by equation (5.1.6). As an application of the latter formula we take two independent gaussian variables 8(1) and 2 , E8(l)=0, E2=0 and we set 8(2)=8(l)+2.
(5.1.30)
The variable 8(1) may be interpreted as the model of the state of input of a communication channel, 8(2) of the output and 2 of the noise. We have C(l, 2) = E8(l)8(2) = E8(l)[8(l)+2] = ^5.
(5.1.31)
2
where ^j=a^[8(l)]. The normalized correlation coefficient defined by (5.1.6) is c,(l, 2)=a]/a^^aW, Substituting in (5.1.29) we get l[8(l):8(2)] = y2log2
(5.1.32) 2 2
D
(5.1.33)
A SPECIFIC FEATURE OF ENTROPY OF CONTINUOUS VARIABLES The similarity of definition (5.1.7) of discrete entropy and of definition (5.1.24) of continuous entropy causes the entropies to have similar properties. In particular, the important relationship (5.1.5) holds for both cases, and the entropy of continuous variables has a property similar to (5.1.9). However, the entropy of the continuous variable has a specific property. It is caused by the previously signalled fact (see discussion on page 190 after equation (4.4.6)), that the density of probability depends not only on the statistical properties of the considered information but also on the measure of volume used to define the density. To illustrate this effect we assume Al. The primeval state is electrical potential of a point terminal; we denote it by the special symbol 5; A2. We take 1 V as the unit of potential and denote it as v.
5.1 The Rough Descriptions of Statistical Relationships
217
We denote by s^ the dimension-less factor by which we have to multiply v to obtain s; we call s^ the representation of the primeval state based on the unit v (it is the description of the state that we previously denoted s). Next we consider another unit li of voltage, say ImV, and we denote Wv the representation of unit n based on unit v, s^ the representation of the primeval state s based on the unit u (measurement of the potential expressed in units k). From the definitions it follows, that 5u=5v/Wv (5.1.34) Thus, Uy has the meaning of the scaling factor that we have to use when we pass from unit v to unit n . We denote by Sv (respectively by Su) the random variables representing Sy (respectively sj. From the definition (4.4.4) of the probability density it follows that Pu(^u)=w./?v(0, (5.1.35) where p^is^) (respectively p^isj) are probability densities of the random variable 8^ (respectively zj. From (5.1.24) we have finally Thus,
H(8,) = H(8,)+log2«v The entropy of a continuous variable depends not only on the statistical properties of this variable but also on the units used to measure the continuous variable.
(5.1.36)
(5.1.37)
Contrary to entropy, the amount of statistical information does not depend on the measure of volume used to calculate the conditional probabilities. This causes the properties of the amount of statistical information particularly, the effects of transformations of information (see, e.g., Abramson [5.3]) for discrete and continuous variables, to be similar. COMMENT 1 To simplify the terminology and notation we considered the components of the states as scalars. However, we can easily generalize our argumentation, by defining in the general case the entropy by (5.1.25). Using the general definition (4.4.12) of the averaging operation, we define the entropy of a multidimensional continuous variable, the average conditional entropy, and the amount of statistical information. COMMENT 2 The concept of entropy emerged when we analyzed the properties of long trains of random variables. Thus, we can expect that the entropy is relevant in situations when the fundamental property of long sequences described in section 4.6 can manifest itself. We show that the performance of optimum vector quantization of long blocks of information depends in an essential way on entropy (see Section 8.6.1, theorem (8.6.24) and that the statistical amount of information influences in a crucial way the performance of transmission of long blocks of information through channels introducing indeterministic distortions (see Sections 8.4.3 and 8.6.1, coding theorems (8.4.50) and (8.6.33).
218
Chapter 5 Statistical Relationships
There have been, however, two approaches to the concept of entropy and the amount of statistical information not related to the fundamental property of long sequences. The first, called axiomatic approach introduces some plausible properties that an indicator of indeterminism or equivalently, of the amount of information should posses. A typical example of such a property is additivity of entropy (5.1.17) when the information (in sense of definition (1.1.1)) consists of independent components. An indicator of indeterminism considered in the axiomatic approach is a special case of the indicator of variety of potential forms of information discussed in Section 1.6.1, page 53. The discussion in Section 1.6.1 in particular. Figure 1.23 shows that an indicator of variety is only one of several indicators of other types characterizing the performance of an information system. Therefore, the axiomatic approach concentrating on indicators of indeterminism cannot tackle real design problems. This has even a deeper reason. Our considerations in Section 1.1 (see, in particular. Figure 1.1) show that without specifying what for the information is needed, it is not possible to define reasonable indicators characterizing it. The second approach to the concept of amount of information not based on the properties of long sequences is linked to analysis of universal bounds for best performance of information recovery rules. The Fisher information, related to RaoCramer inequality, introduced in classical statistics (see, e.g., Larsen, Marx [5.7]) is an example of such a definition. In the performance bounds approach the amount of information is interpreted as a distance between the probability distributions (see, e.g., Bahara [5.8]). Such a distance is introduced in Section 8.5.3 and it is shown that it determines universal bounds for performance of optimal information systems without knowing the optimal rules. These bounds give insight into the effects of preliminary information processing. However, their usefulness is limited since the bounds can be achieved only for some probability distributions (see Seidler [5.6]). One class of them are gaussian-like distributions. The other class are probability distributions similar to probabilities of long sequences discussed in Section 4.6. 5.2 PROTOTYPE STATISTICAL RELATIONSHIPS As in the case of prototype probability distributions, the type of several important statistical relationships is determined by very general assumptions that often are quite exactly satisfied. Then we can infer the type of probability distribution describing the relationships, and we have only to determine the concrete values of the free parameters of the given type. To simplify the terminology we assume that the relationships between the states have the character of time relationships and that the elementary states are scalars. However, most of the concepts that we introduce here can be modified for structured elementary states and for space relationships. We denote as s^t), tE
5.1 Prototype Statistical Relationships
219
The structured state is a sequence s^{s(l), s(2),' ' • ,s(N)}, where s(n)=sM,n
= U2,' - - ,N
(5.2.2)
The sequence B={Q(n), n = l, 2,- • • , N} (5.2.3) of random variables 8(«) is the model of statistical time relationships. It is called time discrete stochastic (random) process (chain). From a formal point of view there is no difference between a time-discrete stochastic process and a multidimensional random variable. In particular, both are described either by a joint or by a conditional probability distribution. Specific for a stochastic process is the interpretation as a train of variables representing observations of a time process in successive instants. Such an interpretation justifies specific assumptions about statistical relationships usually reflecting the primary deterministic relationships between the states at sampling points^ described in Chapter 3. The theory of stochastic processes is the subject of many publications. An introduction into this area can be found in Papoulis [5.1] and Helstrom [5.9] ; a more advanced analysis is presented in Parzen [5.10], Shanmugan, Breipohl [5.11] and in the classical monograph of Lapierre, Fortet [5.12]. We describe here two basic types of trains of statistically related elements, which can be considered as transformations of trains of primeval statistically independent elements. In some cases, the mechanism generating the considered train has the character of such a transformation. However, even if the mechanism of generating the train is not known, the representation of a given train as a hypothetical transformation of a hypothetical primary train of statistically independent elements gives much insight into the properties of stochastic processes and is very useful for their analysis and simulation. In subsequent considerations we frequently exploit such an interpretation. In the first section, as the primeval train of statistically independent variables the train of binary variables is taken, while in the second section, the train of gaussian variables. 5.2.1 POISSON PROCESS AND DERIVED PROCESSES The states of many systems can be considered to be the result of triggering events of very short duration. A triggering event may initiate a lasting process or end an already running process. In the first case, the triggering event is called a birth event in the second a death event. A typical application of this concept is a model of information packets arriving at an information system: the instant when the packet arrives is interpreted as the birth instant, and the instant when it ends as the death instant (see Figure 5.4). Since the duration of a triggering event is usually negligible, we may consider it as a point on time axis. Therefore, the train of triggering events is called SL point process. It is described by the instants at which the triggering events occur.
220
Chapter 5 Statistical Relationships
a^
I
I
I
I
I
I A,
^
I
I
I
I
• t
'^•y
U—
a)
t
t
i
t
-•
t
b)
^^t
U c)
Figure 5.4. Illustration of the definition of the Poisson process and derived processes: (a) illustration of notation, (b) a train of triggering events, (c) a train of pulses generated by the primary birth process (up arrows) and auxiliary deatiis process (down arrows). Often the mechanism of generating the triggering events is such that A triggering event can occur at a short interval
T\=n{r^, T2^n2T^.
(5.2.5)
M^(
M m
P^d-P,)''-'".
(5.2.6)
5.1 Prototype Statistical Relationships
221
We interpret a time-continuous point process describing a train of triggering events as the limiting case of the train of statistically independent binary states, when P, ri=const, r2=const, T.-M), P^-^, but so that the limit X=lim--i exists, (5.2.7) On these assumptions, after some elementary algebra, from (5.2.6) we get limP[Me(
(5.2.8)
where MJ,
222
Chapter 5 Statistical Relationships
In view of property (4.5.20) equivalent to A2 is the assumption that A3'. The variables im(w) are uncorrelated cjm, m')=d(m,m'), where ^1 if m=m' d(m,m')=^ ^
(5.2.11) (5.2.12)
is the Kronecker delta function. We can write assumption (5.2.11) equivalently in the form C,,=Z>„
(5.2.13)
where C^^ is the correlation matrix for the train m(n), /z = 1, 2,- • • and Z), is the unit diagonal matrix (see 4.5.16). A train of uncorrelated gaussian variables may be considered as the primeval gaussian process. It is the counterpart of the train of statistically independent binary components defined by assumptions Al to A3 on page 220. In general, the components of the gaussian process are correlated. We show now that such a process can be always presented as an effect of a linear transformation of a hypothetic primary uncorrelated gaussian process. We denote by /f a matrix describing the linear transformation and we look for such a matrix that a given, in general, correlated gaussian process §={8(w), /z = l, 2,- • • , A^}
can be represented in the form S=flU, (5.2.14) where ILJ={i!ii(/i), w = l,- • • , A^} is the previously described primeval gaussian process. We demand that the transformation can be realized in real time, thus that the mth component of the transformed train depends only on the mth or earlier components of the uncorrelated train. This is equivalent to the requirement The matrix H has over the main diagonal only O's. (5.2.15) Thus the matrix H should have a similar structure as the square matrix in (3.2.46), shown also in Figure 3.9a. To prove that the transformation we are looking for exists and to find its concrete form we use three conclusions, formulated previously: • The density of joint probability of gaussian variables with zero mean values is determined completely by their correlation matrix ( see conclusion (4.5.21)). • A process obtained by a linear transformation of a gaussian process is again a gaussian process (see conclusion 4.5.22). From these conclusions it follows that finding the presentation of the gaussian process W(ji) reduces to finding such a matrix H that the correlation matrix of the transformed process MJ(n) is equal to a given correlation matrix C^^, After substituting C^f*Di and C^^-^C^^ in (4.4.37) we write this condition in the form HHj = C,,.
(5.2.16)
5.1 Prototype Statistical Relationships
223
The matrix ff we are looking for, plays the role of the "variable" in this matrix equation. Since the correlation matrix C^^ and the matrix HHj are symmetric, the matrix equation (5.2.16) is equivalent to N{N-\)I2 scalar equations. Therefore on very general assumptions we can find a solution of the matrix equation satisfying the additional requirement (5.2.15). There exists an efficient algorithm to find numerically the elements of this matrix, called Cholewski decomposition algorithm; the programs (also on diskettes) realizing this transformation can be found also in Press [5.14], in program packages [5.15], or in program packages described by Wolfi-am[5.16]). The solution of equation (5.2.16) has the meaning of a transformation shaping the primeval uncorrelated gaussian train into the given train and it is determined by the correlation matrix C^^- Therefore, the solution is denoted as H^^^iQ^). Thus, we have S= H,^(CJV. (5.2.17) This representation is illustrated in Figure 5.5a. Since the matrix ^^h has the same structure as the matrix H in equation (3.2.46) the transformation can be realized by the time discrete linear filter with time-varying coefficients shown in Figure 3.8e. "1
1 GENERATOR OF UNCORRELATED 1 GAUSIAN TRAIN
1
LINEAR SHAPING 1 ^ TRANSFORMATION 1 ^ u(n) ^sh(Css) 1 s(n)
a) DECORRELATING TRANSFORMATION s{n)
u(n)
b) Figure 5.5. The presentation of a correlated process 8(/i), /z = l, 2,- • • , A^ as a real- time linear transformation of an uncorrelated primary process m(n): (a) the presentation, (b) the generation of the decorrelated process u{n).
It can be also proved that the inverse matrix H,(CJ-H;:(CJ
(5.2.18)
exists and satisfies condition (5.4.15). Left-multiplying formula (5.2.14) by H^iC^^) ^^^^^ ILJ=H,(CJS (5.2.19) Thus, the transformation H^iC^s) produces of the primary train 8(/z) the uncorrelated process m(n) (see Figure 5.5b). Therefore, it is called decorrelating transformation-, hence, the notation. In Section 7.3 we study the decorrelating transformations in detail. The decorrelating transformation can again be realized by a time-discrete linear filter with time-varying coefficients shown in Figure 3.8e.
224
Chapter 5 Statistical Relationships
C„ =
.3000 .671 I .3193 . 16M .084.0 .0478 .0274 .Ol&O
.671 1 1.3000 .671 1 .3193 . 1614 .OfUX> .0478 .0274
.3173 .671 1 1.3000 .671 1 .3193 .1614 .0&&0 .0476
.1614 .3195 .6711 I.5000 .6711 .3193 .1614 • O860
.0000 I.0953 .4823 .2254 . 1 122 .0590 .0324
.0000 .0000 I .0952 .4621 .2234 . 1 121 .0589 .0324
.0000 .0000 .0000 I .0951 .4821 ,2254 .1120 .0569
.0860 . 1614 .3193 .671 I 1.5000 .671 I .3193 .1614
.1614 .3193 .671 I I.5000 .671 1 .3193
.0274 .0478 .0860 .1614 .3193 .6711 1 .5000 .671 1
.0000 .0000 .0000 .0000 1 .0951 .4620 .2253 . I 120
.0000 .0000 .0000 .0000 .0000 I.0951 .4820 .2253
.0000 .0000 .0000 .0000 .0000 .0000 1 .0931 .4820
.0000 .0000 .0000 .0000 .9131 -.4019 -.0110 -.0059
.0000 .0000 .0000 .0000 .0000 .9131 - .4019 -.0110
.0478
.oetc
.0 16C' .0274 .047e .O860 .1614 .3193 i .6711 i 1.3000
a;
^sh-
I.2247 .5460 .2607 .1317 .0702 .0390 .0223 .0131
.oies
.OOO^.'^ .0000
. ooo<;> .0000 .0000 .000«"J
.0000 I .0951
b)
HH =
.8165 -.4085 -.0144 -.0077 -.004 1 -.0022 -.0012 -.0006
.0000 .9130 - .4021 - .0111 -.0059 -.0032 -.0017 -.0009
.0000 .0000 .9131 -.4020 -.0110 -.0059 -.0031 -.0017
.0000 .0000 .0000 .9131 -.4019 -.0110 - .0059 -.0031
.0000 .0000 .0000 .0000 .0000 .0000 .9131 - .40 1 9
.0000 .0000 .0000 .0000 .0OO<' . 0000 . OCOO .9131
Tables 5.2.1. Simulation of gaussian processes: (a) the assumed correlation matrix Q,, (b) the shaping matrix H^^,, (c) the decorrelating matrix H^.
s{n)^
Figure 5.6. Simulated realizations of gaussian processes: (a) a typical realization of the primary uncorrected gaussian process ©(/i), (b) and the corresponding realization of the process &{n) with the correlation matrix given in table 5.2.1 obtained by the shaping transformation.
5.1 Prototype Statistical Relationships
225
From our considerations it follows, that Every gaussian process can be considered as an effect of a real time, linear transformation of a primeval uncorrelated gaussian process ,^ <^ ^^. The uncorrelated gaussian process can be produced from the primary gaussian process by a real time linear transformation. In Section 4.3.4 it was mentioned that observations of random variables can be simulated by pseudo random-numbers generated by deterministic algorithms (see Dagpunar [5.17], Niederreiter [5.18]. The programs generating trains of pseudorandom numbers can be found in Press [5.14], in program packages [5.15] or in programs described by Wolfram [5.16]. The shaping and decorrelation of processes simulated by pseudo random numbers we illustrate with an example. EXAMPLE 5.2.1 SIMULATION OF A GAUSSIAN PROCESS WITH GIVEN CORRELATION MATRIX We assume that the correlation coefficients are given by the formula"^ cjm, A:) = ^,e^'''"-" +^/^'"-" (5.2.21) For numerical calculations we take A^= 8, i4, = l, ^42=0.5, iS, = l, iS2=0.5. The correlation matrix Qs, the shaping matrix H^^, obtained by Cholewski decomposition and the decorrelating matrix H^ are given in Table 5.2.1, while typical realizations u{n), s(n), /z = 1, 2, • • • of simulated primeval uncorrelated gaussian process, respectively correlated gaussian process are shown in Figures 5.6a and 5.6b. If we would apply to the correlated process the decorrelating transformation described by the matrix H^ given in table 5.2.1c we would obtain back the realization of the uncorrelated process shown in Figure 5.6a. Notice the specific structure of the shaping matrix H^^ given in table 5.2.1 In each row we have only positive elements. Thus, an element of the correlated process s(n) can be interpreted as the sum of positively weighted elements of the uncorrelated process mim) m = l, 2,- • , /z. This causes that the produced process is smooth. In a typical row of the decorrelating matrix H^ we have interleaved positive and negative elements. This causes fast oscillations of the process m(n).n The decorrelating transformation described by matrix H^ gives not only insight into properties of gaussian processes. It can also be used as a preliminary transformation simplifying real time compression of a train of pieces of information. Therefore, important are constructive algorithms decorrelating successively, piece by piece a train of arriving correlated pieces of information. Here we present such an algorithm that is an application of the Gramm-Schmidt othogonalization algorithm (see Press et.all [5.14]). An other algorithm using predictive-subtractive procedure is discussed in Section 7.5.3. A train s(m), m = l, 2,* • that is an observation of the train s(m), m = l, 2,- • of correlated random variables z(n) with Es(n)=0 is considered. As the first component of the decorrelated train satisfying conditions (5.2.11) we take the random variable B(l)=/i,(l)S(l) (7.2.22) and we require that Eii'(l)=l (7.2.23)
226
Chapter 5 Statistical Relationships
From equations (5.2.22) and (5.2.23) it follows that h,(l) = l/^cJlA) (5.2.24) ^ c,,(m, n) = E&(m)z(n) (5.2.25) To obtain the random variable IQI(2) that is not correlated with the random variable i!ii(l) we look for a linear combination of random variables 8(1) and 8(2). In view of (5.2.22) we can equivalently take a linear combination of 101(1) and 8(2). As such a combination we take where
^(2)=/z(2, l)B(l)+8(2)
(5.2.26)
Em(l)v(2)=0,
(5.2.27)
We require that Since we set only one requirement it is enough to introduce in (5.2.26) only one free parameter h(2, 1). Substituting (5.2.26) in (5.2.27) and taking into account (5.2.23) we get /z(2, 1)=-Eigi(l)8(2) (5.2.28) Thus, the random variable V(2) = [ - E M ( 1 ) 5 ( 2 ) M 1 ) + 8 ( 2 ) (5.2.29) is not correlated with B(1) but its variance of v(2) is in general not 1. The normalized random variable isa(2)=v(2)/{ E^(2)}'^ (5.2.30) is the second component of the decorrelated train satisfying the conditions (5.2.11). Substituting (5.2.29) and (5.2.22) we get m(2)=/i/2, l)B(l)+/id(2, 2)11(2)
(5.2,31)
where h/2, l)=/z(2, l)/E^(2), h,(2, 2) = 1/E^(2)
(5.2.32)
Using (5.2.22) and (5.2.24) we express both coefficients in terms of correlation coefficients c^^(l, 1), c^^(l, 2) and Css(2, 2). This procedure is continued. Suppose that after n steps we produced a train of linear combinations m(n) of random variables 8(m), m = l, 2, • • , n such that Eim(m')Eifli(m")=0, m', m" = l, 2 / • • , /z, Em\m) = l, /n = l, 2,- • • , /i (5233) We consider the linear combination n
v(n-hl) = 53/i(n + l, m)m(m)+z(n-¥l) and we require that
(5.2.34)
m-i E^(n-{-l)n(m)=0, for m = l , 2,- • • , w
(5.2.35)
Substituting (5.2.34) and using (5.2.33) we get and
Mw + 1, m)=-EiQi(/n)8(n + l) " ^ ( n + l ) = - 5 2 [&fli(m)8(n+l)Mm)+8(/i+l)
(5.2.36) (5.2.37)
m-l
is the new variable that is not correlated with the previously decorrelated variables.
5.1 Prototype Statistical Relationships
227
The new decorrelated variable with variance 1, we are looking for, is m(Az + l)=v(/2 + l)/{ EV^Cw + l)}'''
(5.2.38)
Expressing successively the variables m(m) in terms of 8(1), 8(2), • • •8(m) we put (5.2.37) in the form «-i m(n'\-l)=Y^h^in + l, mMm) (5.2.39) m-l
with coefficients ha(n-\-l, m) expressed by correlation coefficients c^JJc, I). COMMENT In defmitions (5.2.24), (5.2.30), and (5.2.38) of the normalized decorrelated variable we took the positive square root. Also a negative square rot can be taken. Thus, there are several solutions of the considered decorrelation problem. However, they differ only in the signs in front of the vectors h^(n) = {h^{n, /n),/i = l , 2 , , - / n } . TIME CONTINUOUS GAUSSIAN PROCESS The time continuous stochastic process is defined as a function assigning to a continuous argument tE
cjt\
t")^ E8(r')8(r"); /', re
(5.2.40)
From the property (4.5.21) of the gaussian variables and from the definition of the time continuous gaussian process it follows that: The statistical properties of a gaussian process are determined c^ i A \ \ by its mean value E8(0 ond its correlation function c^^{t\ t") An important class of time continuous stochastic process are process whose correlation function does not change when the origin of the time scale is shifted. To formulate this property it is convenient to assume that the process is observed on the whole time axis thus, to assume that ^a=-o°, ^b=a> (5.2.42) In such a case from the assumption that the statistical properties of the process do not change if the origin of the time axis is shifted it follows that E8(0=const, (5.2.43)
^"^
cjr', r")=7„(r"-r')
(5.2.44)
where y^Jj) is a function of one argument. Such a time continuous process is called stationary and y^Jj) is called one argument correlation function (briefly, correlation function). To simplify the argument it is assumed in the subsequent that E8(0=0 (5.2.45) Analyzing transformations of a process by a linear stationary system it is convenient to represent the process as a superposition of harmonic processes. Such a representation is discussed in Section 7.4.4. It is shown there that a stationary stochastic process can be represented in the form (7.4.53)
228
Chapter 5 Statistical Relationships oo
1
^('>=2^1 eJ"'dA(a;)
, -oo < r < oo
(5.2.46)
where dA(co) has the meaning of an infinitely small random complex amplitude of the harmonic function e''*'^ The power of the infinitely small harmonic component dA(a;)d'*'^ is E|dA(co)|2=^Mdo; (5.2.47) The function 5'(a)) is called power spectral density and its basic properties are discussed in Section 7.4.4. When the angular frequencies of all spectral components of the process lie in a frequency band <-27rB, 2'KB> and S^ for -ITVB
/7(5)=^,exp 2o~ k-\
where the variance o^ given by equations (7.4.64) and (7.4.62) and /4, is given by equation (4.5.19b). Since the set s of samples given by (5.2.49) represents almost exactly the segment {s{t), rE
p[^(-)]=^exp -J^[s\t)(^t
(5.2.52)
where A=A^A2>0 is a constant. It is not defined since we did not define A2. However, such a definition is not necessary because we use formula (5.2.52) to calculate the ratios of conditional probabilities (see, e.g.. Section 5.4). We presented here only a sketchy review of properties of time continuous random processes that are needed in the forthcoming considerations. For a detailed discussion see, e.g., Lapierre, Fortet [5.12], Shanmungan [5.11].
5.3 Markov Processes
229
5.3 MARKOV PROCESSES Often a current state can be considered to be an indeterministic transformation of the preceding states. An important subclass of such states are states in which the knowledge of a fixed, usually small number of preceding states makes the knowledge of still earlier states obsolete. Such a process is called Markov processes. Here we illustrate the basic properties of those processes with a simple but representative special case. We assume that: • The random variables S(A2), /Z = 1, 2,- • • , A^are discrete,
• The set of potential forms of each variable is the same; we denote these potential forms as j^, / = 1 , 2,- • • , L. We say that a time-discrete process is a Markov process if P [ S ( « ) = % ) | S ( A 2 - 1 ) = 5 / ( „ . , ) , S(/Z-2)=5;(,.2)/ • • . S(l)=5/(i)] =
P[S(AZ)=5,(„) I S(A2-1)=5,(„.,)]
(5.3.1a)
VAZ>1, V/(A2), where / ( A 2 ) G { 1 , 2,- • • , L]
Equivalently
For evaluation of the conditional probability distribution of next element of a Markov process, the exact information about the current (5.3.1b) state makes the information about earlier states obsolete. In many cases, we can interpret the considered train of states as an effect of a transformation of a train of primary states by a system. Then, from the statistical properties of the train of primary states and of the system we may conclude that the considered train is a Markov train. We illustrate this with a typical example.
EXAMPLE 5.3.1 THE TRAIN OF STATES OF A BUFFER We take the buffering system described in Section 3.3.1 and shown in Figure. 2.23 and we look at the state s{w, r„-0) of the buffer memory defined as the number of waiting packets at the instant r„-0, just before the beginning of the AZth cycle of systems operation (see Section 3.3.1). Suppose that r^.j is the current instant. From the rules of operation of the system, in particular, from equations (3.3.4) and (3.3.5) it follows that when we know the number s{w, /^.j-O) of the packets in the buffer at instant /„.i-0, then the probability distribution of the packets stored in the buffer at instant t{n) depends only on the probability distribution of the number 5in[
230
Chapter 5 Statistical Relationships P[§(n)=s«„,]=P[8(«)=5«„,|§(«-l)=v.jP[S(«-l)=V.)] P[§(/t-l)=Vi)] =P[8(«-1) =:?«„.„ I §(«-2)=V2)]P[S(n-2)=V2)] P[S(2)=s,c^=P[s(2)=:9,(3,|8(l)=5,„)]P[8(l)=5«„];
(5.3.2)
where S/(„)={5,(,), s,(2),- • • , 5,(„)} is a concrete train «
where
and by the probabilities /'/(i)(l) of the first state. In particular, from (4.4.8c), we get the probability distribution of state s(n) P[8(/t)]=%,]=
E V/(l)./(2).../(n-l)
^/(.,(l)fl^/w|/,«-.,('«)}-
(5-3-5)
w-2
Equation (5.3.5) shows also, that although (5.3.1) holds, all states are statistically interrelated. We will illustrate this statement in the forthcoming example. If the transition probabilities do not depend on the number m of the state we say that they are stationary; we denote them PIQ^KI)- The assumption that transition probabilities are stationary is justified if the deterministic relationships between the states of the system are time invariant (see 3.2.5 page 155) and the statistical properties of the primeval states are stationary too. It is so for example in case of the buffering system described in Section 3.3.1 with an stationary Poissonexponential train at the input. In general even if the transition probabilities are stationary the probabilities P[8(n)]=5/(;,)] characterizing the nth state depend on n. However, if they do not depend we call the Markov process stationary. We illustrate the introduced concepts on a simple example. EXAMPLE 5.3.2 A BINARY MARKOV PROCESS We assume that Al. The elementary state is binary; its potential forms are^ SQ=^0, 5I = 1; A2. The probabilities P/(l), / = 0 , 1 are given; we denote briefly P,(n)=P[z(n)=si; A3. The transition probabilities are stationary and are given; we denote P,„ = P[8(AZ)=5/|8(n-l)=5j.
231
5.3 Markov Processes Let us calculate, for example /',(2). From (5.3.5) we have
(5.3.6a)
/'o(2) = Po|o/'o(l) + Po|,/',(l) P,(2) = P,|oPo(l) + P,|./',(l)In a similar way. /-I 2
/-I
m-1 2
(5.3.6b)
m-l
Figures 5.7a and 5.7b illustrate the calculation of Po(2) respectively Po(3).
^(2)
[1)
s(3)
Sf) O
\ It 5 c^:
:
-•
-o s.
b) Figure 5.7. Illustration of the calculation of marginal probabilities of states of a Markov process when the probabilities of the potential forms of the first state and the stationary transition probabilities are known: (a) calculation of Po(2), (b) of PQC^)-
From the diagrams we see that to fmd the probability Pi(n) we have to look for paths going from a potential form of the state ^(1) to the considered potential form 5/ of the state s(n), to assign to each such path the probability of the potential form of the first state multiplied with all transition probabilities corresponding to the passed pairs of potential form of states and to sum over all such paths. Let us denote P P P(n)p - ^ 0|0 ^ 0|1 (5.3.7a) P
Using the matrices we write (5.3.6) in the form P(2)=P,,,P{1) P0) =
p
PLP(1).
(5.3.7b)
The obvious generalization of the second equation is
Pin)-PtMl).
(5.3.8)
Let us take numerical values 0.5
0.8 0.5
(5.3.9) 0.5 0.2 0.5 The probabilities P^n) calculated from (5.3.8) are shown in Figure 5.8. D /»(«)-
^,,-
232
Chapter 5 Statistical Relationships
Figure 5.8. Marginal probabilities of the nth state for fixed probabilities of the first state and fixed transition probabilities (given by (5.3.9)). From the Figure 5.8 we see that although transition probabilities are stationary the process is not stationary. However, we also see that when the process evolves (n grows), the probability distribution for a state stabilizes and converges to a limit. This is a general property of a very wide class of Markov with stationary transition probabilities (see e.g., Parzen [5.10] or Lapierre, Fortet [5.12]]). The asymptotically stable probability distribution of a state is determined only by the transition probabilities and does not depend on the probability distribution of the first state. If the probability distribution for the first state is the same as the asymptotic distribution, the probability distributions of all other states are the same; thus, the process is time invariant (stationary). We determine the stationary probability distribution from the condition that the probability distribution of the second state is the same as of the first state. Using (5.3.7) with A2=2 we write this condition in the form ^ ^ „ P=P2\A (5.3.10) where P211 is the matrix of stationary transition probabilities and the elements of matrix P we consider as variables. We interpret the condition (5.3.10) as a set of L equations. However, only L-1 of those equations are linearly independent. Therefore to the set of equations (5.3.10) we have to add one equation more that requires, that the sum of probabilities is 1. Let us take transition probabilities assumed in Example 5.3.2. From equation (5.3.10) we get the stationary probability P, = 0.72. (5.3.11) We assume now that the transition probabilities are stationary. Substituting equation (5.3.3) in the generalized definition (5.1.10) of joint entropy we get Hg=H{[s(l), 8(2),- • • , s(A0} = H[s(l)]+(iV-l)H[s(2)|s(l)] , (5.3.12) where H[s(2)|s(l)] is the average conditional entropy. The entropy per element defined by (4.6.18) is H,(N) = m/N (5.3.13) After substituting (5.3.12) we obtain (5.3.14) limi/,(A0=5[s(2)|s(l)].
5.4 Relationships Between a State and its Transformation
233
Let us take again the stationary transition probabilities given by (5.3.9) and the stationary marginal probabilities given by (5.3.11). Substituting those values in (5.3.14) we get iin^,(AO =0.73. (5.3.15) N-oo
To this point we considered the simplest type of Markov processes. The obvious generalization of the basic definition (5.3.1) is to assume that conditional probability on the left side of (5.3.1a) depends not on the last state but on M last states. Such a process is called Markov process of rank M. Thus, the previously considered Markov process defined by (5.3.1a) is a Markov process of rank M = l . The trains of statistically independent binary elements considered in Section 5.2.1 may be classified as Markov processes of rank 0. To simplify the argument we considered discrete scalar-valued trains, and thus, functions of a scalar discrete argument w, which can take discrete values. So the structural type of the trains is T^Kdi). The concept of Markov processes can be generalized for other types of states, in particular for the following types TdK(di) (discrete vector valued, time-discrete processes), TcK(di) (continuous vector valued, time-discrete processes). TdK(ci) (discrete valued, time-continuous processes; the previously described Poisson process may be classified as such a process of rank 0). This has been only a synthetic review of the fundamental properties of Markov processes that will be needed in the following chapter. For more detailed studies see, e.g., Parzen [5.10], Blanc-Lapierre, Fortet [5.12]. Although the concept of the Markov process is primarily associated with the sequential structure generalizations of this concept for functions of structured argument, such as images ( structural type Tdi(d2) ) were also used.
5.4 THE RELATIONSHIPS BETWEEN A STATE AND ITS INDETERMINISTIC TRANSFORIMATION We now consider the relationships between a structured primary state and the result of its indeterministic transformation discussed in Section 1.5.5 (see also Figure 1.14). To make these considerations concrete we interpret the transformation as transmission of an input signal by a communication channel introducing indeterministic distortions. However, our argument after slight modification applies for other information channels, particularly, for transformations of primary data in storage media and for information sources producing information that can be considered as a modification of a prototype. A component of the Markov process can be interpreted as a result of an indeterministic transformation of the preceding state. The problem that is being considered here is in principle similar. In particular, we concentrate on the conditional probability of the transformed process on the condition that the primary state is given. The difference between the previous analysis of Markov processes is that now we assume that the state is structured, particularly, that it has a time structure. On the other hand, we consider here only a single transformation, corresponding to a single step of evolution of a Markov process.
Chapter 5 Statistical Relationships
234
We present the general method of calculating the conditional probability distribution of the transformed state on the condition that the primary state is fixed and illustrate it with concrete examples. In Chapter 8 we show that the conditional probability distribution is of paramount importance, because it determines the optimal information recovery rules. 5.4.1 THE BASIC MODEL The transformation of a primary information by a communication channel, considered in Section 2.1.1. is a typical example of the indeterministic transformation of a structured state. We analyze such a transformation in more detail. To have consistent notation we substitute \v-^5, r-*v. Since we consider here both time-continuous and time-discrete processes, to the symbols of processes used in sec 2.1.1 we ad the subscript "c" as a reminder that the process is time continuous. We assume that Al. The primary state is a time-continuous process {s^iO, tE
Kit)
^c(0 I
^cn [^c(')M
Figure 5.9. The model of a typical indeterministic transformation of the primary continuous state process s^it) performed by communication or storage channels: v^lt)- the transformed process, v^nW-the noiseless component, Zc(0-the noise, b-the side parameters, Vcn(-,-,0-the transformation generating the noiseless component.
5.4 Relationships Between a State and its Transformation
235
Usually, v^.n(/) depends on the course of the process {s^t'), t'E
(5.4.3)
For images, the counterparts are scale change and displacement (shift and/or turning) of the primary state process (image). The side parameters b are usually not known, and consequently the transformation Vcn(*»*/) is indeterministic. The primary process s^') can be often recovered exactly from the noiseless process v^^i'), irrespective of values the parameters b take, however with an unavoidable delay ^,. If this delay is not taken into account, the transformation Vc[s(\")] can be considered to be an indeterministic but reversible transformation (see Figure 1.14 and Section 1.5.5). Although then the recovery is possible it may be tedious. Therefore, the side parameters b are also called nuisance parameters. The advantages of numerical analysis and of digital processing cause that before subsequent processing the time-continuous processes is usually sampled. We denote v(n)=v,„(0, t,
(5.4.4)
The time-discrete counterpart of (5.4.1) is v(/i)=v„(«)+z(/z); n = U 2,- • • , N, (5.4.5a) where z(n) are the corresponding samples of noise, v„(/i) = K[5,(-), b,n]. (5.4,5b) Vn(-,',-) is the transformation producing the noiseless train (it includes the previously considered transformation Kn(•,-,•) and a sampling transformation). COMMENT Both the noise and the side parameters are components of side states (in the sense of Section 1.5.1; see Figure 1.13), however there is a difference between them. In typical situations the noise has two features. It is "small" compared with the noiseless component. Thus, the distortion of the transformed process is small too. The second feature is that the noise depends usually on many independent factors; thus, its indeterminism is large (a typical noise process is described in Example 4.3.1). With the side parameters the typical situation is just opposite. They cause large distortions, but because their number is small, their indeterminism is also small. These differences cause the techniques of counteraction to be different. We attempt to increase the energy or to build in more structure (redundancy) into the useful process to make the noise relatively smaller. This can be done in afixedway, using error correcting coding, or in a flexible way in systems with feedback (see Section 2.1.2). For side parameters we attempt to get information about them, estimate them, and compensate their effect. The intelligent systems with considered in Section 1.7.2 operate in such a way.
236
Chapter 5 Statistical Relationships
This comment applies not only to concrete states but also to meta states. Thus, we may have side parameters, which have the meaning of parameters determining the statistical properties of noise. Usually, the parameters characterizing the metastates change much more slowly than the parameters characterizing the concrete states. Therefore, the assumption that the parameters characterizing the meta-state are constant during a cycle of systems operation usually is well justified. 5.4.2 CALCULATION OF PROBABILITY DISTRIBUTION OF THE TRANSFORMED STATE WHEN THE NOISELESS COMPONENT IS EXACTLY KNOWN We assume now that the noise and the side parameters exhibit statistical regularities Thus, noise can be considered as a realization of a random process 2(/), tG
5.4 Relationships Between a State and its Transformation
237
Thus, for an observer who knows the primary state, the value of the transformed state fluctuates around the noiseless component of the transformed signal and the range of fluctuations is determined by the variance a^ of the noise. In our argument we did not use the assumption that the states are onedimensional. Therefore, the generalization of the basic equation (5.4.8) for any structured primary state and the transformed state having the structure of type T^^icK) or JcNCdK) is straightforward. For example, let us assume that Bl. The primary state s is 1-DIM B2. The transformed information is A^-DIM v={v(w), n = l,2,' • • , A^} B3. The noise z = {z(n), /z = l, 2, • • , A^} is a realization of the A^-DIM random variable Z. B4. The variable Z is a gaussian variable with statistically independent components; thus its probability density/7^(z) is given by (4.5.14). The generalization of equation (5.4.9) is: P(yIs) =A[v-V„(5)] =Aexpl-J-f^ [r(n)-V^(s,n)r\, (5.4.10) I 2a,/.-i J where Vj,(s, n) is the nth component of the noiseless vector into which the primary scalar state s is transformed and A is the coefficient in front of "exp" in (4.5.14). Next, assume Bl and CI. The transformed state is a time process observed in the interval », t^>, C2. The noise is a realization of a time-continuous, base-band stationary gaussian process, with spectral density which is uniform in the frequency band in which all spectral components of the transformed state process lie; we denote as S^ the power spectral density of this process. The process representing noise is considered in Section 5.2.2. Using equation (5.2.52) giving the probability density a realization of the noise we get /^[v(-)|^]=A{v(-)-Vcn[5, ( O l l ^ ^ e x p j - ^ j [v(0-Vj5,0]'d/|.
(5.4.11)
This is the time-continuous counterpart of equation (5.4.10). A typical example of the considered indeterministic transformation is the transformation performed in the communication system considered in Section 2.1.1 (see Figures 2.1 and 2.2). Then the primary state has the meaning of the primary information, and the noiseless process is the noiseless signal. Typical examples of transformations producing such a noiseless process are: Kn(^, O=^^n(0cos(ajer+^), tE
K„(5, 0=^n(0(Ocos[(we+«^)^+^], te „ /b> . r . ^.^* ^ \ for t^
(5.4.13)
(5.4.14)
is a pulse-type envelope, as shown in Figure 2.1 and Figure 2.2, Wc the angular carrier frequency, ^ the phase shift, and a. a scaling constant. The process (5.4.12)
238
Chapter 5 Statistical Relationships
is the output of a communication channel produced by an amplitude modulated signal, while by (4.5.13) by a frequency modulated signal put into the channel. 5.4.3 CALCULATION OF PROBABILITY DISTRIBUTION OF THE TRANSFORMED STATE WHEN NOISELESS COMPONENTS DEPEND ON UNKNOWN PARAMETERS We now assume that the side parameters determining the noiseless process are indeterministic but exhibit statistical regularities, and the probability distribution describing the regularities is known. To simplify the argument we assume first that Al. The primary and transformed states are 1-DIM; A2. Only one parameter b is unknown; A3. On the condition that the primary state is fixed the noise and the unknown parameter can be considered as realizations of the random variables z and lb A4. The random variables 2 and lb are statistically independent. Based on these assumptions, v=V„(5, ^)+z (5.4.15) Our task is to calculate the density of conditional probability p(v\s). We take into account the effect of indeterminate parameters using the equation (4.4.8d) for marginal probability. We write it in the form p(v\s)={p(^^b\s)(^,
(5.4.16)
where is the set of potential values of the unknown parameter b. From the equation (4.4.7b) for the density of conditional probability we get p(y, b\s)=p(v\b\s)p(b\s). (5.4.17) However, , . p(v\b\s)=p(v\b, s) (5.4.18) Based on the condition that the primary state s and the parameter b are fixed and on assumption A4, we have the same situation as in the previously considered case when the noiseless process component is determined and we can use equation (5.4.8). It takes the form p{v\b, S)=PP-V,{S, b)]. (5.4.19) Substituting (5.4.15) in (5.4.16) we get the probability distribution we were looking for: i,^ P(^^\s)= {p^[v-Vp,b)]p(b\s)db. (5.4.20) Since the assumptions that the considered states and indeterminate parameter are scalars played only the formal role the generalization of equation (5.4.20) for structured states and indeterministic parameters is straightforward. To illustrate such an generalization we do again assumptions CI and C2 (page 237) and in addition we assume D4. The noiseless process is given by (5.4.12) but the phase ^ is indeterminate; we denote it now as b.
5.4 Relationships Between a State and its Transformation
239
Thus, we write the noiseless process in the form VJs, b, O=^n(0cos(a),/+*), te
p[vi')\s)=
{p,{v(}-VJ(s,
b, (•)]}p(*|5)dfe.
(5.4.23)
0
Using equations (5.4.11), (5.4.22), and (5.4.23) we get finally p[v(-)|s)= A | { e x p [ - J ^ l [v(r)-5/l^(/)cos(«,/+fe)]^dr]|d*. 0
(5.4.24)
/a
This integral can be calculated in a closed form (see, e.g., Grandsteyn, Ryzhik [5.19]) and we obtain
where
/7[v(-) I s)=const exp{-52E[^,(-)]/5jIo{fTv(-)]/5,}. t^ EK(-)] = V 2 p , ( 0 d / .
(5.4.25a) (5.4.25b)
/a
has the meaning of the energy of the noiseless process when s=\ ^v(-)] = \/ce'[v(-)]^c,'[v(-)]
(5.4.25c)
cJv(-)]-5 J v(r)^^(Ocosco^rd/
(5.4.25d)
'a 'b
cjv(-)]'sl v(t)A^{t)^mo^Jdit
(5.4.25e)
fa
and Io(w) is the Bessel function of imaginary argument of order 0. COMMENT We present this lengthy equation for two purposes. First it is used in Chapter 8. Second, the equation illustrates how in calculation of the conditional probability the exact information about the structure of the noiseless process is utilized and the unknown side parameters are eliminated. The observer located at the input of the channel of the channel (as the observer J3^ in Figure 1.13) knows the envelope A J,*) and the central angular frequency Wc. The indeterministic phase shift b, that influences the noiseless process is not known to the observer, but its effect on the probability distribution/7[v(*) | s] is eliminated by integration (see equation (5.4.24)).
240
Chapter 5 Statistical Relationships
However, the performance of an optimized system based on noiseless signal dependent on indeterministic phase shift is worse then of the optimized system utilizing exact information about the phase. We discuss this effect in Section 8.4.2. We presented here only the general method of calculating the conditional probability and gave its simple applications. As is shown in Section 8.3 the considered conditional probability determines the structure of the optimal subsystems recovering distorted information particularly, of optimal receivers in communication systems. Therefore, several examples of application of the general method described here can be found in publications on optimization of signal reception (see, e.g., Proakis [5.20]). 5.4.4 THE ROUGH DESCRffTIONS OF THE TRANSFORMATIONS PERFORMED BY A COMMUNICATION CHANNEL The set of the previously considered conditional probability distributions p(v\s)y sES, where S is the set of potential forms of the primary state, provides the exact description of an indeterministic state transformation. Usually such an exact description is quite complicated, and therefore simplified descriptions of an indeterministic transformation are of great interest. When the transformation has the meaning of an operation performed by a communication channel, then rough descriptions of the conditional probabilities that characterize the quality of decisions about the actions in the environment based on the transformed state are useful. When this state can be represented as the sum of a noiseless component and noise, then we can characterize the transformation performed by the channel by an indicator of the relative magnitude of those components. It is called signal/noise ratio. We show in the following chapters that the indicators of this type appear "automatically", when the quality of decisions based on the state produced by an indeterminate transformation is analyzed. However, in general, we cannot represent a transformed state as a sum of noiseless component and noise and even if we can, there are in general no justified a priori indications how to define the "magnitude" of the components. Here we consider the channel capacity. It is a rough description of the set of conditional probability distributions based on the amount of statistical information defined in Section 5.1.2 and 5.1.3. We cannot use directly the amount of statistical information l(V:S) defined by (5.1.21) or (5.1.27) because it depends not only on the conditional probabilities characterizing the channel but also on statistical properties of the primary state, which are not a characteristic of the considered transformation. Thus, to define on the basis of I(V:S) a rough description of the statistical description of an indeterministic transformation we must remove the dependence of I(V:S) on the statistical properties of the random variable S. The concept of operations removing the dependence on detail was introduced previously (see Section 1.6.1 and is discussed in detail in Section 8.1.2. Here we take as the dependence removing operation the operation of fmding the maximum of I(V:§) in respect to all probability distributions P^ describing the random variable § representing the primary state (state at the input of the system performing the considered indeterminate transformation).
5.4 Relationships Between a State and its Transformation
241
In general, some constraints are imposed on the primary state. For example if it is a vector, it is often required that a component of it must not surpass a fixed value. The constraints determine the set ^p of admissible probability distributions Fj. Thus, as a characteristic of an indeterminate transformation described by the set of the conditional probability distributions/7(v| 5), sES V/Q take C= max I(V:S). (5.4.26) p.eSp
The calculation of capacity simplifies when the primary and transformed states are A^DIM vectors and v=5+z. Thus, V = S + Z . Then from equation (5.1.21) we have — I(V:§) = H(V)-H(V|S). (5.4.27) On the additional assumption that the random variable Z representing the noise is statistically independent from the primary process we have H(V|S) = H(Z). From equations (5.4.26) to (5.4.28) it follows that C(AO= max H(V)- H(Z).
(5.4.28) (5.4.29)
p.eSp
Thus, on the simplifying assumptions, finding channel capacity reduces to maximizing the entropy of the transformed state by a proper choice of the statistical properties of the primary state. We illustrate this with simple but important special cases of channels described by the following assumptions: CI. The components of the state vector are binary (0, 1); the components of the noise are statistically independent and have the same probability distribution; a component of the primary state and corresponding component of the transformed state are different when the noise component is 1 (see equation (2.1.20)). We call the probability P[2(/z) = 1] error probability and denote it as Pg. A communication channel performing the described transformation is called a binary, memoryless, symmetrical channel. C2. The components of the vector states are continuous; the mean-square value of a component of the primary state E^(n) = al = const; components of the noise are represented by gaussian independent random variables with Esin) = 0 and the mean-square value Es^(«)= o^ = const. A communication channel performing such a transformation is called a discrete, memory less, gaussian channel. C3. The primary process is a time-continuous process of duration T; B is the highest frequency of its harmonic components; its average power 2
1
1
"^
f
f/(Od/
(5.4.30)
is fixed; the time continuous noise is a base-band gaussian process considered in Section 5.2.2 with a uniform spectral density that in the frequency range < 0 , B> has the constant value S^. A communication channel performing the described transformation is called a continuous, memoryless, gaussian channel.
242
Chapter 5 Statistical Relationships
We give equations for the capacity of those channels. Figure 5.2 shows that the maximum value of entropy of a binary random variable is 1, and this maximum value can achieved when the variable has a uniform probability distribution. It can be shown that for the assumed noise we can achieve this probability distribution by a proper choice of probability distribution of the component of the primary state. Thus, max H(V) = 1. The entropy H(Z) we get from (5.1.7). Thus the capacity of a binary memoryless symmetrical channel is C(N)=m-^P.log2P.Hl-P.)Hl'P.)log2(l-P,)]. (5.4.31) The diagram of the capacity per a component C,^C{N)/N (5.4.32) as a function of error probability P^ is shown in figure 5.10a. For the discrete gaussian memory-less channel C2 the procedure is similar. It can be shown (see, e.g., Cover, Thomas [5.4]), that for fixed variance entropy H(V) is maximized, when the component of the primary state 8(«) is a gaussian variable. Then, in view of property (4.5.22), the component v(/i) is gaussian too, and its variance is ay = as + az. Using (5.1.33) we get C,^C{N)/N=Vi\og2{ay/Gl) = V2\og,(l-^al/al).
(5.4.33)
The ratio al/oz has the meaning of signal to noise ratio. Thus, in the considered case, the channel capacity per one component is in a unique way related to the signal/noise ratio. In Section 7.4.3 the conclusion (7.4.49) is derived, saying that a base-band time-continuous process observed in a time interval of duration T can be almost accurately represented by a set of ^ ^ ^ N=2BT (5.4.34) samples taken with period 1/2B, where B is the highest frequency of a harmonic component of the process. The representation becomes exact when BT-^oo. It can be easily shown, that the amount of statistical information is not changed if a reversible transformation of a state is performed. Therefore to calculate for large BT the capacity of channel C3, we can use equation (5.4.33), however, after expressing parameters characterizing channel C2 by parameters characterizing channel C3. From equation (7.4.62) it follows that we have to set a^=2BS^ (5.4.35) Substituting (5.4.34) and (5.4.15) in (5.4.33) we obtain the capacity of channel C3 C(T)=BT\og2(l + o\llBS^),
(5.4.36)
Similarly to channel capacity Cj per an elementary component, we introduce channel capacity per time unit _, ^ ^ ,^ ^ C^C{T)IT (5.4.37) From equation (5.4.36) we see that the channel capacity depends strongly on the bandwidtji of noiseless signals. To get insight into this dependence we substitute equation (5.4.36) in definition (5.4.37) and write the result in the form C =Blog2(l ^BJB)
(5.4.38)
B,= a^/25^,
(5.4.39)
where
5.4 Relationships Between a State and its Transformation
243
In view of (5.4.35) B^ has the meaning of such a bandwidth that the noise power is equal to the noiseless signal power. Introducing the normalized bandwidth B' =B/B, (5.4.40) and the normalized channel capacity C" = C/B, (5.4.41) we write equation (5.4.38) in the form (5.4.42) C'=B'log2(l + l/5') The diagram of C" versus B' is shown in figure 5.10b.
a' 0.2 O.J «^ O.S 0.6 Ci7 as
as
10 r,
U
1
L
J
*
>
Figure 5.10. Dependence of channel capacity on parameters determining the conditional probability distribution of a transformed state (a) dependence (5.4.31) of the normalized capacity of the binary, memory less, symmetric channel on binary error probability P^, (b) dependence (5.4.42) of normalized capacity C" of a time-continuous gaussian channel with noise with constant spectral density on the normalized bandwidth 5'. The diagram shows that C" after initial growth stabilizes at the asymptotic value Cw =limC"-log2e =log2e. (5.4.43) Using equations (5.4.30) and (5.4.36) we get the corresponding asymptotic value of the capacity C(I) given by equation (5.4.36) Coo = limC(7)--r_CSo where
Br-oo
25
(5.4.45)
i^-T
E^To\ •
\
sKm
(5.4.46)
is the average energy of the process. COMMENT 1 From equations (5.4.36) to (5.4.43) and from Figure 5.10b it follows that the capacity C{T) initially grows almost linearly with the bandwidth (thus, with the dimensionality) of the noiseless process, but for large BT the capacity C{T) approaches the asymptotic value Coo that depends only on the average energy of the noiseless signal. Conclusion (7.4.26) says that 1/ris approximately the minimum bandwidth of a process of duration T. Therefore, the product BT has the meaning of the surplus of bandwidth of a process over the possibly smallest bandwidth and be interpreted as an indicator of a degree of fine structuring of a noiseless signal. In consequence the interpretation of the discussed dependence of the capacity on the product BT is that when the degree of noiseless signals structuring is small then by
244
Chapter 5 Statistical Relationships
increasing it, the channel capacity can be increased without increasing the power of the noiseless signals or decreasing the power spectral density of the noise. However, if the degree of the noiseless signals structuring is already large, increasing it has no more effect and the capacity depends only on energy of noiseless signals. COMMENT 2 We defined the capacity without taking into account the properties of the superior system. Therefore, at this stage of considerations we cannot say if it is a useful indicator of information systems performance. However in Sections 8.4.3 and 8.6.1 we discuss the coding theorems that state that the performance of a optimized communication systems depends in a crucial way on the channel capacity. Other concrete applications of capacity are presented in Section 8.5.4. We gave here only a comprehensive introduction to the concept of channel capacity. For more details see Abramson [5.3], Cover, Thomas [5.4], Blahut [5.5].
5.5 THE IVIODELS OF INDETERMINISM OF A STATE RELATIVE TO AVAILABLE INFORMATION In Chapters 2 and 3, and in this chapter we presented an in principle "impartial" description of states and systems, without taking into account the purposeful activities of a superior system. Therefore, we have not specified the meaning of the transformed state. However, according to the general definition (1.1.1), information has the meaning of a transformation of the state. Thus, the information source can be interpreted as the subsystem transforming the state of the environment into information. The source may be a primary source (described in Section) or secondary, such as the output of a communication channel or an information compression subsystem. Therefore, our previous considerations about states can be directly used for analysis of information sources from the point of view of an observer B^ having a direct access to the state s of the environment and looking for information produced by an information generating transformation (see Figure 5.11).
INFO SOURCE
t
B.
Figure 5.11. Illustration of the definition of observers B^ and B^,. Here we discuss the basic problem that arises for an observer B^, interested in the state of the environment but having access only to information jc about the state 5. The situation of such an observer is opposite the situation of the observer B^, The problem is that the information generating information is practically always irreversible (deterministic or indeterministic). In other words, the available information is practically never accurate. This means that knowing the information we cannot determine exactly the concrete state that produced the information.
5.5 Models of Indeterminism
245
However, we can in general determine the set of the potential forms of the state, which may have generated available information. As it has been explained in Section 1.4 knowing such a set (the better, the statistical weights, if they exist), we may substantially improve the performance of purposeful actions performed in the environment. This conclusion applies not only to concrete states (internal, external) but also to meta states (sets or potential states). We usually do not have direct access to a meta state but do have some information about it. If the information about the meta state is not exact, the knowledge of the set of potential forms of the meta state thus, the meta meta state, is useful. We now introduce a hierarchy of states and meta states and the corresponding hierarchy of information about them. If the transformation is reversible thus, if the available information x is exact, we say that relative to available information x the state s is determined (the deterministic model of the state can be applied). If the information generating transformation is irreversible (the information is inexact) but for a fixed information x the potential states exhibit statistical regularities a probabilistic weight can be associated with each potential state. Thus, a statistical state may be defined. To simplify our reasoning, we assume further Al. Both the concrete state and information about it are vectors; A2. The concrete states exhibit statistical regularities that are described by the probability density/7(5); A3. For a fixed state (the point of view of observer B^ the information generated by an indeterministic transformation, exhibits statistical regularities, which are described by the density of probability p(x\s). Then, for a given information x (the point of view of observer Bj) the statistical state is described by the density of conditional probability p{s\x). Using the generalized formula (4.4.8) we calculate this density of probability from the probability distributions mentioned in A2 and A3: P{S\X)=\ICP{X\S)P{S)
C=J..
J/7(x|5)/7(5)d5.
(5.5.1)
The set
S^i^^{p{s\xy,seS.xe)^
(5.5.2)
of those densities of probability is called conditional, relative to the available inexact information x, statistical state of order 1. If, for a given inexact information jc, the potential forms of concrete states do not exhibit statistical regularities, or, if they do, we do not take them into account, we cannot associate a probabilistic weight with a potential state. Then we say that the statistical regularities cannot be used. In such a case, we may improve the quality of purposeful actions, if we know some other than statistical, properties of the set of potential forms of the concrete state. The fundamental property of this type is the set of rules of belonging to the set of potential forms (membership rules(see Section 4.7). There have been also many attempts to enrich the description of the set of potential forms of a state by associating with every potential state a nonstatistical weight. Typical examples of such weights are
246
Chapter 5 Statistical Relationships 1. Likelihood; we call so the conditional probability/7(x 15) considered as a function of the condition s; such a weight function is widely used in statistics, when the probability density p(s) mentioned in A2, page 245 (called in statistics a priori probability) is not available; then formula (5.5.1) cannot be used (see Larsen, Marx [5.7], Edwards [5.21]); 2. Zadeh fi function, used infiizzy sets theory (see, e.g., Klir, Folger [5.22], Zimmermann [5.23]); 3. Expert belief functions (see Shafer, Pearl [5.24 ch.5] particularly, the tutorial); 4. Shafer-Dempster evidence weights (see Shafer, Pearl [5.24 ch.3] particularly, the tutorial).
The fundamental difficulty with the nonprobabilistic weights is that unlikely to the statistical weights they do not have an objective character. One of the consequences of this is that there are no systematic methods for experimental determining the values of those weights. Worse, there are no systematic rules for calculating the weights for secondary states obtained by some transformations from primary states with the nonstatistical weights. Therefore, many incoherent, ad hoc rules must be introduced (see, e.g., the publications mentioned in 2, 3, and 4). The discussed properties of the potential forms of a concrete state when the statistical regularities cannot be used, are called the nonstatistical meta state of order 1 and are denoted as 5N^AT •
Let us return to the situation when the potential states exhibit statistical regularities thus, the when the conditional statistical state S^\j exists. With this state we have an analogous situation, as with the concrete state s. The state5STAT is usually not directly available, and we have about it only some information ^STAT ~ T{S^f^j),
(5.5.3)
where T{') denotes the transformation generating the information^. We callA^sTAT statistical information of order 1. Statistical information of order 1 can be obtained during a training cycle as described in Section 1.7.2 (see Figure 1.25). A typical example of the statistical information of order 1 about the probability density/7(r I w) describing the statistical properties of the channel is the train given by equation (1.7.3) JR(CH) = {{J,(/-), r(/)},y = l , 2 , - • • , 7}
where y^ij) is exact information about process w available at channel input and r{j) is the process at the channel output, obtained during the training cycle. Several other examples are given in the forthcoming chapters. The statistical information ^STAT niay be exact or not. If it is, then we can determine the statistical regularities S^\j exactly. We say then that the statistical state is determined or, in classical terminology, that the Bayes model can be applied. If the statistical information X^\j is not exact, we have a situation similar as in the case when information x about a concrete state is s is not exact. If 5S^AT exhibits statistical regularities, then the statistical state S^\j of order 2 exists. If the statistical state does not exhibit statistical regularities or they cannot be used, we have to consider the non-statistical meta state 5N^AT of second order.
5.5 Models of Indeteraiinism
247
Introducing two binary features: • information about the state is exact or nonexact and • statistical information can or cannot be used, we can illustrate our considerations by the first two layers of the tree shown in Figure 5.12. The further extension of our reasoning (and of the tree) is evident. Starting with the third layer we have a specific situation. Having an exact information about the statistical state of order 2 and using the equation (4.4.8d) for marginal probabilities, we can obtain the exact information about the statistical state of order 1 (this is indicated by an arrow on the diagram). Thus, if statistical state of any order is determined, we can apply the Bayes model.
2.1 EXACT INFO ABOUT STATISTICAL STATE (BAYES MODEL)
LIXELIBOQD . SmT> STATISTICAL STATE OF FIRST ORDER
1 EXACT INFO ABOUT CONCRETE STATE (DETERMINISTIC MODEL)
STATISTICAL RCOULARITIES
.SWTAX NON^ATISnCAL M E T A \ STATE OF FIRST ORDER \
/ / /
FUZZY SETS ^ H FONCTICW
NON-STATISTICALVN. WEIGHTS OF ^ ^ POTENTIAL STATES
SHXFERDEHP8TER UEI6BTS EXPERT BELIEF FUHCTIOHS
Figure 5.12. Theftindamentaltypes of relationships between the available nonexact information and the state that this information pertains to (models of indeterminism): STATISTICAL REGULARITIES USED/ NON USED- statistical regularities exist and are used or they do not exist or are not used; EXACT/ INEXACT- the available information about the state shown as a two concentric circles is exact or inexact.
248
Chapter 5 Statistical Relationships
Similarly, as in the case of statistical states, it is possible that only nonexact information about the nonstatistical meta states is available. For example, often the potential forms of a state or even their number is not known exactly. Although some of the situations illustrated in Figure 5.12, panicularly, situations when exact information about nonstatistical meta-states is not available, seem to be quite exotic, they occur often in practice and are assumed tacitly in many publications. We illustrate these considerations with a simple example. EXAMPLE 5.5.1 TWO-LEVEL META STATES AND META INFORMATION We assume that: Al. The primary state is a scalar 5, the set of its potential forms ^=<.oo, oo>;
A2. The statistical state SS^AT of order 1 is described^ by the gaussian density of probability p(s\xM)^—l—t-''-'-'^'^^^ (5.5.4) Thus,
V^TTo^
Sjlij^{p(s\x,a);seSxeX};
(5.5.5)
A3. The information A^STAT consists of (1) information A^jvp about the type of the probability distribution, (2) information X^ about the average value, (3) information X„ about the variance. Thus, ^STAT ~{^TYP» -^a* ^af'y
(5.5.6)
A4. The information components J^TYP ^^ ^a ^^ exact; however, X^ is not exact; therefore, the information ^STAT is a nonexact information about the statistical state 5SVAT; A5. The indeterminate parameter a exhibits statistical regularities; they are described by a gaussian density of probability
PJW^-L=^-'-''^''
(5.5.7)
thus, the statistical state of order 2 is 5s?iT ^ { p » ; f l G < - c x ) , cx>> }
(5.5.8)
A6. Exact information ATSTAT about S^\j is available. Looking at Figure 5.12 we see that the state of the system relative to the available meta information is represented by the point 3.1. The statistical information about the primary state is described by the density of probability p\s\x,X^ij , X^ij ). From the marginal probability equation (4.4.8d) we obtain 00
p(s\x,X^ij
,X^ir ) = I p(s\x,a)p,(a)da.
(5.5.9)
5.5 Models of Indeterminism
249
Substituting (5.5.4) and (5.5.7), after some algebra we get p\s\x,X^i, where
,X^ir )= 1 ^-is-.-tm.y^ yJl-K^a^f {oy=o^+al, D
(5 5 10) (5.5.11)
COMMENT The variance is an indicator of the indeterminism of the state relative to the available information. If the information Z^ were exact, this variance would be cr^, while it is (a*)^ when the information X^ is inexact. Thus, (a*)^-cr^= a] is the indicator of price that we pay for not having exact information about the statistical properties of the state. A state information subsystem, like the state information subsystems described in Section 2.2 could provide such an information. In particular, if the parameter a is a stable parameter, an adaptive information system described in Section 1.7.2 and shown in Figure 1.26 could decrease the variance a] thus, to move the point representing the status of relative indeterminism of the primary state s from the third layer in Figure 5.12 into the second layer representing less indeterminate situations.
NOTES ' The entropy, similarly as the statistical average, is a number assigned to the random variable. Therefore, similarly as the calculation of the average, we interpret the calculation of entropy as an operation and we denote it with the special enlarged character. ^ We used a similar argument earlier to define the joint frequency of occurrences by (4.1.13)see also footnote 1 p.5/4. ^ To explain that this does not contradict the causality principle let us assume first that the frequencies of occurrences of a state and a state occurring later are statistically related. As it has been indicated in Section 1.7.2, page 63, we can use the frequencies of occurrences described in sec.4.1,2 only ex post, after completing the observation. Thus, no conflict with causality principle arises. A statistical relationship between a state at present and a state in theftiturebased on probabilities, can be used only on the assumption that thefrequenciesof occurrences of states in the fiiture willfluctuatearound probabilities estimated in the past. However, such a assumption cannot be a priori strictly proved. Therefore, the bilateral statistical relationship between a state in theftitureand a present state based on probabilities has a hypothetical character and does not violate the deterministic principle of causality holding in every concrete case. ^ If we would take only one component in (5.2.21) we would obtain a Markovian process (see next section) for which the shaping and decorrelating matrices are rather untypical. Therefore, we take two components. ^ To simplify the notation we start the numbering of potential forms of states not with 1 but withO.
250
Chapter 5 Statistical Relationships
^ The transformation T{') may be deterministic or not. Equation (5.5.3) describes a deterministic transformation. If it is indeterministic then, like in (1.5.4) we have to introduce, the indeterminate side states. ^ It can be easily seen that the assumed conditional probability distribution is the conditional probability of the state when (1) the information j c = 5 + z , (2) the state s has gauss probability density, 3) the noise z has a Gaussian probability density with the mean value proportional to a. Thus, the auxiliary parameter a has the meaning of the mean value of noise z.
REFERENCES [5.1] Papoulis, R., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, NY, 1991. [5.2] Breiman, L., Probability, SIAM Publications, Philadelphia, 1995. [5.3] Abramson, N., Information Theory and Coding, McGraw-Hill, NY, 1963. [5.4] Cover, T.M., Thomas J.,A., Elements of Information Theory, J.Wiley, NY, 1991. [5.5] Blahut, R.E., Principles and Practice in Information Theory, Addison-Wesley, Reading, MA, 1990. [5.6] Seidler, J.A., Bounds on the Mean-Square Error and the Quality of Domain Decisions Based on Mutual Information, IEEE Trans.on Info.Theory, vol IT-17, 1971 pp.655-665. [5.7] Larsen, R.J.,Marx M.L, An introduction to Mathematical Statistics and Its Applications, 2-nd ed.,Prentice Hall, Englewood Cliffs, 1986. [5.8] Bahara, M., Additive and Nonadditive Measures of Entropy, J.Wiley, NY, 1990. [5.9] Helstrom, C.W., Probability and Stochastic Processes for Engineers, MacMillan, NY, 1984. [5.10] Parzen E., Stochastic Processes, Holden Day, San Francisco, 1962. [5.11] Shanmugan, K.S., Breipohl A.M., Random Signals, J. Wiley, NY, 1988. [5.12] Blanc-Lapierre, B., Fortet R., Theory of Random Functions, vol 1,2 Gordon and Breach, NY, 1967. [5.13] Kleinrock, L., Queuing Systems, vol 1,2, J.WUey, NY, 1975. [5.14] Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetterling, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [5.15] Matlab, Users Guide, The Math Works, Inc.,Natick, MA, 1991; Matlab is a registred trade mark of The Math Works, Inc.,Natick, MA. [5.16] Wolfram, S., Mathematica, Addison-Wesley, Redwood CA, 1991. [5.17] Dagpunar, J., Principles of Random Variate Generation, Clarendon Press, Oxford, 1988. [5.18] Niederreiter, H., Random Number Generation and Quasi-Monte Carlo Methods, SIAM Publications, Philadelphia, 1992. [5.19] Grandshteyn, I.S., Ryzhik I.M., Tables of Integrals, Series, and Products, Academic Press, NY, 1965. [5.20] Proakis, J.G., Digital Communications, 2-nd ed.,McGraw-Hill, NY, 1989. [5.21] Edwards, A.W.F., Likelihood, Cambridge University Press, Cambridge, 1972. [5.22] Klir, G.J., Folger, T.A., Fuzzy sets. Uncertainty and Information, Prentice Hall, NY ,1988. [5.23] Zimmermann, H.J., Fuzzy Set Theory, Kluwer Academic Publishers, Boston, 1991. [5.24] Shafer, G., Pearl, J., Readings in uncertain reasoning, Morgan Kaufman Publ., San Mateo CA, 1990.
LOSSLESS COMPRESSION OF INFORMATION This chapter begins the third part of the book, which is devoted to the analyzis and synthezis of infonnation transformations. This and the next chapter consider transformations that compress the volume of information. Compression of volume is one of the most important information transformations. We understand it intuitively as a preliminary transformation of structured information that allows to reduce the resources necessary for subsequent processing of the information. An introduction to the problems of information compression has been presented in Section 1.5.4; see in particular, Figure 1.22. This and the next chapter have two objectives. The first is to describe the basic transformations compressing the volume of information and to furnish a solid basis for understanding and designing data, process, and image compression systems. The second objective of the two chapters is to illustrate the general concepts introduced in second part of the book, particularly the summarizing considerations in the closing section of Chapter 5. This chapter considers transformations that compress the volume of information, so that it is possible to recover primary information without distortions, except some delay. For many superior systems the processing delay is irrelevant, particularly, when we compress the volume of information to store it. Such a transformation is called reversible information compressing transformation, or in technical terminology, the lossless information compression. If exact recovery is possible and the introduced delay is not taken into account we have only to inven the transformation, and the problems of evaluation of recovery distortions and optimization of recovery rules do not arise. This simplifies greatly the analysis of loss less information compression. The central concept of this chapter is the concept of volume of information, defined as an index of resources needed to process the information so that it can be exactly recovered. Examples of volume of information are: (a) capacity of a storage device needed to store information, (b) capacity of a transmission channel needed to transmit information, (c) the minimal computational power needed to process information. The volume of information is not only useful in initial stages of information system design to estimate the costs of constructing and running the system, but it also allows us to use efficiently the available information-processing resources during the operation of the system. An example is the application of the declarations integer, real, and dimension used in most programming languages to assign storage capacity of the computer efficiently.
252
Chapter 6 Lossless Compression of Information
Without a formal definition of volume of information systematic design of optimal, the more of intelligent information compressing systems, is not possible. Not only the practical design problems justify interest with this concept. It is also important since its analysis gives much insight into the fundamental problems of information processing. Lossless compression is possible in two basic cases: • When not every possible combination of components of structured information is a potential information; we say then that structural constraints are imposed on information or, briefly that information is structurally constrained. • When the information exhibits statistical regularities and components, or some their combinations have different frequencies of occurrences; we say then that statistical constraints are imposed on information, or briefly, that information is statistically constrained. The utilization of structural constraints for lossless information compression of discrete information is the subject of the first two sections of this chapter. In the first section we introduce the general concepts, and in the second we use these concepts to present a systematic review of lossless compression procedures that are most important for practice. Next two sections devoted to lossless compression of information utilizing statistical constraints have a similar structure. For continuous information irreversible transformations compressing information are of greatest importance. Then the evaluation of distortions of recovered information and minimization of distortions becomes the central design problem. Therefore, only some aspects of the considerations about lossless information compression have counterparts in continuous information compression. They are discussed in the last section of this chapter. The main purpose of the considerations is to get more insight into the continuous-discrete information dilemma. Most important for applications is compression of continuous information introducing unavoidable distortions. Such a compression of continuous information is the subject of the next chapter.
6.1 THE VOLUME OF DISCRETE INFORMATION AND ITS COMPRESSION To describe, to analyze, and to compare transformations that compress information we must introduce a sufficiently general definition of volume of information, which can be used for discrete information, continuous information, and information exhibiting statistical regularities. This is not a straightforward task, primarily because the information compressing subsystem is located inside the chain of subsystems performing the information-processing shown in Figure L4a. It is preceded by the information source, and the operation of the information compression system influences the operation of the subsequent subsystems: the fundamental information subsystem and the superior system. Therefore, a definition of volume of information must take into account the properties of all those three subsystems.
6.1 The Volume of Discrete Information
253
The information source determines the fme structure of information, the sets of potential forms of the components, the constraints on the combinations of the elementary components, and consequently the set of potential forms of the structured information. Those factors influence to a great extent the resources needed to process the information. A typical fundamental information-processing subsystem, particularly an information channel, is designed so that it can process any information having a fme structure independently of the macro structure, and in particular, of restrictions imposed on the potential forms that the information can take. We call such a system constraint blind. In technical terminology the ability of a system to process information having any macro structure is emphasized, and the system is called transparent. The universality of a structure blind system makes it possible to utilize it for many purposes and consequently to keep its price low. However, if we use such a system for a specific purpose, we may not utilize all its capabilities. This causes the size of resources needed to perform the fundamental information-processing to depend not only on the properties of the information but also on the features of the fundamental information-processing subsystem. The compression of volume of information should be such that the ultimate information delivered to a superior system is for this system as useful as the primary information. Thus, defining the volume of information we must also know which features of the information are essential for the superior system (the user). As it has been discussed in Section 1.6.1, the properties of the superior system determine the indices for accuracy of information-processing. In case of lossless compression we can, in principle, recover exactly the primary information. However, the information system may introduce some changes into the information that are relevant for the superior system. In particular, the superior system may be sensitive to delays introduced by information compression and decompression. For these reasons the definition of volume must also reflect the properties of the superior system. Summarizing, we may say that in a formal definition of the volume of information we must account for the properties of • Concrete primary information particularly, its structure, • The set of potential forms of primary information and their weights, • The fundamental information-processing subsystem, and • The superior system. The following subsections use these guidelines to introduce a frame of concepts on which a systematic description and analysis of discrete information-compression systems. 6.1.1 THE INDICATOR OF RESOURCES NEEDED TO PROCESS STRUCTURED DISCRETE INFORMATION We first introduce indicators of resources needed to process concrete information. Next indicators of resources needed to process any potential form of the information are considered.
254
Chapter 6 Lossless Compression of Information
THE RESOURCES NEEDED TO PROCESS A CONCRETE INFORMATION The binary information is iht prototype of discrete information. We denote a^, / = ! , 2 - the potential forms of this information, v{Ui) - an indicator' of resources needed to perform the fundamental processing (e.g., storage, transmission) of information ii^; we call it volume of concrete information. In general, the resources v(Ui) depend on the processed information Uf. However, often the potential forms M, and Uj have the same structure and we may suppose The fundamental processing of each potential form of the discrete information requires the same resources thus, v(M/)=v=const, V/.
(6.1.1)
It is natural to take the resources needed to process a binary piece of information as the unit of information-processing resources and to set v=l.
(6.1.2)
A block u = {u{n)\ n=l, 2,- • - , N } of N pieces of binary information is a prototype of structured discrete information. To defme the volume of this structured information we take into account the properties of subsystems performing the fundamental information-processing-see Figure 1.2. The two basic types of such systems are mass storage systems and communication channels. A typical storage system is a set of binary storage cells in which piece by piece binary components of block information are stored. To store the prototype block information A^ binary storage cells are needed. The second basic type of fundamental information-processing subsystem is a transmission channel. A typical transmission channel ( see Section 2.1.2) generates a set of time slots, each of which can transmit a single binary piece of information. Such a time slot is called binary transmission channel. To transmit a block information we need such A^ binary transmission channels. Thus, the resources that both fundamental information systems require to process prototype block information are proportional to the length Nof the blocks. Hence, it is justified to take this length as the indicator of resources needed to process block u. We write this defmition in the form v(tt) = N(ii),
(6.1.3)
where N(') denotes the function assigning to a block the number of its elements For the considered prototype block U(u)=N.
/^ i 4 \
(6.1.4b)
6.1 The Volume of Discrete Information
255
THE RESOURCES NEEDED TO PROCESS ANY CONCRETE STRUCTURED INFORMATION An information system has to process every potential form of information. Therefore, we are looking for an indicator characterizing the resources needed to process any potential forms of information. We call it volume of information. To indicate that this indicator characterizes not concrete information but the set Hoi all potential forms of information, we denote the volume of information by V( Z/). The definition of V( ?/) should be based on the definition of volume of a concrete information, but it should not depend on the individual features of this information. An operation transforming the set {v(i£); uE U} into V(?/) has been called detail removing operation. This concept was mentioned in Section 1.6.1, page 55; its detailed analysis of follows in Section 8.1. A typical example of a detail-removing operation is finding the maximum or minimum (the extreme-case criterion). When the information exhibits statistical regularities, statistical averaging often is a natural choice. The definition of volume based on statistical averaging is discussed in Sections 6.3 and 6.4. Here, using some heuristic arguments, we first define the volume V(?y) of prototype sets for which the definition is plausible. Using this definition we define the volume of discrete information having any structure for two extreme types of fundamental subsystems: a system fully utilizing the structure of information and a structure-blind system. The binary set S (1)={M/, / = 1 , 2} is the prototype of discrete sets of information. With assumptions (6.1.1) and (6.1.2) it is natural to set V[^(l)] = l.
(6.15)
The set of all blocks of A^ binary pieces of information (of all combinations of the A^ binary pieces of info) is the prototype of sets of structured information. We call this as set N-dimensional, unconstrained discrete set (briefly, unconstrained set) and denote it as ^ (AO- A block information belonging to this set we call unconstrained block information^. We denote L{') as the function assigning to a set ^ the number L of its .^ ^ elements; thus, L (^)=L. Since every combination of binary pieces of information is an element of ^(N), L[^(A0]=2^ (6.1.7) The resources necessary to perform the fundamental processing of a single block information of length N described on previous page can be used directly to process another block having the same structure. Therefore, as the characteristic of resources needed to process any block from the unconstrained set /?(N), we take W[&(N)]^N.
(6.1.8)
Usually not all combinations of elementary components are potential forms of information. Such a set of potential forms is called a constrained set, and the structured information taken of such a set is called^ constrained information.
256
Chapter 6 Lossless Compression of Information
The constraints may be caused by relationships between components of each potential form of information. However, they also may be imposed on the set as a whole. An example is the set of reserved prefix code words (see Section 2.6.4). The resources needed to process information depend also on the class of information-processing transformations. We first define an index of minimal resources needed to process constrained information by such a reversible transformation that can fully utilize both the structure of single potential forms of information and the properties of the set of all its forms. Assume first that the number of potential forms of the information can be represented in the form , ^, ^ L(Z/)=2\ (6.1.9) where A^ is an integer. Then , , , ^ ^ l(U) = l[&(N)]. (6.1.10) and we can always assign to each potential form of the primary information a different block from the unconstrained set iS(N) and each block obtains one partner. Such a transformation is realized by the algorithm (1.5.8) with K = 2 , and the system implementing it is shown in Figure 1.15. A more detailed analysis of several fundamental information-processing subsystems (in particular, storage and communication systems) shows that For a given L(U) having the form (6.1.9) the resources needed .^. ... to process blocks from the unconstrained set B{N) are minimal. ^ -- ) Therefore, , , ,, ^, ^ y^iy)=y[8{N)]=N (6.1.12) is an indicator of minimal resources that are needed to process any information from the set II. We call V^i( U) the minimal volume of information. This term, like the terms "unconstrained, constrained information" is an abbreviation, since the considered volume characterizes the set of potential forms but not a concrete information (see Note 2). In general, we cannot represent L( U) in the form (6.1.9). Then we look for the unconstrained set with smallest dimensionality, and we define V j ? / ) = minV[^(m)]. (6.1.13) m
Let us denote by A^ the dimensionality for which the minimum in (6.1.13) is achieved. Thus, we have , , ^, wr ^ . VJZ/)=V[^(A0]. (6.1.14) To satisfy the condition that the transformation is reversible, the set 1?{N) must have at least so many elements as the set Z/.Thus, it must be L[i^(AO]>L(Zy).
(6.1.15)
On the other hand, N is the smallest integer satisfying this condition. Logarithming both sides of (6.1.15) and using (6.1.7), we conclude that N^\\ogMy). where \x is the smallest integer that is not smaller than x. From (6.1.12) and (6.1.16) we get: VjZ/)=riog2L(Z/)
(6.1.16)
(6.1.17)
6.1 The Volume of Discrete Information
257
From the definition of function [x, it follows that 0 < r^-jc< 1. Therefore, for most applications log2L(^) is a good approximation of [\og2L(U). To avoid computational complications, we calculate the minimum volume from formula V.i(f/)=log2L(^),
(6.1.18)
even if log2L( U) is not an integer. Since the set U is constrained and the set ^(AO is unconstrained, we may call the transformation JcrC*) transforming information ii E ?/into a block from the set B(N) 2i constraint removing transformation-, hence, "cr" in the subscript. To specify this transformation we must have exact information about the set y of all potential forms of information. However, a typical fundamental information-processing system is blind for constraints imposed by the relationships between the components of every potential form of information and/or by the properties of the set of potential forms of the structured information and is unable to perform transformation T^X')- To define the resources needed by such a structureblind system, we introduce an auxiliary concept: The set of all such combinations of elements of which the potential forms of information are built, which the fundamental information ., ^ - g. processing subsystem does not distinguish from the really potential ^ ' ' ^ forms, is called extended set and is denoted as Z/^"^^ The indicator of resources needed to perform a structure blind information transformation of any information from the extended set we call volume for structure blind information transformation (briefly, structure-blind volume) of the primary constrained set of potential forms of information and we denote it V^^Ji U). Thus, Vsb(Z/)=VjZy<«^)].
(6.1.20)
V,,(Z/)> V J Z / ) .
(6.1.21)
Since Z/C^/^"^ it is
The cost of hiring a fundamental information-processing subsystem is usually proportional to the structure-blind volume of information. The ratio R(?/)= V j Z / ) / V j Z / )
(6.1.22)
has the meaning of the efficiency of utilization of the structure-blind fundamental information-subsystems resources. Therefore, we call the ratio R(Z/) the resources utilization indicator. As it is shown in the subsequent section, this is a convenient indicator of matching the properties of structured information to the properties of the subsystem performing the fundamental information-processing. From (6.1.21) follows that R(Z/)<1.
(6.1.23)
When R( Z/) = 1, then a structure-blind fundamental information-processing requires as many resources as the system optimally matched to the properties of information.
258
Chapter 6 Lossless Compression of Information
The ratio S(f/) = [V3b(Z/)-V,i(Z/)]/Vj^) = [l/R(f/)]-l
(6.1.24)
has the meaning of an index of surplus of resources of the fundamental processing subsystem needed because the information-processing transformation is structure blind. Therefore, we call this ratio volume surplus indicator. COMMENT In the introductory considerations about volume, we emphasized (page 253) that, in general, the volume of information depends not only the properties of the information but also of the fundamental information-processing subsystem. In our approach, we take this into account by introducing the two parameters characterizing the volume of information: (1) the minimum volume Vn,i(?y) characterizing the resources needed by a fundamental subsystem to process any information after most efficient compression and (2) the structure-blind volume Vjill) characterizing the resources needed by fundamental subsystems unable to exploit the structure of information. In most cases, we take as Vsb( II) the largest resources needed to process directly a potential form of the primary information. Thus, we take V,,(?/) = maxV,,(i/).
(6.1.25)
We illustrate such a procedure with a simple example. EXAMPLE 6.1.1 VOLUME OF CONSTRAINED BLOCK INFORMATION We assume that the set potential forms of information is ll = { a,=0, 1^2 = 100, ^3=101, M4=1100, ^5=1110, a = l l l l , 1/7=11010, M8=11011}.
(6.1.26)
The length of a potential form of the information can take various values and max N ( M / ) = 5 . Let us assume that the eventual fundamental information-processing systems have the following properties: • They can only fmd out what the largest length A^^.^^ of block information, • They process each information as if it would have the length A^^^; if its length is smaller they add O's in front of shorter potential forms; this causes no ambiguity if all primary blocks begin with a 1, • The transformations are blind in the sense that they cannot realize that not all blocks of length A^^^^x ^ ^ potential forms of the information. The volume of the set U for systems blind in the defined sense is \^^{ II)=5. The optimal constraints removing transformation can be realized by algorithm (L5.8). It produces the following unconstrained set: ^(3) = (v,=000, V2=001, V3=010, V4=011, V5 = 100, v=101, V7=110, V8=lll}. (6.1.27) Thus, the minimum volume y,^ill ) = 3, the resources, utilization indicator R(Z/)=0,6 and the volume-surplus indicator S(Zy) = 0.66.n
6.1 The Volume of Discrete Information
259
6.1.2 THE EFFECT OF TRANSFORMATIONS OF STRUCTURED DISCRETE INFORMATION ON ITS VOLUME We now use the introduced concepts to define an indicator characterizing the capability of an information transformation to change the volume of information, particularly to decrease it. We denote by J(') the transformation transforming information uEll into information vG V where V is the set of potential forms of the information produced by the transformation. We assume that The transformation is deterministic and reversible ,^ ^^ {it is a presentation transformation). ^ ' ^ Then L(l/)= L(f/), and from (6.1.18) it follows that Vn,i(V)=V,,(Z/).
(6.1.29)
Thus, an indicator of changes of volume based on the minimum volumes would not characterize the important transformations, changing only the presentation of information. Therefore, it is natural to compare the volumes for structure-blind transformations and to characterize an information transformation by the ratio /3[7^(-)]^VjZ/)/V3,(t/).
(6.1.30)
This is called indicator of volume transformation. This indicator depends on the properties of the set V, which, in turn, depend in an essential way on the transformation r(-). We show this by writing /3[r(-)]. If jS > 1, we call the transformation information-compressing transformation and we call ^ indicator of volume compression (briefly, compression indicator). From (6.1.29) and from the definition (6.1.22) of the resources utilization indicator, it follows that /3[r(-)]=R(V)/R(zy). (6.1.31) Thus, a compression indicator is the ratio of indicators of utilization of resources of fundamental processing needed to process the information before and after the transformation. Taking in (6.1.21) V instead of Z/and utilizing (6.1.31), we see that &\T(:)\<&J,U).
(6.1.32)
where ^ma(?/)^V,,(2/)/V,i(Zy).
(6.1.33)
The ratio ^^ U) has the meaning of on indicator of maximal volume compression, which can be obtained by a reversible transformation. The maximal compression can be achieved by constrains removing transformation T^X') - see page 257. From (6.1.22) and (6.1.33) it follows that i3,,(/y) = l / R ( ^ ) .
(6.1.34)
Substituting in (6.1.34) the numerical values obtained in Example 6.1.1, we get OZy)=1.66.
260
Chapter 6 Lossless Compression of Information
COMMENT 1 The number L of potential forms of discrete information is related to two quite different characteristics of information. First, it is related to the volume of resources needed to process information. This is the relation exploited here. However, the number L of potential forms is also related to the variety of potential forms information that is an important characteristic of information from the point of view of the superior system (user of info) (see Section 1.4.1). Equally well we may use any monotonous function of L as an indicator of variety of discrete information. Particularly well suited for this purpose is logjL. This indicator is called amount of information (which the superior system obtains when a concrete information is delivered). From equation (5.1.9) it follows that in the special case when information exhibits statistical regularities and the probability distribution is uniform, the amount of information here defined is equal as the amount of statistical information delivered by the information source (see Comment 2, page 215). Using (6.1.18) we write the definition (6.1.22) of the resources utilization indicator in the form R(i/)=log2L(Z/)]/VjZ/) (6.1.35) Thus, if we take the point of view of the superior system then we can interpret: • The minimal volume W^XU)=logJ. as the amount of information that the superior system obtains, • The resources utilization indicator R( U) as density of information (amount of information^ per elementary piece of information), • The surplus index S defined by (6.1.24) as index of redundancy. COMMENT 2 We compress the volume of information to save the resources needed to perform fundamental processing, particularly storage or transmission of information. However, if the fundamental information-processing system introduces inevitable distortions, particularly indeterministic distortions, it may be desirable to expand the volume and thus, to use a reversible transformation with P<\. We call such a transformation volume-expanding transformation or, equivalently, redundancyintroducing transformation. In general, we compress the volume of information (we reduce the redundancy of information) by removing some macro structure. Similarly, we can expand the volume by building in a macro structure. A typical example is the error-correcting coding described in Section 2.1.2. We also may expand the volume to help a superior system with limited resources to utilize the information. The type of the built-in macro structure must match the properties of the subsystem performing the subsequent information-processing. The surplus of the volume of primary information is caused by the macro structure that is determined by the source of the information. This macro structure usually does not match the mentioned properties of the forthcoming information-processing. Therefore, we often compress the volume of information by first removing the redundant macro structure and next expanding the volume by building in macro structure that is useful for subsequent information processing. Such a procedure is the consequence of structure
6.2 Examples of Lossless Compression
261
blindness of transformations matching the structure of information to the properties of the subsequent fundamental processing. If those transformations were sensitive to the primary redundant macro structure, its preliminary removal would be not necessary. The volume compression, thus the redundancy removal plays an important role in cryptography, since the primary structure of information can be utilized by the unauthorized observer (not possessing the key) to break the cryptogram. Therefore, compressing volume is an important measure for increasing the protection against unauthorized access to encrypted information.
6.2 EXAMPLES OF LOSSLESS COMPRESSION OF TRAINS OF STRUCTURED INFORMATION Very important from a practical point of view are universal compression systems that operate efficiently for a wide class of types of primary working information without extensive meta information about the properties of the working information. The vast majority of loss-less information-compression systems employed in practice are modifications of the three basic systems, which are described in this section. The rules of operation of those systems are such that they compress the volume of each concrete information. To estimate the values of the corresponding indicators characterizing the whole set of potential forms of information, we have to specify the properties of those sets. We do this in the next section. 6.2.1 COMPRESSION OF THE TRAIN OF BLOCKS 1: THE POTENTIAL FORMS OF BLOCKS ARE KNOWN We assume: Al. The hierarchically higher block has the form (/, = {i£(0,/ = 1,2, • • • , / }
(6.2.1a)
where a(0= {w(/, n), w = l, 2, • • • , A^}, / = 1 , 2, • • • , /
(6.2.1b)
is a hierarchically lover block. The elementary components w(/, n) are binary. A lower-ranking block u{i) is called here "block", and the higher-ranking block L^tr is called "train"(hence the notation). From (6.2. lb) it follows that all blocks have the same length; A2. The set of potential forms of a block does not depend on the number of the block; therefore, we denote such a set, briefly, as Z/; A3. The potential forms of the blocks are known; we denote them as Ui, / = 1 , 2, • • • , L; thus, Z/ = {M/ , / = 1 , 2, • • • , L}. A4. We transform the train U^, by transforming separately the blocks; we denote by T{') the transformation transforming a primary block u{i) into the transformed block v{i) = I\u{i)\ and by V,,={T[v{i)\, / = 1, 2, • • • , /} the transformed train;
262
Chapter 6 Lossless Compression of Information A5. The rule of transforming a block does not depend on a block's position in the train; A6. The transformed block v(/) is a block of binary pieces of information, the set of potential forms of transformed block is {v,, / = 1 , 2, • • • , L}, the transformed blocks v^ may have different lengths; A7. The transformed block determines exactly the primary block; A8. The resources needed to process a train are the sum of resources needed to process its components V(V,)=Ev[v(/)];
(6.2.2)
A9. As the indicator of resources needed to process a block, we take its length V[v(/)]=N[ii(0], (6.2.3) where N(a) is the length of a block a (see (6.1.4) From A8 and A9 it follows that V(V„) = E N [ V ( 0 ] .
(6.4.4)
/-I
This notation is illustrated in Figure 6.1
IIIMIIINI v(l)
^<
v(2)
v(3) v(4) -•-N[v(3)]-^ • N
Figure 6.1. Illustration of the notation: 7=4. We denote by M(ui) the number of occurrences of a the potential form u^ in the train U^. The ratio j^. . P\U,)^—J!.
(6.2.5)
is tht frequency of occurrences of the block u^ (see Section 4.1.1). We denote by P\vi) the similarly defined frequency of occurrences of the transformed block v^ in the transformed train V^,. In view of assumptions A4 and A7, we have
P\vi)=P\u,)
(6.2.6)
6.2 Examples of Lossless Compression
get
263
Grouping in the sum (6.2.4) blocks having the same form and using (6.2.5) we L V(V„)=/E N(v,)P'(«,) (6.2.7) /-I
For a train Ui, the frequencies of occurrences P*(ii/) are fixed. However, by changing the potential forms v^ of transformed blocks and the rule of assigning a transformed block to the primary block, we can change their length N(V/) and influence the volume V(Vtr)- We are interested in the optimal choice minimizing the volume. In the notation for the optimization problems introduced in Section 1.6.2 the optimization problem is
op{V, rj-)}, v(y,)Kv, Cj
(6.2.8)
where V is the set of potential forms of the transformed information, T^^(*) is the code and Cy (respectively Cj) denotes the constraints imposed on the potential forms of transformed blocks (respectively on the codes). If all transformed blocks had the same length, we could not achieve any compression. Therefore, looking for a set of transformed blocks, which would minimize the volume of the transformed train, we must assume that the transformed blocks have different lengths. Then, however, the separation problem discussed in Section 2.6 arises; thus, as constraint Cy we take the requirement that the train of secondary blocks is separable. We can satisfy this constraint either by taking as blocks the code words of a reserved prefix code or by adding an explicit separation information (comma, length information). To get some insight into the problem of finding the optimal transformed blocks we assume first that the set V of potential forms of the transformed blocks, is given. Then we face the OP T^y('), V( Vj^) problem, which reduces to the following problem: There are given two sets {a^, / = 1 , 2, • • • , L} and {b^, m = l , 2, • • , L} of nonnegative numbers. We have to find such permutations {a,(,), k=h 2, • • • , L} and {b^^,^, k=U 2, • • • , L} of those numbers that the sum Z^m^nk) is minimal. The solution of this optimization problem is simple: We arrange numbers ai in descending order ai^^)>.ai(j)>...'>ai^^ and numbers bi in ascending order b^^i^r(ll,(2))>...>nil,,)) and the given forms of secondary blocks in ascending order N(V,(,))
We assign Ui(krv^(k)-
(6.2.10)
264
Chapter 6 Lossless Compression of Information EXAMPLE 6.2.1 RUNNING OF THE OPTIMAL COMPRESSION ALGORITHM
Let us take concrete values. We assume that the length of the primary block is A'=3, the length of the train 7=48 and the frequencies of occurrences are: F'(«,)=5/48, /''(«2)=25/48, P'(u^)=3/4S, P'iu,)=3/4S, (6.2.11) P'ius)=5/48, P'iu^) = 1 /48, P'iu-,)=3/48, P'(Ug)=3/48. To ensure the separability of the transformed train as the set of given forms of the transformed block we take the set (2.6.6) of reserved prefix code words. For this set we have N(v,) = l, N(V2)=3, N(V3)=3, N(V4)=4, .. , 12^ N(V5)=4, N(v,)=4, N(v,)=5, N(Vg)=5 (^•2-12) The descending ordering for the frequencies of occurrences is /(I)=2, /(2) = 1, /(3)=5, /(4)=3, /(5)=4, /(6)=7, /(7) = 8, /(8)=6; and the ascending ordering of code words lengths is /n(l) = l, m(2)=2, /w(3) = 3, /n(4)=4, /n(5)=5, m(6)=6, m(7)=7, /n(8) = 8. The optimum code is «1
U2
"3
"4
"5
"6
II7
Mg
^2
Vi
^5
V3
V4
Vg
V6
V7
(6.2.13)
Substituting the code words listed in table (2.6.6) we get finally the set of optimally transformed blocks ] / = { v , = 0 , V2=100, V3=101, V4=1100, V5=1110, V 6 = l l l l , V7=11010, V8=11011}.
(6.2.14)
From (6.2.2) we get V(^,) = 3*48. From (6.2.7) (after using (6.2.11 and (6.2.12) we obtain V(VJ=2.3*48. The ratio PUU,)=V(U,)mV,) (6.2.15) characterizes the compression of a concrete train. We call it individual compression indicator. This is the counterpart of the compression coefficient /3 defined by (4.1.30), which characterizes the compression ability of a transformation from the point of view of all potential forms of the primary information. For the considered train we get /3,Jf/,) = 3/2.3 = 1.31 (6.2.16) The individual compression indicator depends on the set of reserved prefix code words. In the example, we used just a set without explaining how we got it. We would like to have an optimal set allowing the possible smallest volume of compressed shortest blocks. There exists a simple and elegant algorithm devised by Huffman, which simultaneously generates the optimal set of reserved prefix code words and implements the optimal coding described by (6.2.10). Assume that an optimal set of reserved prefix code words was found and they are arranged in order of decreasing frequencies of occurrences of the corresponding primary words. Then the last two code words must have the same length. We prove it by contradiction.
6.2 Examples of lossless Compression
265
Suppose next that the last two code words (the codewords corresponding to the words of smallest and second-smallest frequencies of occurrences) have different lengths. We denote Vj^ as the shorter, v^, as the longer code word, and v^^'^ as the prefix of Vsh obtained by rejecting the last binary piece of info (let us assume for example that this is 1). Since the code is a reserved prefix code, the prefix vll'^ is not a code word. Also, the train v^^^O is not a code word. Therefore, we can take it instead of the code word v,r. This train Vs^'^0, having the same length as v^^, is shorter than the code word v,r. This proves our thesis and suggests the algorithm: 51. Arrange the primary words in a sequence with non-increasing frequencies of occurrences in the block; 52. Call the pair of the last two primary words the aggregated word; 53. Identify each of the two primary words by binary identifier; 54. Call the sum offrequencies of occurrences of the primary words the frequency of occurrence of the aggregated word; 55. Consider the first L-1 primary words and the aggregated (6.2.17) word as new primary words; go to SI. 56. End the procedure if the set of the modified primary words contains only one element; call it totally aggregated word; 57. The code word assigned to a primary word is the train of identifiers of aggregated words read on the path from the totally aggregated word to the considered word. This algorithm is called the Huffman algorithm^. It is illustrated with the following concrete example. EXAMPLE 6.2.2 RUNNING OF THE HUFFMAN ALGORITHM We assume again that the frequencies of occurrences of primary words in the block are given by (6.2.11). The implementation of the optimisation procedure is illustrated in figure 6.2. «i
25-^
"2
5
«3
5J-1
W4
3-'-0
1/5
3-rO
"7
0
«8
23^^!
T"
6^1 4^1
:i
Figure 6.2. Illustration of implementation of the Huffman algorithm (6.2.17). The short bars indicate the primary words; the short bars with a cross indicate the aggregate words. The number at the left end of the bar is the frequency of occurrences of the word (primary, aggregated) multiplied by 48 (number of occurrences).The number at the right end of the bar is the binary identifier of the word when it is aggregated. The thicker lines join the bars corresponding to an aggregated pair.
266
Chapter 6 Lossless Compression of Information
Take, for example, the primary block u^. On the path from the top node to the leaf node representing u^, we pass the sequence 1, 1, 0, 0 of the branching identifiers. Thus, the optimal transformed block is V4 = l, 1, 0, 0. The remaining codewords we read from Figure 6.2. Their set is the set V given by table (6.2.14)'^. As the second example we take blocks with equal frequencies of occurrences P\ui) = l/S, / = 1 , 2- • , 8. In the first 4 operations four aggregate words with weights 1/4 are produced. The next two operation produce two aggregate words with weights 1/2 . The compressed codewords are blocks of three binary digits.Thus, Huffman algorithm produces the same code words as algorithm (1.5.8) with K=2.n\ COMMENT 1 The Huffman algorithm produces the set of optimal reserved prefix code words that minimize the length of the train of blocks. However, both those functions are coupled. Therefore, the Huffman algorithm is not suitable if we do not require that the separation of secondary blocks should be achieved by reserved prefixes coding and the separation techniques using explicit separation information are used. Rule (6.2.10) remains then useful. An example of such compression is arithmetic coding. COMMENT 2 The described compression system is an example of a system with learning cycles described in Section 1.7.2, pages 59, 60. Although Huffman algorithm produces single transformed blocks, its task is to minimize the volume of the whole train. The code words produced by Huffman algorithm are determined by the set /^ = { r ( i i „ / = l , 2 , ..,L} (6.2.18) of frequencies of occurrences of primary blocks. However, those frequencies depend on the whole train of blocks. Therefore, we may can interpret them as an auxiliary information about the train of blocks, which is relevant for the optimization of compression block by block. The block diagram of the system realizing the described information-processing is shown in Figure 6.3. The time relationships between the primary and transformed trains are shown in Figure 6.1. The system shown in Figure 6.3 is a special case of the learning system shown in Figure 1.27. set of frequencies of words occurrences P' HUFFMAN IDENTIFICATION OF fcOPTIMIZATION FREQUENCIES 1 ^ w OF WORD OCCURRENCES w ALGORITHM . f^tr . primary train yXr M cod ing rule of words
i
'—
^ ^
DELAY
^ OPTIMAL CODING w OF SINGLE WORDS
^
compressed train Figure 6.3. The optimal compression of a train of blocks using Huffman optimization algorithm.
6.2 Examples of Lossless Compression
267
To get the needed set P* of frequencies of occurrences, we must first observe the whole block of words. This means that the optimal compression of the train introduces a substantial delay, as illustrated in Figure 6.1. This is a special case of the general principle of trading quality of decisions for delay needed to take them. Because the aim of the Huffman algorithm is to minimize the length of the whole train, some code words may be longer then the primary coded blocks. This is illustrated with the codebook (6.2.13) using code words given by (6.2.14). STATISTICALLY OPTIMAL HUFFMAN CODE To this point we made no assumptions about statistical regularities in the train. Now we assume that the train can be considered as an observation of the random ^"^^^^^^ where
I[Jtr = { W ) , ^ = 1, 2,- • • , /}
(6.2.19a)
I[J(0= W , n), AZ = l, 2,- • • , A^}, / = 1, 2,- • • ,/ (6.2.19b) is the random variable representing the zth primary block. We assume also that those random variables have the same probability distribution, that they are statistically independent, and we denote P(ii/)=PnU(0=W/] , / = l , 2,- • • , L;
(6.2.20)
The code produced by Huffman algorithm using these probabilities is called statistically optimal Huffman code. In the next section we consider indicators of performance of systems compressing trains that exhibit statistical regularities. We show that the statistical average volume of the compressed train V(V,) = Ev(V,)
(6.4.21)
where V^r denotes the random process representing the compressed train, is often the suitable performance indicator. Using equation (6.2.2), (6.2.3), and (4.4.9) we get V(VJ=/EN(V,)P(II,)
(6.2.22)
/-I
This equation is the same as equation (6.2.7) with P*(ii,)=P(ii,). Thus, the statistically optimal Huffman coding minimizes the r6 ? 9'^^ statistical average of the volume of the compressed train. Let us look at the relationships between the statistically optimal Huffman compression and the compression of a given train discussed previously. From the fundamental property of long trains discussed in Section 4.6 it follows that with probability close to 1 a long train (/> 1) is a typical train. Thus, the approximation />-(ii,) = />(!!,) (6.2.24) is accurate. Since the codebook changes in steps when the frequencies of occurrences change continuously, we can expect that a long typical primary train is transformed by the statistically optimal Huffman code into the shortest secondary train. The transformation of a non-typical train is lossless but the transformed train may be longer than the train produced by the Huffman algorithm based on actual frequencies of occurrences.
268
Chapter 6 Lossless Compression of Information
The statistically optimal Huffman coding of a primary block can be performed immediately after the block arrives, and we must not wait till the whole train is observed. However, to profit from this feature the set of probabilities P={P(M;), / = 1 , 2,- • • ,L} (6.2.25) must be available. We can get an estimate of P from an observation of a previous train. Thus, the system using the statistically optimal Huffman coding should operate as an intelligent system estimating the state parameters that are discussed in Section 1.7.2, with a set of probabilities P playing the role of the unknown state parameters. In the training cycle, the Huffman algorithm based on frequencies of occurrences is applied and the compressed train is produced after the primary train ended. Then the frequencies of occurrences are stored and used as probabilities to code currently the arriving blocks of the subsequent trains. If the statistical properties of those trains change then the estimates of the set of probabilities Pmust be updated. Thus, training and working cycles must be interleaved as shown in Figure 1.25. 6.2.2 ARITHMETIC CODING Huffman coding is block-orientated coding. It transforms the primary block as a whole into a whole compressed block independently of other blocks. However, all components of a primary train are often interrelated and dividing the train into blocks is an artificial operation. To exploit all the existing relationships, we had to segment the primary train into possibly large blocks. Huffman coding of such blocks may be tedious. We consider here the arithmetic coding that is string-orientated. It builds the compressed block successively as the elements of the primary block arrive. To simplify the argument we describe the counterpart of the previously presented statistically optimal Huffman coding. Thus, we assume that the components of the trains exhibit the statistical regularities and that the probabilities of the components are known. We denote as ll(\)={ui\ /= 1, 2,,- • • , L}- the set of potential forms of a component of the primary block; M={W(A2); A2=1, 2,- • • , N}, u(n)E U(l) Vn -a primary block P(a)=P(U=M)-the probability of block u The arithmetic coding algorithm is 51. The primary block u is transformed in an auxiliary interval J (M) C < 0 , 1 > ; the length of the interval \J{u)\ =P(ii) 52. A family of sets W{m), m = l , 2,- • • , W{m-l)C W(m) of equidistant reference points, with distance between the points decreasing with growing m, is generated; (6.2.26) 53. A reference point w'(u) from a set W{M), with a possibly small M laying inside the interval J(u), is found; 54. As the compressed block v(u) an identifier of the reference point w*(u) is taken. This algorithm is illustrated in Figure 6.4.
6.2 Examples of lossless Compression
269
STAT. INFO ABOUT PROBALITIES OF THE ELEMENTS OF A BLOC A , P2,
, PL \J{u)
ITRANSFORMATIONI OF THE BLOC INTO AUXILIARY pnmary INTERVAL bloc
w (u)
FINDING THE POINTi IDENTYFYINGTHE INTERVAL
BLOC OF DIGITS DESCRIBING THE POINT
compressed bloc
indentifier of the interval
auxiliary interval
a) u - comma primary train
«(/i-l)
u(n) w*[uin)].
I
X
X
H
u{n+l)
Jlu(n)]
W
)( I )( I K
H
X
STR w*[u{n)] compressed train |
v[u(n-l)]
^
v[u(n)]
l^y[u{n+l)]
V - comma
b) Figure 6.4. The illustration of the general description of arithmetic coding: (a) the block diagram of the coding algorithm, (b) illustration of coding; x denotes a reference point. Let us describe in more detail the subrule of assigning the interval J (11) to a primary block u (step SI). We assume that Al. The random variables min), AZ = 1, 2,- • • , A^ representing the components of the primary block are statistically independent and have the same probability distribution; A2. The primary blocks are separated by a comma; we denote it as 11^. On assumption Al the probability of a block (including the conmia at the end) IS
p(«)=nn„/(«)
(6.2.27)
where ii = {W/(i), %),• • • ,W/(AO} is a primary block and /^/=P[i!a(Az)=wJ is the probability of the potential form w^. We partition the interval J(0) = < 0 , 1 > into L subintervals ,, , J [ l , / ( 1 ) ] , / ( 1 ) = 1,2,- • • , L of length | J [ l , / ( l ) ] | = |j;0)|/>,J=F,,. (6.2.28) An example of such a partition is shown in the line n=2 of Figure 6.5.
Chapter 6 Lossless Compression of Information
270
1|
n=0
"
•J(0)0,6
-Jihiy 0,36 -J(2;l,l)- J ( 3 ; 1,1.1)
0,9
[j(l;3)j—>
-J(l;2>0,54 -J(2;l,2)-
0,78 -J(2;2,l) \
0,708 0,468 0,522||0,558| 0,216 0,324 f+U(3;2,l,l)H I l-J(3;l,l,2)H— J(3;l,2,l)l
0,87 . 0,96 0,99 hH hf^—•
: 0,834i
,
;.
0,36 J(3;l,2,l)
J(3;l,2,2)
J'(3;l,2,3))
Figure 6.5. Illustration of the definition of subintervals J[m, 1(1), /(2),..,/(m)], L=3, P,=0.6, ^2=0.3, P3=0.1. Next, in the same way, we partition each interval J[l,/(!)] into the subintervals J [2, 1(1), 1(2)] /(2) = 1, 2,' ' ,L of length I J [2, 1(1), 1(2)] I = I J [2, 1(1)] I P,(2). (6.2.29) From (6.2.28) and (6.2.29) we get |J[2,/(I),/(2)]|=P„A2)Iterating this procedure we introduce the partition of rank n
(6.2.30) n
\J[n',l(l),
1(2), ' • ' , /(Az)]| = | J[Ai-l;/(l), 1(2),' • • , l(n-l)]\Pi^„,= Il^iim). ""'' (6.2.31) From (6.2.27) and (6.2.31) it follows that \J(u)\ = \J[n, 1(1), 1(2),' ' • , /(AO]|=P(ii). (6.2.32) The introduced notation is illustrated in Figure 6.5. We next describe in more detail the generation of the sets of reference points (step S2 of algorithm (6.2.26)). We denote V(1) = {0, 1, 2, • • , K-l} as the set of integers; they play the role of elementary pieces of a compressed block but also of digits in a counting system with basis \/K; V(m) the set of all blocks v={v(l), v(2),' • • , v(m)}, v(A2)E 1/(1), n = \, 2,- • • , m. We next associate with a block vG V(m) the number w(v)=VALv, (6.2.33a) where [1 " (6.2.33b) VALv=E^('^) K Since the representations on the right side is unique we can retrieve v when w(v) is available. We have (6.2.34) v=STR w(v). where STR is the transformation defined by (1.5.7).
6.2 Examples of Lossless Compression
271
When we interpret the elements v(n) as digits, then the block v has the meaning of a fraction wE < 0 , 1 > in the number system with basis l/K. Let us take, for example, K=3, m = 3, v = {l, 1, 2}. Then w= 1/3+ 1/9+2/27 =14/27. We call w(v) a reference point. The set W{m)^{w{v)\ve
V{m)}
(6.2.35)
is called set of reference points of order M, An example of such a set is shown in Figure 6.4b. From the definition of w(v) it follows that W{m-\)CW{m).
(6.2.36)
To identify the interval J (u) we look for a reference point laying inside the interval J(u) and belonging to a reference set of a possibly low-order M. The sufficient condition to find a reference point inside an interval of length \J(u)\ located somewhere in the interval < 0 , 1 > is that the distance between successive reference points from a set W(Ad)is smaller than the length \J(u)\. Since the distance between the successive points from the set W(Af) is (l/K)^, the condition is (1//0^>| J(ii)|
(6.2.37a)
M>-J-[-log|/(tt)|],
(6.2.37b)
or, equivalently, lOgA
where log denotes the logarithm with an arbitrary basis. Since we are looking for a possibly small M, we take M-[-i-[-log|/(i/)|],
(6.2.38)
where \u denotes the smallest integer not smaller then u. From (6.2.26) we have M-[_L[-logi>(ii)].
(6.2.39)
Thus, the length of the train of digits identifying the interval is a non increasing function of the probability of the primary block and the general rule of optimization (6.2.10) is satisfied. Using the estimate (6.2.39) it can be shown, that for sufficiently long blocks the arithmetic coding produces a train with minimal statistical volume, which is discussed Section 6.5.2. From (6.2.31) it follows that the length of the interval J[«;/(!), /(2),- • • , l(n)] decreases with growing n. Therefore, when we calculate for AZ = 1, 2,- • • the endpoints of the intervals J [AI;/(1), /(2),- • • , l(n)] at an n^ the first digit(s) representing the end-points become for the first time the same and do not change for n>ni. Those digits are also the initial digits of the final identifier. Thus, We obtain the elements (the digits) of the identifier v(u) by adding successively new digits when the initial digit(s) in the representation (6.2.40) of the end-points of the interval J[/z;/(l), /(2),- • , l(n)] become the same.
272
Chapter 6 Lossless Compression of Information
From (6.2.33b) and (6.2.36) (see also Figure 6.5) it follows that An identifier v[J(u)] of the primary block ii = {w/(i), W/(2)»' " ' » "/(AO) is also an identifier (although not the shortest) of every interval (6.3.41) J[/z;/(l),/(2),- • • ,l(n)ln
6.2.3 COMPRESSION OF A TRAIN OF BLOCKS 2: THE POTENTIAL FORMS OF BLOCKS ARE NOT KNOWN The Huffman algorithm can be also applied if the lengths of potential forms of the words are different. It is also feasible when we have a long train of binary pieces of information and we will avoid technical complications with coding it as a whole. We can then segment the train into blocks of equal length A^ and treat them as words. We call such an procedure external segmentation. Since the length A^ is a design parameter, the problem of its best choice arises. We address this problem in Section 6.5.3. However, essential for application of the basic Huffman algorithm and of its modifications is the knowledge of potential forms of words. Without it we cannot determine the frequencies of occurrences. The natural way to compress the structured information when we know that it has a macrostructure but we do not know the potential forms of the macrocomponents (in particular words) is to Identify potential forms of macro components of the train currently. Use some volume compressing coding for the identifier of the identified macro component.
(6.3.42)
6.2 Examples of Lossless Compression
273
A rudimentary form of such compression is run-length coding. It is suitable when in a train of binary pieces of information are imbedded blocks of a repeating specific elements and the number of repetitions can vary. A typical specific element is a 0. The previously mentioned general rule takes the following form 1. Search in train of arriving elementary pieces of information for a block ofO's; 2. When such block is identified, replace the block by information about its length; (6.2.43) 3. Add suitable auxiliary information to separate the information about the length from the primary elementary pieces of information. As separation information we may use a comma block described in Section 2.6. The separation information increases the volume of the transformed train. Therefore, it is not favorably to process blocks of O's shorter then the separation information and only a block of O's longer than a N^^JiO) is processed. The choice the minimal length A^niin(O) depends on the statistical properties of lengths of blocks of O's. For algorithms and examples of coding see Held [6.1]. A class of universal and efficient algorithms based on the general principle (6.2.42) are the Ziv-Lempel algorithms and their modifications. Those algorithms are particularly well suited for compression of text in natural or computer languages. Such texts have the following properties: • They have a hierarchic macro structure: the first-ranking components are the characters, the second-ranking are strings of characters frequently occurring in words (in particular the syllables), the third-ranking components are the words (in usual sense), and still higher-ranking are some typical sequences of words; • With exception of the words the other macro components are not segmented; • Many macro components can be considered as modifications of a prototype string that is created by adding at the end of the prototype some new components. The Ziv-Lempel algorithms utilize the first property to generate successively, as new elements of the train arrive, the strings of characters that are considered as reference macro components (briefly, reference patterns). The third property is used to organize the set of already chosen reference patterns (called a dictionary), so that testing if the new arriving string is already in the dictionary is simplified. The general scheme of the compression algorithms is SL If the newly arriving string of the train is not a reference pattern already contained in the dictionary, then 1.1. it is labelled as a reference pattern and put in the dictionary, 1.2. it is placed in the compressed train unchanged; 52. If the arriving segment of the train is a reference pattern, it is replaced by an identifier of this pattern; 53. A reference pattern that does not occur frequently in the time interval < t^-T, t^ > where t^ is the current instant and T is the activity reviewing period, is moved from the dictionary.
274
Chapter 6 lossless Compression of Information
CODING OF SINGLE STRINGS 1
found
new string • P
) 1
r 1 1 1 1 1
SEARCH FOR A MATCHING REFERENCE PATTERN
IDENTIFIER OF THE MATCHING REFERENCE PATTERN
not found
DIRECT CODING
^
W
' L..
^
- Jr_
DIRECTORY (SET OF REFERENCE PATTERNS)
_
_ ^
_
^
•
V-^
r
1 f
DIRECTORY MANAGEMENT
L
_
1 compressed ' train
—.
] 1 1 1 J 11 1 _J
IDENTIFICATION OF MACRO COMPONENTS
Figure 6.6. The basic structure of Ziv-Lempel algorithms for compression of trains with a priori, unknown macrostructure components. The diagram of an information-compressing system implementing the algorithm is shown in Figure 6.6. Let us comment on the steps of the algorithm. The dictionary usually has a lineal structure and hence may be called a string dictionary. It is built successively in such a way that at the decompression site a replica of the dictionary can be gradually created from the available compressed train. The typical identifiers of the reference pattern mentioned in S2 are (a) the information about the position of the first character of the pattern in the string dictionary, (b) the information about the length of the pattern (see Section 2.6). Step 3 is necessary because new patterns are added and unlimited growth of the dictionary must be avoided. However, the removal of patterns that are not real macro components but were tentatively treated as such by the algorithm is advantageous. Let us look again at the algorithm from a broader perspective. Compared with the Huffman algorithm, the Ziv-Lempel algorithms do not require the knowledge of the potential forms of macro components. They generate for themselves the presumable list of such components. Therefore, for coding of a new string of characters the algorithm utilizes the detailed information about the previously observed strings that is built in the dictionary. The volume of this auxiliary information is very large compared with the volume of the information P* of occurrences used by the Huffman algorithm. However, the compressed train carries all information needed to create the dictionary needed for decompression.
6.3 Real Time Compression of a Train of Blocs
275
Our description of the algorithms is only sketchy. In each step several details must be specified and there is quite a great variety of options. A more detailed description of the algorithm can be found in Storer [5.4], the programs implementing it in Nelson [6.6]. In conference proceedings Storer, Reif [6.7], Storer, Cohn [6.8], [6.9] and [6.10] detailed topics are discussed.
6.3 REAL-TIME COMPRESSION OF A TRAIN OF BLOCKS WITH IDLE PAUSES In the two previous examples we considered the elementary pieces of information as atomic elements (in fact as numbers). However, every information is presented as a physical state of some objects and in consequence has multilevel hierarchical structure. This structure has a substantial influence on the choice of the definition of the volume of information and subsequently of the information-compressing transformation. This section illustrates how to utilize the time structure of information presented in a dynamic form (see Section 1.1). We also use this case to provide more insight into the fundamental concepts introduced in the previous section. The concepts introduced here also are utilized in subsequent statistical analysis of the real-time compression systems. 6.3.1 THE MODEL OF THE PRIMARY INFORMATION We consider the system shown in Figure 6.7 a. LOCAL CHANNEL
^
W
COMPRESSION SYSTEM
^
w
FUNDAMENTAL CHANNEL
^tr(t)
a) D[[/„(-)]
I I I I IL 7(2)"
r(ir iD(l>J
++
D,
-H-
r(3)
D(3)-
-D(2)-
b) Figure 6.7 Matching of a train of blocks interleaved with idle pauses to a rhythmically operating channel: (a) the block diagram of the system, (b) the time structure of the train at the output of the local channel.
276
Chapter 6 Lossless Compression of Information
We assume that: (1) A channel, usually a long-range channel forwarding the working information to its destination is given. This channel plays the role of the fundamental information-processing system. Therefore, it is called iht fundamental channel. (2) The primary information is a train of blocks described by (6.2.1). (3) The primary information comes not directly from the primary information source (e.g., e computer terminal) but by a local channel it is delivered to the input of the fundamental channel . (4) The process at the output of a local channel carrying the primary information is a train of blocks divided by idle pauses as shown in Figures 1.10 and 6.7b. We assume also the local channel operates rhythmically. By this we mean that time axis is divided into elementary time slots of equal duration; we denote it^ D^^. The number of slots generated during a time unit Q^l/A, (6.3.1) is called channel capacity^. The process available at the output of the local channel has a multilevel time structure. The elementary component is a binary piece of information that is carried by an elementary process (pulse) that fits into a time slot. This elementary process usually has some fme structure (see Section 2.1.1 in particular. Figure 2.2). The binary pieces of information are assembled in blocks. We denote by u[i, (/)], rG < r ( 0 , r{i)+D{i)> / = ! , 2,- • • , / the block of elementary processes carrying the components of the /th block of the primary binary pieces of information. We assume that Al. The instants r{i) of blocks arrivals and their durations D{i) are random. The corresponding process is called a time-structured train and it is denoted as U^{'). On the assumption that the duration of a pulse is almost equal to slot duration Au ^ typical time-structured train is shown in figure 6.7 b. The primary characteristics of such a train are C)[^tr(*)] -Duration of the train, Ns[^tr(*)] -Number of slots of the fundamental channel needed to transmit the whole (with idle pauses) train, Ntot[^tr(*)] -Total number of binary pieces of information contained in all blocks in the train. The obvious relationships between those parameters are N,[f/,C))]=D[[;,(-)]/Au N J t / , C ) ] ^ N,((/,(-)).
(6.3.2) (6.3.3)
/?[f/,C)]^N,Jf/,C)]/D[f/,C)].
(6.3.4)
The ratio
is called the rate of delivery of the "pure" information. From (6.3.1) to (6.3.3) it follows: /?[[/,(•)] ^ Q . (6.3.5)
6.3 Real Time Compression of a Train of Blocs
277
Till this point we considered the local channel. We assume that the fundamental channel operates also rhythmically. We denote Dsv -the duration of a time slot of the fundamental channel C, = l / A , (6.3.6) the capacity of the channel. We assume that the fundamental channel has the following properties: F l . The fundamental channel is unable to distinguish if a slot is idle or if a pulse carrying a binary information is located in the slot; F2. The total number of reserved time slots is the indicator of resources needed to process an information train. In view of property F2, if we would put the train delivered by the local channel directly in the fundamental channel, we had to pay for the idle pauses. To diminish the usually high costs of hiring the fundamental channel, we are looking for an information-compressing transformation that would decrease, and in the best case eliminate the pauses. As such a transformation we take the buffering. It was briefly described in Section 2.6.5 (see in particular, Figure 2.23) and its formal description was presented in Section 3.3.1 (see Figure 3.12). We now show how the previously introduced indicators of information compression can be used to describe and analyze buffering. 6.3.2 THE VOLUME OF A TRAIN OF BLOCKS INTERLEAVED BY IDLE PAUSES We show now how the concepts introduced in Section 6.1 can be used to define the indicator of compression capability of buffering. We first define the volume of the primary and transformed train of the information. We first assume that the primary train delivered by the local channel is put directly, without preliminary processing, into the fundamental channel. Thus, V,(-) = ^tr(-) (6.3.7) The primary channel can be directly connected with the fundamental channel when both channels operate in the same way. In particular, the durations of the slots in both channels must be the same: Av= Au. (6.3.8) Based on assumption Fl we must reserve the fundamental channel for the whole duration of the time structured train U^X')- In view of assumption F2 we define the structure blind volume of a train Vjf/,(-)]=NJ[/,(-)]. (6.3.9) Using (6.3.2) we get vj^.C)]=D[^.C)]/A. Taking into account equation (6.3.1) we write this in the form
(6.3.10)
C',=Vjf/,C)]/D[^,(-)]. (6.3.11a) Thus, The capacity of the fundamental channel needed to transmit the information directly has the meaning of the structure blind volume (6.3.11b) of the primary information train per time unit.
278
Chapter 6 Lossless Compression of Information
We next consider the minimum volume of the train on the assumption that the information-compressing transformation is reversible. To find it we must take into account the properties of the superior system. We assume 51. The total number Ntot[f/tr(')] of elementary pieces of information is the indicator of usefulness of the information for the superior system, 52. The idle spaces between the blocks, particularly their duration, are irrelevant for the superior system ; 53. The delay between the instant at which the packet delivered by the local channel starts and the instant when the feeding of the transformed packet into the fundamental channel starts, is the indicator of performance of the compressing subsystem. From definition (6.3.2) and assumption F2, page 277 it follows that the volume of a transformed train V^(') is minimal when all slots contain pulses carrying the binary pieces of information. Therefore, v.i[f/.(-)]=NJC/,C)] has the meaning of the minimum volume of the train C/tr(*)From (6.3.2) and (6.3.4) it follows that
(6.3.12)
/?[f/tr(-)]=v.i[^,C)]/D[C/,C)]. (6.3.13a) Thus The rate ofpure information delivery has the meaning of the minimum ., ^ . ^, volume of information per time unit. For the considered concrete train the ratio rch[f/.r(-)] =v„i[f/,X-)]/Vs,[t/,X-)]
(6.3.14)
has the meaning of an indicator of utilization of resources of the fundamental channel needed to process directly the concrete train U^X*) (counterpart of the resources utilization indicator characterizing the set of all potential forms of information defined by (6.1.22)). Putting (6.3.9), (6.3.12) in (6.3.14) we get the indicator of resources utilization rch[f/tr(-)] = N,t[^tr(-)]/N,[f/,(-)]=/?[t/,C)]/Q.
(6.3.15)
The maximum value 1 is achieved when R{U^X')\ = ^M ^ ^ ^^^s when there are no pauses between blocks. From the central part of formula (6.3.15) it follows that we see that rch[^tr(-)]=P,
(6.3.16)
where p is the duty ratio defined by (1.3.7). The counterpart of the volume surplus indicator defined by (6.1.24) is the ratio Sch[^tr(-)] = { V s b [ ^ t r ( - ) ] - V . i [ f / . C ) ] } / V . i [ ^ t r ( - ) ]
(6.3.17)
6.3 Real Time Compression of a Train of Blocs
279
It has the meaning of the relative surplus of the capacity of the actually used channel above the minimum capacity needed. Using (6.3.10 ) and (6.3.11) we get sdU,X')] = {Cu-R[U,m/R[U,X')]
(6.3.18)
From (6.3.15 ) it follows that RIU^X')] has the meaning of the minimum channel capacity if loss-less processing should be possible. 6.3.3 THE COMPRESSION INDICATOR Till this point we assumed that the local channel feeds the primary time-structured process directly into the fundamental channel. We assume now that before feeding the train U^X') into the fundamental channel we compress its volume by buffering. A general description of buffering was given in Section 2.6.5 and is Section 3.3.1 it is formally described. To ensure stable operation of the system when several trains are processed we assume that the durations of the primary and compressed-time-stnictured train are almost equal (with accuracy of the transient states at the beginning and the end of the time-structured primary train (/tr(*))- Thus, we assume D[KC)] = D[f/,C))],
(6.3.19)
where V;r(*) = ^bu[^tr(0] is the transformed train and TbuC*) denotes the transformation performed by the buffering system. A typical example of such a transformation is described in Section 3.3.1, in particular by the relationship (3.3.5). The condition that buffering is a reversible transformation implies that all arriving binary pieces of information are carried by pulses of the transformed train. Thus, NJK(-)]=N,Jf/,C)]. (6.3.20) Since we do the same assumptions about the fundamental channel as about the primary, we now can use all the previously derived equation for the transformed train, only substituting U^^-^V^,. In particular, from definition (6.3.4) we get the rate of delivery of pure information by the secondary channel: /?[V,(-)]^N,o.[^tr(-)]/D[V,C)].
(6.3.21)
Comparing this with the primary definition (6.3.4) and taking into account the assumptions (6.3.19) and (6.3.20) we see that R[V,(')]=R[U,(')l
(6.3.22)
Substituting Uif^V^, in (6.3.15) and using (6.3.20) we get the indicator of utilization of capacity of the fundamental channel carrying the compressed train r,,[y,,C)]=/?[f/tr(-)]/Cv.
(6.3.23)
280
Chapter 6 Lossless Compression of Information For a train the counterpan of the compression index defined by (6.1.30) is the
ratio /3[^tr(-)]^v,,[f/,(-)]/v,,[V,C)]
(6.3.24)
From (6.3.15) and (6.3.13) it follows that similarly to (6.1.31) we have 0[U,X')] = rJV,X')VrJU,X')]-
(6.3.25)
From (6.3.15) and ( 6.3.22) we get ^[f/.(-)] = Q / Q .
(6.3.26)
Thus, buffering allows a lossless transmission of information through a fundamental channel with a smaller capacity than the capacity of a (6.3.27) structure blind channel delivering the primary train with idle pauses. However, we cannot arbitrary decrease the capacity of the fundamental channel because from (6.3.5) it follows that it must be: C^>R[U,X')]-
(6.3.28)
From (6.3.22) and (6.3.26) it follows that the maximum compression ratio is I3^[U,X')] = CJRIU,X')].
(6.3.29)
This maximum compression ratio is achieved when C^=R[UiX')]- Then r[Vtr(*)] = l, and the transformed train Vtr(*) is fully compressed and has no pauses. From the description of the buffering system in Section 3.3.1 it follows that this is possible only when at least one block is always waiting in the buffer. We show in Section 6.4.1 that to achieve on average such a compression the capacity of the buffer memory has to be infinite and the average delay introduced by buffering is infinite too. Therefore, loss-less information-compressing systems can achieve only compression ratio, which is smaller than Prrx^xi^tX')]-
6.4 COMPRESSION OF INFORMATION EXHIBITING STATISTICAL REGULARITIES. In Section 6.2.1 we have concluded (see (6.2.23) that Huffman coding using probabilities minimizes the statistical average length of the compressed train. The reason for improvement is that the statistical regularities impose statistical constraints on the elements of structured information and statistically optimal Huffman algorithm utilizes the statistical regularities optimally. Conclusion (6.2.23) can be also derived by using the relationships between the frequencies of occurrences and probabilities and the resulting relationships between the arithmetical and statistical averages.
6.4 Compression of Information Exhibiting Statistical Regularities
281
In Comment 2, page 266 we noticed that if we consider a single block, the Huffman transformation is not optimal at all, and it even can expand the lengths of some blocks. We profit from the statistically optimal Huffman coding only if we process a sufficiently long train of blocks. Our considerations in Section 4.6 permit us to look at this effect from a broader perspective. Taking as performance indicator a statistical average and using a transformation optimal in the sense of such a criterion, we can expect that the performance of the system will be improved only if we perform the transformation sufficiently many times so that the statistical regularities can manifest themselves. This effect has a general character and occurs for all information transformations optimized in the sense of an indicator that is the statistical average of a primary indicator of performance in a concrete situation. The analysis of the specific character of statistical indicators of resources needed to process information and the peculiarities of systems optimal in the sense of statistical criteria are the central topics of this section. The general considerations in this section are illustrated with specific examples in the next section. 6.4.1 THE BASIC CONCEPTS From the introductory remarks it follows that the statistical indicators of volume are meaningful if we consider processing of information that is a train of structured components. Therefore, we consider here a similar model as in Section 6.2.1, and we will assume that information is a train blocks U, = {u(i),i = U2,- • • , / } . (6.4.1) We also do assumptions A2 and A3, page 261 (the set of potential forms of each block is the same, and the potential forms are known). From the description of the Huffman algorithm in Section (6.2.1) it is evident that if we take into account statistical regularities of the blocks it may be convenient to use blocks of various lengths. Therefore, we generalize assumption Al from page 6-13 and we assume A T . The potential forms of a block may have various lengths; thus, the potential form of a block is w,= K(Az), n = h 2,- • • , Nil)}, l=h 2,- • • , L,
(6.4.2)
where U[(n) V/, AZ are the binary pieces of information. From the definition (6.1.4) of the operator N it follows that N(Ui)=N(l). (6.4.3) Next we assume A4'. The elements of blocks exhibit joint statistical regularities; thus, the iih block can be considered as a realization of the discrete random variable I!J(0» / = l , 2,- • • , / and the train as a realization of the multidimensional random variable (process) IlJtr = {U(0, / = l, 2,- • • , / } . A5'. All random variables IUf(0, / = l, 2,- • • , /have the same statistical properties and are statistically independent.
282
Chapter 6 Lossless Compression of Information
In view of A5', the random variables are described by the probabilities /^/=PntJ(0=M/], l=U 2 / • • , L.
(6.4.4)
As in Section 6.2.1 (definition (6.2.2) we take the length of a train as the indicator of resources needed to process the train. However, our argument could be applied to any other definition of volume that (similarly to (6.2.2)) can be represented as the sum of volumes of the component blocks. We denote N,=N(^,), N(i)=N[u(i)]. (6.4.5) ^^''
A^tr=EMO].
(6.4.6)
/•-i
In view of assumptions A4' and A5' we can consider the lengths of the train and of the blocks as realizations of random variables Ntr=N(IIJtr) (respectively, N ( 0 = N N ( 0 ) . From (6.4.6) it follows that M„=EN(0,
(6.4.7)
/-I
and from A5' it follows that the variables M(/)v / are statistically independent. This implies that /
K = J2m=im)
(6.4.8a)
/-I
/
c^(MJ=E ^[N(0]=M1),
(6.4.8b)
where N^ = EM^, M1)=EM(1), a^(l) = E[M(l)-EN(l)r
(6.4.8c)
We emphasized in Section 6.4.1 that an indicator of resources needed to process any potential train is of primary importance. To define such an indicator we could invoke our argumentation on pages 255 to 258. Since we assume that the blocks exhibit statistical regularities, we may take a rough description of the random variable Ntr- Particularly, it seems it is natural to take the statistical average A^tr as an indicator of resources needed to process any block. However, such an indicator has some peculiarities, making it different from the indicators considered in Section 6.1. We now discuss in more detail the specific features of the statistical average used as a performance indicator. 6.4.2 THE EFFECT OF OVERFLOW The probability that the length of a concrete train will surpass the average ^^r is usually large. If probability offluctuationaround the average is symmetrical, what happens usually, this probability is 0.5. Thus, if we would reserve processing resources corresponding to the average ^tr with a high probability, we would have not enough resources to process the train, for example, the available channel or storage capacity would be too small. We say that in such a situation an overflow occurs.
6.4 Compression of Information Exhibiting Statistical Regularities
283
The natural counteraction is to reserve bigger resources that can process correctly the trains of length A^tr+A, where A is a safety margin. Then the probability of overflow is _ Pov=P(M,>A^tr+A). (6.4.9) To get some insight into this probability we introduce the normalized variable K = (K-Ky(^(K)-
(6.4.10)
a(m,,) = \.
(6.4.11)
Obviously, From (6.4.8b) we get
m^ = m^-N^)/{\fTa(l)). (6.4.12) From (6.4.12 ) it follows that the event Ntr> A^tr +A is equivalent to the event m,,>A/\fra(l). (6.4.13a) We write this in the form \,>\fTd/a(l), (6.4.13b) where 6-A// (6.4.14b) has the meaning of safety margin per one block. Thus, Pov=P(2iitr>\/^^/^(l)5). (6.4.15) We consider the safety margin per block 6 as fixed. Then the normalized overflow threshold grows with when the number of blocks / the train consists of grows. However, in view of (6.4.11), the variance of the normalized length of the train n^^ is always 1. Thus, the threshold moves further apart from the mean value and the probability of surpassing it deceases^, as illustrated in figure 6.8.
yfTd/Gil) Figure 6.8. Illustration of the effect of reducing overflow probability by increasing the number / of blocks the train consists ofy/ (5/(7(l)-margin of resources over the average resources per one block;/?c('2)-the continuous envelope of the distribution defined by (4.2.9); n,, is considered as a continuous variable. From our observations it follows that Increasing the number I of blocks the train consists of but keeping the safety margin b per one block fixed, we can make the overflow probability P^^ as small as we desire.
(6.4.16)
284
Chapter 6 lossless Compression of Information
COMMENT 1 To draw the advantages of statistical regularities of the information train, the number / of blocks the train consists of must be large, and we must process the train as a whole. Only then can the statistical regularities manifest themselves and can a random fluctuation of the block length over the average be compensated by another random fluctuations under the average. This happens, however, only with the probability of no overflow, which is 1-Pov Although we can make it as close to 1, it is never exactly 1. This means that in practice a rejection can occur. To alleviate its consequences some protective mechanism may be included. It is typical to send back the rejected train to the subsystem that generated it, and to shift onto this system the responsibility of trying to process the rejected information train again. COMMENT 2 The average value EM has the meaning of the value around which the observations of a random variable M fluctuate, while the variance a(M) has the meaning of average deviation around the mean value (see Section 4.4.3, page 192). Behind our basic conclusion (6.4.16) is this interpretation and equations (6.4.8) and (6.4.9). It follows from them that the average value A^tr of needed resources grows linearly with /, while their fluctuation around the average grows only with yjl . The reason of this is that when we add several independent random variables, their up and down fluctuation around mean value partially compensate, and with growing / the relative deviation, related to the mean value, decreases to zero at rate \I\[T. Consequently the random length of the train can be more accurately approximated by a constant, the average. This effect also can be explained in terms of the fundamental property of long trains discussed in Section 4.6. The resources needed to process a concrete train are given by (6.1.8). From (4.6.11) it follows that for sufficiently long typical trains the frequencies of occurrences of potential forms of blocks P*iui) are close to their probabilities Pi given by (4.6.10a). However, then the sum on the right side of formula (6.2.7) is close to the statistical average 7^^^. Thus, we can expect that for a sufficiently long train, with probability close to 1, the length of a long train is almost constant and equal to the statistical average. Therefore, to process a typical train we need resources of the order of magnitude of the statistical average J^^^. In most cases, the probability distribution of components of information is not uniform and/or statistical relationships between the components exist. From conclusions (5.1.9) and (5.1.18) it follows that then entropy per element //, defined by (4.6.18) is smaller, than the maximal entropy. From (4.6.15) and its generalization formulated on page 204 it follows next that the number of typical trains is smaller than the number of all potential trains. The difference may be large if the nonuniformity of the distribution or statistical relationships is "strong". Then the resources required to process a typical train are substantially smaller than the resources needed to process any train. However, in general, the probability of occurrence of an untypical train although small, is not zero. Therefore, if we try to profit from the statistical regularities and reserve only the resources of the order of magnitude of the statistical average, inevitably situations will occur when not enough resources are available for processing an untypical train and an overflow occurs.
6.4 Compression of Information Exhibiting Statistical Regularities
285
COMMENT 3 Processing a train consisting of blocks of various length and thus requiring various processing resources, can be interpreted as a special case of the problem of resources sharing. The simple approach is to divide the available resources into chunks and assign separately a chunk to a user. We call this fixed resource assignment. Since the resources needed to process a transformed train depend on its length, the resources needed may be different for various blocks. To ensure correct processing of every block we had to reserve resources needed to process the block demanding most resources (the longest). Then, however, for most blocks the reserved resources would be not used, and the utilization of the whole pool of resources would be more inefficient as the fluctuation of needed resources around the average value becomes larger. The only way to save the resources would be not to process exactly the most demanding blocks and thus to tolerate overflow. The other possibility is not to split the resources but to assign to a block as many resources as are really needed. Such a procedure is cMtd flexible resource sharing. We achieve it by transmitting or storing the train as whole. Processing the train as whole we do not avoid the difficulties with processing of blocks of various lengths. Thus, to process each block exactly we had to reserve the resources needed to process the longest block. However, this would be the same amount as in the case of fixed assignment of resources sufficient to process exactly every block separately. The basic difference between fixed and flexible resources sharing lies in different statistical properties of the length of the single block and their train. Our argument leads to the final conclusion (6.4.16) and has quite a general character: the relative fluctuation of a sum of random variables around its mean value compared with the relative fluctuations of components of the sum becomes smaller as the sum has more components. The consequence of this is that for a given probability of overflow the required total surplus of resources over the average of the needed resources becomes smaller when the number of components grows. This effect is called in resource-sharing theory the economy-of-scale effect. 6.4.3 THE INDICATORS OF STATISTICAL COMPRESSION From our previous considerations it follows that with discussed restrictions, the ^""^'^^^ V(IUJ = Ev(UJ (6.4.17) is a suitable indicator of resources needed to process any train. We call it statistical volume of trains. Since we do not specify whether the resources are used efficiently, the statistical volume is a counterpart of the structure blind volume defined by (6.1.20). To calculate the statistical volume of trains we must take into account the statistical regularities. The counterpart of the minimum volume introduced in Section 6.1.1 is the minimum statistical volume V.i(I[JJ=miny[r(IU,)] (6.4.18) and min is the operation of searching the minimum value of the class of all TuvC-)
reversible transformations.
286
Chapter 6 Lossless Compression of Information
The statistical counterparts of the resources utilization indicator defmed by (6.1.22) and the resources surplus indicator defmed by (6.1.24) are the indicator of statistical resources utilization: Rch^r)=V„,iILJJ/V,,(I[JJ
(6.4.19)
and the indicator of statistical resources surplus: Sc,^J = [ Xb(U,r)-V„;(¥J]/V„;(V„)]
(6.4.20)
The structure blind volume Yt, (HJtr) occurring in these definitions is often deterministic and has the meaning of total available capacity of the structure blind fundamental information system. From definitions it follows that SchU,) = [l/RchU,r)]-l
(6.4.21)
Generalizing (6.1.30) we define the efficiency of volume transformation of information exhibiting statistical regularities (briefly efficiency of statistical volume transformation): __ 0[T{')] = V^U,/ V,j(¥,), (6.4.22) where ¥,=7-(U„). (6.4.23) If 0[T(-)]>1, we say that the transformation T(-) compresses statistically the volume of information. Similarly to (6.1.32) it is P[T(-)]
(6.4.24)
and the maximum compression ratio ;ffma(U,r)=V,,(^)/VJU„) = l/RchUJ.
(6.4.25)
The concrete examples of those concepts are given in the next section. COMMENT 1 If the primary information exhibits statistical regularities and deterministic relationships exist between the components of information, then those relationships manifest themselves in the probabilities of occurrences of the components. Thus, the deterministic relationships are "automatically" taken into account when we minimize the statistical volume. Therefore, the minimal statistical volume based on the exact description of the statistical properties of information is not smaller than the minimal_volume taking into account only deterministic constraints. The difference V (t/)-Vmi(U) is an indicator of gains that we may achieve when we fully utilize the exact description of the statistical properties of information. These conditions are essential. If we would utilize only a rough description of statistical properties of information or we would apply a statistically not optimal transformation, the statistical volume may not be smaller than the minimal volume taking into account only structural constraints.
6.5 Examples of Statistical Compression of Information
287
COMMENT 2 In our presentation of the statistical approach to the concept of volume of information and of information compression we emphasize analogies with the nonstatistical approach of the previous section. We do it because, in spite of its formal elegance, the application of the statistical approach is often in practice not justified. The reason is that only in a few cases are the statistical properties of information known and usable, and the existence of statistical regularities is usually not tested. We elaborate here on the statistical approach not only because it is applicable in some cases but also because it is the only indeterministic model of information that allows to derive a great body of closed-form relationships that give insight into its general properties of information.
6.5 EXAMPLES OF COMPRESSION OF INFORMATION EXHffirriNG STATISTICAL REGULARITIES This section has two purposes. It describes and analyzes typical (and most important for applications) loss-less compression systems utilizing statistical regularities of structured information. It also uses those systems to illustrate the general considerations in the previous section. 6.5.1 COMPRESSION OF THE TRAINS OF BLOCKS SEPARATED BY IDLE PAUSES We introduce the statistical model of the processes in the buffering system described in Section 3.3.1, calculate the statistical performance indicators discussed in Section 6.4.3 that are based on indicators introduced in Section 6.3, and derive the relationships between the statistical indicators. The statistical properties of the primary train are described by the statistical properties of the starting instants r(/) of the block and the duration D{i) of the corresponding precessed nh block u[i, (•)] (see Section 6.3.1). We assume that the train is a Poisson-exponential train (described in Section 5.2.1). Such a train is described by two parameters: the intensity \ of the process of the births of blocks and the intensity ^^ of the ending of a block. By substituting the probability density (5.2.8) in definition (4.4.10) we get the average duration of the primary (delivered by the local channel) block: 5bu = l/Mu. (6.5.1) The process V^(t) generated by the buffer and put into the fundamental channel (see Figure 6.7) can be also considered as a birth-death process. We denote by \ and /Xy the birth and death rates of the secondary blocks. We again do assumption A7 from page 262. Thus, we assume Buffering is a reversible transformation. (6.5.2) To satisfy this assumption, the buffer capacity C^^f (denoted Q in Section 3.4.1) must be infinitely large. Thus, ^ //: <; ix ^bur*°°-
(0.3.3)
Based on this assumption, every primary block, after some waiting in the buffer, will be put into the fundamental channel. Therefore^ X,=X,. (6.5.4)
288
Chapter 6 Lossless Compression of Information
The duration of the secondary block is proportional to the duration of the primary block. Therefore, the duration and the death rate are for the transformed block related by a relationship similar to (6.5.1). Thus, we have From (6.3.26) it follows that
' ' " ^^^"
^^'^'^^
5bv=(Q/Q)Db.. (6.5.6) From conclusion (6.3.27) it follows that to achieve compression the capacity of the fundamental channel must be smaller then of the local channel. Therefore, we consider the capacity Q of the fundamental channel, and in view of (6.5.6) the duration Dbv, as a parameters. To avoid complications with description of the process of putting into the buffering system the blocks delivered by the local channel shown in Figure 6.7a, we assume that the average duration of arriving blocks is negligibly small, so that the time of transferring the block into the buffering system can be neglected. Often the capacity of the local channel is large. Then such an assumption is satisfied. However, we assume that the average duration of the block in the fundamental channel is large in the sense that Dbv^Av (6.5.7) Then we may assume that a block stored in the buffer is transferred to the server immediately after the processing of the previous block is completed. To derive the relationships between parameters X^, /x^ and the statistical parameters characterizing information compression we consider the long time interval of duration T. During this time interval on average \T primary blocks arrive. Each block contains on average 3^^ID^^=D^^C^ binary pieces of information. On the assumption (6.5.2) all blocks are put into the fundamental channel. Therefore, N,o.,(T)=KD,,CJ (6.5.8) is the average total number of binary pieces of information that arrive in time T into the fundamental channel. Averaging defmition (6.3.4) and taking Vinstead of L^ we define — — R.=NUT)/T (6.5.9) and call it average of the rate of delivery of "pure information" into the fundamental channel. From (6.5.8) we get ^v=\D,^Q. ^ (6.5.10) Substituting in definition (6.4.19) TC, for V^^W,,) and ^v for VmiU,r) we obtain the indicator of statistical utilization of^fundamental channel's capacity Rch = ^ v / Q (6.5.11) From (6.5.10) and (6.5.6) it follows that Rch=Pv (6.5.12) where ^ , ,^ ^ ,^x Pv=V/>tv (6.5.13) Let us denote by Sch the statistical surplus of resources (capacity) in the fundamental channel given by (6.4.21). Substituting (6.5.11) in this equation we get Seh=(Q/^v)-l. (6.5.14)
6.5 Examples of Statistical Compression of Information
289
For the superior system the delay caused by buffering is important. Therefore, as an indicator of the quality of the buffering system we take the statistical average D5uf = EDbuf{*-, (•)]}. (6.5.15) where Dbuf {u[i, (•)]} is the delay between the instant of the arrival of a primary block u[i, (•)] and the instant of the start of the corresponding transformed block v[/, (•)]. It is evident that the average delay caused by buffering is related to the number of blocks waiting in the buffer. Therefore, to calculate the average delay we look at the properties of the number of blocks stored in the buffer. We denote N^^fthc number of blocks stored in the buffer, A^sys=A^buf+block being fed into the channel (processed by the server); we call A^sys the number of blocks in the system, Mbuf, Msys" the corresponding random variables, Qurcapacity of the buffer (the maximal number of blocks that can be stored in the buffer). Using the state transition diagram shown in Figure 3.13, we can calculate the probability distribution of M^y, and its average and we obtain (see, e.g., Seidler [6.11], Gallager [6.12]): -^^ Py ^sys = TT:r(6.5.16) The theorem of Little (see Seidler [6.11], Gallager [6.12])^ establishes on very general assumptions a relationship between the average time D^y^ spent by a block in the information-compressing system, the intensity \ , of arrivals of blocks, and the average number ^^y^ of blocks in the system \D,y, = N,y,.
(6.5.17)
The average duration of stay of a block in the buffer 5buf=Ays-5bv.
(6.5.18)
Substituting (6.5.16), (6.5.17) in this equation and using (6.5.13) we get D,,f = D , , ^ . 1-Pv Using (6.5.12) we write (6.5.19) in the form Auf = D,,—^
(6.5.19)
(6.5.21)
1 /Vch
or using the statistical surplus of capacity given by (6.5.14) in the form Sbuf = Dbv:=^.
(6.5.21)
Using (6.5._16), (6.5.11), and (6.5.12) we express the index of statistical resources utilization Rch in terms of the average number of blocks staying in the system ^ch = — ^ .
(6.5.22)
l^A^sys
Equations (6.5.19) to (6.5.22) provide the relationships we were looking for. The diagrams of these relationships are shown in Figure 6.9.
Chapter 6 Lossless Compression of Information
290
Axif/Dbv s 4
0. 6
/l
Anif/Dbv 4
Rch
0 •
b)
0
Cbuf = 10
^ch
—Jp4
- 1 0 3-
\ V3"—^
2-05
^-^
1-
.d)
02
O'l
5—^
-^ _ 06
. 08
1 1 1
^_ 12
Figure 6.9 The relationships between the statistical indicators of performance of real-time compression by buffering; a, b, c -the buffer capacity Chuf= » ; d -it is finite (a) dependence (6.5.20) of the normalized time of stay Dbuf/Z^bv of a blockjn the buffer on the channel capacity_utilization indicator R,h» (b) dependence (6.5.21) oiD^^JDy^^ on statistical capacity surplus Sch» (c) the dependence (6.5.22) of the channel capacity utilization indicator R^h on the average nujnber A^sys of blocks in the system, (d) the dependence of the normalized time of stay Dbuf/^bv (continuous lines) and of the channel capacity utilization indicator K,h(slashed lines) on the normalized intensity R \^ (given by (6.5.27)) of blocks delivered by the local channel and on buffer capacity Q^f (based on Seidler [6.11]).
On the assumption (6.5.2) (equivalently, assumption Cbuf-*<») every block delivered by the local channel goes to the buffer and after some delay is fed into the fundamental channel. For finite Quf it is possible that new blocks arrive when the buffer is already full. Then the overflow occurs and (6.5.4) does not to hold. Let us denote by P^v the probability of overflow. It can be shown (see, e.g., Seidler [6.11, ch.7]) that . ^w
P ^l-lZfJL.
(6.5.23)
The intensity of overflow (of blocks delivered by the local channel that can not be admitted into the buffer) is x n x /^ ^ o.1^ Xov=^ov\,(6.5.24)
6.5 Examples of Statistical Compression of Information
291
The blocks admitted into the buffer are after some delay fed into the fundamental channel. Thus, \ = (l-PJ\. (6.5.25) Taking this intensity we obtain from (6.5.10) and (6.5.11) the indicator of utilization of capacity of the fundamental channel Rch = ^v/Q=(l-PJX,Dbv. (6.5.26) As the normalized intensity of blocks delivered by the local channel we take Rc>>^uSbv
(6.5.27)
This parameter can be also interpreted as an indicator of hypothetical utilization of the fundamental channel, if all primary blocks _were fed into the channel. The dependence of the normalized time of stay D^^{/D^,y and the fundamental channel utilization on the normalized intensity of blocks delivered by the primary channel are shown in Figure 6.9d. COMMENT 1 Figure 6.9a shows that the delay introduced by buffering is the price that we pay for compression of the primary train, which, in turn, improves the utilization of the capacity of the fundamental channel. Figure 6.9b illustrates the same effect but in terms of the surplus of the channel capacity over the minimum capacity that is R^. When the channel utilization approaches 1 or equivalently, the capacity surplus goes to 0, and the delay introduced by buffering increases infinitely. The reason for this effect explains Figure 6.9c. We can avoid idle pauses in the train fed into the fundamental channel if a reserve of blocks is available to put a block into the fundamental channel when the channel becomes idle. Therefore, the increase of statistical channel capacity utilization is inherently coupled with the increase of the average number of blocks waiting in the buffer. The ultimate reason for the improvement is that having on average enough blocks in the buffer we create a situation in which the statistical regularities can manifest themselves. COMMENT 2 The diagrams in figure 6.9d show that in a real system overflow can occur. However, in a wide range of systems parameters overflow probability is small. To make the probability of overflow still smaller we can add a partner information subsystem described in Section 2.2, page 93, that allows to deliver through the local channel of a copy of a block not admitted to the buffering system. The diagrams in Figure 6.9d illustrate also our remarks on page 34 about effects that do not exist in the real word but are only a result of simplifying assumptions. The infinite delay can arise only in a non-existing system with infinite storage capacity. In a real system the delay is always finite. The effect of overflow is not only a nuisance. The diagrams in Figure 6.9d show that decreasing capacity of the buffer memory slows the growth of delay when the utilization of channel capacity approaches 1. This is used to alleviate congestion effects (see, e.g., Seidler [6.11], Gallager [6.12]). To illustrate the effects of statistical real-time compression we used a very simple queuing system. The analysis of several other more complicated queuing systems (see the classic book of Kleinrock [6.13]) provides more such examples.
292
Chapter 6 Lossless Compression of Information
6.5.2 THE MINIMUM STATISTICAL VOLUME It has been shown in Section 6.LI, page 267 that the statistical Huffman algorithm is optimal in the sense that it minimizes the statistical volume of the compressed train defined as the average length of the train. However, the Huffmann compression is not an universal solution of the statistical optimization problem. The Huffman algorithm is a block orientated algorithm while in many cases the primary information is a long train of components. Huffman coding of a long train as a whole would be very costly. Therefore, when implementation criteria are taken into account other compression procedures may be preferred. An example of a non optimal compression of a train is segmentation and separate Huffman compression of blocks, described in Section 6.2.1. Another example is arithmetic coding described in Section 6.2.2, When non-optimal compression is considered, very useful for a preliminary performance assessment are universal estimates of the minimum statistical volume Vmi(Utr) defined by (6.4.18). Using the fundamental property of long trains discussed in Section 4.6 we derive here a universal formula for the minimal statistical volume of information valid when the block size A^-^oo. As an example of application of this estimate for analysis of performance of a non optimal compression we consider in the next section the dependence of compression coefficient for separate Huffman compression of segments of a Markov train on the size of the train. From Section 4.6, particularly from conclusions (4.6.10) and (4.6.11) it follows that for large A^ the set II of all potential forms of a block can be represented as the set-sum U=U,yUU,,^ (6.5.28) of the typical and nontypical blocks. The number of typical blocks is given by the generalization of formula (4.6.15) discussed on page 204. In the notation now used, this generalization takes the form L(^„)-2--™,
(6.5.29)
^i(U, iV)=H(U)W=H[m(l), iffl(2),- • • , m(N)yN
(6.5.30)
where is the entropy per one element of the multidimensional random variable IU = {i!]l(A2), n = l, 2,- • • , A^} representing a primary block^. The conclusion (4.6.10) holds also in the general case, thus the probabilities of typical blocks are almost the same. Therefore, using the code (1.5.8) to transform a typical block uEU^y into a block v of fixed length A^^ty we minimize the average length of transformed blocks. The equation (1.5.10) is the condition that the transformation is reversible and that the secondary blocks have minimal length. In the present notation this equation takes the form N^y = \ogMU,y).
(6,5.31)
N,,y^NH,m
(6.5.32)
Using (6.5.29) we get
6.5 Examples of Statistical Compression of Information
293
If a primary block is nontypical (i/G II^^y) we leave it unchanged. Thus, the length of the transformed nontypical block is A^vnty=A^.
(6.5.33)
As it has been indicated on page 263, to exploit statistical regularities for information compression the coded blocks must have different length. This condition is satisfied by the described transformation since it produces blocks of length N^^ or N^^. This, however, requires auxiliary separating information. Since only two lengths of blocks are possible, binary length information (see Section 2.6) is sufficient. Thus, the average volume of the transformed block defined as its length (see (6.1.3) is V,i(V)=M(A^,y + l)P(IUE ^.y) + (A^,,y+l)P(llJE /y,y).
(6.5.34)
Using (6.5.32), (6.5.33) and (4.6.1b), we get V J V , AO=A^i/,(IU,AO+o(AO,
(6.5.35)
where o(A0 is such a function of A^ that o(N)/N ^J). Our argumentation shows that this is almost the minimum statistical volume of the compressed train. This result can be also obtained if we use statistically optimal Huffman coding and notice that similarly as for the second case considered in Example 6.2.2, page 266 for typical trains the Huffman codewords are blocks of almost equal length given by (6.5.32). On quite general assumptions the limit //,(U, 00)= lim//,(ILJ, AO.
(6.5.36)
N-oo
exists. Then from (6.5.35) it follows that lim Vn,i(V, N)/N=H,(^,
oo)
(6.5.37)
N-oo
We write it as an asymptotic relation V J U , N) = NH,(^, 00).
(6.5.38)
With few exceptions the calculation of the entropy ^,(11J, oo) in a closed form is prohibitively complicated. Yet equation (6.5.) is of paramount importance. It allows the basic properties of the entropy to be used to formulate the guidelines for estimation of efficiency of information-compressing procedures. The basic properties derived in Section 5.1.2 are as follows: • The entropy is maximized when the probability distribution of potential forms of the random object (scalar, block, train, etc.) is uniform (property (5.1.9)) • The joint entropy is maximized when the components are statistically independent (property (5.1.18)). From these properties and asymptotic relationship (6.5.38) it follows that Utilizing the deterministic and statistic relationships between the components of structured information, we can compress the volume of information the more the larger is the difference (6.5.39) Hmg(U)-H(ILJ) between the sum of marginal entropies Hn,g(U) and the joint entropy H(ILlf) of the structured information.
294
Chapter 6 Lossless Compression of Information
Even if the difference mentioned above is small we still can achieve compression if the probabilities of potential forms of components (6.5.40) are not the same. The counterpart of conclusion (6.5.39) applies when only structural constraints are taken into account, but conclusion (6.5.40) has no counterpart in this case. 6.5.3 THE CHOICE OF SIZE OF SEGMENTS COMPRESSED BY THE HUFFMAN ALGORITHM We will now use the basic relationship (6.5.38) to gain some insight into the problem of choosing the length of the segments into which we split a primary train of elementary pieces of information to apply Huffman coding. We assume the following: Al. The elements^^ u{n) of the block are binary; 0 and 1 are their potential forms. A2. The multidimensional random variable lLJ={i!a(w), w = l, 2,- • • , A^} representing a block is a segment of a stationary Markov train. The matrix of transition probability P211 is given by (5.3.9), and the stationary marginal probabilities are given by (5.3.11). A3. The segments are transformed separately but the optimal Huffman algorithm (6.2.17). To calculate the statistical volume of the transformed train we have to evaluate the average length of the code words produced by the Huffman algorithm (6.2.17). To run this algorithm we need the probabilities of the potential forms of the primary block^^: P(U = a,) = PWl)=w,(,), M(2)=«,(2)/ • • , ^(A0=«/(^)], (6.5.41) where W/= (w/^), W/(2)/ • • , u^j^ is a given primary block and M{n) are the random variables representing its binary components. We calculate the probabilities from equation (5.3.3). Next, we take the obtained probabilities in place of frequencies of occurrences, run the Huffman algorithm (6.2.17), obtain the lengths Nr(ii/) of the potential forms of transformed blocks, and from (6.2.7) calculate the average length V(V, AO = ENV of the transformed block. The statistical volume of the whole train V,, = {v(/),/ = 1,2,- • • , / i s ^ V(VJ=/VV,AO, (6.5.42) where ¥ is the random variable representing a transformed block (see note 1). The indicator of statistical utilization of the resources of the fundamental subsystem (communication channel, storage device) given by formula (6.4.19) is R(AO = V^i(VJ/VV,).
(6.5.43)
As an estimate of the minimum resources Vmi(Vtr) needed to process the whole train Utr without the restriction that the train has to be processed block-wise, we use equation (6.5.38) that takes the form V.,i(VJ = W//,(IU, ex).
(6.5.44)
6.5 Examples of Statistical Compression of Information
295
Substituting (6.5.42) and (6.5.44) in (6.5.43) we get R(AO=//,(ILJ, oo)/V(¥, N).
(6.5.45)
The needed entropy H^(U, <») we calculated in Section 5.3 and is given by equation (5.3.15). The dependence of the statistical resources utilization indicator on the size A^ of the segment is shown in Figure 6.10.
1.0 0.9 H 0.8
I
1 1
1 2
1 3
1 4
1 5
1 6
\ 7
1—• 8
^
Figure 6.10 The dependence of indicator R(AO of the statistical resources utihzation (equation (6.5.46)) on the size A^ of blocks in which a primary Markov train is segmented to apply Huffman algorithm; the description of the considered Markov process is given on page 230. COMMENT 1 It has been shown in Section 5.3 that all elements M(1), iai(2),- • • , m(N) of a block of a Markov process of rank 1 are statistically dependent. Those relationships are taken into account by the Huffman algorithm, which operates on blocks as a whole. Although optimal for transforming blocks of given length, the Huffman transformation transforming shorter blocks separately is blind for the statistical relationships existing within the larger block. Therefore, the resources utilization indicator grows with the length A^ of the shorter block (segment). COMMENT 2 From Figure 6.8 it follows that for A^ of order of magnitude of 10 the average volume of the transformed train is close to its asymptotic minimal value A7//i(I[J, 00). The basic reason of occurrence of saturation of the indicator of utilization of statistical regularities is that for A^ large (in the mentioned sense) the statistical properties of the Markov train manifest themselves already. The saturation of the statistical indicator of utilization of the channel capacity shown in Figure 6.9c has the same reason. Thus, although the real-time compression system considered in Section 6.5.1 is totally different from the system now considered, their basic properties are determined by the same general rule: the statistical regularities can substantially improve the performance of lossless information compression. However, on the condition that they can manifest themselves by processing as a whole a sufficiently large block of components of the structured information. To illustrate the effects of compression using statistical regularities, we considered a simple case and sketched only the derivations of the basic relationships. Their exact proofs and analysis of compression of more complicated trains of information can be found, e.g., in Blahut [6.14] and Cover, Thomas [6.15].
296
Chapter 6 Lossless Compression of Information
6.6 TRANSFORMATIONS UTILIZING THE STRUCTURE OF CONTINUOUS INFORMATION TO COMPRESS ITS VOLUME. Hitherto we have assumed that information is discrete. We consider now the transformations compressing the continuous information. The problem of processing continuous information was addressed already in Section 1.4.3. Here we discuss the counterparts of the concepts introduced previously for the discrete information. In particular, we look for an indicator of resources needed to process continuous information. The essential steps in the definition of the volume of discrete information were (1) the choice of prototype information (pages 254 and 255) for which the definition of resources needed to process it is plausible, (2) the definition of prototype information that requires the same processing resources as given structured information (page 256). We now show that in a similar way we can define the indicators characterizing the volume of continuous information. However the counterparts of both steps are more complicated than for discrete information. The basic reasons for the difficulties have been explained in Section 1.4.3. The continuous models provide only an idealized description of the real information systems. The advantage of continuous models is that many properties of continuous information and its transformation can be described by relatively simple equation. This allows broad insight into information processing. The disadvantage of continuous models is that they have some peculiarities that are not related to real system but to the model. Such peculiarities may lead to conclusions that are technical or even physical paradoxes. In particular, in two areas: (1) analysis of the volume of continuous information and (2) spectral representations of information functions. The inherent restrictions of continuous of models in the first area are mentioned in this section, and in the second area in Section 7.4. We point out typical difficulties and indicate how to overcome them. 6.6.1 THE VOLUME OF THE PROTOTYPE CONTINUOUS INFORMATION The simplest type of continuous information is information that can take potential forms, which can be represented as points in the interval < 0 , 1 > . We call such information prototype continuous information. It is the counterpart of binary information, which is the prototype of discrete information. For binary information the definition of an indicator of the resources needed to store or transmit it has been plausible. For prototype continuous information it is no longer so. The basic reason for the difficulty is that, as it has been indicated in Section 1.4.3, page 33, we cannot process continuous information, even the simple prototype information in an exact way. To illustrate this statement let us look more closely at the options of physical storing or transmitting the prototype continuous information. There are two techniques for processing continuous information: 1. the analog processing and 2. discretization and the digital processing.
6.6 Compression of Continuous Information
297
To be processed by an analog technique the prototype information w E < 0 , 1 > must be presented in some physical form. As it has been indicated in Section 1.1.1 this can be either a dynamic or static form. The basic procedure for presenting continuous information in the dynamic form is modulation. We take a carrier process - typically, a pulse or a sinusoidal packet located in a time slot and make one of the parameters characterizing the carrier process dependent on the primary information (pulse amplitude, position, width modulation, amplitude, phase, frequency modulation). For concrete examples, see Section 2.1.1). After processing (in particular, transmission), we recover the primary information. However, the unavoidable external factors influencing the transformation of the primary information into the dynamic information, the processing of this information, and the recovery of the primary information cause that the primary and recovered information differ. Thus, the recovered information is u,=u-\-z. (6.6,1a) where the difference z=u-u^ (6.6.1b) is the final distortion. It is determined by the mentioned unknown factors. The quality of processing the primary information presented in the dynamic form is characterized by the size of the final distortions. If they exhibit statistical regularities, the final distortion exhibits them also. As the simple indicator of the size of distortions we take the mean square value Q,=^^.
(6.6.2)
where 2 is the random variable representing the final distortions. This mean square has also the meaning of an indicator of accuracy of fundamental processing of the prototype continuous information presented in the dynamic form. The other basic type of analog processing the prototype continuous information is to present it in static form. Such a presentation is typical for information storage, particularly on magnetic or optical carrying media. In both cases, a counterpart of modulation is used. Namely, the parameters characterizing the static state (magnetic, optic) of a small area of the carrier (the counterpart of time slot assigned to modulated signal) are made dependent on the primary information. Although the factors determining the distortions have another physical character, the recovered information ("read" out of the storage) has again the form given by (6.6.1). The second basic technique of processing continuous information is to transform it into discrete information. The transformation of scalar and vector information into discrete information has been called quantization. We discussed it already in Sections 1.5.4, 4.2.1, and 4.5.1; see Figures 1.18, 4.2, and 4.6a. If quantization is applied, it is natural to define the volume of the primary continuous information as the volume of the quantized information. This volume we denote as Vq(<0, 1 > ) . From (6.1.18) we get V,(<0, l > ) = l o g ^ , (6.6.3) where L^ is the number of potential forms of quantized information.
298
Chapter 6 Lossless Compression of Information
The digital processing can be considered for most puqjoses as error free but the quantization introduces irreversible distortions. Therefore, if after processing quantized information, we would like to recover primary continuous prototype information, the recovered information would have again the form (6.6.1). However, not indeterminate external factors but the irreversibility of the transformation producing the available information cause the error z. We call it quantization error. To get an insight into its size we give a simple example. EXAMPLE 6.6.1 CALCULATION OF THE MEAN SQUARE OF QUANTIZATION ERROR We assume that A.l The set of potential forms of the prototype continuous information is the interval <-yi, V2>. This shift of the previously introduced interval < 0 , 1> has no effect on quantization errors but allows us to use directly the earlier considerations in Section 4.5.1. As there, we assume that A.2 The quantization is uniform, described by assumptions A l , A3 and A4 formulated on page 196 with s^=V2. Next we assume A3. When the quantized information v,, / = 1 , 2,- • • , Lq is available as the recovered information u^ we take the center of the aggregation interval ^i corresponding to v, (this is the /th reference point of the NNT producing the quantization-see assumption A4 on page 196). A4. The continuous prototype information u exhibits statistical regularities and random variable m representing it has a uniform density of probability: 1 forwG<-y2, V2> 0 for w€ <-V2, V2>; A5. As the indicator of fmal distortions caused by quantization, we take the mean square value of quantization distortions^^ G=E(m-B*)^ (6.6.5) The quantization error we denote as e(u)=u-uXu) (6.6.6) where uXu) = U[T^(u)] (6.6.7) T^(') is the quantization rule determined by assumption A2, and U '(•) is the recovery rule determined by assumption A3. On the assumptions which we do here the quantization error e{u)=b, where b is defined by equation (4.5.5). Thus the saw-tooth diagram in Figure 4.6 is the diagram of e{u) versus u. We denote as pj^e) the density of probability describing the random variable representing the quantization error. We obtain this density by substituting/?(w) given by equation (6.6.4) for pj^s) in equation (4.5.8). We get
^Lfor^E<-±,±> ^ ^ 0 for M^<-J_, ± > ;
6.6 Compression of Continuous Information From (4.4.11) it follows
299
,„ Q,- J e\u)p^iu)du
From Figure 4.6c we see that
(6.6.9)
"""
1/2Z.
1/2Z,
1 (6.6.10) 12L-1/2Z. -172Z. The objective indicator of the mean square quantization distortion is the normalized mean square distortion Q^-L f e\u)du-L [ e^de-
(6.6.11) From assumption A3 it follows 0^(51)= J «M« = 1/12
(6.6.12)
From (6.6.13) and (6.6.15) we get
H
(6.6.13)
Using (6.6.3) we write this relationship in the form v„(
e.
This dependence is illustrated in Figure 6.11. 2.' ^^^
(6.6.14)
1
2
3
T
1
I
4
V,('<0, 0.3
>v
0.1
-
0.03
-
0.01
0.003
Figure 6.11. The trade off relationship (6.6.14) between the volume Vq(<0, \>)=\ogJ^^^ of an uniformly quantized scalar information and the indicator of normalized quantization distortions Q^\ only values corresponding to integer L^, are meaningful. COMMENT 1 The relationship (6.6.14) is a typical trade (?J5^relationship between two conflicting indicators of information system's performance. This relationship depends on the rules of transforming information into a form suitable for subsequent processing (quantization) and on the rules of information recovery. We have proposed those rules without justification. In Section 8.6.1 we show that they are optimal rules for the assumed uniform density of probability.
300
Chapter 6 lossless Compression of Information
COMMENT 2 No matter whether we apply the analog or digital technique of processing (in particular storing, transmitting) the prototype one-dimensional continuousinformation, the resources needed to process the information are determined by the required accuracy of processing, and they grow with increasing required accuracy (decreasing admissible processing error). Therefore, not a single parameter but the trade off relationship: resources needed to process analog information versus the accuracy of information that can be recovered is the characteristic of the volume of continuous information. 6.6.2 THE VOLUME OF STRUCTURED CONTINUOUS INFORMATION AND ITS COMPRESSION The idea to characterize the resources needed to process one dimensional information by a trade of relationship between the volume of quantized information against the accuracy of the recovered continuous information can be directly applied for K DIM, K>2 continuous information. However, in many situation characterizing volume of continuous information in terms of discrete information is not natural. Here, generalizing the concepts introduced in Section 6.1.1, we present an approach to the concept of volume of continuous information without resorting to the concept of discretization. The number of potential forms of discrete information played the crucial role in the definition of the volume of discrete information (relationships (6.1.8), (6.1.18)). Therefore, to adapt the methodology of defining the volume of discrete information for definition of volume of continuous information a counterpart of the number of elements of a discrete set must be introduced for continuous sets. If we look closer at the basic definition of volume of discrete information (Section 6.1.1), we see that it is essential for the numbers of elements of the set of potential forms of structured information and of the set of potential forms of the reference information to be the same. To make a corresponding statement in the case of continuous sets we must introduce the relationship "equal count" of elements of two sets. Let f/(respectively V) denote two sets of potential forms of information u (respectively v). We define this as follows: If such a reversible information transformation T(') exists that every vE Vcan be presented in the form v=T(u) and every uEU can be presented in the form u-T-^v) then we say that the sets Vand U have equal counts and write it as U- V. The condition occurring in definition (6.6.15) can be formulated equivalently in the following form: It is possible to put the elements of both sets in pairs so that each element of the set U has one and only one partner from set Vand vice versa. The concept of "equal count of elements" is illustrated in Figure 6.12.
(6.6.16)
6.6 Compression of Continuous Information
301
Figure 6.12. Illustration of the concept of an equal count of sets ?/and V. Both sets are discrete. For discrete sets the condition that the counts of the two sets are the same is equivalent to the condition that L ( ^ ) = L ( t / ) . (6.6.17) Obviously, the interval < 0 , 1 > has infinitely many elements. As we have shown in Section 1.4.3 this "infinity" is so large that it is not possible to establish an "equal count" relationship between the points of the interval < 0 , 1 > and a set of elements that can be identified by integers. However, we may expect that an "equal count" relationship may exist between the prototype continuous sets. We first show in a simple example that a straight generalization of the definition (6.6.15) of equal count would be not feasible for a technically reasonable definition of volume of continuous information. EXAMPLE 6.6.2 THE LOSSLESS TWO-DIMENSIONAL TO ONE-DIMENSIONAL COMPRESSION OF UNCONSTRAINED EXACT CONTINUOUS INFORMATION This example shows a peculiarity of comparing the counts of two continuous sets by putting their elements into pairs. This is equivalent to defining a deterministic reversible function by assigning to an element of one set one and only one element of the other set. We assume that the set of potential forms of information w = {w(l), w(2)} is the unit square ^ixi ={{"(!)'"(2)}; w(l)G<0, l > , w ( 2 ) E < 0 , l > } . (6.6.18) The coordinates of concrete information must be represented in a counting system say, binary. The representation is in general, infinitely long. Thus, w(l)=Ml, 1), Ml, 2), Z7(l, 3) (6.6.19a) u(2)=b(2, 1), b(2, 2), b(2, 3) (6.6.19b) We interleave both trains and interpret the interleaved train v=Ml, 1), Z7(2, 1), b{l, 2), Z?(2, 2), 6(1, 3), b(2, 3) (6.6.20) as a representation of a number v. We denote by T(') the transformation defined by (6.6.20). Since we can unfold the train (6.6.20) back into the two primary trains the transformation T(') is reversible. Thus, it puts into pairs all points of the unit square and of the unit interval as illustrated in Figure 6.13. Since the folding procedure can be generalized a loss-less compression of a K-dimensional information into one-dimensional information is also possible^^
302
Chapter 6 Lossless Compression of Information «(2)A
0
1
M(1)
0
V
1
Figure 6.13. The transformations described in the example mapping a unit square into a unit interval. COMMENT The example shows also that counter intuition, the count of points of a square (in general, K-dimensional cube) is the same as of a < 0 , 1 > interval. However, this cannot be used for dimensionality reduction, for two physical reasons. First, to perform the transformation or to revert it we must know the information u and v exactly. However, as has been explained in Section 1.4.3, this is for physical reasons impossible. The second reason is that the function assigning v to a considered as a function of two continuous arguments w(l) and w(2) is a discontinuous function in every point. As discussed in Section 1.4.3, every real device has some inertia and cannot implement such a function. The example suggests that to use definition (6.6.15) of equal count for continuous functions we have to restrict the class of transformations of information to transformations that can be implemented. We call such a ivansfovrndXionphysically realizable. Thus, for continuous sets we define as follows: If we can find a reversible, physically realizable transformation putting all elements of two continuous sets into pairs, then (6.6.21) we say that both sets have equal effective count of elements. As it has been indicated previously the counterpart of the binary prototype set is for continuous information the one-dimensional continuous set that can be represented as the <0,1 > interval. We call it iht prototype continuous set and its Qltmtnts prototype continuous information. To extend the procedure of defining the volume of discrete information presented in the previous section, we have to define an indicator of resources needed to process (in particular, to store, to transmit) the prototype continuous information and to use them as unit of volume of continuous information. Having defined the volume of the prototype information, we can define volume of structured continuous information in a similar way as we defined the volume of structured discrete information based on volume of the binary information. The counterpart of the train of N binary pieces of information is the train of ^pieces of continuous scalar information. If on the elements of such potential trains no constraints are imposed, we call the set N-dimensional continuous master set and denote it C(N). Next, similarly to (6.1.8), we define its volume y[CiN)]=N. (6.6.22) Then, the counterpart of the definition (6.1.12) of volume of discrete information is as follows:
6.6 Compression of Continuous Information
303
If a continuous set U has an equal effective count of elements (in the sense of definition {6.6.2\) as set of potential forms of the continuous master set C(N), then we define the minimum (reference) volume as (6.6.23) We next extend for the continuous K-dimensional information the previously given definitions of structure blind volume and compression ratio for discrete information. We illustrate it with a simple example. EXAMPLE 6.6.2 THE LOSS LESS TWO-DIMENSIONAL TO ONE-DIMENSIONAL COMPRESSION OF CONSTRAINED CONTINUOUS INFORMATION We assume that Al. The information is two-dimensional tt = {w(l), w(2)} and the set of its potential forms is the unit square Z/i,,={{w(l), w(2)}; 0 < w ( l ) < l , 0 < w ( 2 ) < l } ; (6.6.24) A2. The second coordinate is determined by the first: u(2)=^lu(l)l (6.6.25) where ^i) is a given function such that 4>(0)=0 and ^(1) = 1. An example of function ^(0 is shown in Figure 6.14.
.u(2)=^[u{l)]
w-r-i(v)
u(l)
Figure 6.14* A physically realizable reversible transformation of a two-dimensional information tt = {w(l), u(2)} into one-dimensional information v. The truncating transformation ^ v=r(a)=w(l) (6.6.26) is a reversible, physically realizable transformation because knowing v from (6.6.28) we can determine w(2) and finally u. Thus from definition (6.6.23) we get the minimum volume V^i( U) = l. Let us next assume that we consider transformations which are reversible but structure blind in the sense that they cannot utilize the fact that the components of the information u are related by the relationship (6.6.25). For such a class of transformations yj^y)=2. Using definition (6.1.13) for the now considered continuous information we get the maximal loss-less compression coefficient
^^(Z/)=2.n
304
Chapter 6 Lossless Compression of Information
Information that is a function of continuous argument (s) belongs to the next after K DIM information structural class. Similarly as it is for discrete and K DIM continuous information the " infinity" of potential forms of unconstrained continuous sets of functions considered as a whole is "infinitely larger" then the "infinity" of K DIM continuous sets. Similarly as we did in Section 6.6.1 we can approximate the information with complicated structure by information with simpler structure (in particular, by K DIM information) and define the volume of the information with more complicated structure in terms of the volume of the continuous approximating information with simpler structure. The volume defined in such a way is described by a trade off relationship between the accuracy of approximation and the volume of the approximating continuous information with simpler structure. Inside a class of continuous functions of continuous argument(s) we may apply the previously described procedure: to introduce "equal count" relationships and a master set and to define structure blind operations. Often the set of potential forms of information with complicated structure is constrained, and the constraints are so strong that equal count relationship can be established with information having a simpler structure. An example is the class of time continuous functions with bounded harmonic spectrum discussed in Section 7.4.3 for which a continuous relationship between the time continuous function and the set of its samples (which is a /f-dimensional information) exists. Then we define directly the volume of the function-information in terms of the volume of Kdimensional information. The detailed analysis of spectral representations of functions and its applications for reducing dimensionality of structured continuous information is the subject of Section 7.4. 6.6.3 THE STATISTICAL VOLUME OF CONTINUOUS INFORMATION It is natural to define the volume of information by means of the minimal capacity of a channel that is needed to transmit continuous information with required accuracy. In Section 5.4.4 we defined the capacity of a channel in terms of the amount of statistical information which the output of the channel delivers about the information put into the channel. This suggests to base the definition of the volume of information, particularly of continuous information, directly on the amount of statistical information that a distorted version of the considered information must deliver about the original information. To simplify the notation and terminology we assume that the information is continuous scalar information wE 2/^, Uc\= - We introduce a hypothetical distorting transformation V(') producing an information w*G i/d. In general, the transformation V(') may be indeterministic. Thus w* = V(w, z) where z is a set of side factors. An example of a deterministic transformation V{') is the chain of transformations U*[T^(*)] considered in Example 6.6.1. We assume that the primary information exhibits statistical regularities also and as indicator of distortions caused by the transformation V(-) we take the average distortion G[V(-)] = E ^[m, V(mm (6.6.27) where
6.6 Compression of Continuous Information
305
m is the random variable representing the primary information Z is the muhidimensional random variable representing the side factors effecting the outcome of the hypothetic transformation V(-) q(', •) is a performance indicator of the information processing system in a concrete situation. A typical example of the average indicator of distortions is the mean square error defined by (6.6.5). As a universal indicator of the statistical volume Vr(B, O of continuous information, relative to accuracy standard Q we take the minimum amount of statistical information that a random variable a* must deliver about the random variable representing the primary information IE , on the condition that (6.6.28) Thus, we define ^"* (6.6.29) V,(i,Q)= minlJi!ii:V(i!ii, 1)] _
v(-)eV(Q)
where V(Q) is the set of hypothetical distorting transformations V(') transforming the primary information u into information w*, such that Eq[m, V(l, Z ) ] = e (6.6.30) Since the volume depends on the required accuracy Q we call Vr(i!a, 2 ) the relative volume-htnct the notation. Its definition is illustrated in Figure 6.15
iU
HYPOTHETICAL TRANSFORMATION
I I
V(u,z) -I(iDi:im*)-
I
INDICATOR OF CONCRETE DISTORTIONS
I I I
Q[u,V(u^z)] AVERAGING
T
Q[Vi')]
Figure 6.15. Illustration of the definition (6.6.29) of volume of continuous information exhibiting statistical regularities based on the amount of statistical information.
Using (5.1.21) we express VXB, Q) in terms of entropies: V,(i!a,G)= min [H(i!a)-H(iQi|iei*)] = H(m)-max H(isi|m*)
(6.6.31)
v(-)€UQ)
V(-)€V(Q)
Examples of explicit calculation of the minimum volume V^ can be found in Cover, Thomas [6.15]. The calculations become particularly simple when the random variable m representing the primary information is gaussian and ^(w, u*)=(u-u'^y. Then it can be easily shown that ]i,G) = V2iog2^ Q
M^(^l
(6.6.32)
306
Chapter 6 Lossless Compression of Information
COMMENT 1 The amount of statistical information defined by equation (5.1.21) which information produced by a transformation delivers about the primary information depends both on the statistical properties of the primary process (input) and the transformed process. Removing by maximization the dependence on the primary process we obtain the definition (5.4.26) of channel capacity. Removing by minimization the dependence of conditional probability describing the transformation we obtain the definition (6.6.29) of the relative volume of the primary information. In this sense both concepts are complementary. Both definitions provide also examples operation removing dependence on details which has been mentioned in Section 1.6.2 and is discussed in detail in Section 8.1.
NOTES ' According to the general rules of notation, we use the special font V to indicate that calculation of volume is a function assigning a number to structured information or to a set. ^ This term is an abbreviation for "block taken of an unconstrained (constrained) set". Frequently used terms as "discrete information", "continuous information" are similar abbreviations. See also. Note 3 in Chapter 1. ^ The original Huffman algorithm was conceived for optimization of compression when the probabilities of potential forms of blocks are given. We did not make such an assumption here but later in this section we discuss the probabilistic version and the relationships between it and the Huffman algorithm based on frequencies of occurrences. ^ The coincidence is the result of the choice of the set of transformed blocks in Example 6.2.1. Although we could take any other set, we have chosen the specific set (6.2.11) to illustrate the relationship between the general rule (6.2.10) and the Huffman algorithm. ** hi the subsequent the processes in the local channel we denote by symbol u. The subscript su should remind that the symbol denotes the duration of a time slot in the channel delivering processes denoted by u. ^ It is equal to channel capacity defined by (5.4.31) when P^=0. '' From equations (6.4.7), (6.4.12), and from theorem (4.5.13 ) it follows that for large / the probability distribution of n^ can be approximated by the gaussian probability distribution with variance 1. Thus, the decrease of probability of surpassing the threshold is fast. ^ Although the intensities are the same, the train of starting instants of transformed blocks is no more Poisson process, because buffering introduces statistical dependence between them. ^ We consider here only one block that is an element of the transformed train. Therefore, we omit the index / numbering the position of the block in the train-see equation (6.2.1). '^ According to the previous convention (Note 9), we do not indicate the number of the considered train in the block. Thus, u{n) is the abbreviation of the symbol M(/, n). '' Since we assumed that the primary train is stationary the random variables representing the transformed blocks have the same probability distribution. Therefore, we write briefly U instead \I{i). ^^ The bar over g is a reminder that the indicator is a statistical average. ^^ In terms of the theory of cardinal numbers: the cardinality of K dimensional cube is c, where c is the cardinality of the < 0 , 1 > interval.
6.6 Compression of Continuous Information
307
REFERENCES [6.1] [6.2] [6.3] [6.4] [6.5]
Held, G., Marshall T., Data Compression (4-th ed.), Wiley, N.Y., 1996. Bell, T.C., Witten, T.H., Text Compression, Prentice-Hall, Englewood Cliffs, NJ, 1990. Storer, J.A., Data Compression, Computer Science Press, Rockville, MD, 1988. Storer, J.A., Image and Text Compression, Kluwer, Boston, 1992. Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetterling, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [6.6] Nelson, M., The Data Compression Book, M&T Books, Redwood City, CA, 1991. [6.7] Storer, J.A., Reif, J.H., DCC'91 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1991. [6.8] Storer, J.A., Cohn, M., DCC'92 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1992. [6.9] Storer, J.A., Cohn, M., DCC'93 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1993. [6.10] Storer, J.A., Cohn, M., DCC'94 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1994. [6.11] Seidler, J.A., Principles of Computer Communication Network Design, Wiley, N.Y., 1983. [6.12] Gallager, R., Bertsekas, D., Networks, Wiley, N.Y., 1991. [6.13] Kleinrock, L. Queuing Systems (2 vol), Wiley, N.Y., 1975. [6.14] Blahut, R.E., Principles and Practice in Information Theory, Addison-Wesley, Reading, MA, 1990. [6.15] Cover, T.M., Thomas, J.A., Elements of Information Theory, Wiley, N.Y., 1991. [6.16] Gray, R.M., Source Coding Theory, Kluwer, Boston, 1990. [6.17] Veldhuis, R., Breeuwer, M., Source Coding, Prentice-Hall, N.Y, 1993.
7 DIMENSIONALITY REDUCTION AND QUANTIZATION The dilemma of information processing is that the states of the environment, and consequently the primary information are continuous but that digital information processing is efficient and cheap. Section 1.5.4 described the basic transformations of the primary continuous information into discrete information. Such a transformation is called discretization; discretization of scalar or vector information is called^ quantization. The important types of discretization are listed in Figure 1.22. Section 1.5.4 indicated that if the structure of continuous information is complicated, then we can discretize the information more efficiently if we first transform it into continuous information with a simpler structure. We called such a transformation a dimensionality-reducing transformation. The dimensionality- reducing transformations are of paramount importance not only as preliminary transformations preceding discretization. In spite of advantages of discrete information processing, in some areas continuous information processing is also useful. Then dimensionality reduction often improves the performance of a superior system utilizing continuous information. This chapter is devoted to dimensionality reduction and quantization. The prototype dimensionality reduction is transformation of vector information whose components are continuous into a vector information consisting of a smaller number of continuous components. Most frequently used types of dimensionality reduction are truncation and decimation (described briefly in Section 1.5.4; in particular, see Figure 1.20). Often the primary information can be considered as a function of a continuous argument. The basic type of dimensionality reduction of such information is point sampling and also, in case of images, line scanning-see Section 1.5.4 particularly. Figure 1.21. In this chapter we concentrate on dimensionality reduction and quantization of vector information. For such an information the formal apparatus is quite simple (matrix calculus). The obtained results are not only directly applicable but also can be generalized for dimensionality reduction of processes and images. Section 7.1, using extensively geometric interpretation, introduces the concept of spectral representations. In Section l.l'WQ concentrate on decorrelating spectral transformations. We use them in Section 7.3 to show on a simple study case that preliminary decorrelation by a spectral transformation greatly simplifies efficient reduction of dimensionality of vector information.
310
Chapter 7 Dimensionality Reduction and Quantization
Generalizing observations in the study case, we present an optimal algorithm for dimensionality reduction and information recovery. We close our considerations about dimensionality reduction of vector information with an example and discussion of trading of quality of recovered information versus volume of compressed information. Section 7.4 shows that the concepts of spectral representation and dimensionality reduction introduced for vector information can be generalized for information that is a function of a continuous argument (function-information). We concentrate on transformations of such information that first transform it into a spectrum consisting of infinitely many components. The compression into vector information is achieved by retaining only a finite number of components of spectrum. Sampling also can be interpreted as such a transformation. In Section 7.5 the quantization of information is discussed. As a simple, but in several respects representative special case, we consider first quantization of scalar information. Then, we review the basic features of vector quantization. Of great practical importance is quantization of vector information achieved by a preliminary presentation transformation and subsequent separate scalar quantization of components of the transformed primary information. Decorrelation is a typical preliminary transformation. Further, we describe real time quantization of a train, achieved by the predictive-subtractive procedure. In this chapter, as in the previous, we emphasize the conceptual and technical aspects of information compression, keeping the formal side of considerations simple. Therefore, besides analytical, also some heuristic arguments are used. However, in the next chapter we give their precise justification. This applies in particular, to the optimization of scalar and vector quantization.
7.1 SPECTRAL REPRESENTATIONS OF VECTOR INFORMATION We explain here the fundamental concepts of spectral representations of information and their applications for decorrelation and dimensionality compression on the simple but representative case of vector information. The presented approach is based on geometrical interpretation. 7.1.1 FUNDAMENTAL CONCEPTS OF K DIM GEOMETRY In previous chapters we used interchangeably the terms "block information" and "vector information". It has been indicated in Section 1.3.2 that as long as we do not define the operations which can be performed on the information, the term "vector information" is a technical jargon name. In this section we define various operations on block information particularly, utilizing the interpretation of block information as a vector or a point in the sense of geometry. To avoid confusion the set u={u{k), k=l, 2, ..,K} (7.1.1) is called here the "block information", and the term "vector" is used in its mathematical sense.
7.1 Spectral Representations of Vector Information
311
We start our considerations with a review of fundamental concepts of K DIM geometry^. A detailed presentation of this geometry can be found in most books on linear algebra; see, e.g.,Thompson [7.1], Horn, Johnson [7.2], and Usmani [7.3]. Since we use here various representations of the set a^{a{k), /:= 1, 2,- • • , AT} of scalars a{k) we indicate the type of representation by subscripts vc, pt, mx for interpretation as a vector, as a point or as a matrix respectively. For two vectors a^c and ^^c. representing the sets a and b, the scalar (dot) product is , , , , •'vc
vc I
COS aCflvc ^vc)» (7.1.2) where | a^c |, | ^vc I denote the absolute values (lengths) of the vectors and a(avc» ^vc) the angle between them. Taking ^vc=^vc ^ ^ g^^ fro"^ (7.1.2): |flvc|=[(flvc, O ] " -
(7.1.3)
From definition (7.1.2 ) it follows that if one of the vectors is fixed, the scalar product is in respect to the other vector a linear operation ([/z(l)ajl)+ /2(2)aJ2)], ^ , ) = MDCMD, ^c)+^(2)(aJ2), bj (7.1.4) If (a^c. /^vc)=0 ^G say that the vectors a^^ and b^^ are orthogonal. A set of vectors/(/:), A:=l, 2,- • • , /iT such that (/;c(^),/vc(0)= 5(^,0
(7.1.5)
where 5(/:,/) = < ; Q
(7.1.6)
is the Kronecker delta function, we call ortho-normal vectors. We use these vectors as unit coordinate vectors. Then we call their set C^ orthogonal coordinate system. A vector a^c representing a set a^{a{k), / : = ! , 2, • • , ^ } can be written in the form v^ a.c = 2l a(k)fjk) (7.1.7) k'\
Writing the vector b^^ in the same form, using (7.1.4) and (7.1.5) we get {a^c. Kc)=Y. a{k)b{k) (7.1.8) In many problems of information processing, the components of one of the vectors say b^^, have the meaning of fixed coefficients while the components of the other vector a^^ are samples of a continuous process a^it) taken at instants r^, /:= 1, 2, • • • , /^. Thus, a{k)=aM^ Then (7.1.8) takes the form
^ = 1 , 2,- • • , /^ j^
ia.c. Kc)=i
b(k)aM'
(7.1.9) (7.1.10)
Comparing this expression with the equation (3.2.43) describing the process at the output of a linear system, we see that they are similar. Let us suppose that • the input process in equation (3.2.43) is v(l, t,)=aM=a(n), n = l, 2, - - , K. (7.1.11) • the number of memory cells j_r^ (7 112) • the coefficients h(t„) characterizing the time-discrete linear system shown in Figure 3.8 satisfy the conditions h(t^-t,)=b{n),n = l,2,' - - ,K (7.1.13)
312
Chapter 7 Dimensionality Reduction and Quantization
Then the output process v(2, t/^) at instant t^ is equal to the scalar product of vectors v(2, /^) = (a,„ ^^J
(7.1.14)
From (7.1.13) and (3.2.43) it follows that h'(tJ=b(K-m),
m=0, 1,- ' • .K-\
(7.1.15)
A linear system with such a set of coefficients is called a matched filter. In view of (3.2.40) this set has also the meaning of the pulse response of the system. Let us summarize our considerations: If at the input of the filter matched to the components of a vector b^^ we put in sequentially the components of the vector flvc» ^hen the value of the output process after arrival of the (7.1.16) last component of a^^ is equal to the scalar product (a^, h^^. The conclusion is illustrated on Figure 7.1. It allows not only the scalar product to be calculated by a simple hardware but, as it is shown later, it also suggests useful interpretations of several information transformations. LOCAL GENERATOR OF TRAIN b{k)
train a{k)
A
Y.a{k)b{k) MATCHED FILTER
i
tram o(k)
reading the output at /A + 0
b) Figure 7.1. Evaluation of the scalar product: (1) based directly on equation (7.1.8), (2) by means of the matched filter. The set a={a(k), k=l,2, • • • , K} can also be represented as a column matrix. Such a presentation is denoted as a^^. Thus, a(l) a(2)
(7.1.17)
a{K) Using the rules of matrix multiplication we write equation (7.1.8) in the form (fl,e, ^c)=^mx^mx
(7.1.18)
where a1^ is the transposed matrix, and writing two matrices side by side means matrix multiplication.
7.1 Spectral Representations of Vector Information
313
We also can interpret the block information a^{a(k), k=l, 2, • - , K} as a point in a ^-dimensional Euclid space, which in the coordinate system Cf has coordinates a(k), k=l, 2,- - -, K. In view of (7.1.7) this point can be interpreted as the end point of the vector a^^ defined by 7.1.7. For a pair of points a^^, b^, we introduce the distance d{a^^, b^^. It is natural to define the distance as the length of the difference vector b^^-a^^. Thus, we define (7.1.19) From (7.1.3) and (7.1.8) we obtain
^(«pt, V={EW«)-fl(«)]'r.
(7.1.20)
Using (7.1.3) and (7.1.4) we write (7.1.19) in the form d\a^,,
V
= (^vc-avc» ^vc-flvc) = (flvc» «vc) + (^vc» ^vc)-2(avc, ^vc) =
= |a,,r-+|Z^,,|2-2(a,„ bj.
(7.1.21)
COMMENT The smaller the distance d(a^^, b^^) the more "similar" are the two sequences a and b. Thus, the distance may be called a "negative" (the smaller the more) indicator of similarity. From (7.1.21) it follows that the only component of the distance that depends on both sequences simultaneously is the scalar product (app ^p,). The distance is smaller the larger this product is. Therefore, the scalar product may be called a "positive" indicator of similarity of the two sequences. 7.1.2 DEFINITION OF THE SPECTRUM OF BLOCK INFORMATION In the AT-DIM Euclidean space we take an orthogonal coordinate system Cf determined by the stif^^k), /:=1, 2,- • • , A' of unit coordinate vectors satisfying the ortho-normality conditions (7.1.5) and we represent the block information u as a vector (see Figure 7.2 a) >< Mvc = Ew(/:)/J/:). (7.1.22)
1
Cf Wvc
j/«(3) ' 1
1 j
/
/
/
/
1/
"(3) |/vc(l)
1 ^.ud)"
Vvc(2)
a) Figure 7.2. Illustration of the definitions: a) of vector and point representation of block information ii, b) of its spectrum v={v(l), v(2), v(3)}.
314
Chapter 7 Dimensionality Reduction and Quantization
We scalar-multiply both sides of (7.1.22) by a coordinate unit vector/y^C/). Taking into account the ortho-normality (7.1.5) we get w(0=(Wvc,/vc(/)). (7.1.23) Next we take an other orthogonal coordinate system C^ (see Figure 7.2b) of unit coordinate vectors g^^JJc), k=l, 2, • - - , K. Thus, (gvcW, gvc(0)= 5(/:,/). (7.1.24) Taking in (7.1.22) instead of M^C ^ unit vector g^^iO we express it in terms of the unit vectors of the coordinate system Cf gJl)=T
g(l. m)f^,(m).
(7.1.25)
m-l
Both sides we scalar-multiply the by a coordinate unit vector/vc(/:). This gives g(l,k) = (gUn,Lc(k)). (7.1.26) Substituting (7.1.25) in (7.1.24) and using (7.1.5) we get E g(l. m)g{K /n)= 6(/, k), V/, k.
(7.1.27)
m-\
Finally, we represent the vector u^c i^ ^he form K
«VC^EK^)^VCW-
(7.1.28)
Similarly to (7.1.23), the occurring coefficients are given by the equation v(/) = (Mvc, gvc(O),/=1,2,- ' .K (7.1.29) Substituting (7.1.22) and (7.1.23), and taking into account (7.1.5), we express v(/) direct in terms of the components of the primary information: K
v{l)^Y.S{Uk)u{k), V/. (7.1.30) The set of coefficients ^"^ v={v{k), k=\, 1,'',K} (7.1.31) is the secondary block information into which the primary information u is transformed. This set is called the g-spectrum of the primary information u (relative to the set of orthogonal unit vectors gvcW» ^ = 1 , 2 , • • • , ^ . When it does not cause confusion, we call it briefly spectrum. From (7.1.29) follows that the /th component of the spectrum has the meaning of the projection of the vector u^^ on the /th axis of the coordinate system C^. Knowing the projections on all K axes, we can determine exactly the vector u„^ (the primary information). Thus the transformation of primary information into its spectrum is a reversible transformation. We now express explicitly the primary information in terms of the spectrum. Because the roles of the both coordinate systems C^ and C^ are symmetrical, we have only to interchange in our previous argumentation the roles of unit vectors f^JJi) and gvcW- Similarly to (7.1.25) we represent the unit coordinate vector/,c(/) in the form I^ Ul)=22f(l.k)g,,ik), (7.1.32) where similarly to (7.1.26) f(l, k) = {fjl),
g^,(k))
(7.1.33)
7.1 Spectral Representations of Vector Information
315
and similarly to (7.1.27) K
E / ( / , m)f(k, m)= d(l, k) V/, k.
(7.1.34)
From equations (7.1.26) and (7.1.33) it follows that f{k,D=g{lk) (7.1.35) To obtain an explicit expression of the primary information u in terms of its spectrum v we substitute (7.1.28) and (7.1.33) in (7.1.23). After elementary algebra and taking into account the orto-normality of vectors /vc(^) we get «(/) = ( E v{k)g,M. /vc(0)= E / ( / , k)v{k). k-\
(7.1.36)
k'\
The pair of relationships (7.1.30) and (7.1.36) is called spectral transformation of block information. The symmetry of these relationships is the consequence of the mentioned symmetry of roles of the coordinate systems C^ and C^. To gain more insight into those relationships we present them in the matrix form: «n.x=/^v,x, (7.1.37a) , yr.. = Gu^,^ (7.1.37b) where Umx and v,^^ are the column matrices representing the primary block information u and the spectrum v^^; G^[g{k, /)] and F=\f(k, /)] are square matrices with elements g(k, I) (respectively/(A% /)). The pair of relationships (7.1.36) we call spectral transformation, the matrix G we call spectrum-generating matrix and F information-recovery matrix. The matrices G and F are mutually related. From (7.1.37) it follows that G=F^-'\ (7.1.38) while from (7.1.35) it follows that G=F^^ (7 1 39) From this it follows that 7^-1)=f
u{l) = Y.g{K l)y(k).
(7.1.41)
k-l
and we denote as u(') = {u(l), / = 1 , 2, • • • , K} g(k. •) = {^(/:, / ) , / = ! , 2, • " ,K} Using this notation we write (7.1.41) as
(7.1.42a) (7.1.42b)
K
u(-) = T 8(k, -Mk).
(7.1.43)
316
Chapter 7 Dimensionality Reduction and Quantization
Thus equation (7.1.43) has the meaning of a representation of a function w(-) corresponding to the primary block information ii as a superposition of standard functions g(k, •) multiplied by coefficients v(k) determined by the primary information. Therefore, the functions g(k, •) are called basic functions. Section 7.4 shows that representation (7.1.43) has direct counterparts for functions of a continuous argument (s). COMMENT 1 From equations (7.1.29) and (7.1.33), it follows that the spectral representation (7.1.37) depends only on angles between the axes of the coordinate systems Cf and C^. If we rotate both coordinate systems keeping their mutual position fixed the spectral presentation does not change. Therefore, we can take any system as the primary coordinate system Cf. COMMENT 2 Since any mutual angular position of the coordinate systems Cf and C^ is possible, we have a continuum of coordinate systems and thus, we have infinitely many various sets of basic functions g(k, •)• In consequence, infinitely many spectral representations of a primary block information are possible. The concrete choice depends on the reason for using the spectral transformation. In general it is used to simplify the analysis of transformations of information. Of paramount importance are the representations based on harmonic functions (sine, cosine, and complex exponential), because these functions have the unique property that their shape is not changed by a linear stationary transformation. This combined with the superposition feature (3.2.15) permits to gain much insight into the transformation performed by a stationary linear system. Therefore, sets of samples of harmonic functions are very useful for numerical analysis of linear stationary systems. Such an ortho-normal set is presented in the forthcoming example. Of great importance also are spectral representations such that the random variables representing spectral components are non correlated; they are called decorrelating spectral representations. They significantly simplify the subsequent dimensionality reduction or discretization. Section 7.2 is devoted to decorrelating transformations, while in Sections 7.3 and 7.4 we discuss their applications for information dimensionality reduction and quantization. COMMENT 3 For applications only the basic relationships (7.1.30) and (7.1.36) (or equivalently (7.1.37)) are essential. They held if the orthogonality conditions (7.1.18), (7.1.27) and (7.1.34) are satisfied. The geometrical interpretation that we used allowed to obtain those relationships in a simple way, but it played only an auxiliary role. Therefore, the described concept of spectrum of block information can be generalized for information or/and spectrum having other, more complicated structure for which a direct geometrical interpretation is no longer plausible. To get such a generalization we must define the analog of the scalar product and the counterparts of the definitions coefficients/(/:, /) (equivalently, g{k, /)) determining the spectral representation.
7.1 Spectral Representations of Vector Information
317
Section 7.4 is devoted to such generalizations in the case when information is a function of continuous argument(s). Here the methodology of generalizations is illustrated on a simple but very important in applications example, when the basic functions of a discrete argument are samples of continuous complex exponential functions. EXAMPLE 7.1.1 SPECTRAL TRANSFORMATION OF VECTOR INFORMATION: DISCRETE FOURIER TRANSFORMATION As indicated in Comment 2, the complex exponential function e''^' plays an important and unique role in the analysis and synthesis of linear systems and of linear information transformations. Therefore, of great practical importance is spectral representation of samples of function information using as basic functions/(•, k) sets of samples of complex exponential functions. We describe now such a set in more detail. We assume that Al. The primary information is a function information of the continuous argument t «jr), 0 < r < T (7.1.44) A2. The component of primary vector information is u{l)=u,MTJ
(7.1.45)
where T,=T/K (7.1.46) is the sampling period and^ / = 0 , 1, • • • , K-\ A3. The basic function g(k, •) of discrete argument occurring in (7.1.43) is a train of samples of the exponential function of a continuous argument (7.1.47a) where ^^^^^'' ^^^ ^^""^ ^^'''^^' ^ " ^ ' h ' ' • . K-l coj =27r/T (7.1.47b) exp x=c\ (7.1.46) and .4 is a constant, which we determine later. From these assumptions it follows that g(k, l)= AQxp Qlo)^kTJ=AtxpQ(xk[) (7.1.47) WllCiC (7.1.48) a=(j)iT^) = 2ir/K To use our previous considerations about spectral representations we must define the counterpart of the scalar product for pairs of sets a = {a(l), 1=1, 2, ' ' ,K} and b = {b(l), 1=1, 2, - • - , K} of complex numbers. Since geometrical interpretation of such sets as vectors is not plausible we have to look rather at equation (7.1.8) as a basis for definition of the scalar product. An analysis shows that the proper counterpart of the definition of scalar product sets of real numbers, is for sets of complex numbers the definition K
(a,b)^'£a(l)b([), where b is the complex conjugate b.
(7.1.49)
318
Chapter 7 Dimensionality Reduction and Quantization Some elementary algebra shows that for the functions defined by (7.1.47) K-l
E g(k, l)S{m, l)=Ad(k, m)
(7.1.50)
/-o
To satisfy the ortho-nomality conditions (7.1.27) we take A = l/y/K~. (7.1.51) Thus, the functions g(k, l) = l/yfK~exp(iak[), k=0, 1, • • • , AT-l (7.1.52) of the discrete argument / are ortho-normal in the sense of definition (7.1.5). For scalar product defined by (7.1.49) equation (7.1.35) takes the form ^ ,^,^^, , From (7.1.52) we get
f(Kl) = g(l,k). /(/, A:) = l / v / F ' exp ('}oJcl).
(7.1.53) (7.1.54)
Thus, for the assumed definition of the scalar product and for the assumed coefficients/(•, k) the spectral transformation described by (7.1.36) and (7.1.30) takes the form K-\ u(l) = 1 / y/^ 5^ v(i^)exp Qakl) (7.1.55a)
v(l) = l/y/F~ E
u(k)txp (-}akl).D
(7.1.55b)
This transformation is called discrete Fourier transformation. It is widely used for digital information analysis and processing. COMMENT 1 The spectral representation (7.1.37) has the same dimensionality. Thus, the same volume as the primary information u. The modification described in the example generates as spectrum K components v(k), which are complex numbers. Thus, the structure blind volume of the described harmonic spectrum is 2K. However, the spectrum has a redundancy, because as can be checked, the spectral components with numbers located symmetrically around K/2 determine each other. If we would eliminate the redundant components of the spectrum the spectral transformation (7.1.55a) respectively, (7.1.55b) would lose its simplicity. This simplicity not only makes the analytical calculations easier. It is also extensively used in the very effective algorithm for numerical calculation of the spectrum, called fast Fourier transformation ( for details and references see Poularikas [7.4], Smith, Smith [7.5] and for programs Press and all.[7.6]). Since the spectral transformation is reversible, every transformation performed on spectral representation can be described as a transformation of the primary information, without introducing explicitly the spectral representation. Thus, the main advantage of spectral transformations is that they provide a presentation of information that gives more insight into some relationships between the primary information and information produced by transformations, especially linear.
7.1 Spectral Representations of Vector Information
319
7.1.3 SOME IMPORTANT PROPERTIES OF SPECTRAL TRANSFORMATIONS Two important properties of spectral transformation, that are used in forthcoming consideration are presented: the distance invariance property and optimal character of fragments of the spectrum. DISTANCE PRESERVATION BY SPECTRAL TRANSFORMATIONS Consider two potential forms u' and u" of block information. We denote as i/ p^ and w p| the two points representing the two potential forms of information in the space with coordinate system Cf. In view of (7.1.13) the Euclidean distance between these points defined by (7.1.12) IS ( "^ 1 '/i d{u'u';)^\£lu"(D-u'(r)n . (7.1.56) ^
W-1
Let us denote as v' and v" the f-spectra of u' and u". We again interpret them as two points Vp^ and v p[ in the coordinate system Cf. In view of (7.1.19) the distance between those two points is ^(^;'^;;)=(E[^"(O-V'(/)]4".
(7.1.57)
However, according to our definition of spectrum, v p^ is just an other notation of M p,, thus, V p^ = w p,. Similarly, Vp'/ = «;/ . Thus,
. , ,, or, equivalently
d(v;,,v;;)=d(u;^u;:) ^. ' ^ K E [ v " ( / ) - v ' ( / ) ] ' = E[w"(/)-w'(/)r /-I
(7.1.58a) (7.1.58b)
/-I
The interpretation of (7.1.58) is as follows the Euclidean distance between a pair of points representing two potential forms of primary information and the pair of corresponding spectra is the same.
(7.1.59)
Thus, the spectral transformation is a distance invariant transformation. This conclusion is very useful for easyly calculating the effects of modifying spectra on the corresponding modifications of the primary block information and vice versa. Taking w " = 0 in (7.1.58b) we get
Y.v\l)^Y.u\l) /-I
(7.1.60)
/-I
SPECTRAL REPRESENTATIONS COMPRESSING THE VOLUME OF VECTOR INFORMATION The dimensionality of the discussed spectral representation is the same or even larger then the dimensionality of the primary information. We consider now the possibility of using spectral representations to compress the volume of primary information, however, for the price of introducing irreversible distortions. Obviously we attempt to keep those distortions possibly small. This leads to the following problem:
320
Chapter 7 Dimensionality Reduction and Quantization
A set C* of M
=Ev(m)g,e(/n)
M (7.1.61)
m-l
where v = {v(m), m = l , 2, .., M] is a set of M coefficients, that the distance d(u,u^,* ) given by equation (7.1.20) is minimized. This problem arises not only in information compression but in other areas of information processing such as optimization of information recovery. In general (7.2.61) is called problem of optimal linear approximation. Using the symbolic notation for optimization problems introduced in Section 1.6.2, we write the problem as OP v, d[u, uj^ (v)]. This is a parametric optimization problem which can be solved analytically. However, to gain more insight into the spectral representations we derive here the solution using geometric concepts. The set of all linear combinations of vectors gvc('")» 'w = 1, 2,- • • , M is called Af-DIM space spanned on these vectors and it is denoted as S*(M). Since M
7.2 Decorrelating Spectral Representations of Vector Information
321
Thus, the optimal approximation is M
^oTc = E ^ o ( ' W ) ^ v c ( ^ )
(7.1.65)
and its g-spectrum is C =(Vo(l), v,(2),- • • , v,(M), 0, 0 / • • ,) (7.1.66) Comparing (7.1.75) with (7.1.28) we get the important conclusion The optimal approximation u^* of a vector u by a linear combination of a subset C^ of a complete set C^ of ortho-normal vectors .„ . ,-. g^Jjn) is obtained from the g-spectrum by rejecting the g-vectors not belonging to C*. From (7.1.12), (7.1.66), and from the equidistance property (7.1.59) it follows that K
d{u,uj^)=
E^'('")
(7-1-68)
m-Af+1
is the error of the optimal approximation. From (7.1.66) it follow that the optimal approximation can be represented by a M-DIM information. Therefore conclusion (7.1.67) is of paramount importance for optimization of recovery of information after dimensionality reduction that is considered in Section 7.3 and 7.4.
7.2 DECORRELATING SPECTRAL REPRESENTATIONS OF VECTOR INFORMATION If the components of a structured information can be represented as random variables, then, as was have indicated in Section 5.1.1, the statistical correlation coefficients are indicators of statistical relationships between components of structured info. It can be expected that when we process a block information with correlated components, the statistical relationships between the components make the results of transformation obscure. In particular, if we reject some components to achieve dimensionality compression, we may expect the evaluation of the consequences of such a transformation to be simpler if the components of the primary block information are not correlated. This, in turn, could simplify the optimization of the dimensionality reduction. In the next section, we show, that if we require the decorrelating transformation to be a spectral transformation, then the distance invariance of those transformations permits, to find an optimal dimensionality reduction in a simple way. Therefore, we concentrate in this Section on decorrelating spectral transformations. 7.2.1 BASIC CONCEPTS The principal possibility of achieving decorrelation by a spectral transformation is illustrated below with a simple example. EXAMPLE 7.2.1 SPECTRAL DECORRELATION OF TWO-DIMENSIONAL INFORMATION BY SPECTRAL REPRESENTATION We assume that Al. the primary information is two-dimensional: ii = {w(l),w(2)}; A2. the information exhibits statistical regularities and can be treated as an observation of the two-dimensional random variable IlJ={m(l), iai(2)}; A3. Ei!ii(l)=0, Ei!ii(2)=0, Etf(l) = Etf(2); the components are correlated, that is EB(1)B(2)
^
0.
322
Chapter 7 Dimensionality Reduction and Quantization
/.(I) Figure 7.3. Geometrical interpretation of the decorrelating transformation in two-dimensional case. Our task is to look for such a spectral representation of the primary information that the components of the representation are not correlated. Figure 7.2b takes for the two-dimensional case the form shown in Figure 7.3. The mutual position of the both coordinate systems is determined by the angle a = a ( l , 1) between the unit vectors/vc(l) and gvc(0- From (7.1.2) and from Figure 7.3 it follows that the spectrum-generating matrix (see page 315): cos a
sm a
-sin a. cos a
(7.2.1)
Let us denote by V={^(1), ^(2)} the random variable representing the spectrum v. Since (7.1.37b) holds for every realization, we have V=GI[J (7.2.2) Using (7.2.1) and performing the matrix multiplication, we get ^(1)= 1(1) cos a + 1(2) sin a
(7.2.3a)
^(2) = -m(l) sin a + m(2) cos a
(7.2.3b)
Next we calculate the correlation coefficient E^y(l)^(2). Substituting (7.2.3) we get E^(l)2(2) = {E[if(l)tf(2)]}cosa sina + + {E[m(l)m(2)]}(cos2Q:-sin2Q:) (7.2.4) Taking into account assumption A3 on the page 321, we have E^^(l)^!^(2) = {£[1(1)1111(2)]} (cosVsin^a). (7.2.5) We see that E^(l)^(2)=0 for a = 7r/4. Then components of the spectral representation (7.2.2) of the primary block information u are not correlated and the spectrum-generating transformation (7.1.37a) takes the form v(l)=^[«(l)+w(2)]
v(2)=^[w(l)-w(2)]. D
(7.2.6a)
(7.2.6b)
7.2 Decorrelating Spectral Representations of Vector Information
323
This simple example shows that it is possible to find a spectral transformation that is simultaneously a decorrelating transformation. However, when the dimensionality of the block information is large, the straightforward procedure that we used in the example would become prohibitively complicated. We will now derive a spectral decorrelating transformation which is not only feasible for larger dimensionalities but also leads directly to an optimal dimensionality reducing algorithm. 7.2.2 THE EIGEN VECTORS OF THE CORRELATION MATRIX One of the fundamental theorems of linear algebra is this: If a KxK matrix C^u is a positive definite symmetric matrix, then K vectors (column matrices) e{k), / : = ! , 2,- • • , Sexist, such that {eik), e{m))=b{K m), Vit, m (7.2.7) C,Ak)=y{k)e{k), k=h 2,- • • , ^ (7.2.8) and 7(^)>0 (7.2.9) The vectors'* (column matrices) e(k) are called eigen (own) vectors of matrix C^u and the parameters y(k) are called eigen values. The interpretation of (7.2.8) is that an eigen vectors is such a vector that multiplying it with the matrix has the same result as multiplying the vector with a scalar (the associated eigen value). We write equation (7.2.8) in the matrix form [C,,-y(k)DMk)==0, (7.2.10) where 10 0 0 0 0 10 0 0 DM6{m,n)](7.2.11) 0 0 0
1
is the diagonal unit matrix and 0 is the matrix with all elements 0. The matrix C^,-y{k)D^ is the matrix C^u in which the elements c^JJk, k) are replaced by c,,(^^ k)-y(k). From (7.2.10) it follows that the eigen values are solutions of the equation {C,,-yD,)e = 0 (7.2.12) If the determinant of matrix C^^-yD^ is not equal zero, then a zero vector e=0'\s the only solution of the equation (7.2.10). Therefore, it must be det(C,,-7A)=0 (7.2.13) In respect to 7, this is an algebraic equation of rank K. Thus, we can find the eigen values of matrix C^^ as the roots of (7.2.13) considered as an algebraic equation for 7. This equation is called a characteristic equation of matrix C^^. Knowing the eigen value we find the components of the associated eigen vector from (7.2.12), considered now as a set of linear equations. However, because of (7.2.13) those equations are linearly dependent and their solution depends on an
324
Chapter 7 Dimensionality Reduction and Quantization
undetermined parameter. We can fmd it taking into account the condition (7.2.7) that the length of the eigen vector should be 1. Summarizing the algorithm for finding the eigen values and eigen vectors is Al. consider (7.2.13) as an algebraic equation for 7 and find its roots: they are the eigen values of the matrix C^^; A2. for a given eigen value consider (7.2.12) as a set linear equations for components of the eigen vector and find its solutions; they are not unique but depend on a parameter. A3, determine the parameter which occurred in step A 2 from the condition K
Te'{k,l) = l
(7.2.14)
where e(k, I) are components of the eigen vector c(/). We illustrate this algorithm with a simple example. EXAMPLE 7.2.2 EVALUATION OF EIGEN VALUES AND EIGEN VECTORS We do again the assumptions Al to A3 on page 332 from Example 7.2.1. On these assumptions the correlation matrix cjhl) cj\,2) cjl,2)
(7.2.15)
cjl.l)
The characteristic equation (7.2.13) takes the form k(i.i)-7
c n,2) •
c(l,2)
c(l.l)-7
det
-0
Evaluating the determinant we get (c„„(l,l)-7)2-c2(1.2)=0 The solutions of this equation are 7(l) = cjl,l)+c„„(l, 2) 7(2) = c J l , l ) - U l , 2). For 7=7(1), the matrix equation (7.2.16) yields two scalar equations: cjl,l)(l,l) . cjl,2)e(l,2) lcjl,lhcjl,2)]e(l,l) cjl,2)t'(i, 1) * c j l , l ) e ( l , 2 ) - [ c j l , l ) + c j l , 2 ) ] e ( l , 2 ) From the first equation we gel
(7.2.16)
(7.2.17) (7.2.18a) (7.2.18b)
(7.2.19)
(7.2.20) eil, 2)=e(l, 1), and we see that the second equation is equivalent to the first (this is the consequence of (7.2.16)). The normalization condition (7.2.14) takes the form e'd, 2)+e'il, 1) = 1. (7.2.21)
7.2 Decorrelating Spectral Representations of Vector Information
325
From (7.2.19) and (7.2.20) we get finally the eigen vector corresponding to eigen value X(l) 1 1 e(l) = { ^ . - ^ } . (7.2.22a) In a similar way we get the eigen vector corresponding to X(2) e(2) = { - ^ , - ^ } .
D
(7.2.22b)
7.2.3 THE DECORRELATION BASED ON EIGEN VECTORS The set of equations (7.2.8) defining the eigen vectors can be written in matrix form C,^=ED^, (7.2.23) where E is the square matrix whose columns are the eigen vectors, considered as column matrices: E=[
e{l)
I e(2)
•7(1) 0 0 0 0 7(2) 0 0
and
\eiK)
(7.2.24)
0 0
D
(7.2.25) 0
0
0
y(K)
is the diagonal matrix with the eigen values y(k) on the diagonal. Since the eigen vectors satisfy the conditions (7.2.81), they can be used as unit vectors of an orthogonal coordinate system. We denote this system C^ and consider it as a special case of the second coordinate system Cg, which was introduced in Section 7.2.1. Thus, the eigen vectors play now the role of unit coordinate vectors g^^ik). We denote by G^ the matrix generating the c-spectrum. From the definition of matrix G^ occurring in ((7.1.37b) it follows that ?1(1)
d(2)
(7.2.26)
elAK) where e^^(k) are eigen vectors considered as column matrices. Comparing this definition with definition (7.2.24) we see that G,=E^. and from (7.1.46) it follows that E''=E-\
(7.2.27) (7.2.28)
326
Chapter 7 Dimensionality Reduction and Quantization
We show now that the spectral representation corresponding to this coordinate system is a decorrelating representation. From (7.2.27) and (7.2.28) we get G,=E\ GJ=E From definition of G^ it follows that
(7.2.29)
V=GeIIJ (7.2.30) is the random variable representing the e spectrum. From (4.4.37) we obtain the correlation matrix of V: C^ = G^C^^GJ. (7.2.31) Substituting (7.2.27) we have C^—E Cuu^". (7.2.32) From (7.2.23) we have next C^^E^ED^, (7.2.33) where D^ is the diagonal matrix given by (7.2.25). However, from (7.2.7) it follows that E'E^D, (7.2.34) Substituting this in (7.2.32) we get finally (7.2.35) or equivalently
(7.2.36a) (7.2.36a)
Thus, we have proved that the transformation
(7.2.37) v=G,u, where G^ is the matrix the rows of that are the eigen vectors of the correlation matrix C^,, of the primary information is a spectral decorrelating transformation.
EXAMPLE 7.2.4 EVALUATION OF THE DECORRELATING MATRIX We make the assumptions as in Example 7.2.2. From (7.2.22a, b) we get the matrix l/v^ E=
l/v/2" (7.2.38)
llsjl -l/v/2"
From (7.2.27) and (7.2.38) we obtain the decorrelating transformation v(l) v(2)
\lsll -l/v/2'
u{\) 'llyjl u(2)
u{\)^uCl) w(l)-«(2)
(7.2.39)
As could be expected, this is the same transformation that was obtained in Example 7.2.1. n
7.3 Reduction of Dimensionality of Vector Information
327
COMMENT To simplify the terminology we assumed that the structured information is a realization of a multidimensional random variable, and we consider the statistical correlation coefficients. However, our argument can be used directly when we do not know whether the information exhibits statistical regularities, but when several observations of the information are available. Then, we can use the concepts of intelligent information processing presented in Section 1.7.2. To realize them, using the concepts presented in Sections 4.1.1 and 4.3.1, we replace the operator of statistical averaging in definition (4.4.25a) of the correlation coefficients by the arithmetical averaging operation defined by (4.1.3). Thus, we introduce the empirical correlation coefficients. Then the transformation (7.2.37) would produce decorrelation in the sense of empirical correlation coefficients and the dimensionality reduction obtained by algorithm (7.3.61) would be optimal in the sense of arithmetical average of square errors. Such dimensionality reduction could be applied ex post when the whole train is available in an adaptive system similar to the compression system discussed in Comment 2, page 266.
7.3 REDUCTION OF DIMENSIONALITY OF VECTOR INFORMATION When information is continuous it is often useful prior to subsequent processing, particularly before discretization, to transform the information into information that is still continuous but has lower dimensionality. We call such a transformation dimensionality reduction. We consider here the case when no deterministic structural constraints such as described in Section 6.6.2 exist, or if they do, they are not taken into account. Then the dimensionality reduction is an irreversible transformation and similarly as in the case of quantization considered in Section 6.6.1, the information compressing transformation is not characterized by a single parameter, such as the compression ratio, but by a compression ratio versus distortions of reconstructed information trade off relationships. To determine the minimal distortions caused by the irreversible dimensionality reduction, we have to optimize jointly the dimensionality reduction and information recovery rules. The systematic approach to such optimization problems is presented in Section 8.1. Here we explain the basic features of irreversible dimensionality reduction with a study case using only a few heuristic assumptions. Then, we present a general algorithm for dimensionality reduction and recovery. 7.3.1 A STUDY CASE We do again the assumptions Al and A2 from Example 7.2.1 (page 321). The dimensionality reduction is an irreversible transformation, and to gain more insight into its properties we have to analyze the aggregation sets characterizing the transformation. For such an analysis the knowledge of a rough description by variances and correlation coefficients is not sufficient and exact description of the probability distribution is needed. Therefore in addition to Al and A2 we introduce the assumption A4. The random variables i!ii(l) and i!]i(2) representing the components have density of joint probability pjji) shown in Figure 7.4a.
328
Chapter 7 Dimensionality Reduction and Quantization
1/(2) i L
vuy)
X(v') V
^
uCl)
c) Figure 7.4. Direct truncation of 2-DIM information: (a) density /?„(«/) of probability of the primary information: the density is constant in the shaded area, (b) the dimensionality reduction (projection on u{\) coordinate axis), (c) aggregation sets ->(/„(v) corresponding to the direct truncation of the primary information, (d) the transformations recovering the second component of information: U^,iv) (solid line) optimal linear, blind for statistical relationships; (7Jv) (solid line) optimal linear, direct truncation, f/^Jv) (slashed line) optimal nonlinear, direct truncation. It is easily seen that the marginal probability densities of the components of information are uniform distributions. From equation (4.4.11) and (4.4.25b) we get Ei!ii(l) = Ei!ii(2)=0, Em\l) = Ew\2)=V3a\
Ei!ii(l)iffl(2)= V^a\
(7.3.1)
DIMENSIONALITY REDUCTION BY DIRECT TRUNCATION We consider first the reduction of the dimensionality from two down to one by rejecting the second component u(2). Thus, the compressed information is V=V,(«)=M(1),
(7.3.2)
where V^i') is the dimensionality reducing transformation. Index 1 should be a reminder that we keep the first component of the primary information. Let us present the primary information as a point iipt. The transformation (7.3.2) has the geometrical meaning of orthogonal projection of point iipt on the w(l) coordinate axis and taking as the description of the projected point "pi its position V on the axis (see Figure 7.4b). Thus, the considered transformation is a special case of the continuous next neighbor transformation (see Section 1.5.3, Figure 1.17) or equivalently, of linear approximation discussed on page 320.
7.3 Reduction of Dimensionality of Vector Information
329
The aggregation set ^ ( v ) corresponding to compressed information v is the set of points having the same projection thus, it is an interval vertical to the axis u(l) going through point w(l) = v as shown in Figure 7.4b. To evaluate the irreversible distortions caused by the information compression, we look for the possibilities of recovering the primary, 2-DIM information u when only the compressed information v is available. We denote u,^{uX\),uX2)}, (7.3.3a) the recovered information ^r(-)^{f/r,(-), t/,2(-)} (7.3.3b) the rule of information recovery (the transformation of the 1 dim compressed information into the two-dimensional recovered information). ^""^^
"r={^rl(v), ^.2(V)}
(7.3.4)
is the recovered primary information. As the indicator of performance of the information compression system we take Q= E E WA:)-^(A:)]2-
(7.3.5)
k-i
where m(n) respectively Mr(n), A2 = 1, 2 are the random variables representing the corresponding components ofjhe primary respectively the recovered information. Thus, we face the OP UX'). Q (see Section 1.6.2). In view of (7.3.2) we recover exactly the first component by taking ^ , w,(l)=v=w(l). (7.3.6) Therefore, Q = Q2. (7.3.7a) where Q, ^EM2)-m,(2)Y=E[m(2)-U,,m' (7.3.7b) Thus, the OP UX'), Q reduces to OP t/2ro(*), Q2' Th^ systematic procedures of solving such problems are presented in Section 8.2. Here, we derive optimal recovery transformations using some heuristic arguments. Their full justification is provided in Section 8.3. It is evident that the optimal transformation is determined by the properties of the performance indicator Q^ and that those properties depend, in turn, on the statistical regularities of the primary information. The density /?„(M) of the joint probability of components of the primary information provides the exact description of the statistical regularities. Therefore, in general, the optimal transformation producing the recovered information depends on the density/7^(a) (see Section 1.7.1 particularly, Figure 1.24). Section 8.3 shows that even in simple cases this dependence is so complicated that the implementation of the optimal transformation would be costly. Usually, the problem of implementation of such a transformation does not even arise because the density p„(u) is not known exactly and only simplified rough descriptions of statistical properties are available. For these reasons, we are interested in such classes of transformations producing the recovered information that 1) the performance indicator (and, in turn, the optimal transformation) depend only on rough descriptions of the statistical
330
Chapter 7 Dimensionality Reduction and Quantization
properties of the primary information, which can be easily acquired, and 2) the implementation of the optimum rule is not excessively costly. We concentrate here on linear transformations that satisfy both conditions. Jn particular, we show that for those transformations the performance indicator Qj depends only on the mean values and correlation coefficients of the primary information_but not on other features of the density pjji). Thus, we consider the OP Ujroi'), Q21 U,2(')^oC, where J' is the class of linear transformations. RECOVERING TRANSFORMATIONS BLIND FOR STATISTICAL RELATIONSHIPS We start with the transformations that do not use the information about the first component of the primary information at all. Since the indicator Q2 of the performance of such a transformation does not depend on the statistical relationships between the components of the primary information, we call the transformation blind for statistical relationships. Such a deterministic transformation can assign to a compressed information v = //(l) only a constant h. Thus, U,2(y)=h=consi. _(7.3.8) From (7.3.7) it follows, that the OPL^2ro(-), Q21 U,2(')^J' reduces to OPh, Gb, where Q^=E[m(2)-h)]\ (7.3.9) The subscript b is a reminder that we consider transformations which are blind for statistical relationships. The optimal value of the constant, which we denote as h^, is the solution of the equation ^—1^=0. (7.3.10) _ d/z Since the sequence of operations h and d/dh can be interchanged, using (7.3.9) we
^^^
da
(7.3.11) From (7.3.10) and (7.3.11) it follows that the optimal value of the constant is /2o=Em(2). (7.3.12) In view of (7.3.1) we have h,,=0. Thus, the optimal recovering transformation that is blind for statistical relationships between the components of the primary information is rr , . ^ .^ ^ ^^. ^r2o(v)=0. (7.3.13) In view of (7.3.1) it follows that the blind for statistical relationships recovery transformation .r , ^ r ^. .^ ^ ... ^bo(v) = {v, 0} (7.3.14) is the optimal rule of recovering the primary two-dimensional information from the compressed, one-dimensional information. Its diagram is shown in Figure 7.4d. Substituting this in (7.3.7) we get the optimal performable of recovery rules blind for statistical relationships between components of the primary information Q^ = Em\2), (7.3.15a) Taking the numerical value from (7.3.1), we obtain Q^^ =V3a^ = 0.33a\ (6.3.15b) ^=-2[EM(2)-/Z].
7.3 Reduction of Dimensionality of Vector Information
331
LINEAR RECOVERING TRANSFORMATIONS Since both components u 1) and m(2) are correlated, we may expect that we could improve the performance of the recovery of the primary information if we use their statistical relationship. The general procedure of finding optimal information recovering rules are presented in Section 8.2. Here we assume that the recovery rule is a linear rule that is, the recovered information uX2)=h(0)+h(\)v, (7.3.16) where h(0) and /2(1) are two coefficients. Thus, the OP (72ro(-), G2l^r2(-)^c/ reduces to OP {/z(0), h(l)},Q,. We show that the optimal linear rule minimizing the mean square performance indicator Q^ (given by (7.3.7b)) depends only on the mean values of the components and on correlation coefficients. Therefore, the class of linear transformations can be considered as a class of transformations that besides correlation coefficients are blind for other features of statistical relationships. To simplify the calculations and to provide material for subsequent generalizations, we denote v(0) = l
(7.3.17)
v(l)=v.
(7.3.18)
and we write (7.3.16) in the form 1
u,{2)='L h(k)vik). In view of (7.3.2) and (7.3.17b), v ( l ) = i ( l ) . We denote B ( 0 ) = V ( 0 ) notation we write (7.3.7b) in the form _
=
(7.3.19) 1. Using this
1
QAh(O),
h(kMk)f.
II(\)] = EM2)-}2
(7.3.20)
k-O
WesolveOP {/2(0), h(\)],Q. similarly to OP/z, Q^. The generalization of (7.3.10) is the set of equations ^Q — i - = 0 , /:=0, 1. (7.3.21) dhik) Similarly as we have derived (7.3.11), we interchange in (7.3.21) the sequence of operations E and d/dh(k) and get 1
" =-2E[ii3i(2)-2^ h(m)m(m)]m(k) = dhik) m-o 1
1
-2[Em(2)m(^-)-E l-u(m)Mk)h{m)]=-2[cM,
fc)-E
///-()
c^{m, kMm)]
(7.3.22)
m-0
k=0, 1, where
cjm, k) = Eitm)m{k),k=0, are the correlation coefficients. Since i!ii(0) = 1, we have cJO,
0) = 1, cJO,
I, m=0,l,2
(7.3.23)
AZ) = EM(/Z), W = 1 , 2.
From (7.3.24) and (7.3.1) it follows that cJO, Az)=0, AZ = l , 2 .
(7.3.25)
332
Chapter 7 Dimensionality Reduction and Quantization
Substituting (7.3.24) with (7.3.25) in (7.3.21) after some elementary algebra we get from the set (7.3.21) of equations the optimal coefficients /Zo(0)=0 (7.3.26a) c (1,2)
^ai)=-4rn.
(7.3.26b)
Thus, the linear transformation producing the optimally recovered second component of the primary information is Uaoiy)=K{\)v. {1321) Substituting the numerical values from (7.3.1) in (7.3.26b) we get /Zo(l)=0.75. (7.3.28) The diagram of the optimal linear information recovering transformation ^Uv) = {v, ^ J v ) } (7.3.29) as a function of v is shown in Figure 7.4d. To calculate the best performance indicator (minimal distortions) that is achieved by the rule (7.3.29), we have to return to the general equation (7.3.19). Squaring the expression in the brace on the r.h.s of this equation and proceeding as in derivation of (7.3.22), we get 1
1
e,[/2(0), h{\)\ = E[tf(2)-2E n{2)n{k)h{k)^ E ^-0 1
*-0 1
1
E Uk)w{m)h{k)h{m)] = m-0
1
c J 2 , 2 ) - 2 E c„,(2, k)h{k)+ E E cJA:, m)h(k)h{m). k-Q
k'O
(7.3.30)
m-0
After substituting (7.3.26) and simple calculations we get the minimal (in the class of linear transformation) mean square distortion d(i,2) Q,, = Q[hM. / U 1 ) ] = U 2 , ^y-^-fjj-y (7.3.31) Taking the numerical values froni_(7.3.1) we obtain 2,0=0,145 fll (7.3.32) The assumption that was made at beginning that we achieve the compression by rejecting the second component w(2) was arbitrary. We may reject the first. Let us look closer at such a possibility. Since we made the same assumptions about both components, if we would reject the first component, then the quality of the counterpart of the recovery rule (7.3.19) would be given by the counterpart of (7.3.31) with the roles of indices 1 and 2 interchanged. However, in view of (7.3.1) the value of the minimum distortions would be the same. Thus, it makes no difference which component of the primary block information u is rejected. Equation (7.3.23) justifies our anticipation that the linear rule minimizing the mean square distortions of information processing depends on correlation coefficients but not on other features of joint probability. This is an advantage but also an disadvantage of the considered optimization problem.
7.3 Reduction of Dimensionality of Vector Information
333
The solution is simple and depends only on the simple rough description of the statistical properties of information. However, if more exact information about the statistical properties is available, it is possible that a nonlinear recovery rule, capable of utilizing the more exact information would be better than the optimal linear rule. We give shortly an example of such a situation. A NON LINEAR RECOVERY RULE Operating with random variables, we considered all potential forms of the primary and recovered information. Let us now assume that the compressed information is fixed, say, it is v'. We know then that the point Upt representing the primary information lies in the aggregation set ^(v') shown in Figure 6.4b. Since the density of probability p^iu) within this set is constant we may expect^, that we minimize the distortion of recovery if we take as the recovered information the point ttrpt laying in the middle of the aggregation set (interval) ^(v'). From Figure 7.5c we see that this point has the coordinates ^ { v , a/2} if v > 0 «r = ^ n . o ( v ) = C ^ ^
,^^ .^
^
(7.3.33)
^ ^ {v, -a/2} if v < 0 , where (/nio(*) is the transformation assigning to the compressed information v the recovered information u^. This is a nonlinear transformation and has an optimum; hence, the index nlo. The diagram of this transformation is shown in Figure 7.5c. On the condition v=const, the density of the conditional probability in the aggregation set ^ ( v ' ) is constant and the conditional mean square EM2)-a/2Y=l/l2a\
(7.3.34)
Since this does not depend on v' the overall mean square distortion caused by the optimized nonlinear recovery rule (7.3.33) is Gnio = l/12fl' = 0.083fll
(7.3.35)
TRUNCATION AFTER DECORRELATION We now analyze the dimensionality reduction by truncation applied not for u directly but for its decorrelating spectral representation as shown in Figure 7.5a. We can use the results of Example 7.2.2 because for the now assumed probability density p^(u) the assumptions the example are satisfied^. The decorrelating spectral transformation given by (7.2.6) is w(l)=^[u{l)+u(2)]
(7.3.36a)
w{2)=-^[u(l)-u(2)]
(7.3.36b)
\/2
We denote this transformation Tj(-). Thus, w = {w(l), w(2)}=TJu) is the decorrelated spectral representation of the primary information.
(7.3,37)
Chapter 7 Dimensionality Reduction and Quantization
334
DIMENSIONALITY REDUCTION
RECOVERY OF PRIMARY INFO u,={Uri\)M2)}
V
u={u{\)M2)}
a) dimensionality reduction 1 1 1
1 DECORRELATING SPECTRAL REPR.
u={u(\)M2)} 1
1
V
^ ^
-1 — i
'
t'
1 1
w={Mi\)M2)}
1
1f J
^
TRUNCATION OF SPECTRUM
111 ^^ I 1
1
RECOVERY OF SPECTRUM
INVERSE SPECTRAL
t
^ TRANSFORMATION r.; (0 '
Wr.{v)={v,0]
n1 1 1
«a,= {Wro(U, "n,( 1
recovery of primary info
b) Figure 7.5. Block diagrams of the dimensionality reduction and recovery systems: (a) direct dimensionality reduction (b) dimensionality reduction after decorrelation. From (7.3.36) and (7.3.1) it follows that Ew(l) = Ew(2)=0, and from (7.3.18) and (7.3.36) we have E w ' ( l ) = c j l , l ) + c j l , 2)
(7.3.38) (7.3.39a)
Ew2(2)=cjl, D - c J l , 2). (7.3.39b) Since the transformation is a decorrelating transformation E^5y(l)^(2)=0. (7.3.40) Example 7.2.1 shows that decorrelation can be interpreted as the representation of the primary information in another orthogonal coordinate system, which in this case is turned by TTM relative to the primary system. Therefore, the diagram of the density p^(w) of the joint probability of the components of the decorrelated information is obtained by turning the diagram of the density p„(u) of the joint probability of the components of the primary information by 7r/4 as shown in Figure 7.6b. We achieve a two-dimensional into a one-dimensional compression of the spectral representation of the primary block information by rejecting one of the components of spectrum, say w(2). We denote by Vwi(*) this rule. Thus, the compressed spectrum is ^^^^^^^^ ^^ ^^^^ The compression has again the meaning of a projection (see Section 1.5.3 particularly. Figure 1.17b) however, we project now the point representing the primary information on >v(l) coordinate axe as shown in Figure 7.6b. The corresponding aggregation sets ^^(v) are shown in Figure 7.6c.
7.3 Reduction of Dimensionality of Vector Information
335
M<2)
Ml) 1/(2) i L
S^(v')
^ u{\)
V
C)
Figure 7.6. The effects of decorrelation: (a) the density/?„(ii) of probability of the primary information and an aggregation set corresponding to direct truncation (redrawn Figure 7.5c, in the shaded area the density of probability is constant), (b) the density p^(w) of probability of the decorrelated spectral representation and an aggregation set ^^(v) corresponding to the truncation of the second component of spectrum, (c) die aggregation set ^y^,(v) of the primary information corresponding to the truncation of the spectrum. To recover the primary information we first recover the spectrum. We denote by w, = {wXl), >Vr(2)} the recovered spectrum. As an indicator of performance of the recovery we take similarly to (7.3.5): _
2
Gwi = E E lw(k)-wXk)]'.
(7.3.42)
The subscript wl is a reminder that Cwi is an indicator of the quality of recovery of the spectrum w when only the exact information about the first component of the spectrum is available and w(/:) (respectively w^k)) are the random variables representing the components of the primary (respectively of the recovered) spectrum. Similarly to (7.3.6) the first component of spectrum can be recovered exactly >vXl) = v=w(l) (7.3.43) and the indicator of spectrum recovery performance is Qwi =E[w(2)-W,2(^)]^
(7.3.44)
where W,2(') is the rule of recovery of the second component of .spectrum. We are interested in the optimal rule Vr2o(') minimizing the indicator 2wi • To make the implementation simple we assume that the recovery rule is linear. Then we face the same linear optimization problem as the previously discussed (see page 331) problem 0P/i,2b of direct recovery of the primary information. Therefore, after changing notation we can use the previously derived equations (7.3.26).
336
Chapter 7 Dimensionality Reduction and Quantization
We denote by h^J
, ^ o o o ^
t
Using (7.3.39a) we get
2w2o=Ew'(l).
(7.3.51)
_ 2.20 = U 1 , D + c J l , 2), (7.3.52) Comparing this with (7.3.49) we see that if the compression is achieved by the rejection of the first component of the decorrelated spectrum, than the performance of the optimal recovery of the spectrum is worse then in the case when compression is achieved by rejection of the second component. The obvious reason for this is that contrary to the mean square values of the components of the primary information (see (7.3.1)) the mean square values of the decorrelated spectral components are different (see (7.3.39)). This suggests the following conclusion: We optimize the truncation of decorrelated spectrum of a 2-DIM vector information by rejecting the spectral component with the smaller mean (7.3.53) square value.
7.3 Reduction of Dimensionality of Vector Information
337
Till this point we considered the recovery of the spectrum. The considerations on page 321 on optimal linear approximation suggest to recover the primary information by performing on the recovered spectrum the transformation ^sd('), which inverts the transformation T^^i*) producing the spectrum. Thus, we take as the recovered primary information
u=T-:d')
(7.3.54)
^rr
where w^^ is given by (7.3.47). Figure 7.5b illustrates this assumption. The indicator of the overall performance of the compression system (truncation after decorrelation) is ^
e,d= E22 Mk)-n,(k)Y
(7.3.55)
Because we recover the primary information from the recovered spectrum by inverted spectral transformation, we can use the distance preservation property (7.1.58). From it and from (7.3.55), (7.3.42), and (7.3.43) we get etd = 2w2o=Ew'(2). (7.3.56) Substituting (7.3.50) we have 7. ^ 2,d=0.083a' (7.3.57) To compare the described compression and recovery systems it is convenient to introduce the normalized overall performance indicator
(7.3.58)
Q'=-^:^
^E ^\k) The previously obtained numerical results ((7.3.15b), (7.3.32), (7.3.35), and (7.3.57)) are summarized in the Table 7.3.1. Dimensionality Reduction
Optimized Recovery Rule
Normalized Performance Indicator
Direct truncation
Blind for statistical relationships
0.5
Direct truncation
Linear
0.096
Direct truncation
Nonlinear
0.055
Optimized truncation arter decorrelation
Linear
0.055
Table 7.3.1. The comparison of the performance of the information compression systems. COMMENT 1 For the volume of the considered continuous information we use definition (6.6.22). The volume compression coefficient defined by (6.1.30 ) has for all considered systems the same value i3[r(-)]=0.5. Thus, the systems can be compared on the basis of the indicator Q' of accuracy of the recovered information. Because we optimized the recovery rules the comparison is fair.
338
Chapter 7 Dimensionality Reduction and Quantization
Table 7.3.1 shows the inferiority of the system that is blind for statistical relationships between the components of primary information. The reason is obvious: the other systems use more accurate rough descriptions of the statistical properties of information. Similar is the reason of the superiority of the system with direct truncation and nonlinear recovery over the system with linear recovery. Choosing the non linear recovery rule, we utilized properties of aggregation sets, which depend on more detailed properties of the density p^(u) of the joint probability than the mean values and correlation coefficients. For optimization of the linear recovery only these properties can be used. We assumed that the system with direct truncation and the system with truncation after decorrelation use linear recovery, optimal for each of the systems. Therefore, the reason of superiority of the system with truncation after decorrelation is that the reduction of dimensionality used in this system is better suited for linear recovery as in the system with direct truncation. This is confirmed by the interesting fact that the performance of the system with direct truncation and optimized nonlinear recovery is the same as the performance of the system with truncation after decorrelation. In other words, the direct truncated information contains the same statistical information about the second component, as the truncated spectrum. However, the statistical information contained in the direct truncated information cannot be utilized by a linear transformation. The advantage of the system with truncation after decorrelation is that it is a linear system, and as we show its principle can be easily generalized for dimensionality reduction of multidimensional information. The performance of this system is better when the discrepancy between the mean square values of the components is larger. From (7.3.53) it follows that if we achieve the decorrelation by the spectral transformation based on eigen values, the performance of the dimensionality compression will be better the more diversified the eigen values of the correlation matrix are. We discuss this issue in the next subsection. 7.3.2 THE ALGORITHM FOR DIMENSIONALITY REDUCTION Some fragments of our argumentation in the previous section are more complicated than it is really needed to analyze the simple case that we considered. The purpose of this was to present an approach that can be almost directly generalized for compression of AT>2 dimensional information into M-DIM information, M
.- ^ -g.
7.3 Reduction of Dimensionality of Vector Information
339
Let us denote by m(k) / : = ! , 2,- • • , A' the random variables representing the components of the primary information u = {u(k), / : = ! , 2,- • • , K). We assume that E m W = 0 , ;^=l, 2,- • • , K. (7.3.60) Our previous remarks about generalizations of the procedures derived for the study case lead us to the following algorithms describing the operation of the dimensionality compression system, THE KTOM
DIMENSIONALITY REDUCTION ALGORITHM
Step 1. Using the procedure described in Section 4.3.2 find the eigen values 7(^) and the eigen vectors e(k), k=l, 2,- - • , Kof the correlation matrix Qu of the primary information. Step 2. Arrange the eigen values in the decreasing order y{k,)> y(k2)>...>y(k^). (7.3.61a) Step 3. As the compressed information take v= { w(k,), w(k2)r • • , w(^J}, (7.3.61b) where w(n) are the components of the spectrum relative to the eigen vectors ^(n). Thus, w = G,u (7.3.62) and Ge is the matrix given by (7. 2.26) the rows of which are the components of the eigen vectors. Generalizing the geometric interpretation of the two to one compression given in the previous section, we interpret the compression of the dimensionality as a projection of the point Mp, representing the primary information in an Kdimensional space on the M-dimensional space spanned on the eigen vectors e(ki), e{k2)r • • , eik^), and we use the conclusion (7.1.65) about optimal linear approximation in particular, as optimally recovered information we take the optimal linear approximation given by equation (7.1.66). This leads to the THE OPTIMAL LINEAR RECOVERY ALGORITHM Step 1. As the recovered e-spectrum take >^ro = {v, ^} = {w(^,), wik^),' • • , w(/:J, 0, 0, ..0}.
(7.3.63)
Step 2. As the ultimately recovered information take u,o=Ew,,,
(7.3.64)
where E is the matrix the columns of which are the components of the eigen vectors (see 7.2.29). Using (7.1.66) and (7.2.36) as a generalization of (7.4.56) we obtain K
e,d= E
T(«m)
(7.3.65)
Chapter 7 Dimensionality Reduction and Quantization
340
Generalizing (7.3.58), we define the normalized performance indicator (7.3.66)
G'EY^ "^^W
EXAMPLE 7.3.1 THE RUNNING OF THE OPTIMAL COMPRESSION AND RECOVERY ALGORITHM We assume that the correlation matrix C^^ is given in Table 5.2.1a. A standard program for finding the eigen values and eigen vectors such as command EIG of Matlab (this is a registered trade name of The Math Worcs, Inc. program package), produces the matrix E defined by (7.2.24) and the eigen values y(n) (upper row) already arranged in ascending order: y{f^)
.t>Ott>
E=
1233 2^O0 4128 4793 4793 4128 2900 1255
.i>S<>2
-.2431 .^6^>3 -.4379 .1700 .1780 -.4379 .44>^»3
-.2431
.73'?9
.3450 -.4576 .0310 .41 I 1 -.411 I
-.0510 .4576 -.3450
.93<>e
.4218 -.2638 -.3829 .3254 .3254 -.3829 -.2638
.42ie
1.2313
.4^>3^>
.0405 -.4474 -.2885 .2885 .4474 -.04O5 -.4636
I.7202
.4574 .3252 -.0793 -.4228 -.4228 -.0795 .3232 .4574
2-Sl
.3877 .4526 .3560 .1343 -.1343 -.3560 -.4526 -.3877
3.3e3l
.2319 .3274 .3942 .4286 .4286 .39*2 .3274 .2319
Table 7.3.2 The eigen values and the eigen vectors of the correlation matrix. In the justification of the algorithm we assumed only that the correlation matrix is given, but it was not necessary to specify beyond this the joint probability distribution. Therefore, to get a typical block of primary information we can use a generator of a random train operating according to a joint probability distribution, which has the given correlation matrix, but besides this requirement is arbitrary. Thus, we can use the generator of a gaussian process described in Example 5.2.1. A typical train generated by this generator is M = {2.0885 -.1800 .0036 1.7565 1.1361 0.0803 .6100 1.4597}. After performing the matrix multiplication (7.3.62), we obtain the ^-spectrum w = {2.2364 -.0577 -.5329 .1149 2.2924 -8295 -.1839 .0210}. Let us assume that the dimensionality of the compressed information M = 6 . Since the eigen values are arranged in descending order, according to step 3 of the dimensionality reduction algorithm, we reject the last two components of the espectrum. The compressed spectrum v = {2.2364 -.0577 -.5329 .1149 2.2924 -8295}. Let us next consider the optimal recovery of the primary information. According to Step 1 of the recovery algorithm, the optimally recovered e spectrum >v,,={2.2364 -.0577 -.5329 .1149 2.2924 -8295 .0000 .0000}. After performing the matrix multiplication according to Step 2, we get the optimally recovered primary information M,,={2.0464 -.1003 -.0683 1.7791 1.1790 -.0089 -.7019 1.4123}.
7.3 Reduction of Dimensionality of Vector Information
341
The error of recovered information is tt-i/,„={.0421 -.0797 .0719 -.0227 -.0428 .0892 -.0919 .0.474}, while the error of recovered spectrum is w;-w,,={0 0 0 0 0 0 -.1839 .0210}. Q'
M
Figure 7.7. Dependence of the normalized performance Q' defined by (7.3.66) of the optimal compression-recovery algorithms (page 339) on the volume M of the compressed information; the correlation matrix of the primary information is given by Table 5.2.1
Jhe described procedure we apply f o r M = l , 2,- • • , 7. From (7.3.65) we get 2td' from Table (5.2.1a) we obtain Em^(n) = Cuu(A2, n), and from (7.3.66) we calculate the normalized performance indicator Q'-O.l. Its dependence on M is shown in Figure 7.7. This is a typical compression quality versus volume of compressed information trade off relationship. D
COMMENT 1 The example shows an important feature of the presented algorithm. Comparing the errors u-u,^ and w-w,^, we see that the inverse e-spectral transformation distributes the final error of the primary information uniformly over its components. The example shows that the performance of the optimal dimensionality reduction and recovery depend essentially on this how diversified the eigen vectors of the correlation matrix are. If they are of similar order of magnitude, the optimal dimensionality reduction and recovery would not offer substantial improvement over the direct truncation. The differences between the values of eigen values depend on the structure of the correlation matrix of the primary information. The smaller the components outside the main diagonal, the "weaker" the correlation between the components of the primary information; the smaller the differences between the eigen values, the fewer (besides spreading the error) are the advantages of the optimal algorithm based on eigen vectors spectrum over direct truncation.
342
Chapter 7 Dimensionality Reduction and Quantization
COMMENT 2 Taking in place of the eigen vectors any set (^ of ortho-normal vectors and instead of eigenvalues the g-spectrum, we can also use the compression algorithm. The problem is that in general the spectral components are less diversified than the eigen values and rejecting a number of spectral components and using the recovery rule we may commit a larger error then using the eigen vectors. The reason for this is that in general the spectral components are correlated and taking zeros in place of the missing components is not optimal. We could improve the performance of recovery taking in place of zeros linear estimates of rejected spectral components based on available components, as we did considering truncation without decorrelation in the study case but such a procedure would be tedious. However, it is shown in the next section, that on quite general assumptions the correlation coefficients between the components of harmonic spectrum are small. Then the performance of the simple recovery rule replacing the missing spectral components by zeros is not significantly smaller than in case when eigen values are used. The great advantage of the harmonic spectrum described in Example 7.1.1 is that it can be calculated using the Fast Fourier Transformation algorithm which requires much less computational power than the compression based on eigen vectors. Therefore, the dimensionality compression is often based in practice on discrete harmonic presentation of the primary information. An example is the essential step of the JPEG compression standard described in Section 2.5, page 118. As it is shown there the recovery of the truncated harmonic spectrum also has the favourable feature to uniformly disperse the errors. Our conclusions about improving the performance of direct truncation by using non-linear recovery rules could be generalized. However, the need for more exact statistical information about the primary information and difficulties of implementation of non-linear rules cause that the compression based on spectral representations reducing the correlation of their components to zero or making them small are most suitable in applications.
7.4 SPECTRAL REPRESENTATIONS AND REDUCTION OF DIMENSIONALITY OF FUNCTION INFORMATION We show now that the previously discussed methods of reducing the dimensionality of vector information can be generalized for functions of one or more continuous arguments. By "reduction of dimensionality" of a function, we mean the transformation of the function information into vector information with a finite number of components. The two basic types of such a transformation are (1) sampling and (2) spectral representation and retaining only a given number of it components. The both methods of reducing dimensionality of a function of a continuous argument(s) are closely connected. We concentrate on compression using spectral representations and we describe the methods utilizing deterministic and statistic features of compressed information. Sampling we discuss only briefly, emphasizing its relationships with compression using spectral representations. The presentation is based on analogies with the previously discussed reduction of dimensionality of vector information.
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
343
7.4.1 BASIC CONCEPTS We consider first spectral representations of a scalar function on the assumption that the argument (called time) takes values confined to a finite interval. To simplify the notation instead of the standard interval
E= { u(t)^dt
(7 4 1)
-772
If u(t) has the meaning of an electrical potential or electrical current process then the integral in equation (7.4.1) is proportional to energy dissipated in a resistor. Therefore, integral E given by (7.4.1) is called energy of the function and the function satisfying condition (7.4.1) is called energy function. The functions of a continuous argument can be interpreted as points when distance is introduced. W e denote as a{')={a{t), tG<-T/2, TI2>} a function considered as whole and as a^^ the point representing it. Similarly, denote as b^^ the point representing an other function b('). The counterpart of distance (7.1.20) is the distance defined by (1.4.9) that in the notation used now takes the form TO.
d{a^,b,^ = [\\b{t)-a{tm^\
(7.4.2)
-r/2
The functions can be interpreted as vectors when a scalar product is defined. For functions as a whole we define similarly to (7.1.8), the scalar product as Til
(a,,,bj=
Jfl(OWdr,
(7.4.3a)
-T/2
where Cvc and b^^ denote the considered two functions interpreted as vectors. Consequently, we set .7-/2 ^^
|a,j = jjfl2(/)dr| .
(7.4.3b)
-T/2
For the distance, scalar product, and vector length defined in this way the bas relationship (7.1.21) holds for functions as a whole. To get a representation of a function as a whole in a form similar to equation (7.1.43) we must take infinity as the upper summation limit. It is shown (see, e.g.. Curtain, Pritchard [7.7]) that sets of functions g(k, 0, ^ = 1 , 2,- • • , te <-r/2, T/2> (7.4.4) exist, such that the functions are ortho-normal, it is
^^
(g.c(k), ^vc(0) = C^^ ^""^ ^"^ ^ ^0 for Ir^k lim^{«(-),E v(^)^(^,-)}-0
(7.4.5a) (7.4.5b)
344
Chapter 7 Dimensionality Reduction and Quantization
The functions g{k, t) are called basis functions and their set is called complete set of ortho-normal functions. The fundamental relationship (7.4.6) can be written symbolically as OO
u{t)=Y. v{k)g{K 0, te <-TI2JI2>.
(7.4.6a)
A:-l
This is the counterpart of (7.1.43) for functions of a continuous argument. Thus, the set of coefficients {v{k), /:= 1, 2,- • • } has the meaning of the spectrum of the function w(') as a whole. Similarly as equation (7.1.29) we get r/2
v(/)= j u(t)g(lj)dt
(7.4.6b)
-772
Thus, we came to the conclusion A function of a continuous argument satisfying the general assumptions Al and A2 (page 343) can be represented by a spectrum which is (7.4.6c) discrete, but has infinitely many elements. In terms of information processing this conclusion means that the considered transformation simplifies the primary function information into vector information, however, with infinitely many components^. We denote as C^o. the infinite complete set of orto-normal basis functions g{k, /), A:=l, 2, • • 00. This set is a counterpart of the set of coefficients g{k, I) A:=l, 2, • • describing the set C^ of unit coordinate vector gvc(^)- As indicated in Comment 2 page 316 there is a continuum of orthogonal coordinate systems. Similarly there is a continuum of complete ortho-normal sets C^o. of basis functions. Several such specific sets have been analyzed in detail (e.g., harmonic functions, Hermite, Laguerre, Legendre, Haar, Heinkel to mention only few). The spectral representation (7.4.7) has also the distance preserving property. Particularly, the counterpart of (7.1.60) is
\ u\t)dt-Y.v\k) -hi
(7.4.7)
*-i
The optimality property (7.1.65) has also its counterpart. The counterpart of the equation (7.1.66) giving the error of approximation by an incomplete ortho-normal set is
«*(0 = E^(^)^('^' 0
(7.4.8)
mew where Wis the set of numbers of ortho normal functions used for approximation. The counterpart of the equation (7.1.66) giving the error of optimal approximation is 772
[ [«(0-« \t)fdt- = E Am) where VJ,^ is the rest of the set {1, 2, •• 00} after subtracting W^.
(7.4.9)
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
345
Every practical application of a spectral representation requires generation of basis functions. Therefore, of greatest practical importance are complete orthogonal sets of basic functions that can be obtained from a single prototype function by such simple transformations as shift or/and changing the scale of the argument. For a general description of such functions see Malat [7.8]. Here two important examples of such prototype functions are used: harmonic functions (we call so the cosine, sine, and complex exponential functions), and sine over argument function. Similarly, as the choice of spectral representation of a vector, the choice of the spectral representation of a function of a continuous argument depends primarily on the subsequent processing of information. The harmonic functions play a special role, since they are invariant to stationary linear transformations. Therefore, spectral representations of functions of a continuous argument using as basic functions harmonic functions are of paramount practical importance. Such a representation is called harmonic spectral representation. There are many excellent books discussing harmonic representation of functions and its applications for analysis of linear, stationary systems (see, e.g., Oppenheimer, Wilsky [7.9]). The harmonic representations are discussed here only briefly, to explain some concepts used in this book and to give additional insight into the problems of spectral representations of structured information. 7.4.2.HARMONIC SPECTRAL REPRESENTATIONS The classical result of functional analysis is that for functions considered in the interval <-TI2, TI2> the set of functions 1/r, {2/T)coskcD^t, k=l, 2,- • • , {2/T)sinko)^t, A:=l, 2,- • • , (7.4.10) where co,=ll (7.4.11) T is a complete set of ortho-normal functions. For this set the representation (7.4.6a) takes the form 00
w(r) = v(0)+5^ v^(k)coskG),t+v^(k)cosko)^t, tE <-T/2, TI2>
(7.4.12a)
k-\
and from (7.4.7b) we have Til
Til
v(0)= 1 I u(t)g(liMt.^m= -Til
TH
1 I u{t)cosko^,dt,vXk)= 1 J w(Osin/:co,^/,/:=l,2,-Til -Til (7.4.12b)
Using the equations cosa = y2(e^"+e'J") and sinQ; = -y2J(e'"-eJ"), where j " v - l we write equations (7.1.12) in the simple form oo
u{t)= E
vik)e^'^'', te<-TI2,
*—"
"^^^'^ Since u(t) is real
v(^)=i,U(Oe-^^^''dr. ~'^^^ v(k)=v(-k)
where v is complex conjugate v.
TI2>,
(7.4.13a)
Tfl
(7.4.13b) (7.4.14)
Chapter 7 Dimensionality Reduction and Quantization
346
The consequence of (7.4.14) is that the complex spectrum {v(k), k=-oo' • ,-i, 0, 1,- • • oo} is redundant. The reward for redundancy is that the representation (7.4.13) is simpler and its manipulation is much easier than of the equivalent, but non-redundant, representation (7.4.12). As an example we take a rectangular pulse ^^A for -r
The function occurring on the right is called sinus over argument function. The diagram of the considered pulse and its spectrum are shown in Figure 7.8. u{t) f
-Til
v(A:) A
Til a)
u{t)
v(^)
— •
-T
r t
b)
Figure 7.8. The rectangular pulse and its harmonic spectrum representation: (a) in interval <-r/2, r/2>, (b) in interval <-r, T>. For practical purposes the representation (7.4.13) is sufficient, because in applications we always process information in a finite time interval. However, analytical operations on the sum occurring in (7.4.13a) are tedious. Therefore, very useful for analytical consideration is a limiting form of the representation (7.4.13) in which instead of the sum an integral occurs. We now sketch the derivation of such a presentation. We consider the dependence of the spectral presentation (7.4.13) when the length r of the interval <-r/2, TI2> in which the function is analyzed, grows. We assume that the condition (7.4.1) is always satisfied thus. u{tY^t < const < 00 This condition causes that
| w ( 0 | ^ for|r|
(7.4.17) (7.4.18)
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
347
Thus in a sufficiently coarse scale the function u(t) looks like a pulse. We denote as At the duration of the time interval outside of which the function «(•) takes only negligible small values and we call At the effective duration of the function «(•)• To make this definition concrete we must say precisely what means "negligibly small". Several such definitions are used. In theoretical considerations the effective duration is defined as radius of inertia of the area under the diagram of the function. We now consider the harmonic spectrum of a fixed pulse-like function when the length r of the interval increases. From equation (7.4.11) it follows that changing r we change (jj. Since in the following consideration not the number k of harmonic but the value kcoj is important we denote a;^.=A:a>,. (7.4.19) The pulse given by (7.4.15) is a representative example of a pulse-like function. Figure 7.8b shows that when the length of the observation interval is doubled, two effects occur. The values of spectral components decrease, but simultaneously the distances between the angular frequencies of spectral components decrease. Thus, we have a similar effect as described in Section 4.2.1 and illustrated in Figure 4.3. Therefore, similarly to (4.2.9) for T large ( compared with the effective duration of the function) we describe the spectrum by the density V(a;,)=27r^ where
(7.4.20)
Acj Aa; = cj^-a)^.,=co,.
(7.4.21)
Using (7.1.11) and (7.4.13b) we write (7.4.20) in the form T/2
V(co,)= [ w(r)e"^"*'dr
(7.4.22)
-T/2
Similarly to (4.2.13) for T-^oo the equation (7.4.13a) takes the form oo
u(t)=j;^
I V(a))eJ"'dco , -oo < r < oo
(7.4.23a)
while equation (7.4.22) takes the form oo
V(o;)= [ w(Oe-J"^dr , -00 < o ; < 00.
(7.4.23b)
-oo
This spectral representation is called continuous Fourier transformation. From (7.4.14)it follows that V(-co)-V(co). Thus, the representation (7.4.22a) is redundant and our remarks in Comment 1, page 318 apply^ The continuous spectrum has also the distance preserving property, in particular the counterpart of (7.1.60) holds and it has the form 00
oo
J u\t)dt'l.
J I V(a;)|2do;
(7.4.24)
From this and from (7.4.17) it follows that the integral on right side of (7.4.25) is finite. This causes that similarly to (7.4.18) \V(o))\^foT |aj|-^oo. (7.4.25)
348
Chapter 7 Dimensionality Reduction and Quantization
Thus, in a sufficiently coarse scale, the function |5(')| is pulse-like. Its width defined similarly as duration of the function is called effective bandwidth of the spectrum and is denoted as A^. One of the basic results of the Fourier transformation theory is 772^ product of effective widths of functions coupled by a Fourier transformation cannot be smaller than a constant of the order of (7.4.26) magnitude of I-K. For typical definitions of the effective width and for most processes occurring in applications we have A,A, = 27r/l,
(7.4.27)
where A is order of magnitude of 1. COMMENT 1 Using definition (7.4.20) we write the representation (7.4.23a) in the form oo
w(0= o - f eJ"'tL4(a;) , -oo
(7.4.28)
-oo
where dy4(co) = V(cj)da;. Comparing (7.4.34) and (7.4.13a) we may say that the representation (7.4.22a) can be interpreted as superposition of a continuum of complex exponential functions with infinitely small amplitudes d4(cj). COMMENT 2 The discrete counterpart of representation (7.4.23) is the discrete Fourier representation (7.1.55). Those both transformation illustrate our earlier remarks in Section 1.4.3 about continuous models. For numerical calculations we use the discrete transformation and the discrete models describe with sufficient accuracy the real processes. The continuous transformation is a limiting case of the discrete transformation. The advantage of the continuous transformation is that it can be analyzed in-depth and such analysis gives more insight into general properties of spectral representations. The price we pay for it are several peculiarities of the continuous harmonic spectral representation, that are related to the continuous model but do not have counterparts in real systems. COMMENT 3 The harmonic spectrum (7.4.13b) of a function analyzed in a finite interval, is a function of a discrete argument, while the primary function is a function of a continuous argument. Thus, the structures of the spectrum and the original function are different and in this sense this representation is odd. Contrary, the discrete Fourier representafion (7.1.55a) and (7.1.55b) and the continuous representation (7.4.23a) and (7.4.23b) are symmetric, in the sense, that the primary ftinction and its spectrum have the same structure.
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
349
The symmetry is advantageous in numerical calculations using discrete Fourier representation and in analytical consideration using continuous harmonic representations. The symmetry of continuous harmonic representation causes that in formulas and theorems about the relationships between the primary function u(t) and its continuous spectrum V(o)) can be interchanged (with minor changes corresponding to other coefficients in (7.1.23a) and (7.1.23b). This is called duality principle, GENERALIZATIONS To simplify the presentations a scalar function of a scalar argument has been considered. Most of the presented concepts can be generalized for functions of more then one argument. The two dimensional generalization of the spectral presentation (7.4.23) is the spectral presentation of black-white images described by equations (1.3.9) and (1.3.10). 00
c»
w[r(l), r ( 2 ) ] = — U f f V[co(l),a;(2)]eJ'-<^)^('>*-<2)r(2)i^^(l)^^(2), UTT)^ _l_l (7.4.29a) -oo
V[a;(l), a)(2)]= [ ( w[/(l), r(2)]e-J''^^^>^('>-(2>^(2)«dr(l)dr(2),
-«--' -00
(7.2.29b)
In a similar way the spectral representation (7.4.13) and the discrete Fourier transformation (7.1.55) are generalized. The discrete cosine transformation (2.5.2) and (2.5.3) used in Section 2.5 is an obvious modification of the latter transformation. For an introduction to spectral transformation of images and their applications see Lim [7.10], for more details Schalkoff [7.11], Russ [7.12].
7.4.3 DETERMINISTIC REDUCTION OF DIMENSIONALITY OF FUNCTION-INFORMATION A function of a continuous argument is characterized by such features as continuity of the function, continuity of its derivatives, discontinuities e.t.c. Many of those features are reflected in the features of the harmonic spectrum of the function. A function can be often considered as a product of a linear transformation which influences in a specific way the harmonic spectrum of the function. A typical example is a function with a spectrum which practically does not include components with frequencies higher then a limiting value. For these reasons constraints are often imposed not directly on a function but on its harmonic spectrum. Such constraints are counterparts of the constraints imposed on continuous vector information discussed in Section 6.6.2 (see in particular. Example 6.6.2).
350
Chapter 7 Dimensionality Reduction and Quantization
Usually, the discussed features of functions, in particular, features of their harmonic spectrum, characterize each considered function and can be used to reduce the dimensionality of each function. Such a reduction is called deterministic dimensionality reduction. The reduction of dimensionality of vector information which we considered in the previous section was based on statistical properties of the compressed information. The counterpart of such compression is the reduction of dimensionality using statistic properties of considered functions. It compresses the dimensionality in the sense of a statistical indicator. Such a compression is called statistic reduction of dimensionality of function-information. It is discussed in the next section. The basic deterministic dimensionality compression methods are methods using spectral representations and sampling. We present here a brief review of these two methods, emphasising their relationships. DIMENSIONALITY REDUCTION BASED ON SPECTRAL REPRESENTATION The spectral methods of transforming a function into a vector having afinitenumber of elements are based on the following generalization conclusion (7.1.65): Jlie optimal approximation of a function u(t), /G
,
A
i;tu, r
w ( 0 = E v(k)^ ' ,
(7.4.32)
k--N
As the indictor of difference between the primary process and the approximation (the indicator of approximation error) we take the square of distance d[u('), u\-)]. Using (7.4.13a), (7.4.31), and (7.4.8) modified for complex spectra we obtain
[u(t)~u*(0]Mr*(0]Mr-T5^ ^^TYl \v(k)\' d'[u('), u *(-)]- f [u(t)~u
(7.4.33)
k-K*l
From this equation it follows that to keep the distortions low, we have to include into the set of spectral coefficients all the large coefficients. To do it in a systematic way we had to calculate possibly many spectral coefficients and reject the smallest. This is done in the essential step in the JPEG compression procedure transforming matrix (2.5.4) of primary spectral components into matrix (2.2.9). However, numerical calculation of many coefficients are tedious. Therefore, useful are general guidelines for choice of the suitable value K of harmonic spectral components that should be included into a compressed description of the primary process u(t), rE <-772, T/2. For this purpose we look for such a continuous approximation u^^it) of the primary process for which the continuous spectrum Vap(^)
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
351
can be easily calculated from equation (7.4.23b) and we estimate the discrete spectral components from equation (7.4.20) which we write in the form v{k)=2'Ko^,V{k(^j) (7.4.34) where o)^=2Tr/T. Still simpler, but less accurate would be to estimate the effective duration A^ of the primary process, to use equation (7.4.27) to estimate the effective bandwidth A^ and to reject harmonic components with absolute value of angular frequency larger than the estimated effective bandwidth. COMMENT 1 The dimensionality of the truncated spectrum is a counterpart of the number of potential forms of quantized information that is necessary to recover with some accuracy the primary one-dimensional information as discussed in Section 6,6.2. Thus, the dimensionality has the meaning of the relative volume of a time continuous function, related to the accuracy of the optimal recovery. In the next section we present the sampling theorem for strictly band limited functions, which corresponds to an other point of view on the volume of time continuous functions. DIMENSIONALITY REDUCTION BY SAMPLING We discuss here conclusions of the theory of continuous harmonic spectral representation about representing continuous information by samples. Let us assume that the primary process takes non-zero values only inside an interval <-r, r > : u(t)=0 for r<-r and for / > r (7.4.35) Consider the spectral representation (7.4.13a) for rG <-T, T> . From (7.4.13b) it follows 1 v(/:)=:^V(a;,) (7.4.36) where V{o)) is given by (7.1.23b). Thus, the spectral components v(k) of the representation of the function u(t) within the finite interval <-r, r > are proportional to samples of the continuous spectrum V(co) representing the function u(t) in the infinite interval (-oo, oo). Knowing the spectral components v(k) we can calculate w(t) from (7.4.13a), substitute it in (7.4.23b), and obtain the continuous spectrum V(a)), coE <-oo, cx>. Thus, If the primary function u(t) takes outside an interval <-r, r> only zero values than the continuous spectrum V(a)) is exactly determined by (7.4.37) its samples taken with sampling period lirlr. This conclusion is called spectrum sampling theorem. The interpretation of it is that the condition that w(r)=0 in the infinite intervals (-oo, -r) and (T, OO) interrelates the values of the continuous spectrum V(w), a;E (-oo, oo) so strongly that only a discrete (although infinite) set of the samples is independent. The dual theorem is If the continuous spectrum V{oi) takes outside an interval <-27rB, 2TrB> only zero values (thus, it is a base-band process) then the primary (j A '2Q\ function u(t) is exactly determined by its samples taken with sampling period 1/25.
352
Chapter 7 Dimensionality Reduction and Quantization
This theorem is called sampling theorem and is of paramount practical importance. Formalizing the dual counterpart of the reasoning that lead us to the spectrum sampling theorem, we obtain the explicit presentation by samples of a base-band process * sin27r5(r-r,) " ( 0 = E ^ ( 0 . ;.., , ' , - o o < / < ( x (7.4.39) where ,_ , .r^ . 1 t.^kT,, and r ^ _ (7.4.40) IB We write (7.4.39) in the form
w(0- E "(^)^(^'O
(7.4.41)
A:--oo
where
. ^ D/* . \ sin27r5(r-/.)
The functions defined by this formula are called shifted sinus over argument functions. It can be easily proved that these functions are in the interval < -oo, oo > orthogonal. Since the representation (7.4.39) is possible for any base-band function, the set of shifted sine over argument functions is a complete ortho-normal set in the class of base band functions. Thus, in the interval <-oo, oo > we may represent a base band process u(t) in the spectral form (7.4.7). In particular, we may calculate the spectral coefficients v{k) from equation (7.4.7) with r-^oo. However, this is not necessary. The sine over argument functions have the obvious property sin27r5(r^-/,)
Q for m^^k
^—^ = C " ^
l-KBit^-t,) Therefore,
^^^\
(1 4 43)
for m=k
^
^ ^
v{k)=u(t^) (7.4.44) The interpretation of this is The set of samples of a base band function, taken with period \I2B, is the spectrum of this function relative to (7.4.45) shifted sine over argument functions. Let us assume that T>T^. From (7.4.29) it follows that this is equivalent to condition 2rB>l (7.4.46) On such an assumption for tE. <-TI2, TI2> but not close to the end points of this interval dominant in the sum (7.4.39) are the components corresponding to sampling points t^ laying in the interval <-TI2, TI2>. Thus, it is £2, sin27r5(r-0 u{t)^ Y, ^(^^) ^ p , f-, r E < - r / 2 , TI2> ,K=2TB (7.4.47) k'-K/2
Z7rB{t-t^)
Since the shifted sinus over argument functions are orthogonal, from conclusion (7.1.65) it follows that the approximation (7.1.36) has an optimal character. From an obvious generalization of (7.4.7) it also follows that "£ u%)T^= \u\t)dt. k--Kf2
J
(7.4.48)
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
353
The inteqjretation of (7.4.47) is For large K=2TB the set of samples {u(tO, k=-K/2, •, •, -1, 0, I-,-, A:/2} (7.4.49) is an approximate representation of the segment of duration Tofa base-band process with band width B. A more detailed analysis shows various deficiencies of the discussed representation of a base-band function, in particular of the approximation (7.4.47). They are caused by the slow decay of the sine over argument function for large values of the argument. This is in turn related to the assumption, that the considered function and in consequence the shifted sine over argument functions are exactly base band functions. A family of spectral representations which are similar to the discussed sine over argument representation but do not have its deficiencies, is based on prototype functions called wavelets. For an introduction see, e.g. Rioul, Vetterli [7.13] and for more details Malat [7.8], Young [7.14] and Wickerhauser [7.15]. An essential generalization of the presentation (7.4.47) of a base band process by its samples, is a hierarchy of presentations that can produce a hierarchy of approximations of the primary process, having increasing accuracies and being optimal at each accuracy level. They are called multi-resolution presentations. The presentations are based on a hierarchy of sets of orthogonal functions produced by shifts and scale changes of a prototype function, called wavelet. Suppose that the primary process is a base-band process and B is the highest frequency of its harmonic components. This process is sampled with period Tsi = 1/2B. In the first stage the train of samples is fed into two discrete time linear systems, as described in Section 3.2.4. The one system produces level 1 coarse approximation of the primary train and the other system produces level 1 difference train. From these both trains only every second element is retained. Thus, they are decimated as shown in Figure 1.20b. The difference train is stored and the level 1 coarse approximation is forwarded to stage 2. At stage 2 the level 1 coarse approximation is processed similarly as the primary train was processed in stage 1. Suppose that the process ends at level J. Then the produced representation of the primary train consists of level J coarse approximation and of level 1 to level J difference trains. To recover the primary train with a desired accuracy the procedure is reversed. To produce the recovered train of the lowest quality in the coarse approximation produced in level /, zeros are inserted in place of the dropped elements and such a with zeros stuffed train, is processed by a linear system. In a similar way the J level difference train is processed and added to the processed coarse level J approximation. The result is the level J recovered train, that has the lowest accuracy. A recovered train of the next higher accuracy is obtained by similar processing of the already recovered train of lowest accuracy and adding processed difference train of level 7-1. This procedure can be continued till the primary train is recovered with the highest accuracy corresponding to level 1. For detailed description of the multi-resolution presentation see the classical paper by Malat [7.8], and for more details Wickerhauser [7.15], Veterli, Kovacevic [7.16], and Hui [7.17].
354
Chapter 7 Dimensionality Reduction and Quantization
7.4.4 STATISTICAL REDUCTION OF DIMENSIONALITY OF FUNCTION-INFORMATION We assume now that a time continuous function-information can be interpreted as an observation of a stochastic process and we discuss transformations of such functions into a fmite dimensional vector information that use statistical properties of the primary information. Those transformations can be considered as generalizations of transformations reducing dimensionality of vector information presented in Section 7.3. We begin with a review of harmonic analysis of stochastic processes. This analysis gives also more insight into previously discussed spectral presentations of deterministic functions. SPECTRAL PRESENTATION OF STOCHASTIC PROCESSES On very general assumptions a time continuous stochastic process can be presented as a superposition of harmonic functions multiplied by coefficients which are random variables. We consider first the presentation of the time continuous stochastic process m(0 in the finite interval < -772, 772 > . The presentation (7.4.13) takes the form n(t)= E where
^ W ^ ' " ' \ te < - r / 2 , T/2>
(7.4.50a)
k'-oo Tfl
V{k)=^'j\
113(0^"''''''^^
(7.4.50b)
is the random variable representing the ki\i spectral component. Equation (7.4.50b) shows that the amplitude v(^) of a spectral component is a linear transformation of the primary process M(/). Therefore, the averages, variances, and correlation coefficients of the random variables ¥Jc) can be expressed in terms of the average and the correlation function defined by (5.2.38)- see, e.g., Papoulis [7.18] and for more details Lapierre, Fortet [7.19]. The latter are given by double integrals of weighted correlation function of the primary process which are counterparts of the matrix multiplication in equation (4.4.37). The harmonic spectral representation (7.4.23a) of a process on the whole time axis is suitable only for pulse-like function, which for large |/| decays to zero so fast that the energy of the function defined by (7.4.1) is finite. Thus the function is a energy function. Often we have to do with processes that even during long periods of time vary in a similar range. Thus, they are not pulse-like. A more natural model of such processes are processes which on the whole time axis have instantaneous power of the same order of magnitude. If such a process is an electrical process, then the energy dissipated on a resistor during a period of time would grow linearly with the duration of observation. Thus, the limit Til
l i m l f w2(0d/-const>0. T-c» T J -r/2
(7.4.51)
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
355
Such functions are calltd power functions. The deterministic harmonic spectral presentation of power functions, that is a counterpart of the previously discussed presentation of energy functions, is quite tedious. However, often a power function can be considered as a segment of a stationary stochastic process. The basic concepts and results of the theory of harmonic representations of stationary process have a relatively simple interpretation and give much insight into properties of power functions. Therefore we present here a sketchy description of this theory. For a stationary random process the average Etf (r) has the meaning of the instantaneous average power of the process. Since for a stationary process it does not depend on time /, Etf(0 = W=const (7.4.52) and Whas also the meaning of the average power of the process. A precise mathematical analysis (see, e.g., Fortet, Lapierre [7.18]) shows that a time continuous stationary stochastic processes can be represented in the infinite time interval in the form oo
B(r)= TT- f ^"'dA(a;) , -oo < / < oo
(7.4.53)
where dA(co) has the meaning of an infinitely small random complex amplitude of the harmonic function e'^'. This is a counterpan of representation (7.4.28). We denote as AW() the total average power of harmonic components with angular frequencies lying in the band . It can be shown (see, e.g., Fortet, Lapierre [7.18]) that for stationary processes the limit S(c^)-hm
1
AurM)
-J
1
(7.4.54)
Ao;
exists and E|dA(a;)|' = 5(a;)da;
(7.4.55)
EdA(a;')dA(cu")=0 for w'^^co". (7.4.56) The function S(o)) is called power spectral density. Since the total power is finite the power spectral density is a pulse-like function. Equation (7.4.56) tells that the random amplitudes of the infinitely small amplitudes of harmonic components are non-correlated. This is the consequence of the assumption that the process is stationary. One of the basic results of the theory of harmonic spectral representations of stationary processes is that the correlation function 7(7) given by (5.2.44) and the spectral density 5(co) are coupled by the Fourier transformation (7.4.23). Thus, 00
7(r)= - ^ [ 5(a;)eJ*^'da; , -oo < r < 00
(7.4.57a)
-00
00
5(co)= I TWe'^'^dr , -00
This relationship is called Wiener-Khinchin theorem.
(7.4.57b)
356
Chapter 7 Dimensionality Reduction and Quantization
From the definition (5.2.40) and (5.2.44) of the correlation function of a stationary process it follows that Em\t)=y{0)
(7.4.58)
while from the Wiener-Khintchin theorem (7.4.58) we have OO
J - f 5(a;)da;-7(0)
(7.4.59)
27r J^
From these two equations we get OO
Em\t)=J-{S(u))d(^
.
(7.4.60)
This equation is plausible in view of the definition of the power spectral density. When the angular frequencies of all spectral components lie in a frequency band K-lirB, 2'KB> then the process is said to be a base-band process. When ^^^ for -ITTB < U^I-KB 5(a;) = < : " ^ 0 for a;<-27r5, io>2TrB
(7.4.61)
then the spectrum is said to be uniform. For a process with such spectral density from equation (7.4.60) we obtain EB'(0=255,
(7.4.62)
Using (7.4.57) we calculate the correlation function for the process with the uniform spectrum given by (7.4.61) and we obtain . . 2 sin27r5r y(r)'(r'
,^ . ^^. (7.4.63)
ZTTBT
where o^ = ES^(0 (7.4.64) From equation (7.4.63) and from the mentioned properties of sine over argument function it follows that Any two samples s(r') and s(r") of a stationary process with uniform spectral density, such that t"-t' =kT^ are uncorrelated, (7.4.65) where T, = l/2B. We used this conclusion deriving the basic formula (5.2.50). STATISTICAL REDUCTION OF DIMENSIONALITY In Section 7.3 we presented methods of reducing dimensionality of vector information using statistical properties of the primary information. We show now the generalization of those methods for transformation of a function-information into a vector information.
7.4 Spectral Representations and Reduction of Dimensionality of Function-Information
357
The compressed vector information is a finite subset of infinitely many spectral components of the processed function-information. The difference with the deterministic approach, is that as the criterion for the quality of recovery, and in consequence as criterion for choice of retained spectral components, the statistical average power of the error is taken. This causes that the compression of some functions may have lower quality, as compared with the previously discussed the deterministic compression but the compression of a train of several processes has better quality. This is the same effect as in the case of lossless compression of trains of discrete information discussed in Section 6.2.2 (see Comment 2, page 266). The other difference between the statistical optimal compression and the deterministic optimal compression based on conclusion (7.4.30) is that taking the deterministic approach we have to calculate anew for every processed functioninformation sufficiently many spectral components and decide which should be rejected. Taking the statistical approach we do this only once (strictly, once in a stabilization interval). The price we pay for it is the cost of acquiring the correlation function of the primary process. As an approximation of the correlation function we may take the arithmetical average of products of shifted samples. Considerations in Section 7.3 can be generalized for optimization of a transformation of a function information into vector information. It is easily seen that if the spectral components are not correlated then the average power of the recovery error is minimized when spectral components with large variances, i.e. with large average powers are retained as components of the compressed vector. The counterpart of the decorrelating spectral representation of vector information described in Section 7.2 is the spectral representation using the eigen functions as basis functions. Those functions are solutions of the integral equation with the correlation function as a kernel, that is the counterpart of the matrix equation (7.2.8). Such a spectral representation based on eigen functions is called Loeve-Karhunen representation. For a function-information finding the eigen functions and eigen values would be for a typical correlation function prohibitively complicated. Therefore, of basic importance are spectral representations relative to sets of ortho-normal basic functions that can be easily handled and which produce spectral components with small correlation coefficients. The harmonic functions are a typical choice. This is even more justified because under quite general assumptions about correlation functions, the harmonic functions are good approximations of eigen functions. The optimal algorithms for reduction of dimensionality of vector information and recovery algorithms presented on page 339 can be easily modified for transformation of function-information into vector information: we have to take instead of eigenvalues the variances of spectral components and in place of recovered spectrum the approximation s\t) given by (7.4.32). Similarly, as in the deterministic approach discussed in Section 7.4.3 we may use the continuous representation in the infinite time interval to get the hints for choosing the number of retained spectral components to achieve good accuracy of approximation of the primary time continuous process by a truncated spectral representation.
358
Chapter 7 Dimensionality Reduction and Quantization
Similarly to (7.4.34) using equation (7.4.53 and (7.4.55) we get the approximate formula for the variance of kth spectral component E^(k) = S(ko)M
(7.4.67)
where S(o)) is power spectral density, which can be obtained from the correlation function using the Wiener-Khintchin theorem (7.4.57). Using the approximation (7.4.27) we fmd the effective width of the power spectral density, and using (7.4.67) we get an estimate of the number of spectral components that should be included in the compressed spectrum. The remarks in Comment 2, page 342 and examples in Section 7.3 suggest to use basic functions other then eigen functions. When the correlation coefficients between the other spectral components are small, then the performance of the modified compression and recovery algorithms described on page 333, is close to optimal performance. If this is not the case, the performance of the recovery algorithm could be improved if instead of zeros linear estimates of possibly many rejected spectral components are made. Therefore, it is important to estimate correlation coefficients between variables representing spectral components to decide if it is worth while to make such an improvement. Using equation (7.4.57) and arguing similarly as in deriving the approximation, (7.4.67) we conclude that when the continuous spectral representation (7.4.53) is an accurate approximation of the spectral representation (7.4.50) in the time interval <-772, T/2 > then the correlation coefficients of the random variables representing the spectral components v(k) in the representation (7.4.50) are small and the function-information given by the simple formula (7.4.32) is an almost optimally recovered primary information. From the derivation of the continuous representation (7.4.23a) it follows that the continuous representation is accurate if the distances between the harmonic components in representation (7.4.13a) are small compared with the range in which the power spectral density takes significant values thus, if c^i^As, (7.4.68) where As^ is the effective width of the power spectral density 5(cj). Using (7.4.11) and (7.4.27) we write this condition in the form A^^<^r (7.4.69) where A^t is the correlation time of the primary process. Summarizing we conclude If a continuous function-information can be considered as a segment of a stationary random process and if duration T of information is much larger then the correlation time of the stationary random process then the truncated harmonic spectrum is an almost (7.4.70) optimally compressed vector information and the sum (7.4.32) of retained harmonic components is an almost optimally recovered primary function-information.
7.5 Quantization
359
7.5 QUANTIZATION A general review of quantization has been presented in Section 1.5.4 and we used it in Sections 4.2.1, 4.5.1, and 6.6.1. Similarly as dimensionality reduction, the quantization is an irreversible, deterministic transformation used as a preliminary simplifying transformation. Therefore, we exploit here large fragments of our considerations in the previous sections of this chapter. As there, geometrical interpretation is used extensively and presentation is partially based on heuristic argument. We return in Section 8.6.1 to the problems of quantization requiring more advanced formal tools particularly, of the optimization theory
7.5.1 THE RECOVERY OF THE PRIMARY CONTINUOUS INFORMATION FROM QUANTIZED INFORMATION Quantization is a typical simplifying transformation (see Section 1.4 particularly. Figure 1.4) and the problem of recovery of the primary information arises. To formulate this problem we have to introduce an indicator of performance of the information recovery rule (see Section 1.6). To define such an indicator we must specify the meta-information about the properties of the set of potential forms of the primary information. We assume first that the information exhibits statistical regularities and that the exact statistical information (see Section 5.5) is available. The K-DIM continuous information ii = {«(/:), A:=l, 2, .., i^}, u(k)E is considered. The quantization rule is described by the partition of the continuous set of the potential forms of the primary information into a finite number of aggregation sets corresponding to the potential forms of the primary information which are transformed into the same quantized info. In this section the quantization rule is considered as fixed and we concentrate on continuous information recovery rule. We denote V/, / = 1 , 2,- • • , L the potential forms of the quantized information iir={Wr(/:), k=l, 2,..K} the recovered information UX') the recovery rule It is assumed that the recovery rule is deterministic^. Then, since the quantized information is discrete, the recovered information is discrete too. We denote as u,,= (/,(v,),/=1,2,- • • , L (7.5.1) the potential forms of the recovered information. On the assumption that the primary information exhibits statistical regularities the primary information, the quantized information, and the recovered information and their components can be considered as realizations of random variables. We denote them, respectively, as I[J={m(it), k=l, 2, •, •, K }, V=MA:), k=l, 2, •, •, K}, I[J, = K(/:),/:=1, 2, The mean square error of recovery ^
',',K}
Q[UX')]= E E Mky%(k)r
(7.5.2)
Uxm, *-^ is taken as the indicator of the performance of the information recovery rule.
360
Chapter 7 Dimensionality Reduction and Quantization OPTIMAL RECOVERY WHEN EXACT STATISTICAL INFORMATION IS AVAILABLE
Similarly, as in the case of recovery of the primary information from the informatioji of reduced dimensionality, we consider the optimization problem OP UX'),Q- The method of solving this problem is illustrated on the 1-DIM (scalar) case. For ^^=1 definition (7.5.2) takes the form Q[UX')] = E(m-%)'
(7.5.3)
nil X Er
The random variable m, = U,('^), where ^ is the random variable representing the quantized information. Therefore, averaging over all potential forms of the recovered is equivalent to averaging over all potential forms of the quantized information. Taking this into account and using formula (4.4.23) for conditional averages we write (7.5.3) in the form QIUX')] = E {E\m-U,m'} = EQ[UX^),m]
(7.5.4)
e(w„ v,)^E(m-w,)2
(7.5.5)
where Since the variable ^ is a discrete variable, we can use the equation (4.4.10) and write equation (7.5.4) in the form^ QlUX')] = i2 QiUXvi). vjP(v=v,)
(7.5.6)
/-I
The average on the right of (7.5.5) corresponds to the point of view of an observer who knows the quantized information v^ , does not know the primary information, and looking for various recovery rules, considers the recovered information u, to be a variable. From this interpretation it follows that for a given quantized information Vi the conditional mean square error is minimized if we consider Q(u„ V/) as a function of a continuous variable u,E - Let us denote as Wro/ the value u, for which the minimum is reached; in this notation we indicate the number / of the fixed quantized information v^. For each v^ we can find independently the corresponding w^o/. Therefore, the recovery rule For the available quantized information Vj find the function Q(u,, V/), consider it as a function of the continuous variable u,E < w^, Wb> and ._ - _ . find the value u^^i for which Q{u,, v^) achieves the minimum. ^ • • ^ Take u^^i as the recovered information, minimizesthe overall average Q. Thus the rule (7.5.7a) is the solution of the OP UX*), Q • We denote this rule U,Ji') and call it best coruiitiormiperformance rule. '^''''
Wro/=^ro(V/)
(7.5.7b)
To determine the concrete form of the optimal recovery rule, we have to calculate the conditional average Q(u,, v^. We assume that the scalar quantization rule is described by thresholds "fl"^o<^i<---<"L""^ shown in Figure 1.18. Then the aggregation interval is Jli=
(7.5.8) (7.5.9)
7.5 Quantization
361
From the general definition (4.4.11) it follows that the conditional average (7.5.5) Q{u„ vi)= J {u-u,yp(uI v,)du
(7.5.10)
where p(u \ v^) is the probability density of the random variable loi on the condition that v=v^. We now express the density p(u \ Vj) in terms of the probability density p{u) describing statistical properties of the primary information u. From the definition (4.4.7b) of the conditional probability we have P{u-e<m
Using this we write (7.5.4) in the form 1 Q(Ur. ^i)^-p(^^\ (u-u,yp(u)du
(7.5.14)
From the definition of aggregation set it follows that P(Vi) = P(me ^i)= ^ p(u)du
(7.5.15)
A,
Equations (7.5.14) and (7.5.15) express the conditional mean square error in terms of the density of probability p(u), we were looking for. The optimally recovered information u,^i minimizing Q(u,, V/) is a solution of the equation: ' ^ -0
(7.5.16)
Using equation (7.5.14), interchanging the sequence of the partial differentiation and integration, after some elementary algebra we get: u,,i=-p^\up(u)du
(7.5.17)
Since the ratio p(u)/P(Vi) has the meaning of a weight of a point in the aggregation interval ^ / the optimally recovered information w^o/ can be interpreted as the centre of gravity of the aggregation interval ^ / . Therefore, the optimal recovery rule (7.5.17) is called centre of gravity rule.
362
Chapter 7 Dimensionality Reduction and Quantization
Substituting (7.5.17) in (7.5.10) for w^, using (7.5.13) and substituting the resuh in (7.5.6) we obtain the overall performance indictor of the centre of gravity rule 2[^ro(-)] = E f [u-u;(v;)Yp(u)du
(7.5.18)
This is the minimum mean square error of the recovery for the considered quantization rule. The generalization of our argument for the case of A'-DIM information is simple since in its essential steps we did not use the assumption that the information is one dimensional. In definition (7.5.5) we have to replace (B-DHT)^ by the sum on the right of equation (7.5.2). Thus we define K
Q(u,. v,) = E E Mk)-uXk)f
(7.5.19)
where ItJ={i!a(/:), /:=!, 2,- • • , ^} is the AT-DIM random variable representing the ^-DIM information u = {u(k), k=l, 2,- - - , K}. For so defined function Qiu„ V/) the obvious modification of the rule (7.5.7) is the optimal rule of recovering the primary information when only the quantized information v^ is available, that minimizes the overall performance indicator given by (7.5.2). The generalization of (7.5.14) is 1 ^ e("r, ^/)= p(77f 1 " 1 ^ [u(k)-uXk)yp{u)du
(7.5.20)
where p{u) is the density of probability describing the K-DIM random variable HJ and ^i is the aggregation set corresponding to the quantized information v^. The generalization of equation (7.5.16) is the set of equations - 4 1 - ^ - 0 , / : = 1 , 2,- ' ^K, (7.5.21) du^k) Substituting (7.5.20) and after some algebra we get kth component of the optimally recovered information Wro/W= p ^ f f •• f u(k)p(u)du^ k=U 2,'
' ,K
(7.5.22)
and Piy,)= J \-\p{u)(iu,
k=l,
2,- • • ,K
(7.5.23)
The set of equations (7.5.22) can be written in the compact form «ro, 7 ( 7 ) 1 \-\up(,u)du,
k=l,
2,- • • ,K
(7.5.24)
' -^, Thus, in the general ^-DIM case the optimally recovered information is the centre of gravity of the AT-DIM aggregation set corresponding to the available quantized information v,.
7.5 Quantization
363
Similarly to (7.5.18) the overall performance indictor of the centre of gravity rule ^S
_
L
K
e[^ro(-)] = E
f
Y.^u{kyu;{Kv)fp(u)du
/-I
i
*-l
(7.5.25)
EXAMPLE 7.5.1 We assume that the probability density is uniform ^ forWa<w<Wb M«)= C T "»>-"a (7.5.26) . . . .^ 0 for w<Wa, w>Wb, and the quantization is uniform w^-l/^,,-Ay-const (7.5.27) From (7.5.13) it follows that ^--- for u, ,u^ (7.5.28) Substituting this in (7.5.10) we obtain •^, «ro,= ^ J « ' l " - - 2 -
(7.5.29)
Thus, the centre of the quantization interval corresponding to the given quantized information, is the optimally recovered information. Putting (7.5.29) in (7.5.18) we
e[^ro(-)]=--^TT^. n
(7.5.30)
OPTIMAL RECOVERY WITHOUT STATISTICAL INFORMATION We show now the application for scalar quantization of the general procedure of intelligent information processing described in Section 1.7.2 and illustrated in Figure 1.27. We assume that Al. A train of pieces of the primary information ^,={tt(0, / = l, 2,- • • , / }, u,
1-1
364
Chapter 7 Dimensionality Reduction and Quantization
Similarly to (7.5.6) we write this in the form (7.5.34)
Q [f/r(-)]=E G[^r(V/), V j - L i l
where
/-I
^
i
(7.5.35)
G(Wr, v,) = ^(^/)w(0€^,
Arguing similarly as in deriving equation (7.5.16) we conclude that when only quantized information v^ is available, then the optimal recovered information is a solution of the equation as (7.5.16) but with Q{u„ V/) instead of Q(u,, v^) and the optimal recovered information is
"r;/ = 7 7 ^ E " ( 0
(7.5.36)
We can get also this result taking into account equation (4.5.1) and considerations in Section 4.4.2 about relationships between probabilities and frequencies of occurrences of states. The presented argument can be easily generalized for the K-DIM case. Then equation (7.5.33) takes the form
"r;/ = T n T E « ( 0
(7.5.37)
where u{i) is the /th observation of the vector information. The compression system using the described transformations is shown in Figure 7.7a. For comparison Figure 7.9b shows the system using the statistically optimal recovery rule (7.5.17). u(/) •
PIECE BY PIECE OUANTIZATIGN
u:(i)
viO
PIECE BY PIECE RECOVERY
n n AV ^ l ^ i - ^ X
^ STORAGE a)
^r
^r
CALCULATION OF U;^
b) u(i)
i
PIECE BY PIECE QUANTIZATION
v(/)
I
« ; ( ' • )
PIECE BY PIECE RECOVERY
Figure 7.9 Quantization systems with optimal information recovery; (a) operating without statistical information, (b) using exact statistical information; f^ro"{"rop"ro2»* ''"HIL} ^ ^ ^^^ of potential forms of optimally recovered information (partner information).
COMMENT The system compressing the information without statistical information is an example of systems discussed in Section 1.7.2 and shown in Figure 1.27. It is also a counterpart of the intelligent Huffman system, discussed in Comment 2, page 266 and shown in Figure 6.3. The set ^^o'i^Toi^^roi»''*"r^} plays the role of the partner
7.5 Quantization
365
information that the quantizing subsystem delivers to the recovering subsystem. It is the counterpart of the set P* of frequencies of occurrences of blocks used in the intelligent Huffman algorithm shown in Figure 6.3. Figures 7.9a and 7.9b underscore the differences between the systems that operate without statistical information and systems using statistical information. The systems operating without statistical operation must have time to collect information about the train that can be used to efficiently process the components separately. The system using statistical information can process a component of the train immediately. However, some long lasting observations which justify the assumption of stationarity (see our discussion in Section 4.3.1) are required. The features of the both systems can be combined in a system with learning cycles-see Section 1.7.2. 7.5.2 QUANTIZATION OF VECTOR INFORMATION The obvious way to quantize a K-DIM continuous vector information K>2, is to quantize separately each component of information separately. We call it a decomposed quantization. On two simple examples we show the basic features of such a quantization. EXAMPLE 7.5.1 A COMPARISON OF PERFORMANCE OF DECOMPOSED AND NON-DECOMPOSABLE QUANTIZATION We assume that Al. The primary vector information is two dimensional u = {w(l),w(2)} A2. The set of potential forms of information is the square t/e2 = {-a<w(l)
(7.5.38)
(7.5.39)
where Q is given by (7.5.25) with K=2 and a^^^=E(iHi-Ei!ii)^ For density of probability (7.5.38) we have a^=a2/3. The decomposed quantization is considered first. It is evident that for the assumed joint probability density the marginal probabilities are uniform probability densities. For symmetry reasons it can be anticipated that in such a case optimal is uniform quantization (we prove this in Section 8.3.1). Therefore, it is assumed that each component is quantized according the uniform quantization rule. We denote as L, the number of potential forms of a quantized component. The number of potential forms of quantized vector information is L^. The aggregation set ^i for the vector information is a square with edge A=2a/Li; such sets are shown in Figure 1.19 a with u^(l)=u^{2) = -a, u^(l)=u^(2)=a.
366
Chapter 7 Dimensionality Reduction and Quantization
The considered quantization is equivalent to quantization obtained by the NNT (next neighbor transformation) described in sec 1.5.3 with reference vectors a[/i(l), h(2)]=h(\)g(l)+hi2)g{2) (7.5.40) where h(\), h(2) are integers, g(l)=a(l), g(2)=ii(2) and w(l), u(2) are the coordinate unit vectors-see fig. 1.19. For uniform density of probability the centroid of the square is its centre. From equation (7.5.25) we obtain Q'=0.25/L^ (7.5.41) In general the aggregation sets corresponding to separate quantization are rectangles. If the aggregation sets are not rectangles the quantization can not be implemented by a separate quantization of components. An example is quantization realized by a NNT with reference vectors u[^h{l), h{2)]=h{l)g{l)+h(2)g(2) (7.5.42) where g(l)=2a(l), g(2)=u(l)-^\/\/3u(2) . It can be shown that for such reference vectors the aggregation sets are regular hexagons as shown in figure 1.19b (with exception of regions at the border of the set Ud of potential forms of information). For a hexagon and uniform probability distribution the centroid is again the centre of the hexagon. Proceeding similarly as for first quantization system for large^° L we get: (2'.1I1-0.24/L2 D (7.5.43) 36L2 COMMENT 1 The difference between the performances given by equations (7.5.41) and (7.5.43) is not large it but it is interesting. The reason for this difference lies in the structure of aggregation sets. Since (1) the aggregation sets do not overlap (2) cover the set of the potential forms, (3) by assumption the number L of potential forms of information is in both systems the same, the areas of the aggregation sets are for both systems the same. It can be proved that if the area of a 2 dimensional figure is fixed, then the integral in (7.5.25) achieves its smallest value when the set ^i is a circle. However, the non overlapping circles can not cover a square. Therefore, the distortions associated with an aggregation set are the smaller the better a set approximates a circle of the same area. The hexagon is in this respect better then a square (it is even optimal). This is the ultimate reason of the inferiority of the considered system with separate quantization. Our argument can be generalized for K dimensional signals. As in the two dimensional case the aggregation sets corresponding to separate quantization are AT-DIM cubes. The optimal recovery performance is the better the better the aggregation set approximates a sphere. The important result of the theory of multidimensional spaces is that it is possible to partition a large cube in aggregation sets of the same shape which are not cubes but which approximate the sphere the better the larger the dimensionality K is. Therefore, the normalized minimal recovery distortions per dimensionality decrease when the dimensionality of quantized primary information increases. The easiness of implementation of the decomposable vector quantization makes it attractive for applications even in spite of its larger, than minimal distortions.
7.5 Quantization
367
The following example shows that when the components of primary vector information are statistically dependent it is possible to improve the performance of the separate quantization by a preliminary presentation changing transformation, in particular by linear decorrelation of components of the primary vector information. EXAMPLE 7.5.2 THE EFFECT OF PRELIMINARY PROCESSING ON PERFORMANCE OF OPTIMIZED QUANTIZATION We do again assumptions Al and A4 made in Example 7.5.1 but instead of assumptions A2 and A3 we assume that: A5. The density of joint probability/7(ii) is as shown in Figure 7.10a. We consider first the system using direct separate quantization. The densities of the marginal probabilities Pi[u(l)] and P2lu(2)] corresponding to the assumed density/?[w(l), w(2)] are shown in figure 7.10a.
Ml)
p[u(l)]
«,(!)
-^
u,il) a w(l)
a)
%(i)
w^il) Ml)
c)
Ml)
b) d) Figure 7.10. The effect of decorrelation on the separate quantization of components of 2 DIM information: a) the assumed joint density of probability and the densities of marginal probabilities; the joint density is constant in the hatched are and is zero outside it c) the aggregation sets corresponding to the optimal separate quantization of the components, c), d) the counterparts of a), b) but for separate quantization after the preliminary decorrelation described by 7.3.36.
368
Chapter 7 Dimensionality Reduction and Quantization
On the assumption that the number of potential forms of a quantized component L,=2, optimal is the uniform quantization which can be considered as the 1 dim NNT with references w,(/:),/=l,2,/:=1,2 denoted as crosses on diagrams of the densities of marginal probabilities shown in Figure 7.10a. The resulting reference vectors Uf., / = ! , 2, 3, 4 for the vector information u and the corresponding aggregation sets ^^ / = ! , 2, 3, 4 are shown in Figure 7.10b. Four reference vectors are defined thus, four potential forms of quantized information could be produced. However only two can occur used. This is the consequence of the assumed probability density. From (7.5.25) we obtain the normalized mean square error for the optimally recovered information 2'=0.5
(7.5.44)
We consider next the system which uses quantization after the decorrelating spectral transformation. The block diagram of system is the same as of the system with truncation after decorrelation shown in Figure 7.5 with quantization instead of truncation. Since we assumed here the same joint density of probability as in Section 7.3 (see Figure 7.4) we can achieve the decorrelation by transformation (7.3.36). We can also use equations (7.3.38) and (7.3.39). From the latter follows that the variance of the first component >v(l) of the decorrelated information is substantially smaller then of the second w(2). Our discussion in Section 7.3 in particular, conclusion 7.3.53 suggests to quantize only the first component w(l) of the decorrelated information and disregard the second component. Formally we do it by transforming w(l) by a NNT with one reference w,(2)-0 (7.5.45) Let us denote by ;?[w(l)] the density of the marginal probability of the first component w(l) of the decorrelated information, resulting from the density of joint probability of its components. The both densities are shown in Figure 7.10c. In Section 8.4.1 we present an algorithm for finding the optimal quantization of scalar information having a non-uniform probability density. For the probability density /7[w(l)] shown in Figure 7.10c the optimal quantization is not uniform. To simplify the argument we assume that the quantization of w(l) is uniform, achieved by the scalar NNT by references H'^(I), / = 1 , 2, 3, 4 shown as crosses in Figure 7.10c. The reference vectors resulting from the described quantization of the components of the decorrelated information are w, = {v^,(l),w,(2)-0} / = 1 , 2, 3, 4 (7.5.46) They and the corresponding aggregation sets ^^i /= 1, 2, 3, 4 for the decorrelated information are shown in Figure 7.10d. Arguing similarly as in Section 7.3.1, from the recovered spectrum we produce the recovered information by inverting the decorrelating transformation. We calculate the mean square error of the recovered spectrum using again equation (7.5.25). In view of the distance preserving property this is the mean square error of the recovered information. In such a way we get: e'=0.33 D (7.5.47)
7.5 Quantization
369
COMMENT 1 In Section 8.4.1 it is shown that for the uniform probability distribution the uniform quantization is optimal. Although, the reference vectors and the corresponding aggregation sets in the system with separated quantization of the primary information are obviously unfavourable, they are optimal on the assumption that the components are quantized separately. Thus the example shows that for some probability distributions the condition that the quantization should be separate is to restrictive and we should look for non-separable quantization. COMMENT 2 Similarly as in the case of dimensionality reduction discussed in Section 7.3, the reason of the superiority of separate quantization after decorrelation is that the average ranges of variations of the decorrelated components (characterized by the mean square values) of the two components are different. Therefore, we can improve the performance of separate quantization if we quantize the both components differently, so that it is possible to recover the component with the larger range of variation more exactly, then the component varying only in a smaller range. We achieve it by allowing more potential forms for the quantized first component then for the second. In the system described in example we realized this idea in an extreme form by allowing only 1 form ( the zero) for the transformed second component and 4 forms for the first component. OPTIMAL BIT ALLOCATION We now generalize and formalize the procedure described in the last comment. We assume that Al. A preliminary spectral decorrelating transformation of vector information is applied, A2. The decorrelated components have in general various statistical features, A3. Each component of the decorrelated vector information is recovered separately from the corresponding quantized component, A4. The indicators of quantization systems performance are the total volume of quantized decorrelated components and the mean square error of the recovered vector information obtained by optimally recovering separately its components. Equations (8.6.17) and Figure 8.23 show that for a sufficiently large number L of potential forms the mean square error of optimally quantized and recovered (centroid rule) scalar information w is: E(M-lVo)'=Trr2^« vn?
(7.5.48)
where the constant A depending on the density of probability of the primary information. In addition to assumption A2 we assume that A2'.The decorrelated components have the same probability distribution, but may have various variances.
370
Chapter 7 Dimensionality Reduction and Quantization
The spectral transformation is distance invariant. Then from (7.5.19) and (7.5.48) it follows that the mean square error of the recovered primary vector information is A 4^ a\k) where Q is given by (7.54.2), o^(k) is the variance of the random variable ^(k) representing the kxh component of the decorrelated information vector and L(k) is the number of its optimally quantized potential forms. On the assumption that logj L{k) is an integer this logarithm is the number of bits (binary elementary pieces of information) needed the identify the quantized component. The indicator of the volume of the quantized spectral representation is V^^\og2 L(k)
(7.5.50)
In view of assumptions A4 a typical optimization problem is 0?{L{k); k=h
2, .., K},Q\\/^^V,
(7.5.51)
where V is the given volume (total number of bits) of the quantized decorrelated information. This problem is called optimal bit allocation problem. The difficulty in solving this problem is the requirement that log2 L(k) should be a positive integer. If we drop this requirement we can fmd the solution of the optimization problem using the method of Lagrange multipliers which we will describe in Section 8.2. The solution is \ogXo(k) = V/K-h Vilog, - ^
(7.5.52a)
where
n^w
(7.5.52b)
COMMENT 1 Equation (7.5.52) has a simple interpretation. The first term on the right corresponds to the proportional bit assignment. The second term is the correction which is determined by the ratio of the mean square error of the considered component and the geometrical average mean square errors of all decorrelated components. The integer parts of the solutions (7.5.52) can be considered as approximation of the optimal bits assignments. If the mean square error of a component of the decorrelated information is substantially smaller than the geometric average of the mean square errors the solution is substantially smaller then 1 or even negative. Then we reject the corresponding component of the decorrelated information. Thus, the obtained solution provides a justification of rejecting the second component in Example 7.5.1. It can be also considered as the justification of the procedure applied in JPEG described in Section 2.5 We present here the approximate solution of the assignment problem since it gives insight into the problem of bit assignment. A algorithm permitting to find directly the optimal integer bit assignments is also known (see Makhoul et.al [7.20], Shoham, Gersho [7.21]).
7.5 Quantization
371
COMMENT 2 The basic assumption that we should separately quantize the components of the decorrelating spectral components has a heuristic character. However, when the density of joint probability distribution of the components of the primary information is gaussian it can be proved (see Segall [7.22]) that quantization achieved by reduction of dimensionality of the decorrelating spectral transformation realized by the algorithm presented in Section 7.3.2, page 339 and by subsequent separate quantization of the decorrelated components using the of optimal bit assignment, is the optimal vector quantization. 7.5.3 THE CURRENT QUANTIZATION OF INFORMATION In previous considerations we assumed that all components of information are simultaneously available for processing. Often the information is a time evolving process and the components arrive successively in time. A similar situation occurs when adjacent components of an image are successively processed. In such cases it is desired to compress the structured information successively as its new components became available. We call this current information compression (in particular current information quantization). A wide class of transformations realizing the current compression can be considered as special cases of the following prototype transformation: using the already available components of the information and the mata information about relationships between the components we estimate the component of the information which will arrive next. As a new component of compressed information we take a description .^ ^ _^. of the difference between the arrived component and its estimate that has possibly small volume. It is required that basing on the compressed descriptions, the components of the primary info can be recovered with given accuracy in particular, with given delay. Let us comment on this description. In general the components of structured information are interrelated by deterministic relationships discussed in Sections 3.2, 3.3, and 6.6.2 and/or by relationships between states of variety in particular, the by statistical relationships discussed in Chapter 5. Those relationships cause that some features of a new arriving component of structured information are related to previously obtained components of information. However, some other features of the new component are independent. The information about these independent features is called really new information. The transformation described in (7.5.53) produces the really new information and compresses its volume. Thus, this transformation may be called transformation extracting new information. To illustrate those general concepts we consider a simple but representative example of the new information extracting transformation. We assume Al. The information is a train ujit^, t^^nT^, A2 = 1, 2, 3... of samples of a primary time continuous scalar process ujif), t
372
Chapter 7 Dimensionality Reduction and Quantization A3. The train W(AZ), n = l, 2,.. exhibits statistical regularities and can be considered as an observation of the train random variables m(/2), n = \, 2,.. , with EI!II(A2)=0
Let us assume that u(n-l) is the last obtained component of information. Since we assumed that it is a scalar its only feature is its value. Therefor the estimate of the forthcoming component u(n) is a scalar W * ( A 2 ) = ^ * [ A 2 , tt(AZ-l)]
(7.5.54)
where u{n-l) = {u(l), w(2),- • • , AZ-1} is the train of available samples and ^*[^»(*)] is the rule transforming u(n-l) into w*(/z). This rule is called the prediction rule. A simple transformation extracting the features of the newly arrived component that are not determined by the components that arrived earlier, thus extracting the "genuinely" new features, is w(A2)=w(A2)-f/*[A2, u(n-l)] (7.5.55) Such a transformation extracting new information is called predictive-subtractive transformation. It transforms the train u(n) into the secondary information w(n). We can produce the predicted value f/*[Az, M(A2-1)] immediately after the instant r„., when the information component uin-l). However, the transformed information v(n) we can produce after uin) has arrived thus, after instant t„. Therefore, between production of the predicted value and the production of the transformed information we must introduce the delay T,. The diagram of the system realizing the predictivesubtractive transformation is shown in Figure 7.11.
Figure 7.11. The predictive-subtractive transformation.
It can be expected that the volume of the transformed information is made small when the prediction of the value of the forthcoming component of information is possibly exact. To formulate the problem of optimization of prediction we have to introduce an indicator of the performance of the prediction rule. The general methodology of defining indicators of performance is present in Section 8.1. Similarly as in the case of recovery of primary information considered in the previous section we take the mean square error G { ^ * [ « , ( - ) ] } = E{B(AZ)-(/*
[n, mn-l)]}^
(7.5.56)
as the indicator of performance of the prediction rule; ILJ(AZ-I) denotes the train of random variables representing the primary train uin-l).
7.5 Quantization
373
Both the implementation constraints and the costs of acquiring exact statistical information cause that often the class of admissible prediction rules L^ *[«,(•)] is restricted to linear rules. Thus, we assume that the predicted component is n-l
u^n)=J2h(n.
m)u(m)
(7.5.57)
Such a linear prediction rule is described by the set of coefficients h{n)^{h{n, m), m = l , 2, .., nA} Substituting (7.5.56) in (7.5.55) we get Q{h{n)]^l[n{nyY.
Kn, m)y^{m)Y
(7.5.58)
m-l
Thus we face the optimization problem OP h{n), Q. Before we derive the solution of this problem we derive an important property of the predictivesubtractive transformation using optimal linear prediction. We write definition (7.5.56) in the form e[^(A2)] = E[l!a(AZ)-B*(A2)]2
where
(7.5.59)
1^ HEHn) ^ = 22h(n, mMm)
(7.5.60)
The optimal set of optimal coefficients is the solution of the set of equations ^ — - 0 , m-1,2,..,«-! n ^ f.\\ dh{n,m) (/o.oi) Substituting (7.5.58) and interchanging the sequence of statistical averaging and partial differentiation we get E[im(n)-B*(n)M^)=0, k=U 2, .., n~l
(7.5.62)
Taking into account definition (7.5.55) we write this set of equations in the form E w(M)m(^)=0, fc=l, 2,.., n-1 (7.5.63a) where w(n)=m(Az)-iei*(Az) is the random variable representing the information produced by the predictive subtractive transformation. The interpretation of (7.5.62) is the information v(n) produced by the predictive-subtractive transformation using optimal linear prediction is not correlated (7.5.63b) with any component u(m), m
374
Chapter 7 Dimensionality Reduction and Quantization Substituting (7.5.59) in (7.5. 61) we obtain n-\
Yl Cuu(m, k)h{n, m)=c^^{n, k), k=l, 2, .., n-l (7.5.65) m-i cjm, k) = Em(mMk) (7.5.66) This is the set of linear equations from which the coefficients determining the optimal linear prediction can be calculated. We denote where
Cuu(^-l) = [c^u(f^, k), 772, k=l, 2, ..n-l]- the square correlation matrix of already available components of information ^uu('^-l) = Ku(^» ^)' k=l, 2, .., Az-l]-the column matrix of correlation coefficients of the available and the next arriving component of information h(n'l) = [h{n, m), m = l, 2, .., n-l] the column matrix of coefficients determining the optimal linear prediction rule Using those matrices we write the set of equations (7.5.64) in the form of a matrix equation CJ«-l)/i(AZ-l)=cjAZ-l) (7.5.67) On very general assumptions about the random variables m(n) the inverse matrix Qu (n-l) exists and the column matrix of coefficients determining the optimal linear prediction rule is h,(n-l)=C-2(n-l)cJnA) (7.5.68) Efficient algorithms for calculating this solution can be found in Haykin [7.23]. An other class of efficient and easy to find coefficients of the optimal linear prediction are iterative algorithms presented in Section 8.2. COMMENT 1 From conclusion (7.5.65) it follows that the linear predictive subtractive transformation using optimal prediction performs the same function as the real time linear decorrelating transformations described in Section 5.1. It can be shown that the predictive subtractive transformation produces a train that in the sense of explained in Comment on page 227, is equivalent to the train produced by the Gramm-Schmidt decorrelating algorithm described on page 226. The real time decorrelating transformations described in Section 5.2.1 are inherently linear transformations and are unable to exploit any statistical features of the primary information besides the mean values and correlation coefficients. The transformation extracting new information described in (7.5.53) is much more universal, since it must not be linear. Already universal are the predictive subtractive transformations using not only linear prediction. Such transformations are particularly suitable for compression of information having a macro structure. Using fragments of an macro element they can identify it and subtract the whole macro component from the primary information. This can lead to very efficient compression of the volume of information having a macro structure.
7.5 Quantization
375
CURRENT QUANTIZATION OF TRAINS BASED ON PREDICTIVE-SUBTRACTIVE TRANSFORMATIONS An optimized predictive subtractive transformation produces a train of continuous pieces of information that are mutually weaker interrelated than the components of the primary train. In particular, using optimal linear prediction we produce a decorrelated train. The predictive-subtractive transformation is reversible, since knowing the first component of the train and adding successively the prediction errors we reconstruct the primary train. We achieve quantization of the primary train by quantizing separately the components of the prediction error. We do it using a NNT transformation. Such a system is shown in Figure 7.12a. We recover the primary train by recovering piece by piece from the quantized train the train of prediction errors and adding them successively we recover the primary information train. However since the quantization distortions can be eliminated completely the recovered primary train is distorted.
PREDICTIV-SUBTRACTIVEDECORELATION I
1
i-
L
a)
I
REFERENCE PATTERNS FOR PRED.ERR
w{n)
u{n)
u;in)
b)
zu: NNT
M-
I
DELAY Tl
v(n)
4—• RECOVERY OF THE PREDICTION ERROR
PREDICTION BASED ON QUANTIZED INFO
u'An)
Figure 7.12. Current quantization of a train of pieces of information based on predictive subtractive transformation; (a) the basic system, (b) the system with additional correction of quantization errors.
A natural modification of improving the performance of the simple system shown in Figure 7.12a is to make the prediction non on the basis of exact samples of the primary information but to predict the future sample using the components of the quantized train produced in the past. Thus, at the place where the quantized train
376
Chapter 7 Dimensionality Reduction and Quantization
is produced the situation at the place where the primary train is recovered from the quantized train, is simulated. This allows to deliver to the user not only the information about the "really new" information contained in the recently arrived piece of the primary information but also information about the effects of errors of quantization of the previously delivered pieces of quantized information. Such a system is shown in Figure 7.12b. We denote w(n) the prediction error w*(n) the quantized prediction error Wq* (n) the prediction of the component w(n) of the primary information made on the basis of the train w*(n-l) of elements of the quantized prediction errors. When the error of recovery of the primary prediction error after quantization is small then it is simplest to recover the primary prediction error (we denote the recovered error as w*(n)) piece by piece and to take w*(n)= w*[n,v(n-l)]+>v*(n) (7.5.69) as an estimate of the primary information and to predict on its basis the next component of the primary information train. The system operating in such a way is shown in Figure 7.12b. COMMENT The basic problems of dimensionality reduction and quantization are presented here from a broader perspective, imbedded in the framework of general concepts of information processing introduced in the previous chapters. During the last decade the dimensionality reduction and quantization have been of paramount importance for development of information transmission and storage systems, and the amount of publications in the area, particularity about quantization is very large. A synthetic discussion of vector quantization is presented in Gersho et al. [7.20] and Abut [7.21]. Concrete quantization programs with detailed descriptions can be found in Nelson [7.22]. The conference proceedings Storer, Reif [7.23], Storer, Cohn [7.24], [7.25], [7.26] present collections of specialized publications on dimensionality reduction and quantization. The conference proceedings [7.26] concentrate on quantization on images.
NOTES ' There is no standard terminology is this area. Often discretization and quantization are considered as synonyms. ^ The reader interested more profoundly in mathematical background of our consideration may consult the textbooks e.g.Thompson [7.1], Usmani[7.2], Horn [7.3]. The reader more interested in technical problems may consider the K-dim space as the natural generalization of the 3 dimensional space which we perceive intuitively and pursue our considerations assuming that ^"=3. ^ For computational reasons it is more convenient to start numbering of samples and elements of the vectors with 0 than with 1. We did this also in Section 3.2.4.
7.5 Quantization
377
^ To simplify the notation here we do not distinguish in notation the interpretation of a set of numbers as a vector (strict notation cevc) and a column matrix (strict notation cemx). The actual interpretation of the set is indicated by the type of performed operations. ^ The formal proof of this assertion is given in section 8.3.3. ^ In Example 7.2,1 the same assumptions have been done. We rather revoke Example 7.2.4 because we will directly apply the metiiod of deriving the decorrelating spectral representation described in this example for the multidimensional info. Contrary to it the elementary method of deriving the decorrelating transformation used in Example 7.2.1 is not suited for information having dimensionality larger then 2. ^ From formal point of view valid is relation (7.2.5) but the equation (7.2.6) is not strict. There is namely a continuum, thus, an infinity of functions which are represented by the same infinite sum in the right of (7.2.6). However, these functions differ only in a set of points having zero Lebesgue volume (measure). For physical reasons such functions can not be distinguished. Thus, the representation (7.2.6) is a compressing transformation in a mathematical sense, but for technically distinguishable functions it is a reversible presentation transformation. ^ This applies still more to the Laplace transformation in which instead of real angular frequency w a complex variable is used. The redundancy of such a representation is still larger than of the continuous Fourier representation, but the Laplace representation gives more insight into transformation of processes by linear stationary systems and therefore, is suitable for synthesis of those systems. ^ See the discussion in Section 1.5.5. '° So large that the effect of non-hexagonal aggregation sets at the edges of set II of potential forms of information can be neglected.
REFERENCES [7.1] Thompson, E.E., An Introduction to Algebra of Matrices with some Applications, Adam Hilger,London,1969. [7.2] Horn, R.A., Johnson C.,R., Matrix Analysis, Cambridge University Press, Cambridge 1988. [7.3] Usmani, R.A., Applied Linear Algebra, Marcel Decker, N.Y, 1987. [7.4] Poularikas, A.D., The Transforms and Applications Handbook, IEEE Press, N.Y., 1995. [7.5] Smith, W.W., Smith, J.M., The Handbook of Real-Time Fast Fourier Transforms, IEEE Press, N.Y., 1995. [7.6] Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetteriing, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [7.7] Curtain, R., Pritchard, A.J., Functional Analysis in Modem Applied Mathematics, Academic Press, N.Y , 1977. [7.8] Mallat, S.G., "A Theory of Multiresolution Signal Decomposition: The Wavelet Representation", IEEE Trans, on Pattern Analysis And Machine Intelligence, vol 11 (1989) pp. 674-693. [7.9] Oppenheim, A.V., Wilsky A., S., Signals and Systems, Prentice Hall, Englewood Cliffs, NJ 1989. [7.10] Lim, J.S., Two-Dimensional Signal and Image Processing, Prentice Hall, Englewood Cliffs, NJ, 1990. [7.11] Schalkoff, R.J., Digital Image Processing And Computer Vision, J.Viley, NY, 1989. [7.12] Russ, J.C, ed.. The Image Processing Handbook, (2-nd ed.), IEEE Press, NY, 1994.
378
Chapter 7 Dimensionality Reduction and Quantization
[7.13] Rioul, O., Vetterli, M., ''Wavelets And Signal Processing'', IEEE SP Magazine, October 1991, pp. 14-38. [7.14] Young, R.,K., Wavelet Theory and Applications, SIAM Press, Philadelphia, 1993. [7.15] Wickerhauser, M.W., Adapted Wavelet Analysis From Theory To Practice, IEEE Press, NY, 1996 [7.16] Vetterli, M., Kovacevic, J., Wavelts And Subband Coding, Prentice Hall, Englewood Cliffs, NJ, 1995. [7.17] Chui, K.C., ed., Wavelets: A tutorial In Theory And Applications, Academic Press, NY, 1991. [7.18] Papoulis, K., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, N.Y., 1991 [7.19] Blanc-Lapierre, B.,Fortet R., Theory of Random Functions, vol 1,2 Gordon and Breach, N.Y., 1967. [7.20] Gersho, A., Gray, B.,and Gray, R.M., Vector Quantization And Signal Comression, Kluwer, Boston, 1992. [7.21] Abut H., ed. Vector Quantization, IEEE Press, NY, 1996. [7.22] Nelson, M., The Data Compression Book, M&T Books, Redwood City, CA, 1991. [7.23] Storer, J.A., Reif, J.H., DCC'91 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1991. [7.24] Storer, J.A., Cohn, M., DCC'92 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1992. [7.25] Storer, J.A., Cohn, M., DCC'93 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1993. [7.26] Storer, J.A., Cohn, M., DCC'94 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1994. [7.26] ICIP-94 Proceedings, IEEE Computer Society Press, Los Alamitos, CA, 1994.
8 STRUCTURES AND FEATURES OF OPTIMAL INFORMATION SYSTEMS The first puqjose of this chapter is to present a systematic approach to the optimization problems. In the preceding chapters we considered several concrete optimization problems. They now serve as examples of general methods that are presented here. On the other hand, considerations in this chapter show those specific problems in a broader perspective and provide formal justifications for the heuristic assumption that have been introduced previously. The second purpose in this chapter is to derive in a systematic way the structures of optimal information systems which use most efficiently the meta information about the properties of the superior system, the environment in which it operates, and the properties of the environment in which the information system operates. We also derive typical trade off relationships between the indictors of information systems performance. The mathematical theory of optimization is very well developed, and computer technology allows even very complicated problems to be solved and implemented in real time. A strict formulation of an optimization problem is essential to use the existing mathematical apparatus an computational means. The crucial part of problem's formulation is the definition of performance criteria. This is the subject of the first section of this chapter. Section 8.2 gives a review of methods essential for the optimization of information systems. To describe them we use the concept of information, which allows a more compact presentation and directly suggests applications. In Sections 8.3, 8.4, and 8.5 basic structures of optimal subsystems for recovery of primary information when only distorted information is available, are derived. It is shown that several recovery rules which have been previously introduced using heuristic arguments, on quite general assumptions are optimal. It is so in particular, with the next neighbor rules and hierarchical information recovery rules. To optimize the whole system the intermediate transformations preceding the ultimate recovery must be optimized taking into account the ultimate transformation. Such an overall optimization of an information system is discussed in Section 8.6. As most important examples of optimization of intermediate transformations optimization of quantization and shaping the signal put into a communication channel is presented. In previous chapters the role of auxiliary information about their state of environment of an information system has been emphasised. Second part of Section 8.6 discusses the optimization of state information subsystem.
380
Chapter 8 Structures and Features of Optimal information Systems
8.1 INDICATORS OF INFORMATION SYSTEMS PERFORMANCE This section presents a universal methodology for defining indicators characterizing the performance of an information processing rule as a whole. This is the basic and usually most difficult step in formulating the optimization problem. First, we consider the performance indicators of the information system in a concrete situation. They are determined by the properties of the superior system and the structure of the available information. In the second step, we define the indicators characterizing the performance of the information system as a whole. We concentrate on indicators characterizing the performance of a transformation of information considered as a whole. The definition of such an indicators must take into account (1) the properties of the superior system, (2) the structure of concrete information, (3) the properties of concrete forms the set of potential forms of information and of the states of the information systems environment. The concept of the operation removing the dependence on details is the key concept. It permits the definition of performance of an information transformation as a whole to be based on the definition of the performance of the information system in a concrete situation. General considerations are illustrated with examples. The obtained results are used in the forthcoming examples of solving optimization problems. 8.1.1 INDICATORS OF SYSTEMS PERFORMANCE IN A CONCRETE SITUATION As a representative example of performance indicators of distortions in the communication system shown in Figure 1.3 are considered. For a given primary information jc, a recovered information jc", and a superior system it is, in general, possible to determine the decrease of performance of the superior system pursuing its goal on the false assumption that the state of its environment corresponds to information x' while in fact the information is JC. In most cases, it is natural to define a scalar q(x, jc*) from which this loss essentially depends. We call q(x,x*) the indicator of distortions in a concrete situation (briefly, indicator of concrete distortions). Often we take as indicator of distortions a monotone function (decreasing, increasing,e.g., a square function) of a distance between the primary and the recovered information. The concept of distance occurred in previous chapters several times. We introduced it in Section 1.4.3 as an essential element of description of continuous sets of states or information. The distance appears also in the nextneighbor transformation (see Section 1.5.3). The choice of the distance function that determines the definition of the indicator of distortions is determined primarily by the properties of the superior system. We concentrate here on general rules for choosing the distance function as a basis for definition of the distortion indicator.
8.1 Indicators of Information Systems Performance
381
DISCRETE INFORMATION We assume that the set of potential forms of primary information x and of recovered information x* is X^={xi, 1=1, 2,- • • , L}. The indicator of concrete distortions is described by the square table q = [q(l,k);l,k=U2,' • • , L], (8.1.1) where q(l, k) is the indicator of distortions for the pair x=Xi, x*=x^. Often any distortion is equally undesired. For example, if discrete data are processed, it is essential that no errors occur, and if an error occurs it is usually irrelevant what form it has. Then it is natural to take JO for k=l ^(/,^)=<7 (8.1.2) ^^1 for kr^l. We call such a distortion indicator symmetric. ONE-DIMENSIONAL INFORMATION We assume that the set of potential forms of the primary information x and of the recovered information x* is Xci= <x^, x^>. For broad classes of superior systems relevant for performance, deterioration is not x but on the difference JC-JC*, which has the meaning of the error. Then it is natural to define indicator of concrete distortions as weighted error: q(x,x)=(t>{x-x), (8.1.3) where (/)(w) is the weighting function. Its choice of the weight function depends in principle on the properties of the superior systems, but often the suitableness for analytic calculations is taken into account. Typical examples of the weighting function are ^^0 for \x-x*\ !(«)=<",, .' ^ (8.1.4) ^^1 for \x-x I >A, (t>2(u)=u\ (8.1.5) For systems with a sensitivity threshold A (see our discussion in Section 1.4.3) the function 0,(w) is natural choice of the distortion indicator. It can be considered as the counterpart of the symmetrical indicator (8.1.2) for discrete information. Because it is quite representative and suitable for analytic calculations, the square weighting function 02(1/) is often used in theoretical considerations. K-DIMENSIONAL VECTOR INFORMATION For the structured information the indicator of distortions is usually defined in terms of indicators of distortions of corresponding components of the structured information. Let us first consider the vector information x = {x{n), Az = l, 2,- • • , A^}. Often the component x(n) has the meaning of information about the AZth sample of a the time-continuous state and the inertia of the superior system cause the effects of errors of recovery of samples to accumulate. Then, for the superior systems the sum of weighted error of components is relevant, and we take
382
Chapter 8 Structures and Features of Optimal information Systems q(x, x')=Yi
(8.1.6)
where
(8.1.7)
Taking as >(«) the square function >2(w) given by (8.1.5)) as a special form of (8.1.6) we get q{x,x)=Y,[x{nyx*{n)f.
(8.1.8)
Comparing this with the definition (1.4.8) of the Euclid distance, we get q(x,x')=d\x,
O.
(8.1.9)
A simple example of a distortion indicator taking into account various weights of the components of the information is the indicator with linear weighting of elementary distortions 0(w, n)=a(n)(t>(u), (8.1.10) where a{u)>0 are weighting coefficients. Taking )2(w) given by (8.1.5) we get N
q{x, x') = J2 a(n)[xin)-x(n)]\
(8.1.11)
When x(n) has the meaning of the information about the nth sample of the state of the environment, then taking a train a(u) growing with n we can take into account the declining effect of a sample of the state as the instant at which it was taken moves further into the past. Some superior systems utilize functions of the information rather then the primary information directly. As it has been indicated in Section 7.3.4, page 349, the features of harmonic spectrum of information are relevant for several superior systems. A typical example is the perception of sounds and images by people (see the discussion about the choice of the preliminary transformation in the JPEG standard in Section 2.5, page 115). Let us denote as r(-) a transformation transforming the primary information into a secondary vector information. A wide class of distortion indicators has the form ^(je,x-)=^p[r(x), r(jc-)], (8.1.12) where ^p(-,) is one of the previously considered distortion indicators for vector information. Notice that (8.1.10) can be considered as the special case of (8.1.12) with T(x) = {y/a(n) x(n), A2 = 1, 2,- • • , A^} and q^ix, x*) given by 8.1.8. Introducing a performance criterion for the superior system itself and using adaptive matching procedures, we can the choose the auxiliary transformation T(') so that the information processed according a rule optimized in the sense of the distortion index q{x, x") maximizes the efficiency of the superior systems efficiency.
8.1 Indicators of Information Systems Performance
383
We have considered performance indicators for vector information with a fixed number of components N. When the number of components can vary, it is usually convenient to normalize the index in respect to the number of elements: q,(x, x') = (\/N)q(x, x'), (8.1.13) where q(x, x*) is one of the previously mentioned distortions indicators. Taking, for example, the indicator (8.1.8), we get q,(x, x*)=(l/N)'£[x(n)-x'(n)y=A[x(n)-x(n)]\
(8.1.14)
where A is the operator of arithmetical averaging defined by (4.1.3). Our considerations about the K-dimensional continuous vectors apply also for discrete vectors X={x(n), n = U 2,- • • , M, x(n)eX^ = {x,, / = 1 , 2,- • • , L}. (8.1.15) From equation (8.1.6) we get q(x, jc") ="£^6 W'')'^' («)].
(8.1.16)
/I-l
where q^lxin), x\n)] is one of the distortion indicators for discrete information discussed on page 381. Assume that the elementary components are binary (L=2), and we use the symmetric indicator of distortions defined by (8.1.2). From (8.1.16) we get q(x, x)=du{x, x^),
(8.1.17)
where d^(x, JC") is the Hamming distance defined by equation (2.1.36). We used this indicator as the feedback information for the system described in Section 2.1.2. To this point, we have considered the indicators of concrete distortions of vector information based on sums of components either of the components either of the primary or transformed vector information. This is justified if for the performance of the superior system the cumulative effect of distortions of elementary components of information is relevant. However, for some superior systems the biggest distortion of a component can cause an irreparable damage. For such systems, the natural indicator of distortions is q{x, jc*)= max q^xin), x\n)].
(8.1.18)
n
The vector information may be interpreted as a function of an integer argument considered as a whole. Therefore, the definitions of individual distortions indicators can be easily modified for information that has the structure of a function of one or more continuous arguments (of type Tj(ci)). For example, the counterpart of (8.1.8) is the indicator of distortions of image information described on page 23 : q(x,x') = ^
J {x[t(l)X2)]-x^{[t(l)X2Wdt(mt(2)
(8.1.19
384
Chapter 8 Structures and Features of Optimal information Systems
8.1.2 INDICES CHARACTEMZING THE PERFORMANCE OF AN INFORMATION TRANSFORMATION RULE AS A WHOLE The processed information x' depends usually not only on the primary information X but also on some side factors, thus, jt* = r(jc, z),
(8.1.20)
where T(-) is the transformation describing the rule of information processing and z are the side factors acting in the system. The concrete distortion caused by the transformation is q(x,x)=q[x^ r(x, z)]. (8.1,21) The system has to operate on any xEX, zGZ where /^(respectively Z ) is the set of potential forms that the primary information (respectively of side factors acting in the system) can take. To define on the basis of q[x, T(x, z)] a number Q[T(')] characterizing the performance of transformation T(') considered as a whole, we have to take into account all forms of the primary information that the system has to process and all potential states of the environment, in which the information system has to operate. To produce a number Q[T{*)] characterizing the transformation T(') as a whole we must remove the dependence of the set {q[x, T(x, z)];xEX zEZ} on specific values q[x, T(x, z)] but keep its dependence on the transformation T('). The operation transforming in such a way the set {q[x, T(x, z)];xEXzE Z] into the number Q[T{')\ we call the dependence on detail removing operation {detail removing operation, abbreviated DRO), and we denote it as D. Thus, the indicator of distortions caused by transformation T{') is jcE A , z E Z G[r(-)]^D q[x, r(jc,z)]. (8.1.22)
xexz^z Defining a DRO we face the problem of defining an operation transforming a set of numbers {q{u)\uE. U} into a number Q that characterizes the function q(') as whole. We write this in the form Q = Dq{u) (8.1.23) uEU If the set U is discrete thus, Z/={M/, / = 1, 2,- • • , L} then typical dependenceremoving operations are as follows: • The arithmetical averaging operation D3r = A, (8.1.24) uGU I where A is the operation defined by (4.1.3); • The operation of finding the maximum (minimum) value D,3 =max; (8.1.25)
ueu
I
• The operation of statistical averaging D3, = E uEU
m
(8.1.26)
8.1 Indicators of Information Systems Performance
385
The choice of the operation D depends on (1) properties of the superior system, (2) the structure of concrete information, (3) the properties of the set of potential forms of information, particularly on the type of eventual weight assigned to each potential form. Let us first assume that Al. The information is a train Xi, = {x(i), / = 1 , 2,- • • , /} but the jc(0 blocks are processed separately according to a rule T(-) as described in Sections 1.7.2, 6.2.1, and 7.5.2; A2. For the superior system the distortions of all elements of the block are relevant; Based on these assumptions, it is natural to take as the DRO the arithmetic averaging operation D^^. Using (8.1.22) and (8.1.24), we get the performance indicator Q[T(')]= A q{x(i),nx(i). /
z(/)]}= I j ^ q{x(i), T(x(i)^ zd)]}.
(8.1.27)
^ /-I
If an excessive error of recovery of a block could caused an irreparable damage, then instead of A2 it is natural to assume A3. For the superior system the performance of the block transformation in the worst case relevant is. On this assumption, we take as the DRO the operation D^^ of finding the maximum Q[T{')]== max q{x{i), T\x(i)^ zd)]}(8.1.28) / Till this point we have considered the indicators of performance of the transformation describing the overall operation of the information system. Often we can only modify the rules of operation of a subsystem while the rules of operation of other subsystems are fixed. Then we face the problem of defining indices of performance not of the overall transformation rule but of a rule according to which a subsystem operates. The subsequent two sections show that by a suitable choice of the DRO our approach is also applicable for definitions of indicators characterizing the performance of subsystems of an information system. We illustrate this with the subsystems at the end and inside the prototype information system shown Figure 1.2. 8.1.3 INDICATORS OF THE PERFORMANCE OF AN ULTIMATE INFORMATION TRANSFORMATION We consider first the indicators of performance of the last subsystem in the prototype information system shown in Figure 1.2, on the assumption that all subsystems preceding the last operate according to fixed and known rules. Then the overall information transformation rule performance index defined by (8.1.22) can be used directly to characterize the ultimate information transformation from the point of view of distortions. To simplify the terminology we assume that the prototype system shown in Figure 1.2 is a communication system and the last subsystem the receiver. Because in most cases (with the exception of rather unusual randomized transformations; see Section 1.5.5 page 51) the ultimate information transformation is deterministic, we must not take into account the side states occurring in (8.1.22).
386
Chapter 8 Structures and Features of Optimal information Systems
Therefore, the operation of the receiver is described by the transformation T^ ^.{') transforming the available information r into the recovered information jc*. Thus, x* = r,,.(r) (8.1.29) Let us first assume that the system operates block-wise (see e.g.. Section 6.2.1, particularly. Figure 6.1) thus, T^*{r) has the meaning of the rule of separate recovery of received blocks, and we consider only the training cycle (see e.g.. Section (1.7.2) particularly Figure 1.25). We denote as f/, = {{jc(0, r ( 0 } ; / = l , 2 , - • • , / } (8.1.30) the train of pairs: primary information, the received signal carrying this primary information. We call this train the training information. On assumption A2 we take the DRO D3, given by (8.1.24). We get Q[^rx*(-)]= ^ E ^WO. T^Ar{i)Vi= A q{x{i), r^.WOl} / ,-1
(8.1.31)
/
EXAMPLE 8.1.1 INDICATOR OF THE PERFORMANCE OF A LINEAR INFORMATION RECOVERY RULE 1: PERFORMANCE DURING A TRAINING CYCLE To illustrate the definition (8.1.31) we assume Al. The primary and recovered information are one-dimensional; we denote it X (respectively, x\ A2. The received information is A^-dimensional thus, r={r{n), /2 = 1, 2,- • •, A^}; A3. The information recovery rule is a linear transformation: N
x=Y.h{n)r{ri).
(8.1.32)
n-l
A4. The concrete distortion indicator q{x, x)--{x-xy. (8.1.33) Since the transformation rule T, x-(*) is determined by the set h = {h(n), Az = l, 2,- • • , A^}, (8.1.34) we write Q{h) instead of Q[T,A')]^ Substituting (8.1.32) and (8.1.33) in (8.1.31) we get ^ Q(h)=A[x(i)-'£h(n)r{nJ)f = N
N
N
= A {[x(i)Y - 2x{i)Yl h{n)r{n, i) + E E h{ri))h{m)Kn, i)r(m, i)}. n-l
/i-l
(8.1.35)
m-l
N
Interchanging the sequence of operations A and J ^ we get n-l
N
N
N
Qih)=A [xiDY - 2"£ h(n)cjn) + E E h(n)h(m)c^^(n,m), ,
where
/I-l
«-l
(8.1.36)
m-1
c^/n) =A Jc(0r(/2, 0 c^^(n,m) =A r(n, i)r{m, i) are the empirical correlation coefficients. D
(8.1.37a) (8.1.37b)
8.1 Indicators of Information Systems Performance
387
We next give two examples of calculating the indicator of performance of the receiving rule of a single block on the assumption that the primary and received information exhibit joint statistical regularities. Then it is natural to take statistical averaging as the DRO. EXAMPLE 8.1.2 INDICATOR OF THE PERFORMANCE OF DISCRETE INFORMATION RECOVERY RULE: THE PRIMARY AND RECEIVED INFORMATION EXHIBIT JOINT STATISTICAL REGULARITIES We assume that Al. The primary and the recovered information are discrete; X^ = {xi, / = 1 , 2,- • • , L} is the set of their potential forms; A2. The indicator q(l, k) of distortion in a concrete situation is a symmetric function given by (8.1.2). As the DRO we take the statistical averaging operation e[r,,-(-)] = E ^ f e X),
(8.1.38)
where 5S, £ are the random variables representing the primary, respectively the recovered, information. Using the definition (4.4.10) we get L
L
(2[r„.(-)] = E E ('• ^)P(K=J«^i. K"=Jfic)/-I
(8.1.39)
k-\
Substituting (8.1.2) we get L
L
e[7'„.(-)] = E E /-I
kT^l
P(3S=JC„ s=Jir,) = l - E P(x=J^„ s"=x,) = l-P(x?i5s') (8.1.40a) /-I
Thus, The probability of error is the indicator of performance of an information recovery rule in the sense of definition (8.1.38) on (8.1.40b) the assumption, that the concrete distortion indicator is symmetric. Now instead of Al and A2 we assume A3. The primary and the recovered information are one-dimensional; A2. The indicator q(l, k) of distortion in a concrete situation is the square error given by (8.1.33). Proceeding as previously we get Q[T,A')] = ^('SL-zy.
(8.1.41a)
Thus, The mean square error is the indicator of performance of an information recovery rule in the sense of definition (8.1.38) on (8.1.41b) the assumption, that the concrete distortion indicator is symmetric. D We now give a counterpart of Example 8.1.1.
388
Chapter 8 Structures and Features of Optimal information Systems
EXAMPLE 8.1.3 INDICATOR OF THE PERFORMANCE OF A LINEAR INFORMATION RECOVERY RULE 2: THE PRIMARY AND RECEIVED INFORMATION EXHIBIT JOINT STATISTICAL REGULARITIES We do assumptions Al to A4 from Example 8.1.1, but now we take as performance indicator the statistical average Qih) = E [ s - £ hinMn)]\
(8.1.42)
/I-l
where x,ff(Az),AZ = 1, 2,- • • , A^are the random variables representing the primary information (respectively the components of the received information). Since the E operation is like the A operation, a linear operation, we can proceed similarly as in Example 8.1.1 and we get N
N
N
Q(h} = E iL\i}-2 Y^ h{n)c„{n) + E E h{n))h{m)c„{n, tri), where
(8.1.43)
C,,(/2)=EM(/2)
(8.1.44a)
c,Xn, m)=E ir(Ai)ff(m)
(8.1.44b)
are the statistical correlation coefficients. D 8.1.4 ESfDICATORS OF THE PERFORMANCE OF A PRELIMINARY INFORMATION TRANSFORMATION Considering the subsystem performing the ultimate information transformation we assumed that the transformations performed by all others, then the considered subsystem, are fixed and known. Therefore, we could use directly the overall transformation performance index as an index characterizing the ultimate transformation. Usually, the properties of the ultimate information produced by the information system are influenced not only by the considered subsystem but they are also influenced by not fixed transformations performed by other subsystems. We now show how to define the index of performance of a subsystem in such a situation. As a representative example we take the preliminary information transformation Txy(')y which transforms the primary information x into the information v that will be fed into the subsystem performing the fundamental transformation (see Figure 1.2). Concrete examples of such a transformation are the volume-compressing transformations considered in Chapter 6 and 7. We assume that Al. The preliminary information transformation T^(') is deterministic but may be irreversible; A2. The fundamental transformation r^(*) transforming the information v into the information r available for ultimate transformation (see Figure 1.2) is a deterministic and reversible transformation. A3. The ultimate transformation T^*i-) is a deterministic transformation. In view of assumption A2, we assume further that r=v (see Figure 1.2) thus, T^,(')= T^A'). Based these assumptions we have qix^x*) = q{x,TUTUx)]}. (8.1.45)
8.2 Methods of Solving Optimization Problems
389
We did not assume that the uhimate transformation r^(-) is known. Therefore, to obtain a performance indicator of the preliminary transformation T^(') we have to remove the dependency both on concrete x and concrete r„,() thus, Q[T^(')] = D D a{x, T^.[TM]}Let us assume that x exhibits statistical regularities. Then we take D= E X
(8.1.46) (8.1.47)
X
To remove the dependence on the ultimate transformation Tyj^*{') we have to introduce a model of its indeterminism. If we are free to chose the rule T^, (•), then for a given T^(') we would apply a possibly good rule r"^*(*). In such a situation we take D = min (8.1.48) If, however, we design a preliminary information processing subsystem; that is an irreversible dimensionality reducing subsystem for a system in which many users recover (decompress) the primary information using their own devices coming from many vendors; then we may assume that the transformation 7Vx*() is random. Then, to obtain the performance indicator characterizing the preliminary information transformation, as the D.R.O. we take D =
E.
(8.1.49)
COMMENT 1 We have described here the indicators of distortions that are quite representative and convenient in analytical considerations. A great variety of indicators of performance of processing of special types information was proposed. Such indicators for speech processing (called distortion measures) are listed in Makhoul [8.1, (Section IB)] and for images (called fidelity indicators) in Lim [8.2], Dougherty [8.3], or Russ [8.4]. The consequent approach to the definition of indicators of performance in a concrete situation would be to express such an indicator in terms of the indicators of performance of the superior system for which the information is destined. Our considerations in Section 1.6.1, particularly equation (1.6.3), suggests such a procedure. COMMENT 2 In previous consideration we used dependence removing operations to define characteristics of structured information bases on characteristics components. For example the definitions of distance (1.4.8), (1.4.9), and (1.4.10) for vectors, diagrams, and images can be considered as a result of applying an operation removing the dependence of the basic distance definition (1.4.7) between corresponding elements on the identifier of the component. Also the definitions of the volume of a potential information based on definition of the volume of a concrete information presented in Section 6.1 use detail removing operations. Still an other example is definition of channel capacity base on the amount of statistical information which the output signal provides about the in put signal in a concrete communication channel.
390
Chapter 8 Optimization of Information Systems
8.2 METHODS OF SOLVING OPTIMIZATION PROBLEMS Section 1.6.6 formulated the problem of optimization of an information transformation. In the previous section the general methodology of defining a performance indicator as a function (functional) of an information transformation as a whole has been given. I general such a transformation is described by a function of a continuous arguments, e.g., by a pulse response. However, in most cases we chose the type of the information transformation (typically a linear transformation or a next neighbour transformation) in which only a set of parameters h = {h(n), w = l, 2 ,• • • , A^} is free. The same situation we have when to-days dominant digital processing technic is used. Therefore, we concentrate here on the problem of finding a minimum or maximum of a function g, of A^ variables h, eventually with some constraints. Such a problem has been called in Section 1.6.2 parametric optimization problem and was denoted as O P / i Q , | C , , m=2, 3,- • • , M , QECH-
Finding the maxima or minima of functions of several variables is the subject of mathematical optimization theory. There are several excellent books on the subject. See e.g, Minoux [8.5] for principles and Cuthbert [8.6] and Press et al. [8.7] for algorithms and programs. Here we only sketch the basic methods that are of greatest importance for information systems optimization. 8.2.1 REDUCTION OF THE MINBVflZATION PROBLEM TO SEARCH IN A SET OF SOLUTIONS OF AN AUXILIARY EQUATION We consider finding the point of minimum of a scalar function f(a) of a scalar argument aE^ where ^ is the set of values which the argument can take. Such a value a^ of the argument that f(aj
8.2 Methods of Solving Optimization Problems
391
For a scalar function/(a) of a A^-DIM variable a = {a(n); AZ = 1, 2, • • , A^} the generalization of (8.2.2) holds. Under corresponding assumptions we have to search for the minimum point only in the set ^^ of solutions of the set of equations iM-0 da(n) yv .
The vector
n-l,2,..A^
(8.2.3)
^-
^ . g r a d / . E ' ^ ^^
bin), (8.2.4) da(n) where b(n), A2 = 1, 2, • • ,A^are the unit coordinate vectors of the basic orthogonal coordinate system (see Section 7.1.1), is called gradient. Using it we write the set of equations (8.2.3) in the compact form grad/=0 (8.2.5) where 0 is a vector with all components equal 0. We give now examples of applications of the set of equation (8.2.3) or equivalently of (8.2.5) for finding optimal transformations of information. EXAMPLE 8.2.1 OPTIMIZATION OF A LINEAR INFORMATION TRANSFORMATION: ALL NEEDED STATISTICAL INFORMATION IS AVAILABLE; ANALYTIC APPROACH We consider the optimization of the transformation of available A^-dimensional information r about the primary one-dimensional information x into the recovered informations*. We assume that AL The transformation is linear, given by (8.L32); thus, it is described by a set h of coefficients; A2. The primary and available information exhibit statistical regularities; A3. As distortions indicator we take the mean square error given by (8.1.42). From equation (8.1.43) it follows that on those assumptions we need only the rough description of the statistical properties by the correlation coefficients (see Section 4.4.4). Therefore, we assume that A4. Exact information about the correlation coefficients is available. After substituting a^h, f{a)-^Q{K) the set of equations (8.2.3) takes the form:
From (8.1.43) we get
dQ{h) -0 ; n - l , 2, . . ., N. dh(n) ^ " 2 ^ x r ( « ) - 2 E h(mK(n,m),
(8.2.6)
(8.2.7)
Substituting this in (8.2.6) we obtain N
c^'ThimX^in^m)
m-1, 2, . . . , N.
(8.2.8a)
m-l
We write this set of equations briefly as a matrix equation C,,h = C,,, where
(8.2.8b)
392
Chapter 8 Optimization of Information Systems
C,^ = [c^(m, n)] the correlation matrix of the components of the available information r, Cxr = [<^xr(^)] the column matrix of correlation between the primary information x and components of the available information r, h = [h(m)] column matrix of coefficients determining the linear transformation. On very general assumptions (see, e.g., Thompson [8.8]), the inverse matrix 0'^^^ exists. Then the set ho=Cl;^'C^r (8.2.9) is the only solution of the set of equations (8.2.3). Thus, this is the only element of the set ^ o . It can be easily proved that this is the point of minimum of Q(h). Special cases of the considered optimization problem occurred already in Section 7.3.1 (page 331) and in Section 7.5.3 (page 374). Calculation of optimal coefficients is a transformation of the rough but sufficient statistical information in the form of correlation matrices into the set h^ of coefficients determining the optimal transformation of the working information. Thus, the considered optimal subsystem has a two-layer structure as shown in Figure 8.1a. This is a special case of the layered system shown in Figure 1.24b.n training info
MEMORY
CALCULATION OF CORRELATION COEF.
statislical info
C,r^
I
^C„
I
I I I
OmMIZATION
OPTIMIZATION
rV)
^r*— • r
LINEAR TRANSFORMATION
LINEAR TRANSFORMATION
—>r X
working info
Figure 8.1. The optimized linear information transformation: (a) the information exhibits statistical regularities and the statistical correlation coefficients are available, (b) the system operates with a training cycle; {x(J), r(/)},y = l, 2,- • • , 7 is the training information. The assumptions made in the example are justified if the primary and available information exhibit statistical regulations and all needed statistical information is available. We now consider the situation when this is not the case.
8.2 Methods of Solving Optimization Problems
393
EXAMPLE 8.2.2 OPTIMIZATION OF A LINEAR INFORMATION TRANSFORMATION: ONLY TRAINING INFORMATION IS AVAILABLE We assume that AT. The system operates with a training cycle (see Section 1.7.2 and Figure 1.25); {x(j), r(/)},y = l, 2,- • • , / is the training information, A2'. The block-wise recovery of information during a training cycle is considered, A3'. The recovery rule is linear, A3'. The arithmetical square error given by (8.1.35) is the indicator of distortions. Since equation (8.1.36) and (8.1.43) are similar, we can use the results (8.2.8) and (8.2.9) by replacing the statistical by arithmetical correlation coefficients. The block diagram of such an optimized system is shown in Figure 8.1b. D COMMENT 1 In practice, the assumptions A2 and A4 made in Example 8.2.1 are justified when by some earlier analysis of the properties of the environment of the information system, the existence of statistical regularities and their stationarity were stated, and the correlation coefficients were acquired. Our general considerations in Section 1.7.2 about systems with a training cycle apply to Example 8.2.2. If the primary information jc and the available information r are not accessible at the same place, providing training information may be not always possible. Using the coefficients optimized during a training cycle in the following working cycle is justified if the states of systems environment are stationary. If they are only quasi-stationary, we have to interleave the working and training cycles as shown in Figure 1.25. For information compression or prediction systems no special subsystem providing training information is necessary because usually both the primary information and the compressed information (playing the role of the available information) are directly accessible. If introducing the delay is permissible, we calculate the optimal set h^ for the whole train and apply it for block-wise compression as shown in Figure 1.27. COMMENT 2 If information exhibits statistical regularities, then in view of the fundamental property of long sequences the arithmetical averages are estimates of statistical averages (see Section 4.3.1). Thus, for long training trains the system shown in Figure 8.1a may be considered as the limiting form of the system shown in Figure 8.1b. COMMENT 3 The problems considered in both examples play the role of atomic problems into which a great variety of problems of statistical optimization of linear transformations of processes and images can be decomposed. Typical examples are filtration and
394
Chapter 8 Optimization of Information Systems
prediction of processes, filtration and enhancement of images, and identification of characteristics of linear dynamic objects (in particular, the pulse response) (see e.g, Middleton, Goodwin [8.9]). Several methods to get the solution of the counterparts of the basic equation for vector-valued functions or functions of two or three arguments (still moving images were developed). Universal is the method of Kalman (see e.g., Proakis [8.13, ch.6]. It allows to implement successively the optimal recovery of trains of scalars or vectors by using already optimal recovered components to calculate the next optimally recovered component. To this point minimization without constraints has been considered. In most cases the various types of constraints are imposed on variables in the optimization system. The basic types of constraints were discussed in Section 1.6.2. Often we face optimization problems with equality constraints. So are called the constraints g{m, fl)=0, m=2, 3,- • • , M, (8.2.10) where g(m, a), m=2, 3,- • • , Af are given functions. One of the important conclusions of optimization theory is that On general assumptions about the existence of derivatives of the criterion function f{a) and of functions g{m, a), /n=2, 3, • • • ,M determining the equality constraints, the solution of the optimization problem O P 2if{a)\g{m, a)=0, w = 2 , 3,- • • , M, is an element of the set of solutions of equation gradf^^(a)-0, (8.2.11a) a
where
^ JL A^(a)=m^Y^Mm)g(m,a)
(8.2.11b)
m-2
is an auxiliary function and \{m), m=2,' •, M are auxiliary parameters {called Lagrange function, Lagrange parameters). For a detailed description of this method see Minoux [8.5]. We used it already previously- see Section 7.5.2, page 370. 8.2.2 NUMERICAL FINDING OF THE ZERO POINT: THE SAMPLES OF THE FUNCTION ARE EXACTLY KNOWN Only in special cases is it possible to find the solution of the set of the auxiliary equation in a closed form. Of paramount importance for practical realization of optimal information systems are the numerical methods of finding such a solution. Those methods are also important because they can be easily extended for the case when the criterial function (and eventually the constraint functions) are not given in the analytical form and calculation of the derivatives in a closed form is not possible. This, in turn, makes it possible to implement those algorithms in intelligent information systems. A GENERAL METHOD OF FINDING THE ZERO POINT We consider first the simple case when a continuous function g(a) is given and we will find numerically its zero point a^ thus, the root of the equation g(a)=0. (8.2.12)
8.2 Methods of Solving Optimization Problems
395
Most numerical methods of solving this equation are based on the following idea: 1. We take an initial value a(l) for the variable a; 2. Using the available meta information about function g(-)m the neighborhood of a(l), we approximate this function by a function g\a, 1), aE^; 3. We fmd the solution a(2) of the equation g\a, 1)=0; (8.2.13) 4. Using an approximating function g\a, 2) we proceed with a(2) as we did with fl(l); continuing this we generate a train a(j),j=l,' ••,; 5. We chose the approximating functions g\a, j) so that the train a(j),j=l,' • •, of solution of equations ^'(fl,y)=0 converges to a solution flo of (8.2.12); 6. The calculation of solutions of the auxiliary equations is simple; 7. For some y =7 we stop the procedure and take a(J) as the approximation of Go- The typical stopping rule is to stop when for the first time \a(j-^l)-a(j)\0 is a train of coefficients. The purpose of these coefficients is to achieve the convergence mentioned in 5. Their choice depends in a crucial way on the type of the available information about properties of the function g('). The solution of the equation g\a, J)=0 for the function given by (8.2.16) is a(j^\)=a(j)-\{j)g[a(j)l (8.2.17a) where X(/) = 1/Xi(/). (8.2.17b) We call A(/) the correction (in the j-ih step). The block diagram of a system generating the train is shown in Figure 8.2.
A(/")
X^(/>1)
r
ISTEP DELAY a(j)
AO)
1 STEP DELAY
a(J)
correction
a)
b)
Figure 8.2. The system implementing a recurrence: (a) of scalars (e.g., generated by (8.2.15)), (b) of vectors (e.g., generated by (8.2.29)); thick lines denote flow of vectors. The recursive equation (8.2.15) permits the train a(/),y = 1, 2, to be calculated successively. We call this equation briefly recurrence. This recurrence with the stoping rule (8.2.14) we call recurrent zero point searching algorithm. To specify the recurrence we must specify the train Xi(/) or, equivalently, X(/). This, however, must take into account the available information about the class (set) of potential forms of function g^*) (in our terminology-meta information about g{')).
396
Chapter 8 Optimization of Information Systems
EXACT INFORMATION ABOUT THE DERIVATIVE OF g(-) IS AVAILABLE When for any a € ^ we can calculate the derivative dg/d&, then it is natural to take in (8.2.16) as the approximation g\a, j), the first two terms of Taylor series around a(j). Thus, we take X,(/) = (8.2.18) da J a-a(j) The typical diagram for the corresponding g\a, j) is shown in Figure 8.3a. From calculus is known, that on general assumptions about g{') the train (8.2.17) with Xi(/) given by (8.2.18) converges to a^, as illustrated with Figure 8.2a.
/fl(2) iHia)
.v (r7.2)
a{\)
a
.c (uA)
g(a)
b)
a)
^=7o
^
d)
Figure 8.3. Examples of trains generated by the recurrence (8.2.16); the algorithm: (a) using the exact value of the derivative, (b)-(g) with constant coefficient \ . Figures (c)-(g) illustrate the dependence on \ of the convergence close to the zero point a^ on the assumption that the function g{a) is approximated by a linear function having slope 70. To get the derivative ^g/ddi for any aE^, we had either to calculate it analytically or numerically. In most information systems this is not possible, and we have only some rough information about g{'). We now show that we can assure the convergence when we have information only about the derivative at the zero point.
8.2 Methods of Solving Optimization Problems
397
ONLY ROUGH INFORMATION ABOUT THE DERIVATIVE OF g{') AT THE ZERO POINT IS AVAILABLE When do not have information about the derivative of g{') for each a, then the simplest choice of the train \(j) is to take in the recurrence (8.2.16) \(j) = K=const. (8.2.19) To analyze the convergence of the sequence a(j) to a^ we assume that 7o>0, (8.2.20a) where
, , (8.2.20b)
is the derivative, or equivalently the slope of the line g(a) at the zero point a^. A typical train generated by (8.2.17a) with X(/) = Xc=const is shown in Figure 8.3b. It is evident that the convergence of the generated train depends on the properties of the function g(a) in the neighborhood of the zero point a^. When in an environment of this point the derivative of g{a) exists, than in this environment we can approximate g(a) by a linear function and study the convergence of the train a(j) on the assumption that g(a) is a linear function. The corresponding trains for various \ are shown in Figures 8.3c to 8.g. Those figures show that essential for the convergence is the magnitude of the coefficient X^ compared with the slope 70 defined by (8.2.20b). The train converges monotonically if Xc<7o, a^ is reached in one step if Xc=7o; the train oscillates if Xc>7o and diverges for Xj,>27o. Although for any Xc<7o the convergence is monotone, the smaller X^ the slower the convergence is. If we know the derivative 70 exactly, then taking Xc=7o we achieve the fastest convergence. If we have about 70 only the rough information that 7o^7min where 7n^in is known, then taking Xc=7min we achieve the monotonic convergence, but for an 7o>7min it is slower than the possibly fastest. Thus, for having only inexact information y^.^, about the slope 70, we pay with a slowed-down convergence rate. ONLY MINIMAL INFORMATION ABOUT THE DERIVATIVE OF g(-) AT THE ZERO POINT IS AVAILABLE When we know only that 70>0, we say that only the minimal information about the derivative 70 is available. Then the search for zero point using fixed coefficients X(/) would be unpredictable, and we have to use the general recurrence (8.2.17a) a(j+l)=a(j)-\(j)g[a(j)] (8.2.21) with coefficients X(/) varying with j so that limX(/)=0. (8.2.22) j-00
To get the guide lines for choosing such a train X(/) we look for the total movement from a(l) to a(j-\-l). From (8.2.21) we get
fl(/>l)=fl(l) - J2 M0^[«(0]
(8.2.23)
398
Chapter 8 Optimization of Information Systems
From this follows in turn, that to achieve for any a(l) the convergence of a(j) to a^, j
the series J^ X(/) can not converge thus, it must be »
/-I
52x(o=oo.
(8.2.24)
/•-i
A useful class of sequences X(/) are the sequences
xovf where i4>0, a > 0 are constants. Their basic property is °° ^ - oo for Of < 1
E4
y-i r
(8.2.25)
(8.2.26)
(8.2.27)
For Q;=0 we have the previously considered case when X(/) = Xc=const. NUMERICAL FINDING OF A SOLUTION OF A SET OF EQUATIONS The described algorithms can be generalized to get numerically the solution of the set of equations ^ ^(fl,/2)=0, Az = l , 2 , - • • ,7V, (8.2.28) where g{a, n) are functions of the A^ DIM argument a = {a{n)y n = l, 2,- • , N}. The generalization of (8.2.21) is the train a(j-^l)=a(j)-\(j)g[a(j)l (8.2.29) where gia)^{g(a, n), n = l, 2,- • • , N}. (8.2.30) The counterpart of (8.2.11) is the stopping condition |a(/-fl)-a(/-)|
(8.2.32)
(8.2.33)
m-l
We write the set of those relationships in the matrix form g{h)^C,rCJi. The sequence (8.2.29) takes the form h(j-^\)^h(j)-\{j)[C,rCJim
(8.2.34) (8.2.35)
8.2 Methods of Solving Optimization Problems
399
To achieve the convergence to the solution given by (8.2.9) using the previously discussed guide lines we choose the auxiliary train \(j), depending on the available meta information about the matrices. D COMMENT Finding the optimal set of coefficients by running the recursion (8.2.35) is an alternative to calculate it from equation (8.2.9). The described procedure is useful when the matrices C^,, C„ can be considered as minor changes of some primary matrices C\„ C\, for which we already calculated the optimal set h\. Using the sequence (8.2.35) with h(l)=h\ may require much less calculation than calculating the inverse matrix C?"/^ in (8.2.9). 8.2.3 NUMERICAL FINDING OF THE ZERO POINT: ONLY DISTORTED SAMPLES OF THE FUNCTION ARE AVAILABLE We now show that on very general assumptions it is possible to obtain a transformation that is optimal in the sense of a performance indicator characterizing the transformation as a whole, even though an equation expressing explicitly the indicator in terms of the statistical features of the transformation is not available. However, we must have access to inaccurate information about the values of the indicator of performance in concrete situations. Thus, the procedure circumvents the need for direct information about the statistical properties of potential states.
THE BASIC RECURRENT ALGORITHM We start with the simple but representative one-dimensional case. We assume that 1. Observations of a function G(a, U) where aE^ is a scalar variable and U is a random, generally multidimensional variable, are available; 2. We are looking for the zero point of the function g(a)=EGia, U). u We introduce the random variable S=E Gia, Vyg(a). From the definition it follows that E2=0.
(8.2.36) (8.2.37) (8.2.38)
We write definition (8.2.37) in the form Gia, U)=^(a)+2. (8.2.39) Thus, we can interpret G(a, U) as a random variable representing the value g(a) distorted by the additive noise 22. We consider a train a(/), 7 = 1, 2,- • • of values of the variable a and a train G[a(j), I!J(/)] of random variables, wherea(/),y = l, 2,* • • is a train of values of the variable a, and ILI(/) is a train of statistically independent random variables with the same probability distribution. The counterpart of the basic recurrence (8.2.21) is the recurrence a(j-^l)=a(j)-\(J)G<j), (8.2.40) where G<j) is an observation of the variable G[a(/), IU(/)]. A typical train generated by (8.2.40) is shown in Figure 8.4.
400
Chapter 8 Optimization of Information Systems
Gil)
laO.) g(a)
g^(a,l)
Figure 8.4. A typical train a(j) generated by the recurrence (8.2.40). Using (8.2.40) similarly to (8.2.23), we get a(j+l)=a(l)- "£ X(0^[a(0]-E ^«z(0,
(8.2.41)
where
z(i) = G(i)-g[a(i)]. (8.4.42) Since the second component of a(j-\-1) is an observation of a random variable, an element a(j-\-l) of the train generated by (8.2.40) is to be considered as a realization of a random variable si(/ + l). Therefore, we cannot speak about the convergence of the train a(/-fl), 7 = 1, 2,- • in the sense of classical analysis, but we have to look for a convergence of the corresponding random variables. For most technical applications we would consider that a train generated by (8.2.40) converges to the zero point a^ of the function g(a) if limE[ai(/>ao)]'=0. (8.2.43) j -oo
To achieve the convergence of the generated train in this sense we must choose properly the train of auxiliary coefficients \(j). Comparing (8.2.41) and (8.2.23) we see that to ensure unrestricted correction ability, we must again require that (8.2.24) is satisfied, i.e., that is, that E X ( 0 = oo (8.2.44) /-I
To achieve the convergence of the generated train we must achieve not only the convergence of the first but also of the second sum in (8.2.41). Detailed analysis of this sum shows (see, e.g., Schmeterer [8.10]) that the necessary condition for the convergence of the train generated by (8.2.40) to a^ is that
E^'(o<«. From (8.2.26) it follows, that if we take X(/)--
(8.2.45) (8.2.46)
8.2 Methods of Solving Optimization Problems
401
where A>0 and 0 . 5 < a < l then we satisfy both conditions (8.2.44) and (8.2.45). A typical choice is X(/-) = l//-, (8.2.47) As in the case when the exact values of the function are available, our present considerations can be generalized for finding a solution of a set of equations (8.2.28). The generalization of the recurrence (8.2.40) is the recurrence: a(/>l)=a(/>X(/-)G(/'), (8.2.48) where G(j) = {G<j, n);n = l, 2,- • • , A^} and G(/\ n), n = l, 2,- • • , A^ is an observation of such a random variable G[a(/), U, n] that g{a, n) = EG[fl, U, w]. (8.2.49) The previously discussed principles of choice of the train coefficients \(j) apply to the recurrence (8.2.48). EXAMPLE 8.2.4 ADAPTIVE LINEAR RECOVERY OF INFORMATION; THE INFORMATION EXHIBITS STATISTICAL REGULARITIES BUT ONLY TRAINING INFORMATION IS AVAILABLE We do assumptions Al, A2 and A3 from Example 8.2.1 but we do not assume that direct information about the statistical correlation coefficients is available. Instead, as in Example 8.2.2, we assume that training information is available. Consider again the optimization of the problem of optimization of the linear information transformation, when the primary and available information exhibit statistical regularities, and the distortion indicator Q{h) is given by (8.1.43). Differentiating this equation we get ^ =^ E dh(n) dh(n)
[^-J2 h{m)t{m)Y = ^^A^-Y: t^ dh(n)
;^
h(m)^m)f =
A'
= -2E[x-5I h(mMmMn).
(8.2.50)
We are looking for the solution of the set of equations (8.2.3). Therefore, we gih.n)=^,
(8.2.51)
dh{n) Comparing (8.2.50) with (8.2.49) we see that N
Gia, U, Az)=x-52 him)^m)Un),
(8.2.52)
with ILJ={x, t{n), « = 1, 2,- • • , N}, is the random variable representing the distorted observation of the partial derivative ^7^ . ^ dh(n) In view of (8.2.52) the recurrence (8.2.48) takes the form N
h(j+1)=/j(/)-X(/)[x(/)-E
^('". Mm, j)]r(j),
(8.2.53a)
402
Chapter 8 Optimization of Information Systems
where r(j) = {rin,j), AZ = 1, 2,- • • , N} is the vector representing the N-dimensional available information. We write this equation in the simple form (8.2.53b)
h(j-^ 1)=^(/)-X(/-)[x(/>x-(/)]r(/*), where
N
xV)^Y^h(m,j)r(m,j)
(8.2.54)
has the meaning of the information produced from the available information rij) by the linear transformation determined by the set of coefficients hO') = {h(n,j),n = U2r " ,N} obtained in theyth step of the recurrence. The optimized linear system based on this recurrence is shown in Figure 8.5. i STOPPING 1 SUB. SYST. 1 1 1
d
1 STEP DELAY
i/r(/+l)
1 ^ l"^ h{J)
/^(/•)
J
recurrency
r
fe PN
1r
LINEAR SYSTEM
working infc)
Figure 8.5. The optimization of hnear information transformation based on the recurrence (8.2.53). We denote by x"(j) = Y. ^(^^ y + l)r(m, j)
(8.2.55)
the recovered information that we would get using the new calculated set /r(/-l-l). From (8.2.53) to (8.2.55) after some algebra we get [x(/)-/-(/-)]2=5[x(/>/(/-)]2' (8.2.56a) where B<\ for sufficiently largey. Thus, in every step the algorithm (8.2.53) improves the processing (8.2.56b) of already obtained information. COMMENT The system derived in the example performs a similar function as the system derived in Example 8.2.2. However, the operation of both systems and their structure (compare Figures 8.3 and 8.5) is different. In the system derived in Example 8.2.2 the evaluation of the empirical correlation coefficients on which the performance indicator (see (8.1.36)) depends and finding of the optimal coefficients of the linear transformation, are separated. A similar situation is found in the system based on statistical correlation coefficients derived in example 8.2.1.
8.2 Methods of Solving Optimization Problems
403
In the system derived in Example 8.2.4 the empirical correlation coefficients do not appear at all. The ability of the system to optimize the linear transformation in sense of a performance indicator determined by the correlation coefficients, is related to the features of the recurrence (8.2.48). Equation (8.2.41) indicates that the recurrence performs two functions: it improves the deterministic part of the approximation so that it approaches the optimum h^ (the first sum in (8.2.41) and due to the choice of auxiliary coefficients X(/)), it decreases to zero the variance of indeterministic part (the second sum in (8.2.41)). Thus, the recurrence has a similar effect as arithmetic averaging. 8.2.4 FINDING THE POINT OF MINIMUM In the previous two subsections we considered the solutions of optimization problems that can be reduced to finding the zero point of the derivative of the criterion function. However, the derivative in the form of a formula is often not available. The typical reason is that the criterion function is given only in a numerical form. We present here methods of solving optimization problems which cannot be directly reduced to finding the zero point of the derivative. FINDING THE MINIMUM POINT OF A CRITERIAL FUNCTION OF ONE VARIABLE Let us return to the primary minimization problem of a scalar function f{a) of a scalar argument discussed in Section 8.2.1. In terms of the function/(A) the recurrence (8.2.21) has the form a(/>l)=a(/>X(/)
dfl
(8.2.57) a-a(j)
If the derivative exists, then
da
- lim L, .. *-o 2d
(8.2.58)
where
(Af)Ma+d)-f{a-5).
(8.2.59)
Since the a(j-\-l) generated by the recurrence (8.2.57) has the meaning of an approximation of the zero point a^ of the derivative df/da, only inexact information about the derivative is really needed. However, the accuracy of this information must increase when a(f) approaches a^. Those remarks suggest to use instead of (8.2.57), the recurrence a(j+l)=a(jy\(j).
2d(j)
(8.2.60)
with a train 5(/)-*oo. We call it recurrence based on increment ratios (briefly, increment ratio recurrence). Figure 8.6 shows a train generated by this recurrence.
404
Chapter 8 Optimization of Information Systems fia) I
7
1 1
1
/
I
1
1 1
\
_ /
-f
6
! 25(7') —r^
\T
—4L
1
L(
J
1 1
U
Figure 8.6. A train generated by the recurrence (8.2.60) based on increment ratios. It can be proved (see, e.g., Schmetterer [8.10]) that on very general assumptions about the function/(a) the increment ratio recurrence (8.2.60) generates a train converging to the point a^ of minimum of the function, if following conditions are satisfied: limX(/)-0, lim6(/>0, 5 ] X(/)=» (8.2.61a) y-i
(8.2.61b) It is convenient to take
X(/>1//-", b(j) = \lf.
(8.2.62)
Then the conditions (8.2.61) are satisfied if 0 < Q : < 1 and a-^^>\. Typical values a r e a = l , /3=0,5. When we cannot calculate analytically the derivative and do not have exact information about the values of the function but they exhibit statistical regularities, it is natural to use instead of recurrence (8.2.60) the recurrence a(j-hl)=a(j)-\(j)
F (j)-F (J) -; -^ , 26(7)
(8.2.63)
where F^,.(j) are observations of the random variables representing the distorted values of the function/(a) as G[a(j), 1IJ(/)] represents g(a) (see equation (8.2.39). We call it recurrence based on distorted increment ratios (briefly, distorted increment ratio recurrence). Again it can be shown (see Schmetterer [8.10]), that on very general assumptions about the function f{a) and about statistical distortions the train generated by the distorted increment ratio recurrence converges in the mean square sense (equation (8.2.39)) if in addition to conditions (8.2.61a) the condition 2
< oo is satisfied.
(8.2.64)
8.2 Methods of Solving Optimization Problems
405
If we assume again that the auxiliary coefficients are given by (8.2.62), then from (8.2.26) it follows that all the conditions are satisfied if 3 4 < a < l , and \-a
(8.2.65)
A pair satisfying those conditions is Q : = 1 , /3=0.3. We discussed here in more detail the minimization of a function of a scalar argument because this simple model shows the fundamental properties of recurrent procedures for finding the point of minimum. Of paramount practical importance are generalizations of our consideration for functions of several arguments that we are presented in subsequent. COMMENT The basic advantage of the increment ratio recurrence (8.2.63) is that we do not need an explicit equation for the derivative df/da, but it is sufficient to know the values of the function f(a). The price that we pay, compared with recurrence (8.2.53), is that in each step instead of calculating once the value of the derivative, we have to calculate twice the value of the function. The distorted increment ratio recurrence (8.2.63) has similar advantages as the increment ratio recurrence (8.2.60) plus the advantages of the distorted function recurrence discussed in Comment on page 402. Obviously, for diminishing the volume of information about the properties of the minimized function we pay with a slow down of the recurrence. Using the distorted increment ratio recurrence, we have also to take into account that unless we know that the processed information exhibits statistical properties, we have ground to expect that a generated train converges to the minimum point. In spite of this, the distorted increment ratio recurrence and its modifications are used as a heuristic procedure and often produce satisfactory results. FINDING THE MINIMUM POINT OF A CRITERIAL FUNCTION SEVERAL VARIABLES When we look for the point of minimum of a function f(a) of an A^-dimensional argument a = {a{n), n = l,2, • • • , N} several previously introduced concepts have their 1 dimensional counterparts; however, specific problems related to multidimensionality arise. The counterpart of the derivative is the gradient defined by (8.2.4). The basic property of the gradient is d/=(grad/, £/a),
(8.2.66)
where (•,•) denotes the scalar product and N
da'Y^d[a(n)]b(n) the infinitesimal displacement vector.
(8.2.67)
Chapter 8 Optimization of Information Systems
406
The set of points „ ^ ^(fx)^{a\m=f,} (8.2.68) we call equivalue surface (in two-dimensional case, equivalue line). For an a E 5 (/i), we have d/=0. From (8.2.67) it follows that (grad/, Ja)=0. Thus, the vectors grad/and a are perpendicular. So we conclude that For every point on an equivalue surface {line), the gradient is (8.2.69) perpendicular to this surface (line). From (8.2.66) it follows that (8.2.70) |d/| = |grad/| \da\ cos(Z grad/, rffl) Thus. 'For a fixed "magnitude" of the infinitesimal displacement vector, (8.2.71) the change of the value of the function is maximal if we move along the gradient vector. This property is exploited by most procedures for searching the minimum point of a differentiable function of several variables. It suggests the steepest-descent procedure: We take an initial point a(l), calculate the gradient for this point arui move along it so long as the function f(a) decreases. When it stops to decrease, we calculate again the gradient and proceed as in the first step. Because this procedure requires frequent testing of the changes of the value of the criterial function, it is not best suited for on-line utilization in information systems. Using the gradient we write the basic recurrence (8.2.29) in the form a(/>l)=fl(/>X(/-)grad/. (8.2.72) Thus, this recurrence also utilizes the basic property (8.2.71) by changing the point approximating the minimum point along the line of steepest descent, as it is illustrated in Figure 8.7. However, the correction is made only on the basis of local properties of the function. / Aa)
^^
^(2)/
grad/ flO>l)
8.2 Methods of Solving Optimization Problems
407
When the values of the gradient cannot be evaluated, it is natural to use instead of the gradient the vector of partial increments (8.2.73) where bin) is the «th unit coordinate vector. The counterpart of the recurrence (8.2.60) is fl(/> 1) =fl(/>X(/-) [GRADf b{j)\^^, (8.2.74) It can be shown (see e.g., Minoux [8.5]) that on very general assumptions the train generated by this recurrence with coefficients satisfying the conditions (8.2.61) converges toward the point of minimum of the function. Also a similar counterpart of (8.2.63) generates a train converging in the mean square sense. COMMENT The described recurrences are of paramount importance for applications, particularly for the design of intelligent predictors, filters, and next neighbor transformation. For a review of such several systems, see Tsypkin [8.11], Widrow, Steams [8.12], Proakis [8.13, Chapter 6].
^
w INFO SOURCE
+
' ^
SYSTEM w r m AN UNKNOWN CHARACTERISTIC
^
^
w k
^
^
w
MODEL WITH ADJUSTABLE PARAMETERS h
yf EVALUATION OF DISTORTIONS
h 1
IMPLEMENTATION OF RECURRENCE
^ ^
Figure 8.8. System with an adjustable model A wide class of intelligent information systems using the recurrences are systems with an adjustable model, shown in Figure 8.8. The great advantage of those systems is that they adjust the parameters of the model so that it mimics possibly exactly the performance of the real system, even if the structure of the model does not match exactly the structure of the real object. Systems with an adjustable model are particulary useful for information processing systems with a training cycle or in systems with feedback information. In particular, they can be used to produce a model of a communication channel that can be used to adjust the rules of operation of the receiver or/and the transmitter. Since in information compression systems the primary information and the compressed information are available at the same place the considered recurrences can be used to optimize those systems. In particular, when quantization is realized
408
Chapter 8 Optimization of Information Systems
by a next neighbor transformation then not only the references but also the distance function can be optimized. The latter optimization can be also achieved by introducing a preliminary transformation (see (8.1.12)) and optimizing it. The presented recurrences producing trains suggest heuristic procedures for search of favourable solutions in situations when statistical regularities are not taken into account and convergence can not be checked. Examples of such heuristic procedures are several procedures of adjustments of neural networks (see e.g., Zurada [8.14], Haykin [8.15] and genetic algorithms ( see e.g. Goldberg [1.16], Soucek[8.17]).
8.3 OPTIMAL RECOVERY OF DISCRETE INFORMATION In this and in the next section we consider optimal recovery of information when the working information and all indeterminate factors influencing information processing exhibit statistical properties and exact statistical information about them is available. We also assume that as criterion the statistical average of an indicator of performance in a concrete situation is used. We show first that on those assumptions it is possible to derive the general form of the optimal information processing system. We call it statistically optimal system. There is a great variety of such specific systems. There are three reasons why we concentrate on statistically optimal systems. The first is, that as it has been explained in Chapters 4 and 5, the assumption of existence of statistical regularities is often well justified. If they exist, disregarding them obviously deteriorates the performance of an information system and it is usually possible to build an information system subsystem acquiring information about statistical regularities. The second reason of interest in statistically optimal systems is, that the results obtained for probabilities can be directly used for the very wide class of systems for which only the frequencies of occurrences are available. The methods of it have been discussed in Sections 1.7.2, 6.2.1, and 7.5.1. The last but not least reason is, that statistically optimal systems suggest useful solutions in many cases when the existence of statistical regularities can not be proved. We present here the most important and typical special forms of such systems for recovery of discrete and continuous information. We show that most of the information recovery systems which we introduced in the previous chapters using heuristic arguments are on quite general assumptions statistically optimal systems. In particular, the next neighbor transformations which we considered in the previous chapters are on general assumptions statistically optimal systems. Since the information recovery is the last transformation performed by the information system the recovered information may be also called decision of the information system about the primary //z/ormation (briefly decision) and the transformation of the available information into the recovered information considered as a whole may be called decision rule. Therefore, we use here alternatively the terminology of decision theory. In Section 8.3.1 the general statistically optimal rule of ultimate information recovering (performed by the last subsystem of the prototype information system shown in Figure 1.2) is derived.
8.3 Optimal Recovery of Information
409
As application and illustration of the general result obtained in Section 8.3.1, the structures of optimal systems recovering discrete information are derived in Section 8.3.2. First the optimization of the information recovery in the basic system having the chain structure shown in Figure 1.2 is considered. In the second part the systems using feedback information are discussed. Section 8.3.3 presents the methods of calculating the performance of the optimal recovery rules and discusses the distortion versus cost trade off relationships for the optima information recovery rules derived in Section 8.3.2. 8.3.1 GENERAL SOLUTION OF THE OPTIMIZATION PROBLEM We assume that all indeterminate factors influencing the information processing exhibit statistical regularities and exact statistical information •^STAT is available. We take the statistical averaging as the dependency removing operation (see our discussion in Section 8.1) and the performance criterion is defined as Q [ r ( - ) ] = E^[X, r ® ) ]
(8.3.1)
where X*(') is the recovering transformation, q(', •) is a performance indicator in a concrete situation, X and IR are random variables (processes) representing the primary respectively the available information. The O P r ( - ) Q I (?„ A72=2, 3, .., M,
CTECH ;^SIAT
(8.3.2)
is called Bayes optimization problem; C^ respectively CJECH denote the parametric respectively technic constraints-see Section 1.6.2 . We now show that the method used in Section 7.5.1 to derive the optimal rule of recovering the primary information from its quantized presentation can be generalized. From the formula (4.4.23) for conditional averages we have
e[r(-)]=E^(^. r ) = E^[X, rm\ = E ( E ^K, rm\
(8.3.3)
We write this in the form: where
G [ r ( - ) ] = EG[r(IR),]R)] - . Q{x\ r)=E qQL, x )
(8.3.4) (8.3.5)
X|r
x*E yV,/\is the set of potential forms both of the primary and recovered information and E is the operation of conditional statistical averaging on the condition that the received information r is given. Let us assume that the available information r is fixed. Since we consider various decision rules, the decision X*(r) can be considered as a variable, which can take any potential form of the primary information. Therefore, for a given r we can consider Q(X, r) as a function of the variable x* and we may look for the x* which minimizes Q(x\ r). This x^ usually depends on r. Therefore, we write it as form: x;'X;(r) (8.3.6) Since for each r we minimize Q(x\ r) independently from the other r's we minimize the overall average QlX'i')] and the assignment r-^x^ is the transformation which is the solution of the considered optimization problem (8.3.2). Thus we came to the fundamental conclusion, that optimal is the rule:
410
Chapter 8 Optimization of Information Systems
For available information r and for every potential form of primary information x, using (8.5.3) calculate the conditional performance indicator of Q(x\ r). Consider it as a function of x find the , (9,-x i\ potential form x^ for which Q{x\ r) achieves the minimum value arui ^ ' ' ^ take JCQ as the recovered information. We call this the rule (transformation) of best coruiitional performance. The point of maximum of the function f{x) is the point of maximum of the function 0i|/(jc)], where ),(w) is a strictly increasing function or it is the point of minimum of the function 4>2\f{x)] where )2(w) is a strictly decreasing function. Therefore, instead of searching for the point of minimum of Q(x, r) we may look for the point of minimum of the function WCJC, r)=),[e(x, r], jcGJT,
(8.3.8a)
or a point of maximum of the function u{x, r)^(t>2[Q{x, r], xeX,
(8.3.8b)
where >i(*) is a strictly increasing function and )2(-) is a strictly decreasing function. The function u{x, r) has the meaning of a weight associated with each potential form of information on which the optimal decisions are based; the function 0(') is called the weight producing function. The weight W(JC, r) produced by an increasing function 0,(') has the meaning of the negative weight (the smaller it is-the better) and the weight produced by a decreasing function )2(*) has the meaning of di positive decision weight (the larger it is-the better); we call the both types briefly decision weight. The decision weight u{x, r) is a function of the available information r which is relevant for the decisions about a concrete potential form x of the working information. To emphasise this we call the decision weight also concrete decision information. The set u{r)^{u{x,ry,xeX)} (8.3.9) considered as the function of x has the meaning of a function of the available information which is relevant for making an optimal decision about the working information. Therefore, we call u{r) primary decision information. The best conditional performance rule (8.3.7) is equivalent to the rule: Rl. For a given available information r and for every potential form of the primary information x calculate as in rule (8.3.7), the coruiitional average performance Q{x, r) R2. Taking into account the character of the dependence of Q on x introduce a weight producing function >(•) such that it is easier (8.3.10) to evaluate the decision weight w(jc, r) than Q{x, r). R3. In the set X of all potential forms of the primary information find the potential form ^o with smallest negative (largest positive) decision weight and take it as the recovered information. The block diagram of the system implementing this rule is shown in Figure 8.9a.
8.3 Optimal Recovery of Information
XST-AT(R.X)
411
g(',')
i 1
CALCULATION OF CONCRETE DECISION WEIGHTS
w(x.r)
FINDING THE POINT OF MIN / MAX J
W
XAr)
a)
i
i
CALC.OF CURRENT DECISION INFO
^ CALC.OF A PRIMARY DECISION INFO
r(-,-) u(x,r)
u,(x)
b)
Figure 8.9. The rule of best conditional performance: a) based on the decision information tt(r) = {w(x, r);xEX)}, b) calculation of a concrete decision information w(x, r) using the current u^ix, r) and a priori uj^x) decision information: XSJATCR* X) exact statistical information about the joint statistical properties of the working and available information, q{\') the indicator of performance in a concrete simation. We show on forthcoming examples that in many cases the concrete decision information has the form w(jr, r) = r[w(Xe, r), W3(jt)] (8.3.11) where r ( - , ) is a function of two arguments, uj,x, r) is a function depending both on the information r and the potential information JC, while ujix) depends only on the on the potential information x. Thus, u^{x, r) can be evaluated only after the current znformation r arrived, while uj^x) can be calculated before the operation of the systems starts. Therefore, we call: Mc(Jt, r) the current decision m/<9rmation about the potential working information x u^(x) the a priori decision informaiion about the potential working information x iic(r) = {Wc(jc, r);xEX} (respectively u^ = {ujix); xE Jr})-the current (respectively a priori) decision /Az/brmation The set {u^ir), u^} determines in a unique way the primary decision information ii(r) given by (8.3.7). We call this set the secondary decision information: it has the meaning of a pretention of the primary decision information. Those definitions are illustrated in figure 8.10b. Several concrete examples of the decision information are given in subsequent sections. COMMENT 1 The primary decision information u(r) is the typical example of application of the general definition (1.1.1) of information.
412
Chapter 8 Optimization of Information Systems
The primary decision information u(r) has so many elements as the set of potential forms of the working information. Thus, in the sense of definition (6.1.12) or (6.6.26), decision information has the same minimal volume in as the working information. Usually the available information r has a larger volume as the working information. Then the transformation r^u{r) is a volume compressing transformation. When the working information is discrete and the available information is continuous, the compression is dramatic. The compression is possible because in general the transformation r^u(r) is non-reversible. In particular, knowing the decision information u(r) we can not always calculate the conditional d.p. p{x\r). Therefore, in general l[X;ii(IR)]
L
Q(x„, r)=E q(M, xJ=J2 ^l'"
q(x„ JcJ/'(X=x,|]R=r)
(8.3.13)
/-I
The rule (8.3.7) of the best conditional performance takes the form To an available information r assign the primary information ^/ G X^ which minimizes the conditional performance indicator (8.3.14) G(jc„r),/=1,2,- • • ,L.
8.3 Optimal Recovery of Information
413
We denote: F(jcJr)=Pa^=x,|IR=r) (8.3.15a) P(x\r) = [P(Xi\r), 1=1, 2,- • • , L] (8.3.15b) the column matrix of conditional probabilities, q = [q{Xi, j c j , /, m=l, 2,- • • , L] (8.3.16a) the square matrix of performance indices in a concrete situation, Q(x, r) = [Q(x,, r), / = 1 , 2,- • • , L] (8.3.16b) the column matrix of conditional average performance indices, and we write (8.3.7) in the form Q(x, r)=qP(x\r) (8.3.17) From equations (4.4.7) we have P ( j t , | r J = P ( X = x , | E = r ) = CP(]R=r^|3S=jC/)P(X=x,) (8.3.18a) P(Xi\r)=P{^=Xi\Tk=r) = Cp(r\X=Xi)P(^=Xi) (8.3.18b) where the probability distributions P(IR=r^ 15S=JC/) respectively/7(r | X=JC/) describe the transformation generating the available information r. Thus, to calculate P{Xi\r) we need: 1) the exact information ZSTAT(X) about the statistical properties of the inf source, 2) the exact information ZSTAT(R|X) about the transformation x-*r; the probabilities describing those transformations have been discussed in Section 5.4. Having the statistical information XSJATOQ ^ d XSTAT(R IX) we fmd the rule Px|r(* I •) of calculating for given rE ^and xG X the conditional probability P{Xi \ r). Similarly, to obtain the values of the matrix q of performance indices we need the relevant information X^UP about the features of the superior system. Our argumentation is illustrated in figure 8.10a. For discrete information the decision information defined by (8.3.8) is a L dimensional vector information u{r) = {u{Xi, r), / = 1 , 2,- • • , L] (8.3.19) and the general diagram of the best conditional performance rule shown in figure 8.10 a takes the simple form shown in figure 8.10b. In the binary case (L=2) searching for the maximum reduces to checking the sign of difference between the two weights, e.g. of Wb(r)=w(jci, r)-w(jC2, r) (8.3.20) We call this difference the binary decision information. Let us suppose that the decision weights are positive. Then the best conditional performance rule (8.3. ) based on decision weights takes the form .jc, if Wu(r)>0 . , . or equivalently where
^--X2ifWb(r)<0 ^/,=<^thKW]
(8.3.21b)
^ ^ 1 if w>0 ^th(w) = < ; ., ^^
(8.3.21c)
^JC2 i f W < 0
is a threshold function. This rule can be implemented in the system shown in Figure 8.10c.
414
Chapter 8 Optimization of Information Systems
XSTAT(X)
XSTAT(RIX)
r-4--4^-. CALCULATION OF THE RULE
CALCULATION OF PERFORMANCE MATRIX P(x,\r) •
NUMERICAL CALCULATION OF COND. PROBILmES calculation of conditional probilities
Q(x,,r) FINDING THE POINT OF MAXIMUM
MATRIX MULTIPLICATOR
Axjr)
Q(XL,r)
a)
XSTAT(RIX)
g
u(x,,r) CALCULATION OF CONCRETE DECISION INFO XSTAT(R'X)
FINDING THE POINT OF MIN/MAX
b)
XSUP
CALCULATION OF CONCRETE DECISION INFO
"(x,,r)
CALCULATION OF CONCRETE DECISION INFO
T^
XSTAT(R'X)
calculation of the binary decision info
XSUP
c) XSTAT(X)
JF
XsTAT(f^'X)
4r
P(x,|r)
CALCULATION OF COND. PROBILITIES
• •
FINDING THE POINT OF MAXIMUM
P(XL\r)
d)
Figure 8.10. The implementation of the best conditional performance rule of discrete information recovery: a) based directly on conditional performance, b) on decision weights (special case of transformation shown in fig (8.10 a)), c) for binary information; ^STAT(X), jf5TAT(R I X)-exact statistical information about the a priori, conditional probabilities P(X|R), XSUP information about the properties of the superior system relevant for the choice of performance indicator, d) The maximum conditional probability rule. The primary decision information can be decomposed into current and a priori components (8.3.22) u{Xi, f)=uJ,Xi, r)^uXxi)
8.3 Optimal Recovery of Information
415
In particular, the binary decision information can be presented in the form Wb(r)=wjr)+Wba
(8.3.23a)
where ujj-)=^uj,x^, r)'U,(x2, r) , Uy,^=^u^{x;)-uJ,X2) Then the optimal binary decision rule (8.3.21) takes the form ^i(r)=^
(8.3.23b)
(8.3.24a) "^X2lfWbe(^)<Wth
with «th=«ba (8.3.24b) As indicated in Section 8.1.1, for many superior systems the indicator of performance in a concrete situation q(x, x*) is symmetric, given by (8.1.2). Then the indicator of performance of the information transformation as a whole is the probability of error V(X7^T) ( see (8.1.40)). Substituting (8.1.2) in (8.3.13) we get G(x,, r) = l-/>(5S=JC,|I^=r)
(8.3.25)
From this it follows that the conditional probability P(X=Jc^|IR=r) is the positive weight function and the rule (8.3.14) of the best conditional performance based on weight functions takes the form Assign to a given available information r the information x^ G Xfor which the conditional probability P(X=JcJI^=r) considered as (8.3.26) function of I achieves its maximum. This rule is called maximum conditional probability rule (also maximum a posteriori probability rule). The system implementing this rule is shown in Figure 8. lOd. For binary information the rule simplifies to threshold rule like (8.3.24) and can be realized by a system as shown in Figure 8.10c, COMMENT 1 The derived optimal system is a concrete example of the system shown in Figure 1.24. In Figures 8.10 and 8.11 we emphasize the role of meta information. If it is not available we can add a hierarchically higher subsystem acquiring such an information so, that the whole system operates as an intelligent system shown in Figure 1.25. In particular, as an inexact information about the probability distributions/7(IR=r 13S=jC/) and P(X=X/) we can take the corresponding frequencies of occurrences obtained during a training cycle. Another possibility is to: (1) take a standard probability distribution which can be considered as a reasonable approximation of the real distributions (in particular for probability distribution of primary information we may use one of the distributions described in Section 4.5, and for the conditional distribution describing the fundamental information processing the prototype statistical relationships presented in Section 5.2)), (2) leave some parameters of the standard distributions free, (3) find the rule of optimal information processing for the standard distributions, (4) use the recurrent procedures described in Section 8.2 to find the values of the free parameters maximizing the performance in the class of rules which are optimal for the hypothetical probability distributions.
416
Chapter 8 Optimization of Information Systems
EXAMPLE 8.3.1 OPTIMAL RECOVERY OF DISCRETE INFORMATION USING VECTOR INFORMATION; DETERMINISTIC NOISELESS SIGNALS We consider the communication system shown in Figure 1.3 and described in more detail in Section 2.1.1. We assume: Al. The working information Xi is discrete / = 1 , 2, .., L, A2. The available information r and its noiseless component of are N DIM vector information: r={r{n), n=l, 2,., A^}, w{x) = {w{x, n), « = 1, 2,., A^}, A3. The transformation performed by the channel is described in Section 5.4.2 and characterized by the conditional probability (5.4.10). The needed probability p{r\xi) occurring in (8.3.18) we obtain from equation (5.4.10). After substituting v-^r, 5-*JC/, >V(^/), ^(•5. n)-^w{x, n) we get P(K=Xi\]k=r) = CP(Xi)e '"""
(8.3.27)
This equation shows that the conditional probability depends on the available information r only through the sum in the exponential. To present this dependence in a simpler form we take the logarithm both sides of (8.3.27) and multiply the result by 2o^. We get 2(7^1aP(X=X/|E=r)=2a^lnC+2o^lnP(jc^)-52 lr(n)-w{Xi,n)y =2a^hiC+2a2lnP(x,) + N
N
"'^
N
-2 J2 r\n) + J2 iin)w{x„n)-2'£w\x„n)
(8.3.28)
Since the terms not depending on x, are not relevant for decisions about the primary information as concrete decision information (decision weight) we take N
N
u;,(x„ r)=2o^lnP{Xi)+ "£r(n)w{x„n)-2Y^w\x„n) n-]
(8.3.29)
n-l
Decision information we decompose into current and a priori components
where
u',{xi, r)=w,(jc,, r)+M3(X;)
(8.3.30)
^ Wc(JC/, r) = 2^ r(n)w{Xi, n)
(8.3.31a)
and A^
u,(xi) =2(r\nP(Xi)-2 J^ ^\x^,n) (8.3.31b) From equations (8.3.31) it follows that the decision information depends only on the rough description of the statistical information through statistical parameters entering in (8.3.31b). However we have to remember that the form of this information results from the assumption that the noise has a gaussian probability distribution. Comparing (8.3.30b) and the definition (7.1.8) we see that the current decision information can be produced by systems shown in Figure 7.1 in particular, by a matched filter. This greatly simplifies the implementation of the optimal system shown in Figure 8.10b.
8.3 Optimal Recovery of Information
417
Often for symmetry reasons it can be assumed that P(JC/) = 1/L=const
(8.3.32)
Then, also for symmetry reasons the transmitted signals are chosen so that E{Xi) =£'=const (8.3.33a) where N E(x,)^'£w\x,,n) (8.3.33b) is energy the noiseless signal (see (7.4.1)). The assumptions (8.3.32) and (8.3.33) are called symmetry assumptions. When those assumption are satisfied then from (8.3.31b) it follow that the a priori decision information does not depend on the primary information. Therefore, in the symmetrical case we take u,(x;)=0, V/ (8.3.34) The choice of the decision weight is not unique. An other choice then (8.3.29) would be to take the negative decision weight w"(JC/, r)= {In P(^=Xi\R=r)-\nC}o^=d^lr,
V„(x,)]Vln P(^=Xi) (8.3.35)
where dlr, V^(Xi)] is the Euclid distance defined by (1.4.8). The transformation realizing the maximum conditional probability rule (see fig.8.10c) would became a generalized NNT transformation. When the potential forms of information are equiprobable, thus (8.3.32) holds the natural choice of the decision weight would be u'"(x,^r)^d[r,w(x^)] (8.3.36) Thus, On assumptions A1-A3 and (8.3.32) the maximum conditional probability rule is a next neighbor transformation using (8.3.37) Euclid distance and noiseless signals as reference patterns. In the binary case the optimal decision rule is given by (8.3.24) and can be realized by the system shown in Figure 8.10c. From equations (8.3.23) and (8.3.30) we obtain the current and a priori binary decision information uUr) = Y^ r(/z)[w(x,, n)-w(x2, n)]
(8.3.38a)
n-l
w,(r)=w,(jc,)-w,(jC2)
(8.3.38b)
where u^{X[), 1=1, 2 is given by (8.3.31b). Thus, in the binary case the maximum conditional probability rule can be implemented by a single matched filter matched to the difference of noiseless signals and a threshold device with threshold given by equation (8.3.24b). From (8.3.31b) it follows that in the symmetric case we have to take in the optimal decision rule (8.3.24a) the threshold w,h=0.
D
(8.3.39)
We present now the time-continuous modification of the previous example.
418
Chapter 8 Optimization of Information Systems
EXAMPLE 8.3.2 OPTIMAL RECOVERY OF DISCRETE INFORMATION USING FUNCTION INFORMATION; DETERMINISTIC NOISELESS SIGNALS Al. The working information Xf is discrete / = 1 , 2, .., L. A2. The available information and its noiseless component of are timecontinuous processes r(0, "^(Xf, t) tG
(8.3.41)
Similarly, as a counterpart of (8.3.38a) the current binary decision information ^b
uM')] = f r(t)[w(x,, t)Mx2, t)]dt.
n
(8.3.42)
'a
In the next example we illustrate the problems arising when noiseless signals are indeterministic. We consider only the time continuous model, since the calculation for the corresponding time discrete model would be very be very tedious. EXAMPLE 8.3.3 OPTIMAL RECOVERY OF DISCRETE INFORMATION USING FUNCTION INFORMATION; INDETERMINISTIC NOISELESS SIGNALS We do again assumptions Al to A4 as in the previous example, but we assume that the noiseless signal is >v[JC„ (•), \H=^(X/, t)cos(a;er+\^), tE
(8.3.43)
and instead of A5 we assume that A5". The envelope A(Xi, t), tE
8.3 Optimal Recovery of Information Plxi\r(')] =P(X=x,) exp{-£(x,)/5JIo{x[x„ r(-)]/5j
419 (8.3.44a)
where x[Xi, r(')]^\Jcc[Xrri')hc^[x,,r(')]
^
Cc[Xi,r{')]'{ r(/)^(x,,Os?n^,rcir
(8.3.44b) (8.3.44c)
and E(Xi) is the energy of the noiseless signal. From (8.3.44) it follows that Uc[Xi, r(')] = x[x,, r(')] (8.3.45) can be used as current information decision. In general we have to calculate the decision information from equation (8.3.11) with function r(-,-) determined by equation (8.3.44) and to implement the rule of maximal conditional probability as shown in Figure 8.10b. However, since the function Io(w) is an increasing function in the binary symmetrical case (when (8.3.32) and (8.3.33) held) wJr(-)] = x[jc„r(-)]-x[x2, r(-)] (8.3.46) is the binary current decision weight and the implementation of the rule (8.3.23) simplifies. D EXAMPLE 8.3.4 OPTIMAL RECOVERY OF A BLOCK TRANSMITTED THROUGH A BINARY CHANNEL We consider the transmission of blocks of binary information as a whole, described in Section 2.1.2 and illustrated with Figure 2.7. We assume: Al. The working information is a block: x, = {x(l, Az), Az = l, 2, .., A^J, / = 1 , 2,- • • , L, where x(l, n) are the elementary binary components A2. The transmitted information is the block (code word-see 2.1.2) >v(jc,) = {w(/, n), A2 = l, 2, .., TV}, / = 1 , 2,- • • , L of A^ pieces of binary information w{l, n)\ the coding x^^Wi is a reversible transformation A3. The available information is also block of A^ pieces of binary information: r, = {r(A2), A2 = l , 2 , - • • , A^}
A4. The channel is a binary channel (see Section 2.1.1), the elementary piece of the available information ^ r(n) = w(/, n)®z{n) (8.3.47) A5. The probability distribution of the primary information is uniform: F(jC/) = l/L=const A6. The random variables 2(n), « = 1, 2, .., A^ representing the binary components of noise are statistically independent, have the same probability distribution characterized by Pb=P[s(n) = l ] . From assumption A6 and repeated using of the equation (4.4.8a) it follows that P ( E = r I X=x;) = />^^//[>^(^/)''-](l -P^f-dnly^^x;) ,r] (8 3 48) where d^{w{x^, r] is the Hamming distance defined by (2.1.36).
420
Chapter 8 Optimization of Information Systems
Substituting (8.4.48) in (8.3.18a) and using assumption A5 we get P ( 5 S = x , | l = r ) = CP,^//[^(^/)'^](l-P,)^-^//[^(^^^
(8.3.49)
Taking into account that 0
8.3 Optimal Recovery of Information
421
8.3.3 STRUCTURES OF SUBSYSTEMS RECOVERING DISCRETE INFORMATION IN A SYSTEM WITH FEEDBACK Simple examples of communication systems using auxiliary information about the information delivered by the communication have been described in Sections 2.1.1 and 2.1.2 and shown in Figures 2.4, 2.5, and 2.9. We describe now such systems in more detail and we discuss their optimization. Although we use the terminology of communication systems, our considerations are directly applicable to other information systems such as systems storing or simplifying information. The feedback system is characterized by the type of cooperation between the receiver and the transmitter. Typically we transmit a train of elementary pieces of primary information and in the simplest case the processing of the next piece of information starts when the processing of the previous is completed. Such a system is characterized by the type of: 1) Auxiliary information used by the transmitter to make the decision about forwarding to the receiver an additional information about the actually processed elementary piece of working information; it is cMed feedback information. 2) The type of such an additional information about the working information furnished by the transmitter; it is called retransmitted information. 3) The type of making by the receiver the ultimate decision about the actual piece of the working information on the basis of all already available information. The choice concrete rules of feedback system operation depends on the features of the forward and feedback channels. When the capacity of the forward channel is small, the natural choice is to deliver from the receiver to the transmitter feedback information having possibly small volume. In extreme case this information is binary and has the meaning of the ultimate decision commanding the transmitter to forward the next retransmission. Such a feedback system is traditionally called decision feedback. Systems described in Sections 2.1.1 and 2.1.2 are of this type. The first transmission of a primary piece of information and the following retransmissions we call total transmission. The operation of the feedback system can be considered as controlling the total transmission by feedback information so that the effects of distortions caused by the channel are diminished. In a typical information system the quality of the fundamental information processing subsystem, in particular of the communication channel, is so high that in most situations besides the first transmission no retransmissions are required. Then it is natural to send the additional information about the working information only when such an information is needed. This causes that the length of the total transmission is variable. The varying length of a total transmission generates additional problems. First, the segmenting information (see Section 2.6) must be build in, so that it is possible to separate in the train of signals at the output at the forward channel the total transmission of an elementary piece of working information. Second, variable length introduces arrythmia and requires some buffering.
422
Chapter 8 Optimization of Information Systems
In spite of those deficiencies the feedback systems with variable length of total transmissions are most frequently used. Therefore, we concentrate here on such systems. However in the last part of this section we describe a feedback system with total transmissions having fixed length but using the feedback information to control the shape of the total transmission.
FEEDBACK SYSTEMS CONTROLLING THE LENGTH OF THE TOTAL TRANSMISSION When it may cause no confusion, describing the feedback systems that use feedback information to control the length of the total transmission, instead of the term "retransmission" the shorter term "transmission" is used. We denote: X^, = {x(n), n = \, 2, .., A^} the train of pieces of the working information, w{x{n), /], / = 1, 2, .. the nh transmission of wth working information; (w[jc(l), 1] is the first transmission, the transmissions n=2, 3,..are retransmissions) W[jc(«),y] = {>v[jc(«), / ] , / = !, 2, ..,77 the train all already produced transmissions of information x{n) (see Figure 8.11). fin, i) the signal at the channel output produced by the transmission w[x{n), i\\ we call it information available about the /th transmission ^(^, j) = {K^» /), / = 1, 2, • • , y the train of all pieces of information available after j transmissions. This notation is illustrated in Figure 8.11. w[xin),l]
w[jc(/2),2]
]
w[x(n),3]
iiiiiiiiiini W[x(n),3] —
Figure 8.11. Illustration of the notation for transmissions carrying working information jc(w) in a feedback system; the length of the total transmission 7=3. The processing of the information obtained at the channel output is done in two steps. First a decision about distortions of the available information is taken; we denote it yf. This decision is forwarded to the transmitter of the working information. Therefore, we call yf altevmiiwtly feedback information. In the simplest case this decision is binary and its potential forms are: y+-the distortions are so small that on the basis of available transmissions an ultimate decision about the working information can be made y- -the distortions are large so that an additional retransmission is requested When the decision y^ has been taken, a potential form x^. of the working information is produced as the decision about the actually transmitted working information. When decision y- has been taken, a decision about the working information is postponed. Therefore, the decision y- is called the disqualifying decision. The feedback information yf is also delivered to the transmitter. When it obtains the y+ decision it starts to transmit the next elementary working information. When y_ is obtained, the next retransmission of the actually processed information is send.
8.3 Optimal Recovery of Information
423
Let us denote by y{nj) the distortion information produced aftery transmissions of the /2th elementary working information. In general, the decision y{n, j) can be based on all available information R{n, j) about x{n) and the rule of making such a decision may depend on the number y of available retransmissions. However, to simplify the implementation in most systems used in practice, the feedback decision aftery retransmissions is based only on the last retransmission, i.e. y{n,j)^Y[r{n.j).n (8.3.52) We call such a rule of making the distortion decision memoryless. If it would not depend on the number of retransmissiony infinitely many retransmissions would be possible. In most practical systems it is assumed that for ally
Figure 8.12. Aggregation sets of the composite rule Xc(')={Y{'), X'i')} of operation of a feedback system; the aggregation sets ^i, /= 1, 2, .., L correspond to the ultimate decisions X, about the working information, /=1, 2, ..L, while the set ^ corresponds to the disqualifying decision of the composed rule X^{').
424
Chapter 8 Optimization of Information Systems
To formulate the problem of optimization of the composite rule we must introduce indicators of performance for both component rules. The indicator characterizing the performance of the rule of making ultimate decisions is the same as in the case of the previously considered system without feedback. We assume that the performance indicator in a concrete situation q(Xi, x^) is synmietric, given by (8.1.2). Then the criterion is the error probability G[r(-)]=P[^(R)?^X]
(8.3.56)
Every retransmission requires additional resources for processing a piece of working information in particular, it needs more feedback channel capacity. It can be expected that on average those resources are an increasing function of the probability of making the disqualifying decision. Therefore, as an indicator of costs characterizing the system with feedback we take the probability of making the disqualifying decision QAY-(')]-Paf=y-)
(8.3.57)
The typical problem of optimization of the composite information processing
"^^^'''
OP {y(-), r(-)},ei e_[r-(-)]=const
(8.3.58)
The counterpart of the criterion (8.2.11b) (Lagrange function) used to solve optimization problems with equality constraints, is the auxiliary criterion QAr(')l
Y-(')]^Q[ri')]-^\QAY.(')]
(8.3.59)
where the parameter X is a counterpart of the Lagrange multiplier in (8.2.11b). To find the solution of OP {¥(*), X'(')},Q^ we introduce the auxiliary indicator of performance in a concrete situation A for }' =y _, / = 1, 2, • • • , L «?y(^/,>'f) = C . . . „ : ' , „ , ' . ' ', (8.3.60) "0 fory=y^, / = 1 , 2,Using this we write (8.3.59) in the form G j r ( - ) , ^-(•)] = E^[X, r ( E ) ] + ^ y [ X Y-(M)]
(8.3.61)
Arguing as by derivation of rule (8.3.7) we conclude that the best conditional performance rule is: To a given available information r assign the potential information Xf E{jC/, / = 1 , 2,- • • , L, y_} which minimizes the conditional (8.3.62) performance indicator Q(Xi, r), /=1,2, ..L, Q(y-, r) where L
e(Jc„ r)=E,^(^, JC,) = E ^(Jc,, JcJP(jc,|r) = l-P(jc,|r) X|r
(8.3.63a)
m-i L
Q(y-, r)=E^y(X, A:J = E 9 ^ , , y-)P{x,\r) = \
(8.3.63b)
P(A:,|r)=/'(X=JC,|]R=r)
(8.3.63c)
and
8.3 Optimal Recovery of Information
425
From the general rule (8.3.62) it follows that the feedback decision isyG(Jt:„r)>GCV-,r),/=1,2,- • • , L
if (8.3.64)
From (8.3.63) we see that this is equivalent to condition P^(x|r)>^, / = 1 , 2 , - • • , L where P^(x|r)=maxP(jc,|r), P,, = l-\
(8.3.65) (8.3.66)
Thus, the optimal rules for making the decision about distortions (equivalently asking for retransmission) and recovering the working information are: if the largest conditional probability P^\ \ r) is smaller then the threshold P^^, we take the decision y. {request for a retransmission), (8.3.67) if the largest conditional probability P^^^ix \ r) is larger then the threshold P^^ take the decision y+ (no more retransmissions needed), and as the recovered working information we take the information X, with the maximal conditional probability. The system realizing this composite rule is shown in Figure 8.13. ^ w
j XSTAT(R»X)
CALCULATION OF CONDITIONAL PROBABILITIES
FINDING THE POINT OF MAXIMUM i k v.
feedback (distortion) info
t /'(J,|r)
/ 1 O . / — 1,^,
_^. <
P™.(x| r)
T
, L^
w
FINDING THE MAXIMUM VALUE
i
TRESHOLD DEVICE
w
i
y^
^ w
L P^
to the transmitter
Figure 8.13. The system implementing the composite optimal feedback information generation and working information recovery minimizing the probability of error for fixed probability of a retransmission based on conditional probabilities. Since the decisions of the optimal rule (8.3.19) are based on comparisons of conditional probabilities, the rule can be implemented using instead of conditional probabilities the decision weights «(jc„r)=(/>[P(jc,|r)] (8.3.68) where >(•) is an increasing or decreasing function. COMMENT 1 The performance indicators are interrelated. The probability of errors depends on the shape of the aggregation sets ^^, / = 1, 2, • • , L corresponding to the ultimate decisions. The set ^ _ corresponding to the decisiony. acts as a "buffer zone".
426
Chapter 8 Optimization of Information Systems
The larger this zone is, the smaller is the probability that the point representing the available information carrying a given working information will be pushed by channel distortions into an aggregation set corresponding to another potential form of information. Every decision y- causes a new retransmission. This increases the redundancy which the feedback mechanism builds into a signal carrying a piece of the working information. The redundancy can be used to decrease the probability of error of optimized ultimate decisions. Thus, increasing the "size" of the set . X , we can decrease the probability of errors of recovered information. COMMENT 2 When the negative decision weight has the meaning of the distance between the available information and the noiseless signal corresponding to a given potential form of information, then the transformation realized by this system may be called intelligent next neighbor transformation, asking for more information when no reference pattern is sufficiently close to the point representing available information. For binary working information such a system takes the form shown in Figure 2.4. Thus, we proved that the feedback systems described in Section 2.1.1 have an optimal character. Our argumentation here allows to choose the reference signals and the thresholds in a systematic way and, if necessary, to augment the system with subsystems acquiring the needed state information. FEEDBACK SYSTEMS CONTROLLING THE SHAPE OF THE TOTAL TRANSMISSION We assumed previously that the features of an elementary piece of distorted information are characterized only by binary auxiliary information. It can be expected that the performance of a feedback system can be enhanced if auxiliary information can take more potential forms. Then the feedback information can provide to the transmitter more detailed information about the signal available at the receiver. A system is now described in which the volume of feedback information is substantially larger then of the primary information. To simplify the description we assume: Al. The primary information is bmary Xi, / = ! , 2. A2. The primary channel and feedback channel can carry trains of 1 DIM continuous signals. A3. A single transmission is 1 DIM signal, the total number J of transmissions is fixed. A4. Theyth component of the available signal, y = l , 2,- •, 7 is r(j)=w{Xi, j)+z(j), where w(X/, j) is theyth retransmission of the binary information Xi . A5. After receiving the train r(/)=(r(0, / = l , 2, ..,7, the receiver calculates the average 2 >v*(/>Ew(x„ l)P[x=x,|E(/-)=r(/-)] /-I
and transmits it as feedback information to the transmitter. A6. After receiving the total train r(j) the ultimate maximum conditional probability decision about the transmitted information is made.
8.4 Performance of Optimal Recovery of Information
427
A7. The first transmission >v(;c,, l)=Wj, w(x2, l) = ->Vi, where w, is a constant. A8. A retransmission is the scaled difference between the first transmission w(X[, 1) and the recently obtained feedback information w(XiJ-h\)=A(j+\)[w(x,, l)-w*W] where A(j)j=l, 2,- • • , 7 is a train of scaling coefficients. The average w*(/J can be interpreted as an estimate of the first transmitted signal w(x,, 1) and it is called centre of gravity (of points representing the potential transmitted signals). Therefore, the described system is called centre of gravity feedback system. COMMENT Total feedback information is / DIM continuous vector information. Thus, its volume is substantially larger then the volume of binary working information. This causes that the system would require a feedback channel with a much larger capacity than the capacity of the forward channel. Since the number J of retransmissions must be large, the system would also introduce a very large delay. Therefore, from the practical point of view the system could be useful only in specific situations. We describe here the centre of gravity feedback system because it is an interesting example of delivering to the transmitter very detailed information about the features of the state of the environment of the communication system, that are relevant for transmission of the working information. This allows to control optimally all features of the total transmission, while its length is fixed. The consequence of this is very high efficiency of trading of the energy of the total transmission for the probability of errors. We show in the last section (in Figure 8.18) that the system with centre of gravity feedback can be considered as a system using to the maximum the capacity of the forward channel. Only a sketchy description of the system has been presented. There was very much research done on the overall optimization of the system feedback in particular on optimization of the train of coefficients A(j) determining the operation of the transmitter. For details see Schalkwijlk, Kailath [8.15] and Omura [8.16].
8.4 PERFORMANCE OF OPTIMAL INFORMATION RECOVERY Till this point we discussed methods of finding the optimal rule of information recovery. This is the special case of the problem of finding the point x^ of the minimum /maximum of a function/(JC), discussed in Section 8.2. From the point of view of the superior system, essential is the value of the indicator of performance of the optimized information processing rule. It corresponds to the minimum/maximum value/(XQ) of the considered function. The performance of the optimized ultimate information processing rule is also important for an other reason. As it has been indicated in Section 8.1 this performance is often an objective indicator of performance of the subsystems inside the basic information system in particular of a subsystem implementing the preliminary information processing such as compression or shaping of signals put into a communication channel.
428
Chapter 8 Optimization of Information Systems
We describe here the general method of calculating the performance indicators and to discuss the trade off relationships between distortion and cost indicators for the previously presented optimal rules of information recovery. 8.4.1 THE GENERAL METHOD OF CALCULATING THE STATISTICAL PERFORMANCE INDICES We assume that the statistical average is used as the dependence removing operation to obtain the indicator of performance of the considered information processing rule. To calculate such an indicator we use again the equation (4.4.23) for conditional averages. However, contrary to the derivation of the optimal information recovery rule (8.3.3) we consider now the primary information as fixed. Thus, we use the equation
We write this in the form: G [ r ( - ) ] = Ee(X)
(8.4.2a)
where Q{x^) = E q[x,, n m
(8.4.2b)
and JC/, /= 1, 2,- •, L are the potential forms of the primary information. Since the random variable is discrete _
L
_
e[r(-)]=Ee(^,)^(:t,)
(8.4.3)
/-I
Similarly as the average Q(x\ r) defined by (8.3.3) the average QiXf) has the meaning of the conditional performance indicator, however the condition is that not the available information r but the_primary information is fixed. To distinguish between the both averages we call Q(Xi) the x-conditional performance indicator. From the definition of the average it follows _
where
L
2(JC,)= E q(Xi, xJPixJXi) ""' P(x,\xi)=P(X=Xk\^=Xi),
K 1=1. 2,- •, L
(8.4.4a) (8.4.4b)
and X*=A^[I^(X/)] is the random variable representing the decision of the considered recovery rule while the random variable (process) ]R(jC/) represents the available information on the condition that the primary information is JC^. We call the probabilities P(x,,\xj) conditional decision probabilities. The calculation of the x-conditional performance indicator greatly simplifies when the distortion function is symmetric ( given by (8.1.2)). Then the x-conditional performance indicator is the conditional probability of error and similarly to (8.1.40) we have e(jc,)=P,(jc,) (8.4.5a) where PXxi) = \-P(x,\x,)
(8.4.5b)
8.4 Performance of Optimal Recovery of Information
429
From these considerations it follows that the calculation of statistical performance indicators reduces to calculation of the conditional decision probabilities. We sketch now the method of such calculations. The initial information for calculation of the conditional decisions probabilities is the probability distribution of the primary available information. When the random variable ]k(Xi) representing this information is discrete and r^, /7Z = 1, 2,- •, M are its potential forms, the distribution is described by the set of conditional probabilities P(rm\xi)^Pm(Xi)=rJ^=Xi),
r,, m = l, 2,- •, M , 1=1, 2 / •, L. (8.4.6)
When the random variable IR is continuous its distribution is described by the set of conditional probability densities /7(r|jc,), r E ^ / = l , 2 , - •, L . (8.4.7) The aggregation set ^,,, corresponding to the information recovery rule X'(-) is the subset of the set /? of potential forms of the available information r, such that X'(r)=x^. From the definition of aggregation sets it follows that for discrete available information we have P(x,\xi)=J2 P(r\x,) (8.4.8a) while for continuous available information P{x,\Xi)= \ \ " Ip(r\x^)dr
(8.4.8b)
The equations (8.4.8 ) are the relationships between the conditional decision probabilities and the primary statistical information, we were looking for. Although they are in principle elementary the direct utilization of these equation may be tedious. The concept of the decision information (weight) introduced in Section 8.3.1 can simplify the calculation of conditional decision probabilities, for optimal recovery rules. Since a decision of such a rule depends directly on the current decision information, we can calculate the needed conditional probabilities P(Xj^\xi) from a counterpart of equation (8.4.8) by taking the decision information u in place of the primary available information r. For example such a counterpart of equation (8.4.8b) is [ c c P(x,\x,)=\ J "\p{u\x,)du (8.4.9) u6X/ where ^^i is the aggregation set of potential forms of the decision information. To use equation (8.4.9) we have to calculate the density of conditional probability/?(!/1JC/). We obtain it using the relationship (8.3.8) between rand u, the conditional probability p(r\xi) and the general rules of probability theory for calculating the probability density of a function of a given random variable (see e.g., Papoulis [9.17]). Often, the primary available information is a gaussian variable (process) and the decision information is obtained by a linear transformation of the primary information as in Examples 8.3.1 to 8.3.3. Then from conclusion (4.5.22) it follows that the decision information is gaussian. This greatly simplifies the calculation of the probabilities P(Xi^\Xi) from equation (8.4.9). We illustrate it on a simple example.
430
Chapter 8 Optimization of Information Systems EXAMPLE 8.4.1 CALCULATION OF DECISION ERRORS OF OPTIMAL BINARY RECOVERY RULE
We do assumptions A2 to A5 from Example 8.3.2. Specifying assumption Al we assume that the primary information JC/, 1=1, 2 is binary (L=2). We also do the symmetry assumptions (8.3.22) and (8.3.33), however in place of (8.3.33b) we take E(Xi)=^wiXi, t)dt
(8.4.10)
On these assumptions optimal is the threshold rule (8.3.24) using the binary decision Information u^[r(')] given by equation (8.3.42). From (8.3.40) it follows that on the condition 3^=X/ the process representing primary available information is
m=^(xi. 0+2(0, te
(8.4.11)
Substituting this in (8.3.42) we obtain the random variable representing the decision information on the condition X=Jt/ 'b
Bb(^/)= Iff(-^/.0 Mxx, t)-w(x2. t)]dt
(8.4.12)
^a
After some elementary algebra we obtain m^(Xi)=A(Xi)-\-^ where ^b A(Xi)= I w(x,, 0 [w(jc,, t)-w(x2, t)]dt
(8.4.13a) (8.4.13b)
'b
^ = 1 2(0[MJ^I, t)-w{x2. t)]dt
(8.4.13c)
^a
The random variable ^ is obtained from the noise by a linear transformation. We have assumed that the noise is a gaussian process. Then from a generalization of the conclusion (4.5.22) it follows, that s^ is a gaussian variable. Averaging (8.4.13c) and interchanging the sequence of averaging and integration we fmd that E2ij=0. Similarly we fmd the variance of 2^ (see for example Papoulis [8.17]). From (8.4.13a) it follows that the random variables %{Xi) are gaussian variables with the same variance and mean values located symmetrically around zero. Thus, the conditional probability distributions p(u\xi) describing those variables are gaussian densities located symmetrically around zero, as shown in Figure 8.14a. From symmetry assumptions it follows that the threshold in the optimal information recovery rule (8.3.24) is u,,=0. (8.4.14) From this it follows that the aggregation sets ^ , = < 0 , ex), ^ 2 = ( - o o , 0 ) (8.4.15) Using (8.4.9) (with a single integration) we get oo
P(x2\x,)= |p(w|x,)dw,
0
P(x,\x,)= I p(u\x,)du
0
These considerations are illustrated on Figure 8.14.
(8.4.16)
8.4 Performance of Optimal Recovery of Information
431
P(«|J^2)
Figure 8.14. The densities of conditional probabilities p{u\xi), /=1,2 of the variables representing the binary decision information w^ and the geometrical interpretation of conditional probabilities of decisions P(X2|x,) and P(jc, |jC2) of optimal binary information recovery; (a) the open system, (b) the system with feedback.
Performing the described procedure for calculating the variance and the means of variables %(Xi) and using equation (8.3.68) we obtain finally the probability of errors of the maximum conditional probability rule P^ = G,[£Jl-c'(l, 2)] E^^E/S, is the normalized noiseless signal energy 1 c'(h2)=^^w(t,x,)w(t,x,)dt
where
(8.4.17) (8.4.18)
(8.4.19)
is the normalized correlation coefficient and -«V2, ^(w)= f -—=e-"'^aw
i
^
(8.4.20)
is the "tail" of the distribution of the standard Gaussian variable. D 8.4.2 PERFORMANCE OF BINARY INFORMATION RECOVERY The general method of calculating the statistical performance indicators illustrated in the example has been used for many types of channels producing the available information. The calculation of the sum, respectively the integral occurring in equations (8.4.8), is straightforward but tedious. Therefore, we will now only discuss the general conclusions about information processing which can be drawn from results of such calculations. Let us consider first equation (8.4.17) derived in the example. From this equation it follows For the symmetrical model the performance of the optimal binary information recovery depends only on the normalize energy and the (OAJ^\ correlation coefficient between the potential forms of the noiseless signals but does not depend on their other features.
432
Chapter 8 Optimization of Information Systems
The smaller is the correlation coefficient between the signals (8.4.21b) the better is the performance of optimal information. It can be proved that for binary signals (8.4.22) mincXl, 2)=-l In view of (7.1.21) the geometrical interpretation of (8.4.22) is The performance of optimal information recovery is optimized (8.4.23) when the distance between the noiseless signals are maximal Since we assumed that the energies of the noiseless signals are equal from (8.4.23) it follows that optimal is such a pair of noiseless signals that w(jC2, t) = -w{Xj. 0, te
Figure 8.15. The dependence of the probability of error P^o of optimal binary info recovery on the normalized energy E^=EIS^ of noiseless signals when the noise is Gaussian noise with flat harmonic spectrum. Lines: (D) the noiseless signals are determinate and antipodal, (IP) bandpass, orthogonal signals with indeterminate phase described in Example 8.3.2, (IPA) bandpass, orthogonal signals with indeterminate phase and amplitude (Reighley distribution). In accordance with conclusion (8.4.21a) the bandwidth B of frequencies occupied by the optimized signals has no influence on the performance of the optimized system. Obviously, the smaller this bandwidth is, the smaller is the channel capacity. However, we can not make B arbitrary small. This follows from the basic relationship (7.4.26) between the bandwidth and the duration of a process. In the now used notation this relationship is B>:AIT (8.4.26) where T=t^-t^ and /I is a constant of order of magnitude of 1.
8.4 Performance of Optimal Recovery of Information
433
In Example 8.3.3 the recovery of binary information carried by a narrow-band signal with indeterminate phase was considered (see also Section 2.1.1). We derived probability density p(u\xi) for the decision information (8.4.45) using equation (5.4.20) with gaussian probability densities discussed in Example 8.4.1 in place of Pz[v-VJ,s, b)] and with phase \l/ in place of the indeterminate parameter b. From equations similar to (8.4.16) we calculate the probability of errors of optimal recovery rule discussed in Example 8.3.3. The conclusion from obtained results is similar to conclusion (8.4.21). However, for signals with indeterminate phase the error probability depends not on the correlation coefficient of noiseless signals but it depends on the correlation coefficient of their envelopes A(xl, t). It can be proved that for fixed energies of the noiseless signals the errors are minimal when the envelopes are orthogonal (their correlation coefficient is zero). The dependence of the error probability on the normalized energy is shown in Figure 8.15 as diagram IP (indeterminate phase). COMMENT The comparison of diagrams D and IP in Figure 8.15 shows that the trade of minimal error probability for the energy of the noiseless signal is significantly worse for signals depending on an indeterminate parameter than for determinate noiseless signals. The formal reason for performance deterioration caused by indeterminism of the noiseless signals is, that in result of integration according to equation (5.4.20), the densities of probabilities of the decision information are for indeterminate signals more stretched, than the corresponding densities of probabilities for determinate signals with the same energy. Therefore, the tails of the probability densities behind the threshold are longer. The ultimate reason for deterioration is related to the differences in available meta information. In the case of determinate noiseless signals the exact information about reference signals is available. When a noiseless signal depends on indeterminate parameter(s) we only have information about the set of its potential forms and their statistical weights. The price which we pay for this indeterminism is the increase of the error probability. We can expect that the "larger" the indeterminism, the larger is the error probability for an optimized system. This illustrates diagram lAP in Figure 8.15 that presents the performance for energy trade off when amplitude and phase of noiseless signals is indeterminate (for derivation of such a diagram see e.g. Proakis [7.13]). PERFORMANCE OF OPTIMIZED SYSTEMS WITH FEEDBACK The previously described general procedure for calculating the error probabilities can be also applied for systems using feedback information that have been considered in Section 8.3.2. We assume first that the length of the total transmission (number of retransmission) is fixed. Then we can proceed as in case of the open system. In particular, we can use equation (8.4.9) with aggregate sets ^^i ^^r system with feedback as it is illustrated in Figure 8.14b. Next we average obtained conditional probabilities over potential lengths of total transmission. The needed probabilities we obtain from probability of the disqualifying decision. The latter we get from equation (8.4.9), integrating over the aggregation set ^ _ as illustrated in Figure 8.14b (for details see Seidler [8.19].
434
Chapter 8 Optimization of Information Systems
To simplify the argumentation about primary and available information we do assumptions as in Examples 8.3.2 and 8.4.1 We assume also that the energy of the noiseless signal corresponding to a single retransmission does not depend on the number of retransmission and is the same as of the first transmission. In view of conclusion (8.4.23) we assume that the noiseless signals of all transmissions are antipodal signals. We denote as £1 the energy of a noiseless signal corresponding to a single retransmission, E the average total energy of all noiseless signal corresponding to a total transmission, averaged over all potential lengths of total transmissions, and as
E^.^EJS^
E=E/S^
(8.4.27)
the corresponding normalized energies. From the rules of systems operation it follows that (8.4.28) E=NE.. where N is the average number of retransmissions. As the performance indicator we take the probability of error P^ of the ultimate decision. The trade off error probability for the normalized total average normalized energy of the total transmission is shown in figure 8.16. lOgio Pt
ML, MEM
£nl-9
OPEN
Figure 8.16.The dependence of the probability Pj, of error of optimal ultimate decision on the normalized total average energy E^ of noiseless signals with normalized energy £„, of a single retransmission as a parameter; ML memory less rule, MLA memory less adaptive rule, MEM rule with memory (using all retransmissions), OPEN-the optimal open system (redrawn line D from Figure 8.14).
8.4 Performance of Optimal Recovery of Information
435
COMMENT 1 In the system with feedback, the threshold P^^ determines the "size" of the rejecting zone. Increasing it we increase the rejection probability P- and in consequence the average number of retransmissions. From (8.4.28) it follows that for a fixed energy £"„, of a single transmission, the change of the average total energy is caused by the change of the average number of retransmissions. Therefore, the diagrams of P^ versus E^ with fixed f'ni describe the improvement of quality of working information achieved by increasing the average energy of the total transmission. COMMENT 2 If no feedback information is used, the system operates as an open system and the Pb versus E^ trade off is described by the line D in Figure 8.15. This line is redrawn as the OPEN (slashed) line in Figure 8.16. Thus, for a given error probability P^, the difference between the value read of one of the lines characterizing the feedback system and of the OPEN line is the indicator of gain due to feedback information, measured in terms of saved noiseless signal energy. The rules with memory (MEM lines) are always better then the rule using no feedback and when lower probability of errors is required the advantage of the feedback rules becomes larger. The simple memoryless rules give also an improvement but only in some ranges of E^ and £"„,. We discuss this in the next comment. COMMENT 3 In the range where the difference between the energy of a single transmission and average total transmission is small a memoryless rule (ML lines) gives a substantial improvement as compared with the open system. From (8.4.27) it follows that in range the average number of retransmissions is small. However, when the buffer zone is too large the performance of the system with feedback is worse than of the open system. The reason is that the class of all possible information recovery rules the memoryless rule is not optimal because it uses only the recent retransmission and wastes the energy of all earlier. From the diagrams we see that for a required probability of errors there is a value of the average total energy (thus of the average number of retransmissions) for which the improvement of the memoryless rule is biggest. Thus, an auxiliary adaptive system could change the rejection zone so, that for every P^, the memoryless rule operates best. The performance of such an adaptive rule is presented as line MLA. This feature of the memoryless rules is quite typical: often optimal rules are uniformly optimal, while non optimal rules operate well only in some ranges of environment parameters. 8.4.3 THE PERFORMANCE OF OPTIMAL RECOVERY OF DISCRETE INFORMATION WHEN L>2 We discuss now the generalization of considerations in Example 8.4.1 when the number of potential forms of the discrete information is L>2. We again introduce the assumptions Al to A5 as in example 8.3.2. We introduce also the symmetry assumptions (8.3.32) and (8.3.33a) using the definition (8.4.10). In the geometrical terms the symmetry assumption (8.3.33a) means that the points representing the noiseless signals lie on the surface of a sphere with radius yJE.
436
Chapter 8 Optimization of Information Systems
A more detailed analysis (see e.g., Golomb [8.22]) shows that the generalization of conclusion (8.4.23) is: the performance of optimal recovery is best when the mutual distances between noiseless signals are equal and possibly large. From (7.1.21) it follows that the first condition is equivalent to the conditions that c\k, O=c-=const (8.4.29) where the correlation coefficient is defined by equation (8.4.19) with k, I in place of 1, 2. It can be also proved that the minimum value of constant c is ^.i„-;^
(8.4.30)
and sets of signals achieving this minimum can be found. Such a set has the geometrical meaning of vertices of a symmetric pyramid (called also simplex). Conclusion (8.4.22) is the special case of conclusion (8.3.30). From (8.4.30) it follows that for large L the set of noiseless signals with correlation coefficients c'(k, 0 = 0 V A:, / (8.4.31) is close to optimal.Thus any of the sets of L orthogonal functions discussed in Section 7.4.1 is a set that allows to achieve an almost optimal performance of the optimized information recovery considered in Example 8.3.3. Therefore, a system in which the noiseless signals are orthogonal and the optimal recovery rule is used, is called an optimal orthogonal system. The generalization for L > 2 of the procedure presented in Example 8.4.1 is straightforward since the consequence of symmetry assumptions is that all conditional decision probabilities given by equation (8.4.9) are equal and the orthogonality causes that the random variables representing the current decision information given by (8.3.41) on the condition that the primary information is fixed, are statistically independent gaussian variables. This leads to the final formula for the probability of errors in the optimal orthogonal system Pe = F(L, £„) (8.4.32) where E^ is the normalized energy defined by equation (8.4.18), F(L, £n) = l - f -i-e'^^G^^-^(w)d«,
(8.4.33)
and Gt(w) is the tail of the standard Gaussian distribution given by (8.4.20). The function F(L, £"„) is called Fano function. From equation (8.3.32) it follows that the probability of errors of an optimal orthogonal system depends on the noiseless signals only through the parameter E^ that is a characteristic of a single noiseless signal, and the parameter L that characterizes the set of potential forms of a noiseless signal. We now discuss in more detail the relationship between the probability of error of the optimally recovered information and the number of its potential forms L. This number is a characteristic of the state of variety (see Section 1.4.1). As indicated in Section 6.1.1 the parameter , ^ ,c. * ^^. ^ v=log2L (8.4.34) is an indicator of the volume of resources needed to process the information, but it is also an indicator of usefulness of information for the superior system.
8.4 Performance of Optimal Recovery of Information
437
Therefore, when signals with various L are compared is it is natural to introduce the normalized energy of the noiseless signal related to unit volume, defined as E^^ ^ENS^^E/SJog^
(8.4.35)
We call it noise-volume normalized energy (abbreviated n-v normalized energy); hence the notation. From the point of view of a superior system, the consequences of an error depend usually on the number of potential forms of the information. To define a normalized indicator of probability of errors we considering reference discrete information. Similarly as we defined the volume of discrete information in Section 6.1.1, as a reference we take information which is an unconstrained block of A^b binary pieces of information and we assume that the errors of elementary pieces of information are independent. The probability of error of such a block is P-l-d-P,)^^ (8.4.36) where P^ is the probability of error of a binary piece of information. For P^<\ we ^^""^ P. = N^P, (8.4.37) The volume of considered blocks is v=A^b (8.4.38) For P^<\ from (8.4.36) and (8.4.38) it follows that P, = PJv^PJ\og^. (8.4.39) Therefore, since we consider small probabilities of errors, as a volume-normalized error probability of discrete information we take Pev = ^e/v.
(8.4.40)
and we call it the volume normalized probability of errors. Using the definitions (8.4.35) and (8.4.40) we express the primary variables P^ and E^ occurring in the basic relationship (8.4.33) and we obtain the relationship between the volume normalized characteristics of the optimal system. The diagrams visualising this relationships with L as a parameter are shown in Figure 8.17. Those diagrams show that compared with separate processing of single binary pieces of information, optimal processing is significantly better. The diagrams also show that the performance improves when the volume of blocks processed as a whole grows. We analyze now this important and interesting effect. First let us look for the prise we have to pay for the improvement. We derived in Section 5.2.1 the basic formula (5.2.52) for probability density by considering a K DIM vector as an almost exact representation of a base band process. As shown in Section 7.4.3 (conclusion (7.4.49)), such an assumption is justified when BT> 1. From Section 7.1 it follows that in the space of K DIM vectors we can find sets of K mutually orthogonal vectors but not more. In view of the mentioned possibility of almost exact representation of base band processes by the vectors and the preservation of distance and scalar product we can expect that In the space of base-band processes of duration T and occupying bandwidth B such that TB> 1 the largest number of orthogonal functions which can be found is close to 2TB.
(8.4.41)
438
Chapter 8 Optimization of Information Systems /'.v
1
2
5
10
m i l J. 1—i i I iiiii
20 rr
0 .5
10-1 logJ.=
10-2
10-3 L Figure 8.17. The diagrams of the volume-normahzed probabiHty P^^ of errors of the optimal recovery of information carried by die almost optimal (orthogonal) noiseless as a function of the noise-volume normalized energy E^^ with the number of potential forms L as a parameter E^l^^ die Shannon bound. For a fixed duration of the function-information the cost of its transmission depends on the bandwidth B. Therefore, although the bandwidth B does not occur in the basic equation (8.4.33) giving the error probability, we attempt to keep the bandwidth possibly small. Therefore, the consequence of the assumption that L mutually orthogonal signals are available is that the minimum product min 2TB=L.
(8.4.42)
An unconstrained bloc consisting of A^b binary pieces of information has L-2^^ potential forms. Summarizing we conclude Assembling Ni^ pieces of binary information into bloc and processing optimally the blocs as a whole so that probability of recovery of a binary piece of information has a requested value, by increasing N^ ^ it is possible to diminish the distortions of recovered information without increasing the energy of noiseless signals. The prise that must be payed for it is increasing the bandwidth of the noiseless signals so that condition (8.4.42) is satisfied. COMMENT The product 2TB has the meaning of the dimensionality of a segment of a base-band process and the dimensionality can be interpreted as an indicator of the level of signals structuring. Therefore, the interpretation of conclusion (8.4.43) is that increasing the the level of structuring of signals carrying the information, it its possible to improve the performance of information recovery without increasing energy of noiseless signals carrying the information. Another such possibility is to utilize additional information about the state of environment of the information system, as discussed in Comment on page 433 and in Comment 2 on page 435.
8.4 Performance of Optimal Recovery of Information
439
The conclusion (8.4.43) and these comments rise the question what are the limits of quality improvement by increasing the bandwidth of the signals. To get the answer we look closer at the properties of the Fano function. It can be proved (see e.g. Golomb [8.18]) that ^ ^ 1 / L i f E,^<E^^J limP,(L,£,)=C; . ^ ^ , ; (8.4.44) where E^^:^=2/\og2^
(8.4.45)
is a universal constant. Substituting (8.4.45) in the condition En.>EV we see that it is equivalent to the condition E log2L< Ji-log^e 25"
(8.4.46)
(8.4.47)
Comparing the right side of this inequality with equation (5.4.44) after substituting (5.4.33) we see that the right side is the capacity Co© of the gaussian channel in the limiting case when the band width B^oo . Thus, we can write condition (8.4.46) as log2L
C^) is a function decreasing from value 0.5 for /?'=0 to zero for Thus,
OL{R'I
When R' < C'^ then the probability of error decreases exponentially with the volume ofprocessed information, the faster the bigger is the (8.4.52) surplus of channel capacity over the rate of working information transmission.
440
Chapter 8 Structures and Features of Optimal Information Systems
8.5 OPTIMAL RECOVERY OF CONTINUOUS INFORMATION The special cases of the general solution (8.3.10) of the optimization problem when information is continuous are considered. It is shown that the linear information recovery rules and two level rules with separate estimation of the state of the main information systems environment that have been previously introduced in a heuristic way, are on quite general assumptions statistically optimal rules. Most of the presented features of continuous information processing have their discrete counterparts. However, their analysis is usually more complicated. The main reason for presenting here the classical problems of continuous information processing is to give more insight into corresponding problems of discrete information processing. In the last section we present an estimate the quality of optimal recovery of continuous information in terms of the capacity of the channel delivering the available information. This is one of the applications of the concept of channel capacity which has been introduced in a formal way in Section 5.4.4. Other important applications of channel capacity in the analysis of discrete information processing we discuss in the next section. 8.5.1 THE SOLUTION OF THE OPTIMIZATION PROBLEM We assume that the primary information is 1 DIM and that the set of its potential forms is the interval A== <x^, x^>. Then the conditional performance indicator is Q(x\ r)= I q(x. x*)p(x\r)dx, xE <x,, x,>
(8.5.1)
where/?(x I r) is density of conditional probability, and the optimal decision is a solution of the equation ^^(£i22.0. (8.5.2) dx* We take qix, x*) = (x^x*y (8.5.3) Then _ '^ Q(x^, r)= J (x-x*fp(x\r)6x (8.5.4) Substituting in (8.5.2) and after some algebra we fmd that the optimal decision about the primary information is ^b
jCo= ^xp(x\r)6x
(8.5.5)
A special case of this equation is equation (7.5.17) that we used in Section 7.5. In the continuous case we may also use the decision information (weight) x,(x*, r)=(/>{e[(x*, r)]}, xE <x,, x^>
(8.5.6)
where (t>(u) is an increasing or decreasing function. Then the optimally recovered information is a solution of the equation dXd(x*, r)/dx*=0 (8.5.7)
8.5 Optimal Recovery of Continuous Information
441
As discussed in Section 8.1.1 in many cases q{x, x)=0{x-x) (8.5.8) where P(') is the error weight function. Substituting (8.5.8) in (8.5.1) we get (2(x*, r)= J 0{x-x)p(x\r)dx
(8.5.9)
Diagrams of typical functions occurring in this equation are shown in figure 8.18.
Figure 8.18. Typical error weight function i8(-) and typical density of conditional probability occurring in equation (8.5.9). Changing x* we shift the function i3(jc-x*). When the error weight /3(w) is symmetric around its minimum at w=0 and the probability density p{x \ r) considered as the function of jc has a single maximum at x^ir) and is synunetric around it, than we minimize the integral in (8.5.9) by locating the minimum of
442
Chapter 8 Structures and Features of Optimal Information Systems
Figure 8.19. Illustration of the concept of small/large indeterminism: the a priori indeterminism of primary information x is (a) large ((b) small) compared with the conditional indeterminism of the available information r; p{x) denotes the density of the a priori probability, p{r\x) the density of conditional probability of the available information.
Thus, When the a priori indeterminism of the primary information is much larger than the conditional indeterminism of the available information as shown in Figure 8.19a, then the a priori statistical information is not needed to make an almost optimal decision about the primary information. From this conclusion it follows that the recovery rule To the available information r assign the primary information x^E Xwhich maximizes the density of conditional probability p{r\x) (8.5.12) considered as a function of x. is an almost optimal rule when the priori indeterminism is much larger than the conditional indeterminism and the general conditions used to justify the optimal character of the rule (8.5.10) are satisfied. This rule is called maximum likelihood rule. EXAMPLE 8.5.1 OPTIMUM RECOVERY OF 1 DIM INFO We assume Al. The primary information is a realization of the gaussian variable x with E3^=0 and cr^(x)=(J^^ A2. The random variable representing the available information y on the condition that the primary information x is fixed, is y(x)=JC+2 (8.5.13) where 2 is a gaussian variable £2= z and 0^(2)= a^^. A3. The random variables x and s are statistically independent. From Al and A2 it follows that probability density of the primary information p{x)=G{x-x,a^) (8.5.14) and the density of conditional probability /7(r|x) = G(r-x-z, a^) where
1
G(X, (70 =
^
TTff
-^
(8.5.15) (8.5.16)
8.5 Optimal Recovery of Continuous Information
443
Substituting (8.5.14) and (8.5.15) in (8.5.11) we get pix\r)=G(x-x(r), a^) _ a' . a' . xir)-—l—(r-zh—l—x ".^"z '^x'+'^z (T^--JLJ_
where and
(8.5.17) (8.5.18) (8.5.19)
Since the gaussian probability density reaches the maximum for its average value, from (8.5.17) it follows that the decision of the maximum conditional probability rule is _ x,= x(y) (8.5.20) and the mean square error of the optimal decision is a^. The indeterminism of the primary information is large when a'
-±>l (8.5.21) From (8.5.15) it follows that in the limiting case a^a^^oo the decision (8.5.18) and its mean square error (8.5. 19) do not depend on a^. This is a concrete example of the general conclusion formulated on page 442. D 8.5.2 OPTIMAL CHARACTER OF LINEAR RULES We assume that the joint probability density of the primary information x and components r(n) of the available information is gaussian. To simplify the argument we assume that E ^n)=0, A2 = 1, 2,- • , A^, E 3S=0 and we denote uim) = r(m), m=l, 2,.., N, u(N-^l)=x (8.5.22) Then the general equation (4.5.14) gives the density of the joint probability , , -i[detA]'^ p(x,r)= —exp ^'
^
(27r)(^*^>^2
(8.5.23)
^
Equation (4.5.15b) allows to express the coefficients A{m, n) in terms of the correlation coefficients of the random variables m(m)=ff(m), m = l, 2 / • • , A^ and ii(N-hl)=s. Their correlation matrix is C rr
C rx
(CJ, c{x,x)
(8.5.24)
where C„ is the correlation matrix of the components of the available information, Crx is the column matrix the elements of which are the correlation coefficients between the primary information and the components of the available information. c(x, x) = (r^(x) is the variance of the primary information The basic relationship (4.5.15b) takes the form: A = C' (8.5.25) where A is the matrix of coefficients A{m, n).
444
Chapter 8 Structures and Features of Optimal Information Systems
The conditional density of probability we are looking for is
p[«(^.l)|ii].£[f^4^^
(8.5.26)
where u = {u{n), AZ = 1, 2, • , A^}. From the general conclusion (4.5.23) it follows that this is a gaussian density thus, p[u(N+l)\u]=G[u(N-\-l)-u(N^l),
a\N^l)]
(8.5.27)
To express the parameters of this density in terms of the parameters entering in equation (8.5.26) only terms including the variable w(N+l) must be taken into account. These are /4(A^+1, N-\-l)u\N-hl) and the terms A{n, N-^l)u(n)u(N+l) n = \, 2,- • • , N. (8.5.28) occurring in the sum in equation (8.5.23). Therefore, we write the equation (8.5.26) in the form
where C, is a function of u but does not depend on u(N-\-l). Comparing equation (8.5.29) with (8.5.27) we see that 1
^
u{N^iy-—--l—^'£A{n,N^l)u{n)
(8.5.30a)
and A(N+\,N+\) Using equations (8.5.27) and (8.5.30a), arguing as in Example 8.5.1, and returning to the primary notation (8.5.22) ) we conclude that 1
X'UiN^l)'
^
1 y A(n,N+l)rin) (8.5.31) /4(A^+l,A^-l);S7r is the decision of the rule of maximal conditional probability and a\N+l) is the means quare error of the decisions. Thus we came to the basic conclusion: On the assumption that the joint probability distribution of the primary information and the components of the available information (8.5.32) is gaussian, the rule of maximal conditional probability is a linear rule. Usually the coefficients A{m, n) are directly not available but available are the correlation coefficients c,X^, m) and c^^{n). Then we can use the equation (8.5.25). However we can avoid the tedious algebra and use an already available result. The gaussian probability distribution and the square error weight (8.5.8) satisfy the conditions formulated on page 441 on which the maximum conditional probability rule (8.5.10) is optimal. Therefore, The linear rule {%.\31) with optimal coefficients, which satisfy the set of equations (8.2.8) is equivalent to the maximal conditional (8.5.33) probability rule. Thus, the set of coefficients ^(n, A^4-l), w = l, 2, .., ^occurring in (8.5.29) is the solution of the matrix equation (8.2.8b) which in the present notation has the form C ^ = C,, (8.5.34)
8.5 Optimal Recovery of Continuous Information
445
We have considered the recovery of the 1 DIM information. The generalization of our argumentation for the recovery of multi-dimensional primary information is quite straightforward. In the case of a gaussian probability distribution the basic conclusions that the optimal recovery rule is linear and that it is equivalent to a linear rule optimized for mean square criterion remains valid. In view of the latter conclusion all methods for finding the solution of the linear optimization problem, particularly the very efficient Kalman recursive algorithms (see e.g., Proakis [8.13 ch 4]) are applicable. We present now an example of application of the optimal recovery rule. This example gives also insight into the properties of intelligent systems with independent estimation of the state of the environment. 8.5.3 OPTIMAL CHARACTER OF INTELLIGENT RULES WITH INDEPENDENT STATE PARAMETER ESTIMATION We consider now the processing of a train of pieces of information on the assumption that while the train lasts, some unknown components of the state of information systems environment are constant. This is a simple model of the system operating in an environment with slowly varying components that was discussed in Section 1.7.2. We assume Al. An elementary piece of the primary information and of the available and information is a scalar. A2. The evolving trains X(ISf) = {x(n);n=l, 2,.., iV} of the primary information and the train R(N) = {r(n);n = l, 2, .., A^} of the available information are processed. A3. A component of available information is r(n)=x(n)+z(n)'^b
(8.5.35)
where z{n) is the component of noise which changes in each step and b is a component of noise which remains constant throughout a train. A4. All pieces of information JC(A2), r{n), z(n), b exhibit statistical regularities, they can be considered as observations of random variables x(/z), ]r(A2), 2(n), lb; the value of lb is determined by random factors anew every time a new train starts, but then it remains constant throughout the train A5. All variables x(/2), 2(w), lb are gaussian, Ex(n)=0,
E2(AZ)=0,
Elb=0
(8.5.36)
and uncorrelated, thus statistically independent. The dependence of all components "sin) on the same constant component b causes that the variables ir(«), t{m) are statistically related (correlated). Therefore, for optimal recovery of information x{n) relevant is not only information r(n) statistically related directly to x{n) but also other pieces r{m) of the available information. This in turn provides insight into basic properties of learning systems. We consider two modes of systems operation (1) without a training cycle, (2) with a training cycle (see Section 1.7.2,particularly Figure 1.25 and Section 2.2.1).
446
Chapter 8 Structures and Features of Optimal Information Systems
We begin with the system without a training cycle and we interpret information r(N) as the recently arrived information. Thus available information is the train r(N) = {r(n), n = l, 2, .., A^}. In view of assumptions A4 and A5 we can use the results derived in Section 8.5.2 taking x(N) as x and r(N) as r. Using the assumptions A4 and A5 we easily calculate the correlation coefficients. In view of the independence assumption A5, matrix C given by (8.5.24) has a simple structure and after some matrix algebra we can find in a closed form the elements of the matrix A. Then from equations (8.5.30a) and (8.5.30b) we find that XoW'
2
(8.5.37)
2
is the optimal decision about information QoW = cj'\
JC(AO
\ ' ]
and / ' \
(8.5.38)
is the mean square error of these decisions. The assumptions made in Example 8.5.1 corresponds to the now considered problem with N= 1. However, in the example we assumed that the averages of the random variables z and s can take any value. Comparing equations (8.5.37) with equation (8.5.18) with x-0 we see that
is the counterpart of known mean value z of the noise considered in Example 8.5.1. Thus b*(N) has the meaning of an estimate of the initially not known but fixed inside the train noise component. We denote as 1b*(A0 the random variable representing b'(N) within the train thus, on the condition lb=b. From assumptions A4 and A5 it follows that Elb'(AO=b (8.5.40a) and
^
2^. 2
E[lb-(A0-b]2 = E JLX^ [z(n)+z(n)Y= -^ll^i
(8.5.40a)
Thus, the secondary information b*(AO is an estimate of the unknown component of the noise. It is an efficient estimate in the sense that for large N the mean square error of estimation goes to zero. More, it can be proved that b*(AO is almost equal to the maximum conditional probability decision about constant noise component b that is made independently from the recovery of the primary information. Thus, we came to the following conclusion The two level system with separately optimized subsystems shown in Figure 1.26 is almost equivalent to the maximum conditional (8.5.41) probability rule derived here. The rules of transforming the current available information and the rule of estimating the unknown but within the train fixed noise component b appeared "automatically" as interpretation of the rule of the maximum conditional probability.
8.5 Optimal Recovery of Continuous Infonnation
447
We consider next the system with a training cycle. We assume that it includes the first M pieces of elementary information and the working cycle starts with recovery of the information x(M-\-l). We denote by x,XAf) = {{x(/z), r(n)}, n = U 2, .., M}
(8.5.42a)
the training information and by x(N) = {x(M+n), Ai = l, 2, .., A^}
(8.5.42b)
the already obtained A^ pieces of information in the working cycle. Optimally recovered information x^iM+K) is the information maximizing the density of conditional probability /7[x(M+^|x,(Ai),r(/0].
(8.5.43)
To compare the system with the training information with the previously considered system without the training cycle we assume M=N-l,K=l (8.5.44) Proceeding similarly as for the system without the training cycle, we find that the decision of the maximum conditional probability rule is r(N)--
^.(AO--
a^-^a.
^b
"^irinyxm a/^(Ar-l)cT,2tT
(8.5.45)
and its mean square error is a^'^Na,'
(8.5.46)
Similarly to (8.5.39) the second component in [ ] in equation (8.3.45) 2
vw= a^'^iN-l)a,'tJ^lrinyxin)]
(8.5.47)
has the meaning of the estimate of the constant component of the primary unknown constant components of the disturbance, which is now based on the training information. Similarly as in the system without training the variance of the estimate decreases to zero when N-^oo and is almost an optimal decision about b made separately. Thus, the conclusion (8.5.41) also holds for the system with training. To get insight into the performance of the considered optimal rules we introduce the normalized performance indicators
Qm--^, a}
P.-4. '^<^; ^ <
(8.5.48)
of the normalized mean square recovery error, of the relative magnitude of the working information, and of the variable noise component, respectively.
448
Chapter 8 Structures and Features of Optimal Information Systems
To get insight into the process of learning we assume that the constant component of noise compared with the variable component is large in the sense that Na^>o^
(8.5.49)
The dependence of the normalized mean square error Q'{N) on the length A^ of the train and the relative magnitude of the working information Px as a parameter is shown in figure 8.20. Q{N)i P =0.1 Px = l
^-1.1
7
10
100
Px = 1 0 WOO TV
Figure 8.20. The dependence of the normalized mean square error 2 ' W of the maximum conditional probability decisions on the number A^ of already available pieces of information and on the normalized range of the working information Px; the normalized range Pb of the constant component of the noise is small. Continuous lines-the system widi a training cycle, dashed lines-without a training cycle.
COMMENT 1 The analysis of the simple model of a system operating in an enviroiunent the state of which initially is not known, suggests the following generalizations: • Learning in an environment with quasi-stable state components is possible with and without a training cycle. • A training cycle substantially accelerates the learning process. • When a sufficiently large number of pieces of information is available and the recovery rule is optimized the asymptotic performance of the systems with and without training is similar. • The advantages of the learning cycle are larger when the quality of the available information ( in the considered case Px is such a quality indicator) is better. Examples can be given showing that in non optimized systems learning based on not sufficiently accurate working information may deteriorate systems performance ("learning from wrong examples"). GENERALIZATIONS We now show that on quite general assumptions it is possible to decompose the optimal system into a two-layered hierarchical system consisting of • a subsystem operating only on the current piece of available information and • subsystem utilizing all the already obtained concrete information to estimate the primary unknown, quasi-static components of states of the information systems environment.
8.5 Optimal Recovery of Continuous Information
449
We denote: jc(AO- the considered component of the working information r{N)- the piece of available information directly related to the working information x(N), b - the set of primary unknown, quasi stable components of the state of information systems environment Y{N)- all the available concrete information about the set b including (1) the information delivered by the subsystem acquiring the information about the state of the environment of the information system, (2) the information obtained during the training cycle, and (3) the train R{N-\) of available pieces of information about the working information jc(l), jc(2),...x(iV-l).
We assume that all unknown factors exhibit statistical regularities and we consider the rule of the maximum conditional probability. We can realize this rule by looking for the information JCoCAO maximizing the density of the joint probability distribution p[Z(iV), r{N), F(N)]. From a generalization of equation (4.4.8d) for the marginal probability we have p[X{N), r{N). Y{N)]= ( ' ' '\p[X{N).
KN). Y(N), b]db
(8.5.50)
From the conditional probability equation (4.4.7b) we get /7[Z(A0, r(AO, Y{^. b\=p\Y(N)\X(N)^ r{N), h\p\X{N)^ r{N)^ b\ (8.5.51) The probability density p\Y{N)\X{N), r(N), b] describes the transformation generating the state information y(AO. Typical examples of such a state information are the components of the available information entering in the definitions (8.5.39) and (8.5.47) of the estimates ^*(A0 and b*^X^ ^^ the constant component b considered previously. Those definitions suggest that for sufficiently long gathering of information about the state of environment (large AO the newly arrived pieces of information x{N), r{N) do not substantially influence the probability density p[Y(N)\X(N), r(N), b]. Thus, for large N we may assume p[Y(N)\X(N), r(AO, b]=p[Y(N)\b]
(8.5.52)
Then from (8.5.50) and (8.5.51) we have p[X(N), r(AO, Y(N)]= \ " " '\p[XW,
r(N), b]p[Y(N)\b]db
(8.5.53)
Usually the subsystem for state information provides progressively additional information about the quasi stable state parameter b. Therefore, for growing N the probability density p[X{N), r(N), b] considered as a function of b, compared with the probability density p[b I Y(N)] = Cp[YiN) I b]p(b), (8.5.54) looks as a narrow pulse. Since p(b) does not change with A^ this applies also to p[Y(N) I b]. Thus, we have a situation similar to the one shown in Figure 8.19a with p[Y(N)\b]^ p(x) mdp[X(N), r(AO, b] ^ p(r\x).
450
Chapter 8 Structures and Features of Optimal Information Systems
Then from equation (8.5.53) it follows that p[X(N), r(AO, Y(N)] = Cp{X(N), riN), b*[Y(N)]} (8.5.55) where C does not depend on X(N) and b*[Y(N)] is the centre point of the "impulse like" probability density/?[y(AO|^] considered as a function of ^. From (8.5.12) it follows that b*[Y{N)] has the meaning of the decision of maximum likelihood rule, which on conditions discussed in Section 8.5.1, has an optimal character. Therefor ft*[y(AO] can be considered to be an optimal estimate of the quasi stable components of the state of information systems environment. Using the definition (4.4.7b) of conditional probabilities we can write equation (8.5.52) in the form p[X(N) I r(AO, Y(N)] = Cp[X(N), r(N), Y(N)]
(8.5.56a)
where C does not depend on X(N). From (8.5.55) and (8.5.56a) we get finally p[X(N) Ir(AO, Y(N)] = Cp{X(N), r(N), b*[Y(N)]}
(8.5.56b)
From this equation the maximum conditional probability decision about the working information X(N) is calculated from the newly arrived piece r(N) as if the state components b of the environment would be known exactly. Thus, the whole system operates as the hierarchical system with separate processing of the actual working information and the information about the state of the environment of the working information system. Our argument proves that on quite general assumptions, the hierarchical information system shown in Figure 1.26, with subsystems operating according rules here specified, is an almost optimal system. 8.5.4 UNIVERSAL PERFORMANCE ESTIMATIONS OF OPTIMAL RECOVERY RULES In more complicated cases, the analysis not to say implementation of the optimal information recovery rule may be not feasible, and sub-optimal of heuristically found rules are used. Even in such situations the knowledge of the performance of the optimized rule is very useful, because it provides the reference for the performance of non optimal rules. There were many attempts to derive universal estimates of the performance of the optimal information processing rule without deriving the rule explicitly. We describe here a class of such estimates based on a general property of entropy. To simplify the argumentation we assume that information is continuous and 1 dimensional. We denote oo
oo
^[Pi(-), Pii')] = j [-log2[/?2W]p,(^)dx- J [-logJ^p^(x)]p,(x)dx (8.5.57) The mentioned property of entropy is For any pair p^{-) and P2(') of densities of probability D\p,('), P2(')]>0 (8.5.58) and the equality holds when and only when p^{x)=p2(x). In view of this property we may interpret Z)|/7,(*), Pii')] ^ ^ indicator of distance oipji') from/7i(*). Therefore, it is called entropy distance. We consider the information system shown in Figure 8.21.
8.5 Optimal Recovery of Continuous Information
TRANSMnTER
CHANNEL
l(X:R) •I(X:r) = I(X:]R)-Z)
PRELIMINARY PROCESSING
D[W(')]' W(;)] — )-D[W{')]-D,^
451
INFO RECOVERY
r(.)
D,.
Figure 8.21. The information system with preprocessing of available information before making the ultimate decision about the working information. To simplify the argumentation we assume that the indicator of performance in a concrete situation is q(x, x*)=(x-;c*)^. Using the fundamental property (8.5.58 ) of the entropy distance it can be shown (see Seidler [8.24]) that for the statistical indicator QiX'i-)] of performance of the ultimate information processing rule X*(-) given by (8.3.1) with square error weight (8.1.5), we have: Q[r(')]>J-oxp2 [2H(X|IR) + D,^] (8.5.59) 27re where //(3S|IR) is the average conditional entropy of the primary information, on the condition that the preprocessed information r delivered by the fundamental information processing subsystem (we call it here channel) _ is known Djg^ is the average entropy distance of an auxiliary probability density PiM—Pdtci^) from the conditional probability density/?i(jc)=/7(jc|r). As the auxiliary probability distribution/?dec(^)» we take such a probability density that: (1) can be calculated when only the ultimate decision of the information recovery rule X^(') and the indicator of performance of this rule known, (2) has the minimum entropy distance from the conditional probability density p(x\r). The auxiliary probability density can be interpreted as the best choice of an observer who knows only the decision about the primary information produced by the rule X'(') and the average performance of this rule, and is asked to give a possibly exact estimate of the conditional probability density p(x\r). The difference D[W(-)]=^(X|iR)-H(X|r)=i(X|iR)-i(X|r) (8.5.6O) has the meaning of loss of statistical information caused by the preliminary information preprocessing, before the ultimate recovered information is produced. This definition is illustrated with Figure 8.21. Using the loss D[W(-)] we write (8.5.59) in the form Q [ r ( - ) ] > 2 ^ exp2 [2H(^\R)+b[W(^)]-hb^]
(8.5.61)
Equation (8.5.61) shows that the average D^^ has the same effect on the conditional entropy as the loss of statistical information D[iy(-)] paused by the preliminary retrieved information processing. Therefore, we call D^^ the loss of statistical information caused by taking an ultimate decision.
452
Chapter 8 Structures and Features of Optimal Information Systems
QAn-)]
-2C+2D[W(')] -2C
^o*(-)
r(-)
Figure 8.22. Illustration of the relationship between performance of the optimized rule of information recovery and the capacity of the channel delivering the information about the primary information. Let us consider the worst case when the available preprocessed information delivers no information. Then we can still use the statistical information about priori probability distribution of the primary info. Then the lower bound (8.5.61) for (2[^(-)]is 1 ,, _ ewr= 2 ^ exp2 [2Hm-^D[W(')]+D^J (8.5.62) The subscript "wr" should remind that this is the worst case estimate of the performance the rule producing the ultimate information. The ratio
Gjr(-)]=e[r(-)]/G.
(8.5.63)
has the meaning of the normalized index of performance of the ultimate information processing rule. From (8.5.61), (8.5.62) we get e j r ( - ) ] > e x p 2 { 2 { H ( X Y ) - H ( X ) + D[W(-)]+^,ec}}
(8.5.64)
The statistical amount of information is l(^|E) = H(X)-H(X|IR)
(8.5.65)
From the definition of channel capacity Qh we have /(^|IR) = Qh-D
(8.5.66)
where D > 0 . Using (8.5.64) and (8.5.66) we write (8.5.61) in the form Q j r ( - ) ] > e x p 2 {-2{C-D[W(-)]} + D,ec+D}}
(8.5.67)
To calculate the right side of this inequality we had to calculate the average loss of information D^^ caused by taking ultimate decisions. To do this we had to specify the ultimate information recovery rule X'i') and this is what we try to avoid.
8.6 Overall Optimization of Information Systems
453
However, from the fundamental property (8.5.58) of the entropy distance follows that D^^>0. From (8.5.60) it follows that also D is non-negative. Thus, from (8.5.67) we get Qn[r(-)]> exp2 {-2{C-D[W(')]}} (8.5.68) COMMENT Equation (8.5.68) shows that the channel capacity and the loss of information caused by preliminary processing of available information before the transformation producing ultimate information that is delivered to the superior system, provide the ultimate limit for normalized performance of the system. However, this is only a lower bound that can be reached only for specific performance indicators and specific statistical properties of the available information. Such pair is the mean square error and the gaussian probability distribution. In general, the discussed lower bound should be considered as a pessimistic estimate of the performance of an optimal information recovery rule.
8.6 OVERALL OPTIMIZATION OF INFORMATION SYSTEMS An information system is structured. The prototype system has a chain structure and an intelligent system has also a vertical (hierarchical) structure. Such systems have been discussed in Sections 1.1 and 1.6 and their concrete examples have been presented in Chapter 2, in Section 6.6.1 and in the previous section. With exception of very simple systems the optimization problem of a structured system must be decomposed into a set of problems of optization of the component subsystems. The cooperation between the subsystems is taken into consideration by introducing indicators of performance of subsystems or constraints that take into account the properties of other cooperating subsystems. Such a procedure is called decomposed system optimization. The first section considers such optimization of preliminary information processing in the prototype chain system shown in Figure 1.2, that the quality of the optimal information recovery at the end of the chain is possibly good. In the second section we consider subsystems that provide their superior information system information about the state of its environment, so that the superior system can operate in an intelligent way. We discuss two study cases and we formulate general guidelines for optimization of such state information subsystems. 8.6.1 THE OVERALL OPTIMIZATION OF THE PROTOTYPE INFORMATION SYSTEM Usually the subsystem performing the fundamental transformation in the prototype system shown in Figure 1.2 is fixed (see the discussion in Section 1.1.4). Then the overall optimization is joint optimization of preliminary transformation performed before fundamental transformation and the ultimate transformation of information produced by the fundamental transformation. In the previous two sections the optimization of the ultimate transformation producing the recovered information has been discussed on the assumption that the properties of the available information are fixed. In fact, we can change them by changing the rule of
454
Chapter 8 Structures and Features of Optimal Information Systems
preliminary transformation of the primary information delivered by the information source before it is put into the fundamental processing subsystem. Here, we concentrate on optimization of the preliminary transformation, taking into account the subsequent information recovery. The basis for systematic approach to optimization of preliminary transformation which takes into account subsequent recovery has been presented in Section 8.1.4. Concrete examples of such optimization, using partially heuristic arguments, have been also given. In particular, in Section 7.3 optimization of such a dimensionality reduction, that the quality of optimal recovery is possibly good has been discussed. We applied also a rudimentary version of joint optimization in Example 8.4.1 by showing that it is optimal to shape the signals put into the channel, so that the potential forms of noiseless signals at channel's output are antipodal signals. Similarly, the optimal character of orthogonal noiseless signals has been discussed in Section 8.4.3. The generalization of these examples is the optimization of the preliminary transformation using as criterion an performance indicator produced by a dependence removing transformation discussed in Section 8.1.4. In this section we concentrate on optimization of quantization. However, we give also a short review of problems of optimization of transformations that shape the information put into a channel causing distortions. A typical example of such a transformation is error correcting coding. OPTIMIZATION OF SCALAR QUANTIZATION WHEN EXACT STATISTICAL INFORMATION IS AVAILABLE We consider here optimization of scalar quantization rule described in Section 1.5.4 and illustrated in Figure 1.18. We assume first that the exact statistical information is available and as the performance criterion we take the overall mean quare error G[X*(-), V(-)]^E{x-r[VU)}^
(8.6.1)
where V(-) is the quantization rule, X'(-) the information recovery rule and x the random variable representing the primary information. The quantization and recovery rules are characterized by the set Z={x^;/-0, l , 2 , - - , L } (8.6.2) of thresholds and the set ^. r * , . ^ r. .« ^ ^. Z-={x;;/-l,2,-,L} (8.6.3) of values of the recovered information (see Figure 1.18). Therefore, instead of Q{ r ( - ) , V(')] we write briefly Q(X:X). In Section 8.1.4 we indicated that to define on the basis of Q(X^X) the performance indicator for the set X of thresholds, we have to remove the dependence of Q(X'X) on Ar*={x/;/-l, 2, • - , L } . We assume here that the information recovery rule is matched to the quantization rule. Then the minimization of distortions in respect of A!^*={x/;/-l, 2, ",L} is the dependence removing operation. In Section 7.5.1 we derived equations (7.5.17) and (7.5.24) givingjthe optimal potential forms of recovered information. Substituting Jhose values in Q{X^X) we get the indicator of performance of a threshold set X. However, such a typical
8.6 Overall Optimization of Information Systems
455
decomposed optimization_would be quite tedious. It is more convenient to go back to the primary criterion Q(X'X) and analyze the conditions for the simultaneous optimization in respect to X and X' . From (8.2.5) it follows that the pair of such optimal sets is a solution of the ^^^^^'^^
grad Q(X:X)'0
(8.6.4)
X',X
We write this equation as a pair of two sets of equations
or equivalently
gmdQ(X\X)-0
(8.6.5)
gTadQ(X\XyO
(8.6.6)
^M^-O,
1-1,2,..,
dx;
L
^£<™-0,/-1.2,..,L dx^
(8.6.7)
(8.6.8)
To calculate the partial derivatives we need an explicit formula for Q(X\X). given by equation (7.5.18) which in the now used notation takes the form Q{X\X)^l
{[x-r[V(x)]}'= E J [x-x*fp{x)dx
It is
(8.6.9)
Substituting this in (8.6.7), after some algebra we get A',
[xp{x)dx V - -^i;
-^'
^-1' 2, • • , L
(8.6.10)
[p{x)dx As it could be expected, this is equivalent to the previously derived center of gravity rule (7.5.17). Substituting (8.6.9) in (8.6.8) and differentiating we get x^-V2(V,+x;)-0 (8.6.11) Section 1.5.4 indicated that quantization can be achieved by the next neighbor transformation described by (1.5.13). The thresholds are then determined by the reference points -^z, / = 1 , 2,- • • , L of the NNT by equation (1.1.18) which in the present notation takes the form x-Vi{x^^^^x^) (8.6.12) Comparing this with equation (8.6.11) we see that the The optimal quantization rule is a NNT transformation with references given by equation (8.6.12). The NNT produces an optimal quantization when a reference is simultaneously the optimally recovered primary information.
(si f, M, \ .^ , .^,.
456
Chapter 8 Structures and Features of Optimal Information Systems
Thus, the condition of optimality is / = 1 , 2 , •, • , L (8.6.14) It is evident that For a uniform probability density the uniform quantization with centers of aggregation intervals as potential forms of recovered (8.6.15) information is optimal. An explicit general solution of the sets of equations (8.6.10) and (8.6.12) of the equations with variables as limits of integration is not possible. However it is possible to obtain (for a simple derivation see Judell, Scharf [8.25]) an approximate solution when the number L of forms of quantized information is so large, that the quantization interval are so small that the probability density p{x) can be approximated inside an quantization interval by a linear function. The lengths of optimal quantization intervals are
M,J = — 4 =
(8.6.16)
where x^i is the center of the optimal quantization interval ^ i^ and ^4 is a normalizing constant. The corresponding minimal mean square error is ^3
e(^;,i„)^^
\lp{x) dx
(8.6.17)
As quality indicator we take the normalized mean square error Q - ^ ^ ^
(8.6.18)
and as the cost indicator we take the volume v of the quantized information v=log2L
(8.6.19)
The diagrams of the normalized mean square error versus the volume of quantized information for the uniform and gaussian distributions are shown in Figure 8.23.
Figure 8.23. The trade of the normalized mean square error Q' versus volume v=log2 L of quantized information for the optimal scalar quantization; slashed line-approximation (8.6.17), continuous line-Lloyds algorithm (8.6.27).
8.6 Overall Optimization of Information Systems
457
COMMENT 1 Figure 8.23 shows that for a given number of potential forms of quantized information the minimal recovery distortions are significantly smaller for the uniform probability distribution than for the gaussian. For other probability distributions the advantage of the uniform distribution is still larger. This may suggest that it would be convenient to transform the primary information into an information with a uniform probability density and to quantize the transformed information uniformly. It can be easily proved that the random variable F(z) where X
F(x)- {p{u)du
(8.6.20)
and p(x) is the density of a random variable x, has a uniform probability density. Therefore, F(-) is called a "uniform making" transformation. However it can be also proved that uniform quantization of the variable produced by the uniform making transformation does not produce the quantization intervals equivalent to the optimal intervals given by (8.6.16). Thus contrary to decorrelation, uniform making of the probability distribution does not bring advantages for the subsequent processing. The basic conclusions (8.6.13) can be generalized for the case when the primary information is K DIM vector information. The multidimensional generalization of equation (8.6.10) is equation (7.5.24) derived in Section 7.5.1. In the case of vector quantization the aggregation sets are separated by surfaces. However, the optimality condition (8.6.14) can be generalized by the following argument. We suppose that a point x on the surface separating two aggregation sets ^^ and ^^ is located closer to the point JC^" representing the /th potential form of the recovered information than to point x^* . It can be easily proved that the contribution of jc to the mean square error given by (7.5.2) is decreased if x is included in the aggregation set ^ / . Thus the condition of optimality is that the surface separating the aggregation sets ^[ and ^„ is a segment of a plane that is perpendicular to the interval <x* ,x^* > and goes through its center. This proves the ^-DIM generalization of conclusions (8.6.13). The calculation in a closed form of the integral (7.5.2) giving the mean square error of recovery is in general not possible. However, when the dimensionality K of the quantized information is large then the K-DIM generalization of the relative volume Vr(m, Q) of continuous information defined by equation (6.6.29) in Section 6.6.3, gives insight into the performance of the optimal vector quantization without specifying the quantization rule. We denote as X={^{k), k=l, 2,- - , K} iht K-DlM random variable representing the continuous vector information. We assume that the random variables ^(k) representing the components are statistically independent and have the same probability density. We denote briefly VXQ)=V,[x(l), Ql Q>0 From definition (6.6.29) of the relative volume it follows that Vr(Q) is a decreasing function of Q, Thus, an inverse function Q*(v) exists. For example the inverse function of the relative entropy given by equation (6.3.35) is Q*(v)=a^22-2v
(8.6.21)
458
Chapter 8 Structures and Features of Optimal Information Systems
We denote as R"=log^/K, e" = El5^[x(A:)-z*(A:)]2
(8.6.22) (8.6.23)
the volume of quantized K DIM information respectively, the mean square recovery error normalized in respect to their dimensionality. It can be shown (see e.g., Proakis [8.13]) that For a Q>Q*(R") and for a sufficiently large K such a set of reference points of a NNT transformation can be found that the normalized mean square of the optimal recovery error Q"=Q+A2-''"^Q''"'> (8.6.24) where A is a constant and the coefficient a(Q, R") is the larger the larger the difference R"-H*(Q). When this difference goes to zero also oi(Q, R") goes to zero. If Q< Q'^{R")for large K it is not possible to make the difference Q"'Q small. The condition Q>Q^{R") is equivalent to the condition R">H'^(Q). Thus the interpretation of (8.6.24) is For large dimensionality K of primary vectors Qrr..n^Q*(\Og^/K) (8.6.25) is an approximation of the normalized mean square error for the optimal quantization producing quantized information which can take L forms. For example, for gaussian information we use equation (8.6.21) and we get Q„,„=o2 2-'"^^.a;l^
(8.6.26)
The assumptions that the components are independent and have the same probability distribution have been introduced to simplify the argument. In general we have to use in (6.6.28) the amount of information for K-DIM variables and relative volume per one dimension, defined similarly to as entropy H^ by equation (4.6.16). ALGORITHMS FOR FINDING THE OPTIMAL QUANTIZATION RULE In general a solution of the set of equations (8.6.10) and (8.6.12), the more of their K DIM generalizations, cannot be obtained in a closed form. However, the form in which equations (8.6.10) and (8.6.12) are written suggests the interpretation of those equations as conditions that the functions on the left sides take the zero value. This in turn, suggests the application of iterative algorithms for finding the zero point which have been described in Section 8.2.2. A straightforward modification of the procedure described on page 395 leads to the Lloyd algorithm: STEP 1. Take an initial set J^(1)={JC^(1) ; / - 1 , 2, • • , L } of thresholds and calculate from (8.6.10) the corresponding set ZT(l)-{x;(l); / - I , 2, - • , L} of the optimally recovered forms of information STEP;. Consider the set xy-l)^x;'(j-l); / - I , 2, - • , L} as the set of reference points of a N.N.T, calculate from (8.6.11) the new set of thresholds and from (8.6.11) calculate the new set A!'1(/)={x^"'(/);/-l, 2, - • , L} of recovered forms of information. STOPPING RULE. Similar as rule 7 on page 395.
8.6 Overall Optimization of Information Systems
459
OPTIMIZATION OF SCALAR QUANTIZATION WHEN ONLY A TRAIN OF PIECES OF INFORMATION IS AVAILABLE We assume that only a train X={x{n), n = l,2,.., As the performance indicator we take
N}, x(n)E <x^, x^> is available.
2 { X-(-), V(-)] ^A{x(n)-r[v(n)]'}
(8.6.27)
where A is the operation of arithmetical averaging. In the notation now used the optimal recovery rule (7.5.36) derived in Section 7.5.1 is Xtv,)-V"7-rTT E
^(^^
(8.6.28)
^\'^l)x(n)eA,
where L(^,) is the number of elements of the aggregation set ^ / . We optimize the quantization rule, thus the aggregation sets similarly as in the case when exact statistical information was available. We use the 1 DIM version of the argument presented on page 457. If there would be an aggregation set ^^ and an element x(n*)E^^ such that \x(n^yx^_\ I < \x(nn-x;' \
(8.6.29)
then by shifting this point to left end point of the aggregation interval we could decrease the average square error Q' {X'('), V(')]. Therefore, in the considered case the condition (8.6.12) must be also satisfied, thus it must be x-V2(x^:,^x;)-0
(8.6.30)
In view of the presented analogies with the case when exact statistical information is available, the modification of the Lloyd algorithm is straightforward. Similarly, we optimize the vector quantization when exact statistical information is not available. OPTIMIZATION OF SHAPING THE CHANNEL INPUT SIGNAL Shaping of the primary information so that it can be best transmitted by a given communication channel is, besides quantization, another important preliminary transformation of information. For decades, the research in this area has been very intensive and several methods of design and optimization of the rules of transmitters operation have been developed. Many of these methods illustrate the advantages of structuring the signals put into the channel and of using various types of auxiliary information about the state of environment in which a communication system operates. Concrete examples illustrating it have been presented already in Sections 8.4.2 and 8.4.3. Here the general effects of increasing the degree of structuring, particularly, the dimensionality of transmitted signals are discussed. We consider the transmission of discrete information on the symmetry assumptions: the potential forms of working information exhibit statistical regularities, have the same probability, and the distortion indicator is symmetric (given by (8.1.2)). Then the typical indicator of performance of the rule of channel input signals shaping is the probability of error of the optimal information recovery ^^^
P,or=P(^^T)
(8.6.31)
460
Chapter 8 Structures and Features of Optimal Information Systems
First we assume that the channel is binary, symmetric, and memory less as described in Section 5.4.4 page 241. Thus, the transmitted signals are binary blocks ; we called them code words. Then, as shown in Example 8.3.4 optimal is the NNT recovery rule (8.3.51). We denote as W the set of code words. In view of the symmetry assumptions, it is irrelevant which reversible rule of assigning a code word to a potential form of working information is used. Therefore, the optimization of shaping of the channel input signals reduces to the optimization of the set W of codewords thus, to the OP W, P^ovl^^ ^STAT where C is the set of constraints imposed on the code words. Typical are implementation constraints, particularly that the code is a parity check code described in Section 2.1.2. After very intensive research during past two decades the solutions of such optimization problems are well known (for general information and citations see pages 87 and 88) and widely used in practice. Here we concentrate on universal properties of optimal codes. In view of symmetry assumptions and implementation considerations, the codewords of the same length A^ are considered. A comprehensive characteristic of the set M/of such codewords is the volume of working information (see Section 6.1.1) per a binary element of the a codeword R=\//N={\og^)/N (8.6.32) where L is the number of potential forms of the working information. The universal relationship between parameter R and the rough description of the assumed channel by its capacity Q per a binary signal, defined by equation (5.4.31) is IfC^>R then for a sufficiently large N it is possible to find such a set W ^ of codewords that ,,, ^ AN^(R r\ " ^ Peor(W/o) = ^ , 2 ^ ^ ^ ' '^ (8.6.33) where A^, Aj, are two constants and a(R, C,) > 0 is a coefficient that for R growing from 0 to C, decreases from its initial value a(0, C,) to 0 for /?-*C, . IfC^
8.6 Overall Optimization of Information Systems
461
As indicated in Section 6.1.1, page 254, the cost of hiring a binary channel is usually proportional to the length A^ of the transmitted block. Therefore, we try to keep 1/^ small, thus, R large. From the definition (8.6.32) it follows that then A^ can be large when the number L of potential forms of working information is large too. If this is not the case, we must assemble a sufficient number of pieces of working information into a block to satisfy the condition of the coding theorem (8.6.33). A modified version of the theorem holds also for the system with feedback using blocks of length A^ for a single transmission. As discussed in Section 8.3.3 the performance of a feedback system is described by the ultimate error probability P^ and by the probability P- of making a disqualifying decision. It can be shown that if Ci>R then both probabilities decrease exponentially with the level A^ of structuring of a single transmission, and for C, close to R the coefficient af^(R, C,) is substantially larger then for previously discussed system without feedback. Thus, Compared with the open system, the system using feedback information allows to decrease substantially the error probability of working information recovery. However, as in the case of the optimal open system the performance of the optimal system with feedback is good only if the capacity of the forward channel Ci>R'\ in other words: the feedback does not change the capacity of the system. The modifications of the coding theorem (8.6.33) hold not only for wide classes of discrete channels, other then the considered memory less binary channel. The modifications hold also for continuous channels, particularly for the gaussian continuous channel using signals of duration T and bandwidth B described on page 241. Considerations in Section 7.4.3 suggest that in this case we should take as indicator of structuring level N=2TB and take R=\og2L/2TB. Then the direct counterpart of (8.6.33) holds for the gaussian channel. For orthogonal signals considered in Section 8.4.3 we have a specific situation. Using (8.4.41) we obt2LinR={\og22TB)/2TB. From the well known result of calculus it follows that (log22ra)/2ra-*0 when 2TB-*oo as we assumed in Section 8.4.3. Therefore, for orthogonal signals we have to take R^ilogjL)/! as we did in Section 8.4.3. Then the conclusion (8.4.52) and equation (8.4.53) are counterparts of the coding theorem (8.6.33) and the conclusions formulated on this and an the previous page, apply for orthogonal signals. 8.6.2 THE OPTIMIZATION OF THE SUBSYSTEM PROVIDING INFORMATION ABOUT THE STATE OF MAIN SYSTEMS ENVIRONMENT In the introductory Section 1.1 and in Section 1.7 it was indicated that the efficiency of information processing can be increased by using auxiliary information about the state of the environment of an information system. In particular, the auxiliary information makes intelligent operation of an information system possible. In Chapter 6, Chapter 7, and in the previous sections of this chapter we discussed utilization of various types of state information in specific systems. We now summarize and generalize those considerations. We use two complex systems to illustrate the basic features of systems with state information subsystems. The first is the intelligent data transmission system described in Section 2.2 and shown in Figure 2.9. The second is the system with a
462
Chapter 8 Structures and Features of Optimal Information Systems
common channel described in Section 2.3.1 and shown in Figure 2.11. Although the systems are different, the character of the dependence of their performance on basic features of the used state information is similar. This suggests generalizations which allow to formulate the guidelines for the design of a state information subsystem. UTILIZATION OF STATE INFORMATION IN A DATA TRANSMISSION SYSTEM Analyzing the effects of state information in the system shown in Figure 2.9 we do assumptions as in Examples 8.3.1-8.3.3. In particular, we assume that Al. The signal at the output of the working channel is r(0=>v(jc;, 0+z(0, te
8.6 Overall Optimization of Information Systems
463 0^ CJ "O «/3 w
W o
9 z li: O .— ^
H
C/5
w
E.£
*" ^ E S
OJ T3 •5 4>-^ C C «/5 tJo C
^ r? -§
.s a> p > O o tin®
^•^ e 2 i 5
2 E ^ >.2
.^ g S o .0 " O
•—
, ^
<J->
—
C ;£ g_^ ^ D _2 c o •r=^^-£ E o o ^
^
(u o
a>
E — ^ ^^ > o
3 o "5 ,0 o >>
a.
c E.£ E-Si*^ Ci-«r- O
o o
IS
4j CO
to
•s'S
^ o E i:? -
? -- 2i ,SPo S-E--2^
Q C c/5 g C
w2So.2 • G
,gC^
era a>
C
3
0)
J-
ta ^ E o "5
464
Chapter 8 Structures and Features of Optimal Information Systems
Figure 8.24 shows that using even the simple binary feedback we can improve dramatically the performance of working information transmission. To achieve a similar performance of the possibly best open system applying orthogonal signals and block operation, we have to use a very high structuring level. Thus, the concrete state information which controls the actions of the transmitter is more effective than the structure build into signals in an open system. However, using feedback information requires a feedback channel and introduces indeterminate delays. Also, as indicated in conclusion on page 461 the minimum capacity that is required to achieve to improvement by structuring is the same as for the open system. UTILIZATION OF STATE INFORMATION IN MULTIPLE ACCESS SYSTEMS As the second complex information system we consider the system with several information sources using a common channel shown in Figure 2.11. We assume Al. The primary information delivered by a local source is a train of blocks interleaved with pauses as described in Section 6.3.1. A2. Statistical information about the instants of arriving packets and about their lengths is available; the Poisson-exponential model discussed in Section 5.2.1 is used. A3. The arriving packets are stored in a buffer and according to a transmission rule taken of and put into the common channel. A4. The decisions of the transmission rule are based on the following state information: (1) yT(^)[CL(m)] about collisions of an own packet, (2) yj^^)(CL) about all collisions, (3) yj(m)(SYN) about position of time slots in which the packet should be put, (4) yj^^^iCH,^) about the state at the input of the common channel (CSS systems), (5) yccsC^) about the state of queues in local transmitters (reservation systems, scheduling decision taken at the central control unit). A5. The indicator of distortions is the normalized delay T,=T/T
_
(8.6.37)
where r is the average delay caused by the system and T is average duration of a packet. 6) The cost indicator is the normalized Rapacity of the common channel: C,=C/S _ (8.6.38) where C is the common channel capacity and S is the average intensity of total transmitted information (in bit/s). A brief description of systems using the various types of the state information is given in Section 2.3. Such systems can be considered as systems performing real time compression of trains arriving from remote sources. In general the analysis of the multiple access systems is quite complicated (see e.g.,Kleinrock [8.24], Seidler [8.25], Beteskas, Gallager [8.26] The diagram of r^ as a function of C^ based on Seidler [8.25] are shown in Figure 8.24b. In the limiting case when exact information yccs(T) about the states of buffer of all users is available at a central control subsystem (see Section 2.3) it is possible to organize all arriving packets in a single virtual queue that behaves as a queue in a single buffer considered in Section 6.3. Therefore, the characteristic of the system using efficiently the information >^ccs(T) is the same as diagram (a) in Figure 6.9.
8.6 Overall Optimization of Information Systems
465
If in a system using only information yT(n,)[CL(m)] about own collisions, the synchronisation information yT(n,)(SYN) is in addition available, the performance of the system improves significantly. In particular, the asymptotic minimal required normalized capacity reduces from C^=2e to e. However, if besides information yj(^)[CL(m)] also information yT(m)b(CHin) is available the improvement achieved by using the synchronisation information yT(n,)(SYN) is small. A similar effect of saturation occurs in data transmission systems. A typical example is the feedback information. Compared with the open system it improves the performance, but not the channel capacity (see the conclusion on page 461). There is also an effect of saturation when the level of structuring increases. The properties of the system with orthogonal signals discussed in Section 8.4.3 are an example. Those observations suggest the following generalizations • Using the information about the concrete and meta state of the environment of the working information system it is possible to achieve a substantial improvement of performance the working information system, without increasing its fundamental processing resources. • When the fundamental information processing introduces indeterministic distortions then the performance can be also improved by increasing the level of structuring of signals carrying the information. • An effect of saturation with state information occurs, namely when already enough state information is available then further increasing the volume of state information only slightly improves the performance of information processing. Let us return to the general considerations in Section 1.6.1 about performance of an information system. In view of the above properties of state information the gross gain G^^^[T^(')] brought by an intelligent information system which processes the working information according to the optimized rules TJ^-) grows with the volume Vy of used concrete and meta information about the state of the environment, but exhibits a saturation as shown in Figure 8.25.
Figure 8.25.Typical dependence of the gross gain G^^^[TJ^')] brought by an intelligent information system, of the cost G~[r5,3(-)] of building and running the state information subsystem and of the net gain 0^,^ brought by the imbedding the state information system and running the working information system in an intelligent way, on the volume Vy of the state information.
466
Chapter 8 Structures and Features of Optimal Information Systems
The cost of implementing and running the state information subsystem grows with the volume Vy of the state information. Thus the net gain ^w,s ^G;j7;(-)]-G-[r3,3(-)] (8.6.40) considered as the function of the volume of state information exhibits a maximum. In other words there is an optimal volume Vyo of the state information. If the volume of used state information Vy
INDEX The page number of the basic definition of a term is boldfaced if it is not the first number cited. A aggregation - interval, 43, 335, 360 ff, 459 - set, 40, 76, 189, 423, 430 ff.,456 algorithm - cascaded segmentation a., 123 - dimensionality reducing a., 338 - finding optimal quantization rule, 458 - gradient a. see minimum point search a. - Huffman a., 265 ff., 292, 294 - minimnm point search a. - deterministic, 403 - stochastic, 404 - steepest descent a., 406 - zero point search a., 395 - Ziv-Lempel a., 273 amount of statistical informationset statistical information approximation - discrete, 33, 144, 175 -hnear, 320 ff., attribute, 14, 19, 108, 131 f 160 ff - binary, 130, 132 average - arithmetical, 169,181 - statistical, 191 ff. - conditional, s.a.l93 ff. B base-band process see process Bayes model, 245, 409 biological information-stt information bit - allocation, 115, - stuffmg, 121 buffer, buffering, see system
C channel - base-pass ch., 77 - binary ch., 72 ff., 81, 86, 2421, 419 ff. - capacity, 53, 240 ff, 242, 276, 280, 439, 460 ff. - common ch., 96 ff., 464 - feedback ch., 81, 422, 433, 461 - low-pass ch., 74 - memeoryless ch., 241 - parameters, 60 ff., 92, - secondary ch., 81 - virtual ch., 89, - with additive noise, 72, 237, 241, 418 Cholewski decomposition, 223 code, 38 - arithmetic, 267 ff. - b l o c c , 83, 267, 460 - book, 38, 124 - Hamming c , 87 - Huffman c , 117, 265 ff., 292 ff. - parity check c , 83 ff, 85, 460 - reserved prefix c , 124, - run length c , 23, 117, 266 - string-orientated c , 267 - word, 38 coding see code - theorem, 439, 460 comma. 111 communication system see system constraints, 56 - parametric, 56 - statistic, 173, 293 correlation - coefficient, 193, 210 ff., 214, 432, 436 - empirical c . c , 386 - function, 227 ff., 356 - matrix, 194, 199, 222, 392, 443
468 D data bank, 8, 108 data compressions^^ lossless information compression decimation, 46, 353 decision, 11, 408 ff., 413, 422 - deterministic, 36 - disqualifying d., 422 ff. - information d.- see information - randomized, 51 - rule linear, 443 ff. - rule of - best conditional performance, 360, 410, 424 - center of gravity, 361 ff.,427 - maximal conditional probability, 415 ff., 425, 441 ff. - maximum likelihood, 442 - routing, 101 - theory, 66 deciding inequality, 76, decorrelation- see transformation detail removing operationsee operation DIM, 19 dimensionality, 19 - reduction, 9, 10, 45, 327, 349, 350, 356 ff., 351 ff.,412 discretization, 33, distance, 29 ft, 41, 76, 382 - entropy d., 450ff - EucUd, 30, 319, 382 - Hamming, 86, - Hilbert, 30, 78 - minimal, 87 - preservation of d., 319, 344 duality principle, 349, 351 duty ratio, 23, 278
eigenvalue, 323 ff., 339 ff. - vector 46, 323 entropy, 202, 211 ff., 215, 231, 242, 292 - conditional, 213 ff, 241 - distance, 450 ff. - relative e.,306, 457
feedback 16 - information- see information - system- see system
Index Fourier transformation-so^Q transformation frequency of occurrences, - of continuous states, 174 ff. - of discrete states, 63, 111, 169 ff, 262 Junctional analysis, 31, 66, 111, 178 fundamental property of long trainssee train
gaussian probability density- see probabihty gradient, 391, 405, 455 Gramm-Schmidt orthogonalization, 225, 374 I image see information indeterminism, 34, 213, 244 ff., 441 indeterministic transformations-stt transformations indicator of - cost, 53, 465 - distortions, 53, 380 ff. - symmetric, 381 - performance, 52, 384, 385 - conditional p., 360, 410 ff., 428 - economic i.o p., 53 - technical i.o.p., 53 - redundancy, 260 - resources utihzation, 257 - structuring level, 438, 460 - variety, 27, 53 - volume compression, 259, 264, 280 ff., 293 - volume surplus, 258, 274 identifier, 18, 41, 262, 268, 271 information, 3 - about state of a channel, 90, 92, 99, 237 ff.,462 - about state of a network, 101 - about state of a system, 4, 5, 51 - array i., 21, 112 - biological i., 11, 65ff. - b l o c i . , 19, 22 - compression, 9, 42, - lossless, 251 ff. - constraint bhnd, 253, - statistical, 285 ff.,
469 - information compression ctd -lossy, 76, 103, 259, 261 ff., 281, 286 - constrained i., 28, 255 - continuous i., 25, 29, 296, 440 - decision i., 76, 80, 408, 410 - a priori, 411 - binary, 413 ff. - current, 411 ff. - primary i.d., 441 ff. - secondary i.d., 414, 421 - diagram i., 19, 48 - delivery rate of i., 278, 288, 439 - design i.,57 ff. - discrete i., 26, - distortion i., 79, 85, 422 - dynamic i., 2, - evolving i., 21 - error i., 85 ff. - exact i., 35 - external i.6, 58 - feedback i., 16, 81, 85, 422 ff., 435, 461 - function i., 25 - image i., 23, 48, 349, 383 - internal i., 6, 59, 111 - inexact i., 35 - length i., 120, 123, 236 - linking i., 8, 57, - m e t a i . , 57, 79, 245 ff. - networks, 94 -parity i., 83, 98, - partner i., 6, 58, 93, - presentation of i., - - spectral p.of i.- see spectral - prototype i. 254, 296 - recovery rule- see decision rule - segmentation i., 120 ff. - set of potential forms, see set - source, 6, 15 ff. - active, 15 ff. - passive, 15 ff. - structure, 17 ff. - statistical i., 117, 245 ff., 360 ff, 408 ff. - amount of s.L, 213 ff, 305, 451 synchronization i., 79, 92 system, 5, 51 - centralized i.s., 98 ff. - decentrahzed i.s., 98 ff. - distributed i.s., 94 -feedback i.s., 80 ff., 421 ff.
- hierarchical i.s.,6, 104, 107 - inteUigent i.s., 6, 58, 268, 392, 446 ff. - learning i.s., 5, 59 ff., 446 - multiple access i.s., 99, 462 -prototype i.s., 6 ff., 453 ff. - state i.s., 90, 449, 461 - training i., 61, 386, 393, 401 , 445 - transfonnations of i. see transformation - ultimate i.,5 - unconstrained i., 28, - vector i., 19 - volume of i., see volume - wave i., 24, 50, 57 - working i., 57 interface - point i., 140 - window i., 141 intermittent operation 22,
learning, 59, 448 - cycle, 60, 446 ff - supervised 61, 445ff. M matched filter, 312 mean square error, 328, 332, 336, 339, 387, 442, 456 - normahzed, 337, 340, 365, 447, 456 N network, 94, 140 next neighbor transformation see transformation noise, 72 - binary, 82 - thermal, 184 noiseless signal-stt signal O object, 13, 108 - atomic, 13 - continuous, 13, - discrete, 13, operation - arithmetical averaging o.- see average - blocwise o., 82, - detail removing o., 55, 107, 384
470
Index
operation ctd - intermittent o., 22, - linear o., 193 - piecewise o., 82, - statistical averaging o.-see average optimization of - bit allocation, 115, 369 - dimensionality reduction, 338 ff. - infonnation system, 453 ff. - linear approximation, 320, 350 - linear transfonnation, 330, 373, 386 ff., 391, 401, 443 - infonnation recovery rule, 331 ff., 410, 432, 446, - continuous i.r., 440 ff. - discrete i.r., 412 ff. - prediction rule, 373 - quantization, 454 ff. - state information system, 461 ff. optimization - problem, 56, - parametric o.p., 390 - statistical o.p., 56, 360 ff., 399 ff., 409 ff. - theory, 67, - with constraints, 394 orthogonal systems of - functions, 344 -vectors, 313 ff., 323
probability - marginal p., 171, 237,247 - theory, 168, 188 process see also signal - base band p., 73, 227, 237, 241 - narrow band p., 73, 227, 237 - point time p., 156, - stochastic p., 209, 219 - density of probabihty of s.p, 228 - gaussian s.p., 221 ff., 225, - Marcov s.p., 229 ff., 239, 295 - Poisson s.p., 219, 221 - spectral repr. of s.p. 354 ff. - stationary s.p., 227 projection, 42, 109, 314, 320, 327 pseudo-random numbers, 136, 187 pulse response, 146, 152
parameters - independent estimation of p.,445 ff. - Lagrange p.,394, 424 - nuisance p., 235 - side p., 233, 238 - state p., 130 - slowly varying s.p., 59, 445 ff pattern recognition, 10, point - of minimum, 390 ff. - of zero, 394 ff. probability - axioms, 189 ff. - conditional p., 229, 235, 237 - density of .p., 190 - gaussian d.o.p., 198 ff., 235, 241, 247, 442 - uniform d.o.p., 196, 239, 242 - distribution, 190 - empirical p., 181 ff. - error p., 241, 431, 439
random variable, 189, - continuous 189, see also probability density - discrete, 189 randomization 51, receiver, 8 recovered information- see information reference pattern, 40, 268, 273 relationship- see also internal state - described by - algebraic equations, 156 - differential equations, 144, ff. - logical expressions, 159 ff. - hierarchy of r. 161 - macro r., 160 ff. - statistical r., 167ff. - universal r., 139, 143, 139, 161 rule- see decision rule, transformation
quantization, 9, 42, 359 - current, 371, ff. - decomposed q., 44, 367 - fine q., 114 - optimization, see optimization - rough q., 115 - scalar q., 43, 298, 360 - uniform s.q., 43, 298 - vector q., 44, 365 quality indicator-s&t indicator
471
sampling, 48, 180, 218, 227, 351, 356 - lineal s., 48 - planar s., 48 - point s., 48 - space s., 50 scanning, 47 safety margin 284 set - of potential foims, 26 - continuous s.o.p.f., 26, 29 - discrete s.o.p.f., 26, 28 - of information, 26 ff. - of state, 137, 165 - structural type of s., 26 - theory, 32, 63 - unconstrained s.,255 - Voronoi s., 42 segmentation, 120 signal see also process - energy, 417, 430 - normalized, 418, 432, 432, 438 - noiseless, 11, 72, 235 ff., 418 - received, 72 - transmitted, 72 signal to noise ratio, 239, 242 sine over argument function, 346, 352 Shannon infonnation, 66, 215 see amount of statistical information shift register, S3, 150 ff. spectral - power density, 355 ff. - representation of - function information, 342 ff. see also Fourier transformation - vector inlbrmation, 310 ff. - discrete cosine transf. 113 - discrete Fourier transf. 317 spectrum- see spectral representation state 3, - atomic 130, - concrete s., 55, 205 - course of s., 14, 136 - external s., 12, 13, 160 -internals., 12, 14, 143, 160 see also relationship - fuzzy state, 135 - generaUzed s., 167, 205 ff. state ctd
-
macro s. 160 meta s., 205, 245 of information system, 90 parameter 130 - estimation of s.p., 61, 445 - slowly varying s.p., 59, 445 ff. - rough s., 14, 131, 135, 137, 144 - side s., 36, 233 - statistical s., 32, 205, 245, 293 - trajectory, 136 - variety s., 26, 138, - vector s., 130 statistical - average see average - constraints, 252, 280 - identification, 185 - independence, 172, 181, - information see information - optimization problem 56, 409 ff. - regularities 34, 168, 181, 183, 185, 286, 387 - volume-see volume stochastic process see process system see also transfonnation - biological s. 11, 67 - buffering s. 90, 93, 126, 156 ff. 288, 289 - communication s., 8, 72 ff., 89 ff. - information s.- see information - linear s., 148 - Marcovian s., 143, 155 - memory less s., 143, 153 - production s., 139, 154 - semi-stationary s., 153 - stationary s., 144, - superior s., 5, - terminal interacting s.,139, 144 - time varying s., 148 structure - tine s., 13, 18 - hierarchical s., 13, 22 - macro s., 13 - simplification of s., 9, 112, - type of s., 25 superposition principle, 148
transformation - causal 139 - compressing information- see information compression
472 transformation ctd - decorrelating 45, 223 ff., 368 - predictive-subtractive d.t., 371 - spectral d.t., 321 ff., 333, - deterministic t.,36, - dimensionality reducing t., 309 ff. - discrete cosine t., 113, - extracting new infonnation 371, - Fourier t., see also spectrum - continuous F.t., 46, 346 ff. - discrete F.t., 317 ff - fundamental t., 7 - generating information, 35, - generating reports, 110 - indeterministic t. 36, 51, 233 - irreversible t. 36, 39, 133 - linear t. -see also optimization 155 - marcovian t.,143 - memoryless t.l54, 241, 432 - next neighbor transformation - continuous 42 - discrete 11, 40 ff., 76, 80, 420, 455 -of files, 110 - of information recovery - - see information recovery rule - of records, 109 - of representation, 27, 37 - optimal t. see optimization - overall t., 384, 453 - preliminary t. 7, - reversible t. 35, 37-39, 259, 287, 300 - removing constraints 5, 257,258 - removing relationships 45, 113 - stationary t. 155, 432 - structure blind t. 257, 330 - ultimate t. 7, - volume compressing t.- 260, see information compression - volume expanding t. 260 train - assembling 119 - fundamental property of long trains 201 ff. - non-typical t. 206 - typical t. 202 - training cycle - transmitter 8, 72 - truncation 45, 328, 333, 341
Index U uncertainty zone 79 universe of discourse 142
volume of information 27, 254 ff., 277, 301 - compression of volume, 286 ff. see also information compression - minimal v.o i., 256, 301 - relative v.o i., 306, 458 - statistical v.o i., 251 - structure blind v.o i., 285, 292,