Information systems and data compression

INFORMATION SYSTEMS AND DATA COMPRESSION INFORMATION SYSTEMS AND DATA COMPRESSION by Jerzy A. Seidler Universitdt Sa...

Author: Jerzy A. Seidler

74 downloads 1042 Views 24MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

INFORMATION SYSTEMS AND DATA COMPRESSION

INFORMATION SYSTEMS AND DATA COMPRESSION

by Jerzy A. Seidler Universitdt Salzburg

\N\UK? ARCHIEF

KLUWER ACADEMIC PUBLISHERS Boston/Dordrecht/London

Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS

Library of Congress Cataloging-in-Publication Data

A CLP. Catalogue record for this book is available from the Library of Congress.

Copyright © 1997 by Kluwer Academic Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061. Printed on acid-free paper. Printed in the United States of America

To Mam ani ?awd

CONTENTS PREFACE

XV

NOTATION

xix

PART 1 BASIC CONCEPTS 1 BASIC FUNCTIONS AND STRUCTURES OF INFORMATION SYSTEMS

1

1.1 FUNDAMENTAL CONCEPTS 1.1.1 Systems pursuing goals 1.1.2 The concept of information 1.1.3 Information systems:Fundamentals 1.1.4 Prototypes of basic information systems

1 1 3 5 7

1.2 ACQUISITION OF INFORMATION 1.2.1 Basic types of states 1.2.2 Information sources

12 12 15

1.3 STRUCTURE OF A CONCRETE INFORMATION 1.3.1 1.3.2 1.3.3 1.3.4

Basic concepts Scalar elementary information and scalar identifier Structured elementary information and/or structured identifier Classification of fundamental structure types

1.4 THE 1.4.1 1.4.2 1.4.3

17 17 19 21 25

SET OF POTENTIAL FORMS OF INFORMATION Basic concepts The discrete set of potential forms of information The continuous set of potential of potential forms of information 1.4.4 Weights of potential forms of information

26 26 28 29

1.5 TRANSFORMATIONS OF INFORMATION 1.5.1 Basic types of information transformations 1.5.2 Reversible transformations 1.5.3 Irreversible transformations 1: basic concepts 1.5.4 Irreversible transformations 2:transformations compressing continuous information 1.5.5 Indeterministic transformations

35 35 37 39 42

1.6 OPTIMIZATION OF TRANSFORMATIONS OF INFORMATION 1.6.1 Indicators of performance of an information transformation 1.6.2 The optimization problem

52 52 55

1.7 INTELLIGENT INFORMATION SYSTEMS 1.7.1 Design information 1.7.2 Intelligent operation in a slowly varying environment

57 57 59

34

51

Vlll

1.8 RELATIONSHIPS BETWEEN THEORY OF INFORMATION SYSTEMS AND OTHER SCIENCES

64

REFERENCES

69

2 EXAMPLES OF INFORMATION SYSTEMS

71

2.1 DATA TRANSMISSION SYSTEMS 2.1.1 Transmission of a single binary information 2.1.2 Transmission of a block of binary pieces of information 2.2 INTELLIGENT DATA TRANSMISSION SYSTEMS 2.2.1 Information about the state of the channel 2.2.2 The partner information

72 72 82 89 90 93

2.3 MULTIPLE ACCESS SYSTEMS 2.3.1 Systems using a common channel 2.3.2 Information transmission networks

94 96 100

2.4 INFORMATION STORAGE SYSTEMS 2.4.1 Primary information about the state of a hierarchical system 2.4.2 Secondary information about the state of objects and systems

105 107 109

2.5 A SYSTEM SIMPLIFYING THE STRUCTURE OF IMAGE INFORMATION

112

2.6 SUBSYSTEMS ASSEMBLING TRAINS OF INFORMATION BLOCKS 2.6.1 Basic concepts 2.6.2 Segmentation by a comma 2.6.3 Segmentation based on length information 2.6.4 Segmentation based on specific structure of the code book 2.6.5 Compression of trains of blocks interleaved with pauses

119 120 121 122 123 125

REFERENCES

127

PART 2 STATES OF SYSTEMS 3 CONCRETE STATE OF A SYSTEM

129

3.1 THE EXTERNAL STATE OF A SYSTEM 3.1.1 Basic descriptions of states 3.1.2 Rough description of a state 3.1.3 The course of an external state in time 3.1.4 The set of potential forms of an external state

130 130 131 136 137

IX

3.2 THE INTERNAL STATE OF A SYSTEM 1: RELATIONSHIPS BETWEEN CONTINUOUS EXTERNAL STATES 3.2.1 Terminal interconnected systems 3.2.2 Relationships between time-continuous states of two-terminal systems 3.2.3 Relationships between time-continuous states of a network of two-terminal systems 3.2.4 Relationships between time-discrete states of a network 3.2.5 A classification of relationships between external states of systems

154

3.3 THE INTERNAL STATE OF A SYSTEM 2: RELATIONSHIPS BETWEEN DISCRETE EXTERNAL STATES 3.3.1 Relationships described by equations 3.3.2 The relationships described by logical expressions 3.3.3 Set of potential forms of an internal state

156 156 159 165

REFERENCES

166

4 STATISTICAL STATE OF A SYSTEM

167

4.1 FREQUENCIES OF OCCURRENCES OF DISCRETE STATES 4.1.1 Basic concepts 4.1.2 The frequencies of joint occurrences of states 4.1.3 Generalizations

168 169 170 173

4.2 FREQUENCIES OF OCCURRENCES OF CONTINUOUS STATES 4.2.1 The discrete approximation of a continuous process 4.2.2 The density of occurrences of potential forms of a continuous state

174 175

138 140 141

144 149

178

4.3 STATISTICAL REGULARITIES 4.3.1 Statistical regularities in trains of discrete states 4.3.2 Statistical regularities in trains of continuous states 4.3.3 Statistical regularities in assembles of systems 4.3.4 Testing the existence of statistical regularities and estimation of probability distribution

180 180 183 185 185

4.4 THE AXIOMATIC APPROACH TO STATISTICAL REGULARITIES 4.4.1 The axioms of probability theory 4.4.2 The random variables 4.4.3 The statistical average 4.4.4 Correlation coefficients and correlation matrix

188 188 189 191 193

4.5 PROTOTYPE PROBABILITY DISTRIBUTIONS 4.5.1 The uniform probability distribution 4.5.2 The gaussian probability distribution

195 196 198

4.6 THE FUNDAMENTAL PROPERTY OF LONG TRAINS OF RANDOM VARIABLES

201

4.7 THE GENERALIZED STATE AND A UNIVERSAL CLASSIFICATION OF STATES AND INFORMATION

205

REFERENCES

208

5 STATISTICAL RELATIONSHIPS

209

5.1 THE ROUGH DESCRIPTION OF STATISTICAL RELATIONSHIPS BY PARAMETERS 5.1.1 The correlation coefficient 5.1.2 The entropy and amount of statistical information 1: discrete states 5.1.3 The entropy and amount of statistical information 2: continuous states

215

5.2 PROTOTYPE STATISTICAL RELATIONSHIPS 5.2.1 Poisson process and derived processes 5.2.2 Gaussian processes

218 219 221

5.3 MARKOV PROCESSES

229

210 210 211

5.4 THE RELATIONSHIPS BETWEEN A STATE AND ITS INDETERMINISTIC TRANSFORMATION 5.4.1 The basic model 5.4.2 Calculation of probability distribution of the transformed state when the noiseless component is exactly known 5.4.3 Calculation of probability distribution of the transformed state when the noiseless component depends on unknown parameters 5.4.4 The rough descriptions of the transformations performed by a communication channel

240

5.5 THE MODELS OF INDETERMINISM OF A STATE RELATIVE TO AVAILABLE INFORMATION

244

REFERENCES

250

233 234 236 238

PART 3 DATA COMPRESSION AND OPTIMIZATION OF INFORMATION SYSTEMS 6 LOSSLESS COMPRESSION OF INFORMATION 6.1 THE VOLUME OF DISCRETE INFORMATION AND ITS COMPRESSION 6.1.1 The indicators of resources needed to process structured discrete information 6.1.2 The effect of transformations of structured discrete information on its volume

251 252 253 259

XI

6.2 EXAMPLES OF LOSSLESS COMPRESSION OF TRAINS OF STRUCTURED INFORMATION 6.2.1 Compression of the train of blocks 1: the potential forms of blocks are known 6.2.2 Arithmetic coding 6.2.3 Compression of a train of blocks 2: the potential forms of blocks are not known

272

6.3 REAL TIME COMPRESSION OF A TRAIN OF BLOCKS WITH IDLE PAUSES 6.3.1 The model of the primary information 6.3.2 The volume of a train of blocks interleaved by idle pauses 6.3.3 The compression indicator

275 275 277 279

6.4 COMPRESSION OF INFORMATION EXHIBITING STATISTICAL REGULARITIES 6.4.1 The basic concepts 6.4.2 The effect of overflow 6.4.3 The indicators of statistical compression

280 281 282 285

261 261 268

6.5 EXAMPLES OF COMPRESSION OF INFORMATION EXHIBITING STATISTICAL REGULARITIES 287 6.5.1 Compression of the trains of blocks separated by idle pauses 287 6.5.2 The minimum statistical volume 292 6.5.3 The choice of size of segments compressed by Huffman algorithm 294 6.6 TRANSFORMATIONS UTILIZING THE STRUCTURE OF CONTINUOUS INFORMATION TO COMPRESS ITS VOLUME 6.6.1 The volume of the prototype continuous information 6.6.2 The volume of structured continuous information and its compression 6.6.3 The statistical volume of continuous information

296 296 300 304

REFERENCES

307

7 DIMENSIONALITY REDUCTION AND QUANTIZATION

309

7.1 SPECTRAL REPRESENTATIONS OF VECTOR INFORMATION 7.1.1 Review of fundamental concepts of K dim geometry 7.1.2 Defmition of the spectrum of block information 7.1.3 Some important properties of spectral transformations

310 310 313 319

Xll

7.2 DECORRELATING SPECTRAL REPRESENTATIONS OF VECTOR INFORMATION 7.2.1 Basic concepts 7.2.2 The eigen vectors of the correlation matrix 7.2.3 The decorrelation based on eigen vectors

321 321 323 325

7.3 REDUCTION OF DIMENSIONALITY OF VECTOR INFORMATION 7.3.1 A study case 7.3.2 The algorithm for dimensionality reduction

327 327 338

7.4 SPECTRAL REPRESENTATIONS AND REDUCTION OF DIMENSIONALITY OF FUNCTION INFORMATION 7.4.1 Basic concepts 7.4.2 Harmonic spectral representations 7.4.3 Deterministic reduction of dimensionality of function-information 7.4.4 Statistical reduction of dimensionality of function-information

342 343 345 349 354

7.5 QUANTIZATION 7.5.1 The recovery of the primary continuous information from quantized information 7.5.2 Quantization of vector information 7.5.3 The current quantization of information

359 365 371

REFERENCES

377

8 STRUCTURES AND FEATURES OF OPTIMAL INFORMATION SYSTEMS

379

8.1 INDICATORS OF INFORMATION SYSTEMS PERFORMANCE 8.1.1 Indicators of systems performance in a concrete situation 8.1.2 Indices characterizing the performance of an information transformation as a whole 8.1.3 Indicators of the performance of an ultimate information transformation 8.1.4 Indicators of the performance of a preliminary information transformation

359

380 380 384 385 388

Xlll

8.2 METHODS OF SOLVING OPTIMIZATION PROBLEMS 8.2.1 Reduction of the minimization problem to search in a set of solutions of an auxiliary equation 8.2.2 Numerical finding of the zero point: the samples of the function are exactly known 8.2.3 Numerical fmding of the zero point: only distorted samples of the function are available 8.2.4 Finding the point of minimum 8.3 OPTIMAL RECOVERY OF DISCRETE INFORMATION 8.3.1 General solution of the optimization problem 8.3.2 Structures of optimal subsystems recovering discrete information in an open system 8.3.3 Structures of optimal subsystems recovering discrete information in a system with feedback 8.4 PERFORMANCE OF OPTIMAL INFORMATION RECOVERY 8.4.1 The general method of calculating the statistical performance indices 8.4.2 Performance of binary information recovery 8.4.3 The performance of optimal recovery of discrete information whenL>2

390 390 394 399 403 408 409 412 421 427 428 431 435

8.5 OPTIMAL RECOVERY OF CONTINUOUS INFORMATION 8.5.1 The solution of the optimization problem 8.5.2 Optimal character of linear recovery rules 8.5.3 Optimal character of intelligent recovery rules with independent state parameter estimation 8.5.4 Universal performance estimations of optimal recovery rules

440 440 443

8.6 OVERALL OPTIMIZATION OF INFORMATION SYSTEMS 8.6.1 The overall optimization of the prototype information system 8.6.2 The optimization of the subsystem providing information about the state of main systems environment

453 453

REFERENCES

466

INDEX

467

445 450

461

PREFACE

The systems for information acquisition, transmission, storage, and processing are crucial for functioning and growth of today's society. Typical of such systems are • measurement and identification systems acquiring information about properties (mechanical, thermal, electrical, optical, chemical, economical, etc.) of objects; • systems for transmission of information (data, speech, music, images); • systems for information storage (primary magnetic, optic storage systems, systems for storage of structured information, data banks); • systems for simplification of structured information (information compression, extraction of features), and • systems transforming information according to given algorithms. The specific character of primary information, of the tasks, and of the technologies used, caused these information systems to develop quite independently in their early stages. However, during the last two decades a strong trend has emerged for integration of those systems on operational, implementation, and design levels. First, there is a tendency to physically integrate various information systems into one large system. Typical examples are integrated systems for data acquisition, transmission, and storage, integrated systems for data, voice, and image processing, surveillance, and remote sensing systems. Second, there is a trend to unify the implementation. The progress of solid state technology has allowed standardized devices for digital signal processing. Third, the transfer of concepts developed primarily for specific types of information systems has become very intensive. Interest has been growing in all areas of information processing and transmission in such concepts as optimization, adaptation, learning, and resources sharing. This in turn makes the design philosophies to become similar (coupling of analytical and simulation techniques, computed aided design, and countermeasures against indeterminate events). This book is concerned with the third trend. It has two objectives. The first is to analyze the concept of information and to develop a universal methodology of intelligent information system design. The second objective is to provide a solid basis for the design of systems for compression of digital data, processes, and images. The combination of the two objectives has two purposes. First, since almost all basic techniques of information processing are used in systems compressing information, they are a representative example of information systems. Second, in all types of information systems we have information compressing subsystems. Often these subsystems are essential for the overall performance of the whole system.

XVI

The basic idea of this book is to look at an information system as a system imbedded between a superior system, for which the information system renders its services, and the environment in which the superior system operates (see Figure 1.1, page 5). Information is used by the superior system to match its operation to the state of its environment. An information system itself is in a similar situation as the superior system. The efficiency of information system's operation can be improved without increasing its material resources, if in the processing of information destined to the superior system, the information system uses auxiliary information about its environment. Such information systems are called intelligent and their analysis is the central topic of this book. A consequence of considering the information system as a component of a larger system is to emphasize the relationships between the methods of designing information systems with methods used in other sciences. The localization of topics presented in this book on the map of sciences is shown in Figure 1.28, page 65. The book consists of three parts. The first part (Chapter 1 and 2) presents the basic ideas in an informal and concise manner. Part two (Chapters 3, 4, and 5) offers a physics-like description of the environment of an information system and of the transformations of information. The third part (Chapters 6 , 7 , and 8) discusses optimization of an information system and matching its operation to the state of environment. Chapters 6 and 7 concentrate on systems compressing information. Methodology of formulating the optimization problems, the methods of their solutions, and methods of analyzing the performance of information systems are presented in the synthesising Chapter 8. Much effort was made to present both the heuristic and analytic aspects of information systems design. The informal interpretations are formulated as separate comments. To present the analytic considerations in a rigorous but readable form, and to keep the size of the book within reasonable limits several techniques have been used: uniform terminology and notation, frequent use of geometrical interpretation, introducing new concepts with simple models and then generalizing them. Because the range of problems covered by the book is broad, a list of related publications would be very long. Therefore, for each topic only few selected publications are cited, usually in this sequence: introductory publications, textbooks, detailed studies, in particular collections of articles. As references for concrete procedure I have identified sources of relevant computer programs so that the reader can experiment with the algorithms. A course emphasizing the relationships between the information system, its environment, and the superior system to which the information system renders its services is missing in most curriculums. Also, common features of various types of information systems are rarely discussed. This book can help to bridge such gaps in two ways. A synthesising course for undergraduate students of electronic engineering, communications, and computer sciences can be based on the first part of the book. All of the material can be used for a course on theory of information systems for graduate students of mentioned specializations. This book can also be interesting for students working on their theses in the area of information compression or intelligent information system design.

xvu The synthesizing approach makes the book useful for system analysts and engineers developing the various types of information systems listed on page xv, especially for those interested with transfer of concepts developed for various types of information systems. The material is presented in an autonomous way so that the reader can go through the book without reaching for several other books. However, to profit fully from the book the reader should be familiar with the fundamental concepts of system theory and its basic mathematical tools as optimization theory or probability theory. Some knowledge of decision theory would be also desirable. ACKNOWLEDGEMENTS Several concepts presented here emerged in discussions with students and colleagues during seminars in the Aloha Systems Laboratory at the University of Hawaii led by Dr Norman Abramson, in the Communication Systems Laboratory at University of Kansas led by Dr David Frost, and in the Mathematical Institute of the Salzburg University directed by Prof Dr Peter Zinterhof. I am also very thankful to Dr J.M. Zurada, director of the VLSI Circuit Design Laboratory at the University of Louisville, for his help in preparing this book. A preliminary review of the material presented in Chapter 1 and in Section 8.6.2 has been published by the Austrian Computer Society'. The comments of Prof Dr Veith Risak from Siemens Research Laboratories in Vienna on this publication contributed to the present form of this material. I am also greatly indebted to Prof Dr Karl Josef Parisot from the Salzburg University for his advice and help in preparing a camera ready copy of the manuscript. Finally, I owe a special debt of gratitude to my publisher and editor.

Jerzy A. Seidler May 13, 1997

' J.Seidler, A Systematic Approach to Intelligent Information System Design, Schriftenreihe der Ostereichischen Computer Gesellschaft, Band 79, R. Oldenbourg, Wien Miinchen 1995.

XIX

NOTATION GENERAL PRINCIPLES. In text, italic indicates a newly appearing or emphasized term or statement, such as "This operation we call quantization.''. In equations, the following typeface conventions are used: Italic roman for a numerical variable such as h,x,Ay\ Italic lowercase bold characters for a string or a vector variable, such as h,x\ Italic uppercase bold characters for an array variable, such as U,X, also for a function assigning a vector, string, or an array to a variable, such as y(w); Script-type uppercase characters for a set, event, or object, such as ^ , 0 ; Shadowed lower case characters for a random variable, such as s, X, also for a stochastic process, such as s(/); Shadowed upper case characters for a multidimensional random variable such as S; Upright standing upper case bold character (Albertus font)-operations acting on non-numerical variables, such as Ess ,L (_>^) . COMPOSITE SYMBOLS A symbol with a bar such as Q denotes a statistical average. A symbol with a wave, such as Q denotes the arithmetical average. A symbol with a hat such as u denotes a reference object in the next neighbor transformation. A symbol with a prime such as x' denotes a number related to a but never denotes differentiation. The elements of a set or components of a structured object are listed within braces such as ^ = {a,Z?,c} or M = {W(1), W(2), W(3)}. Symbols for functions of several variables: a variable is identified by its number in parentheses and written in the same line as the symbol of function, such as x[w(l),w(2)]. The exception is the notation for potential forms of a state or an information. The integer identifying the potential form is written in the subscript; for example jc, is the Ith potential form of the vector information x. A closed interval with the left end x^ and right end x^ is denoted by <x^,x^> . If an interval is opened at one of the ends soft braces are used. A symbol of a function with a dot instead of an argument for example, s{'), V(-) denotes a function, a rule of a transformation considered as a whole. For a more detailed explanation see equation (1.3.2). OP x,Q I ^is a shorthand notation for the problem of finding the x^ maximizing (minimizing) the criterion Q on constraints C. For a more detailed explanation see equation (1.6.6). MNEMONIC ABBREVIATIONS The mnemonic abbreviations characterizing modified variables are written as upright characters in the subscript, and the characters used as abbreviation are printed bold in the accompanying definition in text; for example, "• • and we denote by Jt^ux the auxiliary information.".

XX

Frequently used mnemonic abbreviations: b-bloc, binary buf-buffer c-contmuous, current d-decision DES-design e-error

ma-maximal mi-minimal

pt-pomt r-recovered

T-transmitter th-threshold

mx-matrix

R-receiver

tr-train

nty-non typical

sb-structure blind

ty-typical

seg-segment

vc-vector

sh-shaping

w-white (noise)

n-noiseless, normalized o-optimal

DIM-dimension, NNT-next neighbor transformation; for definition see page 41, info-infromation (only on some figures), D-end of an example. UNIVERSAL SYMBOLS HAVING THE SAME MEANING THROUGHOUT THE BOOK Jl aggregation set A operation of arithmetical averaging b, b side parameters c^^{m,n) the correlation coefficient of random variables x(m) and 2S(m) Cxx correlation matrix with elements c^Jjn.n) C capacity d{x',x") distance between points {x' and x") D operation removing the dependence on details D{X^,) duration of the train {X^,) E operation of statistical averaging, h{n) coefficients determining a linear transformation h{t) function determining a time continuous linear transformation (impulse response) H matrix describing a linear transformation H operation assigning to random variable its entropy H value of entropy / index numbering elements of a train / total number of elements of a train I operation assigning to a pair or random variables the amount of statistical information which the one variable delivers about the other. / index numbering potential forms of discrete information (usually as subscript) L number of potential forms of discrete information i.{X) operation assigning to a discrete set X the number of its elements n index numbering components of a vector, of a string A^ number of components of a vector (its dimensionality), of a string (its length) p{x) density of probability in point x p{x\ C) density of conditional probability in point x on condition C

XXI

P operation assigning to an event its probability q(x, jc*) index of distortions in a given situation, Q index of performance of an information transformation, r, r information retrieved after the fundamental transformation (output of a transmission channel, information retrieved from a mass storage) R rate of information transmission (rate of information delivery) i, s state of an object, of a system S set of potential forms of a state t time T('), T(') transformation of information T^i^ck) structural type of information for defintion see pages 25, 26 w, u information delivered to subsystem of the informatin system U set of potential forms of w,w V, V information produced by a subsystem of the information system V set of potential forms of v,v KO, y(') transformation producing information v,v; if it is necessary the transformation is identified by two subscripts, e.g. V^,y(-), the subscripts are standard symbols used to denote the particular primary and the processed information, v(ji:) volume of information x (resources needed to process it) V(/V) volume (resources needed to process) any potential form of information w, w information put into fundamental infomation processing subsystem (communication channel, storage channel) A', jc, X information delivered by the information source into the information system (primary, working information) X set of potential forms of the primary information y, y information about the state of the environment of the working information system (state information) z, z components of the state of environment influencing the operation of the information system (side factors, noise)

BASIC FUNCTIONS AND STRUCTURES OF INFORMATION SYSTEMS This chapter introduces a framework of concepts that allows to describe a great variety of information systems in the same terms. In particular, it permits a uniform treatment of systems for information acquisition, transmission, storage, and compression, for feature extraction, and for the execution of given algorithms. The concepts introduced in this chapter are illustrated with typical examples of information systems presented in the next chapter. In the last section of this chapter the relationships between the theory of information systems, the sciences providing the tools and sciences using the results of the theory of information are discussed (see Figure 1.28). This helps to localize the topics of this book and shows their roots. This and the next chapter constitute the first part of this book. This part plays a double role: it gives an informal but a global look at the problems considered in the book, and provides material for a systematic, analytic approach to the design of information compression systems and of intelligent information systems discussed in the third part of the book (Chapters 6, 7, and 8). In the first part of the book the formal side is kept as simple as possible but often heuristic arguments are used. Therefore, only a general background is needed to read this and the next chapter. Most concepts that are introduced in this chapter in a concise form are analyzed in detail in the second and third part of the book. The number of the corresponding section is given and in this section the related publications can be found. Only publications related to the last section are cited here.

1.1 FUNDAMENTAL CONCEPTS The information system is a service system that provides to the user information about the environment in which the user operates. The principal task of the information system is to transform available primary information about the environment into ultimate information that can be directly applied by the user to pursue more efficiently his goals. This consideration of information systems therefore begins with a description of users of information. 1.1.1 SYSTEMS PURSUING GOALS Characteristic of living organisms is their drive to survive. This fundamental urge induces a great variety of secondary goals. Thus, the activities of living organisms, especially of humans can be characterised as purposeful and the living organisms may be called biological systems pursuing goals.

2

Chapter 1 Basic Functions and Structures of Information Systems

Human goals range from basic existential to sophisticated intellectual goals. To realize a more complicated goal, humans join in a group equipped with tools and organize a system that as a whole, pursues the goal. Examples of such systems are production, power, transponation, economic, administration, and educational systems. Information systems such as postal or telephone systems, also belong to this class. Humans project their goals onto automatic devices and design them so that they adjust their operation to the changing state of the environment in which they operate. We also may say that such devices perform purposeful actions and call them technical systems pursuing goals. Examples of such systems are the plethora of control systems, from simple mechanical controllers to sophisticated navigation systems. A purposeful activity is performed in an environment by special acting subsystems that can directly influence the environment. These subsystems are controlled by a subsystem making decisions about actions. In the case of human purposeful activities the decisions may have a hybrid character: some classes of them are made by technical devices, and others by people. The resources of an acting subsystem, in particular the available energy that it can utilize, are limited. Therefore, it is essential to adjust decisions about the actions to the state of the environment, or more precisely, to the components of the state of the environment that influence the outcome of the activity. These components are called relevant state components. The relevant state components are usually not directly accessible for making decisions. However, it is often possible to acquire some features of these components that allow better decisions to be made about the purposeful actions. Such features of state components are called information. Information can be presented in two fundamental forms: static and dynamic. Static information is presented as a configuration of some standard objects located on a carrier. Static information is suitable for storage or mechanical transportation. Typical technical carriers of static information are paper, films, magnetic carriers, and electronic solid state devices. Examples of carriers of static biological information are hormones, DNA molecules, and special blood cells. Often the standard objects are ordered either on a string (lineal information) or on a plane (planar information). The information presented in this text is in principle a string information; the standard objects are the characters. Genetic information carried by DNA can also be considered as lineal information. The standard objects are the four bases: adenine, cytosine, guanine, and thymine, which are arranged in a helix. Information presented as a time process is called dynamic information. This form is suitable both for its transportation and for transformations. The fundamental carriers of dynamic information about the state of distant objects are the waves, particularly electromagnetic or acoustic waves. For example, visual information is delivered by electromagnetic waves in the optical range. The disadvantage of presenting information in the wave form is that only simple transformations of information in this form (such as transformations by lenses) are now feasible.

1.1 Fundamental Concepts

3

Most important carrier of dynamic information over short ranges is electrical current. The fundamental advantage of presenting information in such a form is that even very complicated transformations of information can easily be performed. If primary information is presented in a non electrical form (such as mechanical or chemical), either static or dynamic, it is not difficult to convert it into the electrical form using very efficient transducers that are now available. Therefore, most technical processing is now performed on information that has the form of an electric process. Such processing also occurs in biological information systems (nervous subsystems). 1.1.2 THE CONCEPT OF INFORMATION The word "information" was previously used in the intuitive sense. To formalize the considerations on information its workable definition must be introduced. The concept of information may be placed in importance third after the concepts of matter and energy. Matter is the carrier of all events, and energy is related to the states of matter, in particular to its motion and its changes of form. The concept of information is associated with purposeful activities of living organisms, especially of people. Being a fundamental concept like matter or energy, information cannot be defined by means of other basic concepts but rather by the purpose it serves: Information is a function of relevant components of the state of environment, which can be used to improve the quality of a purposeful action performed in the environment.

(1-1.1)

To make this definition more concrete the meaning of the term "state" must be specified. This is a basic concept not only of physics but of several sciences concerned with description of various aspects of the real world. The basic types of states are described in Section 1.2.1. The detailed analysis of the concept of state is the subject of the second part of the book (Chapters 3 , 4 , and 5). The environment in which a purposeful action is taken usually consists of several interacting objects. In other words, the environment is structured. This causes the state to consist of interrelated components, so that the state is also structured. Since information is a function of the state, it too is structured. Its structure usually reflects the structure of the states of the environment. A review of the structures of information is presented in Section 1.3. Definition (1.1.1) stresses that information is related to di purposeful action. Therefore, information is relative. For example, the sound of the engine of a plane may be only a nuisance for passengers but is important information for the pilot. The relationship between information and a purposeful activity is essential for two reasons. First, it determines the ultimate form of information so that it is suitable for making decisions about actions. The second reason is deeper. Definition (1.1.1) is based on the assumption that there is an intention to improve the quality of a purposeful action. The resources available to realize a goal are usually limited and it is essential to evaluate the costs of actions taken to achieve the goal. Therefore, an indicator of quality of a purposeful activity is usually defined.

4

Chapter 1 Basic Functions and Structures of Information Systems

The purpose of the information is to make the improvement of the quality of superior systems actions possible. Therefore, the indicators of quality of information supplied to the superior system must be based on indicators of quality of the purposeful activity utilizing the information. The indicators of quality of information can be in turn used to introduce indicators of performance of a transformation of primary information into ultimate information. Such indicators are necessary to compare information systems, to introduce the concept of an optimal information system and of an intelligent information system. Thus, the criteria for design of an information system are determined by features of its superior system. In Section 1.6.1 the indicators of the quality of information and indicators of performance of information transformations are defined; in Section 8.1 they are analyzed in detail. In Section 1.6.1 the problem of optimization of an information system is formulated and in Section 1.7 the concept of intelligent information systems is discussed. In Chapters 6 and 7 examples of optimization of information compressing systems are presented. The structures of basic optimal information systems are derived systematically in Sections 8.3 and 8.5. The performance of optimal and intelligent information systems is considered in Sections 8.4 to and 8.6. THE RELATIONSHIPS BETWEEN INFORMATION SCIENCE AND OTHER SCIENCES Life is the process of an organism adjusting to its environment. Therefore, every living organism utilizes information about the state of its environment. Consequently the concept of information presented in definition (1.1.1) is of primary importance for all sciences concerned with life. Information plays a crucial role in all human activities, from simple everyday activities to social and political activities. Consequently, mutual relationships exist between information science and all sciences concerning systems that realize goals set by people, such as economics and the social sciences. Information is of central importance to the intellectual human activities. Therefore, of all other technical sciences, information science has closest links with human sciences. In particular, mutual relationships exist between philosophy and information sciences. Since the concept of purposeful action does not occur in natural sciences concerned with inanimate matter, such as physics or chemistry, the concept of information defined by (1.1.1) essentially does not occur in these sciences. The exceptions are the areas of these sciences concerned with the utilization of physical or chemical phenomena for some purpose, such as the transformation of unorganized thermal motion into organized motion (thermodynamics), or for measurements. The mentioned relationships between information science and other sciences are discussed in more detail in Section 1.8. See in particular. Figure 1.28 illustrating the relationships. Several references are also cited there.

1.1 Fundamental Concqpts

5

1.1.3 INFORMATION SYSTEMS: FUNDAMENTALS The previous considerations on information are illustrated in Figure 1.1. The system which uses information to pursue goals is called superior system. The properties of the superior system determine the form of the information that the system can accept and determine the indicators of performance of information transformations. The available information about the relevant components of the state of a superior systems environment is called primary information. The form of this information depends on the properties of the environment and of the transducers transforming the state of the environment into the primary information.

-h--

SUPERIOR SYSTEM ACTING SUBSYSTEMS

DECISIONS ABOUT ACTIONS

INFORMATION SYSTEM PRIMARY INFORMATION ABOUT THE STATE OF THE ENVIRONMENT

It I

ULTIMATE INFORMATION

ACTIONS

I

ENVIRONMENT OF THE INFORMATION SYSTEM

ENVIRONMENT OF THE SUPERIOR SYSTEM

Figure 1.1. The basic model of a superior system using information to improve the quality of its purposeful actions. Since they are determined by different factors, the forms of primary information and of information that can be used by the superior system usually do not match. The set of devices transforming primary information into ultimate information that can be directly used by the superior system is called information system. The type of required transformation determines the type of the information system. The fundamental types of transformations needed to match the information delivered by an information source to a superior system are transformations • Of a primary information acquired at a place into information available at another remote place, • Of a primary information acquired at a time interval into information available at a later time interval, • Of a primary structured information into information having a simpler structure, and • Of structured data and a given algorithm into the result of applying the algorithm to the data.

6

Chapter 1 Basic Functions and Structures of Information Systems The corresponding fundamental basic information systems are • Communication systems in panicular, remote measurement systems such as radar or sonar systems, • Information storage systems in particular data banks and hypertext systems, • systems simplifying the structure of information, • information processing systems.

The performance of an information system can be improved by using auxiliary information about the state of the environment of the information system. This auxiliary information is called state information and the system processing it is called state information system. This system renders its services to superior information system, which in turn serves to his superior system. Such a pair of information systems is called hierarchical information system. If the superior information system uses the information about the state of its environment to process optimally the information for its superior system, then the hierarchical information system is called intelligent information system. The components of an information system often operate as autonomous information subsystems and they may be intelligent too. The environment of a subsystem consists of the environment of the superior information system and of the other subsystems of the superior system. Correspondingly, the state information which an intelligent information subsystem may use consists of two components: the information about the state of environment of the superior information system (called external state information) and information about the state of cooperating subsystems of the superior information system (called internal or partner state information). Chapter 2 presents several examples of information systems with intelligent subsystems using both external and internal state information. In general the superior system in not an information system and the analysis of relationships between the superior system and serving information system is complicated. In a hierarchical information system both systems have a similar character. This greatly simplifies the analysis of the relationships between a superior system and its information system. Such an analysis is one the main topics of this book. THE PROTOTYPE STRUCTURE OF INFORMATION SYSTEMS A survey of technical information systems shows, that the simplest information system has the chain structure shown in Figure 1.2. Such a system is called the prototype {information) system. Other, more complicated information systems, such as those with feedback or multiple access systems considered in Chapter 2 can be decomposed into prototype information systems. The first component of the prototype system is the interface coupling the information system with the environment of the superior system. It transforms the relevant components of the state environment of the superior system into xht primary information (also called message in communications, primary data in data processing, observation in measurements). This interface is called information source.

1.1 Fundamental Concepts

E N V I INFO PRELIMINARY FUNDAMENTAL ULTIMATE SUPERIOR R TRANSFORMATION TRANSFORMATION SOURCE TRANSFORMATION SYSTEM "*" X V O s tate r JC* N preprocessed prim ary info processed info ultimate info M (message, info, (received signal. (recovered E primary data, (transmitted retrieved data. message. N observation) signal. record, composed recovered info, T record, info unit info block.) decision simplified info,) conclusion)

E N \r

^ I ^ R actions O I

N

M E N T

^^^

Figure 1.2. The prototype structure of information systems. The names used in communications, databases, and other areas are given in parentheses. For examples of the prototype structure, see Figures 1.3 and 1.4.

The properties of primary information and the basic information processing resources are usually predetermined. Therefore, the type and structure of the primary information must be changed to be suitable for subsequent processing by the information system. Such an information is cMtd preprocessed information and the transformation converting primary information into preprocessed information is cdlltd preliminary transformation. The main part of processing is usually performed by a component sub-system, which is usually costly and standardized. This process is called fundamental information transformation, ihtsnbsysitmptrfonmngiifundamental subsystem, and the produced mform2X\on processed information. The last component of the prototype system is the interface coupling the information system and the superior system. Its task is to transform the information delivered by Uie fundamental information processing system into information that can be used directly by the superior system. This transformation is called ultimate transformation. If a human is the superior system this transformation should display the information in the best perceivable form. In some cases primary information is suitable for direct utilization by the superior system, but its structure had to be changed to match it to the properties of a given fundamental information processing subsystem. Then the task of the ultimate transformation is to revert the preliminary transformation. Often the information system is required to take over some responsibilities of the decision-making subsystem of the superior system. Then the ultimate transformation must not only match the structure of produced information to the properties of the superior system, but it also must extract from the primary information the features relevant to the superior system. In most cases, this implies a simplification of the structure of the primary info. 1.1.4 PROTOTYPES OF BASIC INFORMATION SYSTEMS The basic types of the overall transformation which an information system performs and of the corresponding basic information systems were discussed previously. Now the concrete forms of the prototypes of these systems are presented.

8

Chapter 1 Basic Functions and Structures of Information Systems

COMMUNICATION AND INFORMATION STORAGE SYSTEMS Figure 1.3a shows the structure of a prototype conununication system. The fundamental information processing subsystem is the communication channel. The preliminary transformation performed by the transmitter includes queuing of irregularly arriving pieces of primary info, coding and modulation. The ultimate transformation is performed by the receiver. Often the primary information ( in communications terminology the "message") is structured, and its structure is simplified before the information is fed into the transmitter or receiver. Such preprocessing is discussed in Sections 1.5.3 and 1.5.4. Sections 2.1 till 2.3 describe models of communication systems and cite several related publications. The optimization of communication systems is discussed in Sections 8.3, 8.5, and 8.6. preliminary transformation

fundamental transformation

ultimate transformation

INFO SOURCE

INFO TDAMQK/TTTPO L COMMUNICATION RECEIVER DESTINATION TRANSMnTERh-rCHANNEL I w\ IX received recovered transmitted message signal signal signal

a)

preliminary transformation DATA SOURCE

fundamental transformation

ultimate transformation

REPORT INFO GENERATION DESTINATION r report retrieved record

MASS RECORD 1 1 FORMATTING ^ ^ STORAGE pnmaiy data

record

SEARCH SUBSYSTEM]

SOURCE OF REQUESTS

b) Figure 1.3. Basic structures of (a) a communication system and (b) a data bank. Information storage systems are another class of very important systems. In these systems the primary information consisting usually of many pieces of elementary information (a file, a database) is stored. After some time, on a request, the system should retrieve a desired single primary piece of information (a record) or it should provide a simplified version of several pieces of information, called report. Such a system is called data bank. The fundamental information processing subsystem is a mass storage device. The structure of the prototype data bank is shown in Figure 1.3b. Another type of information storage system is a hypertext system. Its task is to assemble several pieces of stored information (also called information units) into a larger entity, called composed block. This requires some linking information that may be obtained by a simplifying transformation of the information unit. Section 2.4 describes a simple model of a data bank and cites references.

1.1 Fundamental Concepts

9

SYSTEMS SIMPLIFYING THE STRUCTURE OF INFORMATION The state of the real world, and consequently, primary information about it, has a highly complicated structure. It may be a time process, or a two dimensional or three dimensional static or moving image. In addition, the information often has a complicated macro structure. Therefore, the simplification of the structure of primary information is essential for the transmission, storage, and ultimate transformation of information into decisions made by the superior system. A great variety of systems simplifying the structure of information is used.They operate either as specialized subsystems of the previously described communication and information storage systems or as almost autonomous, universal systems. The systems simplifying the structure of information can be classified into two categories. If the simplification of information is performed so that it is possible to recover exactly or almost exactly the primary information from the simplified information (to decompress the simplified info) the transformation is called information-compression. Special types of information compression are loss-less data compression, discretization (quantization), anddimensionality reduction, inparticular sampling. The prototype compression and decompression systems are shown in Figure 1.4a.

INFO SOURCE

PRELIMINARY TRANSFORMATION

FUNDAMENTAL TRANSFORMATION

ULTIMATE TRANSFORMATION

INFO COMPRESSION

ICOMM. CHANNEL MASS STORAGE

RECOVERY OF PRIMARY INFO

compressed info

structured primary info

retrieved info

INFO DESTINATION

recovered primary info

a) PRELIMINARY TRANSFORMATION INFO SOURCE

FUNDAMENTAL TRANSFORMATION

SAMPLING

DECORELATION w

train (array) of continous samples

contmous function info

DIMENSONALFFY REDUCTION

train (array) of decorelated continous samples

QUANTIZATION OF SINGLE COMPONENTS

train (array) of continous samples with reduced dimensionality

train (array) of quantized samples

b) PRELIMINARY TRANSFORMATION INFO SOURCE

->

FUNDAMENTAL TRANSFORMATION

EXTRACTION OF FEATURES

X

V

structured primary info

features

CLASSIFICATION

INFO DESTINATION

indentifier of class, pattern, prototype

C)

Figure 1.4.Systems simplifying the structure of information: (a). Typical configuration of an information compression and the complementary decompression system, (b). Basic structure of an image compression system, (c) Basic structure of a two level pattern recognition system.

10

Chapter 1 Basic Functions and Structures of Information Systems

Often a complex object or a state can be considered as a modification of a prototype, and for the superior system the modifications are not essential. The second category of systems simplifying information are systems rejecting the non essential details and identifying the prototype. Such a system is called pattern recognition (also classification) system. Thus, contrary to information compression, in the case of pattern recognition it is not required, that the primary information can be exactly recovered from the processed information. Figure 1.4c shows a simplified diagram of a pattern recognition system. Let us discuss in more detail the systems simplifying the structure of information. INFORMATION COMPRESSING SYSTEMS If the structure of the primary information is complicated, the compression is realized in a chain of component-simplifying transformations. A typical example is the compression of an image. The first component compression is sampling. It transforms the image into an array of samples (called pixels),Thost samples are usually interrelated. To simplify the subsequent operations, those interrelations are removed. A typical transformation of this type is decorrelation. The next simplifying transformation is removal of the least significant elements of the set of decorrelated samples. It is called dimensionality reduction.In the last step each retained decorrelated sample is quantized. The structure of the described system is shown in Figure 1.4 b. In this system sampling plays the role of the preliminary transformation, quantization the ultimate. Since it usually requires the most computing power, we may consider the decorrelating subsystem as the fundamental subsystem. If the chain compression is used, then the primary information usually is recovered in corresponding steps but in a reversed sequence. Section 2.6 describes a concrete application of the described procedure for image compression. PATTERN RECOGNITION SYSTEMS The task of a pattern recognition system is to identify the prototype by rejecting irrelevant details of an available information about a modification of the prototype. The recognition (in pattern recognition terminology extraction) of the features plays the role of the preliminary transformation, and the identification of the prototype the role of the fundamental transformation. Often the prototype consists of lower-ranking prototypes; thus, it has a hierarchical structure. For example, a typed character consists of bars and semicircles. The highest-ranking prototype is called pattern or class. The lower-ranking prototypes are cdll&d features. The modification of a structured prototype consists then usually of modifications of lower-ranking prototypes. Then the basic structure of a pattern recognition system is the chain shown in Figure 1.4c. An example of pattern recognition is automatic recognition of an English character presented in a handwritten form. Then the character typed in a standard font plays the role of the prototype, and the number of the character for example the ASCII code is the identifier of die prototype.

1.1 Fundamental Concepts

11

Another example of pattern recognition is the recovery of primary information (message) from a distorted received signal (output of the channel; see Figure 1.3a). For many channels the noiseless signal plays the role of the prototype. The noiseless signal is defined as the hypothetical received signal which would be produced by the transmitted signal carrying a given message, on the assumption that no external distortions occur. The message carried by the noiseless signal plays the role of the identifier of the prototype. Section 2.1 gives concrete examples of such an application of pattern recognition for distorted information recovery. A closer review of transformations performed by irreversible information compression systems and by pattern recognition systems shows that almost all those transformations can be considered as modifications of a basic information transformation called next-neighbor transformation. It is briefly described in Section 1.5.3, several examples are given in Chapters 6 and 7, and its optimal character is discussed in Section 8.3. DECISION MAKING SYSTEMS Often the task of an information system is to transform primary information directly into a decision about actions of the superior system so that the performance of the superior system is possibly good. Usually the structure of a decision is much simpler than of the primary information, but similarly as in the case of pattern recognition we do not require that primary information could be recovered back from a decision. Therefore most decision rules (the rules transforming the primary information into the decision) are information-simplifying transformations. On other hand, information-simplifying transformations can be interpreted as decision rules. For example, the rule of recognizing the pattern may be called the rule of making decisions about the pattern, and the rule according which the receiver in a communication system recovers the message as the rule of taking decisions about the information. The relationships between information theory and decision theory are discussed in Section 1.8.

BIOLOGICAL INFORMATION SYSTEMS Besides human-made information systems there are also biological information systems. Higher living organisms have a hierarchy of such systems. As technical intelligent subsystems, the biological information systems use external and internal state information (see Section 1.1.3). The senses: hearing, vision, feeling, smell, and taste are the biological external information systems. They have usually a chain structure, similar to the prototype structure of technical information systems. The external information subsystems deliver processed information to the brain, which plays for them the role of the superior system. The brain processes all information obtained from the external subsystems, compresses it, and stores it in the memory.

12

Chapter 1 Basic Functions and Structures of Information Systems

The brain combines also all available information about the current state of the environment with information stored in the past and works out decisions for the motoring system of the organism. The structure of brain is a spaced network structure much more complicated than the simple chain structure. Final commands are worked out in the cortex. Thus, we may say that the motoric system of the organism is the superior system for the brain. In view of the enormous complexity of the higher organisms, it is natural that they have developed several subsystems for acquiring the information about the state of organs. These are the biological internal information systems. The pain subsystem provides information about damages or malfunctioning to the brain. Several other internal information subsystems provide information for other then the brain organs. These are, for example, systems providing information needed for control of several metabolic processes. This information is delivered in form of specific biochemical substances. The system of carrying the genetic info by the DNA molecules is extremely complicated. This system may be classified as an information system rendering its services not only to the organism (production of new cells) but also to the species to which the organism belongs (production of new organisms). Studies of biological information systems have a growing impact on the theory of technical information systems. The relationships between those two areas are discussed in Section 1.8, which also cites relevant publications.

1.2 ACQUISITION OF INFORMATION In the next two sections the properties of information are discussed in more detail. The properties of information, in particular its structure, are determined by: 1) the properties of the states the information is about and 2) the rule according to which the information source transforms the state into information. Therefore, in considering the properties of information we start with a brief discussion of the concept of the state. Next, the basic types of information sources are described. 1.2.1 BASIC TYPES OF STATES The environment of a superior system is usually a set of many interacting components that are called objects. The set of features of the environment is called its state. The subset of features directly responsible for interactions between the objects is called external state. The external state consists usually of many components, such as the position of an object in space, its velocity, the temperature of the object, its electrical potential, and the intensity of acoustic, electromagnetic, and gravitational fields. Some components of the external state are related by some general relationships, which are inherent features of the environment. Since they manifest themselves only through the external state, the general relationships are called internal state. The detailed discussion of external and internal states is the subject of Chapter 3. Only some general concepts related to the states, that are needed for the following considerations about information are presented here.

1.2 Acquisition of Information

13

EXTERNAL STATE The structure of the external state reflects the structure of the environment. This causes that often both structures are similar. Also, many concepts having their roots in description of a complicated environment are applicable for description of structured states. Therefore, the structure of a typical environment of a superior system is considered first. A typical object of the environment usually consists of lower-ranking objects and so on. In other words, the environment has a multilevel hierarchical structure. However, starting with a hierarchical level the superior system cannot utilize properties of too small objects either because its acting subsystems are not precise enough to handle them or because the information system cannot provide accurate information about them. Such objects are considered as indivisible/7
a point object an object with fine structure

an object with macro structure

Figure 1.5. Illustration to the definition of the rank of structure of an object. An ensemble of objects with a structure of rank 2 is considered as an object having structure of rank 3. Objects with higher-ranking structures are defined in a similar way. If an object has structure of rank 3 or higher it is said to have a macro structure. This definition of ranks of structure is illustrated in Figure 1.5. The introduced concepts are directly applicable for description of structured states and consequently of structured information. In the next section several types of structured information that are counterparts of the above described structure of the environment are described. These structures are often used in the following chapters.

14

Chapter 1 Basic Functions and Structures of Information Systems INTERNAL STATE

The internal state has been defined as the set of general relationships relating components of external state of an environment. Examples of components of the internal state are the physical characteristics of an object such as its shape, mass, heat, or electrical conductivity, and ability to reflect and absorb electromagnetic waves, panicularly light. Often relationships exist between values of an external state at various instants. Such relationships usually are described by equations (algebraic, differential, integral) and some functions, which, considered as a whole, are characteristics of the environment. In the case of linear systems, the impulse response is such a function; it is discussed in Section 3.2. The relationships between the forms that a component of an external state takes in various instants cause the states in the future to depend to some extent on the course of the state in the past. Therefore, the performance of a purposeful activity can be improved if the decision about present activity is based not only the state of the environment at the present instant (called instantaneous state) but also the course of the state in the past (in the terminology used in physics, the state trajectory) is taken into account. Usually, the forms that the same external state component takes in close adjacent instants are strongly related. This causes that the states of a real system change in time only continuously. However, in some cases the changes are so rapid that it may be convenient to treat them as discontinuous (the concept of continuity is used here in the intuitive sense; it is discussed in detail in the next section). Similarly, as the forms of a state component at various instants, the forms of corresponding instantaneous states of various points objects of a continuous object are related. This causes that usually the instantaneous states of point objects of a continuous object change in space only continuously and the structure of their changes reflects the structure of the object. ROUGH STATE The components of the previously mentioned states (external, internal) can be described by parameters. If the state changes continuously (in time or/and in space) than the parameter describing the state also changes continuously. Such a state is called continuous. For many purposes, however it is sufficient to characterize the environment by a feature which can take only a finite, usually small number of forms. Such a state (external, internal) is called rough state. It is also called the attribute of the environment (object). Typical forms of a rough state characterizing the mass of an object are very heavy, heavy, light, and very light. The rough states are often used in descriptions of the environment by people. As continuous states, rough states of various objects or the rough state of an object in various instants are related. Logic provides formalisms (in particular the propositional and the predicate calculus) for describing and analyzing the relationships between the rough states. Those formalisms have been exploited in the artificial intelligence approach (see Section 1.8). Sections 3.1.2 and 3.3.2 discuss the rough states and relationships between them in detail.

1.2 Acquisition of Information

15

1.2.2 INFORMATION SOURCES The task of the information source is to transform the state components of the environment that are relevant for the realization of the goals of the superior system into primary information. Usually some energy is associated with an external state. Such a state can directly influence the information source and produce the primary information. Since this happens without any action of the information source, the source is cdlltd passive. A typical example is a lens producing an image of an object emitting light or a thermocouple transforming temperature difference into electric potential. The internal states do not manifest themselves directly. To get information about an internal state the information source must generate a stimulus. Its task is to produce an external state that depends on the internal state and that can directly influence the information source. Such an external state is called response. The information about the response and the stimulus are primary information. The described source is called active. Its basic diagram is shown in Figure 1.6.

response (forced out external state)

INFO. SOURCE

info about response \ ^ info about _y internal y / state

stimulus

Figure 1.6. The fundamental diagram of an active information source.

Typical examples of internal states are physical parameters such as resistance or capacitance, characterizing the electrical properties of a resistor respectively of a capacitor. The classical measuring devices using a special testing process (signal) to get information about those parameters are examples of simple active information sources. Another class of internal states are characteristics of dynamic objects describing the relationships between the components of states in various instants, in particular the pulse response or frequency characteristic of linear systems. In control theory this task is called object identification. The devices providing information about objects characteristics ("identifying the dynamic object") are another class of active information sources. The typical stimulus used is a pulse or an harmonic signal. Of great practical importance are systems with active sources for determining properties of remote spatial tree-dimensional objects. One class of such systems are systems illuminating the object with a wave (electromagnetic, acoustic) which is reflected from the surface of the object. The reflected wave observed at the input of the information systems provides information about the shape and position of the object. Radar and sonar systems are the most advanced systems with such active sources of information.

16

Chapter 1 Basic Functions and Structures of Information Systems

An other class are systems irradiating the object with radiation that penetrates the object. The absorbtion shadow is the information about the interior state of objects. A typical example is an x-ray image. More information about the interior of the object can be obtained from the absorbtion shadows for various directions of the irradiation. Tomography is based on this information. A purposeful activity is often performed in several steps. Then an elementary activity can be utilized as a stimulating process. Information about the effect of such an activity jointly with direct information about the activity provides information about the state of the environment. This information can be used to improve the performance of the next step of activity. Such an operation may be called learning by doing. The considered information is cdlltd feedback information because the effect of a previous activity has an effect on the next activity. The great advantage of feedback information is that it is "automatically" related to those components of the state of the environment, which influence the result of an action of the superior system. The concept of feedback is also used in classical control systems. In these systems the feedback signal performs often two functions: it carries information about the state of the controlled object (our feedback information) and it carries some energy to perform the control. Here only the feedback information is considered and its physical form, in particular, the ability to deliver some energy is irrelevant. The discussed features of information sources are sunmiarized in Figure 1.7.

PASSIVE ^

_ USING FEEDBACK INFORMATION

ACTIVE

<

^ ^ A TESTING PROCESS USING c—— REFLECTED RADIATION ^ ^ ^ PENETRATING RADIATION

Figure 1.7. A classification of information sources from the point of view of their role of acquiring the information. The response of the object to the stimulus and in consequence the acquired information about the internal state, depend on the stimulus. The important problem is to choose the stimulus so that the obtained information delivers possibly useful information about the directly inaccessible internal state. Consider the information obtained by the reflection of an illuminating electromagnetic wave. The reflection properties of the surface usually depend on the frequency of irradiating wave. Therefore, we get more useful information about the surface of the object by using electromagnetic radiation that includes waves with a variety of frequencies (such as while light to get colored images).

1.3 Structure of Concrete Information

17

Still more information about the surface can be obtained by shaping the illuminating radiation into a narrow beam moving in space. Equivalently, a wide illuminating beam may used but only radiation reflected within a moving thin bunch of directions is used. Typical examples of such systems are radar and sonar systems. THE RELATIONSHIPS BETWEEN THE CONCEPTS OF INFORMATION AND OF THE STATE Bilateral relationships exist between the concepts of state and information. The information can be interpreted as the state of the output of information source or another subsystem of the prototype information system shown in Figure 1.2. With this point of view, the state is considered as the primary concept, and the information as a secondary. On the other hand, the superior system always utilizes a description of the state but not the state itself. Such a description can be considered as the primeval information about the state, and exact information about the state has the meaning of another representation of the description of the state. From such a point of view information is the primary concept and the state the secondary. In spite of those links between the both concepts, there is a difference between them that is essential to subsequent considerations. The state characterizes the objectively existing world (the environment), and its properties are determined by external factors on which we have no influence. On the contrary, the structure of information can be modified to an extent limited only by the available technical means.

1.3 STRUCTURE OF A CONCRETE INFORMATION. In this section the typical structures of information are described. The aim of it is to illustrate previous general considerations on concrete examples and to introduce concepts which are utilized throughout the book. 1.3.1 BASIC CONCEPTS The structure of information reflects the structure of the state of the object that the information is about, while this state reflects usually the structure of the object. Therefore, the considerations about structured states, presented in Section 1.2.1, apply directly to structured information. As mentioned there, the states at various instants are usually related. This causes, that for most applications the course of information in a time interval is important. This course is called information process. The time structure of the information process also reflects the corresponding time structure of the state process. In particular the information process may have a fine time structure and a time macrostructure. Examples of such information follow.

18

Chapter 1 Basic Functions and Structures of Information Systems

The counterpart of the atomic object is an elementary piece of information. It can be represented by a number. Such information is called scalar information. Typical examples of scalar information are: information about distance between two point objects, the time passed between two events, the instantaneous potential of an electrical terminal, and the intensity of darkness of a small area (point) on a blackwhite picture. An elementary piece of information perceived by people is information about an attribute of an object. Since the attribute can take one of only a few possible forms, we can number the forms that the information about the attribute can take. Thus, information about an attribute can be considered as special case of scalar information. Information with difine structure (the counterpart of the object with structure of rank 2) is described by the ensemble of all elementary component pieces of information. Such a description puts together several elementary pieces of information. To identify a piece of information auxiliary information, called identifier must be added. The properties of an identifier of an elementary piece of information about the instantaneous state of a point object must reflect the structure of the system. If it is a spatial system then the identifier must describe the position of the point object in the space. Such identifier must consist of three scalars (coordinates). Since every point object must get another identifier, the structure of the set of forms that the identifier can take must correspond with the structure of the system. If the system is discrete and consists out A^ point objects, then the set of forms which the identifier can take must consist also of A^ elements. If the object is continuous, then the set of forms which the identifier can take must be continuous too. Similar to identification of a point object is identification of the instant at which the state, the information is about, was observed. A scalar is sufficient to identify such an instant, but the set of values that this scalar can take may be discrete or continuous depending on the type of observation of the state. When structured information is stored in a computer, the identifier is usually described by the address of the memory cell in which the corresponding piece of scalar information is stored. From these considerations it follows that the description of information with a fine structure is: description of the set of forms of the identifier

elementary component of description of structured information

(1.3.1a) where elementary component of description of structured information

=<

elementary piece of information

Its

identifier

(1.3.1b) The next two sections present examples of information with fme structure.

1.3 Structure of Concrete Information

19

1.3.2 SCALAR ELEMENTARY INFORMATION AND SCALAR IDENTIFIER The examples start with the simplest case when the elementary info is scalar and the identifier can take a finite number of values. Next the information when the set of values of the identifier is an interval is described. BLOCK (VECTOR) INFORMATION A set N of pieces x(n), /z = l, 2 , - - , A ^ o f scalar information we call block information and denote it x={x(n), n;n = l, 2,- • • , TV}.

(1.3.2)

scalar its set of values of information identifier the identifier The number of the forms which the identifier can take is A^. It has the meaning of the number of components of the vector information and it is called block length. Formally the block information can be interpreted as a scalar function of the discrete argument considered as a whole. Thus graphically the block information can be represented as bar diagram, as shown in Figure 1.8a. Often it is convenient to interpret block information as a vector. To do this we have to defme for the blocks the typical operations on vectors, e.g. addition or scalar (point) multiplication. In sec. 7.1 we discuss in detail the vector interpretation of information. The block information interpreted as a vector is called vector information. Since this term is quite common it is used here, even when the vector operations are not defined. If the block is interpreted as a vector its length A^ is called the dimensionality of the vector and the vector is briefly called N-DIM vector. When the elementary scalar pieces of information x(n) have this same character we call the vector information homogeneous. An example is information about electrical state (electrical potential, intensity of current) of ^terminals at an instant. Another example is information about electrical state of a terminal at a sequence of A^ instants (sampling instants). To emphasize that the elementary pieces arrive in a time sequence, the term train is also used instead of block or vector information. If the components x(n) arrive in real time and the last piece x(N) has the meaning of the piece that just arrived the train is called evolving train. The elementary scalar pieces of information may have a different character. For example, JC(1) may be information about the height of a person and x(2) about her or his yearly income. Such a block (vector) information we call inhomogeneous. DIAGRAM INFORMATION The information about a state observed continuously in a time interval is the ensemble of elementary pieces x(r), t for all t in the interval . Thus, the identifier of the instant is a scalar, but the set of its potential forms is a continuous set. Such structured information can be represented graphically by a diagram as shown in Figure 1.8c.

20

Chapter 1 Basic Functions and Structures of Information Systems

Figure 1.8. Dlustration of the definitions of fundamental types of fine structure of information: (a) block (vector) information, (b) diagram information, (c) array information, (d) image information. The diagram information is denoted as (1.3.3) scalar its set of values of info identifier the identifier The dot in braces should remind that we consider not an elementary piece of a information x(t) at an mstant t, but the whole ensemble of elementary pieces corresponding to the interval .

1.3 Structure of Concrete Information

21

Several classes of functions of a continuous argument are considered in mathematics. Some of them are quite bizarre functions that are at no point continuous or differentiable or that cannot even be integrated. However in the macro world in which we operate, no real object can change its state in a discontinuous way. In other words, observed in a sufficiently expanded time or space scale, every real process is continuous and differentiable. Thus, in technical considerations we are interested only with continuous, differentiable functions. Only for some specific purposes it may be convenient to introduce as models of real processes functions that for some values of argument are discontinuous such as pulse or step functions. Using discontinuous models we have to remember that their formal peculiarities may have no real world counterparts ^. Similarly as in the case of vector information if we observe the information in real time and t^^ has the meaning of the current instant the information is called time evolving information or equivalently information process. This is the counterpart of the concept of trajectory used in physics. As has been mentioned, it is convenient for many purposeful activities to consider an object as a continuum of point objects. A simple class of such objects includes objects which can be considered as strings of the elementary point objects; they are called lineal objects or strings. Examples of such objects are wires, pipes or roads. On the assumption that the information about the state of a point object is a scalar, the state of a lineal object is described by the continuum of states of the point objects. A point object on a string can be identified by the distance from an end of the string to the considered point object measured along the string. Thus, the instantaneous state of the string object is again described by formula (1.3.3), but the argument t has the meaning of the identifier of the position of the point object on the string. An example of such information is the diagram of the altitude measured along a road or static electrical potential along a wire. 1.3.3 STRUCTURED ELEMENTARY INFORMATION AND/OR STRUCTURED IDENTIFIER In general the elementary piece of information and/or the identifier in the basic structure (1.3.1) is structured. Usually they are few-dimensional vectors. Examples of such structured information follow. ARRAY INFORMATION It is assumed that elementary information is a scalar and the identifier has the form of two scalars and that they can take only a finite number of forms, thus the structured information is: X^ {x{m, /z), (m, n) ; m=l, 2,- • • , M, w = l, 2,- • • , N}. (1.3.4) It is called two-dimensional (2-DIM) array information. An example of such information is the information about electrical potential of M terminals measured at N instants or a set of samples of a planar picture taken at nodes of a rectangular grid. The array information can be interpreted as a function of two discrete arguments considered as a whole and can be presented as a two-dimensional bar diagram as shown in Figure 1.8b.

22

Chapter 1 Basic Functions and Structures of Information Systems

HIERARCHICALLY STRUCTURED BLOCK INFORMATION The array information may be interpreted in another way. Consider the rows of the array as structured elementary components of information and write it in the form: ^^^^^

X^{x(m);m = U2,'",N} (1.3.5) x(m)={jc(m, n); /z = l, 2, • • , M} (1.3.6) and x(m,n) are elementary components of information. We call x(m) the lower (first) ranking block and X the higher (second) ranking block. In the notation x(m, n) of the elementary component and throughout the book we apply the rule that the first index identifying an element of a block numbers the block and the second index numbers the elementary component within the block. The structure (L3.5) may be called hierarchical block structure. Typical examples of such information are considered in Sections 2.4 and 2.5. Both representations (1.3.4) and (L3.5) are illustrated in figure 1.9. A' -bloc of rank 2

ARRAY A'

ID D D D D p DDD D D DDD D

1

x{rn) ID

D D D D|

ID

D D D D|

ID

D D g D|

ID

D g D D|

D x(m,n)

b D D D n| D xim,n)

Figure 1.9. The two representations of the set X of NxM scalars: (a) as an array of equally ranking scalars, (b) as a hierarchy of blocks. In several cases we may have first rank blocks of various types. In the simplest case, of two types: (1) regular information blocks (working blocks), (2) blocks having the meaning of pauses between the regular blocks (idle blocks). The idle blocks consist usually out of specific elementary components such as zeros or elements having the sense of an elementary pause. From the point of view of processing the working blocks the length of an idle block can be either irrelevant (passive pause) or be auxiliary information (active pause). Many superior systems do not need permanently new information. Therefore, the information source becomes active only for some periods and between them stays idle (the source operates intermittently). In such cases the pauses are usually passive. A diagram of a typical information generated by an intermittently operating source is shown in figure 1.10. DtotW

jmjTW,JTT^^

-HI

D(J^-

Figure 1.10. A train of blocks interleaved with pauses.

1.3 Structure of Concrete Information

23

Denote by D(X) the duration of the train X and by T>xJ<X) the total duration of the blocks in the train. The ratio is called duty ratio. Many information sources deliver trains with a small, even very small duty ratio. An average home telephone considered as a source of speech information produces an information process with the duty ratio of the order 10 ^ Of great practical importance are transformations that increase the duty ratio. These are described in Sections 2.6.5 and 3.3.1 and analyzed in Section 6.3. Some information sources may have a hierarchy of intermissions and consequently the produced information has a structure of a hierarchy of blocks separated by a hierarchy of pauses. An example is a computer terminal operating in an interactive mode. Pauses between sessions may last for dozens of minutes, pauses between data blocks within a session for minutes, and pauses within data block may last for seconds. In some cases, the length of the idle block is an auxiliary information which helps to process the working information. For example, if a facsimile machine would horizontally scan this page, the idle blocks would be generated by indents and spaces at the ends of paragraphs. Those elements of the page arrangement provide additional information for the reader. An efficient method for transmission or storage of the info about the length of an active pause is run length coding (see Section 6.2.3). IMAGE INFORMATION A white-black picture can be considered as a continuum of optical states of points on a plane. The position of such a point is described as a pair /^ {r(l), r(2)} (1.3.8) of its two coordinates-see figure 1.8e. We denote by 5 the set of forms t can take. It is called di frame. Usually the frame is a rectangle:

3={r, ai)
(1.3.9)

The simple information about optical state of a point with coordinates / is a scalar x{t) indicating the intensity of grey at this point as shown in Figure l.Sd. Thus, {x{t), t} is the elementary component of information about an image; it is called pixe/. The function {x{t),t\ tE 5 } considered as a whole, is the information about the black-white image. Similarly to (1.3.2) we denote it: x('r)^{x(t),t;tej}. (1.3.10) Figures 1.8f and l.Sd show such a function and the primary black-white image. Till now it has been assumed that the elementary information is a scalar. Now examples of information consisting of pieces that are structured are given. Such an example is information delivered by two microphones located at various positions. This information is , ^ ^ ^^^^^ x(-)= WO, te } (1.3.11a) x(t) = {xM,xXt)} (1.3.11b) and x,(0, xXt) are the scalar processes produced by the left and right microphones.

24

Chapter 1 Basic Functions and Structures of Information Systems

As an example of structured information with structured elementary information and its structured identifier consider the color planar pictures. The elementary information about the optical state of a point is: x(t)^{x,{t),x,it),x,(t)}^ (1.3.12) where ;Cr(0, x^(t), x^^it) is information about intensities of red, green, and blue at the considered point on the plane identified by t. The described images may be classified as planar or 2 DIM images. Real objects are three-dimensional. The optical state of a point object of a 3 DIM object is described in the same way, as in the case of 2-DIM picture, either by the intensity of grey or for a color object by the intensities of the three fundamental colors. However, the point object must be now identified by the set t={t(l), /(2), r(3)} of three coordinates. If only surface ^ of a 3-DIM object is relevant for the superior system, then the primeval information is described by x('rr) = {x(t),t;teS, SEJ}, (1.3.13) where x is the set of relevant features of a point t on the surface and 5 i s the 3-DIM frame (cage) in which the spatial object is located. If the interior of the object is relevant, then the primeval information is described by (1.3.13) however, without the constraint tES. Real images (2-DIM and 3-DIM) usually change in time thus they are moving images. To describe moving images the time must be added to the arguments identifying the location of a point object. For example, the primeval information about a moving object is described by jc(-,-,-,-) = {jc(^ 0, t; teS, SED, tE }, (1.3.14) where t is the set of coordinates at a point t in the frame 5 , and / is time. WAVE INFORMATION Usually the primeval information about a space object is directly non-accessible. Typical source of information about such an object is an active source (see Section 1.2.2, Figure 1.6). The stimulus is a radiation that is reflected from surface of the object or penetrates the object. In both cases the available primary information is usually a wave, in most cases electromagnetic or acoustic. It is called wave information. Similarly to (1.3.13) the wave information is described by the function Jt(-,-,-,-) = W ^ 0, t\ tEJ, tE }, (1.3.14) where x(t, t) is the vector of instantaneous field intensities in point t and J is a 3-DIM area at the input of the information system. This area has the meaning of a spatial input terminal (a port). For example, the field intensities of an electromagnetic wave are described by three components of electric field intensity and three components of magnetic field intensity. A more complicated direct processing of wave information is prohibitively complicated. Therefore, usually already in the first stage, the wave information is transformed into information having a simpler structure. Such structure simplifying transformations are discussed in Section 1.5.4.

1.3 Structure of Concrete Information

25

Since the diagram, image, and wave information are functions as a whole of one or more continuous arguments they are generally called function-information. The prototype of function infonnation is a scalar function of one continuous argument. 1.3.4 A CLASSIFICATION OF FUNDAMENTAL STRUCTURE TYPES It is convenient to introduce a classification and corresponding shorthand notation for the discussed types of structure. Such a classification is shown in Figure 1.11. dimensionality of elementary info

y3(d2)

*3(c2+l)

array of color pixels

2+

'2(dl)

train of 2 dim vectors

\1 scalar

04-

color planar moving picture

M(cl)

* 2(d2)

M(dl)

vector

•

array

scalar process

discrete

contmuous

-KvWf3

0

*3(cl)

3 dim vector process set of potential forms •

^ dimensionality of the identifier

Figure 1.11. The classification offtindamentaistructures of information and the examples of its various types. We say that information that is described by a function assigning a K-DIM vector to a N-DIM discrete argument is of type T^^^^; if the argument (1.3.15) is continuous, we denote the type of information by TJ^^^AOThus, the vector information with scalar components (such as a set of samples of an electrical potential process) is of type T^^^y In this notation the type of elementary scalar information is TK^Q); it is denoted briefly T,. The 2 DIM array information with elements that are 3 DIM vectors (such as a set of pixels of a color image located at nodes of an orthogonal grid) is of type T^^^2y The static colored continuous planar image is of type T^^ciy For time-varying pictures and wave information the continuously varying time is indicated by +cl as an additional argument. Thus, a time-varying, white-black image is of type TKCJ+CD- The type of electromagnetic wave information is T6(c3+ci)-

26

Chapter 1 Basic Functions and Structures of Information Systems

1.4 THE SET OF POTENTIAL FORMS OF INFORMATION Practically every superior system performs a purposeful action several times but in various states of the environment. Consequently, the information system must be able to process any potential form of information. The set of potential forms of information therefore, is as important in the design of an information system as the structure of concrete information. The set of potential forms of information is determined by 1. The properties of the set of potential forms of relevant components of the state of environment that the information is about, and (1-4.1) 2. The rule according to which the state is transformed into the information. Similarly, as the structure of a single relevant component of the state, the set of its potential forms is an objective property of the system. Therefore, the set of potential forms of the relevant states is called the state of variety of the system. Section 1.4.1 introduces the basic concepts related to the set of potential forms of information. Section 1.4.2 considers the discrete set of potential forms of info. Section 1.4.3 is devoted to continuous sets of potential forms of info and to the discrete-continuous information dilemma. In the last part the weights characterizing the potential forms of information and models of indeterminism of exactly not known information are briefly discussed.

1.4.1 BASIC CONCEPTS There are two fundamental types of sets of potential forms of information. If the set consists of a finite number of elements it is called discrete. In theoretical considerations we also call the set discrete when the number of its elements is infinite, but they can be numbered. If the set of potential forms of information contains only two elements, it is called binary. If for any element of the set we can find another element at an arbitrary small distance, the set is called continuous. The statement "information that can take forms from a discrete (continuous) state" is often abbreviated to the phrase discrete (continuous) information. In technical jargon, "continuous information" is called analog information^. The concrete information may have any of the fundamental structures described in the preceding section. The classification of the types of concrete information introduced in Section 1.3.4 can be combined with the binary classification of types of sets by adding character d in front of the indices characterizing the structure of the concrete information or by adding the character c to denote a continuous set. Thus TjiKdi) denotes a discrete set of vectors (function of a discrete variable) and TcKci) denotes a continuous set of scalar functions of a continuous argument. For example, the set of a finite number of code words built out of scalar elementary pieces of information is of type T^Kdo-

1.4 The Set of Potential Forms of Information

27

However, if the potential forms of discrete information are continuous waveforms taking scalar values (such as signals carrying discrete information in a data communication system), then the set is of type T^KCD- Examples of other types of information can be found in Figure 1.23. THE INDICATORS OF VARIETY AND OF VOLUME OF INFORMATION The variety of forms that a state of the environment can take is usually very large. Therefore, a typical superior system can realize its goals better when it can identify more forms of the state. The available information makes the identification possible. However, the variety of identifiable states cannot surpass the variety of the forms of information. The variety of forms that information can take therefore, characterizes the usefulness of information for the superior system. The word "variety" has been used in its intuitive sense. To make our considerations concrete we have to define the indicator of variety. A definition of the indicator of the variety of discrete information is quite straightforward. It is natural to take as such an indicator the number ofpotential forms of the information, or in other words, the number of the elements of the set of potential forms. Many transformations of information change the form and structure of a concrete information but do it so that knowing the transformed information it is possible to recover the primary information exactly. Such a transformation is called transformation of presentation (of information). Several examples of such transformations are present in forthcoming chapters. For a superior system that has sufficient information processing resources, information presented in a modified form is as useful as the primary information for making decisions about actions. Thus, the transformation of presentation of information does not change the potential usefulness of information for the superior system. Also, the number of elements of the set of potential forms of information is not changed by such a transformation. This invariance is an additional argument for taking the number of potential forms of information as an indicator of the variety of information. An indicator of the resources needed to process information is called volume of information. Such an indicator is very useful in the analysis of information transformations, in particular of fundamental transformations (see Figures 1.3, 1.4). Consider as a simple but representative example the block information. If we store information, its length A^ determines the number of storage cells needed. If we transmit block information, the needed capacity of the communication channel is also determined by the length N. Thus, the length A^ of block information may be used as the volume of information. In Chapter 6 the concept of the volume of information is discussed in detail. The indicators of variety of information and of the volume have different meanings. The first characterizes information from the point of its usefulness for the superior systems, and the second from the point of resources needed to process it in the information system. However, as it is shown in Chapters 6 and 7, both indicators are closely related. For many purposes even the same characteristic of information can play the role of an indicator of usefulness of information for the superior system and of an indicator of resources needed to process it.

28

Chapter 1 Basic Functions and Structures of Information Systems

1.4.2 THE DISCRETE SET OF POTENTIAL FORMS OF INFORMATION It was noted previously that a transformation of presentation can change the form and structure of concrete information without changing the number of potential forms of information. However, there exists a relationship between the structure of concrete information and the set of potential forms it can take. In general, we can expect that the more complicated the structure of a concrete information, the greater the number of potential forms the information can take. The following considerations justify this presumption. If any combination of components of structured information is a potential information, the information is called unconstrained information. In most cases, the factors (1.4.1) that determine the properties of potential information cause some combinations of components of structured information never to appear as information. We call such an information constrained information. This and the term unconstrained information are abbreviated terms. In fact, the adjectives unconstrained, constrained refer to the set of potential forms but not to a concrete form of information (see endnote 3). The number of potential forms of constrained information is smaller than the number of potential forms of unconstrained information. Therefore, this number provides a reference for the number of potential forms of constrained information. This, in turn, allows to define the volume of constrained information and to estimate the limits of its compression. A detailed analysis of such a possibility is presented in Chapter 6. We next derive for fundamental structural types of discrete information the number of unconstrained potential forms it can take. This illustrates the mentioned relationship between the structure of information and the index of the variety of the set of its potential forms.

THE NUMBER OF POTENTIAL FORMS OF VECTOR AND ARRAY INFORMATION. We assume that Al. The set of potential forms the elementary piece of information can take is discrete and it has Lj elements, A2. The structured information is a set of A^ pieces of elementary information, and A3. Any combination of pieces of elementary information is a potential information (the set of potential forms of information is unconstrained). From these assumptions it follows that the number L of potential forms of the structured information is . .A^ ,, . ^V L=Lx (1.4.2) The structure of the elementary information is not specified. If the elementary information is 1-DIM (scalar), then the considered information is a vector information; thus its type is TK^^.

1.4 The Set of Potential Forms of Information

29

Next, let us assume that the information is 2-DIM array information described by (1.3.4) with M=N. If the elementary information is scalar, the type of the considered array information is T^^y The number L^ of potential forms of unconstrained array information is obtained from (1.4.2) taking A^^ instead of A^. We ^'' In general, for a /-DIM array

L,=Lr. (1.4.3a) ^ ^^^^ L3 = Li . (1.4.3b) Next consider information having hierarchical vector structure described by (1.3.5). We introduce assumptions A2 and A3 but instead of Al we assume that A 1.1. The elementary information is a vector information consisting of A^o pieces of primary elementary info; we call it vector of rank 1; A 1.2. The set of potential forms of the primary elementary information is discrete and contains LQ elements. From these assumption it follows that the number of potential forms of the vectors of rank 1 is L^-L^'. From (1.4.2) we get L-L^'^ (1.4.4) To this point we have considered unconstrained sets. However, often the sets are constrained. In the simplest case the constraints are caused by universal relationships between components of information. The structure of a constrained set of potential forms of information is usually determined by the properties of the state, the information is about. However, the set of potential forms of information is often structured purposefully, to realize specific goals. Typical examples are sets of code words of error correcting and of prefix codes described in Sections 2.2.1 and 2.6.4. 1.4.3 THE CONTINUOUS SET OF POTENTIAL FORMS OF INFORMATION Many primary sources deliver information that can take forms differing very little. As the mathematical model of the set of potential forms of such information, we take a continuous set and call the information continuous. To give a formal definition of a continuous set X we must define an indicator of the difference between pairs of elements of the set. Such an indicator is called distance. It is defined as a function associating a number d with a pair JC, JC'. of elements of the considered set X. Usually, it is required that ^(x, jc')^0, Vjc, jc'GA'

(1.4.5)

d(x, jc')=0 if and only ifx^x'. (1.4.6) From the fundamental definition (1.1.1) of information it follows that the distance between an information x and other information x' should reflect the difference of actions that the superior system takes on the basis of jt or, respectively, of x'. Thus, in general, the choice of a concrete definition of distance is determined by the properties of the superior system, in particular on how it uses the information.

30

Chapter 1 Basic Functions and Structures of Information Systems

The definition of distance also depends on the structure of concrete information. The problems of the choice of concrete form of distance are discussed in more detail in Section 8.1. Table 1.4.1 lists typical definitions of distance for typical structures of information.The distance (1.4.8) is called Euclid distance, (1.4.9) and (1.4.10) Hilbert distance. FUNDAMENTAL STRUCTURE OF INFORMATION

TYPICAL DEFINITION OF DISTANCE d{x\x")=\x"'X'\

scalar

d(x\x") = „ E

[x"(n)-x\n)f

(1.4.8)

d[x\'), X"(')] ^ f[x'\t)-x'(t)f6t

(1.4.9)

N-DIM (vector information)

\

scalar diagram information

(1.4.7)

/i-i

N1

black-white image information d[x'iv),x"(v)]^

[

[

{x"[t(l)J(2)hx'[t(l),t(2)]]^6t(l)dt(2)

N

(1.4.10)

Table 1.4.1 Typical definitions of distance.

Since \X'X'\ =[{X'X'f\\ the definitions (1.4.8),(1.4.9), and (1.4.10) of the distance of structured information can be interpreted as extensions of the fundamental definition of distance for the elementary component information. Having defined the distance between two forms of information, we can interpret a potential form of information as a point and the set of all potential forms of information as a space. Suppose that any combination of N numbers x(n)\ x^<x(n)<Xy,, x^<x^, n = l, 2,- • • , A^ can be a potential form of a vector information. In geometrical terminology the set of points representing potential forms of vector information is the N dimensional cube with edge x^-x^. We call such a continuous vector information unconstrained. This is the counterpart of the concept of unconstrained discrete vector information introduced in Section 1.4.2. If the Euclid distance is used, then the set of unconstrained vectors is called A^dimensional (A^-DIM) Euclidean space. With only few exceptions the properties of such a space have direct counterparts of the 3-DIM space that we perceive instinctively.

1.4 The Set of Potential Forms of Information

31

The continuous set of information may be constrained. Similarly, as in the case of discrete information, the constraints reflect the properties of the states the information is about or are introduced purposefully in the information system. Usually the constraints are caused by universal relationships between components of information. A technical example of such information is a vector or a time processes modulated by 1 Dimensional continuous information. Such continuous constrained sets of vector information are shown in Figure 1.12. In the extreme case the constrained set is a discrete set of points. An example is the set of processes carrying discrete information, described in Section 2.1.1.

Figure 1.12. Continuous constrained sets of 3-DIM continuous vector information: (a) carrying 1-DIM primary continuous information (non-linear modulation with fixed energy such as frequency or pulse position modulation), (b) carrying 2-DIM primary continuous info (linear modulation or linear multiplexing). For information that is a function of one or more continuous arguments considered as a whole, such as diagram or image information, the formalization of the concept of continuous set of potential forms of information is difficult. A reason is that the counterpart of the set of unconstrained combinations of components of discrete structured information or of continuous A^-DIM vector information would include the bizarre discontinuous functions mentioned on page 21. Therefore, usually only sufficiently regular functions that can be integrated, so that the definitions of distance (1.4.9), (1.4.10) have a sense. The set of such functions is called Hilbert space. With few exceptions, the A^-DIM Euclid space and Hilbert space have similar properties. The analysis of functions of continuous arguments as a whole is the subject of functional analysis; for references see Section 1.8. THE VARIETY OF POTENTIAL FORMS OF CONTINUOUS INFORMATION It has been indicated that the number of elements is a natural indicator of the variety of potential forms of discrete information. The difficulty with extending previous arguments is that the definition of a counterpart of the number of elements of a continuous set not straightforward.

32

Chapter 1 Basic Functions and Structures of Information Systems

To illustrate the problem, consider continuous information a:, which is a scalar, and the set X of its potential values is the interval < 0 , 1 > . Such information is the simplest continuous information. Therefore, we call it prototype continuous information. It is the counterpan of the binary information which is the prototype discrete information There are obviously infinitely many potential forms of the continuous prototype information. This "infinity" is so "large" that is not possible to consider the prototype continuous information as the limiting case of discrete information, when the number of its potential forms L-*oo. This is proved by contradiction. Suppose that we can represent every xE < 0 , 1 > in the form JC/, /= 1, 2,- • •. We cover Xi with an interval J , of length e/2' where £ > 0 is arbitrary small. The total length of those intervals is

E^z-^

(1.4.11)

The whole interval < 0 , 1 > has the length l.Thus, there are points of the interval X that are not covered by any of intervals J „. However, this contradicts our assumption that we can represent every ^G < 0 , 1 > in the form A;, / = 1 , 2, • • • . Thus, we proved that // is not possible to represent all potential forms of continuous .. . .^. information in the form Xi, / = 1 , 2 , - ' - . From this it follows, that it is not possible to describe the set of potential forms of continuous information by listing its elements; we can describe them only by rules of membership. Our argument proves more. If we would represent a given subset of points of the interval ;scG < 0 , 1 > in the form .x^, 1 = 1, 2, • • • than the rest of the points that cannot be represented in this form would contribute at least \-e to the total length of the interval /Yis 1. Thus, the "infinity" of potential forms of the considered prototype continuous information is "much larger" than the infinity of potential forms Xi of discrete information in the limiting case when their number L-^oo. COMMENT 1 In the set theory (see Section 1.8 for references), the concept of cardinal numbers is introduced that allows to compare the "infinite amounts" of elements of sets. The comparison is based on the idea of associating the elements of two sets in pairs so that one element of the pair can be identified when the other is known. However, when both sets are continuous then the rule of identifying the elements is an operation so discontinuous that it cannot be implemented by any real device. Therefore, we cannot directly use the concept of cardinal numbers to define an index of variety of potential forms of continuous information. Section 6.6.2 describes a modification of this theory that is useful for technical purposes, particularly for the analysis of compression of continuous information. LIMITED ACCURACY OF CONTINUOUS INFORMATION Suppose that we could exactly transmit or store the simple scalar information considered previously. By using a subset of potential forms j ^ , /== 1, 2,- • • of the set of potential forms of the continuous information, we could transmit or store a discrete information that can take an arbitrary large number of potential forms.

1.4 The Set of Potential Forms of Information

33

However, such an information system would not work because no real device can exactly process continuous information. In other words, every real information system has difiniteaccuracy of continuous information processing. The ultimate physical barriers for accuracy of continuous information processing are either thermal fluctuations or quantum effects'^. Usually the technical barriers are much higher and to lower them to the physical barriers is usually very costly. In most real situations this is not needed for two reasons. The first reason is that the superior system does not need to identify the continuous information exactly, and the second is that the acting subsystems (see Figure 1.1) do not respond exactly to the commands (decisions). The result of these limitations of the superior system is that if the difference between two potential forms of information does not surpass a threshold, both forms of information produce the same action. Briefly, the system has a limited sensitivity to the details of continuous information. The level of sensitivity of the superior system determines the required accuracy of continuous information processing. Absolute accuracy is never needed. People are typical examples of this. The mentioned resolution of our hearing and seeing senses matches the capabilities of our motoric system. The second reason is that similar barriers, as in case of information processing systems, cause that the superior system is unable to distinguish all details of continuous information. If the information is a scalar, the superior system cannot exactly identify its value. If the information is a time process, the superior system is not able to identify all the details of its time structure. We say that such a system has a limited resolution. Humans are a typical example. They cannot perceive all the details of sounds reaching them or of images they are looking at because of the limited resolutions of hearing and seeing senses. An important consequence of the limited sensitivity of real systems for continuous information is that it is possible to transform primary continuous information into discrete information so that it is as useful for the superior system as the primary information is. As a matter of fact, such an assumption is made in all information systems using digital computers both for simulation and for controlling of any real system. The transformation of the continuous information into discrete information is called discretization. Discretization does not diminish the usefulness of the information for a superior system, if in neighbourhood of a potential form of the discretized information another potential form at distance of order of magnitude of superior system's sensitivity is available. Discrete information satisfying this condition is called discrete approximation (of the primary continuous info). An example of discrete approximations is fine quantization described in Section 2.5. The possibility of replacing continuous information by discrete information can be utilized not only in practice, but also in theoretical considerations. An example is the approximation of time continuous systems described by differential equations discussed in Section 3.2.2 and 3.2.3 by time discrete systems described by difference equations that are considered in Section 3.2.4 and vice versa. Another example is the definition of density of occurrences of continuous states introduced in Section 4.2.2.

34

Chapter 1 Basic Functions and Structures of Information Systems

The number of potential forms of discrete information approximating the primary continuous information with the accuracy commensurable with the sensitivity of the superior system for which the information is destined is a natural indicator of the effective number of potential forms of the continuous information (this is discussed in Section 6.6.1). Our deliberations indicate that, in principle, we could restrict ourselves to discrete information and consider the continuous models of information as an approximation of a more adequate discrete model. Main reason for using continuous models is that they allow the presentation of continuous counterparts of often complicated and obscure relationships between components of discrete information in formulas, that are easy to interpret and give insight into primary discrete relationships. The price paid for this are some peculiarities or even paradoxes of transformations of continuous information that do not have counterparts in real systems. 1.4.4 WEIGHTS OF POTENTIAL FORMS OF INFORMATION Many systems have such a property that if the state of a system is observed many times, then the potential forms of the state occur with frequencies that fluctuate around fixed values. We say that such states exhibit statistical regularities. The fixed value around which the frequencies fluctuate, is called probability weight. If the set of potential forms of the state is discrete, its probabilistic weight is iht probability of the state; if the set is continuous, the density of probability. The main concepts of probability theory needed in this book are presented briefly in Section 4.4. Characterization of unknown states of the system by their statistical weights is called the probabilistic model of indeterminism of the state. The probability weights are an inherent property of the system, and therefore, the set of potential forms of the state associated with probability weight is called the statistical state. This concept is discussed in detail in Chapter 4. Information is a function of the state (definition 1.1.1). Therefore, if potential forms of states exhibit statistical regularities, then the potential forms of information also exhibit them. When the superior system performs an action several times, its information system also must process several pieces of information. Then the statistical weights can be used to improve the performance of the information processing. The analysis of this possibility is one of the main topics of this book. There are situations when the potential states, and, in consequence the potential forms of information either do not exhibit statistical regularities or they are not known. There have been many attempts to introduce in such situations, other than probabilistic weights of the potential states and potential forms of information thus, to introduce non-probabilistic models of indeterminism of states and information. Such models, imbedded in the framework of concepts introduced in this book, are discussed in Section 5.5.

1.5 Transformations of Information

35

1.5 TRANSFORMATIONS OF INFORMATION In spite of the great variety of functions performed by the subsystems of information systems, the rules of their operation can be classified into a few fundamental types. Those types are determined by the type of relationship that exists between the primary and the transformed information. The type of transformation that produced the information is its important feature from the point of view of the superior system. The purpose of this section is to give a review of the basic types of transformations of information and a summary of Chapter 6 and Chapter 7. 1.5.1 BASIC TYPES OF INFORMATION TRANSFORMATIONS We denote u the primary information U the set of its potential forms V the transformed information V the set of its potential forms r(') the transformation considered as a whole. ^^^'

v=r(ii), uE U, vE U

(1.5.1)

As it has been indicated previously (see page 17), the state can be considered as a primeval form of information. Therefore subsequent considerations apply to transformations of information by an internal subsystem of an information system (in particular of the prototype system shown in Figure 1.2), and to the transformations of the state into the primary information thus, to the transformation describing the operation of the information source. In the latter case, we have to take the state 5 as the information u and the primary information x delivered to the information system as the transformed information v. Thus, x= T,^{s\ (1.5.2) where rig(') is the transformation describing the operation of the information source; see Figure 1.2. This transformation is called i/i/
36

Chapter 1 Basic Functions and Structures of Information Systems

The components z are usually not known and are not related to the processing of the information. Therefore, they are called side-state components (briefly, side states). The described transformation and the information produced by it are called indeterministic, while the transformation described by (1.5.1) and the information it produces is called deterministic. The concept of deterministic-indeterministic information corresponds to the point of view of an observer S^ located where the primary information is available (see Figure 1.13b).

t

B.. * T{u)

r-^

-t

B.

B„

T{u,z)

T; Figure 1.13. Illustration of the definitions of: (a) reversible-irreversible transformations (point of view of the observer &^) and (b) deterministic- indeterministic transformations (point of view of the observer B^\ z-side states. If knowing the transformed information v, in spite of ignorance of side-factors z, we can determine the unique primary information u that produced v the indeterministic transformation is said to be reversible. If this is not possible, the transformation is irreversible. The two binary features: reversible (R)-irreversible (IR) and deterministic (D)-indeterministic (ID) define the four fundamental prototypes of transformations of information. Examples of these transformations are shown in Figure 1.14.

V

V

D-R

V

D-IR

U

ID-R

ID-IR

Figure 1.14. An illustration of the basic types of information-generating transformations (transformations of information). The set U of the potential forms of the primary information and the set V of potential state of produced (secondary) information are discrete sets. R/IR-reversible/irreversible, D/DD-deterministic/indeterministic.

1.5 Transformations of Information

37

Since the description of the state can be considered as the primeval information (see our discussion in Section 1.2.2) the previous discussion of transformations of information applies directly to information generating transformations. In particular, the source may deliver exact-inexact and deterministic-indeterministic information about the state of the environment of the superior system to which the information system renders its services. The following three subsections describe the most important (from a practical point of view) applications of the basic types of information transformations. 1.5.2 REVERSIBLE TRANSFORMATIONS Since the primary information can be recovered exactly if the information produced by a D-R transformation is available, we may say that the D-R transformation changes only the form of information without effecting it potential usefulness to the superior system. Therefore, a D-R transformation may be called the information' representation transformation. The task of such a D-R is to change the form of information so that it is better suited for subsequent processing. In the prototype information system shown in Figure 1.2 the information-representation transformations may be used both as preliminary and ultimate transformations. The choice of a concrete information-representation transformation depends on the subsequent information processing. Examples of information representationtransformation include the following: • Changing the physical structure of information, especially changing the primary non-electric information into electric information and vice versa. Oftenfliistype of information-representation transformation is performed by the information source. • Transformations removing the interdependences between components of structured discrete information (data). This allows saving of resources needed to store or to transmit the information (lossless compression of information); such transformations are the subject of Chapter 6. • Spectral representations, in particular Fourier representations. They are useful for (a) the presentation of information to a human operator, (b) the analysis of systems, mainly stationary linear systems, (c) the removing of the statistical correlation between components of structured continuous information (decorrelating transformations). Decorrelation improves the performance of the subsequent lossy information compression, particularly dimensionality reduction or/and discretization. The spectral transformations are subject of Section 7.1; spectral decorrelating transformations of Section 7.2. • Transformations reducing the effects of the distortions in a communication channel. The special case of such a transformation is the error-correcting coding discussed in Section 2.1.2. An other special case is modulation increasing noise immunity. • Transformations of information increasing the efficiency of its later retrieval from a data-bank. Formatting of a record is an example. Examples of such transformations are given in Section 2.4.2.

38

Chapter 1 Basic Functions and Structures of Information Systems

To illustrate the general concepts, a reversible deterministic transformation will be described. This transformation is very simple, but it is representative and often used. A TRANSFORMATION OF A DISCRETE INFORMATION INTO A STANDARD BLOCK OF GIVEN CHARACTERS The information that is a block of a fixed number of predetermined elementary pieces occurs often because most technical information processing systems and superior systems can accept only such information and its processing usually requires minimal resources. Often the primary information has another structure. This section describes a universal reversible transformation transforming information of any structure into block information of fixed length. About the information source we assume Al. The primary information may have any structure; A2. The primary information x is discrete; >Vd = {jC/, / = 1 , 2,- • • , L} is the set of its potential forms; and A3. The transformed information v should be block information v={v(/2); /z = l, 2, • • , N}; U={^k' ^=1» 2,- • • , K} is the set of potential forms of an elementary piece v(n). For transformations that transform information into a block of elements, a specific terminology of coding theory is used. The elementary piece of information v(/z) is called a symbol, the set l^ of its potential forms an alphabet, a potential form Vi of the symbol a character, the block v of symbols assigned to a primary information x a code word, and the set V = {T(Ui; / = 1 , 2,- • • , L} of all code words is called code book. The rule assigning a code word to the primary info is called code. Thus, the transformation described by assumption A3 is a code. It is denoted by r(-). If no constraints are imposed on the set symbols {v(n); n = l, 2,- - - , N} the number of potential forms of the block information v is given by (1.4.2): L,=K^ (1.5.5) We next assume that A4. The transformation T(') is reversible. Thus, we assume that the considered code is a information-representation transformation. First, we introduce an auxiliary code T^^^i')- We assume that the primary information / is an integer. The code is as follows 1. We write I in the numbering system with base K, n s fi^ 2. TTie code word w = T^Jl)=STR^l-l), ^.:>.ty) where STRj^ is a sequence of digits obtained by writing the number I in the numbering system with base K without a 0 in front', (1.5.7) ifK=2 we write briefly STR/. The code T{') that we are looking for is as follows

1.5 Transformations of Information

39

1. Transform the primary information X/ into its number /; 2. In front of T^^^l) add so many zeros, that the obtained (1.5.8) auxiliary sequence of digits has the same length N 3. Replace the digits by the corresponding characters of the alphabet li ; the obtained sequence is the code word T(Xi). The described code is illustrated in Figure 1.15. Section 6.2.1 shows (see page 266) that this code compresses optimally a train of pieces of the primary information when the statistical weights of their potential forms are the same.

STR(M) ASSIGNING THE NUMBER TO THE INFO

7^aux(-)

1

ADDING ZEROS! IN FRONT OF THE STRING

TRANSFORMING DIGITS INTO CHARACTERS I

I

r(-) Figure 1.15. The transformation of discrete primary information having any structure into a block of fixed length of characters of a predetermined alphabet. Assume that the alphabet is {A, B}, K=2, A^=4 and that information X5 is being coded. We have STR(5-1) = 1 0 0, the auxiliary sequence of digits is 0 1 0 0, and the code word is V5=ABAA. A widely used version of the described code is the ASCII code. Assumption A4 determines the minimum length N of the block because the block must have at least so many forms as the primary information. From (1.5.5) follows that it must be ^ L
40

Chapter 1 Basic Functions and Structures of Information Systems

This section concentrates on the next-neighbour transformation that is the most frequently used D-IR transformation. The next section describes in more detail the quantization and dimensionality reduction that are the basic types of lossy compression of continuous information. The detailed analysis of those transformations is the subject of Chapters 7 and 8. AGGREGATION SETS The reason for irreversibility is that more than one potential form of primary information u may be transformed in the same secondary information v. The set of all forms of the primary information that are transformed into a given secondary information v is called aggregation set (associated with v ). We denote it ^ . Thus, ^,={ii; r(ii)=v}. (1.5.11) From this definition it follows that y ^ ^ , = f/ (1.5.12a) ^,.r\_J^.= 0, vVv'. (1.5.12b) If the transformed information can take only one of L^ potential forms (the set V of potential forms of the transformed information is discrete), the number of aggregation sets is also L^ An example of aggregation sets is shown in Figure 1.16.

Figure 1.16. An example of aggregation sets when the number of potential forms of transformed information L^=6. Knowing the aggregation set, we can say only that the primary information belonged to it, but we cannot say where it was located inside this set. Thus, the transformation corresponding to the aggregation sets can be interpreted as pulling the whole aggregation set ^i into one point. Hence the name of the set. NEXT-NEIGHBOR TRANSFORMATION A great variety of D-IR transformations can be interpreted as modifications of a basic transformation, called next-neighbour transformation (NNT). To define the NNT for a given class of structured information we chose a set of combinations of components that the considered information consists of. This combination is called a reference pattern. In particular, a reference pattern may be a potential form of information.

1.5 Transformations of Information We denote u the primary info information u a reference pattern,

41

f/the set of its potential forms U the set of reference patterns

and we define an index of similarity between an information u and a reference pattern u. It is called distance and denoted as d{u,u). Often, one of the distance functions (1.4.7)-(1.4.10) is used. The NNT is: 1. Each reference pattern uEU becomes an identifier v=V(u), where V(*) is the assigning operation; we denote by Vthe set of potential identifiers', ^ ^ (1.5.13) 2. Ifdiu, u^
SEARCH FOR THE NEXT PATTERN

DESCRIPTION OF ITHE NEXT PATTERN V(')

next pattern dju)

transformed info v

«a(l)

b)

C)

Figure 1.17. Next-neighbor transformation; (a) block diagram of a system realizing the discrete next-neighbor transformation, (b) the continuous linear NNT when the transformed information is 2 DIM, (c) it is 3 DIM;ttp(l),iip(2) are auxiliary vectors on which the plane of reference patterns is spanned.

42

Chapter 1 Basic Functions and Structures of Information Systems

The character of the NNT depends on the type of the set U of reference patterns. It can be discrete or continuous. Correspondingly, the NNT is called discrete or continuous, and the set V of the potential forms of the identifier is discrete or continuous. In the simplest case, the continuous set of references is a plane (a line if the information is two-dimensional). Then the continuous NNT is called linear NNT. Such a transformation is illustrated in Figures 1.17b, and 1.17c. Section 7.xx (page 333) shows that optimal dimensionality reduction is a linear NNT transformation. We can represent the information and the reference patterns as points in a space. Then the line connecting the primary information and the closest reference pattern is perpendicular to the plane representing the continuous set reference patterns. Therefore the continuous NNT has the geometrical meaning of projection. The aggregation sets corresponding to NNT are called Voronoi sets. The next section describes the irreversible transformations simplifying the structure of information. Since most of such transformations are special cases of NNT, the considerations of the next section give more insight into NNT and examples of Voronoi sets. 1.5.4 IRREVERSIBLE TRANSFORMATIONS 2: TRANSFORMATIONS COMPRESSING CONTINUOUS INFO A short review of information-compression systems from the point of view of functions that they preform in information systems was given in Section 1.1.4. In the case of information compression we have to consider a pair of adjoint transformations (see Figures 1.4a and 1.4b) • The compressing transformation r(*) transforming the primary continuous information u into the transformed (compressed) information v=T(u) having a simpler structure, and • The recovering transformation TX'), transforming the compressed information v into recovered information Ur=TXv), which should be ,,possibly close" to the primary information u. To formalize the problem of the best choice of both transformations T(') and JrC), we have to introduce several concepts and methods. Chapters 6 and 7 analyze in detail concrete information compression systems. In Section 8.6.1, as an application of the general methodology of formulating and solving optimization problems the optimal pair of information compression and recovery transformations is derived systematic^ly. Here we concentrate on a description of basic information-compressing transformations IX*). Almost all such transformations applied in practice are nextneighbor transformations. Following the general character of this chapter we use several heuristic arguments. Their formal justification is given in Sections 7.3, 7.5, 8.3, and 8.6.1. The basic transformations compressing continuous information are discretization and dimensionality reduction. The transformation of continuous vector information into discrete information is called quantization.

1.5 Transformations of Information

43

If primary information is a 1-DIM (scalar) information, quantization is called scalar quantization-, if the primary information is ^-DIM, AT>2 the quantization is called vector quantization. The quantization is called birmrization when the number of potential forms of quantized information L=2. SCALAR QUANTIZATION Beginning with the quantization of continuous scalar information, assume that the primary information u is scalar, and the set f/ of potential forms is an interval <Wo, w^>. In this interval a sequence of thresholds "«-^o<"i<---<«L-"* (1.5.14) is taken. They divide the interval f/ into L non overiapping subintervals called aggregation intervals: ^i^
^^

^s

-^4

"V"

"V"

"V"

• a) •

H **0

•

^

^

H

-K-

K-

h "1

**2

X

"3

"4

Figure 1.18. Quantization of a scalar: (a) based on thresholds w^, (b) on reference points w^, (c) an example of quantization by thresholds that cannot be interpreted as a NNT. u -primary information, _>^/ -aggregation interval, V/ -quantized information, L=4. If all potential forms of the continuous information have the same weights (are equally important for the superior system), we may expect that it is favorable that all aggregation intervals have the same length. Thus, I ^ J -w^-w^_,-const. (1.5.17) Such a quantization is called uniform.

44

Chapter 1 Basic Functions and Structures of Information Systems

We can also achieve quantization using a next neighbor transformation. Such a transformation is illustrated in Figure 1.18b. This figure shows that in 1-DIM case the NNT is equivalent to quantization realized by thresholds: (1.5.18) However, not every quantization realized by thresholds can be implemented by a NNT. Figure 1.18c gives an example. VECTOR QUANTIZATION The multi-dimensional continuous information quantization is practically always realized by a NNT. This is caused not only by simplicity of implementation. Section 8.6.1 shows that under general assumptions the NNT is optimal quantizing transformation. As in the case of scalar quantization, we may expect that when the potential forms of vector information have equal weights, then it is favorable to distribute the reference vectors of the NNT possibly uniformly over the set of potential forms of the primary information. A simple way to achieve such a distribution is to locate the reference patterns at nodes of an auxiliary regular grid. Such a vector quantization is called grid quantization. Two examples of reference patterns realizing the grid quantization and the corresponding Voronoi sets are shown in Figure 1.19. The shape and the size of Voronoi sets depend on the auxiliary grid. If the lines are orthogonal as shown in Figure 1.19a, the vector quantization can be realized by separate scalar quantization of components of the vector information. Such a quantization is called decomposed (vector) quantization. u :2)

u(2)

«b(2)

«a(0

X

X

X

X

:c. 8 :2)

X

X

X

X

_. _^. — .. ^( >(. X X

X

X

><

X

X

X

X

\\

«(1)

V

X

\ j «b(l)

"3(2)

Figure 1.19. Quantization of 2 DIM vectors by means of next-neighbor transformation. The reference vectors are located at the nodes of an auxiliary regular grid. Crosses denote the reference vectors, thick lines denote the frame (set of all potential forms of the vector info), thinner lines are borders between Voronoi sets, and dashed lines are the lines of the auxiliary grid: (a) decomposed quantization, (b) non-decomposable quantization.

1.5 Transformations of Information

45

The decomposition is favorable from the point of view of implementation. However, the corresponding set of reference patterns, and consequently of the rectangular aggregation sets, is in general not optimal from the point of view of quantization distortions. Better (in fact, optimal for 2-DIM vectors) is the oblique grid for which the Voronoi sets (with exception of sets laying at the border of the frame) are hexagons, as shown in Figure 1.19b. Such a vector quantization cannot be directly decomposed. The quantization is analyzed in detail in Section 7.5 and its optimization in Section 8.6.1. A natural attempt to improve the performance of the decomposed vector quantization is to perform a preliminary presentation transformation preceding the decomposed quantization. This permits to modify the reference patterns and distance function and, in turn, gives much flexibility in choices of Voronoi sets. In particular, if the components of the primary information are interrelated, it can be expected that it is advantageous to remove the relationships between the components and to quantize separately the detached components. Such a preliminary transformation is called transformation removing relationships. Typical example are statistical relationships between the components. Then the decorrelation is the transformation diminishing or even removing the statistical relationships. Section 7.5 shows how preliminary decorrelation improves the performance of decomposed quantization. The preliminary transformation is particularly useful when the vector or array information has a macrostructure, the macro components are modifications of macroprototypes and we are not interested in the details of modifications. Then it is natural to take the prototypes as reference patterns. By recognizing the macroprototypes and rejecting the irrelevant modifications, we may achieve dramatic information simplification and in consequence very high volume-compression ratios. If the macro-structure is multilevel, we may identify successively the macroprototypes at the various hierarchical levels as shown in figure 1.4.c. At each hierarchical level we can use the corresponding NNT. DIMENSIONALITY REDUCTION OF N-DIMENSIONAL VECTOR INFORMATION Dimensionality reduction is a transformation of a continuous structured information into an information that is continuous too but has a simpler fundamental structure. In the simplest case the primary information is the A^-DIM vector information « = {«(«), « = 1, 2,- • • , A^}. Then the dimensionality reduction is the transformation producing aM-DIM, M
46

Chapter 1 Basic Functions and Structures of Information Systems

If successive components of the primary information differ only slightly, then it is natural to reject groups of H>1 adjacent components leaving only every H+l-th component. Such a dimensionality reduction is called decimation. An example of it is shown in Figure 1.20b. Both truncation and decimation are linear transformations.

u(l) , w(3)

u(l)

u(l)

«(3)

u{5)

u(5)

uU) 1 2

1 2

a)

b)

Figure 1.20. Dimensionality reduction by (a) truncation, (b) decimation. Thin bars indicate the rejected components of the primary vector information; the thick bars indicate the components taken as components of the vector of reduced dimensionality. It may be expected that truncation is an efficient dimensionality reduction if the component of the information is a decreasing function of its number and if components are not interrelated. If the primary information u does not satisfy those conditions, it is natural to perform a Ttvtrsihltpreliminary transformation producing such preprocessed information w that its components satisfy the conditions. Then the dimensionality reduction is achieved by truncating the preprocessed information w. The linear reversible transformations, in particular their subclass called spectral transformations, are a very useful and wide class of preliminary transformations. The frequently used and representative example of such transformations is Fourier transformation, which transforms the primary information into coefficients of its harmonic components. The important property of this transformation is that if the differences between successive components are small, (as, for example shown in Figure 1.20b), then the Fourier spectrum decreases for large indexes of its components. Thus, the spectrum looks like the bar diagram on Figure 1.20a. Section 2.5 describes the application of a generalization of this method to compress images. Similar properties as Fourier transformation has the spectral transformation based on eigen-vectors. This transformation removes completely the correlation between components of the train. It is considered in detail in Section 7.3 Truncating the spectrum instead of truncating the primary information directly we can substantially decrease the distortions caused by the dimensionality reduction. Figures 1.17b and 1.17c show, that if the dimensionality N of the primary information is small, the dimensionality reduction by a NNT using a linear set of references (a line, a plane) introduces large distortions (they are calculated in Section 7.3).

1.5 Transformations of Information

47

Therefore, to reduce the dimensionality of 2-DIM or 3-DIM information so that it is possible to recover the primary information with good accuracy, we may still use the continuous NNT but based on more sophisticated set of references. For 2-DIM into 1-DIM reduction such a set of reference patterns is a line covering possibly densely the set of potential^forms of the primary information (the frame). We describe such a reference line 11 by the parametric equations: w(l)-*,(v),

w(2).,(v),

(1.5.19)

v
where ^jfv) and ^2(^) ^^ two functions. The dimensionality is reduced by projecting the point representing the primary information onto the reference-line and the parameter v in (1.5.17) plays the role of an identifier of the next reference pattern (a point on the line). This identifier is the 1 DIM compressed information. The described procedure is called scanning (of the set of potential forms of the vector info) and the lineal set of reference points is called scanning line. Typical scanning lines are shown in Figure 1.21.

V V

*

» •

Jl l ^ - — — __ — — — " "^ "^ ^ ______ — — — -^^ " " ____ — — — ~^ ___ — — — ~^*~' ~~ a)

c)

H L J U EJ~I_] d)

Figure 1.21. Typical scanning lines (continuous sets of reference patterns): (a) row scanning, (b) diagonal scanning, (c) spiral scanning, (d) Hilbert line scanning; u„ respectively-u^ the starting respectively the ending point on the line of reference patterns.

48

Chapter 1 Basic Functions and Structures of Information Systems

DIMENSIONALITY REDUCTION OF DIAGRAM, PROCESS, IMAGE, AND WAVE-INFORMATION Substantially more complex as continuous sets of vector and array information are continuous sets of information that is function information, particularly diagram, process, image, wave information considered in Section 1.3. We have two ways to compress function information: • to compress the "value" of the information for a fixed argument (s); this may be a scalar, vector or an array, • to compress the continuous argument(s). If we use the quantization of the argument, we simply take as the component of the compressed information the scalar (vector or array) that is the "value" of the function for the reference points used to quantize the argument. Such a transformation of the primary function is called point sampling. If we apply this procedure to a function of time given by (1.3.3), the sampling is called time sampling. Point sampling can be also used to compress non-moving planar images described by (1.3.10); then it is calltd planar sampling. In the simplest case the point sampling is uniform. We realize it by uniform scalar quantization described by (1.5.16) and (1.5.17) or, in case of images, by taking as sampling points the points on a regular grid, as shown in Figure 1.19a. However, if the "values " of the function-information are interrelated, nonuniform sampling may be more efficient. The non-moving planar image can be compressed into a diagram or process information by scanning the frame of the image. The compressed information is obtained by taking for the coordinates t in formula (1.3.10) only the coordinates of points on a scanning line described by (1.5.17), particularly on one of the scanning lines shown in Figure 1.21. This transforms the function of two continuous arguments given by (1.3.10) into a function of one argument that has the meaning of the identifier v of a point on the scanning line. Such a compression of image is called lineal sampling. Lineal sampling is the classical method for transmitting images over radio channels and storing images in sequentially operating media. The next more complicated continuous information is the moving image information described by (1.3.14). Such information can be compressed not only by sampling (planar and time sampling) but also by using only time sampling to produce by photographic methods a stack of static images x(',\ O, /z = l, 2,- • • where t„ are the sampling instants (instants of taking the photographs of the frames). Such a compression is the basis of the cinema. Previously described image-compression procedures can be applied separately to each of the three fundamental colors. The transformation of a color image described by (1.3.12) into a black-white image described by (1.3.10) is achieved by three to one dimensionality reduction of the image information x(t) into information x(t).

1.5 Transformations of Information

49 JS O «3

.^ ^ T3 •s « erj

c/3

U

(/3

3 O *S5 3 co o

C/5

a

is o occ/T' c

~4)

w O

aV-> QJ c T3

o

(4«

o

a5 C/5

4^

-s

c/3

^

7 3 c«

3 O 3

c/3

3

"2

3

(U

c c o o t> o 1

>c

o. C/3 ' k«

o

cl> o i^ oJ o 4> c«

;^ 52 O

CJ

•o d TD ,o

>

i

E ,o U-4 c/3

O

3

C4M

c« O

a > <4M u; o H bO 3 ,c «>i c 1/3 C3

C/3

> 2%iJ o u

S> ^9 § 0^

5 c CO

1^

'So c C/3

Q

c/3

^ed

C«

c/3

3

§S C ' CQ

IM

CO

O

o

o

>

c o CJ

c«

0)

3 O

a>

3

u •c ^ •o oC/3 c 2 "-5 o c 2

c/3

^'

s?o c^ c/3

>>

"55 13 o

c«"'

O

a. o CO

V U 3

0)

CJ

^

> ii

'o

§^

E Q^^^

E o-fc< rt

50

Chapter 1 Basic Functions and Structures of Information Systems

The next, still more complicated case is the wave information considered in section 1.3.3 (page 24). This information has not only a time structure but also a 3-DIM space structure. Simplest are the transformations of local features of the wave information into process information. A typical example is transformation of electromagnetic wave information by an antenna or acoustic information by a microphone. Such transformations, called point space sampling, lose many features of the wave information available in the spatial input of the information system. The wave information may be better exploited by sampling the wave at several points. Typical examples of information sources using such a transformation are antenna arrays or acoustic sensor arrays. Another way to simplify the electromagnetic wave information, in particular in the optical range of wavelengths, is to transform it by a lens, project on a screen, and produce a planar image (static or moving depending upon the type of wave information). This is the fundamental technique used both in technical information sources and in the eyes of all higher living organisms. The transformation of a wave into a planar image is often an intermediate transformation, after which the transformation into an evolving time process follows. To acquire information about space depth relationships of objects, more than one planar image is needed. Examples are two 2-DIM images obtained from a twolens system in the classical stereoscopic vision system or additional phase information used beside light-intensity information in holography. Information about the interior of objects can be obtained from absorbtion shadows produced by penetrating radiation. A typical example is a x-ray image. In our terminology it is a 2-DIM image information about the 3-DIM primary information about the absorbtion properties of the interior of an object. By changing the direction of the radiation relative to the position of the object we obtain different compressed 2-DIM image information about the primary 3-DIM image information. Reconstruction of the primary information from a stack of shadow images obtained for quantized directions of irradiation, is the basis of x-ray tomography. Our considerations about transformations compressing or/and simplifying information are summarized in Figure 1.22. This figure includes the description of the real word by people in a natural language. Such a description can be considered as the most sophisticated simplification of extremely complicated continuous information about objects and processes into strings of words that are discrete information. Looking at Figure 1.22 we may state, without overexaggeration, that the discoveries of transformations which simplify structured information were the milestones of mankind's history. 1.5.5 INDETERMINISTIC TRANSFORMATIONS We defined the transformation as indeterministic if the transformed information depends not only on the primary information but also on factors that are known neither for the observer ^, nor for £^ shown in Figure 1.13. To describe an indeterministic transformation we must describe the operation T(\ ') and specify what we mean saying that the side factors z are unknown.

1.5 Transformations of Information

51

Usually we do not know the concrete form of side factors z, but we have some information about the set of potential form of side factors z and eventually the weights of the elements of this set. This knowledge determines the model of indeterminism of the side factors z coresponsible for the form of the transformed information. The typical indeterministic transformations T{', •) are described in Section 5.4, and the models of indeterminism in Section 5.5 (see the synthesizing Figure 5.11). Often a structured object, a state, or information can be interpreted as a modification of a prototype, generated by some unknown factors, but so that on the basis of the modification the prototype can be exactly recovered. Then the transformation is reversible but indeterministic. Thus, it is of ID-R type. Such transformations occur also as an undesired side effect caused by predetermined components of an information system, particularly by the fundamental information subsystem. An example is an unknown time shift of a primary process caused by a communication channel or shift and turning of a primary image in storage media. Even if the indeterminism introduced by a ID-R transformation could be removed, it can cause serious nuisances, such as instability of the system. However, removing the reversible indeterminism may be tedious. If an ID-IR transformation follows a ID-R, then the ID-R transformation enhances the irreversible distortions caused by the ID-IR transformation. Section 8.4 gives examples of this unfavorable effect. In special cases we may introduce an indeterministic transformation purposefully. This is called randomization. A typical example of technical application is dithering. In Section 1.1.4 (page 11) it has been indicated that the rules describing the operation of information systems and subsystems can be interpreted as decision rules. An important result of the theory of decisions is that in some situations when information does not exhibit statistical regularities, a properly chosen randomized decision rule is better then any deterministic rule (see publications on decision theory cited in Section 1.8). A randomized decision rule is called the two-stage rule: in the first stage an auxiliary decision about a probability distribution is made; in the second stage a random variable generator operating according to the probability distribution chooses the fmal decision. An example of such a rule is the randomized binarization described in Section 3.1.2, page 135.

52

Chapter 1 Basic Functions and Structures of Information Systems

1.6 OPTIMIZATION OF TRANSFORMATIONS OF INFORMATION The previous section described several information transformations. The basic design problem is to find the transformation that possibly best realizes the task of the system. Presented here is a methodology of optimizing the information transformations. The approach is applicable for optimization of partial transformations performed by subsystems of the information system and for the design of the overall information transformation, according to which the primary information obtained from the information source is transformed into information ultimately delivered to the superior system. In the terminology of the decision theory, such a transformation is the overall decision rule producing a decision about the state of the environment based on available information. A formalization of the optimization procedure is essential not only in the design phase of an information system. Operation of an intelligent information system, that matches currently its rule of working information processing to primarily unknown and varying states of the environment must be also based on formalized optimization procedures. The first and fundamental step in the formalization of the design of information systems is the definition of performance indicators and of the constraints imposed on feasible transformations. The next step is the formulation of the optimization problem. Final step is finding the solution of this problem. This section gives only a sketchy description of those three steps. Chapter 6 and Chapter 7 present several concrete examples of optimization of important information transformations, in particular of information simplifying transformations. Sections 8.1 and 8.2 describe in detail the methods of formulating and solving optimization problems. In Sections 8.3 and 8.5 the optimal information transformations are derived in a systematic way and in Sections 8.4 and 8.6 their features are discussed. 1.6.1 INDICATORS OF PERFORMANCE OF AN INFORMATION TRANSFORMATION To formalize and eventually automatize the choice of the best overall transformation characterizing an information system, we have to introduce in the ensemble O of potential information transformations an ordering relation, that allows to say whether a transformation T{-)\snot worse than another transformation 7^(*)- We write this relation in the form r'(-) > r"(-). (1.6.1) Next, we look for such a transformation TJi*), such that for every r ( ' ) ^ -7 we have JoC') >: T{*). The transformation Tji') is called optimal transformation. The simplest way to order the transformations is to associate with each transformation T{') as a whole, di performance indicator Q[T{;)\ and to say that r(-) ^ r ( - ) if Q[T\')\ > Q[r{')\,

(1.6.2)

1.6 Optimization of Transformations of Information

53

From the general definition (1.1.1) of information it follows that the basic indicator of the transformation describing an information system should characterize the improvement of superior systems performance due to the available information. Such an economic indicator is the net gain G[T'(')] ^ G^[T'(')-G'[T'(')l (1.6.3) where G^[F(*)] is the gross gain obtained due to the available information and G~[T'(')] is the processing cost including the share of cost of running the information system and the investment cost. The gross gain is G''[r(')]=G^''^\with information)-G^''^\without information), (1.6.4) where G^*"^^ is the gain obtained by the superior system by its actions using, or, respectively, not using the information. TECHNICAL INDICATORS OF PERFORMANCE The disadvantage of economic indicators is that they depend not only on the information transformation T(') but also on volatile economical factors such as prices and tariffs. However, the dependence often is not direct but occurs through some parameters depending only on the information transformation. Such parameters are called technical performance indicators. The technical parameters on which the gross gain G^ depends can be interpreted as indicators of quality of information delivered to the superior system. Those indicators can be divided into two categories: • Indicators of the variety of the potential forms that the information delivered to the superior system can take (discussed in Section 1.4.1). The number of potential forms that discrete information can take or the number of dimensions of continuous vector information are simple examples of such indicators. • Indicators of distortions between actually processed information and information processed in a hypothetically best way. In systems that should transfer or store the information, the indicator of distortions has simply the meaning of an indicator of difference between the primary information and the information delivered to the superior system. Often the indicator of distortions Is based on the distance between the primary and distorted information. The definitions of distance between two potential forms of information are discussed in Section 1.4.3-see Table 1.4.1, page 30. The technical parameters on which the processing cost G'[T'(-)] depends are called cost indicators. These parameters can be grouped into two classes: • Indicators of investment cost (of the cost of equipment realizing the transformation) and • Indicators of exploitation costs (of the running cost of performing the transformation). In the case of information processing, often some resources necessary to realize a transformation are not built into the system but temporary hired. Therefore, we use rather technical indicators of running costs with an added share in investment costs (the amortization cost). Typical examples of such indicators are capacity of a communication channel, capacity of a mass storage, and computational power required to perform a transformation.

54

3* IS O-QTQ*

< >

Sa so

fD

O

o'o* 00 O

o' &9

o

p

B"

3 5

CD

p.

O 13

C3 CfQ

Chapter 1 Basic Functions and Structures of Information Systems

1.6 Optimization of Transformations of Information

55

The presented general categorization of performance indicators is illustrated on Figure 1.23. A strict definition of the indicators of information transformations is an essential step in the design of any information system. Without such a definition it is not possible to compare information systems or to formulate their optimization problems. Here we only sketch the methodology of defining the performance indicators. Its detailed presentation is the subject of Section 8.1. INDICATORS CHARACTERIZING THE PERFORMANCE OF AN INFORMATION TRANSFORMATION AS A WHOLE. Usually it is quite straightforward to define and to evaluate an indicator of quality of processing a concrete information in a concrete state of the environment. However, to assess the performance of an information transformation we have to consider all potential forms of primary information, which the system has to process, and all associated potential states of the environment, in which the information system has to operate. In other words, we are looking for performance indicators characterizing an information transformation as a whole. Such indicators also must take into account the properties of the set of potential forms of information and of the states of environment, including the weights associated with potential forms (discussed in Section 1.4.4). In particular, when the potential forms of information and of the states of environment exhibit statistical regularities, an indicator of performance usually depends on their probability distributions. The essential step in the definition of the indicator characterizing the performance of an information transformation is to remove the dependence of an assessment of systems performance in a concrete situation on specific features of this situation but to keep the dependence of the assessment on the information transformation as a whole. The operation realizing such a task is called operation removing dependence on details (briefly detail removing operation). The choice of the operation depends on properties of information and on properties of the superior system. When the number of potential situations is finite, arithmetical averaging is a typical choice. When the potential forms of information and states of environment exhibit statistical regularities, then statistical averaging is a typical details removing operation. The mean square error, the average delay, or the average storage capacity are examples of performance indicators obtained by statistical averaging of indicators of quality or processing costs in a concrete situation. Several examples of the detail removing operation are given in Chapter 6, Chapter 7. Section 8.1 discusses this operation in detail. 1.6.2 THE OPTIMIZATION PROBLEM The next step after defining the performance indicators is to formulate the problem of optimization. Having several secondary performance indicators Q\, Qi,' ' , Qu we face essentially di poly optimization problem. However, usually we choose one of indicators-say, Qj as the criterion and require that the other stay within some limits 2bm> • That is we require that (2a,.<em^ebm,

^ = 2,' ' ' , M

(1.6.5)

56

Chapter 1 Basic Functions and Structures of Information Systems

Often Gain=2bm=Ccm» thus, it is required that Q^=Q^. The described requirements are called parametric constraints and are denoted as €„,. On very general assumptions the parametric constraints can be fulfilled by using a modified criterion (the Lagrange method-see Section 8.2.1, page 394). Besides parametric constraints we usually have to consider some others constraints (technological, reflecting accepted standards, etc.). These technical constraints are denoted by CTECHThe problem of optimization of the rule T(') of transformations performed by an information system or subsystem we write in the form: O

P

T(')

Q,

I C,

m=2, 3 , . . . , M ,

I

QECH

\

(1.6.6)

\ OPTIMIZATION "VARIABLE" PARAMETRIC CONSTRAINTS PROBLEM CRITERION TECHNICAL CONSTRAINTS This symbolic notation greatly simplifies discussion of optimization problems. Often the choice of the type of the information transformation bases on some general considerations, and only some parameters h = {h(n), AZ = 1, 2, • • • , A^} defining the transformation exactly are left free. Since the value of the criterion for such a transformation depends only on the set of parameters h the problem of optimization reduces to the problem OP hQ, | C „ m=2, 3,..., M, CTECH

(1.6.7)

Such a problem is called parametric optimization problem. A typical example is the optimisation of the set of coefficients determining a linear transformation of A^-dimensional information, or the optimization of the set of ^-DIM reference patterns determining a next-neighbor transformation. Chapter 6 and Chapter 7 formulate several optimization problems and give their solutions. Sections 8.1 and 8.2 present a systematic review of methods of solving the 'optimization problems. When the unknown factors in the system exhibit statistical regularities, and as performance criterion we take the mean value of performance in concrete situations, the problem is called statistical optimization problem. It is the subject of Section 8.3. When the mean square error is taken as the criterion, and linear transformations of vector information are considered, the statistical optimization problem reduces to the parametric optimization problem. Such an optimisation problem arising in dimensionality reduction is discussed in Section 7.3 and in Section 8.2 its general solution is presented. The concept of the amount of statistical information (introduced in Section 5.1.2)) gives much insight into the properties of optimal information systems. In particular, without deriving optimal information processing rules, this concept allows to estimate the performance of statistically optimal systems. Such general estimations are discussed in Section 8.5.4 while estimations which become accurate when long blocks of information are processed as a whole are presented in Section 8.6.1.

1.7 Intelligent Information Systems

57

1.7 INTELLIGENT INFORMATION SYSTEMS Let us look at the problem of finding an optimal information transformation from a more general point of view. To calculate the criterion and the indicators determining the constraints we have to use • The information ATpRj^ about the structure of primary information and properties of the set of its potential forms, • Similar information JTENV about the potential states of environments of the subsystems of the information system, influencing their operation, and • Information X^^jp about properties of the superior system. Besides this information also the universal information ZUNI about the laws of physics, mathematic and, logic is used. 1.7.1 DESIGN EVFORMATION As a result of optimization, the optimal information transformation becomes dependent on the performance criterion and on the constraints. They in turn depend on the types of information mentioned above. Therefore, we call information ^DES n^PRIM* -^ENV* ^SUP> ^UNl}

(l-^.l)

design information. To differentiate between this information and information directly destined to the superior system, we call the latter working info. The design information influences the operation of the superior system in a concrete situation only indirectly, since it can be used to improve the quality of rules of processing the information that the superior system obtains. Thus, it has the meaning of an information about the potential forms of the working information. Therefore, the design information and its components are also called meta information. The meta information is an important component of the state information introduced in Section 1.1.2, page 6. It is discussed in detail in Section 5.5, in particular, see figure 5.12. To summarize, the optimal rule ToC*) of transforming the working information is the resuh of a transformation of the design information JTDES* ^ shown in Figure 1.24a. The working information-processing system and the system transforming the design information into the optimal rule of transforming the working information are two subsystems of the hierarchical system shown in Figure 1.24b. This system performs the following functions • It acquires the information about the features of state of the environment (concrete state, meta state), that are not directly related to the processing of a a concrete working information, and • it applies this information to adjust the rule of processing the working information, so that the available processing resources are optimally used. The subsystem acquiring the information about the state of the environment is called state information system.

58

Chapter 1 Basic Functions and Structures of Information Systems

DEFINITION OF TECHNICAL CONSTRAINS

)>^

i

SET OF ADMISSIBLE INFO TRANSFORMATIONS!

T{')

EVALUATION OF PERFORMANCE INDICES

Qi^Qi^ QM

SEARCH FOR THE OPTIMAL TRANSFORMATION

T^(')

a) ^DES

OPTIMIZATION OF THE WORKING INFORMATION PROCESSING RULE

I

LDI

U')

OPTIMAL PROCESSING OF WORKING INFORMATION

LWI

working information

b) Figure 1.24. The intelligent information system: (a) the diagram of the subsystem transforming the design information X^ES into the rule T^(') of optimal processing of working information: ^PRIM information about the properties of potential primary information, A^E^V information about the properties of potential states of the information systems environment, X^yjp information about properties of the superior system, A^u^i information about universal rules of logic and mathematic; (b) the basic diagram of an intelligent information system; LDI, LWI layers of design, working information. The first feature is called ability to learn, the stcond-ability to adapt to the environment. According to the common understanding of the term "intelligence" we say an information system is called intelligent, if it has the ability _^ to learn and to adapt optimally to a changing environment. v••/ The primary two features emphasized in the definition imply other features characteristic of intelligent information systems. To use the resources efficiently, the intelligent system must be able to evaluate its own performance. In view of the inertia of the states of the environment, evaluated must be not only the current performance, but the performance in a time span, particularly in the future. This in turn requires information about potential forms of future states. If during systems operation the performance can be improved by acquiring additional information about the state (concrete state, meta state) of the environment, an intelligent system must be able to identify the components of the acquired information relevant for the future actions and store them.

1.7 Intelligent Information Systems

59

The state information subsystem is a hierarchically lower information system, rendering its services to the primary information system. Therefore, our previous general considerations apply to the state information subsystem. In particular, it may be intelligent and a sub-subsystem may provide the needed state information. Then we have a three level hierarchical information system. Similarly it is with subsystems of the prototype system. Thus, we may have an intelligent preliminary processing subsystem or an intelligent information-recovering subsystem. As indicated in Section 1.1.3 a subsystem may utilize not only the external but also internal (partner) information. Data transmission systems and data transmission networks, described in Sections 2.2 and 2.3 ( see, in particular, Figures 2.9 and 2.11) are examples of such systems. 1.7.2 INTELLIGENT OPERATION IN A SLOWLY VARYING ENVIRONMENT Even in quite simple casesfindingthe direct transformation of the information X^^ into the optimal rule TJi') of processing the working information is prohibitively complicated. A general approach to the analysis and synthesis of complicated transformations is to decompose the transformation into component transformations and to design and to implement each of them almost independently. The prototype information system shown in figure 1.2 can be considered as an example of application of the horizontal decomposition method and the hierarchical information system shown in Figure 1.24b is an example of vertical decomposition. Often the properties of the state of the environment of the working information system suggest vertical decomposition. This is a case when the system operates in cycles (seances) separated by pauses, and when during a cycle of operation, some indeterminate components of the generalized state (concrete state, meta state) of the environment of the information system (subsystem) can be considered as fixed. Such components are called slowly varying state components (of the state of environment of the information system). When the slowly varying components are scalars, they are called slowly varying state parameters. LEARNING CYCLES Usually, when a system starts a new cycle of operation, the values of the slowly varying components are different from those in the previous cycle. Then at the beginning of the cycle the system may only gather information about the concrete state of the environment. Next, this information is used to find the transformation of working information optimally matching the current state of the environment. Such an initial phase is called the learning cycle (subcycle); see Figure 1.25. The information acquired during a learning cycle has the meaning oilearned information. To generate the state information in the learning cycle we may use the various techniques of information acquisition discussed in Section 1.2 (see in particular Figures 1.6 and 1.7). When the internal state of the environment is relevant, then the active acquisition of information may be applied. This is illustrated in the following example. Consider the optimization of the transformation performed by a receiver in a communication system shown in Figure 1.3a. In this system the conununication channel plays the role of a component of the environment of the receiver.

60

Chapter 1 Basic Functions and Structures of Information Systems t*(t)

a) , operation cycles ^

learning cycles

working cycles

b)

transmitted info

^w(l)

w(2)

w(3)

w(4)

w(5)

wi6)

w(7)

r(6)

r(7)

w(S)

WpRE working info

•

•

r(l)

r(2)

I exact info about transmitted # info available at the receiver-^w^^^

•

•

r(3)

• r(4)

TRAINING CYCLE

•

•

•''w(2)

J'wO)

• K5) 1

» yJ4)

•

• r(8)

HWORKING CYCLEH

« y^{5)

C)

Figure 1.25. Timing in an intelligent information system operating in an environment with slowly varying components: (a) the time course of a slowly varying environment parameter, (b) timing in a learning system, (c) timing in a communication system with training cycles: H'(/)-the signal fed into the channel (the transmitted signal), r(/)-the received signal, y^ auxiliary information about w(j) available at the receiver during the learning cycle, WpRE-the train of predetermined transmitted signals. Relevant for the operation of the receiver are the parameters of the channel determining the transformation of the transmitted signal into the received signal (such as the introduced attenuation or delay). These are called channel parameters. Often they are unknown, but it is possible to acquire information about them. The chaimel input signal (transmitted signal; see Figure 1.3a) w can be interpreted as the stimulus (see figure 1.6) and the received signal r as the response of the channel. The pair {w,r} is the primary information about the channel parameters. The receiver usually has access only to r but, by organizing cooperation between the transmitter and receiver, it is possible to get at the receiver additional information about the transmitted signal w. Suppose that the transmitter initiates a new cycle of systems operation. Then it feeds into the channel a train WpRE = {w(/), y = l, 2,* • • , 7 } of predetermined signals, see Figure 1.25c. The cooperation between the transmitter and receiver is so organized that the receiver has an auxiliary, almost exact information about every signal w(j). We denote this information by y^(j) and denote by r(j) the received signal produced by the transmitted signal w(j).

1.7 Intelligent Information Systems

61

The train of pairs y^(CU) = {{y^(j\ r<j)}J=U

2,- • • , J}

(1.7.3)

is the concrete information about the internal state of the channel available at the receiver (Similar notation is used in Section 2.2; see in particular Figure 2.9). Since the original nondistorted primary information is delivered besides the distorted information, the learning cycle is called the training cycle, the learned information ^^(CH) is called training information, and the described procedure is called supervised learning. After the end of the training cycle the receiver matches its rule of processing the working information to the actual state of the channel, and the transmitter starts to transmit signals carrying the working information. A concrete application of the training information is described in Section 2.2 and analysis of its optimal use is presented in Section 8.5.3. MATCHING OF SYSTEMS OPERATION TO THE STATE OF THE ENVIRONMENT When the concrete state information j about slowly varying parameters b of the state of the environment is available, it is natural to decompose the procedure of finding the optimal transformation of the working information into following steps: 1. On the assumption that slowly varying parameters have fixed values b, using the design information X^l^ (with information y about b excluded) we find the optimal working information processing rule ro(-, b)\ thus the processed working information is x* = ro(r, b), where r is the available information. 2. Using concrete state information y, we make a decision ^* about the slowly varying parameters. We denote by B{') the decision rule; thus, b*=B(y). "3. We introduce performance indicators for B(*), and using the corresponding design information X^^s wc ^^^^ ^^ optimal rule B,(-). 4. We use in 1 the optimal decision b^ =Bo(y)The intelligent system operating according these rules can be considered as a hierarchical system with three layers: • The layer LDI of design information processing which produces a rule of working information processing optimized on the assumption that the slowly varying state components are known. • The layer LSI of concrete state information processing (estimation of slowly varying state components). • The layer LWI of working information processing. This system is called system with separate estimation of state parameters. Its block diagram is shown in Figure 1.26.

62

Chapter 1 Basic Functions and Structures of Information Systems

{X)

X,DES

h"->" ^DES

OPTIMIZATION OF THE WORKING INFO PROCESSING RULE ON THE ASSUMPTION THAT THE SLOWLY VARYING PARAMETERS ARE FIXED

Toi'^b)

OPTIMIZATION OF THE RULE B(-) OF THE SLOWLY VARYING PARAMETERS ESTIMATION

I

LDI

LSI

^o(')

OPTIMAL ESTIMATION OF SLOWLY VARYING PARAMETERS

PROCESING OF WORKING INFO

K'^oiy)

<-Tlr,b:)

LWI

Figure 1.26. The inteUigent system with separate estimation of state parameters: y concrete state information, X^'ES (respectively, X^ES) design information (see Figure 1.24) used to optimize working information processing rule (respectively, to optimize the rule of slowly varying state parameters estimation). LDI, LWI, and LSI are layers of design, working, and state information processing. Slashed lines indicate the possibility of augmenting the design information by concrete working information and state information arriving during the working cycle. Often not only concrete state parameters but also other, more complicated features of the environment that are relevant for processing of working information can be estimated and used to optimize the rule of working information processing. An example of such features is the set of statistical weights that can be assigned to potential forms of states of environment (see Section 1.4.4). A representative class of such systems are intelligent systems compressing structured information. To illustrate the operation of such a system, assume 1. a train of structured blocks of information is to be compressed 2. to decrease the processing cost, each block should be compressed separately but not the train as a whole, 3. a delay of the order of magnitude of the duration of the train is acceptable by the superior system.

1.7 Intelligent Information Systems

63

The frequencies of occurrences of a potential forms of the primary block are typical weights. Assigning to a potential form of the primary block a secondary block that has volume the smaller the higher the frequency of occurrences of the primary block is, we optimize the transformation of a primary block into a compressed block. The intelligent system using such a processing of single blocks and acquiring the auxiliary information about the frequencies of potential forms of the blocks is shown in Figure 1.27. The train is first stored, and the frequencies of occurrences of potential forms of a block are calculated. These frequencies are used to find the optimal rule of processing a single block. In the last phase of operation, the blocks constituting the train are taken out of storage and processed optimally block by block. The result is the compression of the whole train. We may look at the described procedure form another point of view. When the potential forms of the blocks exhibit statistical regularities, then with each potential form of a block its probability can be associated. If the probabilities are known, the statistically optimal transformation which compresses blocks separately can be find. The frequency of occurrences has the meaning of an estimate of the probability of a block. Therefore, the described procedure can be interpreted as statistical optimization of single block compression with separate estimation of probabilities. A concrete example of the described optimization is presented in Section 6.2.1. Dimensionality reduction using a similar principle is the subject of Section 7.3.2. OPTIMIZING OF SINGLE BLOC PROCESSING

^DES|

7'blo(-)

optimal single bloc transformation

frequencies of occurrences pnmary train of bloc

•-

delayed train of bloc

STORAGE AND CALCULATION OF FREQUENCIES

SEPARATE PROCESSING OF BLOC

compressed train

a) calculation of optimal single bloc processing

Figure 1.27. An intelligent information compression system with estimation of weights of potential forms of working information: (a) block diagram, (b) time relationships.

64

Chapter 1 Basic Functions and Structures of Information Systems

The separate optimization of state parameters estimation is suitable when a long train of elementary pieces of information is processed. Section 8.5.3 shows on a specific example that then learning is possible even without a training cycle, but the quality of information processing is lower. In Section 8.5.3 it is shown also that on quite general assumptions about probability distributions, the system with separate optimal estimation of slowly varying state parameters is almost optimal. However, often the needed probability distributions are not available and the analytic optimization approach cannot be used. In Sections 8.2.2 and 8.2.3 iterative methods of finding numerically optimal sets of adjustable parameters are presented. Those methods are particularly well suited to optimize an information transformation when the training information is available and to apply them only rough information about the statistical properties of concrete information is needed. Although the conditions for application of the iterative methods of numerical optimization are very mild, in several cases the conditions are not satisfied or the description of the environment is not sufficiently exact, to check if they are satisfied. For such situations a plethora of heuristic, non systematic and in most cases non verifiable ad hoc methods of information processing have been proposed.

1.8 RELATIONSHIPS BETWEEN THEORY OF INFORMATION SYSTEMS AND OTHER SCIENCES The theory of information systems is an interdisciplinary science. This Section summarizes and extends our previous remarks about the relationships between this theory and several other sciences. Those relationships are schematically illustrated in Figure 1.28. An important domain of philosophy is cognitive science. Its subject is the process of gathering knowledge about the outside world and describing it in natural language. In this book's terminology, knowledge is the information about the state of the environment of human beings and its description in a language is a form of presentation of this knowledge. Thus some roots of information science stem from cognitive science (for introductory reading see Goldman [1.1], for more detailed studies see Fetzer [1.2], Goldman [1.3]). This book's concept of information (definition (1.1.1)) is closely related to the concept of state. This is one of the basic concepts of physics, which plays an important role in its several branches. The concept of state used in theoretical mechanic and theory of electricity lies behind considerations in Sections 3.2 and 3.2.3. The concept of statistical state presented in Chapter 4 is partially related to the concept of state in statistical physics. There have been attempts to introduce the concept of information to physics thus, for inanimate matter (see Brillouin [1.4], Stonier [1.5], Leff, Rex [1.6]). Although it is interesting from a philosophical point of view, this book does not pursue this concept because definition (1.1.1) is sufficiently general to allow uniform treatment of the very broad area of technical and biological information systems.

1.8 Relationships Between Theory of Information and Other Sciences

65

MATHEMATICS

^ r P H Y S I

MODELS OF INDETERMINISM PROBABILITY THEORY

c s

STOCHASTIC

SQ^A^T

PROC,

^F=V

^^-^ theory THEORff PF \^FormJ Logic* INFORMATlri)N SYSTEMS

PHILOSOPHY COGNITIVE SCIENCE/T

r

DEC. T H E 0 R Y

~^\r

AI

e

- c o t n P-

SIT

NN

R.s.r KJ

r\ Logi dal ireasoijiing'

Brain

y

K

Sense organs

nfo. Info. Sys.

iControl Transp Sys ^ Prod TECHNICAL SYSTEMS

BIOLOGICAL SYSTEMS y V.

SYSTEM THEORY

Figure 1.28. The simplified representation of relationships between the theory of information systems and other sciences and domains; Al-artificial intelligence, DEC-decision, Form. Logic-formal logic, D-comp.-decomposition methods, NN-neural networks. Opt. theoryoptimization theory, R.S.-resources sharing theory, SIT-statistical information tlieory, STAT-statistics.

Besides physics-like descriptions, very important are discrete descriptions of states (in particular by attributes) and their relationships by concepts oi formal logic (for introductory reading see Devlin [1.7], for more detailed studies see Chamiak, McDermott [1.8], Genesareth, Nilsson [1.9, ch. 2-6], Tanimoto [1.10, ch. 4, 6], Frost [1.11]). Gaining information about the state of environment can be considered as decreasing the indeterminism of the state. Therefore, the concept of information is inherently related to the concept of indeterminism. Since of basic importance is the probabilistic model of indeterminism, several aspects of the concept of information have their roots m probability theory, particularly in a branch-the theory of random {stochastic) processes (for introductory reading see Kotz, Johnson [1.12], Papoulis [1.13], Shafer, Pearl [1.14, ch.3]). More publications about the probability theory and probabilistic models of indeterminism are cited in Chapter 4 and Chapter 5.

66

Chapter 1 Basic Functions and Structures of Information Systems

Besides the probabilistic model of indeterminism, other models have been considered, often in studies on intelligent data bases (see Shafer, Pearl [1.14, ch.4,7], Bhatnagar, Kanal in Lemmer, Kanal [1.15], Klir, Folger [1.16]). A decision is primary information (concrete, design information) compressed into the form of a directive for a concrete action. Therefore, the theory of information and the decision theory (for introductory reading see Luce, Raiffa [1.17]) are closely related. In particular, the theory of statistical decisions that is the central sitSi of statistics (for introductory reading, see e.g., Frank, Althoen [1.18]) also has had a great impact on shaping of the concept of information. The considerations in Chapter 8 base partially on the concepts presented in the classical monograph by Wald [1.19] on statistical decision theory. The applications of probability theory jointly with the theory of statistical decisions for the analysis and synthesis of information systems has been very successful, particularly for design of communication systems, radar and other remote measurement systems, systems for information compression, pattern recognition systems, and for identification of dynamic objects (see Proakis [1.20], Blachut [1.21], Cover [1.22], Golomb et al.[1.23], Seidler [1.24], Cook, Bemfeld [1.25], Morchin [1.26], Burdic [1.27], Gersho, Grey [1.28], Fukunaga [1.29], Goodwin [1.30]; several other related publications are cited in Chapters 6,7, and 8). The applications of probability theory for design of communication systems have been also of paramount importance for the general theory of information. Around the concept of the amount of (statistical) information introduced by Shannon [1.31] developed an area called statistical information theory or even briefly information theory (see: the classical book by Abramson [1.32], Blachut [1.21], Cover [1.22], Golomb et al.[1.23]; more publications are cited in Chapter 5 and 8). Statistical information theory has inherent limitations. The statistical amount of information, which is its basic concept, has a sense only when the states of the information source and of the channel exhibit statistical regularities and when the system repeats its operation many times, so that the statistical regularities can manifest tliemselves. Thus, in spite of its elegance statistical information theory deals only with some aspects of the concept of information and its transformations. The statistical amount of information is a number and can be considered as a very special case of information in the sense of definition (1.1.1). General mathematical concepts had been influencing the theory of information systems not only through the probability theory and the theory of decisions. The analysis of vector, diagram, image and wave information considered in Section 1.3 and the analysis of their transformations, particularly spectral transformations considered in Chapter 7 are based on concepts and results of linear algebra and of functional analysis (for introductions see Usmani [1.33], Curtain, Pritchard [1.34]). The concepts of set theory (see Roitman [1.35]) give insight into information compression. Optimization of information systems analyzed in Chapter 8 bases on mathematical optimization theory (see e.g., Minoux [1.36]). Information system consists of cooperating components. Therefore, several concepts of information systems, such as horizontal and/or vertical decomposition (hierarchical structure), feedback, and sharing of resources stem from the general theory of systems.

1.8 Relationships Between Theory of Information and Other Sciences

67

To this point we have discussed the aspects of information related to technical systems. In sec. 1.1.4, page 11 we emphasized that information acquisition and processing are essential for all living organisms, especially the higher organisms. Therefore, important areas of analysis of information systems are related to biological systems. One of these areas is the analysis of the senses, particularly the hearing and vision organs, considered as interfaces between men and technical devices. A similar character has the analysis of speech organs and speech signals firom the point of view of their subsequent processing (see Deutch [1.37], Bronzino [1.38], O'Shaughnessy [1.39]). However, more impact on analysis and synthesis of technical information systems have the approaches looking for technical analogies of information processing in the central nervous system. There are two main streams of research in this area. The first explores the analogies with the structure of brain. This approach was initiated by the pioneering work on the perceptron (Rosenblat [1.40]) and developed into the area of processing information by the artificial neural networks (for introductory reading: Rogers [1.41], texts: Zurada [1.42], Haykin [1.43]). Those networks have applications in several areas, particularly in communication systems (Yuhas, Ansari [1.44]), information representation (Levine, Aparitio [1.45]), and pattern recognition (Benjo [1.46], Skrzypek, Karplus [1.47]). The basic idea of the second main stream that utilizes analogies with the operation of the central nervous system is to imitate in technical devices the reasoning of people. The initial goal of the research was "to study mental faculties through use of computational methods" (Chamiak, McDermott [1.8]) or similarly "to encompass computational techniques for performing tasks that apparently require intelligence when performed by humans" (Tanimoto [1.10, p.6] ). This field of research is called artificial intelligence (AI). Attempts have been made to apply the AI approach in such areas as representations of structured discrete information^ and relationships between its components described by logical functions, logical inference, inference based on inaccurate information, in particular, on subjective weight functions (used in expert systems), learning, search for specific information, and the perception of sound and images. In its later phase the emphasis of research was put on subjective weights assigned by humans to potential situations. The systems utilizing such weights were called expert systems (see, e.g., Forsyth [1.48]). In a typical AI approach ad hoc information processing rules have been proposed, the environment in which they may be applied has been only vaguely described, and an objective performance evaluation, the more comparison with other systems, has been usually missing. The AI approach has been concentrating on discrete models of environment and on processing of available discrete information. The approach of this book is almost opposite. Specification of goals of an information system and in consequence the evaluation its performance, the specification of environment in which an information system operates and the analysis of information about the state of the environment, play an essential role. This book also attempts to combine continuous and discrete models of environment and to analyze the relationships between both models (see particularly Chapters 3, 7, and 8).

68

Chapter 1 Basic Functions and Structures of Information Systems

The recent trend in studying analogies with the operation of the central nervous system is to imitate globally not only reasoning but also acquiring the information about current state of environment (called sensor data). It is called computational intelligence (see Zurada [1.49]), particularly the paper by Bezdek [1.50]). Another new approach in imitating biological systems is the evolutionary computation (see Fogel [1.51]). Both trends are related to our approach. We also consider jointly acquiring the primary information and its subsequent processing. The performance indicators which play an essential role in our considerations, can be considered as simplified models of indicators of effectiveness of an organism to survive and of a species to evolve.

NOTES ^ Behind this statement lies the fundamental philosophical problem of the relationship between our descriptions of the external world and the external world itself. We take here pragmatic position that only properties of the environment which effect the outcome of our purposeful activities are relevant. Therefore, we use the concepts of the state and of its exact description (all its observable and measurable features) interchangeably. ^ Let us give two examples. In the analysis of linear electrical circuits the Dirac impulse is a useful model of a narrow impulse of electrical current or potential. However, such an impulse applied to a resistor would develop an infinitely large energy. No real generator of electrical potential can do it. As a second example consider a step process used frequently in analysis of linear mechanic systems. Since every real object has some inertia, an instantaneous change of its position would require an infinite force. No real system can produce a step displacement. ^ Although widely used the terms binary, discrete, continuous information are misleading. The point is that these adjectives refer to the set of potential forms that the information can take but do not refer to concrete information. Knowing only concrete information we cannot say if it discrete or continuous. The term analog information is still more confusing. The term analog refers to the transducer generating information. As indicated in (1.4.1) the properties of set of potential forms of information depend not only on the transformation producing information but on the properties of the state the information is about. If the set of potential forms the state can take is discrete, the set of potential forms of information will be discrete even if it is produced by an analog transducer. ^ The energy of thermal fluctuations in an object in temperature T [Kelv] is of order of magnitude of kT, where k is the Bolzman constant. Quantum effects for an electro-magnetic information carrier with frequency f are of order of magnitude of 27rfh where h is the Plank constant (see, e.g., Brillouin [1.4]). For room temperature both effects became commensurable for frequency 6x10'^ Hz, corresponding to the wave length 50 cm"^. Thus, for processing information carried by light, the quantum effects are the physical barrier for exact analog information processing, while for electrical carriers with lower frequencies such as radio frequencies, thermal fluctuations are the ultimate barrier.

1.8 Relationships Between Theory of Information and Other Sciences

69

^ To simphfy the considerations on the next neighbor transformation we did not consider in detail the situations when two reference patterns, say u. and Uj are located at equal distance from the transformed information u and this distance is smaller than the distance from other reference patterns. Then an additional rule for choosing between the reference patterns u. and Uj must be specified. However it is quite irrelevant what rule it is. The reason is that the situations in which two references are located at the same distance from the available information are usually very rare, hi particular when the primary information exhibits statistical regularities and can be considered as a continuous random variable (vector) than the event that distance between the information and two references is the same, has a zero probability. ^ We discuss the problems of AI in terms of this book's terminology. A peculiarity of AI writing is a systematic avoidance of the word "information".

REFERENCES [1.1] Goldman, A.I., Epistemology and Cognition, Harvard University Press, Cambridge,MA, 1985. [1.2] Fetzer, J.H., ed., Epistemology and Cognition: Smdies in Cognitive Systems 6, Kluwer, Boston, 1990. [1.3] Goldman, A.I., ed.. Readings in Philosophy and Cognitive Science, MIT Press, Cambridge, MA, 1993. [1.4] Brillouin, L., Science and Information Theory, Academic Press, NY, 1956. [1.5] Stonier, T., Information and the Internal Structure of the Universe, Springer, Berlin, 1990. [1.6] Leff, H.S., Rex, A.F., Maxwell's Demon: Entropy, Information, Computing, Adam Hilger, Bristol, UK, 1990. [1.7] Devlin, K.J., Logic and Information, Cambridge University Press, Cambridge, UK, 1991. [1.8] Chamiak, E., McDermott, D., Introduction to Artificial Intelligence, Addison Wesley, Reading, MA, 1985. [1.9] Genesareth, M.R., Nilsson N.J., Logical Foundations of Artificial Intelligence, Morgan Kaufman, Palo Alto CA, 1986. [1.10] Tanimoto, S.L., The Elements of Artificial Intelligence, Computer Science Press, NY ,1990. [1.11] Frost, R., Introduction to Knowledge Base Systems, Collins, Lx)ndon, 1987. [1.12] Kotz, S., Johnson, N.L., Encyclopedia of Statistical Sciences, J.Wiley, NY, 1988. [1.13] Papoulis, R., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, NY, 1991. [1.14rShafer, G., Pearl, J., Readings in uncertain reasoning, Morgan Kaufman, San Mateo, CA, 1990. [1.15] Lenmier, J.F., Kanal, L.N., Uncertainty in Artificial Intelligence, North Holland, Amsterdam, 1988. [1.16] Klir, G.J., Folger, T.A., Fuziy sets: Uncertainty and Information, Prentice Hall, NY, 1988. [1.17] Luce, R.H., Raiffa, H., Games and Decisions, J. Wiley, NY, 1957. [1.18] Frank, H., Althoen S.C, Statistics: Concepts and Applications, Cambridge University Press, Cambridge, UK, 1994. [1.19] Wald, A., Statistical Decision Functions, J.Wiley, NY, 1958. [1.20] Proakis, J.G., Digital Communications, (2nd ed.), McGraw-Hill, NY, 1989. [1.21] Blachut, R.E., Principles and Practice in Information Theory, Addison-Wesley, Reading, MA, 1990. [1.22] Cover, T.M., Elements of Information Theory, J. Wiley, NY, 1991. [1.23] Golomb, S.W., Peile R.A., Scholtz, R.A., Basic Concepts in Information Theory and Coding, Plenum Press, NY, 1994. [1.24] Seidler, J. A., Optimisation of Information Systems (in German), 2 vols, (2-nd ed.), Verlag Technik, Berlin, 1971. [1.25] Cook, C.E., Bemfeld, M., Radar Signals: An Introduction to Theory and Applications, Artec House, 1993. [1.26] Morchin, W., Radar Engineer's source book. Springer, Berlin 1992.

70 [1.27] [1.28] [1.29] [1.30]

Chapter 1 Basic Functions and Structures of Information Systems

Burdic, W.S., Underwater Acoustic System Analysis, Prentice Hall, NY, 1984. Gersho, A., Gray, R.M., Vector Quantization and Signal Compression, Kluwer, Boston, 1992. Fukunaga, K., Introduction to Statistical Pattern Recognition, Academic Press, Boston, 1990. Goodwin, G., Middleton, R., Digital control and Estimation: A unified Approach, Prentice Hall, NY, 1990. [1.31] Shannnon, C.E., ,,A Mathematical Theory of Communication" in Sloane, N.J. A., Wyner, A.D., Claude Elwood Shannon Collected Papers, IEEE Press, NY, 1993. [1.32] Abramson, N., Information Theory and Coding, McGraw-Hill, NY, 1963. [1.33] Usmani, R.A., Applied Linear Algebra, Marcel Decker, N.Y., 1987. [1.34] Curtain, R., Pritchard, A.J., Functional Analysis in Modem Applied Mathematics, Academic Press, N.Y.,1977. [1.35] Roitman, J., Introduction to Modem Set Theory, J. Wiley, NY, 1990. [1.36] Minoux, M., Mathematical Programming: Theory and Algorithms, J. Wiley, NY, 1986. [1.37] Deutch, S., Understanding Nervous System, IEEE Press, NY, 1993. [1.38] Bronzino, J.D. (ed.), Biomedical Engineering, IEEE Press, NY, 1995. [1.39] O'Shaughnessy, D., Speech Communication, Prentice-Hall, Englewood Cliffs, 1983. [1.40] Rosenblatt, F., Principles of Neurodynamics, Spartan Books, N.Y., 1962. [1.41] Rogers, S.K., An Introduction to Biological and Artificial Neural Networks for Pattern Recognition, SPIE Optical Engineering Press, Bellingham, WA, 1991. [1.42] Zurada, J.M., Introduction to Artificial Neural Systems, West, St.Poul, MN, 1992. [1.43] Haykin, S., Neural Networks: A Comprehensive Foundation, IEEE Press, NY, 1994. [1.44] Yuhas, B., Ansari, N., Neural Networks in Telecommunications, Kluwer, Boston, 1993. [1.45] Levine, D.S., Aparito, IV M., Neural Networks for Knowledge Representation and Inference, Lawrence Erlbaum, NJ, 1994. [1.46] Bengio, Y., Neural Networks for Speech and Sequence Recognition, Thomson Computer Press, Boston, MA 1996. [1.47] Skrzypek, J., Karplus, W., Neural Networks in Vision and Pattem Recognition, Word Scientific, Singapore, 1992. [1.48] Forsyth, R.(ed.), Expert Systems, (2nd ed.), Chapman & Hall Computing, London, 1990. [1.49] Zurada, J.M., Marks II, R.J., Robinson, C.J. (eds.). Computational Intelligence Imitating Life, IEEE Press, NY, 1994. [1.50] Bezdek, J.,C., "What is Computational Intelligence" in Zurada et al. [1.49] [1.51] Fogel, D.B., Evolutionary Computation: Towards a New Philosophy of Machine Intelligence, IEEE Press, NY, 1995.

EXAMPLES OF INFORMATION SYSTEMS The first puq)ose of this chapter is to illustrate on concrete examples the general concepts introduced in Chapter 1 and to show that they allow a uniform treatment of a great variety of information systems. Basic concepts such as state, information, information system, prototype information system, superior system, decision making system, hierarchical, and intelligent information system are illustrated on examples. Concrete forms of external, internal states, of working information, state information, and of design information are presented. We concentrate on structured information and on hierarchical information systems. Also several examples of transformations of information, particularly of the next-neighbor transformation and its applications for information compression are given. In Section 1.1.4 the three basic types of information systems have been introduced: the communication, the storage, and the information compression systems. Representative examples of each of these systems are given, emphasizing the subsystems acquiring information about the state of environment of the information system and on subsystems compressing information. The last section considers subsystems for assembling and reshaping trains of blocks. Such subsystems are components of several communication and information storage systems. The framework of concepts introduced in Chapter 1 allows to describe the three basic types of information systems in a similar way. Consequently, considering them we use the same terminology. Therefore, in some cases it departs from the traditional terminology. The chapters second purpose is to provide descriptions of system that are analyzed in Part 3 of the book, and to formulate optimization problems, that are considered in Chapter 8. Intelligent data transmission systems and multiple access systems are described in more detail, since they are used as study cases in closing Section 8.6. Similarly as in Chapter 1, the material is presented in a simple and concise way. However, we cite several references considering in more detail the systems presented here and giving more examples of information systems.

72

Chapter 2 Examples of Information Systems

2.1 BASIC DATA TRANSMISSION SYSTEMS The general description of a communication system has been given in Section 1.1.4 and the block diagram of the prototype system, is shown in figure 1.3a. We describe here examples of communication systems in more detail, since most of the fundamental techniques of information processing and most concepts of information theory have their roots in communication systems. The reason is quite obvious. Contrary to other information systems, such as most storage systems or information processing systems, the main factors limiting the quality of information transmission are the indeterminate distortions. Therefore, methods for dealing with indeterminate events were most intensively developed in communications. And the concept of information is inherently related to the concept of indeterminism. We describe the communication systems step wise, beginning with a simple hypothetical system for transmission of a single piece of information and then proceeding to more advanced systems. The purpose of following presentation is to illustrate the concept of information but not to consider the systems in detail. For a detailed discussion of fundamental problems of commimication systems see Haykin [2.1], Proakis, Salehi [2.2], Proakis [2.3] ; more technically orientated are Black [2.4] and Freeman [2.5]. 2.1.1 TRANSMISSION OF A SINGLE BINARY INFORMATION. About the information source we assume that 151. The primary information is static; we denote it as x; 152. The set of potential forms of information is {x,,/=1, 2} (information is binary); to simplify the argument we set Xi=0

o r JC2 = l .

About the channel we assume: CHI. The information that can be put into the channel (the transmitted signal) is a time process located in time interval ; we denote it as w(t);te. (2.1.1) CH2.The information available at the output (received signal) is r(t)=r,(t)+z(t); tE , (2.1.2) where • z(t) is the process generated by unknown factors acting in the channel (in communications terminology it is called noise); • r^it) is the process produced by a transformation from the input information w(t); tE , which would occur at the output of the channel if no noise would occur (z(0=0, tE ); it is called noiseless signal, • is the time interval during which the output signal is observed (usually this is the delayed interval )The described channel is called channel with additive noise. Its model is shown in Figure 2.1a and its formal description is discussed in detail in Section 5.4.

2.1 Basic Data Transmission Systems

z(t)

CHANNEL

o->-

K(-)

73

T-0

-•-3

!•w(0 input (transmitted) signal

r„(r) noiseless signal

r{t) output (received) signal

a)

-H

> t

bl)

hi) Figure 2.1. The model of a communication channel: (a) the block diagram, (bl) typical base band input signal, (b2) corresponding output signal, (cl) typical narrow band input signal, (c2) corresponding output signal. On very general assumptions a process can be presented as a sum of harmonic functions. If the frequencies of component harmonic functions lie in the frequency band < 0 , B > , we call the process a base-band process. Such a typical process is shown in Figure 2.1bl.

74

Chapter 2 Examples of Information Systems

If the frequencies of component harmonic functions lie in the frequency band and Af4,(Ocos[aj,r+0,(O], (2.1.3) where y4^(0 is instantaneous amplitude of process w(t), (t>^(t) is instantaneous phase, and c«;c=27rfc is the central angular frequency. The processes A(t) and 0(0 are varying slowly compared with the harmonic signal coscuc/. For detailed discussion of narrow band processes see Hay kin [2.1], Proakis, Salehi [2.2, ch.2], Proakis [2.3, ch 3]). If a channel can carry base band processes, the channel is called a low-pass channel. If it can carry narrow-band processes, we call it a band-pass channel. The low-pass channel is a model for such channels as pair of wires or a coaxial cable. For wave guides, light guides, and open space radio channels, the band-pass channel is the suitable model. In the simplest case, the noiseless signal is an attenuated and delayed transmitted signal: ^ , r,(t)=bMt'b2), (2.1.4) where fcj is the attenuation and ^2 is the delay introduced by the channel. Besides attenuation and propagation delay, most band pass-channels introduce a phase shift ^3. For such channels, the noiseless received signal r,(t)=b,A,(t-b2)cos[o^,(t-b2)-^^(t-b2)+b,]; rE

(2.1.5)

We assumed (ISl) that the primary information x is static. To transmit such an information through the assumed channel, we have to transform x into a process w(jc,r); tE that can be carried by the channel. This transformation is performed by the transmitter; see Figure 2.2a. We call w(x, t) a transmitted signal carrying the information x. In view of assumption ISl the potential forms of the transmitted signal are w(jC/, r), / = 1 , 2, tE. Typical low-pass signals carrying the binary information are shown in Figures 2.2bl and 2.2b2. In case of the band-pass channel we can make either the amplitude or the instantaneous phase dependent on the primary information ;C/. In the first case we say that amplitude modulation is used; in the second XhdXphase modulation is used. The simplest method to obtain amplitude modulation is to transform the primary information Xi into an auxiliary low-band process w(JC/,0 such as shown in Figures 2.2b and to generate the narrow-band signal carrying information jc,; see Figures 2.2c. w{Xi,t)=u(Xi, t)cos[(^J+^(t)] (2.1.6) Similarly we obtain phase modulation (for a detailed description of modulated signals see Haykin [2.1], Proakis, Salehi [2.2], Proakis [2.3,ch 3]). The task of the receiver is to recover on the basis of the received signal r(t), tE the primary information. When the signal w{t, x^ is transmitted, then the received signal is ^^^^^^^^^^ ,)^^(^). ^^ <^^^ ^^> (2.1.7) We call it the received signal carrying the information JC/ ( for examples see Figures 2.1b2and2.1c2.

75

2.1 Basic Data Transmission Systems

i

TRANSMITTER

primary binary information

Z noise RECEIVER

CHANNEL

transmitted signal

r(0

X,.

received signal

recovered signal

a) y^^{xx.t)

^

w(x,,r)

^ t

^ t

-^ t

^ t

bl)

W{X2J)

b2)

! MX2^^) c2)

Figure 2.2. The model of the binary information communication system: (a) the block diagram, (b). (c) typical sets of transmitted signals.

Since we do not know the noise z(t), knowing only the received signal Hf) we cannot determine exactly which potential form of the noiseless signal is contained in Ht). In the third part of the book we discuss in detail the philosophy and concrete procedures of recovering primary information on the basis of information produced by an indeterministic and irreversible transformation. Here we use a partially heuristic approach.

76

Chapter 2 Examples of Information Systems

We interpret the received signal as a modification of the noiseless signal caused by the noise. Then the problem of recovering the primary information can be considered as the problem of identifying the number of a reference pattern when its modified version is available. On the assumption all potential forms of noiseless signals carrying the .^ ^ primary information are exactly known at the receiver, \ • - ) it is natural to use for recovery of the primary information the next neighbor transformation (1.5.13) illustrated in Figure 1.17a. As reference patterns we take the noiseless signals rXO {^n(^/, 0;^^ <^a» ^b>}» ^=1» 2. are taken and we denote the distance function as d[r('), r/.()]. For the considered binary case, the NNT takes the form 1. For received information r(t), tE 0 take Xi.=X2 (2.1.10b) where dnlri)]^d[r('), r,(-)]- d[ri'), r^i')]. (2.1.11) The inequality occurring in the subrule (2.1.10a) is called the deciding inequality (see Section 1.1.4 page 11). The subrule (2.1.10a) says that if the deciding inequality does not hold, take JC/.=X2. The diagram of a system realizing the transformation (2.1.10) is shown in Figure 2.3a. Since in the transformation 2.1.10 only three processes ri(-), r2(*), and r(-) are involved, we can represent them as points r^, rj, r on a plane P as shown in Figure 2.3b. The line J* going through the center of the interval and perpendicular to it divides the plane P into two semiplanes ^ , and ^2- The rule (2.1.10) is equivalent to the rule if rE ^ , then as recovered information take x^. (2.1.12) The semiplanes have the meaning of the aggregation sets (see page 40). From (2.1.10) it follows that the product of the NNT depends on the received process r(-) only through the difference of distances di2[r(')]. Thus, di2[r(')] is the information that is relevant for the decision about the recovered primary information. Therefore, di2[K')] is called decision information. Since it is a scalar, while r() is a process, informa[tion di2[r(')] is a compressed received information. Let us take 2i d^>0 and d^^-d^. The transformation (2.1.10) can be interpreted as a binary NNT using not the primary, but the decision information if \dn\r{')\-d, \ < l^nWOl-^j I ^^^^ ^^^^ ^/•=-^i if \dn[r(')]-d, \ > \d,2[ri')]-d, \ then take x,.=x,. ^^'^'^^^ The aggregation sets ^^i corresponding to this NNT are shown in Figure 2.3c.

2.1 Basic Data Transmission Systems

77

NIZXT NEIGHBOR TRANSFORMATION

An niela information

d[ti),

^.(X2,0

r^ix^J)]

compressed received information

recovered binary information

LOCAL GENERATOR OF NOISELESS PROCESS meta information TRANSFORMATION PRODUCING COMPRESSED RECEIVED INFORMATION

-^,

-^H

-^

4

•

^nlK)]

c) NEXT NEIGHBOR TRANSFORMATION info ILOCAL GENERATOR] OF THE PROCESS ^2(0

integrator

threshold device

J (•)dt

-M X riO

d) Figure 2.3. Next-neighbor transformation used for recovery of the primary binary information: (a) the primary system corresponding to rule (2.1.10), (b) the geometrical interpretation of the rule (2.1.10) on the plane P representing the two noiseless processes and the received process, (c) the geometrical interpretation of (2.1.13) based on the secondary information ^,2[r()], (d) the modified system based on the deciding inequality (2.1.10a); ATn-meta information about the noiseless signals.

78

Chapter 2 Examples of Information Systems A.

The parameter d^>0 was not specified. To keep the correspondence of (2.1.10) and (2.1.13) we look for situation when r(t)=r„(Xi, OU^ <^, ^b> • Then Therefore we take A,2=4'-i(-), r,(-)] (2.1.14b) To concretize the discussed NNT we must define the distance function. For the considered signals it is natural to take as the distance function (1.4.9). Thus, we take 'b

d[r(), rji-)] = I [rit)-r,{x„ t)]'dt.

(2.1.15)

Substituting this in (2.1.10a) and after some algebra, we write the deciding inequality in the form 'b

j r(0[/-n(x„0-^(x2,0]d/ >r

(2.1.16a)

'a

where the threshold r = I [rn'(^/,0-r„V2>0]d/.

(2.1.16b)

'a

The system realizing this transformation is shown in Figure 2.3d. COMMENT 1 Our considerations about the transmission of the binary information have only a sketchy character. For more discussion of the recovery of binary information, see Haykin [2.1], Proakis, Salehi [2.2], Proakis [2.3], and Blachut [2.6]. The basic assumptions that the information recovery rule is a next-neighbor transformation and that the noiseless signals are the reference patterns, are heuristic assumptions. A systematic approach to optimal design of information systems described briefly in Section 1.6 is discussed in detail in Chapter 8. In particular, Section 8.3 proves that under quite general assumptions the information recovery rule based on 2.1.16a with a proper threshold is optimal in the sense that it minimizes the probability of error. COMMENT 2 The received signal r(t) can be considered as description of the state of the channel output and information recovery as a purpose activity. Then our argumentation, that lead to definition (2.1.11) serves as an illustration of the basic definition (1.1.1) of information. COMMENT 3 The set of potential forms of the transmitted signals carrying discrete information is a subset of a usually continuous set of potential forms that a signal at the input of the channel can take. Thus, the set of potential forms of transmitted signals and in consequence the set of potential forms of noiseless signals, are constrained sets;

2.1 Basic Data Transmission Systems

79

see Section 1.4.3, page 31. This makes possible to diminish the effect of distortions in the channel because the distorted available signal usually is not an element of the constrained set of potential forms of a noiseless signal. Section 8.6 shows that the performance of a communication system with a channel introducing distortions depends in an essential way on the constraints imposed on the set of potential transmitted signals. COMMENT 4 Essential in our argumentation is the assumption that the reference signals generated in the receiver match exactly the potential forms of the noiseless signal at the output of the channel. To realize the NNT the exact information about the set of all potential forms of noiseless signals is needed. Since this information is not directly related to the currently processed working information it is an example of the meta (design) information introduced in Section 1.7.1. Therefor the information about the set of all potential forms of noiseless signals has the character of meta information about noiseless signals (see Section 1.7.1) and it is denoted as X^ (see Figure 2.3a). The form of a noiseless signal is determined by the transmitted signal and by the properties of the channel. Thus, the meta information about noiseless signals is an information about the internal state of the information source, of the transmitter, and of the channel. Important is that the signals generated by a local generators and the actual potential forms of noiseless signals have same position in time. Therefore, an essential component of the meta information about the noiseless signals is the information about the position in time of the potential noiseless signals. This information is called synchronization information and the subsystem supplying it is called the synchronization subsystem. For narrow-band signals two types of synchronization information are needed: the information about time position of the envelope and about the phase shift (information about 62 respectively about b^ in equation (2.1.5)). The synchronization techniques are considered in Proakis [2.3, ch.4], and analyzed in detail in Lee, Messerschmitt [2.7, chs.l3, 14, and 15]. THE DISTORTION INFORMATION Suppose that the primary information ocj is transmitted. Figure 2.3c shows that the decision of NNT (2.1.13) is correct, when the noise is such that the point representing the difference of distances d^ilri)] lies in the aggregation set ^ j . However, when the noise displaces this point into the aggregation set ^2» the decision is wrong. In a typical channel the probability of noise is the smaller the larger the noise is. Then, with a large probability, an erroneous decision is made when the point representing di2[K')] lays in an area at the border between the aggregation sets. TTierefore, this area is called uncertainty zone; it is denoted as 2 For the NNT given by (2.1.13) zero corresponds to the border point between the aggregation sets ^ j , and ^^ (see Figure 2.3c). A typical uncertainty zone is the interval Z=<-A„A,> (2.1.17) shown in Figure 2.4a. We use the uncertainty zone to produce auxiliary binary information that has the meaning of an indicator of distortions of the received signal. This information is called distortion information and denoted as y.

Chapter 2 Examples of Information Systems

80 -^.dl

-K0

^

a) BASIC CHANNEL

rit) w

PRODUCTION OF DECISION INFORMATION

Al

w)]

NNT

w

i L activaiiiig

^ ^

W

FEEDBACK CHANNEL

• ^

NNT

PRODUCTION OF THE ROUGH INDICATOR OF DISTORTIONS

TRANSMITTER OF FEEDBACK INFORMATION

feedback (distortion) information "^

b) Figure 2.4. Illustration of the definition of distortion (feedback) information and its utilization in a feedback system: (a) the uncertainty zone; see Figures 2.3b and 2.3c, (b) the receiver in the system using feedback information. The transformation generating the distortion information is >'[r()]=-

(2.1.18)

Thus, the distortion information has the meaning of the binarized (see Section 1.5.4, page 43) decision information dY^r{')\ which is a continuous scalar information. The distortion information can be used to improve the performance of the working information transmission in a hierarchical system operating according to the following rules: 1. If ^=0, a recovered working information is produced using the NNT (2.1.10) or equivalently (2.1.13); 2. If >'=1, no recovered information is produced and we wait for the next received signal; (2.1.19) 3. The binary information y is delivered to the transmitter; 4. After obtaining information >'=l,the transmitter retransmits the recently transmitted working information.

2.1 Basic Data Transmission Systems

81

To deliver the distortion information to the transmitter we need a channel transmitting information in the direction opposite to the direction of transmission of the working information. Therefore, the channel for transmission of distortion information is cailedfeedback channel. The information about distortion information that is available at the transmitter is called feedback information. Usually the feedback channel is so designed that distortions which it causes are negligible. Then feedback information is an exact representation of distortion information. Therefore, we use both terms interchangeably. The described system is called binary information transmission system with feedback; in technical terminology -system with automatic retransmission request. Its block diagram is shown in Figure 2.5.

BINARY CHANNEL

primar>' info X, BINARY INFORMATION SOURCE

transmitted signal w (A-/, /)

TRANSMITTER OF BINARY INFORMATION

received signal r{l)

CHANNEL

received information x„

RECEIVER OF BINARY INFORMATION

distorsion information y[ri-)]-

RECEIVER OF FEEDBACK INFORMATION

FEEDBACK CHANNEL

TRANSMITTER OF FEEDBACK INFORMATION

signal earring distorsion info

a) BINARY CHANNEL

Figure 2.5. The system for binary information transmission with feedback: (a) the detailed diagram, (b) the equivalent binary channel with binary noise. The feedback system produces ultimately a static recovered binary information. Thus, the whole system shown in Figure 2.5a can be interpreted as a secondary channel transferring the static binary information from the place where it was generated to place of its destination. Such an equivalent channel is called a binary channel-set Figure 2.5 b.

82

Chapter 2 Examples of Information Systems

The indeterminate distortions in the primary channel cause that in spite of all precautions (optimal choice of transmitted signals, optimal choice of the receiving rule, eventual application of feedback) the recovered information 0:^* may be different from the primary information Xi. The binary variable z^=Xiexi. (2.1.20a) is called binary noise. This definition can be written in the form Xi.=Xi®z^ (2.1.20b) This is the counterpart of (2.1.2), and we can represent the binary channel as in Figure 2.5b. This is the binary counterpart of Figure 2.2a. COMMENT 1 This description of the feedback system is also only sketchy and uses several heuristic arguments. We have skipped the basic design problems: best choice of the working information recovery rule, best choice of the uncertainty zone, best using by the transmitter the feedback information to support the receiver. The systematic approach to information system design described in Section 1.6 is applied in Section 8.3.3 for information systems with feedback. It is shown that on quite general assumptions the feedback system described by information operating according to rules (2.1.19) with a suitable threshold A^, is optimal. The performance of optimized systems with feedback is discussed in Section 8.4.2- see Figure 8.16. A detailed analysis of a feedback information for transmission of binary information can be found in Seidler [2.8 chs. 9,10]. 2.1.2 TRANSMISSION OF A BLOCK OF BINARY PIECES OF INFORMATION. The binary system described previously could be directly used to transmit a train of binary pieces of information x(w), /i = l, 2 available at instants t„=(n-l)T, where T is the period of systems operation. We would transmit each piece during a time interval <(n-l)T, nT> and recover it on the basis of the received signal observed in the interval <(w-l)r+^, /zTH-^>, where 6 is the delay introduced by the channel. We call such an optTSXion piecewise operation. The piece wise operation has a drawback that is not evident on the first glance. Section 8.6 shows, that we can more effectively counteract the indeterministic distortions introduced by the channel if the binary pieces of information into are assembled into a block, and the block is transmitted and recovered as a whole. Such an operation is called block wise operation. We describe here error correcting coding that is the most frequently used to counteract channel distortions. We assume: • The primary information is a train of binary pieces of information x(n) that are available at instants i^=(n-l)T. • The channel is a time continuous channel with additive noise described on page 72 (assumptions CHI and CH2).

2.1 Basic Data Transmission Systems

83

The primary train is segmented into blocks including A^ binary pieces of information, and the block is transformed into a secondary block of N'>N pieces of binary information. The transformed block is called code word, and the rule of producing it from the primary block is called block coding (see Section 1.5.2, page 38). The block coding transforms the primary train of elementary pieces of information into a train of code words. If the duration of the code words would be longer than of the primary information blocks, a jam of primary information blocks would build up. To avoid it we assume that both durations are equal. Thus, that T=NT/N' (2.1.21) where T is the time needed to transmit a piece of binary information that the code words consist of, and T is the period with which the primary binary pieces of information arrive. We describe here the most frequently used block code called parity check code. It achieves noise immunity by building into the code words additional information about the primary block. This additional information is determined by the primary information. If no distortions would occur, the primary information could be recovered exactly without using the additional information. In this sense this information is superfluous. However, the additional information can be used to counteract distortions in the channel. This effect is discussed in Section 8.6. The prototype of all parity check codes is considered first. Let us denote jc(n), n = l, 2,' • • , N a. binary piece of information that can take values 0 or 1, x={x(n), /z = l, 2,- • • , A^ the primary block of information, Af,(x) the number of binary pieces of the block that have value 1. A^ the length of the block A binary parity information y^ determined by the rule ^^ 0 if Af,(x) is even

>'xW = <

, ., J / , .

^^

(2.1.22)

1 if Mj(x) IS odd

^

is produced. Using the addition modulo 2 this definition is written in the form N

>',(x)=gx(n),

(2.1.23)

n-l

The code word is Mx)^{x,y,(x)}. (2.1.24) Its length is N-\-l; thus, w(jc) = {w(x, w), /z = l, 2,- • • , A^+1}. The structure of the code word is illustrated in Figure 2.6a. To calculate ^^(x) we must have access to all pieces of binary information forming the primary block. Therefore, those pieces must be transformed into static information. We do it by feeding the arriving information in a chain of memory cells. This chain is called shift register (for a detailed description see Section 3.2.4). The auxiliary information y^(x) can be interpreted as a compressed information about the primary block x; we call it parity information.

84

Chapter 2 Examples of Information Systems

code word

code word

— w{x) —

— wix) — parity information

working information 1

JC

1

r-y^1

X

^ I—y^ix)—I

x{I) 1 x(2) 1 x{3) 1 x(4)11

1 1

iL

T

T

T

E

vector ' . P^^'^y information

working information

cmi 1

T

i

T

E

^x(X) panty control sets

1 1

era;

y.( xj)

j

E

a)

yxi^

b)

Figure 2.6. Illustration of the structure of parity check code words: (a) the code with a single parity information, (b) with parity vector information. We assume that the binary pieces of information are stored in a chain of binary memory cells (shift register). From (2.1.23) and the properties of modulo 2 algebra it follows that for any primary block information ^^^ ^wix, n)=0 (2.1.25) n-l

The considered prototype parity coding introduces a universal relationship between the elements of a code word. This causes the set of potential code words to be constrained (see page 28). The prototype coding can be generalized to introduce more auxiliary information about the primary block jc. The positions of the primary block are grouped into K sets C(k) :={n(k, 1), n(k, 2),- • • , n(k. A/)}, k=l, 2,- • • , /^ (2.1.26) where n(k, m), are integers from the set {1,2,- • • , A^}. We generate the auxiliary information similarly to y^ defined by (2.1.23), but instead of the whole block we take now into account only the positions belonging to a set C(k). Thus, we introduce the parity information: ^rv y.(x, k)=^x(n). (2.1.27) neC(k)

Therefore, we call the sets C(k) parity control sets. The set JxW= {y.(x), ^ = 1 , 2,..,K } is called vector parity information. The code word is:

(2.1.28)

, ., , . and Its length

(2.1.29a)

w(x)^{x,y,(x)}

N'=N-\-K The structure of the code word is illustrated in figure 2.6b.

(2.1.30)

2.1 Basic Data Transmission Systems

85

The code word is a counterpart of the transmitted signal w(^ ^/) given for example by (2.1.6). The described transformation of the primary information block X into the code word w(x) is called parity check code. To transmit the code word, from the chain of memory cells we take successively the binary pieces of information and transmit them piece by piece through the binary channel considered in Section 2.1.1. UTILIZATION OF THE BUILD IN AUXILIARY INFORMATION TO COUNTERACT DISTORTIONS CAUSED BY THE CHANNEL At the output of the binary channel we obtain the block r= {r(n), n = l,2,- • - , N'} of A^' binary pieces of information recovered piece by piece. Using the concept of the binary noise (defined by (2.1.20) we write the received block in the form r=w(x)®z, (2.1.30) where z is the train of the pieces z(n) of binary noise. This equation is a counterpart of equation (2.1.2). Thus, the codeword plays the role of the noiseless signal. We now show how the build in auxiliary information can be used to counteract the effect of the indeterminate binary noise. We consider first the simple code with single binary parity information, described by (2.1.22). Then the recovered block consists of A^' =A^+1 binary pieces of information. The first A^ pieces correspond to the primary information block. As for coding, we calculate from formula (2.1.23) the auxiliary information N

u(r) = ^r{n).

(2.1.31)

n-l

The difference y,(r)= r{n-hl)eu(r) (2.1.32a) is 1 if one or a greater odd number of binary pieces of information have been recovered incorrectly. Therefore, yXr) has the meaning of an information about the errors in the recovered block r and is called error information. It is the discrete counterpart of the distortion information defined by (2.1.18). Using (2.1.31) and (2.1.32a) we can calculate the error information from the formula N*\ yrir)=^r(n). (2.1.32b) /i-i

Since yr(r)=0 not only when no binary errors occurred, but also when an even number of errors occurred the error information is not an exact information about binary errors. However, similarly as the distortion information y[rC)] about distortions of the received process r('), carrying binary information (defined by (2.1.18)), the error information can be used as feedback information about distortions of a received block. Such a feedback system operating on blocks as whole is shown in Figure 2.7. It is a counterpart of the feedback system shown in Figure 2.5a operating on signals carrying binary information. It can be expected, that the feedback system operating on blocks can substantially improve the performance of block transmission, when significant is only the probability of a single elementary error, but the probability that two or a larger even number of binary errors occur, is small. This often occurs in practice.

Chapter 2 Examples of Information Systems

86

Therefore, the feedback systems based on error information y^r) are widely used in commmiication systems and other discrete information processing systems. The optimization of system with feedback and their performance are discussed in Sections 8.3.3 and 8.4.2. VIRTUAL BINARY CHANNEL

SOURCE OF BINARY TRAIN

t

ASSEMBLING OF A BLOC

CODER

t

BINARY CHANNEL

DECODER

y{r)

FEEDBACK DECODER

t

x*in)

w(x)

x(n)

UNPACKING BLOCS

PRIMARY FEEDBACK CHANNEL

feedback info

FEEDBACK! CODER

Figure 2.7. The system transmitting blocks of binary information using coding and feedback. To recover correctly, in spite of binary errors, blocks of information or/and to improve the performance of system with feedback vector parity information defined by (2.1.28) must be built into a code word. Thus, a code with more then one parity control set must be used. Similarly to (2.1.32) we define the error information associated with the kth parity information y^^ix^k) The set

neC(k)

(2.1.33)

yXr)^{yXr,k);k=U2r - , K} (2.1.34) is called vector error information. In the terminology used in coding theory it is called the syndrome (symptoms of the "decease" of the recovered block). To recover the primary block we can use again the NNT. As the distance between the blocks II' = {w'Ci); w = l, 2,- • • , A^} and II" = {M"(«); n = l, 2,- • • , A^ we take dy^iu' ,u")=number of positions in which the blocks u' and u" differ. (2.1.35) This distance is called the Hamming distance. The definition (2.1.35) can be written equivalently in the form ^ d„(u', ii")=f^ M'(/i)ew"(n). (2.1.36) Equation (2.1.30) suggests taking as reference vectors the code words w(Xi). The NNT takes then the following form As the recovered primary information take the block x,* for which the distance d^[r,w{Xi^)] is minimal. Let us illustrate the introduced concepts on a classical example.

(2.1.37)

87

2.1 Basic Data Transmission Systems EXAMPLE 2.1.1 AN ERROR CORRECTING CODE

We assume that Al. The length of block information A^=4 and the number of control sets M=3; A2. The positions of the code word containing the primary information are not concentrated in the front of the code word as shown in Figure 2.6b but are dispersed, as shown in Figure 2.8; A3. The control sets are ^(1) = {3, 5, 7}, ^(2) = {3, 6, 7}, C(3) = {5, 6, 7}.

n=

1

y.ixA)

y.ix,2)

x{\)

y,(x,3)

xiD

xO)

x(4)

Figure 2.8. The structure of the code assumed in the Example 2.1.1. This code is called Hamming code. A simple rule of recovery of the primary block information is Rl. If all components of the error vector are zero take the elements of the received block r in the positions 3, 5, 6, 7 as the recovered block. R2. Else, consider the train n^yXr, 3), yXr, 2), yXr^ 3) as a number in binary representation, change the binary information in the n th position of r, and proceed according to Rl. Let us, for example, take JC= 1011. From (2.1.27) and (2.1.28) we get j , = 0 1 0 and the code word is iv(jc)=0110011. Next suppose that the binary channel caused an error in the position /z=3; thus, r=01(X)011. The pieces of error vector information areyr(r l) = l,>'r(^ 2) = l,yr(^ 3)=0. The/z*=011=3. Thus, implementing subrule R2 we obtain the actually sent code word that is we correct the error caused by the binary channel. It can be easily shown that the described rule produces the same result as the NNT. Thus, they are equivalent. D The performance of NNT depends on the mutual distances between code words. However, essential is the minimum distance A^=xmnd^{w{x)Mx^]>

(2.1.38)

Denote by V,^ the greatest number of erroneously recovered binary pieces of information that do not cause an error of NNT, no matter where the damaged pieces are located in the received block.

88

Chapter 2 Examples of Information Systems

From the definition it follows that

A -1 - J L _ if A^ is odd

^"^"C^ A ^

^

^^-^-^^^ if A^ is even

We described here only the basic features of linear error detecting and/or correcting codes. There are many publications on the subject. For an introduction see Hamming [2.10], Vanstone, Oorschot [2.11], Purser [2.12], Haykin [2.1, ch.ll9], Proakis, Salehi [2.2, ch. 10], for more practical aspects see Freeman [2.5, ch.l6]. For detailed studies see the classical book of Peterson, Weldon [2.13] and Rhee[2.14]. Several generalizations and proofs of our statements can be found in the latter two monographs. It is shown there also that • For any parity check code, it is possible to implement the NNT by a transformation operating directly on the error vector information. In this way we avoid the search for the code word next to the recovered block, which for larger blocks requires large computational power. In general the block recovering transformation based on error vector information has an other structure than the transformation described in Example 2.1.1. • It is possible to introduce the control information currently in a train of binary pieces of primary information, without segmenting the train into blocks (recurrent coding) • The linear codes are homogeneous in the sense that the distances between a code word and other code words are the same for every code word. This causes all code words to have the same standing from the point of view of distortions. Also optimal parity control sets have been found (see, e.g. Peterson, Weldon 2.13]) that for a given length N of the primary information block and given number ^ of parity control sets (equivalently the length A^' of the code words), maximize the minimum distance A^. Thus, they maximize the number F^ax of binary errors that do not cause distortions of the recovered working information. The analysis of systems utilizing feedback information to improve the quality of transmission of a block as a whole can be found in Seidler [2.8, ch.9]. However, of greatest practical importance are systems utilizing the feedback information to improve the transmission of trains of blocks of information, particularly applying the stop-and-wait, go-back, and selective repeat procedures. They are incorporated in most data transmission protocols (see, e.g., Stallings [2.15], Black [2.16]). COMMENT 1 We have shown that adding components thus, building structure into the transmitted signals, in particular into code words, allows to obtain information about the distortions of the received signal (code word). This information can be used directly by the receiver to decrease and, in some cases, to eliminate the effects of distortions. In addition, the information about distortions may simplify the implementation of the optimal rules of primary information recovery. The distortion (error) information can be also used as a feedback information.

2.1 Intelligent Data Transmission Systems

89

It can be expected, that the efficiency of distortion information to increase the quality of working information transmission, depends on structure build into the signal carrying the working information. If no structure is build in, thus the set of potential transmitted signals is unconstrained, every error of a component of the signal causes an error of working information. Examples of relationships between build in structure and systems performance are the relationships between: (1) the length A^ of the block of the primary information, (2) the number K of parity control sets, that in view of (2.3.10) is an indicator of dimensionality of code words, and (3) the minimum distanceA^ between the code words, that in view of (2.1.39) is an indicator of quality of recovered information. However, generalizing conclusions cannot be drawn directly from these relationships, because the error correcting coding not only builds in the structure, but also requires additional resources (signal energy, channel capacity). The "pure" effect of structuring signals on information systems performance is discussed Section 8.6. COMMENT 2 The error correcting coding may be classified as a system of building in structure in a rigid way, while the feedback is a system building in structure in a flexible way, adjusting the signal (train of blocks carrying primary block information) to the concrete state of the channel. COMMENT 3 From the point of view of transmission of binary information, the whole system using the block coding (without or with feedback) can be considered as a secondary virtual binary channel as shown in Figure 2.7.

2.2 INTELLIGENT DATA TRANSMISSION SYSTEMS The basic data transmission system has been considered in Section 2.1 on the assumption that the state of the environment is fixed and known at the design stage. Section 1.7.2 emphasized that the state of the environment of an information system changes usually in an indeterministic way and that we can expect an essential improvement of the working information processing if auxiliary information about the changing state of the environment is available and used for adjustment of the rules of processing the working information . This applies to the whole system as to its subsystems. As indicated in Section 1.7.1 we can improve the operation of a subsystem using the following components of the state information • The external information (about the state of the environment of the communication system), • The internal information (about the state of components of the considered system), and • The partner information (about the state of components of a cooperating subsystem of the system).

90

Chapter 2 Examples of Information Systems

When the state information is used to optimize the processing of working information, the rule of working information processing becomes matched to the actual state of subsystems environment. Such an information system (subsystem) is intelligent in the sense of definition (1.7.2). We concentrate here on the state information that makes intelligent operation of a data transmission system possible. All remarks in Section 1.2 about acquiring information, in particular the classification of information sources illustrated in Figure 1.7 apply to the state information. The two extreme options for providing the state information to the subsystems are • every subsystem has its individual state information subsystem, • a central state information subsystem provides state information for all subsystems. The first option is called a decentralized, the second a centralized state information subsystem. The choice depends on the spatial structure of the working information system. When a system is confined to a small area, as for a example a computer, a centralized state information subsystem is preferable. If the subsystems are located at great distances, as in the case of long range communication system, the natural choice is a decentralized state information subsystem. To illustrate the introduced concepts, we give examples of two classes of intelligent information systems that acquire the state information and use it to improve the processing of the working information. The improvement is discussed in detail in Section 8.6.2. This Section considers a typical intelligent data transmission system with one information source and one destination of information. We assume that: (1) the information source operates intermittently, (2) the channel operates rhythmically, and (3) the transmitter locates exactly the potential forms of the transmitted signal in a time slot. The system is shown in Figure 2.9. To the basic subsystems described in Section 2.1 are added buffers. Their task is to match the irregularly arriving trains of blocks, usually also separated by idle pauses, to rhythmically operating subsystems. The irregularity is caused not only by the structure of the primary train of information blocks, but also by feedback. The buffer system is described in Sections 2.6.5, 3.3.1, and analyzed in Section 6.5 2.2.1 INFORMATION ABOUT THE STATE OF THE CHANNEL The important state information, which can be used to improve the quality of the transformations performed by the transmitter and the receiver, is the information about the components of the internal state of the channel, which are relevant for transmission of the signals carrying the working information. We denote^ by yjiCYi) (respectively JR(CH)) such information accessible at the receiver (respectively at the transmitter)-see Figure 2.9. First, consider the effect of the internal state of the channel on performance of the receiver. In Section 2.1.1 we emphasised (see Comment 4, page 79) that to use the efficient NNT for information recovery we must know at the receiver the potential forms of the noiseless signals. From Figures 2.1a and 2.2 we see that noiseless signal is determined by the channel input (transmitted) signal w{Xi, t) and the transformation V^{') characterizing the channel (see. Section 5.4).

2.1 Intelligent Data Transmission Systems

91

^ S ••= J2 -S >^ P II g W e^ r r o

*^ r5

11^11.^•Sfe.S § g^ X 5i >< ^

o -55

-. ra O) tr ^ ^ . ^ «/3 II ^ N O

g « >< (u i £ • 2 5 5 3 CI. e t r _ r «iO ® c

uf . S —. "So «*O "^

^ i i

(U >> O

4> w 3 >^

III« §« c -S •= Ss

•I '

w 3 ^ ^ ^ g £ C/3

«-

. S erj

-

3 0 1::

^ o -^S . 2 c a H X) -s c ^ ^ ^

2 .a S ?^

. .

. .r. S

«^ H

t'^

>> S oo.S o ®

^ "^ g ffl 3 S 2: c ^ B § E c g (u o OS'S -a .2

2* « E? I 3 j> -i E "s ir >£

f i l l 2='-^ CO) ;

3 > c: c

ri "'So— u r; ii

E i ^ . S tail's

92

Chapter 2 Examples of Information Systems

Usually, the character of the transformation Vn(-) is known, and only some parameters determining it are indeterminate. We called them channel parameters (see Section 1.7.2). A typical example of such a parameter is the attenuation or phase shift (see equation (2.1.5), page 74). From our considerations in Section 2.1.1 it follows that for the recovery of working information by means of the next neighbor transformation only meta information X^ about the set of potential forms of the noiseless signals is relevant. This information could be obtained by a transformation of information about transformation VJ^-) characterizing the channel and of information about transmitted signals w(xi, t), 1=1,2. However, such information is usually not available and only inaccurate information about noiseless signals can be obtained. A typical method to get inaccurate information about noiseless signals is to introduce the learning cycles described in Section 1.7.2 (see also Figure 1.25) with the training cycle divided into two subcycles. During a learning cycle subcycle the potential forms of working information are transmitted several times (they are a special case of the predetermined signals introduced in Section 1.7.2). The primary information about the potential forms of noiseless signals is the information jR(CH) = tVR(CHJ,/?Tc} (2.2.1) where 3'R(CHin) is the information about the signals send by the transmitter during the training cycle and Rj^ is the train of corresponding signals obtained at the output of the working information channel during the training cycle. Equation (2.2.1) is a special case of (1.7.3) with jR(CHin) respectively Rjc in place of y^(j) respectively r(/),; = l , 2 , - • • , y . The essential component of information ^^(CHin) about the predetermined signals sent by the transmitter during the training cycle is the information about the number of pieces of working information transmitted in a subcycle of the training cycle. This information can be built in permanently in the transmitter and receiver. Then the auxiliary channel delivering the information jR(CHin) shown in Figure 2.9 must not exist physically. From equation (2.1.7) we see that the signals Rjc obtained during the training cycle are the noiseless signals distorted by noise. Using information jR(CHin) we identify the number of pieces of the working information that a noiseless signal carries, but we cannot determine exactly the shape of the corresponding noiseless signal. Thus, we face the problem of recovery of diagram information. From Section 8.4 it follows that on quite general assumptions, close to optimal recovery is adding shifted received signals corresponding to the same working information. It also can be expected, that for typical statistical properties of noise the accuracy of such recovery is the larger the more times the signal carrying the same working information is repeated in a training subcycle. As it has been indicated in Comment 4 page 79, essential for the efficiency of NNT is the knowledge of exact position of Uie noiseless signals in the time slot of the channel. Therefore, in simpler systems only this position is estimated using suitable signals transmitted during the training cycle. The corresponding component of information JR(CH) is called synchronization information (see, e.g., Lee, Messerschmitt [2.7]).

2.1 Intelligent Data Transmission Systems

93

The other type of information about the actual state of the channel is feedback information (distortion information or error information, described in Sections 2.1.1 and 2.1.2). Such an information available at the transmitter is denoted as yj(CHo^,i) (see Figure 2.9). This information can be utilized by the transmitter to "help" the receiver to overcome the effects of noise (see Sections 2.1.1, 8.3.3, 8.4.2). 2.2.2 THE PARTNER INFORMATION We assumed that we have no other access to the channel except the access to its input and output terminals. In the case of the receiver and transmitter we can get the information about the state of their own components and components of the partner. We now describe briefly the basic types of such information. Buffering that is described in Section 2.6.5 (see Figure 2.23) is used in transmitter and receiver to counteract the arhythmic operation of the primary information source and the arrhythmia caused by the retransmissions when feedback information is used. Since the buffers have a finite capacity, overflow (discussed in Section 6.4.2) can occur, particularly in the buffer of the receiver. When this buffer is full, the additional signals sent by the transmitter could not be processed by the receiver. Therefore, the operation of the whole system can be improved if information about the state of receivers buffers is delivered to the transmitter. In a similar way, information about malfunctioning of other subsystems of the receiver may be useful for the transmitter. Such information available at the transmitter we denote as yjiR) (see Figure 2.9). It is typical transmitter-receiver partner information. A transmitter utilizing feedback information must inform the receiver whether a new piece of working information or a retransmission of the working information sent in the past is transmitted. An intelligent transmitter also can change coding and modulating rules to match the state of its environment. We denote by y^iT) the information about such states of the transmitter that is relevant for the receiver. Since the transmitter consists of several components, the information about the states of components is essential for their cooperation. We denote this information by yjiT). We have a similar situation with the receiver. The joint information {^^(CH), yj(R), yj(T)} is an example of the concrete state information introduced in Section 1.1.3 (page 6) and discussed in more detail in Section 1.7.2. Using the corresponding design information Jt^j^ we can apply the general scheme of the intelligent information system shown in Figures 1.24 and 1.26 for the optimization of the transformations performed by the components of the transmitter, as shown in the upper layer of Figure 2.9. A similar situation we have with the receiver. There are obviously a great variety of specialized forms of the general scheme of the intelligent data transmission system shown in Figure 2.9. Often several such systems have to cooperate in multiaccess systems with several sources and several information destinations interconnected by a common channel or a network of channels. Then important is the standardization of the state information.

94

Chapter 2 Examples of Information Systems

In practice the transfer of the described state information is implemented by means of communication protocols. In the terminology used in this area the rules of delivering and utilizing state information in the learning cycle are called shake hand protocols. Examples of such communication protocols are the R 232 and X 25 protocols (see, e.g., Stallings [2.15], Black [2.16]), Tanenbaum [2.17], Bertsekas, Gallager [2.19], and Woodward [2.20]).

2,3 MULTIPLE ACCESS SYSTEMS

A typical information system has to serve several superior systems and has to process information arriving from several information sources. Such a system is called multiple access system. Although the system can be considered as a bunch of previously described simple systems with a single source-destination pair, it is usually desirable that the pairs source-destination share some information processing resources. To achieve this, some cooperation between several processes of information transmission, storage, or transformation must be organized. In many production, management and administration systems, the decisions about actions are made in several remote decision subsystems. Since the subsystems interact, the quality of decisions made in a decision subsystem can be improved, if besides information about the state of the local environment some information about the states of remote subsystems also is used. Thus, a system with several information sources and information destinations located in remote places must be established. Such a multiple access system is C3Wtd2L distributed information system. If we have only one superior system but several information sources (see Figure 2.10a), the multiple access system is called an information collecting system. Information systems for a central archive or for a environment control system are typical examples. An example of an information system transmitting information from a bunch of sources to a distant destination is described in more detail in the subsequent section. If information from one information source is to be delivered to several information destinations (see Figure 2.10b) the system is called an information dissemination or broadcast system. Typical examples are public radio and television systems. In the most general case we have several information sources, several information destinations, and several components of fundamental information processing resources located in various locations illustrated in Figure 2.10c. Such systems are called information networks. A typical example is a distributed data bank or a distributed system for executing algorithms requiring much computing power. To interconnect the components of an information network a distributed information transmission system based on a network of communication channels must be available. For basic problems of distributed information transmission systems see Tanenbaum [2.17], Seidler [2.18], Bertsekas, Gallager [2.19]; for technical aspects see Black [2.4], Freeman [2.5].

2.1 Intelligent Data Transmission Systems

Oy THE SUPERIOR SVSXEM

95

/^ E N V IRONTMENTX ' 03F T H E SUPXCRXOR SVSTEM

1 NFO SYSTEM

SUPERIOR SYSTEM

L_

1

E N V IROIMMEIMT Oir T H E SUPE31IOR SVSTEM

x*(l)

ENVIRONMEMTI Oy THE 1 SUPERIOR 1 SVSTEM 1

=t

INIP-O SOUJRCE

AH2) 1 NFO SYSTEM

_J

\a.

•-

SUPmiOR SYSTIM 2

h

SUPERIOR SYSTD1 K

k

v:

b)

ENV IRONMENT 01»- T H E SUPERIOR

SUPERIOR SYSTEM 1

ENVIRONMENT OE THE SUPERIOR SYSTEM

x*(l) 'x-(2).

SUPERICn SYSTP1 1

K

SUPERIOR SYSTEM 2

SUPERIOR SYSTEW K

\—

Figure 2.10. Multiple access information systems: (a) information collecting system, (b) information disseminating (broadcasting) system, (c) with several source-destination pairs. The analysis of distributed information transmission systems is interesting not only because such systems are integral components of all distributed information systems, but also because it gives insight into properties of other distributed information systems, such as data banks or computing systems.

96

Chapter 2 Examples of Information Systems

Typical information sources in multiple access systems operate intermittently, with a small duty ratio. An analysis of such systems shows (see, e.g., Seidler [2.18], Bertsekas, Gallager [2.19], and for in-depth study Kleinrock [2.21]), that it is not favorable to split the available resources and to assign its parts permanently to a cooperating source-destination pair. It is better to provide temporarily the whole, undivided resource to the pair, which actually needs it. This principle of resource sharing applies in particular to information transmission system for a distributed information storage or information processing system. When the information sources operate intermittently, then it would be very inefficient to assign a separate communication channel to each communicating source-destination pair. Using flexible routing it is possible to establish communication between all source-destination pairs by means of a network with a substantially smaller number of physical channels than the number of all communicating source-destination pairs. To achieve resource sharing it is necessary to have an auxiliary distributed information system for identification of the state of the primary distributed information system. Two representative examples of multiaccess communication systems are described in the following subsections. 2.3.1 SYSTEMS USING A COMMON CHANNEL Often several intermittently operating transmitters located in an area have to transmit their pieces of information to destinations located in another remote area, as shown in Figure 2.10c. Then it is economical when the common channel is shared by all communicating pairs source-destination. Since an information source and the input of the common channel are usually located at some distance, a local transmitter and a local channel are needed to bring the information to the input of the common channel, as shown in Figure 2.11. At the input of the common channel a composite signal carrying all pieces of information is formed. It should satisfy the two conditions: (1) it should be possible to retrieve from composite signal separately pieces of information coming from each information source, and (2) the composite signal should be matched to the properties of the common channel. The forming of such a composite signal is called multiplexing, and retrieving separately from the composite signal the information coming from a given source is called demultiplexing (see, e.g.. Black [2.4, ch.4], Seidler [2.18]). To make demultiplexing possible we must associate with the working information an identifier saying from which information source the information comes and to which destination it should be delivered. This identifier is called a source-destination address (abbreviated s-d address). There are two extreme options of forming the composite signal: • All shaping of the composite signal is performed at the input of the common channel and the only function of a local transmitter is to deliver the primary information there. • The local transmitter shapes the signal so that forming of the composite signal can be greatly simplified.

97

2.1 Intelligent Data Transmission Systems

3 Sg° iS.sl^ s

iWliili: O •

•iUM-ot^

°iiiigi.

tpfi 11 o

0-£?>< 3 : 1 1 . 2 S..S>-5 4>

•S-iS^gJl5|S.

^liillUt^

iiii fillifi'r="ISM' •f£-g=y

oi^llllF iilllii&.l •sg «

c« O

^

•iilllllll

98

Chapter 2 Examples of Information Systems

If the first option is chosen, the muhiple access system is called centralized, if the second, decentralized. In the centralized system the central transmitter produces usually a continuous train of binary signals that are matched to the binary channel described in Section 2.1.1. Then essential are the techniques of buffering discussed in Section 2.6.5 and techniques of achieving separability of sequentially arriving blocks of information coming from the various information sources, considered in Sections 2.6.1 to 2.6.3. The system shown in Figure 2.11 without the central control system CCS but with an intelligent multiplexer MPX that performs all the shaping of the composite signal is a centralized multiple access system. An extreme example of a decentralized system is the system in which the local transmitters radiate the signals in a common space and multiplexing takes place by physical superposition of the electromagnetic waves. A communication satellite that receives the composite signal, shifts its spectrum, amplifies the signal, and bounces it back to receiving antennas of the local receivers, plays the role of the common channel. Another example is the Ethernet system. For a more detailed description of these systems see Tanenbaum [2.17, ch.3], Seidler [2.18 chs.3, 4], Bertsekas, Gallager [2.19, ch.4], and for technical aspects see Black [2.4], and Freeman [2.5]. A typical signal produced by a local transmitter in a decentralized system is a structured block (in technical terminology 2Lpacket). It consists of three subblocks, as shown in Figure 2.12.

HEADER

WORKING INFORMATION

PARITY INFORMATION

Figure 2.12. Typical information packet used in a decentralized multiple access systems. The first subblock, called header, contains the s-d address and eventually auxiliary information that should help the recovery of the block. Typically, this is the synchronizing information or information about the length of the block described in described in Section 2.6.3. The next subblock is the block of pieces of working information. Since the local transmitter has often to take care, at least partially, for distortions in the channel, a parity error correcting and/or detecting coding is applied. The parity information vector described in Section 2.1.2 is the third subblock. For many decentralized multiple access systems, in particular for most of the mentioned satellite and Ethernet systems we may assume that • The primary information source operates intermittently, usually with a small duty ratio. • The local transmitters shape the packets carrying the blocks of information and determine the instants of their transmission. • The effects of external factors influencing the state of the common channel are negligible and the only sources of distortions of a signal carrying information from a given information source are eventual collisions with packets generated by other sources.

2.1 Intelligent Data Transmission Systems

99

The latter assumption is justified if a sufficiently efficient error correcting transforms a primary noisy channel into a binary virtual channel of high quality; see fig.2.7. On the assumptions made, the fundamental problem is to avoid, and if this not possible, to neutralize the collisions of packets in the common channel. This can be achieved if the local transmitter can acquire some state information (external, partner state information). Consider a given local transmitter say, ^(1). In the simplest case the only state information it has, is the binary information yT(i)[CL(l)] whether the previous sent packet suffered in the common channel a collision with a packet from another transmitter. The system in which the local transmitters use only this information is called the ALOHA system. The information yT(i)[CL(l)] corresponds to information yjCCHout) mentioned in Section 2.2.1, page 93; see also Figure 2.9. The performance of the system can be improved if every local transmitter puts its block into the common channel in predetermined time slots. Such a system is called synchronous (slotted) Aloha system. To implement such a system every local transmitter must have a synchronizing (timing) information yT(n,)(SYN). A transmitter can use all the information yjaCL)^{yjaCUn)]

, /i = l, 2,- • • , M}

(2.3.1)

to refrain own transmissions and help the other transmitters to overcome the consequences of collisions of their packets. Such a system is called a system with resolution of conflicts. Binary information yT(m)(CHin) whether a packet from another transmitter J(n), n^m already reached the input of the common channel can be used to wait with transmission and to prevent a collision. A system with local transmitters using such information is called a system with carrier sensing. As in the case of the ALOHA system, we may have asynchronous or synchronous versions of a carrier sensing system. Let us denote by y[T(/w)] the information about the state of mth transmitter, particularly about the packets waiting in the buffer. This information corresponds to information yT(R) used in the intelligent data transmission system described in Section 2.2.2. If a central control subsystem acquires the information yccs(T)Myccs[T(/n)], /w = l, 2,- • • , M}

(2.3.2)

about the states of all local transmitters, then the central control system can schedule for all local transmitters the instants of putting their packets into the common channel so, that no collisions occur. Such a system is called a system with reservations. To establish a system with reservations, an auxiliary information dissemination subsystem delivering scheduling decisions made in the central control subsystem to all local transmitters must be established. The systems for acquiring all types of the described state information may be centralized, decentralized, or mixed type. On the assumption, that the multiplexer MPX shown in Figure 2.11 performs only simple shaping of the signals delivered by the local channels, such as adding them, the system shown in Figure 2.11 is a multiaccess system with decentralized production of the composite signal and a centralized subsystem for state information acquisition and dissemination.

100

Chapter 2 Examples of Information Systems

2.3.2 INFORMATION TRANSMISSION NETWORKS Of greatest practical importance are commmiication systems that allow commmiications to be establish between several source/destination pairs distributed over a large area. Since a typical information source operates intermittently, with a small duty ratio, it would be cost inefficient to have a physical channel connecting each source-destination pair. In other words, the network of channels is not fully connected as shown in Figure 2.13a.

NODE ^ COMMUNICATION CHANNEL

a) CHANNEL OUT GOING BUFFERS CHANNELS

IN GOING CHANNELS

3^v(NET)

Figure 2.13. A data communication network: (a) the graph of a network, (b) a typical node !/ in a data packets transmission network: SIDA-subsystem identifying the destination address of the packet and deciding if the packet should be delivered to the local information destination (LTD) or should be forwarded, SFP- subsystem forwarding the packets ( the transitory and coming from the local information source-LIS), yyiNET)- information about the state of the network, y(1/), y'iV) -information about the state of the node 1/that is delivered to other nodes. The auxiliary conmiunication subsystems carrying the state information of parmer information are not indicated. Today's trend is to transmit information in the form of packets, and to establish the coDMnimication between a source-destination pair, that is not directly connected, by forwarding the packet through a chain available channels. Similarly, as in the previously described multiple access system with a common channel, we must add to the primary information packet the source and destination address.

2.1 Intelligent Data Transmission Systems

101

Usually the communication is bidirectional thus, in one place we have a local information source transmitting its information to remote information destinations and a local information destination receiving information from remote information sources. The system including the local information source, local information destination, and subsystems processing the arriving and departing packets is called a node. The simplified diagram of a typical node in a packet transmission network is shown in Figure 2.13b. The cooperation between the node and the nodes connected by direct channels can be organized in the same way as in the intelligent data transmission system described in Section 2.2 and shown in Figure 2.11. The specific subsystem in a node of a data transmission network is the subsystem responsible for choosing for a packet the outgoing channel. We call li forwarding decision and the subsystem the packet forwarding subsystem. The basis for the forwarding decision is the information about the state of the network available at the considered node V. We call it network state information. Using the notation convention explained in the endnote 1 we denote this information as jt/CNET). Thus, in our terminology the packet forwarding system realizes the transformation of information jv(NET) and the information about the destination address of the processed packet into the a forwarding decision. We call this transformation ih^ forwarding rule. The train of forwarding decisions made in the successively visited nodes determines the route of the packet from its source to its destination. Therefore, the forwarding rule is also called routing rule. The state of the network is described by the states of all nodes and channels. Similarly, as in the multiple access system using a common channel (described in Section 2.3.1), there are two extreme types of network state information subsystems: the centralized and the decentralized. In the case of a centralized network all nodes deliver information about their states to a center of network state information acquisition and processing subsystem (CNSI). This is the counterpart of a central control subsystem (CCS) in the multiple access system with a common channel, described in Section 2.3.1 (the system with the central control subsystem CCS and channels shown in Figure 2.11 by slashed lines). For each node the CNSI works out the forwarding decisions number of the node destination address of the packet

the outgoing channel in which the packet should be forwarded The list of such decisions for a given node is called a routing table. The CNSI delivers the routing tables to all nodes. To deliver from the nodes to the CSPI the auxiliary primary information an information collecting subsystem shown in Figure 2.10a must be established. Similarly, for delivering the routing tables an information dissemination system shown in Figure 2.10b is needed. In a node of the network with a centralized network state subsystem the packet forwarding system reduces to a simple system realizing the obtained routing table.

102

Chapter 2 Examples of Information Systems

In the decentralized network state information system the packet forwarding subsystem obtains the network state information and itself works out the routing decisions. We are obviously interested with possibly "good" decisions. To formulate the optimization problem a criterion for routing performance must be introduced. Such a typical criterion is the average time T^ that a packet must spend waiting in the buffers at inputs of the channels on its way from source node S to the destination node 2). It is possible to minimize this time using compressed information about the state of the network.

~VJ< W)

^u(2))

Figure 2.14. Illustration to the definition of the network information needed for optimized decentralized routing: (a) nodes in the neighborhood of the destination node, (b) the general case.

2.1 Intelligent Data Transmission Systems

103

To describe this information we introduce the concept of the upper (also upstream) neighborhood of a given node say, U It is defined as the set of nodes connected with V by direct channels going out from node V\ it is denoted as HjiV), Similarly, the set of nodes connected with node V by channels going to node V is called a lower (also a downstream) neighborhood and is denoted as Ylx{V)\ see Figure 2.14a. The information needed for optimal forwarding is obtained by a process "spreading" over the whole network. The process starts in the nodes lying in the upper neighbourhood Yly,{2>) of the destination node 2> (see Figure 2.14a). Each node W Eil^yJiH^) sends to the nodes of its upper neighborhood Wu(U/) the information y{W, lb) about the current average delays Ty^ in the channel going from U/ to 2) .To do this an auxiliary upstream channel like the feedback channel in Figure 2.9 is needed (see Figure 2.14). _ _ _ A node in flJiW), say V, calculates the sum T ^ + T^^^ where T^^ is the average delay introduced by the channel going from 1/ to U/ ; the exact information about Ty^ is directly available at V, The node !/may belong not only to YIJi W) but also to upper neighborhoods of other nodes from YIJ^jb)d& shown in Figure 1.14a. For all those nodes similar sums are calculated and the minimum ^vD=

™^

(^VW-^^WD).

(2.3.3)

is found. This is the minimum delay of transporting a packet along a path going from the node V to the destination 2) through the upper neighborhood of the destination node. We denote by U/* the node of this neighbourhood for which the minimum is achieved. The delay of specific packet directed to destination 2) is minimized if the packet is forwarded into the channel going from 1/ to U/*. Similarly, as the node W sent the information y( M/, 2 ) ) to all nodes in its upper neighborhood, the node 1/sends the information y{V,U>) about ^voto all nodes in the upper neighborhood YlJiV) of the node V. Each node in flJiV) proceeds as the node 1/did. Such a process is continued till each node in the network determines its best rule of forwarding to an outgoing channel a packet destined to node 2). To get the optimal rules of routing to destinations IZ>(1), 2) (2),- •, J2)(^ we perform the similar procedure for each destination. In result each node II forwards to each node in J7u( If) the vector information y{ If) the components of which are the minimal delays of packets transported from f/ to a destination jb (k); see Figure 2.14b. On the other hand, the node U having the information about the minimal delays of transporting a packet to its destination from each node in the lower neighborhood Wi(f/) can feed its packet into the optimal outgoing channel. COMMENT 1 The described procedure is interesting since at each node by the minimum operation (2.3.3) the dimensionality of information obtained from all nodes in the lower neighborhood is compressed to the dimensionality of information delivered by a single node. Therefore, when we move away from a destination node, the exponential increase of dimensionality of state information is avoided. Such an information compression may be called "ever repeating".

104

Chapter 2 Examples of Information Systems

The procedure is a technical application of Bellman's dynamic programming, which is a universal and efficient multi-step optimization technique. Thus we may say that compression of already obtained information about the criterial function is an important element of dynamic programming. COMMENT 2 We gave here only a simplified description of acquisition of routing information. In particular, we did not consider the behaviour of the network during the initial phase of finding the optimized forwarding rules nor we did consider the effects of the changes of the state of the network in the time. The analysis of these problems can be found in Seidler [2.18], Bertsekas, Gallager [2.19]. COMMENT 3 We took as a criterion the average delay of a packet on its way from a source to a destination. A packet routed according a rule optimized in the sense of this criterion, may be forwarded into a channel that could be used to decrease the delay of a packet transported from another source to another destination so much, that the total delay of both packets would be decreased. Thus, the criterion of the average delay of a concrete packet may be called a selfish criterion, A detailed analysis shows that if we would take as the delays of packets transmitted between all source-destination pairs, another optimal routing rule would result. The interesting result is that information needed for making the routing decision of this rule is similar as the previously described state information needed for the optimized selfish routing rule (see Seidler [2.18], Bertsekas, Gallager [2.19], Ahuja et al.[2.23]). COMMENT 4 The functions of a packet forwarding subsystem in a node of a network with decentralized state information subsystem are similar to functions of a center of network state information in the system with centralized network state information subsystem. The difference is that the typical decentralized system uses only information about the state of adjacent nodes (as in the previously described system), while a typical center of network state information can use the information about the state of the whole network. However, this information is usually less accurate, than the information about adjacent nodes. This causes the centralized and decentralized systems to have their advantages and disadvantages, and in most cases a hybrid system is best. COMMENT 5 Although the packet transmission systems develop very fast, the traditional systems establishing the path by physical connecting the channels by switching circuits, particularly the classical telephone networks also play an important role. The counterpart of the previously described information about the state of the packet transmission network is the signalling information about the state of channels, analogue terminals, and switching devices (see, e.g.. Freeman [2.5, ch.2], Welch [2.24]).

2.4 Information Storage Systems

105

2.4 INFORMATION STORAGE SYSTEMS Information storage systems are an important class of information systems. The fundamental block diagram of such a system is shown in Figure 1.3b. TTiere are two types of information storage systems: • Specialized systems incorporated into a superior information system as its subsystems and • Universal, autonomous systems that can render services directly to a variety of superior systems, often pursuing other goals than processing information. For example, most information systems we must process information as a whole, while it is delivered by a communication channel as an evolving time process. Such a situation occurs in communication system described in the previous two sections. To realize the frequently used next neighbor transformation (see Section 1.5.3, Figure 1.17) we must have a subsystem storing the reference patterns. The memory in a buffering system used to decrease the idle intervals between blocks of information described in Section 2.6.5 is also a specialized information storage subsystem. In every computer we have several types of information storage subsystems; see, e.g., Leeuwen [2.25]. Of great practical importance are autonomous storage systems providing structured discrete information about large hierarchical superior systems. They are called data banks (see Ricardo [2.26]; for more advanced systems see Ozkarahan [2.27], Khoshafian [2.28]). There are some similarities between a typical information storage system and a typical communication system shown in fig 1.3a. The fundamental tasks of both systems are similar: we may consider storage as "transmission of information in time". This similarity of tasks causes some techniques used in the transmission systems described in the previous section, such as block operation, segmenting, assembling, queuing, adding auxiliary synthetic information about primary structured information and about relationships between components of structured information to be also used in information storage systems. There are, however, substantial differences between conununication and information storage systems. The first is in handling pieces of structured information. A typical communication channel operates essentially sequentially. Thus, a piece of information can be retrieved only when pieces transmitted earlier have already arrived. In many mass storage devices we can access a piece of structured information directly, independently of its initial localization relative other pieces of information. This is called random access. Due to such a possibility not only sequential but also other types of information structures such as tree structures or hierarchical block structures can be applied in the design of an information storage system. The second important difference between communication and information storage systems is that the transformation performed by a typical mass storage device ("time channel") can be usually considered as insulated from the influence of indeterministic external factors.

Chapter 2 Examples of Information Systems

106

Therefore, the problems of protection against imknown external factors that are important in commmiication systems, essentially do not arise in information storage systems. The third difference is that a typical commmiication system has to transfer the information possibly exactly from one place to another. However, it is usually requested, that besides storage of primary structured discrete information the storage system can perform some operations on the stored information, such as • Derivation of secondary information about features of objects that are not explicitly described by the primary information and • Derivation of information about features of a subsystem as a whole on the basis of detailed information about the features of systems components. general info about drawing conlusions

±

STORAGE OF INFO ABOUT DIRECT ACCESIBLE STATE COMPONENTS

DERIVATION OF INFO ABOUT DIRECTV NON-ACCESIBLE STATE COMPONETS STORAGE OF INFO ABOUT RELATIONSHIPS

secondary info about features of object

T

a)

info about properties of superior system info about the environment

I —>-

STORAGE OF INFO ABOUT STATES OF OBJECTS

b)

±

OPERATION REMOVING THE DEPENCE OF DETAILS

info about features of the system as a whole

Figure 2.15. The information transformations realizing typical additional tasks of an information storage system: (a) generation of information about directly non accessible features of states of objects, (b) generation of secondary information about features of a subsystem as a whole. The first task is reasonable if some relationships exist between the non accessible components of the state describing the needed features, and information about these relationships is available. A detailed analysis of such relationships between state components is one of main subjects of Part 2 of this book. The information about features of directly non accessible state components can be considered as a product of a transformation of available information about accessible state components and of information about relationships between accessible and non accessible state components. The block diagram of such a transformation is shown in Figure 2.15a.

2.4 Information Storage Systems

107

Similarly, the derivation of information about features of a subsystem as a whole can be considered as a transformation rejecting specific features of an object but leaving the features characterizing the subsystem to which the object belongs. Such an operation is an example of a dependence on detail removing operation. An other example of such an operation has been introduced in sec. 1.6.1, page 55. The dependence removing operations are the main subject of Section 8.1.4. We describe here first a typical system storing structured discrete information. Then we discuss: (1) transformations of primary information producing information about the directly non-accessible state components and (2) transformations producing information about the features of a system as a whole when only primary information about the features of its subsystems is available. Similarly, as in the previous sections we emphasize the role of auxiliary information used in a storage system.

2.4.1 PRIMARY INFORMATION ABOUT THE STATE OF A HIERARCHICAL SYSTEM Section 1.3.1 indicated that the structure of information reflects usually the structure of the state the information is about and in turn the structure of the state reflects often the structure of the system. Therefore, we give first an example of a typical large hierarchical system and of the description of its state. EXAMPLE 2.4.1 A LARGE HIERARCHICAL SYSTEM AND ITS STATE We consider a university as a large hierarchical system and denote it as UNIVERSITY. The typical subsystems of the upper layer (rank 1 layer) of UNIVERSITY are (1) MATERIAL RESOURCES, (2) TEACHING STAFF, (3) ADMINISTRATING STAFF, (4) STUDENTS. Typical components of the subsystem MATERIAL RESOURCES (subsystems of rank 2) are lecturing rooms, laboratory rooms, workshops, library. Each of these subsystems consists of components of rank 3. For example the components of a library are subsystems for books storage, catalogues, and the collection of books. Each of these subsystems is again a system of components of rank 4, and so on. The state of the UNIVERSITY is described by the state of its subsystems. To the hierarchy of structure of the subsystems corresponds a hierarchy of the structure of descriptions of the states of subsystems. Let us take, for example, the description of the state of a lecturing room. Typical components of the description of its state are (1) the technical parameters of the room (its volume, number of seats, built in audiovisual equipment, etc.), (2) the number of students staying at given moment in the room, (3) dates and names of courses scheduled for this room. We have chosen different types of state components. The first can be considered as almost static, changing only after a time. The second varies step wise in time intervals of one hour. The third component has another character. It is not related to the physical state of the room but is imposed from outside by the planning office.

108

Chapter 2 Examples of Information Systems

Next consider a subsystem of rank 1 of another character namely, the TEACHING STAFF. Its lower ranking subsystems are staff of a school and staff of a department. The lowest ranking subsystem is an academic teacher. The state of a person has many components, but the typical components relevant for the university are (1) academic rank, (2) the specialization (it can be described by the list of subjects the person could teach, (3) planned and conducted teaching activities (they are described by names or code numbers and dates). D THE STRUCTURE OF PRIMARY INFORMATION ABOUT A HIERARCHICAL SYSTEM We consider first a simple but representative discrete system that has two hierarchical levels. We assume that on the highest structural level we have K subsystems and each of them consists of M components (called objects). The fundamental component of the description of the state of the kxh subsystem is the information about the state of its mth object. This is usually vector information x{K m) = {x{K m, /i), /z = l, 2,- • • , A^} (2.4.1) where x(k, m, /z) is an elementary piece of information. In the simplest case it is a scalar, but usually it is structured. For example, it may be a string or an array. We assume here that the set of potential forms of an elementary piece of information is a discrete set thus, that the elementary piece is discrete information. Then it has often the meaning of an attribute. The set x(k, m) is called data set, or record. Similarly, as a code word (see Section 2.1.2, Figures 2.6 and 2.8) a record can be interpreted as the result of writing down concrete values in "empty" positions /2 = 1, 2,- • • , N. Those empty positions are cdlltd fields. The primary information about the state of an object is the set X(k)^x(k, m), m = l, 2,- • • , M} (2.4.2) cdlltdfile. The set of files X^{X(k),k=l, 2,' ' ' ,K} (2.4.3) is the information about the system; it is called a data base. EXAMPLE 2.4.2 THE STRUCTURE OF A RECORD For illustration we take the subsystem STUDENTS described in the EXAMPLE 2.4.1 and assume that the number of this system k=l. The object of this subsystem is a student. A typical data set for the mth student consists of the following elementary pieces of information registration number jc(l, m, 1), last and first name of the student jc(l, m, 2), the department in which she/he is enroled jc(l, m, 3), the study year jc(l, m, 4), names and grades of passed exams x(l, m, 5), names of courses which she/he intents to take in the current semester x(l, m, 6). The pieces of information x(l, m, l)-x(l, m, 4) refer to the properties (attributes) of the m-th student, while the pieces jc(l, m, 5) and x(l, m, 6) refer to the relationship between the subsystems STUDENTS and TEACHING STAFF. D

2.4 Information Storage Systems

109

2.4.2 SECONDARY INFORMATION ABOUT THE STATE OF OBJECTS AND SYSTEMS The primary task of a simple data base is to get the primary information about the state of an object (a record) from the storage and deliver it to the user. Another task may be to provide the primary information about the state of a subsystem (that is the file) to a user. In a more advanced data base a transformed information about the object or a subsystem is required. There are two types of such transformations • transformation that transforms each record separately (the record transformations) and • transformation that transforms a file as a whole into another file (file transformations). BASIC TRANSFORMATIONS OF A RECORD We consider first the record transformation. Usually such a transformation has the meaning of a rough description of the primary record (see Section 1.2.1 and Section 3.1.2 for detailed discussion). Examples of transformations of this type are: • Separate transformation of each elementary component into binary information (the counterpart a such transformation is binarization (see Section 1.5.4) and/or • Rejection of some components of the primary record; such a transformation has the geometrical meaning of 2i projection', the continuous counterpart of such a transformation is the dimensionality reduction (see Sections 1.5.3 and 7.3). More advanced are transformations operating on the record as a whole. Widely used are two-step transformations: (1) the primary components of information are transformed into binary elements and, (2) the transformed component is a logical function of the binarized components. Such secondary information has the meaning of a secondary attribute of the object. As a simple example, we take the STUDENT record described in the following example. EXAMPLE 5.2.3 PRODUCTION OF A SECONDARY ATTRIBUTE An attribute is to be defined that characterizes the mth student of the second study year of biology as a candidate for the dean's honor list. The attribute should be based on students data set described on page 101. We first define the binarized secondary elementary information components: 5(/n, 1) = 1 iff jc(l, m, 2)=biology 5(m, 2) = 1 iffjc(l, m, 3)=2 B{m, 3) = 1 iff the grade average calculated from JC(1, m, 4) is not smaller then 3.5. The secondary binary attribute is y(m)^B(m, 1) A B(m, 2) A B(m, 3)

(2.4.4)

This is the information relevant for setting the honour list for second-year-biology students, n

110

Chapter 2 Examples of Information Systems

More advanced are the transformations of a primary record into secondary attributes utilizing universal relationships between the components of the primary record. They can be used to compress in a reversible way the volume of the primary record or to produce features of the object that are directly nonaccessible. Such relationships are the subject of Chapters 3 and 5. Especially useful are the universal macro-relationships between the directly nonaccessible discrete features of the object and the components of the primary record; for examples see Section 3.3.2. The generalizations of transformations of a single record are transformations of records about the state of objects belonging to various subsystems and therefore, located in various files. Using universal relationships, we can derive from records located in various files features that are not described by a single record. For example, from the record describing a student (located in the file STUDENTS) and from the file of records describing instructors (the TEACHING STAFF file) we can obtain information about the name of instructor giving the course the student is attending. To do this we have to find the record of the instructor matching the record of the student. This however, requires coupling information. In the simplest case the records from the different files have the same component. This component can serve as coupling information. In the mentioned case the name of the lecture is the coupling information for records in STUDENT and TEACHING STAFF files. BASIC TRANSFORMATIONS OF A FILE We have considered transformations of a record describing an object or of few records describing linked objects. Often a primary file of records describing states of components of a subsystem is available, but needed is information about features of a subsystem, considered as a whole. The basic types of such transformations are • Reversible transformations changing the presentation of the file. • Irreversible transformations producing a simplified file. • Irreversible transformations producing a vector or even only a scalar information about the features of the object as a whole. An example of presentation transformation is sortings which changes the sequence of records. Usually an identifier of the record is introduced and the records are sorted in a ascending or descending sequence in the sense of the identifier. In case of the described student record two components of the primary record can play this role: the enrollment number and the last and the first name of the student. A typical transformation producing a simplified file is report generating transformation that selects from the original file records satisfying some selection criteria. The selection criterion may have the meaning of a previously described secondary information about the state of the object. A typical example of such a criterion is a secondary binary attribute (2.4.4). Often report generation is coupled with previously described report transformations. Of greatest importance are irreversible transformations of a primary file, producing a compressed information about the features of the subsystem. Such a transformation has the meaning of the previously mentioned operation removing dependence on details.

2.4 Information Storage Systems

111

Suppose that the feature is characterized by a number and placed in the fah position of each record. Then a typical characteristic of the subsystem as the whole is the arithmetical average j^

m-Y.x(m,k)

(2.4.5)

^maxW= max x(m, k).

(2.4.6)

or the maximum m

Part 3 of the book shows that for optimization of many information processing systems the set of frequencies of occurrences of potential forms of states of a component (of an attribute) is the relevant characteristic of the state of the subsystem as a whole. Then the transformation of a file into the set of frequencies of occurrences of some states or objects is an example of transformations producing vector information about the state of an object (subsystem) as a whole. THE IMPLEMENTATION OF STORAGE The basic implementation problems are to store large or even very large numbers of primary records so that they are quickly accessed and delivered either directly to the user or to the subsystem producing the previously discussed secondary information. In addition, the modifications of the data base such as updating, addition or removal of records must also be easy and cheap. The realization of these requirements is simplified if the records are presented in standard form, usually as shown in Figure 2.16.

HEADER EXTERNAL AUX. INFO

PRIMARY RECORD

INTERNAL AUXILIARY INFO.

fields

Figure 2.16. The standard presentation of a record. The header of block of information placed in the storage medium consists of two parts. The first is a compressed information about the record, is an example of the external state information. Its basic component is the previously mentioned information identifying the record, which is used for sorting the file and searching of a given record. In communication systems a counterpart of the auxiliary external information is the parity information in an error correcting code word (see Figure 2.6). The second part of the header is the auxiliary information related to the operation of the storage, is an example of internal state information. Typical examples of such information are the time of arrival of the primary record and the identifier of position in the mass medium, where the record should be stored. To day the most frequently used carrier of stored records is a stack of magnetic discs. However, the trend is to use the optic discs. For detailed description of data bases and their implementations see, e.g., Ricardo [2.26].

112

Chapter 2 Examples of Information Systems

2.5 A SYSTEM SIMPLIFYING THE STRUCTURE OF IMAGE INFORMATION The system described here is the Joint Photographic Experts Group (JPEG) standard image compression system. This is a typical example of the lossy information compression system discussed in Section 1.1.4 and shown in Figure 1.4a. The general compression subsystem system shown in Figure 1.4b takes in the case of the JPEG standard, the form shown in Figure 2.17 a, while the decompression (recovery) subsystem takes the form shown in Figure 2.17b. The primary image is usually delivered in an already simplified form as a A^'*A^' array of samples. Usually the samples are interrelated and therefore, it would be advantageous to process the array as a whole. However, the resources needed to implement typical array transformations grow fast with the size of the array (typically as A^'^). Therefore, the primary array is segmented into smaller N*N arrays.

partner partner fine «nfo^,pqu infoA:^, '""^^^ quantized A quantized quantized spectrum | spectrum

8x8 biocs

primary array of samples SEGMENTATION

i

DCT AND FINE QUNTIZATION

;

t

ROUGH QUANTIZATION OF ELEMENTS

V

J

DCT

compressed train TRANSFORMATION INTO A TRAIN AND REVERSIBLE COMPRESSION W\

T

w.

Wo

*PRIM' '^SUP CHOICE OF INFO PRESENTATION

J w.

RECOVERY OF THE PRIMARY TRAIN AND ROUGH {QUANTIZED ZED SPECTR IV

partner info ™-ptr

^PRIM' '^'SUP

RECOVERY OF THE FINE QUANTIZED SPECTRUM SPhLi

partner info y pqu

a) INVERSE DCT

If

SEWING 8x8 BLOCS TOGETHER

b)

Figure 2.17. The structure of a compression system realizing the JPEG standard: (a) the compression subsystem, (b) the decompression (recovery) subsystem; DCT-discrete cosine transformation; Xp^i^ information about the properties of the set of potential images, ATsup information about properties of the superior system relevant for image compression, WsTAT information about statistical properties of the train of quantized spectral components, ATpqu partner information about quantization of spectral components, information for the decompression subsystem, ^p^^ partner information about compression of the train.

2.5 A System Simplifying the Structure of Image Information

113

The choice of N is based on the information ATpRi^ about the properties of the primary array, by the array processing costs, and partially by already existing standards. The analysis of those factors (see, e.g. Wallace, [2.29]) led to the value A^=8. Since the processing is digital, and the samples are digital, we assume that they are presented as integers in the range <0,2^-l > . On the assumption that 7=8, a typical 8*8 array is

u-=

[140

144

147

140

140

155

179

144

152

140

147

140

148

167

1751 179

152

155

136

167

16";

162

152

172 ]

168

145

156

160

152

155

136

160 j

162 148 156 148 140 136 147 162 I I 147

167

140

155

155

140

136

(2.5.1)

162 I

136

156

123

167

162

144

140

147 j ' 14/

[148

155

136

155

152

147

147

136]

Usually the components of a primary array are related. Therefore, the low cost separate compression of array's elements would be inefficient. However, such compression becomes feasible, if the primary array is transformed into an array with non related or only weak related elements. Such a transformation is an example of the transformation removing relationships that has been mentioned in Section 1.5.4, page 45. There is a variety of such transformations. In terms of concepts introduced in Section 1.7.1 the choice of the most favorable transformation removing relationships can be considered as result of processing the meta information J^PRIM about the properties of the set of potential images and the meta information ATsup about features relevant for the design of the compression system, as shown in Figure 2.17a. A typical transformation removing relationships between elements of structured information is the decorrelating transformation producing an array with statistically uncorrelated elements. Compression system using decorrelation as a preliminary transformation is shown in Figure 1.4b and is analyzed in detail in Section 7.3. It was assumed that the JPEG compression system should be universal, applicable even when the elements of the array do not exhibit statistical regularities (or when they are not taken into account). Therefore, the decorrelation is not used as a preliminary transformation. The detailed analyses of typical images and of the features of image perception by people, who play the role of the superior system, lead to the conclusion (see Wallace [2.29] or Pennebaker, Mitchel [2.30]) that most suitable universal preliminary transformation is the discrete cosine transformation (DCT). Section 7.1.2 discusses the discrete Fourier transformation, which is similar to the DCT; for details see Press et al. [2.32, ch.l2] or Poularikas [2.33]. The DCT transforms a primary 8*8 array U={u(m, n);m, n=0, 1, • • , 7} into the array V={v(/, /:);/, k=0, 1,- • • ,7} with elements

Ml, it) = V4 i : E^C". nMm, n)cos2!:^cos2t^ m-0 n-0

16

where A{m, n)= ^N'^ for m=0, n=0 and A(m, n)-l is called the cosine spectrum of the primary array U.

16

.

(2.5.2)

for other m,n. The array V

114

Chapter 2 Examples of Information Systems

The DCT transformation is a reversible transformation. The components of the primary array are obtained from the equation

u(m, n)= W it'^im,

n)v(m. n)cos^E:}^cos<Jt^ 16

/-O k'O

16

(2.5.3)

Although the samples of the primary array U are integers, the spectral components are in general not integers. To be digitally processed, each spectral component is uniformly quantized using the NNT with uniformly distributed references (see Section 1.5.4, particularly Figure 1.18). The distance between successive references is equal to the length of the aggregation interval (see Figure 1.18b). Therefore, it is called quantization step. The quantization step is chosen so, that the subjective quality of the image obtained after inverting the quantized spectrum is the same as of the primary array. Hence, the discussed quantization of spectral components is called^z/ze quantization. It is a typical example of the discrete approximation discussed in Section 1.4.3, page 33. For the assumed presentation of the primary samples, a typical choice of the reference values are integers in the range <-255, 256 > (therefore, in technical jargon the fine quantization is called "rounding to the next integer"). The array V of fine quantized spectral components of the primary array U given by (2.5.1) is

V=

194 -18 15 -9 21 -34 26 -9 -10 -24 -2 6 -8 -S 14 -15 1 -3 10 8 4 -2 -18 8 1 -3 4 9 . 0

-8

-2

2

-9 - 1 4 1 1 i 14

-19 7

-18 3 -20 -8 -.1 -3 -11 IS 18 8 •4 1 -1 -7 -1 -6 1 4

-1 8 15 -7

2.-! -1 1

(2.5.4)

-2 0

In the previously mentioned considerations about the choice of DCT as a preliminary transformation, essential were the specific features of the Fourier transforms (harmonic spectrum), in particular of DCT: 1. If the elements of the primary information (train, array) fluctuate slowly around a fixed value, the absolute value of a component of the spectrum decreases fast with the number of the component; in the case of trains this property was mentioned in Section 1.5.4; see Figure 1.20. 2. Since DCT is a spectral transformation (for a definition see Section?. 1.2) the distortion of the spectrum is the same as the corresponding distortion of the primary information (for precise formulation of this statement see Section 7.1.3); this property allows an easy insight into the consequences of compressing the spectrum. 3. When the components of the primary information exhibit statistical regularities and are correlated, then on quite general assumptions the absolute values of correlation of the components of the harmonic-type spectrum are smaller than of primary information; therefore, from a practical point of view the DCT is often a decorrelating transformation.

2.5 A System Simplifying the Structure of Image Infonnation

115

4. The subjective perception of the consequences of a given distortion of a component of the spectrum on the appearance of the recovered primary image decreases with the number of the distorted component of the spectrum. Being reversible, the DCT does not reduce in general the dimensionality of the primary information but only changes its representation. However, the properties 1 and 2 suggest that substantial compression without introducing excessive distortions can be obtained by rejecting insignificant spectral components. Property 3 suggests to quantize significant spectral components separately, while property 4 suggests to achieve compression by quantizing the various spectral components with different accuracy (in particular, using various quantisation steps). Such a transformation of the spectral component identified by the pair (l,k) is denoted as Wy^i'). Thus, each spectral component v(/, k) is transformed separately into secondary information w(/, k)^ WJv(/, it)], Z, k=

0, 1,- • • , 7

(2.5.5)

that has less potential forms as the primary spectral component that are significant. The array W={w(l, k);l, k=0, 1,- • • ,7} is called rough quantized spectrum. The simplest transformation W^(') is a NNT transformation with references v^ that are integers laying in the set of potential values of v(/, k) (in the range <-128, 127 > ) and are distributed uniformly. The distance between adjacent references is denoted A(/, k). This is equivalent to assume that the references are located at integer multiples of A(/, k). It is convenient to give number / = 0 to the reference that has value 0, number /=1 to reference which has the value A(/, k), number / = 2 to reference having value -A(Z, k), number Z=3 to reference having value 2A(/, k) and so on; this may be called around zero oscillating numbering. Compressed spectral component can be described by an identifier of the reference v^. that is next to v(/, k) w(l,k)=l'' (2.5.6) where is /* is the number of the reference next to v(l,k). Similarly, as for the previously considered fine quantization the distance A(/, k) has the meaning of the quantization step. Contrary to fine quantization, the quantization W^(«) can introduce a substantial distortion of a spectral component. Therefore, the transformation Wi^(-) is called rough quantization. The fundamental problem is to choose the array of quantization steps A4A(Z,/:); Z,/:=0, 1,- • • , 7]

(2.5.7)

so, that the total distortions are minimized while the volume of the quantized spectrum is fixed. This problem is called favorable bit allocation problem. We present its solution in Section 7.5.2, page 369. Similarly, as the choice of the preliminary transformation the choice of a favorable array of quantization steps is based on the available meta information ATpRiM about the properties of the set of potential images relevant for quantization of a spectral component and on the meta information X^yjp about features relevant for the quantization criteria.

116

Chapter 2 Examples of Information Systems

The problem of favorable bit allocation can be formalized as a statistical optimization problem (see Section 1.6.2) however, on the assumption that the components of the primary information exhibit statistical regularities and that the needed statistical information C^STAT (a component of ATPRJM) is available. Section 7.5.2 discusses the statistical optimization of bit allocation in detail. The choice of the quantization steps in the JPEG standard is not based on the statistical information ^STAT but on less precise information ATpRi^ about the properties of the set of potential images that are relevant for quantization of a spectral component and on the meta information X^up about features that are relevant for the quantization criteria, as shown in Figure 2.17a. A detailed empirical analysis has shown (see e.g.Pennebaker, Mitchell [2.31]) that, the perception of distortions depends strongly not on the identifier (/, k) of a spectral component but on the sum l-^-k and with growing sum the distortions of the spectral component are less noticeable. This property suggests that the favourable quantisation steps should depend on the sum l+k and increase with it. Such a typical array of quantization steps ( see, e.g., Nelson [2.31, ch. 11]) is 3 5 7

A=

9

5 7

7

9 11 13 15 17 11 13 15 17 19 9 11 13 15 17 19 21 11 13 15 17 19 21 23 9

11 13 15 17 19 13 15 17 19 21 15 17 19 21 23 17 19 21 23 25

(2.5.8)

21 23 25 23 25 27 25 27 29 27 29 31.

The array W= [Wft[v(/, k)]] of rough quantized spectral components produced by the transformation (2.5.5) using this quantization step array is 65 -4 4 -5

W=

-1 2 -1 -1 -I -1 -1 1 1 0 0 -1 0 -I 0 -1 -3 1 -1 0 0 0 0 -1 0 0 1 1 0 -I 1 1 1 0 0 -1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 3 0

(2.5.9)

The primary image information array U given by (2.5.1), its cosine spectrum Vgiven by (2.5.4) and the rough quantized cosine spectrum W given by (2.5.9) are shown in Figures 2.18a, 2.18b, and 2.18c respectively. Figures 2.18a and 2.18b illustrate the property 1 of cosine transform formulated on page 114, while Figures 2.18b and 2.18c show that the rough quantization with variable quantization step truncates the cosine spectrum. Since the typical fundamental transformation of the compressed image is a sequential transformation (transmission through a communication channel, storage on a magnetic carrier), the array W must be transformed into a train W^. This can be achieved by a discrete counterpart of the linear scanning described in Section 1.5.4, page 47. The previously discussed perception of distortions of a spectral component v(lyk) considered as a function of I'k suggests the diagonal scanning corresponding to the continuous scanning shown in figure 1.21b.

2.5 A System Simplifying the Structure of Image Information

a)

'I 2 0 Ji-

117

c)

b)

u*

^i!^^-

100 T

-2

0 la

d)

e)

Figure 2.18. Typical arrays produced in JPEG compression system: (a) the primary array U given by (2.5.1), (b) the cosine spectrum V given by (2.5.4), (c) the rough quantized spectrum Wgiven by (6.2.9), and in the recovery subsystem: (d) the quantized spectrum W given by (6.2.8), (e) the cosine spectrum V* (2.5.11) recovered according to rule (2.5.16), (e) the recovered primary array U* (2.5.12) obtained by the inverse DCT from V*. The scanning disrupts usually much of the structure of the primary image. In spite of this, the quantized spectral components often exhibit statistical regularities. These regularities can be used to perform efficient loss compression of the train W^ into a train W^ having smaller volume. Such compression is discussed in detail in Chapter 6. It is shown there that the statistical information WSTAT needed to formulate and implement statistically optimal compression of the rough quantized spectrum is much simpler than the statistical information I/STAT needed to optimize in a systematic way the quantization steps. Therefore, the statistically optimal lossless compression of a train of rough quantized spectrum samples is feasible. A typical loss-less compression procedure is the Huffman coding (discussed in detail in Section 6.2.1). Another procedure is run length coding. As shown in Section 6.2.1 such coding rules are obtained as a transformation of the information WSTAT about the frequencies of occurrences of components of the primary train W^^ (see also Figure 1.27).

Chapter 2 Examples of Information Systems

118

The decompression subsystem inverts step by step the transformations performed by the compression system, as Figure 2.17b shows. First the train W^ of components of the rough quantized spectrum is recovered from the compressed train W*. To do this an exact information about the loss less compression rule currently used by the subsystem producing the train W*, must be known. This is a typical partner information that is provided by the compression subsystem. Therefore, it is denoted Xptr. The compression rule is determined by information WSJAT- Therefore, this information can be used as X^^. Since the transmission and/or storage of the train W^ can be usually considered as error-free and the transformation of Winto W^ is reversible, the rough quantized spectrum W can be recovered exactly. However, the rough quantization is an irreversible transformation and the primary spectrum cannot be recovered exactly when W is available. Section 7.5.1 derives the rule of optimal recovery of the primary information from the quantized information. It shows that on quite general assumptions, close to optimal is the rule to take as recovered spectral component v*(/,/:) the center of gravity of the corresponding aggregation interval. As an approximation of it we take (2.5.10)

v*(/, k)= v .

where V^.is the reference with the number /*=w(/, k). To implement this rule the exact information X^^ about the quantization steps used for rough quantization is needed. This is again a partner information that must be delivered by the compression subsystem. Let us denote as V* = [v*(/, k), /, /:=0, 1,- • • , 7) the recovered spectrum array. In the last stage the recovered primary array L^* is produced by the inverse DCT, of V*. The arrays V* and 6^* obtained from the array Wgiven by (2.5.8) are 195 -20 20 -35 -7 -27

V* =

U* =

14 27 0

-9

0

13

0 0 15 0

13 0 0 0

15 -17 0 0

-13 15 0

-15 17

-15'

-19

0

-9 -11 0

22 -13 -15

-15 0 0 0 0

0

0

0

0

-19 0 0 0

21 0 0 0

23 0 0 0

23 0 0

141J670

I3360&4

0

(2.5.11)

oj

UI.M72

143J578

1547106

139.2126

152.0837

1290786

1510913

143.9646

1538190

1630278

1813.^61

150 1837

156 2584

138 7898

166 1907

1536842

1789274

I794S735

1619435

162 7788

154 0I.V)

17242^.1

163.6713

144S5IS

1598717

152 9596

150.8319

152 9386

145 1409

154 3517

168 1819

150 2956

155.8626

145.0792

140.6150

139 638.1

140 8103

I58 93.V'<

148018

164.2321

1334263

1638308

157 9410

139 8029

1."5 9675

1C2 970^

130 3209

158 8988

123 0452

1639443

164 9657

140 l.^ll

134 6270

145 659'..

1484698

156.1512

1402329

148 3 0 %

159 1093

145 8190

I.Vl 287J

141 2'<J'>

(2.5.12)

The arrays V* and I/* are shown in Figures 2.18e and 2.18f. Comparing the pair of Figures 2.18b and 2.18c with the pair of Figures 2.18a and 2.18f we see that the relative errors of spectral components caused by rough quantization are large, but

2.6 Systems Arranging Blocs Into an Information Train

119

the inverse DCT ,,distributes" these errors uniformly in the recovered primary array, so that relative errors are small, for subjective perception hardly noticeable. We can speak about dispersing of errors, because the DCT direct and inverse have such a property that the sum of squared errors of components is the same. This property called equidistance property, is discussed in detail in Section 7.1.3. We concentrated here on processing a single 8*8 element array independently of other arrays. However, usually relationships exist between the arrays, and they can be utilized for more efficient compression. In particular, it is natural to use the predictive subtractive processing described in Section 7.5.3. The JPEG standard uses such a possibility for precise processing of the most important zero order spectral component (the average over the array grey level). The relationships between the arrays and the predictive-subtractive procedure are used extensively for compression of moving images (see, e.g.. Gall [2.34]). We described only the basic features of the JPEG standard for still pictures, emphasizing the role of the design information. Concrete programs implementing a more detailed version of the standard are presented in Nelson [2.31, ch. 8]. A review of the JPEG standard for color and moving images can be found in Wallace [2.29], and a detailed description of the standard in Pennebaker, Mitchel [2.30]. The theoretical of described transformation is the subject of Chapter 7.

2.6 SUBSYSTEIVIS ASSEIMBLING TRAINS OF INFORIVIATION BLOCKS A communication channel described in Sections 2.1, 2.2, and 2.3 transmits information most efficiently when the information has the form of a train of pieces or of blocks located in time tightly, one after another. Similarly, a mass storage memory mentioned in Section 2.4, is utilized best when the blocks of information are arranged in a tight chain structure. Systems putting the primary information blocks into a tightly packed train are called train assembling subsystems. We consider these systems for two reasons. First, assembling trains of blocks of various lengths is a typical problem in data communication and storage systems, which are described in sections 2.3 and 2.4. It is also essential for processing of blocks produced by statistical lossless compression (particularly Huffman, and arithmetic coding), because such compression is possible only when the compressed blocks have various lengths (see. Section 6.2). Second, the analysis of train assembling provides interesting examples of shaping structured information and of auxiliary information that makes possible to identify the lower ranking pieces of information when only higher ranking structured information is available. Train assembling is also a representative example of assembling information having more complicated structure, in particular of formatting. It has been indicated in Section 1.3.3 that many information sources operate intermittently. Thus, they deliver a train of blocks separated by idle pauses. Section 2.6.5 describes the buffering system, which is most widely used system to decrease or even eliminate the pauses. It is an example of a real time transformation transforming a train as a whole.

120

Chapter 2 Examples of Information Systems

2.6.1 BASIC CONCEPTS We consider here assembling a train blocks build of the same components, but the blocks may have various lengths. Then the train must be assembled so that knowing the train it is possible to recover the primary blocks separately. The operation of disassembling the train into primary blocks is called segmentation. To simplify the terminology, we use here the following terms: the train (higher ranking block) is called block, lower ranking block is called word, components of which a word is build are called symbols, and the set of potential forms of a symbol is called alphabet (see, page 38). A mediod to make segmentation possible is to transform a word into a code word that includes the word and auxiliary segmentation information saying where a code word ends and the next code word begins. There are two types of segmentation information: • Information added at the end of the word saying that the preceding symbol is the last symbol of a word; such information is called comma. • Information added in the front of the word saying how long it is; such information is called length information. Usually both types of segmentation information are blocks of symbols. To avoid ambiguity we call the primary word a working word. The structure of the described code words is shown in Figure 2.19.

iiiiiin working word

miiiiiii comma word

working word

comma word

a)

^ l l l l l l ^ ^ length working info info

Ill

length info

working info

nn

length working info info

b^ Figure 2.19. The structure of code words enabling segmentation : (a) using conuna, (b) using length information.

An other method to make segmentation possible is • Using as code words only such combinations of symbols of an alphabet that can be separated without explicit segmentation information. This method is an example of improving performance of an information system not by structuring each potential form of information but by structuring the set of potential forms of information. This possibility is discussed in more detail in Section 6.2.4. Each method of segmentation requires additional resources needed to process (in particular to transmit or to store) the code words. For the first two methods additional resources are needed to process the segmentation information.

2.6 Systems Arranging Blocs Into an Information Train

121

The third method causes the set of potential forms of code words to be a constrained set. Therefore, the typical processing resources cannot be utilized fully. To compare the various methods of segmentation we must define indices of resources needed to process information. Such indices are discussed in Section 6.1. 2.6.2 SEGMENTATION BY A COMMA In the simplest case we add to the primary alphabet a new symbol and use it as comma. We call it a comma symbol. The extended alphabet has K-\-l elements; thus, more resources for fundamental processing are needed to process the code words. However, if K/K-\-l is close to 1, the increase of resources will not be significant. An example of information with hierarchical block structure using a hierarchy of comma symbols is this text. The commas are: a space for words (in common sense), the comma in common sense for subsentences, the period for sentences, and the indent for paragraphs. Often only a binary alphabet say, {0,1} is available. Using one symbol, say 1 as a comma, would deteriorate the transmission dramatically (the only possibility would be number the potential forms of words and to transform the word into the sequence of zeros O's of the length equal to number of the block). The way out of this difficulty is to use a block of binary signals as a comma. It is called comma word. A typical comma word is a block of N^^^ Vs. Using a comma word is not so simple as using a special comma symbol since it is possible that a block of A^s^g I's occurs inside a primary word. Such a block would be interpreted as a comma word and cause false segmentation. This can be avoided by the following preliminary transformation 1. After a sequence of N^^^-\ successive Ts in a primary word a 0 is always inserted; this procedure is called bit stuffing. 2. When a sequence of Njcg-l successive Ts followed by a 0 is encountered in train of processed binary symbols, the 0 is dropped; this is called removal of stuffed bits. It can happen, that we encounter a train of more than N.^^ Tses. Then we interpret the last N^^g I's as the comma word and the preceding as pieces of working information . Let us take for example Nseg=6 and working words JCJ = 100110, ^2= 11111111. The train put into the fundamental information processing subsystem is 100110,111111,111110111,111111 (2.6.1) /

word Xj conrnia word stuffed word Xj stuffed 0 comma word We added in (2.6.1) the commas only for reader convenience.They are not needed to separate the working words. For bit stuffing a double price is payed in terms of increased demand for transmission or storage resources. At the end of a working word we have to add N.^^ I's and every time a block of A/,^^-1 of Ts occurs in the primary word we have to add one 0.

122

Chapter 2 Examples of Information Systems

If Njgg grows, the additional resources needed to process the comma word increase. However, the chances decrease that a block of N^^^-l successive T occurs in the primary word. Thus, there is an optimum length of the comma word. The block diagram of the described system is shown in Figure 2.20. For more information about bit stuffing see Stallings [2.15].

SOURCE OF THE PRIMARY TRAIN

[_SFUNDAMENTAL < PROCESSING

BIT STUFFING

XJ

SEPARATION OF WORDS

ADDING COMMA WORDS

FUNDAMENTAL^ PROCESSING < ' ^ <.

REMOVING OF STUFFED Brrs

^

I I

ULTIMATE PROCESSING

Figure 2.20. The system enabling segmentation of words consisting of binary elements, by a comma word. 2.6.3 SEGMENTATION BASED ON LENGTH INFORMATION. The fundamental idea of this method is to put in front of a working word the length information saying what the length of the block is. To keep the code word including the length information, homogeneous we present the length information as a block of symbols out which the working word part of the code word consists. The block representing the length information is called length word. To simplify the argumentation we assume that the symbols are binary {0,1}. To write number N in the binary system we need \0g2N binary digits. Thus, the length the length information is substantially smaller than of the block the length information is about. However, we must indicate where the binary symbols forming the length word end and the binary pieces of the working word start. Introducing a comma would be illogical, since we try to avoid commas. The problem is solved by iterating the idea of segmentation based on length information. Thus we provide information about the length of the lower-ranking length information. Since this information is shorter than the lower-ranking length information after few iterations the length information becomes binary. Then we can achieve separability by simple means. A possibility is to define the length information so that it always ends with a 1 and to ad a 0 at the end of the working information. The structure of the working information with added information about its length is shown in Figure 2.21.

I I I 11 first ranking length info

I l l I I I I lol second ranking length info

working info

Figure 2.21. Structure of cascaded information about the length of a forthcoming block.

2.6 Systems Arranging Blocs Into an Information Train

123

To distinguish between the working information and the length information we introduce the augmented working word jc,=l,x (2.6.2) The length of the augmented working word can not be smaller then 2. Therefore, instead of its length it is convenient to use shortened length information defined as N'ix^) N(jc,)-2, (2.6.3) where N(x^ is the length of a block x^. We denote as b{n) the length information of rank n. As long as the length of b{n) is not smaller then 2, we use also the shortened length information to characterize the length of bin). The algorithm for the cascaded length information segmentation is STEP 1 Set v(l)=jc„ *(l)=STR[Ar(jcJ]. STEP 2 If Ar(jCa)<2 end and as the transformed information take v=STR[Ar(xJ], x„ 0. If Ar(x,)>2 set v(2)=Z^(l), v(l), !^(2) = STR[Ar(x,)]. STEP n If Ar[Z?(w-l)] < 2 end and as the transformed information take v=Z7(w-l)], v(n-l), 0. If 7V[^(/z-l)]>2 set v(n)=^(n),v(n), Z>(n)=STR[Ar[Z?(/i-l)]. where STR u is the string of digits corresponding to number u in binary representation defined by (1.5.7), page 38. Let us comment the algorithm. In the first step the first rank length information ^(1) about the augmented working information is produced. If the working information is described by two or three binary digits the process of generating the secondary information containing the length information about the working information is terminated in the second step. When this is not the case the length b{2) about the length information of rank b{\) is generated. This procedure is continued until the highest ranking length information has length two or three. With exception of the highest ranking length information, any other length information cannot begin with a zero. Therefore the zero added at the end of the finally transformed information identifies the working information. EXAMPLE 2.6.1 APPLICATION OF THE CASCADED LENGTH INFORMATION CODING As the first working information we take x = 1 0 . The algorithm gives: x,= l,10, b(l)=STR[Ar(x,)]=STR(3-2) = l, end, v(l) = l,1,10,0. As the other working information we take x = 11000. The algorithm gives: X3= 1,11000, b ( l ) = STR[iV'(x3)]=STR(6-2) = 100; v(l) = 100,1,11000; b(2)=STR{N'[b(l)]} = l, end, v= 1,100,1,11000. As in example (2.6.1) of comma segmentation, we add commas only for readers convenience. They are not needed for segmentation of words. D The main area of applications of the length information is the lossless data compression. See Storer [2.35, ch.1,2]. 2.6.4 SEGMENTATION BASED ON SPECIFIC STRUCTURE OF THE CODE BOOK Let us look at the string of characters forming this line. Although no explicit segmentation information is used, we can still, separate the words correctly. The reason is that we expect that this is a train of words in English and not all sequences of the characters of the English alphabet are meaningful English words.

124

Chapter 2 Examples of Information Systems

However, in some cases this would not work. For example, we could not disiingVLish fellowship from fellow ship. The reason for the difficuhy is that the first 6 characters of fellowship are another meaningful English word. We call an initial segment of a block the prefix. Thus, the necessary condition that a sequence of block can be separated without explicit segmentation information is No code word is a prefix of an other code word. (2.6.5) We call such a set of code words reserved prefix code book (or abbreviated prefix code book) and code using such a code book reserved prefix code (or abbreviated prefix code). An example of the prefix code book is the set 1/ = {vj=0, V2=100, V3 = 101, V4=1100, V5 = 1110, V 6 = l l l l , V7=11010, V8=11011} (2.6.6) In Section 6.2. the Huffman algorithm producing reserved prefix code books that achieve statistically optimal compression is presented; (2.6.6) is such a code book. The prefix code book can be represented in graph form as a tree. Each branch of the tree corresponds to character, and the path from the root to a node corresponds to a prefix of a block. Each leaf corresponds to a codeword, and the path from to the leaf determines the binary symbols forming the leaf. In terms of graphs theory, the condition (2.6.5) is that no node in the path from the root to the node corresponding to a codeword corresponds to another codeword. The tree presenting the code book V given by (2.6.6) is shown in Figure 2.22.

o Figure 2.22. The tree representing the reserved prefix code book (2.6.6). The set of prefix code words is a constrained set (not all possible combinations of the characters are as code words). Most typical information systems require more resources for processing constrained information than for processing of unconstrained information (for a strict proof of this statement see Section 6.1.1). Therefore, the reserved prefix codes do not disqualify the systems with explicit segmentation information, such as a comma or length information. They are just an alternative. COMMENT The length information is an analogy of the parity information. It is compressed information about the features of ttie primary block. However, not about the elementary pieces of information it consists of, but about the amount of resources needed to process the block.

2.6 Systems Arranging Blocs Into an Information Train

125

The prefix coding, similarly, as error detecting correcting coding is a transformation shaping the constrained set of potential forms of code words, so that concrete code words have the desired properties. 2.6.5 COMPRESSION OF TRAINS OF BLOCKS INTERLEAVED WITH PAUSES In many information systems we have the following situation 11. The primary information is a train of blocks of binary pieces of information; 12. The blocks are separated by idle intervals; the length and the arrival instants and the lengths of the blocs are not known in advance; 51. The block delivered to the superior system must not be distorted; 52. The delay between the arrival of a block and its delivery to the superior system is a performance indicator; 53. The duration of the pauses is irrelevant for the superior system. Fl. The fundamental information processing system operates rhythmically F2. The resources of the fundamental information subsystem are used most efficiently if the system operates without pauses. On these assumptions it is natural, to perform on the primary train a preliminary transformation eliminating, or at least decreasing, the idle pauses. Such a typical transformation is buffering^. The system realizing it is called buffering system and is shown in Figure 2.23. primary train of blocs

BUFFER MEMORY

compressed train of blocs

\

FUNDAMENTAL INFO PROCESSING SYSTEM

decisions about actions

info about the stale of FIPS

info about the state of buffer ACQUISITION OF INFO ABOUT THE STATES OF SUB SYSTEMS AND DECISIONS ABOUT ACTIONS

a)

JTTUTL

JL

JTTTTTTl b)

.^,

^,

myiTLJL

jTnjiJiniiJi__JuDijmji^^ S(^5 c)

Figure 2.23. The buffering system: (a) basic block diagram, (b) typical primary train t/^ (c) typical compressed train V„.

126

Chapter 2 Examples of Information Systems

The basic component of the buffering system is an infonnation storage subsystem called buffer memory. The arriving blocks are placed in it. The buffer is controlled by a subsystem that gets auxiliary information about the state of the transmitter. If the transmitter is idle and at least a block is in the buffer, the block is moved into the transmitter. The transmitter produces the code word and starts to feed it into the channel. In such a phase the transmitter is said to be active. When the control system states that the transmitter completed its operation, it gets the information about the state of the buffer, and if it is not empty, the available block is transferred to the transmitter. If the buffer is empty, the slots generated by the transmitter remain empty. Thus, a secondary pause between the blocks fed into the channel arises. The various types of partner information needed to organize the described cooperation between the components of the system are shown in Figure 2.23a. Figures 2.23b and 2.23c illustrate the compression process. Buffering has several advantages. Therefore it is widely applied when users irregularly ask an information system for services. A typical example of a communication system using buffering is the intelligent data transmission system considered in Section 2.2 and shown in Figure 2.9. Here only a sketchy description of buffering as a transformation of a train of blocks is given. A detailed description of internal states of the buffering system and partner information used by the system are presented in Section 3.3.1. Since buffering decreases idle pauses it can be considered as a transformation compressing the volume of information. From such a point of view buffering is analyzed in Section 6.3. An analysis of buffering systems can be found in books on computer networks, such as Tanenbaum [2.17], Seidler [2.18, App. 1,2], Bertsekas, Gallager [2.19, ch.3]. An exhaustive analysis of buffering systems is the subject of the classical monograph by Kleinrock [2.21]. Methods of computer simulation of processes in buffering systems are presented in Woodward [2.20]. COMMENT 1 We introduced here buffering as a transformation matching the information generated by an intermittent information source to a rhythmically operating communication channel. In fact, buffering has a much wider application in service systems rendering its service to irregularly arriving "customers". The buffering is then called queuing.

NOTES • We consider distributed information systems with subsystems located in various places. The subsystem the state of which the info pertains we indicate in the parenthesis at the symbol denoting the information, while the subsystem which uses the information is indicated by a subscript. For example 3'x[CMoJ denotes information y about the state of the CHanners output that is available at the Transmitter. ^ Traditionally the buffer memory had a linear structure and the stored blocks were arranged sequentially. Such a train is called a queue. Therefore, the buffering systems are also called queuing systems. The queuing systems used not only in information systems but in service systems. For a detailed study of queuing systems see the classical monograph by Kleinrock [2.21].

2.6 Systems Arranging Blocs Into an Information Train

127

REFERENCES [2.1] [2.2] [2.3] [2.4] [2.5]

Haykin, S., Communication Systems (3rd ed.). J.Wiley, NY, 1994. Proakis, J.G., Salehi, M., Communication Systems Engineering, Prentice Hall, NY, 1994. Proakis, J.G., Digital Communications (2nd ed.), McGraw-Hill, NY, 1989. Black, U.D., Data Transmission and Distribution Networks, Prentice Hall, Englewood, NJ, 1993. Freeman, R.L., Reference Manual for Telecommunications Engineering (2nd ed.), J.Wiley, NY, 1994. [2.6] Blachut, R.E., Principles and Practice in Information Theory, Addison-Wesley, Reading, MA, 1990. [2.7] Lee, E.A., Messerschmitt, D.G., Digital Communications, Kluwer, Boston, 1988. [2.8] Seidler, J. A., Digital Data Transmission Systems, (in Polish), W.N.T., Warsaw, 1976; (in Russian) SWAZ, Moscow, 1978. [2.10] Hamming, W.R., Coding and Information Theory (2nd ed.). Prentice Hall, Englewood Cliffs, NJ, 1986. [2.11] Vanstone, S.A., Oorschot, P.C., An Introduction to Error Correcting coding, Kluwer, Boston 1991. [2.12] Purser, M., Introduction to Error-Correcting Codes, Artech House, Boston, 1995. [2.13] Peterson, W.W., Weldon Jr. E.J., Error Correcting Codes (2nd ed.), MIT Press, Cambridge, MA, 1972. [2.14] Rhee, M.Y., Error Correcting Coding Theory, McGraw-Hill, NY, 1989. [2.15] Stallings, W., Computer Communications (3rd ed.), IEEE Press, Piscataway, NJ, 1992. [2.16] Black, U.X., Computer Networks: Protocols, Standards, and Interfaces, Prentice Hall, Englewood Cliffs, 1993. [2.17] Tanenbaum, A.S., Computer Networks (2nd ed.). Prentice Hall, Englewood Cliffs, NJ, 1988. [2.18] Seidler, J.A., Principles of Computer Communication Network Design, J.Wiley, NY, 1983. [2.19] Bertsekas, D., Gallager R., Data Networks, Prentice HaU, Englewood Cliffs, NJ, 1987. [2.20] Woodward, M.E., Communication and Computer Networks, IEEE Press, Piscataway, NJ, 1993. [2.21] Kleinrock, L., Queuing Systems (2 vols.), J.Wiley, NY., 1976. [2.22] Minoli, D., Dobrowski, G., Principles of Signalling for Cell Relay and Frame Relay, Artech House, Norwood, MA, 1995. [2.23] Ahuja, R.K., Magnanti T.L., Orlin J.B., Network Flows, Prentice Hall, Englewood Cliffs, NJ., 1993. [2.24] Welch, S., Signalling in Telecommunications Networks, Peter Perigrinus, NY, 1979. [2.25] Leeuwen, I.J. (ed.). Handbook of Theoretical Computer Science, Elsvier, Amsterdam, 1994. [2.26] Ricardo, C M . , Data Base Systems, MacMillan, NY, 1990. [2.27] Ozkarahan, E., Database Management, Prentice Hall, Englewood Cliffs, NJ, 1990. [2.28] Khoshafian, S., Object-Orientated Batabases, J.Wiley, NY, 1993. [2.29] Wallace, G.K., ,,The JPEG Still Picture Compression Standard"", Communications of ACM, 34 (4), (1991), pp. 31-44. [2.30] Pennebaker, W.B., Mitchell, J.L., JPEG, Van Nostrand Reinhold, N.Y., 1993. [2.31] Nelson, M., The Data Compression Book, M&T Books, San Mateo, CA, 1992. [2.32] Press, W.H., Flannery, B.P., Teukolsky S.A., Vetterling W.T., Numerical Recipes (2nd ed.), Cambridge University Press, Cambridge, UK, 1992. [2.33] Poularikas, A.D., The Transforms and Applications Handbook, IEEE Press, Piscataway, NJ, 1995. [2.34] Gal, D., „MPEG:A Video Compression Standardfor Multimedia Applications", Communications of ACM, 34 (4), (1991), pp. 47-58. [2.35] Storer, J.A., Data Compression, Computer Science Press, Rockville MD, 1988.

CONCRETE STATE OF A SYSTEM Information is a function of the components of the state of the environment that are relevant for purposeful actions performed by a superior system in the environment (definition (1.1.1)). Therefore, any systematic analysis of information must be based on a description of the states of objects the information is about. Deciding which components of the state of environment are relevant for a given activity and how they should be described may be also called problem of building a model of the external world that is suitable for the concerned activity. Several examples of choosing state components that are relevant for working information processing were given in the previous chapter. The problem of creating a model of the environment in the general case when the superior system is not an information system lies on the border of information sciences but is central to all sciences concerned with purposeful activities. It also plays a fundamental role in sciences dealing with non animate matter, in particular in physics and chemistry. In those sciences it is attempted to create possibly universal models of the external world not related to a specific activity. The three chapters in this part of the book have two purposes. The first is to provide a universal methodology for describing the states of systems, in particularly, of the environment of an information system. In its "objectivity" this approach is similar to the approach of physics, but we also indicate in which steps of building the model of the environment we have to account for the properties of the superior system for which the considered information system renders its services. Section 1.2.2 indicated that the relationships between the concept of state and of information have a bilateral character. Therefore, the analysis of states gives not only insight into features of information but large parts of this analysis can be directly used in the forthcoming discussion of information systems. The preparation of such material is the second purpose of this part of the book. A generalized concept of ^ e state is used here. It includes the concepts of the external, internal states (see Section 1.2.1) and of the state of variety, particularly the statistical state (see Section 1.4.4). The external state of a subsystem is the set of system's features that are directly responsible for interactions with other subsystems of a system. A imiversal relationship between components of external states is a feature of a subsystem, which manifests itself through external states. Therefore, the set of such relationships is called internal state. Internal and external states are called concrete states. Section 1.4 indicated that for pursuing goals by the superior system essential is not only the concrete state but also the set of its potential. This set is an inherent property of the environment and it is called state of variety. It is called statistical state when the potential forms have statistical weights.

130

Chapter 3 Concrete State of a System

Because the concept of state is used here m such a broad sense the considerations about it are divided into three chapters. This chapter considers the concrete states the other two concentrate on the statistical state. The first sections of this chapter describe the basic types of external states. We concentrate on rough descriptions that are a model of descriptions of the state of the environment in terms of attributes. The remaining two sections are devoted to the relationships between external states. Section 3.2 presents a physics-like description of relationships between states by equations relating continuous variables. First the relationships between time-continuous states of simple two-terminal systems are discussed. Next we consider networks of such elements and relationships between states of the terminals that can be described by differential equations or equivalently, by integral expressions. Finally we present counterparts of such relationships for time-discrete description of states. The time continuous and discrete models considered illustrate also the concept of discrete approximation introduced in Section 1.4.3, page 33. Section 3.3 reviews of relationships between rough external states described by logical functions relating logical variables. These two sections show that both types of relationships, usually presented separately, can be treated in a similar way.

3.1 THE EXTERNAL STATE OF A SYSTEM The concepts of the external and rough states have been already introduced in Section 1.2.1. Here we concentrate on formalization of those concepts and their detailed analysis. 3.1.1 BASIC DESCRIPTIONS OF STATES The state of a typical system is structured. As in the case of information we describe such a state by an assembly of descriptions of elementary components (see Section 1.3.1). Simplest is the description of the state of a point object at a given instant. This state is described by a set of scalars called state parameters. If the set of potential forms of a state parameter is discrete, the state parameter is also called an attribute. Most colloquial descriptions of states by people such as large, medium, or small are attributes. If there are only two potential forms of an attribute, it is called a binary attribute. An attribute can be described by a set of binary attributes (such as by writing the number of the attribute in the binary notation). Therefore, we may consider the binary attribute as a prototype attribute. The state parameter is the atomic element of the description of the state of a system. The next higher ranking element of a state description is the set {s(k), /:=!, 2,- • • , ^} of external state parameters. It is denoted s^{s(k);k^l, and is called a state vector.

2,- ' ' ,K}

(3.1.1)

3.1 The External State of a System

131

Typical examples of state vectors describing the instantaneous external state of a point object are: • The mechanical state of a material point: three coordinates and the forces acting on the point, • The electrical state of a point terminal: the potential of the terminal and intensity of electrical current flowing in, • The optical state of a point on an illuminated surface: the intensities of the three fundamental colours of the reflected light, • The economical state of a person: its current earnings, expenses, and savings. Several other examples of state vectors have been given in Chapter 2. More are given in the next section. The external state of a system is described by the assembly of descriptions of the states of its components augmented by identifiers of the subsystems. Therefore, considerations from Section 1.3 apply directly to states. In particular the fine structure of a state of a system is given by (1.3.1). Examples of the structure of information presented in section 1.3 can be also interpreted as examples of structure of state. Consider the array information described by (1.3.4). If we interpret x(m, n) as the mth state parameter of the mth node of a network at an instant, then x{m) given by (1.3.6) has the meaning of the instantaneous state of /zth node, and the array X given by (1.3.5) describes the instantaneous state of all nodes of the network. A similar array Y describes the instantaneous states of all channels. The pair X, Y describes the instantaneous state of the network. If the scalar x{t) in equation (1.3.10) has the meaning of the state vector describing the instantaneous optical state of a point ^ on a continuous planar (2 DIM) object, then x(*,') given by (1.3.10) describes the instantaneous optical state of the whole object. In a similar way (1.3.13) describes the state of the surface of a continuous spatial (3-DIM ) object. 3.1.2 A ROUGH DESCRIPTION OF A STATE Section 1.4 indicated that often it is convenient to simplify the information, particularly to compress its volume even at cost of introducing some distortions. This also applies to the state. A simplified description of a state is called a rough state description or briefly rough state (see Section 1.2.1). When necessary to avoid ambiguity we call the previously considered primary description (3.1.1) by the state vector the exact description of state or briefly exact state. If we interpret the exact description of a state as a primeval information about the state, the simplification of the description of the state can be considered as a special case of simplification of information, which is discussed in Section 1.5.4. If the rough description can take only a finite number L of forms (the set of potential forms of the rough description is discrete), we call the rough description the discretized state. If L=2, the rough description is a binary state. The discretized rough state has often the meaning of an attribute. The discretized state is the counterpart of the discretized, in particular quantized information considered in Section 1.5.4.

132

Chapter 3 Concrete State of a System The rough state is considered now in more detail. We denote s the primary state vector (the exact description of the state), s* its rough description of the state, 7T(') the transformation generating the rough description

^^^' s* = T(s). (3.3.2) A typical set of potential states of a macro object of the real worid is a continuous set. In particular, the set of potential values of a state parameter characterizing an instantaneous state of a point object is usually an interval. Thus, the primary state is of type T^Kdi) (see Section 1.3.4). This is the case in the previously given examples 1-4 of state vectors. However, for most purposeful actions the exact value of the primary continuous-state parameter is not needed (see discussion in Section 1.4.3) and a rough description is sufficient. First we consider a state parameter s and we assume that for a purposeful activity it is important if s is larger or not than a threshold s . Then the transformation producing the binary rough description is the threshold function ^ ^ 0 if s<s T^is'J ) ^ ^ ^ .^ ^ . (3.1.3) ^

1 if 5 ^ 5

Let ^ be a set in space S of forms, which the exact state description can take. Thus, ^ C ^ . As the transformation generating the binary description of a primary structured state s that characterizes the relationship between the state and the set ^, we take where

.*=ru.;^).

(3.1.4a)

^ ^ (z r

r„<»;?).<„;,;4

(3...4b)

Thus, 5* says whether the state s belongs to set ^or not; hence, the notation. (The reader familiar with set theory notices that T(s, ^ is the characteristic function of the set ^ ) . Often for the superior system it is important if the environment has a certain property P or not. The binary characteristic of the state s from the point of view of the property P is 5*=rp«(s; P), (3.1.5a) where - ^^1 if s possesses the property P TryniS; P) = i r FKJ^ (3.1.5b) ^ 0 if s does not possess the property r This rough state produced by this transformation has the meaning of the binary attribute of possessing the property P. There is an obvious relationship between both rough descriptions. Let us denote by ^(P) the set of primary states for which the system has the property P; we say that the set ^(P) characterizes this property.

^^''

T,n(s; P)=Tas;g(P)l

(3.1.6)

3.1 The External State of a System

133

As a simple example let us take the property P(i)=(5 >s). (3.1.7) Thus, we assume that the system has the property P(5 ) if the state s parameter surpasses a thresholdi . The set characterizing this property is ^[P(s )]=<s ,oo) and equation (3.5.2) takes the form (3.1.5). Thus, the rough description obtained by transformation T^^isJ ) given by (3.1.3) can be considered as a special case of transformation (3.1.5). Every transformation producing a binary rough state description can be presented in form (3.1.7). We have only to take as ^(P) the subset of the exact states that are transformed in a given binary rough state. Taking several sets ^(/), y = l, 2, • • , 7 we obtain a set s* = {s*0)J=U2,' • • ,/} (3.1.8) of binary rough descriptions. It can be considered as a discrete rough description that can take L=2^ potential forms. If the sets ^ (j) have the meaning of sets characteristic for some properties P^, then the set s* has the meaning of the set of binary attributes characterizing the state of the system from the point of view of those properties. Using the fundamental set operations^ ((J union, fl intersection, C complement) we obtain from the primary sets ^(/) a variety of secondary sets. The rough descriptions of the secondary sets are obtained through corresponding logical operations' (V alternative, A conjunction, -^ negation) on the primary rough state descriptions s'^d) considered as Boolean variables. If the sets ^(j) have the meaning of sets characterizing some properties, the secondary rough descriptions characterize the state of the system from the point of view of secondary properties obtained from the primary by the corresponding logical operations. We illustrate this with a simple example. EXAMPLE 3.1.1 ATTRIBUTES CHARACTERIZING COMPOSITE PROPERTIES OF A SYSTEM We consider two threshold properties P[i (1)], P[i (2)] with two different thresholds i (1)<5 (2). As the secondary composite property we take the property that the system possesses the property P[i (1)] but does not possess .^ . g. property ?[s (2)]; we denote this property as?[<s {\), s (2))] v•• > From this definition it follows that the relationship between the sets characterizing the considered properties is g{?{<s

(1), s (2))]} =^{P[5 (1)]} n C^{P[i (2)]}

Q.\.VS)

The set ^ { P [ < i (1), s (2))]} is obviously the interval <s (1), s (2)) as shown in Figure 3.1. The relationship between the attributes corresponding the relationship (3.1.10) between the sets characterizing the properties is ^*{P[<^^ (1). ^^ (2))]}=5*{P[5 (l)]}A(-'^*{P[i (2)]}).

(3.1.11)

134

Chapter 3 Concrete State of a System

^(P(<5(1)J(2))} ,„..„

.

, _ -/^fr»rx/OM\

y{ns (2)]} ^{P[^^(l)]} 1

1

5(1)

1

^(2)

ib

'

'

Figure 3.1. Sets characterizing the properties of a state parameter s considered in Example 3.2.1. Thus, the attribute 5*{P[S' = {0(/2), w = l,2}. We denote by h{n) the height of the vertical position of the object 0{n) (briefly, height of 0(n)). The vector /r = {/z(l), A(2)} is the primary state description. As the relevant feature of the system we take the property Phg= the object 0(1) is located higher then the object 0(2). (3.1.12) Let us denote by ^*(Phg) the binary rough description characterizing the system from the point of view of the property P^g. As in the case of P(i ) using (3.1.7) and (3.1.4) we can express 5*(Phg) in terms of the exact state description h:

People or automatic devices can also determine the mutual position of two objects and thus, the value of 5*(Phg) directly, without using the exact values of the primary, exact state parameters h(l) and h(2). Then the primary, exact description by state parameters does not appear at all, and the rough description plays the role of the primary description. D

3.1 The External State of a System

135

If the primary state parameters are continuous, then the considered rough descriptions of the state are the counteq)art of the quantization of information described in Section 1.5.4. In particular the transformations (3.1.3) or (3.1.4) can be interpreted as binarization. If the state is a vector state, we can achieve its rough description by dimensionality reduction considered in section 1.5.4. Let us take for example the truncation (see Figure 1.20 a). It is described by the transformation: T^,(s)^{s(l), 5(2),- • • , 5(Af)},

M
(3.1.14)

The rough description of the vector state s is s* = T^^(s).

(3.1.15)

Such a simplification may be used when the components of the state vector are either continuous or discrete. An example of the latter case is a simplifying file transformation discussed in Section 2.4.2 or of quantized spectrum of images considered in Section 2.5. If the primary state vector is continuous then the rough description obtained by dimensionality reduction, in particular by truncation is continuous. Such rough descriptions of spectra are discussed in Section 7.3. Of paramount importance are rough descriptions of states that are functions of continuous argument(s) considered as a whole. Several information sources described in section 1.2.1 utilize such descriptions of states. The fundamental types of rough description of functions of continuous argument(s) have been described in Section 1.5.4 and the most important systems producing such rough descriptions are listed on the lower part of Figure 1.22. A FUZZY DESCRIPTION OF A ROUGH STATE Since transformation T(') producing the rough description (3.1.2) is deterministic, it is a clear description. Often however, a rough description may depend not only on the value of the primary state but also on some not exactly known factors. Typical examples are binary descriptions such as large/small, made by people. Other examples are randomized decisions mentioned in Section 1.5.5. A binary description depending on indeterminate factors is called SL fuzzy binary description. Let us denote by ^/ , Z= 1, 2 the potential forms of this description. The rule of producing the fuzzy binary description 5* is as follows We define an auxiliary function P(s, 1) such that 0. For a given sE<s^, s^> we run a raruiom binary number generator generating 1 with probability P(s, 1) or 2 with probability (3.1.16) P(s, 2)^l-P(s ,1). We denote 6y Z*G {1, 2} the generated binary number. The fuzzy rough description of the primary state is s*^. The bloc diagram of the described transformation and typical probabilities P(5, /)> Z=l» 2 are shown in Figure 3.2. Also are shown the aggregation sets (see Section 1.5.3) determined by the rule (3.1.16). Contrary to the aggregation sets corresponding to deterministic quantization shown in Figure 1.18a, the aggregation sets corresponding to randomized quantization overlap.

136

Chapter 3 Concrete State of a System

SELECTION OF PROBABiLrry DISTRIBUTION

GENERATOR OF A BINARY RANDOM NUMBER

ULTIMATE CODING

Figure 3.2. The generation of a ftizzy binary 5^! description of a primary continuous state s: (a) the bloc diagram, (b) typical auxiliary probabilities P(s, t), /=!, 2 and the corresponding aggregation sets ^ i. Instead of random numbers in practice numbers produced by special deterministic algorithms, which behave like random numbers, are used. Such numbers are cMtd pseudo-random numbers. The principles of operation of pseudorandom numbers generators are briefly discussed in Section 4.5. For details see Dagpunar [3.1], Yarmolik, Demidenko [3.2]; programs generating pseudo-random numbers can be found in Press et all.[3.3]. 3.1.3 THE COURSE OF AN EXTERNAL STATE IN TIME Till this point we considered the external state of a system at a given instant. Therefore, it is called the instantaneous state. The properties of a system in a time interval aie described by the course of the state in this interval. In physics it is called state trajectory. We call it here state process. It is the counterpart of information process considered in Section 1.3.2. Let us assume, for example, that the instantaneous state is described by the vector s(r). Then the state process is described by the function s(^)^s{t);te} (3.1.17) as a whole. If the upper end t^ of the considered time interval has the meaning of current time, the state process is called evolving. Often the state process has a macrostructure, which usually reflects the structure of the system. A typical example of state processes with hierarchical macrostructure are speech sounds shown in Figure 3.2. This structure reflects the multilevel structure of the vocal tract producing the sounds, (for a detailed study of speech sounds see O'Shaughnessy [3.4]). We can apply the previously mentioned two basic classes of rough description of vector states (discretization, dimensionality reduction) also to state processes.

3.1 The External State of a System

o h miiaft

'"

iPftflW!^

«wiin!mnj["

!*»«'-«'

1—€iVM.-««"--nin'

yn

1.3

i- ' O

S A

137

r[j]

2 . 0

a) i 2a ^(0

10

b)

20

30

^

J

c)

Figure 3.3. An example of a state process with a multilevel hierarchical structure: speech waveforms: (a) The waveform of the sentence "every salt breeze comes from the sea", (b), (c) the fragments of the sound S respectively A magnified and expanded in time. Based on Flanagan et al., 1979 (IEEE). We may quantize the instantaneous values of the state process or we may quantize the time argument or both (for corresponding transformations of information see Section 1.5.4). Consider for example, the time process described by (3.1.17). We take the train of sampling instants

r,
(3.1.18)

The vector (set of samples) s={s(0,n = U2,' ' ' ,N} (3.1.19) is the simplified description of the primary time-continuous state process. Besides discretization of either the value of the state process or/and the time other rough description of state processes are possible. For example, in mechanics and theory of time-continuous systems, of fundamental importance is the rough description of an evolving state process §(•) by the set of derivatives (d^^/dr")^.^, w - 1, 2, •, •, A^ at the current instant t^,. A counterpart a rough description of the evolving train of states by derivatives is the description of states of time-discrete systems by a set of difference quotients. However, their definition is quite cumbersome. Therefore, often we use the continuous approximation of time- discrete systems state description (see the discussion on discrete approximation in Section 1.4.3).

3.1.4 THE SET OF POTENTIAL FORMS OF AN EXTERNAL STATE Section 1.4 indicated that if exact information about the concrete state is not available, then the superior system can improve the efficiency of its purposeful actions by exploiting the properties of the set S of potential forms that the external state can take. This applies also to the set of potential forms of the external state. Using state-orientated terminology let us recapitulate considerations in section 1.4. The set S of potential forms of the external state of a system is described by • the structure STR of a potential state, and • the rules of membership MR that say which state belongs to S.

138

Chapter 3 Concrete State of a System

The structure STR is described by the list of elementary components of a state and by rules saying how the components can be assembled into a description of a potential state. Those assembling rules (also called syntax) include universal relationships between the components. The basic types of those relationships are discussed in forthcoming Sections 3.3 and 3.4. In the simplest case, the rule of membership is that every combination of elementary components is a potential state. Then the potential state is said to be unconstrained. Easiest is the description of MR when the set S is discrete. The membership rule is then described by the list of the possible states 5,, / = 1 , 2,- • , L. If they have the structure of vectors, then the list has the structure of an array. If the states have structure of an array, their list has the structure of a higher-ranking, array. The set of potential forms of practically all exact descriptions of external states of real objects are continuous. To make this statement meaningful we must define an indicator of distance between potential forms of states (see Section 1.4.3). As in the case of information discussed in Section 1.4.3 the sets of potential states may be unconstrained or constrained (see, e.g.. Figure 1.12). The set of the potential forms, which a potential state can take, is a property of the system. Therefore, we call it state of variety. To emphasize that the description of the state of variety includes the description of the structure of a concrete information and the description of membership rules, instead of the short symbol S we denote the state of variety by 5VAR. Thus, Sy^, = {STR. MR}.

(3.1.19)

3.2 THE INTERNAL STATE OF A SYSTEM 1: RELATIONSHIPS BETWEEN CONTINUOUS STATES In most systems the value of an external state at an instant is related with the value of this component at another instant by relationships that hold for all potential values of state parameters. Such relationships are an objective property of the system. They are directly nonaccessible but they manifest themselves through external states. Therefore, the relationships between external state components have been called internal state of the system. If the components of the state are connected by relationships, then it is often possible to express some state components as functions of other components. The first components are called dependent components, the second/re^ components. Dropping the dependent components we simplify the description of the state without impairing its accuracy. The relationships between components of states induce relationships between corresponding components of information. Those relationships make loss-less compression of information possible or they provide immunity to some types of distortions, similarly as relationships built into code words of an error correcting code (see Section 2.1.2). The relationships between components of external state also cause some accessible state components to deliver information about other inaccessible components. Thus, the knowledge of relationships between components of states is of great importance both for information processing and for utilizing information by a superior system.

3.2 The Internal State 1: Relationships Between Continuous External States 139 Usually a hierarchy of relationships exists between external states of systems. On the top are the universal relationships that are formulated as axioms of logic and mathematic. On the lower level are the relationships holding for broad classes of systems considered in natural sciences, particularly in physics. On the lowest level lay relationships holding for narrow classes of systems. The relationships may have the form of logical statements, equations, or inequalities. Explicit relationships express the dependent state components as functions of independent components and implicit relationships are functions relating state components. Most relationships considered in physics have the form of differential equations relating time and space derivatives of state processes. One of fundamental problems of natural sciences is to transform a relationship in an implicit form into a explicit form (in particular to find the "solution" of an equation). This is done by using the universal relationships of logic and mathematics. For many systems we can divide the components of the state into two categories: the causing components, which can be considered as the primary agent generating the state, and the resulting components, which can be considered as the effect of causing components. We call such relationships (and systems) causal. In technical terminology the caused state is called iht product of the causing state, and causal systems are called production systems. In any real system the changes of the causing components precede in time the consequent changes of resulting components. Such systems we call real time systems. In the case of a causal relationship it is natural to take the causing components as the free components mentioned earlier. There are, however, situations when some state components cannot be considered as the result of other. This applies to instantaneous states (such as positions of resting, noninteracting objects) and state processes. Then the relationship is mutual. The problem of relationships between components of states and thus, of models of systems is very broad and we do not go into details of analysis and synthesis of systems. This is the subject of several excellent books (see, e.g., Oppenheim, Wilsky [3.5] on time-continuous linear systems, Oppenheim, Schafer [3.6] on time discrete systems, on simulation of such systems Alkin [3.7], Wolfram [3.8]). The goals of this and the following sections are to present the principles of describing relationships between external states and to describe the classes of relationships that are used in subsequent chapters for implementation or interpretation of information transformations. The considerations start with a general discussion of types of interactions between components of systems. We introduce first the important concept of terminal interacting systems. Then we concentrate on causal relationships in systems which change their states continuously in time. As a representative example the relationships between components of states of lumped electrical elements and networks built of them are considered. Although the examples are very simple they allow to realize both goals.

140

Chapter 3 Concrete State of a System

3.2.1 TERMINAL INTERCONNECTED SYSTEMS A system is an assembly of objects tied together by mutual interactions. The interactions are established by space fields, such as mechanic, electromagnetic, or gravitational. We discuss here in more detail the types of interactions between the objects (subsystems) forming a system. The subsystems interact through space fields. The analysis of such interactions is usually very complicated and their control and utilization difficult. Therefore, of great practical importance are systems whose components can interact only through small interfaces. Such systems are called terminal interacting systems. Most systems build by people can be considered as terminal interconnected systems. Examples range from electronic devices over machinery, large cities to worldwide communication and transportation systems. As a concrete example, a typical building can be considered as a system of rooms interconnected by doors that play the role of terminals. In Chapter 1 and in Chapter 2 we assumed silently that considered information systems are terminal interconnected systems. This is the precondition to representing a system by a block diagram. The simplest interface is a point object called/7c>mr interface ox point terminal. A subsystem that can interact with other subsystems only through point interfaces is called a terminal interacting subsystem. A system consisting of terminal interacting subsystems is called a terminal interacting system. The simplest terminal interacting subsystem is the black box with two point terminals shown in Figure 3.4. The box between the terminals represents the primary subsystem establishing relationships between the states of terminals. Such relationships are discussed in the next section. ^(1)

O^(1)

BLACK BOX

—o s{2)

Figure 3.4. A two-terminal accessible system. I7(w)-the nth terminal, n = l,2, s(/i)-its state. Typical examples of objects that can be treated as terminal interacting subsystems are lumped electrical elements such as resistors, capacitors, coils, diodes, and transistors. The circuits built of these elements are examples of terminal interconnected systems. The electrical elements illustrate the limitations of the point terminal accessible objects. This model is only suitable if the physical dimensions of the elements are much larger then the length of the longest electromagnetic wave corresponding to the changes of the electrical state of the terminal. However, even for lower frequencies often costly electrical and magnetic screening techniques must be used to design those systems so that they behave as point interacting. An important and broad class of terminal interacting systems is networks. They consist of two types of subsystems: nodes and connecting channels that enable the interactions between nodes that usually are located at distant places (see Figure 2.13a).

3.2 The Internal State 1: Relationships Between Continuous External States 141 Typical networks are water, gas, electricity, and sewage networks. The elementary components of medium processed in those networks have no identity. E.g., electricity produced by a power plant is fed into the network but not directed to a particular user. Of different type are most transportation and communication networks, in particular the packet communication networks considered in Section 2.3.2. They process units (packets) that are characterised by their destination and often by their origin. To this point we have considered the interface between subsystems as a point object. However, we often have to take into account its finite size. A broad class of such interfaces are two-dimensional window interfaces that may be also called gates. The membranes play often in biological systems the role of window interfaces. The harbors and airports are examples of spatial interfaces between the road network and the sea or respectively air transport systems. Also, many living organisms can be considered as terminal interconnected systems. The nervous system is essentially a point-terminal connected system. However, most organs interact through window-type interfaces. 3.2.2 RELATIONSHIPS BETWEEN TIME-CONTINUOUS STATES OF A TWO-TERMINAL SYSTEM Our discussion on internal states we start with a very simple but representative example of two-terminal electrical elements. We assume^ the following Al. The subsystems interact through a flow of electric charges; A2. The interaction is effected only through terminals. Assumption A2 is justified if the changes of the intensity of flow of the changes of electrical state are slow. Such flows are called semistatic and the systems are called lumped. We consider here elementary electrical systems shown in Figure 3.5 that can interact only through two terminals D(n), n = l, 2. The next example deals with a simply network of such elementary subsystems.

:7(i) v(l,0 ^(1,0

R

^(2)

-•AAA^"

—o v(2,r) /(2,r)

:7(i)

av(l,0 z(l,r)

^h

-O

v(2,r) 1(2,0

Figure 3.5. A resistor and condenser considered as two terminal accessible systems. Further we assume the following A3. At an instant t the interaction of terminal D(n) with terminals of another element is determined by the instantaneous potential v(n, t) and instantaneous intensity of electrical current (briefly, intensity) i(n, i). The intensity is defined as the rate of change of electrical charge q(n, t) that flowed into the terminal: i{n, t)=dq/6t (3.2.1)

142

Chapter 3 Concrete State of a System

From the assumptions it follows that s(n, 0 = {v(/i, 0, Kn, 0} (3.2.2) is the state vector describing the instantaneous electrical state of the terminal If(n). As first lumped electrical element we take a resistor (see Figure 4.1). Many observations showed that its state parameters are connected by the universal relationships (called law of Ohm): 1(2, t)=-i(l, t) (3.2.3a) v(2, 0-v(l, 0 = / ( l , t)IR, (3.2.3b) where R is the resistance of the resistor. It depends on the shape of the conductor and electrical properties of the conducting material. If the conductor is a wire of cross-section S and length /, then R=al/S, (3.2.4) As the second lumped electrical element we take the condenser. Many observations show that the relationship (3.2.3a) holds again, but there is no imiversal relationship between instantaneous potential and current intensities. However, related are potentials and electrical charge q{n, t) on the electrode of the condenser connected with terminal Zl(n). The relationship is ^,(0 = C[v(2, 0-v(l, 0 ] , (3.2.5) where C is the capacity of the condenser. It depends on the shape of its electrodes and on properties of the isolator between them. For a flat condenser we have: C=eS/D, (3.2.6) where 5 is the surface of the electrodes, D is the distance between them, and e is the electric constant of isolator between the electrodes. After differentiating both sides of (3.2.5) and using definition (3.2.1) we get /(I, 0 = C [dv(2, 0/ci/-dv(l, 0/d/]. (3.2.7) Let us assume that v(2, t^)-v(l, tj)=0 where t^ is the instant when the observation of the capacitor begins. Integrating equation (3.2.7) gives t

v(2, 0-v(l, 0 = C-'/i(T)dT. From this we obtain

(3.2.8) ,

v(2, t)-v(l, t) = [v(Z, t-A)-v(l, r-A)]+C-' j /(r)dr.

(3.2.9)

This very simple example suggests a couple of generalizing conmients. COMMENT 1 To create a tractable model of real objects we must limit the class of considered real objects and situations (assumptions Al and A2 ) and decide which components of their states are relevant for considered class of superior systems (assumption A 3). Those steps are essential and usually difficult. They must base on results of specific sciences outside information sciences. Behind the seemingly simple assumptions Al to A3 stands the huge body of sciences concerned with electrical phenomena. The described procedure is also called choice of the universe of discourse.

3.2 The Internal State 1: Relationships Between Continuous External States 143 COMMENT 2 The examples give more insight into our considerations about hierarchical structure and the concept of atomic objects. If we stay at the highest (least precise) level of system description, we have to determine the resistance or capacity on the basis of many observations of electrical states of the point terminals. If we take into account the macrostructure at a finer level and go inside the black box, we get the geometrical dimensions of the components and can use formulae (3.2.4) and (3.2.6). With material constants a and e we face a similar situation. Staying on the macro level, we can obtain their values by observations of states at this level. However, if we go to atomic (in the sense of physics) level, we cannot derive the relationships from general relationships holding at the atomic level and in addition we can also express the material constants e and a in terms of characteristics of the material at the atomic level, such as electrical charges of elementary carriers or indices of their mobility. COMMENT 3 The primary relationships between external states are not derived but are the result of generalizations of many observations^ of the external electrical states. It must be so because the relationships are features of real world existing independently of our reasoning. COMMENT 4 The relationship between state components can be represented in different forms. We can pass from one form to an other through mathematical operations. Typical are differential forms as (3.2.7) and integral forms as (3.2.8). The various forms are helpful for analyzing specific properties of the relationship. COMMENT 5 For wide classes of systems their external state parameters may be related by the same relationship (in the examples relationships (3.2.3) and (3.2.5), and the properties of a concrete object enter in the relationship only through some parameters (resistance R for resistors, capacity C for condensers). Such a relationship is called a universal relationship and the parameters are called internal state parameters. COMMENT 6 For some systems, such as the resistor instantaneous state parameters are related. We call such systems memory-less. If the course of a state parameter in an interval in the past is related to the instantaneous value of another state parameter, as in the case of the condenser, we say that the system is with memory. For the capacitor the dependence has a specific character. From equation (3.2.9) it follows that if the difference v(2, /-A)-v(l, r-A) is known at the instant r-A where A > 0 , then the earlier course of the intensity /(«, T), r . Since A is arbitrary, the length of this interval can be arbitrary short. Systems having such a property are called Markovian systems.

144

Chapter 3 Concrete State of a System

COMMENT 7 Equation (3.2.8) shows that the instantaneous difference of potentials v(2, 0-v(l, 0 is related to the course of the intensity in the whole past. Equation (3.2.7) seemingly contradicts it. This would be, however, a superficial conclusion. To calculate the derivative dv(«, t)/dt or, more precisely, the value [dv(r, «)/dT]^^p it is not enough to know v(/z, 0 but we must know the process in an arbitrary small but finite time interval , A>0. Thus, the value of the derivative [dv(/2, r)/dr]^=^ should be considered not as a characteristic of an instantaneous state but as a rough description of the process {v(r, n): TE } . 3.2.3 RELATIONSHIPS BETWEEN TIME CONTINUOUS-STATES OF A NETWORK OF INTERACTING TWO-TERMINAL SYSTEMS Establishing interactions between terminals of several two-terminal systems we build a network. Examples of networks of nodes interacting through channels were presented in Section 2.3.2. Here we discuss the features of networks build from the lumped electrical elements. With such simple networks we illustrate the internal states of a very wide class of systems that can be described by differential equations or equivalently, by integral relationships. We consider the resistor-capacitor system shown in Figure 3.5. We assume the following: Al. We observe the system in the time interval where t has the meaning of the current instant. A2. At the beginning of the observation no charge is on the condenser thus, v(2, 0 = 0 . (3.2.10) A3. The potential of terminal D(3) is always zero v(3, 0=0, vr. (3.2.11) :7(i)

R

'•(1./)

--AAA^-v(l.r)

r

•Q

v(2.r)

:7(3) '•(1.0

r^

Figure 3.6. A resistor-condenser network.

Assumption A3 does not limit our considerations. We introduce it only to simplify the notation. From Figure 3.5 we see that the current /(I, 0 causes the potential difference on the resistor and loads the condenser. From (3.2.3) and (3.2.8) we get ^^2, 0-v(l, t)=Ri(\, r), (3.2.12) v(2, 0 = C^f Kl, r)dr.

(3.2.13)

3.2 The Internal State 1: Relationships Between Continuous External States 145 Differentiating this and putting the result in (3.2.12) gives RC dv(2, t)/dt +v(2, 0-v(l, 0 = 0 ,

t>t^

(3.2.14)

This is the fundamental relationship between the potentials of terminals ^(1) and U(2). Since it relates instantaneous values of the processes and their derivatives, it is called differential relationship. Usually, the potential of terminal !7(1) is considered as the factor causing the potential of terminal 17(2), and we are interested in expressing v(2, 0 as a function of a given process v(l, r), rA (3.2.17) It can be shown that: If the width A of the impulse v^(l, 0 CLt the input terminal !7(1) becomes very small, then the shape of the process (3.2.18) v^(2, 0 cit the output terminal !7(1) does not depend on the shape of the input process v^(l, 0More precisely: (a) the limit /i(0=lim^^£^ (3.2.19) where A-O A^

^^ = Jv^(l,Od/

(3.2.20)

0

exists, and (b) it does not depend on the input impulse but is solely determined by the system. Typical processes v^(l, 0 and v^(2, 0 and the limit h(f) are shown in the left column of Figure 3.7.

Chapter 3 Concrete State of a System

146 'A(U)

A1

A2

VA'(1,0

M) A3 •^ B 3 Figure 3.7. Dlustration of the definition of the pulse response (left column) and of the derivation of the explicit relationship between the process at input and output of a linear causal system (right column). A2 v^(l, r) a narrow pulse at the input and v^(2, t) the response at the output; A3 the limiting case when the pulse width A'-K); h{t) the pulse response; B2 the elementary pulse v(l, t^^{t-Q approximating the input process in the interval and the response to it; B3 the pulse train approximating the input process in the time interval . The function h(f) has the meaning of the response of the considered system to a very narrow pulse with a unit area^. Therefore, h{t) is C2\\tA pulse response. To elucidate the property (3.2.18), in particular definition (3.2.19) we sketch the method of calculating, eventually measuring the pulse response. Any real system responds to a stimulus after the stimulus was applied. Such a system is called causal. For a causal system from assumption (3.2.10) and from assumption (3.2.17) that v^(l, 0 = 0 for r<0 it follows that also v^(2, 0 = 0 for r<0. Hence, MO=Oforr<0 (3.2.21) To calculate k(t) for r>0 let us notice that the difference of potentials on the condenser cannot change rapidly. On assumption (3.2.10) the initial charge on the condenser is zero. Therefore, we may assume that while the pulse v^(l, 0 lasts the difference of potentials on the condenser is much smaller than on the resistor: |v(2, 01 < |v(2, 0-v(l, 01, 0 < / < A Thus, the intensity of current in the resistor /(l,0«v^(l,0//?.

(3.2.22) (3.2.23)

3.2 The Internal State 1: Relationships Between Continuous External States 147 From (3.2.13) we get the potential difference on the terminals of the condenser at the end of the pulse: v(2, A)-(RC)-^A^, (3.2.24) where A^ is defined by (3.2.20). After the end of the pulse v^(l, 0 = 0 , t>A. Thus, in this time interval the differential equation (3.2.16) describing our system takes the form RC dv(2, 0/d/-f v(2, 0 = 0 , / > A (3.2.25) with the initial condition (3.2.24). The solution of this equations can be easily found (see, e.g., Arnold [3.9]). It is v^(2, 0=A^(RC)-^expK^Aj//?q, r>At.

(3.2.26)

From the definition (3.2.19) of the pulse response, we obtain finally *{RC)^ exp(-t/RQ, t>0 h(t)^Ci (3.2.27) ^ 0 , /<0. We show now that using both properties (3.2.16) and (3.2.18) we can obtain an explicit solution of the differential equation (3.2.14) in the general case. We approximate the process y(l, t), tE at the input terminal by a train of narrow pulses: N

v*(l, 0 = E v(l, O/rcC-O.

(3.2.28)

where r„=ra+(/z-l)A, A=(r-rJ/A^ and/^^(0 is a rectangular pulse of height 1 and duration A-e where e'«^A; the pulse is shown in Figure 3.7 Bl. From property (3.2.16) with c' =v(l, rj and c" = 0 and from property (3.2.18), (3.2.19) follows that if the input process is v(l, 0/rc(^0» ^^^ for A-K) the output process is close to v(l, t„)h(t't„) (see also Figures 3.7 B2, B3). Next, from property (3.2.16) it follows that when the right side of (3.2.28) appears at the input terminal !7(1), then A'

v*(2, O ^ E v(l. Oh(t-OA

(3.2.29)

is produced at the output terminal !7(2). From the definition of the integral we have limj^ v(l, Ok(t-t„)A-

v(r, l)k(t-T)dr.

(3.2.30)

From (3.2.29) and (3.2.30) it follows that the process at the output terminal is v(2,0-|it(r-r)v(l,r)dT.

(3.2.31)

'a

Taking into account (3.2.21) and assuming that v(l, 0 = 0 for r
v(2,0-[*(f-r)v(l,T)dT.

(3.2.32)

148

Chapter 3 Concrete State of a System

COMMENT 1 The class of linear systems defined on page 145 is very wide, since when the changes of the states are small, we can expand the characteristics (parameters, functions) of the system into Taylor series and to take as a good approximation only the linear term of expansion. On the hand, if we consider very wide ranges of states, practically every system is nonlinear. For example, if we would increase the potential difference on a resistor, it would heat to the melting point, and its behavior could be no more described by Ohm's low. However, even in such situations we could consider the system as piece-wise linear. COMMENT 2 The general idea of calculating the pulse response (3.2.27) is that we have two phases of changing states of systems elements. In the first phase, while the pulse lasts, the energy is accumulated in the elements of the circuit that are capable of storing energy. In the considered system, the condenser is the element that can store electric energy. In systems containing coils, magnetic energy is stored. The second phase begins when the pulse ended. During this phase (described in particular by (3.2.25) the accumulated energy discharges. Thus, the pulse response has the meaning of the process of discharge of energy loaded but the pulse in energy storing elements of the system. The discharge process depends on the stored energy but not on the way it was stored. This elucidates property (3.2.18) and suggests its formal proof. If we look closer at the derivation (equations (3.2.29 ) to (3.2.32 ) of equation (3.2.31) we see that deriving it we used only the general properties (3.2.16) and (3.2.18) but not the specific form (3.2.27) of the pulse response. Thus, equation (3.2.32) holds for any linear system that has the property (3.2.18). A wide class of systems having those properties are networks of the mentioned linear electric components. For such systems the states v(/z, /) and v(w, /) of two terminals V{n) and V{m) are related by the differential equation a(K)d^v{n, t)/dt^ + a(K-l)d^^-Mn, t)/di^^-^^ -f...+ a(0)v(Ai, 0 = (3.2.33) where the coefficients a(k) and b(l) are functions of parameters R, C, L (inductance) characterizing the elementary electrical components the circuit consists of. An example of this equation is the previously derived equation (3.2.14). The general solution of the differential equation (3.2.33) is again given by (3.2.31) with the pulse response is defined by (3.2.19). To find it we have to solve the homogeneous differential equation that we obtain from (3.2.33) by setting v(m, 0 = 0 t>0. As initial conditions we have to take the values of (d''v(n, O/d/*),^^^. determined by the processes of loading the energy storing elements by a narrow pulse v^(m, t) occurring at / = 0 . In our argument we assumed that the properties of the considered electrical elements, particularly the internal state parameters R, C, L do not change in time.

3.2 The Internal State 1: Relationships Between Continuous External States 149 Such elements and systems built of them are called time invariant or equivalently, stationary. The fact that the pulse response does not depend on the instant at which the short pulse appeared is the consequence of system being stationary. A powerful method of analysis of linear stationary systems is to present the processes as a superposition of harmonic processes. This method is discussed in Section 7.4.2. Our reasoning can be generalized for a wide class of linear systems in which the internal state parameters vary in time according a predetermined function. An example of such a system would be the system shown in Figure 3.5 with the resistor that resistance changes in time according to a given function R(t). Such a system is called a time varying linear system. It is described by the differential equations as (3.2.33) but with coefficients ain, t) and b(jn, t) which are functions of time. The result of this is that such a system does not have the property (3.2.19). However, the limit (3.2.19) exists but is dependent on the time at which the short pulse occurred and the pulse response h{t, r) is a function of the current instant t and instant r at which the short impulse appeared (for causal systems h(t, r)=0 for t
(3.2.34)

-oo

where h^ „(t, r) is the state of the mth terminal at instant t if at instant r occurred a change of the state of terminal n that had the form of a narrow pulse. COMMENT To this point we have considered terminal accessible electrical systems. As it has been mentioned, electrical phenomena have inherently a spatial character. The electrical instantaneous state of a point in space is described by electric and magnetic intensity vectors thus, by six state parameters. There exist universal relationships between them. These relationships can be presented either in the form of differential equations relating time and space derivatives of components of electric and magnetic field intensities (Maxwell equations) or in integral form. As in our simple examples, the properties of a concrete medium in which the electromagnetic field exists enter in the universal relationships through dielectric and magnetic permeability constants that in our terminology are typical internal state parameters. We have chosen electrical states to illustrate the relationships between external state parameters. The relationships between state components of other character are different in details but similar in principle. For example, almost identical as the previously presented are relationships between the components of mechanical states (position in space, forces, point and continuous models of objects etc.). 3.2.4 RELATIONSHIPS BETWEEN TIME DISCRETE-STATES In the previous section we assumed that time is, as in the real world, a continuous variable. However, many technical systems, particularly those controlled by computers, can change their states at predetermined, usually equally spaced instants. Such systems are called time discrete. We now consider such a typical system.

150

Chapter 3 Concrete State of a System

The block diagram of the system is shown in Figure 3.8a. The fundamental subsystem is the shift register that is a chain of memory cells. We assume that Al. The system can change its states only in the short interval (r^,r„-fe;, e a cell stays in the same state as the state of its input at the end the last interval of changes (r^, r„+e); we call nth interval of stability, and we say briefly that state of the input was loaded in the cell. A3. The instantaneous state of the cell C{i), / = 0 , 1, 2,- • • , 7-1 is described by a scalar; we denote it c{ij) where r is an instant in an interval of stability. A4. The set of potential values that c(z, 0 can take is an interval^ <5a, 5*5 > . If c(/, 0 = 0 we say that the /th cell is empty. A5. At input terminal ^(1) the primary train of states v(l, r„), w = l, 2,- • •, v(l, r^)E <5a, s^> is generated by outside factors. A6. At the beginning of the first interval of charges (0th interval of changes), the states of all cells are 0 thus, c(/, 0 = 0 , / = 0 , 1, 2,- • • , 7-1. The rules of operation of the shift register are as follows: Rl. During the 0th interval of changes the first state v(l, t^ is loaded in the first memory cell (^(0), thus for the following stability interval its state c(0, 0 = v ( l , /Q); the remaining cell stay empty. R2. During the 1-st interval of changes the state of cell C{\) becomes equal to the state of cell C{Qi) and cell ^(0) becomes empty (the content of cell C{0) is moved ("shifted") to cell C\)). During this same cycle of changes the state v(l,r,) is stored in cell ^(0). R3. During the 2-nd cycle of changes the content of cell C{\) is moved into cell C{2), and afterwards, still during the same cycle of changes the content of cell C{0) is moved to cell C{\) and the state v(l, t^ is stored in cell C{0). R4. During the mh, n>2 interval of changes the content cell C{i) is moved to cell C{i-^\) and if it is loaded, the content of cell ^(7-1) is moved out. We denote as c(/, t^ the state of cell C{i) after the end of the nth interval of changes (although consistently with the previous notation we should write c(/, /„+€); the notation c(i, t^ is shorter and unambiguous). The rules Rl to R4 we write in the form c(0, g = v ( l , O , / z = 0 , 1,2,.. (3.2.36) c(/, 0 = c ( M , ^.,), /2 = l, 2,.., / = l , 2,- • • , 7-1

(3.2.37)

3.2 The Internal State 1: Relationships Between Continuous External States 151 SHIFT

t

1 0{\) VI 1 . m'

[' cell ^(0] i

<:ell

cell C{\]

c(l.O

c{Oj)

'1

REGISTER C{1'2)

c(l-2j)

c{l-\j)

'--

m(^

cell C{I-\ )

' ' ^ ^

1 1

hV^

z a) c{U3)f

intervals of stability

intervals of changes sliding window -•

I

•

1—I—I—r " V

d)

I pieces of info stored in the shift register S H I F T

R E G I S T E R

cell ^(1)

cell C(I-2)

:7(1)

e) Figure 3.8. The time-discrete linear system: (a) the block diagram of a semi-stationary system, (b) timing, (c) a typical state process, (d) interpretation of the operation as a sliding window, (e) the system with time varying multipliers (the non-stationary system).

152

Chapter 3 Concrete State of a System

The operation performed by the shift register can be interpreted as the transformation of a segment of sequentially arriving binary pieces of dynamic information seen at a sliding window of width / into static information stored in the chain of memory cells (see Figure 3.8d). Immediately after the new contents of the cells were set, the sum /-I

v(2, tJ = E h'(i)c{i. tj

(3.2.38)

/•-O

is evaluated, where^ h'(i), / = 0 , 1, 2,- • , 7-1 are fixed multipliers. Let us assume that at the input we have the train 1 for/z=0 v^(l,0 =t : ^ . . _ _ „ . . (3.2.39) ^0 for/2 = l, 2,Such a train is the discrete counterpart of the narrow pulse VA(l,t) that has been introduced in Example 3.2.1. From (3.2.36) and (3.2.37) we obtain VA(2, tj=h\m), m=0, 1, 2,- • • , 7-1, (3.2.40) where v^(2, tj is a train of states of the output terminal D(2) when the train v^(l ,r„) occurred at the input terminal D(l). From (3.2.40) it follows that the weighting coefficients arranged sequentially in time have the meaning of the pulse response of the system shown in Figure 3.7a. We denote h(Q^h'(n), /2=0, 1, 2,- • • , 7-1, h(t„)= 0, /z<0, AZ>7-1 (3.2.41) Changing the sequence of summation in (3.2.38), taking into account that tm-n=t^-t„, using (3.2.36) and (3.2.37) we obtain m

v(2, 0 =

E

h'(Tn-ri)v{\,t„)

(3.2.42)

rt-m-/+l

In general the coefficients h'{t^ may depend on the number m of the working cycle. We denote them by h{t^, t„); see Figure 3.8e. The relationship m

v(2, 0 =

E

A(r„, f„)v(l, O

(3.2.43)

n-m-I+l

is the generalization of (3.2.42). Since the system can memorize only 7 recent input elements, it is h(t^, Q=0 for m-n>I-l, (3.4.44) and since /i(r^, r„) has also the meaning of the response of the system at instant t„ to the pulse v^(l, r„), it must be h(t„, 0 = 0 for m7-l the matrix H(m) has the structure shown in Figure 3.9b.

3.2 The Internal State 1: Relationships Between Continuous External States 153

Him)

H(/-i)=

a)

b)

Figure 3.9. Structure of matrix H(m) describing the time discrete, nonstationary linear system shown in Figure 3.8: (a) m=/-l, (b) m>I-\. Using the matrices we write the relationship (3.2.43) in the simple matrix form: v(2, m)^H{m)v{\, m).

(3.2.46)

The system described by (3.2.44) or equivalently by (3.2.46) is a time-discrete counterpart of the non-stationary time-continuous linear system described by formula (3.2.34). Therefore, we call a system described by (3.2.34) time-discrete nonstationary system. COMMENT 1 This system described by assumptions Al till A6 on page 150 is not stationary in the sense of definition on page 149, since it behaves differently in the changing and stability intervals. However, its properties during one work cycle are the same as during the other work cycle. Therefore, it may be called semistationary. COMMENT 2 In the limiting case when l-^oo and T-K) so that T—t^-t^ the time discrete system would see from the point of view of input and output terminals as a time-continuous system. Thus, the time-discrete system described by (3.2.42) or, respectively, by (3.2.43) can be considered as a discrete approximation of the time continuoussystems described by (3.2.31) respectively (3.2.34). Those two systems illustrate the considerations in Section 1.4.3 about relationships between discrete and continuous models of states. The advantage of the discrete approximation is that for a concrete input train we can easily evaluate numerically the sum (3.2.43) (for efficient algorithms see Press et al. [3.3, ch.l3]). If necessary, we can implement the system in real time using standard simple and inexpensive digital information processing devices. Also, the fundamental mathematical tools needed to analyze a discrete problem are simpler than to analyze a time continuous system. For example, the definition of the pulse response for the discrete system is simple, while for the continuous system, we had to introduce several heuristic assumptions or use precise mathematical methods to analyze the limit operations. On the other hand, for wide classes of input processes v(l,0 we can calculate the integral (3.2.31) in a closed, easy to analyze form. Even for narrow classes of input trains, the closed forms of the corresponding sum (3.2.43) are quite complicated and rather difficult to analyze.

154

Chapter 3 Concrete State of a System

COMMENT 3 We can modify the considered system by feeding back the output process into the input as shown in Figure 3.10.

Figure 3.10. Time-discrete linear system with feedback.

The system with feedback is a linear system, and it can be described by its pulse response. However, there is an essential difference between the pulse response of the system with feedback and the previously considered open system. The pulse response of an open system is the process v^(2, r j given by (3.2.40). Thus, for/z>/ the pulse response takes values 0. In other words, the time extension of the pulse response of previously considered system is FT. In the system with feedback, however, a unit pulse put at its input influences the pulses occurring later at the input of the shift register. Thus, the pulse response of the system with feedback has usually infinite duration. In particular, the response may be an infinitely long-lasting periodical function. Such a system is called a unstable system. Counterparts of such a system are time continuous R, L, C systems with feedback and at least one additional source of energy for example, with an amplifier, (for an analysis of the time-discrete systems with feedback see Oppenheim, Schaffer [3.6], for simulation Alkin[3.7], Wolfram [3.8]. 3.2.5 A CLASSIFICATION OF RELATIONSHIPS BETWEEN EXTERNAL STATES OF SYSTEMS A system in which the states of some terminals can be considered as the effect of states of other terminals is called 2i production system. The previous considerations suggest a classification of such systems. The classification is equivalent to a classification of transformations performed by systems. The classification is based on three fundamental properties of relationships between the causing and caused states. The first property is related to the course of the states in time (see Comment 6, page 142). If the instantaneous form of the caused state depends only on the form of the causing state at this same instant, we say that the relationship (transformation, system) is memory-less. If the instantaneous form of the caused state is determined by the course of the causing state during a time interval in the past, we say that the relationship (transformation, system) has memory.

3.2 The Internal State 1: Relationships Between Continuous External States 155 A subclass of relationships having memory are relationships in which the form of a cause state in an instant t depends only on the form of the caused state at an instant /-A, and on the course of the causing states in the interval with an arbitrary small A . Such a relationship is called Markovian relationship (transformation, system) (see Comment 6, page 142). The second important property of a relationship characterizes the dependence of the caused-state components on the components of the causing state. As a criterion we take the validity of the superposition rule (3.2.15) and (3.1.16). If it holds, we call the relationship (transformation, system) linear, if not nonlinear (see comment, page 147). The systems described in Sections 3.2.2 and 3.2.3 are linear systems. In the next section we will describe the buffer which is an example of a nonlinear system. The systems described in Sections 3.2.2 and 3.2.3 are linear systems. The next section describes the buffer which is an example of a nonlinear system. A very important and wide class of nonlinear systems includes systems performing the nextneighbor transformation described in Section 1.5.3, page 41. The third property of relationships is related to the dependence of the relationship between the caused state at an instant t and the causing state (instantaneous form, course in the past) and the instant t. If this relationship does not change when instead of / we take an other r-hr, where r is arbitrary (in other words if we change the origin of the time scale), we call the relationship (transformation, system) stationary (equivalently, time invariant). If the relationship changes, we call the relationship (transformation, system) nonstationary (equivalently, time variable). The three properties determine the type of the relationship between causing and caused components of the state of a system (a transformation of states, a system itself). Since those properties are independent, any combination is possible. Thus, we may represent the type of the relationship (transformation, system) as a point in the three-dimensional "space" as shown in Figure 3.11.

ST>VTIONA:RY C T I M E INTVyVRIANT) LINEA.R

MEI^ORYLESS V/ITH

NON

LINEAR NON

MEMORY

^ STATIONARY

Figure 3.11. A classification of production systems.

156

Chapter 3 Concrete State of a System

3.3 THE INTERNAL STATE 2: RELATIONSHIPS BETWEEN DISCRETE EXTERNAL STATES In the previous section it was assumed that the set of potential forms of the elementary component of the state is continuous. Here it is assumed that this set is discrete; thus, the elementary state components and in consequence the states are discrete. Often such states have the meaning of rough descriptions of primary continuous states. There are two basic methods for describing the relationships between discrete states: • We interpret the state components as discrete variables, use a discrete algebra and we describe the relationships by functions or algebraic equations, particularly recurrent equations, • We interpret the state components as logical, particularly Boolean variables, and we describe the relationships by logical expressions. We now illustrate each method with an example. 3.3.1 RELATIONSHIPS DESCRIBED BY EQUATIONS In Section 2.6.5 we described briefly the buffering system as a system decreasing the idle pauses between a train of working information blocks. The block diagram of the system and typical input and output processes are shown in Figure 2.23. We now demonstrate how the verbal description of the systems operation, which we was given in Section 2.6.5 can be formalized. The subsystems of the system are the buffer memory (we call it the waiting room) and the fundamental information processing subsystem (the server). We denote the kih primary block by £,,, k=l, 2,- • • . The auxiliary information about a block is its arrival time rik) and time d(k) needed by the fundamental information subsystem to process the block. We define first the state ^(Wj^, /) of the input terminal of the waiting room. It is assumed that the time needed to put the arriving block into the buffer memory is very short compared with the time it spends in the waiting room. Therefore, as a model of the state s(v/-^^, t) we take a process that stays in an idle state with exception of a train of instants; such a process is called 2i point time process. We set ^e(k)fort=T(k) ^(Win, 0 = C r Ofortr^T(k)

(3.3.1)

Such a process can be represented as a train of bars as shown in Figure 3.12a. We assume that the set S,^ of potential values of ^[Win, T(k)] is the interval (0, oo). Next, we define the state s{v/, t) of the waiting room as the number of blocks stored in the waiting room. We assume that at most Q blocks can be stored in the waiting room. We call C^ capacity of the waiting room. Thus, the set of its potential forms is ^ = {0, 1,- • • , C^} is the set of its potential forms.

3.3 The Internal State 2: Relationships Between Discrete External States

157

^(6) ^(3)

0.5 T -\

Tm

A(w,r)

5(r,r) 1 -1

Figure 3.12. Typical state processes in a buffer: (a) ^(Win, t) at the input (process of task arrival), (b) 5(w, /) of the buffer, (c) ^(r, t) of the server. T=t„'t„.^ cycle of systems operation; it is assumed that ^(A:) where^ r„=nr,/z = l , 2 , 3,- • •

(3.3.3)

and T is the cycle for the buffer memory-server cooperation. A3. A decision about transferring a block to the server is taken at the instant t^ and it is based on exact information about the states of the waiting room and the server at the instant r^-e', just before r„. A4. An input block can be put into the buffer memory as soon as it arrives (see Figure 3.12 b).

158

Chapter 3 Concrete State of a System

We assume also that the duration of the interval of state changes e would be added to the blocks which have been in the buffer memory but at the instant /^.i+O one block would be transferred to the server. Thus, 5H(W, V0)^5(W, r,.,-0)+5i„( ) - 1 (3.3.4) where •^inC^^n-i+O* V ^ ^ ) denotes the number of pulses of the train s(v/,„, t) located within the interval In general we have to respect rule R2 and take into account that no block can be transferred to the server if the buffer memory is empty. Thus, the relationship between the input states s(yj,^, t) and the states 5[w, t{n)\ of the buffer memory we are looking for is ^(w,r„U)

- ^ c . if , . ( ^ ,^^)>C^

(3.3.5)

where «-^(«)=<^.

^^

(3.3.6)

and5J,(w, vO) is interpreted as an auxiliary variable defined by (3.3.4). The relationship (3.3.5) can be represented by a state transition graph shown in Figure 3.13. Such a graph is called a state transition graph.

3.3 The Internal State 2: Relationships Between Discrete External States

S=4,5,../

S=0

/ ^

S=4,5,..

>
S=l

S = 3.4,..

S=l

^ \ \

159

S = 2,3,

S=l

Figure 3.13. Graph representing the changes of states of the waiting room given by the recursive formula (3.3.5). The circles represent the potential states of the waiting room at instants r^,.,-© and /„-0, Si, is the abbreviation for ^(w, /„.,-0)=/: and j(w, t„-0)=k, the arches indicate the potential transitions, S is an abbreviation for the number 5in( and the number at a vertical arrow is the number of rejected packets. Buffer memory capacity Cw=3. For details see Seidler [3.10 App.Al].

COMMENT This system operates rhythmically as the time-discrete linear system considered in Section 3.2.3 thus, it is not stationary but semi stationary. The nonlinear function w^i') and the conditions in (3.3.5) cause the superposition principle not to hold. The system is with memory, and from (3.3.5) it follows that it is a Markovian system. 3.3.2 THE RELATIONSHIPS DESCRIBED BY LOGICAL EXPRESSIONS Till this point we considered relationships between state parameters describing the states of elementary components of a system. These state parameters have the meaning of atomic components of state description (see Section 3.1.1). Therefore, a relationship between state parameters may be called 2ifine relationship and we may interpret it as sifine internal states of the system. Sections 3.2.2 and 3.2.3 show that no direct relationship may exist between two external state parameters but a relationship may exist between one parameter and a rough description of another parameter. Let us take, for example, the condenser considered in Section 3.2.2. From equation (3.2.8) it follows that no universal relationship exists between the instantaneous potential and the instantaneous current, but the instantaneous potential is related to a rough description (the integral) of course of the instantaneous current. This same relationship is described by equation (3.2.7); see comment 3 page 144. Similarly as (3.2.7), equation (3.3.5) relates instantaneous states of the waiting room and a rough description of the course of the state of the input terminal that is described by 5in().

Chapter 3 Concrete State of a System

160

For a wide class of systems related are only the rough descriptions of the primary state parameters. Such a relationship is called a macro relationships. Most relationships formulated in a heuristic way by people have the form of macro relationships. Section 3.1 discussed the binary rough descriptions of external states. It has been shown that they can be considered as Boolean variables and that we can, using the logical operations of negation, alternative, and conjunction, produce secondary binary rough descriptions. We now show that by using the implication operation we can formalize the macro relationships between the binary rough descriptions of external states. This combined with the idea of treating the external and internal states in the same way permits the formulating and handling systematically of a wide class of macro relationships similarly as the previously considered fine relationships. The basic relationship between two binary rough states is R ^if an object O^ posses the property Pj then, the (3.3.7) object O2 posses the property Pj. The existence of a relationship is a property of the system. Therefore, we can use the general definition (3.1.5) to define the binary variable characterising the existence of the relationship R: •0 if the relationship R does not hold (3.3.8) *^(R)^ ^ 1 // it holds. Let us introduce the binary attributes (variables) s*(0„, P„), w = l, 2 saying whether an object O^ possesses a property P^ (see Section 3.1.2). It is natural to relate s*(R) and 5*(0,, Pj), s*(02, P2) by the logical implication ^*(R)=5*(0„ P,)=*5*(02, P2). This relationship can be represented by the table

s*(0„ P,)

s*(02,

Pj)

(3.3.9)

s*(R)

1

1

1

1

0

0

0

1

1

0

0

1

Table 3.3.1. The truth table for the relation (3.4.9) Combining the logical operations of negation, alternative and conjunction we can describe a great variety of macro relationships relating rough binary states both external and internal. In particular, the strong relationship "if and only i f is described by the logical function s*(0,,

P,)^sH02, P2)^*(02, P2)=*^*(0„ P,).

(3.3.10)

3.3 The Internal State 2: Relationships Between Discrete External States

161

The fact that the macro relationship R defined by (3.3.7) holds for a system or a class of systems we write in the form: 5*(R) = 1, V 5*(0„ P,), V s*(02, P2). (3.3.11) Such a relationship is a universal relationship, as are the previously discussed fine relationships characterizing systems, for example, the relationships (3.2.3) characterizing a resistor or relationship (3.2.5) characterizing a capacitor. As in the case of those relationships, if it causes no confusion we omit the V statements; thus, we do not indicate explicitly that the rough relationship holds for all values the involved variables can take. To emphasize that a relationship between external states holding for all their potential forms is an inherent property of the system, we called it internal state, and we indicated that the relationship can be treated in the same way as an external state. In particular, the existence of a relationship can be considered to be a property of the system, and we can apply the concept of the binary variable characterizing a property defined by (3.1.6). For example, the rough state ^*(Phg) characterizing the property Phg defined by (3.1.12) can be also interpreted as the rough state characterizing the possession of the relation "located higher" by the system consisting of two objects. In the definition (3.3.7) of the relationship we can take as the property of an object the possession of a lower-ranking relationship. Equivalently, as the variables in (3.3.10) we can take the variables characterizing the existence of the lowerranking relations. Then we obtain a hierarchy of relationships. On the top of these relationships we have the universal relationships between logical functions that hold for all values of the involved variables. For example, the logical function ((s'=>s")As')=^s" takes value 1 for all 4 combinations of the variables s\s". Thus, we may write (is'=>s")As')=^s" = l V5', 5"

(3.3.12)

The following two examples present universal relationships between binary indices characterizing the lower-ranking relationships between states of some classes of systems. EXAMPLE 3.3.5. A MACRO RELATIONSHIP BETWEEN POSITIONS OF OBJECTS We consider the system S={0(n); n = l,2, 3} of three point objects 0(n) located on a vertical line. Let us denote by S (n, m) ={0(/z), O {m)] the subsystem consisting of two objects. For such a subsystem we define the relationship of rank 1: Rhg(/z, m)=object 0(n) is located higher then object 0(m) (3.3.13) Saying that the relationship Rhg(/2, m) holds is equivalent to saying that the subsystem possesses the property P^^^in, m) which was introduced in Example 3.1.2 (definition (3.1.12)). We denote by s^^{m, n) the binary variable characterizing the property defined by (3.1.13). Equivalently, we may define the variable s^^{m,n) as the variable characterizing the existence of the relationship Rhg(Az, m).

162

Chapter 3 Concrete State of a System

We leam by observations that an universal relationship of rank 2 (higher ranking) exists between the relationships Rhg(l, 2), Rhg(2, 3), and Rhg(l, 3). The relationship is this: 7/0(3) is higher located as 0(2), (Rhg(3, 2) holds); and if 0(2) is located higher as 0(1), (Rhg(2, 1) holds); then 0(3) is higher (3.3.14) located as 0(1) (Rhg(3, 2) holds). Using the variables characterizing the existence of relationships of rank 1, we write the universal relationship (3.3.14) in this form: ^Hg(3, 2)A ^,g(2, 1)=> ^Hg(3> l) = l . n (3.3.15) The relationships Rhg(«, ni) occurring in this universal relationship are illustrated in Figure 3.14. h h(3) 0(3) 0 ;

\gih3) h(2)

0(2)

h(l)

0(1)

Figure 3.14. Illustration of the lower-ranking relationships Rhg(Ai, m) occurring in the universal relationship (3.3.14). COMMENT The state parameter describing the fine state of the object 0(n) is the height h(n) of the vertical position of the object (see Example 3.1.2). Thus, h ={h(n); /z = l, 2, 3} is the fine description of the system S. Then the relationship (3.3.14) or equivalently (3.3.15) can be derived from the following mathematical statement: If a>b and b>c, then a>c.

(3.3.16)

However, to use this possibility the exact information about the values of the three state parameters h(n), AZ = 1, 2, 3 must be available. Therefore, it may be much more natural to check by simpler means than to measure exact positions whether the subsystems S(n, m) have the property Phg(m, n) and to base on the empirical justification of the universal relationship (3.3.14). Doing so we must not invoke the fine description of the state by state parameters at all even use mathematical concepts, and yet we can improve the quality of many activities on the hierarchical level corresponding to rough descriptions and macro relationships.

3.3 The Internal State 2: Relationships Between Discrete External States

163

In principle, such an approach is not inferior to the mathematical approach that has its ultimate justification also in empirical facts. On the other hand, the knowledge of the external world that has been built into logical and mathematical statements so universal that utilizing it is much more efficient than performing several observations of specialized cases. EXAMPLE 3.3.2 A MACRO RELATIONSHIP BETWEEN PEOPLE AND OBJECTS We suppose that the components of the system are: S' a student, f? an office room, N{t^) the number of ^ " an other student the telephone in the office room ^ We introduce the following relationships of rank 1: in(^, le) a student S lives in s h ( ^ \ S ") a student S ' a dormitory room ^ shares a dormitory room with another student S" av(^, AO- the student ^ i s available at the telephone with number A^ and we denote by s.^,, s^^ and ^^v the binary indicators of existence of the corresponding relationships. From the meaning of the components of the defined system and from many observations can it be concluded that the following universal relationships of rank 2 exist: 5i„(^ , /e) A sdS\ S-) =^ S;,(S'\ ^) = 1 (3.3.17) 5i„(^, ^ ) ^ U ^ , A^(^)] = l. D (3.3.18) A universal relationship between binary indicators of possession of a property is a counterpart of an algebraic or differential relationship between continuous state components such as the universal relationships considered in Section 3.2 and the first part of this section. The counterpart of the problem of solving equations involving the continuous or discrete state components is this problem: For a given universal relationship between binary indicators of possession of some properties and for known values of some possession indicators, find the value(s) of the other possession indicators related with the known indicators by the universal relationships. Such a problem is called inference or logical reasoning. Some of the previously mentioned universal logical relationships allow some of the inference problems to be solved directly. One of such fundamental imiversal logical relationships is the relationship (3.3.12). In classical logic the inference based on this relationship is called modus ponens (see e.g.. Frost [3.11]). A simple example illustrates the problem of inference.

Chapter 3 Concrete State of a System

164

EXAMPLE 3.3.3 INFERENCE ABOUT DIRECTLY NOT KNOWN MACRO RELATIONSHIPS BETWEEN PEOPLE AND OBJECTS We take the system described in Example 3.3.2. We set ^ '= %(ary), S "= ^(wen). We know that (1) Mary lives in dormitory room #3, (2) Mary shares the room with Gwen, (3) the number of the telephone in room #3 is 9. We wonder by which telephone number we can reach Gwen. Thus, we set . Settmg

5JW, #3) = 1), s,,(W, g) = U ^ = # 3 , N(/e)=9.

(3.3.19)

5'=5jW, m^s,,(W, ^ ) , s-=s,,(^, #3) in the universal logical relationship (3.2.12) and taking into account the universal relationship (3.2.17) we conclude that s-^^i^, #3) = 1. Thus we infer that Gwen lives in room #3. In a similar way, using (3.2.18) we conclude that s^^i^, #3) = 1, thus that Gwen is available at the telephone number 9. The values of the initially unknown logical variables s^J,^, #3) and ^av(^» ^0 were obtained by simple substitutions. We assume now that Mary lives in dormitory room #3 and Gwen lives also in dormitory room #3. Thus, we set 5.„(W, #3) = 1,:9,„(^, #3) = 1 (3.3.20) and we wonder whether Mary and Gwen share the same room. Our task is to find formally the value of the Boolean variable s^^( W, Q). We denote as s*=^s,(W, #3)As,h(W, ^)=*s,n(^, #3) the Boolean variables occurring in (3.2.17). Table 3.2.1 lists the values of the introduced variables. S|„(W),#3) a b c d e f

g h

1 1 0 0 1 1 0 0

Ssh(W, g) 1 0 1 0 1 0 1 0

s j W , 3)As,,

Si„(^, #3)

1 0 0 0 1 0 0 0

1 1 1 1 0 0 0 0

s* 1 1 0 0 0 0 1 1 1

Table 3.3.1.Truth table for the Boolean variables considered in the example of formal concluding. From the table we see that (3.3.17) and (3.3.20) are satisfied only for the set a of values. For this set Ssh( W, ^ = 1. Thus knowing that the universal rule (3.3.17) holds, Mary lives in room #3 and Gwen lives in room #3, we derived formally that Mary and Gwen share a room. This conclusion is intuitively obvious. However, the presented method can be automatized and used in cases when intuitive reasoning is no longer simple. D

3.3 The Internal State 2: Relationships Between Discrete External States

165

In more complicated cases the construction of the tables of values may be prohibitively cumbersome. However, quite efficient algorithms are available (see e.g., Frost [3.11]) that evaluate some Boolean variables when the values of other variables involved in a universal relationship are given. 3.3.3 SET OF POTENTIAL FORMS OF AN INTERNAL STATE With the internal states we have a similar situation as with external states. Often we do not know the internal state exactly. Then the superior system can improve its purposeful activities if it knows the set of potential forms that the internal state can take. Usually the type of the system and in consequence the type internal state (of the relationship between the external states) is known. Then we have to consider only the set of potential values (forms) of internal state parameters or functions determining exactly the relationship. For example, suppose that we know that the considered system is a linear resistor. Then we know that the relationship between its external states is described by (3.2.3) and we need only to know the set of potential values of the resistance. Thus, the set of potential forms of the resistance is type Tj. If we know that the system is a linear stationary system, then the relationship between its input and out processes is determined by the pulse response of the system. Thus, the internal state is of type T,(„). To determine the set of potential forms of the internal state we must know factors influencing the relationship between the external states of the considered system. Often the system is a product of a hierarchically higher system. Then the relationships between the external states of the system can be considered as imbedded external properties of the producing system. For example, as the lower ranking system we take a resistor. The factory that produced it plays the role of the hierarchically higher system. Knowing the methods of production and testing used by that factory, we can determine the set of potential values of the resistance of a given resistor. Suppose that the resistor has the nominal resistance R^^ and is characterized by a tolerance of a percent. Then the set of potential values of resistance is the tolerance interval The set of potential forms of the internal state of a system is an objective property of the class of systems that we may encounter. Therefore, we may call this set the state of variety of this class of internal states. NOTES ^ We mention here only logical operations corresponding directly to basic operations on sets. The operation of conclusion plays an important role in formal logic. Since this operation describes relationshipsbetwttn rough state descriptions we discuss it only in the next section. ^ The electrical state is described exactly by the electromagnetic field (E and H vectors considered asftinctionsof space coordinates and time). Introducing assumptions Al and A2, we take instead of this primary description the rough description by scalars of states of the point terminals.

166

Chapter 3 Concrete State of a System

^ Strictly taken only in the early stages of development of the theory of electricity these relationships were formulated as generalizations of many observations of the external states of resistors respectively capacitors. Today we derive them through mathematical operations from general relationships of theory of electricity. This does not change our argumentation, since the general relationships are not given a priori but were also formulated as generalizations of very many observations of external electrical states. * The assumption that the pulse lasts only during a finite interval is introduced to simplify the argumentation. The spectral properties of real pulses cause such an assumption to be not satisfied exactly, and a real pulse takes small values outside the finite time interval. We can modify the presented argumentation to take such an effect into account, but this would not change the final conclusions we are going to derive. ^ The limiting form of such a pulse for A-*0 is called Dirac impulse. Using this concept we define pulse response as the response to a Dirac impulse. Although for some applications it is useful, the Dirac impulse is a singular function that in some simations "produces" (only formally) bizarre results (see endnote 2 in Chapter 1). The advantage of the definition given, is that it directly suggests a procedure of determining experimentally the pulse response. ^ We introduce assumption 4 to simplify our subsequent argument. From a technical point of view, we should rather assume that the set of potential forms of stored information is discrete. We then would introduce a few new concepts to define arithmetic operations on discrete numbers, that are generalizations of the binary addition and multiplication. To avoid such a complication we make assumption 4, but a reader familiar with discrete algebra can see that all equations we use are also valid for discrete variables. ^ In this book we use a prime to denote a variable associated with the variable without the prime, but we do not use the prime to denote the derivative. * Notice that now we start numbering of cycles of operation with 1 but not with 0 as we did for the time discrete linear system.

REFERENCES [3.1] Dagpunar, J., Principles of Random Variate Generation, Clarendon Press, Oxford UK, 1988. [3.2] Yarmolik, V.N., Demidenko, S.N., Generation and Application of Pseudorandom sequences for Random Testing, J.Wiley, NY, 1988. [3.3] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., Numerical Recipes, 2-ed, Cambridge University Press, Cambridge UK, 1992. [3.4] O'Shaughnessy D., Speech Communication, Prentice Hall, Engelwood Cliffs, 1983. [3.5] Oppenheim A.V., Wilsky A.S., Signals and Systems^ Prentice Hall, Englewood Cliffs, 1986. [3.6] Oppenheim A.V., Schaffer R.W., Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, 1986. [3.7] Alkin, O., PC-DSP, Prentice Hall, Englewood Cliffs, NJ 1990. [3.8] Wolfram, S., Mathematica, 2-nd ed., Addison Wesley, Reading, MA, 1991. [3.9] Arnold, V.I., Ordinary Differential Equations, Springer Verlag, Berlin, 1992. [3.10] Seidler, J.A., Principles of Computer Communication Network Design, MJ.Wiley, NY, 1983. [3.11] Frost, R., Introduction to Knowledge Base Systems, Collins, London, 1987.

STATISTICAL STATE OF A SYSTEM Practically every information system has to process not a single piece of information but a train of them. Processing the train as a whole we could exploit all properties of its components, particularly any relationships between them, and consequently minimize the information-processing resources required for each component of the train. To process a train as a whole sufficiently large resources must be available. However, the resources are usually limited. Then it is natural to divide the primary train of pieces of information into blocs and to process each bloc separately but according to a rule that takes into account the properties of the whole train. Tliis is called bloc wise processing. The general principles of block wise processing have been presented in Section 1.7.2. A concrete example is adaptive Huffman data compression discussed in Section 6.2. It is shown there that the frequencies of occurrences of potential forms of elements of a train are comprehensive features of the train that allow efficient block by block data compression. Similar character has dimensionality reduction considered in Section 7.3. In this case empirical correlation coefficients are the features of the train essential for efficient bloc by block dimensionality reduction. As indicated in Section 1.4.4, for many systems the frequencies of occurrences of potential forms of a component of the state fluctuate around values that are determined by the properties of the system, but do not depend on a particular observation. Such a system is said to exhibits statistical regularities, and the fixed values around which the frequencies of occurrences fluctuate, are called probabilities. If the states exhibit statistical regularities, then often the knowledge of a state influences the statistical regularities of another state. Then a statistical relationship is said to exist between both states. If the states exhibit statistical regularities, then also the blocs of states exhibit the statistical regularities. Knowing the probabilities of the potential forms of blocs of elements, we can optimize the block-wise processing, without waiting until the whole train is available. In particular we can optimize the mentioned data compression and dimensionality reduction. The knowledge of probabilities and of statistical relationships is of paramount importance for other, then bloc-wise, types of information processing. Knowing them we can efficiently estimate nonaccessible components of the states on the basis of information about accessible components. In particular knowing the states that occurred in the past, we can efficiently predict future states. Such estimates can improve dramatically the quality of purposeful actions of the superior system.

168

Chapter 4 Statistical State of a System

This chapter concentrates on the properties of frequencies of occurrences. The first two sections describe those properties for a given train of states, without caring whether another train would have similar properties. Such a model is suitable for processing the whole train ex post, after it is available. A typical example is compression of information stored on magnetic media. The existence of statistical regularities of the states and, if they do exist, the probabilities are a property of the system. Therefore, we can in principle check the existence of statistical regularities and estimate the probabilities only experimentally. However, in special cases we can deduce the existence of statistical regularities of states from the mechanism of generating by the system a concrete form of the state. In Section 4.3 we discuss the methods of determining on the basis of properties of frequencies of occurrences of potential forms of states, whether the potential forms of states exhibit statistical regularities. Although it seems to be natural, the formalization of analysis of properties of probabilities based on properties of frequencies of occurrences, in particular in the case of continuous states, is difficult (see, e.g., Mises [4.1]). Therefore, the axiomatic approach to probabilities became prevalent. It is usually called mathematical probability theory {probability theory). We present it in Section 4.4. Section 4.5 discusses in the terms of probability theory the previously mentioned special but important cases, when from the mechanism of producing the state, we can draw concrete conclusions about its probabilities. Similarly, in Section 4.6 we consider from the point of view of probability theory the consequences of existence of statistical regularities and derive conclusions that are of paramount importance for information processing. In particular, our considerations lead in a natural way to the important concept of entropy of potential forms of state (of information). The external and internal states discussed in Chapter 3, the state of variety introduced in Section 1.4, and the statistical state described in more detail in this chapter are objective features of a system. Therefore, they may be jointly interpreted as the generalized state that provides a global characteristic of the system from the point of view of purposeful actions involving the system. The concept of generalized state plays an important role in this book, since it gives insight into problems of pursuing purposeful actions, particularly processing information, and allows to treat them in a uniform way. The generalized state and a universal classification of states are the subjects of the last section of this chapter.

4.1 FREQUENCIES OF OCCURRENCES OF DISCRETE STATES This section analyzes the properties of frequencies of occurrences of potential forms of discrete elementary components of a train. Although very simple and plausible, the concepts introduced are important for two reasons. First, they are directly useful for the design of the previously mentioned block wise operating systems processing discrete information.

4.2 Frequencies of Occurrences of Discrete States

169

Second, the analyzis of the frequencies of occurrences of discrete states can be generalized. The one generalization is for continuous states. This generalization is also a simple but representative illustration of presentation of properties of continuous states by densities, such as probability or spectral density that are introduced in the subsequent chapters. The other generalization has a much more fundamental character. It leads to the concept of probability and to the axioms of probability theory, which are presented in Sections 4.3 respectively 4.4. 4.1.1 BASIC CONCEPTS The quality of an action performed by a superior system depends usually on the state of the environment. Therefore, with a state s (with an information) a scalar weight q(s) is usually associated that characterizes the influence of the state (of the information) on the quality of action performed by the superior system (in particular the quality of processing a specific information). Section 8.1 discusses in detail the methods of choosing the function q('). A typical superior system performs an action many times. We denote by Q(S^,) an indicator characterizing the whole train S^={5(0,/ = 1,2,- • • , / } , 5 ( 0 E ^ of states from the point of realization of the purposeful activity (in particular of information processing). It is natural to base the definition of Q(S^) on the definition of the indicator q[s(i)] characterizing the components of the train. Such an approach is discussed in detail section 8.1. We show there that it is often justified to take /

e(s„)=E 9[s(0].

(4.1.1)

1-1

Since the indicator Q(Str) usually depends strongly on the length / of the train it may be more convenient to use the normalized indicator ^(/)^Q(Str)//= T E q[s(i)]=Aqls(i)] ^ /-I

where

(4.1.2)

I

. /

A=4E I

(4.1.3)

^ /-I

is the arithmetical averaging operation. We assume here that the set of potential forms of the state is discrete: S={s, / = 1 , 2 , - • • , L } . (4.1.4) After grouping in the sum (4.1.2) all components s(i)=Si we get L

q(D^Y^q(s,)PXsi,D, Pis,,!)-—-—

(4.1.5) ^4^^^

is the frequency of occurrences of the state s, in the train S„ and M(Si, I) is the number of occurrences of state s, in the train S„. (4.1.7)

170

Chapter 4 Statistical State of a System

From this definition it follows that L

TM(Si,f)=I.

(4.1.8a)

Dividing both sides by / we get L

E PXSi,I) = l. (4.1.8b) From (4.1.5) we see that the properties of the train 8^^ influence the index Q(S^) only through the set P'(D = {P\s^,D, 1=1,2,- • • ,L} (4.1.9) of frequencies of occurrences of potential forms of the state. A consequence of it is that if we optimize the rule according to which the superior system performs its actions separately on the elements of the train (in particular, the rule information processing), then the optimized rule depends only on the set i**(/). Let us introduce the family of weight functions ^ 1 forA:=/

From (4.1.2) and (4.1.5) we have P\s,, r)=Aq,[s(J)]

(4.1.11a)

Thus, The frequencies of occurrences can be considered as r4 1 llh^ arithmetical averages of a specific weight function of a state. To calculate the frequencies of occurrences the whole train of states must be known. Thus, we can utilize the frequencies only after the train has ended (see our discussion in Section 1.7.2, in particular. Figure 1.27). 4.1.2 THE FREQUENCIES OF JOINT OCCURRENCES OF STATES In the general definition (4.1.6) of frequency of occurrences, we did not specify the structure of the state. We assume now that the state is a vector and discuss the relationships between the frequencies of occurrences of its components. To simplify our argument we assume that the state vector has two components: ^={^(1), 5(2)}. (4.1.12) We assume that the set of potential values of the component s(n), n = 1, 2 is S{n)^{ sin), / = 1 , 2,- • • , L{n) }, n = \, 2 (4.1.13) We consider again the train S^ but now look at it as a train of pairs {s{i, 1), s{i, 2)}. ^"^^^^'"

. M[5,(l),.,(2),/] n s X D , s,{2), I\^ J , (4.1.14a) where ^ M[5X1), s^p.), /] is the number of pairs in the train the first .. - - . , . element of which is j / l ) and 5,(2) the second. We call P*[5X1), sjil), I\ iht frequency ofjoint occurrences ofSj^l) and ^^^(2).

4.2 Frequencies of Occurrences of Discrete States

171

Obviously, Af[5/(1), Sf,{2), /] and P\sf,\), Sf^L), I] are the previously introduced M{Siy I) and P$Si, /)] written in another form^ In this notation equation (4.1.8b) takes the form: m) ^2)

E E ^ ^ W D , ^.(2),/] = l /-I

(4.1.15)

k'\

Next, we denote M[s/i$, /] as the number of elements (pairs) in the train with ,. . .^ 5/(1) as the first component and any second component. ^ From definitions of M[sji\), 5^(2), 7] and M[s^\), / ] , it follows that UD

MUD, /] = E M ^ X 1 ) , ^.(2), /]

(4.1.17)

k-\

Dividing both sides of this equation by /, we get P\si\), where

/] = E ^'Wl)> ^.(2), / ] ,

M[^/(l),/] n ^ X D , / ] = Z12i::iLL

(4.1.18) (4.1.19)

is the frequency of occurrences of form sf,\) of the first component of the vector state s irrespectively of the form of the second component. If we evaluate the probability P' [s^l), 7] from (4.1.18) we call it marginal probability. Therefore, equation (4.1.18) is C2\\td marginal probability formula. This is one of two equations that play a key role in our future considerations about utilizing the information about the statistical state. For symmetry reasons we have also ns,{2), /] = E P\si\).

5,(2), / ] .

(4.1.20)

/-I

CONDITIONAL FREQUENCIES OF OCCURRENCES Let us suppose that the first component s/il) of the pair {^^l), 5*(2)} is fixed. According to definition (4.1.16) we have M[s/i\)\ such pairs. Let us next take a concrete 5,(2). In the train S^ we have M{s^\), 5,(2), /] pairs with fixed first and second components. Therefore, the frequency of occurrences of the state 5,(2) in the class of such pairs that the first component is 5^1) is />[^.(2)UX1),/]-

M [5/(1), 5, (2), 7] ;^,^(,;,j .

(4.1.21)

We call P*[5,(2) 15/1), /] the conditional frequency of occurrences of component 5,(2) on the condition that 5/(1) is known. Dividing nominator and denominator by I and using (4.1.14) and (4.1.19) we obtain

ns.i2mi),i\-^^y^^:^^^^tR *

' '^

P'[5,(l),/]

(4.1.22)

172

Chapter 4 Statistical State of a System

where P'[s^\), 5^(2), /] is the frequency of occurrences of the pair [s^l), s^(2)], and P*[5X1), /] is the frequency of occurrence of state 5/(1) (marginal probability). Chapter 5 shows than when exact information about the state s^(2) is not available but 5^1) ^^ ^^^ conditional frequency P\si,(2) 15/1), /] are known, we can draw conclusions about s,,(2). Therefore, equation (4.1.22) is the second most important equation for information processing. Because the roles of both components are synmietrical, we have also

This equation together with (4.1.18) allows conditional frequencies with one of the components fixed to be calculated when the conditional frequencies with the other fixed component is known. In the following chapters, we often use this possibility. The set ^;(/)={/>-[5Xl), 5,(2), 7]; / = 1 , 2,- • , L(l),7 = l, 2,- • , L(2)}

(4.1.24)

of joint frequencies or equivalently the sets of conditional P:J.^HP^[S,(2)\S,

(1), / ] ; / = 1 , 2,- • , L ( l ) , ; = l, 2,- • , L(2)}(4.1.25)

and marginal frequencies of occurrences C/W^{^V/(1)^ ^; ' = 1^ 2,- • , L(l)}

(4.1.26)

provide the complete description of statistical relationships between the states 5(1) and 5(2). In general, the unconditional (marginal) frequency of occurrences P*[5,(2), 7] and the conditional frequency P*[5,(2) |5/ (1), /] are different. We can interpret this as an indication that a relationship exists between the component 5^(1) and the component 5j^(2). Therefore, if ^5,(2)15/ (1), /]=P*[5,(2), / ] , V/, Vfc,

(4.1.27)

we say that the states s(l) and s(2) are statistically independent. Then from (4.1.23) we have P-[5X1), s,(2), I]=P\s^l), I]P\s/(2), I] (4.1.28) Let us illustrate the introduced concepts with a numerical example. EXAMPLE 4.1.1 CALCULATION OF MARGINAL AND CONDITIONAL FREQUENCIES We take L(l)=6, L(2)=6; thus, the number of potential forms of the pair (vector {s(l), s(2)} isL=36. We assume that for some large I the joint frequencies can be described approximately by the equation r[5Xl), 5,(2), /] =.4exp(-a\l^k\)

(4.1.29)

where a is a parameter. The parameter A we obtain from the condition (4.1.15). Taking various values of it we obtain a class of joint frequencies.

4.2 Frequencies of Occurrences of Discrete States

173

j|(l) ijd) jjd) s^O) ss(l) s^(l) Si(2)

#

•

1^1

»

•

•

•

*

*

•

•

•

»

»

•

«

*

•

•

»

#

I I I I L*I^I*J*I*I^

|*l*|«{*|*|#|

{•{•{•{•|«|#

•

U » « « *M

» « • • • «

1

« « U U « '1 « * > > * >1 I»I»I»M»I«" • • • • * H

^$(2) S6(2)

I

vm ' • • • • •

• • • • • •

a=0.5 J , ( l ) S2(\) Sjd)

S^(l)

• • • • • .

a=l

J5(l)jg(l)

m

D [i

JBtB

b)

Figure 4.1. Graphical representation of joint probabilities (a) given by (4.1.29) and the corresponding marginal probabilities, (b) calculated from formula (4.1.20). The probability is proportional to the area of the disc.

The joint frequency P*[s^l), Jjt(2), 7] is represented in Figure 4. la as a disc the area of which is proportional to P*[^X1), ^^^(2), / ] . In the same manner we represent the marginal frequencies P'ls^l), 7]. Because of symmetry the frequencies P*[s,(l), /] are the same, and therefore, we do not show them. From (4.1.23) it follows that the conditional frequencies are proportional to the joint frequencies. Thus, a row on the square in Figure 4.1 illustrates the relative values of the conditional frequencies of occurrences. From Figure 4.1 we see that only for a = l are the conditional and marginal frequencies are the same. Thus, only for this value of a are the components independent. We say then that the set of potential forms of the trains is statistically unconstrained. In the limiting case when a-M), the one component determines exactly the other; thus, only the pairs {s^l), sX2)} / = 1 , 2,* • • , 6 occur in the train. In the other limiting case cr-^oo, the relationship becomes again rigid, but only the two pairs {s6(l), Si(2)} and {si(l), S6(2)} occur in the train. For a = 0 , all possible pair occur with the same frequencies. D 4.1.3 GENERALIZATIONS We introduced several assumptions only to simplify the notation and terminology, but we did not really use them in our argument. Therefore, we can directly generalize the previously introduced concepts and statements. We now illustrate such possibilities with two examples chosen so that they lead to equations that are needed in subsequent chapters. First, we show how to define the concept of conditional frequencies when the components of the state are not scalars as we assumed but they are vectors. We assume that

174

Chapter 4 Statistical State of a System

• The first component of the state s(i) is a A^-1 dimensional vector 5(1) = {5(1, n); n = l, 2,- • • , A^-1} where s(l, n) are scalars • S{\)^{s/i\),

1=1, 2,- • • , L(l)} is set of potential forms of each ^(1, n)

• The second component is a scalar ^(2); S(2) = {s^(2), k=l, 2,- - • , L(2)} is the set of its potential forms. The generalization of definition (4.1.23) is P'[sn),sm,I] p^[(s,(2)\5X1),n^ ';.!;; > (4.1.30) p [5,(1),/] where s^l) = {si^^)(l), %)(1),- • • , 5/(Ar-i)(l)} is the concrete form of the first component of the state s(l) and /={Z(1), /(2),- • • , /(A^-1)}, l(n)e{l, 2,- • • , L(l)} is the vector of indices of forms of elementary components s(l, n) of s(l). Next consider a generalization of the fundamental definition (4.1.6) for rough states. We assume that S = {s, /=1,2,- • • ,L} is the set of potential forms of the exact, primary state, but we are interested with the rough description , , ^ ^ ^ ^s,, s,r " , Sj} (4.1.31) of thefirstJ
(4.1.32)

From definition (4.1.6) of the primary states and from the definition of M(^, I) it follows that: j p\A r)='£ p\si, D (4.1.33) /-I

In a similar way, we define joint frequencies of occurrences and conditional frequencies when both states are rough states or one state is a rough state and the other is a state parameter or vector.

4.2 FREQUENCffiS OF OCCURRENCES OF CONTINUOUS STATES This section assumes that the elementary component of the train is a continuous state. To simplify the argument and terminology, we consider again a train S={s(i), / = 1, 2, • • • , /} of state parameters. We assume that the set of its potential forms of each s(i) is an interval ^ = <^^, 5^>.

4.2 Frequencies of Occurrences of Continuous States

175

Behind the definition of frequencies of occurrences of a potential state in a train of continuous states are the general relationships between the continuous and discrete models of information (of states) discussed in Section 1.4.3. The basic idea is to approximate the continuous state by a discrete state and to define the frequencies of occurrences of a potential form of the continuous in terms of frequencies of occurrences of potential forms of the discrete approximating state. Since a similar approach can be used also for the definition of probability density (Section 4.4) and spectral density (Section 7.4.2) the method of discrete approximation is described in more detail. 4.2.1 THE DISCRETE APPROXIMATION OF A CONTINUOUS PROCESS As the discrete state parameter approximating the primary continuous state s=s(i), we take the discrete state parameter obtained by the uniform scalar quantization described in Section 1.5.3. In the notation that is used here the transformation (1.5.16) takes the form: s,(i,L)'S^(L) if \s{i)-s,{L)\<\sii)-s,(L)\ >ik^l, (4.2.1) where sJii,L) the discrete state approximating the primary continuous state 5(0,and 5XL)=5,+(M/2)A(L), / = 1 , 2,- • • , L (4.2.2) are the potential forms of the discrete approximating state sJii^L) (the same for all n) and A(L)=(v^3)/L (4.2.3) is the distance between the quantization thresholds (see (1.5.14). The transformation (4.2.1) is illustrated in Figure 4.2. -^l(4)

-^2W

-^sW

-^4(4)

^3(4)

^4(4)

-A(L)Jl(4)

I

-^<^2(4)

I

5(0 ^b

u ^l(4)

u '

52(4)

U

li I

53(4)

I

54(4)

Sd^nX) ^b

Figure 4.2. Transformation producing the discrete approximation sJ^iX) of the primary continuous state 5(0; L = 4 .

176

Chapter 4 Statistical State of a System

Since we are interested with the accuracy of approximation and this accuracy is determined by the number L of potential forms of the discrete information in the introduced notation, we indicate explicitly the dependence of considered variables onL. The discretized state can be considered as a rough description of the primary state (see Section 3.1.3). From equation (1.5.15) and (1.5.18) we obtain the corresponding aggregations intervals (sets): _,(L)+5,(L) 5XL)+5,,,(L)^

AiU^i^—

(4.2.4)

For a given L we introduce: I M[5XL), /] the number of occurrences of a discrete state S/(L) in the train S^d={^^^L),/ = 1,2,- • • , / } of the discretized states, I P\sfX), I\=M[sf\S), /]//-the corresponding frequency of occurrences. From the definition of the discretized state sjiiy L) it follows that M[s^h), 7] is the number of occurrences in train of events s(i)E:^ (L)

(4.2.5)

P*[s^L), /] is the frequency of occurrences in train of events s(i)E^(L).

(4.2.6)

and

To define more precisely the relationship between the train S^ of continuous states and the train 8^,^ of discretized states, we introduce the auxiliary function: />*(5,/,L) = P*[5XL), /] for ^ e _^ (L).

(4.2.7)

This is a function of the continuous argument sEKs^, 5^>. Therefore, it is called a continuous envelope of the set of the probabilities P*[s^L), / ] . This envelope has the form of steps with height P'ls^L), / ] . The typical bar diagram of the frequencies P*[Si(L), 7] is shown in Figure 4.3Al, while that of their continuous envelope is shown in Figures 4.3A2 and 4.3A3. When L grows, the potential forms of the discreti2:ed state sj,i, L) become more densely distributed in the interval <Sg, 5^> as shown in Figures 4.3A2 and 4.3A3. However, with growing L the lengUi A(L) of each aggregation interval ^ (L) decreases. This usually causes that the chance that an s(i) falls into an aggregation set ^ (L) covering a given s'E <5^, 5^> decreases too. In other words, when L grows, the value that the continuous envelope ^c (s',/, L) takes for a given s' decreases with L. Figures 4.3A2 and 4.3A3 illustrate this effect. Figure 4.3A4 illustrates the limiting case whenL-»oo and all P*[s^L), /] are almost zero.

4.2 Frequencies of Occurrences of Continuous States

177

P*[s,(4).r]

B1

B2

B3

B4 ^b

s

^b

^

Figure 4.3. Illustration of the concept of density of occurrences of a potential form a continuous state in a train of continuous states; left colunm- the frequencies of occurrences of discretized states (Al) and the continuous envelopes of the trains of frequencies (A2), (A3); right column- the corresponding normalized frequencies and the continuous envelopes; (B4) shows the limiting continuous envelope (density of die frequency of occurrences. It is assumed that the length of the aggregation interval corresponding to L=4 is a unit length; thus, A(4)=l.

178

Chapter 4 Statistical State of a System

4.2.2 THE DENSITY OF OCCURRENCES OF A CONTINUOUS STATE The discussed dependence of ^c (s, /, L) on L causes this frequency to be an unsuitable characteristic of frequencies of occurrences of the primary continuous states. However, our considerations indicate that it is natural to normalize the frequency P*[s^L), /] in respect of the size of the corresponding aggregation set ^i (L). The ratio p[s,(L),I]^ ^(^) (4.2.8) is such a normalized characteristic, called density of occurrences of (continuous) states. Similarly to (4.2.7) we define for the train P * W^)»^, / = 1 , 2,- • • , L its continuous envelope p*(sJ,L)^

Jrr—^ for sE^d), (4.2.9) A(L) which is a function of the continuous argument s. The bar diagrams of the train P * lSi(L)yl] and the continuous envelope /7*(j,/,4) are shown in the right column of Figure 4.3. From (4.2.8) and from the definition (4.2.9) it follows that [/7*(5,/,L)ds-l.

(4.2.10)

Similarly, using (4.2.9) we write (4.1.5) in the form L

'b

q(D^^q(s)PXs,,I)=

f^*(5,/,L)/7*(5,/,L)d5,

(4.2.11)

L where ^*(^,/, ^) is defined similarly to P*(s,I,L) (equation (4.2.7)) however, with q[SiiL)] in place of P*[s^L), / ] . Observations of many systems show that if L becomes large^ then L almost does not influence the approximating function P c('^' ^' ^ ) . Thus, we may say that it "converges" to a "limiting" function/7*(5, 7), sE <5^, ^^> and write it in the form: p*(5,/,L) -

P*(^,/).

(4.2.12)

A(L)-*0

This function considered as a whole, is for the continuous states the counterpart of the set { P U / ) , / = ! , 2,- • • , L } characterizing the train of discrete states. Approximating p *(5,7, L) by p*(s, I) from (4.2.11) we have q{I) = J q{s, I)p*(s, Dds.

(4.2.13)

4.2 Frequencies of Occurrences of Continuous States

179

COMMENT 1 The frequencies of occurrences appeared here as compressed information about the primary train of states S^, which is relevant from the point of view of the class of performance criteria 2 ( S J given by (4.1.1). In particular, we did not introduce any assumptions about the existence of statistical regularities. Therefore, if we take Q(SJ as the criterion, the compression of the primary data into the much simpler set of frequencies of occurrences does not deteriorate the quality of primary data processing. To evaluate the frequencies of occurrences, the whole train must be available. This is, for example, possible in off line compression systems based on JPEG standard described in Section 2.5 or adaptive Huffman compression systems considered in Section 6.2.1. An other example are systems with a training cycle described in Section 1.7.2 (see Figure 1.27). However, for several types of information processing, particularly for making decisions about future actions such as prediction, the existence of statistical regularities is essential. We discuss such problems and the relationships between the frequencies of occurrences and probabilities in the next two sections. While the considerations about frequencies of occurrences of discrete states are formally strict, the concept of convergence in definition (4.2.12) has an heuristic character. Consequently, the equation (4.2.13) should be considered at this stage of considerations as an empirical approximation that may be useful for calculations. Thus, it is an example of the discrete approximations discussed in Section 1.4.3. The problems of convergence in (4.2.12) and of accuracy of expression (4.2.13) are discussed in the next two sections. COMMENT 2 From (4.2.6) it follows that the basic characteristic of the frequency of occurrences of continuous states is the frequency of events that an element of the train falls into the aggregation set ^ (L). Thus, the primary description of the frequencies of occurrences of continuous states is a function of sets assigning a number to a given set. The length A(L) of the aggregation interval is another function of the set. The density of frequencies, defined by (4.2.8) and (4.2.12) is the limit of the ratio of those two functions of the set (the aggregation interval) when this set shrinks to a point. Consequently, the density of occurrences is a function of a point. Therefore, it is much easier to handle than the frequency of occurrences that for continuous states, is a function of a set. Using information processing terminology, we may say that the density of occurrences is a representation of the frequency of continuous states. However, we pay a price for describing the frequencies of occurrences of continuous states by the density. Namely, the density depends not only the objective features of the train but also on the definition of the length of the aggregation interval, which is subjective. In particular, the numerical value of the length depends on the units for length, which in principle are set in an arbitrary way. For example, if the state parameter s is electrical potential, the physical dimension of the length of interval A(L) are volts. Since the frequency of occurrences has no physical dimension, the physical dimension of the density p\s, I) is [V] ^

180

Chapter 4 Statistical State of a System

4.3 STATISTICAL REGULARITIES The previous two sections considered the train of states as fixed. Now we look at the properties of the frequencies when the length / of the train grows. For many systems, the frequencies of occurrences then begin to stabilize in other words, to converge to limit values. The first two parts of this section discuss the effect of the convergence of the frequencies of occurrences of states in a train, which can be interpreted as a train of samples of an evolving state process, take in subsequent instants. The third section discusses an analogous effect when we observe at an instant the states of several similar systems. 4.3.1 STATISTICAL REGULARiriES IN TRAINS OF DISCRETE STATES In the previous considerations we did not specify the meaning of the element s(i) of the considered train S^ of states and we considered the number / of its elements as fixed. Now we assume that Al. We observe the system during a time interval , A2. We introduce a train r,^r,+ (/-l)7;,/ = 1,2,- • • , / (4.3.1) of instants, where r,=(vO/(/-l) (4.3.2) A3. The previously considered state 5(0 has the meaning of the state of the system at instant r,; thus, s(i)=s,(t,) (4.3.3) where 5^(0 is the instantaneous state of the system at instant tE Kt^, ti,>. In view of assumption A3 we call r, the sampling instant, T^ the sampling period and s(i) the sample (of the primary time-continuous state s^it)). These assumptions and notation are illustrated on Figure 4.4a. To the previous assumptions we add the assumption: A4.1 The set ^S'of potential forms of the instantaneous state 5(0 and thus, of 5(0 is discrete: S={si, 1=1, 2,- • • , L. If we keep the sampling interval T^ constant and increase the number of samples / (from (4.3.2) follows that we achieve it by increasing the length t^^-t^ of the observation interval) then several systems have the following property The frequency of occurrences of each state Si stabilizes that is, for large I^ and /j the difference |P*(5/,/,)-P*(5^,/2)| becomes very small; in other words, thefrequencyapproaches a limit. We write this in the form (4.3.4) P*(5„/)

-./>*(5^),V/.

4.2 Statistical Regularities

181

Sc(r)

T. h-

a)

'ciO-.-^^

'a2 V

"AT-

^b2

"V^

b)

Figure 4.4. Illustration of assumptions and of the notation: (a) the single train, (b) a pair of trains. We say that the states having this property exhibit statistical regularities and the limiting value P*(Si) is called empirical probability. The statistical regularities are described by the set of empirical probabilities P*={P*(^,),/=1,2,(4.3.5) L} From (4.3.4) it follows that (4.3.6a) I—oo where q(l) is the arithmetical average over the train given by (4.1.2) and L

q^Y.9(s)P*(s)

(4.3.6b)

is the empirical statistical average. From (4.1.11) it follows that If with growing length I of the train, the arithmetical average q{I) (4.3.7) approaches a limit, then the states exhibit statistical regularities Till this point we assumed that the instantaneous state sjix) is scalar. In general, the state may be structured. If the frequencies of joint occurrences defined by (4.1.4) have the coimterpart of property (4.3.4), we say that the components of the structured state exhibitjoint statistical regularities. The corresponding limiting values of frequencies of joint occurrences are called empirical joint probabilities. The fundamental properties (4.1.8), (4.1.15), and (4.1.18) do not depend on the length / of the train. Therefore, the empirical probabilities possess them too. We also can use for them the definition (4.1.21) of the conditional probability. In all these equations we have only to drop /. When the empirical conditional probability of state s{l) does not depend on state 5(1) we say that both states are statistically independent.

182

Chapter 4 Statistical State of a System STATIONARITY OF STATISTICAL REGULARITIES

The introduced empirical probabilities characterize a long train of states. In some cases the properties of the system may change during such a long train and cause local changes of frequencies of occurrences. The superior system may use such a behaviour of its environment to improve the performance of purposeful actions but this would complicate the optimal rules of operations. However, many systems can be considered as time invariant and the local frequencies of occurrences of states do not change during the operation of the superior system. Then the goals the superior system can be realized efficiently in a simpler way. Therefore, it is important to fmd out whether the frequencies of occurrences are time invariant. To do this we take two nonoverlapping intervals of the same length (see Figure 4.4b) and for each of them we apply the previously described procedure. We denote by S^i and S^2 the corresponding two trains of samples. Observations show that many systems have the following property When the length I of both trains becomes very large^ each difference

between frequencies of occurrences of a potential states of samples taken in both time intervals become very small; we write this in the form |/>/(5,,/)-/>2V„/)| - 0 V/. (4.3.8a) The frequencies P^ (5^7) ^ - 1 , 2 approach the same limit P\si); we write it in the form P / ( 5 ^ , / ) - P\s) V /, A:=l, 2. (4.3.8b) We say that the state parameter for which (4.3.8) holds, exhibits statistical regularities and they are stationary ( equivalently- time invariant). Let us comment the introduced concepts. COMMENT 1 We can expect that the necessary condition for the statistical properties to be time invariant is that the system is stationary (the relationships between the external states are time invariant (see Section 3.2.5). However, the internal states of most real systems change in time either in a periodical or in a systematic way. In particular, the internal states of most systems serving people exhibit daily and yearly cycles of changes. All systems age. However, often are we interested with a system only during a time interval and the changes of internal states of the system during this interval are small compared with the changes of instantaneous external states. We call such a system quasistationary. For such a system we may assume that the statistical properties are stationary but this is in principle an inconsistent assumption. In particular, in definition (4.3.4) we have to take / large, but not very large. To simplify the argument and notation we assumed that the state is a scalar and is discrete. As in the previous section we can generalize our argument for structured and/or continuous states. We sketch such a generalization for continuous states.

4.2 Statistical Regularities

183

4.3.2 STATISTICAL REGULARITIES IN TRAINS OF CONTINUOUS STATES Instead of A4.1 we assume: A4.2 The set ^S'of potential forms of an instantaneous state 5(0 thus, of a sample s(n) is the interval <s^, 5b > , In the previous section the frequencies of occurrences of continuous states have been defined by means of approximating discrete states. If for every number L of potential forms of the discrete approximation, the approximation exhibits the statistical regularities, we say that the train of continuous states exhibits statistical regularities. They are described by the function /7*(5), to which the density of frequency of occurrences pjis, /, L) (see (4.2.9)), "converges" when the number of potential forms of the discretized state L-^oo or equivalently, when the size of the set of potential forms of the continuous state that are aggregated into one discretized state A(L)-K): /7^*(5, /, L) -^p L-*oo

*(5)

(4.3.9)

J-*oo

Since condition I>L must be satisfied to achieve L-^-oo, we must also increase the number of samples we observe; thus, the condition /-•oo must be also satisfied. The function/7*(5), sE <s^, s^> is called density of empirical probability. This function considered as a whole describes the statistical properties of a continuous state similarly, as the set P* of empirical probabilities given by (4.3.5), describes the statistical properties of the discrete state. Therefore, we call both /^ and p*(s), 5E <5a, 5b> the empirical probability distribution of a state exhibiting statistical regularities. Let us denote by q(L,r) the arithmetical average of the train of the discrete approximating states given by (4.2.11) with Sj^L) in place of 5/. From (4.3.9) it follows that q{L,[) —>q^

(4.3.10)

L-»oo / - » a o

where qr^^(s)p*(s)ds

(4.3.11)

is the statistical average of the function q(s) of the continuous variable s. Similarly to (4.3.7) we conclude that If with growing number L of forms of a uniformly quantized state and with growing length I of the train the empirical average q(L, /) (A -x \2^ approaches a limit, then the continuous state exhibits statistical regularities. An example illustrates these general considerations about empirical probabilities.

184

Chapter 4 Statistical State of a System EXAMPLE 4.3.1 STATISTICAL PROPERTIES OF THERMAL NOISE OF AN RESISTOR

Consider a resistor made out of metal wire. In the macro scale it can be considered as two terminal accessible system considered in Section 3.2.1. In atomic scale the metal consists of crystals, and a crystal is an assembly of atoms of metal located at nodes of a spatial grid. Between those atoms move free electrons that collide with atoms. At each collision a potential impulse is induced on the terminal of the resistor. Very many of such pulses overlap and cause that a potential difference in the macro scale between the terminals of the resistor develops. It is called thermal noise. A typical diagram of such a process is shown in Figure 4.5.

A S{t) [p,V]

2.0H this]

100

200

300

400

500

Figure 4.5. Typical electrical potential process produced on a resistor by thermal movement of electrons (thermal fluctuation noise). As the parameter describing the external state of the resistor we state we take the instantaneous potential difference VjjCt). A detailed analysis of the elementary potential pulses generated by the collisions shows that the statistical average ^[v,2(0]"0 and the average of squared potential difference (an indicator of "magnitude") is ^[v2(0]-k7KB, (4.3.14) where k is the Bolzman constant, T the temperature in Kelvin degrees, B the frequency of the highest sinusoidal component of the potential process and R the resistance of the resistor. From equation (4.3.14) it follows, that if the temperature is constant the potential process is stationary. However, if the temperature would change, the frequencies of occurrences of a given quantized difference of potentials in an observation interval and another nonoverlapping interval would be different and the process would be nonstationary. Denote by T the time during which a system interacting with the resistor operates. If during this time the changes of temperature are insignificant the thermal noise can be considered as a quasi stationary process. D

4.2 Statistical Regularities

185

4.3.3 STATISTICAL REGULARITIES IN ASSEMBLES OF SYSTEMS States were interpreted in previous considerations as instantaneous states of a system in a train of instants. There is another great domain in which the concepts of statistical regularities and empirical probabilities occur. Namely, a system can often be considered to be an element of an ensemble of similar systems. By ,,similar'* we mean that the systems have the same general structure and properties but may differ only in some details. As an illustration, take our standard example the resistor. Then the ensemble would be a set of resistors produced in factory using the same base rough materials, the same technological process, and the same machinery. In spite of these similarities, each resistor would have slightly different properties and obviously the state on the atomic structural level would be for every resistor different. Notice that considering the set of potential forms of an internal state in Section 3.3.3, we introduced already the concept of the ensemble of systems. To define the statistical regularities of a nonstationary system, instead of choosing a set of sampling instants and considering the set of time samples of the state of the same system, we choose various sample systems out of the ensemble. Thus, we interpret the sample s(i) introduced at the beginning of our considerations about the statistical state in Section 4.3.1 as the state at a given instant / of the /-th system a, chosen of the ensemble E: ^(0^^,(0. (4.3.15) Replacing r, defined by (4.3.1) by a,, we can utilize all our previous argument and definitions to define the frequencies of occurrences, statistical averages, statistical regularities, and empirical probabilities for the states (external, internal) of system chosen out of the ensemble L. The counterpart of a stationary system is a homogenous ensemble for frequencies of occurrences of states of system chosen out of two subsets of E "converge" to the same limiting values. To this point we considered the instant t at which we observe the systems as fixed. If we consider this instant as variable the concept of statistical regularities in an ensemble of systems allows to define in a systematic way the time-dependent statistical regularities, particularly the time dependent empirical probabilities 4.3.4 TESTING THE EXISTENCE OF STATISTICAL REGULARITIES AND ESTIMATION OF PROBABILITY DISTRIBUTION The basic two problems are • To determine whether state parameters exhibit statistical regularities thus, the frequencies of occurrences or, equivalently, the arithmetical average converge and • If they do, how to find the probability distribution The first problem is called testing whether statistical regularities occur (briefly testing problem) and the second is called the statistical identification problem.

186

Chapter 4 Statistical State of a System

The approach to these problems presented here has a heuristic character. The concepts "very small" and "approaches a limit" have been used in the heuristic sense. Thus, the definitions (4.3.4) and (4.3.8,9) of empiric probability and probability density and of empiric averages are not strict in the mathematical sense. They are rather, practical instructions how to check whether a state parameter exhibits statistical regularities and, if it does, how to estimate them. A similar character have several rules used in statistics (see e.g., Frank, Althoen [4.2] for an introduction, Kotz, Johnson [4.3] for an encyclopedic review, and Press [4.4, ch. 13] for concrete programs for obtaining an estimate of a probability distribution). The empiric approach can be justified by the usefulness criterion. We assume tentatively that the states exhibit statistical regularities and that the estimates of probabilities are exact. Using such design information we fmd the rules of optimal information processing (see,e.g.. Section 1.7.1). If the performance of the superior system is improved, we consider ex post the tentative assumptions about existence of statistical regularities to be true. Although it may seem inconsistent, such a procedure is, in fact, the ultimate justification for the application of all the mathematical models we apply. The attempts to formalize the empiric approach have a long history (see the classical work by von Mises [4.1] and for more recent discussions and references Shafer, Pearl [4.5, ch.4, the tutorial]). The difficulties in formalizing the empirical approach caused the axiomatic approach to became prevalent in the mathematical probability theory (see Kolmogorov [4.6], Renyi [4.7], Billingsley [4.8]). This approach is described in the next section. This section explains also why, in spite of its mathematical elegance, the axiomatic approach does not replace the heuristic approach presented here. The basic approach of the axiomatic probability theory to the testing problem is to analyze the convergence of arithmetical averages on assumptions about the probabilities of the samples, which are so weak that it is possible to justify them knowing only some general properties of the mechanism generating the successive samples. An example is the assumption that the samples are statistically independent. The theorems about convergence of averages in such a case are called laws of large numbers (for basic theorems see Renyi [4.7], Papoulis [4.9], Breiman [4.10], for more detailed studies Reves [4.11]). More complicated is the case when a sample is produced by an indeterministic transformation from the previous sample (such as in the Markovian process considered in Section 5.3). The analysis of convergence of arithmetical averages of statistically related samples is subject of ergodic theory (for an introduction see Breiman [4.10]; for a more detailed study see study Mane [4.12]). The basic approach to the statistical identification problem is to gain more insight into the mechanism generating the state. This mechanism is described by (1) the universal relationships between the states of the system ( the internal states of the system), (2) the external factors influencing the system, (3) the initial states of the system.the factors determining the value of a sample. The approach analyzing those factors is called analytic approach.

4.2 Statistical Regularities

187

The typical analytic approach of the axiomatic probability theory to the statistical identification problem is (1) to assume that the sample is generated by a transformation from some primary states, called causing states, (2) to assume that the causing states exhibit statistical regularities, (3) to look for such generating transformations that the probability distribution can be approximated by a probability distribution depending on the generating transformation and only on some simple statistical features of the causing states. The states produced by such a mechanism may be called states weakly dependent on statistics of causing factors and the approximating probability distribution is called limiting distribution. Section 4.5.1 shows that the uniform distribution can be considered as a limiting distribution, while Section 5.5.2 shows that the Poisson distribution has such a character. Of paramount importance is the limiting character of the Gaussian probability distribution. The theorems stating that this is the limiting probability distribution of a state that is produced by adding a large number of independent commensurable components (with variances of similar order of magnitude), are called central limit theorems. These theorems are essential for statistical physics (see, e.g., Huang [4.13]). In particular, they allow to derive the probability distribution of state parameters characterising a macroscopic property that is a manifestation of atomic-scale factors. A typical example of such a state parameter is the pressure caused by the movement of molecules of gas. Another example is the fluctuation of potentials on terminals of a metallic resistor caused by movement of free electrons described in the previous example. The central limit theorems can be also used to identify the probability distribution of parameters determined by many factors in the macro scale. A typical example is the electrical energy consumed during one hour by households in a large city or the number of telephone connections established by an exchange serving many subscribers. For an introduction to central limit theorems see Renyi [4.7], Papoulis [4.8], Breiman [4.10]; for more detailed study see Gnedenko, Kolmogorov [4.14]. The theorems are discussed in Section 4.5.2. Although the character of the limiting probability distribution depends strongly on the generating transformation the statistical regularities of the considered state are a consequence of the statistical regularities of the causing states. Interesting are systems realizing a deterministic transformation that is unstable (as the shift register with feedback mentioned in Section 3.2.4) and being not influenced by external factors, produce a train of states that is locally similar to a train of states exhibiting statistical regularities and statistically independent. Such a system is called generator of pseudo-random numbers (for a detailed analysis see Dagpunar [4.15], Yarmolik, Demidenko [4.16], Niederreiter [4.17]). A similar character have special classes of oscillating deterministic systems described by specific nonlinear differential equations considered in the chaos theory (see, e.g., Davaney [4.18], Rasband [4.19]).

188

Chapter 4 Statistical State of a System

4.4 THE AXIOMATIC APPROACH TO STATISTICAL REGULARITIES To utilize the statistical regularities for enhancement of superior systems efficiency, we usually have to perform on the primary frequencies of occurrences various operations for example to calculate conditional, marginal frequencies or averages. A branch of mathematics, probability theory provides a formalism for such analyses. This section reviews the basic concepts of this theory. To illustrate the general concepts and to derive conclusions that are needed in subsequent chapters, we consider in the next two sections two more specialized areas of probability theory: the prototype probability distributions, and the "view of the probability theory" on the statistical regularities discussed in the previous section. The next chapter, devoted to statistical relationships, provides further examples of applications of probability theory. There are several excellent books on probability theory (e.g., Papoulis [4.9], Breiman [4.10]). Our approach is quite specific because we emphasize the relationships between probability theory and information systems analysis. Especially close are the links between the concepts of probability theory and the anaJysis of frequencies of occurrences of states presented in Section 4.1 Those links are bilateral. On the one hand, the axioms of probability theory can be considered as an abstract generalization of the plausible properties of the frequencies of occurrences. On the other hand, because the frequencies of occurrences satisfy the axioms of probability theory, they can be considered as probabilities and the theorems of this theory hold for them. In the following review of fundamental concepts of probability theory, we discuss the axioms, the concept of random variables, and the basic properties of statistical averages. 4.4.1 THE AXIOMS OF PROBABILITY THEORY Probability theory is based on axioms, and, as is usual in mathematics, the relationships between the axioms and the real world are not the subject of the theory. However, for analysis of states of systems and of information processing this relationship is crucial. Therefore in this presentation of axioms of probability theory we emphasize the relationships between the axioms and the previously introduced concepts of information systems analysis, particularly with concepts introduced in the previous chapter. The first two fundamental concepts of probability theory are iht primary event, which we denote e, and the set ^of forms, which the elementary event can take. The third concept is the event. It is a subset of the set
4.4 Axiomatic Approach to Statistical Regularities

189

The event corresponds to the situation in which a state parameter (primary, secondary, rough) takes a concrete form say, 5'. Such a situation can be considered as the set ^(s') of situations in which all components of the exact description of the instantaneous state take such values that the value of the considered state parameter is s'. Thus, the set ^(s') has the meaning of an aggregation set. The forth fundamental concept of the probability theory is the probability measure (briefly, probability). It is a number associated with each^ subset ^ of the set £ We denote it P(_y^) ; as usual, we use the special character P to denote an operation of assigning a number to an mathematical object more complicated than a number. Probability satisfies the following axioms: AXl. For asubset ^CC 0 < P ( ^ ) < 1 , P(^:) = l P(0)=O, (4.4.1) AX2. For two subsets ^ 1 , ^ 2 such that _^in_y^2 = 0 we have P(_><^,U^2) = P(^i)+P(-^2) (4.4.2) where 0 denotes the empty set. Taking into account the previously mentioned correspondence between the event in the sense of probability theory and the situation that a state parameter takes a concrete value, we see that the axioms can be considered as an abstract generalization of the obvious properties of the frequencies of occurrences. In particular axiom (4.4.2) is the generalization of the obvious property (4.1.33) of frequencies of occurrences. On the other hand, it is easily seen that the frequencies of occurrences defined by (4.1.6), (4.1.14) and (4.1.21) satisfy the axioms AXl and AX2. Therefore, most results of probability theory apply to the frequencies of occurrences. 4.4.2 THE RANDOM VARIABLES We assume that a probability measure is defined in the set (C of elementary events. A function assigning s(e) assigning to each elementary event e G i ' a number s, considered as a whole, is called random variable. In our notion, the random variable is the set {s(e); eEC P}. Instead of this lengthy expression, we use the shadowed character 8. Suppose that s(e) can take only the values 5/, / = 1 , 2,- • • , L. Then we call s a discrete random variable. When the event that s{e) takes a value Si we say that the random variable s takes value Si and we write it in the form 8=5/. The probability of this event is P(8=5,). This is the counterpart of the previously introduced frequency of occurrences P*(5/, I) and empirical probability P*(5/). If the random variable can take values from the interval S^<Sa, Sf,>, then we call it continuous random variable. To describe such a variable we take a sequence of subintervals ^^, /n = l, 2,- • • shrinking to a point sES. Thus lim7(<^)=0 and '^^•^"^ ,W.|ft|. For a continuous random variable, P(.^i>*0 but the limit exists.

p(5)=iim_L:::: •"-" yy'^mf

•"

(4.4.3) (4.4.4)

190

Chapter 4 Statistical State of a System

This is a similar effect as in case of frequencies of occurrences that has been discussed in detail in Section 4.2 (for a strict definition of density of probability see, e.g., Billingsley [4.8]). The limit characterizes the value s from the point of view of statistical regularities of the continuous state parameter. Since it is an analog of density of a thin material bar, it is called density of probability. In technical jargon the ttnn probability density is used. It is shorter and may be misleading, but when it causes no confusion, we will use it too. The set 4. is the counterpart of the aggregation set ^(L) (definition (4.2.4)), which was introduced defining the density p*{sJ,L) of frequency of occurrences (definitions (4.2.7), (4.2.8), and (4.2.9)). The d.o.p. p{s) (defined by (4.4.4)) is the counterpart of the density of empirical probability /7*(5) defined by (4.3.9). In particular, y{B,^ is the counterpart of A(L) in (4.2.9). Subsequent sections discuss in more detail the relationships between those concepts. To generalize our reasoning for continuous K-DIM vector states is similar to the case of frequencies of occurrences. We have to 1. Replace the sub-interval 8^, occurring in definition (4.4.4), by a K-DIM subset B^ of the set ^ o f the potential forms of the vector state, and 2. Take instead of the length of the interval the volume y(8jo{ the subset /S„ and to require in definition (4.4.4) that the subsets 8„ shrink in all dimensions mto a considered point s'={s\k); k=l, 2,- - - , K}; A typical choice for the subsets are K DIM cubes: ^ , = {{5(1), s(2),..,s(K)}; s\k)-AJ2<s(k)<s'(k)-¥AJ2, k=U 2,..,K} (4.4.5) and the typical definition of volume is 7(^J=(AJ^ (4.4.6) Similar to the density of occurrences (see Comment 2, page 179) the density of probability p(s) depends not only on the statistical properties of the system but also on the definition of the volume of the subsets 8„. This must be taken into account when transformations of random variables are considered. The density of probability {p(s);sEiS } considered as whole describes completely the properties of the continuous random variable similarly as the set of probabilities {P(8=^/);/=l, 2,- • • , L} describes the discrete variable. We call both {p(s);sES} and {P(8=5/);/=l, 2, • • ,L } probability distribution of the random variable. If we know the function s(e) determining the dependence of a variable on the primary event and the probability of primary events, then in principle we can calculate the probabilities (density of probability) for the corresponding random variable. However, if we consider only a single random variable 8(1) it is completely described by its probability distribution, and we do not need to know either the relationship between the value of the random variable and the primary events or the probability of the primary events. Similar, if we consider only a pair of random variables 8(1), 8(2), we need only to know their joint distribution. The bidirectional relations between frequencies of occurrences and probabilities not only cause them to correspond but cause all previously introduced definitions and derived properties of frequencies and empirical probabilities to apply to probabilities.

4.4 Axiomatic Approach to Statistical Regularities

191

In particular, if in equations (4.1.8b), (4.1.15), and (4.4.18) we drop / and the asterisk *, we obtain definitions valid for probabilities. Similarly, the counterparts of the definitions (4.1.22) and (4.1.23) of conditional frequencies are definitions of the conditional probabilities P[8(2)=:yJ2)|8(l)=5Xl)]^P[8(l)=5Xl), 8(2)=5J2)]/ P[8(1)=5X1)], (4.4.7a) and the density of conditional probability p[s(2)\s(l)]^p[s(l), (2)]/p[s(\)]. (4.4.7b) The counterparts of condition (4.1.28) for the statistical independence take the form P[S(1)=5X1), 8(2)=5,(2)]=P[8(l)=5Xl)]P[8(2)=:yJ2)] (4.4.8a) p[s(l\ s(2)] =p[s(l)] p[s(2)], (4.4.8.b) For discrete variables the counterpart of equation (4.1.20) is P[8(2)=5,(2)=^ P[8(1)=5X1), 8(2)=5,(2)], while for continuous p[s(2)] = j p[s(l), s(2)]ds(l).

(4.4.8c)

(4.4.8d)

4.4.3 THE STATISTICAL AVERAGE The counterpart of the arithmetical average qil) given by (4.1.5) and of the empirical statistical average q given by (4.3.6b) is the statistical average of a scalar function q(s) of a discrete random variable s: L

E^(8) = E9(«/)P(S=*/)-

(4.4.9)

/-I

°

We denote here as E the operation of statistical averaging with respect to the random variable s and we interpret the statistical average given by the right hand of (4.3.9) as the result of performing the operation of statistical averaging on the random variable ^(s). If it will not cause confusion in the subsequent we will drop the random variable under the symbol E. The operation E of statistical averaging is the counterpart of the operation A of arithmetic averaging defined by (4.1.3). If the random variable is scalar and q(s)=s, the definition (4.4.9) simplifies to the definition of the statistical average of a scalar random variable L

E8=E^/P(8=5/). 8

(4.4.10)

/-I

Using (4.4.9) and proceeding similarly as by derivation of equation (4.2.13) for a continuous scalar variable, we define the statistical average EQ=^sp(s)ds

(4.4.11)

192

Chapter 4 Statistical State of a System

To this point we considered the mean value of a scalar function of a scalar argument. Similarly, we define the statistical average of a scalar function q(s) of a K-DIM continuous vector: Eq(Q)^\\"\q(s)p(s)ds,

(4.4.12)

and Sf^ is the set (continuous) of the potential forms of the K DIM state s. Let us take as special case ^={5(1), ^(2)} and q(s)=a(l)s(l)'{'a(2)s(2). After some calculus from (4.4.10.) or (4.4.11) we get _ E[a(l)8(l)+fl(2)8(2)]=a(l)E8(l)-ha(2)E8(2) (4.4.13a) Thus, The operation of statistical averaging is a linear operation (4.413b) We use this conclusion frequently in subsequent considerations. The difference 8-E8 has the meaning of the deviation of the value of the random variable from its average. Since the statistical averaging operation is a linear operation, we have _ « _ ^

E(8-E8) = E8-E8=0.

(4.4.14)

Thus, the statistical average has the meaning of a constant around which the observations of the random variable fluctuate. However, the statistical average does not characterize the range of those fluctuations. A characteristic of this range is the r^"^^^^, ^ a^(8) = E(8-E8)^ (4.4.15) It is called variance. Using again the linearity of the averaging operation, we get a'(8) = E8'-(E8)' (4.4.16) The pair mean value and the variance can be considered as a rough description of the probability distribution characterizing the center and range of fluctuations of observations of the corresponding random variable. From (4.4.6) follows that the pair E8 and E8^ provides an equivalent rough description. It can be expected that we may obtain a still more accurate descriptions using the averages ^ . = E8^, (4.4.17) where m is an integer. The average A„ is called moment of mth order (of the probability distribution). It can be proved that on very general assumption all moments A^, m = 1, 2,- • provide an exact description of the probability density. Namely, under quite general conditions it can be presented in the form oo

P ( ^ ) = E M'")^(*y» ^)» -oo<^
(4.4.18)

m-l

where the coefficient /x(m), called mth commulant, is a function of moments i4i, A2,- ' , A„ (e.g., /x(2)=(T^(8)) and h(s, m) is a family of specific orthogonal functions. Thus, the probability density can be represented as the infinite set of commulants. This is a special case of spectral representations that are discussed in Section 7.4.1. An important property of a spectral representations is that its finite initial part is an optimal approximate representation (see. Section 7.1.3). Thus, e.g., the mean value and variance are an optimal rough description of a p. density by two parameters.

4.4 Axiomatic Approach to Statistical Regularities

193

CONDITIONAL AVERAGES Conditional averages are of paramount importance for theory of information processing. To simplify the argument we again assume that the two random variables 8(1) and 8(2) are discrete scalar variables. Replacing in (4.4.10) the probabilities P(z=Si) by the conditional probabilities defined by (4.4.7a) we obtain the conditional statistical average of 8(2) on the condition that 8(1)=5^1) L(l)

E 8(2)^5: 5,(2)P[8(2)=^J2)|8(1)=5X1)] 8(2)|5/(1)

(4.4.19)

/-l

where E is the operation of conditional averaging over the random variable 8(2) on the condition that 8(1) is fixed. The conditional average is a function of the condition 5X1). To indicate it we introduce the function D[s(l)]^ E 8(2) (4.4.20a) 8(2) UXD and the random variable ID=D[8(1)]. (4.4.20b) We denote by Z)=E© (4.4.21) the statistical average of ID. Using equation for marginal probabilities (counterpart of (4.4.8), after some algebra we get E 8(2) = 5 . (4.4.22) e(2)

Substituting (4.4.20), (4.4.21) we write this equation in the form E 8(2)= E E 8(2). (4.4.23) 8(2) 8(1) 8(2) 18(1) Thus, the average of 8(2) can be obtained by averaging the conditional average of 8(2) over the condition. This equation allows to take into considerations of a state 5(1) another statistically related state 5(2). Therefore, equation (4.4.23) plays a key role in optimization of information processing. 4.4.4 CORRELATION COEFFICIENTS AND CORRELATION MATRIX Let us consider two random variables 8(1) and 8(2). The difference between the random components of those variables is: [8(2)-E8(2)]-[8(l)-E8(l)]. It is natural to take: ^„(1,2)=E{[8(2)-E8(2)]-[8(1)-E8(1)]}2 (4.4.24a) as an indicator of the "difference" between the random components of both variables. Using the linearity of operation E we get ^ , 5„(1,2)=E [8(l)-E8(l)r-fE[8(l)-E8(l)r-2c„(l, 2), (4.4.24b) where* c„(l, 2)=E[8(1)-E8(1)][8(2)-E8(2)]. (4.4.24c)

194

Chapter 4 Statistical State of a System

From equation (4.4.24b) it can be seen that the indicator ^„(1,2) of the statistical difference depends on the statistical relationship between both variables only through the coefficient Css(l, 2), and the difference is the smaller the larger Css(l, 2) is. Thus, Css(l, 2) can be interpreted as an indicator of statistical relationships between the random variables 8(1) and 8(2). Therefore, c„(l, 2) is called correlation coefficient. Some authors call this centralized correlation coefficient, while the coefficient ^^^(l, 2) = £8(1)8(2) is called noncentralized correlation coefficient. Since we use only the centralized coefficient we drop the adjective "centralized". The concrete form for the averaging operation we obtain for discrete variables, taking in the definition (4.4.9) instead of 8 the pair {8(1), 8(2)}. Definition (4.4.24c) then takes the form L

L

c„(l, 2) = E E E[5Xl)-E8(l)]W2)-E8(2)]P[8(l)=5Xl),8(2)=:y,(2)l M n-\ (4.4.25a) Similarly, for continuous variables we use equation (4.4.11) and we get c„(l, 2)= j j 8[l)-Es(l)]U(2)-E8(2)]/7[5(l),5(2)]d5(l)dj(2).

(4.4.25b)

Let us next consider now the multidimensional random variable §={8(A:), ^ = 1 , 2 , - • ,K}, The mutual relationships between the components z{k), A:=l, 2,..,AT are characterized by the set of the correlation coefficients c^^(m, k), Vm, k. It is convenient to arrange them in a matrix C„4Css(m, /:)], (4.4.26a) where c„(m, ^) = E[8(m)-E8(m)][8(A:)-E8(/:]. (4.4.26b) The matrix C^^ is called the correlation matrix. We now derive a few properties of this matrix, which are used in subsequent chapters. To simplify the notation we assume _ E8(it)=0, Vit. (4.4.27) From definition of c^Jjn, k) it follows that the correlation matrix C^^ is symmetrical. To derive another important property of the matrix we introduce the auxiliary variable j^ 2=E^(*)SW»

(4.4.28)

where a{k), / : = l , 2 , - - , ^ a r e arbitrary numbers. The mean square of it is K

K

K

K

E2?=E(E a(m)m(/w))(E a(k)%(Jc))= E E EBi(m))ffl(/r)a(m)a(*) = m-1

m-1

k-\ K

k-\

K

E E CssC/n, k)a{m)a{k). m-1

(4.4.29)

k'\

Except singular cases^ E2?>0. Therefore K

K

E E CsAm, n)a(m)a(n)>0

(4.4.30)

4.5 Prototype Probability Distributions

195

A matrix satisfying such a condition is called positive definite (see Thompson [4.20], Horn [4.21]). This property is essential for efficient dimensionality reduction, which is described in Section 7.2. The important property of correlation coefficients is that to calculate the correlation coefficients of a multidimensional secondary state obtained by a linear transformation of a primary multidimensional state, we need only the correlation matrix of the primary state. We now derive this relationship. From the rules of matrix multiplication (see, e.g., Thompson [4.20], Horn [4.21]) follows that if a^ is a column matrix with elements a(k) then a^^d^^ is a square matrix with elements a(m)a(k): a^a''^ = [a(m)a(k)]. (4.4.31) Using this formula we write the correlation matrix C^^ given by (4.4.26a) in the form _ -^ ^^ Q = ES^S^,, (4.4.32) where S^x is the random column matrix with elements z(k). We denote: s^ the column matrix with the components s(k), / : = ! , 2,- • • , A' of the primary state, v^ the column matrix with the components v(k), /:=1, 2,- • • , ^ of the secondary state. A wide class of linear transformation can be presented (see, e.g.,Thompson [4.19], Horn [4.20], Usmani [4.21]) in the form v.x=^^n«. (4.4.33) where H is KiJC square matrix. For the corresponding random variables we have V„«=^S,x. (4.4.34) Using again (4.4.32) we represent the correlation matrix C^=[E^(mMk)] (4.4.35) of the transformed variables in the form C^=EY„Xu.(4.4.36) After substituting (4.4.34) and some elementary matrix algebra, we get C^=E(H^^)(HB^^y=H E^Xn. tf^HC,,If

(4.4.37)

where C^^ is the correlation matrix of the primary state. This equation is used frequently in the subsequent chapters.

4.5 PROTOTYPE PROBABILITY DISTRIBUTIONS This section has two purposes. First, it illustrates on concrete examples the previously introduced concepts of the axiomatic probability theory. Second, it provides two examples of transformations producing states weakly dependent on statistics of causing factors, discussed in Section 4.3.4. The limiting probability distributions are the uniform and gaussian distribution. We concentrate on these two distributions because

196

Chapter 4 Statistical State of a System

• In many cases we can conclude from very general premises that the mechanism producing a state is similar to one of the transformations considered here, and • often, due to deterministic relationships discussed in Chapter 3, some relevant states of the system can be considered as secondary states produced from the states mentioned above; then, using routine procedures of probability theory we can calculate the probabilities describing the secondary states (examples are given in Section 5.2). For these reasons we call the uniform and gaussian distributions the prototype probability distributions. 4.5.1 THE UNIFORM PROBABILITY DISTRIBUTION We say that a discrete random variable 8 has a uniform probability distribution if: P(8=5,)=const, (4.5.1) where 5/, / = 1 , 2,- • • , L are the potential forms of the state. We say that a continuous random variable S has a uniform probability distribution if p(5)=const, V5. (4.5.2) We consider a continuous scalar state. We assume that Al. The primary state is a scalar, and the set of its potential forms is the interval <-5b, ^b); A2. The primary state can be considered as a realization of the continuous random variable a; we denote by p(s) its probability density; A3. Using uniform quantization we transform the primary state s into discrete state w; we denote by 5/, /= 1, 2,- • • , L the potential forms of w and by T^(') the quanting transformation; thus w=^T^(s); A4. We achieve the quantization by the next-neighbour transformation^ w=Si if \s-Si\ < | 5 - 5 j , >/k9^l,

(4.5.3)

where the reference values 5^=[/-(L+l)/2]A (4.5.4) and A=2sJL is the length of the quantization interval; the described transformation is illustrated in figure 4.6a. A5. The secondary state is b^S'Si=S'T^(s).

(4.5.5)

The secondary state has the meaning of the quantization error. It can be interpreted as the state at the output of the system shown in Figure 4.6b. The diagram of 6 as function of s is shown in Figure 4.6c. The random variable representing the secondary state is lb=8-7;(8). (4.5.6) From Figure 4.6c we see that PnbE
(4.5.7)

197

4.5 Prototype Probability Distributions

H—K

1

K

1 K

a) r

1

K

tn1 K

QUANTIZING TRANSFORMATION

1 H

1

H ^

- • ^

b)

Figure 4.6. Illustration to the definition of the secondary state: (a) the transformation of the primary state s into the quantized state w, (b) the bloc diagram of the system producing the secondary state, (c) dependence of the secondary state b on the primary state s, (d) a typical probability density of the primary state, (e) interpretation of equation (4.5.7), (f) die resulting probability density of the secondary state b\ the scale on the vertical axis is roughly L times coarser then in Figure 4.6d. . From this and from the definition (4.4 4) of the probability density we get L

Pr>(b)=Y.Ps(Si-^b), be < - 4 / 2 , A , / 2 > ,

(4.5.8)

/-I

where p^(s) is probability density of the primary state and A is the length of the quantization interval.

198

Chapter 4 Statistical State of a System

From (4.5.33) follows that the probability density of the secondary state b is the sum of the shifted segments of length A of the probability density of the primary state as illustrated in Figure 4.6d. It is seen that: If the length of the quantization interval A is sufficiently small and the probability density is a smooth function, then independently of the concrete form of this density the probability density of the (4.5.9) transformed state approaches the uniform probability density. Thus, the system shown in figure 4.6c is an example of a system producing states weakly depending on the causing states. Our argument also proves that the uniform probability distribution is a limiting distribution (see, Section 4.3.4). Although the system in Figure 4.6b seems to be specific it is representative of a wide and important class of systems whose the states have a uniform distribution independent of the distribution of the primary factors generating the states. The reason is that the state b defined by (4.5.5) can be also defined as the rest of dividing the value of the primary state by a constant A. The state of a variety of systems is determined by such a mechanism. Examples are (1) the final position of a disc that after a push revolves several times such as roulette or a disk carrying information (in magnetic, optic form), (2) phase shift of a harmonic process that is delayed (significant is only the rest of dividing by Iw). If • The secondary state can be interpreted as the rest of division of a primary state by a fixed number A, and • This number is much smaller then the range of values of the primary state for which its probability density p^s) takes significant values, then arguing similarly as previously we can conclude that the probability distribution of the secondary state is almost uniform no matter what the probability distribution of the primary states is. There is a relationship between the considered system and random number generators mentioned in Section 4.3.3. Suppose, that instead of a scalar a bloc of binary numbers is taken and the concept of division so that division is suitably generalized, so that division can be realized by a shift register with feedback shown in Figure 3.10. It can be proved that such a system would transform a primary deterministic state (the seed) into such a train such that its segments exhibit statistical regularities. Most generators of pseudo-random numbers operate on this principle (for references see. Section 4.3.4). 4.5.2 THE GAUSSIAN PROBABILITY DISTRIBUTION The probability density p(s)^-^c-^-'''"^ (4.5.11) v/2^ is called gaussianprobability distribution, and the corresponding continuous random variable s is called gaussian (also normal) random variable. For this variable we have E s = ^ , c72(8)=o^ (4.5.12) The specific role of this distribution is justified by the following theorem (a simplified formulation^ of the basic central limit theorem; see. Section 4.3.4)

4.5 Prototype I^obability Distributions

199

If a random variable s can be represented in the form I

1-1

where the random variables z{i) have E8(0=0, v/, their variances a^[8(/) are of similar order of magnitude and the variables (4.5.13) 8(/) satisfy additional very broad constraints, then for large I the probability density of the normalized random variable z/y/l converges to the gaussian probability density. Saying that the variances are of similar magnitude means that two constants Ai>0 andA2>Ai exist, such that A^
,(4.5.14)

where G^^iGim, /:)], is called AT-DIM gaussian probability distribution. It is determined by the coefficients a{k) and G(m, k). These coefficients are related to the moments of the corresponding A^-dimensional random variable §={8(^), ^=1, 2,-

" ,K}

(for derivation see, e.g., Papaoulis [4.9], Breiman [4.10]), namely: 5(it) = E8(it) (4.5.15a) GC^^^D,, (4.5.15b) where C^^ is the correlation matrix of the components of the random variable S given by (4.4.25) and (4.4.26) and 10 0 0 0 10 0 Z>,

0" 0 (4.5.16)

^0 0 0 1 is the diagonal unit matrix. Since on very general conditions the inverse C"^ of the correlation matrix C exists, equation (4.5.51b) can be written in the form A=q;^ (4.5.17) From (4.5.15a) and (4.5.15b) follows that The ^-DIM probability density is exactly determined by averages of the components of the corresponding AT-DIM (4.5.18) random variable and their correlation matrix.

200

Chapter 4 Statistical State of a System

UNCORRELATED GAUSSIAN VARIABLES To illustrate the general considerations we assume: Al. The component variables 8(^) are uncorrelated (c(m, k)=0 "imj^k), A2. The mean values E8(/:)=0, A3. The variances a^[8(/:)]=a^=const. From Al it follows that all elements of the correlation matrix C^^ lying not on the main diagonal are zero. Taking into account assumption A3 we see that C-^ = l/a^Z),. Then (4.5.14) takes the form

I 1^ /7(5)=^-^^exp V—^Y. [•^(^)]^ I , I 2a k-\

(4.5.19a)

where A^ClTra"). (4.5.19b) Comparing (4.5.19) and (4.5.11) we see that the probability density of uncorrelated gaussian variables is equal to the product of the probability densities of the component variables. In other words If the gaussian variables are uncorrelated, then they are statistically independent. (4.5.20) MULTIDIMENSIONAL GAUSSIAN VARIABLES THE GENERAL CASE The reason of the great practical importance of multidimensional gaussian variables is that for them the generalization of the fundamental theorem (4.5.13) holds. Thus, whenever a set of random variables can be represented as a sum of a large number of independent component sets with variances of the similar order of magnitude, then the normalized sum has approximately the multidimensional probability density. As an example take the thermal fluctuation potential of a resistor described in example 4.3.1. The course of the noise is determined by several elementary pulses generated by the independent collisions of electrons. Thus, without going into details of the properties of elementary pulses, we conclude that a train of samples of thermal noise has the multidimensional gaussian probability distribution. Experiments prove this with great accuracy. The gaussian variables can be almost considered as a gift of nature. Not only are they often an accurate approximation of real frequencies of occurrences, but they have several properties that make the operations on them very easy. The following are the most important • To determine exactly the statistical properties of gaussian variables, we need only to know their average values and correlation matrix; (4.5.21) • A set of linear combinations of gaussian variables is again a (4.5.22) gaussian variable • The density of conditional probability distribution of a set of some components of a multidimensional random variable on the (4.5.23) condition that a set of other components is known is again a gaussian probability distribution.

4.6 The Fundamental Property of hong Random Trains

201

These properties in conjunction with equation (4.4.3) reduce the calculation of probability distributions of a linear combination of gaussian variables and of joint and conditional distributions of such linear combinations to routine matrix manipulation. They also lead to a useful presentation of a set of statistically dependent gaussian variables as a transformation of a set of statistically independent gaussian variables. Such presentations are discussed in Sections 5.2.1 and 7.3.

4.6 THE FUNDAMENTAL PROPERTY OF LONG TRAINS OF RANDOM VARLVBLES The subject of this section is an important theorem that can be considered as the view of the axiomatic probability theory on the statistical regularities, discussed in Section 4.4. We assume: Al. A train of states Str={5(l), ^(2),- • • , 5(1)} can be considered as an observation of a train Str = {8(l), 8(2),- • • , 8(/)} of random variables (the train of states exhibits statistical regularities) A2. The set of potential forms of each elementary state is the same and the potential states are 5/, / = 1 , 2,- • • , L. A3. The random variables 8(/) V i are statistically independent. As in Section 4.1 M(5/,/) denotes the number of occurrences of state ^^ in the train 8^^. The following theorem can be proved (see, e.g., Breiman [4.10], Revesz [4.11]): For given 6>0, 6 > 0 such an /(e, d) can be found that for I> /(e, 6) the set S (I) of unconstrained trains S^, can be divided into two subsets S^y and S^^^ such that for every train S^G ^y we have M(s^,I) ^ ^^y—.P[8(l)=5j <e, V/ (4.6.1a) arul P(S,eS^)<6. (4.6.1b) The ratio

\4(c

n

is the frequency of occurrences of elementary state 5/ in the train S^. Thus (4.6.2) says that in each train belonging to the set ^y the frequency of occurrences of any elementary state Si is with accuracy better than e close to the probability of the state Si. Such a train is called a typical-htnct the notation S^y. The set S^ty consists of trains for which the frequencies of occurrences of states differ at least by e. Such a train is called nontypical. The probability P(Su.^ -S'nty) is the sum of probabilities of all nontypical trains. Therefore, from (4.6.1b) it follows that the probability of each nontypical train can be made arbitrary small. Let us denote by SJ a train belonging to set S^y. We calculate the probability ,=S,')=P[8(1)=5'(1), 8(2)=5'(2),- • • , 8(/)=5'(/)]

(4.6.3)

that the train 5'^r is an outcome of an observation of the random train S^.

202

Chapter 4 Statistical State of a System Since the variables 8(/), Vz are statistically independent, P(S,=5V)= n

P[8(0=^'(0]= n

/ - 1

P[a(l)=5,r"''.

(4.6.4)

/ - 1

After logarithming we get

J-1

(4.6.5)

We write (4.6.1) in the form :!!^=P[8(l)=.j4-e., where |«,| <e. Substituting (4.6.6) in (4.6.5) we get |-log,P(S„=S'u)

H[s(l)]
where

L H[S(1)]= E {-log2 P[8(l)=.rJ}P[8(l)=5j

(4.6.6)

(4.6.7)

(4.6.8)

/ - 1

is the entropy of the random variable 8(1) and L

C,= E |log2P[8(l)=5j

(4.6.9)

/ - 1

From (4.6.7) it follows that ,=5V) = 2-'"^-<^^ where = means asymptotically equal, in the sense that lim

(4.6.10a)

-'°^'''^-"'-' .H[«l„

/-oo

The interpretation of formula (4.6.10a) is The probability of every typical train (that is of a train in which the states Si occur with frequencies close to their probabilities) (4.6.10b) is the same aftd given &y (4.6.10a). Taking into account (4.6.1) we can formulate our conclusions in geometrical terms: If we would "draw" a bar diagram in the set S of all possible trains S^={s(l), s(2),' • • , s(l)} the "peaks" of the bars would form two plateaus : the one high plateau over the set J^ (with altitude given by (4.6.10a)) and the second very low (almost at (4.6.11) zero level) plateau over the set S ^^ of non typical trains. The border area between the two plateaus is the more steep the larger the length I of the train is. On the assumption that the trains can be represented as points on a plane this conclusion is illustrated on Figure 4.7.

4.6 The Fundamental Property of Long Random Trains

203

log2 P(S,=5'J

typical non train typical train

Figure 4.7. Simplified geometrical illustration of the fundamental property of long trains exhibiting statistical regularities; (see conclusion (4.6.12)). From axiom (4.4.2) it follows that:

P(^,es,,)= E P(S,,=5,).

(4.6.12)

ses^ From conclusion (4.6.10) it follows that P(S,, E^,y)=7o(^ty)2-'""^^^

(4.6.13)

7o(-^) is* the number of elements of the discrete set ^

(4.6.14)

where

Writing (4.6.1b) in the form P(^,eS,y)

= l'd and using (4.6.12) we get^

7o(^ty) = 2'"^"^^^^

(4.6.15)

COMMENT To use the results of probability theory we must know if the potential states of the considered system exhibit the statistical regularities. As it has been emphasized, the regularities of occurrences of potential states are an objective property of a system. Thus, using the available concrete and meta information we must decide whether the states of the system exhibit the statistical regularities. This is the statistical identification problem, which we discussed in Section 4.3.4.

204

Chapter 4 Statistical State of a System GENERALIZATIONS

To simplify the terminology and notation we assumed that the random variables are discrete and statistically independent. However, suitable modifications of our conclusions hold with very general assumptions. First, we keep the assumptions that the variables z(i) have the same probability distribution and are statistically independent, but we assume that they are continuous described by the density of probability/7(5), sE<s^,s^,>. We define now as typical the set Sly of trains for which the continuous envelope of the density of occurrences of the potential forms of the approximating discrete variables, whose definition is similar to that of (4.2.9) is close (in the sense of a suitable chosen definition of distance between functions considered as a whole; see Section 1.4.3) to the density of probability p(s) (this is the counterpart of (4.6.1)). The counterpart of the first basic conclusion (4.6.8) holds if instead of the probability P(S=.s) we take the density of probability p(S) of a typical train and instead of (4.6.8) we define the entropy of the continuous variable by the equation: H[8(l)]= j[-\og^p(s)]p(s)ds

(4.6.16)

To formulate the counterpart of the second basic conclusion (4.6.16) we have to use the definition the volume 7 / ^ ) of /-dimensional sets based on the following definition of volume of an /-dimensional cube Cv/iih an edge of length A: yjiO^A^

(4.6.17)

It is essential that the unit for the length A be the same as the unit we use in (4.4.4) to define the density of probability p(s). Then the conclusion (4.6.15) with 7X^ty ) instead of 7o(^ty ) holds. For continuous variables we can gain more insight into the structure of the set of typical trains. For example, when the probability density p(s) is gaussian, the set Sly is a thin /-dimensional spherical shell. The fundamental property holds also for trains of structured pieces of information. Since for discrete information we did not use the assumption that the component information is one-dimensional, our argument holds for any structured discrete information. In particular, it holds in the case when the train has the hierarchical bloc structure. Let us suppose that the component is a ^-DIM vector information thus, s{i) = {s{i, k); k=\, 2,- - - , K}. Then instead of H[8(l)] in (4.6.10) and (4.6.15) we have to take the entropy per an element of the random vector S(0 defined by: //,=H[8(1), s(2),.., 8(/0]/A: (4.6.18) where H[8(l), s(2),.., 8(^] is the entropy of the discrete vector §(0- We get it from (4.6.8) taking the joint probability of vector components instead of probability P[8(1)=^J and performing K summations over all potential forms of all components of the vector, instead of a single summation. For an exact definition and derivation of properties oiH^ see Cover, Thomas [4.23], Blahut [4.24], Golomb et all [4.25].

4.6 The Fundamental Property of Long Random Trains

205

In several cases the elementary components of a train cannot be grouped into statistically independent blocs (vectors), but a component depends statistically on the adjacent components. A typical model of such a statistical dependence is the Markov train, that is considered in Section 5.4.2. For such a train we have to define the entropy per an element as the limit of the ratio on the right hand of (4.6.18) for K-*co (for details see Cover, Thomas [4.24], Blahut [4.25]). Section 5.4.2 presents a simple example of calculating of such a limit.

4.7 THE GENERALIZED STATE AND A UNIVERSAL CLASSIFICATION OF STATES AND INFORMATION The statistical regularities described by a probability distribution are an inherent property of the system. We call them the statistical state and denote its description by SsTAT- This description includes the description of the state of variety SVAR of the external state and the description of statistical weights assigned to each potential state. Thus: SsTAT= {SVAR, W} = { S R T , MR, W} (4.7.1) where SRT is description of the structure of external states, MR is the rule of membership in the set of potential forms, and W is the description of statistical weights. The type of description depends on the structure of the external state and the structure of the set S of its potential forms. Let us assume that this is discrete. Then the statistical weights are described by the set of probabilities W={P(si);s^eS}. (4.7.2) When the set of potential forms is continuous, the statistical weights are described by the probability density W={p(s);seS} (4.7.3) considered as a whole. To simplify the terminology we assumed previously that the state is an external state. However, our argument applies directly to internal states. We called the external and internal states concrete states. The set of variety of a concrete state (of potential forms of a concrete state) and its properties in particular, the statistical properties we call the meta state of the system. If the concrete states exhibit statistical regularities, the meta state is described by the statistical state SSTAT» if iiot, only by the state of variety SVAROur considerations about sets of potential forms of concrete states to meta states. Thus, if a meta state is not known, the set of potential forms of meta states in particular, of statistical states should be be considered. Such is also a property of the system and may be called meta state of second rank (meta, meta state). The hierarchy of higher ranking meta states has been discussed in Section The external, internal, and meta states give a complete description of the properties of the system and together are called, the generalized state of the system.

206

Chapter 4 Statistical State of a System Let us summarize our considerations in this and the previous chapters: 1. We have introduced the following types of the states: a. External state (directly influencing interactions between components of the system), b. Internal state (the universal relationships between components of external states), c. Concrete state (external and internal state), d. The state of variety (the set of potential forms of concrete states), e. The statistical state (the set of potential forms of concrete states with associated statistical weights). f. Meta state (joint name for variety and statistical states) g. Higher ranking meta state (the set of potential form of lower ranking meta states, eventually with associated statistical weigthts. h. Generalized state (concrete and meta states together). 2. Each of these states can be either exact or rough. 3. Each of these states has a fundamental structure and often a macrostructure. The fundamental structures are: vector, array (a function of discrete arguments), function of continuous argument(s) as a whole. 4. If any of these types of states is not known, the purposeful activity can be performed more efficiently if the set of potential forms of the unknown state is taken into consideration. (TYPE tlMPO

OF S T A T E ) ABOUT...)

^ ( STATISTICAL. STATBS NBTA STATBS V

STATES OF VARIETY

INTERNAL STATBS (RELATIONSHIPS) COMCIBTB STATBS

{

EXTERNAL STATES (DIRECTLY ACCESSIBLE STATE PARAMBTRS, ATTRIBUTES ) 6lSCRETB_ SETS

(x>irnNuous K DIM SPACES

=5f

SCALAR^ VECTORS ! ARRAYS FUNCTIONS ' / (FUNCTIONS OF^ DISCRETE OF CONTINUOUS ARGUMENTS) y ARGUMENTS • .f.tAiruB raoctttn

FUNDAHBNTAL .^STKUCTUBB OF . - / (TBB STATE) ' [THE IMFO]

HILBERT FUNCTIONAL SPACES

STIUCTUIB OF THE SET OF FOTIITIAL FOIHS OF ( T I E STATE) [THE INFO)

Figure 4.8. A classification of states (inscriptions in () valid) and of information (inscriptions in [ ] valid).

4.7 The Generalized State and a Universal Classification

207

If, for simplification we take into account only the fundamental structure of the state and the set of its potential forms, the generalized state can be represented as a point, in a 3-DIM ,"space" as shown in Figure 3.8. On one horizontal "coordinate axis" we have the fundamental structure the state; on the other, the structure of the set of potential forms it can take. On the vertical axis we indicate the type of the state. For example, point P^ corresponds to an external state that is a scalar and the set o potential forms is an interval. Point P2 represents the statistical state of the state represented by point P, . In view of bilateral relationships between the state and information discussed in Section 1.2.2, concrete forms of states and information and sets of their potential forms can be classified in the same way. We can also classify the information according to the type of the state the information is about. Thus, after changing the interpretation of the features we can use the previously described classification of states as a universal classification of information, as shown in Figure 4.8.

NOTES ' We used 1 as an index to number the potential forms of the vector 5. To each s„ corresponds a pair of components Si(„)(l) and s,^f^^(l). The indexes l{m) and k(m) numbering the potential forms of components may be in general different. Because in our considerations the relationship between numbering the forms of s and of the components 5(1) and s{2) is irrelevant, we write briefly / instead of l(m) and k instead of ^(m). ^ It is essential for our argumentation that I>L. Thus L can be large only if the train 5^ has a suitable large length /. ^ With a strict approach, the class of considered subsets is restricted to the class of Borel sets (see, e.g., Kolmogorov [4.6], Billingsley [4.8]). This is a very wide class including practically all sets occurring in applications. "* In the notation of the correlation coefficient and the correlation matrix we add in the subscript ss to indicate that the correlation coefficients of pairs of components of a multidimensional random variable S are considered. We do so because in the later sections we consider matrices of correlations coefficients for components of several multidimensional variables. ^ Always Ez^>0. We could have Ez^=0 for any set of coefficients a(k) if for all random variables B!D^=0. Such a case is of no technical interest. ^ This is obviously the special case of rule (1.5.16) with the reference pattern used directly as its identifier. The quantization rule considered here is the same as rule (4.2.1), however we use another notation. ^ The proof of the theorem and exact formulation of its premises require several additional concepts-they can be found in Breiman [4.10]. For detailed study of the sums of random variables see Gnedenko [4.14]. Here we give a simplified formulation of the theorem. * Using the same symbol as the symbol used to denote the volume of a K-DIM set; (see, e.g., (4.4.3)) is not coincidental. The number of elements of a discrete set has the same fundamental properties as the volume of a K-DIM set. In mathematical terms the number of elements of a set is an additive measure (see, e.g., Billingsley [4.8]). ^ We can get this equation, calculating the number of trains satisfying constraint (4.6.1) and using the Stirling formula for the logarithm of the factorial.

208

Chapter 4 Statistical State of a System

REFERENCES [4.1] Mises von, R., Probability, Statistics and Truth, Dover Publications, N.Y., 1957. [4.2] Frank, H., Althoen, S.C, Statistics, Concepts and Application, Cambridge University Press, Cambridge, 1994. [4.3] Kotz, S., Johnson, N.L., Encyclopedia of Statistical Sciences, J.Wiley, N.Y., 1988. [4.4] Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetterling, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [4.5] Shafer, G., Pearl, J., Readings in uncertain reasoning, Morgan Kaufman Publ. San Mateo CA, 1990. [4.6] Kolmogorov, A.N., Foundations of the Theory of Probability, 2-nd ed., Chelsa Publishing Corporation, N.Y., 1956. [4.7] Renyi, A., Probability Theory, North-Holland, Amsterdam, 1970. [4.8] Billingsley, P., Probability and Measure, J.Viley, N.Y., 1979. [4.9] Papoulis, R., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, N.Y., 1991. [4.10] Breiman, L., Probability, SIAM Publications, Philadelphia, 1995. [4.11] Revesz, P., The Laws of Large Numbers, Academic Press, N.Y., 1960. [4.12] Mane, R., Ergodic Theory and Differentiable Systems, Springer Verlag, Berlin, 1978. [4.13] Huang, K., Statistical Mechanics, J.Wiley, N.Y., 1966. [4.14] Gnedenko, B.V., Kolmogorov, A.N., Limit Distributions of Independent Variables, Addison Vesley, Reading, 1968. [4.15] Dagpunar, J., Principles of Random Variate Generation, Clarendon Press, Oxford, 1988. [4.16] Yarmolik, V.N., Demidenko, S.,N., Generation and Application of Pseudo-random Sequences for Random Testing, J.Wiley, N.Y., 1988. [4.17] Niederreiter, H., Random Number Generation and Quasi-Monte Carlo Methods, SIAM Publications, Philadelphia, 1992. [4.18] Devaney, R.L., An Introduction to Chaotic Dynamical Systems, Addison-Wesley, Redwood, 1989. [4.19] Rasband, S.N., Chaotic Dynamics of Nonlinear Systems, J.Wiley, N.Y., 1990. [4.20] Thompson, E.E., ^ Introduction to Algebra of Matrices with some Applications, Adam Hilger, London, 1969. [4.21] Horn, R.A., Johnson, C.R., Matrix Analysis, Cambridge University Press, Cambridge 1988. [4.22] Usmani, R.A., Applied Linear Algebra, Marcel Decker, N.Y, 1987. [1.23] Cover, T.M., Thomas, J.A., Elements of Information Theory, J.Wiley, N.Y., 1991. [1.24] Blahut, R.E., Principles and Practice in Information Theory, Addison- Wesley, Reading, MA, 1990. [1.25] Golomb, S.W., Peile, R.A., Scholtz, R.A., Basic Concepts in Information Theory and Coding, Plenum Press, N.Y., 1994.

STATISTICAL RELATIONSHIPS A statistical relationship exists between states if one state influences the frequencies of occurrences of the other state. The statistical relationship is described either by joint or by conditional frequencies of occurrences of potential forms of states (see (4.1.24) to (4.1.26)). If the states exhibit statistical regularities, the statistical relationship is described by joint or conditional probability distributions (see (4.4.7) and (4.4.8)). The statistical relationships are of paramount importance for the superior system. Using these relationships the system can take into account the components of states of its environment, that are directly inaccessible. In particular, the system may "predict" future states. This, in turn, can dramatically improve the performance of the superior system. The analysis of optimal utilization of statistical relationships and assessment of achievable advantages are important topics of this book. The basic statistical relationships are the relationships between the atomic components of the states having one of the fine structures described in Section 1.3. The wide class of such states are functions of a discrete ( respectively, continuous) identifier (argument (s)); see Section 1.3.3. The statistical model of such a function whose "values" exhibit joint statistical regularities is called stochastic process. Usually the states have a macro structure. Then the macro components may be also related statistically. We call such a relationship statistical macro-relationship. An example is the relationship between a primary process and a secondary process produced from the primary process by an indeterminate transformation. Our considerations of statistical relationships start with a review of their rough descriptions by means of parameters. Such typical parameters are correlation coefficients and parameters based on entropy, in particular, the amount of statistical information. Section 5.2 describes the birth and death processes and the Gauss processes, considered as result of deterministic transformations of a primeval train of statistically independent components. Section 5.3 is devoted to Markov processes, which can be considered as successive indeterministic transformations of an initial state. Section 5.4 presents typical descriptions of relationships between the input and the output of a channel introducing indeterministic distortions. In a real system the transformation transforming the state into information is practically never reversible; thus, we have to consider the state as indeterminate. Characterizing such a state by frequencies of occurrences of its potential forms or by probability, we introduce, in fact, a model of indeterminism. Besides the probabilistic, other models of indeterminism have been proposed. We close this chapter with a systematic review of the various models of indeterminism.

210

Chapter 5 Statistical Relationships

5.1 THE ROUGH DESCRIPTION OF STATISTICAL RELATIONSHIPS BY PARAMETERS Sections 4.1, 4.2, and 4.3 indicated that already for structured information consisting of few components, experimental estimation of joint probability distributions (or equivalently conditional distribution) would be tedious. Therefore, of great importance are simplified descriptions of statistical relationships by parameters. This section concentrates on correlation coefficients and parameters based on entropy particularly, on the amount of statistical information. 5.1.1 THE CORRELATION COEFFICIENT In Section 4.4.4 it has been shown that the correlation coefficient is a useful indicator of statistical relationships in two cases. The first is when the joint probability of the components of the state is gaussian. In view of property (4.5.13) and its multidimensional generalization this is often the case. Then from conclusion (4.5.18) it follows that the correlation coefficients describe the probability distribution exactly. The second case is when we consider only linear transformations and we describe the statistical relationships only by correlation coefficients. Then from equation (4.4.37) it follows that we can calculate the correlation coefficients for the components of the transformed state. Thus, in the case of linear transformation the rough description of statistical relationships is self-sufficient. This property of correlation coefficients plays an important role in this book. We review here properties of the correlation coefficient defined by (4.4.24), which allow us to use this coefficient as a rough description of statistical relationships between two random variables 8(1) and 8(2). First, assume that the variables are statistically independent; thus, for discrete variables equation (4.4.8a) holds or for continuous variables equation (4.4.8 b) holds. After substituting the former in (4.4.25a) respectively in (4.4.25b), we prove that both for discrete and continuous variables c(l, 2)=0.

(5.1.1a)

Thus, Statistically independent variables are uncorrelated.

(5.1.1b)

In general, the reverse does not hold. An example is the density p[s(l), 5(2)] of joint probability shown in Figure 5.1. The marginal probability distributions/7,[5(1)] andp2[s(l)] are uniform. Figure 5.1 shows that for points inside the white squait p^[s(l)]p2[s(l)]j^p[s(l), 5(2)]. Thus, from definition (4.4.8) it follows that the variables 8(1) and 8(2) are statistically dependent. For symmetry reasons c(l, 2)=0. Thus, in spite of being noncorrelated, the variables are statistically dependent. However, this is not typical. For several classes of random variables zero correlation implies independence. In particular this holds in the important case when the variables are gaussian.

5.1 The Rough Descriptions of Statistical Relationships s(2)

211

s{2) P[sa),s(2)]

M—

Figure 5.1. Density of joint and corresponding marginal probability distributions of random variables, which are noncorrelated but statistically dependent; the probability density in die shadowed area is constant. It can be easily proved (see, e.g., Papoulis [5.1], Breiman [5.2]) that c^d, 2)
(5.1.2)

a^[8(m)]=E[s(m)-E8(m)]2, m = l, 2,

(5.1.3)

where and the sign of equality holds if 8(2)=/l8(l),

(5.1.4a)

where ^4 is an arbitrary constant. Thus, The correlation coefficient reaches its maximum when both variables are linearly related. From (5.1.2) it follows that

(5.1.4b)

where

(5.1.5)

0 < |c„(l, 2)| < 1 ^"^^'^^

a[s(l)Ms(2)]

(5.1.6)

is the normalized correlation coefficient. 5.1.2 THE ENTROPY AND AMOUNT OF STATISTICAL INFORMATION 1: DISCRETE STATES Entropy appeared in Section 4.6 where frequencies of occurrences of potential forms of states in long trains of observations exhibiting statistical regularities were analyzed. The definition (4.6.8) of entropy of the discrete random variable 8 was^: L

H(8) = E

[-log2P(8=5,)]P(8=5,)],

/-I

where j / , / = 1 , 2,- • • , L are the values which the variable can take.

(5.1.7)

212

Chapter 5 Statistical Relationships

Comparing this definitions with the definition (4.4.9 ) of the average, we see that H(8) = E[-log2P(8)] (5.1.8) where P(8) is the discrete random variable taking the value -log2P(8=5/) with probability P(s=5/). It can be easily proved (see e.g Abramson [5.3], Cover,Thomas [5.4], Blahut [5.5]) that: The entropy H(a) reaches the maximum when P(s=j^) = 1/L=const. (5.1.9) and max H (S)=logjL. all prob.distributions

Figure 5.2 illustrates this property forL=2.

0

0,f

0,2

0.2

04

Q5

Q6

Q7

Q8

Q9

1.0 p ( g = ^ )

Figure 5.2. The entropy of a binary random variable. From definition (5.1.7) it follows that the entropy does not depend on the values of the random variable but only on its probabilities. Tlius, equation (5.1.7) is in fact, the definition of entropy of a random object having any structure, provided that its potential forms can be identified by integers. In particular, if we take in place of s the 2-DIM variable §={8(1), 8(2)} and replace the probabilities P(8=5/) by the joint probabilities, equation (5.1.7) becomes the definition of the entropy of S H(S)^E E /-I

{-log2P[8(l)=5Xl),8(2)=j,(2)]}P[8(l)=5Xl),8(2)=5,(2)],

(5.1.10)

k'\

and equation (5.1.8) takes the form H(® = E{-log2P[8(l), 8(2)]} (5.1.11) After expressing the joint probabilities in equation (5.1.10) by the conditional probabilities and marginal probabilities (using equation (4.4.7a) we get H(§) = H[8(1)] + H[8(2)|8(1)],

where

(5.1.12)

L

H[S(2)|8(l)]^EH[8(2)|8(l)=5Xl)]P[8(l)=5j L

(5.1.13)

'-'

H[8(2)18(1)=5X1)] = E {-log2P[8(2)=5,(2)18=5,)]P[8(2)=:y,(2)|z^s^)]}.

(5.1.14)

k-\

We call H[8(2)|8(1)=5X1)] the conditional entropy and H [8(2) 18(1)] the average conditional entropy.

5.1 The Rough Descriptions of Statistical Relationships

213

The conditional entropy occurs only as an intermediate expression. However, of basic importance for analyzing the statistical relationships between the random variables s(l) and s(2) is the conditional average entropy. It can be shown (see, e.g., Abramson [5.3], Cover,Thomas [5.4], Blahut [5.5]) that H[S(2)|8(1)]
(5.1.21)

is an indicator of the reduction of the indeterminism of the random variable 8(2) when the value taken by the variable 8(1) is known. Therefore, it is natural to call l[8(l):8(2)] the amount of statistical information, that the knowledge of the random variable 8(1) provides about another random variable 8(2). It is traditionally also called Shannon *s information. Using (5.1.7), (5.1.14) after some elementary algebra from (5.1.21) we get l[8(l):8(2)] = j ^ 5 ^

^^ P[s^(l),s,(2)] P[s^(l),s,(2)]. ' ""^ P[s^(l)]P[s,(2)]

(5.1.22)

Chapter 5 Statistical Relationships

214

The previously used notation for probability indicating the random variable is useful for general considerations, but it is inconvenient for longer calculations. In equation (5.1.22) we apply a simplified notation, in which we indicate only the considered potential value. For example, the probability P[8(l)=5/)] we write briefly as P(Si). Because the random variable does not occur, we utilize here the italic character P but not the special symbol P. In future, whenever this causes no confusion, we use this simplified notation. EXAMPLE 5.1.1 ILLUSTRATION OF PROPERTIES OF ENTROPIES AND THE AMOUNT OF STATISTICAL INFORMATION We take the joint probability distribution considered in Example 4.1.1 and given by equation (4.1.29). The parameter a determines the character of the probability distribution as illustrated in Figure 4.1. The dependence of the joint, marginal, conditional entropies and of the amount of information are shown in Figure 5.3.

c„(l,2) H[8(2)] H[8(2)|c(l)] I[8(l):c(2)]

0.1

G25

0.5

1

1,6 2

a

Figure 5.3. The dependence of joint, conditional, marginal entropies, the amount of statistical information, and the normalized correlation coefficient on the parameter a. determining the character of the probability distribution given by (4.1.29) and illustrated in Figure 4.1. According to property (5.1.9) the entropy H[8(l)] achieves its maximum for a = l because for this value the marginal probability distribution is uniform. However, for this value also the joint probability distribution is uniform, and consequently the variables 8(1) and 8(2) are independent. Then, according to properties (5.1.9) and (5.1.16), the joint entropy H[8(l), 8(2)] and the conditional entropy H [8(2) 18(1)] take their maximum. Because the variables 8(1) and 8(2) are independent the amount of statistical information i[8(l):8(2)]=0. For a-K) the variable 8(1) determines the value of 8(2) and vice versa. Then the amount of statistical information l[8(l):8(2)] reaches its maximum. For cr*oo the probability distribution approaches binary uniform distribution. Then, I[8(l):8(2)] grows again asymptotically to 1, and the entropies decrease to l . D

5.1 The Rough Descriptions of Statistical Relationships

215

COMMENT 1 The frequently used abbreviation "statistical information" of the terms, amount of statistical information" and "Shannons information'' is misleading and confusing. Information, as defined by (1.1.1), has a different character than the amount of statistical information l[8(l):8(2)]. In the terminology used here, l[8(l):8(2)] is a number characterizing the set of probabilities of potential forms of the information. Therefore, the statistical amount of information has a sense only if the states exhibit statistical regularities. The term "information" in the sense here defined, that is close to its common sense understanding, is not necessarily a number but may have any structure and it is not necessarily associated with the existence of statistical regularities. COMMENT 2 We considered here the entropy as a primary concept and defined the statistical amount of information in terms of entropy. However, entropy can be considered as amount of information. Suppose that an exact observation of the variable 8 is available. The variable is then determined. Therefore, H[8(l)|8(l)]=0. From (5.1.22) it follows that ' l[8(l):8(l)] = H[8(l)]. (5.1.23a) Thus, The entropy also has the meaning of the amount of statistical information that is obtained by making an exact observation of the (5.1.23b) value that the variable takes. This conclusion does not contradict conclusion (5.1.20). 5.1.3 THE ENTROPY AND AMOUNT OF STATISTICAL INFORMATION 2: CONTINUOUS STATES Till this point we assumed that the variables are discrete. The definition (4.6.16) of entropy of the continuous random variable 8 introduced in Section 4.6 was: H{z)^\[Aogj>{s)\p{s)ds,

(5.1.24)

where p{s), sE<s^, s^> is the density of probability characterizing the random variable 8. The formula u . x rr i / M H(8) = E[-log2p(8)] (5.1.25) holding for the continuous random variable is the counterpart of formula (5.1.8) for the discrete variable. For a pair §={8(1), 8(2)} of continuous variables described by the density of joint probability p{s) the counterparts of definition (5.1.14) of conditional entropy and definition (5.1.22) of the amount of statistical information are H[8(2)|8(l)]= j J {-log2/7[5(2)|5(l)]}/7[^(l),5(2)]dj(l)dj(2) l[8(l):8(2)]= f f J Aog,,^^^^^^!^

y [ where D={<s^,

Id5(l)d5(2)

PW1)]P[^(2)] J

Sy^>\<s^, s^>} is the square outside which/7(5)=0.

(5.1.26) (5.1.27)

216

Chapter 5 Statistical Relationships EXAMPLE 5.1.1 ENTROPY AND AMOUNT OF STATISTICAL INFORMATION FOR GAUSSIAN VARIABLES

The gaussian probability distribution is given by equation (4.5.11). Substituting this in (5.1.24) with s^=-co^ s^= oo and with some calculus we get H(s) = y2log2(27rea^), (5.1.28) where e=2,718.. is Euler constant. Some more calculations show (see, e.g., Abramson [5.3], Cover, Thomas [5.4], Blahut [5.5]) that for the 2-DIM gaussian probability density (4.5.14) the statistical amount of information I[8(l):8(2)]=-log2\/l-^n(l,2)

,

(5.1.29)

where Cn(l, 2) is the normalized correlation coefficient defined by equation (5.1.6). As an application of the latter formula we take two independent gaussian variables 8(1) and 2 , E8(l)=0, E2=0 and we set 8(2)=8(l)+2.

(5.1.30)

The variable 8(1) may be interpreted as the model of the state of input of a communication channel, 8(2) of the output and 2 of the noise. We have C(l, 2) = E8(l)8(2) = E8(l)[8(l)+2] = ^5.

(5.1.31)

2

where ^j=a^[8(l)]. The normalized correlation coefficient defined by (5.1.6) is c,(l, 2)=a]/a^^aW, Substituting in (5.1.29) we get l[8(l):8(2)] = y2log2

(5.1.32) 2 2

D

(5.1.33)

A SPECIFIC FEATURE OF ENTROPY OF CONTINUOUS VARIABLES The similarity of definition (5.1.7) of discrete entropy and of definition (5.1.24) of continuous entropy causes the entropies to have similar properties. In particular, the important relationship (5.1.5) holds for both cases, and the entropy of continuous variables has a property similar to (5.1.9). However, the entropy of the continuous variable has a specific property. It is caused by the previously signalled fact (see discussion on page 190 after equation (4.4.6)), that the density of probability depends not only on the statistical properties of the considered information but also on the measure of volume used to define the density. To illustrate this effect we assume Al. The primeval state is electrical potential of a point terminal; we denote it by the special symbol 5; A2. We take 1 V as the unit of potential and denote it as v.

5.1 The Rough Descriptions of Statistical Relationships

217

We denote by s^ the dimension-less factor by which we have to multiply v to obtain s; we call s^ the representation of the primeval state based on the unit v (it is the description of the state that we previously denoted s). Next we consider another unit li of voltage, say ImV, and we denote Wv the representation of unit n based on unit v, s^ the representation of the primeval state s based on the unit u (measurement of the potential expressed in units k). From the definitions it follows, that 5u=5v/Wv (5.1.34) Thus, Uy has the meaning of the scaling factor that we have to use when we pass from unit v to unit n . We denote by Sv (respectively by Su) the random variables representing Sy (respectively sj. From the definition (4.4.4) of the probability density it follows that Pu(^u)=w./?v(0, (5.1.35) where p^is^) (respectively p^isj) are probability densities of the random variable 8^ (respectively zj. From (5.1.24) we have finally Thus,

H(8,) = H(8,)+log2«v The entropy of a continuous variable depends not only on the statistical properties of this variable but also on the units used to measure the continuous variable.

(5.1.36)

(5.1.37)

Contrary to entropy, the amount of statistical information does not depend on the measure of volume used to calculate the conditional probabilities. This causes the properties of the amount of statistical information particularly, the effects of transformations of information (see, e.g., Abramson [5.3]) for discrete and continuous variables, to be similar. COMMENT 1 To simplify the terminology and notation we considered the components of the states as scalars. However, we can easily generalize our argumentation, by defining in the general case the entropy by (5.1.25). Using the general definition (4.4.12) of the averaging operation, we define the entropy of a multidimensional continuous variable, the average conditional entropy, and the amount of statistical information. COMMENT 2 The concept of entropy emerged when we analyzed the properties of long trains of random variables. Thus, we can expect that the entropy is relevant in situations when the fundamental property of long sequences described in section 4.6 can manifest itself. We show that the performance of optimum vector quantization of long blocks of information depends in an essential way on entropy (see Section 8.6.1, theorem (8.6.24) and that the statistical amount of information influences in a crucial way the performance of transmission of long blocks of information through channels introducing indeterministic distortions (see Sections 8.4.3 and 8.6.1, coding theorems (8.4.50) and (8.6.33).

218

Chapter 5 Statistical Relationships

There have been, however, two approaches to the concept of entropy and the amount of statistical information not related to the fundamental property of long sequences. The first, called axiomatic approach introduces some plausible properties that an indicator of indeterminism or equivalently, of the amount of information should posses. A typical example of such a property is additivity of entropy (5.1.17) when the information (in sense of definition (1.1.1)) consists of independent components. An indicator of indeterminism considered in the axiomatic approach is a special case of the indicator of variety of potential forms of information discussed in Section 1.6.1, page 53. The discussion in Section 1.6.1 in particular. Figure 1.23 shows that an indicator of variety is only one of several indicators of other types characterizing the performance of an information system. Therefore, the axiomatic approach concentrating on indicators of indeterminism cannot tackle real design problems. This has even a deeper reason. Our considerations in Section 1.1 (see, in particular. Figure 1.1) show that without specifying what for the information is needed, it is not possible to define reasonable indicators characterizing it. The second approach to the concept of amount of information not based on the properties of long sequences is linked to analysis of universal bounds for best performance of information recovery rules. The Fisher information, related to RaoCramer inequality, introduced in classical statistics (see, e.g., Larsen, Marx [5.7]) is an example of such a definition. In the performance bounds approach the amount of information is interpreted as a distance between the probability distributions (see, e.g., Bahara [5.8]). Such a distance is introduced in Section 8.5.3 and it is shown that it determines universal bounds for performance of optimal information systems without knowing the optimal rules. These bounds give insight into the effects of preliminary information processing. However, their usefulness is limited since the bounds can be achieved only for some probability distributions (see Seidler [5.6]). One class of them are gaussian-like distributions. The other class are probability distributions similar to probabilities of long sequences discussed in Section 4.6. 5.2 PROTOTYPE STATISTICAL RELATIONSHIPS As in the case of prototype probability distributions, the type of several important statistical relationships is determined by very general assumptions that often are quite exactly satisfied. Then we can infer the type of probability distribution describing the relationships, and we have only to determine the concrete values of the free parameters of the given type. To simplify the terminology we assume that the relationships between the states have the character of time relationships and that the elementary states are scalars. However, most of the concepts that we introduce here can be modified for structured elementary states and for space relationships. We denote as s^t), tE the primary-time continuous process and as t,
5.1 Prototype Statistical Relationships

219

The structured state is a sequence s^{s(l), s(2),' ' • ,s(N)}, where s(n)=sM,n

= U2,' - - ,N

(5.2.2)

The sequence B={Q(n), n = l, 2,- • • , N} (5.2.3) of random variables 8(«) is the model of statistical time relationships. It is called time discrete stochastic (random) process (chain). From a formal point of view there is no difference between a time-discrete stochastic process and a multidimensional random variable. In particular, both are described either by a joint or by a conditional probability distribution. Specific for a stochastic process is the interpretation as a train of variables representing observations of a time process in successive instants. Such an interpretation justifies specific assumptions about statistical relationships usually reflecting the primary deterministic relationships between the states at sampling points^ described in Chapter 3. The theory of stochastic processes is the subject of many publications. An introduction into this area can be found in Papoulis [5.1] and Helstrom [5.9] ; a more advanced analysis is presented in Parzen [5.10], Shanmugan, Breipohl [5.11] and in the classical monograph of Lapierre, Fortet [5.12]. We describe here two basic types of trains of statistically related elements, which can be considered as transformations of trains of primeval statistically independent elements. In some cases, the mechanism generating the considered train has the character of such a transformation. However, even if the mechanism of generating the train is not known, the representation of a given train as a hypothetical transformation of a hypothetical primary train of statistically independent elements gives much insight into the properties of stochastic processes and is very useful for their analysis and simulation. In subsequent considerations we frequently exploit such an interpretation. In the first section, as the primeval train of statistically independent variables the train of binary variables is taken, while in the second section, the train of gaussian variables. 5.2.1 POISSON PROCESS AND DERIVED PROCESSES The states of many systems can be considered to be the result of triggering events of very short duration. A triggering event may initiate a lasting process or end an already running process. In the first case, the triggering event is called a birth event in the second a death event. A typical application of this concept is a model of information packets arriving at an information system: the instant when the packet arrives is interpreted as the birth instant, and the instant when it ends as the death instant (see Figure 5.4). Since the duration of a triggering event is usually negligible, we may consider it as a point on time axis. Therefore, the train of triggering events is called SL point process. It is described by the instants at which the triggering events occur.

220

Chapter 5 Statistical Relationships

a^

I

I

I

I

I

I A,

^

I

I

I

I

• t

'^•y

U—

a)

t

t

i

t

-•

t

b)

^^t

U c)

Figure 5.4. Illustration of the definition of the Poisson process and derived processes: (a) illustration of notation, (b) a train of triggering events, (c) a train of pulses generated by the primary birth process (up arrows) and auxiliary deatiis process (down arrows). Often the mechanism of generating the triggering events is such that A triggering event can occur at a short interval with probability XAt+o(At). /^ 2 4) The occurrence of the triggering event in an interval is statistically independent of occurrences of triggering events in other intervals. We consider the basic train of triggering events as a limiting case of a primeval stochastic process Ih)(l), Ib(2), • • • determined by these assumptions Al. The set of potential values of the elementary component h(n) is {0, 1}. A2. Each binary random variable h(n), Vn has the same probability distribution; it is described by the probability Pi=P[lb(/z) = l ] . A3. The variables h(n) are statistically independent. We consider the time interval of length T where (see Figure 5.4a) We denote

T\=n{r^, T2^n2T^.

(5.2.5)

M^() the number of elements of the time discrete process lying in the interval <Ti, T2> and having value 1. Md(<7'i, T2>) the corresponding random variable. From assumptions A2 and A3 it follows that P[M,()=m\ =

M m

P^d-P,)''-'".

(5.2.6)

5.1 Prototype Statistical Relationships

221

We interpret a time-continuous point process describing a train of triggering events as the limiting case of the train of statistically independent binary states, when P, ri=const, r2=const, T.-M), P^-^, but so that the limit X=lim--i exists, (5.2.7) On these assumptions, after some elementary algebra, from (5.2.6) we get limP[Me( ) = / n ] = - 2 ^ e - ^ ^

(5.2.8)

where MJ,) is the number of triggering events of the time continuous process located in the interval < T,, Tj > and T=T2-T2 is the length of this interval. The probability distribution on the right hand of (5.2.8) is called Poisson probability distribution. Thus, TTie number of the triggering events with statistical properties determined by the assumptions (5.2.4), occurring in a time interval (5.2.9) of duration T, is the Poisson random variable with parameter XT. The time-continuous point process described by (5.2.9) is called Poisson point process. Using this process we can create models of more complicated processes. As an example we take trains of blocks separated by pauses, mentioned in Section 1.3.3 ( see Figure 1.10) and in Section 2.6.5 (see Figure 2.23b). We assume that the birth process representing the instants of starts of blocks is a Poisson characterized by intensity X of "births" of the blocks. With the start of a block another, statistically independent auxiliary Poisson "death" process with parameter" H starts. The block ends when the first "death" event occurs. The probability that the duration of the block is v is equal to the probability that in the interval < 0 , v> no death event occurred, but at the infinitely short interval < V, v-hdv) the first death event occurred. The probability of the first event we get from (5.2.8) substituting \-^fi, T-^v and taking /n=0, while of the second it is ^dv. From this it follows that the probability density of the block length v is p(v)=.fiQ'^\ (5.2.10) The described train of blocks is called the Poisson-exponential process. Many experiments show that this process is a good model of blocks generated by several information sources. We here presented only the basic features of the Poisson and derived processes. Detailed analyzis of Poisson processes can be found in Lapierre, Fortet [5.12], and of death and birth processes in Parzen [5.10] and Kleinrock [5.13]. 5.2.2 GAUSSIAN PROCESSES The basic time-discrete process whose components are continuous (are of type T^K^D) is the gaussian process. We call so a process if the joint probability distribution of its elements is a gaussian probability distribution discussed in Section 4.5.2. We consider first the gaussian process determined by the following assumptions Al. Each random variable m(n) \/n representing an elementary component, is a gaussian random variable, with the probability density given by (4.5.11) with a(w)=a=const, V/z; to simplify the notation we assume a(n)=0, V/z. A2. The variables m{n) are statistically independent.

222

Chapter 5 Statistical Relationships

In view of property (4.5.20) equivalent to A2 is the assumption that A3'. The variables im(w) are uncorrelated cjm, m')=d(m,m'), where ^1 if m=m' d(m,m')=^ ^

(5.2.11) (5.2.12)

is the Kronecker delta function. We can write assumption (5.2.11) equivalently in the form C,,=Z>„

(5.2.13)

where C^^ is the correlation matrix for the train m(n), /z = 1, 2,- • • and Z), is the unit diagonal matrix (see 4.5.16). A train of uncorrelated gaussian variables may be considered as the primeval gaussian process. It is the counterpart of the train of statistically independent binary components defined by assumptions Al to A3 on page 220. In general, the components of the gaussian process are correlated. We show now that such a process can be always presented as an effect of a linear transformation of a hypothetic primary uncorrelated gaussian process. We denote by /f a matrix describing the linear transformation and we look for such a matrix that a given, in general, correlated gaussian process §={8(w), /z = l, 2,- • • , A^}

can be represented in the form S=flU, (5.2.14) where ILJ={i!ii(/i), w = l,- • • , A^} is the previously described primeval gaussian process. We demand that the transformation can be realized in real time, thus that the mth component of the transformed train depends only on the mth or earlier components of the uncorrelated train. This is equivalent to the requirement The matrix H has over the main diagonal only O's. (5.2.15) Thus the matrix H should have a similar structure as the square matrix in (3.2.46), shown also in Figure 3.9a. To prove that the transformation we are looking for exists and to find its concrete form we use three conclusions, formulated previously: • The density of joint probability of gaussian variables with zero mean values is determined completely by their correlation matrix ( see conclusion (4.5.21)). • A process obtained by a linear transformation of a gaussian process is again a gaussian process (see conclusion 4.5.22). From these conclusions it follows that finding the presentation of the gaussian process W(ji) reduces to finding such a matrix H that the correlation matrix of the transformed process MJ(n) is equal to a given correlation matrix C^^, After substituting C^f*Di and C^^-^C^^ in (4.4.37) we write this condition in the form HHj = C,,.

(5.2.16)

5.1 Prototype Statistical Relationships

223

The matrix ff we are looking for, plays the role of the "variable" in this matrix equation. Since the correlation matrix C^^ and the matrix HHj are symmetric, the matrix equation (5.2.16) is equivalent to N{N-\)I2 scalar equations. Therefore on very general assumptions we can find a solution of the matrix equation satisfying the additional requirement (5.2.15). There exists an efficient algorithm to find numerically the elements of this matrix, called Cholewski decomposition algorithm; the programs (also on diskettes) realizing this transformation can be found also in Press [5.14], in program packages [5.15], or in program packages described by Wolfi-am[5.16]). The solution of equation (5.2.16) has the meaning of a transformation shaping the primeval uncorrelated gaussian train into the given train and it is determined by the correlation matrix C^^- Therefore, the solution is denoted as H^^^iQ^). Thus, we have S= H,^(CJV. (5.2.17) This representation is illustrated in Figure 5.5a. Since the matrix ^^h has the same structure as the matrix H in equation (3.2.46) the transformation can be realized by the time discrete linear filter with time-varying coefficients shown in Figure 3.8e. "1

1 GENERATOR OF UNCORRELATED 1 GAUSIAN TRAIN

1

LINEAR SHAPING 1 ^ TRANSFORMATION 1 ^ u(n) ^sh(Css) 1 s(n)

a) DECORRELATING TRANSFORMATION s{n)

u(n)

b) Figure 5.5. The presentation of a correlated process 8(/i), /z = l, 2,- • • , A^ as a real- time linear transformation of an uncorrelated primary process m(n): (a) the presentation, (b) the generation of the decorrelated process u{n).

It can be also proved that the inverse matrix H,(CJ-H;:(CJ

(5.2.18)

exists and satisfies condition (5.4.15). Left-multiplying formula (5.2.14) by H^iC^^) ^^^^^ ILJ=H,(CJS (5.2.19) Thus, the transformation H^iC^s) produces of the primary train 8(/z) the uncorrelated process m(n) (see Figure 5.5b). Therefore, it is called decorrelating transformation-, hence, the notation. In Section 7.3 we study the decorrelating transformations in detail. The decorrelating transformation can again be realized by a time-discrete linear filter with time-varying coefficients shown in Figure 3.8e.

224

Chapter 5 Statistical Relationships

C„ =

.3000 .671 I .3193 . 16M .084.0 .0478 .0274 .Ol&O

.671 1 1.3000 .671 1 .3193 . 1614 .OfUX> .0478 .0274

.3173 .671 1 1.3000 .671 1 .3193 .1614 .0&&0 .0476

.1614 .3195 .6711 I.5000 .6711 .3193 .1614 • O860

.0000 I.0953 .4823 .2254 . 1 122 .0590 .0324

.0000 .0000 I .0952 .4621 .2234 . 1 121 .0589 .0324

.0000 .0000 .0000 I .0951 .4821 ,2254 .1120 .0569

.0860 . 1614 .3193 .671 I 1.5000 .671 I .3193 .1614

.1614 .3193 .671 I I.5000 .671 1 .3193

.0274 .0478 .0860 .1614 .3193 .6711 1 .5000 .671 1

.0000 .0000 .0000 .0000 1 .0951 .4620 .2253 . I 120

.0000 .0000 .0000 .0000 .0000 I.0951 .4820 .2253

.0000 .0000 .0000 .0000 .0000 .0000 1 .0931 .4820

.0000 .0000 .0000 .0000 .9131 -.4019 -.0110 -.0059

.0000 .0000 .0000 .0000 .0000 .9131 - .4019 -.0110

.0478

.oetc

.0 16C' .0274 .047e .O860 .1614 .3193 i .6711 i 1.3000

a;

^sh-

I.2247 .5460 .2607 .1317 .0702 .0390 .0223 .0131

.oies

.OOO^.'^ .0000

. ooo<;> .0000 .0000 .000«"J

.0000 I .0951

b)

HH =

.8165 -.4085 -.0144 -.0077 -.004 1 -.0022 -.0012 -.0006

.0000 .9130 - .4021 - .0111 -.0059 -.0032 -.0017 -.0009

.0000 .0000 .9131 -.4020 -.0110 -.0059 -.0031 -.0017

.0000 .0000 .0000 .9131 -.4019 -.0110 - .0059 -.0031

.0000 .0000 .0000 .0000 .0000 .0000 .9131 - .40 1 9

.0000 .0000 .0000 .0000 .0OO<' . 0000 . OCOO .9131

Tables 5.2.1. Simulation of gaussian processes: (a) the assumed correlation matrix Q,, (b) the shaping matrix H^^,, (c) the decorrelating matrix H^.

s{n)^

Figure 5.6. Simulated realizations of gaussian processes: (a) a typical realization of the primary uncorrected gaussian process ©(/i), (b) and the corresponding realization of the process &{n) with the correlation matrix given in table 5.2.1 obtained by the shaping transformation.

5.1 Prototype Statistical Relationships

225

From our considerations it follows, that Every gaussian process can be considered as an effect of a real time, linear transformation of a primeval uncorrelated gaussian process ,^ <^ ^^. The uncorrelated gaussian process can be produced from the primary gaussian process by a real time linear transformation. In Section 4.3.4 it was mentioned that observations of random variables can be simulated by pseudo random-numbers generated by deterministic algorithms (see Dagpunar [5.17], Niederreiter [5.18]. The programs generating trains of pseudorandom numbers can be found in Press [5.14], in program packages [5.15] or in programs described by Wolfram [5.16]. The shaping and decorrelation of processes simulated by pseudo random numbers we illustrate with an example. EXAMPLE 5.2.1 SIMULATION OF A GAUSSIAN PROCESS WITH GIVEN CORRELATION MATRIX We assume that the correlation coefficients are given by the formula"^ cjm, A:) = ^,e^'''"-" +^/^'"-" (5.2.21) For numerical calculations we take A^= 8, i4, = l, ^42=0.5, iS, = l, iS2=0.5. The correlation matrix Qs, the shaping matrix H^^, obtained by Cholewski decomposition and the decorrelating matrix H^ are given in Table 5.2.1, while typical realizations u{n), s(n), /z = 1, 2, • • • of simulated primeval uncorrelated gaussian process, respectively correlated gaussian process are shown in Figures 5.6a and 5.6b. If we would apply to the correlated process the decorrelating transformation described by the matrix H^ given in table 5.2.1c we would obtain back the realization of the uncorrelated process shown in Figure 5.6a. Notice the specific structure of the shaping matrix H^^ given in table 5.2.1 In each row we have only positive elements. Thus, an element of the correlated process s(n) can be interpreted as the sum of positively weighted elements of the uncorrelated process mim) m = l, 2,- • , /z. This causes that the produced process is smooth. In a typical row of the decorrelating matrix H^ we have interleaved positive and negative elements. This causes fast oscillations of the process m(n).n The decorrelating transformation described by matrix H^ gives not only insight into properties of gaussian processes. It can also be used as a preliminary transformation simplifying real time compression of a train of pieces of information. Therefore, important are constructive algorithms decorrelating successively, piece by piece a train of arriving correlated pieces of information. Here we present such an algorithm that is an application of the Gramm-Schmidt othogonalization algorithm (see Press et.all [5.14]). An other algorithm using predictive-subtractive procedure is discussed in Section 7.5.3. A train s(m), m = l, 2,* • that is an observation of the train s(m), m = l, 2,- • of correlated random variables z(n) with Es(n)=0 is considered. As the first component of the decorrelated train satisfying conditions (5.2.11) we take the random variable B(l)=/i,(l)S(l) (7.2.22) and we require that Eii'(l)=l (7.2.23)

226

Chapter 5 Statistical Relationships

From equations (5.2.22) and (5.2.23) it follows that h,(l) = l/^cJlA) (5.2.24) ^ c,,(m, n) = E&(m)z(n) (5.2.25) To obtain the random variable IQI(2) that is not correlated with the random variable i!ii(l) we look for a linear combination of random variables 8(1) and 8(2). In view of (5.2.22) we can equivalently take a linear combination of 101(1) and 8(2). As such a combination we take where

^(2)=/z(2, l)B(l)+8(2)

(5.2.26)

Em(l)v(2)=0,

(5.2.27)

We require that Since we set only one requirement it is enough to introduce in (5.2.26) only one free parameter h(2, 1). Substituting (5.2.26) in (5.2.27) and taking into account (5.2.23) we get /z(2, 1)=-Eigi(l)8(2) (5.2.28) Thus, the random variable V(2) = [ - E M ( 1 ) 5 ( 2 ) M 1 ) + 8 ( 2 ) (5.2.29) is not correlated with B(1) but its variance of v(2) is in general not 1. The normalized random variable isa(2)=v(2)/{ E^(2)}'^ (5.2.30) is the second component of the decorrelated train satisfying the conditions (5.2.11). Substituting (5.2.29) and (5.2.22) we get m(2)=/i/2, l)B(l)+/id(2, 2)11(2)

(5.2,31)

where h/2, l)=/z(2, l)/E^(2), h,(2, 2) = 1/E^(2)

(5.2.32)

Using (5.2.22) and (5.2.24) we express both coefficients in terms of correlation coefficients c^^(l, 1), c^^(l, 2) and Css(2, 2). This procedure is continued. Suppose that after n steps we produced a train of linear combinations m(n) of random variables 8(m), m = l, 2, • • , n such that Eim(m')Eifli(m")=0, m', m" = l, 2 / • • , /z, Em\m) = l, /n = l, 2,- • • , /i (5233) We consider the linear combination n

v(n-hl) = 53/i(n + l, m)m(m)+z(n-¥l) and we require that

(5.2.34)

m-i E^(n-{-l)n(m)=0, for m = l , 2,- • • , w

(5.2.35)

Substituting (5.2.34) and using (5.2.33) we get and

Mw + 1, m)=-EiQi(/n)8(n + l) " ^ ( n + l ) = - 5 2 [&fli(m)8(n+l)Mm)+8(/i+l)

(5.2.36) (5.2.37)

m-l

is the new variable that is not correlated with the previously decorrelated variables.

5.1 Prototype Statistical Relationships

227

The new decorrelated variable with variance 1, we are looking for, is m(Az + l)=v(/2 + l)/{ EV^Cw + l)}'''

(5.2.38)

Expressing successively the variables m(m) in terms of 8(1), 8(2), • • •8(m) we put (5.2.37) in the form «-i m(n'\-l)=Y^h^in + l, mMm) (5.2.39) m-l

with coefficients ha(n-\-l, m) expressed by correlation coefficients c^JJc, I). COMMENT In defmitions (5.2.24), (5.2.30), and (5.2.38) of the normalized decorrelated variable we took the positive square root. Also a negative square rot can be taken. Thus, there are several solutions of the considered decorrelation problem. However, they differ only in the signs in front of the vectors h^(n) = {h^{n, /n),/i = l , 2 , , - / n } . TIME CONTINUOUS GAUSSIAN PROCESS The time continuous stochastic process is defined as a function assigning to a continuous argument tE a random variable 8(0- If for any set of sampling instants t^} is called gaussian process. Such a process is characterized by its mean value E8(0; tE and by the two argument correlation function

cjt\

t")^ E8(r')8(r"); /', re

(5.2.40)

From the property (4.5.21) of the gaussian variables and from the definition of the time continuous gaussian process it follows that: The statistical properties of a gaussian process are determined c^ i A \ \ by its mean value E8(0 ond its correlation function c^^{t\ t") An important class of time continuous stochastic process are process whose correlation function does not change when the origin of the time scale is shifted. To formulate this property it is convenient to assume that the process is observed on the whole time axis thus, to assume that ^a=-o°, ^b=a> (5.2.42) In such a case from the assumption that the statistical properties of the process do not change if the origin of the time axis is shifted it follows that E8(0=const, (5.2.43)

^"^

cjr', r")=7„(r"-r')

(5.2.44)

where y^Jj) is a function of one argument. Such a time continuous process is called stationary and y^Jj) is called one argument correlation function (briefly, correlation function). To simplify the argument it is assumed in the subsequent that E8(0=0 (5.2.45) Analyzing transformations of a process by a linear stationary system it is convenient to represent the process as a superposition of harmonic processes. Such a representation is discussed in Section 7.4.4. It is shown there that a stationary stochastic process can be represented in the form (7.4.53)

228

Chapter 5 Statistical Relationships oo

1

^('>=2^1 eJ"'dA(a;)

, -oo < r < oo

(5.2.46)

where dA(co) has the meaning of an infinitely small random complex amplitude of the harmonic function e''*'^ The power of the infinitely small harmonic component dA(a;)d'*'^ is E|dA(co)|2=^Mdo; (5.2.47) The function 5'(a)) is called power spectral density and its basic properties are discussed in Section 7.4.4. When the angular frequencies of all spectral components of the process lie in a frequency band <-27rB, 2'KB> and S^ for -ITVB
where

and by the probabilities /'/(i)(l) of the first state. In particular, from (4.4.8c), we get the probability distribution of state s(n) P[8(/t)]=%,]=

E V/(l)./(2).../(n-l)

^/(.,(l)fl^/w|/,«-.,('«)}-

(5-3-5)

w-2

Equation (5.3.5) shows also, that although (5.3.1) holds, all states are statistically interrelated. We will illustrate this statement in the forthcoming example. If the transition probabilities do not depend on the number m of the state we say that they are stationary; we denote them PIQ^KI)- The assumption that transition probabilities are stationary is justified if the deterministic relationships between the states of the system are time invariant (see 3.2.5 page 155) and the statistical properties of the primeval states are stationary too. It is so for example in case of the buffering system described in Section 3.3.1 with an stationary Poissonexponential train at the input. In general even if the transition probabilities are stationary the probabilities P[8(n)]=5/(;,)] characterizing the nth state depend on n. However, if they do not depend we call the Markov process stationary. We illustrate the introduced concepts on a simple example. EXAMPLE 5.3.2 A BINARY MARKOV PROCESS We assume that Al. The elementary state is binary; its potential forms are^ SQ=^0, 5I = 1; A2. The probabilities P/(l), / = 0 , 1 are given; we denote briefly P,(n)=P[z(n)=si; A3. The transition probabilities are stationary and are given; we denote P,„ = P[8(AZ)=5/|8(n-l)=5j.

231

5.3 Markov Processes Let us calculate, for example /',(2). From (5.3.5) we have

(5.3.6a)

/'o(2) = Po|o/'o(l) + Po|,/',(l) P,(2) = P,|oPo(l) + P,|./',(l)In a similar way. /-I 2

/-I

m-1 2

(5.3.6b)

m-l

Figures 5.7a and 5.7b illustrate the calculation of Po(2) respectively Po(3).

^(2)

[1)

s(3)

Sf) O

\ It 5 c^:

:

-•

-o s.

b) Figure 5.7. Illustration of the calculation of marginal probabilities of states of a Markov process when the probabilities of the potential forms of the first state and the stationary transition probabilities are known: (a) calculation of Po(2), (b) of PQC^)-

From the diagrams we see that to fmd the probability Pi(n) we have to look for paths going from a potential form of the state ^(1) to the considered potential form 5/ of the state s(n), to assign to each such path the probability of the potential form of the first state multiplied with all transition probabilities corresponding to the passed pairs of potential form of states and to sum over all such paths. Let us denote P P P(n)p - ^ 0|0 ^ 0|1 (5.3.7a) P

Using the matrices we write (5.3.6) in the form P(2)=P,,,P{1) P0) =

p

PLP(1).

(5.3.7b)

The obvious generalization of the second equation is

Pin)-PtMl).

(5.3.8)

Let us take numerical values 0.5

0.8 0.5

(5.3.9) 0.5 0.2 0.5 The probabilities P^n) calculated from (5.3.8) are shown in Figure 5.8. D /»(«)-

^,,-

232

Chapter 5 Statistical Relationships

Figure 5.8. Marginal probabilities of the nth state for fixed probabilities of the first state and fixed transition probabilities (given by (5.3.9)). From the Figure 5.8 we see that although transition probabilities are stationary the process is not stationary. However, we also see that when the process evolves (n grows), the probability distribution for a state stabilizes and converges to a limit. This is a general property of a very wide class of Markov with stationary transition probabilities (see e.g., Parzen [5.10] or Lapierre, Fortet [5.12]]). The asymptotically stable probability distribution of a state is determined only by the transition probabilities and does not depend on the probability distribution of the first state. If the probability distribution for the first state is the same as the asymptotic distribution, the probability distributions of all other states are the same; thus, the process is time invariant (stationary). We determine the stationary probability distribution from the condition that the probability distribution of the second state is the same as of the first state. Using (5.3.7) with A2=2 we write this condition in the form ^ ^ „ P=P2\A (5.3.10) where P211 is the matrix of stationary transition probabilities and the elements of matrix P we consider as variables. We interpret the condition (5.3.10) as a set of L equations. However, only L-1 of those equations are linearly independent. Therefore to the set of equations (5.3.10) we have to add one equation more that requires, that the sum of probabilities is 1. Let us take transition probabilities assumed in Example 5.3.2. From equation (5.3.10) we get the stationary probability P, = 0.72. (5.3.11) We assume now that the transition probabilities are stationary. Substituting equation (5.3.3) in the generalized definition (5.1.10) of joint entropy we get Hg=H{[s(l), 8(2),- • • , s(A0} = H[s(l)]+(iV-l)H[s(2)|s(l)] , (5.3.12) where H[s(2)|s(l)] is the average conditional entropy. The entropy per element defined by (4.6.18) is H,(N) = m/N (5.3.13) After substituting (5.3.12) we obtain (5.3.14) limi/,(A0=5[s(2)|s(l)].

5.4 Relationships Between a State and its Transformation

233

Let us take again the stationary transition probabilities given by (5.3.9) and the stationary marginal probabilities given by (5.3.11). Substituting those values in (5.3.14) we get iin^,(AO =0.73. (5.3.15) N-oo

To this point we considered the simplest type of Markov processes. The obvious generalization of the basic definition (5.3.1) is to assume that conditional probability on the left side of (5.3.1a) depends not on the last state but on M last states. Such a process is called Markov process of rank M. Thus, the previously considered Markov process defined by (5.3.1a) is a Markov process of rank M = l . The trains of statistically independent binary elements considered in Section 5.2.1 may be classified as Markov processes of rank 0. To simplify the argument we considered discrete scalar-valued trains, and thus, functions of a scalar discrete argument w, which can take discrete values. So the structural type of the trains is T^Kdi). The concept of Markov processes can be generalized for other types of states, in particular for the following types TdK(di) (discrete vector valued, time-discrete processes), TcK(di) (continuous vector valued, time-discrete processes). TdK(ci) (discrete valued, time-continuous processes; the previously described Poisson process may be classified as such a process of rank 0). This has been only a synthetic review of the fundamental properties of Markov processes that will be needed in the following chapter. For more detailed studies see, e.g., Parzen [5.10], Blanc-Lapierre, Fortet [5.12]. Although the concept of the Markov process is primarily associated with the sequential structure generalizations of this concept for functions of structured argument, such as images ( structural type Tdi(d2) ) were also used.

5.4 THE RELATIONSHIPS BETWEEN A STATE AND ITS INDETERMINISTIC TRANSFORIMATION We now consider the relationships between a structured primary state and the result of its indeterministic transformation discussed in Section 1.5.5 (see also Figure 1.14). To make these considerations concrete we interpret the transformation as transmission of an input signal by a communication channel introducing indeterministic distortions. However, our argument after slight modification applies for other information channels, particularly, for transformations of primary data in storage media and for information sources producing information that can be considered as a modification of a prototype. A component of the Markov process can be interpreted as a result of an indeterministic transformation of the preceding state. The problem that is being considered here is in principle similar. In particular, we concentrate on the conditional probability of the transformed process on the condition that the primary state is given. The difference between the previous analysis of Markov processes is that now we assume that the state is structured, particularly, that it has a time structure. On the other hand, we consider here only a single transformation, corresponding to a single step of evolution of a Markov process.

Chapter 5 Statistical Relationships

234

We present the general method of calculating the conditional probability distribution of the transformed state on the condition that the primary state is fixed and illustrate it with concrete examples. In Chapter 8 we show that the conditional probability distribution is of paramount importance, because it determines the optimal information recovery rules. 5.4.1 THE BASIC MODEL The transformation of a primary information by a communication channel, considered in Section 2.1.1. is a typical example of the indeterministic transformation of a structured state. We analyze such a transformation in more detail. To have consistent notation we substitute \v-^5, r-*v. Since we consider here both time-continuous and time-discrete processes, to the symbols of processes used in sec 2.1.1 we ad the subscript "c" as a reminder that the process is time continuous. We assume that Al. The primary state is a time-continuous process {s^iO, tE }; A2. The transformed state is v,(/) = vJO+Zc(0, te < / „ t^> (5.4.1) where < r / , /b' > is the observation interval of the transformed process; it is usually a delayed interval » i.tt^=tj'\-bi, ^b=^b'+*i and Z?, is the processing delay Zc(t) is a time continuous process generated by unknown components of the state of the environment; it is traditionally called additive noise, briefly noise; Vcn(r) is the time continuous noiseless component of the transformed process; it is ^ . ^ v J O = KnK(-), ^ , / ] ; (5.4.2) b={b(j),j=l, 2,- •, /} is a set of parameters characterizing the state of environment of information system; they are called side state parameters; briefly side parameters Vcn(/,0 is the transformation producing v^nit) from the primary process s^it). The transformation generating the process v^it) is illustrated in Figure 5.9. Vcn(0 ^cW

Kit)

^c(0 I

^cn [^c(')M

Figure 5.9. The model of a typical indeterministic transformation of the primary continuous state process s^it) performed by communication or storage channels: v^lt)- the transformed process, v^nW-the noiseless component, Zc(0-the noise, b-the side parameters, Vcn(-,-,0-the transformation generating the noiseless component.

5.4 Relationships Between a State and its Transformation

235

Usually, v^.n(/) depends on the course of the process {s^t'), t'E} and the dependence is linear, e.g. described by equation (3.2.31). Typical example of the transformation Kn(-,.,.) is attenuation and/or delay described by (2.1.4), which in now used notation takes the form Vcn(0=ft2^ca-*l).

(5.4.3)

For images, the counterparts are scale change and displacement (shift and/or turning) of the primary state process (image). The side parameters b are usually not known, and consequently the transformation Vcn(*»*/) is indeterministic. The primary process s^') can be often recovered exactly from the noiseless process v^^i'), irrespective of values the parameters b take, however with an unavoidable delay ^,. If this delay is not taken into account, the transformation Vc[s(\")] can be considered to be an indeterministic but reversible transformation (see Figure 1.14 and Section 1.5.5). Although then the recovery is possible it may be tedious. Therefore, the side parameters b are also called nuisance parameters. The advantages of numerical analysis and of digital processing cause that before subsequent processing the time-continuous processes is usually sampled. We denote v(n)=v,„(0, t,
(5.4.4)

The time-discrete counterpart of (5.4.1) is v(/i)=v„(«)+z(/z); n = U 2,- • • , N, (5.4.5a) where z(n) are the corresponding samples of noise, v„(/i) = K[5,(-), b,n]. (5.4,5b) Vn(-,',-) is the transformation producing the noiseless train (it includes the previously considered transformation Kn(•,-,•) and a sampling transformation). COMMENT Both the noise and the side parameters are components of side states (in the sense of Section 1.5.1; see Figure 1.13), however there is a difference between them. In typical situations the noise has two features. It is "small" compared with the noiseless component. Thus, the distortion of the transformed process is small too. The second feature is that the noise depends usually on many independent factors; thus, its indeterminism is large (a typical noise process is described in Example 4.3.1). With the side parameters the typical situation is just opposite. They cause large distortions, but because their number is small, their indeterminism is also small. These differences cause the techniques of counteraction to be different. We attempt to increase the energy or to build in more structure (redundancy) into the useful process to make the noise relatively smaller. This can be done in afixedway, using error correcting coding, or in a flexible way in systems with feedback (see Section 2.1.2). For side parameters we attempt to get information about them, estimate them, and compensate their effect. The intelligent systems with considered in Section 1.7.2 operate in such a way.

236

Chapter 5 Statistical Relationships

This comment applies not only to concrete states but also to meta states. Thus, we may have side parameters, which have the meaning of parameters determining the statistical properties of noise. Usually, the parameters characterizing the metastates change much more slowly than the parameters characterizing the concrete states. Therefore, the assumption that the parameters characterizing the meta-state are constant during a cycle of systems operation usually is well justified. 5.4.2 CALCULATION OF PROBABILITY DISTRIBUTION OF THE TRANSFORMED STATE WHEN THE NOISELESS COMPONENT IS EXACTLY KNOWN We assume now that the noise and the side parameters exhibit statistical regularities Thus, noise can be considered as a realization of a random process 2(/), tG (respectively the multidimensional random variable Z={2(w), n = l, 2,- • ,N} and side parameters as a realization of random variable ®={lb(/z), ;z = l, 2,- • , N}, Knowing the probability distributions of those variables and the form of the transformation Kn(•,-,•) respectively Vn(•,•,•) we can calculate the conditional probability distribution of v^-) respectively of v. We give now examples of such calculations. In the simplest case, the noiseless component is determined, and the only reason of indeterminism is the additive noise. The method of calculating the conditional probability distribution is illustrated with a very simple scalar example. We assume Al. The primary and transformed states are 1-DIM (are scalars); A2. The noiseless component is deterministic (does not depend on side parameters); we denote it V^^s). The transformed signal is „/ x . ,c A ^^ v=V„(5)+z. (5.4.6) We take the point of view of an observer knowing only the primary state s (see Figure 1.13a with M-^5,), and we denote by v and 2 the random variables representing the transformed state and the noise. We assume A3. The random variable 2 representing the noise does not depend on the primary state s. From (5.4.6) it follows that the event v=v is equivalent to the event 7L=v-V^{s). (5.4.7) Let us assume next that A4. The random variable 2 is a continuous variable; Pj,{z) denotes the density of its probability. From (5.4.7) it follows that the density of conditional probability that we are looking for is ^ p{v\s)^pSy^y,m (5.4.8) Let us assume that: (a) E2=0, (b) noise is gaussian, (c) the probability density Pziz) is given by equation (4.5.11) with fl=0. Substituting this in (5.4.8) and after some algebra we get: p(y\s)^-L^t''-'^'''''^'^. (5.4.9) yjliral

5.4 Relationships Between a State and its Transformation

237

Thus, for an observer who knows the primary state, the value of the transformed state fluctuates around the noiseless component of the transformed signal and the range of fluctuations is determined by the variance a^ of the noise. In our argument we did not use the assumption that the states are onedimensional. Therefore, the generalization of the basic equation (5.4.8) for any structured primary state and the transformed state having the structure of type T^^icK) or JcNCdK) is straightforward. For example, let us assume that Bl. The primary state s is 1-DIM B2. The transformed information is A^-DIM v={v(w), n = l,2,' • • , A^} B3. The noise z = {z(n), /z = l, 2, • • , A^} is a realization of the A^-DIM random variable Z. B4. The variable Z is a gaussian variable with statistically independent components; thus its probability density/7^(z) is given by (4.5.14). The generalization of equation (5.4.9) is: P(yIs) =A[v-V„(5)] =Aexpl-J-f^ [r(n)-V^(s,n)r\, (5.4.10) I 2a,/.-i J where Vj,(s, n) is the nth component of the noiseless vector into which the primary scalar state s is transformed and A is the coefficient in front of "exp" in (4.5.14). Next, assume Bl and CI. The transformed state is a time process observed in the interval , C2. The noise is a realization of a time-continuous, base-band stationary gaussian process, with spectral density which is uniform in the frequency band in which all spectral components of the transformed state process lie; we denote as S^ the power spectral density of this process. The process representing noise is considered in Section 5.2.2. Using equation (5.2.52) giving the probability density a realization of the noise we get /^[v(-)|^]=A{v(-)-Vcn[5, ( O l l ^ ^ e x p j - ^ j [v(0-Vj5,0]'d/|.

(5.4.11)

This is the time-continuous counterpart of equation (5.4.10). A typical example of the considered indeterministic transformation is the transformation performed in the communication system considered in Section 2.1.1 (see Figures 2.1 and 2.2). Then the primary state has the meaning of the primary information, and the noiseless process is the noiseless signal. Typical examples of transformations producing such a noiseless process are: Kn(^, O=^^n(0cos(ajer+^), tE (5.4.12) where

K„(5, 0=^n(0(Ocos[(we+«^)^+^], te . r . ^.^* ^ \ for t^t^

(5.4.13)

(5.4.14)

is a pulse-type envelope, as shown in Figure 2.1 and Figure 2.2, Wc the angular carrier frequency, ^ the phase shift, and a. a scaling constant. The process (5.4.12)

238

Chapter 5 Statistical Relationships

is the output of a communication channel produced by an amplitude modulated signal, while by (4.5.13) by a frequency modulated signal put into the channel. 5.4.3 CALCULATION OF PROBABILITY DISTRIBUTION OF THE TRANSFORMED STATE WHEN NOISELESS COMPONENTS DEPEND ON UNKNOWN PARAMETERS We now assume that the side parameters determining the noiseless process are indeterministic but exhibit statistical regularities, and the probability distribution describing the regularities is known. To simplify the argument we assume first that Al. The primary and transformed states are 1-DIM; A2. Only one parameter b is unknown; A3. On the condition that the primary state is fixed the noise and the unknown parameter can be considered as realizations of the random variables z and lb A4. The random variables 2 and lb are statistically independent. Based on these assumptions, v=V„(5, ^)+z (5.4.15) Our task is to calculate the density of conditional probability p(v\s). We take into account the effect of indeterminate parameters using the equation (4.4.8d) for marginal probability. We write it in the form p(v\s)={p(^^b\s)(^,

(5.4.16)

where is the set of potential values of the unknown parameter b. From the equation (4.4.7b) for the density of conditional probability we get p(y, b\s)=p(v\b\s)p(b\s). (5.4.17) However, , . p(v\b\s)=p(v\b, s) (5.4.18) Based on the condition that the primary state s and the parameter b are fixed and on assumption A4, we have the same situation as in the previously considered case when the noiseless process component is determined and we can use equation (5.4.8). It takes the form p{v\b, S)=PP-V,{S, b)]. (5.4.19) Substituting (5.4.15) in (5.4.16) we get the probability distribution we were looking for: i,^ P(^^\s)= {p^[v-Vp,b)]p(b\s)db. (5.4.20) Since the assumptions that the considered states and indeterminate parameter are scalars played only the formal role the generalization of equation (5.4.20) for structured states and indeterministic parameters is straightforward. To illustrate such an generalization we do again assumptions CI and C2 (page 237) and in addition we assume D4. The noiseless process is given by (5.4.12) but the phase ^ is indeterminate; we denote it now as b.

5.4 Relationships Between a State and its Transformation

239

Thus, we write the noiseless process in the form VJs, b, O=^n(0cos(a),/+*), te , (5.4.21) In Section 4.5.1 the mechanism generating a state with a uniform probability distribution was considered. We indicated there that on general assumptions the presented argument applies also to phase shift. Therefore, we assume that the phase shift of the narrow band noiseless signal has a uniform probability distribution p^(b\s)=const=l/2Tr, bE <0,27r> (5.4.22) Since the factors causing noise and the phase shift have usually a different character and are independent, it is assumed that the random variables representing noise and the phase shift are statistically independent. Based on these assumptions generalized equation (5.4.20) takes the form 2T

p[vi')\s)=

{p,{v(}-VJ(s,

b, (•)]}p(*|5)dfe.

(5.4.23)

0

Using equations (5.4.11), (5.4.22), and (5.4.23) we get finally p[v(-)|s)= A | { e x p [ - J ^ l [v(r)-5/l^(/)cos(«,/+fe)]^dr]|d*. 0

(5.4.24)

/a

This integral can be calculated in a closed form (see, e.g., Grandsteyn, Ryzhik [5.19]) and we obtain

where

/7[v(-) I s)=const exp{-52E[^,(-)]/5jIo{fTv(-)]/5,}. t^ EK(-)] = V 2 p , ( 0 d / .

(5.4.25a) (5.4.25b)

/a

has the meaning of the energy of the noiseless process when s=\ ^v(-)] = \/ce'[v(-)]^c,'[v(-)]

(5.4.25c)

cJv(-)]-5 J v(r)^^(Ocosco^rd/

(5.4.25d)

'a 'b

cjv(-)]'sl v(t)A^{t)^mo^Jdit

(5.4.25e)

fa

and Io(w) is the Bessel function of imaginary argument of order 0. COMMENT We present this lengthy equation for two purposes. First it is used in Chapter 8. Second, the equation illustrates how in calculation of the conditional probability the exact information about the structure of the noiseless process is utilized and the unknown side parameters are eliminated. The observer located at the input of the channel of the channel (as the observer J3^ in Figure 1.13) knows the envelope A J,*) and the central angular frequency Wc. The indeterministic phase shift b, that influences the noiseless process is not known to the observer, but its effect on the probability distribution/7[v(*) | s] is eliminated by integration (see equation (5.4.24)).

240

Chapter 5 Statistical Relationships

However, the performance of an optimized system based on noiseless signal dependent on indeterministic phase shift is worse then of the optimized system utilizing exact information about the phase. We discuss this effect in Section 8.4.2. We presented here only the general method of calculating the conditional probability and gave its simple applications. As is shown in Section 8.3 the considered conditional probability determines the structure of the optimal subsystems recovering distorted information particularly, of optimal receivers in communication systems. Therefore, several examples of application of the general method described here can be found in publications on optimization of signal reception (see, e.g., Proakis [5.20]). 5.4.4 THE ROUGH DESCRffTIONS OF THE TRANSFORMATIONS PERFORMED BY A COMMUNICATION CHANNEL The set of the previously considered conditional probability distributions p(v\s)y sES, where S is the set of potential forms of the primary state, provides the exact description of an indeterministic state transformation. Usually such an exact description is quite complicated, and therefore simplified descriptions of an indeterministic transformation are of great interest. When the transformation has the meaning of an operation performed by a communication channel, then rough descriptions of the conditional probabilities that characterize the quality of decisions about the actions in the environment based on the transformed state are useful. When this state can be represented as the sum of a noiseless component and noise, then we can characterize the transformation performed by the channel by an indicator of the relative magnitude of those components. It is called signal/noise ratio. We show in the following chapters that the indicators of this type appear "automatically", when the quality of decisions based on the state produced by an indeterminate transformation is analyzed. However, in general, we cannot represent a transformed state as a sum of noiseless component and noise and even if we can, there are in general no justified a priori indications how to define the "magnitude" of the components. Here we consider the channel capacity. It is a rough description of the set of conditional probability distributions based on the amount of statistical information defined in Section 5.1.2 and 5.1.3. We cannot use directly the amount of statistical information l(V:S) defined by (5.1.21) or (5.1.27) because it depends not only on the conditional probabilities characterizing the channel but also on statistical properties of the primary state, which are not a characteristic of the considered transformation. Thus, to define on the basis of I(V:S) a rough description of the statistical description of an indeterministic transformation we must remove the dependence of I(V:S) on the statistical properties of the random variable S. The concept of operations removing the dependence on detail was introduced previously (see Section 1.6.1 and is discussed in detail in Section 8.1.2. Here we take as the dependence removing operation the operation of fmding the maximum of I(V:§) in respect to all probability distributions P^ describing the random variable § representing the primary state (state at the input of the system performing the considered indeterminate transformation).

5.4 Relationships Between a State and its Transformation

241

In general, some constraints are imposed on the primary state. For example if it is a vector, it is often required that a component of it must not surpass a fixed value. The constraints determine the set ^p of admissible probability distributions Fj. Thus, as a characteristic of an indeterminate transformation described by the set of the conditional probability distributions/7(v| 5), sES V/Q take C= max I(V:S). (5.4.26) p.eSp

The calculation of capacity simplifies when the primary and transformed states are A^DIM vectors and v=5+z. Thus, V = S + Z . Then from equation (5.1.21) we have — I(V:§) = H(V)-H(V|S). (5.4.27) On the additional assumption that the random variable Z representing the noise is statistically independent from the primary process we have H(V|S) = H(Z). From equations (5.4.26) to (5.4.28) it follows that C(AO= max H(V)- H(Z).

(5.4.28) (5.4.29)

p.eSp

Thus, on the simplifying assumptions, finding channel capacity reduces to maximizing the entropy of the transformed state by a proper choice of the statistical properties of the primary state. We illustrate this with simple but important special cases of channels described by the following assumptions: CI. The components of the state vector are binary (0, 1); the components of the noise are statistically independent and have the same probability distribution; a component of the primary state and corresponding component of the transformed state are different when the noise component is 1 (see equation (2.1.20)). We call the probability P[2(/z) = 1] error probability and denote it as Pg. A communication channel performing the described transformation is called a binary, memoryless, symmetrical channel. C2. The components of the vector states are continuous; the mean-square value of a component of the primary state E^(n) = al = const; components of the noise are represented by gaussian independent random variables with Esin) = 0 and the mean-square value Es^(«)= o^ = const. A communication channel performing such a transformation is called a discrete, memory less, gaussian channel. C3. The primary process is a time-continuous process of duration T; B is the highest frequency of its harmonic components; its average power 2

1

1

"^

f

f/(Od/

(5.4.30)

is fixed; the time continuous noise is a base-band gaussian process considered in Section 5.2.2 with a uniform spectral density that in the frequency range < 0 , B> has the constant value S^. A communication channel performing the described transformation is called a continuous, memoryless, gaussian channel.

242

Chapter 5 Statistical Relationships

We give equations for the capacity of those channels. Figure 5.2 shows that the maximum value of entropy of a binary random variable is 1, and this maximum value can achieved when the variable has a uniform probability distribution. It can be shown that for the assumed noise we can achieve this probability distribution by a proper choice of probability distribution of the component of the primary state. Thus, max H(V) = 1. The entropy H(Z) we get from (5.1.7). Thus the capacity of a binary memoryless symmetrical channel is C(N)=m-^P.log2P.Hl-P.)Hl'P.)log2(l-P,)]. (5.4.31) The diagram of the capacity per a component C,^C{N)/N (5.4.32) as a function of error probability P^ is shown in figure 5.10a. For the discrete gaussian memory-less channel C2 the procedure is similar. It can be shown (see, e.g., Cover, Thomas [5.4]), that for fixed variance entropy H(V) is maximized, when the component of the primary state 8(«) is a gaussian variable. Then, in view of property (4.5.22), the component v(/i) is gaussian too, and its variance is ay = as + az. Using (5.1.33) we get C,^C{N)/N=Vi\og2{ay/Gl) = V2\og,(l-^al/al).

(5.4.33)

The ratio al/oz has the meaning of signal to noise ratio. Thus, in the considered case, the channel capacity per one component is in a unique way related to the signal/noise ratio. In Section 7.4.3 the conclusion (7.4.49) is derived, saying that a base-band time-continuous process observed in a time interval of duration T can be almost accurately represented by a set of ^ ^ ^ N=2BT (5.4.34) samples taken with period 1/2B, where B is the highest frequency of a harmonic component of the process. The representation becomes exact when BT-^oo. It can be easily shown, that the amount of statistical information is not changed if a reversible transformation of a state is performed. Therefore to calculate for large BT the capacity of channel C3, we can use equation (5.4.33), however, after expressing parameters characterizing channel C2 by parameters characterizing channel C3. From equation (7.4.62) it follows that we have to set a^=2BS^ (5.4.35) Substituting (5.4.34) and (5.4.15) in (5.4.33) we obtain the capacity of channel C3 C(T)=BT\og2(l + o\llBS^),

(5.4.36)

Similarly to channel capacity Cj per an elementary component, we introduce channel capacity per time unit _, ^ ^ ,^ ^ C^C{T)IT (5.4.37) From equation (5.4.36) we see that the channel capacity depends strongly on the bandwidtji of noiseless signals. To get insight into this dependence we substitute equation (5.4.36) in definition (5.4.37) and write the result in the form C =Blog2(l ^BJB)

(5.4.38)

B,= a^/25^,

(5.4.39)

where

5.4 Relationships Between a State and its Transformation

243

In view of (5.4.35) B^ has the meaning of such a bandwidth that the noise power is equal to the noiseless signal power. Introducing the normalized bandwidth B' =B/B, (5.4.40) and the normalized channel capacity C" = C/B, (5.4.41) we write equation (5.4.38) in the form (5.4.42) C'=B'log2(l + l/5') The diagram of C" versus B' is shown in figure 5.10b.

a' 0.2 O.J «^ O.S 0.6 Ci7 as

as

10 r,

U

1

L

J

*

>

Figure 5.10. Dependence of channel capacity on parameters determining the conditional probability distribution of a transformed state (a) dependence (5.4.31) of the normalized capacity of the binary, memory less, symmetric channel on binary error probability P^, (b) dependence (5.4.42) of normalized capacity C" of a time-continuous gaussian channel with noise with constant spectral density on the normalized bandwidth 5'. The diagram shows that C" after initial growth stabilizes at the asymptotic value Cw =limC"-log2e =log2e. (5.4.43) Using equations (5.4.30) and (5.4.36) we get the corresponding asymptotic value of the capacity C(I) given by equation (5.4.36) Coo = limC(7)--r_CSo where

Br-oo

25

(5.4.45)

i^-T

E^To\ •

\

sKm

(5.4.46)

is the average energy of the process. COMMENT 1 From equations (5.4.36) to (5.4.43) and from Figure 5.10b it follows that the capacity C{T) initially grows almost linearly with the bandwidth (thus, with the dimensionality) of the noiseless process, but for large BT the capacity C{T) approaches the asymptotic value Coo that depends only on the average energy of the noiseless signal. Conclusion (7.4.26) says that 1/ris approximately the minimum bandwidth of a process of duration T. Therefore, the product BT has the meaning of the surplus of bandwidth of a process over the possibly smallest bandwidth and be interpreted as an indicator of a degree of fine structuring of a noiseless signal. In consequence the interpretation of the discussed dependence of the capacity on the product BT is that when the degree of noiseless signals structuring is small then by

244

Chapter 5 Statistical Relationships

increasing it, the channel capacity can be increased without increasing the power of the noiseless signals or decreasing the power spectral density of the noise. However, if the degree of the noiseless signals structuring is already large, increasing it has no more effect and the capacity depends only on energy of noiseless signals. COMMENT 2 We defined the capacity without taking into account the properties of the superior system. Therefore, at this stage of considerations we cannot say if it is a useful indicator of information systems performance. However in Sections 8.4.3 and 8.6.1 we discuss the coding theorems that state that the performance of a optimized communication systems depends in a crucial way on the channel capacity. Other concrete applications of capacity are presented in Section 8.5.4. We gave here only a comprehensive introduction to the concept of channel capacity. For more details see Abramson [5.3], Cover, Thomas [5.4], Blahut [5.5].

5.5 THE IVIODELS OF INDETERMINISM OF A STATE RELATIVE TO AVAILABLE INFORMATION In Chapters 2 and 3, and in this chapter we presented an in principle "impartial" description of states and systems, without taking into account the purposeful activities of a superior system. Therefore, we have not specified the meaning of the transformed state. However, according to the general definition (1.1.1), information has the meaning of a transformation of the state. Thus, the information source can be interpreted as the subsystem transforming the state of the environment into information. The source may be a primary source (described in Section) or secondary, such as the output of a communication channel or an information compression subsystem. Therefore, our previous considerations about states can be directly used for analysis of information sources from the point of view of an observer B^ having a direct access to the state s of the environment and looking for information produced by an information generating transformation (see Figure 5.11).

INFO SOURCE

t

B.

Figure 5.11. Illustration of the definition of observers B^ and B^,. Here we discuss the basic problem that arises for an observer B^, interested in the state of the environment but having access only to information jc about the state 5. The situation of such an observer is opposite the situation of the observer B^, The problem is that the information generating information is practically always irreversible (deterministic or indeterministic). In other words, the available information is practically never accurate. This means that knowing the information we cannot determine exactly the concrete state that produced the information.

5.5 Models of Indeterminism

245

However, we can in general determine the set of the potential forms of the state, which may have generated available information. As it has been explained in Section 1.4 knowing such a set (the better, the statistical weights, if they exist), we may substantially improve the performance of purposeful actions performed in the environment. This conclusion applies not only to concrete states (internal, external) but also to meta states (sets or potential states). We usually do not have direct access to a meta state but do have some information about it. If the information about the meta state is not exact, the knowledge of the set of potential forms of the meta state thus, the meta meta state, is useful. We now introduce a hierarchy of states and meta states and the corresponding hierarchy of information about them. If the transformation is reversible thus, if the available information x is exact, we say that relative to available information x the state s is determined (the deterministic model of the state can be applied). If the information generating transformation is irreversible (the information is inexact) but for a fixed information x the potential states exhibit statistical regularities a probabilistic weight can be associated with each potential state. Thus, a statistical state may be defined. To simplify our reasoning, we assume further Al. Both the concrete state and information about it are vectors; A2. The concrete states exhibit statistical regularities that are described by the probability density/7(5); A3. For a fixed state (the point of view of observer B^ the information generated by an indeterministic transformation, exhibits statistical regularities, which are described by the density of probability p(x\s). Then, for a given information x (the point of view of observer Bj) the statistical state is described by the density of conditional probability p{s\x). Using the generalized formula (4.4.8) we calculate this density of probability from the probability distributions mentioned in A2 and A3: P{S\X)=\ICP{X\S)P{S)

C=J..

J/7(x|5)/7(5)d5.

(5.5.1)

The set

S^i^^{p{s\xy,seS.xe)^

(5.5.2)

of those densities of probability is called conditional, relative to the available inexact information x, statistical state of order 1. If, for a given inexact information jc, the potential forms of concrete states do not exhibit statistical regularities, or, if they do, we do not take them into account, we cannot associate a probabilistic weight with a potential state. Then we say that the statistical regularities cannot be used. In such a case, we may improve the quality of purposeful actions, if we know some other than statistical, properties of the set of potential forms of the concrete state. The fundamental property of this type is the set of rules of belonging to the set of potential forms (membership rules(see Section 4.7). There have been also many attempts to enrich the description of the set of potential forms of a state by associating with every potential state a nonstatistical weight. Typical examples of such weights are

246

Chapter 5 Statistical Relationships 1. Likelihood; we call so the conditional probability/7(x 15) considered as a function of the condition s; such a weight function is widely used in statistics, when the probability density p(s) mentioned in A2, page 245 (called in statistics a priori probability) is not available; then formula (5.5.1) cannot be used (see Larsen, Marx [5.7], Edwards [5.21]); 2. Zadeh fi function, used infiizzy sets theory (see, e.g., Klir, Folger [5.22], Zimmermann [5.23]); 3. Expert belief functions (see Shafer, Pearl [5.24 ch.5] particularly, the tutorial); 4. Shafer-Dempster evidence weights (see Shafer, Pearl [5.24 ch.3] particularly, the tutorial).

The fundamental difficulty with the nonprobabilistic weights is that unlikely to the statistical weights they do not have an objective character. One of the consequences of this is that there are no systematic methods for experimental determining the values of those weights. Worse, there are no systematic rules for calculating the weights for secondary states obtained by some transformations from primary states with the nonstatistical weights. Therefore, many incoherent, ad hoc rules must be introduced (see, e.g., the publications mentioned in 2, 3, and 4). The discussed properties of the potential forms of a concrete state when the statistical regularities cannot be used, are called the nonstatistical meta state of order 1 and are denoted as 5N^AT •

Let us return to the situation when the potential states exhibit statistical regularities thus, the when the conditional statistical state S^\j exists. With this state we have an analogous situation, as with the concrete state s. The state5STAT is usually not directly available, and we have about it only some information ^STAT ~ T{S^f^j),

(5.5.3)

where T{') denotes the transformation generating the information^. We callA^sTAT statistical information of order 1. Statistical information of order 1 can be obtained during a training cycle as described in Section 1.7.2 (see Figure 1.25). A typical example of the statistical information of order 1 about the probability density/7(r I w) describing the statistical properties of the channel is the train given by equation (1.7.3) JR(CH) = {{J,(/-), r(/)},y = l , 2 , - • • , 7}

where y^ij) is exact information about process w available at channel input and r{j) is the process at the channel output, obtained during the training cycle. Several other examples are given in the forthcoming chapters. The statistical information ^STAT niay be exact or not. If it is, then we can determine the statistical regularities S^\j exactly. We say then that the statistical state is determined or, in classical terminology, that the Bayes model can be applied. If the statistical information X^\j is not exact, we have a situation similar as in the case when information x about a concrete state is s is not exact. If 5S^AT exhibits statistical regularities, then the statistical state S^\j of order 2 exists. If the statistical state does not exhibit statistical regularities or they cannot be used, we have to consider the non-statistical meta state 5N^AT of second order.

5.5 Models of Indeteraiinism

247

Introducing two binary features: • information about the state is exact or nonexact and • statistical information can or cannot be used, we can illustrate our considerations by the first two layers of the tree shown in Figure 5.12. The further extension of our reasoning (and of the tree) is evident. Starting with the third layer we have a specific situation. Having an exact information about the statistical state of order 2 and using the equation (4.4.8d) for marginal probabilities, we can obtain the exact information about the statistical state of order 1 (this is indicated by an arrow on the diagram). Thus, if statistical state of any order is determined, we can apply the Bayes model.

2.1 EXACT INFO ABOUT STATISTICAL STATE (BAYES MODEL)

LIXELIBOQD . SmT> STATISTICAL STATE OF FIRST ORDER

1 EXACT INFO ABOUT CONCRETE STATE (DETERMINISTIC MODEL)

STATISTICAL RCOULARITIES

.SWTAX NON^ATISnCAL M E T A \ STATE OF FIRST ORDER \

/ / /

FUZZY SETS ^ H FONCTICW

NON-STATISTICALVN. WEIGHTS OF ^ ^ POTENTIAL STATES

SHXFERDEHP8TER UEI6BTS EXPERT BELIEF FUHCTIOHS

Figure 5.12. Theftindamentaltypes of relationships between the available nonexact information and the state that this information pertains to (models of indeterminism): STATISTICAL REGULARITIES USED/ NON USED- statistical regularities exist and are used or they do not exist or are not used; EXACT/ INEXACT- the available information about the state shown as a two concentric circles is exact or inexact.

248

Chapter 5 Statistical Relationships

Similarly, as in the case of statistical states, it is possible that only nonexact information about the nonstatistical meta states is available. For example, often the potential forms of a state or even their number is not known exactly. Although some of the situations illustrated in Figure 5.12, panicularly, situations when exact information about nonstatistical meta-states is not available, seem to be quite exotic, they occur often in practice and are assumed tacitly in many publications. We illustrate these considerations with a simple example. EXAMPLE 5.5.1 TWO-LEVEL META STATES AND META INFORMATION We assume that: Al. The primary state is a scalar 5, the set of its potential forms ^=<.oo, oo>;

A2. The statistical state SS^AT of order 1 is described^ by the gaussian density of probability p(s\xM)^—l—t-''-'-'^'^^^ (5.5.4) Thus,

V^TTo^

Sjlij^{p(s\x,a);seSxeX};

(5.5.5)

A3. The information A^STAT consists of (1) information A^jvp about the type of the probability distribution, (2) information X^ about the average value, (3) information X„ about the variance. Thus, ^STAT ~{^TYP» -^a* ^af'y

(5.5.6)

A4. The information components J^TYP ^^ ^a ^^ exact; however, X^ is not exact; therefore, the information ^STAT is a nonexact information about the statistical state 5SVAT; A5. The indeterminate parameter a exhibits statistical regularities; they are described by a gaussian density of probability

PJW^-L=^-'-''^''

(5.5.7)

thus, the statistical state of order 2 is 5s?iT ^ { p » ; f l G < - c x ) , cx>> }

(5.5.8)

A6. Exact information ATSTAT about S^\j is available. Looking at Figure 5.12 we see that the state of the system relative to the available meta information is represented by the point 3.1. The statistical information about the primary state is described by the density of probability p\s\x,X^ij , X^ij ). From the marginal probability equation (4.4.8d) we obtain 00

p(s\x,X^ij

,X^ir ) = I p(s\x,a)p,(a)da.

(5.5.9)

5.5 Models of Indeterminism

249

Substituting (5.5.4) and (5.5.7), after some algebra we get p\s\x,X^i, where

,X^ir )= 1 ^-is-.-tm.y^ yJl-K^a^f {oy=o^+al, D

(5 5 10) (5.5.11)

COMMENT The variance is an indicator of the indeterminism of the state relative to the available information. If the information Z^ were exact, this variance would be cr^, while it is (a*)^ when the information X^ is inexact. Thus, (a*)^-cr^= a] is the indicator of price that we pay for not having exact information about the statistical properties of the state. A state information subsystem, like the state information subsystems described in Section 2.2 could provide such an information. In particular, if the parameter a is a stable parameter, an adaptive information system described in Section 1.7.2 and shown in Figure 1.26 could decrease the variance a] thus, to move the point representing the status of relative indeterminism of the primary state s from the third layer in Figure 5.12 into the second layer representing less indeterminate situations.

NOTES ' The entropy, similarly as the statistical average, is a number assigned to the random variable. Therefore, similarly as the calculation of the average, we interpret the calculation of entropy as an operation and we denote it with the special enlarged character. ^ We used a similar argument earlier to define the joint frequency of occurrences by (4.1.13)see also footnote 1 p.5/4. ^ To explain that this does not contradict the causality principle let us assume first that the frequencies of occurrences of a state and a state occurring later are statistically related. As it has been indicated in Section 1.7.2, page 63, we can use the frequencies of occurrences described in sec.4.1,2 only ex post, after completing the observation. Thus, no conflict with causality principle arises. A statistical relationship between a state at present and a state in theftiturebased on probabilities, can be used only on the assumption that thefrequenciesof occurrences of states in the fiiture willfluctuatearound probabilities estimated in the past. However, such a assumption cannot be a priori strictly proved. Therefore, the bilateral statistical relationship between a state in theftitureand a present state based on probabilities has a hypothetical character and does not violate the deterministic principle of causality holding in every concrete case. ^ If we would take only one component in (5.2.21) we would obtain a Markovian process (see next section) for which the shaping and decorrelating matrices are rather untypical. Therefore, we take two components. ^ To simplify the notation we start the numbering of potential forms of states not with 1 but withO.

250

Chapter 5 Statistical Relationships

^ The transformation T{') may be deterministic or not. Equation (5.5.3) describes a deterministic transformation. If it is indeterministic then, like in (1.5.4) we have to introduce, the indeterminate side states. ^ It can be easily seen that the assumed conditional probability distribution is the conditional probability of the state when (1) the information j c = 5 + z , (2) the state s has gauss probability density, 3) the noise z has a Gaussian probability density with the mean value proportional to a. Thus, the auxiliary parameter a has the meaning of the mean value of noise z.

REFERENCES [5.1] Papoulis, R., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, NY, 1991. [5.2] Breiman, L., Probability, SIAM Publications, Philadelphia, 1995. [5.3] Abramson, N., Information Theory and Coding, McGraw-Hill, NY, 1963. [5.4] Cover, T.M., Thomas J.,A., Elements of Information Theory, J.Wiley, NY, 1991. [5.5] Blahut, R.E., Principles and Practice in Information Theory, Addison-Wesley, Reading, MA, 1990. [5.6] Seidler, J.A., Bounds on the Mean-Square Error and the Quality of Domain Decisions Based on Mutual Information, IEEE Trans.on Info.Theory, vol IT-17, 1971 pp.655-665. [5.7] Larsen, R.J.,Marx M.L, An introduction to Mathematical Statistics and Its Applications, 2-nd ed.,Prentice Hall, Englewood Cliffs, 1986. [5.8] Bahara, M., Additive and Nonadditive Measures of Entropy, J.Wiley, NY, 1990. [5.9] Helstrom, C.W., Probability and Stochastic Processes for Engineers, MacMillan, NY, 1984. [5.10] Parzen E., Stochastic Processes, Holden Day, San Francisco, 1962. [5.11] Shanmugan, K.S., Breipohl A.M., Random Signals, J. Wiley, NY, 1988. [5.12] Blanc-Lapierre, B., Fortet R., Theory of Random Functions, vol 1,2 Gordon and Breach, NY, 1967. [5.13] Kleinrock, L., Queuing Systems, vol 1,2, J.WUey, NY, 1975. [5.14] Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetterling, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [5.15] Matlab, Users Guide, The Math Works, Inc.,Natick, MA, 1991; Matlab is a registred trade mark of The Math Works, Inc.,Natick, MA. [5.16] Wolfram, S., Mathematica, Addison-Wesley, Redwood CA, 1991. [5.17] Dagpunar, J., Principles of Random Variate Generation, Clarendon Press, Oxford, 1988. [5.18] Niederreiter, H., Random Number Generation and Quasi-Monte Carlo Methods, SIAM Publications, Philadelphia, 1992. [5.19] Grandshteyn, I.S., Ryzhik I.M., Tables of Integrals, Series, and Products, Academic Press, NY, 1965. [5.20] Proakis, J.G., Digital Communications, 2-nd ed.,McGraw-Hill, NY, 1989. [5.21] Edwards, A.W.F., Likelihood, Cambridge University Press, Cambridge, 1972. [5.22] Klir, G.J., Folger, T.A., Fuzzy sets. Uncertainty and Information, Prentice Hall, NY ,1988. [5.23] Zimmermann, H.J., Fuzzy Set Theory, Kluwer Academic Publishers, Boston, 1991. [5.24] Shafer, G., Pearl, J., Readings in uncertain reasoning, Morgan Kaufman Publ., San Mateo CA, 1990.

LOSSLESS COMPRESSION OF INFORMATION This chapter begins the third part of the book, which is devoted to the analyzis and synthezis of infonnation transformations. This and the next chapter consider transformations that compress the volume of information. Compression of volume is one of the most important information transformations. We understand it intuitively as a preliminary transformation of structured information that allows to reduce the resources necessary for subsequent processing of the information. An introduction to the problems of information compression has been presented in Section 1.5.4; see in particular, Figure 1.22. This and the next chapter have two objectives. The first is to describe the basic transformations compressing the volume of information and to furnish a solid basis for understanding and designing data, process, and image compression systems. The second objective of the two chapters is to illustrate the general concepts introduced in second part of the book, particularly the summarizing considerations in the closing section of Chapter 5. This chapter considers transformations that compress the volume of information, so that it is possible to recover primary information without distortions, except some delay. For many superior systems the processing delay is irrelevant, particularly, when we compress the volume of information to store it. Such a transformation is called reversible information compressing transformation, or in technical terminology, the lossless information compression. If exact recovery is possible and the introduced delay is not taken into account we have only to inven the transformation, and the problems of evaluation of recovery distortions and optimization of recovery rules do not arise. This simplifies greatly the analysis of loss less information compression. The central concept of this chapter is the concept of volume of information, defined as an index of resources needed to process the information so that it can be exactly recovered. Examples of volume of information are: (a) capacity of a storage device needed to store information, (b) capacity of a transmission channel needed to transmit information, (c) the minimal computational power needed to process information. The volume of information is not only useful in initial stages of information system design to estimate the costs of constructing and running the system, but it also allows us to use efficiently the available information-processing resources during the operation of the system. An example is the application of the declarations integer, real, and dimension used in most programming languages to assign storage capacity of the computer efficiently.

252

Chapter 6 Lossless Compression of Information

Without a formal definition of volume of information systematic design of optimal, the more of intelligent information compressing systems, is not possible. Not only the practical design problems justify interest with this concept. It is also important since its analysis gives much insight into the fundamental problems of information processing. Lossless compression is possible in two basic cases: • When not every possible combination of components of structured information is a potential information; we say then that structural constraints are imposed on information or, briefly that information is structurally constrained. • When the information exhibits statistical regularities and components, or some their combinations have different frequencies of occurrences; we say then that statistical constraints are imposed on information, or briefly, that information is statistically constrained. The utilization of structural constraints for lossless information compression of discrete information is the subject of the first two sections of this chapter. In the first section we introduce the general concepts, and in the second we use these concepts to present a systematic review of lossless compression procedures that are most important for practice. Next two sections devoted to lossless compression of information utilizing statistical constraints have a similar structure. For continuous information irreversible transformations compressing information are of greatest importance. Then the evaluation of distortions of recovered information and minimization of distortions becomes the central design problem. Therefore, only some aspects of the considerations about lossless information compression have counterparts in continuous information compression. They are discussed in the last section of this chapter. The main purpose of the considerations is to get more insight into the continuous-discrete information dilemma. Most important for applications is compression of continuous information introducing unavoidable distortions. Such a compression of continuous information is the subject of the next chapter.

6.1 THE VOLUME OF DISCRETE INFORMATION AND ITS COMPRESSION To describe, to analyze, and to compare transformations that compress information we must introduce a sufficiently general definition of volume of information, which can be used for discrete information, continuous information, and information exhibiting statistical regularities. This is not a straightforward task, primarily because the information compressing subsystem is located inside the chain of subsystems performing the information-processing shown in Figure L4a. It is preceded by the information source, and the operation of the information compression system influences the operation of the subsequent subsystems: the fundamental information subsystem and the superior system. Therefore, a definition of volume of information must take into account the properties of all those three subsystems.

6.1 The Volume of Discrete Information

253

The information source determines the fme structure of information, the sets of potential forms of the components, the constraints on the combinations of the elementary components, and consequently the set of potential forms of the structured information. Those factors influence to a great extent the resources needed to process the information. A typical fundamental information-processing subsystem, particularly an information channel, is designed so that it can process any information having a fme structure independently of the macro structure, and in particular, of restrictions imposed on the potential forms that the information can take. We call such a system constraint blind. In technical terminology the ability of a system to process information having any macro structure is emphasized, and the system is called transparent. The universality of a structure blind system makes it possible to utilize it for many purposes and consequently to keep its price low. However, if we use such a system for a specific purpose, we may not utilize all its capabilities. This causes the size of resources needed to perform the fundamental information-processing to depend not only on the properties of the information but also on the features of the fundamental information-processing subsystem. The compression of volume of information should be such that the ultimate information delivered to a superior system is for this system as useful as the primary information. Thus, defining the volume of information we must also know which features of the information are essential for the superior system (the user). As it has been discussed in Section 1.6.1, the properties of the superior system determine the indices for accuracy of information-processing. In case of lossless compression we can, in principle, recover exactly the primary information. However, the information system may introduce some changes into the information that are relevant for the superior system. In particular, the superior system may be sensitive to delays introduced by information compression and decompression. For these reasons the definition of volume must also reflect the properties of the superior system. Summarizing, we may say that in a formal definition of the volume of information we must account for the properties of • Concrete primary information particularly, its structure, • The set of potential forms of primary information and their weights, • The fundamental information-processing subsystem, and • The superior system. The following subsections use these guidelines to introduce a frame of concepts on which a systematic description and analysis of discrete information-compression systems. 6.1.1 THE INDICATOR OF RESOURCES NEEDED TO PROCESS STRUCTURED DISCRETE INFORMATION We first introduce indicators of resources needed to process concrete information. Next indicators of resources needed to process any potential form of the information are considered.

254

Chapter 6 Lossless Compression of Information

THE RESOURCES NEEDED TO PROCESS A CONCRETE INFORMATION The binary information is iht prototype of discrete information. We denote a^, / = ! , 2 - the potential forms of this information, v{Ui) - an indicator' of resources needed to perform the fundamental processing (e.g., storage, transmission) of information ii^; we call it volume of concrete information. In general, the resources v(Ui) depend on the processed information Uf. However, often the potential forms M, and Uj have the same structure and we may suppose The fundamental processing of each potential form of the discrete information requires the same resources thus, v(M/)=v=const, V/.

(6.1.1)

It is natural to take the resources needed to process a binary piece of information as the unit of information-processing resources and to set v=l.

(6.1.2)

A block u = {u{n)\ n=l, 2,- • - , N } of N pieces of binary information is a prototype of structured discrete information. To defme the volume of this structured information we take into account the properties of subsystems performing the fundamental information-processing-see Figure 1.2. The two basic types of such systems are mass storage systems and communication channels. A typical storage system is a set of binary storage cells in which piece by piece binary components of block information are stored. To store the prototype block information A^ binary storage cells are needed. The second basic type of fundamental information-processing subsystem is a transmission channel. A typical transmission channel ( see Section 2.1.2) generates a set of time slots, each of which can transmit a single binary piece of information. Such a time slot is called binary transmission channel. To transmit a block information we need such A^ binary transmission channels. Thus, the resources that both fundamental information systems require to process prototype block information are proportional to the length Nof the blocks. Hence, it is justified to take this length as the indicator of resources needed to process block u. We write this defmition in the form v(tt) = N(ii),

(6.1.3)

where N(') denotes the function assigning to a block the number of its elements For the considered prototype block U(u)=N.

/^ i 4 \

(6.1.4b)

6.1 The Volume of Discrete Information

255

THE RESOURCES NEEDED TO PROCESS ANY CONCRETE STRUCTURED INFORMATION An information system has to process every potential form of information. Therefore, we are looking for an indicator characterizing the resources needed to process any potential forms of information. We call it volume of information. To indicate that this indicator characterizes not concrete information but the set Hoi all potential forms of information, we denote the volume of information by V( Z/). The definition of V( ?/) should be based on the definition of volume of a concrete information, but it should not depend on the individual features of this information. An operation transforming the set {v(i£); uE U} into V(?/) has been called detail removing operation. This concept was mentioned in Section 1.6.1, page 55; its detailed analysis of follows in Section 8.1. A typical example of a detail-removing operation is finding the maximum or minimum (the extreme-case criterion). When the information exhibits statistical regularities, statistical averaging often is a natural choice. The definition of volume based on statistical averaging is discussed in Sections 6.3 and 6.4. Here, using some heuristic arguments, we first define the volume V(?y) of prototype sets for which the definition is plausible. Using this definition we define the volume of discrete information having any structure for two extreme types of fundamental subsystems: a system fully utilizing the structure of information and a structure-blind system. The binary set S (1)={M/, / = 1 , 2} is the prototype of discrete sets of information. With assumptions (6.1.1) and (6.1.2) it is natural to set V[^(l)] = l.

(6.15)

The set of all blocks of A^ binary pieces of information (of all combinations of the A^ binary pieces of info) is the prototype of sets of structured information. We call this as set N-dimensional, unconstrained discrete set (briefly, unconstrained set) and denote it as ^ (AO- A block information belonging to this set we call unconstrained block information^. We denote L{') as the function assigning to a set ^ the number L of its .^ ^ elements; thus, L (^)=L. Since every combination of binary pieces of information is an element of ^(N), L[^(A0]=2^ (6.1.7) The resources necessary to perform the fundamental processing of a single block information of length N described on previous page can be used directly to process another block having the same structure. Therefore, as the characteristic of resources needed to process any block from the unconstrained set /?(N), we take W[&(N)]^N.

(6.1.8)

Usually not all combinations of elementary components are potential forms of information. Such a set of potential forms is called a constrained set, and the structured information taken of such a set is called^ constrained information.

256

Chapter 6 Lossless Compression of Information

The constraints may be caused by relationships between components of each potential form of information. However, they also may be imposed on the set as a whole. An example is the set of reserved prefix code words (see Section 2.6.4). The resources needed to process information depend also on the class of information-processing transformations. We first define an index of minimal resources needed to process constrained information by such a reversible transformation that can fully utilize both the structure of single potential forms of information and the properties of the set of all its forms. Assume first that the number of potential forms of the information can be represented in the form , ^, ^ L(Z/)=2\ (6.1.9) where A^ is an integer. Then , , , ^ ^ l(U) = l[&(N)]. (6.1.10) and we can always assign to each potential form of the primary information a different block from the unconstrained set iS(N) and each block obtains one partner. Such a transformation is realized by the algorithm (1.5.8) with K = 2 , and the system implementing it is shown in Figure 1.15. A more detailed analysis of several fundamental information-processing subsystems (in particular, storage and communication systems) shows that For a given L(U) having the form (6.1.9) the resources needed .^. ... to process blocks from the unconstrained set B{N) are minimal. ^ -- ) Therefore, , , ,, ^, ^ y^iy)=y[8{N)]=N (6.1.12) is an indicator of minimal resources that are needed to process any information from the set II. We call V^i( U) the minimal volume of information. This term, like the terms "unconstrained, constrained information" is an abbreviation, since the considered volume characterizes the set of potential forms but not a concrete information (see Note 2). In general, we cannot represent L( U) in the form (6.1.9). Then we look for the unconstrained set with smallest dimensionality, and we define V j ? / ) = minV[^(m)]. (6.1.13) m

Let us denote by A^ the dimensionality for which the minimum in (6.1.13) is achieved. Thus, we have , , ^, wr ^ . VJZ/)=V[^(A0]. (6.1.14) To satisfy the condition that the transformation is reversible, the set 1?{N) must have at least so many elements as the set Z/.Thus, it must be L[i^(AO]>L(Zy).

(6.1.15)

On the other hand, N is the smallest integer satisfying this condition. Logarithming both sides of (6.1.15) and using (6.1.7), we conclude that N^\\ogMy). where \x is the smallest integer that is not smaller than x. From (6.1.12) and (6.1.16) we get: VjZ/)=riog2L(Z/)

(6.1.16)

(6.1.17)

6.1 The Volume of Discrete Information

257

From the definition of function [x, it follows that 0 < r^-jc< 1. Therefore, for most applications log2L(^) is a good approximation of [\og2L(U). To avoid computational complications, we calculate the minimum volume from formula V.i(f/)=log2L(^),

(6.1.18)

even if log2L( U) is not an integer. Since the set U is constrained and the set ^(AO is unconstrained, we may call the transformation JcrC*) transforming information ii E ?/into a block from the set B(N) 2i constraint removing transformation-, hence, "cr" in the subscript. To specify this transformation we must have exact information about the set y of all potential forms of information. However, a typical fundamental information-processing system is blind for constraints imposed by the relationships between the components of every potential form of information and/or by the properties of the set of potential forms of the structured information and is unable to perform transformation T^X')- To define the resources needed by such a structureblind system, we introduce an auxiliary concept: The set of all such combinations of elements of which the potential forms of information are built, which the fundamental information ., ^ - g. processing subsystem does not distinguish from the really potential ^ ' ' ^ forms, is called extended set and is denoted as Z/^"^^ The indicator of resources needed to perform a structure blind information transformation of any information from the extended set we call volume for structure blind information transformation (briefly, structure-blind volume) of the primary constrained set of potential forms of information and we denote it V^^Ji U). Thus, Vsb(Z/)=VjZy<«^)].

(6.1.20)

V,,(Z/)> V J Z / ) .

(6.1.21)

Since Z/C^/^"^ it is

The cost of hiring a fundamental information-processing subsystem is usually proportional to the structure-blind volume of information. The ratio R(?/)= V j Z / ) / V j Z / )

(6.1.22)

has the meaning of the efficiency of utilization of the structure-blind fundamental information-subsystems resources. Therefore, we call the ratio R(Z/) the resources utilization indicator. As it is shown in the subsequent section, this is a convenient indicator of matching the properties of structured information to the properties of the subsystem performing the fundamental information-processing. From (6.1.21) follows that R(Z/)<1.

(6.1.23)

When R( Z/) = 1, then a structure-blind fundamental information-processing requires as many resources as the system optimally matched to the properties of information.

258

Chapter 6 Lossless Compression of Information

The ratio S(f/) = [V3b(Z/)-V,i(Z/)]/Vj^) = [l/R(f/)]-l

(6.1.24)

has the meaning of an index of surplus of resources of the fundamental processing subsystem needed because the information-processing transformation is structure blind. Therefore, we call this ratio volume surplus indicator. COMMENT In the introductory considerations about volume, we emphasized (page 253) that, in general, the volume of information depends not only the properties of the information but also of the fundamental information-processing subsystem. In our approach, we take this into account by introducing the two parameters characterizing the volume of information: (1) the minimum volume Vn,i(?y) characterizing the resources needed by a fundamental subsystem to process any information after most efficient compression and (2) the structure-blind volume Vjill) characterizing the resources needed by fundamental subsystems unable to exploit the structure of information. In most cases, we take as Vsb( II) the largest resources needed to process directly a potential form of the primary information. Thus, we take V,,(?/) = maxV,,(i/).

(6.1.25)

We illustrate such a procedure with a simple example. EXAMPLE 6.1.1 VOLUME OF CONSTRAINED BLOCK INFORMATION We assume that the set potential forms of information is ll = { a,=0, 1^2 = 100, ^3=101, M4=1100, ^5=1110, a = l l l l , 1/7=11010, M8=11011}.

(6.1.26)

The length of a potential form of the information can take various values and max N ( M / ) = 5 . Let us assume that the eventual fundamental information-processing systems have the following properties: • They can only fmd out what the largest length A^^.^^ of block information, • They process each information as if it would have the length A^^^; if its length is smaller they add O's in front of shorter potential forms; this causes no ambiguity if all primary blocks begin with a 1, • The transformations are blind in the sense that they cannot realize that not all blocks of length A^^^^x ^ ^ potential forms of the information. The volume of the set U for systems blind in the defined sense is \^^{ II)=5. The optimal constraints removing transformation can be realized by algorithm (L5.8). It produces the following unconstrained set: ^(3) = (v,=000, V2=001, V3=010, V4=011, V5 = 100, v=101, V7=110, V8=lll}. (6.1.27) Thus, the minimum volume y,^ill ) = 3, the resources, utilization indicator R(Z/)=0,6 and the volume-surplus indicator S(Zy) = 0.66.n

6.1 The Volume of Discrete Information

259

6.1.2 THE EFFECT OF TRANSFORMATIONS OF STRUCTURED DISCRETE INFORMATION ON ITS VOLUME We now use the introduced concepts to define an indicator characterizing the capability of an information transformation to change the volume of information, particularly to decrease it. We denote by J(') the transformation transforming information uEll into information vG V where V is the set of potential forms of the information produced by the transformation. We assume that The transformation is deterministic and reversible ,^ ^^ {it is a presentation transformation). ^ ' ^ Then L(l/)= L(f/), and from (6.1.18) it follows that Vn,i(V)=V,,(Z/).

(6.1.29)

Thus, an indicator of changes of volume based on the minimum volumes would not characterize the important transformations, changing only the presentation of information. Therefore, it is natural to compare the volumes for structure-blind transformations and to characterize an information transformation by the ratio /3[7^(-)]^VjZ/)/V3,(t/).

(6.1.30)

This is called indicator of volume transformation. This indicator depends on the properties of the set V, which, in turn, depend in an essential way on the transformation r(-). We show this by writing /3[r(-)]. If jS > 1, we call the transformation information-compressing transformation and we call ^ indicator of volume compression (briefly, compression indicator). From (6.1.29) and from the definition (6.1.22) of the resources utilization indicator, it follows that /3[r(-)]=R(V)/R(zy). (6.1.31) Thus, a compression indicator is the ratio of indicators of utilization of resources of fundamental processing needed to process the information before and after the transformation. Taking in (6.1.21) V instead of Z/and utilizing (6.1.31), we see that &\T(:)\<&J,U).

(6.1.32)

where ^ma(?/)^V,,(2/)/V,i(Zy).

(6.1.33)

The ratio ^^ U) has the meaning of on indicator of maximal volume compression, which can be obtained by a reversible transformation. The maximal compression can be achieved by constrains removing transformation T^X') - see page 257. From (6.1.22) and (6.1.33) it follows that i3,,(/y) = l / R ( ^ ) .

(6.1.34)

Substituting in (6.1.34) the numerical values obtained in Example 6.1.1, we get OZy)=1.66.

260

Chapter 6 Lossless Compression of Information

COMMENT 1 The number L of potential forms of discrete information is related to two quite different characteristics of information. First, it is related to the volume of resources needed to process information. This is the relation exploited here. However, the number L of potential forms is also related to the variety of potential forms information that is an important characteristic of information from the point of view of the superior system (user of info) (see Section 1.4.1). Equally well we may use any monotonous function of L as an indicator of variety of discrete information. Particularly well suited for this purpose is logjL. This indicator is called amount of information (which the superior system obtains when a concrete information is delivered). From equation (5.1.9) it follows that in the special case when information exhibits statistical regularities and the probability distribution is uniform, the amount of information here defined is equal as the amount of statistical information delivered by the information source (see Comment 2, page 215). Using (6.1.18) we write the definition (6.1.22) of the resources utilization indicator in the form R(i/)=log2L(Z/)]/VjZ/) (6.1.35) Thus, if we take the point of view of the superior system then we can interpret: • The minimal volume W^XU)=logJ. as the amount of information that the superior system obtains, • The resources utilization indicator R( U) as density of information (amount of information^ per elementary piece of information), • The surplus index S defined by (6.1.24) as index of redundancy. COMMENT 2 We compress the volume of information to save the resources needed to perform fundamental processing, particularly storage or transmission of information. However, if the fundamental information-processing system introduces inevitable distortions, particularly indeterministic distortions, it may be desirable to expand the volume and thus, to use a reversible transformation with P<\. We call such a transformation volume-expanding transformation or, equivalently, redundancyintroducing transformation. In general, we compress the volume of information (we reduce the redundancy of information) by removing some macro structure. Similarly, we can expand the volume by building in a macro structure. A typical example is the error-correcting coding described in Section 2.1.2. We also may expand the volume to help a superior system with limited resources to utilize the information. The type of the built-in macro structure must match the properties of the subsystem performing the subsequent information-processing. The surplus of the volume of primary information is caused by the macro structure that is determined by the source of the information. This macro structure usually does not match the mentioned properties of the forthcoming information-processing. Therefore, we often compress the volume of information by first removing the redundant macro structure and next expanding the volume by building in macro structure that is useful for subsequent information processing. Such a procedure is the consequence of structure

6.2 Examples of Lossless Compression

261

blindness of transformations matching the structure of information to the properties of the subsequent fundamental processing. If those transformations were sensitive to the primary redundant macro structure, its preliminary removal would be not necessary. The volume compression, thus the redundancy removal plays an important role in cryptography, since the primary structure of information can be utilized by the unauthorized observer (not possessing the key) to break the cryptogram. Therefore, compressing volume is an important measure for increasing the protection against unauthorized access to encrypted information.

6.2 EXAMPLES OF LOSSLESS COMPRESSION OF TRAINS OF STRUCTURED INFORMATION Very important from a practical point of view are universal compression systems that operate efficiently for a wide class of types of primary working information without extensive meta information about the properties of the working information. The vast majority of loss-less information-compression systems employed in practice are modifications of the three basic systems, which are described in this section. The rules of operation of those systems are such that they compress the volume of each concrete information. To estimate the values of the corresponding indicators characterizing the whole set of potential forms of information, we have to specify the properties of those sets. We do this in the next section. 6.2.1 COMPRESSION OF THE TRAIN OF BLOCKS 1: THE POTENTIAL FORMS OF BLOCKS ARE KNOWN We assume: Al. The hierarchically higher block has the form (/, = {i£(0,/ = 1,2, • • • , / }

(6.2.1a)

where a(0= {w(/, n), w = l, 2, • • • , A^}, / = 1 , 2, • • • , /

(6.2.1b)

is a hierarchically lover block. The elementary components w(/, n) are binary. A lower-ranking block u{i) is called here "block", and the higher-ranking block L^tr is called "train"(hence the notation). From (6.2. lb) it follows that all blocks have the same length; A2. The set of potential forms of a block does not depend on the number of the block; therefore, we denote such a set, briefly, as Z/; A3. The potential forms of the blocks are known; we denote them as Ui, / = 1 , 2, • • • , L; thus, Z/ = {M/ , / = 1 , 2, • • • , L}. A4. We transform the train U^, by transforming separately the blocks; we denote by T{') the transformation transforming a primary block u{i) into the transformed block v{i) = I\u{i)\ and by V,,={T[v{i)\, / = 1, 2, • • • , /} the transformed train;

262

Chapter 6 Lossless Compression of Information A5. The rule of transforming a block does not depend on a block's position in the train; A6. The transformed block v(/) is a block of binary pieces of information, the set of potential forms of transformed block is {v,, / = 1 , 2, • • • , L}, the transformed blocks v^ may have different lengths; A7. The transformed block determines exactly the primary block; A8. The resources needed to process a train are the sum of resources needed to process its components V(V,)=Ev[v(/)];

(6.2.2)

A9. As the indicator of resources needed to process a block, we take its length V[v(/)]=N[ii(0], (6.2.3) where N(a) is the length of a block a (see (6.1.4) From A8 and A9 it follows that V(V„) = E N [ V ( 0 ] .

(6.4.4)

/-I

This notation is illustrated in Figure 6.1

IIIMIIINI v(l)

^<

v(2)

v(3) v(4) -•-N[v(3)]-^ • N

Figure 6.1. Illustration of the notation: 7=4. We denote by M(ui) the number of occurrences of a the potential form u^ in the train U^. The ratio j^. . P\U,)^—J!.

(6.2.5)

is tht frequency of occurrences of the block u^ (see Section 4.1.1). We denote by P\vi) the similarly defined frequency of occurrences of the transformed block v^ in the transformed train V^,. In view of assumptions A4 and A7, we have

P\vi)=P\u,)

(6.2.6)

6.2 Examples of Lossless Compression

get

263

Grouping in the sum (6.2.4) blocks having the same form and using (6.2.5) we L V(V„)=/E N(v,)P'(«,) (6.2.7) /-I

For a train Ui, the frequencies of occurrences P*(ii/) are fixed. However, by changing the potential forms v^ of transformed blocks and the rule of assigning a transformed block to the primary block, we can change their length N(V/) and influence the volume V(Vtr)- We are interested in the optimal choice minimizing the volume. In the notation for the optimization problems introduced in Section 1.6.2 the optimization problem is

op{V, rj-)}, v(y,)Kv, Cj

(6.2.8)

where V is the set of potential forms of the transformed information, T^^(*) is the code and Cy (respectively Cj) denotes the constraints imposed on the potential forms of transformed blocks (respectively on the codes). If all transformed blocks had the same length, we could not achieve any compression. Therefore, looking for a set of transformed blocks, which would minimize the volume of the transformed train, we must assume that the transformed blocks have different lengths. Then, however, the separation problem discussed in Section 2.6 arises; thus, as constraint Cy we take the requirement that the train of secondary blocks is separable. We can satisfy this constraint either by taking as blocks the code words of a reserved prefix code or by adding an explicit separation information (comma, length information). To get some insight into the problem of finding the optimal transformed blocks we assume first that the set V of potential forms of the transformed blocks, is given. Then we face the OP T^y('), V( Vj^) problem, which reduces to the following problem: There are given two sets {a^, / = 1 , 2, • • • , L} and {b^, m = l , 2, • • , L} of nonnegative numbers. We have to find such permutations {a,(,), k=h 2, • • • , L} and {b^^,^, k=U 2, • • • , L} of those numbers that the sum Z^m^nk) is minimal. The solution of this optimization problem is simple: We arrange numbers ai in descending order ai^^)>.ai(j)>...'>ai^^ and numbers bi in ascending order b^^i^r(ll,(2))>...>nil,,)) and the given forms of secondary blocks in ascending order N(V,(,))
We assign Ui(krv^(k)-

(6.2.10)

264

Chapter 6 Lossless Compression of Information EXAMPLE 6.2.1 RUNNING OF THE OPTIMAL COMPRESSION ALGORITHM

Let us take concrete values. We assume that the length of the primary block is A'=3, the length of the train 7=48 and the frequencies of occurrences are: F'(«,)=5/48, /''(«2)=25/48, P'(u^)=3/4S, P'iu,)=3/4S, (6.2.11) P'ius)=5/48, P'iu^) = 1 /48, P'iu-,)=3/48, P'(Ug)=3/48. To ensure the separability of the transformed train as the set of given forms of the transformed block we take the set (2.6.6) of reserved prefix code words. For this set we have N(v,) = l, N(V2)=3, N(V3)=3, N(V4)=4, .. , 12^ N(V5)=4, N(v,)=4, N(v,)=5, N(Vg)=5 (^•2-12) The descending ordering for the frequencies of occurrences is /(I)=2, /(2) = 1, /(3)=5, /(4)=3, /(5)=4, /(6)=7, /(7) = 8, /(8)=6; and the ascending ordering of code words lengths is /n(l) = l, m(2)=2, /w(3) = 3, /n(4)=4, /n(5)=5, m(6)=6, m(7)=7, /n(8) = 8. The optimum code is «1

U2

"3

"4

"5

"6

II7

Mg

^2

Vi

^5

V3

V4

Vg

V6

V7

(6.2.13)

Substituting the code words listed in table (2.6.6) we get finally the set of optimally transformed blocks ] / = { v , = 0 , V2=100, V3=101, V4=1100, V5=1110, V 6 = l l l l , V7=11010, V8=11011}.

(6.2.14)

From (6.2.2) we get V(^,) = 3*48. From (6.2.7) (after using (6.2.11 and (6.2.12) we obtain V(VJ=2.3*48. The ratio PUU,)=V(U,)mV,) (6.2.15) characterizes the compression of a concrete train. We call it individual compression indicator. This is the counterpart of the compression coefficient /3 defined by (4.1.30), which characterizes the compression ability of a transformation from the point of view of all potential forms of the primary information. For the considered train we get /3,Jf/,) = 3/2.3 = 1.31 (6.2.16) The individual compression indicator depends on the set of reserved prefix code words. In the example, we used just a set without explaining how we got it. We would like to have an optimal set allowing the possible smallest volume of compressed shortest blocks. There exists a simple and elegant algorithm devised by Huffman, which simultaneously generates the optimal set of reserved prefix code words and implements the optimal coding described by (6.2.10). Assume that an optimal set of reserved prefix code words was found and they are arranged in order of decreasing frequencies of occurrences of the corresponding primary words. Then the last two code words must have the same length. We prove it by contradiction.

6.2 Examples of lossless Compression

265

Suppose next that the last two code words (the codewords corresponding to the words of smallest and second-smallest frequencies of occurrences) have different lengths. We denote Vj^ as the shorter, v^, as the longer code word, and v^^'^ as the prefix of Vsh obtained by rejecting the last binary piece of info (let us assume for example that this is 1). Since the code is a reserved prefix code, the prefix vll'^ is not a code word. Also, the train v^^^O is not a code word. Therefore, we can take it instead of the code word v,r. This train Vs^'^0, having the same length as v^^, is shorter than the code word v,r. This proves our thesis and suggests the algorithm: 51. Arrange the primary words in a sequence with non-increasing frequencies of occurrences in the block; 52. Call the pair of the last two primary words the aggregated word; 53. Identify each of the two primary words by binary identifier; 54. Call the sum offrequencies of occurrences of the primary words the frequency of occurrence of the aggregated word; 55. Consider the first L-1 primary words and the aggregated (6.2.17) word as new primary words; go to SI. 56. End the procedure if the set of the modified primary words contains only one element; call it totally aggregated word; 57. The code word assigned to a primary word is the train of identifiers of aggregated words read on the path from the totally aggregated word to the considered word. This algorithm is called the Huffman algorithm^. It is illustrated with the following concrete example. EXAMPLE 6.2.2 RUNNING OF THE HUFFMAN ALGORITHM We assume again that the frequencies of occurrences of primary words in the block are given by (6.2.11). The implementation of the optimisation procedure is illustrated in figure 6.2. «i

25-^

"2

5

«3

5J-1

W4

3-'-0

1/5

3-rO

"7

0

«8

23^^!

T"

6^1 4^1

:i

Figure 6.2. Illustration of implementation of the Huffman algorithm (6.2.17). The short bars indicate the primary words; the short bars with a cross indicate the aggregate words. The number at the left end of the bar is the frequency of occurrences of the word (primary, aggregated) multiplied by 48 (number of occurrences).The number at the right end of the bar is the binary identifier of the word when it is aggregated. The thicker lines join the bars corresponding to an aggregated pair.

266

Chapter 6 Lossless Compression of Information

Take, for example, the primary block u^. On the path from the top node to the leaf node representing u^, we pass the sequence 1, 1, 0, 0 of the branching identifiers. Thus, the optimal transformed block is V4 = l, 1, 0, 0. The remaining codewords we read from Figure 6.2. Their set is the set V given by table (6.2.14)'^. As the second example we take blocks with equal frequencies of occurrences P\ui) = l/S, / = 1 , 2- • , 8. In the first 4 operations four aggregate words with weights 1/4 are produced. The next two operation produce two aggregate words with weights 1/2 . The compressed codewords are blocks of three binary digits.Thus, Huffman algorithm produces the same code words as algorithm (1.5.8) with K=2.n\ COMMENT 1 The Huffman algorithm produces the set of optimal reserved prefix code words that minimize the length of the train of blocks. However, both those functions are coupled. Therefore, the Huffman algorithm is not suitable if we do not require that the separation of secondary blocks should be achieved by reserved prefixes coding and the separation techniques using explicit separation information are used. Rule (6.2.10) remains then useful. An example of such compression is arithmetic coding. COMMENT 2 The described compression system is an example of a system with learning cycles described in Section 1.7.2, pages 59, 60. Although Huffman algorithm produces single transformed blocks, its task is to minimize the volume of the whole train. The code words produced by Huffman algorithm are determined by the set /^ = { r ( i i „ / = l , 2 , ..,L} (6.2.18) of frequencies of occurrences of primary blocks. However, those frequencies depend on the whole train of blocks. Therefore, we may can interpret them as an auxiliary information about the train of blocks, which is relevant for the optimization of compression block by block. The block diagram of the system realizing the described information-processing is shown in Figure 6.3. The time relationships between the primary and transformed trains are shown in Figure 6.1. The system shown in Figure 6.3 is a special case of the learning system shown in Figure 1.27. set of frequencies of words occurrences P' HUFFMAN IDENTIFICATION OF fcOPTIMIZATION FREQUENCIES 1 ^ w OF WORD OCCURRENCES w ALGORITHM . f^tr . primary train yXr M cod ing rule of words

i

'—

^ ^

DELAY

^ OPTIMAL CODING w OF SINGLE WORDS

^

compressed train Figure 6.3. The optimal compression of a train of blocks using Huffman optimization algorithm.

6.2 Examples of Lossless Compression

267

To get the needed set P* of frequencies of occurrences, we must first observe the whole block of words. This means that the optimal compression of the train introduces a substantial delay, as illustrated in Figure 6.1. This is a special case of the general principle of trading quality of decisions for delay needed to take them. Because the aim of the Huffman algorithm is to minimize the length of the whole train, some code words may be longer then the primary coded blocks. This is illustrated with the codebook (6.2.13) using code words given by (6.2.14). STATISTICALLY OPTIMAL HUFFMAN CODE To this point we made no assumptions about statistical regularities in the train. Now we assume that the train can be considered as an observation of the random ^"^^^^^^ where

I[Jtr = { W ) , ^ = 1, 2,- • • , /}

(6.2.19a)

I[J(0= W , n), AZ = l, 2,- • • , A^}, / = 1, 2,- • • ,/ (6.2.19b) is the random variable representing the zth primary block. We assume also that those random variables have the same probability distribution, that they are statistically independent, and we denote P(ii/)=PnU(0=W/] , / = l , 2,- • • , L;

(6.2.20)

The code produced by Huffman algorithm using these probabilities is called statistically optimal Huffman code. In the next section we consider indicators of performance of systems compressing trains that exhibit statistical regularities. We show that the statistical average volume of the compressed train V(V,) = Ev(V,)

(6.4.21)

where V^r denotes the random process representing the compressed train, is often the suitable performance indicator. Using equation (6.2.2), (6.2.3), and (4.4.9) we get V(VJ=/EN(V,)P(II,)

(6.2.22)

/-I

This equation is the same as equation (6.2.7) with P*(ii,)=P(ii,). Thus, the statistically optimal Huffman coding minimizes the r6 ? 9'^^ statistical average of the volume of the compressed train. Let us look at the relationships between the statistically optimal Huffman compression and the compression of a given train discussed previously. From the fundamental property of long trains discussed in Section 4.6 it follows that with probability close to 1 a long train (/> 1) is a typical train. Thus, the approximation />-(ii,) = />(!!,) (6.2.24) is accurate. Since the codebook changes in steps when the frequencies of occurrences change continuously, we can expect that a long typical primary train is transformed by the statistically optimal Huffman code into the shortest secondary train. The transformation of a non-typical train is lossless but the transformed train may be longer than the train produced by the Huffman algorithm based on actual frequencies of occurrences.

268

Chapter 6 Lossless Compression of Information

The statistically optimal Huffman coding of a primary block can be performed immediately after the block arrives, and we must not wait till the whole train is observed. However, to profit from this feature the set of probabilities P={P(M;), / = 1 , 2,- • • ,L} (6.2.25) must be available. We can get an estimate of P from an observation of a previous train. Thus, the system using the statistically optimal Huffman coding should operate as an intelligent system estimating the state parameters that are discussed in Section 1.7.2, with a set of probabilities P playing the role of the unknown state parameters. In the training cycle, the Huffman algorithm based on frequencies of occurrences is applied and the compressed train is produced after the primary train ended. Then the frequencies of occurrences are stored and used as probabilities to code currently the arriving blocks of the subsequent trains. If the statistical properties of those trains change then the estimates of the set of probabilities Pmust be updated. Thus, training and working cycles must be interleaved as shown in Figure 1.25. 6.2.2 ARITHMETIC CODING Huffman coding is block-orientated coding. It transforms the primary block as a whole into a whole compressed block independently of other blocks. However, all components of a primary train are often interrelated and dividing the train into blocks is an artificial operation. To exploit all the existing relationships, we had to segment the primary train into possibly large blocks. Huffman coding of such blocks may be tedious. We consider here the arithmetic coding that is string-orientated. It builds the compressed block successively as the elements of the primary block arrive. To simplify the argument we describe the counterpart of the previously presented statistically optimal Huffman coding. Thus, we assume that the components of the trains exhibit the statistical regularities and that the probabilities of the components are known. We denote as ll(\)={ui\ /= 1, 2,,- • • , L}- the set of potential forms of a component of the primary block; M={W(A2); A2=1, 2,- • • , N}, u(n)E U(l) Vn -a primary block P(a)=P(U=M)-the probability of block u The arithmetic coding algorithm is 51. The primary block u is transformed in an auxiliary interval J (M) C < 0 , 1 > ; the length of the interval \J{u)\ =P(ii) 52. A family of sets W{m), m = l , 2,- • • , W{m-l)C W(m) of equidistant reference points, with distance between the points decreasing with growing m, is generated; (6.2.26) 53. A reference point w'(u) from a set W{M), with a possibly small M laying inside the interval J(u), is found; 54. As the compressed block v(u) an identifier of the reference point w*(u) is taken. This algorithm is illustrated in Figure 6.4.

6.2 Examples of lossless Compression

269

STAT. INFO ABOUT PROBALITIES OF THE ELEMENTS OF A BLOC A , P2,

, PL \J{u)

ITRANSFORMATIONI OF THE BLOC INTO AUXILIARY pnmary INTERVAL bloc

w (u)

FINDING THE POINTi IDENTYFYINGTHE INTERVAL

BLOC OF DIGITS DESCRIBING THE POINT

compressed bloc

indentifier of the interval

auxiliary interval

a) u - comma primary train

«(/i-l)

u(n) w*[uin)].

I

X

X

H

u{n+l)

Jlu(n)]

W

)( I )( I K

H

X

STR w*[u{n)] compressed train |

v[u(n-l)]

^

v[u(n)]

l^y[u{n+l)]

V - comma

b) Figure 6.4. The illustration of the general description of arithmetic coding: (a) the block diagram of the coding algorithm, (b) illustration of coding; x denotes a reference point. Let us describe in more detail the subrule of assigning the interval J (11) to a primary block u (step SI). We assume that Al. The random variables min), AZ = 1, 2,- • • , A^ representing the components of the primary block are statistically independent and have the same probability distribution; A2. The primary blocks are separated by a comma; we denote it as 11^. On assumption Al the probability of a block (including the conmia at the end) IS

p(«)=nn„/(«)

(6.2.27)

where ii = {W/(i), %),• • • ,W/(AO} is a primary block and /^/=P[i!a(Az)=wJ is the probability of the potential form w^. We partition the interval J(0) = < 0 , 1 > into L subintervals ,, , J [ l , / ( 1 ) ] , / ( 1 ) = 1,2,- • • , L of length | J [ l , / ( l ) ] | = |j;0)|/>,J=F,,. (6.2.28) An example of such a partition is shown in the line n=2 of Figure 6.5.

Chapter 6 Lossless Compression of Information

270

1|

n=0

"

•J(0)0,6

-Jihiy 0,36 -J(2;l,l)- J ( 3 ; 1,1.1)

0,9

[j(l;3)j—>

-J(l;2>0,54 -J(2;l,2)-

0,78 -J(2;2,l) \

0,708 0,468 0,522||0,558| 0,216 0,324 f+U(3;2,l,l)H I l-J(3;l,l,2)H— J(3;l,2,l)l

0,87 . 0,96 0,99 hH hf^—•

: 0,834i

,

;.

0,36 J(3;l,2,l)

J(3;l,2,2)

J'(3;l,2,3))

Figure 6.5. Illustration of the definition of subintervals J[m, 1(1), /(2),..,/(m)], L=3, P,=0.6, ^2=0.3, P3=0.1. Next, in the same way, we partition each interval J[l,/(!)] into the subintervals J [2, 1(1), 1(2)] /(2) = 1, 2,' ' ,L of length I J [2, 1(1), 1(2)] I = I J [2, 1(1)] I P,(2). (6.2.29) From (6.2.28) and (6.2.29) we get |J[2,/(I),/(2)]|=P„A2)Iterating this procedure we introduce the partition of rank n

(6.2.30) n

\J[n',l(l),

1(2), ' • ' , /(Az)]| = | J[Ai-l;/(l), 1(2),' • • , l(n-l)]\Pi^„,= Il^iim). ""'' (6.2.31) From (6.2.27) and (6.2.31) it follows that \J(u)\ = \J[n, 1(1), 1(2),' ' • , /(AO]|=P(ii). (6.2.32) The introduced notation is illustrated in Figure 6.5. We next describe in more detail the generation of the sets of reference points (step S2 of algorithm (6.2.26)). We denote V(1) = {0, 1, 2, • • , K-l} as the set of integers; they play the role of elementary pieces of a compressed block but also of digits in a counting system with basis \/K; V(m) the set of all blocks v={v(l), v(2),' • • , v(m)}, v(A2)E 1/(1), n = \, 2,- • • , m. We next associate with a block vG V(m) the number w(v)=VALv, (6.2.33a) where [1 " (6.2.33b) VALv=E^('^) K Since the representations on the right side is unique we can retrieve v when w(v) is available. We have (6.2.34) v=STR w(v). where STR is the transformation defined by (1.5.7).

6.2 Examples of Lossless Compression

271

When we interpret the elements v(n) as digits, then the block v has the meaning of a fraction wE < 0 , 1 > in the number system with basis l/K. Let us take, for example, K=3, m = 3, v = {l, 1, 2}. Then w= 1/3+ 1/9+2/27 =14/27. We call w(v) a reference point. The set W{m)^{w{v)\ve

V{m)}

(6.2.35)

is called set of reference points of order M, An example of such a set is shown in Figure 6.4b. From the definition of w(v) it follows that W{m-\)CW{m).

(6.2.36)

To identify the interval J (u) we look for a reference point laying inside the interval J(u) and belonging to a reference set of a possibly low-order M. The sufficient condition to find a reference point inside an interval of length \J(u)\ located somewhere in the interval < 0 , 1 > is that the distance between successive reference points from a set W(Ad)is smaller than the length \J(u)\. Since the distance between the successive points from the set W(Af) is (l/K)^, the condition is (1//0^>| J(ii)|

(6.2.37a)

M>-J-[-log|/(tt)|],

(6.2.37b)

or, equivalently, lOgA

where log denotes the logarithm with an arbitrary basis. Since we are looking for a possibly small M, we take M-[-i-[-log|/(i/)|],

(6.2.38)

where \u denotes the smallest integer not smaller then u. From (6.2.26) we have M-[_L[-logi>(ii)].

(6.2.39)

Thus, the length of the train of digits identifying the interval is a non increasing function of the probability of the primary block and the general rule of optimization (6.2.10) is satisfied. Using the estimate (6.2.39) it can be shown, that for sufficiently long blocks the arithmetic coding produces a train with minimal statistical volume, which is discussed Section 6.5.2. From (6.2.31) it follows that the length of the interval J[«;/(!), /(2),- • • , l(n)] decreases with growing n. Therefore, when we calculate for AZ = 1, 2,- • • the endpoints of the intervals J [AI;/(1), /(2),- • • , l(n)] at an n^ the first digit(s) representing the end-points become for the first time the same and do not change for n>ni. Those digits are also the initial digits of the final identifier. Thus, We obtain the elements (the digits) of the identifier v(u) by adding successively new digits when the initial digit(s) in the representation (6.2.40) of the end-points of the interval J[/z;/(l), /(2),- • , l(n)] become the same.

272

Chapter 6 Lossless Compression of Information

From (6.2.33b) and (6.2.36) (see also Figure 6.5) it follows that An identifier v[J(u)] of the primary block ii = {w/(i), W/(2)»' " ' » "/(AO) is also an identifier (although not the shortest) of every interval (6.3.41) J[/z;/(l),/(2),- • • ,l(n)ln
6.2.3 COMPRESSION OF A TRAIN OF BLOCKS 2: THE POTENTIAL FORMS OF BLOCKS ARE NOT KNOWN The Huffman algorithm can be also applied if the lengths of potential forms of the words are different. It is also feasible when we have a long train of binary pieces of information and we will avoid technical complications with coding it as a whole. We can then segment the train into blocks of equal length A^ and treat them as words. We call such an procedure external segmentation. Since the length A^ is a design parameter, the problem of its best choice arises. We address this problem in Section 6.5.3. However, essential for application of the basic Huffman algorithm and of its modifications is the knowledge of potential forms of words. Without it we cannot determine the frequencies of occurrences. The natural way to compress the structured information when we know that it has a macrostructure but we do not know the potential forms of the macrocomponents (in particular words) is to Identify potential forms of macro components of the train currently. Use some volume compressing coding for the identifier of the identified macro component.

(6.3.42)

6.2 Examples of Lossless Compression

273

A rudimentary form of such compression is run-length coding. It is suitable when in a train of binary pieces of information are imbedded blocks of a repeating specific elements and the number of repetitions can vary. A typical specific element is a 0. The previously mentioned general rule takes the following form 1. Search in train of arriving elementary pieces of information for a block ofO's; 2. When such block is identified, replace the block by information about its length; (6.2.43) 3. Add suitable auxiliary information to separate the information about the length from the primary elementary pieces of information. As separation information we may use a comma block described in Section 2.6. The separation information increases the volume of the transformed train. Therefore, it is not favorably to process blocks of O's shorter then the separation information and only a block of O's longer than a N^^JiO) is processed. The choice the minimal length A^niin(O) depends on the statistical properties of lengths of blocks of O's. For algorithms and examples of coding see Held [6.1]. A class of universal and efficient algorithms based on the general principle (6.2.42) are the Ziv-Lempel algorithms and their modifications. Those algorithms are particularly well suited for compression of text in natural or computer languages. Such texts have the following properties: • They have a hierarchic macro structure: the first-ranking components are the characters, the second-ranking are strings of characters frequently occurring in words (in particular the syllables), the third-ranking components are the words (in usual sense), and still higher-ranking are some typical sequences of words; • With exception of the words the other macro components are not segmented; • Many macro components can be considered as modifications of a prototype string that is created by adding at the end of the prototype some new components. The Ziv-Lempel algorithms utilize the first property to generate successively, as new elements of the train arrive, the strings of characters that are considered as reference macro components (briefly, reference patterns). The third property is used to organize the set of already chosen reference patterns (called a dictionary), so that testing if the new arriving string is already in the dictionary is simplified. The general scheme of the compression algorithms is SL If the newly arriving string of the train is not a reference pattern already contained in the dictionary, then 1.1. it is labelled as a reference pattern and put in the dictionary, 1.2. it is placed in the compressed train unchanged; 52. If the arriving segment of the train is a reference pattern, it is replaced by an identifier of this pattern; 53. A reference pattern that does not occur frequently in the time interval < t^-T, t^ > where t^ is the current instant and T is the activity reviewing period, is moved from the dictionary.

274

Chapter 6 lossless Compression of Information

CODING OF SINGLE STRINGS 1

found

new string • P

) 1

r 1 1 1 1 1

SEARCH FOR A MATCHING REFERENCE PATTERN

IDENTIFIER OF THE MATCHING REFERENCE PATTERN

not found

DIRECT CODING

^

W

' L..

^

- Jr_

DIRECTORY (SET OF REFERENCE PATTERNS)

_

_ ^

_

^

•

V-^

r

1 f

DIRECTORY MANAGEMENT

L

_

1 compressed ' train

—.

] 1 1 1 J 11 1 _J

IDENTIFICATION OF MACRO COMPONENTS

Figure 6.6. The basic structure of Ziv-Lempel algorithms for compression of trains with a priori, unknown macrostructure components. The diagram of an information-compressing system implementing the algorithm is shown in Figure 6.6. Let us comment on the steps of the algorithm. The dictionary usually has a lineal structure and hence may be called a string dictionary. It is built successively in such a way that at the decompression site a replica of the dictionary can be gradually created from the available compressed train. The typical identifiers of the reference pattern mentioned in S2 are (a) the information about the position of the first character of the pattern in the string dictionary, (b) the information about the length of the pattern (see Section 2.6). Step 3 is necessary because new patterns are added and unlimited growth of the dictionary must be avoided. However, the removal of patterns that are not real macro components but were tentatively treated as such by the algorithm is advantageous. Let us look again at the algorithm from a broader perspective. Compared with the Huffman algorithm, the Ziv-Lempel algorithms do not require the knowledge of the potential forms of macro components. They generate for themselves the presumable list of such components. Therefore, for coding of a new string of characters the algorithm utilizes the detailed information about the previously observed strings that is built in the dictionary. The volume of this auxiliary information is very large compared with the volume of the information P* of occurrences used by the Huffman algorithm. However, the compressed train carries all information needed to create the dictionary needed for decompression.

6.3 Real Time Compression of a Train of Blocs

275

Our description of the algorithms is only sketchy. In each step several details must be specified and there is quite a great variety of options. A more detailed description of the algorithm can be found in Storer [5.4], the programs implementing it in Nelson [6.6]. In conference proceedings Storer, Reif [6.7], Storer, Cohn [6.8], [6.9] and [6.10] detailed topics are discussed.

6.3 REAL-TIME COMPRESSION OF A TRAIN OF BLOCKS WITH IDLE PAUSES In the two previous examples we considered the elementary pieces of information as atomic elements (in fact as numbers). However, every information is presented as a physical state of some objects and in consequence has multilevel hierarchical structure. This structure has a substantial influence on the choice of the definition of the volume of information and subsequently of the information-compressing transformation. This section illustrates how to utilize the time structure of information presented in a dynamic form (see Section 1.1). We also use this case to provide more insight into the fundamental concepts introduced in the previous section. The concepts introduced here also are utilized in subsequent statistical analysis of the real-time compression systems. 6.3.1 THE MODEL OF THE PRIMARY INFORMATION We consider the system shown in Figure 6.7 a. LOCAL CHANNEL

^

W

COMPRESSION SYSTEM

^

w

FUNDAMENTAL CHANNEL

^tr(t)

a) D[[/„(-)]

I I I I IL 7(2)"

r(ir iD(l>J

++

D,

-H-

r(3)

D(3)-

-D(2)-

b) Figure 6.7 Matching of a train of blocks interleaved with idle pauses to a rhythmically operating channel: (a) the block diagram of the system, (b) the time structure of the train at the output of the local channel.

276

Chapter 6 Lossless Compression of Information

We assume that: (1) A channel, usually a long-range channel forwarding the working information to its destination is given. This channel plays the role of the fundamental information-processing system. Therefore, it is called iht fundamental channel. (2) The primary information is a train of blocks described by (6.2.1). (3) The primary information comes not directly from the primary information source (e.g., e computer terminal) but by a local channel it is delivered to the input of the fundamental channel . (4) The process at the output of a local channel carrying the primary information is a train of blocks divided by idle pauses as shown in Figures 1.10 and 6.7b. We assume also the local channel operates rhythmically. By this we mean that time axis is divided into elementary time slots of equal duration; we denote it^ D^^. The number of slots generated during a time unit Q^l/A, (6.3.1) is called channel capacity^. The process available at the output of the local channel has a multilevel time structure. The elementary component is a binary piece of information that is carried by an elementary process (pulse) that fits into a time slot. This elementary process usually has some fme structure (see Section 2.1.1 in particular. Figure 2.2). The binary pieces of information are assembled in blocks. We denote by u[i, (/)], rG < r ( 0 , r{i)+D{i)> / = ! , 2,- • • , / the block of elementary processes carrying the components of the /th block of the primary binary pieces of information. We assume that Al. The instants r{i) of blocks arrivals and their durations D{i) are random. The corresponding process is called a time-structured train and it is denoted as U^{'). On the assumption that the duration of a pulse is almost equal to slot duration Au ^ typical time-structured train is shown in figure 6.7 b. The primary characteristics of such a train are C)[^tr(*)] -Duration of the train, Ns[^tr(*)] -Number of slots of the fundamental channel needed to transmit the whole (with idle pauses) train, Ntot[^tr(*)] -Total number of binary pieces of information contained in all blocks in the train. The obvious relationships between those parameters are N,[f/,C))]=D[[;,(-)]/Au N J t / , C ) ] ^ N,((/,(-)).

(6.3.2) (6.3.3)

/?[f/,C)]^N,Jf/,C)]/D[f/,C)].

(6.3.4)

The ratio

is called the rate of delivery of the "pure" information. From (6.3.1) to (6.3.3) it follows: /?[[/,(•)] ^ Q . (6.3.5)

6.3 Real Time Compression of a Train of Blocs

277

Till this point we considered the local channel. We assume that the fundamental channel operates also rhythmically. We denote Dsv -the duration of a time slot of the fundamental channel C, = l / A , (6.3.6) the capacity of the channel. We assume that the fundamental channel has the following properties: F l . The fundamental channel is unable to distinguish if a slot is idle or if a pulse carrying a binary information is located in the slot; F2. The total number of reserved time slots is the indicator of resources needed to process an information train. In view of property F2, if we would put the train delivered by the local channel directly in the fundamental channel, we had to pay for the idle pauses. To diminish the usually high costs of hiring the fundamental channel, we are looking for an information-compressing transformation that would decrease, and in the best case eliminate the pauses. As such a transformation we take the buffering. It was briefly described in Section 2.6.5 (see in particular, Figure 2.23) and its formal description was presented in Section 3.3.1 (see Figure 3.12). We now show how the previously introduced indicators of information compression can be used to describe and analyze buffering. 6.3.2 THE VOLUME OF A TRAIN OF BLOCKS INTERLEAVED BY IDLE PAUSES We show now how the concepts introduced in Section 6.1 can be used to define the indicator of compression capability of buffering. We first define the volume of the primary and transformed train of the information. We first assume that the primary train delivered by the local channel is put directly, without preliminary processing, into the fundamental channel. Thus, V,(-) = ^tr(-) (6.3.7) The primary channel can be directly connected with the fundamental channel when both channels operate in the same way. In particular, the durations of the slots in both channels must be the same: Av= Au. (6.3.8) Based on assumption Fl we must reserve the fundamental channel for the whole duration of the time structured train U^X')- In view of assumption F2 we define the structure blind volume of a train Vjf/,(-)]=NJ[/,(-)]. (6.3.9) Using (6.3.2) we get vj^.C)]=D[^.C)]/A. Taking into account equation (6.3.1) we write this in the form

(6.3.10)

C',=Vjf/,C)]/D[^,(-)]. (6.3.11a) Thus, The capacity of the fundamental channel needed to transmit the information directly has the meaning of the structure blind volume (6.3.11b) of the primary information train per time unit.

278

Chapter 6 Lossless Compression of Information

We next consider the minimum volume of the train on the assumption that the information-compressing transformation is reversible. To find it we must take into account the properties of the superior system. We assume 51. The total number Ntot[f/tr(')] of elementary pieces of information is the indicator of usefulness of the information for the superior system, 52. The idle spaces between the blocks, particularly their duration, are irrelevant for the superior system ; 53. The delay between the instant at which the packet delivered by the local channel starts and the instant when the feeding of the transformed packet into the fundamental channel starts, is the indicator of performance of the compressing subsystem. From definition (6.3.2) and assumption F2, page 277 it follows that the volume of a transformed train V^(') is minimal when all slots contain pulses carrying the binary pieces of information. Therefore, v.i[f/.(-)]=NJC/,C)] has the meaning of the minimum volume of the train C/tr(*)From (6.3.2) and (6.3.4) it follows that

(6.3.12)

/?[f/tr(-)]=v.i[^,C)]/D[C/,C)]. (6.3.13a) Thus The rate ofpure information delivery has the meaning of the minimum ., ^ . ^, volume of information per time unit. For the considered concrete train the ratio rch[f/.r(-)] =v„i[f/,X-)]/Vs,[t/,X-)]

(6.3.14)

has the meaning of an indicator of utilization of resources of the fundamental channel needed to process directly the concrete train U^X*) (counterpart of the resources utilization indicator characterizing the set of all potential forms of information defined by (6.1.22)). Putting (6.3.9), (6.3.12) in (6.3.14) we get the indicator of resources utilization rch[f/tr(-)] = N,t[^tr(-)]/N,[f/,(-)]=/?[t/,C)]/Q.

(6.3.15)

The maximum value 1 is achieved when R{U^X')\ = ^M ^ ^ ^^^s when there are no pauses between blocks. From the central part of formula (6.3.15) it follows that we see that rch[^tr(-)]=P,

(6.3.16)

where p is the duty ratio defined by (1.3.7). The counterpart of the volume surplus indicator defined by (6.1.24) is the ratio Sch[^tr(-)] = { V s b [ ^ t r ( - ) ] - V . i [ f / . C ) ] } / V . i [ ^ t r ( - ) ]

(6.3.17)

6.3 Real Time Compression of a Train of Blocs

279

It has the meaning of the relative surplus of the capacity of the actually used channel above the minimum capacity needed. Using (6.3.10 ) and (6.3.11) we get sdU,X')] = {Cu-R[U,m/R[U,X')]

(6.3.18)

From (6.3.15 ) it follows that RIU^X')] has the meaning of the minimum channel capacity if loss-less processing should be possible. 6.3.3 THE COMPRESSION INDICATOR Till this point we assumed that the local channel feeds the primary time-structured process directly into the fundamental channel. We assume now that before feeding the train U^X') into the fundamental channel we compress its volume by buffering. A general description of buffering was given in Section 2.6.5 and is Section 3.3.1 it is formally described. To ensure stable operation of the system when several trains are processed we assume that the durations of the primary and compressed-time-stnictured train are almost equal (with accuracy of the transient states at the beginning and the end of the time-structured primary train (/tr(*))- Thus, we assume D[KC)] = D[f/,C))],

(6.3.19)

where V;r(*) = ^bu[^tr(0] is the transformed train and TbuC*) denotes the transformation performed by the buffering system. A typical example of such a transformation is described in Section 3.3.1, in particular by the relationship (3.3.5). The condition that buffering is a reversible transformation implies that all arriving binary pieces of information are carried by pulses of the transformed train. Thus, NJK(-)]=N,Jf/,C)]. (6.3.20) Since we do the same assumptions about the fundamental channel as about the primary, we now can use all the previously derived equation for the transformed train, only substituting U^^-^V^,. In particular, from definition (6.3.4) we get the rate of delivery of pure information by the secondary channel: /?[V,(-)]^N,o.[^tr(-)]/D[V,C)].

(6.3.21)

Comparing this with the primary definition (6.3.4) and taking into account the assumptions (6.3.19) and (6.3.20) we see that R[V,(')]=R[U,(')l

(6.3.22)

Substituting Uif^V^, in (6.3.15) and using (6.3.20) we get the indicator of utilization of capacity of the fundamental channel carrying the compressed train r,,[y,,C)]=/?[f/tr(-)]/Cv.

(6.3.23)

280

Chapter 6 Lossless Compression of Information For a train the counterpan of the compression index defined by (6.1.30) is the

ratio /3[^tr(-)]^v,,[f/,(-)]/v,,[V,C)]

(6.3.24)

From (6.3.15) and (6.3.13) it follows that similarly to (6.1.31) we have 0[U,X')] = rJV,X')VrJU,X')]-

(6.3.25)

From (6.3.15) and ( 6.3.22) we get ^[f/.(-)] = Q / Q .

(6.3.26)

Thus, buffering allows a lossless transmission of information through a fundamental channel with a smaller capacity than the capacity of a (6.3.27) structure blind channel delivering the primary train with idle pauses. However, we cannot arbitrary decrease the capacity of the fundamental channel because from (6.3.5) it follows that it must be: C^>R[U,X')]-

(6.3.28)

From (6.3.22) and (6.3.26) it follows that the maximum compression ratio is I3^[U,X')] = CJRIU,X')].

(6.3.29)

This maximum compression ratio is achieved when C^=R[UiX')]- Then r[Vtr(*)] = l, and the transformed train Vtr(*) is fully compressed and has no pauses. From the description of the buffering system in Section 3.3.1 it follows that this is possible only when at least one block is always waiting in the buffer. We show in Section 6.4.1 that to achieve on average such a compression the capacity of the buffer memory has to be infinite and the average delay introduced by buffering is infinite too. Therefore, loss-less information-compressing systems can achieve only compression ratio, which is smaller than Prrx^xi^tX')]-

6.4 COMPRESSION OF INFORMATION EXHIBITING STATISTICAL REGULARITIES. In Section 6.2.1 we have concluded (see (6.2.23) that Huffman coding using probabilities minimizes the statistical average length of the compressed train. The reason for improvement is that the statistical regularities impose statistical constraints on the elements of structured information and statistically optimal Huffman algorithm utilizes the statistical regularities optimally. Conclusion (6.2.23) can be also derived by using the relationships between the frequencies of occurrences and probabilities and the resulting relationships between the arithmetical and statistical averages.

6.4 Compression of Information Exhibiting Statistical Regularities

281

In Comment 2, page 266 we noticed that if we consider a single block, the Huffman transformation is not optimal at all, and it even can expand the lengths of some blocks. We profit from the statistically optimal Huffman coding only if we process a sufficiently long train of blocks. Our considerations in Section 4.6 permit us to look at this effect from a broader perspective. Taking as performance indicator a statistical average and using a transformation optimal in the sense of such a criterion, we can expect that the performance of the system will be improved only if we perform the transformation sufficiently many times so that the statistical regularities can manifest themselves. This effect has a general character and occurs for all information transformations optimized in the sense of an indicator that is the statistical average of a primary indicator of performance in a concrete situation. The analysis of the specific character of statistical indicators of resources needed to process information and the peculiarities of systems optimal in the sense of statistical criteria are the central topics of this section. The general considerations in this section are illustrated with specific examples in the next section. 6.4.1 THE BASIC CONCEPTS From the introductory remarks it follows that the statistical indicators of volume are meaningful if we consider processing of information that is a train of structured components. Therefore, we consider here a similar model as in Section 6.2.1, and we will assume that information is a train blocks U, = {u(i),i = U2,- • • , / } . (6.4.1) We also do assumptions A2 and A3, page 261 (the set of potential forms of each block is the same, and the potential forms are known). From the description of the Huffman algorithm in Section (6.2.1) it is evident that if we take into account statistical regularities of the blocks it may be convenient to use blocks of various lengths. Therefore, we generalize assumption Al from page 6-13 and we assume A T . The potential forms of a block may have various lengths; thus, the potential form of a block is w,= K(Az), n = h 2,- • • , Nil)}, l=h 2,- • • , L,

(6.4.2)

where U[(n) V/, AZ are the binary pieces of information. From the definition (6.1.4) of the operator N it follows that N(Ui)=N(l). (6.4.3) Next we assume A4'. The elements of blocks exhibit joint statistical regularities; thus, the iih block can be considered as a realization of the discrete random variable I!J(0» / = l , 2,- • • , / and the train as a realization of the multidimensional random variable (process) IlJtr = {U(0, / = l, 2,- • • , / } . A5'. All random variables IUf(0, / = l, 2,- • • , /have the same statistical properties and are statistically independent.

282

Chapter 6 Lossless Compression of Information

In view of A5', the random variables are described by the probabilities /^/=PntJ(0=M/], l=U 2 / • • , L.

(6.4.4)

As in Section 6.2.1 (definition (6.2.2) we take the length of a train as the indicator of resources needed to process the train. However, our argument could be applied to any other definition of volume that (similarly to (6.2.2)) can be represented as the sum of volumes of the component blocks. We denote N,=N(^,), N(i)=N[u(i)]. (6.4.5) ^^''

A^tr=EMO].

(6.4.6)

/•-i

In view of assumptions A4' and A5' we can consider the lengths of the train and of the blocks as realizations of random variables Ntr=N(IIJtr) (respectively, N ( 0 = N N ( 0 ) . From (6.4.6) it follows that M„=EN(0,

(6.4.7)

/-I

and from A5' it follows that the variables M(/)v / are statistically independent. This implies that /

K = J2m=im)

(6.4.8a)

/-I

/

c^(MJ=E ^[N(0]=M1),

(6.4.8b)

where N^ = EM^, M1)=EM(1), a^(l) = E[M(l)-EN(l)r

(6.4.8c)

We emphasized in Section 6.4.1 that an indicator of resources needed to process any potential train is of primary importance. To define such an indicator we could invoke our argumentation on pages 255 to 258. Since we assume that the blocks exhibit statistical regularities, we may take a rough description of the random variable Ntr- Particularly, it seems it is natural to take the statistical average A^tr as an indicator of resources needed to process any block. However, such an indicator has some peculiarities, making it different from the indicators considered in Section 6.1. We now discuss in more detail the specific features of the statistical average used as a performance indicator. 6.4.2 THE EFFECT OF OVERFLOW The probability that the length of a concrete train will surpass the average ^^r is usually large. If probability offluctuationaround the average is symmetrical, what happens usually, this probability is 0.5. Thus, if we would reserve processing resources corresponding to the average ^tr with a high probability, we would have not enough resources to process the train, for example, the available channel or storage capacity would be too small. We say that in such a situation an overflow occurs.

6.4 Compression of Information Exhibiting Statistical Regularities

283

The natural counteraction is to reserve bigger resources that can process correctly the trains of length A^tr+A, where A is a safety margin. Then the probability of overflow is _ Pov=P(M,>A^tr+A). (6.4.9) To get some insight into this probability we introduce the normalized variable K = (K-Ky(^(K)-

(6.4.10)

a(m,,) = \.

(6.4.11)

Obviously, From (6.4.8b) we get

m^ = m^-N^)/{\fTa(l)). (6.4.12) From (6.4.12 ) it follows that the event Ntr> A^tr +A is equivalent to the event m,,>A/\fra(l). (6.4.13a) We write this in the form \,>\fTd/a(l), (6.4.13b) where 6-A// (6.4.14b) has the meaning of safety margin per one block. Thus, Pov=P(2iitr>\/^^/^(l)5). (6.4.15) We consider the safety margin per block 6 as fixed. Then the normalized overflow threshold grows with when the number of blocks / the train consists of grows. However, in view of (6.4.11), the variance of the normalized length of the train n^^ is always 1. Thus, the threshold moves further apart from the mean value and the probability of surpassing it deceases^, as illustrated in figure 6.8.

yfTd/Gil) Figure 6.8. Illustration of the effect of reducing overflow probability by increasing the number / of blocks the train consists ofy/ (5/(7(l)-margin of resources over the average resources per one block;/?c('2)-the continuous envelope of the distribution defined by (4.2.9); n,, is considered as a continuous variable. From our observations it follows that Increasing the number I of blocks the train consists of but keeping the safety margin b per one block fixed, we can make the overflow probability P^^ as small as we desire.

(6.4.16)

284

Chapter 6 lossless Compression of Information

COMMENT 1 To draw the advantages of statistical regularities of the information train, the number / of blocks the train consists of must be large, and we must process the train as a whole. Only then can the statistical regularities manifest themselves and can a random fluctuation of the block length over the average be compensated by another random fluctuations under the average. This happens, however, only with the probability of no overflow, which is 1-Pov Although we can make it as close to 1, it is never exactly 1. This means that in practice a rejection can occur. To alleviate its consequences some protective mechanism may be included. It is typical to send back the rejected train to the subsystem that generated it, and to shift onto this system the responsibility of trying to process the rejected information train again. COMMENT 2 The average value EM has the meaning of the value around which the observations of a random variable M fluctuate, while the variance a(M) has the meaning of average deviation around the mean value (see Section 4.4.3, page 192). Behind our basic conclusion (6.4.16) is this interpretation and equations (6.4.8) and (6.4.9). It follows from them that the average value A^tr of needed resources grows linearly with /, while their fluctuation around the average grows only with yjl . The reason of this is that when we add several independent random variables, their up and down fluctuation around mean value partially compensate, and with growing / the relative deviation, related to the mean value, decreases to zero at rate \I\[T. Consequently the random length of the train can be more accurately approximated by a constant, the average. This effect also can be explained in terms of the fundamental property of long trains discussed in Section 4.6. The resources needed to process a concrete train are given by (6.1.8). From (4.6.11) it follows that for sufficiently long typical trains the frequencies of occurrences of potential forms of blocks P*iui) are close to their probabilities Pi given by (4.6.10a). However, then the sum on the right side of formula (6.2.7) is close to the statistical average 7^^^. Thus, we can expect that for a sufficiently long train, with probability close to 1, the length of a long train is almost constant and equal to the statistical average. Therefore, to process a typical train we need resources of the order of magnitude of the statistical average J^^^. In most cases, the probability distribution of components of information is not uniform and/or statistical relationships between the components exist. From conclusions (5.1.9) and (5.1.18) it follows that then entropy per element //, defined by (4.6.18) is smaller, than the maximal entropy. From (4.6.15) and its generalization formulated on page 204 it follows next that the number of typical trains is smaller than the number of all potential trains. The difference may be large if the nonuniformity of the distribution or statistical relationships is "strong". Then the resources required to process a typical train are substantially smaller than the resources needed to process any train. However, in general, the probability of occurrence of an untypical train although small, is not zero. Therefore, if we try to profit from the statistical regularities and reserve only the resources of the order of magnitude of the statistical average, inevitably situations will occur when not enough resources are available for processing an untypical train and an overflow occurs.

6.4 Compression of Information Exhibiting Statistical Regularities

285

COMMENT 3 Processing a train consisting of blocks of various length and thus requiring various processing resources, can be interpreted as a special case of the problem of resources sharing. The simple approach is to divide the available resources into chunks and assign separately a chunk to a user. We call this fixed resource assignment. Since the resources needed to process a transformed train depend on its length, the resources needed may be different for various blocks. To ensure correct processing of every block we had to reserve resources needed to process the block demanding most resources (the longest). Then, however, for most blocks the reserved resources would be not used, and the utilization of the whole pool of resources would be more inefficient as the fluctuation of needed resources around the average value becomes larger. The only way to save the resources would be not to process exactly the most demanding blocks and thus to tolerate overflow. The other possibility is not to split the resources but to assign to a block as many resources as are really needed. Such a procedure is cMtd flexible resource sharing. We achieve it by transmitting or storing the train as whole. Processing the train as whole we do not avoid the difficulties with processing of blocks of various lengths. Thus, to process each block exactly we had to reserve the resources needed to process the longest block. However, this would be the same amount as in the case of fixed assignment of resources sufficient to process exactly every block separately. The basic difference between fixed and flexible resources sharing lies in different statistical properties of the length of the single block and their train. Our argument leads to the final conclusion (6.4.16) and has quite a general character: the relative fluctuation of a sum of random variables around its mean value compared with the relative fluctuations of components of the sum becomes smaller as the sum has more components. The consequence of this is that for a given probability of overflow the required total surplus of resources over the average of the needed resources becomes smaller when the number of components grows. This effect is called in resource-sharing theory the economy-of-scale effect. 6.4.3 THE INDICATORS OF STATISTICAL COMPRESSION From our previous considerations it follows that with discussed restrictions, the ^""^'^^^ V(IUJ = Ev(UJ (6.4.17) is a suitable indicator of resources needed to process any train. We call it statistical volume of trains. Since we do not specify whether the resources are used efficiently, the statistical volume is a counterpart of the structure blind volume defined by (6.1.20). To calculate the statistical volume of trains we must take into account the statistical regularities. The counterpart of the minimum volume introduced in Section 6.1.1 is the minimum statistical volume V.i(I[JJ=miny[r(IU,)] (6.4.18) and min is the operation of searching the minimum value of the class of all TuvC-)

reversible transformations.

286

Chapter 6 Lossless Compression of Information

The statistical counterparts of the resources utilization indicator defmed by (6.1.22) and the resources surplus indicator defmed by (6.1.24) are the indicator of statistical resources utilization: Rch^r)=V„,iILJJ/V,,(I[JJ

(6.4.19)

and the indicator of statistical resources surplus: Sc,^J = [ Xb(U,r)-V„;(¥J]/V„;(V„)]

(6.4.20)

The structure blind volume Yt, (HJtr) occurring in these definitions is often deterministic and has the meaning of total available capacity of the structure blind fundamental information system. From definitions it follows that SchU,) = [l/RchU,r)]-l

(6.4.21)

Generalizing (6.1.30) we define the efficiency of volume transformation of information exhibiting statistical regularities (briefly efficiency of statistical volume transformation): __ 0[T{')] = V^U,/ V,j(¥,), (6.4.22) where ¥,=7-(U„). (6.4.23) If 0[T(-)]>1, we say that the transformation T(-) compresses statistically the volume of information. Similarly to (6.1.32) it is P[T(-)]
(6.4.24)

and the maximum compression ratio ;ffma(U,r)=V,,(^)/VJU„) = l/RchUJ.

(6.4.25)

The concrete examples of those concepts are given in the next section. COMMENT 1 If the primary information exhibits statistical regularities and deterministic relationships exist between the components of information, then those relationships manifest themselves in the probabilities of occurrences of the components. Thus, the deterministic relationships are "automatically" taken into account when we minimize the statistical volume. Therefore, the minimal statistical volume based on the exact description of the statistical properties of information is not smaller than the minimal_volume taking into account only deterministic constraints. The difference V (t/)-Vmi(U) is an indicator of gains that we may achieve when we fully utilize the exact description of the statistical properties of information. These conditions are essential. If we would utilize only a rough description of statistical properties of information or we would apply a statistically not optimal transformation, the statistical volume may not be smaller than the minimal volume taking into account only structural constraints.

6.5 Examples of Statistical Compression of Information

287

COMMENT 2 In our presentation of the statistical approach to the concept of volume of information and of information compression we emphasize analogies with the nonstatistical approach of the previous section. We do it because, in spite of its formal elegance, the application of the statistical approach is often in practice not justified. The reason is that only in a few cases are the statistical properties of information known and usable, and the existence of statistical regularities is usually not tested. We elaborate here on the statistical approach not only because it is applicable in some cases but also because it is the only indeterministic model of information that allows to derive a great body of closed-form relationships that give insight into its general properties of information.

6.5 EXAMPLES OF COMPRESSION OF INFORMATION EXHffirriNG STATISTICAL REGULARITIES This section has two purposes. It describes and analyzes typical (and most important for applications) loss-less compression systems utilizing statistical regularities of structured information. It also uses those systems to illustrate the general considerations in the previous section. 6.5.1 COMPRESSION OF THE TRAINS OF BLOCKS SEPARATED BY IDLE PAUSES We introduce the statistical model of the processes in the buffering system described in Section 3.3.1, calculate the statistical performance indicators discussed in Section 6.4.3 that are based on indicators introduced in Section 6.3, and derive the relationships between the statistical indicators. The statistical properties of the primary train are described by the statistical properties of the starting instants r(/) of the block and the duration D{i) of the corresponding precessed nh block u[i, (•)] (see Section 6.3.1). We assume that the train is a Poisson-exponential train (described in Section 5.2.1). Such a train is described by two parameters: the intensity \ of the process of the births of blocks and the intensity ^^ of the ending of a block. By substituting the probability density (5.2.8) in definition (4.4.10) we get the average duration of the primary (delivered by the local channel) block: 5bu = l/Mu. (6.5.1) The process V^(t) generated by the buffer and put into the fundamental channel (see Figure 6.7) can be also considered as a birth-death process. We denote by \ and /Xy the birth and death rates of the secondary blocks. We again do assumption A7 from page 262. Thus, we assume Buffering is a reversible transformation. (6.5.2) To satisfy this assumption, the buffer capacity C^^f (denoted Q in Section 3.4.1) must be infinitely large. Thus, ^ //: <; ix ^bur*°°-

(0.3.3)

Based on this assumption, every primary block, after some waiting in the buffer, will be put into the fundamental channel. Therefore^ X,=X,. (6.5.4)

288

Chapter 6 Lossless Compression of Information

The duration of the secondary block is proportional to the duration of the primary block. Therefore, the duration and the death rate are for the transformed block related by a relationship similar to (6.5.1). Thus, we have From (6.3.26) it follows that

' ' " ^^^"

^^'^'^^

5bv=(Q/Q)Db.. (6.5.6) From conclusion (6.3.27) it follows that to achieve compression the capacity of the fundamental channel must be smaller then of the local channel. Therefore, we consider the capacity Q of the fundamental channel, and in view of (6.5.6) the duration Dbv, as a parameters. To avoid complications with description of the process of putting into the buffering system the blocks delivered by the local channel shown in Figure 6.7a, we assume that the average duration of arriving blocks is negligibly small, so that the time of transferring the block into the buffering system can be neglected. Often the capacity of the local channel is large. Then such an assumption is satisfied. However, we assume that the average duration of the block in the fundamental channel is large in the sense that Dbv^Av (6.5.7) Then we may assume that a block stored in the buffer is transferred to the server immediately after the processing of the previous block is completed. To derive the relationships between parameters X^, /x^ and the statistical parameters characterizing information compression we consider the long time interval of duration T. During this time interval on average \T primary blocks arrive. Each block contains on average 3^^ID^^=D^^C^ binary pieces of information. On the assumption (6.5.2) all blocks are put into the fundamental channel. Therefore, N,o.,(T)=KD,,CJ (6.5.8) is the average total number of binary pieces of information that arrive in time T into the fundamental channel. Averaging defmition (6.3.4) and taking Vinstead of L^ we define — — R.=NUT)/T (6.5.9) and call it average of the rate of delivery of "pure information" into the fundamental channel. From (6.5.8) we get ^v=\D,^Q. ^ (6.5.10) Substituting in definition (6.4.19) TC, for V^^W,,) and ^v for VmiU,r) we obtain the indicator of statistical utilization of^fundamental channel's capacity Rch = ^ v / Q (6.5.11) From (6.5.10) and (6.5.6) it follows that Rch=Pv (6.5.12) where ^ , ,^ ^ ,^x Pv=V/>tv (6.5.13) Let us denote by Sch the statistical surplus of resources (capacity) in the fundamental channel given by (6.4.21). Substituting (6.5.11) in this equation we get Seh=(Q/^v)-l. (6.5.14)

6.5 Examples of Statistical Compression of Information

289

For the superior system the delay caused by buffering is important. Therefore, as an indicator of the quality of the buffering system we take the statistical average D5uf = EDbuf{*-, (•)]}. (6.5.15) where Dbuf {u[i, (•)]} is the delay between the instant of the arrival of a primary block u[i, (•)] and the instant of the start of the corresponding transformed block v[/, (•)]. It is evident that the average delay caused by buffering is related to the number of blocks waiting in the buffer. Therefore, to calculate the average delay we look at the properties of the number of blocks stored in the buffer. We denote N^^fthc number of blocks stored in the buffer, A^sys=A^buf+block being fed into the channel (processed by the server); we call A^sys the number of blocks in the system, Mbuf, Msys" the corresponding random variables, Qurcapacity of the buffer (the maximal number of blocks that can be stored in the buffer). Using the state transition diagram shown in Figure 3.13, we can calculate the probability distribution of M^y, and its average and we obtain (see, e.g., Seidler [6.11], Gallager [6.12]): -^^ Py ^sys = TT:r(6.5.16) The theorem of Little (see Seidler [6.11], Gallager [6.12])^ establishes on very general assumptions a relationship between the average time D^y^ spent by a block in the information-compressing system, the intensity \ , of arrivals of blocks, and the average number ^^y^ of blocks in the system \D,y, = N,y,.

(6.5.17)

The average duration of stay of a block in the buffer 5buf=Ays-5bv.

(6.5.18)

Substituting (6.5.16), (6.5.17) in this equation and using (6.5.13) we get D,,f = D , , ^ . 1-Pv Using (6.5.12) we write (6.5.19) in the form Auf = D,,—^

(6.5.19)

(6.5.21)

1 /Vch

or using the statistical surplus of capacity given by (6.5.14) in the form Sbuf = Dbv:=^.

(6.5.21)

Using (6.5._16), (6.5.11), and (6.5.12) we express the index of statistical resources utilization Rch in terms of the average number of blocks staying in the system ^ch = — ^ .

(6.5.22)

l^A^sys

Equations (6.5.19) to (6.5.22) provide the relationships we were looking for. The diagrams of these relationships are shown in Figure 6.9.

Chapter 6 Lossless Compression of Information

290

Axif/Dbv s 4

0. 6

/l

Anif/Dbv 4

Rch

0 •

b)

0

Cbuf = 10

^ch

—Jp4

- 1 0 3-

\ V3"—^

2-05

^-^

1-

.d)

02

O'l

5—^

-^ _ 06

. 08

1 1 1

^_ 12

Figure 6.9 The relationships between the statistical indicators of performance of real-time compression by buffering; a, b, c -the buffer capacity Chuf= » ; d -it is finite (a) dependence (6.5.20) of the normalized time of stay Dbuf/Z^bv of a blockjn the buffer on the channel capacity_utilization indicator R,h» (b) dependence (6.5.21) oiD^^JDy^^ on statistical capacity surplus Sch» (c) the dependence (6.5.22) of the channel capacity utilization indicator R^h on the average nujnber A^sys of blocks in the system, (d) the dependence of the normalized time of stay Dbuf/^bv (continuous lines) and of the channel capacity utilization indicator K,h(slashed lines) on the normalized intensity R \^ (given by (6.5.27)) of blocks delivered by the local channel and on buffer capacity Q^f (based on Seidler [6.11]).

On the assumption (6.5.2) (equivalently, assumption Cbuf-*<») every block delivered by the local channel goes to the buffer and after some delay is fed into the fundamental channel. For finite Quf it is possible that new blocks arrive when the buffer is already full. Then the overflow occurs and (6.5.4) does not to hold. Let us denote by P^v the probability of overflow. It can be shown (see, e.g., Seidler [6.11, ch.7]) that . ^w

P ^l-lZfJL.

(6.5.23)

The intensity of overflow (of blocks delivered by the local channel that can not be admitted into the buffer) is x n x /^ ^ o.1^ Xov=^ov\,(6.5.24)

6.5 Examples of Statistical Compression of Information

291

The blocks admitted into the buffer are after some delay fed into the fundamental channel. Thus, \ = (l-PJ\. (6.5.25) Taking this intensity we obtain from (6.5.10) and (6.5.11) the indicator of utilization of capacity of the fundamental channel Rch = ^v/Q=(l-PJX,Dbv. (6.5.26) As the normalized intensity of blocks delivered by the local channel we take Rc>>^uSbv

(6.5.27)

This parameter can be also interpreted as an indicator of hypothetical utilization of the fundamental channel, if all primary blocks _were fed into the channel. The dependence of the normalized time of stay D^^{/D^,y and the fundamental channel utilization on the normalized intensity of blocks delivered by the primary channel are shown in Figure 6.9d. COMMENT 1 Figure 6.9a shows that the delay introduced by buffering is the price that we pay for compression of the primary train, which, in turn, improves the utilization of the capacity of the fundamental channel. Figure 6.9b illustrates the same effect but in terms of the surplus of the channel capacity over the minimum capacity that is R^. When the channel utilization approaches 1 or equivalently, the capacity surplus goes to 0, and the delay introduced by buffering increases infinitely. The reason for this effect explains Figure 6.9c. We can avoid idle pauses in the train fed into the fundamental channel if a reserve of blocks is available to put a block into the fundamental channel when the channel becomes idle. Therefore, the increase of statistical channel capacity utilization is inherently coupled with the increase of the average number of blocks waiting in the buffer. The ultimate reason for the improvement is that having on average enough blocks in the buffer we create a situation in which the statistical regularities can manifest themselves. COMMENT 2 The diagrams in figure 6.9d show that in a real system overflow can occur. However, in a wide range of systems parameters overflow probability is small. To make the probability of overflow still smaller we can add a partner information subsystem described in Section 2.2, page 93, that allows to deliver through the local channel of a copy of a block not admitted to the buffering system. The diagrams in Figure 6.9d illustrate also our remarks on page 34 about effects that do not exist in the real word but are only a result of simplifying assumptions. The infinite delay can arise only in a non-existing system with infinite storage capacity. In a real system the delay is always finite. The effect of overflow is not only a nuisance. The diagrams in Figure 6.9d show that decreasing capacity of the buffer memory slows the growth of delay when the utilization of channel capacity approaches 1. This is used to alleviate congestion effects (see, e.g., Seidler [6.11], Gallager [6.12]). To illustrate the effects of statistical real-time compression we used a very simple queuing system. The analysis of several other more complicated queuing systems (see the classic book of Kleinrock [6.13]) provides more such examples.

292

Chapter 6 Lossless Compression of Information

6.5.2 THE MINIMUM STATISTICAL VOLUME It has been shown in Section 6.LI, page 267 that the statistical Huffman algorithm is optimal in the sense that it minimizes the statistical volume of the compressed train defined as the average length of the train. However, the Huffmann compression is not an universal solution of the statistical optimization problem. The Huffman algorithm is a block orientated algorithm while in many cases the primary information is a long train of components. Huffman coding of a long train as a whole would be very costly. Therefore, when implementation criteria are taken into account other compression procedures may be preferred. An example of a non optimal compression of a train is segmentation and separate Huffman compression of blocks, described in Section 6.2.1. Another example is arithmetic coding described in Section 6.2.2, When non-optimal compression is considered, very useful for a preliminary performance assessment are universal estimates of the minimum statistical volume Vmi(Utr) defined by (6.4.18). Using the fundamental property of long trains discussed in Section 4.6 we derive here a universal formula for the minimal statistical volume of information valid when the block size A^-^oo. As an example of application of this estimate for analysis of performance of a non optimal compression we consider in the next section the dependence of compression coefficient for separate Huffman compression of segments of a Markov train on the size of the train. From Section 4.6, particularly from conclusions (4.6.10) and (4.6.11) it follows that for large A^ the set II of all potential forms of a block can be represented as the set-sum U=U,yUU,,^ (6.5.28) of the typical and nontypical blocks. The number of typical blocks is given by the generalization of formula (4.6.15) discussed on page 204. In the notation now used, this generalization takes the form L(^„)-2--™,

(6.5.29)

^i(U, iV)=H(U)W=H[m(l), iffl(2),- • • , m(N)yN

(6.5.30)

where is the entropy per one element of the multidimensional random variable IU = {i!]l(A2), n = l, 2,- • • , A^} representing a primary block^. The conclusion (4.6.10) holds also in the general case, thus the probabilities of typical blocks are almost the same. Therefore, using the code (1.5.8) to transform a typical block uEU^y into a block v of fixed length A^^ty we minimize the average length of transformed blocks. The equation (1.5.10) is the condition that the transformation is reversible and that the secondary blocks have minimal length. In the present notation this equation takes the form N^y = \ogMU,y).

(6,5.31)

N,,y^NH,m

(6.5.32)

Using (6.5.29) we get

6.5 Examples of Statistical Compression of Information

293

If a primary block is nontypical (i/G II^^y) we leave it unchanged. Thus, the length of the transformed nontypical block is A^vnty=A^.

(6.5.33)

As it has been indicated on page 263, to exploit statistical regularities for information compression the coded blocks must have different length. This condition is satisfied by the described transformation since it produces blocks of length N^^ or N^^. This, however, requires auxiliary separating information. Since only two lengths of blocks are possible, binary length information (see Section 2.6) is sufficient. Thus, the average volume of the transformed block defined as its length (see (6.1.3) is V,i(V)=M(A^,y + l)P(IUE ^.y) + (A^,,y+l)P(llJE /y,y).

(6.5.34)

Using (6.5.32), (6.5.33) and (4.6.1b), we get V J V , AO=A^i/,(IU,AO+o(AO,

(6.5.35)

where o(A0 is such a function of A^ that o(N)/N ^J). Our argumentation shows that this is almost the minimum statistical volume of the compressed train. This result can be also obtained if we use statistically optimal Huffman coding and notice that similarly as for the second case considered in Example 6.2.2, page 266 for typical trains the Huffman codewords are blocks of almost equal length given by (6.5.32). On quite general assumptions the limit //,(U, 00)= lim//,(ILJ, AO.

(6.5.36)

N-oo

exists. Then from (6.5.35) it follows that lim Vn,i(V, N)/N=H,(^,

oo)

(6.5.37)

N-oo

We write it as an asymptotic relation V J U , N) = NH,(^, 00).

(6.5.38)

With few exceptions the calculation of the entropy ^,(11J, oo) in a closed form is prohibitively complicated. Yet equation (6.5.) is of paramount importance. It allows the basic properties of the entropy to be used to formulate the guidelines for estimation of efficiency of information-compressing procedures. The basic properties derived in Section 5.1.2 are as follows: • The entropy is maximized when the probability distribution of potential forms of the random object (scalar, block, train, etc.) is uniform (property (5.1.9)) • The joint entropy is maximized when the components are statistically independent (property (5.1.18)). From these properties and asymptotic relationship (6.5.38) it follows that Utilizing the deterministic and statistic relationships between the components of structured information, we can compress the volume of information the more the larger is the difference (6.5.39) Hmg(U)-H(ILJ) between the sum of marginal entropies Hn,g(U) and the joint entropy H(ILlf) of the structured information.

294

Chapter 6 Lossless Compression of Information

Even if the difference mentioned above is small we still can achieve compression if the probabilities of potential forms of components (6.5.40) are not the same. The counterpart of conclusion (6.5.39) applies when only structural constraints are taken into account, but conclusion (6.5.40) has no counterpart in this case. 6.5.3 THE CHOICE OF SIZE OF SEGMENTS COMPRESSED BY THE HUFFMAN ALGORITHM We will now use the basic relationship (6.5.38) to gain some insight into the problem of choosing the length of the segments into which we split a primary train of elementary pieces of information to apply Huffman coding. We assume the following: Al. The elements^^ u{n) of the block are binary; 0 and 1 are their potential forms. A2. The multidimensional random variable lLJ={i!a(w), w = l, 2,- • • , A^} representing a block is a segment of a stationary Markov train. The matrix of transition probability P211 is given by (5.3.9), and the stationary marginal probabilities are given by (5.3.11). A3. The segments are transformed separately but the optimal Huffman algorithm (6.2.17). To calculate the statistical volume of the transformed train we have to evaluate the average length of the code words produced by the Huffman algorithm (6.2.17). To run this algorithm we need the probabilities of the potential forms of the primary block^^: P(U = a,) = PWl)=w,(,), M(2)=«,(2)/ • • , ^(A0=«/(^)], (6.5.41) where W/= (w/^), W/(2)/ • • , u^j^ is a given primary block and M{n) are the random variables representing its binary components. We calculate the probabilities from equation (5.3.3). Next, we take the obtained probabilities in place of frequencies of occurrences, run the Huffman algorithm (6.2.17), obtain the lengths Nr(ii/) of the potential forms of transformed blocks, and from (6.2.7) calculate the average length V(V, AO = ENV of the transformed block. The statistical volume of the whole train V,, = {v(/),/ = 1,2,- • • , / i s ^ V(VJ=/VV,AO, (6.5.42) where ¥ is the random variable representing a transformed block (see note 1). The indicator of statistical utilization of the resources of the fundamental subsystem (communication channel, storage device) given by formula (6.4.19) is R(AO = V^i(VJ/VV,).

(6.5.43)

As an estimate of the minimum resources Vmi(Vtr) needed to process the whole train Utr without the restriction that the train has to be processed block-wise, we use equation (6.5.38) that takes the form V.,i(VJ = W//,(IU, ex).

(6.5.44)

6.5 Examples of Statistical Compression of Information

295

Substituting (6.5.42) and (6.5.44) in (6.5.43) we get R(AO=//,(ILJ, oo)/V(¥, N).

(6.5.45)

The needed entropy H^(U, <») we calculated in Section 5.3 and is given by equation (5.3.15). The dependence of the statistical resources utilization indicator on the size A^ of the segment is shown in Figure 6.10.

1.0 0.9 H 0.8

I

1 1

1 2

1 3

1 4

1 5

1 6

\ 7

1—• 8

^

Figure 6.10 The dependence of indicator R(AO of the statistical resources utihzation (equation (6.5.46)) on the size A^ of blocks in which a primary Markov train is segmented to apply Huffman algorithm; the description of the considered Markov process is given on page 230. COMMENT 1 It has been shown in Section 5.3 that all elements M(1), iai(2),- • • , m(N) of a block of a Markov process of rank 1 are statistically dependent. Those relationships are taken into account by the Huffman algorithm, which operates on blocks as a whole. Although optimal for transforming blocks of given length, the Huffman transformation transforming shorter blocks separately is blind for the statistical relationships existing within the larger block. Therefore, the resources utilization indicator grows with the length A^ of the shorter block (segment). COMMENT 2 From Figure 6.8 it follows that for A^ of order of magnitude of 10 the average volume of the transformed train is close to its asymptotic minimal value A7//i(I[J, 00). The basic reason of occurrence of saturation of the indicator of utilization of statistical regularities is that for A^ large (in the mentioned sense) the statistical properties of the Markov train manifest themselves already. The saturation of the statistical indicator of utilization of the channel capacity shown in Figure 6.9c has the same reason. Thus, although the real-time compression system considered in Section 6.5.1 is totally different from the system now considered, their basic properties are determined by the same general rule: the statistical regularities can substantially improve the performance of lossless information compression. However, on the condition that they can manifest themselves by processing as a whole a sufficiently large block of components of the structured information. To illustrate the effects of compression using statistical regularities, we considered a simple case and sketched only the derivations of the basic relationships. Their exact proofs and analysis of compression of more complicated trains of information can be found, e.g., in Blahut [6.14] and Cover, Thomas [6.15].

296

Chapter 6 Lossless Compression of Information

6.6 TRANSFORMATIONS UTILIZING THE STRUCTURE OF CONTINUOUS INFORMATION TO COMPRESS ITS VOLUME. Hitherto we have assumed that information is discrete. We consider now the transformations compressing the continuous information. The problem of processing continuous information was addressed already in Section 1.4.3. Here we discuss the counterparts of the concepts introduced previously for the discrete information. In particular, we look for an indicator of resources needed to process continuous information. The essential steps in the definition of the volume of discrete information were (1) the choice of prototype information (pages 254 and 255) for which the definition of resources needed to process it is plausible, (2) the definition of prototype information that requires the same processing resources as given structured information (page 256). We now show that in a similar way we can define the indicators characterizing the volume of continuous information. However the counterparts of both steps are more complicated than for discrete information. The basic reasons for the difficulties have been explained in Section 1.4.3. The continuous models provide only an idealized description of the real information systems. The advantage of continuous models is that many properties of continuous information and its transformation can be described by relatively simple equation. This allows broad insight into information processing. The disadvantage of continuous models is that they have some peculiarities that are not related to real system but to the model. Such peculiarities may lead to conclusions that are technical or even physical paradoxes. In particular, in two areas: (1) analysis of the volume of continuous information and (2) spectral representations of information functions. The inherent restrictions of continuous of models in the first area are mentioned in this section, and in the second area in Section 7.4. We point out typical difficulties and indicate how to overcome them. 6.6.1 THE VOLUME OF THE PROTOTYPE CONTINUOUS INFORMATION The simplest type of continuous information is information that can take potential forms, which can be represented as points in the interval < 0 , 1 > . We call such information prototype continuous information. It is the counterpart of binary information, which is the prototype of discrete information. For binary information the definition of an indicator of the resources needed to store or transmit it has been plausible. For prototype continuous information it is no longer so. The basic reason for the difficulty is that, as it has been indicated in Section 1.4.3, page 33, we cannot process continuous information, even the simple prototype information in an exact way. To illustrate this statement let us look more closely at the options of physical storing or transmitting the prototype continuous information. There are two techniques for processing continuous information: 1. the analog processing and 2. discretization and the digital processing.

6.6 Compression of Continuous Information

297

To be processed by an analog technique the prototype information w E < 0 , 1 > must be presented in some physical form. As it has been indicated in Section 1.1.1 this can be either a dynamic or static form. The basic procedure for presenting continuous information in the dynamic form is modulation. We take a carrier process - typically, a pulse or a sinusoidal packet located in a time slot and make one of the parameters characterizing the carrier process dependent on the primary information (pulse amplitude, position, width modulation, amplitude, phase, frequency modulation). For concrete examples, see Section 2.1.1). After processing (in particular, transmission), we recover the primary information. However, the unavoidable external factors influencing the transformation of the primary information into the dynamic information, the processing of this information, and the recovery of the primary information cause that the primary and recovered information differ. Thus, the recovered information is u,=u-\-z. (6.6,1a) where the difference z=u-u^ (6.6.1b) is the final distortion. It is determined by the mentioned unknown factors. The quality of processing the primary information presented in the dynamic form is characterized by the size of the final distortions. If they exhibit statistical regularities, the final distortion exhibits them also. As the simple indicator of the size of distortions we take the mean square value Q,=^^.

(6.6.2)

where 2 is the random variable representing the final distortions. This mean square has also the meaning of an indicator of accuracy of fundamental processing of the prototype continuous information presented in the dynamic form. The other basic type of analog processing the prototype continuous information is to present it in static form. Such a presentation is typical for information storage, particularly on magnetic or optical carrying media. In both cases, a counterpart of modulation is used. Namely, the parameters characterizing the static state (magnetic, optic) of a small area of the carrier (the counterpart of time slot assigned to modulated signal) are made dependent on the primary information. Although the factors determining the distortions have another physical character, the recovered information ("read" out of the storage) has again the form given by (6.6.1). The second basic technique of processing continuous information is to transform it into discrete information. The transformation of scalar and vector information into discrete information has been called quantization. We discussed it already in Sections 1.5.4, 4.2.1, and 4.5.1; see Figures 1.18, 4.2, and 4.6a. If quantization is applied, it is natural to define the volume of the primary continuous information as the volume of the quantized information. This volume we denote as Vq(<0, 1 > ) . From (6.1.18) we get V,(<0, l > ) = l o g ^ , (6.6.3) where L^ is the number of potential forms of quantized information.

298

Chapter 6 Lossless Compression of Information

The digital processing can be considered for most puqjoses as error free but the quantization introduces irreversible distortions. Therefore, if after processing quantized information, we would like to recover primary continuous prototype information, the recovered information would have again the form (6.6.1). However, not indeterminate external factors but the irreversibility of the transformation producing the available information cause the error z. We call it quantization error. To get an insight into its size we give a simple example. EXAMPLE 6.6.1 CALCULATION OF THE MEAN SQUARE OF QUANTIZATION ERROR We assume that A.l The set of potential forms of the prototype continuous information is the interval <-yi, V2>. This shift of the previously introduced interval < 0 , 1> has no effect on quantization errors but allows us to use directly the earlier considerations in Section 4.5.1. As there, we assume that A.2 The quantization is uniform, described by assumptions A l , A3 and A4 formulated on page 196 with s^=V2. Next we assume A3. When the quantized information v,, / = 1 , 2,- • • , Lq is available as the recovered information u^ we take the center of the aggregation interval ^i corresponding to v, (this is the /th reference point of the NNT producing the quantization-see assumption A4 on page 196). A4. The continuous prototype information u exhibits statistical regularities and random variable m representing it has a uniform density of probability: 1 forwG<-y2, V2> 0 for w€ <-V2, V2>; A5. As the indicator of fmal distortions caused by quantization, we take the mean square value of quantization distortions^^ G=E(m-B*)^ (6.6.5) The quantization error we denote as e(u)=u-uXu) (6.6.6) where uXu) = U[T^(u)] (6.6.7) T^(') is the quantization rule determined by assumption A2, and U '(•) is the recovery rule determined by assumption A3. On the assumptions which we do here the quantization error e{u)=b, where b is defined by equation (4.5.5). Thus the saw-tooth diagram in Figure 4.6 is the diagram of e{u) versus u. We denote as pj^e) the density of probability describing the random variable representing the quantization error. We obtain this density by substituting/?(w) given by equation (6.6.4) for pj^s) in equation (4.5.8). We get

^Lfor^E<-±,±> ^ ^ 0 for M^<-J_, ± > ;

6.6 Compression of Continuous Information From (4.4.11) it follows

299

,„ Q,- J e\u)p^iu)du

From Figure 4.6c we see that

(6.6.9)

"""

1/2Z.

1/2Z,

1 (6.6.10) 12L-1/2Z. -172Z. The objective indicator of the mean square quantization distortion is the normalized mean square distortion Q^-L f e\u)du-L [ e^de-

(6.6.11) From assumption A3 it follows 0^(51)= J «M« = 1/12

(6.6.12)

From (6.6.13) and (6.6.15) we get

H

(6.6.13)

Using (6.6.3) we write this relationship in the form v„()='/2iog,-L.

e.

This dependence is illustrated in Figure 6.11. 2.' ^^^

(6.6.14)

1

2

3

T

1

I

4

V,('<0, 0.3

>v

0.1

-

0.03

-

0.01

0.003

Figure 6.11. The trade off relationship (6.6.14) between the volume Vq(<0, \>)=\ogJ^^^ of an uniformly quantized scalar information and the indicator of normalized quantization distortions Q^\ only values corresponding to integer L^, are meaningful. COMMENT 1 The relationship (6.6.14) is a typical trade (?J5^relationship between two conflicting indicators of information system's performance. This relationship depends on the rules of transforming information into a form suitable for subsequent processing (quantization) and on the rules of information recovery. We have proposed those rules without justification. In Section 8.6.1 we show that they are optimal rules for the assumed uniform density of probability.

300

Chapter 6 lossless Compression of Information

COMMENT 2 No matter whether we apply the analog or digital technique of processing (in particular storing, transmitting) the prototype one-dimensional continuousinformation, the resources needed to process the information are determined by the required accuracy of processing, and they grow with increasing required accuracy (decreasing admissible processing error). Therefore, not a single parameter but the trade off relationship: resources needed to process analog information versus the accuracy of information that can be recovered is the characteristic of the volume of continuous information. 6.6.2 THE VOLUME OF STRUCTURED CONTINUOUS INFORMATION AND ITS COMPRESSION The idea to characterize the resources needed to process one dimensional information by a trade of relationship between the volume of quantized information against the accuracy of the recovered continuous information can be directly applied for K DIM, K>2 continuous information. However, in many situation characterizing volume of continuous information in terms of discrete information is not natural. Here, generalizing the concepts introduced in Section 6.1.1, we present an approach to the concept of volume of continuous information without resorting to the concept of discretization. The number of potential forms of discrete information played the crucial role in the definition of the volume of discrete information (relationships (6.1.8), (6.1.18)). Therefore, to adapt the methodology of defining the volume of discrete information for definition of volume of continuous information a counterpart of the number of elements of a discrete set must be introduced for continuous sets. If we look closer at the basic definition of volume of discrete information (Section 6.1.1), we see that it is essential for the numbers of elements of the set of potential forms of structured information and of the set of potential forms of the reference information to be the same. To make a corresponding statement in the case of continuous sets we must introduce the relationship "equal count" of elements of two sets. Let f/(respectively V) denote two sets of potential forms of information u (respectively v). We define this as follows: If such a reversible information transformation T(') exists that every vE Vcan be presented in the form v=T(u) and every uEU can be presented in the form u-T-^v) then we say that the sets Vand U have equal counts and write it as U- V. The condition occurring in definition (6.6.15) can be formulated equivalently in the following form: It is possible to put the elements of both sets in pairs so that each element of the set U has one and only one partner from set Vand vice versa. The concept of "equal count of elements" is illustrated in Figure 6.12.

(6.6.16)

6.6 Compression of Continuous Information

301

Figure 6.12. Illustration of the concept of an equal count of sets ?/and V. Both sets are discrete. For discrete sets the condition that the counts of the two sets are the same is equivalent to the condition that L ( ^ ) = L ( t / ) . (6.6.17) Obviously, the interval < 0 , 1 > has infinitely many elements. As we have shown in Section 1.4.3 this "infinity" is so large that it is not possible to establish an "equal count" relationship between the points of the interval < 0 , 1 > and a set of elements that can be identified by integers. However, we may expect that an "equal count" relationship may exist between the prototype continuous sets. We first show in a simple example that a straight generalization of the definition (6.6.15) of equal count would be not feasible for a technically reasonable definition of volume of continuous information. EXAMPLE 6.6.2 THE LOSSLESS TWO-DIMENSIONAL TO ONE-DIMENSIONAL COMPRESSION OF UNCONSTRAINED EXACT CONTINUOUS INFORMATION This example shows a peculiarity of comparing the counts of two continuous sets by putting their elements into pairs. This is equivalent to defining a deterministic reversible function by assigning to an element of one set one and only one element of the other set. We assume that the set of potential forms of information w = {w(l), w(2)} is the unit square ^ixi ={{"(!)'"(2)}; w(l)G<0, l > , w ( 2 ) E < 0 , l > } . (6.6.18) The coordinates of concrete information must be represented in a counting system say, binary. The representation is in general, infinitely long. Thus, w(l)=Ml, 1), Ml, 2), Z7(l, 3) (6.6.19a) u(2)=b(2, 1), b(2, 2), b(2, 3) (6.6.19b) We interleave both trains and interpret the interleaved train v=Ml, 1), Z7(2, 1), b{l, 2), Z?(2, 2), 6(1, 3), b(2, 3) (6.6.20) as a representation of a number v. We denote by T(') the transformation defined by (6.6.20). Since we can unfold the train (6.6.20) back into the two primary trains the transformation T(') is reversible. Thus, it puts into pairs all points of the unit square and of the unit interval as illustrated in Figure 6.13. Since the folding procedure can be generalized a loss-less compression of a K-dimensional information into one-dimensional information is also possible^^

302

Chapter 6 Lossless Compression of Information «(2)A

0

1

M(1)

0

V

1

Figure 6.13. The transformations described in the example mapping a unit square into a unit interval. COMMENT The example shows also that counter intuition, the count of points of a square (in general, K-dimensional cube) is the same as of a < 0 , 1 > interval. However, this cannot be used for dimensionality reduction, for two physical reasons. First, to perform the transformation or to revert it we must know the information u and v exactly. However, as has been explained in Section 1.4.3, this is for physical reasons impossible. The second reason is that the function assigning v to a considered as a function of two continuous arguments w(l) and w(2) is a discontinuous function in every point. As discussed in Section 1.4.3, every real device has some inertia and cannot implement such a function. The example suggests that to use definition (6.6.15) of equal count for continuous functions we have to restrict the class of transformations of information to transformations that can be implemented. We call such a ivansfovrndXionphysically realizable. Thus, for continuous sets we define as follows: If we can find a reversible, physically realizable transformation putting all elements of two continuous sets into pairs, then (6.6.21) we say that both sets have equal effective count of elements. As it has been indicated previously the counterpart of the binary prototype set is for continuous information the one-dimensional continuous set that can be represented as the <0,1 > interval. We call it iht prototype continuous set and its Qltmtnts prototype continuous information. To extend the procedure of defining the volume of discrete information presented in the previous section, we have to define an indicator of resources needed to process (in particular, to store, to transmit) the prototype continuous information and to use them as unit of volume of continuous information. Having defined the volume of the prototype information, we can define volume of structured continuous information in a similar way as we defined the volume of structured discrete information based on volume of the binary information. The counterpart of the train of N binary pieces of information is the train of ^pieces of continuous scalar information. If on the elements of such potential trains no constraints are imposed, we call the set N-dimensional continuous master set and denote it C(N). Next, similarly to (6.1.8), we define its volume y[CiN)]=N. (6.6.22) Then, the counterpart of the definition (6.1.12) of volume of discrete information is as follows:

6.6 Compression of Continuous Information

303

If a continuous set U has an equal effective count of elements (in the sense of definition {6.6.2\) as set of potential forms of the continuous master set C(N), then we define the minimum (reference) volume as (6.6.23) We next extend for the continuous K-dimensional information the previously given definitions of structure blind volume and compression ratio for discrete information. We illustrate it with a simple example. EXAMPLE 6.6.2 THE LOSS LESS TWO-DIMENSIONAL TO ONE-DIMENSIONAL COMPRESSION OF CONSTRAINED CONTINUOUS INFORMATION We assume that Al. The information is two-dimensional tt = {w(l), w(2)} and the set of its potential forms is the unit square Z/i,,={{w(l), w(2)}; 0 < w ( l ) < l , 0 < w ( 2 ) < l } ; (6.6.24) A2. The second coordinate is determined by the first: u(2)=^lu(l)l (6.6.25) where ^i) is a given function such that 4>(0)=0 and ^(1) = 1. An example of function ^(0 is shown in Figure 6.14.

.u(2)=^[u{l)]

w-r-i(v)

u(l)

Figure 6.14* A physically realizable reversible transformation of a two-dimensional information tt = {w(l), u(2)} into one-dimensional information v. The truncating transformation ^ v=r(a)=w(l) (6.6.26) is a reversible, physically realizable transformation because knowing v from (6.6.28) we can determine w(2) and finally u. Thus from definition (6.6.23) we get the minimum volume V^i( U) = l. Let us next assume that we consider transformations which are reversible but structure blind in the sense that they cannot utilize the fact that the components of the information u are related by the relationship (6.6.25). For such a class of transformations yj^y)=2. Using definition (6.1.13) for the now considered continuous information we get the maximal loss-less compression coefficient

^^(Z/)=2.n

304

Chapter 6 Lossless Compression of Information

Information that is a function of continuous argument (s) belongs to the next after K DIM information structural class. Similarly as it is for discrete and K DIM continuous information the " infinity" of potential forms of unconstrained continuous sets of functions considered as a whole is "infinitely larger" then the "infinity" of K DIM continuous sets. Similarly as we did in Section 6.6.1 we can approximate the information with complicated structure by information with simpler structure (in particular, by K DIM information) and define the volume of the information with more complicated structure in terms of the volume of the continuous approximating information with simpler structure. The volume defined in such a way is described by a trade off relationship between the accuracy of approximation and the volume of the approximating continuous information with simpler structure. Inside a class of continuous functions of continuous argument(s) we may apply the previously described procedure: to introduce "equal count" relationships and a master set and to define structure blind operations. Often the set of potential forms of information with complicated structure is constrained, and the constraints are so strong that equal count relationship can be established with information having a simpler structure. An example is the class of time continuous functions with bounded harmonic spectrum discussed in Section 7.4.3 for which a continuous relationship between the time continuous function and the set of its samples (which is a /f-dimensional information) exists. Then we define directly the volume of the function-information in terms of the volume of Kdimensional information. The detailed analysis of spectral representations of functions and its applications for reducing dimensionality of structured continuous information is the subject of Section 7.4. 6.6.3 THE STATISTICAL VOLUME OF CONTINUOUS INFORMATION It is natural to define the volume of information by means of the minimal capacity of a channel that is needed to transmit continuous information with required accuracy. In Section 5.4.4 we defined the capacity of a channel in terms of the amount of statistical information which the output of the channel delivers about the information put into the channel. This suggests to base the definition of the volume of information, particularly of continuous information, directly on the amount of statistical information that a distorted version of the considered information must deliver about the original information. To simplify the notation and terminology we assume that the information is continuous scalar information wE 2/^, Uc\= - We introduce a hypothetical distorting transformation V(') producing an information w*G i/d. In general, the transformation V(') may be indeterministic. Thus w* = V(w, z) where z is a set of side factors. An example of a deterministic transformation V{') is the chain of transformations U*[T^(*)] considered in Example 6.6.1. We assume that the primary information exhibits statistical regularities also and as indicator of distortions caused by the transformation V(-) we take the average distortion G[V(-)] = E ^[m, V(mm (6.6.27) where

6.6 Compression of Continuous Information

305

m is the random variable representing the primary information Z is the muhidimensional random variable representing the side factors effecting the outcome of the hypothetic transformation V(-) q(', •) is a performance indicator of the information processing system in a concrete situation. A typical example of the average indicator of distortions is the mean square error defined by (6.6.5). As a universal indicator of the statistical volume Vr(B, O of continuous information, relative to accuracy standard Q we take the minimum amount of statistical information that a random variable a* must deliver about the random variable representing the primary information IE , on the condition that (6.6.28) Thus, we define ^"* (6.6.29) V,(i,Q)= minlJi!ii:V(i!ii, 1)] _

v(-)eV(Q)

where V(Q) is the set of hypothetical distorting transformations V(') transforming the primary information u into information w*, such that Eq[m, V(l, Z ) ] = e (6.6.30) Since the volume depends on the required accuracy Q we call Vr(i!a, 2 ) the relative volume-htnct the notation. Its definition is illustrated in Figure 6.15

iU

HYPOTHETICAL TRANSFORMATION

I I

V(u,z) -I(iDi:im*)-

I

INDICATOR OF CONCRETE DISTORTIONS

I I I

Q[u,V(u^z)] AVERAGING

T

Q[Vi')]

Figure 6.15. Illustration of the definition (6.6.29) of volume of continuous information exhibiting statistical regularities based on the amount of statistical information.

Using (5.1.21) we express VXB, Q) in terms of entropies: V,(i!a,G)= min [H(i!a)-H(iQi|iei*)] = H(m)-max H(isi|m*)

(6.6.31)

v(-)€UQ)

V(-)€V(Q)

Examples of explicit calculation of the minimum volume V^ can be found in Cover, Thomas [6.15]. The calculations become particularly simple when the random variable m representing the primary information is gaussian and ^(w, u*)=(u-u'^y. Then it can be easily shown that ]i,G) = V2iog2^ Q

M^(^l

(6.6.32)

306

Chapter 6 Lossless Compression of Information

COMMENT 1 The amount of statistical information defined by equation (5.1.21) which information produced by a transformation delivers about the primary information depends both on the statistical properties of the primary process (input) and the transformed process. Removing by maximization the dependence on the primary process we obtain the definition (5.4.26) of channel capacity. Removing by minimization the dependence of conditional probability describing the transformation we obtain the definition (6.6.29) of the relative volume of the primary information. In this sense both concepts are complementary. Both definitions provide also examples operation removing dependence on details which has been mentioned in Section 1.6.2 and is discussed in detail in Section 8.1.

NOTES ' According to the general rules of notation, we use the special font V to indicate that calculation of volume is a function assigning a number to structured information or to a set. ^ This term is an abbreviation for "block taken of an unconstrained (constrained) set". Frequently used terms as "discrete information", "continuous information" are similar abbreviations. See also. Note 3 in Chapter 1. ^ The original Huffman algorithm was conceived for optimization of compression when the probabilities of potential forms of blocks are given. We did not make such an assumption here but later in this section we discuss the probabilistic version and the relationships between it and the Huffman algorithm based on frequencies of occurrences. ^ The coincidence is the result of the choice of the set of transformed blocks in Example 6.2.1. Although we could take any other set, we have chosen the specific set (6.2.11) to illustrate the relationship between the general rule (6.2.10) and the Huffman algorithm. ** hi the subsequent the processes in the local channel we denote by symbol u. The subscript su should remind that the symbol denotes the duration of a time slot in the channel delivering processes denoted by u. ^ It is equal to channel capacity defined by (5.4.31) when P^=0. '' From equations (6.4.7), (6.4.12), and from theorem (4.5.13 ) it follows that for large / the probability distribution of n^ can be approximated by the gaussian probability distribution with variance 1. Thus, the decrease of probability of surpassing the threshold is fast. ^ Although the intensities are the same, the train of starting instants of transformed blocks is no more Poisson process, because buffering introduces statistical dependence between them. ^ We consider here only one block that is an element of the transformed train. Therefore, we omit the index / numbering the position of the block in the train-see equation (6.2.1). '^ According to the previous convention (Note 9), we do not indicate the number of the considered train in the block. Thus, u{n) is the abbreviation of the symbol M(/, n). '' Since we assumed that the primary train is stationary the random variables representing the transformed blocks have the same probability distribution. Therefore, we write briefly U instead \I{i). ^^ The bar over g is a reminder that the indicator is a statistical average. ^^ In terms of the theory of cardinal numbers: the cardinality of K dimensional cube is c, where c is the cardinality of the < 0 , 1 > interval.

6.6 Compression of Continuous Information

307

REFERENCES [6.1] [6.2] [6.3] [6.4] [6.5]

Held, G., Marshall T., Data Compression (4-th ed.), Wiley, N.Y., 1996. Bell, T.C., Witten, T.H., Text Compression, Prentice-Hall, Englewood Cliffs, NJ, 1990. Storer, J.A., Data Compression, Computer Science Press, Rockville, MD, 1988. Storer, J.A., Image and Text Compression, Kluwer, Boston, 1992. Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetterling, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [6.6] Nelson, M., The Data Compression Book, M&T Books, Redwood City, CA, 1991. [6.7] Storer, J.A., Reif, J.H., DCC'91 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1991. [6.8] Storer, J.A., Cohn, M., DCC'92 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1992. [6.9] Storer, J.A., Cohn, M., DCC'93 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1993. [6.10] Storer, J.A., Cohn, M., DCC'94 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1994. [6.11] Seidler, J.A., Principles of Computer Communication Network Design, Wiley, N.Y., 1983. [6.12] Gallager, R., Bertsekas, D., Networks, Wiley, N.Y., 1991. [6.13] Kleinrock, L. Queuing Systems (2 vol), Wiley, N.Y., 1975. [6.14] Blahut, R.E., Principles and Practice in Information Theory, Addison-Wesley, Reading, MA, 1990. [6.15] Cover, T.M., Thomas, J.A., Elements of Information Theory, Wiley, N.Y., 1991. [6.16] Gray, R.M., Source Coding Theory, Kluwer, Boston, 1990. [6.17] Veldhuis, R., Breeuwer, M., Source Coding, Prentice-Hall, N.Y, 1993.

7 DIMENSIONALITY REDUCTION AND QUANTIZATION The dilemma of information processing is that the states of the environment, and consequently the primary information are continuous but that digital information processing is efficient and cheap. Section 1.5.4 described the basic transformations of the primary continuous information into discrete information. Such a transformation is called discretization; discretization of scalar or vector information is called^ quantization. The important types of discretization are listed in Figure 1.22. Section 1.5.4 indicated that if the structure of continuous information is complicated, then we can discretize the information more efficiently if we first transform it into continuous information with a simpler structure. We called such a transformation a dimensionality-reducing transformation. The dimensionality- reducing transformations are of paramount importance not only as preliminary transformations preceding discretization. In spite of advantages of discrete information processing, in some areas continuous information processing is also useful. Then dimensionality reduction often improves the performance of a superior system utilizing continuous information. This chapter is devoted to dimensionality reduction and quantization. The prototype dimensionality reduction is transformation of vector information whose components are continuous into a vector information consisting of a smaller number of continuous components. Most frequently used types of dimensionality reduction are truncation and decimation (described briefly in Section 1.5.4; in particular, see Figure 1.20). Often the primary information can be considered as a function of a continuous argument. The basic type of dimensionality reduction of such information is point sampling and also, in case of images, line scanning-see Section 1.5.4 particularly. Figure 1.21. In this chapter we concentrate on dimensionality reduction and quantization of vector information. For such an information the formal apparatus is quite simple (matrix calculus). The obtained results are not only directly applicable but also can be generalized for dimensionality reduction of processes and images. Section 7.1, using extensively geometric interpretation, introduces the concept of spectral representations. In Section l.l'WQ concentrate on decorrelating spectral transformations. We use them in Section 7.3 to show on a simple study case that preliminary decorrelation by a spectral transformation greatly simplifies efficient reduction of dimensionality of vector information.

310

Chapter 7 Dimensionality Reduction and Quantization

Generalizing observations in the study case, we present an optimal algorithm for dimensionality reduction and information recovery. We close our considerations about dimensionality reduction of vector information with an example and discussion of trading of quality of recovered information versus volume of compressed information. Section 7.4 shows that the concepts of spectral representation and dimensionality reduction introduced for vector information can be generalized for information that is a function of a continuous argument (function-information). We concentrate on transformations of such information that first transform it into a spectrum consisting of infinitely many components. The compression into vector information is achieved by retaining only a finite number of components of spectrum. Sampling also can be interpreted as such a transformation. In Section 7.5 the quantization of information is discussed. As a simple, but in several respects representative special case, we consider first quantization of scalar information. Then, we review the basic features of vector quantization. Of great practical importance is quantization of vector information achieved by a preliminary presentation transformation and subsequent separate scalar quantization of components of the transformed primary information. Decorrelation is a typical preliminary transformation. Further, we describe real time quantization of a train, achieved by the predictive-subtractive procedure. In this chapter, as in the previous, we emphasize the conceptual and technical aspects of information compression, keeping the formal side of considerations simple. Therefore, besides analytical, also some heuristic arguments are used. However, in the next chapter we give their precise justification. This applies in particular, to the optimization of scalar and vector quantization.

7.1 SPECTRAL REPRESENTATIONS OF VECTOR INFORMATION We explain here the fundamental concepts of spectral representations of information and their applications for decorrelation and dimensionality compression on the simple but representative case of vector information. The presented approach is based on geometrical interpretation. 7.1.1 FUNDAMENTAL CONCEPTS OF K DIM GEOMETRY In previous chapters we used interchangeably the terms "block information" and "vector information". It has been indicated in Section 1.3.2 that as long as we do not define the operations which can be performed on the information, the term "vector information" is a technical jargon name. In this section we define various operations on block information particularly, utilizing the interpretation of block information as a vector or a point in the sense of geometry. To avoid confusion the set u={u{k), k=l, 2, ..,K} (7.1.1) is called here the "block information", and the term "vector" is used in its mathematical sense.

7.1 Spectral Representations of Vector Information

311

We start our considerations with a review of fundamental concepts of K DIM geometry^. A detailed presentation of this geometry can be found in most books on linear algebra; see, e.g.,Thompson [7.1], Horn, Johnson [7.2], and Usmani [7.3]. Since we use here various representations of the set a^{a{k), /:= 1, 2,- • • , AT} of scalars a{k) we indicate the type of representation by subscripts vc, pt, mx for interpretation as a vector, as a point or as a matrix respectively. For two vectors a^c and ^^c. representing the sets a and b, the scalar (dot) product is , , , , •'vc

vc I

COS aCflvc ^vc)» (7.1.2) where | a^c |, | ^vc I denote the absolute values (lengths) of the vectors and a(avc» ^vc) the angle between them. Taking ^vc=^vc ^ ^ g^^ fro"^ (7.1.2): |flvc|=[(flvc, O ] " -

(7.1.3)

From definition (7.1.2 ) it follows that if one of the vectors is fixed, the scalar product is in respect to the other vector a linear operation ([/z(l)ajl)+ /2(2)aJ2)], ^ , ) = MDCMD, ^c)+^(2)(aJ2), bj (7.1.4) If (a^c. /^vc)=0 ^G say that the vectors a^^ and b^^ are orthogonal. A set of vectors/(/:), A:=l, 2,- • • , /iT such that (/;c(^),/vc(0)= 5(^,0

(7.1.5)

where 5(/:,/) = < ; Q

(7.1.6)

is the Kronecker delta function, we call ortho-normal vectors. We use these vectors as unit coordinate vectors. Then we call their set C^ orthogonal coordinate system. A vector a^c representing a set a^{a{k), / : = ! , 2, • • , ^ } can be written in the form v^ a.c = 2l a(k)fjk) (7.1.7) k'\

Writing the vector b^^ in the same form, using (7.1.4) and (7.1.5) we get {a^c. Kc)=Y. a{k)b{k) (7.1.8) In many problems of information processing, the components of one of the vectors say b^^, have the meaning of fixed coefficients while the components of the other vector a^^ are samples of a continuous process a^it) taken at instants r^, /:= 1, 2, • • • , /^. Thus, a{k)=aM^ Then (7.1.8) takes the form

^ = 1 , 2,- • • , /^ j^

ia.c. Kc)=i

b(k)aM'

(7.1.9) (7.1.10)

Comparing this expression with the equation (3.2.43) describing the process at the output of a linear system, we see that they are similar. Let us suppose that • the input process in equation (3.2.43) is v(l, t,)=aM=a(n), n = l, 2, - - , K. (7.1.11) • the number of memory cells j_r^ (7 112) • the coefficients h(t„) characterizing the time-discrete linear system shown in Figure 3.8 satisfy the conditions h(t^-t,)=b{n),n = l,2,' - - ,K (7.1.13)

312

Chapter 7 Dimensionality Reduction and Quantization

Then the output process v(2, t/^) at instant t^ is equal to the scalar product of vectors v(2, /^) = (a,„ ^^J

(7.1.14)

From (7.1.13) and (3.2.43) it follows that h'(tJ=b(K-m),

m=0, 1,- ' • .K-\

(7.1.15)

A linear system with such a set of coefficients is called a matched filter. In view of (3.2.40) this set has also the meaning of the pulse response of the system. Let us summarize our considerations: If at the input of the filter matched to the components of a vector b^^ we put in sequentially the components of the vector flvc» ^hen the value of the output process after arrival of the (7.1.16) last component of a^^ is equal to the scalar product (a^, h^^. The conclusion is illustrated on Figure 7.1. It allows not only the scalar product to be calculated by a simple hardware but, as it is shown later, it also suggests useful interpretations of several information transformations. LOCAL GENERATOR OF TRAIN b{k)

train a{k)

A

Y.a{k)b{k) MATCHED FILTER

i

tram o(k)

reading the output at /A + 0

b) Figure 7.1. Evaluation of the scalar product: (1) based directly on equation (7.1.8), (2) by means of the matched filter. The set a={a(k), k=l,2, • • • , K} can also be represented as a column matrix. Such a presentation is denoted as a^^. Thus, a(l) a(2)

(7.1.17)

a{K) Using the rules of matrix multiplication we write equation (7.1.8) in the form (fl,e, ^c)=^mx^mx

(7.1.18)

where a1^ is the transposed matrix, and writing two matrices side by side means matrix multiplication.

7.1 Spectral Representations of Vector Information

313

We also can interpret the block information a^{a(k), k=l, 2, • - , K} as a point in a ^-dimensional Euclid space, which in the coordinate system Cf has coordinates a(k), k=l, 2,- - -, K. In view of (7.1.7) this point can be interpreted as the end point of the vector a^^ defined by 7.1.7. For a pair of points a^^, b^, we introduce the distance d{a^^, b^^. It is natural to define the distance as the length of the difference vector b^^-a^^. Thus, we define (7.1.19) From (7.1.3) and (7.1.8) we obtain

^(«pt, V={EW«)-fl(«)]'r.

(7.1.20)

Using (7.1.3) and (7.1.4) we write (7.1.19) in the form d\a^,,

V

= (^vc-avc» ^vc-flvc) = (flvc» «vc) + (^vc» ^vc)-2(avc, ^vc) =

= |a,,r-+|Z^,,|2-2(a,„ bj.

(7.1.21)

COMMENT The smaller the distance d(a^^, b^^) the more "similar" are the two sequences a and b. Thus, the distance may be called a "negative" (the smaller the more) indicator of similarity. From (7.1.21) it follows that the only component of the distance that depends on both sequences simultaneously is the scalar product (app ^p,). The distance is smaller the larger this product is. Therefore, the scalar product may be called a "positive" indicator of similarity of the two sequences. 7.1.2 DEFINITION OF THE SPECTRUM OF BLOCK INFORMATION In the AT-DIM Euclidean space we take an orthogonal coordinate system Cf determined by the stif^^k), /:=1, 2,- • • , A' of unit coordinate vectors satisfying the ortho-normality conditions (7.1.5) and we represent the block information u as a vector (see Figure 7.2 a) >< Mvc = Ew(/:)/J/:). (7.1.22)

1

Cf Wvc

j/«(3) ' 1

1 j

/

/

/

/

1/

"(3) |/vc(l)

1 ^.ud)"

Vvc(2)

a) Figure 7.2. Illustration of the definitions: a) of vector and point representation of block information ii, b) of its spectrum v={v(l), v(2), v(3)}.

314

Chapter 7 Dimensionality Reduction and Quantization

We scalar-multiply both sides of (7.1.22) by a coordinate unit vector/y^C/). Taking into account the ortho-normality (7.1.5) we get w(0=(Wvc,/vc(/)). (7.1.23) Next we take an other orthogonal coordinate system C^ (see Figure 7.2b) of unit coordinate vectors g^^JJc), k=l, 2, • - - , K. Thus, (gvcW, gvc(0)= 5(/:,/). (7.1.24) Taking in (7.1.22) instead of M^C ^ unit vector g^^iO we express it in terms of the unit vectors of the coordinate system Cf gJl)=T

g(l. m)f^,(m).

(7.1.25)

m-l

Both sides we scalar-multiply the by a coordinate unit vector/vc(/:). This gives g(l,k) = (gUn,Lc(k)). (7.1.26) Substituting (7.1.25) in (7.1.24) and using (7.1.5) we get E g(l. m)g{K /n)= 6(/, k), V/, k.

(7.1.27)

m-\

Finally, we represent the vector u^c i^ ^he form K

«VC^EK^)^VCW-

(7.1.28)

Similarly to (7.1.23), the occurring coefficients are given by the equation v(/) = (Mvc, gvc(O),/=1,2,- ' .K (7.1.29) Substituting (7.1.22) and (7.1.23), and taking into account (7.1.5), we express v(/) direct in terms of the components of the primary information: K

v{l)^Y.S{Uk)u{k), V/. (7.1.30) The set of coefficients ^"^ v={v{k), k=\, 1,'',K} (7.1.31) is the secondary block information into which the primary information u is transformed. This set is called the g-spectrum of the primary information u (relative to the set of orthogonal unit vectors gvcW» ^ = 1 , 2 , • • • , ^ . When it does not cause confusion, we call it briefly spectrum. From (7.1.29) follows that the /th component of the spectrum has the meaning of the projection of the vector u^^ on the /th axis of the coordinate system C^. Knowing the projections on all K axes, we can determine exactly the vector u„^ (the primary information). Thus the transformation of primary information into its spectrum is a reversible transformation. We now express explicitly the primary information in terms of the spectrum. Because the roles of the both coordinate systems C^ and C^ are symmetrical, we have only to interchange in our previous argumentation the roles of unit vectors f^JJi) and gvcW- Similarly to (7.1.25) we represent the unit coordinate vector/,c(/) in the form I^ Ul)=22f(l.k)g,,ik), (7.1.32) where similarly to (7.1.26) f(l, k) = {fjl),

g^,(k))

(7.1.33)

7.1 Spectral Representations of Vector Information

315

and similarly to (7.1.27) K

E / ( / , m)f(k, m)= d(l, k) V/, k.

(7.1.34)

From equations (7.1.26) and (7.1.33) it follows that f{k,D=g{lk) (7.1.35) To obtain an explicit expression of the primary information u in terms of its spectrum v we substitute (7.1.28) and (7.1.33) in (7.1.23). After elementary algebra and taking into account the orto-normality of vectors /vc(^) we get «(/) = ( E v{k)g,M. /vc(0)= E / ( / , k)v{k). k-\

(7.1.36)

k'\

The pair of relationships (7.1.30) and (7.1.36) is called spectral transformation of block information. The symmetry of these relationships is the consequence of the mentioned symmetry of roles of the coordinate systems C^ and C^. To gain more insight into those relationships we present them in the matrix form: «n.x=/^v,x, (7.1.37a) , yr.. = Gu^,^ (7.1.37b) where Umx and v,^^ are the column matrices representing the primary block information u and the spectrum v^^; G^[g{k, /)] and F=\f(k, /)] are square matrices with elements g(k, I) (respectively/(A% /)). The pair of relationships (7.1.36) we call spectral transformation, the matrix G we call spectrum-generating matrix and F information-recovery matrix. The matrices G and F are mutually related. From (7.1.37) it follows that G=F^-'\ (7.1.38) while from (7.1.35) it follows that G=F^^ (7 1 39) From this it follows that 7^-1)=f = G^> (7.1.40b) Thus, it is simple to invert the spectrum generating and information recovery matrices. The basic definition (7.1.28) can be interpreted as representation of the vector corresponding to the primary block information a as a superposition of fixed vectors multiplied by coefficients. We now make this interpretation more precise. Using equation (7.1.35) we write equation (7.1.36) in the form K

u{l) = Y.g{K l)y(k).

(7.1.41)

k-l

and we denote as u(') = {u(l), / = 1 , 2, • • • , K} g(k. •) = {^(/:, / ) , / = ! , 2, • " ,K} Using this notation we write (7.1.41) as

(7.1.42a) (7.1.42b)

K

u(-) = T 8(k, -Mk).

(7.1.43)

316

Chapter 7 Dimensionality Reduction and Quantization

Thus equation (7.1.43) has the meaning of a representation of a function w(-) corresponding to the primary block information ii as a superposition of standard functions g(k, •) multiplied by coefficients v(k) determined by the primary information. Therefore, the functions g(k, •) are called basic functions. Section 7.4 shows that representation (7.1.43) has direct counterparts for functions of a continuous argument (s). COMMENT 1 From equations (7.1.29) and (7.1.33), it follows that the spectral representation (7.1.37) depends only on angles between the axes of the coordinate systems Cf and C^. If we rotate both coordinate systems keeping their mutual position fixed the spectral presentation does not change. Therefore, we can take any system as the primary coordinate system Cf. COMMENT 2 Since any mutual angular position of the coordinate systems Cf and C^ is possible, we have a continuum of coordinate systems and thus, we have infinitely many various sets of basic functions g(k, •)• In consequence, infinitely many spectral representations of a primary block information are possible. The concrete choice depends on the reason for using the spectral transformation. In general it is used to simplify the analysis of transformations of information. Of paramount importance are the representations based on harmonic functions (sine, cosine, and complex exponential), because these functions have the unique property that their shape is not changed by a linear stationary transformation. This combined with the superposition feature (3.2.15) permits to gain much insight into the transformation performed by a stationary linear system. Therefore, sets of samples of harmonic functions are very useful for numerical analysis of linear stationary systems. Such an ortho-normal set is presented in the forthcoming example. Of great importance also are spectral representations such that the random variables representing spectral components are non correlated; they are called decorrelating spectral representations. They significantly simplify the subsequent dimensionality reduction or discretization. Section 7.2 is devoted to decorrelating transformations, while in Sections 7.3 and 7.4 we discuss their applications for information dimensionality reduction and quantization. COMMENT 3 For applications only the basic relationships (7.1.30) and (7.1.36) (or equivalently (7.1.37)) are essential. They held if the orthogonality conditions (7.1.18), (7.1.27) and (7.1.34) are satisfied. The geometrical interpretation that we used allowed to obtain those relationships in a simple way, but it played only an auxiliary role. Therefore, the described concept of spectrum of block information can be generalized for information or/and spectrum having other, more complicated structure for which a direct geometrical interpretation is no longer plausible. To get such a generalization we must define the analog of the scalar product and the counterparts of the definitions coefficients/(/:, /) (equivalently, g{k, /)) determining the spectral representation.

7.1 Spectral Representations of Vector Information

317

Section 7.4 is devoted to such generalizations in the case when information is a function of continuous argument(s). Here the methodology of generalizations is illustrated on a simple but very important in applications example, when the basic functions of a discrete argument are samples of continuous complex exponential functions. EXAMPLE 7.1.1 SPECTRAL TRANSFORMATION OF VECTOR INFORMATION: DISCRETE FOURIER TRANSFORMATION As indicated in Comment 2, the complex exponential function e''^' plays an important and unique role in the analysis and synthesis of linear systems and of linear information transformations. Therefore, of great practical importance is spectral representation of samples of function information using as basic functions/(•, k) sets of samples of complex exponential functions. We describe now such a set in more detail. We assume that Al. The primary information is a function information of the continuous argument t «jr), 0 < r < T (7.1.44) A2. The component of primary vector information is u{l)=u,MTJ

(7.1.45)

where T,=T/K (7.1.46) is the sampling period and^ / = 0 , 1, • • • , K-\ A3. The basic function g(k, •) of discrete argument occurring in (7.1.43) is a train of samples of the exponential function of a continuous argument (7.1.47a) where ^^^^^'' ^^^ ^^""^ ^^'''^^' ^ " ^ ' h ' ' • . K-l coj =27r/T (7.1.47b) exp x=c\ (7.1.46) and .4 is a constant, which we determine later. From these assumptions it follows that g(k, l)= AQxp Qlo)^kTJ=AtxpQ(xk[) (7.1.47) WllCiC (7.1.48) a=(j)iT^) = 2ir/K To use our previous considerations about spectral representations we must define the counterpart of the scalar product for pairs of sets a = {a(l), 1=1, 2, ' ' ,K} and b = {b(l), 1=1, 2, - • - , K} of complex numbers. Since geometrical interpretation of such sets as vectors is not plausible we have to look rather at equation (7.1.8) as a basis for definition of the scalar product. An analysis shows that the proper counterpart of the definition of scalar product sets of real numbers, is for sets of complex numbers the definition K

(a,b)^'£a(l)b([), where b is the complex conjugate b.

(7.1.49)

318

Chapter 7 Dimensionality Reduction and Quantization Some elementary algebra shows that for the functions defined by (7.1.47) K-l

E g(k, l)S{m, l)=Ad(k, m)

(7.1.50)

/-o

To satisfy the ortho-nomality conditions (7.1.27) we take A = l/y/K~. (7.1.51) Thus, the functions g(k, l) = l/yfK~exp(iak[), k=0, 1, • • • , AT-l (7.1.52) of the discrete argument / are ortho-normal in the sense of definition (7.1.5). For scalar product defined by (7.1.49) equation (7.1.35) takes the form ^ ,^,^^, , From (7.1.52) we get

f(Kl) = g(l,k). /(/, A:) = l / v / F ' exp ('}oJcl).

(7.1.53) (7.1.54)

Thus, for the assumed definition of the scalar product and for the assumed coefficients/(•, k) the spectral transformation described by (7.1.36) and (7.1.30) takes the form K-\ u(l) = 1 / y/^ 5^ v(i^)exp Qakl) (7.1.55a)

v(l) = l/y/F~ E

u(k)txp (-}akl).D

(7.1.55b)

This transformation is called discrete Fourier transformation. It is widely used for digital information analysis and processing. COMMENT 1 The spectral representation (7.1.37) has the same dimensionality. Thus, the same volume as the primary information u. The modification described in the example generates as spectrum K components v(k), which are complex numbers. Thus, the structure blind volume of the described harmonic spectrum is 2K. However, the spectrum has a redundancy, because as can be checked, the spectral components with numbers located symmetrically around K/2 determine each other. If we would eliminate the redundant components of the spectrum the spectral transformation (7.1.55a) respectively, (7.1.55b) would lose its simplicity. This simplicity not only makes the analytical calculations easier. It is also extensively used in the very effective algorithm for numerical calculation of the spectrum, called fast Fourier transformation ( for details and references see Poularikas [7.4], Smith, Smith [7.5] and for programs Press and all.[7.6]). Since the spectral transformation is reversible, every transformation performed on spectral representation can be described as a transformation of the primary information, without introducing explicitly the spectral representation. Thus, the main advantage of spectral transformations is that they provide a presentation of information that gives more insight into some relationships between the primary information and information produced by transformations, especially linear.

7.1 Spectral Representations of Vector Information

319

7.1.3 SOME IMPORTANT PROPERTIES OF SPECTRAL TRANSFORMATIONS Two important properties of spectral transformation, that are used in forthcoming consideration are presented: the distance invariance property and optimal character of fragments of the spectrum. DISTANCE PRESERVATION BY SPECTRAL TRANSFORMATIONS Consider two potential forms u' and u" of block information. We denote as i/ p^ and w p| the two points representing the two potential forms of information in the space with coordinate system Cf. In view of (7.1.13) the Euclidean distance between these points defined by (7.1.12) IS ( "^ 1 '/i d{u'u';)^\£lu"(D-u'(r)n . (7.1.56) ^

W-1

Let us denote as v' and v" the f-spectra of u' and u". We again interpret them as two points Vp^ and v p[ in the coordinate system Cf. In view of (7.1.19) the distance between those two points is ^(^;'^;;)=(E[^"(O-V'(/)]4".

(7.1.57)

However, according to our definition of spectrum, v p^ is just an other notation of M p,, thus, V p^ = w p,. Similarly, Vp'/ = «;/ . Thus,

. , ,, or, equivalently

d(v;,,v;;)=d(u;^u;:) ^. ' ^ K E [ v " ( / ) - v ' ( / ) ] ' = E[w"(/)-w'(/)r /-I

(7.1.58a) (7.1.58b)

/-I

The interpretation of (7.1.58) is as follows the Euclidean distance between a pair of points representing two potential forms of primary information and the pair of corresponding spectra is the same.

(7.1.59)

Thus, the spectral transformation is a distance invariant transformation. This conclusion is very useful for easyly calculating the effects of modifying spectra on the corresponding modifications of the primary block information and vice versa. Taking w " = 0 in (7.1.58b) we get

Y.v\l)^Y.u\l) /-I

(7.1.60)

/-I

SPECTRAL REPRESENTATIONS COMPRESSING THE VOLUME OF VECTOR INFORMATION The dimensionality of the discussed spectral representation is the same or even larger then the dimensionality of the primary information. We consider now the possibility of using spectral representations to compress the volume of primary information, however, for the price of introducing irreversible distortions. Obviously we attempt to keep those distortions possibly small. This leads to the following problem:

320

Chapter 7 Dimensionality Reduction and Quantization

A set C* of M
=Ev(m)g,e(/n)

M (7.1.61)

m-l

where v = {v(m), m = l , 2, .., M] is a set of M coefficients, that the distance d(u,u^,* ) given by equation (7.1.20) is minimized. This problem arises not only in information compression but in other areas of information processing such as optimization of information recovery. In general (7.2.61) is called problem of optimal linear approximation. Using the symbolic notation for optimization problems introduced in Section 1.6.2, we write the problem as OP v, d[u, uj^ (v)]. This is a parametric optimization problem which can be solved analytically. However, to gain more insight into the spectral representations we derive here the solution using geometric concepts. The set of all linear combinations of vectors gvc('")» 'w = 1, 2,- • • , M is called Af-DIM space spanned on these vectors and it is denoted as S*(M). Since MS'*(Af) is called hyperplane. We augment the set gvc('^)» ni = \, 2, • • • , M by K-M ortho-normal vectors gvc(^), m=M+l, M+2,- • • , K, such that the set of all vectors g j m ) , /rz = l, 2,' • ' , K is an ortho-normal set. We denote this augmented set C^ and consider it as a coordinate system. The corresponding K-DlM space (spanned on C^) is denoted as S(K). In geometrical terms the considered optimization problem is a problem of finding a point a^*^ on the hyperplane S*(Af) which is closest to a given point u in the space ^-DIM space S(K) (the next-neigbor point). It is seen that the distance between a point on the hyperplane S*(M) and a point u^^ in the space S that does not lie on ^*(A0 is minimal if vector u-v* is perpendicular to the hyperplane ^S' *(M). Thus, The vector u^ corresponding to the optimal approximation of a vector Wvc» ^y ^ linear combination of given vectors n \ fn\ gycijn), m=l, 2, ' ' , M, is the orthogonal projection of the vector live on the hyperplane S*(M) spanned on vectors gy^(rn). For A'=3, M=2 this conclusion is illustrated in Figure 1.17c, with u-^x. The condition that u^^- u* is perpendicular to the hyperplane ^(M) implies that ttvc-"v? ^^ perpendicular to every vector g^Jim), m = l , 2,- • , Af. Thus, the optimal coefficients vj^m) determining the orthogonal projection are the solution of the set of equations^^^^^^^ ^^^^^^^^^ . = 1, 2,- • • , M (7.1.63) Substituting (7.1.61), taking into account equation (7.1.4) and the orthonormality (7.1.24) of vectors gvc('^) ^ ^ obtain v„(m) = (ii,„ g,,{m)), /z = l, 2,- • • , M (7.1.64)

7.2 Decorrelating Spectral Representations of Vector Information

321

Thus, the optimal approximation is M

^oTc = E ^ o ( ' W ) ^ v c ( ^ )

(7.1.65)

and its g-spectrum is C =(Vo(l), v,(2),- • • , v,(M), 0, 0 / • • ,) (7.1.66) Comparing (7.1.75) with (7.1.28) we get the important conclusion The optimal approximation u^* of a vector u by a linear combination of a subset C^ of a complete set C^ of ortho-normal vectors .„ . ,-. g^Jjn) is obtained from the g-spectrum by rejecting the g-vectors not belonging to C*. From (7.1.12), (7.1.66), and from the equidistance property (7.1.59) it follows that K

d{u,uj^)=

E^'('")

(7-1-68)

m-Af+1

is the error of the optimal approximation. From (7.1.66) it follow that the optimal approximation can be represented by a M-DIM information. Therefore conclusion (7.1.67) is of paramount importance for optimization of recovery of information after dimensionality reduction that is considered in Section 7.3 and 7.4.

7.2 DECORRELATING SPECTRAL REPRESENTATIONS OF VECTOR INFORMATION If the components of a structured information can be represented as random variables, then, as was have indicated in Section 5.1.1, the statistical correlation coefficients are indicators of statistical relationships between components of structured info. It can be expected that when we process a block information with correlated components, the statistical relationships between the components make the results of transformation obscure. In particular, if we reject some components to achieve dimensionality compression, we may expect the evaluation of the consequences of such a transformation to be simpler if the components of the primary block information are not correlated. This, in turn, could simplify the optimization of the dimensionality reduction. In the next section, we show, that if we require the decorrelating transformation to be a spectral transformation, then the distance invariance of those transformations permits, to find an optimal dimensionality reduction in a simple way. Therefore, we concentrate in this Section on decorrelating spectral transformations. 7.2.1 BASIC CONCEPTS The principal possibility of achieving decorrelation by a spectral transformation is illustrated below with a simple example. EXAMPLE 7.2.1 SPECTRAL DECORRELATION OF TWO-DIMENSIONAL INFORMATION BY SPECTRAL REPRESENTATION We assume that Al. the primary information is two-dimensional: ii = {w(l),w(2)}; A2. the information exhibits statistical regularities and can be treated as an observation of the two-dimensional random variable IlJ={m(l), iai(2)}; A3. Ei!ii(l)=0, Ei!ii(2)=0, Etf(l) = Etf(2); the components are correlated, that is EB(1)B(2)

^

0.

322

Chapter 7 Dimensionality Reduction and Quantization

/.(I) Figure 7.3. Geometrical interpretation of the decorrelating transformation in two-dimensional case. Our task is to look for such a spectral representation of the primary information that the components of the representation are not correlated. Figure 7.2b takes for the two-dimensional case the form shown in Figure 7.3. The mutual position of the both coordinate systems is determined by the angle a = a ( l , 1) between the unit vectors/vc(l) and gvc(0- From (7.1.2) and from Figure 7.3 it follows that the spectrum-generating matrix (see page 315): cos a

sm a

-sin a. cos a

(7.2.1)

Let us denote by V={^(1), ^(2)} the random variable representing the spectrum v. Since (7.1.37b) holds for every realization, we have V=GI[J (7.2.2) Using (7.2.1) and performing the matrix multiplication, we get ^(1)= 1(1) cos a + 1(2) sin a

(7.2.3a)

^(2) = -m(l) sin a + m(2) cos a

(7.2.3b)

Next we calculate the correlation coefficient E^y(l)^(2). Substituting (7.2.3) we get E^(l)2(2) = {E[if(l)tf(2)]}cosa sina + + {E[m(l)m(2)]}(cos2Q:-sin2Q:) (7.2.4) Taking into account assumption A3 on the page 321, we have E^^(l)^!^(2) = {£[1(1)1111(2)]} (cosVsin^a). (7.2.5) We see that E^(l)^(2)=0 for a = 7r/4. Then components of the spectral representation (7.2.2) of the primary block information u are not correlated and the spectrum-generating transformation (7.1.37a) takes the form v(l)=^[«(l)+w(2)]

v(2)=^[w(l)-w(2)]. D

(7.2.6a)

(7.2.6b)

7.2 Decorrelating Spectral Representations of Vector Information

323

This simple example shows that it is possible to find a spectral transformation that is simultaneously a decorrelating transformation. However, when the dimensionality of the block information is large, the straightforward procedure that we used in the example would become prohibitively complicated. We will now derive a spectral decorrelating transformation which is not only feasible for larger dimensionalities but also leads directly to an optimal dimensionality reducing algorithm. 7.2.2 THE EIGEN VECTORS OF THE CORRELATION MATRIX One of the fundamental theorems of linear algebra is this: If a KxK matrix C^u is a positive definite symmetric matrix, then K vectors (column matrices) e{k), / : = ! , 2,- • • , Sexist, such that {eik), e{m))=b{K m), Vit, m (7.2.7) C,Ak)=y{k)e{k), k=h 2,- • • , ^ (7.2.8) and 7(^)>0 (7.2.9) The vectors'* (column matrices) e(k) are called eigen (own) vectors of matrix C^u and the parameters y(k) are called eigen values. The interpretation of (7.2.8) is that an eigen vectors is such a vector that multiplying it with the matrix has the same result as multiplying the vector with a scalar (the associated eigen value). We write equation (7.2.8) in the matrix form [C,,-y(k)DMk)==0, (7.2.10) where 10 0 0 0 0 10 0 0 DM6{m,n)](7.2.11) 0 0 0

1

is the diagonal unit matrix and 0 is the matrix with all elements 0. The matrix C^,-y{k)D^ is the matrix C^u in which the elements c^JJk, k) are replaced by c,,(^^ k)-y(k). From (7.2.10) it follows that the eigen values are solutions of the equation {C,,-yD,)e = 0 (7.2.12) If the determinant of matrix C^^-yD^ is not equal zero, then a zero vector e=0'\s the only solution of the equation (7.2.10). Therefore, it must be det(C,,-7A)=0 (7.2.13) In respect to 7, this is an algebraic equation of rank K. Thus, we can find the eigen values of matrix C^^ as the roots of (7.2.13) considered as an algebraic equation for 7. This equation is called a characteristic equation of matrix C^^. Knowing the eigen value we find the components of the associated eigen vector from (7.2.12), considered now as a set of linear equations. However, because of (7.2.13) those equations are linearly dependent and their solution depends on an

324

Chapter 7 Dimensionality Reduction and Quantization

undetermined parameter. We can fmd it taking into account the condition (7.2.7) that the length of the eigen vector should be 1. Summarizing the algorithm for finding the eigen values and eigen vectors is Al. consider (7.2.13) as an algebraic equation for 7 and find its roots: they are the eigen values of the matrix C^^; A2. for a given eigen value consider (7.2.12) as a set linear equations for components of the eigen vector and find its solutions; they are not unique but depend on a parameter. A3, determine the parameter which occurred in step A 2 from the condition K

Te'{k,l) = l

(7.2.14)

where e(k, I) are components of the eigen vector c(/). We illustrate this algorithm with a simple example. EXAMPLE 7.2.2 EVALUATION OF EIGEN VALUES AND EIGEN VECTORS We do again the assumptions Al to A3 on page 332 from Example 7.2.1. On these assumptions the correlation matrix cjhl) cj\,2) cjl,2)

(7.2.15)

cjl.l)

The characteristic equation (7.2.13) takes the form k(i.i)-7

c n,2) •

c(l,2)

c(l.l)-7

det

-0

Evaluating the determinant we get (c„„(l,l)-7)2-c2(1.2)=0 The solutions of this equation are 7(l) = cjl,l)+c„„(l, 2) 7(2) = c J l , l ) - U l , 2). For 7=7(1), the matrix equation (7.2.16) yields two scalar equations: cjl,l)
(7.2.16)

(7.2.17) (7.2.18a) (7.2.18b)

(7.2.19)

(7.2.20) eil, 2)=e(l, 1), and we see that the second equation is equivalent to the first (this is the consequence of (7.2.16)). The normalization condition (7.2.14) takes the form e'd, 2)+e'il, 1) = 1. (7.2.21)

7.2 Decorrelating Spectral Representations of Vector Information

325

From (7.2.19) and (7.2.20) we get finally the eigen vector corresponding to eigen value X(l) 1 1 e(l) = { ^ . - ^ } . (7.2.22a) In a similar way we get the eigen vector corresponding to X(2) e(2) = { - ^ , - ^ } .

D

(7.2.22b)

7.2.3 THE DECORRELATION BASED ON EIGEN VECTORS The set of equations (7.2.8) defining the eigen vectors can be written in matrix form C,^=ED^, (7.2.23) where E is the square matrix whose columns are the eigen vectors, considered as column matrices: E=[

e{l)

I e(2)

•7(1) 0 0 0 0 7(2) 0 0

and

\eiK)

(7.2.24)

0 0

D

(7.2.25) 0

0

0

y(K)

is the diagonal matrix with the eigen values y(k) on the diagonal. Since the eigen vectors satisfy the conditions (7.2.81), they can be used as unit vectors of an orthogonal coordinate system. We denote this system C^ and consider it as a special case of the second coordinate system Cg, which was introduced in Section 7.2.1. Thus, the eigen vectors play now the role of unit coordinate vectors g^^ik). We denote by G^ the matrix generating the c-spectrum. From the definition of matrix G^ occurring in ((7.1.37b) it follows that ?1(1)

d(2)

(7.2.26)

elAK) where e^^(k) are eigen vectors considered as column matrices. Comparing this definition with definition (7.2.24) we see that G,=E^. and from (7.1.46) it follows that E''=E-\

(7.2.27) (7.2.28)

326

Chapter 7 Dimensionality Reduction and Quantization

We show now that the spectral representation corresponding to this coordinate system is a decorrelating representation. From (7.2.27) and (7.2.28) we get G,=E\ GJ=E From definition of G^ it follows that

(7.2.29)

V=GeIIJ (7.2.30) is the random variable representing the e spectrum. From (4.4.37) we obtain the correlation matrix of V: C^ = G^C^^GJ. (7.2.31) Substituting (7.2.27) we have C^—E Cuu^". (7.2.32) From (7.2.23) we have next C^^E^ED^, (7.2.33) where D^ is the diagonal matrix given by (7.2.25). However, from (7.2.7) it follows that E'E^D, (7.2.34) Substituting this in (7.2.32) we get finally (7.2.35) or equivalently

(7.2.36a) (7.2.36a)

Thus, we have proved that the transformation

(7.2.37) v=G,u, where G^ is the matrix the rows of that are the eigen vectors of the correlation matrix C^,, of the primary information is a spectral decorrelating transformation.

EXAMPLE 7.2.4 EVALUATION OF THE DECORRELATING MATRIX We make the assumptions as in Example 7.2.2. From (7.2.22a, b) we get the matrix l/v^ E=

l/v/2" (7.2.38)

llsjl -l/v/2"

From (7.2.27) and (7.2.38) we obtain the decorrelating transformation v(l) v(2)

\lsll -l/v/2'

u{\) 'llyjl u(2)

u{\)^uCl) w(l)-«(2)

(7.2.39)

As could be expected, this is the same transformation that was obtained in Example 7.2.1. n

7.3 Reduction of Dimensionality of Vector Information

327

COMMENT To simplify the terminology we assumed that the structured information is a realization of a multidimensional random variable, and we consider the statistical correlation coefficients. However, our argument can be used directly when we do not know whether the information exhibits statistical regularities, but when several observations of the information are available. Then, we can use the concepts of intelligent information processing presented in Section 1.7.2. To realize them, using the concepts presented in Sections 4.1.1 and 4.3.1, we replace the operator of statistical averaging in definition (4.4.25a) of the correlation coefficients by the arithmetical averaging operation defined by (4.1.3). Thus, we introduce the empirical correlation coefficients. Then the transformation (7.2.37) would produce decorrelation in the sense of empirical correlation coefficients and the dimensionality reduction obtained by algorithm (7.3.61) would be optimal in the sense of arithmetical average of square errors. Such dimensionality reduction could be applied ex post when the whole train is available in an adaptive system similar to the compression system discussed in Comment 2, page 266.

7.3 REDUCTION OF DIMENSIONALITY OF VECTOR INFORMATION When information is continuous it is often useful prior to subsequent processing, particularly before discretization, to transform the information into information that is still continuous but has lower dimensionality. We call such a transformation dimensionality reduction. We consider here the case when no deterministic structural constraints such as described in Section 6.6.2 exist, or if they do, they are not taken into account. Then the dimensionality reduction is an irreversible transformation and similarly as in the case of quantization considered in Section 6.6.1, the information compressing transformation is not characterized by a single parameter, such as the compression ratio, but by a compression ratio versus distortions of reconstructed information trade off relationships. To determine the minimal distortions caused by the irreversible dimensionality reduction, we have to optimize jointly the dimensionality reduction and information recovery rules. The systematic approach to such optimization problems is presented in Section 8.1. Here we explain the basic features of irreversible dimensionality reduction with a study case using only a few heuristic assumptions. Then, we present a general algorithm for dimensionality reduction and recovery. 7.3.1 A STUDY CASE We do again the assumptions Al and A2 from Example 7.2.1 (page 321). The dimensionality reduction is an irreversible transformation, and to gain more insight into its properties we have to analyze the aggregation sets characterizing the transformation. For such an analysis the knowledge of a rough description by variances and correlation coefficients is not sufficient and exact description of the probability distribution is needed. Therefore in addition to Al and A2 we introduce the assumption A4. The random variables i!ii(l) and i!]i(2) representing the components have density of joint probability pjji) shown in Figure 7.4a.

328

Chapter 7 Dimensionality Reduction and Quantization

1/(2) i L

vuy)

X(v') V

^

uCl)

c) Figure 7.4. Direct truncation of 2-DIM information: (a) density /?„(«/) of probability of the primary information: the density is constant in the shaded area, (b) the dimensionality reduction (projection on u{\) coordinate axis), (c) aggregation sets ->(/„(v) corresponding to the direct truncation of the primary information, (d) the transformations recovering the second component of information: U^,iv) (solid line) optimal linear, blind for statistical relationships; (7Jv) (solid line) optimal linear, direct truncation, f/^Jv) (slashed line) optimal nonlinear, direct truncation. It is easily seen that the marginal probability densities of the components of information are uniform distributions. From equation (4.4.11) and (4.4.25b) we get Ei!ii(l) = Ei!ii(2)=0, Em\l) = Ew\2)=V3a\

Ei!ii(l)iffl(2)= V^a\

(7.3.1)

DIMENSIONALITY REDUCTION BY DIRECT TRUNCATION We consider first the reduction of the dimensionality from two down to one by rejecting the second component u(2). Thus, the compressed information is V=V,(«)=M(1),

(7.3.2)

where V^i') is the dimensionality reducing transformation. Index 1 should be a reminder that we keep the first component of the primary information. Let us present the primary information as a point iipt. The transformation (7.3.2) has the geometrical meaning of orthogonal projection of point iipt on the w(l) coordinate axis and taking as the description of the projected point "pi its position V on the axis (see Figure 7.4b). Thus, the considered transformation is a special case of the continuous next neighbor transformation (see Section 1.5.3, Figure 1.17) or equivalently, of linear approximation discussed on page 320.

7.3 Reduction of Dimensionality of Vector Information

329

The aggregation set ^ ( v ) corresponding to compressed information v is the set of points having the same projection thus, it is an interval vertical to the axis u(l) going through point w(l) = v as shown in Figure 7.4b. To evaluate the irreversible distortions caused by the information compression, we look for the possibilities of recovering the primary, 2-DIM information u when only the compressed information v is available. We denote u,^{uX\),uX2)}, (7.3.3a) the recovered information ^r(-)^{f/r,(-), t/,2(-)} (7.3.3b) the rule of information recovery (the transformation of the 1 dim compressed information into the two-dimensional recovered information). ^""^^

"r={^rl(v), ^.2(V)}

(7.3.4)

is the recovered primary information. As the indicator of performance of the information compression system we take Q= E E WA:)-^(A:)]2-

(7.3.5)

k-i

where m(n) respectively Mr(n), A2 = 1, 2 are the random variables representing the corresponding components ofjhe primary respectively the recovered information. Thus, we face the OP UX'). Q (see Section 1.6.2). In view of (7.3.2) we recover exactly the first component by taking ^ , w,(l)=v=w(l). (7.3.6) Therefore, Q = Q2. (7.3.7a) where Q, ^EM2)-m,(2)Y=E[m(2)-U,,m' (7.3.7b) Thus, the OP UX'), Q reduces to OP t/2ro(*), Q2' Th^ systematic procedures of solving such problems are presented in Section 8.2. Here, we derive optimal recovery transformations using some heuristic arguments. Their full justification is provided in Section 8.3. It is evident that the optimal transformation is determined by the properties of the performance indicator Q^ and that those properties depend, in turn, on the statistical regularities of the primary information. The density /?„(M) of the joint probability of components of the primary information provides the exact description of the statistical regularities. Therefore, in general, the optimal transformation producing the recovered information depends on the density/7^(a) (see Section 1.7.1 particularly, Figure 1.24). Section 8.3 shows that even in simple cases this dependence is so complicated that the implementation of the optimal transformation would be costly. Usually, the problem of implementation of such a transformation does not even arise because the density p„(u) is not known exactly and only simplified rough descriptions of statistical properties are available. For these reasons, we are interested in such classes of transformations producing the recovered information that 1) the performance indicator (and, in turn, the optimal transformation) depend only on rough descriptions of the statistical

330

Chapter 7 Dimensionality Reduction and Quantization

properties of the primary information, which can be easily acquired, and 2) the implementation of the optimum rule is not excessively costly. We concentrate here on linear transformations that satisfy both conditions. Jn particular, we show that for those transformations the performance indicator Qj depends only on the mean values and correlation coefficients of the primary information_but not on other features of the density pjji). Thus, we consider the OP Ujroi'), Q21 U,2(')^oC, where J' is the class of linear transformations. RECOVERING TRANSFORMATIONS BLIND FOR STATISTICAL RELATIONSHIPS We start with the transformations that do not use the information about the first component of the primary information at all. Since the indicator Q2 of the performance of such a transformation does not depend on the statistical relationships between the components of the primary information, we call the transformation blind for statistical relationships. Such a deterministic transformation can assign to a compressed information v = //(l) only a constant h. Thus, U,2(y)=h=consi. _(7.3.8) From (7.3.7) it follows, that the OPL^2ro(-), Q21 U,2(')^J' reduces to OPh, Gb, where Q^=E[m(2)-h)]\ (7.3.9) The subscript b is a reminder that we consider transformations which are blind for statistical relationships. The optimal value of the constant, which we denote as h^, is the solution of the equation ^—1^=0. (7.3.10) _ d/z Since the sequence of operations h and d/dh can be interchanged, using (7.3.9) we

^^^

da

(7.3.11) From (7.3.10) and (7.3.11) it follows that the optimal value of the constant is /2o=Em(2). (7.3.12) In view of (7.3.1) we have h,,=0. Thus, the optimal recovering transformation that is blind for statistical relationships between the components of the primary information is rr , . ^ .^ ^ ^^. ^r2o(v)=0. (7.3.13) In view of (7.3.1) it follows that the blind for statistical relationships recovery transformation .r , ^ r ^. .^ ^ ... ^bo(v) = {v, 0} (7.3.14) is the optimal rule of recovering the primary two-dimensional information from the compressed, one-dimensional information. Its diagram is shown in Figure 7.4d. Substituting this in (7.3.7) we get the optimal performable of recovery rules blind for statistical relationships between components of the primary information Q^ = Em\2), (7.3.15a) Taking the numerical value from (7.3.1), we obtain Q^^ =V3a^ = 0.33a\ (6.3.15b) ^=-2[EM(2)-/Z].

7.3 Reduction of Dimensionality of Vector Information

331

LINEAR RECOVERING TRANSFORMATIONS Since both components u 1) and m(2) are correlated, we may expect that we could improve the performance of the recovery of the primary information if we use their statistical relationship. The general procedure of finding optimal information recovering rules are presented in Section 8.2. Here we assume that the recovery rule is a linear rule that is, the recovered information uX2)=h(0)+h(\)v, (7.3.16) where h(0) and /2(1) are two coefficients. Thus, the OP (72ro(-), G2l^r2(-)^c/ reduces to OP {/z(0), h(l)},Q,. We show that the optimal linear rule minimizing the mean square performance indicator Q^ (given by (7.3.7b)) depends only on the mean values of the components and on correlation coefficients. Therefore, the class of linear transformations can be considered as a class of transformations that besides correlation coefficients are blind for other features of statistical relationships. To simplify the calculations and to provide material for subsequent generalizations, we denote v(0) = l

(7.3.17)

v(l)=v.

(7.3.18)

and we write (7.3.16) in the form 1

u,{2)='L h(k)vik). In view of (7.3.2) and (7.3.17b), v ( l ) = i ( l ) . We denote B ( 0 ) = V ( 0 ) notation we write (7.3.7b) in the form _

=

(7.3.19) 1. Using this

1

QAh(O),

h(kMk)f.

II(\)] = EM2)-}2

(7.3.20)

k-O

WesolveOP {/2(0), h(\)],Q. similarly to OP/z, Q^. The generalization of (7.3.10) is the set of equations ^Q — i - = 0 , /:=0, 1. (7.3.21) dhik) Similarly as we have derived (7.3.11), we interchange in (7.3.21) the sequence of operations E and d/dh(k) and get 1

" =-2E[ii3i(2)-2^ h(m)m(m)]m(k) = dhik) m-o 1

1

-2[Em(2)m(^-)-E l-u(m)Mk)h{m)]=-2[cM,

fc)-E

///-()

c^{m, kMm)]

(7.3.22)

m-0

k=0, 1, where

cjm, k) = Eitm)m{k),k=0, are the correlation coefficients. Since i!ii(0) = 1, we have cJO,

0) = 1, cJO,

I, m=0,l,2

(7.3.23)

AZ) = EM(/Z), W = 1 , 2.

From (7.3.24) and (7.3.1) it follows that cJO, Az)=0, AZ = l , 2 .

(7.3.25)

332

Chapter 7 Dimensionality Reduction and Quantization

Substituting (7.3.24) with (7.3.25) in (7.3.21) after some elementary algebra we get from the set (7.3.21) of equations the optimal coefficients /Zo(0)=0 (7.3.26a) c (1,2)

^ai)=-4rn.

(7.3.26b)

Thus, the linear transformation producing the optimally recovered second component of the primary information is Uaoiy)=K{\)v. {1321) Substituting the numerical values from (7.3.1) in (7.3.26b) we get /Zo(l)=0.75. (7.3.28) The diagram of the optimal linear information recovering transformation ^Uv) = {v, ^ J v ) } (7.3.29) as a function of v is shown in Figure 7.4d. To calculate the best performance indicator (minimal distortions) that is achieved by the rule (7.3.29), we have to return to the general equation (7.3.19). Squaring the expression in the brace on the r.h.s of this equation and proceeding as in derivation of (7.3.22), we get 1

1

e,[/2(0), h{\)\ = E[tf(2)-2E n{2)n{k)h{k)^ E ^-0 1

*-0 1

1

E Uk)w{m)h{k)h{m)] = m-0

1

c J 2 , 2 ) - 2 E c„,(2, k)h{k)+ E E cJA:, m)h(k)h{m). k-Q

k'O

(7.3.30)

m-0

After substituting (7.3.26) and simple calculations we get the minimal (in the class of linear transformation) mean square distortion d(i,2) Q,, = Q[hM. / U 1 ) ] = U 2 , ^y-^-fjj-y (7.3.31) Taking the numerical values froni_(7.3.1) we obtain 2,0=0,145 fll (7.3.32) The assumption that was made at beginning that we achieve the compression by rejecting the second component w(2) was arbitrary. We may reject the first. Let us look closer at such a possibility. Since we made the same assumptions about both components, if we would reject the first component, then the quality of the counterpart of the recovery rule (7.3.19) would be given by the counterpart of (7.3.31) with the roles of indices 1 and 2 interchanged. However, in view of (7.3.1) the value of the minimum distortions would be the same. Thus, it makes no difference which component of the primary block information u is rejected. Equation (7.3.23) justifies our anticipation that the linear rule minimizing the mean square distortions of information processing depends on correlation coefficients but not on other features of joint probability. This is an advantage but also an disadvantage of the considered optimization problem.

7.3 Reduction of Dimensionality of Vector Information

333

The solution is simple and depends only on the simple rough description of the statistical properties of information. However, if more exact information about the statistical properties is available, it is possible that a nonlinear recovery rule, capable of utilizing the more exact information would be better than the optimal linear rule. We give shortly an example of such a situation. A NON LINEAR RECOVERY RULE Operating with random variables, we considered all potential forms of the primary and recovered information. Let us now assume that the compressed information is fixed, say, it is v'. We know then that the point Upt representing the primary information lies in the aggregation set ^(v') shown in Figure 6.4b. Since the density of probability p^iu) within this set is constant we may expect^, that we minimize the distortion of recovery if we take as the recovered information the point ttrpt laying in the middle of the aggregation set (interval) ^(v'). From Figure 7.5c we see that this point has the coordinates ^ { v , a/2} if v > 0 «r = ^ n . o ( v ) = C ^ ^

,^^ .^

^

(7.3.33)

^ ^ {v, -a/2} if v < 0 , where (/nio(*) is the transformation assigning to the compressed information v the recovered information u^. This is a nonlinear transformation and has an optimum; hence, the index nlo. The diagram of this transformation is shown in Figure 7.5c. On the condition v=const, the density of the conditional probability in the aggregation set ^ ( v ' ) is constant and the conditional mean square EM2)-a/2Y=l/l2a\

(7.3.34)

Since this does not depend on v' the overall mean square distortion caused by the optimized nonlinear recovery rule (7.3.33) is Gnio = l/12fl' = 0.083fll

(7.3.35)

TRUNCATION AFTER DECORRELATION We now analyze the dimensionality reduction by truncation applied not for u directly but for its decorrelating spectral representation as shown in Figure 7.5a. We can use the results of Example 7.2.2 because for the now assumed probability density p^(u) the assumptions the example are satisfied^. The decorrelating spectral transformation given by (7.2.6) is w(l)=^[u{l)+u(2)]

(7.3.36a)

w{2)=-^[u(l)-u(2)]

(7.3.36b)

\/2

We denote this transformation Tj(-). Thus, w = {w(l), w(2)}=TJu) is the decorrelated spectral representation of the primary information.

(7.3,37)

Chapter 7 Dimensionality Reduction and Quantization

334

DIMENSIONALITY REDUCTION

RECOVERY OF PRIMARY INFO u,={Uri\)M2)}

V

u={u{\)M2)}

a) dimensionality reduction 1 1 1

1 DECORRELATING SPECTRAL REPR.

u={u(\)M2)} 1

1

V

^ ^

-1 — i

'

t'

1 1

w={Mi\)M2)}

1

1f J

^

TRUNCATION OF SPECTRUM

111 ^^ I 1

1

RECOVERY OF SPECTRUM

INVERSE SPECTRAL

t

^ TRANSFORMATION r.; (0 '

Wr.{v)={v,0]

n1 1 1

«a,= {Wro(U, "n,( 1

recovery of primary info

b) Figure 7.5. Block diagrams of the dimensionality reduction and recovery systems: (a) direct dimensionality reduction (b) dimensionality reduction after decorrelation. From (7.3.36) and (7.3.1) it follows that Ew(l) = Ew(2)=0, and from (7.3.18) and (7.3.36) we have E w ' ( l ) = c j l , l ) + c j l , 2)

(7.3.38) (7.3.39a)

Ew2(2)=cjl, D - c J l , 2). (7.3.39b) Since the transformation is a decorrelating transformation E^5y(l)^(2)=0. (7.3.40) Example 7.2.1 shows that decorrelation can be interpreted as the representation of the primary information in another orthogonal coordinate system, which in this case is turned by TTM relative to the primary system. Therefore, the diagram of the density p^(w) of the joint probability of the components of the decorrelated information is obtained by turning the diagram of the density p„(u) of the joint probability of the components of the primary information by 7r/4 as shown in Figure 7.6b. We achieve a two-dimensional into a one-dimensional compression of the spectral representation of the primary block information by rejecting one of the components of spectrum, say w(2). We denote by Vwi(*) this rule. Thus, the compressed spectrum is ^^^^^^^^ ^^ ^^^^ The compression has again the meaning of a projection (see Section 1.5.3 particularly. Figure 1.17b) however, we project now the point representing the primary information on >v(l) coordinate axe as shown in Figure 7.6b. The corresponding aggregation sets ^^(v) are shown in Figure 7.6c.

7.3 Reduction of Dimensionality of Vector Information

335

M<2)

Ml) 1/(2) i L

S^(v')

^ u{\)

V

C)

Figure 7.6. The effects of decorrelation: (a) the density/?„(ii) of probability of the primary information and an aggregation set corresponding to direct truncation (redrawn Figure 7.5c, in the shaded area the density of probability is constant), (b) the density p^(w) of probability of the decorrelated spectral representation and an aggregation set ^^(v) corresponding to the truncation of the second component of spectrum, (c) die aggregation set ^y^,(v) of the primary information corresponding to the truncation of the spectrum. To recover the primary information we first recover the spectrum. We denote by w, = {wXl), >Vr(2)} the recovered spectrum. As an indicator of performance of the recovery we take similarly to (7.3.5): _

2

Gwi = E E lw(k)-wXk)]'.

(7.3.42)

The subscript wl is a reminder that Cwi is an indicator of the quality of recovery of the spectrum w when only the exact information about the first component of the spectrum is available and w(/:) (respectively w^k)) are the random variables representing the components of the primary (respectively of the recovered) spectrum. Similarly to (7.3.6) the first component of spectrum can be recovered exactly >vXl) = v=w(l) (7.3.43) and the indicator of spectrum recovery performance is Qwi =E[w(2)-W,2(^)]^

(7.3.44)

where W,2(') is the rule of recovery of the second component of .spectrum. We are interested in the optimal rule Vr2o(') minimizing the indicator 2wi • To make the implementation simple we assume that the recovery rule is linear. Then we face the same linear optimization problem as the previously discussed (see page 331) problem 0P/i,2b of direct recovery of the primary information. Therefore, after changing notation we can use the previously derived equations (7.3.26).

336

Chapter 7 Dimensionality Reduction and Quantization

We denote by h^J
, ^ o o o ^

t

Using (7.3.39a) we get

2w2o=Ew'(l).

(7.3.51)

_ 2.20 = U 1 , D + c J l , 2), (7.3.52) Comparing this with (7.3.49) we see that if the compression is achieved by the rejection of the first component of the decorrelated spectrum, than the performance of the optimal recovery of the spectrum is worse then in the case when compression is achieved by rejection of the second component. The obvious reason for this is that contrary to the mean square values of the components of the primary information (see (7.3.1)) the mean square values of the decorrelated spectral components are different (see (7.3.39)). This suggests the following conclusion: We optimize the truncation of decorrelated spectrum of a 2-DIM vector information by rejecting the spectral component with the smaller mean (7.3.53) square value.

7.3 Reduction of Dimensionality of Vector Information

337

Till this point we considered the recovery of the spectrum. The considerations on page 321 on optimal linear approximation suggest to recover the primary information by performing on the recovered spectrum the transformation ^sd('), which inverts the transformation T^^i*) producing the spectrum. Thus, we take as the recovered primary information

u=T-:d')

(7.3.54)

^rr

where w^^ is given by (7.3.47). Figure 7.5b illustrates this assumption. The indicator of the overall performance of the compression system (truncation after decorrelation) is ^

e,d= E22 Mk)-n,(k)Y

(7.3.55)

Because we recover the primary information from the recovered spectrum by inverted spectral transformation, we can use the distance preservation property (7.1.58). From it and from (7.3.55), (7.3.42), and (7.3.43) we get etd = 2w2o=Ew'(2). (7.3.56) Substituting (7.3.50) we have 7. ^ 2,d=0.083a' (7.3.57) To compare the described compression and recovery systems it is convenient to introduce the normalized overall performance indicator

(7.3.58)

Q'=-^:^

^E ^\k) The previously obtained numerical results ((7.3.15b), (7.3.32), (7.3.35), and (7.3.57)) are summarized in the Table 7.3.1. Dimensionality Reduction

Optimized Recovery Rule

Normalized Performance Indicator

Direct truncation

Blind for statistical relationships

0.5

Direct truncation

Linear

0.096

Direct truncation

Nonlinear

0.055

Optimized truncation arter decorrelation

Linear

0.055

Table 7.3.1. The comparison of the performance of the information compression systems. COMMENT 1 For the volume of the considered continuous information we use definition (6.6.22). The volume compression coefficient defined by (6.1.30 ) has for all considered systems the same value i3[r(-)]=0.5. Thus, the systems can be compared on the basis of the indicator Q' of accuracy of the recovered information. Because we optimized the recovery rules the comparison is fair.

338

Chapter 7 Dimensionality Reduction and Quantization

Table 7.3.1 shows the inferiority of the system that is blind for statistical relationships between the components of primary information. The reason is obvious: the other systems use more accurate rough descriptions of the statistical properties of information. Similar is the reason of the superiority of the system with direct truncation and nonlinear recovery over the system with linear recovery. Choosing the non linear recovery rule, we utilized properties of aggregation sets, which depend on more detailed properties of the density p^(u) of the joint probability than the mean values and correlation coefficients. For optimization of the linear recovery only these properties can be used. We assumed that the system with direct truncation and the system with truncation after decorrelation use linear recovery, optimal for each of the systems. Therefore, the reason of superiority of the system with truncation after decorrelation is that the reduction of dimensionality used in this system is better suited for linear recovery as in the system with direct truncation. This is confirmed by the interesting fact that the performance of the system with direct truncation and optimized nonlinear recovery is the same as the performance of the system with truncation after decorrelation. In other words, the direct truncated information contains the same statistical information about the second component, as the truncated spectrum. However, the statistical information contained in the direct truncated information cannot be utilized by a linear transformation. The advantage of the system with truncation after decorrelation is that it is a linear system, and as we show its principle can be easily generalized for dimensionality reduction of multidimensional information. The performance of this system is better when the discrepancy between the mean square values of the components is larger. From (7.3.53) it follows that if we achieve the decorrelation by the spectral transformation based on eigen values, the performance of the dimensionality compression will be better the more diversified the eigen values of the correlation matrix are. We discuss this issue in the next subsection. 7.3.2 THE ALGORITHM FOR DIMENSIONALITY REDUCTION Some fragments of our argumentation in the previous section are more complicated than it is really needed to analyze the simple case that we considered. The purpose of this was to present an approach that can be almost directly generalized for compression of AT>2 dimensional information into M-DIM information, M
.- ^ -g.

7.3 Reduction of Dimensionality of Vector Information

339

Let us denote by m(k) / : = ! , 2,- • • , A' the random variables representing the components of the primary information u = {u(k), / : = ! , 2,- • • , K). We assume that E m W = 0 , ;^=l, 2,- • • , K. (7.3.60) Our previous remarks about generalizations of the procedures derived for the study case lead us to the following algorithms describing the operation of the dimensionality compression system, THE KTOM

DIMENSIONALITY REDUCTION ALGORITHM

Step 1. Using the procedure described in Section 4.3.2 find the eigen values 7(^) and the eigen vectors e(k), k=l, 2,- - • , Kof the correlation matrix Qu of the primary information. Step 2. Arrange the eigen values in the decreasing order y{k,)> y(k2)>...>y(k^). (7.3.61a) Step 3. As the compressed information take v= { w(k,), w(k2)r • • , w(^J}, (7.3.61b) where w(n) are the components of the spectrum relative to the eigen vectors ^(n). Thus, w = G,u (7.3.62) and Ge is the matrix given by (7. 2.26) the rows of which are the components of the eigen vectors. Generalizing the geometric interpretation of the two to one compression given in the previous section, we interpret the compression of the dimensionality as a projection of the point Mp, representing the primary information in an Kdimensional space on the M-dimensional space spanned on the eigen vectors e(ki), e{k2)r • • , eik^), and we use the conclusion (7.1.65) about optimal linear approximation in particular, as optimally recovered information we take the optimal linear approximation given by equation (7.1.66). This leads to the THE OPTIMAL LINEAR RECOVERY ALGORITHM Step 1. As the recovered e-spectrum take >^ro = {v, ^} = {w(^,), wik^),' • • , w(/:J, 0, 0, ..0}.

(7.3.63)

Step 2. As the ultimately recovered information take u,o=Ew,,,

(7.3.64)

where E is the matrix the columns of which are the components of the eigen vectors (see 7.2.29). Using (7.1.66) and (7.2.36) as a generalization of (7.4.56) we obtain K

e,d= E

T(«m)

(7.3.65)

Chapter 7 Dimensionality Reduction and Quantization

340

Generalizing (7.3.58), we define the normalized performance indicator (7.3.66)

G'EY^ "^^W

EXAMPLE 7.3.1 THE RUNNING OF THE OPTIMAL COMPRESSION AND RECOVERY ALGORITHM We assume that the correlation matrix C^^ is given in Table 5.2.1a. A standard program for finding the eigen values and eigen vectors such as command EIG of Matlab (this is a registered trade name of The Math Worcs, Inc. program package), produces the matrix E defined by (7.2.24) and the eigen values y(n) (upper row) already arranged in ascending order: y{f^)

.t>Ott>

E=

1233 2^O0 4128 4793 4793 4128 2900 1255

.i>S<>2

-.2431 .^6^>3 -.4379 .1700 .1780 -.4379 .44>^»3

-.2431

.73'?9

.3450 -.4576 .0310 .41 I 1 -.411 I

-.0510 .4576 -.3450

.93<>e

.4218 -.2638 -.3829 .3254 .3254 -.3829 -.2638

.42ie

1.2313

.4^>3^>

.0405 -.4474 -.2885 .2885 .4474 -.04O5 -.4636

I.7202

.4574 .3252 -.0793 -.4228 -.4228 -.0795 .3232 .4574

2-Sl
.3877 .4526 .3560 .1343 -.1343 -.3560 -.4526 -.3877

3.3e3l

.2319 .3274 .3942 .4286 .4286 .39*2 .3274 .2319

Table 7.3.2 The eigen values and the eigen vectors of the correlation matrix. In the justification of the algorithm we assumed only that the correlation matrix is given, but it was not necessary to specify beyond this the joint probability distribution. Therefore, to get a typical block of primary information we can use a generator of a random train operating according to a joint probability distribution, which has the given correlation matrix, but besides this requirement is arbitrary. Thus, we can use the generator of a gaussian process described in Example 5.2.1. A typical train generated by this generator is M = {2.0885 -.1800 .0036 1.7565 1.1361 0.0803 .6100 1.4597}. After performing the matrix multiplication (7.3.62), we obtain the ^-spectrum w = {2.2364 -.0577 -.5329 .1149 2.2924 -8295 -.1839 .0210}. Let us assume that the dimensionality of the compressed information M = 6 . Since the eigen values are arranged in descending order, according to step 3 of the dimensionality reduction algorithm, we reject the last two components of the espectrum. The compressed spectrum v = {2.2364 -.0577 -.5329 .1149 2.2924 -8295}. Let us next consider the optimal recovery of the primary information. According to Step 1 of the recovery algorithm, the optimally recovered e spectrum >v,,={2.2364 -.0577 -.5329 .1149 2.2924 -8295 .0000 .0000}. After performing the matrix multiplication according to Step 2, we get the optimally recovered primary information M,,={2.0464 -.1003 -.0683 1.7791 1.1790 -.0089 -.7019 1.4123}.

7.3 Reduction of Dimensionality of Vector Information

341

The error of recovered information is tt-i/,„={.0421 -.0797 .0719 -.0227 -.0428 .0892 -.0919 .0.474}, while the error of recovered spectrum is w;-w,,={0 0 0 0 0 0 -.1839 .0210}. Q'

M

Figure 7.7. Dependence of the normalized performance Q' defined by (7.3.66) of the optimal compression-recovery algorithms (page 339) on the volume M of the compressed information; the correlation matrix of the primary information is given by Table 5.2.1

Jhe described procedure we apply f o r M = l , 2,- • • , 7. From (7.3.65) we get 2td' from Table (5.2.1a) we obtain Em^(n) = Cuu(A2, n), and from (7.3.66) we calculate the normalized performance indicator Q'-O.l. Its dependence on M is shown in Figure 7.7. This is a typical compression quality versus volume of compressed information trade off relationship. D

COMMENT 1 The example shows an important feature of the presented algorithm. Comparing the errors u-u,^ and w-w,^, we see that the inverse e-spectral transformation distributes the final error of the primary information uniformly over its components. The example shows that the performance of the optimal dimensionality reduction and recovery depend essentially on this how diversified the eigen vectors of the correlation matrix are. If they are of similar order of magnitude, the optimal dimensionality reduction and recovery would not offer substantial improvement over the direct truncation. The differences between the values of eigen values depend on the structure of the correlation matrix of the primary information. The smaller the components outside the main diagonal, the "weaker" the correlation between the components of the primary information; the smaller the differences between the eigen values, the fewer (besides spreading the error) are the advantages of the optimal algorithm based on eigen vectors spectrum over direct truncation.

342

Chapter 7 Dimensionality Reduction and Quantization

COMMENT 2 Taking in place of the eigen vectors any set (^ of ortho-normal vectors and instead of eigenvalues the g-spectrum, we can also use the compression algorithm. The problem is that in general the spectral components are less diversified than the eigen values and rejecting a number of spectral components and using the recovery rule we may commit a larger error then using the eigen vectors. The reason for this is that in general the spectral components are correlated and taking zeros in place of the missing components is not optimal. We could improve the performance of recovery taking in place of zeros linear estimates of rejected spectral components based on available components, as we did considering truncation without decorrelation in the study case but such a procedure would be tedious. However, it is shown in the next section, that on quite general assumptions the correlation coefficients between the components of harmonic spectrum are small. Then the performance of the simple recovery rule replacing the missing spectral components by zeros is not significantly smaller than in case when eigen values are used. The great advantage of the harmonic spectrum described in Example 7.1.1 is that it can be calculated using the Fast Fourier Transformation algorithm which requires much less computational power than the compression based on eigen vectors. Therefore, the dimensionality compression is often based in practice on discrete harmonic presentation of the primary information. An example is the essential step of the JPEG compression standard described in Section 2.5, page 118. As it is shown there the recovery of the truncated harmonic spectrum also has the favourable feature to uniformly disperse the errors. Our conclusions about improving the performance of direct truncation by using non-linear recovery rules could be generalized. However, the need for more exact statistical information about the primary information and difficulties of implementation of non-linear rules cause that the compression based on spectral representations reducing the correlation of their components to zero or making them small are most suitable in applications.

7.4 SPECTRAL REPRESENTATIONS AND REDUCTION OF DIMENSIONALITY OF FUNCTION INFORMATION We show now that the previously discussed methods of reducing the dimensionality of vector information can be generalized for functions of one or more continuous arguments. By "reduction of dimensionality" of a function, we mean the transformation of the function information into vector information with a finite number of components. The two basic types of such a transformation are (1) sampling and (2) spectral representation and retaining only a given number of it components. The both methods of reducing dimensionality of a function of a continuous argument(s) are closely connected. We concentrate on compression using spectral representations and we describe the methods utilizing deterministic and statistic features of compressed information. Sampling we discuss only briefly, emphasizing its relationships with compression using spectral representations. The presentation is based on analogies with the previously discussed reduction of dimensionality of vector information.

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

343

7.4.1 BASIC CONCEPTS We consider first spectral representations of a scalar function on the assumption that the argument (called time) takes values confined to a finite interval. To simplify the notation instead of the standard interval we take the interval <-T/2, Til > . We assume that A l . the considered function is integrable (in Lebesgue sense-see Curtain, Pritchard [7.7]) A2. The integral of its square is finite. Thus, for a function u(t), r G <-T/2, T/2> satisfying those conditions we have T/2

E= { u(t)^dt
(7 4 1)

-772

If u(t) has the meaning of an electrical potential or electrical current process then the integral in equation (7.4.1) is proportional to energy dissipated in a resistor. Therefore, integral E given by (7.4.1) is called energy of the function and the function satisfying condition (7.4.1) is called energy function. The functions of a continuous argument can be interpreted as points when distance is introduced. W e denote as a{')={a{t), tG<-T/2, TI2>} a function considered as whole and as a^^ the point representing it. Similarly, denote as b^^ the point representing an other function b('). The counterpart of distance (7.1.20) is the distance defined by (1.4.9) that in the notation used now takes the form TO.

d{a^,b,^ = [\\b{t)-a{tm^\

(7.4.2)

-r/2

The functions can be interpreted as vectors when a scalar product is defined. For functions as a whole we define similarly to (7.1.8), the scalar product as Til

(a,,,bj=

Jfl(OWdr,

(7.4.3a)

-T/2

where Cvc and b^^ denote the considered two functions interpreted as vectors. Consequently, we set .7-/2 ^^

|a,j = jjfl2(/)dr| .

(7.4.3b)

-T/2

For the distance, scalar product, and vector length defined in this way the bas relationship (7.1.21) holds for functions as a whole. To get a representation of a function as a whole in a form similar to equation (7.1.43) we must take infinity as the upper summation limit. It is shown (see, e.g.. Curtain, Pritchard [7.7]) that sets of functions g(k, 0, ^ = 1 , 2,- • • , te <-r/2, T/2> (7.4.4) exist, such that the functions are ortho-normal, it is

^^

(g.c(k), ^vc(0) = C^^ ^""^ ^"^ ^ ^0 for Ir^k lim^{«(-),E v(^)^(^,-)}-0

(7.4.5a) (7.4.5b)

344

Chapter 7 Dimensionality Reduction and Quantization

The functions g{k, t) are called basis functions and their set is called complete set of ortho-normal functions. The fundamental relationship (7.4.6) can be written symbolically as OO

u{t)=Y. v{k)g{K 0, te <-TI2JI2>.

(7.4.6a)

A:-l

This is the counterpart of (7.1.43) for functions of a continuous argument. Thus, the set of coefficients {v{k), /:= 1, 2,- • • } has the meaning of the spectrum of the function w(') as a whole. Similarly as equation (7.1.29) we get r/2

v(/)= j u(t)g(lj)dt

(7.4.6b)

-772

Thus, we came to the conclusion A function of a continuous argument satisfying the general assumptions Al and A2 (page 343) can be represented by a spectrum which is (7.4.6c) discrete, but has infinitely many elements. In terms of information processing this conclusion means that the considered transformation simplifies the primary function information into vector information, however, with infinitely many components^. We denote as C^o. the infinite complete set of orto-normal basis functions g{k, /), A:=l, 2, • • 00. This set is a counterpart of the set of coefficients g{k, I) A:=l, 2, • • describing the set C^ of unit coordinate vector gvc(^)- As indicated in Comment 2 page 316 there is a continuum of orthogonal coordinate systems. Similarly there is a continuum of complete ortho-normal sets C^o. of basis functions. Several such specific sets have been analyzed in detail (e.g., harmonic functions, Hermite, Laguerre, Legendre, Haar, Heinkel to mention only few). The spectral representation (7.4.7) has also the distance preserving property. Particularly, the counterpart of (7.1.60) is

\ u\t)dt-Y.v\k) -hi

(7.4.7)

*-i

The optimality property (7.1.65) has also its counterpart. The counterpart of the equation (7.1.66) giving the error of approximation by an incomplete ortho-normal set is

«*(0 = E^(^)^('^' 0

(7.4.8)

mew where Wis the set of numbers of ortho normal functions used for approximation. The counterpart of the equation (7.1.66) giving the error of optimal approximation is 772

[ [«(0-« \t)fdt- = E Am) where VJ,^ is the rest of the set {1, 2, •• 00} after subtracting W^.

(7.4.9)

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

345

Every practical application of a spectral representation requires generation of basis functions. Therefore, of greatest practical importance are complete orthogonal sets of basic functions that can be obtained from a single prototype function by such simple transformations as shift or/and changing the scale of the argument. For a general description of such functions see Malat [7.8]. Here two important examples of such prototype functions are used: harmonic functions (we call so the cosine, sine, and complex exponential functions), and sine over argument function. Similarly, as the choice of spectral representation of a vector, the choice of the spectral representation of a function of a continuous argument depends primarily on the subsequent processing of information. The harmonic functions play a special role, since they are invariant to stationary linear transformations. Therefore, spectral representations of functions of a continuous argument using as basic functions harmonic functions are of paramount practical importance. Such a representation is called harmonic spectral representation. There are many excellent books discussing harmonic representation of functions and its applications for analysis of linear, stationary systems (see, e.g., Oppenheimer, Wilsky [7.9]). The harmonic representations are discussed here only briefly, to explain some concepts used in this book and to give additional insight into the problems of spectral representations of structured information. 7.4.2.HARMONIC SPECTRAL REPRESENTATIONS The classical result of functional analysis is that for functions considered in the interval <-TI2, TI2> the set of functions 1/r, {2/T)coskcD^t, k=l, 2,- • • , {2/T)sinko)^t, A:=l, 2,- • • , (7.4.10) where co,=ll (7.4.11) T is a complete set of ortho-normal functions. For this set the representation (7.4.6a) takes the form 00

w(r) = v(0)+5^ v^(k)coskG),t+v^(k)cosko)^t, tE <-T/2, TI2>

(7.4.12a)

k-\

and from (7.4.7b) we have Til

Til

v(0)= 1 I u(t)g(liMt.^m= -Til

TH

1 I u{t)cosko^,dt,vXk)= 1 J w(Osin/:co,^/,/:=l,2,-Til -Til (7.4.12b)

Using the equations cosa = y2(e^"+e'J") and sinQ; = -y2J(e'"-eJ"), where j " v - l we write equations (7.1.12) in the simple form oo

u{t)= E

vik)e^'^'', te<-TI2,

*—"

"^^^'^ Since u(t) is real

v(^)=i,U(Oe-^^^''dr. ~'^^^ v(k)=v(-k)

where v is complex conjugate v.

TI2>,

(7.4.13a)

Tfl

(7.4.13b) (7.4.14)

Chapter 7 Dimensionality Reduction and Quantization

346

The consequence of (7.4.14) is that the complex spectrum {v(k), k=-oo' • ,-i, 0, 1,- • • oo} is redundant. The reward for redundancy is that the representation (7.4.13) is simpler and its manipulation is much easier than of the equivalent, but non-redundant, representation (7.4.12). As an example we take a rectangular pulse ^^A for -r r After substituting in (7.1.13b) and some algebra we get 7- sinkco.T v(k)'2Al. ' (7.4.16) T ko)^T

The function occurring on the right is called sinus over argument function. The diagram of the considered pulse and its spectrum are shown in Figure 7.8. u{t) f

-Til

v(A:) A

Til a)

u{t)

v(^)

— •

-T

r t

b)

Figure 7.8. The rectangular pulse and its harmonic spectrum representation: (a) in interval <-r/2, r/2>, (b) in interval <-r, T>. For practical purposes the representation (7.4.13) is sufficient, because in applications we always process information in a finite time interval. However, analytical operations on the sum occurring in (7.4.13a) are tedious. Therefore, very useful for analytical consideration is a limiting form of the representation (7.4.13) in which instead of the sum an integral occurs. We now sketch the derivation of such a presentation. We consider the dependence of the spectral presentation (7.4.13) when the length r of the interval <-r/2, TI2> in which the function is analyzed, grows. We assume that the condition (7.4.1) is always satisfied thus. u{tY^t < const < 00 This condition causes that

| w ( 0 | ^ for|r|

(7.4.17) (7.4.18)

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

347

Thus in a sufficiently coarse scale the function u(t) looks like a pulse. We denote as At the duration of the time interval outside of which the function «(•) takes only negligible small values and we call At the effective duration of the function «(•)• To make this definition concrete we must say precisely what means "negligibly small". Several such definitions are used. In theoretical considerations the effective duration is defined as radius of inertia of the area under the diagram of the function. We now consider the harmonic spectrum of a fixed pulse-like function when the length r of the interval increases. From equation (7.4.11) it follows that changing r we change (jj. Since in the following consideration not the number k of harmonic but the value kcoj is important we denote a;^.=A:a>,. (7.4.19) The pulse given by (7.4.15) is a representative example of a pulse-like function. Figure 7.8b shows that when the length of the observation interval is doubled, two effects occur. The values of spectral components decrease, but simultaneously the distances between the angular frequencies of spectral components decrease. Thus, we have a similar effect as described in Section 4.2.1 and illustrated in Figure 4.3. Therefore, similarly to (4.2.9) for T large ( compared with the effective duration of the function) we describe the spectrum by the density V(a;,)=27r^ where

(7.4.20)

Acj Aa; = cj^-a)^.,=co,.

(7.4.21)

Using (7.1.11) and (7.4.13b) we write (7.4.20) in the form T/2

V(co,)= [ w(r)e"^"*'dr

(7.4.22)

-T/2

Similarly to (4.2.13) for T-^oo the equation (7.4.13a) takes the form oo

u(t)=j;^

I V(a))eJ"'dco , -oo < r < oo

(7.4.23a)

while equation (7.4.22) takes the form oo

V(o;)= [ w(Oe-J"^dr , -00 < o ; < 00.

(7.4.23b)

-oo

This spectral representation is called continuous Fourier transformation. From (7.4.14)it follows that V(-co)-V(co). Thus, the representation (7.4.22a) is redundant and our remarks in Comment 1, page 318 apply^ The continuous spectrum has also the distance preserving property, in particular the counterpart of (7.1.60) holds and it has the form 00

oo

J u\t)dt'l.

J I V(a;)|2do;

(7.4.24)

From this and from (7.4.17) it follows that the integral on right side of (7.4.25) is finite. This causes that similarly to (7.4.18) \V(o))\^foT |aj|-^oo. (7.4.25)

348

Chapter 7 Dimensionality Reduction and Quantization

Thus, in a sufficiently coarse scale, the function |5(')| is pulse-like. Its width defined similarly as duration of the function is called effective bandwidth of the spectrum and is denoted as A^. One of the basic results of the Fourier transformation theory is 772^ product of effective widths of functions coupled by a Fourier transformation cannot be smaller than a constant of the order of (7.4.26) magnitude of I-K. For typical definitions of the effective width and for most processes occurring in applications we have A,A, = 27r/l,

(7.4.27)

where A is order of magnitude of 1. COMMENT 1 Using definition (7.4.20) we write the representation (7.4.23a) in the form oo

w(0= o - f eJ"'tL4(a;) , -oo
(7.4.28)

-oo

where dy4(co) = V(cj)da;. Comparing (7.4.34) and (7.4.13a) we may say that the representation (7.4.22a) can be interpreted as superposition of a continuum of complex exponential functions with infinitely small amplitudes d4(cj). COMMENT 2 The discrete counterpart of representation (7.4.23) is the discrete Fourier representation (7.1.55). Those both transformation illustrate our earlier remarks in Section 1.4.3 about continuous models. For numerical calculations we use the discrete transformation and the discrete models describe with sufficient accuracy the real processes. The continuous transformation is a limiting case of the discrete transformation. The advantage of the continuous transformation is that it can be analyzed in-depth and such analysis gives more insight into general properties of spectral representations. The price we pay for it are several peculiarities of the continuous harmonic spectral representation, that are related to the continuous model but do not have counterparts in real systems. COMMENT 3 The harmonic spectrum (7.4.13b) of a function analyzed in a finite interval, is a function of a discrete argument, while the primary function is a function of a continuous argument. Thus, the structures of the spectrum and the original function are different and in this sense this representation is odd. Contrary, the discrete Fourier representafion (7.1.55a) and (7.1.55b) and the continuous representation (7.4.23a) and (7.4.23b) are symmetric, in the sense, that the primary ftinction and its spectrum have the same structure.

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

349

The symmetry is advantageous in numerical calculations using discrete Fourier representation and in analytical consideration using continuous harmonic representations. The symmetry of continuous harmonic representation causes that in formulas and theorems about the relationships between the primary function u(t) and its continuous spectrum V(o)) can be interchanged (with minor changes corresponding to other coefficients in (7.1.23a) and (7.1.23b). This is called duality principle, GENERALIZATIONS To simplify the presentations a scalar function of a scalar argument has been considered. Most of the presented concepts can be generalized for functions of more then one argument. The two dimensional generalization of the spectral presentation (7.4.23) is the spectral presentation of black-white images described by equations (1.3.9) and (1.3.10). 00

c»

w[r(l), r ( 2 ) ] = — U f f V[co(l),a;(2)]eJ'-<^)^('>*-<2)r(2)i^^(l)^^(2), UTT)^ _l_l (7.4.29a) -oo
V[a;(l), a)(2)]= [ ( w[/(l), r(2)]e-J''^^^>^('>-(2>^(2)«dr(l)dr(2),

-«--' -00
(7.2.29b)

In a similar way the spectral representation (7.4.13) and the discrete Fourier transformation (7.1.55) are generalized. The discrete cosine transformation (2.5.2) and (2.5.3) used in Section 2.5 is an obvious modification of the latter transformation. For an introduction to spectral transformation of images and their applications see Lim [7.10], for more details Schalkoff [7.11], Russ [7.12].

7.4.3 DETERMINISTIC REDUCTION OF DIMENSIONALITY OF FUNCTION-INFORMATION A function of a continuous argument is characterized by such features as continuity of the function, continuity of its derivatives, discontinuities e.t.c. Many of those features are reflected in the features of the harmonic spectrum of the function. A function can be often considered as a product of a linear transformation which influences in a specific way the harmonic spectrum of the function. A typical example is a function with a spectrum which practically does not include components with frequencies higher then a limiting value. For these reasons constraints are often imposed not directly on a function but on its harmonic spectrum. Such constraints are counterparts of the constraints imposed on continuous vector information discussed in Section 6.6.2 (see in particular. Example 6.6.2).

350

Chapter 7 Dimensionality Reduction and Quantization

Usually, the discussed features of functions, in particular, features of their harmonic spectrum, characterize each considered function and can be used to reduce the dimensionality of each function. Such a reduction is called deterministic dimensionality reduction. The reduction of dimensionality of vector information which we considered in the previous section was based on statistical properties of the compressed information. The counterpart of such compression is the reduction of dimensionality using statistic properties of considered functions. It compresses the dimensionality in the sense of a statistical indicator. Such a compression is called statistic reduction of dimensionality of function-information. It is discussed in the next section. The basic deterministic dimensionality compression methods are methods using spectral representations and sampling. We present here a brief review of these two methods, emphasising their relationships. DIMENSIONALITY REDUCTION BASED ON SPECTRAL REPRESENTATION The spectral methods of transforming a function into a vector having afinitenumber of elements are based on the following generalization conclusion (7.1.65): Jlie optimal approximation of a function u(t), /G by a linear combination of a finite set JJ{N) of harmonic exponential ._ . ^^ functions is the spectral representation (7.4.13a) including only harmonic functions belonging to JJ{N). The optimization criterion is the distance defined by (7.4.2). The typical example is the set JJ(2N) = {^'^'' k=-K -/^+1, • • -1, 0 , 1,- • • , A:} (7.4.31) Then the optimal approximation is the truncated representation (7.4.13a) thus, the sum

,

A

i;tu, r

w ( 0 = E v(k)^ ' ,

(7.4.32)

k--N

As the indictor of difference between the primary process and the approximation (the indicator of approximation error) we take the square of distance d[u('), u\-)]. Using (7.4.13a), (7.4.31), and (7.4.8) modified for complex spectra we obtain

[u(t)~u*(0]Mr*(0]Mr-T5^ ^^TYl \v(k)\' d'[u('), u *(-)]- f [u(t)~u

(7.4.33)

k-K*l

From this equation it follows that to keep the distortions low, we have to include into the set of spectral coefficients all the large coefficients. To do it in a systematic way we had to calculate possibly many spectral coefficients and reject the smallest. This is done in the essential step in the JPEG compression procedure transforming matrix (2.5.4) of primary spectral components into matrix (2.2.9). However, numerical calculation of many coefficients are tedious. Therefore, useful are general guidelines for choice of the suitable value K of harmonic spectral components that should be included into a compressed description of the primary process u(t), rE <-772, T/2. For this purpose we look for such a continuous approximation u^^it) of the primary process for which the continuous spectrum Vap(^)

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

351

can be easily calculated from equation (7.4.23b) and we estimate the discrete spectral components from equation (7.4.20) which we write in the form v{k)=2'Ko^,V{k(^j) (7.4.34) where o)^=2Tr/T. Still simpler, but less accurate would be to estimate the effective duration A^ of the primary process, to use equation (7.4.27) to estimate the effective bandwidth A^ and to reject harmonic components with absolute value of angular frequency larger than the estimated effective bandwidth. COMMENT 1 The dimensionality of the truncated spectrum is a counterpart of the number of potential forms of quantized information that is necessary to recover with some accuracy the primary one-dimensional information as discussed in Section 6,6.2. Thus, the dimensionality has the meaning of the relative volume of a time continuous function, related to the accuracy of the optimal recovery. In the next section we present the sampling theorem for strictly band limited functions, which corresponds to an other point of view on the volume of time continuous functions. DIMENSIONALITY REDUCTION BY SAMPLING We discuss here conclusions of the theory of continuous harmonic spectral representation about representing continuous information by samples. Let us assume that the primary process takes non-zero values only inside an interval <-r, r > : u(t)=0 for r<-r and for / > r (7.4.35) Consider the spectral representation (7.4.13a) for rG <-T, T> . From (7.4.13b) it follows 1 v(/:)=:^V(a;,) (7.4.36) where V{o)) is given by (7.1.23b). Thus, the spectral components v(k) of the representation of the function u(t) within the finite interval <-r, r > are proportional to samples of the continuous spectrum V(co) representing the function u(t) in the infinite interval (-oo, oo). Knowing the spectral components v(k) we can calculate w(t) from (7.4.13a), substitute it in (7.4.23b), and obtain the continuous spectrum V(a)), coE <-oo, cx>. Thus, If the primary function u(t) takes outside an interval <-r, r> only zero values than the continuous spectrum V(a)) is exactly determined by (7.4.37) its samples taken with sampling period lirlr. This conclusion is called spectrum sampling theorem. The interpretation of it is that the condition that w(r)=0 in the infinite intervals (-oo, -r) and (T, OO) interrelates the values of the continuous spectrum V(w), a;E (-oo, oo) so strongly that only a discrete (although infinite) set of the samples is independent. The dual theorem is If the continuous spectrum V{oi) takes outside an interval <-27rB, 2TrB> only zero values (thus, it is a base-band process) then the primary (j A '2Q\ function u(t) is exactly determined by its samples taken with sampling period 1/25.

352

Chapter 7 Dimensionality Reduction and Quantization

This theorem is called sampling theorem and is of paramount practical importance. Formalizing the dual counterpart of the reasoning that lead us to the spectrum sampling theorem, we obtain the explicit presentation by samples of a base-band process * sin27r5(r-r,) " ( 0 = E ^ ( 0 . ;.., , ' , - o o < / < ( x (7.4.39) where ,_ , .r^ . 1 t.^kT,, and r ^ _ (7.4.40) IB We write (7.4.39) in the form

w(0- E "(^)^(^'O

(7.4.41)

A:--oo

where

. ^ D/* . \ sin27r5(r-/.)

The functions defined by this formula are called shifted sinus over argument functions. It can be easily proved that these functions are in the interval < -oo, oo > orthogonal. Since the representation (7.4.39) is possible for any base-band function, the set of shifted sine over argument functions is a complete ortho-normal set in the class of base band functions. Thus, in the interval <-oo, oo > we may represent a base band process u(t) in the spectral form (7.4.7). In particular, we may calculate the spectral coefficients v{k) from equation (7.4.7) with r-^oo. However, this is not necessary. The sine over argument functions have the obvious property sin27r5(r^-/,)

Q for m^^k

^—^ = C " ^

l-KBit^-t,) Therefore,

^^^\

(1 4 43)

for m=k

^

^ ^

v{k)=u(t^) (7.4.44) The interpretation of this is The set of samples of a base band function, taken with period \I2B, is the spectrum of this function relative to (7.4.45) shifted sine over argument functions. Let us assume that T>T^. From (7.4.29) it follows that this is equivalent to condition 2rB>l (7.4.46) On such an assumption for tE. <-TI2, TI2> but not close to the end points of this interval dominant in the sum (7.4.39) are the components corresponding to sampling points t^ laying in the interval <-TI2, TI2>. Thus, it is £2, sin27r5(r-0 u{t)^ Y, ^(^^) ^ p , f-, r E < - r / 2 , TI2> ,K=2TB (7.4.47) k'-K/2

Z7rB{t-t^)

Since the shifted sinus over argument functions are orthogonal, from conclusion (7.1.65) it follows that the approximation (7.1.36) has an optimal character. From an obvious generalization of (7.4.7) it also follows that "£ u%)T^= \u\t)dt. k--Kf2

J

(7.4.48)

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

353

The inteqjretation of (7.4.47) is For large K=2TB the set of samples {u(tO, k=-K/2, •, •, -1, 0, I-,-, A:/2} (7.4.49) is an approximate representation of the segment of duration Tofa base-band process with band width B. A more detailed analysis shows various deficiencies of the discussed representation of a base-band function, in particular of the approximation (7.4.47). They are caused by the slow decay of the sine over argument function for large values of the argument. This is in turn related to the assumption, that the considered function and in consequence the shifted sine over argument functions are exactly base band functions. A family of spectral representations which are similar to the discussed sine over argument representation but do not have its deficiencies, is based on prototype functions called wavelets. For an introduction see, e.g. Rioul, Vetterli [7.13] and for more details Malat [7.8], Young [7.14] and Wickerhauser [7.15]. An essential generalization of the presentation (7.4.47) of a base band process by its samples, is a hierarchy of presentations that can produce a hierarchy of approximations of the primary process, having increasing accuracies and being optimal at each accuracy level. They are called multi-resolution presentations. The presentations are based on a hierarchy of sets of orthogonal functions produced by shifts and scale changes of a prototype function, called wavelet. Suppose that the primary process is a base-band process and B is the highest frequency of its harmonic components. This process is sampled with period Tsi = 1/2B. In the first stage the train of samples is fed into two discrete time linear systems, as described in Section 3.2.4. The one system produces level 1 coarse approximation of the primary train and the other system produces level 1 difference train. From these both trains only every second element is retained. Thus, they are decimated as shown in Figure 1.20b. The difference train is stored and the level 1 coarse approximation is forwarded to stage 2. At stage 2 the level 1 coarse approximation is processed similarly as the primary train was processed in stage 1. Suppose that the process ends at level J. Then the produced representation of the primary train consists of level J coarse approximation and of level 1 to level J difference trains. To recover the primary train with a desired accuracy the procedure is reversed. To produce the recovered train of the lowest quality in the coarse approximation produced in level /, zeros are inserted in place of the dropped elements and such a with zeros stuffed train, is processed by a linear system. In a similar way the J level difference train is processed and added to the processed coarse level J approximation. The result is the level J recovered train, that has the lowest accuracy. A recovered train of the next higher accuracy is obtained by similar processing of the already recovered train of lowest accuracy and adding processed difference train of level 7-1. This procedure can be continued till the primary train is recovered with the highest accuracy corresponding to level 1. For detailed description of the multi-resolution presentation see the classical paper by Malat [7.8], and for more details Wickerhauser [7.15], Veterli, Kovacevic [7.16], and Hui [7.17].

354

Chapter 7 Dimensionality Reduction and Quantization

7.4.4 STATISTICAL REDUCTION OF DIMENSIONALITY OF FUNCTION-INFORMATION We assume now that a time continuous function-information can be interpreted as an observation of a stochastic process and we discuss transformations of such functions into a fmite dimensional vector information that use statistical properties of the primary information. Those transformations can be considered as generalizations of transformations reducing dimensionality of vector information presented in Section 7.3. We begin with a review of harmonic analysis of stochastic processes. This analysis gives also more insight into previously discussed spectral presentations of deterministic functions. SPECTRAL PRESENTATION OF STOCHASTIC PROCESSES On very general assumptions a time continuous stochastic process can be presented as a superposition of harmonic functions multiplied by coefficients which are random variables. We consider first the presentation of the time continuous stochastic process m(0 in the finite interval < -772, 772 > . The presentation (7.4.13) takes the form n(t)= E where

^ W ^ ' " ' \ te < - r / 2 , T/2>

(7.4.50a)

k'-oo Tfl

V{k)=^'j\

113(0^"''''''^^

(7.4.50b)

is the random variable representing the ki\i spectral component. Equation (7.4.50b) shows that the amplitude v(^) of a spectral component is a linear transformation of the primary process M(/). Therefore, the averages, variances, and correlation coefficients of the random variables ¥Jc) can be expressed in terms of the average and the correlation function defined by (5.2.38)- see, e.g., Papoulis [7.18] and for more details Lapierre, Fortet [7.19]. The latter are given by double integrals of weighted correlation function of the primary process which are counterparts of the matrix multiplication in equation (4.4.37). The harmonic spectral representation (7.4.23a) of a process on the whole time axis is suitable only for pulse-like function, which for large |/| decays to zero so fast that the energy of the function defined by (7.4.1) is finite. Thus the function is a energy function. Often we have to do with processes that even during long periods of time vary in a similar range. Thus, they are not pulse-like. A more natural model of such processes are processes which on the whole time axis have instantaneous power of the same order of magnitude. If such a process is an electrical process, then the energy dissipated on a resistor during a period of time would grow linearly with the duration of observation. Thus, the limit Til

l i m l f w2(0d/-const>0. T-c» T J -r/2

(7.4.51)

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

355

Such functions are calltd power functions. The deterministic harmonic spectral presentation of power functions, that is a counterpart of the previously discussed presentation of energy functions, is quite tedious. However, often a power function can be considered as a segment of a stationary stochastic process. The basic concepts and results of the theory of harmonic representations of stationary process have a relatively simple interpretation and give much insight into properties of power functions. Therefore we present here a sketchy description of this theory. For a stationary random process the average Etf (r) has the meaning of the instantaneous average power of the process. Since for a stationary process it does not depend on time /, Etf(0 = W=const (7.4.52) and Whas also the meaning of the average power of the process. A precise mathematical analysis (see, e.g., Fortet, Lapierre [7.18]) shows that a time continuous stationary stochastic processes can be represented in the infinite time interval in the form oo

B(r)= TT- f ^"'dA(a;) , -oo < / < oo

(7.4.53)

where dA(co) has the meaning of an infinitely small random complex amplitude of the harmonic function e'^'. This is a counterpan of representation (7.4.28). We denote as AW() the total average power of harmonic components with angular frequencies lying in the band . It can be shown (see, e.g., Fortet, Lapierre [7.18]) that for stationary processes the limit S(c^)-hm

1

AurM)

-J

1

(7.4.54)

Ao;

exists and E|dA(a;)|' = 5(a;)da;

(7.4.55)

EdA(a;')dA(cu")=0 for w'^^co". (7.4.56) The function S(o)) is called power spectral density. Since the total power is finite the power spectral density is a pulse-like function. Equation (7.4.56) tells that the random amplitudes of the infinitely small amplitudes of harmonic components are non-correlated. This is the consequence of the assumption that the process is stationary. One of the basic results of the theory of harmonic spectral representations of stationary processes is that the correlation function 7(7) given by (5.2.44) and the spectral density 5(co) are coupled by the Fourier transformation (7.4.23). Thus, 00

7(r)= - ^ [ 5(a;)eJ*^'da; , -oo < r < 00

(7.4.57a)

-00

00

5(co)= I TWe'^'^dr , -00
This relationship is called Wiener-Khinchin theorem.

(7.4.57b)

356

Chapter 7 Dimensionality Reduction and Quantization

From the definition (5.2.40) and (5.2.44) of the correlation function of a stationary process it follows that Em\t)=y{0)

(7.4.58)

while from the Wiener-Khintchin theorem (7.4.58) we have OO

J - f 5(a;)da;-7(0)

(7.4.59)

27r J^

From these two equations we get OO

Em\t)=J-{S(u))d(^

.

(7.4.60)

This equation is plausible in view of the definition of the power spectral density. When the angular frequencies of all spectral components lie in a frequency band K-lirB, 2'KB> then the process is said to be a base-band process. When ^^^ for -ITTB < U^I-KB 5(a;) = < : " ^ 0 for a;<-27r5, io>2TrB

(7.4.61)

then the spectrum is said to be uniform. For a process with such spectral density from equation (7.4.60) we obtain EB'(0=255,

(7.4.62)

Using (7.4.57) we calculate the correlation function for the process with the uniform spectrum given by (7.4.61) and we obtain . . 2 sin27r5r y(r)'(r'

,^ . ^^. (7.4.63)

ZTTBT

where o^ = ES^(0 (7.4.64) From equation (7.4.63) and from the mentioned properties of sine over argument function it follows that Any two samples s(r') and s(r") of a stationary process with uniform spectral density, such that t"-t' =kT^ are uncorrelated, (7.4.65) where T, = l/2B. We used this conclusion deriving the basic formula (5.2.50). STATISTICAL REDUCTION OF DIMENSIONALITY In Section 7.3 we presented methods of reducing dimensionality of vector information using statistical properties of the primary information. We show now the generalization of those methods for transformation of a function-information into a vector information.

7.4 Spectral Representations and Reduction of Dimensionality of Function-Information

357

The compressed vector information is a finite subset of infinitely many spectral components of the processed function-information. The difference with the deterministic approach, is that as the criterion for the quality of recovery, and in consequence as criterion for choice of retained spectral components, the statistical average power of the error is taken. This causes that the compression of some functions may have lower quality, as compared with the previously discussed the deterministic compression but the compression of a train of several processes has better quality. This is the same effect as in the case of lossless compression of trains of discrete information discussed in Section 6.2.2 (see Comment 2, page 266). The other difference between the statistical optimal compression and the deterministic optimal compression based on conclusion (7.4.30) is that taking the deterministic approach we have to calculate anew for every processed functioninformation sufficiently many spectral components and decide which should be rejected. Taking the statistical approach we do this only once (strictly, once in a stabilization interval). The price we pay for it is the cost of acquiring the correlation function of the primary process. As an approximation of the correlation function we may take the arithmetical average of products of shifted samples. Considerations in Section 7.3 can be generalized for optimization of a transformation of a function information into vector information. It is easily seen that if the spectral components are not correlated then the average power of the recovery error is minimized when spectral components with large variances, i.e. with large average powers are retained as components of the compressed vector. The counterpart of the decorrelating spectral representation of vector information described in Section 7.2 is the spectral representation using the eigen functions as basis functions. Those functions are solutions of the integral equation with the correlation function as a kernel, that is the counterpart of the matrix equation (7.2.8). Such a spectral representation based on eigen functions is called Loeve-Karhunen representation. For a function-information finding the eigen functions and eigen values would be for a typical correlation function prohibitively complicated. Therefore, of basic importance are spectral representations relative to sets of ortho-normal basic functions that can be easily handled and which produce spectral components with small correlation coefficients. The harmonic functions are a typical choice. This is even more justified because under quite general assumptions about correlation functions, the harmonic functions are good approximations of eigen functions. The optimal algorithms for reduction of dimensionality of vector information and recovery algorithms presented on page 339 can be easily modified for transformation of function-information into vector information: we have to take instead of eigenvalues the variances of spectral components and in place of recovered spectrum the approximation s\t) given by (7.4.32). Similarly, as in the deterministic approach discussed in Section 7.4.3 we may use the continuous representation in the infinite time interval to get the hints for choosing the number of retained spectral components to achieve good accuracy of approximation of the primary time continuous process by a truncated spectral representation.

358

Chapter 7 Dimensionality Reduction and Quantization

Similarly to (7.4.34) using equation (7.4.53 and (7.4.55) we get the approximate formula for the variance of kth spectral component E^(k) = S(ko)M

(7.4.67)

where S(o)) is power spectral density, which can be obtained from the correlation function using the Wiener-Khintchin theorem (7.4.57). Using the approximation (7.4.27) we fmd the effective width of the power spectral density, and using (7.4.67) we get an estimate of the number of spectral components that should be included in the compressed spectrum. The remarks in Comment 2, page 342 and examples in Section 7.3 suggest to use basic functions other then eigen functions. When the correlation coefficients between the other spectral components are small, then the performance of the modified compression and recovery algorithms described on page 333, is close to optimal performance. If this is not the case, the performance of the recovery algorithm could be improved if instead of zeros linear estimates of possibly many rejected spectral components are made. Therefore, it is important to estimate correlation coefficients between variables representing spectral components to decide if it is worth while to make such an improvement. Using equation (7.4.57) and arguing similarly as in deriving the approximation, (7.4.67) we conclude that when the continuous spectral representation (7.4.53) is an accurate approximation of the spectral representation (7.4.50) in the time interval <-772, T/2 > then the correlation coefficients of the random variables representing the spectral components v(k) in the representation (7.4.50) are small and the function-information given by the simple formula (7.4.32) is an almost optimally recovered primary information. From the derivation of the continuous representation (7.4.23a) it follows that the continuous representation is accurate if the distances between the harmonic components in representation (7.4.13a) are small compared with the range in which the power spectral density takes significant values thus, if c^i^As, (7.4.68) where As^ is the effective width of the power spectral density 5(cj). Using (7.4.11) and (7.4.27) we write this condition in the form A^^<^r (7.4.69) where A^t is the correlation time of the primary process. Summarizing we conclude If a continuous function-information can be considered as a segment of a stationary random process and if duration T of information is much larger then the correlation time of the stationary random process then the truncated harmonic spectrum is an almost (7.4.70) optimally compressed vector information and the sum (7.4.32) of retained harmonic components is an almost optimally recovered primary function-information.

7.5 Quantization

359

7.5 QUANTIZATION A general review of quantization has been presented in Section 1.5.4 and we used it in Sections 4.2.1, 4.5.1, and 6.6.1. Similarly as dimensionality reduction, the quantization is an irreversible, deterministic transformation used as a preliminary simplifying transformation. Therefore, we exploit here large fragments of our considerations in the previous sections of this chapter. As there, geometrical interpretation is used extensively and presentation is partially based on heuristic argument. We return in Section 8.6.1 to the problems of quantization requiring more advanced formal tools particularly, of the optimization theory

7.5.1 THE RECOVERY OF THE PRIMARY CONTINUOUS INFORMATION FROM QUANTIZED INFORMATION Quantization is a typical simplifying transformation (see Section 1.4 particularly. Figure 1.4) and the problem of recovery of the primary information arises. To formulate this problem we have to introduce an indicator of performance of the information recovery rule (see Section 1.6). To define such an indicator we must specify the meta-information about the properties of the set of potential forms of the primary information. We assume first that the information exhibits statistical regularities and that the exact statistical information (see Section 5.5) is available. The K-DIM continuous information ii = {«(/:), A:=l, 2, .., i^}, u(k)E is considered. The quantization rule is described by the partition of the continuous set of the potential forms of the primary information into a finite number of aggregation sets corresponding to the potential forms of the primary information which are transformed into the same quantized info. In this section the quantization rule is considered as fixed and we concentrate on continuous information recovery rule. We denote V/, / = 1 , 2,- • • , L the potential forms of the quantized information iir={Wr(/:), k=l, 2,..K} the recovered information UX') the recovery rule It is assumed that the recovery rule is deterministic^. Then, since the quantized information is discrete, the recovered information is discrete too. We denote as u,,= (/,(v,),/=1,2,- • • , L (7.5.1) the potential forms of the recovered information. On the assumption that the primary information exhibits statistical regularities the primary information, the quantized information, and the recovered information and their components can be considered as realizations of random variables. We denote them, respectively, as I[J={m(it), k=l, 2, •, •, K }, V=MA:), k=l, 2, •, •, K}, I[J, = K(/:),/:=1, 2, The mean square error of recovery ^

',',K}

Q[UX')]= E E Mky%(k)r

(7.5.2)

Uxm, *-^ is taken as the indicator of the performance of the information recovery rule.

360

Chapter 7 Dimensionality Reduction and Quantization OPTIMAL RECOVERY WHEN EXACT STATISTICAL INFORMATION IS AVAILABLE

Similarly, as in the case of recovery of the primary information from the informatioji of reduced dimensionality, we consider the optimization problem OP UX'),Q- The method of solving this problem is illustrated on the 1-DIM (scalar) case. For ^^=1 definition (7.5.2) takes the form Q[UX')] = E(m-%)'

(7.5.3)

nil X Er

The random variable m, = U,('^), where ^ is the random variable representing the quantized information. Therefore, averaging over all potential forms of the recovered is equivalent to averaging over all potential forms of the quantized information. Taking this into account and using formula (4.4.23) for conditional averages we write (7.5.3) in the form QIUX')] = E {E\m-U,m'} = EQ[UX^),m]

(7.5.4)

e(w„ v,)^E(m-w,)2

(7.5.5)

where Since the variable ^ is a discrete variable, we can use the equation (4.4.10) and write equation (7.5.4) in the form^ QlUX')] = i2 QiUXvi). vjP(v=v,)

(7.5.6)

/-I

The average on the right of (7.5.5) corresponds to the point of view of an observer who knows the quantized information v^ , does not know the primary information, and looking for various recovery rules, considers the recovered information u, to be a variable. From this interpretation it follows that for a given quantized information Vi the conditional mean square error is minimized if we consider Q(u„ V/) as a function of a continuous variable u,E - Let us denote as Wro/ the value u, for which the minimum is reached; in this notation we indicate the number / of the fixed quantized information v^. For each v^ we can find independently the corresponding w^o/. Therefore, the recovery rule For the available quantized information Vj find the function Q(u,, V/), consider it as a function of the continuous variable u,E < w^, Wb> and ._ - _ . find the value u^^i for which Q{u,, v^) achieves the minimum. ^ • • ^ Take u^^i as the recovered information, minimizesthe overall average Q. Thus the rule (7.5.7a) is the solution of the OP UX*), Q • We denote this rule U,Ji') and call it best coruiitiormiperformance rule. '^''''

Wro/=^ro(V/)

(7.5.7b)

To determine the concrete form of the optimal recovery rule, we have to calculate the conditional average Q(u,, v^. We assume that the scalar quantization rule is described by thresholds "fl"^o<^i<---<"L""^ shown in Figure 1.18. Then the aggregation interval is Jli=
(7.5.8) (7.5.9)

7.5 Quantization

361

From the general definition (4.4.11) it follows that the conditional average (7.5.5) Q{u„ vi)= J {u-u,yp(uI v,)du

(7.5.10)

where p(u \ v^) is the probability density of the random variable loi on the condition that v=v^. We now express the density p(u \ Vj) in terms of the probability density p{u) describing statistical properties of the primary information u. From the definition (4.4.7b) of the conditional probability we have P{u-e<m
Using this we write (7.5.4) in the form 1 Q(Ur. ^i)^-p(^^\ (u-u,yp(u)du

(7.5.14)

From the definition of aggregation set it follows that P(Vi) = P(me ^i)= ^ p(u)du

(7.5.15)

A,

Equations (7.5.14) and (7.5.15) express the conditional mean square error in terms of the density of probability p(u), we were looking for. The optimally recovered information u,^i minimizing Q(u,, V/) is a solution of the equation: ' ^ -0

(7.5.16)

Using equation (7.5.14), interchanging the sequence of the partial differentiation and integration, after some elementary algebra we get: u,,i=-p^\up(u)du

(7.5.17)

Since the ratio p(u)/P(Vi) has the meaning of a weight of a point in the aggregation interval ^ / the optimally recovered information w^o/ can be interpreted as the centre of gravity of the aggregation interval ^ / . Therefore, the optimal recovery rule (7.5.17) is called centre of gravity rule.

362

Chapter 7 Dimensionality Reduction and Quantization

Substituting (7.5.17) in (7.5.10) for w^, using (7.5.13) and substituting the resuh in (7.5.6) we obtain the overall performance indictor of the centre of gravity rule 2[^ro(-)] = E f [u-u;(v;)Yp(u)du

(7.5.18)

This is the minimum mean square error of the recovery for the considered quantization rule. The generalization of our argument for the case of A'-DIM information is simple since in its essential steps we did not use the assumption that the information is one dimensional. In definition (7.5.5) we have to replace (B-DHT)^ by the sum on the right of equation (7.5.2). Thus we define K

Q(u,. v,) = E E Mk)-uXk)f

(7.5.19)

where ItJ={i!a(/:), /:=!, 2,- • • , ^} is the AT-DIM random variable representing the ^-DIM information u = {u(k), k=l, 2,- - - , K}. For so defined function Qiu„ V/) the obvious modification of the rule (7.5.7) is the optimal rule of recovering the primary information when only the quantized information v^ is available, that minimizes the overall performance indicator given by (7.5.2). The generalization of (7.5.14) is 1 ^ e("r, ^/)= p(77f 1 " 1 ^ [u(k)-uXk)yp{u)du

(7.5.20)

where p{u) is the density of probability describing the K-DIM random variable HJ and ^i is the aggregation set corresponding to the quantized information v^. The generalization of equation (7.5.16) is the set of equations - 4 1 - ^ - 0 , / : = 1 , 2,- ' ^K, (7.5.21) du^k) Substituting (7.5.20) and after some algebra we get kth component of the optimally recovered information Wro/W= p ^ f f •• f u(k)p(u)du^ k=U 2,'

' ,K

(7.5.22)

and Piy,)= J \-\p{u)(iu,

k=l,

2,- • • ,K

(7.5.23)

The set of equations (7.5.22) can be written in the compact form «ro, 7 ( 7 ) 1 \-\up(,u)du,

k=l,

2,- • • ,K

(7.5.24)

' -^, Thus, in the general ^-DIM case the optimally recovered information is the centre of gravity of the AT-DIM aggregation set corresponding to the available quantized information v,.

7.5 Quantization

363

Similarly to (7.5.18) the overall performance indictor of the centre of gravity rule ^S

_

L

K

e[^ro(-)] = E

f

Y.^u{kyu;{Kv)fp(u)du

/-I

i

*-l

(7.5.25)

EXAMPLE 7.5.1 We assume that the probability density is uniform ^ forWa<w<Wb M«)= C T "»>-"a (7.5.26) . . . .^ 0 for w<Wa, w>Wb, and the quantization is uniform w^-l/^,,-Ay-const (7.5.27) From (7.5.13) it follows that ^--- for u, ,u^ (7.5.28) Substituting this in (7.5.10) we obtain •^, «ro,= ^ J « ' l " - - 2 -

(7.5.29)

Thus, the centre of the quantization interval corresponding to the given quantized information, is the optimally recovered information. Putting (7.5.29) in (7.5.18) we

e[^ro(-)]=--^TT^. n

(7.5.30)

OPTIMAL RECOVERY WITHOUT STATISTICAL INFORMATION We show now the application for scalar quantization of the general procedure of intelligent information processing described in Section 1.7.2 and illustrated in Figure 1.27. We assume that Al. A train of pieces of the primary information ^,={tt(0, / = l, 2,- • • , / }, u,
1-1

364

Chapter 7 Dimensionality Reduction and Quantization

Similarly to (7.5.6) we write this in the form (7.5.34)

Q [f/r(-)]=E G[^r(V/), V j - L i l

where

/-I

^

i

(7.5.35)

G(Wr, v,) = ^(^/)w(0€^,

Arguing similarly as in deriving equation (7.5.16) we conclude that when only quantized information v^ is available, then the optimal recovered information is a solution of the equation as (7.5.16) but with Q{u„ V/) instead of Q(u,, v^) and the optimal recovered information is

"r;/ = 7 7 ^ E " ( 0

(7.5.36)

We can get also this result taking into account equation (4.5.1) and considerations in Section 4.4.2 about relationships between probabilities and frequencies of occurrences of states. The presented argument can be easily generalized for the K-DIM case. Then equation (7.5.33) takes the form

"r;/ = T n T E « ( 0

(7.5.37)

where u{i) is the /th observation of the vector information. The compression system using the described transformations is shown in Figure 7.7a. For comparison Figure 7.9b shows the system using the statistically optimal recovery rule (7.5.17). u(/) •

PIECE BY PIECE OUANTIZATIGN

u:(i)

viO

PIECE BY PIECE RECOVERY

n n AV ^ l ^ i - ^ X

^ STORAGE a)

^r

^r

CALCULATION OF U;^

b) u(i)

i

PIECE BY PIECE QUANTIZATION

v(/)

I

« ; ( ' • )

PIECE BY PIECE RECOVERY

Figure 7.9 Quantization systems with optimal information recovery; (a) operating without statistical information, (b) using exact statistical information; f^ro"{"rop"ro2»* ''"HIL} ^ ^ ^^^ of potential forms of optimally recovered information (partner information).

COMMENT The system compressing the information without statistical information is an example of systems discussed in Section 1.7.2 and shown in Figure 1.27. It is also a counterpart of the intelligent Huffman system, discussed in Comment 2, page 266 and shown in Figure 6.3. The set ^^o'i^Toi^^roi»''*"r^} plays the role of the partner

7.5 Quantization

365

information that the quantizing subsystem delivers to the recovering subsystem. It is the counterpart of the set P* of frequencies of occurrences of blocks used in the intelligent Huffman algorithm shown in Figure 6.3. Figures 7.9a and 7.9b underscore the differences between the systems that operate without statistical information and systems using statistical information. The systems operating without statistical operation must have time to collect information about the train that can be used to efficiently process the components separately. The system using statistical information can process a component of the train immediately. However, some long lasting observations which justify the assumption of stationarity (see our discussion in Section 4.3.1) are required. The features of the both systems can be combined in a system with learning cycles-see Section 1.7.2. 7.5.2 QUANTIZATION OF VECTOR INFORMATION The obvious way to quantize a K-DIM continuous vector information K>2, is to quantize separately each component of information separately. We call it a decomposed quantization. On two simple examples we show the basic features of such a quantization. EXAMPLE 7.5.1 A COMPARISON OF PERFORMANCE OF DECOMPOSED AND NON-DECOMPOSABLE QUANTIZATION We assume that Al. The primary vector information is two dimensional u = {w(l),w(2)} A2. The set of potential forms of information is the square t/e2 = {-a<w(l)
(7.5.38)

(7.5.39)

where Q is given by (7.5.25) with K=2 and a^^^=E(iHi-Ei!ii)^ For density of probability (7.5.38) we have a^=a2/3. The decomposed quantization is considered first. It is evident that for the assumed joint probability density the marginal probabilities are uniform probability densities. For symmetry reasons it can be anticipated that in such a case optimal is uniform quantization (we prove this in Section 8.3.1). Therefore, it is assumed that each component is quantized according the uniform quantization rule. We denote as L, the number of potential forms of a quantized component. The number of potential forms of quantized vector information is L^. The aggregation set ^i for the vector information is a square with edge A=2a/Li; such sets are shown in Figure 1.19 a with u^(l)=u^{2) = -a, u^(l)=u^(2)=a.

366

Chapter 7 Dimensionality Reduction and Quantization

The considered quantization is equivalent to quantization obtained by the NNT (next neighbor transformation) described in sec 1.5.3 with reference vectors a[/i(l), h(2)]=h(\)g(l)+hi2)g{2) (7.5.40) where h(\), h(2) are integers, g(l)=a(l), g(2)=ii(2) and w(l), u(2) are the coordinate unit vectors-see fig. 1.19. For uniform density of probability the centroid of the square is its centre. From equation (7.5.25) we obtain Q'=0.25/L^ (7.5.41) In general the aggregation sets corresponding to separate quantization are rectangles. If the aggregation sets are not rectangles the quantization can not be implemented by a separate quantization of components. An example is quantization realized by a NNT with reference vectors u[^h{l), h{2)]=h{l)g{l)+h(2)g(2) (7.5.42) where g(l)=2a(l), g(2)=u(l)-^\/\/3u(2) . It can be shown that for such reference vectors the aggregation sets are regular hexagons as shown in figure 1.19b (with exception of regions at the border of the set Ud of potential forms of information). For a hexagon and uniform probability distribution the centroid is again the centre of the hexagon. Proceeding similarly as for first quantization system for large^° L we get: (2'.1I1-0.24/L2 D (7.5.43) 36L2 COMMENT 1 The difference between the performances given by equations (7.5.41) and (7.5.43) is not large it but it is interesting. The reason for this difference lies in the structure of aggregation sets. Since (1) the aggregation sets do not overlap (2) cover the set of the potential forms, (3) by assumption the number L of potential forms of information is in both systems the same, the areas of the aggregation sets are for both systems the same. It can be proved that if the area of a 2 dimensional figure is fixed, then the integral in (7.5.25) achieves its smallest value when the set ^i is a circle. However, the non overlapping circles can not cover a square. Therefore, the distortions associated with an aggregation set are the smaller the better a set approximates a circle of the same area. The hexagon is in this respect better then a square (it is even optimal). This is the ultimate reason of the inferiority of the considered system with separate quantization. Our argument can be generalized for K dimensional signals. As in the two dimensional case the aggregation sets corresponding to separate quantization are AT-DIM cubes. The optimal recovery performance is the better the better the aggregation set approximates a sphere. The important result of the theory of multidimensional spaces is that it is possible to partition a large cube in aggregation sets of the same shape which are not cubes but which approximate the sphere the better the larger the dimensionality K is. Therefore, the normalized minimal recovery distortions per dimensionality decrease when the dimensionality of quantized primary information increases. The easiness of implementation of the decomposable vector quantization makes it attractive for applications even in spite of its larger, than minimal distortions.

7.5 Quantization

367

The following example shows that when the components of primary vector information are statistically dependent it is possible to improve the performance of the separate quantization by a preliminary presentation changing transformation, in particular by linear decorrelation of components of the primary vector information. EXAMPLE 7.5.2 THE EFFECT OF PRELIMINARY PROCESSING ON PERFORMANCE OF OPTIMIZED QUANTIZATION We do again assumptions Al and A4 made in Example 7.5.1 but instead of assumptions A2 and A3 we assume that: A5. The density of joint probability/7(ii) is as shown in Figure 7.10a. We consider first the system using direct separate quantization. The densities of the marginal probabilities Pi[u(l)] and P2lu(2)] corresponding to the assumed density/?[w(l), w(2)] are shown in figure 7.10a.

Ml)

p[u(l)]

«,(!)

-^

u,il) a w(l)

a)

%(i)

w^il) Ml)

c)

Ml)

b) d) Figure 7.10. The effect of decorrelation on the separate quantization of components of 2 DIM information: a) the assumed joint density of probability and the densities of marginal probabilities; the joint density is constant in the hatched are and is zero outside it c) the aggregation sets corresponding to the optimal separate quantization of the components, c), d) the counterparts of a), b) but for separate quantization after the preliminary decorrelation described by 7.3.36.

368

Chapter 7 Dimensionality Reduction and Quantization

On the assumption that the number of potential forms of a quantized component L,=2, optimal is the uniform quantization which can be considered as the 1 dim NNT with references w,(/:),/=l,2,/:=1,2 denoted as crosses on diagrams of the densities of marginal probabilities shown in Figure 7.10a. The resulting reference vectors Uf., / = ! , 2, 3, 4 for the vector information u and the corresponding aggregation sets ^^ / = ! , 2, 3, 4 are shown in Figure 7.10b. Four reference vectors are defined thus, four potential forms of quantized information could be produced. However only two can occur used. This is the consequence of the assumed probability density. From (7.5.25) we obtain the normalized mean square error for the optimally recovered information 2'=0.5

(7.5.44)

We consider next the system which uses quantization after the decorrelating spectral transformation. The block diagram of system is the same as of the system with truncation after decorrelation shown in Figure 7.5 with quantization instead of truncation. Since we assumed here the same joint density of probability as in Section 7.3 (see Figure 7.4) we can achieve the decorrelation by transformation (7.3.36). We can also use equations (7.3.38) and (7.3.39). From the latter follows that the variance of the first component >v(l) of the decorrelated information is substantially smaller then of the second w(2). Our discussion in Section 7.3 in particular, conclusion 7.3.53 suggests to quantize only the first component w(l) of the decorrelated information and disregard the second component. Formally we do it by transforming w(l) by a NNT with one reference w,(2)-0 (7.5.45) Let us denote by ;?[w(l)] the density of the marginal probability of the first component w(l) of the decorrelated information, resulting from the density of joint probability of its components. The both densities are shown in Figure 7.10c. In Section 8.4.1 we present an algorithm for finding the optimal quantization of scalar information having a non-uniform probability density. For the probability density /7[w(l)] shown in Figure 7.10c the optimal quantization is not uniform. To simplify the argument we assume that the quantization of w(l) is uniform, achieved by the scalar NNT by references H'^(I), / = 1 , 2, 3, 4 shown as crosses in Figure 7.10c. The reference vectors resulting from the described quantization of the components of the decorrelated information are w, = {v^,(l),w,(2)-0} / = 1 , 2, 3, 4 (7.5.46) They and the corresponding aggregation sets ^^i /= 1, 2, 3, 4 for the decorrelated information are shown in Figure 7.10d. Arguing similarly as in Section 7.3.1, from the recovered spectrum we produce the recovered information by inverting the decorrelating transformation. We calculate the mean square error of the recovered spectrum using again equation (7.5.25). In view of the distance preserving property this is the mean square error of the recovered information. In such a way we get: e'=0.33 D (7.5.47)

7.5 Quantization

369

COMMENT 1 In Section 8.4.1 it is shown that for the uniform probability distribution the uniform quantization is optimal. Although, the reference vectors and the corresponding aggregation sets in the system with separated quantization of the primary information are obviously unfavourable, they are optimal on the assumption that the components are quantized separately. Thus the example shows that for some probability distributions the condition that the quantization should be separate is to restrictive and we should look for non-separable quantization. COMMENT 2 Similarly as in the case of dimensionality reduction discussed in Section 7.3, the reason of the superiority of separate quantization after decorrelation is that the average ranges of variations of the decorrelated components (characterized by the mean square values) of the two components are different. Therefore, we can improve the performance of separate quantization if we quantize the both components differently, so that it is possible to recover the component with the larger range of variation more exactly, then the component varying only in a smaller range. We achieve it by allowing more potential forms for the quantized first component then for the second. In the system described in example we realized this idea in an extreme form by allowing only 1 form ( the zero) for the transformed second component and 4 forms for the first component. OPTIMAL BIT ALLOCATION We now generalize and formalize the procedure described in the last comment. We assume that Al. A preliminary spectral decorrelating transformation of vector information is applied, A2. The decorrelated components have in general various statistical features, A3. Each component of the decorrelated vector information is recovered separately from the corresponding quantized component, A4. The indicators of quantization systems performance are the total volume of quantized decorrelated components and the mean square error of the recovered vector information obtained by optimally recovering separately its components. Equations (8.6.17) and Figure 8.23 show that for a sufficiently large number L of potential forms the mean square error of optimally quantized and recovered (centroid rule) scalar information w is: E(M-lVo)'=Trr2^« vn?

(7.5.48)

where the constant A depending on the density of probability of the primary information. In addition to assumption A2 we assume that A2'.The decorrelated components have the same probability distribution, but may have various variances.

370

Chapter 7 Dimensionality Reduction and Quantization

The spectral transformation is distance invariant. Then from (7.5.19) and (7.5.48) it follows that the mean square error of the recovered primary vector information is A 4^ a\k) where Q is given by (7.54.2), o^(k) is the variance of the random variable ^(k) representing the kxh component of the decorrelated information vector and L(k) is the number of its optimally quantized potential forms. On the assumption that logj L{k) is an integer this logarithm is the number of bits (binary elementary pieces of information) needed the identify the quantized component. The indicator of the volume of the quantized spectral representation is V^^\og2 L(k)

(7.5.50)

In view of assumptions A4 a typical optimization problem is 0?{L{k); k=h

2, .., K},Q\\/^^V,

(7.5.51)

where V is the given volume (total number of bits) of the quantized decorrelated information. This problem is called optimal bit allocation problem. The difficulty in solving this problem is the requirement that log2 L(k) should be a positive integer. If we drop this requirement we can fmd the solution of the optimization problem using the method of Lagrange multipliers which we will describe in Section 8.2. The solution is \ogXo(k) = V/K-h Vilog, - ^

(7.5.52a)

where

n^w

(7.5.52b)

COMMENT 1 Equation (7.5.52) has a simple interpretation. The first term on the right corresponds to the proportional bit assignment. The second term is the correction which is determined by the ratio of the mean square error of the considered component and the geometrical average mean square errors of all decorrelated components. The integer parts of the solutions (7.5.52) can be considered as approximation of the optimal bits assignments. If the mean square error of a component of the decorrelated information is substantially smaller than the geometric average of the mean square errors the solution is substantially smaller then 1 or even negative. Then we reject the corresponding component of the decorrelated information. Thus, the obtained solution provides a justification of rejecting the second component in Example 7.5.1. It can be also considered as the justification of the procedure applied in JPEG described in Section 2.5 We present here the approximate solution of the assignment problem since it gives insight into the problem of bit assignment. A algorithm permitting to find directly the optimal integer bit assignments is also known (see Makhoul et.al [7.20], Shoham, Gersho [7.21]).

7.5 Quantization

371

COMMENT 2 The basic assumption that we should separately quantize the components of the decorrelating spectral components has a heuristic character. However, when the density of joint probability distribution of the components of the primary information is gaussian it can be proved (see Segall [7.22]) that quantization achieved by reduction of dimensionality of the decorrelating spectral transformation realized by the algorithm presented in Section 7.3.2, page 339 and by subsequent separate quantization of the decorrelated components using the of optimal bit assignment, is the optimal vector quantization. 7.5.3 THE CURRENT QUANTIZATION OF INFORMATION In previous considerations we assumed that all components of information are simultaneously available for processing. Often the information is a time evolving process and the components arrive successively in time. A similar situation occurs when adjacent components of an image are successively processed. In such cases it is desired to compress the structured information successively as its new components became available. We call this current information compression (in particular current information quantization). A wide class of transformations realizing the current compression can be considered as special cases of the following prototype transformation: using the already available components of the information and the mata information about relationships between the components we estimate the component of the information which will arrive next. As a new component of compressed information we take a description .^ ^ _^. of the difference between the arrived component and its estimate that has possibly small volume. It is required that basing on the compressed descriptions, the components of the primary info can be recovered with given accuracy in particular, with given delay. Let us comment on this description. In general the components of structured information are interrelated by deterministic relationships discussed in Sections 3.2, 3.3, and 6.6.2 and/or by relationships between states of variety in particular, the by statistical relationships discussed in Chapter 5. Those relationships cause that some features of a new arriving component of structured information are related to previously obtained components of information. However, some other features of the new component are independent. The information about these independent features is called really new information. The transformation described in (7.5.53) produces the really new information and compresses its volume. Thus, this transformation may be called transformation extracting new information. To illustrate those general concepts we consider a simple but representative example of the new information extracting transformation. We assume Al. The information is a train ujit^, t^^nT^, A2 = 1, 2, 3... of samples of a primary time continuous scalar process ujif), t
372

Chapter 7 Dimensionality Reduction and Quantization A3. The train W(AZ), n = l, 2,.. exhibits statistical regularities and can be considered as an observation of the train random variables m(/2), n = \, 2,.. , with EI!II(A2)=0

Let us assume that u(n-l) is the last obtained component of information. Since we assumed that it is a scalar its only feature is its value. Therefor the estimate of the forthcoming component u(n) is a scalar W * ( A 2 ) = ^ * [ A 2 , tt(AZ-l)]

(7.5.54)

where u{n-l) = {u(l), w(2),- • • , AZ-1} is the train of available samples and ^*[^»(*)] is the rule transforming u(n-l) into w*(/z). This rule is called the prediction rule. A simple transformation extracting the features of the newly arrived component that are not determined by the components that arrived earlier, thus extracting the "genuinely" new features, is w(A2)=w(A2)-f/*[A2, u(n-l)] (7.5.55) Such a transformation extracting new information is called predictive-subtractive transformation. It transforms the train u(n) into the secondary information w(n). We can produce the predicted value f/*[Az, M(A2-1)] immediately after the instant r„., when the information component uin-l). However, the transformed information v(n) we can produce after uin) has arrived thus, after instant t„. Therefore, between production of the predicted value and the production of the transformed information we must introduce the delay T,. The diagram of the system realizing the predictivesubtractive transformation is shown in Figure 7.11.

Figure 7.11. The predictive-subtractive transformation.

It can be expected that the volume of the transformed information is made small when the prediction of the value of the forthcoming component of information is possibly exact. To formulate the problem of optimization of prediction we have to introduce an indicator of the performance of the prediction rule. The general methodology of defining indicators of performance is present in Section 8.1. Similarly as in the case of recovery of primary information considered in the previous section we take the mean square error G { ^ * [ « , ( - ) ] } = E{B(AZ)-(/*

[n, mn-l)]}^

(7.5.56)

as the indicator of performance of the prediction rule; ILJ(AZ-I) denotes the train of random variables representing the primary train uin-l).

7.5 Quantization

373

Both the implementation constraints and the costs of acquiring exact statistical information cause that often the class of admissible prediction rules L^ *[«,(•)] is restricted to linear rules. Thus, we assume that the predicted component is n-l

u^n)=J2h(n.

m)u(m)

(7.5.57)

Such a linear prediction rule is described by the set of coefficients h{n)^{h{n, m), m = l , 2, .., nA} Substituting (7.5.56) in (7.5.55) we get Q{h{n)]^l[n{nyY.

Kn, m)y^{m)Y

(7.5.58)

m-l

Thus we face the optimization problem OP h{n), Q. Before we derive the solution of this problem we derive an important property of the predictivesubtractive transformation using optimal linear prediction. We write definition (7.5.56) in the form e[^(A2)] = E[l!a(AZ)-B*(A2)]2

where

(7.5.59)

1^ HEHn) ^ = 22h(n, mMm)

(7.5.60)

The optimal set of optimal coefficients is the solution of the set of equations ^ — - 0 , m-1,2,..,«-! n ^ f.\\ dh{n,m) (/o.oi) Substituting (7.5.58) and interchanging the sequence of statistical averaging and partial differentiation we get E[im(n)-B*(n)M^)=0, k=U 2, .., n~l

(7.5.62)

Taking into account definition (7.5.55) we write this set of equations in the form E w(M)m(^)=0, fc=l, 2,.., n-1 (7.5.63a) where w(n)=m(Az)-iei*(Az) is the random variable representing the information produced by the predictive subtractive transformation. The interpretation of (7.5.62) is the information v(n) produced by the predictive-subtractive transformation using optimal linear prediction is not correlated (7.5.63b) with any component u(m), m
374

Chapter 7 Dimensionality Reduction and Quantization Substituting (7.5.59) in (7.5. 61) we obtain n-\

Yl Cuu(m, k)h{n, m)=c^^{n, k), k=l, 2, .., n-l (7.5.65) m-i cjm, k) = Em(mMk) (7.5.66) This is the set of linear equations from which the coefficients determining the optimal linear prediction can be calculated. We denote where

Cuu(^-l) = [c^u(f^, k), 772, k=l, 2, ..n-l]- the square correlation matrix of already available components of information ^uu('^-l) = Ku(^» ^)' k=l, 2, .., Az-l]-the column matrix of correlation coefficients of the available and the next arriving component of information h(n'l) = [h{n, m), m = l, 2, .., n-l] the column matrix of coefficients determining the optimal linear prediction rule Using those matrices we write the set of equations (7.5.64) in the form of a matrix equation CJ«-l)/i(AZ-l)=cjAZ-l) (7.5.67) On very general assumptions about the random variables m(n) the inverse matrix Qu (n-l) exists and the column matrix of coefficients determining the optimal linear prediction rule is h,(n-l)=C-2(n-l)cJnA) (7.5.68) Efficient algorithms for calculating this solution can be found in Haykin [7.23]. An other class of efficient and easy to find coefficients of the optimal linear prediction are iterative algorithms presented in Section 8.2. COMMENT 1 From conclusion (7.5.65) it follows that the linear predictive subtractive transformation using optimal prediction performs the same function as the real time linear decorrelating transformations described in Section 5.1. It can be shown that the predictive subtractive transformation produces a train that in the sense of explained in Comment on page 227, is equivalent to the train produced by the Gramm-Schmidt decorrelating algorithm described on page 226. The real time decorrelating transformations described in Section 5.2.1 are inherently linear transformations and are unable to exploit any statistical features of the primary information besides the mean values and correlation coefficients. The transformation extracting new information described in (7.5.53) is much more universal, since it must not be linear. Already universal are the predictive subtractive transformations using not only linear prediction. Such transformations are particularly suitable for compression of information having a macro structure. Using fragments of an macro element they can identify it and subtract the whole macro component from the primary information. This can lead to very efficient compression of the volume of information having a macro structure.

7.5 Quantization

375

CURRENT QUANTIZATION OF TRAINS BASED ON PREDICTIVE-SUBTRACTIVE TRANSFORMATIONS An optimized predictive subtractive transformation produces a train of continuous pieces of information that are mutually weaker interrelated than the components of the primary train. In particular, using optimal linear prediction we produce a decorrelated train. The predictive-subtractive transformation is reversible, since knowing the first component of the train and adding successively the prediction errors we reconstruct the primary train. We achieve quantization of the primary train by quantizing separately the components of the prediction error. We do it using a NNT transformation. Such a system is shown in Figure 7.12a. We recover the primary train by recovering piece by piece from the quantized train the train of prediction errors and adding them successively we recover the primary information train. However since the quantization distortions can be eliminated completely the recovered primary train is distorted.

PREDICTIV-SUBTRACTIVEDECORELATION I

1

i-

L

a)

I

REFERENCE PATTERNS FOR PRED.ERR

w{n)

u{n)

u;in)

b)

zu: NNT

M-

I

DELAY Tl

v(n)

4—• RECOVERY OF THE PREDICTION ERROR

PREDICTION BASED ON QUANTIZED INFO

u'An)

Figure 7.12. Current quantization of a train of pieces of information based on predictive subtractive transformation; (a) the basic system, (b) the system with additional correction of quantization errors.

A natural modification of improving the performance of the simple system shown in Figure 7.12a is to make the prediction non on the basis of exact samples of the primary information but to predict the future sample using the components of the quantized train produced in the past. Thus, at the place where the quantized train

376

Chapter 7 Dimensionality Reduction and Quantization

is produced the situation at the place where the primary train is recovered from the quantized train, is simulated. This allows to deliver to the user not only the information about the "really new" information contained in the recently arrived piece of the primary information but also information about the effects of errors of quantization of the previously delivered pieces of quantized information. Such a system is shown in Figure 7.12b. We denote w(n) the prediction error w*(n) the quantized prediction error Wq* (n) the prediction of the component w(n) of the primary information made on the basis of the train w*(n-l) of elements of the quantized prediction errors. When the error of recovery of the primary prediction error after quantization is small then it is simplest to recover the primary prediction error (we denote the recovered error as w*(n)) piece by piece and to take w*(n)= w*[n,v(n-l)]+>v*(n) (7.5.69) as an estimate of the primary information and to predict on its basis the next component of the primary information train. The system operating in such a way is shown in Figure 7.12b. COMMENT The basic problems of dimensionality reduction and quantization are presented here from a broader perspective, imbedded in the framework of general concepts of information processing introduced in the previous chapters. During the last decade the dimensionality reduction and quantization have been of paramount importance for development of information transmission and storage systems, and the amount of publications in the area, particularity about quantization is very large. A synthetic discussion of vector quantization is presented in Gersho et al. [7.20] and Abut [7.21]. Concrete quantization programs with detailed descriptions can be found in Nelson [7.22]. The conference proceedings Storer, Reif [7.23], Storer, Cohn [7.24], [7.25], [7.26] present collections of specialized publications on dimensionality reduction and quantization. The conference proceedings [7.26] concentrate on quantization on images.

NOTES ' There is no standard terminology is this area. Often discretization and quantization are considered as synonyms. ^ The reader interested more profoundly in mathematical background of our consideration may consult the textbooks e.g.Thompson [7.1], Usmani[7.2], Horn [7.3]. The reader more interested in technical problems may consider the K-dim space as the natural generalization of the 3 dimensional space which we perceive intuitively and pursue our considerations assuming that ^"=3. ^ For computational reasons it is more convenient to start numbering of samples and elements of the vectors with 0 than with 1. We did this also in Section 3.2.4.

7.5 Quantization

377

^ To simplify the notation here we do not distinguish in notation the interpretation of a set of numbers as a vector (strict notation cevc) and a column matrix (strict notation cemx). The actual interpretation of the set is indicated by the type of performed operations. ^ The formal proof of this assertion is given in section 8.3.3. ^ In Example 7.2,1 the same assumptions have been done. We rather revoke Example 7.2.4 because we will directly apply the metiiod of deriving the decorrelating spectral representation described in this example for the multidimensional info. Contrary to it the elementary method of deriving the decorrelating transformation used in Example 7.2.1 is not suited for information having dimensionality larger then 2. ^ From formal point of view valid is relation (7.2.5) but the equation (7.2.6) is not strict. There is namely a continuum, thus, an infinity of functions which are represented by the same infinite sum in the right of (7.2.6). However, these functions differ only in a set of points having zero Lebesgue volume (measure). For physical reasons such functions can not be distinguished. Thus, the representation (7.2.6) is a compressing transformation in a mathematical sense, but for technically distinguishable functions it is a reversible presentation transformation. ^ This applies still more to the Laplace transformation in which instead of real angular frequency w a complex variable is used. The redundancy of such a representation is still larger than of the continuous Fourier representation, but the Laplace representation gives more insight into transformation of processes by linear stationary systems and therefore, is suitable for synthesis of those systems. ^ See the discussion in Section 1.5.5. '° So large that the effect of non-hexagonal aggregation sets at the edges of set II of potential forms of information can be neglected.

REFERENCES [7.1] Thompson, E.E., An Introduction to Algebra of Matrices with some Applications, Adam Hilger,London,1969. [7.2] Horn, R.A., Johnson C.,R., Matrix Analysis, Cambridge University Press, Cambridge 1988. [7.3] Usmani, R.A., Applied Linear Algebra, Marcel Decker, N.Y, 1987. [7.4] Poularikas, A.D., The Transforms and Applications Handbook, IEEE Press, N.Y., 1995. [7.5] Smith, W.W., Smith, J.M., The Handbook of Real-Time Fast Fourier Transforms, IEEE Press, N.Y., 1995. [7.6] Press, W.H., Flannery, B.P., Teukolsy, S.A., Vetteriing, W.T., Numerical Recipes, Cambridge University Press, Cambridge, 1992. [7.7] Curtain, R., Pritchard, A.J., Functional Analysis in Modem Applied Mathematics, Academic Press, N.Y , 1977. [7.8] Mallat, S.G., "A Theory of Multiresolution Signal Decomposition: The Wavelet Representation", IEEE Trans, on Pattern Analysis And Machine Intelligence, vol 11 (1989) pp. 674-693. [7.9] Oppenheim, A.V., Wilsky A., S., Signals and Systems, Prentice Hall, Englewood Cliffs, NJ 1989. [7.10] Lim, J.S., Two-Dimensional Signal and Image Processing, Prentice Hall, Englewood Cliffs, NJ, 1990. [7.11] Schalkoff, R.J., Digital Image Processing And Computer Vision, J.Viley, NY, 1989. [7.12] Russ, J.C, ed.. The Image Processing Handbook, (2-nd ed.), IEEE Press, NY, 1994.

378

Chapter 7 Dimensionality Reduction and Quantization

[7.13] Rioul, O., Vetterli, M., ''Wavelets And Signal Processing'', IEEE SP Magazine, October 1991, pp. 14-38. [7.14] Young, R.,K., Wavelet Theory and Applications, SIAM Press, Philadelphia, 1993. [7.15] Wickerhauser, M.W., Adapted Wavelet Analysis From Theory To Practice, IEEE Press, NY, 1996 [7.16] Vetterli, M., Kovacevic, J., Wavelts And Subband Coding, Prentice Hall, Englewood Cliffs, NJ, 1995. [7.17] Chui, K.C., ed., Wavelets: A tutorial In Theory And Applications, Academic Press, NY, 1991. [7.18] Papoulis, K., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, N.Y., 1991 [7.19] Blanc-Lapierre, B.,Fortet R., Theory of Random Functions, vol 1,2 Gordon and Breach, N.Y., 1967. [7.20] Gersho, A., Gray, B.,and Gray, R.M., Vector Quantization And Signal Comression, Kluwer, Boston, 1992. [7.21] Abut H., ed. Vector Quantization, IEEE Press, NY, 1996. [7.22] Nelson, M., The Data Compression Book, M&T Books, Redwood City, CA, 1991. [7.23] Storer, J.A., Reif, J.H., DCC'91 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1991. [7.24] Storer, J.A., Cohn, M., DCC'92 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1992. [7.25] Storer, J.A., Cohn, M., DCC'93 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1993. [7.26] Storer, J.A., Cohn, M., DCC'94 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA, 1994. [7.26] ICIP-94 Proceedings, IEEE Computer Society Press, Los Alamitos, CA, 1994.

8 STRUCTURES AND FEATURES OF OPTIMAL INFORMATION SYSTEMS The first puqjose of this chapter is to present a systematic approach to the optimization problems. In the preceding chapters we considered several concrete optimization problems. They now serve as examples of general methods that are presented here. On the other hand, considerations in this chapter show those specific problems in a broader perspective and provide formal justifications for the heuristic assumption that have been introduced previously. The second purpose in this chapter is to derive in a systematic way the structures of optimal information systems which use most efficiently the meta information about the properties of the superior system, the environment in which it operates, and the properties of the environment in which the information system operates. We also derive typical trade off relationships between the indictors of information systems performance. The mathematical theory of optimization is very well developed, and computer technology allows even very complicated problems to be solved and implemented in real time. A strict formulation of an optimization problem is essential to use the existing mathematical apparatus an computational means. The crucial part of problem's formulation is the definition of performance criteria. This is the subject of the first section of this chapter. Section 8.2 gives a review of methods essential for the optimization of information systems. To describe them we use the concept of information, which allows a more compact presentation and directly suggests applications. In Sections 8.3, 8.4, and 8.5 basic structures of optimal subsystems for recovery of primary information when only distorted information is available, are derived. It is shown that several recovery rules which have been previously introduced using heuristic arguments, on quite general assumptions are optimal. It is so in particular, with the next neighbor rules and hierarchical information recovery rules. To optimize the whole system the intermediate transformations preceding the ultimate recovery must be optimized taking into account the ultimate transformation. Such an overall optimization of an information system is discussed in Section 8.6. As most important examples of optimization of intermediate transformations optimization of quantization and shaping the signal put into a communication channel is presented. In previous chapters the role of auxiliary information about their state of environment of an information system has been emphasised. Second part of Section 8.6 discusses the optimization of state information subsystem.

380

Chapter 8 Structures and Features of Optimal information Systems

8.1 INDICATORS OF INFORMATION SYSTEMS PERFORMANCE This section presents a universal methodology for defining indicators characterizing the performance of an information processing rule as a whole. This is the basic and usually most difficult step in formulating the optimization problem. First, we consider the performance indicators of the information system in a concrete situation. They are determined by the properties of the superior system and the structure of the available information. In the second step, we define the indicators characterizing the performance of the information system as a whole. We concentrate on indicators characterizing the performance of a transformation of information considered as a whole. The definition of such an indicators must take into account (1) the properties of the superior system, (2) the structure of concrete information, (3) the properties of concrete forms the set of potential forms of information and of the states of the information systems environment. The concept of the operation removing the dependence on details is the key concept. It permits the definition of performance of an information transformation as a whole to be based on the definition of the performance of the information system in a concrete situation. General considerations are illustrated with examples. The obtained results are used in the forthcoming examples of solving optimization problems. 8.1.1 INDICATORS OF SYSTEMS PERFORMANCE IN A CONCRETE SITUATION As a representative example of performance indicators of distortions in the communication system shown in Figure 1.3 are considered. For a given primary information jc, a recovered information jc", and a superior system it is, in general, possible to determine the decrease of performance of the superior system pursuing its goal on the false assumption that the state of its environment corresponds to information x' while in fact the information is JC. In most cases, it is natural to define a scalar q(x, jc*) from which this loss essentially depends. We call q(x,x*) the indicator of distortions in a concrete situation (briefly, indicator of concrete distortions). Often we take as indicator of distortions a monotone function (decreasing, increasing,e.g., a square function) of a distance between the primary and the recovered information. The concept of distance occurred in previous chapters several times. We introduced it in Section 1.4.3 as an essential element of description of continuous sets of states or information. The distance appears also in the nextneighbor transformation (see Section 1.5.3). The choice of the distance function that determines the definition of the indicator of distortions is determined primarily by the properties of the superior system. We concentrate here on general rules for choosing the distance function as a basis for definition of the distortion indicator.

8.1 Indicators of Information Systems Performance

381

DISCRETE INFORMATION We assume that the set of potential forms of primary information x and of recovered information x* is X^={xi, 1=1, 2,- • • , L}. The indicator of concrete distortions is described by the square table q = [q(l,k);l,k=U2,' • • , L], (8.1.1) where q(l, k) is the indicator of distortions for the pair x=Xi, x*=x^. Often any distortion is equally undesired. For example, if discrete data are processed, it is essential that no errors occur, and if an error occurs it is usually irrelevant what form it has. Then it is natural to take JO for k=l ^(/,^)=<7 (8.1.2) ^^1 for kr^l. We call such a distortion indicator symmetric. ONE-DIMENSIONAL INFORMATION We assume that the set of potential forms of the primary information x and of the recovered information x* is Xci= <x^, x^>. For broad classes of superior systems relevant for performance, deterioration is not x but on the difference JC-JC*, which has the meaning of the error. Then it is natural to define indicator of concrete distortions as weighted error: q(x,x)=(t>{x-x), (8.1.3) where (/)(w) is the weighting function. Its choice of the weight function depends in principle on the properties of the superior systems, but often the suitableness for analytic calculations is taken into account. Typical examples of the weighting function are ^^0 for \x-x*\ !(«)=<",, .' ^ (8.1.4) ^^1 for \x-x I >A, (t>2(u)=u\ (8.1.5) For systems with a sensitivity threshold A (see our discussion in Section 1.4.3) the function 0,(w) is natural choice of the distortion indicator. It can be considered as the counterpart of the symmetrical indicator (8.1.2) for discrete information. Because it is quite representative and suitable for analytic calculations, the square weighting function 02(1/) is often used in theoretical considerations. K-DIMENSIONAL VECTOR INFORMATION For the structured information the indicator of distortions is usually defined in terms of indicators of distortions of corresponding components of the structured information. Let us first consider the vector information x = {x{n), Az = l, 2,- • • , A^}. Often the component x(n) has the meaning of information about the AZth sample of a the time-continuous state and the inertia of the superior system cause the effects of errors of recovery of samples to accumulate. Then, for the superior systems the sum of weighted error of components is relevant, and we take

382

Chapter 8 Structures and Features of Optimal information Systems q(x, x')=Yi [x(nyx'(nl n],

(8.1.6)

where (u, n) is the weighting function for the nth component. If the information is homogeneous in the sense that distortion of any component has the same effect on the overall performance, then we take the same weighting function for each component, that is, we take 0(w, /z)=
(8.1.7)

Taking as («) the square function 2(w) given by (8.1.5)) as a special form of (8.1.6) we get q{x,x)=Y,[x{nyx*{n)f.

(8.1.8)

Comparing this with the definition (1.4.8) of the Euclid distance, we get q(x,x')=d\x,

O.

(8.1.9)

A simple example of a distortion indicator taking into account various weights of the components of the information is the indicator with linear weighting of elementary distortions 0(w, n)=a(n)(t>(u), (8.1.10) where a{u)>0 are weighting coefficients. Taking
q{x, x') = J2 a(n)[xin)-x(n)]\

(8.1.11)

When x(n) has the meaning of the information about the nth sample of the state of the environment, then taking a train a(u) growing with n we can take into account the declining effect of a sample of the state as the instant at which it was taken moves further into the past. Some superior systems utilize functions of the information rather then the primary information directly. As it has been indicated in Section 7.3.4, page 349, the features of harmonic spectrum of information are relevant for several superior systems. A typical example is the perception of sounds and images by people (see the discussion about the choice of the preliminary transformation in the JPEG standard in Section 2.5, page 115). Let us denote as r(-) a transformation transforming the primary information into a secondary vector information. A wide class of distortion indicators has the form ^(je,x-)=^p[r(x), r(jc-)], (8.1.12) where ^p(-,) is one of the previously considered distortion indicators for vector information. Notice that (8.1.10) can be considered as the special case of (8.1.12) with T(x) = {y/a(n) x(n), A2 = 1, 2,- • • , A^} and q^ix, x*) given by 8.1.8. Introducing a performance criterion for the superior system itself and using adaptive matching procedures, we can the choose the auxiliary transformation T(') so that the information processed according a rule optimized in the sense of the distortion index q{x, x") maximizes the efficiency of the superior systems efficiency.

8.1 Indicators of Information Systems Performance

383

We have considered performance indicators for vector information with a fixed number of components N. When the number of components can vary, it is usually convenient to normalize the index in respect to the number of elements: q,(x, x') = (\/N)q(x, x'), (8.1.13) where q(x, x*) is one of the previously mentioned distortions indicators. Taking, for example, the indicator (8.1.8), we get q,(x, x*)=(l/N)'£[x(n)-x'(n)y=A[x(n)-x(n)]\

(8.1.14)

where A is the operator of arithmetical averaging defined by (4.1.3). Our considerations about the K-dimensional continuous vectors apply also for discrete vectors X={x(n), n = U 2,- • • , M, x(n)eX^ = {x,, / = 1 , 2,- • • , L}. (8.1.15) From equation (8.1.6) we get q(x, jc") ="£^6 W'')'^' («)].

(8.1.16)

/I-l

where q^lxin), x\n)] is one of the distortion indicators for discrete information discussed on page 381. Assume that the elementary components are binary (L=2), and we use the symmetric indicator of distortions defined by (8.1.2). From (8.1.16) we get q(x, x)=du{x, x^),

(8.1.17)

where d^(x, JC") is the Hamming distance defined by equation (2.1.36). We used this indicator as the feedback information for the system described in Section 2.1.2. To this point, we have considered the indicators of concrete distortions of vector information based on sums of components either of the components either of the primary or transformed vector information. This is justified if for the performance of the superior system the cumulative effect of distortions of elementary components of information is relevant. However, for some superior systems the biggest distortion of a component can cause an irreparable damage. For such systems, the natural indicator of distortions is q{x, jc*)= max q^xin), x\n)].

(8.1.18)

n

The vector information may be interpreted as a function of an integer argument considered as a whole. Therefore, the definitions of individual distortions indicators can be easily modified for information that has the structure of a function of one or more continuous arguments (of type Tj(ci)). For example, the counterpart of (8.1.8) is the indicator of distortions of image information described on page 23 : q(x,x') = ^

J {x[t(l)X2)]-x^{[t(l)X2Wdt(mt(2)

(8.1.19

384

Chapter 8 Structures and Features of Optimal information Systems

8.1.2 INDICES CHARACTEMZING THE PERFORMANCE OF AN INFORMATION TRANSFORMATION RULE AS A WHOLE The processed information x' depends usually not only on the primary information X but also on some side factors, thus, jt* = r(jc, z),

(8.1.20)

where T(-) is the transformation describing the rule of information processing and z are the side factors acting in the system. The concrete distortion caused by the transformation is q(x,x)=q[x^ r(x, z)]. (8.1,21) The system has to operate on any xEX, zGZ where /^(respectively Z ) is the set of potential forms that the primary information (respectively of side factors acting in the system) can take. To define on the basis of q[x, T(x, z)] a number Q[T(')] characterizing the performance of transformation T(') considered as a whole, we have to take into account all forms of the primary information that the system has to process and all potential states of the environment, in which the information system has to operate. To produce a number Q[T{*)] characterizing the transformation T(') as a whole we must remove the dependence of the set {q[x, T(x, z)];xEX zEZ} on specific values q[x, T(x, z)] but keep its dependence on the transformation T('). The operation transforming in such a way the set {q[x, T(x, z)];xEXzE Z] into the number Q[T{')\ we call the dependence on detail removing operation {detail removing operation, abbreviated DRO), and we denote it as D. Thus, the indicator of distortions caused by transformation T{') is jcE A , z E Z G[r(-)]^D q[x, r(jc,z)]. (8.1.22)

xexz^z Defining a DRO we face the problem of defining an operation transforming a set of numbers {q{u)\uE. U} into a number Q that characterizes the function q(') as whole. We write this in the form Q = Dq{u) (8.1.23) uEU If the set U is discrete thus, Z/={M/, / = 1, 2,- • • , L} then typical dependenceremoving operations are as follows: • The arithmetical averaging operation D3r = A, (8.1.24) uGU I where A is the operation defined by (4.1.3); • The operation of finding the maximum (minimum) value D,3 =max; (8.1.25)

ueu

I

• The operation of statistical averaging D3, = E uEU

m

(8.1.26)

8.1 Indicators of Information Systems Performance

385

The choice of the operation D depends on (1) properties of the superior system, (2) the structure of concrete information, (3) the properties of the set of potential forms of information, particularly on the type of eventual weight assigned to each potential form. Let us first assume that Al. The information is a train Xi, = {x(i), / = 1 , 2,- • • , /} but the jc(0 blocks are processed separately according to a rule T(-) as described in Sections 1.7.2, 6.2.1, and 7.5.2; A2. For the superior system the distortions of all elements of the block are relevant; Based on these assumptions, it is natural to take as the DRO the arithmetic averaging operation D^^. Using (8.1.22) and (8.1.24), we get the performance indicator Q[T(')]= A q{x(i),nx(i). /

z(/)]}= I j ^ q{x(i), T(x(i)^ zd)]}.

(8.1.27)

^ /-I

If an excessive error of recovery of a block could caused an irreparable damage, then instead of A2 it is natural to assume A3. For the superior system the performance of the block transformation in the worst case relevant is. On this assumption, we take as the DRO the operation D^^ of finding the maximum Q[T{')]== max q{x{i), T\x(i)^ zd)]}(8.1.28) / Till this point we have considered the indicators of performance of the transformation describing the overall operation of the information system. Often we can only modify the rules of operation of a subsystem while the rules of operation of other subsystems are fixed. Then we face the problem of defining indices of performance not of the overall transformation rule but of a rule according to which a subsystem operates. The subsequent two sections show that by a suitable choice of the DRO our approach is also applicable for definitions of indicators characterizing the performance of subsystems of an information system. We illustrate this with the subsystems at the end and inside the prototype information system shown Figure 1.2. 8.1.3 INDICATORS OF THE PERFORMANCE OF AN ULTIMATE INFORMATION TRANSFORMATION We consider first the indicators of performance of the last subsystem in the prototype information system shown in Figure 1.2, on the assumption that all subsystems preceding the last operate according to fixed and known rules. Then the overall information transformation rule performance index defined by (8.1.22) can be used directly to characterize the ultimate information transformation from the point of view of distortions. To simplify the terminology we assume that the prototype system shown in Figure 1.2 is a communication system and the last subsystem the receiver. Because in most cases (with the exception of rather unusual randomized transformations; see Section 1.5.5 page 51) the ultimate information transformation is deterministic, we must not take into account the side states occurring in (8.1.22).

386

Chapter 8 Structures and Features of Optimal information Systems

Therefore, the operation of the receiver is described by the transformation T^ ^.{') transforming the available information r into the recovered information jc*. Thus, x* = r,,.(r) (8.1.29) Let us first assume that the system operates block-wise (see e.g.. Section 6.2.1, particularly. Figure 6.1) thus, T^*{r) has the meaning of the rule of separate recovery of received blocks, and we consider only the training cycle (see e.g.. Section (1.7.2) particularly Figure 1.25). We denote as f/, = {{jc(0, r ( 0 } ; / = l , 2 , - • • , / } (8.1.30) the train of pairs: primary information, the received signal carrying this primary information. We call this train the training information. On assumption A2 we take the DRO D3, given by (8.1.24). We get Q[^rx*(-)]= ^ E ^WO. T^Ar{i)Vi= A q{x{i), r^.WOl} / ,-1

(8.1.31)

/

EXAMPLE 8.1.1 INDICATOR OF THE PERFORMANCE OF A LINEAR INFORMATION RECOVERY RULE 1: PERFORMANCE DURING A TRAINING CYCLE To illustrate the definition (8.1.31) we assume Al. The primary and recovered information are one-dimensional; we denote it X (respectively, x\ A2. The received information is A^-dimensional thus, r={r{n), /2 = 1, 2,- • •, A^}; A3. The information recovery rule is a linear transformation: N

x=Y.h{n)r{ri).

(8.1.32)

n-l

A4. The concrete distortion indicator q{x, x)--{x-xy. (8.1.33) Since the transformation rule T, x-(*) is determined by the set h = {h(n), Az = l, 2,- • • , A^}, (8.1.34) we write Q{h) instead of Q[T,A')]^ Substituting (8.1.32) and (8.1.33) in (8.1.31) we get ^ Q(h)=A[x(i)-'£h(n)r{nJ)f = N

N

N

= A {[x(i)Y - 2x{i)Yl h{n)r{n, i) + E E h{ri))h{m)Kn, i)r(m, i)}. n-l

/i-l

(8.1.35)

m-l

N

Interchanging the sequence of operations A and J ^ we get n-l

N

N

N

Qih)=A [xiDY - 2"£ h(n)cjn) + E E h(n)h(m)c^^(n,m), ,

where

/I-l

«-l

(8.1.36)

m-1

c^/n) =A Jc(0r(/2, 0 c^^(n,m) =A r(n, i)r{m, i) are the empirical correlation coefficients. D

(8.1.37a) (8.1.37b)

8.1 Indicators of Information Systems Performance

387

We next give two examples of calculating the indicator of performance of the receiving rule of a single block on the assumption that the primary and received information exhibit joint statistical regularities. Then it is natural to take statistical averaging as the DRO. EXAMPLE 8.1.2 INDICATOR OF THE PERFORMANCE OF DISCRETE INFORMATION RECOVERY RULE: THE PRIMARY AND RECEIVED INFORMATION EXHIBIT JOINT STATISTICAL REGULARITIES We assume that Al. The primary and the recovered information are discrete; X^ = {xi, / = 1 , 2,- • • , L} is the set of their potential forms; A2. The indicator q(l, k) of distortion in a concrete situation is a symmetric function given by (8.1.2). As the DRO we take the statistical averaging operation e[r,,-(-)] = E ^ f e X),

(8.1.38)

where 5S, £ are the random variables representing the primary, respectively the recovered, information. Using the definition (4.4.10) we get L

L

(2[r„.(-)] = E E
(8.1.39)

k-\

Substituting (8.1.2) we get L

L

e[7'„.(-)] = E E /-I

kT^l

P(3S=JC„ s=Jir,) = l - E P(x=J^„ s"=x,) = l-P(x?i5s') (8.1.40a) /-I

Thus, The probability of error is the indicator of performance of an information recovery rule in the sense of definition (8.1.38) on (8.1.40b) the assumption, that the concrete distortion indicator is symmetric. Now instead of Al and A2 we assume A3. The primary and the recovered information are one-dimensional; A2. The indicator q(l, k) of distortion in a concrete situation is the square error given by (8.1.33). Proceeding as previously we get Q[T,A')] = ^('SL-zy.

(8.1.41a)

Thus, The mean square error is the indicator of performance of an information recovery rule in the sense of definition (8.1.38) on (8.1.41b) the assumption, that the concrete distortion indicator is symmetric. D We now give a counterpart of Example 8.1.1.

388

Chapter 8 Structures and Features of Optimal information Systems

EXAMPLE 8.1.3 INDICATOR OF THE PERFORMANCE OF A LINEAR INFORMATION RECOVERY RULE 2: THE PRIMARY AND RECEIVED INFORMATION EXHIBIT JOINT STATISTICAL REGULARITIES We do assumptions Al to A4 from Example 8.1.1, but now we take as performance indicator the statistical average Qih) = E [ s - £ hinMn)]\

(8.1.42)

/I-l

where x,ff(Az),AZ = 1, 2,- • • , A^are the random variables representing the primary information (respectively the components of the received information). Since the E operation is like the A operation, a linear operation, we can proceed similarly as in Example 8.1.1 and we get N

N

N

Q(h} = E iL\i}-2 Y^ h{n)c„{n) + E E h{n))h{m)c„{n, tri), where

(8.1.43)

C,,(/2)=EM(/2)

(8.1.44a)

c,Xn, m)=E ir(Ai)ff(m)

(8.1.44b)

are the statistical correlation coefficients. D 8.1.4 ESfDICATORS OF THE PERFORMANCE OF A PRELIMINARY INFORMATION TRANSFORMATION Considering the subsystem performing the ultimate information transformation we assumed that the transformations performed by all others, then the considered subsystem, are fixed and known. Therefore, we could use directly the overall transformation performance index as an index characterizing the ultimate transformation. Usually, the properties of the ultimate information produced by the information system are influenced not only by the considered subsystem but they are also influenced by not fixed transformations performed by other subsystems. We now show how to define the index of performance of a subsystem in such a situation. As a representative example we take the preliminary information transformation Txy(')y which transforms the primary information x into the information v that will be fed into the subsystem performing the fundamental transformation (see Figure 1.2). Concrete examples of such a transformation are the volume-compressing transformations considered in Chapter 6 and 7. We assume that Al. The preliminary information transformation T^(') is deterministic but may be irreversible; A2. The fundamental transformation r^(*) transforming the information v into the information r available for ultimate transformation (see Figure 1.2) is a deterministic and reversible transformation. A3. The ultimate transformation T^*i-) is a deterministic transformation. In view of assumption A2, we assume further that r=v (see Figure 1.2) thus, T^,(')= T^A'). Based these assumptions we have qix^x*) = q{x,TUTUx)]}. (8.1.45)

8.2 Methods of Solving Optimization Problems

389

We did not assume that the uhimate transformation r^(-) is known. Therefore, to obtain a performance indicator of the preliminary transformation T^(') we have to remove the dependency both on concrete x and concrete r„,() thus, Q[T^(')] = D D a{x, T^.[TM]}Let us assume that x exhibits statistical regularities. Then we take D= E X

(8.1.46) (8.1.47)

X

To remove the dependence on the ultimate transformation Tyj^*{') we have to introduce a model of its indeterminism. If we are free to chose the rule T^, (•), then for a given T^(') we would apply a possibly good rule r"^*(*). In such a situation we take D = min (8.1.48) If, however, we design a preliminary information processing subsystem; that is an irreversible dimensionality reducing subsystem for a system in which many users recover (decompress) the primary information using their own devices coming from many vendors; then we may assume that the transformation 7Vx*() is random. Then, to obtain the performance indicator characterizing the preliminary information transformation, as the D.R.O. we take D =

E.

(8.1.49)

COMMENT 1 We have described here the indicators of distortions that are quite representative and convenient in analytical considerations. A great variety of indicators of performance of processing of special types information was proposed. Such indicators for speech processing (called distortion measures) are listed in Makhoul [8.1, (Section IB)] and for images (called fidelity indicators) in Lim [8.2], Dougherty [8.3], or Russ [8.4]. The consequent approach to the definition of indicators of performance in a concrete situation would be to express such an indicator in terms of the indicators of performance of the superior system for which the information is destined. Our considerations in Section 1.6.1, particularly equation (1.6.3), suggests such a procedure. COMMENT 2 In previous consideration we used dependence removing operations to define characteristics of structured information bases on characteristics components. For example the definitions of distance (1.4.8), (1.4.9), and (1.4.10) for vectors, diagrams, and images can be considered as a result of applying an operation removing the dependence of the basic distance definition (1.4.7) between corresponding elements on the identifier of the component. Also the definitions of the volume of a potential information based on definition of the volume of a concrete information presented in Section 6.1 use detail removing operations. Still an other example is definition of channel capacity base on the amount of statistical information which the output signal provides about the in put signal in a concrete communication channel.

390

Chapter 8 Optimization of Information Systems

8.2 METHODS OF SOLVING OPTIMIZATION PROBLEMS Section 1.6.6 formulated the problem of optimization of an information transformation. In the previous section the general methodology of defining a performance indicator as a function (functional) of an information transformation as a whole has been given. I general such a transformation is described by a function of a continuous arguments, e.g., by a pulse response. However, in most cases we chose the type of the information transformation (typically a linear transformation or a next neighbour transformation) in which only a set of parameters h = {h(n), w = l, 2 ,• • • , A^} is free. The same situation we have when to-days dominant digital processing technic is used. Therefore, we concentrate here on the problem of finding a minimum or maximum of a function g, of A^ variables h, eventually with some constraints. Such a problem has been called in Section 1.6.2 parametric optimization problem and was denoted as O P / i Q , | C , , m=2, 3,- • • , M , QECH-

Finding the maxima or minima of functions of several variables is the subject of mathematical optimization theory. There are several excellent books on the subject. See e.g, Minoux [8.5] for principles and Cuthbert [8.6] and Press et al. [8.7] for algorithms and programs. Here we only sketch the basic methods that are of greatest importance for information systems optimization. 8.2.1 REDUCTION OF THE MINBVflZATION PROBLEM TO SEARCH IN A SET OF SOLUTIONS OF AN AUXILIARY EQUATION We consider finding the point of minimum of a scalar function f(a) of a scalar argument aE^ where ^ is the set of values which the argument can take. Such a value a^ of the argument that f(aj^, and on (3) the available information about those properties. When the set _>
8.2 Methods of Solving Optimization Problems

391

For a scalar function/(a) of a A^-DIM variable a = {a(n); AZ = 1, 2, • • , A^} the generalization of (8.2.2) holds. Under corresponding assumptions we have to search for the minimum point only in the set ^^ of solutions of the set of equations iM-0 da(n) yv .

The vector

n-l,2,..A^

(8.2.3)

^-

^ . g r a d / . E ' ^ ^^

bin), (8.2.4) da(n) where b(n), A2 = 1, 2, • • ,A^are the unit coordinate vectors of the basic orthogonal coordinate system (see Section 7.1.1), is called gradient. Using it we write the set of equations (8.2.3) in the compact form grad/=0 (8.2.5) where 0 is a vector with all components equal 0. We give now examples of applications of the set of equation (8.2.3) or equivalently of (8.2.5) for finding optimal transformations of information. EXAMPLE 8.2.1 OPTIMIZATION OF A LINEAR INFORMATION TRANSFORMATION: ALL NEEDED STATISTICAL INFORMATION IS AVAILABLE; ANALYTIC APPROACH We consider the optimization of the transformation of available A^-dimensional information r about the primary one-dimensional information x into the recovered informations*. We assume that AL The transformation is linear, given by (8.L32); thus, it is described by a set h of coefficients; A2. The primary and available information exhibit statistical regularities; A3. As distortions indicator we take the mean square error given by (8.1.42). From equation (8.1.43) it follows that on those assumptions we need only the rough description of the statistical properties by the correlation coefficients (see Section 4.4.4). Therefore, we assume that A4. Exact information about the correlation coefficients is available. After substituting a^h, f{a)-^Q{K) the set of equations (8.2.3) takes the form:

From (8.1.43) we get

dQ{h) -0 ; n - l , 2, . . ., N. dh(n) ^ " 2 ^ x r ( « ) - 2 E h(mK(n,m),

(8.2.6)

(8.2.7)

Substituting this in (8.2.6) we obtain N

c^'ThimX^in^m)

m-1, 2, . . . , N.

(8.2.8a)

m-l

We write this set of equations briefly as a matrix equation C,,h = C,,, where

(8.2.8b)

392

Chapter 8 Optimization of Information Systems

C,^ = [c^(m, n)] the correlation matrix of the components of the available information r, Cxr = [<^xr(^)] the column matrix of correlation between the primary information x and components of the available information r, h = [h(m)] column matrix of coefficients determining the linear transformation. On very general assumptions (see, e.g., Thompson [8.8]), the inverse matrix 0'^^^ exists. Then the set ho=Cl;^'C^r (8.2.9) is the only solution of the set of equations (8.2.3). Thus, this is the only element of the set ^ o . It can be easily proved that this is the point of minimum of Q(h). Special cases of the considered optimization problem occurred already in Section 7.3.1 (page 331) and in Section 7.5.3 (page 374). Calculation of optimal coefficients is a transformation of the rough but sufficient statistical information in the form of correlation matrices into the set h^ of coefficients determining the optimal transformation of the working information. Thus, the considered optimal subsystem has a two-layer structure as shown in Figure 8.1a. This is a special case of the layered system shown in Figure 1.24b.n training info

MEMORY

CALCULATION OF CORRELATION COEF.

statislical info

C,r^

I

^C„

I

I I I

OmMIZATION

OPTIMIZATION

rV)

^r*— • r

LINEAR TRANSFORMATION

LINEAR TRANSFORMATION

—>r X

working info

Figure 8.1. The optimized linear information transformation: (a) the information exhibits statistical regularities and the statistical correlation coefficients are available, (b) the system operates with a training cycle; {x(J), r(/)},y = l, 2,- • • , 7 is the training information. The assumptions made in the example are justified if the primary and available information exhibit statistical regulations and all needed statistical information is available. We now consider the situation when this is not the case.

8.2 Methods of Solving Optimization Problems

393

EXAMPLE 8.2.2 OPTIMIZATION OF A LINEAR INFORMATION TRANSFORMATION: ONLY TRAINING INFORMATION IS AVAILABLE We assume that AT. The system operates with a training cycle (see Section 1.7.2 and Figure 1.25); {x(j), r(/)},y = l, 2,- • • , / is the training information, A2'. The block-wise recovery of information during a training cycle is considered, A3'. The recovery rule is linear, A3'. The arithmetical square error given by (8.1.35) is the indicator of distortions. Since equation (8.1.36) and (8.1.43) are similar, we can use the results (8.2.8) and (8.2.9) by replacing the statistical by arithmetical correlation coefficients. The block diagram of such an optimized system is shown in Figure 8.1b. D COMMENT 1 In practice, the assumptions A2 and A4 made in Example 8.2.1 are justified when by some earlier analysis of the properties of the environment of the information system, the existence of statistical regularities and their stationarity were stated, and the correlation coefficients were acquired. Our general considerations in Section 1.7.2 about systems with a training cycle apply to Example 8.2.2. If the primary information jc and the available information r are not accessible at the same place, providing training information may be not always possible. Using the coefficients optimized during a training cycle in the following working cycle is justified if the states of systems environment are stationary. If they are only quasi-stationary, we have to interleave the working and training cycles as shown in Figure 1.25. For information compression or prediction systems no special subsystem providing training information is necessary because usually both the primary information and the compressed information (playing the role of the available information) are directly accessible. If introducing the delay is permissible, we calculate the optimal set h^ for the whole train and apply it for block-wise compression as shown in Figure 1.27. COMMENT 2 If information exhibits statistical regularities, then in view of the fundamental property of long sequences the arithmetical averages are estimates of statistical averages (see Section 4.3.1). Thus, for long training trains the system shown in Figure 8.1a may be considered as the limiting form of the system shown in Figure 8.1b. COMMENT 3 The problems considered in both examples play the role of atomic problems into which a great variety of problems of statistical optimization of linear transformations of processes and images can be decomposed. Typical examples are filtration and

394

Chapter 8 Optimization of Information Systems

prediction of processes, filtration and enhancement of images, and identification of characteristics of linear dynamic objects (in particular, the pulse response) (see e.g, Middleton, Goodwin [8.9]). Several methods to get the solution of the counterparts of the basic equation for vector-valued functions or functions of two or three arguments (still moving images were developed). Universal is the method of Kalman (see e.g., Proakis [8.13, ch.6]. It allows to implement successively the optimal recovery of trains of scalars or vectors by using already optimal recovered components to calculate the next optimally recovered component. To this point minimization without constraints has been considered. In most cases the various types of constraints are imposed on variables in the optimization system. The basic types of constraints were discussed in Section 1.6.2. Often we face optimization problems with equality constraints. So are called the constraints g{m, fl)=0, m=2, 3,- • • , M, (8.2.10) where g(m, a), m=2, 3,- • • , Af are given functions. One of the important conclusions of optimization theory is that On general assumptions about the existence of derivatives of the criterion function f{a) and of functions g{m, a), /n=2, 3, • • • ,M determining the equality constraints, the solution of the optimization problem O P 2if{a)\g{m, a)=0, w = 2 , 3,- • • , M, is an element of the set of solutions of equation gradf^^(a)-0, (8.2.11a) a

where

^ JL A^(a)=m^Y^Mm)g(m,a)

(8.2.11b)

m-2

is an auxiliary function and \{m), m=2,' •, M are auxiliary parameters {called Lagrange function, Lagrange parameters). For a detailed description of this method see Minoux [8.5]. We used it already previously- see Section 7.5.2, page 370. 8.2.2 NUMERICAL FINDING OF THE ZERO POINT: THE SAMPLES OF THE FUNCTION ARE EXACTLY KNOWN Only in special cases is it possible to find the solution of the set of the auxiliary equation in a closed form. Of paramount importance for practical realization of optimal information systems are the numerical methods of finding such a solution. Those methods are also important because they can be easily extended for the case when the criterial function (and eventually the constraint functions) are not given in the analytical form and calculation of the derivatives in a closed form is not possible. This, in turn, makes it possible to implement those algorithms in intelligent information systems. A GENERAL METHOD OF FINDING THE ZERO POINT We consider first the simple case when a continuous function g(a) is given and we will find numerically its zero point a^ thus, the root of the equation g(a)=0. (8.2.12)

8.2 Methods of Solving Optimization Problems

395

Most numerical methods of solving this equation are based on the following idea: 1. We take an initial value a(l) for the variable a; 2. Using the available meta information about function g(-)m the neighborhood of a(l), we approximate this function by a function g\a, 1), aE^; 3. We fmd the solution a(2) of the equation g\a, 1)=0; (8.2.13) 4. Using an approximating function g\a, 2) we proceed with a(2) as we did with fl(l); continuing this we generate a train a(j),j=l,' ••,; 5. We chose the approximating functions g\a, j) so that the train a(j),j=l,' • •, of solution of equations ^'(fl,y)=0 converges to a solution flo of (8.2.12); 6. The calculation of solutions of the auxiliary equations is simple; 7. For some y =7 we stop the procedure and take a(J) as the approximation of Go- The typical stopping rule is to stop when for the first time \a(j-^l)-a(j)\0 is a train of coefficients. The purpose of these coefficients is to achieve the convergence mentioned in 5. Their choice depends in a crucial way on the type of the available information about properties of the function g('). The solution of the equation g\a, J)=0 for the function given by (8.2.16) is a(j^\)=a(j)-\{j)g[a(j)l (8.2.17a) where X(/) = 1/Xi(/). (8.2.17b) We call A(/) the correction (in the j-ih step). The block diagram of a system generating the train is shown in Figure 8.2.

A(/")

X^(/>1)

r

ISTEP DELAY a(j)

AO)

1 STEP DELAY

a(J)

correction

a)

b)

Figure 8.2. The system implementing a recurrence: (a) of scalars (e.g., generated by (8.2.15)), (b) of vectors (e.g., generated by (8.2.29)); thick lines denote flow of vectors. The recursive equation (8.2.15) permits the train a(/),y = 1, 2, to be calculated successively. We call this equation briefly recurrence. This recurrence with the stoping rule (8.2.14) we call recurrent zero point searching algorithm. To specify the recurrence we must specify the train Xi(/) or, equivalently, X(/). This, however, must take into account the available information about the class (set) of potential forms of function g^*) (in our terminology-meta information about g{')).

396

Chapter 8 Optimization of Information Systems

EXACT INFORMATION ABOUT THE DERIVATIVE OF g(-) IS AVAILABLE When for any a € ^ we can calculate the derivative dg/d&, then it is natural to take in (8.2.16) as the approximation g\a, j), the first two terms of Taylor series around a(j). Thus, we take X,(/) = (8.2.18) da J a-a(j) The typical diagram for the corresponding g\a, j) is shown in Figure 8.3a. From calculus is known, that on general assumptions about g{') the train (8.2.17) with Xi(/) given by (8.2.18) converges to a^, as illustrated with Figure 8.2a.

/fl(2) iHia)

.v (r7.2)

a{\)

a

.c (uA)

g(a)

b)

a)

^=7o

^

d)

Figure 8.3. Examples of trains generated by the recurrence (8.2.16); the algorithm: (a) using the exact value of the derivative, (b)-(g) with constant coefficient \ . Figures (c)-(g) illustrate the dependence on \ of the convergence close to the zero point a^ on the assumption that the function g{a) is approximated by a linear function having slope 70. To get the derivative ^g/ddi for any aE^, we had either to calculate it analytically or numerically. In most information systems this is not possible, and we have only some rough information about g{'). We now show that we can assure the convergence when we have information only about the derivative at the zero point.

8.2 Methods of Solving Optimization Problems

397

ONLY ROUGH INFORMATION ABOUT THE DERIVATIVE OF g{') AT THE ZERO POINT IS AVAILABLE When do not have information about the derivative of g{') for each a, then the simplest choice of the train \(j) is to take in the recurrence (8.2.16) \(j) = K=const. (8.2.19) To analyze the convergence of the sequence a(j) to a^ we assume that 7o>0, (8.2.20a) where

, , (8.2.20b)

is the derivative, or equivalently the slope of the line g(a) at the zero point a^. A typical train generated by (8.2.17a) with X(/) = Xc=const is shown in Figure 8.3b. It is evident that the convergence of the generated train depends on the properties of the function g(a) in the neighborhood of the zero point a^. When in an environment of this point the derivative of g{a) exists, than in this environment we can approximate g(a) by a linear function and study the convergence of the train a(j) on the assumption that g(a) is a linear function. The corresponding trains for various \ are shown in Figures 8.3c to 8.g. Those figures show that essential for the convergence is the magnitude of the coefficient X^ compared with the slope 70 defined by (8.2.20b). The train converges monotonically if Xc<7o, a^ is reached in one step if Xc=7o; the train oscillates if Xc>7o and diverges for Xj,>27o. Although for any Xc<7o the convergence is monotone, the smaller X^ the slower the convergence is. If we know the derivative 70 exactly, then taking Xc=7o we achieve the fastest convergence. If we have about 70 only the rough information that 7o^7min where 7n^in is known, then taking Xc=7min we achieve the monotonic convergence, but for an 7o>7min it is slower than the possibly fastest. Thus, for having only inexact information y^.^, about the slope 70, we pay with a slowed-down convergence rate. ONLY MINIMAL INFORMATION ABOUT THE DERIVATIVE OF g(-) AT THE ZERO POINT IS AVAILABLE When we know only that 70>0, we say that only the minimal information about the derivative 70 is available. Then the search for zero point using fixed coefficients X(/) would be unpredictable, and we have to use the general recurrence (8.2.17a) a(j+l)=a(j)-\(j)g[a(j)] (8.2.21) with coefficients X(/) varying with j so that limX(/)=0. (8.2.22) j-00

To get the guide lines for choosing such a train X(/) we look for the total movement from a(l) to a(j-\-l). From (8.2.21) we get

fl(/>l)=fl(l) - J2 M0^[«(0]

(8.2.23)

398

Chapter 8 Optimization of Information Systems

From this follows in turn, that to achieve for any a(l) the convergence of a(j) to a^, j

the series J^ X(/) can not converge thus, it must be »

/-I

52x(o=oo.

(8.2.24)

/•-i

A useful class of sequences X(/) are the sequences

xovf where i4>0, a > 0 are constants. Their basic property is °° ^ - oo for Of < 1

E4

y-i r l From this it follows that we can satisfy the condition (8.2.24) taking 0
(8.2.25)

(8.2.26)

(8.2.27)

For Q;=0 we have the previously considered case when X(/) = Xc=const. NUMERICAL FINDING OF A SOLUTION OF A SET OF EQUATIONS The described algorithms can be generalized to get numerically the solution of the set of equations ^ ^(fl,/2)=0, Az = l , 2 , - • • ,7V, (8.2.28) where g{a, n) are functions of the A^ DIM argument a = {a{n)y n = l, 2,- • , N}. The generalization of (8.2.21) is the train a(j-^l)=a(j)-$j)g[a(j)l (8.2.29) where gia)^{g(a, n), n = l, 2,- • • , N}. (8.2.30) The counterpart of (8.2.11) is the stopping condition |a(/-fl)-a(/-)|
(8.2.32)

(8.2.33)

m-l

We write the set of those relationships in the matrix form g{h)^C,rCJi. The sequence (8.2.29) takes the form h(j-^$^h(j)-\{j)[C,rCJim

(8.2.34) (8.2.35)

8.2 Methods of Solving Optimization Problems

399

To achieve the convergence to the solution given by (8.2.9) using the previously discussed guide lines we choose the auxiliary train \(j), depending on the available meta information about the matrices. D COMMENT Finding the optimal set of coefficients by running the recursion (8.2.35) is an alternative to calculate it from equation (8.2.9). The described procedure is useful when the matrices C^,, C„ can be considered as minor changes of some primary matrices C\„ C\, for which we already calculated the optimal set h\. Using the sequence (8.2.35) with h(l)=h\ may require much less calculation than calculating the inverse matrix C?"/^ in (8.2.9). 8.2.3 NUMERICAL FINDING OF THE ZERO POINT: ONLY DISTORTED SAMPLES OF THE FUNCTION ARE AVAILABLE We now show that on very general assumptions it is possible to obtain a transformation that is optimal in the sense of a performance indicator characterizing the transformation as a whole, even though an equation expressing explicitly the indicator in terms of the statistical features of the transformation is not available. However, we must have access to inaccurate information about the values of the indicator of performance in concrete situations. Thus, the procedure circumvents the need for direct information about the statistical properties of potential states.

THE BASIC RECURRENT ALGORITHM We start with the simple but representative one-dimensional case. We assume that 1. Observations of a function G(a, U) where aE^ is a scalar variable and U is a random, generally multidimensional variable, are available; 2. We are looking for the zero point of the function g(a)=EGia, U). u We introduce the random variable S=E Gia, Vyg(a). From the definition it follows that E2=0.

(8.2.36) (8.2.37) (8.2.38)

We write definition (8.2.37) in the form Gia, U)=^(a)+2. (8.2.39) Thus, we can interpret G(a, U) as a random variable representing the value g(a) distorted by the additive noise 22. We consider a train a(/), 7 = 1, 2,- • • of values of the variable a and a train G[a(j), I!J(/)] of random variables, wherea(/),y = l, 2,* • • is a train of values of the variable a, and ILI(/) is a train of statistically independent random variables with the same probability distribution. The counterpart of the basic recurrence (8.2.21) is the recurrence a(j-^l)=a(j)-\(J)G<j), (8.2.40) where G<j) is an observation of the variable G[a(/), IU(/)]. A typical train generated by (8.2.40) is shown in Figure 8.4.

400

Chapter 8 Optimization of Information Systems

Gil)

laO.) g(a)

g^(a,l)

Figure 8.4. A typical train a(j) generated by the recurrence (8.2.40). Using (8.2.40) similarly to (8.2.23), we get a(j+l)=a(l)- "£ X(0^[a(0]-E ^«z(0,

(8.2.41)

where

z(i) = G(i)-g[a(i)]. (8.4.42) Since the second component of a(j-\-1) is an observation of a random variable, an element a(j-\-l) of the train generated by (8.2.40) is to be considered as a realization of a random variable si(/ + l). Therefore, we cannot speak about the convergence of the train a(/-fl), 7 = 1, 2,- • in the sense of classical analysis, but we have to look for a convergence of the corresponding random variables. For most technical applications we would consider that a train generated by (8.2.40) converges to the zero point a^ of the function g(a) if limE[ai(/>ao)]'=0. (8.2.43) j -oo

To achieve the convergence of the generated train in this sense we must choose properly the train of auxiliary coefficients \(j). Comparing (8.2.41) and (8.2.23) we see that to ensure unrestricted correction ability, we must again require that (8.2.24) is satisfied, i.e., that is, that E X ( 0 = oo (8.2.44) /-I

To achieve the convergence of the generated train we must achieve not only the convergence of the first but also of the second sum in (8.2.41). Detailed analysis of this sum shows (see, e.g., Schmeterer [8.10]) that the necessary condition for the convergence of the train generated by (8.2.40) to a^ is that

E^'(o<«. From (8.2.26) it follows, that if we take X(/)--

(8.2.45) (8.2.46)

8.2 Methods of Solving Optimization Problems

401

where A>0 and 0 . 5 < a < l then we satisfy both conditions (8.2.44) and (8.2.45). A typical choice is X(/-) = l//-, (8.2.47) As in the case when the exact values of the function are available, our present considerations can be generalized for finding a solution of a set of equations (8.2.28). The generalization of the recurrence (8.2.40) is the recurrence: a(/>l)=a(/>X(/-)G(/'), (8.2.48) where G(j) = {G<j, n);n = l, 2,- • • , A^} and G(/\ n), n = l, 2,- • • , A^ is an observation of such a random variable G[a(/), U, n] that g{a, n) = EG[fl, U, w]. (8.2.49) The previously discussed principles of choice of the train coefficients \(j) apply to the recurrence (8.2.48). EXAMPLE 8.2.4 ADAPTIVE LINEAR RECOVERY OF INFORMATION; THE INFORMATION EXHIBITS STATISTICAL REGULARITIES BUT ONLY TRAINING INFORMATION IS AVAILABLE We do assumptions Al, A2 and A3 from Example 8.2.1 but we do not assume that direct information about the statistical correlation coefficients is available. Instead, as in Example 8.2.2, we assume that training information is available. Consider again the optimization of the problem of optimization of the linear information transformation, when the primary and available information exhibit statistical regularities, and the distortion indicator Q{h) is given by (8.1.43). Differentiating this equation we get ^ =^ E dh(n) dh(n)

[^-J2 h{m)t{m)Y = ^^A^-Y: t^ dh(n)

;^

h(m)^m)f =

A'

= -2E[x-5I h(mMmMn).

(8.2.50)

We are looking for the solution of the set of equations (8.2.3). Therefore, we gih.n)=^,

(8.2.51)

dh{n) Comparing (8.2.50) with (8.2.49) we see that N

Gia, U, Az)=x-52 him)^m)Un),

(8.2.52)

with ILJ={x, t{n), « = 1, 2,- • • , N}, is the random variable representing the distorted observation of the partial derivative ^7^ . ^ dh(n) In view of (8.2.52) the recurrence (8.2.48) takes the form N

h(j+1)=/j(/)-X(/)[x(/)-E

^('". Mm, j)]r(j),

(8.2.53a)

402

Chapter 8 Optimization of Information Systems

where r(j) = {rin,j), AZ = 1, 2,- • • , N} is the vector representing the N-dimensional available information. We write this equation in the simple form (8.2.53b)

h(j-^ 1)=^(/)-X(/-)[x(/>x-(/)]r(/*), where

N

xV)^Y^h(m,j)r(m,j)

(8.2.54)

has the meaning of the information produced from the available information rij) by the linear transformation determined by the set of coefficients hO') = {h(n,j),n = U2r " ,N} obtained in theyth step of the recurrence. The optimized linear system based on this recurrence is shown in Figure 8.5. i STOPPING 1 SUB. SYST. 1 1 1

d

1 STEP DELAY

i/r(/+l)

1 ^ l"^ h{J)

/^(/•)

J

recurrency

r

fe PN

1r

LINEAR SYSTEM

working infc)

Figure 8.5. The optimization of hnear information transformation based on the recurrence (8.2.53). We denote by x"(j) = Y. ^(^^ y + l)r(m, j)

(8.2.55)

the recovered information that we would get using the new calculated set /r(/-l-l). From (8.2.53) to (8.2.55) after some algebra we get [x(/)-/-(/-)]2=5[x(/>/(/-)]2' (8.2.56a) where B<\ for sufficiently largey. Thus, in every step the algorithm (8.2.53) improves the processing (8.2.56b) of already obtained information. COMMENT The system derived in the example performs a similar function as the system derived in Example 8.2.2. However, the operation of both systems and their structure (compare Figures 8.3 and 8.5) is different. In the system derived in Example 8.2.2 the evaluation of the empirical correlation coefficients on which the performance indicator (see (8.1.36)) depends and finding of the optimal coefficients of the linear transformation, are separated. A similar situation is found in the system based on statistical correlation coefficients derived in example 8.2.1.

8.2 Methods of Solving Optimization Problems

403

In the system derived in Example 8.2.4 the empirical correlation coefficients do not appear at all. The ability of the system to optimize the linear transformation in sense of a performance indicator determined by the correlation coefficients, is related to the features of the recurrence (8.2.48). Equation (8.2.41) indicates that the recurrence performs two functions: it improves the deterministic part of the approximation so that it approaches the optimum h^ (the first sum in (8.2.41) and due to the choice of auxiliary coefficients X(/)), it decreases to zero the variance of indeterministic part (the second sum in (8.2.41)). Thus, the recurrence has a similar effect as arithmetic averaging. 8.2.4 FINDING THE POINT OF MINIMUM In the previous two subsections we considered the solutions of optimization problems that can be reduced to finding the zero point of the derivative of the criterion function. However, the derivative in the form of a formula is often not available. The typical reason is that the criterion function is given only in a numerical form. We present here methods of solving optimization problems which cannot be directly reduced to finding the zero point of the derivative. FINDING THE MINIMUM POINT OF A CRITERIAL FUNCTION OF ONE VARIABLE Let us return to the primary minimization problem of a scalar function f{a) of a scalar argument discussed in Section 8.2.1. In terms of the function/(A) the recurrence (8.2.21) has the form a(/>l)=a(/>X(/)

dfl

(8.2.57) a-a(j)

If the derivative exists, then

da

- lim L, .. *-o 2d

(8.2.58)

where

(Af)Ma+d)-f{a-5).

(8.2.59)

Since the a(j-\-l) generated by the recurrence (8.2.57) has the meaning of an approximation of the zero point a^ of the derivative df/da, only inexact information about the derivative is really needed. However, the accuracy of this information must increase when a(f) approaches a^. Those remarks suggest to use instead of (8.2.57), the recurrence a(j+l)=a(jy\(j).

2d(j)

(8.2.60)

with a train 5(/)-*oo. We call it recurrence based on increment ratios (briefly, increment ratio recurrence). Figure 8.6 shows a train generated by this recurrence.

404

Chapter 8 Optimization of Information Systems fia) I

7

1 1

1

/

I

1

1 1

\

_ /

-f

6

! 25(7') —r^

\T

—4L

1

L(

J

1 1

U

Figure 8.6. A train generated by the recurrence (8.2.60) based on increment ratios. It can be proved (see, e.g., Schmetterer [8.10]) that on very general assumptions about the function/(a) the increment ratio recurrence (8.2.60) generates a train converging to the point a^ of minimum of the function, if following conditions are satisfied: limX(/)-0, lim6(/>0, 5 ] X(/)=» (8.2.61a) y-i

(8.2.61b) It is convenient to take

X(/>1//-", b(j) = \lf.

(8.2.62)

Then the conditions (8.2.61) are satisfied if 0 < Q : < 1 and a-^^>\. Typical values a r e a = l , /3=0,5. When we cannot calculate analytically the derivative and do not have exact information about the values of the function but they exhibit statistical regularities, it is natural to use instead of recurrence (8.2.60) the recurrence a(j-hl)=a(j)-\(j)

F (j)-F (J) -; -^ , 26(7)

(8.2.63)

where F^,.(j) are observations of the random variables representing the distorted values of the function/(a) as G[a(j), 1IJ(/)] represents g(a) (see equation (8.2.39). We call it recurrence based on distorted increment ratios (briefly, distorted increment ratio recurrence). Again it can be shown (see Schmetterer [8.10]), that on very general assumptions about the function f{a) and about statistical distortions the train generated by the distorted increment ratio recurrence converges in the mean square sense (equation (8.2.39)) if in addition to conditions (8.2.61a) the condition 2

< oo is satisfied.

(8.2.64)

8.2 Methods of Solving Optimization Problems

405

If we assume again that the auxiliary coefficients are given by (8.2.62), then from (8.2.26) it follows that all the conditions are satisfied if 3 4 < a < l , and \-a
(8.2.65)

A pair satisfying those conditions is Q : = 1 , /3=0.3. We discussed here in more detail the minimization of a function of a scalar argument because this simple model shows the fundamental properties of recurrent procedures for finding the point of minimum. Of paramount practical importance are generalizations of our consideration for functions of several arguments that we are presented in subsequent. COMMENT The basic advantage of the increment ratio recurrence (8.2.63) is that we do not need an explicit equation for the derivative df/da, but it is sufficient to know the values of the function f(a). The price that we pay, compared with recurrence (8.2.53), is that in each step instead of calculating once the value of the derivative, we have to calculate twice the value of the function. The distorted increment ratio recurrence (8.2.63) has similar advantages as the increment ratio recurrence (8.2.60) plus the advantages of the distorted function recurrence discussed in Comment on page 402. Obviously, for diminishing the volume of information about the properties of the minimized function we pay with a slow down of the recurrence. Using the distorted increment ratio recurrence, we have also to take into account that unless we know that the processed information exhibits statistical properties, we have ground to expect that a generated train converges to the minimum point. In spite of this, the distorted increment ratio recurrence and its modifications are used as a heuristic procedure and often produce satisfactory results. FINDING THE MINIMUM POINT OF A CRITERIAL FUNCTION SEVERAL VARIABLES When we look for the point of minimum of a function f(a) of an A^-dimensional argument a = {a{n), n = l,2, • • • , N} several previously introduced concepts have their 1 dimensional counterparts; however, specific problems related to multidimensionality arise. The counterpart of the derivative is the gradient defined by (8.2.4). The basic property of the gradient is d/=(grad/, £/a),

(8.2.66)

where (•,•) denotes the scalar product and N

da'Y^d[a(n)]b(n) the infinitesimal displacement vector.

(8.2.67)

Chapter 8 Optimization of Information Systems

406

The set of points „ ^ ^(fx)^{a\m=f,} (8.2.68) we call equivalue surface (in two-dimensional case, equivalue line). For an a E 5 (/i), we have d/=0. From (8.2.67) it follows that (grad/, Ja)=0. Thus, the vectors grad/and a are perpendicular. So we conclude that For every point on an equivalue surface {line), the gradient is (8.2.69) perpendicular to this surface (line). From (8.2.66) it follows that (8.2.70) |d/| = |grad/| \da\ cos(Z grad/, rffl) Thus. 'For a fixed "magnitude" of the infinitesimal displacement vector, (8.2.71) the change of the value of the function is maximal if we move along the gradient vector. This property is exploited by most procedures for searching the minimum point of a differentiable function of several variables. It suggests the steepest-descent procedure: We take an initial point a(l), calculate the gradient for this point arui move along it so long as the function f(a) decreases. When it stops to decrease, we calculate again the gradient and proceed as in the first step. Because this procedure requires frequent testing of the changes of the value of the criterial function, it is not best suited for on-line utilization in information systems. Using the gradient we write the basic recurrence (8.2.29) in the form a(/>l)=fl(/>X(/-)grad/. (8.2.72) Thus, this recurrence also utilizes the basic property (8.2.71) by changing the point approximating the minimum point along the line of steepest descent, as it is illustrated in Figure 8.7. However, the correction is made only on the basis of local properties of the function. / Aa)

^^

^(2)/

grad/ flO>l)
8.2 Methods of Solving Optimization Problems

407

When the values of the gradient cannot be evaluated, it is natural to use instead of the gradient the vector of partial increments (8.2.73) where bin) is the «th unit coordinate vector. The counterpart of the recurrence (8.2.60) is fl(/> 1) =fl(/>X(/-) [GRADf b{j)\^^, (8.2.74) It can be shown (see e.g., Minoux [8.5]) that on very general assumptions the train generated by this recurrence with coefficients satisfying the conditions (8.2.61) converges toward the point of minimum of the function. Also a similar counterpart of (8.2.63) generates a train converging in the mean square sense. COMMENT The described recurrences are of paramount importance for applications, particularly for the design of intelligent predictors, filters, and next neighbor transformation. For a review of such several systems, see Tsypkin [8.11], Widrow, Steams [8.12], Proakis [8.13, Chapter 6].

^

w INFO SOURCE

+

' ^

SYSTEM w r m AN UNKNOWN CHARACTERISTIC

^

^

w k

^

^

w

MODEL WITH ADJUSTABLE PARAMETERS h

yf EVALUATION OF DISTORTIONS

h 1

IMPLEMENTATION OF RECURRENCE

^ ^

Figure 8.8. System with an adjustable model A wide class of intelligent information systems using the recurrences are systems with an adjustable model, shown in Figure 8.8. The great advantage of those systems is that they adjust the parameters of the model so that it mimics possibly exactly the performance of the real system, even if the structure of the model does not match exactly the structure of the real object. Systems with an adjustable model are particulary useful for information processing systems with a training cycle or in systems with feedback information. In particular, they can be used to produce a model of a communication channel that can be used to adjust the rules of operation of the receiver or/and the transmitter. Since in information compression systems the primary information and the compressed information are available at the same place the considered recurrences can be used to optimize those systems. In particular, when quantization is realized

408

Chapter 8 Optimization of Information Systems

by a next neighbor transformation then not only the references but also the distance function can be optimized. The latter optimization can be also achieved by introducing a preliminary transformation (see (8.1.12)) and optimizing it. The presented recurrences producing trains suggest heuristic procedures for search of favourable solutions in situations when statistical regularities are not taken into account and convergence can not be checked. Examples of such heuristic procedures are several procedures of adjustments of neural networks (see e.g., Zurada [8.14], Haykin [8.15] and genetic algorithms ( see e.g. Goldberg [1.16], Soucek[8.17]).

8.3 OPTIMAL RECOVERY OF DISCRETE INFORMATION In this and in the next section we consider optimal recovery of information when the working information and all indeterminate factors influencing information processing exhibit statistical properties and exact statistical information about them is available. We also assume that as criterion the statistical average of an indicator of performance in a concrete situation is used. We show first that on those assumptions it is possible to derive the general form of the optimal information processing system. We call it statistically optimal system. There is a great variety of such specific systems. There are three reasons why we concentrate on statistically optimal systems. The first is, that as it has been explained in Chapters 4 and 5, the assumption of existence of statistical regularities is often well justified. If they exist, disregarding them obviously deteriorates the performance of an information system and it is usually possible to build an information system subsystem acquiring information about statistical regularities. The second reason of interest in statistically optimal systems is, that the results obtained for probabilities can be directly used for the very wide class of systems for which only the frequencies of occurrences are available. The methods of it have been discussed in Sections 1.7.2, 6.2.1, and 7.5.1. The last but not least reason is, that statistically optimal systems suggest useful solutions in many cases when the existence of statistical regularities can not be proved. We present here the most important and typical special forms of such systems for recovery of discrete and continuous information. We show that most of the information recovery systems which we introduced in the previous chapters using heuristic arguments are on quite general assumptions statistically optimal systems. In particular, the next neighbor transformations which we considered in the previous chapters are on general assumptions statistically optimal systems. Since the information recovery is the last transformation performed by the information system the recovered information may be also called decision of the information system about the primary //z/ormation (briefly decision) and the transformation of the available information into the recovered information considered as a whole may be called decision rule. Therefore, we use here alternatively the terminology of decision theory. In Section 8.3.1 the general statistically optimal rule of ultimate information recovering (performed by the last subsystem of the prototype information system shown in Figure 1.2) is derived.

8.3 Optimal Recovery of Information

409

As application and illustration of the general result obtained in Section 8.3.1, the structures of optimal systems recovering discrete information are derived in Section 8.3.2. First the optimization of the information recovery in the basic system having the chain structure shown in Figure 1.2 is considered. In the second part the systems using feedback information are discussed. Section 8.3.3 presents the methods of calculating the performance of the optimal recovery rules and discusses the distortion versus cost trade off relationships for the optima information recovery rules derived in Section 8.3.2. 8.3.1 GENERAL SOLUTION OF THE OPTIMIZATION PROBLEM We assume that all indeterminate factors influencing the information processing exhibit statistical regularities and exact statistical information •^STAT is available. We take the statistical averaging as the dependency removing operation (see our discussion in Section 8.1) and the performance criterion is defined as Q [ r ( - ) ] = E^[X, r ® ) ]

(8.3.1)

where X*(') is the recovering transformation, q(', •) is a performance indicator in a concrete situation, X and IR are random variables (processes) representing the primary respectively the available information. The O P r ( - ) Q I (?„ A72=2, 3, .., M,

CTECH ;^SIAT

(8.3.2)

is called Bayes optimization problem; C^ respectively CJECH denote the parametric respectively technic constraints-see Section 1.6.2 . We now show that the method used in Section 7.5.1 to derive the optimal rule of recovering the primary information from its quantized presentation can be generalized. From the formula (4.4.23) for conditional averages we have

e[r(-)]=E^(^. r ) = E^[X, rm\ = E ( E ^K, rm\

(8.3.3)

We write this in the form: where

G [ r ( - ) ] = EG[r(IR),]R)] - . Q{x\ r)=E qQL, x )

(8.3.4) (8.3.5)

X|r

x*E yV,/\is the set of potential forms both of the primary and recovered information and E is the operation of conditional statistical averaging on the condition that the received information r is given. Let us assume that the available information r is fixed. Since we consider various decision rules, the decision X*(r) can be considered as a variable, which can take any potential form of the primary information. Therefore, for a given r we can consider Q(X, r) as a function of the variable x* and we may look for the x* which minimizes Q(x\ r). This x^ usually depends on r. Therefore, we write it as form: x;'X;(r) (8.3.6) Since for each r we minimize Q(x\ r) independently from the other r's we minimize the overall average QlX'i')] and the assignment r-^x^ is the transformation which is the solution of the considered optimization problem (8.3.2). Thus we came to the fundamental conclusion, that optimal is the rule:

410

Chapter 8 Optimization of Information Systems

For available information r and for every potential form of primary information x, using (8.5.3) calculate the conditional performance indicator of Q(x\ r). Consider it as a function of x find the , (9,-x i\ potential form x^ for which Q{x\ r) achieves the minimum value arui ^ ' ' ^ take JCQ as the recovered information. We call this the rule (transformation) of best coruiitional performance. The point of maximum of the function f{x) is the point of maximum of the function 0i|/(jc)], where 2\f{x)] where
(8.3.8a)

or a point of maximum of the function u{x, r)^(t>2[Q{x, r], xeX,

(8.3.8b)

where i(*) is a strictly increasing function and (•) such that it is easier (8.3.10) to evaluate the decision weight w(jc, r) than Q{x, r). R3. In the set X of all potential forms of the primary information find the potential form ^o with smallest negative (largest positive) decision weight and take it as the recovered information. The block diagram of the system implementing this rule is shown in Figure 8.9a.

8.3 Optimal Recovery of Information

XST-AT(R.X)

411

g(',')

i 1

CALCULATION OF CONCRETE DECISION WEIGHTS

w(x.r)

FINDING THE POINT OF MIN / MAX J

W

XAr)

a)

i

i

CALC.OF CURRENT DECISION INFO

^ CALC.OF A PRIMARY DECISION INFO

r(-,-) u(x,r)

u,(x)

b)

Figure 8.9. The rule of best conditional performance: a) based on the decision information tt(r) = {w(x, r);xEX)}, b) calculation of a concrete decision information w(x, r) using the current u^ix, r) and a priori uj^x) decision information: XSJATCR* X) exact statistical information about the joint statistical properties of the working and available information, q{\') the indicator of performance in a concrete simation. We show on forthcoming examples that in many cases the concrete decision information has the form w(jr, r) = r[w(Xe, r), W3(jt)] (8.3.11) where r ( - , ) is a function of two arguments, uj,x, r) is a function depending both on the information r and the potential information JC, while ujix) depends only on the on the potential information x. Thus, u^{x, r) can be evaluated only after the current znformation r arrived, while uj^x) can be calculated before the operation of the systems starts. Therefore, we call: Mc(Jt, r) the current decision m/<9rmation about the potential working information x u^(x) the a priori decision informaiion about the potential working information x iic(r) = {Wc(jc, r);xEX} (respectively u^ = {ujix); xE Jr})-the current (respectively a priori) decision /Az/brmation The set {u^ir), u^} determines in a unique way the primary decision information ii(r) given by (8.3.7). We call this set the secondary decision information: it has the meaning of a pretention of the primary decision information. Those definitions are illustrated in figure 8.10b. Several concrete examples of the decision information are given in subsequent sections. COMMENT 1 The primary decision information u(r) is the typical example of application of the general definition (1.1.1) of information.

412

Chapter 8 Optimization of Information Systems

The primary decision information u(r) has so many elements as the set of potential forms of the working information. Thus, in the sense of definition (6.1.12) or (6.6.26), decision information has the same minimal volume in as the working information. Usually the available information r has a larger volume as the working information. Then the transformation r^u{r) is a volume compressing transformation. When the working information is discrete and the available information is continuous, the compression is dramatic. The compression is possible because in general the transformation r^u(r) is non-reversible. In particular, knowing the decision information u(r) we can not always calculate the conditional d.p. p{x\r). Therefore, in general l[X;ii(IR)]
L

Q(x„, r)=E q(M, xJ=J2 ^l'"

q(x„ JcJ/'(X=x,|]R=r)

(8.3.13)

/-I

The rule (8.3.7) of the best conditional performance takes the form To an available information r assign the primary information ^/ G X^ which minimizes the conditional performance indicator (8.3.14) G(jc„r),/=1,2,- • • ,L.

8.3 Optimal Recovery of Information

413

We denote: F(jcJr)=Pa^=x,|IR=r) (8.3.15a) P(x\r) = [P(Xi\r), 1=1, 2,- • • , L] (8.3.15b) the column matrix of conditional probabilities, q = [q{Xi, j c j , /, m=l, 2,- • • , L] (8.3.16a) the square matrix of performance indices in a concrete situation, Q(x, r) = [Q(x,, r), / = 1 , 2,- • • , L] (8.3.16b) the column matrix of conditional average performance indices, and we write (8.3.7) in the form Q(x, r)=qP(x\r) (8.3.17) From equations (4.4.7) we have P ( j t , | r J = P ( X = x , | E = r ) = CP(]R=r^|3S=jC/)P(X=x,) (8.3.18a) P(Xi\r)=P{^=Xi\Tk=r) = Cp(r\X=Xi)P(^=Xi) (8.3.18b) where the probability distributions P(IR=r^ 15S=JC/) respectively/7(r | X=JC/) describe the transformation generating the available information r. Thus, to calculate P{Xi\r) we need: 1) the exact information ZSTAT(X) about the statistical properties of the inf source, 2) the exact information ZSTAT(R|X) about the transformation x-*r; the probabilities describing those transformations have been discussed in Section 5.4. Having the statistical information XSJATOQ ^ d XSTAT(R IX) we fmd the rule Px|r(* I •) of calculating for given rE ^and xG X the conditional probability P{Xi \ r). Similarly, to obtain the values of the matrix q of performance indices we need the relevant information X^UP about the features of the superior system. Our argumentation is illustrated in figure 8.10a. For discrete information the decision information defined by (8.3.8) is a L dimensional vector information u{r) = {u{Xi, r), / = 1 , 2,- • • , L] (8.3.19) and the general diagram of the best conditional performance rule shown in figure 8.10 a takes the simple form shown in figure 8.10b. In the binary case (L=2) searching for the maximum reduces to checking the sign of difference between the two weights, e.g. of Wb(r)=w(jci, r)-w(jC2, r) (8.3.20) We call this difference the binary decision information. Let us suppose that the decision weights are positive. Then the best conditional performance rule (8.3. ) based on decision weights takes the form .jc, if Wu(r)>0 . , . or equivalently where

^--X2ifWb(r)<0 ^/,=<^thKW]

(8.3.21b)

^ ^ 1 if w>0 ^th(w) = < ; ., ^^

(8.3.21c)

^JC2 i f W < 0

is a threshold function. This rule can be implemented in the system shown in Figure 8.10c.

414

Chapter 8 Optimization of Information Systems

XSTAT(X)

XSTAT(RIX)

r-4--4^-. CALCULATION OF THE RULE

CALCULATION OF PERFORMANCE MATRIX P(x,\r) •

NUMERICAL CALCULATION OF COND. PROBILmES calculation of conditional probilities

Q(x,,r) FINDING THE POINT OF MAXIMUM

MATRIX MULTIPLICATOR

Axjr)

Q(XL,r)

a)

XSTAT(RIX)

g

u(x,,r) CALCULATION OF CONCRETE DECISION INFO XSTAT(R'X)

FINDING THE POINT OF MIN/MAX

b)

XSUP

CALCULATION OF CONCRETE DECISION INFO

"(x,,r)

CALCULATION OF CONCRETE DECISION INFO

T^

XSTAT(R'X)

calculation of the binary decision info

XSUP

c) XSTAT(X)

JF

XsTAT(f^'X)

4r

P(x,|r)

CALCULATION OF COND. PROBILITIES

• •

FINDING THE POINT OF MAXIMUM

P(XL\r)

d)

Figure 8.10. The implementation of the best conditional performance rule of discrete information recovery: a) based directly on conditional performance, b) on decision weights (special case of transformation shown in fig (8.10 a)), c) for binary information; ^STAT(X), jf5TAT(R I X)-exact statistical information about the a priori, conditional probabilities P(X|R), XSUP information about the properties of the superior system relevant for the choice of performance indicator, d) The maximum conditional probability rule. The primary decision information can be decomposed into current and a priori components (8.3.22) u{Xi, f)=uJ,Xi, r)^uXxi)

8.3 Optimal Recovery of Information

415

In particular, the binary decision information can be presented in the form Wb(r)=wjr)+Wba

(8.3.23a)

where ujj-)=^uj,x^, r)'U,(x2, r) , Uy,^=^u^{x;)-uJ,X2) Then the optimal binary decision rule (8.3.21) takes the form ^i(r)=^

(8.3.23b)

(8.3.24a) "^X2lfWbe(^)<Wth

with «th=«ba (8.3.24b) As indicated in Section 8.1.1, for many superior systems the indicator of performance in a concrete situation q(x, x*) is symmetric, given by (8.1.2). Then the indicator of performance of the information transformation as a whole is the probability of error V(X7^T) ( see (8.1.40)). Substituting (8.1.2) in (8.3.13) we get G(x,, r) = l-/>(5S=JC,|I^=r)

(8.3.25)

From this it follows that the conditional probability P(X=Jc^|IR=r) is the positive weight function and the rule (8.3.14) of the best conditional performance based on weight functions takes the form Assign to a given available information r the information x^ G Xfor which the conditional probability P(X=JcJI^=r) considered as (8.3.26) function of I achieves its maximum. This rule is called maximum conditional probability rule (also maximum a posteriori probability rule). The system implementing this rule is shown in Figure 8. lOd. For binary information the rule simplifies to threshold rule like (8.3.24) and can be realized by a system as shown in Figure 8.10c, COMMENT 1 The derived optimal system is a concrete example of the system shown in Figure 1.24. In Figures 8.10 and 8.11 we emphasize the role of meta information. If it is not available we can add a hierarchically higher subsystem acquiring such an information so, that the whole system operates as an intelligent system shown in Figure 1.25. In particular, as an inexact information about the probability distributions/7(IR=r 13S=jC/) and P(X=X/) we can take the corresponding frequencies of occurrences obtained during a training cycle. Another possibility is to: (1) take a standard probability distribution which can be considered as a reasonable approximation of the real distributions (in particular for probability distribution of primary information we may use one of the distributions described in Section 4.5, and for the conditional distribution describing the fundamental information processing the prototype statistical relationships presented in Section 5.2)), (2) leave some parameters of the standard distributions free, (3) find the rule of optimal information processing for the standard distributions, (4) use the recurrent procedures described in Section 8.2 to find the values of the free parameters maximizing the performance in the class of rules which are optimal for the hypothetical probability distributions.

416

Chapter 8 Optimization of Information Systems

EXAMPLE 8.3.1 OPTIMAL RECOVERY OF DISCRETE INFORMATION USING VECTOR INFORMATION; DETERMINISTIC NOISELESS SIGNALS We consider the communication system shown in Figure 1.3 and described in more detail in Section 2.1.1. We assume: Al. The working information Xi is discrete / = 1 , 2, .., L, A2. The available information r and its noiseless component of are N DIM vector information: r={r{n), n=l, 2,., A^}, w{x) = {w{x, n), « = 1, 2,., A^}, A3. The transformation performed by the channel is described in Section 5.4.2 and characterized by the conditional probability (5.4.10). The needed probability p{r\xi) occurring in (8.3.18) we obtain from equation (5.4.10). After substituting v-^r, 5-*JC/, >V(^/), ^(•5. n)-^w{x, n) we get P(K=Xi\]k=r) = CP(Xi)e '"""

(8.3.27)

This equation shows that the conditional probability depends on the available information r only through the sum in the exponential. To present this dependence in a simpler form we take the logarithm both sides of (8.3.27) and multiply the result by 2o^. We get 2(7^1aP(X=X/|E=r)=2a^lnC+2o^lnP(jc^)-52 lr(n)-w{Xi,n)y =2a^hiC+2a2lnP(x,) + N

N

"'^

N

-2 J2 r\n) + J2 iin)w{x„n)-2'£w\x„n)

(8.3.28)

Since the terms not depending on x, are not relevant for decisions about the primary information as concrete decision information (decision weight) we take N

N

u;,(x„ r)=2o^lnP{Xi)+ "£r(n)w{x„n)-2Y^w\x„n) n-]

(8.3.29)

n-l

Decision information we decompose into current and a priori components

where

u',{xi, r)=w,(jc,, r)+M3(X;)

(8.3.30)

^ Wc(JC/, r) = 2^ r(n)w{Xi, n)

(8.3.31a)

and A^

u,(xi) =2(r\nP(Xi)-2 J^ ^\x^,n) (8.3.31b) From equations (8.3.31) it follows that the decision information depends only on the rough description of the statistical information through statistical parameters entering in (8.3.31b). However we have to remember that the form of this information results from the assumption that the noise has a gaussian probability distribution. Comparing (8.3.30b) and the definition (7.1.8) we see that the current decision information can be produced by systems shown in Figure 7.1 in particular, by a matched filter. This greatly simplifies the implementation of the optimal system shown in Figure 8.10b.

8.3 Optimal Recovery of Information

417

Often for symmetry reasons it can be assumed that P(JC/) = 1/L=const

(8.3.32)

Then, also for symmetry reasons the transmitted signals are chosen so that E{Xi) =£'=const (8.3.33a) where N E(x,)^'£w\x,,n) (8.3.33b) is energy the noiseless signal (see (7.4.1)). The assumptions (8.3.32) and (8.3.33) are called symmetry assumptions. When those assumption are satisfied then from (8.3.31b) it follow that the a priori decision information does not depend on the primary information. Therefore, in the symmetrical case we take u,(x;)=0, V/ (8.3.34) The choice of the decision weight is not unique. An other choice then (8.3.29) would be to take the negative decision weight w"(JC/, r)= {In P(^=Xi\R=r)-\nC}o^=d^lr,

V„(x,)]Vln P(^=Xi) (8.3.35)

where dlr, V^(Xi)] is the Euclid distance defined by (1.4.8). The transformation realizing the maximum conditional probability rule (see fig.8.10c) would became a generalized NNT transformation. When the potential forms of information are equiprobable, thus (8.3.32) holds the natural choice of the decision weight would be u'"(x,^r)^d[r,w(x^)] (8.3.36) Thus, On assumptions A1-A3 and (8.3.32) the maximum conditional probability rule is a next neighbor transformation using (8.3.37) Euclid distance and noiseless signals as reference patterns. In the binary case the optimal decision rule is given by (8.3.24) and can be realized by the system shown in Figure 8.10c. From equations (8.3.23) and (8.3.30) we obtain the current and a priori binary decision information uUr) = Y^ r(/z)[w(x,, n)-w(x2, n)]

(8.3.38a)

n-l

w,(r)=w,(jc,)-w,(jC2)

(8.3.38b)

where u^{X[), 1=1, 2 is given by (8.3.31b). Thus, in the binary case the maximum conditional probability rule can be implemented by a single matched filter matched to the difference of noiseless signals and a threshold device with threshold given by equation (8.3.24b). From (8.3.31b) it follows that in the symmetric case we have to take in the optimal decision rule (8.3.24a) the threshold w,h=0.

D

(8.3.39)

We present now the time-continuous modification of the previous example.

418

Chapter 8 Optimization of Information Systems

EXAMPLE 8.3.2 OPTIMAL RECOVERY OF DISCRETE INFORMATION USING FUNCTION INFORMATION; DETERMINISTIC NOISELESS SIGNALS Al. The working information Xf is discrete / = 1 , 2, .., L. A2. The available information and its noiseless component of are timecontinuous processes r(0, "^(Xf, t) tG. A3. The transformation performed by the channel is described in Section 5.4.2, page 237: r(0=w(x,, 0+z(0, tG . (8.3.40) A4. The noise z(t) is a realization of a gaussian stationary process z(t) with uniform power spectral density S^ (see Section 5.2.2, page 228). A5. The noiseless signals are deterministic. On these assumptions we use equation (5.4.11) in place of equation (5.4.10) used in the previous example. Since the equations are analogous with the sum corresponding to the integral we can use all the results from the previous example replacing sums by integrals. For example as counterpart of equation (8.3.31a), we obtain the current decision information uAxi. /•(•)]= f r{t)w(Xi, t)dt

(8.3.41)

Similarly, as a counterpart of (8.3.38a) the current binary decision information ^b

uM')] = f r(t)[w(x,, t)Mx2, t)]dt.

n

(8.3.42)

'a

In the next example we illustrate the problems arising when noiseless signals are indeterministic. We consider only the time continuous model, since the calculation for the corresponding time discrete model would be very be very tedious. EXAMPLE 8.3.3 OPTIMAL RECOVERY OF DISCRETE INFORMATION USING FUNCTION INFORMATION; INDETERMINISTIC NOISELESS SIGNALS We do again assumptions Al to A4 as in the previous example, but we assume that the noiseless signal is >v[JC„ (•), \H=^(X/, t)cos(a;er+\^), tE

(8.3.43)

and instead of A5 we assume that A5". The envelope A(Xi, t), tE and the circular frequency ojc are exactly known, but the phase shift \l/ is indeterminate; this causes noiseless signal to be indeterminate. On the assumptions now made we can use the results derived in Section 5.4.2. The probability p[r(')\xi\ occurring in (8.3.18) we obtain from formula (5.4.25) after substituting v(-)^r{'), s-* Xf^, A^(t)-^A(Xi, t). We get

8.3 Optimal Recovery of Information Plxi\r(')] =P(X=x,) exp{-£(x,)/5JIo{x[x„ r(-)]/5j

419 (8.3.44a)

where x[Xi, r(')]^\Jcc[Xrri')hc^[x,,r(')]

^

Cc[Xi,r{')]'{ r(/)^(x,,Os?n^,rcir

(8.3.44b) (8.3.44c)

and E(Xi) is the energy of the noiseless signal. From (8.3.44) it follows that Uc[Xi, r(')] = x[x,, r(')] (8.3.45) can be used as current information decision. In general we have to calculate the decision information from equation (8.3.11) with function r(-,-) determined by equation (8.3.44) and to implement the rule of maximal conditional probability as shown in Figure 8.10b. However, since the function Io(w) is an increasing function in the binary symmetrical case (when (8.3.32) and (8.3.33) held) wJr(-)] = x[jc„r(-)]-x[x2, r(-)] (8.3.46) is the binary current decision weight and the implementation of the rule (8.3.23) simplifies. D EXAMPLE 8.3.4 OPTIMAL RECOVERY OF A BLOCK TRANSMITTED THROUGH A BINARY CHANNEL We consider the transmission of blocks of binary information as a whole, described in Section 2.1.2 and illustrated with Figure 2.7. We assume: Al. The working information is a block: x, = {x(l, Az), Az = l, 2, .., A^J, / = 1 , 2,- • • , L, where x(l, n) are the elementary binary components A2. The transmitted information is the block (code word-see 2.1.2) >v(jc,) = {w(/, n), A2 = l, 2, .., TV}, / = 1 , 2,- • • , L of A^ pieces of binary information w{l, n)\ the coding x^^Wi is a reversible transformation A3. The available information is also block of A^ pieces of binary information: r, = {r(A2), A2 = l , 2 , - • • , A^}

A4. The channel is a binary channel (see Section 2.1.1), the elementary piece of the available information ^ r(n) = w(/, n)®z{n) (8.3.47) A5. The probability distribution of the primary information is uniform: F(jC/) = l/L=const A6. The random variables 2(n), « = 1, 2, .., A^ representing the binary components of noise are statistically independent, have the same probability distribution characterized by Pb=P[s(n) = l ] . From assumption A6 and repeated using of the equation (4.4.8a) it follows that P ( E = r I X=x;) = />^^//[>^(^/)''-](l -P^f-dnly^^x;) ,r] (8 3 48) where d^{w{x^, r] is the Hamming distance defined by (2.1.36).

420

Chapter 8 Optimization of Information Systems

Substituting (8.4.48) in (8.3.18a) and using assumption A5 we get P ( 5 S = x , | l = r ) = CP,^//[^(^/)'^](l-P,)^-^//[^(^^^

(8.3.49)

Taking into account that 0 0 . From this and from (8.3.49) it follows that u(Xi, r)=d^[w{Xi), r] (8.3.50) is a negative decision weight. Thus on assumptions A1-A6 the maximum conditional probability rule can be implemented as a next neighbour transformation with using the (8.3.51) Hamming distance and code words as reference patterns. D COMMENT 1 The conclusions (8.3.33) and (8.3.39) prove that based on the assumptions made here the receivers described in Section 2.1.1, equation (2.1.10), Figure 2.3 and in Section 2.1.2, equation (2.1.37) are optimal. The advantage of the derivation presented here is, that it shows that such factors as the a priori probability and the variance of the noise must be taken into account, while in the heuristic approach they do not appear. Therefore, when the heuristically chosen NNT is used, its performance is in general inferior compared with recovery rules derived here. The disadvantage of the approach presented here is that we had to introduce several specific assumptions to get the concrete results. However, the optimal rule (8.3.51) does not depend on the binary error probability Py^. Similarly the optimal rule (8.3.37) does not depend on the variance of the noise. Thus the optimal rules can be implemented if we have no information about some parameters assumed for derivation (the rules are in respect to those parameters uniformly optimal. However, the exact information about some other assumed features is essential, in particular the exact information about the potential forms of noiseless signals. If such an information is primarily not available we can use the hierarchical adaptive system described in Section 1.7.2, in particular the system shown in Figure 1.26 with a subsystem providing information about the features necessary for implementation of the optimal rule. Special classes of such systems are the intelligent data transmission systems described Section 2.2. COMMENT 2 We have shown in Section 5.4 that for several indeterministic transformations producing the available information the conditional probability distribution p(lR=r|^=x^) can be represented as decreasing function of a distance function between the available information r and a point representing the primary information. Then the generalization of conclusions (8.3.37) and (8.3.51) is evident: the maximum conditional probability rule is equivalent to a NNT. Thus, the NNT is in many cases an optimal recovery rule. We xan go even a step further and interpret the conditional performance indicator Q{Xi, r) given by (8.3.13) as a distance between the available information r and the potential form Xi of the primary information. Then the optimal information recovery rule (8.3.14) can be interpreted as a generalized NNT.

8.3 Optimal Recovery of Information

421

8.3.3 STRUCTURES OF SUBSYSTEMS RECOVERING DISCRETE INFORMATION IN A SYSTEM WITH FEEDBACK Simple examples of communication systems using auxiliary information about the information delivered by the communication have been described in Sections 2.1.1 and 2.1.2 and shown in Figures 2.4, 2.5, and 2.9. We describe now such systems in more detail and we discuss their optimization. Although we use the terminology of communication systems, our considerations are directly applicable to other information systems such as systems storing or simplifying information. The feedback system is characterized by the type of cooperation between the receiver and the transmitter. Typically we transmit a train of elementary pieces of primary information and in the simplest case the processing of the next piece of information starts when the processing of the previous is completed. Such a system is characterized by the type of: 1) Auxiliary information used by the transmitter to make the decision about forwarding to the receiver an additional information about the actually processed elementary piece of working information; it is cMed feedback information. 2) The type of such an additional information about the working information furnished by the transmitter; it is called retransmitted information. 3) The type of making by the receiver the ultimate decision about the actual piece of the working information on the basis of all already available information. The choice concrete rules of feedback system operation depends on the features of the forward and feedback channels. When the capacity of the forward channel is small, the natural choice is to deliver from the receiver to the transmitter feedback information having possibly small volume. In extreme case this information is binary and has the meaning of the ultimate decision commanding the transmitter to forward the next retransmission. Such a feedback system is traditionally called decision feedback. Systems described in Sections 2.1.1 and 2.1.2 are of this type. The first transmission of a primary piece of information and the following retransmissions we call total transmission. The operation of the feedback system can be considered as controlling the total transmission by feedback information so that the effects of distortions caused by the channel are diminished. In a typical information system the quality of the fundamental information processing subsystem, in particular of the communication channel, is so high that in most situations besides the first transmission no retransmissions are required. Then it is natural to send the additional information about the working information only when such an information is needed. This causes that the length of the total transmission is variable. The varying length of a total transmission generates additional problems. First, the segmenting information (see Section 2.6) must be build in, so that it is possible to separate in the train of signals at the output at the forward channel the total transmission of an elementary piece of working information. Second, variable length introduces arrythmia and requires some buffering.

422

Chapter 8 Optimization of Information Systems

In spite of those deficiencies the feedback systems with variable length of total transmissions are most frequently used. Therefore, we concentrate here on such systems. However in the last part of this section we describe a feedback system with total transmissions having fixed length but using the feedback information to control the shape of the total transmission.

FEEDBACK SYSTEMS CONTROLLING THE LENGTH OF THE TOTAL TRANSMISSION When it may cause no confusion, describing the feedback systems that use feedback information to control the length of the total transmission, instead of the term "retransmission" the shorter term "transmission" is used. We denote: X^, = {x(n), n = \, 2, .., A^} the train of pieces of the working information, w{x{n), /], / = 1, 2, .. the nh transmission of wth working information; (w[jc(l), 1] is the first transmission, the transmissions n=2, 3,..are retransmissions) W[jc(«),y] = {>v[jc(«), / ] , / = !, 2, ..,77 the train all already produced transmissions of information x{n) (see Figure 8.11). fin, i) the signal at the channel output produced by the transmission w[x{n), i\\ we call it information available about the /th transmission ^(^, j) = {K^» /), / = 1, 2, • • , y the train of all pieces of information available after j transmissions. This notation is illustrated in Figure 8.11. w[xin),l]

w[jc(/2),2]

]

w[x(n),3]

iiiiiiiiiini W[x(n),3] —

Figure 8.11. Illustration of the notation for transmissions carrying working information jc(w) in a feedback system; the length of the total transmission 7=3. The processing of the information obtained at the channel output is done in two steps. First a decision about distortions of the available information is taken; we denote it yf. This decision is forwarded to the transmitter of the working information. Therefore, we call yf altevmiiwtly feedback information. In the simplest case this decision is binary and its potential forms are: y+-the distortions are so small that on the basis of available transmissions an ultimate decision about the working information can be made y- -the distortions are large so that an additional retransmission is requested When the decision y^ has been taken, a potential form x^. of the working information is produced as the decision about the actually transmitted working information. When decision y- has been taken, a decision about the working information is postponed. Therefore, the decision y- is called the disqualifying decision. The feedback information yf is also delivered to the transmitter. When it obtains the y+ decision it starts to transmit the next elementary working information. When y_ is obtained, the next retransmission of the actually processed information is send.

8.3 Optimal Recovery of Information

423

Let us denote by y{nj) the distortion information produced aftery transmissions of the /2th elementary working information. In general, the decision y{n, j) can be based on all available information R{n, j) about x{n) and the rule of making such a decision may depend on the number y of available retransmissions. However, to simplify the implementation in most systems used in practice, the feedback decision aftery retransmissions is based only on the last retransmission, i.e. y{n,j)^Y[r{n.j).n (8.3.52) We call such a rule of making the distortion decision memoryless. If it would not depend on the number of retransmissiony infinitely many retransmissions would be possible. In most practical systems it is assumed that for ally
Figure 8.12. Aggregation sets of the composite rule Xc(')={Y{'), X'i')} of operation of a feedback system; the aggregation sets ^i, /= 1, 2, .., L correspond to the ultimate decisions X, about the working information, /=1, 2, ..L, while the set ^ corresponds to the disqualifying decision of the composed rule X^{').

424

Chapter 8 Optimization of Information Systems

To formulate the problem of optimization of the composite rule we must introduce indicators of performance for both component rules. The indicator characterizing the performance of the rule of making ultimate decisions is the same as in the case of the previously considered system without feedback. We assume that the performance indicator in a concrete situation q(Xi, x^) is synmietric, given by (8.1.2). Then the criterion is the error probability G[r(-)]=P[^(R)?^X]

(8.3.56)

Every retransmission requires additional resources for processing a piece of working information in particular, it needs more feedback channel capacity. It can be expected that on average those resources are an increasing function of the probability of making the disqualifying decision. Therefore, as an indicator of costs characterizing the system with feedback we take the probability of making the disqualifying decision QAY-(')]-Paf=y-)

(8.3.57)

The typical problem of optimization of the composite information processing

"^^^'''

OP {y(-), r(-)},ei e_[r-(-)]=const

(8.3.58)

The counterpart of the criterion (8.2.11b) (Lagrange function) used to solve optimization problems with equality constraints, is the auxiliary criterion QAr(')l

Y-(')]^Q[ri')]-^\QAY.(')]

(8.3.59)

where the parameter X is a counterpart of the Lagrange multiplier in (8.2.11b). To find the solution of OP {¥(*), X'(')},Q^ we introduce the auxiliary indicator of performance in a concrete situation A for }' =y _, / = 1, 2, • • • , L «?y(^/,>'f) = C . . . „ : ' , „ , ' . ' ', (8.3.60) "0 fory=y^, / = 1 , 2,Using this we write (8.3.59) in the form G j r ( - ) , ^-(•)] = E^[X, r ( E ) ] + ^ y [ X Y-(M)]

(8.3.61)

Arguing as by derivation of rule (8.3.7) we conclude that the best conditional performance rule is: To a given available information r assign the potential information Xf E{jC/, / = 1 , 2,- • • , L, y_} which minimizes the conditional (8.3.62) performance indicator Q(Xi, r), /=1,2, ..L, Q(y-, r) where L

e(Jc„ r)=E,^(^, JC,) = E ^(Jc,, JcJP(jc,|r) = l-P(jc,|r) X|r

(8.3.63a)

m-i L

Q(y-, r)=E^y(X, A:J = E 9 ^ , , y-)P{x,\r) = \

(8.3.63b)

P(A:,|r)=/'(X=JC,|]R=r)

(8.3.63c)

and

8.3 Optimal Recovery of Information

425

From the general rule (8.3.62) it follows that the feedback decision isyG(Jt:„r)>GCV-,r),/=1,2,- • • , L

if (8.3.64)

From (8.3.63) we see that this is equivalent to condition P^(x|r)^, / = 1 , 2 , - • • , L where P^(x|r)=maxP(jc,|r), P,, = l-\

(8.3.65) (8.3.66)

Thus, the optimal rules for making the decision about distortions (equivalently asking for retransmission) and recovering the working information are: if the largest conditional probability P^\ \ r) is smaller then the threshold P^^, we take the decision y. {request for a retransmission), (8.3.67) if the largest conditional probability P^^^ix \ r) is larger then the threshold P^^ take the decision y+ (no more retransmissions needed), and as the recovered working information we take the information X, with the maximal conditional probability. The system realizing this composite rule is shown in Figure 8.13. ^ w

j XSTAT(R»X)

CALCULATION OF CONDITIONAL PROBABILITIES

FINDING THE POINT OF MAXIMUM i k v.

feedback (distortion) info

t /'(J,|r)

/ 1 O . / — 1,^,

_^. <

P™.(x| r)

T

, L^

w

FINDING THE MAXIMUM VALUE

i

TRESHOLD DEVICE

w

i

y^

^ w

L P^

to the transmitter

Figure 8.13. The system implementing the composite optimal feedback information generation and working information recovery minimizing the probability of error for fixed probability of a retransmission based on conditional probabilities. Since the decisions of the optimal rule (8.3.19) are based on comparisons of conditional probabilities, the rule can be implemented using instead of conditional probabilities the decision weights «(jc„r)=(/>[P(jc,|r)] (8.3.68) where (•) is an increasing or decreasing function. COMMENT 1 The performance indicators are interrelated. The probability of errors depends on the shape of the aggregation sets ^^, / = 1, 2, • • , L corresponding to the ultimate decisions. The set ^ _ corresponding to the decisiony. acts as a "buffer zone".

426

Chapter 8 Optimization of Information Systems

The larger this zone is, the smaller is the probability that the point representing the available information carrying a given working information will be pushed by channel distortions into an aggregation set corresponding to another potential form of information. Every decision y- causes a new retransmission. This increases the redundancy which the feedback mechanism builds into a signal carrying a piece of the working information. The redundancy can be used to decrease the probability of error of optimized ultimate decisions. Thus, increasing the "size" of the set . X , we can decrease the probability of errors of recovered information. COMMENT 2 When the negative decision weight has the meaning of the distance between the available information and the noiseless signal corresponding to a given potential form of information, then the transformation realized by this system may be called intelligent next neighbor transformation, asking for more information when no reference pattern is sufficiently close to the point representing available information. For binary working information such a system takes the form shown in Figure 2.4. Thus, we proved that the feedback systems described in Section 2.1.1 have an optimal character. Our argumentation here allows to choose the reference signals and the thresholds in a systematic way and, if necessary, to augment the system with subsystems acquiring the needed state information. FEEDBACK SYSTEMS CONTROLLING THE SHAPE OF THE TOTAL TRANSMISSION We assumed previously that the features of an elementary piece of distorted information are characterized only by binary auxiliary information. It can be expected that the performance of a feedback system can be enhanced if auxiliary information can take more potential forms. Then the feedback information can provide to the transmitter more detailed information about the signal available at the receiver. A system is now described in which the volume of feedback information is substantially larger then of the primary information. To simplify the description we assume: Al. The primary information is bmary Xi, / = ! , 2. A2. The primary channel and feedback channel can carry trains of 1 DIM continuous signals. A3. A single transmission is 1 DIM signal, the total number J of transmissions is fixed. A4. Theyth component of the available signal, y = l , 2,- •, 7 is r(j)=w{Xi, j)+z(j), where w(X/, j) is theyth retransmission of the binary information Xi . A5. After receiving the train r(/)=(r(0, / = l , 2, ..,7, the receiver calculates the average 2 >v*(/>Ew(x„ l)P[x=x,|E(/-)=r(/-)] /-I

and transmits it as feedback information to the transmitter. A6. After receiving the total train r(j) the ultimate maximum conditional probability decision about the transmitted information is made.

8.4 Performance of Optimal Recovery of Information

427

A7. The first transmission >v(;c,, l)=Wj, w(x2, l) = ->Vi, where w, is a constant. A8. A retransmission is the scaled difference between the first transmission w(X[, 1) and the recently obtained feedback information w(XiJ-h\)=A(j+\)[w(x,, l)-w*W] where A(j)j=l, 2,- • • , 7 is a train of scaling coefficients. The average w*(/J can be interpreted as an estimate of the first transmitted signal w(x,, 1) and it is called centre of gravity (of points representing the potential transmitted signals). Therefore, the described system is called centre of gravity feedback system. COMMENT Total feedback information is / DIM continuous vector information. Thus, its volume is substantially larger then the volume of binary working information. This causes that the system would require a feedback channel with a much larger capacity than the capacity of the forward channel. Since the number J of retransmissions must be large, the system would also introduce a very large delay. Therefore, from the practical point of view the system could be useful only in specific situations. We describe here the centre of gravity feedback system because it is an interesting example of delivering to the transmitter very detailed information about the features of the state of the environment of the communication system, that are relevant for transmission of the working information. This allows to control optimally all features of the total transmission, while its length is fixed. The consequence of this is very high efficiency of trading of the energy of the total transmission for the probability of errors. We show in the last section (in Figure 8.18) that the system with centre of gravity feedback can be considered as a system using to the maximum the capacity of the forward channel. Only a sketchy description of the system has been presented. There was very much research done on the overall optimization of the system feedback in particular on optimization of the train of coefficients A(j) determining the operation of the transmitter. For details see Schalkwijlk, Kailath [8.15] and Omura [8.16].

8.4 PERFORMANCE OF OPTIMAL INFORMATION RECOVERY Till this point we discussed methods of finding the optimal rule of information recovery. This is the special case of the problem of finding the point x^ of the minimum /maximum of a function/(JC), discussed in Section 8.2. From the point of view of the superior system, essential is the value of the indicator of performance of the optimized information processing rule. It corresponds to the minimum/maximum value/(XQ) of the considered function. The performance of the optimized ultimate information processing rule is also important for an other reason. As it has been indicated in Section 8.1 this performance is often an objective indicator of performance of the subsystems inside the basic information system in particular of a subsystem implementing the preliminary information processing such as compression or shaping of signals put into a communication channel.

428

Chapter 8 Optimization of Information Systems

We describe here the general method of calculating the performance indicators and to discuss the trade off relationships between distortion and cost indicators for the previously presented optimal rules of information recovery. 8.4.1 THE GENERAL METHOD OF CALCULATING THE STATISTICAL PERFORMANCE INDICES We assume that the statistical average is used as the dependence removing operation to obtain the indicator of performance of the considered information processing rule. To calculate such an indicator we use again the equation (4.4.23) for conditional averages. However, contrary to the derivation of the optimal information recovery rule (8.3.3) we consider now the primary information as fixed. Thus, we use the equation

We write this in the form: G [ r ( - ) ] = Ee(X)

(8.4.2a)

where Q{x^) = E q[x,, n m

(8.4.2b)

and JC/, /= 1, 2,- •, L are the potential forms of the primary information. Since the random variable is discrete _

L

_

e[r(-)]=Ee(^,)^(:t,)

(8.4.3)

/-I

Similarly as the average Q(x\ r) defined by (8.3.3) the average QiXf) has the meaning of the conditional performance indicator, however the condition is that not the available information r but the_primary information is fixed. To distinguish between the both averages we call Q(Xi) the x-conditional performance indicator. From the definition of the average it follows _

where

L

2(JC,)= E q(Xi, xJPixJXi) ""' P(x,\xi)=P(X=Xk\^=Xi),

K 1=1. 2,- •, L

(8.4.4a) (8.4.4b)

and X*=A^[I^(X/)] is the random variable representing the decision of the considered recovery rule while the random variable (process) ]R(jC/) represents the available information on the condition that the primary information is JC^. We call the probabilities P(x,,\xj) conditional decision probabilities. The calculation of the x-conditional performance indicator greatly simplifies when the distortion function is symmetric ( given by (8.1.2)). Then the x-conditional performance indicator is the conditional probability of error and similarly to (8.1.40) we have e(jc,)=P,(jc,) (8.4.5a) where PXxi) = \-P(x,\x,)

(8.4.5b)

8.4 Performance of Optimal Recovery of Information

429

From these considerations it follows that the calculation of statistical performance indicators reduces to calculation of the conditional decision probabilities. We sketch now the method of such calculations. The initial information for calculation of the conditional decisions probabilities is the probability distribution of the primary available information. When the random variable ]k(Xi) representing this information is discrete and r^, /7Z = 1, 2,- •, M are its potential forms, the distribution is described by the set of conditional probabilities P(rm\xi)^Pm(Xi)=rJ^=Xi),

r,, m = l, 2,- •, M , 1=1, 2 / •, L. (8.4.6)

When the random variable IR is continuous its distribution is described by the set of conditional probability densities /7(r|jc,), r E ^ / = l , 2 , - •, L . (8.4.7) The aggregation set ^,,, corresponding to the information recovery rule X'(-) is the subset of the set /? of potential forms of the available information r, such that X'(r)=x^. From the definition of aggregation sets it follows that for discrete available information we have P(x,\xi)=J2 P(r\x,) (8.4.8a) while for continuous available information P{x,\Xi)= \ \ " Ip(r\x^)dr

(8.4.8b)

The equations (8.4.8 ) are the relationships between the conditional decision probabilities and the primary statistical information, we were looking for. Although they are in principle elementary the direct utilization of these equation may be tedious. The concept of the decision information (weight) introduced in Section 8.3.1 can simplify the calculation of conditional decision probabilities, for optimal recovery rules. Since a decision of such a rule depends directly on the current decision information, we can calculate the needed conditional probabilities P(Xj^\xi) from a counterpart of equation (8.4.8) by taking the decision information u in place of the primary available information r. For example such a counterpart of equation (8.4.8b) is [ c c P(x,\x,)=\ J "\p{u\x,)du (8.4.9) u6X/ where ^^i is the aggregation set of potential forms of the decision information. To use equation (8.4.9) we have to calculate the density of conditional probability/?(!/1JC/). We obtain it using the relationship (8.3.8) between rand u, the conditional probability p(r\xi) and the general rules of probability theory for calculating the probability density of a function of a given random variable (see e.g., Papoulis [9.17]). Often, the primary available information is a gaussian variable (process) and the decision information is obtained by a linear transformation of the primary information as in Examples 8.3.1 to 8.3.3. Then from conclusion (4.5.22) it follows that the decision information is gaussian. This greatly simplifies the calculation of the probabilities P(Xi^\Xi) from equation (8.4.9). We illustrate it on a simple example.

430

Chapter 8 Optimization of Information Systems EXAMPLE 8.4.1 CALCULATION OF DECISION ERRORS OF OPTIMAL BINARY RECOVERY RULE

We do assumptions A2 to A5 from Example 8.3.2. Specifying assumption Al we assume that the primary information JC/, 1=1, 2 is binary (L=2). We also do the symmetry assumptions (8.3.22) and (8.3.33), however in place of (8.3.33b) we take E(Xi)=^wiXi, t)dt

(8.4.10)

On these assumptions optimal is the threshold rule (8.3.24) using the binary decision Information u^[r(')] given by equation (8.3.42). From (8.3.40) it follows that on the condition 3^=X/ the process representing primary available information is

m=^(xi. 0+2(0, te

(8.4.11)

Substituting this in (8.3.42) we obtain the random variable representing the decision information on the condition X=Jt/ 'b

Bb(^/)= Iff(-^/.0 Mxx, t)-w(x2. t)]dt

(8.4.12)

^a

After some elementary algebra we obtain m^(Xi)=A(Xi)-\-^ where ^b A(Xi)= I w(x,, 0 [w(jc,, t)-w(x2, t)]dt

(8.4.13a) (8.4.13b)

'b

^ = 1 2(0[MJ^I, t)-w{x2. t)]dt

(8.4.13c)

^a

The random variable ^ is obtained from the noise by a linear transformation. We have assumed that the noise is a gaussian process. Then from a generalization of the conclusion (4.5.22) it follows, that s^ is a gaussian variable. Averaging (8.4.13c) and interchanging the sequence of averaging and integration we fmd that E2ij=0. Similarly we fmd the variance of 2^ (see for example Papoulis [8.17]). From (8.4.13a) it follows that the random variables %{Xi) are gaussian variables with the same variance and mean values located symmetrically around zero. Thus, the conditional probability distributions p(u\xi) describing those variables are gaussian densities located symmetrically around zero, as shown in Figure 8.14a. From symmetry assumptions it follows that the threshold in the optimal information recovery rule (8.3.24) is u,,=0. (8.4.14) From this it follows that the aggregation sets ^ , = < 0 , ex), ^ 2 = ( - o o , 0 ) (8.4.15) Using (8.4.9) (with a single integration) we get oo

P(x2\x,)= |p(w|x,)dw,

0

P(x,\x,)= I p(u\x,)du

0

These considerations are illustrated on Figure 8.14.

(8.4.16)

8.4 Performance of Optimal Recovery of Information

431

P(«|J^2)

Figure 8.14. The densities of conditional probabilities p{u\xi), /=1,2 of the variables representing the binary decision information w^ and the geometrical interpretation of conditional probabilities of decisions P(X2|x,) and P(jc, |jC2) of optimal binary information recovery; (a) the open system, (b) the system with feedback.

Performing the described procedure for calculating the variance and the means of variables %(Xi) and using equation (8.3.68) we obtain finally the probability of errors of the maximum conditional probability rule P^ = G,[£Jl-c'(l, 2)] E^^E/S, is the normalized noiseless signal energy 1 c'(h2)=^^w(t,x,)w(t,x,)dt

where

(8.4.17) (8.4.18)

(8.4.19)

is the normalized correlation coefficient and -«V2, ^(w)= f -—=e-"'^aw

i

^

(8.4.20)

is the "tail" of the distribution of the standard Gaussian variable. D 8.4.2 PERFORMANCE OF BINARY INFORMATION RECOVERY The general method of calculating the statistical performance indicators illustrated in the example has been used for many types of channels producing the available information. The calculation of the sum, respectively the integral occurring in equations (8.4.8), is straightforward but tedious. Therefore, we will now only discuss the general conclusions about information processing which can be drawn from results of such calculations. Let us consider first equation (8.4.17) derived in the example. From this equation it follows For the symmetrical model the performance of the optimal binary information recovery depends only on the normalize energy and the (OAJ^\ correlation coefficient between the potential forms of the noiseless signals but does not depend on their other features.

432

Chapter 8 Optimization of Information Systems

The smaller is the correlation coefficient between the signals (8.4.21b) the better is the performance of optimal information. It can be proved that for binary signals (8.4.22) mincXl, 2)=-l In view of (7.1.21) the geometrical interpretation of (8.4.22) is The performance of optimal information recovery is optimized (8.4.23) when the distance between the noiseless signals are maximal Since we assumed that the energies of the noiseless signals are equal from (8.4.23) it follows that optimal is such a pair of noiseless signals that w(jC2, t) = -w{Xj. 0, te (8.4.24) We call such signals antipodal. For such a pair of signals from (8.4.20) we get P^=G,(2E,) (8.4.25) The diagram of P^o versus the normalized energy of signals £"„ is shown in Figure 8.15 as line D.

Figure 8.15. The dependence of the probability of error P^o of optimal binary info recovery on the normalized energy E^=EIS^ of noiseless signals when the noise is Gaussian noise with flat harmonic spectrum. Lines: (D) the noiseless signals are determinate and antipodal, (IP) bandpass, orthogonal signals with indeterminate phase described in Example 8.3.2, (IPA) bandpass, orthogonal signals with indeterminate phase and amplitude (Reighley distribution). In accordance with conclusion (8.4.21a) the bandwidth B of frequencies occupied by the optimized signals has no influence on the performance of the optimized system. Obviously, the smaller this bandwidth is, the smaller is the channel capacity. However, we can not make B arbitrary small. This follows from the basic relationship (7.4.26) between the bandwidth and the duration of a process. In the now used notation this relationship is B>:AIT (8.4.26) where T=t^-t^ and /I is a constant of order of magnitude of 1.

8.4 Performance of Optimal Recovery of Information

433

In Example 8.3.3 the recovery of binary information carried by a narrow-band signal with indeterminate phase was considered (see also Section 2.1.1). We derived probability density p(u\xi) for the decision information (8.4.45) using equation (5.4.20) with gaussian probability densities discussed in Example 8.4.1 in place of Pz[v-VJ,s, b)] and with phase \l/ in place of the indeterminate parameter b. From equations similar to (8.4.16) we calculate the probability of errors of optimal recovery rule discussed in Example 8.3.3. The conclusion from obtained results is similar to conclusion (8.4.21). However, for signals with indeterminate phase the error probability depends not on the correlation coefficient of noiseless signals but it depends on the correlation coefficient of their envelopes A(xl, t). It can be proved that for fixed energies of the noiseless signals the errors are minimal when the envelopes are orthogonal (their correlation coefficient is zero). The dependence of the error probability on the normalized energy is shown in Figure 8.15 as diagram IP (indeterminate phase). COMMENT The comparison of diagrams D and IP in Figure 8.15 shows that the trade of minimal error probability for the energy of the noiseless signal is significantly worse for signals depending on an indeterminate parameter than for determinate noiseless signals. The formal reason for performance deterioration caused by indeterminism of the noiseless signals is, that in result of integration according to equation (5.4.20), the densities of probabilities of the decision information are for indeterminate signals more stretched, than the corresponding densities of probabilities for determinate signals with the same energy. Therefore, the tails of the probability densities behind the threshold are longer. The ultimate reason for deterioration is related to the differences in available meta information. In the case of determinate noiseless signals the exact information about reference signals is available. When a noiseless signal depends on indeterminate parameter(s) we only have information about the set of its potential forms and their statistical weights. The price which we pay for this indeterminism is the increase of the error probability. We can expect that the "larger" the indeterminism, the larger is the error probability for an optimized system. This illustrates diagram lAP in Figure 8.15 that presents the performance for energy trade off when amplitude and phase of noiseless signals is indeterminate (for derivation of such a diagram see e.g. Proakis [7.13]). PERFORMANCE OF OPTIMIZED SYSTEMS WITH FEEDBACK The previously described general procedure for calculating the error probabilities can be also applied for systems using feedback information that have been considered in Section 8.3.2. We assume first that the length of the total transmission (number of retransmission) is fixed. Then we can proceed as in case of the open system. In particular, we can use equation (8.4.9) with aggregate sets ^^i ^^r system with feedback as it is illustrated in Figure 8.14b. Next we average obtained conditional probabilities over potential lengths of total transmission. The needed probabilities we obtain from probability of the disqualifying decision. The latter we get from equation (8.4.9), integrating over the aggregation set ^ _ as illustrated in Figure 8.14b (for details see Seidler [8.19].

434

Chapter 8 Optimization of Information Systems

To simplify the argumentation about primary and available information we do assumptions as in Examples 8.3.2 and 8.4.1 We assume also that the energy of the noiseless signal corresponding to a single retransmission does not depend on the number of retransmission and is the same as of the first transmission. In view of conclusion (8.4.23) we assume that the noiseless signals of all transmissions are antipodal signals. We denote as £1 the energy of a noiseless signal corresponding to a single retransmission, E the average total energy of all noiseless signal corresponding to a total transmission, averaged over all potential lengths of total transmissions, and as

E^.^EJS^

E=E/S^

(8.4.27)

the corresponding normalized energies. From the rules of systems operation it follows that (8.4.28) E=NE.. where N is the average number of retransmissions. As the performance indicator we take the probability of error P^ of the ultimate decision. The trade off error probability for the normalized total average normalized energy of the total transmission is shown in figure 8.16. lOgio Pt

ML, MEM

£nl-9

OPEN

Figure 8.16.The dependence of the probability Pj, of error of optimal ultimate decision on the normalized total average energy E^ of noiseless signals with normalized energy £„, of a single retransmission as a parameter; ML memory less rule, MLA memory less adaptive rule, MEM rule with memory (using all retransmissions), OPEN-the optimal open system (redrawn line D from Figure 8.14).

8.4 Performance of Optimal Recovery of Information

435

COMMENT 1 In the system with feedback, the threshold P^^ determines the "size" of the rejecting zone. Increasing it we increase the rejection probability P- and in consequence the average number of retransmissions. From (8.4.28) it follows that for a fixed energy £"„, of a single transmission, the change of the average total energy is caused by the change of the average number of retransmissions. Therefore, the diagrams of P^ versus E^ with fixed f'ni describe the improvement of quality of working information achieved by increasing the average energy of the total transmission. COMMENT 2 If no feedback information is used, the system operates as an open system and the Pb versus E^ trade off is described by the line D in Figure 8.15. This line is redrawn as the OPEN (slashed) line in Figure 8.16. Thus, for a given error probability P^, the difference between the value read of one of the lines characterizing the feedback system and of the OPEN line is the indicator of gain due to feedback information, measured in terms of saved noiseless signal energy. The rules with memory (MEM lines) are always better then the rule using no feedback and when lower probability of errors is required the advantage of the feedback rules becomes larger. The simple memoryless rules give also an improvement but only in some ranges of E^ and £"„,. We discuss this in the next comment. COMMENT 3 In the range where the difference between the energy of a single transmission and average total transmission is small a memoryless rule (ML lines) gives a substantial improvement as compared with the open system. From (8.4.27) it follows that in range the average number of retransmissions is small. However, when the buffer zone is too large the performance of the system with feedback is worse than of the open system. The reason is that the class of all possible information recovery rules the memoryless rule is not optimal because it uses only the recent retransmission and wastes the energy of all earlier. From the diagrams we see that for a required probability of errors there is a value of the average total energy (thus of the average number of retransmissions) for which the improvement of the memoryless rule is biggest. Thus, an auxiliary adaptive system could change the rejection zone so, that for every P^, the memoryless rule operates best. The performance of such an adaptive rule is presented as line MLA. This feature of the memoryless rules is quite typical: often optimal rules are uniformly optimal, while non optimal rules operate well only in some ranges of environment parameters. 8.4.3 THE PERFORMANCE OF OPTIMAL RECOVERY OF DISCRETE INFORMATION WHEN L>2 We discuss now the generalization of considerations in Example 8.4.1 when the number of potential forms of the discrete information is L>2. We again introduce the assumptions Al to A5 as in example 8.3.2. We introduce also the symmetry assumptions (8.3.32) and (8.3.33a) using the definition (8.4.10). In the geometrical terms the symmetry assumption (8.3.33a) means that the points representing the noiseless signals lie on the surface of a sphere with radius yJE.

436

Chapter 8 Optimization of Information Systems

A more detailed analysis (see e.g., Golomb [8.22]) shows that the generalization of conclusion (8.4.23) is: the performance of optimal recovery is best when the mutual distances between noiseless signals are equal and possibly large. From (7.1.21) it follows that the first condition is equivalent to the conditions that c\k, O=c-=const (8.4.29) where the correlation coefficient is defined by equation (8.4.19) with k, I in place of 1, 2. It can be also proved that the minimum value of constant c is ^.i„-;^

(8.4.30)

and sets of signals achieving this minimum can be found. Such a set has the geometrical meaning of vertices of a symmetric pyramid (called also simplex). Conclusion (8.4.22) is the special case of conclusion (8.3.30). From (8.4.30) it follows that for large L the set of noiseless signals with correlation coefficients c'(k, 0 = 0 V A:, / (8.4.31) is close to optimal.Thus any of the sets of L orthogonal functions discussed in Section 7.4.1 is a set that allows to achieve an almost optimal performance of the optimized information recovery considered in Example 8.3.3. Therefore, a system in which the noiseless signals are orthogonal and the optimal recovery rule is used, is called an optimal orthogonal system. The generalization for L > 2 of the procedure presented in Example 8.4.1 is straightforward since the consequence of symmetry assumptions is that all conditional decision probabilities given by equation (8.4.9) are equal and the orthogonality causes that the random variables representing the current decision information given by (8.3.41) on the condition that the primary information is fixed, are statistically independent gaussian variables. This leads to the final formula for the probability of errors in the optimal orthogonal system Pe = F(L, £„) (8.4.32) where E^ is the normalized energy defined by equation (8.4.18), F(L, £n) = l - f -i-e'^^G^^-^(w)d«,

(8.4.33)

and Gt(w) is the tail of the standard Gaussian distribution given by (8.4.20). The function F(L, £"„) is called Fano function. From equation (8.3.32) it follows that the probability of errors of an optimal orthogonal system depends on the noiseless signals only through the parameter E^ that is a characteristic of a single noiseless signal, and the parameter L that characterizes the set of potential forms of a noiseless signal. We now discuss in more detail the relationship between the probability of error of the optimally recovered information and the number of its potential forms L. This number is a characteristic of the state of variety (see Section 1.4.1). As indicated in Section 6.1.1 the parameter , ^ ,c. * ^^. ^ v=log2L (8.4.34) is an indicator of the volume of resources needed to process the information, but it is also an indicator of usefulness of information for the superior system.

8.4 Performance of Optimal Recovery of Information

437

Therefore, when signals with various L are compared is it is natural to introduce the normalized energy of the noiseless signal related to unit volume, defined as E^^ ^ENS^^E/SJog^

(8.4.35)

We call it noise-volume normalized energy (abbreviated n-v normalized energy); hence the notation. From the point of view of a superior system, the consequences of an error depend usually on the number of potential forms of the information. To define a normalized indicator of probability of errors we considering reference discrete information. Similarly as we defined the volume of discrete information in Section 6.1.1, as a reference we take information which is an unconstrained block of A^b binary pieces of information and we assume that the errors of elementary pieces of information are independent. The probability of error of such a block is P-l-d-P,)^^ (8.4.36) where P^ is the probability of error of a binary piece of information. For P^<\ we ^^""^ P. = N^P, (8.4.37) The volume of considered blocks is v=A^b (8.4.38) For P^<\ from (8.4.36) and (8.4.38) it follows that P, = PJv^PJ\og^. (8.4.39) Therefore, since we consider small probabilities of errors, as a volume-normalized error probability of discrete information we take Pev = ^e/v.

(8.4.40)

and we call it the volume normalized probability of errors. Using the definitions (8.4.35) and (8.4.40) we express the primary variables P^ and E^ occurring in the basic relationship (8.4.33) and we obtain the relationship between the volume normalized characteristics of the optimal system. The diagrams visualising this relationships with L as a parameter are shown in Figure 8.17. Those diagrams show that compared with separate processing of single binary pieces of information, optimal processing is significantly better. The diagrams also show that the performance improves when the volume of blocks processed as a whole grows. We analyze now this important and interesting effect. First let us look for the prise we have to pay for the improvement. We derived in Section 5.2.1 the basic formula (5.2.52) for probability density by considering a K DIM vector as an almost exact representation of a base band process. As shown in Section 7.4.3 (conclusion (7.4.49)), such an assumption is justified when BT> 1. From Section 7.1 it follows that in the space of K DIM vectors we can find sets of K mutually orthogonal vectors but not more. In view of the mentioned possibility of almost exact representation of base band processes by the vectors and the preservation of distance and scalar product we can expect that In the space of base-band processes of duration T and occupying bandwidth B such that TB> 1 the largest number of orthogonal functions which can be found is close to 2TB.

(8.4.41)

438

Chapter 8 Optimization of Information Systems /'.v

1

2

5

10

m i l J. 1—i i I iiiii

20 rr

0 .5

10-1 logJ.=

10-2

10-3 L Figure 8.17. The diagrams of the volume-normahzed probabiHty P^^ of errors of the optimal recovery of information carried by die almost optimal (orthogonal) noiseless as a function of the noise-volume normalized energy E^^ with the number of potential forms L as a parameter E^l^^ die Shannon bound. For a fixed duration of the function-information the cost of its transmission depends on the bandwidth B. Therefore, although the bandwidth B does not occur in the basic equation (8.4.33) giving the error probability, we attempt to keep the bandwidth possibly small. Therefore, the consequence of the assumption that L mutually orthogonal signals are available is that the minimum product min 2TB=L.

(8.4.42)

An unconstrained bloc consisting of A^b binary pieces of information has L-2^^ potential forms. Summarizing we conclude Assembling Ni^ pieces of binary information into bloc and processing optimally the blocs as a whole so that probability of recovery of a binary piece of information has a requested value, by increasing N^ ^ it is possible to diminish the distortions of recovered information without increasing the energy of noiseless signals. The prise that must be payed for it is increasing the bandwidth of the noiseless signals so that condition (8.4.42) is satisfied. COMMENT The product 2TB has the meaning of the dimensionality of a segment of a base-band process and the dimensionality can be interpreted as an indicator of the level of signals structuring. Therefore, the interpretation of conclusion (8.4.43) is that increasing the the level of structuring of signals carrying the information, it its possible to improve the performance of information recovery without increasing energy of noiseless signals carrying the information. Another such possibility is to utilize additional information about the state of environment of the information system, as discussed in Comment on page 433 and in Comment 2 on page 435.

8.4 Performance of Optimal Recovery of Information

439

The conclusion (8.4.43) and these comments rise the question what are the limits of quality improvement by increasing the bandwidth of the signals. To get the answer we look closer at the properties of the Fano function. It can be proved (see e.g. Golomb [8.18]) that ^ ^ 1 / L i f E,^<E^^J limP,(L,£,)=C; . ^ ^ , ; (8.4.44) where E^^:^=2/\og2^

(8.4.45)

is a universal constant. Substituting (8.4.45) in the condition En.>EV we see that it is equivalent to the condition E log2L< Ji-log^e 25"

(8.4.46)

(8.4.47)

Comparing the right side of this inequality with equation (5.4.44) after substituting (5.4.33) we see that the right side is the capacity Co© of the gaussian channel in the limiting case when the band width B^oo . Thus, we can write condition (8.4.46) as log2L^nV^ or equivalently condition R' < CI is satisfied then for large volume v=log2L we have asymptotically^ ^^^..E,^.,R-,C,) ^g 4 33^ where R'=Ci.

C^) is a function decreasing from value 0.5 for /?'=0 to zero for Thus,

OL{R'I

When R' < C'^ then the probability of error decreases exponentially with the volume ofprocessed information, the faster the bigger is the (8.4.52) surplus of channel capacity over the rate of working information transmission.

440

Chapter 8 Structures and Features of Optimal Information Systems

8.5 OPTIMAL RECOVERY OF CONTINUOUS INFORMATION The special cases of the general solution (8.3.10) of the optimization problem when information is continuous are considered. It is shown that the linear information recovery rules and two level rules with separate estimation of the state of the main information systems environment that have been previously introduced in a heuristic way, are on quite general assumptions statistically optimal rules. Most of the presented features of continuous information processing have their discrete counterparts. However, their analysis is usually more complicated. The main reason for presenting here the classical problems of continuous information processing is to give more insight into corresponding problems of discrete information processing. In the last section we present an estimate the quality of optimal recovery of continuous information in terms of the capacity of the channel delivering the available information. This is one of the applications of the concept of channel capacity which has been introduced in a formal way in Section 5.4.4. Other important applications of channel capacity in the analysis of discrete information processing we discuss in the next section. 8.5.1 THE SOLUTION OF THE OPTIMIZATION PROBLEM We assume that the primary information is 1 DIM and that the set of its potential forms is the interval A== <x^, x^>. Then the conditional performance indicator is Q(x\ r)= I q(x. x*)p(x\r)dx, xE <x,, x,>

(8.5.1)

where/?(x I r) is density of conditional probability, and the optimal decision is a solution of the equation ^^(£i22.0. (8.5.2) dx* We take qix, x*) = (x^x*y (8.5.3) Then _ '^ Q(x^, r)= J (x-x*fp(x\r)6x (8.5.4) Substituting in (8.5.2) and after some algebra we fmd that the optimal decision about the primary information is ^b

jCo= ^xp(x\r)6x

(8.5.5)

A special case of this equation is equation (7.5.17) that we used in Section 7.5. In the continuous case we may also use the decision information (weight) x,(x*, r)=(/>{e[(x*, r)]}, xE <x,, x^>

(8.5.6)

where (t>(u) is an increasing or decreasing function. Then the optimally recovered information is a solution of the equation dXd(x*, r)/dx*=0 (8.5.7)

8.5 Optimal Recovery of Continuous Information

441

As discussed in Section 8.1.1 in many cases q{x, x)=0{x-x) (8.5.8) where P(') is the error weight function. Substituting (8.5.8) in (8.5.1) we get (2(x*, r)= J 0{x-x)p(x\r)dx

(8.5.9)

Diagrams of typical functions occurring in this equation are shown in figure 8.18.

Figure 8.18. Typical error weight function i8(-) and typical density of conditional probability occurring in equation (8.5.9). Changing x* we shift the function i3(jc-x*). When the error weight /3(w) is symmetric around its minimum at w=0 and the probability density p{x \ r) considered as the function of jc has a single maximum at x^ir) and is synunetric around it, than we minimize the integral in (8.5.9) by locating the minimum of (x-x') at the maximum of p(;c|r). Thus, on these general conditions optimal is the rule: To given available information r assign primary information x^EX which maximizes the density of conditional probability p(x\r) (8.5.10) considered as a function of x. This is the continuous counterpart of the discrete maximum conditional probability rule (8.3.26). The conditional probability p(x|r) we calculate from the equation p{x\r)^Cp{x) p(r\x) (8.5.11) The a priori probability p(x) describes the statistical properties of the primary information while the probability density p{r\x) describes the transformation generating the available information r. Thus, those probabilities are determined by different factors. Figure 8.19 illustrates two extreme cases. When the probability density p(r\x) considered as a function of x compared with the probability density p{x) has the shape of a narrow pulse as shown in Figure 8.19a we say that the a priori indeterminism of the primary information is large (compared with the conditional indeterminism of the available information). The situation shown in Figure 8.19b is opposite. We say then that the a priori indeterminism of the primary information is small. Figure 8.19a shows that when the a priori indeterminism is large, then the maximum of the density p(x\r) is located almost in the same place where the maximum of the probability density/?(r|x) considered as a function of JC.

442

Chapter 8 Structures and Features of Optimal Information Systems

Figure 8.19. Illustration of the concept of small/large indeterminism: the a priori indeterminism of primary information x is (a) large ((b) small) compared with the conditional indeterminism of the available information r; p{x) denotes the density of the a priori probability, p{r\x) the density of conditional probability of the available information.

Thus, When the a priori indeterminism of the primary information is much larger than the conditional indeterminism of the available information as shown in Figure 8.19a, then the a priori statistical information is not needed to make an almost optimal decision about the primary information. From this conclusion it follows that the recovery rule To the available information r assign the primary information x^E Xwhich maximizes the density of conditional probability p{r\x) (8.5.12) considered as a function of x. is an almost optimal rule when the priori indeterminism is much larger than the conditional indeterminism and the general conditions used to justify the optimal character of the rule (8.5.10) are satisfied. This rule is called maximum likelihood rule. EXAMPLE 8.5.1 OPTIMUM RECOVERY OF 1 DIM INFO We assume Al. The primary information is a realization of the gaussian variable x with E3^=0 and cr^(x)=(J^^ A2. The random variable representing the available information y on the condition that the primary information x is fixed, is y(x)=JC+2 (8.5.13) where 2 is a gaussian variable £2= z and 0^(2)= a^^. A3. The random variables x and s are statistically independent. From Al and A2 it follows that probability density of the primary information p{x)=G{x-x,a^) (8.5.14) and the density of conditional probability /7(r|x) = G(r-x-z, a^) where

1

G(X, (70 =

^

TTff

-^

(8.5.15) (8.5.16)

8.5 Optimal Recovery of Continuous Information

443

Substituting (8.5.14) and (8.5.15) in (8.5.11) we get pix\r)=G(x-x(r), a^) _ a' . a' . xir)-—l—(r-zh—l—x ".^"z '^x'+'^z (T^--JLJ_

where and

(8.5.17) (8.5.18) (8.5.19)

Since the gaussian probability density reaches the maximum for its average value, from (8.5.17) it follows that the decision of the maximum conditional probability rule is _ x,= x(y) (8.5.20) and the mean square error of the optimal decision is a^. The indeterminism of the primary information is large when a'

-±>l (8.5.21) From (8.5.15) it follows that in the limiting case a^a^^oo the decision (8.5.18) and its mean square error (8.5. 19) do not depend on a^. This is a concrete example of the general conclusion formulated on page 442. D 8.5.2 OPTIMAL CHARACTER OF LINEAR RULES We assume that the joint probability density of the primary information x and components r(n) of the available information is gaussian. To simplify the argument we assume that E ^n)=0, A2 = 1, 2,- • , A^, E 3S=0 and we denote uim) = r(m), m=l, 2,.., N, u(N-^l)=x (8.5.22) Then the general equation (4.5.14) gives the density of the joint probability , , -i[detA]'^ p(x,r)= —exp ^'

^

(27r)(^*^>^2

(8.5.23)

^

Equation (4.5.15b) allows to express the coefficients A{m, n) in terms of the correlation coefficients of the random variables m(m)=ff(m), m = l, 2 / • • , A^ and ii(N-hl)=s. Their correlation matrix is C rr

C rx

(CJ, c{x,x)

(8.5.24)

where C„ is the correlation matrix of the components of the available information, Crx is the column matrix the elements of which are the correlation coefficients between the primary information and the components of the available information. c(x, x) = (r^(x) is the variance of the primary information The basic relationship (4.5.15b) takes the form: A = C' (8.5.25) where A is the matrix of coefficients A{m, n).

444

Chapter 8 Structures and Features of Optimal Information Systems

The conditional density of probability we are looking for is

p[«(^.l)|ii].£[f^4^^

(8.5.26)

where u = {u{n), AZ = 1, 2, • , A^}. From the general conclusion (4.5.23) it follows that this is a gaussian density thus, p[u(N+l)\u]=G[u(N-\-l)-u(N^l),

a\N^l)]

(8.5.27)

To express the parameters of this density in terms of the parameters entering in equation (8.5.26) only terms including the variable w(N+l) must be taken into account. These are /4(A^+1, N-\-l)u\N-hl) and the terms A{n, N-^l)u(n)u(N+l) n = \, 2,- • • , N. (8.5.28) occurring in the sum in equation (8.5.23). Therefore, we write the equation (8.5.26) in the form

where C, is a function of u but does not depend on u(N-\-l). Comparing equation (8.5.29) with (8.5.27) we see that 1

^

u{N^iy-—--l—^'£A{n,N^l)u{n)

(8.5.30a)

and A(N+\,N+\) Using equations (8.5.27) and (8.5.30a), arguing as in Example 8.5.1, and returning to the primary notation (8.5.22) ) we conclude that 1

X'UiN^l)'

^

1 y A(n,N+l)rin) (8.5.31) /4(A^+l,A^-l);S7r is the decision of the rule of maximal conditional probability and a\N+l) is the means quare error of the decisions. Thus we came to the basic conclusion: On the assumption that the joint probability distribution of the primary information and the components of the available information (8.5.32) is gaussian, the rule of maximal conditional probability is a linear rule. Usually the coefficients A{m, n) are directly not available but available are the correlation coefficients c,X^, m) and c^^{n). Then we can use the equation (8.5.25). However we can avoid the tedious algebra and use an already available result. The gaussian probability distribution and the square error weight (8.5.8) satisfy the conditions formulated on page 441 on which the maximum conditional probability rule (8.5.10) is optimal. Therefore, The linear rule {%.\31) with optimal coefficients, which satisfy the set of equations (8.2.8) is equivalent to the maximal conditional (8.5.33) probability rule. Thus, the set of coefficients ^(n, A^4-l), w = l, 2, .., ^occurring in (8.5.29) is the solution of the matrix equation (8.2.8b) which in the present notation has the form C ^ = C,, (8.5.34)

8.5 Optimal Recovery of Continuous Information

445

We have considered the recovery of the 1 DIM information. The generalization of our argumentation for the recovery of multi-dimensional primary information is quite straightforward. In the case of a gaussian probability distribution the basic conclusions that the optimal recovery rule is linear and that it is equivalent to a linear rule optimized for mean square criterion remains valid. In view of the latter conclusion all methods for finding the solution of the linear optimization problem, particularly the very efficient Kalman recursive algorithms (see e.g., Proakis [8.13 ch 4]) are applicable. We present now an example of application of the optimal recovery rule. This example gives also insight into the properties of intelligent systems with independent estimation of the state of the environment. 8.5.3 OPTIMAL CHARACTER OF INTELLIGENT RULES WITH INDEPENDENT STATE PARAMETER ESTIMATION We consider now the processing of a train of pieces of information on the assumption that while the train lasts, some unknown components of the state of information systems environment are constant. This is a simple model of the system operating in an environment with slowly varying components that was discussed in Section 1.7.2. We assume Al. An elementary piece of the primary information and of the available and information is a scalar. A2. The evolving trains X(ISf) = {x(n);n=l, 2,.., iV} of the primary information and the train R(N) = {r(n);n = l, 2, .., A^} of the available information are processed. A3. A component of available information is r(n)=x(n)+z(n)'^b

(8.5.35)

where z{n) is the component of noise which changes in each step and b is a component of noise which remains constant throughout a train. A4. All pieces of information JC(A2), r{n), z(n), b exhibit statistical regularities, they can be considered as observations of random variables x(/z), ]r(A2), 2(n), lb; the value of lb is determined by random factors anew every time a new train starts, but then it remains constant throughout the train A5. All variables x(/2), 2(w), lb are gaussian, Ex(n)=0,

E2(AZ)=0,

Elb=0

(8.5.36)

and uncorrelated, thus statistically independent. The dependence of all components "sin) on the same constant component b causes that the variables ir(«), t{m) are statistically related (correlated). Therefore, for optimal recovery of information x{n) relevant is not only information r(n) statistically related directly to x{n) but also other pieces r{m) of the available information. This in turn provides insight into basic properties of learning systems. We consider two modes of systems operation (1) without a training cycle, (2) with a training cycle (see Section 1.7.2,particularly Figure 1.25 and Section 2.2.1).

446

Chapter 8 Structures and Features of Optimal Information Systems

We begin with the system without a training cycle and we interpret information r(N) as the recently arrived information. Thus available information is the train r(N) = {r(n), n = l, 2, .., A^}. In view of assumptions A4 and A5 we can use the results derived in Section 8.5.2 taking x(N) as x and r(N) as r. Using the assumptions A4 and A5 we easily calculate the correlation coefficients. In view of the independence assumption A5, matrix C given by (8.5.24) has a simple structure and after some matrix algebra we can find in a closed form the elements of the matrix A. Then from equations (8.5.30a) and (8.5.30b) we find that XoW'

2

(8.5.37)

2

is the optimal decision about information QoW = cj'\

JC(AO

\ ' ]

and / ' \

(8.5.38)

is the mean square error of these decisions. The assumptions made in Example 8.5.1 corresponds to the now considered problem with N= 1. However, in the example we assumed that the averages of the random variables z and s can take any value. Comparing equations (8.5.37) with equation (8.5.18) with x-0 we see that

is the counterpart of known mean value z of the noise considered in Example 8.5.1. Thus b*(N) has the meaning of an estimate of the initially not known but fixed inside the train noise component. We denote as 1b*(A0 the random variable representing b'(N) within the train thus, on the condition lb=b. From assumptions A4 and A5 it follows that Elb'(AO=b (8.5.40a) and

^

2^. 2

E[lb-(A0-b]2 = E JLX^ [z(n)+z(n)Y= -^ll^i

(8.5.40a)

Thus, the secondary information b*(AO is an estimate of the unknown component of the noise. It is an efficient estimate in the sense that for large N the mean square error of estimation goes to zero. More, it can be proved that b*(AO is almost equal to the maximum conditional probability decision about constant noise component b that is made independently from the recovery of the primary information. Thus, we came to the following conclusion The two level system with separately optimized subsystems shown in Figure 1.26 is almost equivalent to the maximum conditional (8.5.41) probability rule derived here. The rules of transforming the current available information and the rule of estimating the unknown but within the train fixed noise component b appeared "automatically" as interpretation of the rule of the maximum conditional probability.

8.5 Optimal Recovery of Continuous Infonnation

447

We consider next the system with a training cycle. We assume that it includes the first M pieces of elementary information and the working cycle starts with recovery of the information x(M-\-l). We denote by x,XAf) = {{x(/z), r(n)}, n = U 2, .., M}

(8.5.42a)

the training information and by x(N) = {x(M+n), Ai = l, 2, .., A^}

(8.5.42b)

the already obtained A^ pieces of information in the working cycle. Optimally recovered information x^iM+K) is the information maximizing the density of conditional probability /7[x(M+^|x,(Ai),r(/0].

(8.5.43)

To compare the system with the training information with the previously considered system without the training cycle we assume M=N-l,K=l (8.5.44) Proceeding similarly as for the system without the training cycle, we find that the decision of the maximum conditional probability rule is r(N)--

^.(AO--

a^-^a.

^b

"^irinyxm a/^(Ar-l)cT,2tT

(8.5.45)

and its mean square error is a^'^Na,'

(8.5.46)

Similarly to (8.5.39) the second component in [ ] in equation (8.3.45) 2

vw= a^'^iN-l)a,'tJ^lrinyxin)]

(8.5.47)

has the meaning of the estimate of the constant component of the primary unknown constant components of the disturbance, which is now based on the training information. Similarly as in the system without training the variance of the estimate decreases to zero when N-^oo and is almost an optimal decision about b made separately. Thus, the conclusion (8.5.41) also holds for the system with training. To get insight into the performance of the considered optimal rules we introduce the normalized performance indicators

Qm--^, a}

P.-4. '^<^; ^ <

(8.5.48)

of the normalized mean square recovery error, of the relative magnitude of the working information, and of the variable noise component, respectively.

448

Chapter 8 Structures and Features of Optimal Information Systems

To get insight into the process of learning we assume that the constant component of noise compared with the variable component is large in the sense that Na^>o^

(8.5.49)

The dependence of the normalized mean square error Q'{N) on the length A^ of the train and the relative magnitude of the working information Px as a parameter is shown in figure 8.20. Q{N)i P =0.1 Px = l

^-1.1

7

10

100

Px = 1 0 WOO TV

Figure 8.20. The dependence of the normalized mean square error 2 ' W of the maximum conditional probability decisions on the number A^ of already available pieces of information and on the normalized range of the working information Px; the normalized range Pb of the constant component of the noise is small. Continuous lines-the system widi a training cycle, dashed lines-without a training cycle.

COMMENT 1 The analysis of the simple model of a system operating in an enviroiunent the state of which initially is not known, suggests the following generalizations: • Learning in an environment with quasi-stable state components is possible with and without a training cycle. • A training cycle substantially accelerates the learning process. • When a sufficiently large number of pieces of information is available and the recovery rule is optimized the asymptotic performance of the systems with and without training is similar. • The advantages of the learning cycle are larger when the quality of the available information ( in the considered case Px is such a quality indicator) is better. Examples can be given showing that in non optimized systems learning based on not sufficiently accurate working information may deteriorate systems performance ("learning from wrong examples"). GENERALIZATIONS We now show that on quite general assumptions it is possible to decompose the optimal system into a two-layered hierarchical system consisting of • a subsystem operating only on the current piece of available information and • subsystem utilizing all the already obtained concrete information to estimate the primary unknown, quasi-static components of states of the information systems environment.

8.5 Optimal Recovery of Continuous Information

449

We denote: jc(AO- the considered component of the working information r{N)- the piece of available information directly related to the working information x(N), b - the set of primary unknown, quasi stable components of the state of information systems environment Y{N)- all the available concrete information about the set b including (1) the information delivered by the subsystem acquiring the information about the state of the environment of the information system, (2) the information obtained during the training cycle, and (3) the train R{N-\) of available pieces of information about the working information jc(l), jc(2),...x(iV-l).

We assume that all unknown factors exhibit statistical regularities and we consider the rule of the maximum conditional probability. We can realize this rule by looking for the information JCoCAO maximizing the density of the joint probability distribution p[Z(iV), r{N), F(N)]. From a generalization of equation (4.4.8d) for the marginal probability we have p[X{N), r{N). Y{N)]= ( ' ' '\p[X{N).

KN). Y(N), b]db

(8.5.50)

From the conditional probability equation (4.4.7b) we get /7[Z(A0, r(AO, Y{^. b\=p\Y(N)\X(N)^ r{N), h\p\X{N)^ r{N)^ b\ (8.5.51) The probability density p\Y{N)\X{N), r(N), b] describes the transformation generating the state information y(AO. Typical examples of such a state information are the components of the available information entering in the definitions (8.5.39) and (8.5.47) of the estimates ^*(A0 and b*^X^ ^^ the constant component b considered previously. Those definitions suggest that for sufficiently long gathering of information about the state of environment (large AO the newly arrived pieces of information x{N), r{N) do not substantially influence the probability density p[Y(N)\X(N), r(N), b]. Thus, for large N we may assume p[Y(N)\X(N), r(AO, b]=p[Y(N)\b]

(8.5.52)

Then from (8.5.50) and (8.5.51) we have p[X(N), r(AO, Y(N)]= \ " " '\p[XW,

r(N), b]p[Y(N)\b]db

(8.5.53)

Usually the subsystem for state information provides progressively additional information about the quasi stable state parameter b. Therefore, for growing N the probability density p[X{N), r(N), b] considered as a function of b, compared with the probability density p[b I Y(N)] = Cp[YiN) I b]p(b), (8.5.54) looks as a narrow pulse. Since p(b) does not change with A^ this applies also to p[Y(N) I b]. Thus, we have a situation similar to the one shown in Figure 8.19a with p[Y(N)\b]^ p(x) mdp[X(N), r(AO, b] ^ p(r\x).

450

Chapter 8 Structures and Features of Optimal Information Systems

Then from equation (8.5.53) it follows that p[X(N), r(AO, Y(N)] = Cp{X(N), riN), b*[Y(N)]} (8.5.55) where C does not depend on X(N) and b*[Y(N)] is the centre point of the "impulse like" probability density/?[y(AO|^] considered as a function of ^. From (8.5.12) it follows that b*[Y{N)] has the meaning of the decision of maximum likelihood rule, which on conditions discussed in Section 8.5.1, has an optimal character. Therefor ft*[y(AO] can be considered to be an optimal estimate of the quasi stable components of the state of information systems environment. Using the definition (4.4.7b) of conditional probabilities we can write equation (8.5.52) in the form p[X(N) I r(AO, Y(N)] = Cp[X(N), r(N), Y(N)]

(8.5.56a)

where C does not depend on X(N). From (8.5.55) and (8.5.56a) we get finally p[X(N) Ir(AO, Y(N)] = Cp{X(N), r(N), b*[Y(N)]}

(8.5.56b)

From this equation the maximum conditional probability decision about the working information X(N) is calculated from the newly arrived piece r(N) as if the state components b of the environment would be known exactly. Thus, the whole system operates as the hierarchical system with separate processing of the actual working information and the information about the state of the environment of the working information system. Our argument proves that on quite general assumptions, the hierarchical information system shown in Figure 1.26, with subsystems operating according rules here specified, is an almost optimal system. 8.5.4 UNIVERSAL PERFORMANCE ESTIMATIONS OF OPTIMAL RECOVERY RULES In more complicated cases, the analysis not to say implementation of the optimal information recovery rule may be not feasible, and sub-optimal of heuristically found rules are used. Even in such situations the knowledge of the performance of the optimized rule is very useful, because it provides the reference for the performance of non optimal rules. There were many attempts to derive universal estimates of the performance of the optimal information processing rule without deriving the rule explicitly. We describe here a class of such estimates based on a general property of entropy. To simplify the argumentation we assume that information is continuous and 1 dimensional. We denote oo

oo

^[Pi(-), Pii')] = j [-log2[/?2W]p,(^)dx- J [-logJ^p^(x)]p,(x)dx (8.5.57) The mentioned property of entropy is For any pair p^{-) and P2(') of densities of probability D\p,('), P2(')]>0 (8.5.58) and the equality holds when and only when p^{x)=p2(x). In view of this property we may interpret Z)|/7,(*), Pii')] ^ ^ indicator of distance oipji') from/7i(*). Therefore, it is called entropy distance. We consider the information system shown in Figure 8.21.

8.5 Optimal Recovery of Continuous Information

TRANSMnTER

CHANNEL

l(X:R) •I(X:r) = I(X:]R)-Z)

PRELIMINARY PROCESSING

D[W(')]' W(;)] — )-D[W{')]-D,^

451

INFO RECOVERY

r(.)

D,.

Figure 8.21. The information system with preprocessing of available information before making the ultimate decision about the working information. To simplify the argumentation we assume that the indicator of performance in a concrete situation is q(x, x*)=(x-;c*)^. Using the fundamental property (8.5.58 ) of the entropy distance it can be shown (see Seidler [8.24]) that for the statistical indicator QiX'i-)] of performance of the ultimate information processing rule X*(-) given by (8.3.1) with square error weight (8.1.5), we have: Q[r(')]>J-oxp2 [2H(X|IR) + D,^] (8.5.59) 27re where //(3S|IR) is the average conditional entropy of the primary information, on the condition that the preprocessed information r delivered by the fundamental information processing subsystem (we call it here channel) _ is known Djg^ is the average entropy distance of an auxiliary probability density PiM—Pdtci^) from the conditional probability density/?i(jc)=/7(jc|r). As the auxiliary probability distribution/?dec(^)» we take such a probability density that: (1) can be calculated when only the ultimate decision of the information recovery rule X^(') and the indicator of performance of this rule known, (2) has the minimum entropy distance from the conditional probability density p(x\r). The auxiliary probability density can be interpreted as the best choice of an observer who knows only the decision about the primary information produced by the rule X'(') and the average performance of this rule, and is asked to give a possibly exact estimate of the conditional probability density p(x\r). The difference D[W(-)]=^(X|iR)-H(X|r)=i(X|iR)-i(X|r) (8.5.6O) has the meaning of loss of statistical information caused by the preliminary information preprocessing, before the ultimate recovered information is produced. This definition is illustrated with Figure 8.21. Using the loss D[W(-)] we write (8.5.59) in the form Q [ r ( - ) ] > 2 ^ exp2 [2H(^\R)+b[W(^)]-hb^]

(8.5.61)

Equation (8.5.61) shows that the average D^^ has the same effect on the conditional entropy as the loss of statistical information D[iy(-)] paused by the preliminary retrieved information processing. Therefore, we call D^^ the loss of statistical information caused by taking an ultimate decision.

452

Chapter 8 Structures and Features of Optimal Information Systems

QAn-)]

-2C+2D[W(')] -2C

^o*(-)

r(-)

Figure 8.22. Illustration of the relationship between performance of the optimized rule of information recovery and the capacity of the channel delivering the information about the primary information. Let us consider the worst case when the available preprocessed information delivers no information. Then we can still use the statistical information about priori probability distribution of the primary info. Then the lower bound (8.5.61) for (2[^(-)]is 1 ,, _ ewr= 2 ^ exp2 [2Hm-^D[W(')]+D^J (8.5.62) The subscript "wr" should remind that this is the worst case estimate of the performance the rule producing the ultimate information. The ratio

Gjr(-)]=e[r(-)]/G.

(8.5.63)

has the meaning of the normalized index of performance of the ultimate information processing rule. From (8.5.61), (8.5.62) we get e j r ( - ) ] > e x p 2 { 2 { H ( X Y ) - H ( X ) + D[W(-)]+^,ec}}

(8.5.64)

The statistical amount of information is l(^|E) = H(X)-H(X|IR)

(8.5.65)

From the definition of channel capacity Qh we have /(^|IR) = Qh-D

(8.5.66)

where D > 0 . Using (8.5.64) and (8.5.66) we write (8.5.61) in the form Q j r ( - ) ] > e x p 2 {-2{C-D[W(-)]} + D,ec+D}}

(8.5.67)

To calculate the right side of this inequality we had to calculate the average loss of information D^^ caused by taking ultimate decisions. To do this we had to specify the ultimate information recovery rule X'i') and this is what we try to avoid.

8.6 Overall Optimization of Information Systems

453

However, from the fundamental property (8.5.58) of the entropy distance follows that D^^>0. From (8.5.60) it follows that also D is non-negative. Thus, from (8.5.67) we get Qn[r(-)]> exp2 {-2{C-D[W(')]}} (8.5.68) COMMENT Equation (8.5.68) shows that the channel capacity and the loss of information caused by preliminary processing of available information before the transformation producing ultimate information that is delivered to the superior system, provide the ultimate limit for normalized performance of the system. However, this is only a lower bound that can be reached only for specific performance indicators and specific statistical properties of the available information. Such pair is the mean square error and the gaussian probability distribution. In general, the discussed lower bound should be considered as a pessimistic estimate of the performance of an optimal information recovery rule.

8.6 OVERALL OPTIMIZATION OF INFORMATION SYSTEMS An information system is structured. The prototype system has a chain structure and an intelligent system has also a vertical (hierarchical) structure. Such systems have been discussed in Sections 1.1 and 1.6 and their concrete examples have been presented in Chapter 2, in Section 6.6.1 and in the previous section. With exception of very simple systems the optimization problem of a structured system must be decomposed into a set of problems of optization of the component subsystems. The cooperation between the subsystems is taken into consideration by introducing indicators of performance of subsystems or constraints that take into account the properties of other cooperating subsystems. Such a procedure is called decomposed system optimization. The first section considers such optimization of preliminary information processing in the prototype chain system shown in Figure 1.2, that the quality of the optimal information recovery at the end of the chain is possibly good. In the second section we consider subsystems that provide their superior information system information about the state of its environment, so that the superior system can operate in an intelligent way. We discuss two study cases and we formulate general guidelines for optimization of such state information subsystems. 8.6.1 THE OVERALL OPTIMIZATION OF THE PROTOTYPE INFORMATION SYSTEM Usually the subsystem performing the fundamental transformation in the prototype system shown in Figure 1.2 is fixed (see the discussion in Section 1.1.4). Then the overall optimization is joint optimization of preliminary transformation performed before fundamental transformation and the ultimate transformation of information produced by the fundamental transformation. In the previous two sections the optimization of the ultimate transformation producing the recovered information has been discussed on the assumption that the properties of the available information are fixed. In fact, we can change them by changing the rule of

454

Chapter 8 Structures and Features of Optimal Information Systems

preliminary transformation of the primary information delivered by the information source before it is put into the fundamental processing subsystem. Here, we concentrate on optimization of the preliminary transformation, taking into account the subsequent information recovery. The basis for systematic approach to optimization of preliminary transformation which takes into account subsequent recovery has been presented in Section 8.1.4. Concrete examples of such optimization, using partially heuristic arguments, have been also given. In particular, in Section 7.3 optimization of such a dimensionality reduction, that the quality of optimal recovery is possibly good has been discussed. We applied also a rudimentary version of joint optimization in Example 8.4.1 by showing that it is optimal to shape the signals put into the channel, so that the potential forms of noiseless signals at channel's output are antipodal signals. Similarly, the optimal character of orthogonal noiseless signals has been discussed in Section 8.4.3. The generalization of these examples is the optimization of the preliminary transformation using as criterion an performance indicator produced by a dependence removing transformation discussed in Section 8.1.4. In this section we concentrate on optimization of quantization. However, we give also a short review of problems of optimization of transformations that shape the information put into a channel causing distortions. A typical example of such a transformation is error correcting coding. OPTIMIZATION OF SCALAR QUANTIZATION WHEN EXACT STATISTICAL INFORMATION IS AVAILABLE We consider here optimization of scalar quantization rule described in Section 1.5.4 and illustrated in Figure 1.18. We assume first that the exact statistical information is available and as the performance criterion we take the overall mean quare error G[X*(-), V(-)]^E{x-r[VU)}^

(8.6.1)

where V(-) is the quantization rule, X'(-) the information recovery rule and x the random variable representing the primary information. The quantization and recovery rules are characterized by the set Z={x^;/-0, l , 2 , - - , L } (8.6.2) of thresholds and the set ^. r * , . ^ r. .« ^ ^. Z-={x;;/-l,2,-,L} (8.6.3) of values of the recovered information (see Figure 1.18). Therefore, instead of Q{ r ( - ) , V(')] we write briefly Q(X:X). In Section 8.1.4 we indicated that to define on the basis of Q(X^X) the performance indicator for the set X of thresholds, we have to remove the dependence of Q(X'X) on Ar*={x/;/-l, 2, • - , L } . We assume here that the information recovery rule is matched to the quantization rule. Then the minimization of distortions in respect of A!^*={x/;/-l, 2, ",L} is the dependence removing operation. In Section 7.5.1 we derived equations (7.5.17) and (7.5.24) givingjthe optimal potential forms of recovered information. Substituting Jhose values in Q{X^X) we get the indicator of performance of a threshold set X. However, such a typical

8.6 Overall Optimization of Information Systems

455

decomposed optimization_would be quite tedious. It is more convenient to go back to the primary criterion Q(X'X) and analyze the conditions for the simultaneous optimization in respect to X and X' . From (8.2.5) it follows that the pair of such optimal sets is a solution of the ^^^^^'^^

grad Q(X:X)'0

(8.6.4)

X',X

We write this equation as a pair of two sets of equations

or equivalently

gmdQ(X\X)-0

(8.6.5)

gTadQ(X\XyO

(8.6.6)

^M^-O,

1-1,2,..,

dx;

L

^£<™-0,/-1.2,..,L dx^

(8.6.7)

(8.6.8)

To calculate the partial derivatives we need an explicit formula for Q(X\X). given by equation (7.5.18) which in the now used notation takes the form Q{X\X)^l

{[x-r[V(x)]}'= E J [x-x*fp{x)dx

It is

(8.6.9)

Substituting this in (8.6.7), after some algebra we get A',

[xp{x)dx V - -^i;

-^'

^-1' 2, • • , L

(8.6.10)

[p{x)dx As it could be expected, this is equivalent to the previously derived center of gravity rule (7.5.17). Substituting (8.6.9) in (8.6.8) and differentiating we get x^-V2(V,+x;)-0 (8.6.11) Section 1.5.4 indicated that quantization can be achieved by the next neighbor transformation described by (1.5.13). The thresholds are then determined by the reference points -^z, / = 1 , 2,- • • , L of the NNT by equation (1.1.18) which in the present notation takes the form x-Vi{x^^^^x^) (8.6.12) Comparing this with equation (8.6.11) we see that the The optimal quantization rule is a NNT transformation with references given by equation (8.6.12). The NNT produces an optimal quantization when a reference is simultaneously the optimally recovered primary information.

(si f, M, \ .^ , .^,.

456

Chapter 8 Structures and Features of Optimal Information Systems

Thus, the condition of optimality is / = 1 , 2 , •, • , L (8.6.14) It is evident that For a uniform probability density the uniform quantization with centers of aggregation intervals as potential forms of recovered (8.6.15) information is optimal. An explicit general solution of the sets of equations (8.6.10) and (8.6.12) of the equations with variables as limits of integration is not possible. However it is possible to obtain (for a simple derivation see Judell, Scharf [8.25]) an approximate solution when the number L of forms of quantized information is so large, that the quantization interval are so small that the probability density p{x) can be approximated inside an quantization interval by a linear function. The lengths of optimal quantization intervals are

M,J = — 4 =

(8.6.16)

where x^i is the center of the optimal quantization interval ^ i^ and ^4 is a normalizing constant. The corresponding minimal mean square error is ^3

e(^;,i„)^^

\lp{x) dx

(8.6.17)

As quality indicator we take the normalized mean square error Q - ^ ^ ^

(8.6.18)

and as the cost indicator we take the volume v of the quantized information v=log2L

(8.6.19)

The diagrams of the normalized mean square error versus the volume of quantized information for the uniform and gaussian distributions are shown in Figure 8.23.

Figure 8.23. The trade of the normalized mean square error Q' versus volume v=log2 L of quantized information for the optimal scalar quantization; slashed line-approximation (8.6.17), continuous line-Lloyds algorithm (8.6.27).

8.6 Overall Optimization of Information Systems

457

COMMENT 1 Figure 8.23 shows that for a given number of potential forms of quantized information the minimal recovery distortions are significantly smaller for the uniform probability distribution than for the gaussian. For other probability distributions the advantage of the uniform distribution is still larger. This may suggest that it would be convenient to transform the primary information into an information with a uniform probability density and to quantize the transformed information uniformly. It can be easily proved that the random variable F(z) where X

F(x)- {p{u)du

(8.6.20)

and p(x) is the density of a random variable x, has a uniform probability density. Therefore, F(-) is called a "uniform making" transformation. However it can be also proved that uniform quantization of the variable produced by the uniform making transformation does not produce the quantization intervals equivalent to the optimal intervals given by (8.6.16). Thus contrary to decorrelation, uniform making of the probability distribution does not bring advantages for the subsequent processing. The basic conclusions (8.6.13) can be generalized for the case when the primary information is K DIM vector information. The multidimensional generalization of equation (8.6.10) is equation (7.5.24) derived in Section 7.5.1. In the case of vector quantization the aggregation sets are separated by surfaces. However, the optimality condition (8.6.14) can be generalized by the following argument. We suppose that a point x on the surface separating two aggregation sets ^^ and ^^ is located closer to the point JC^" representing the /th potential form of the recovered information than to point x^* . It can be easily proved that the contribution of jc to the mean square error given by (7.5.2) is decreased if x is included in the aggregation set ^ / . Thus the condition of optimality is that the surface separating the aggregation sets ^[ and ^„ is a segment of a plane that is perpendicular to the interval <x* ,x^* > and goes through its center. This proves the ^-DIM generalization of conclusions (8.6.13). The calculation in a closed form of the integral (7.5.2) giving the mean square error of recovery is in general not possible. However, when the dimensionality K of the quantized information is large then the K-DIM generalization of the relative volume Vr(m, Q) of continuous information defined by equation (6.6.29) in Section 6.6.3, gives insight into the performance of the optimal vector quantization without specifying the quantization rule. We denote as X={^{k), k=l, 2,- - , K} iht K-DlM random variable representing the continuous vector information. We assume that the random variables ^(k) representing the components are statistically independent and have the same probability density. We denote briefly VXQ)=V,[x(l), Ql Q>0 From definition (6.6.29) of the relative volume it follows that Vr(Q) is a decreasing function of Q, Thus, an inverse function Q*(v) exists. For example the inverse function of the relative entropy given by equation (6.3.35) is Q*(v)=a^22-2v

(8.6.21)

458

Chapter 8 Structures and Features of Optimal Information Systems

We denote as R"=log^/K, e" = El5^[x(A:)-z*(A:)]2

(8.6.22) (8.6.23)

the volume of quantized K DIM information respectively, the mean square recovery error normalized in respect to their dimensionality. It can be shown (see e.g., Proakis [8.13]) that For a Q>Q*(R") and for a sufficiently large K such a set of reference points of a NNT transformation can be found that the normalized mean square of the optimal recovery error Q"=Q+A2-''"^Q''"'> (8.6.24) where A is a constant and the coefficient a(Q, R") is the larger the larger the difference R"-H*(Q). When this difference goes to zero also oi(Q, R") goes to zero. If Q< Q'^{R")for large K it is not possible to make the difference Q"'Q small. The condition Q>Q^{R") is equivalent to the condition R">H'^(Q). Thus the interpretation of (8.6.24) is For large dimensionality K of primary vectors Qrr..n^Q*(\Og^/K) (8.6.25) is an approximation of the normalized mean square error for the optimal quantization producing quantized information which can take L forms. For example, for gaussian information we use equation (8.6.21) and we get Q„,„=o2 2-'"^^.a;l^

(8.6.26)

The assumptions that the components are independent and have the same probability distribution have been introduced to simplify the argument. In general we have to use in (6.6.28) the amount of information for K-DIM variables and relative volume per one dimension, defined similarly to as entropy H^ by equation (4.6.16). ALGORITHMS FOR FINDING THE OPTIMAL QUANTIZATION RULE In general a solution of the set of equations (8.6.10) and (8.6.12), the more of their K DIM generalizations, cannot be obtained in a closed form. However, the form in which equations (8.6.10) and (8.6.12) are written suggests the interpretation of those equations as conditions that the functions on the left sides take the zero value. This in turn, suggests the application of iterative algorithms for finding the zero point which have been described in Section 8.2.2. A straightforward modification of the procedure described on page 395 leads to the Lloyd algorithm: STEP 1. Take an initial set J^(1)={JC^(1) ; / - 1 , 2, • • , L } of thresholds and calculate from (8.6.10) the corresponding set ZT(l)-{x;(l); / - I , 2, - • , L} of the optimally recovered forms of information STEP;. Consider the set xy-l)^x;'(j-l); / - I , 2, - • , L} as the set of reference points of a N.N.T, calculate from (8.6.11) the new set of thresholds and from (8.6.11) calculate the new set A!'1(/)={x^"'(/);/-l, 2, - • , L} of recovered forms of information. STOPPING RULE. Similar as rule 7 on page 395.

8.6 Overall Optimization of Information Systems

459

OPTIMIZATION OF SCALAR QUANTIZATION WHEN ONLY A TRAIN OF PIECES OF INFORMATION IS AVAILABLE We assume that only a train X={x{n), n = l,2,.., As the performance indicator we take

N}, x(n)E <x^, x^> is available.

2 { X-(-), V(-)] ^A{x(n)-r[v(n)]'}

(8.6.27)

where A is the operation of arithmetical averaging. In the notation now used the optimal recovery rule (7.5.36) derived in Section 7.5.1 is Xtv,)-V"7-rTT E

^(^^

(8.6.28)

^\'^l)x(n)eA,

where L(^,) is the number of elements of the aggregation set ^ / . We optimize the quantization rule, thus the aggregation sets similarly as in the case when exact statistical information was available. We use the 1 DIM version of the argument presented on page 457. If there would be an aggregation set ^^ and an element x(n*)E^^ such that \x(n^yx^_\ I < \x(nn-x;' \

(8.6.29)

then by shifting this point to left end point of the aggregation interval we could decrease the average square error Q' {X'('), V(')]. Therefore, in the considered case the condition (8.6.12) must be also satisfied, thus it must be x-V2(x^:,^x;)-0

(8.6.30)

In view of the presented analogies with the case when exact statistical information is available, the modification of the Lloyd algorithm is straightforward. Similarly, we optimize the vector quantization when exact statistical information is not available. OPTIMIZATION OF SHAPING THE CHANNEL INPUT SIGNAL Shaping of the primary information so that it can be best transmitted by a given communication channel is, besides quantization, another important preliminary transformation of information. For decades, the research in this area has been very intensive and several methods of design and optimization of the rules of transmitters operation have been developed. Many of these methods illustrate the advantages of structuring the signals put into the channel and of using various types of auxiliary information about the state of environment in which a communication system operates. Concrete examples illustrating it have been presented already in Sections 8.4.2 and 8.4.3. Here the general effects of increasing the degree of structuring, particularly, the dimensionality of transmitted signals are discussed. We consider the transmission of discrete information on the symmetry assumptions: the potential forms of working information exhibit statistical regularities, have the same probability, and the distortion indicator is symmetric (given by (8.1.2)). Then the typical indicator of performance of the rule of channel input signals shaping is the probability of error of the optimal information recovery ^^^

P,or=P(^^T)

(8.6.31)

460

Chapter 8 Structures and Features of Optimal Information Systems

First we assume that the channel is binary, symmetric, and memory less as described in Section 5.4.4 page 241. Thus, the transmitted signals are binary blocks ; we called them code words. Then, as shown in Example 8.3.4 optimal is the NNT recovery rule (8.3.51). We denote as W the set of code words. In view of the symmetry assumptions, it is irrelevant which reversible rule of assigning a code word to a potential form of working information is used. Therefore, the optimization of shaping of the channel input signals reduces to the optimization of the set W of codewords thus, to the OP W, P^ovl^^ ^STAT where C is the set of constraints imposed on the code words. Typical are implementation constraints, particularly that the code is a parity check code described in Section 2.1.2. After very intensive research during past two decades the solutions of such optimization problems are well known (for general information and citations see pages 87 and 88) and widely used in practice. Here we concentrate on universal properties of optimal codes. In view of symmetry assumptions and implementation considerations, the codewords of the same length A^ are considered. A comprehensive characteristic of the set M/of such codewords is the volume of working information (see Section 6.1.1) per a binary element of the a codeword R=\//N={\og^)/N (8.6.32) where L is the number of potential forms of the working information. The universal relationship between parameter R and the rough description of the assumed channel by its capacity Q per a binary signal, defined by equation (5.4.31) is IfC^>R then for a sufficiently large N it is possible to find such a set W ^ of codewords that ,,, ^ AN^(R r\ " ^ Peor(W/o) = ^ , 2 ^ ^ ^ ' '^ (8.6.33) where A^, Aj, are two constants and a(R, C,) > 0 is a coefficient that for R growing from 0 to C, decreases from its initial value a(0, C,) to 0 for /?-*C, . IfC^A^>0 where Aj, is a constant of order of magnitude of \IL. This statement is a typical example of a coding theorem. The derivation of such theorems is the central problem of statistical information theory (see e.g., Blahut [8.21], Golomb [8.22]). The length A^ of the code word has the meaning of the level of structuring of code words. From equation (5.4.31) and from Figure 5.10 it follows that for the typical values of probability of binary errors Pb<0.5, the capacity C, is a decreasing function of P^, while Example 8.4.1 and Figure 8.15 show that this probability is a decreasing function of energy of binary signals. Thus, the capacity Q is an indicator of energy resources of the channel needed to produce a binary signal. Therefore, the interpretation of theorem (8.6.33) is If the energy resources of a channel are so large that Cj >JR, then by making the structuring level N of code words larger than a number N^i(R, C,,^) whereP^ is arbitrary small, we can find such a set of code words that the probability of errors of the optimal NNT recovery rule is P. The required level of structuring N^l(R, Ci, /J) of the codewords is the smaller the larger is the surplus Cj-Qn^j of the capacity, where C^^-^^Ri. However for increasing the surplus we must pay with increased energy resources of the channel per a binary signal.

8.6 Overall Optimization of Information Systems

461

As indicated in Section 6.1.1, page 254, the cost of hiring a binary channel is usually proportional to the length A^ of the transmitted block. Therefore, we try to keep 1/^ small, thus, R large. From the definition (8.6.32) it follows that then A^ can be large when the number L of potential forms of working information is large too. If this is not the case, we must assemble a sufficient number of pieces of working information into a block to satisfy the condition of the coding theorem (8.6.33). A modified version of the theorem holds also for the system with feedback using blocks of length A^ for a single transmission. As discussed in Section 8.3.3 the performance of a feedback system is described by the ultimate error probability P^ and by the probability P- of making a disqualifying decision. It can be shown that if Ci>R then both probabilities decrease exponentially with the level A^ of structuring of a single transmission, and for C, close to R the coefficient af^(R, C,) is substantially larger then for previously discussed system without feedback. Thus, Compared with the open system, the system using feedback information allows to decrease substantially the error probability of working information recovery. However, as in the case of the optimal open system the performance of the optimal system with feedback is good only if the capacity of the forward channel Ci>R'\ in other words: the feedback does not change the capacity of the system. The modifications of the coding theorem (8.6.33) hold not only for wide classes of discrete channels, other then the considered memory less binary channel. The modifications hold also for continuous channels, particularly for the gaussian continuous channel using signals of duration T and bandwidth B described on page 241. Considerations in Section 7.4.3 suggest that in this case we should take as indicator of structuring level N=2TB and take R=\og2L/2TB. Then the direct counterpart of (8.6.33) holds for the gaussian channel. For orthogonal signals considered in Section 8.4.3 we have a specific situation. Using (8.4.41) we obt2LinR={\og22TB)/2TB. From the well known result of calculus it follows that (log22ra)/2ra-*0 when 2TB-*oo as we assumed in Section 8.4.3. Therefore, for orthogonal signals we have to take R^ilogjL)/! as we did in Section 8.4.3. Then the conclusion (8.4.52) and equation (8.4.53) are counterparts of the coding theorem (8.6.33) and the conclusions formulated on this and an the previous page, apply for orthogonal signals. 8.6.2 THE OPTIMIZATION OF THE SUBSYSTEM PROVIDING INFORMATION ABOUT THE STATE OF MAIN SYSTEMS ENVIRONMENT In the introductory Section 1.1 and in Section 1.7 it was indicated that the efficiency of information processing can be increased by using auxiliary information about the state of the environment of an information system. In particular, the auxiliary information makes intelligent operation of an information system possible. In Chapter 6, Chapter 7, and in the previous sections of this chapter we discussed utilization of various types of state information in specific systems. We now summarize and generalize those considerations. We use two complex systems to illustrate the basic features of systems with state information subsystems. The first is the intelligent data transmission system described in Section 2.2 and shown in Figure 2.9. The second is the system with a

462

Chapter 8 Structures and Features of Optimal Information Systems

common channel described in Section 2.3.1 and shown in Figure 2.11. Although the systems are different, the character of the dependence of their performance on basic features of the used state information is similar. This suggests generalizations which allow to formulate the guidelines for the design of a state information subsystem. UTILIZATION OF STATE INFORMATION IN A DATA TRANSMISSION SYSTEM Analyzing the effects of state information in the system shown in Figure 2.9 we do assumptions as in Examples 8.3.1-8.3.3. In particular, we assume that Al. The signal at the output of the working channel is r(0=>v(jc;, 0+z(0, te , / = l , 2. (8.6.34) w(xi, t)=b^A(Xi, t)cos(w,r-l-ft2) (8.6.35) where b^ is the attenuation and ^2 is the phase shift. A2. The noise z(t) is a realization of a gaussian stationary process z(t) with uniform power spectral density S^. A3. The primary information and if they are not known, state parameters b^ and ^2 exhibit statistical regularities and exact statistical information about them is available at the receiver and transmitter. A4. The distortion indicator is the probability of error P^ of an overall optimized system given by (9.6.31). A5. The cost indicator is average normalized energy of noiseless signals E,= E/S^ (8.6.36) where E is the average energy of noiseless signals We consider 3 cases: (1) Exact information about attenuation and phase shift is available, (2) exact information about b^ but only statistical information about Z?2 is available, (3) only statistical information about Z?, and Z?2 is available. The optimization of the corresponding systems has been discussed in Section 8.3. The summarizing diagrams shown in Figure 8.15 are redrawn in Figure 8.24. The optimization of feedback systems in which the transmitter uses the binary feedback information yT(CHou,) was discussed in Section 8.3.2 and their performance was analyzed in Section 8.4.2. The diagrams from Figure 8.16 are redrawn in Figure 8.24. In Section 8.4.3 we considered the possibilities of improving the performance of information transmission by assembling binary pieces of information into a block and processing it as a whole. The diagrams of characterizing the optimal performance of such systems were shown in Figure 8.17. They are also redrawn in Figure 8.24. Comparing the diagrams characterizing the operation of systems using only the information yR(CH) about the parameters Z?, and bj of the noiseless signals that are determined by the channel we see that the indeterminism of such parameters deteriorates badly the performance of the system. Since those parameters are usually slowly varying the previously described learning subsystems can be used to improve the overall performance. The bound for the performance of such systems is the performance of the optimal system for which the parameter is known. For example the bound for performance of systems with subsystems estimating the phase shift using synchronizing information is the performance of the synchronous system.

8.6 Overall Optimization of Information Systems

463 0^ CJ "O «/3 w

W o

9 z li: O .— ^

H

C/5

w

E.£

*" ^ E S

OJ T3 •5 4>-^ C C «/5 tJo C

^ r? -§

.s a> p > O o tin®

^•^ e 2 i 5

2 E ^ >.2

.^ g S o .0 " O

•—

, ^

<J->

—

C ;£ g_^ ^ D _2 c o •r=^^-£ E o o ^

^

(u o

a>

E — ^ ^^ > o

3 o "5 ,0 o >>

a.

c E.£ E-Si*^ Ci-«r- O

o o

IS

4j CO

to

•s'S

^ o E i:? -

? -- 2i ,SPo S-E--2^

Q C c/5 g C

w2So.2 • G

,gC^

era a>

C

3

0)

J-

ta ^ E o "5

464

Chapter 8 Structures and Features of Optimal Information Systems

Figure 8.24 shows that using even the simple binary feedback we can improve dramatically the performance of working information transmission. To achieve a similar performance of the possibly best open system applying orthogonal signals and block operation, we have to use a very high structuring level. Thus, the concrete state information which controls the actions of the transmitter is more effective than the structure build into signals in an open system. However, using feedback information requires a feedback channel and introduces indeterminate delays. Also, as indicated in conclusion on page 461 the minimum capacity that is required to achieve to improvement by structuring is the same as for the open system. UTILIZATION OF STATE INFORMATION IN MULTIPLE ACCESS SYSTEMS As the second complex information system we consider the system with several information sources using a common channel shown in Figure 2.11. We assume Al. The primary information delivered by a local source is a train of blocks interleaved with pauses as described in Section 6.3.1. A2. Statistical information about the instants of arriving packets and about their lengths is available; the Poisson-exponential model discussed in Section 5.2.1 is used. A3. The arriving packets are stored in a buffer and according to a transmission rule taken of and put into the common channel. A4. The decisions of the transmission rule are based on the following state information: (1) yT(^)[CL(m)] about collisions of an own packet, (2) yj^^)(CL) about all collisions, (3) yj(m)(SYN) about position of time slots in which the packet should be put, (4) yj^^^iCH,^) about the state at the input of the common channel (CSS systems), (5) yccsC^) about the state of queues in local transmitters (reservation systems, scheduling decision taken at the central control unit). A5. The indicator of distortions is the normalized delay T,=T/T

_

(8.6.37)

where r is the average delay caused by the system and T is average duration of a packet. 6) The cost indicator is the normalized Rapacity of the common channel: C,=C/S _ (8.6.38) where C is the common channel capacity and S is the average intensity of total transmitted information (in bit/s). A brief description of systems using the various types of the state information is given in Section 2.3. Such systems can be considered as systems performing real time compression of trains arriving from remote sources. In general the analysis of the multiple access systems is quite complicated (see e.g.,Kleinrock [8.24], Seidler [8.25], Beteskas, Gallager [8.26] The diagram of r^ as a function of C^ based on Seidler [8.25] are shown in Figure 8.24b. In the limiting case when exact information yccs(T) about the states of buffer of all users is available at a central control subsystem (see Section 2.3) it is possible to organize all arriving packets in a single virtual queue that behaves as a queue in a single buffer considered in Section 6.3. Therefore, the characteristic of the system using efficiently the information >^ccs(T) is the same as diagram (a) in Figure 6.9.

8.6 Overall Optimization of Information Systems

465

If in a system using only information yT(n,)[CL(m)] about own collisions, the synchronisation information yT(n,)(SYN) is in addition available, the performance of the system improves significantly. In particular, the asymptotic minimal required normalized capacity reduces from C^=2e to e. However, if besides information yj(^)[CL(m)] also information yT(m)b(CHin) is available the improvement achieved by using the synchronisation information yT(n,)(SYN) is small. A similar effect of saturation occurs in data transmission systems. A typical example is the feedback information. Compared with the open system it improves the performance, but not the channel capacity (see the conclusion on page 461). There is also an effect of saturation when the level of structuring increases. The properties of the system with orthogonal signals discussed in Section 8.4.3 are an example. Those observations suggest the following generalizations • Using the information about the concrete and meta state of the environment of the working information system it is possible to achieve a substantial improvement of performance the working information system, without increasing its fundamental processing resources. • When the fundamental information processing introduces indeterministic distortions then the performance can be also improved by increasing the level of structuring of signals carrying the information. • An effect of saturation with state information occurs, namely when already enough state information is available then further increasing the volume of state information only slightly improves the performance of information processing. Let us return to the general considerations in Section 1.6.1 about performance of an information system. In view of the above properties of state information the gross gain G^^^[T^(')] brought by an intelligent information system which processes the working information according to the optimized rules TJ^-) grows with the volume Vy of used concrete and meta information about the state of the environment, but exhibits a saturation as shown in Figure 8.25.

Figure 8.25.Typical dependence of the gross gain G^^^[TJ^')] brought by an intelligent information system, of the cost G~[r5,3(-)] of building and running the state information subsystem and of the net gain 0^,^ brought by the imbedding the state information system and running the working information system in an intelligent way, on the volume Vy of the state information.

466

Chapter 8 Structures and Features of Optimal Information Systems

The cost of implementing and running the state information subsystem grows with the volume Vy of the state information. Thus the net gain ^w,s ^G;j7;(-)]-G-[r3,3(-)] (8.6.40) considered as the function of the volume of state information exhibits a maximum. In other words there is an optimal volume Vyo of the state information. If the volume of used state information Vy
INDEX The page number of the basic definition of a term is boldfaced if it is not the first number cited. A aggregation - interval, 43, 335, 360 ff, 459 - set, 40, 76, 189, 423, 430 ff.,456 algorithm - cascaded segmentation a., 123 - dimensionality reducing a., 338 - finding optimal quantization rule, 458 - gradient a. see minimum point search a. - Huffman a., 265 ff., 292, 294 - minimnm point search a. - deterministic, 403 - stochastic, 404 - steepest descent a., 406 - zero point search a., 395 - Ziv-Lempel a., 273 amount of statistical informationset statistical information approximation - discrete, 33, 144, 175 -hnear, 320 ff., attribute, 14, 19, 108, 131 f 160 ff - binary, 130, 132 average - arithmetical, 169,181 - statistical, 191 ff. - conditional, s.a.l93 ff. B base-band process see process Bayes model, 245, 409 biological information-stt information bit - allocation, 115, - stuffmg, 121 buffer, buffering, see system

C channel - base-pass ch., 77 - binary ch., 72 ff., 81, 86, 2421, 419 ff. - capacity, 53, 240 ff, 242, 276, 280, 439, 460 ff. - common ch., 96 ff., 464 - feedback ch., 81, 422, 433, 461 - low-pass ch., 74 - memeoryless ch., 241 - parameters, 60 ff., 92, - secondary ch., 81 - virtual ch., 89, - with additive noise, 72, 237, 241, 418 Cholewski decomposition, 223 code, 38 - arithmetic, 267 ff. - b l o c c , 83, 267, 460 - book, 38, 124 - Hamming c , 87 - Huffman c , 117, 265 ff., 292 ff. - parity check c , 83 ff, 85, 460 - reserved prefix c , 124, - run length c , 23, 117, 266 - string-orientated c , 267 - word, 38 coding see code - theorem, 439, 460 comma. 111 communication system see system constraints, 56 - parametric, 56 - statistic, 173, 293 correlation - coefficient, 193, 210 ff., 214, 432, 436 - empirical c . c , 386 - function, 227 ff., 356 - matrix, 194, 199, 222, 392, 443

468 D data bank, 8, 108 data compressions^^ lossless information compression decimation, 46, 353 decision, 11, 408 ff., 413, 422 - deterministic, 36 - disqualifying d., 422 ff. - information d.- see information - randomized, 51 - rule linear, 443 ff. - rule of - best conditional performance, 360, 410, 424 - center of gravity, 361 ff.,427 - maximal conditional probability, 415 ff., 425, 441 ff. - maximum likelihood, 442 - routing, 101 - theory, 66 deciding inequality, 76, decorrelation- see transformation detail removing operationsee operation DIM, 19 dimensionality, 19 - reduction, 9, 10, 45, 327, 349, 350, 356 ff., 351 ff.,412 discretization, 33, distance, 29 ft, 41, 76, 382 - entropy d., 450ff - EucUd, 30, 319, 382 - Hamming, 86, - Hilbert, 30, 78 - minimal, 87 - preservation of d., 319, 344 duality principle, 349, 351 duty ratio, 23, 278

eigenvalue, 323 ff., 339 ff. - vector 46, 323 entropy, 202, 211 ff., 215, 231, 242, 292 - conditional, 213 ff, 241 - distance, 450 ff. - relative e.,306, 457

feedback 16 - information- see information - system- see system

Index Fourier transformation-so^Q transformation frequency of occurrences, - of continuous states, 174 ff. - of discrete states, 63, 111, 169 ff, 262 Junctional analysis, 31, 66, 111, 178 fundamental property of long trainssee train

gaussian probability density- see probabihty gradient, 391, 405, 455 Gramm-Schmidt orthogonalization, 225, 374 I image see information indeterminism, 34, 213, 244 ff., 441 indeterministic transformations-stt transformations indicator of - cost, 53, 465 - distortions, 53, 380 ff. - symmetric, 381 - performance, 52, 384, 385 - conditional p., 360, 410 ff., 428 - economic i.o p., 53 - technical i.o.p., 53 - redundancy, 260 - resources utihzation, 257 - structuring level, 438, 460 - variety, 27, 53 - volume compression, 259, 264, 280 ff., 293 - volume surplus, 258, 274 identifier, 18, 41, 262, 268, 271 information, 3 - about state of a channel, 90, 92, 99, 237 ff.,462 - about state of a network, 101 - about state of a system, 4, 5, 51 - array i., 21, 112 - biological i., 11, 65ff. - b l o c i . , 19, 22 - compression, 9, 42, - lossless, 251 ff. - constraint bhnd, 253, - statistical, 285 ff.,

469 - information compression ctd -lossy, 76, 103, 259, 261 ff., 281, 286 - constrained i., 28, 255 - continuous i., 25, 29, 296, 440 - decision i., 76, 80, 408, 410 - a priori, 411 - binary, 413 ff. - current, 411 ff. - primary i.d., 441 ff. - secondary i.d., 414, 421 - diagram i., 19, 48 - delivery rate of i., 278, 288, 439 - design i.,57 ff. - discrete i., 26, - distortion i., 79, 85, 422 - dynamic i., 2, - evolving i., 21 - error i., 85 ff. - exact i., 35 - external i.6, 58 - feedback i., 16, 81, 85, 422 ff., 435, 461 - function i., 25 - image i., 23, 48, 349, 383 - internal i., 6, 59, 111 - inexact i., 35 - length i., 120, 123, 236 - linking i., 8, 57, - m e t a i . , 57, 79, 245 ff. - networks, 94 -parity i., 83, 98, - partner i., 6, 58, 93, - presentation of i., - - spectral p.of i.- see spectral - prototype i. 254, 296 - recovery rule- see decision rule - segmentation i., 120 ff. - set of potential forms, see set - source, 6, 15 ff. - active, 15 ff. - passive, 15 ff. - structure, 17 ff. - statistical i., 117, 245 ff., 360 ff, 408 ff. - amount of s.L, 213 ff, 305, 451 synchronization i., 79, 92 system, 5, 51 - centralized i.s., 98 ff. - decentrahzed i.s., 98 ff. - distributed i.s., 94 -feedback i.s., 80 ff., 421 ff.

- hierarchical i.s.,6, 104, 107 - inteUigent i.s., 6, 58, 268, 392, 446 ff. - learning i.s., 5, 59 ff., 446 - multiple access i.s., 99, 462 -prototype i.s., 6 ff., 453 ff. - state i.s., 90, 449, 461 - training i., 61, 386, 393, 401 , 445 - transfonnations of i. see transformation - ultimate i.,5 - unconstrained i., 28, - vector i., 19 - volume of i., see volume - wave i., 24, 50, 57 - working i., 57 interface - point i., 140 - window i., 141 intermittent operation 22,

learning, 59, 448 - cycle, 60, 446 ff - supervised 61, 445ff. M matched filter, 312 mean square error, 328, 332, 336, 339, 387, 442, 456 - normahzed, 337, 340, 365, 447, 456 N network, 94, 140 next neighbor transformation see transformation noise, 72 - binary, 82 - thermal, 184 noiseless signal-stt signal O object, 13, 108 - atomic, 13 - continuous, 13, - discrete, 13, operation - arithmetical averaging o.- see average - blocwise o., 82, - detail removing o., 55, 107, 384

470

Index

operation ctd - intermittent o., 22, - linear o., 193 - piecewise o., 82, - statistical averaging o.-see average optimization of - bit allocation, 115, 369 - dimensionality reduction, 338 ff. - infonnation system, 453 ff. - linear approximation, 320, 350 - linear transfonnation, 330, 373, 386 ff., 391, 401, 443 - infonnation recovery rule, 331 ff., 410, 432, 446, - continuous i.r., 440 ff. - discrete i.r., 412 ff. - prediction rule, 373 - quantization, 454 ff. - state information system, 461 ff. optimization - problem, 56, - parametric o.p., 390 - statistical o.p., 56, 360 ff., 399 ff., 409 ff. - theory, 67, - with constraints, 394 orthogonal systems of - functions, 344 -vectors, 313 ff., 323

probability - marginal p., 171, 237,247 - theory, 168, 188 process see also signal - base band p., 73, 227, 237, 241 - narrow band p., 73, 227, 237 - point time p., 156, - stochastic p., 209, 219 - density of probabihty of s.p, 228 - gaussian s.p., 221 ff., 225, - Marcov s.p., 229 ff., 239, 295 - Poisson s.p., 219, 221 - spectral repr. of s.p. 354 ff. - stationary s.p., 227 projection, 42, 109, 314, 320, 327 pseudo-random numbers, 136, 187 pulse response, 146, 152

parameters - independent estimation of p.,445 ff. - Lagrange p.,394, 424 - nuisance p., 235 - side p., 233, 238 - state p., 130 - slowly varying s.p., 59, 445 ff pattern recognition, 10, point - of minimum, 390 ff. - of zero, 394 ff. probability - axioms, 189 ff. - conditional p., 229, 235, 237 - density of .p., 190 - gaussian d.o.p., 198 ff., 235, 241, 247, 442 - uniform d.o.p., 196, 239, 242 - distribution, 190 - empirical p., 181 ff. - error p., 241, 431, 439

random variable, 189, - continuous 189, see also probability density - discrete, 189 randomization 51, receiver, 8 recovered information- see information reference pattern, 40, 268, 273 relationship- see also internal state - described by - algebraic equations, 156 - differential equations, 144, ff. - logical expressions, 159 ff. - hierarchy of r. 161 - macro r., 160 ff. - statistical r., 167ff. - universal r., 139, 143, 139, 161 rule- see decision rule, transformation

quantization, 9, 42, 359 - current, 371, ff. - decomposed q., 44, 367 - fine q., 114 - optimization, see optimization - rough q., 115 - scalar q., 43, 298, 360 - uniform s.q., 43, 298 - vector q., 44, 365 quality indicator-s&t indicator

471

sampling, 48, 180, 218, 227, 351, 356 - lineal s., 48 - planar s., 48 - point s., 48 - space s., 50 scanning, 47 safety margin 284 set - of potential foims, 26 - continuous s.o.p.f., 26, 29 - discrete s.o.p.f., 26, 28 - of information, 26 ff. - of state, 137, 165 - structural type of s., 26 - theory, 32, 63 - unconstrained s.,255 - Voronoi s., 42 segmentation, 120 signal see also process - energy, 417, 430 - normalized, 418, 432, 432, 438 - noiseless, 11, 72, 235 ff., 418 - received, 72 - transmitted, 72 signal to noise ratio, 239, 242 sine over argument function, 346, 352 Shannon infonnation, 66, 215 see amount of statistical information shift register, S3, 150 ff. spectral - power density, 355 ff. - representation of - function information, 342 ff. see also Fourier transformation - vector inlbrmation, 310 ff. - discrete cosine transf. 113 - discrete Fourier transf. 317 spectrum- see spectral representation state 3, - atomic 130, - concrete s., 55, 205 - course of s., 14, 136 - external s., 12, 13, 160 -internals., 12, 14, 143, 160 see also relationship - fuzzy state, 135 - generaUzed s., 167, 205 ff. state ctd

-

macro s. 160 meta s., 205, 245 of information system, 90 parameter 130 - estimation of s.p., 61, 445 - slowly varying s.p., 59, 445 ff. - rough s., 14, 131, 135, 137, 144 - side s., 36, 233 - statistical s., 32, 205, 245, 293 - trajectory, 136 - variety s., 26, 138, - vector s., 130 statistical - average see average - constraints, 252, 280 - identification, 185 - independence, 172, 181, - information see information - optimization problem 56, 409 ff. - regularities 34, 168, 181, 183, 185, 286, 387 - volume-see volume stochastic process see process system see also transfonnation - biological s. 11, 67 - buffering s. 90, 93, 126, 156 ff. 288, 289 - communication s., 8, 72 ff., 89 ff. - information s.- see information - linear s., 148 - Marcovian s., 143, 155 - memory less s., 143, 153 - production s., 139, 154 - semi-stationary s., 153 - stationary s., 144, - superior s., 5, - terminal interacting s.,139, 144 - time varying s., 148 structure - tine s., 13, 18 - hierarchical s., 13, 22 - macro s., 13 - simplification of s., 9, 112, - type of s., 25 superposition principle, 148

transformation - causal 139 - compressing information- see information compression

472 transformation ctd - decorrelating 45, 223 ff., 368 - predictive-subtractive d.t., 371 - spectral d.t., 321 ff., 333, - deterministic t.,36, - dimensionality reducing t., 309 ff. - discrete cosine t., 113, - extracting new infonnation 371, - Fourier t., see also spectrum - continuous F.t., 46, 346 ff. - discrete F.t., 317 ff - fundamental t., 7 - generating information, 35, - generating reports, 110 - indeterministic t. 36, 51, 233 - irreversible t. 36, 39, 133 - linear t. -see also optimization 155 - marcovian t.,143 - memoryless t.l54, 241, 432 - next neighbor transformation - continuous 42 - discrete 11, 40 ff., 76, 80, 420, 455 -of files, 110 - of information recovery - - see information recovery rule - of records, 109 - of representation, 27, 37 - optimal t. see optimization - overall t., 384, 453 - preliminary t. 7, - reversible t. 35, 37-39, 259, 287, 300 - removing constraints 5, 257,258 - removing relationships 45, 113 - stationary t. 155, 432 - structure blind t. 257, 330 - ultimate t. 7, - volume compressing t.- 260, see information compression - volume expanding t. 260 train - assembling 119 - fundamental property of long trains 201 ff. - non-typical t. 206 - typical t. 202 - training cycle - transmitter 8, 72 - truncation 45, 328, 333, 341

Index U uncertainty zone 79 universe of discourse 142

volume of information 27, 254 ff., 277, 301 - compression of volume, 286 ff. see also information compression - minimal v.o i., 256, 301 - relative v.o i., 306, 458 - statistical v.o i., 251 - structure blind v.o i., 285, 292,

Introduction to information theory and data compression

Read more

Introduction to Information Theory and Data Compression

Read more

Data compression

Read more

Fundamental Data Compression

Read more

The Data Compression Book

Read more

Fundamental data compression

Read more

Introduction to data compression

Read more

Satellite Data Compression

Read more

Introduction to Data Compression

Read more

Elements of Data Compression

Read more

Fundamental Data Compression

Read more

The Data Compression Book

Read more

Fundamental data compression

Read more

Fundamental Data Compression

Read more

Satellite Data Compression

Read more

The Data Compression Book

Read more

Fundamental Data Compression

Read more

Hyperspectral data compression

Read more

Introduction to data compression

Read more

Handbook of Data Compression

Read more

Handbook of Data Compression

Read more

Introduction to Information Theory and Data Compression, Second Edition

Read more

Introduction to Data Compression (Morgan Kaufmann Series in Multimedia Information and Systems)

Read more

Introduction to Data Compression, Third Edition (Morgan Kaufmann Series in Multimedia Information and Systems)

Read more

Data Compression: The Complete Reference

Read more

Introduction to Data Compression, Third Edition (Morgan Kaufmann Series in Multimedia Information and Systems)

Read more

The transform and data compression handbook

Read more

Data Compression The Complete Reference

Read more

Data Compression: The Complete Reference

Read more

Data Compression: The Complete Reference

Read more

Recommend Documents

Introduction to information theory and data compression

Introduction to Information Theory and Data Compression Second Edition © 2003 by CRC Press LLC DISCRETE MATHEMATICS...

Introduction to Information Theory and Data Compression

Introduction to Information Theory and Data Compression Second Edition © 2003 by CRC Press LLC DISCRETE MATHEMATICS...

Data compression

Data Compression Third Edition This page intentionally left blank David Salomon Data Compression The Complete Refe...

Fundamental Data Compression

The Data Compression Book

Afterword When writing about data compression, I am haunted by the idea that many of the techniques discussed in this b...

Fundamental data compression

Introduction to data compression

Satellite Data Compression

Satellite Data Compression Bormin Huang Editor Satellite Data Compression Editor Bormin Huang Space Science and Eng...

Introduction to Data Compression

Elements of Data Compression