Applied Nonlin Time Series Analysis - Applns in Physics

RPPLIED NONLINERR TIME SERIES RNHLVSIS RPPLiCRTIONS IN PHYSICS. PHYSIOLOGY flfJO FINRNCE WORLD SCIENTIFIC SERIES ON ...

Author: Michael Small

78 downloads 886 Views 13MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

RPPLIED NONLINERR TIME SERIES RNHLVSIS

RPPLiCRTIONS IN PHYSICS. PHYSIOLOGY flfJO FINRNCE

WORLD SCIENTIFIC SERIES ON NONLINEAR SCIENCE Editor: Leon O. Chua University of California, Berkeley Series A.

MONOGRAPHS AND TREATISES

Volume 33:

Lectures in Synergetics V. I. Sugakov

Volume 34:

Introduction to Nonlinear Dynamics* L Kocarev & M. P. Kennedy

Volume 35:

Introduction to Control of Oscillations and Chaos A. L Fradkov & A. Yu. Pogromsky

Volume 36:

Chaotic Mechanics in Systems with Impacts & Friction B. Blazejczyk-Okolewska, K. Czolczynski, T. Kapitaniak & J. Wojewoda

Volume 37:

Invariant Sets for Windows — Resonance Structures, Attractors, Fractals and Patterns A. D. Morozov, T. N. Dragunov, S. A. Boykova & O. V. Malysheva

Volume 38:

Nonlinear Noninteger Order Circuits & Systems — An Introduction P. Arena, R. Caponetto, L Fortuna & D. Porto

Volume 39:

The Chaos Avant-Garde: Memories of the Early Days of Chaos Theory Edited by Ralph Abraham & Yoshisuke Ueda

Volume 40:

Advanced Topics in Nonlinear Control Systems Edited by T. P. Leung & H. S. Qin

Volume 41:

Synchronization in Coupled Chaotic Circuits and Systems C. W. Wu

Volume 42:

Chaotic Synchronization: Applications to Living Systems E. Mosekilde, Y. Maistrenko & D. Postnov

Volume 43:

Universality and Emergent Computation in Cellular Neural Networks R. Dogaru

Volume 44:

Bifurcations and Chaos in Piecewise-Smooth Dynamical Systems Z T. Zhusubaliyev & E. Mosekilde

Volume 45:

Bifurcation and Chaos in Nonsmooth Mechanical Systems J. Awrejcewicz & C.-H. Lamarque

Volume 46:

Synchronization of Mechanical Systems H. Nijmeijer & A. Rodriguez-Angeles

Volume 47:

Chaos, Bifurcations and Fractals Around Us W. Szemplihska-Stupnicka

Volume 48:

Bio-Inspired Emergent Control of Locomotion Systems M. Frasca, P. Arena & L Fortuna

Volume 49:

Nonlinearand Parametric Phenomena V. Damgov

Volume 50:

Cellular Neural Networks, Multi-Scroll Chaos and Synchronization M. E. Yalcin, J. A. K. Suykens & J. P. L. Vandewalle

Volume 51:

Symmetry and Complexity K. Mainzer

*Forthcoming

|k 1 WORLD SCIENTIFIC SERIES ON n %

c.i—

A

W~l

CO

NONLINEAR SCIENCE^%

senesA voi.52

Series Editor: Leon 0. Chua

RPPLIED NOKLINERR TINE SERIES flNflLYSIS

APPLICATIONS IN PHYSICS. PHYSIOLOGY i D FINflNCE

Michael Small Hong Kong polytechnic University

\[p World Scientific NEW JERSEY

• LONDON

• S I N G A P O R E • BEIJING • SHANGHAI

• HONGKONG

• TAIPEI • B A N G A L O R E

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

APPLIED NONLINEAR TIME SERIES ANALYSIS Applications in Physics, Physiology and Finance Copyright © 2005 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-256-117-X

Printed by Fulsland Offset Printing (S) Pte Ltd, Singapore

For Sylvia and Henry

Preface

Nonlinear time series methods have developed rapidly over a quarter of a century and have reached an advanced state of maturity during the last decade. Implementations of these methods for experimental data are now widely accepted and fairly routine, however genuinely useful applications remain rare. The aim of this book is to focus on the practice of applying these methods to solve real problems. It is my hope that the methods presented here are sufficiently accessible, and the examples sufficiently detailed, that practitioners in other areas may use this work to begin considering further applications of nonlinear time series analysis in their own disciplines. This volume is therefore intended to be accessible to a fairly broad audience: both specialists in nonlinear time series analysis (for whom many of these techniques may be new); and, scientists in other fields (who may be looking to apply these methods within their speciality). For the experimental scientist looking to use these methods, MATLAB implementation of the underlying algorithms accompany this book. Although the mathematical motivation for nonlinear time series analysis is fairly advanced, I have chosen to keep technical content in this book to a minimum. Postgraduate and advanced undergraduate students in the physical sciences should find the material reasonably easy to understand. This book may be read sequentially; skimmed in a pseudo-random order; used primarily as a reference; or, treated as a manual for the companion computer programs. The applications To illustrate the usefulness of nonlinear time series analysis, a wide variety of physical, financial and physiological systems have been considered. vii

viii

Preface

In particular, several detailed applications serve as case studies of fruitful (and occasionally less fruitful) applications, and illustrate the mathematical techniques described in the text. These applications include: • diagnosis and control of cardiac arrhythmia in humans (prediction of ventricular fibrillation); • characterisation and identification of aberrant respiratory control in infants (sleep apnea and Sudden Infant Death syndrome); • simulation and recognition of human vocalisation patterns; • interpretation and prediction of financial time series data; and, • quantitative assessment of the effectiveness of government control measures of the recent SARS crisis in Hong Kong. The applications described in this book are drawn largely from my own work, and for this I offer no apology. These applications are the applications which interest me most, and with which I am most familiar. Some of the applications are particularly long and detailed: this is so that the reader can truly get a feel for the complexity of each example and the effort required to fit the methods to the problems. However, if one is primarily interested in the methods themselves, it is entirely possible to read this monograph as a textbook: one should simply skip (or skim) the long-winded application based sections. The tools The technical tools utilised in this book fall into three distinct, but interconnected areas: quantitative measures of nonlinear dynamics, Monte-Carlo statistical hypothesis testing (aka Surrogate data analysis), and nonlinear modelling. In Chapter 1 we discuss nonlinear time series analysis from the perspective of time delay embedding reconstruction. We describe the current state-of-the-art in reconstruction techniques, together with my own view of how to "best" embed scalar time series data. Chapters 2 and 3 are concerned with the estimation of statistical quantities (mostly dynamic invariants) from time series data. Quantitative measures, such as estimation of correlation dimension and Lyapunov exponents have been described previously in standard texts. However, this material is included here for completeness: an understanding of these techniques is necessary for the surrogate data methods that follow.

Preface

ix

More importantly, it is necessary to provide a current description of the most appropriate estimation techniques for short and noisy data sets. In particular, it should be noted that many of the more venerable techniques perform poorly when faced with the limitations of most experimentally obtained data i.e. short and noisy time series. Hopefully, some of the new methods we discuss will help alleviate these problems. Three standard methods of Monte-Carlo statistical hypothesis testing (the method of surrogate data) have come into widespread use recently. In Chapters 4 and 5, these methods are described along with their many limitations and restrictions. Several new, state-of-the-art methods to circumvent these problems are also described and demonstrated to have useful application. Finally, a natural extension of surrogate data methods is the testing of model fit, and this is the third focus of this book. Standard nonlinear modelling methods (neural networks, radial basis functions and so on) are the subject of numerous excellent texts. In Chapter 6 we focus on finding the best model and how to determine when a given model is "good enough". These are two problems that are not well addressed in the current literature. For this work, we utilise the surrogate data methods described previously, as well as information theoretic measures of both the model and the data. These information theoretic measures of the data are in turn related to, and form the foundation of, the dynamic invariants described in the introduction of this book.

Acknowledgments This volume is actually the result of many years work. The new methods presented here build on a broad and strong foundation of nonlinear time series analysis and nonlinear dynamical systems theory. In collating this material in a single volume I should thank many people. I would never have met this exciting research field if it were not for my PhD adviser, Dr. Kevin Judd (University of Western Australia, Perth), and also Dr. Steven Stick (Princess Margaret Hospital for Children, Perth). My interest in the analysis of human ECG rhythms was sown by Prof. Robert Harrison (Heriot-Watt University, Edinburgh), and my enthusiasm for financial time series analysis is due to the significant work of Prof. Marius Gerber (Proteus VTS). I should wish also acknowledge Prof. Michael Tse (Hong Kong Polytechnic University) for his help and also for suggesting this endeavour in the first place: without his incessant encouragement I would not have

x

Preface

written this. As is usual for a post-doctoral researcher, my work has been conducted in a variety of settings. Primarily, the work that serves as the foundation for this volume was conducted in the Centre for Applied Dynamics and Optimisation (CADO) in the Mathematics Department of the University of Western Australia; the Nonlinear Dynamics Group in the Physics Department of Heriot-Watt University (Edinburgh); and, the Applied Nonlinear Circuits and Systems Research Group in the Department of Electronic and Information Engineering at the Hong Kong Polytechnic University. I extend my warm thanks to my numerous friends and colleagues in all three centres. Finally, I would like to acknowledge those who have contributed directly to this work: Dr. Tomomichi Nakamura persevered through an early draft version of much of this manuscript and gave me valuable feedback at each stage, and my students Mr. Xiaodong Luo and Mr. Yi Zhao assisted and provided me with valuable ideas and new ways of looking at most of the content of this work. Most important of all, my wife Sylvia endured the entire text of this volume and vastly improved both the grammar and style. Finally, I am obliged to acknowledge various sources of financial support. This work was supported in part by Hong Kong University Grants Council (UGC) Competitive Earmarked Research Grants (CERG) numbers PolyU 5235/03E and PolyU 5216/04E as well as a direct allocations from the Department of Electronic and Information Engineering (A-PE46 and APF95). Feedback Although the best way to really understand how these techniques work is to create your own implementation of the necessary computer code, this is not always possible. Therefore, I have made MATLAB implementations of many of the key algorithms and the various case studies are available from my website (http://small.eie.polyu.edu.hk/). You are most welcome to send me your comments and feedback ([email protected] ). Michael Small Hong Kong January 25, 2005

Contents

vii

Preface 1. Time series embedding and reconstruction

2.

1

1.1 Stochasticity and determinism: Why should we bother? . . 1.2 Embedding dimension 1.2.1 False Nearest Neighbours 1.2.2 False strands and so on 1.2.3 Embed, embed and then embed 1.2.4 Embed and model, and then embed again 1.3 Embedding lag 1.3.1 Autocorrelation 1.3.2 Mutual information 1.3.3 Approximate period 1.3.4 Generalised embedding lags 1.4 Which comes first? 1.5 An embedding zoo 1.6 Irregular embeddings 1.6.1 Finding irregular embeddings 1.7 Embedding window 1.7.1 A modelling paradigm 1.7.2 Examples 1.8 Application: Sunspots and chaotic laser dynamics: Improved modelling and superior dynamics 1.9 Summary

41 44

Dynamic measures and topological invariants

47

xi

2 5 6 7 8 9 10 10 11 11 12 14 15 19 21 28 30 34

xii

3.

4.

Contents

2.1 Correlation dimension 2.2 Entropy, complexity and information 2.2.1 Entropy 2.2.2 Complexity 2.2.3 Alternative encoding schemes 2.3 A p p l i c a t i o n : D e t e c t i n g v e n t r i c u l a r a r r h y t h m i a . . . . 2.4 L y a p u n o v e x p o n e n t s a n d nonlinear prediction error 2.5 Application: Potential predictability in financial time series 2.6 Summary

48 54 54 58 60 69 74

Estimation of correlation dimension

85

3.1 3.2 3.3 3.4

86 87 90

80 82

Preamble Box-counting and the Grassberger-Procaccia algorithm . . . Judd's algorithm Application: Distinguishing sleep states by monitoring respiration 3.5 The Gaussian Kernel algorithm 3.6 Application: Categorising cardiac dynamics from measured ECG 3.7 Even more algorithms

105 Ill

The method of surrogate data

115

4.1 The rationale and language of surrogate data 4.2 Linear surrogates 4.2.1 Algorithm 0 and its analogues 4.2.2 Algorithm 1 and its applications 4.2.3 Algorithm 2 and its problems 4.3 Cycle shuffled surrogates 4.4 Test statistics 4.4.1 The Kolmogorov-Smirnov test 4.4.2 The X2 test 4.4.3 Noise dimension 4.4.4 Moments of the data 4.5 Correlation dimension: A pivotal test statistic — linear hypotheses 4.5.1 The linear hypotheses 4.5.2 Calculations 4.5.3 Results

116 120 121 122 123 125 129 131 131 132 132

95 102

133 135 136 142

Contents

xiii

4.6 Application: Are financial time series deterministic? 143 4.7 Summary 147 5.

6.

7.

Non-standard and non-linear surrogates

149

5.1 Generalised nonlinear null hypotheses: The hypothesis is the model 5.1.1 The "pivotalness" of dynamic measures 5.1.2 Correlation dimension: A pivotal test statistic — nonlinear hypothesis 5.2 Application: Infant sleep apnea 5.3 Pseudo-periodic surrogates 5.3.1 Shadowing surrogates 5.3.2 The parameters of the algorithm 5.3.3 Linear noise and chaos 5.4 Application: Mimicking human vocalisation patterns 5.5 Application: Are financial time series really deterministic? 5.6 Simulated annealing and other computational methods . . . 5.7 Summary

153 155 157 158 161 163 166

Identifying the dynamics

179

6.1 Phenomenological and ontological models 6.2 Application: Severe Acute Respiratory Syndrome: Assessing governmental control strategies during the SARS outbreak in Hong Kong 6.3 Local models 6.4 The importance of embedding for modelling 6.5 Semi-local models 6.5.1 Radial basis functions 6.5.2 Minimum description length principle 6.5.3 Pseudo linear models 6.5.4 Cylindrical basis models 6.6 Application: Predicting onset of Ventricular Fibrillation, and evaluating time since onset

180

208

Applications

223

150 152

168 174 176

181 195 198 200 200 201 205 207

Bibliography

229

Index

241

Chapter 1

Time series embedding and reconstruction

Nonlinear time series analysis is the study of time series data with computational techniques sensitive to nonlinearity in the data. While there is a long history of linear time series analysis, nonlinear methods have only just begun to reach maturity. When analysing time series data with linear methods, there are certain standard procedures one can follow, moreover the behaviour may be completely described by a relatively small set of parameters. For nonlinear time series analysis, this is not the case. While black box algorithms exist for the analysis of time series data with nonlinear methods, the application of these algorithms requires considerable knowledge and skill on the part of the operator: you cannot simply close your eyes and press the button. Moreover, the growing artillery of nonlinear time series methods still has fairly few tangible, practical, applications: "applications" that satisfy the mathematicians or physicists definitely exist, but fairly few of these applications merit the attention of a practising engineer or physician. This book is intended to be a guide book for the study of experimental time series data with nonlinear methods. We aim to present several applications, that, by the authors' biased assessment, sit somewhere between a mathematician's "application" and an engineer's. To achieve such a lofty aim with such a slim volume, we need to restrict our attention somewhat: the subject of this book is the study of a single scalar time series measured from a deterministic dynamical system in the presence of observational and dynamic noise. A substantial portion of this volume concerns methods to obtain statistical evidence of nonlinear determinism rather than actual "applications". The reason for this is simple: one should first establish that linear time series methods are insufficient before stepping outside the bounds of this venerable field. 1

2

Applied Nonlinear Time Series Analysis

In terms of understanding the underlying system, the focus of this book is two-fold. We are interested in statistically characterising the dynamics observed in a time series, and we are interested in developing methods to produce new time series exhibiting the same dynamics. The main tools which we will employ to achieve this are the method of surrogate data, the estimation of dynamic invariants, and nonlinear modelling. In the remainder of this opening chapter, we provide the basic definitions and framework with which we will analyse data. We describe the various embedding theorems and appeal to these as the motivation of delay reconstruction and the subsequent methods. Throughout this opening salvo, we make one very significant assumption. We assume that the data are the deterministic output, measured with arbitrary (but finite) precision, of a deterministic dynamical system.

1.1

Stochasticity and determinism: Why should we bother?

So, let us begin with some mathematical necessities. The main purpose of delving into this detail so early in a book on applied methods, is to define precisely what it is we are interested in studying, and what we expect to find. A deterministic dynamical system is one for which there is a rule, and, given sufficient knowledge of the current state of that system one can use that rule to predict future states. For notational convenience we will consider only discrete dynamical systems (this after all, is the only sort that modern digital computers can cope with). The extension to continuous systems is obvious and is dealt with at considerable depth elsewhere. Often, although our notation relates to a discrete signal it is in fact a continuous system sampled at a (hopefully) sufficiently high sampling rate. Let zn be the current state of the system. We suppose that zn £ M. Q Rk is an k—dimensional state vector, M. is the attractor on which the dynamics evolve, and one has some evolution operator $ : M. x Z i—> M. such that $(zn, t) — zn+t. This situation is depicted in Fig. 1.1. Note that, in general one does not have to restrict M C Rfe, but to do so does not really restrict our discussion either. The dynamical system described by (A4, $) is said to be deterministic if the evolution operator $ is deterministic. In other words, if one can write down some mathematical rule by which the future state zn+t can be determined precisely from the current state zn at any instance n for

Time series embedding and reconstruction

3

Fig. 1.1 Evolution on a manifold. The system state z n exists within some manifold M and evolves according to $. This dynamic is observed in R by some C 2 observation function g, and, hence delay construction is possible. The delay reconstruction in Rd is, in some sense, equivalent to the original trajectory.

some value of t > 0, then that rule, $, is said to be deterministic, and the dynamical system defined by that rule is also deterministic. We are currently only interested in discrete dynamical systems, and can therefore limit our discussion to n, t G Z. Moreover, our definition insists that the evolution operator does not change with time (if xn — xm then $(xn, •) = $(xm, •) even if m ^ n); in this case, the dynamical system is stationary. Dynamical systems that are not stationary are exceedingly difficult to model from time series,1 therefore we generally restrict our attention to the stationary case. Notice that this definition of stationarity is not the same as for linear systems: a linear system is said to be stationary if all its moments2 remain unchanged with time. However, restricting our attention to stationary systems can be easily justified by observing that the extent of the system is not 1 Unless one has a priori knowledge of the structure of the underlying system, the number of parameters will greatly exceed the number of available data. 2 That is, the mean, standard deviation, kurtosis, skewness and all similar statistics concerning the distribution of the data.

4

Applied Nonlinear Time Series Analysis

necessarily constrained. One usually chooses the system M. C Rfc to be the smallest system (lowest k) such that the corresponding $ : Z x M. i—> Ai is stationary. A non-stationary system is one which is subject to temporal dependence based on some outside influence. If we extend our definition of the system to include all outside influences, the system is stationary. In other words, an evolution operator $ is not stationary, a new evolution operator "3? can be constructed which is stationary simply by increasing k appropriately. For example, an observer standing on the beach 3 observing the rise and fall of the tide could justifiably say that the state of the system representing the level of the tide is nonstationary: it is subject to external, and for a suitably ignorant observer, not entirely predictable effects. However, if the same observer included the relative positions and orientation of the earth, moon and sun in their system, then, they would conclude that taken together they represent an (approximately) stationary system. Now, let us suppose that we have a dynamical system which is both stationary and deterministic. We must now examine this system. Suppose that we can measure some single scalar quantity at any time. That is, we have an observation function g : M. i—> R. This observation function provides us with a way to measure the current state of the system g(zn). Since g(-) gives us only a scalar value, it cannot offer a complete description of the system. But, observing xn = g(zn) at many successive times will. According to the celebrated Takens' embedding theorem [144], if de is • • • ,xn-dc) will be the sufficiently large, the evolution of (xn,xn-i,xn-2, same as zn. The argument for this theorem is topological, but it is easy to understand on an intuitive level. Suppose I can only measure one variable of a system. Then, except for one-dimensional systems, this is insufficient to describe the underlying dynamics. However, suppose that I can also measure the derivative of that variable, and further higher order derivatives up to some finite level d. Then, if the dimension of the system is less than d, I have enough information to completely describe the system: a system of d differential (or difference) equations. But, for sufficiently high sampling rate, measuring d derivatives is equivalent to measuring the system at d different time intervals. Moreover, the embedding xn —> ( x n , x n _ i , x n _ 2 , . . . , a ; n - d e ) 3

As Sir Isaac Newton did (when he wasn't standing on giants).

(1-1)

Time series embedding and reconstruction

5

contains the same information as the original system: provided de is large enough, the measurement function g is twice differentiable,4 there is a sufficiently long data record available (for the comparison to the original system to be meaningful) and the data are sampled sufficiently often [144]. In practice, these conditions are very difficult to achieve. At the very least, digitisation (both sampling and quantisation) of the data represents a breach of the differentiability of g. Moreover, our informal argument concerning the differentiation of the system state by taking successive observations is substantially weakened in the presence of noise. Higher order derivatives are notoriously difficult to obtain numerically [95]. However, practitioners will presume that the conditions hold approximately; and the data embedded by Eq. (1.1) approximates the topology of the underlying attractor yW; and the sequence of embedded points behaves under a deterministic rule approximately equivalent to the evolution operator $. All of this approximation may not sit well with the theorist, but it is best to bear in mind that the motivation of nonlinear time series analysis is based firmly on these slightly faulty assumptions. One cannot necessarily achieve a perfect embedding (quantisation of data and finite sampling time violate the sufficient conditions of Takens' theorem), but we can still hope for a good one. But even our concept of a good embedding is limited by the purpose we have in mind [14]. To achieve a "good" embedding, the first parameter we need to estimate is the embedding dimension de. 1.2

Embedding dimension

Takens' embedding theorem [90; 144] and more recently work of Grebogi [26]5 and others [18; 6; 12; 83] give sufficient conditions on or suggest criteria for de. Unfortunately, the conditions require a prior knowledge of the fractal dimension of the object under study. In practice, one could guess a suitable value for de by successively embedding in higher dimensions and looking for consistency of results; this is the method that is generally employed. In general, the aim of selecting an embedding dimension is to make sufficiently many observations of the system state so that the deterministic state of the system can be resolved unambiguously.6 Most methods to esti4

That is, noise, or even ordinary digital quantisation are technically not allowed. Grebogi gives a sufficient condition on the value of de necessary to estimate the correlation dimension of an attractor, not to avoid all possible self intersections. 6 It is best to remember that in the presence of observational noise and finite quantisa5

6

Applied Nonlinear Time Series Analysis

mate the embedding dimension aim to achieve unambiguity of the system state. The archetype of many of these methods is the so-called false nearest neighbour technique [32; 147]. 1.2.1

False Nearest

Neighbours

Suitable bounds on de can be deduced by using false nearest neighbour analysis [64]. The rationale of false nearest neighbour techniques is the following. One embeds a scalar time series yt in increasingly higher dimensions, at each stage comparing the number of pairs of vectors vt and v^N (the nearest neighbour of vt) which are close when embedded in R™ but not close in R n + 1 . Each point v

t = {yt-r,yt-2r,-

• • ,Vt-nr)

has a nearest neighbour v

NN

t

\

i

={yf-T,yt'-2T,---,yt'-nr)-

When one has a large amount of data, the distance (Euclidean norm will do) between vt and v^ should be small. If these two points are genuine neighbours, they became close due to the system dynamics and should separate (relatively) slowly. However, these two points may have become close because the embedding in R" has produced trajectories that cross (or become close) due to the embedding rather than the system dynamics.7 For each pair of neighbours vt and v^N in R", one can increase the embedding dimension by one so that Vt = (yt-r,yt-2r,

•• •

,yt-nT,yt-(n+l)T)

and ~NN Vt

i = (yt'-T,yt'-2T,

• • • , yt'-nr,

\ yt'-(n+\)T)

tion this is not possible. Moreover, it has been shown that even with perfect observations over an arbitrary finite time interval, a "correct" embedding will still yield a set of states indistinguishable from the true state [58]. 7 The standard example is the embedding of motion around a figure 8 in two dimension. At the crossing point in the centre of the figure, trajectories cross. However, one can imagine if this was embedded in three dimensions, then these trajectories may not intersect.

Time series embedding and reconstruction

7

may or may not still be close. The increase in the distance between these two points is given only by the difference between the last components \\vt - VtNNf - \\Vt - Vr\\2 = (Vt-in+l)r - yt'-{n+X)r)2One will typically calculate the normalised increase to the distance between these two points and determine that two points are false nearest neighbours if \yt-(n+l)T ~ yt'-{n+l)r\

p

RT

—ii^Fii— - -

n

,

(L2)

A suitable value of RT depends on the spatial distribution of the embedded data vt. If RT is too small, true near neighbours will be counted as false, if RT is too large, some false near neighbours will not be included. Typically 10 < RT < 30, for convenience we find that -RT = 15 is a good starting point. One must ensure that the chosen value of RT is suitable for the spatial distribution of the data under consideration — this may be done by trialling a variety of values of RT- By determining if the closest neighbour to each point is false, one can then calculate the proportion of false nearest neighbours for a given embedding dimension n. We can then choose as the embedding dimension de the minimum value of n for which the proportion of points which satisfy the above condition is below some small threshold. Typically one could expect the proportion of points satisfying this to gradually decrease; as the embedded data is "unfolded" in an increasing embedding dimension; and, eventually that proportion plateaus at a relatively low level. 1.2.2

False strands and so on

The idea in the previous section can easily be extended to consider trajectories, rather than simply successive points. Whereas, the method of false nearest neighbour relies on measuring the divergence of nearby points after a short time, the method of false strands [64] measures whether trajectories which cross continue to stay relatively close over their entire length. The difference is a subtle but important one. When computing false nearest neighbours, one runs into a problem as the embedding dimension increases. The threshold, by which we determine whether a neighbour is false becomes insufficient, and one can readily observe that even white noise exhibits (according to the false nearest neighbour technique) a relatively low embedding dimension [63]. To overcome

8

Applied Nonlinear Time Series Analysis

this, the method of false nearest neighbour replaces Eq. (1.2) with the condition

^ M

(1.3)

> RT,

where /

i

N

\

*

and y denotes the mean value of the data [63]. 1.2.3

Embed, embed and then embed

One further approach that is worth mentioning, at least because it is commonly used in practice, we refer to as "the doctrine of ever increasing embedding dimension". The idea goes something like this: embed the data in increasing dimensions until one observes consistency. Usually, one will choose increasing embedding dimensions, and in each case measure the correlation integral.8 When we no longer observe changes in the behaviour of the correlation integral with increasing dimension, we have found a sufficiently large embedding dimension. Although, consistency will not necessarily imply correctness, the rationale of this approach is that the correlation integral (or any other dynamic invariant) should be independent of the embedding dimension provided that it is sufficiently large. Therefore, when one observes no dependence on the embedding dimension, then this must be a sufficiently large dimension. While this approach may work in practice the fallacy of this logic is that if the sufficient conditions of Takens' theorem are not satisfied (either the data set is too short, too noisy, has been digitised, or is not even finite dimensional), no embedding dimension may necessarily be acceptable. However, for a finite length data set, with some noise, one will almost always observe consistent behaviour provided that the dimension is sufficiently large. This consistency is simply a result of embedding in an ever increasing dimension. Eventually the embedding dimension will become sufficiently large so that increasing it by one does not appreciable change the properties of the data (i.e. as n —-> oo, n + 1 « n). 8 Correlation dimension and other dynamic invariants are discussed in more detail in Chapter 2.

Time series embedding and reconstruction

9

A further trap one may often observe from this approach is that the dynamic invariants being estimated may not behave exactly as expected. As the embedding dimension increases, the distribution of points in space changes. For large embedding dimensions, one observes the counterintuitive fact that the majority of a volume is located in a thin shell on the exterior (it has a virtually empty interior). Many correlation dimension estimation algorithms do not correct this and one can observe increasing estimates of correlation dimension even after the minimal suitable choice of embedding dimension has been exceeded.

1.2.4

Embed and model, and then embed again

The approaches for selecting the embedding dimension described in the previous section aim to achieve an unambiguous reconstruction of the state of the dynamical system. An alternative, less heavily exploited approach is to select the embedding dimension which achieves the best model of the underlying dynamics. In Sec. 1.6.1, we will return to this idea and describe an efficient implementation that utilises information theory to evaluate the quality of the model. In [7] a similar, although somewhat different approach is described. If one supposes that the object of embedding (and selecting the embedding dimension) is to reconstruct the underlying dynamics, one should choose the embedding dimension that corresponds to the best model of the underlying system. Unfortunately, this requires one to choose a model class, select a model from within that class and then to determine which model is best. Ataei [7] and co-workers achieve this by restricting themselves to the class of polynomial models and assessing model quality based on model prediction errors. However, it is known that the class of polynomial models is often a poor choice for nonlinear systems [17; 127]. Of course, selecting wider model classes involves proportionately more work. We will describe a solution to this problem in the coming sections, and also a more robust way of assessing the quality of a model, using an information theoretic criterion.

10

1.3

Applied Nonlinear Time Series Analysis

Embedding lag

In the previous sections, we described several distinct methods for choosing the embedding dimension, and we assumed that the selection of embedding lag was straightforward (or at they very least, was not a problem). In theory, any value of r is acceptable, but the shape of the embedded time series will depend critically on the choice of r and it is wise to select a value of T which separates the data as much as possible. Typically, one is concerned with the evolution of the dynamics in phase space. By ensuring that the data are maximally spread in phase space, the vector field will be maximally smooth. Spreading the data out minimises possibly sharp changes in direction amongst the data. From a topological view-point, spreading data maximally, makes fine features of phase space (and the underlying attractor) more easily discernible. In his thorough book, Abarbanel [l] suggests the first minimum of the mutual information criterion [96; 103]; and, the first zero of the autocorrelation function [97] or one of several other criteria to choose r. Our experience and numerical experiments suggest that selecting a lag approximately equal to one quarter of the approximate period of the time series produces comparable results to the autocorrelation function but is more expedient. Note that the first zero of the autocorrelation function will be approximately the same as one quarter of the approximate period if the data are very strongly periodic.9 In the following subsections, we review each of these popular methods. 1.3.1

Autocorrelation

Define the sample autocorrelation of a scalar time series yt of N measurements to be m

_ Yln=i(yn+T - y)(yn

-y)

where y = jj J2n=i V". ^s the sample mean. The smallest positive value of T for which p(T) < 0 is often used as embedding lag. Data which exhibits a strong periodic component suggests a value for which the successive coordinates of the embedded data will be virtually uncorrelated whilst still being close (temporally). We stress that a choice of r = T such that the 9 Moreover, in some specific cases, such as the Lorenz system, the autocorrelation is strictly positive, but the approximate period is still well defined, and easy to determine.

11

Time series embedding and reconstruction

sample autocorrelation is zero is purely prescriptive. Some simple systems produce poor results for this choice, and, as an alternative, some authors recommend choosing the lag such that autocorrelation has first dropped below i (the so-called decorrelation time). Sample autocorrelation is only an estimate of the autocorrelation of the underlying process, but is sufficient for estimating time lag. 1.3.2

Mutual information

A competing criterion relies on the information theoretic concept of mutual information, the mutual information criterion (MIC). In the context of a scalar time series, the information I(T) can be defined by N

I(T) =

.

.

^P(yn,yn+T)log2p{yn)p{yn+Ty

where P(yn,yn+T) is the probability of observing yn and yn+T, and P(yn) is the probability of observing yn. I(T) is the amount of information we have about yn by observing yn+T, a n d so one sets r to be the first local minima of I(T). The primary difficulty with estimating mutual information is that one must first estimate a probability distribution on the systems states (and, moreover, do this on vector states). Inappropriate selection of the histogram binning can easily lead to poor results: the issue of density estimation is a well-established field in its own right [118]. However, provided one can obtain a good estimate of the underlying density, the mutual information will usually provide a good guide to appropriate choice of embedding lag [60]. 1.3.3

Approximate period

The rationale of these previous two methods is to choose the lag so that the coordinate components of vt are reasonably uncorrelated while still being "close" to one another. When the data exhibits strong periodicity, a value of T that is one quarter of the length of the average breath generally gives a good embedding. This lag is approximately the same as the time of the first zero of the autocorrelation function. Coordinates produced by this method are within a few cycles of each other (even in relatively high dimensional embeddings) whilst being spread out as much as possible over a single period. Moreover, for embedding in three or four dimensions

12

Applied Nonlinear Time Series Analysis

(as is commonly used in certain widely studied systems10 the data are spread out over one half to three quarters of a period. This means that the coordinates of a single point in the three or four dimensional vector time series vt represents most of the information for an entire cycle. This choice of lag is extremely easy to calculate and for the data sets that we consider it also seems to give much more reliable results than the mutual information criterion.

1.3.4

Generalised embedding lags

A generalisation of the ideas presented in the previous sections has been described by Luo [79]. In [79] the authors describe the redundance and irrelevance tradeoff exponent (RITE) which is a general technique for choosing embedding lag. In general, one considers two competing criteria for the selection of embedding lag: (i) the lag must be large enough so that the various co-ordinates contain as much new information as possible, and (ii) the lag must be small enough so that the various co-ordinates are not entirely independent. We can see this principle in each of the previous ideas. With AMI, we choose the first minimum of mutual information, as the first minimum gives the least redundant (i.e. mutual) information as possible without being too large. Similarly, with the first zero of autocorrelation, the co-ordinates are linearly uncorrelated, and yet not too far apart. Let us generalise this idea as follows. Let R(T) measure the redundance between a time series {xn} and the time series {x n + T }, and let I(T) measure the irrelevance. Then, given some weight k, we define the RITE as kR(r) + (1 - k)I(r).

For example, we may consider (as Luo does [79]), the case where redundance and irrelevance are measured by second order autocorrelation p(r) and the weights assigned to the two measures are < x\ > and < xn > 2 . In other 10 From examining the literature, it may seem that most chaotic systems which are actually studied, can be embedded in no more than three or four dimensions. We speculate that, while three or four dimensions is necessary to observe chaotic flows [75], higher dimensional systems are often simply too difficult to study in practice.

Time series embedding and reconstruction

13

Fig. 1.2 Embedding in two dimensions. The vector point z n can be decomposed into its polar form consisting of the radius rn and the angle 0n. The distance of the point zn from the diagonal line is proportional to \xn+T — xn\ and the projection of zn onto that line is proportional to |xn4-T + xn\.

words,

= P(T) I(T) = 1 - p(r)

R(T)

k

,

^

1 - fc =

<xl>

< x2n > + < xn >2 <xn> ? 5

^

< xl > + < xn >2 kR(r) + (1 - fc)J(r) = P(-)<*l>+(l-p(T))<xn>i_ < Xl > + < Xn >2

In fact, in this special case, the RITE reduces exactly to the standard second order autocorrelation. A more interesting case is that, by considering a two dimensional embedding, we may examine the distance from the diagonal

14

Applied Nonlinear Time Series Analysis

dn oc \xn+T— xn|, the projection onto that identity line pn oc |a;n+1-+a;n|, and the angle subtended from that line 6n oc t a n - 1 ^n+T+*n . Fig. 1.2 depicts the general situation. Now, each of these measures can be employed in Eq. (1.4) in the place of xn, and one now obtains a nontrivial, unique, nonlinear measure of redundance and irrelevance. In computational studies [79], it can be seen that these measures perform as well as the competing criteria, and often provide more robust estimates of appropriate embedding lags. 1.4

Which comes first?

Clearly estimating embedding dimension requires one to first estimate the embedding lag. But the value of embedding lag selected is, at least implicitly, dependent on the embedding dimension. Although most of the techniques for selecting the embedding lag appear to be independent of embedding dimension, this is not the case. By selecting one-quarter of the pseudo-period as the embedding lag (or, almost equivalently, the first minimum of mutual information or the first zero of autocorrelation), one implicitly supposes that the embedding dimension is approximately 4. The heuristic motivation for selecting embedding lags according to these criteria (for data with a strong periodic component) is to have each single embedded point representative of the dynamics over an entire period. Although it is common practice to estimate the embedding lag first, then choose embedding dimension, we prefer an alternative approach. Provided the data is "clean enough", starting with an embedding lag of 1 and estimating the embedding dimension will give a good guide to the number of degrees of freedom. From this, one can choose an embedding lag. For most low-dimensional (i.e. 3 — 5 degrees of freedom) systems, an embedding lag estimated with one of the methods described in the previous section will suffice. If the embedding dimension does not fall neatly into the range 35, an alternative value of embedding lag may be better. Having selected embedding lag, one can go back and check that the embedding dimension, using this chosen value of embedding lag, is appropriate. Alternatively, one may start by estimating the embedding window (Sec. 1.7), then determine the most appropriate irregular embedding (Sec. 1.6). Before describing this alternative, we believe, superior approach, we present some illustrations of selection of embedding lag and embedding dimension for various sets of data.

Time series embedding and reconstruction

15

Fig. 1.3 Typical time series for the Rossler dynamical system, x, y and z components (from top to bottom) of the three dimensional Rossler system (1.4) integrated (and sampled) with a time step of 0.2.

1.5

An embedding zoo

Let us pause now, and present several archetypal dynamical systems and some of our favourite data sets. Two examples of continuous dynamical systems that are stationary, and

16

Applied Nonlinear Time Series Analysis

Fig. 1.4 Rossler attractor and reconstruction of the attractor. The left panel is a single 5000 point trajectory plotted in x, y and z co-ordinates, the right hand panel is a delay embedding in three dimension (embedding lag of 8) of the x co-ordinate. One can see from both original and embedded co-ordinates that chaos in this system is generated by a gradual stretching apart of trajectories over most of the attractors, combined with rapid folding and compressing at one point.

in all other respects satisfy the above conditions. The Rossler equations [107] are given by x =

-y-z,

y = x + ay,

(1.4)

z = b + z(x — c) where, for a = 0.398, 6 = 2 and c = 4 the system exhibits "single-band" chaos [152]. Figure 1.3 depicts a discrete sampling with a sample step size11 of 0.2 of a typical time series of the x, y and z co-ordinates. In Fig. 1.4, we have produced the attractor of this system (left panel) by plotting the x, y, and z components against one another to illustrate a single trajectory. The right hand panel shows a reconstruction of this system from the x component alone. The Lorenz system [77] is defined by x = s(y

-x),

y = rx-y-xz,

(1.5)

z = xy — bz where (s — 10, r = 28, b = 8/3) yields chaotic dynamics [152]. In Figs. 1.5 and 1.6 we depict the dynamics of a discrete sampling of the Lorenz system 11

0.2.

That is, the equations are numerically integrated with an integration time step of

Time series embedding and reconstruction

17

Fig. 1.5 Typical time series for the Lorenz dynamical system, x, y and z components (from top to bottom) of the three dimensional Lorenz system (1.5) integrated (and sampled) with a time step of 0.05.

and typical reconstructions. Throughout the remainder of this book we will frequently refer to these two systems. Furthermore, let us introduce two discrete dynamical systems (maps) that exhibit chaos and are also widely studied. Perhaps the simplest of all

18

Applied Nonlinear Time Series Analysis

Fig. 1.6 Lorenz attractor and reconstruction of the attractor. The left panel is a single 5000 point trajectory plotted in x, y and z co-ordinates, the right hand panel is a delay embedding in three dimensions (embedding lag of 3) of the x co-ordinate. Notice that the original attractor (and to a lesser extent, when viewed from this angle, the reconstructed one) exhibits two flat (two dimensional wings) and a more complex central region. The dynamics on the wings is relatively simple, only at the central separatrix do the complex crossings and splittings that generate chaos in this system occur.

is the Logistic map [6l] given by xn+i = (ixn(l - xn)

(1.6)

where, for \i > 4 the system is chaotic. In Fig. 1.7, we see a short trajectory of the Logistic map together with its attractor in x n -vs.-x n+1 co-ordinates (note that because this system is one dimensional, delay reconstruction is both unnecessary and trivial). However, the representation of successive values of the logistic map is widely used to provide a complete description of the underlying dynamics (see for example [152]). A less trivial map (and one with a more attractive attractor) is the Ikeda map defined by xn+i = 1 + 0.7(:rn cos 9n yn+i=O.7(xnsin0n+yncos6n), fln = 0.4 -

-ynsm0n), (1.7)

1 + x2n + y l

where, for the parameter values given, the dynamics are chaotic. In Fig. 1.8, we show a short segment of the time series together with its attractor and delay reconstruction (from the xn component). Notice that the delay reconstruction here is non-trivial, and, even for a "good"12 choice of em12

And, obvious.

Time series embedding and reconstruction

19

Fig. 1.7 Typical time series and attractor for the Logistic map system. The top two panels show the time evolution of the logistic map system (1.6) on different time scales. The bottom panel is the first return map (aka the time delay embedding with lag 1 in two dimensions).

bedding parameters, there is significant convolution in the reconstructed attractor. Several experimental data sets which we constantly find both useful and interesting are illustrated in Fig. 1.9. In Fig. 1.10, we show the effect of embedding the first data set with various embedding lags in two dimensions. In Fig. 1.11, we can see that no low dimensional embedding provides a satisfactory reconstruction of the ECG data. Conversely, in Fig. 1.12, we show the laser data successfully reconstructed in three dimensions. 1.6

Irregular embeddings

The methods to estimate de and r described in the previous sections assume that a single embedding lag is sufficient and having chosen de and r, the embedding defined by Xn

> {Xn,Xn-T,Xn-2T,

• • • ,Xn-(dt-l)T)

(1-8)

20

Applied Nonlinear Time Series Analysis

Fig. 1.8 Typical time series and attractor for the Ikeda map system. The top two panels show the time evolution of the Ikeda map system (1.7), the lower two panels are the plot of xn against yn (left panel) and a delay one embedding of the first co-ordinate (i.e. xn against xn—\, right panel).

is adequate. For estimating dynamic invariants, this may well be the case,13 however, if one is concerned with recreating the underlying dynamics and estimating the evolution operator (i.e. nonlinear modelling), there is no reason to suppose that this would be the case. In the first part of this book, we are concerned primarily with the estimation of dynamic invariants, and therefore irregular embeddings are not necessary. However, later we will be highly interested in estimating the underlying evolution operator (i.e. modelling); in this context an irregular embedding can be invaluable. In [55], Judd and Mees describe the embedding (1.8) as a uniform embedding and propose a nonuniform embedding according to xn —> (xn-£1,xn-e2,xn-e3,-

• • ,xn-edj

(1-9)

where the parameters 0 < l\ < £i < £l+\ < £dc are the embedding lags. Collectively, the problem is now finding the correct set of lags (£1, £2,..., £de) • 13

I.e. to the best of my knowledge there is no evidence to the contrary.

Time series embedding and reconstruction

21

Fig. 1.9 Experimental time series data.The two experimental time series examined in this application are depicted (from top to bottom): annual sunspot numbers for the period 1700 to 2000, a recording of ECG (electrocardiogram) activity during Ventricular Fibrillation, and the chaotic "Santa-Fe" laser times series. For the bottom panel only the first 2000 points are utilised for time series modelling.

However, as Judd and Mees point out [55], even this may not be sufficient. Often, one encounters dynamical systems where the important variables are different in different parts of phase space. Another way of describing this is to say that the embedding is not constant. For the Lorenz attractor (see Fig. 1.6), the "wings" are two dimensional, and a two dimensional embedding is sufficient, however, at the central separatrix three dimensions are required. Such non-constant and potentially non-uniform embeddings are called variable embeddings. 1.6.1

Finding irregular embeddings

The full problem of finding the correct (i.e. "best", or even just "good") embedding for a particular time series will depend entirely on the model and the model selection scheme one chooses to employ. We will consider these problems in more detail much later in this book. For now we would just like to demonstrate how one may practically obtain variable embeddings (1.8) that behave quantitatively better than uniform embeddings. The problem of estimating the optimal embedding can be stated as: find the parameters {£i\i = 1,... k} and the embedding window dw (to be

22

Applied Nonlinear Time Series Analysis

Fig. 1.10 Embedding of the sunspot data. Six embeddings in two dimensions, with embedding lag from 1 to 6 axe shown for the sunspot time series depicted in Fig. 1.9. The upper panels show that a small embedding lag (i.e. 1) allows a delay reconstruction with a central hole (for pseudo-periodic dynamics such as this, it is reasonable to presume that the central region will exhibit an unstable focus). However, larger embedding lags (i.e. 2 and 3) show a better reconstruction of the amplitude of oscillations. These two main features (oscillations about a central point, and amplitude modulation) are evident, but in different embeddings. We will revisit this problem later when we see that for a third data set a non-uniform embedding actually performs much better.

Time series embedding and reconstruction

23

Fig. 1.11 Embedding of the ECG VF data. Six embeddings in two dimensions, with embedding lag from 1 to 6 are shown for the ECG data depicted in Fig. 1.9. The embeddings we achieve of this data are clearly very poor. The main feature of these embeddings is a complex (high dimensional or non-stationary) structure which is not adequately unravelled in three dimensions. Worse, for small embedding lags the data are tightly distributed along the identity line (too highly correlated).

24

Applied Nonlinear Time Series Analysis

Fig. 1.12 Embedding of the laser data. An embedding in 3 dimensions with an embedding lag of 2 (the first zero of autocorrelation) for the laser data depicted in Fig. 1.9. The smooth geometric structure of this system is evident from this simple lowdimensional embedding. Moreover, this indicates that the dynamics of this system can be described and modelled within only three dimensions.

discussed in more detail in Sec. 1.7), where 1 < i\ < £i < £i+i < l^ < dw and the time delay embedding xt -> (xt-e^xt-e^xt-es,

• • -xt-e,,)

(1.10)

is somehow the "best". Unfortunately, application of Eq. (1.10) makes the problem of selecting embedding parameters considerably more complicated. Here, we describe one suitable criterion for quantitatively comparing embedding strategies and k. and an efficient scheme for the computation of {li\i = l,...,k} The, so-called optimal embedding strategy achieves results superior to the standard techniques (1.1) and (1.8). Employing this optimal embedding strategy may allow one to reconstruct the complex nonlinear dynamics of the underlying system more accurately than would otherwise be possible. Often, the purpose of time delay embedding is to estimate correlation dimension [167] or other dynamic invariants [2] (to be described in Chapter

Time series embedding and reconstruction

25

2). In such situations, embeddings such as (1.8) are usually adequate. But, what if one is interested in the more complex problem of estimating the underlying evolution operator of the dynamical system. Hence, we are interested in obtaining the most accurate prediction of the observed data values. By doing so, we hope to capture the long term dynamics of the underlying system. To achieve this we adopt the information theoretic measure description length [103] and seek to choose the embedding which provides the minimum description length. We will revisit the topic of description length later in this book. Roughly speaking, the description length of a time series is the compression of the finite precision data afforded by the model of that data [53]. If a model is poor, it will be more economical to simply describe the model prediction errors. Conversely, if a model fits the data well, the description of that model and the (presumably small) model prediction errors will be more compact. However, if a model over-fits the data [130], the description of the model itself will be too large. In [132] (see Sec. 1.7) we showed that the description length DL(-) of a time series {xt} is approximated by

DL({xt}) « j (1 + In 2TT) +d-In U £ > < - xf N-d,

\

1

A

L

2

i—d+1

+ d + DL(d) + DL(x) + DL(V).

(1.11)

where d = maxj{^} = £dc, x is the mean of the data, e, is the model prediction error, and DL(V) is the description length of the model parameters. The description length of an integer d can be shown to be DL(d) = [log d\ + [log [log d~\ 1 + ... where each term on the right is an integer and the last term in the series is 0 [103]. Furthermore, ^ (1 + In27r) + DL(x) is independent of the embedding strategy. Hence, the optimal embedding strategy is that which minimises

d h d -In - ] T ( x t - x ) 2 +d + DL(d) + L »=i

(1.12) L

i=d+i

.

The first three terms in (1.12) may be computed directly. However, the last

26

Applied Nonlinear Time Series Analysis

Table 1.1 Embedding parameters for various data sets.Embedding parameters for the various data sets described so far: data length (N), embedding dimension computed with the method of false nearest neighbours (de), embedding lag estimated by the first zero of autocorrelation (r), optimal embedding window computed with the method described in Sec. 1.7 (dw), and the optimal set of embedding lags (such that (xt_(1,..., xt_ik) is used to predict xt). With the exception of the Lorenz system, r is approximately one-quarter of the pseudo-period of the time series. data Sunspots Fibrillation Laser Rossler Rossler+noise Lorenz Lorenz+noise

N 301 6000 4000 1000 1000 1000 1000

de I r 6 3 6 7 8 2 3 3 5 3 5 43 5 42

dw I d I t\,..., ik 7 10 1,2,5 2 10 1,5,6 32 30 1,2,6,7,11,14,15,21,24,25,30 6 10 1,5,7 9 10 1,2,4,5,6,7,9 4 10 1,3 8 10 1,2,3,5,6,7,9

two terms require one to estimate the optimal model. As in [132], for the purposes of computational expediency, we restrict ourselves to the class of local constant models. In the current context this is not unreasonable as we hope to obtain an embedding which spreads the data in phase space based on the deterministic dynamic evolution. Under this assumption, DL{V) = 0 and the model prediction error 1 N-d

N

^

l

i=d+l

may be computed via "drop-one-out" interpolation. That is, ei+i = yi+i — ?/j+i where j G {1,2,..., iV}\{i} is such that \\yi - yj\\ is minimal. Note that, in the limit as N —> oo (i.e. TV 3> d), optimising (1.12) is equivalent to finding the embedding which provides the best prediction (the last two terms of (1.12) dominate). To minimise Eq. (1.12) we assume that the maximum number of inputs, d, has already been calculated. We choose d = dw, the embedding window computed using the method described in [132]. Alternatively, one may either assign an arbitrary value for d or use d = der where both de and r are estimated by one of the many standard techniques. An exhaustive search on the 2d possible embedding strategies is only feasible for small d. For large d (i.e. d > 10) we utilise a genetic algorithm [31] to determine the optimum embedding strategies. Furthermore, to reduce the computational effort in estimating the model prediction error for large N (N > 1000) we minimise the prediction error only on a randomly selected subset of the data. Our calculations show that neither of these

Time series embedding and reconstruction

27

approximations adversely affects our results. The results of the genetic algorithm are robust and accurate, and the final solution is independent of the data subset selected.14 In Table 1.1, we demonstrate the application of this algorithm with data from three experimental systems (the famous annual sunspot time series, a chaotic laser [158], and a recording of human electroencephalogram during ventricular fibrillation [132; 134], these data are depicted in Fig. 1.9); and two computational simulations (Rossler and Lorenz equations, see Figs. 1.3 and 1.5) both with and without the addition of Gaussian observational noise with a standard deviation of 5% that of the data. For each data set, we estimated the embedding window dw [132], the embedding dimension de (via false nearest neighbours) and the embedding lag r (using the first zero of autocorrelation). For each of these systems, we estimated the optimal embedding strategy using a genetic algorithm and (where necessary) the sub-sample selection scheme 30 times. All the data sets except the longest (the ECG recording and the laser system) produced identical results on repeated execution. For the two longest data sets, the most often observed embedding strategy was also the best (indicating that the sub-sample selection scheme is expedient but perhaps not always accurate). Table 1.1 also illustrates that, in most cases, the optimal embedding covered a smaller range of embedding lags than the standard method (i.e. Ik < der) and is often of lower dimension (k < de). Perhaps intuitively, noisier time series required larger k. Furthermore, we note that in none of the cases was the optimal embedding strategy uniform. Although time delay embedding is a fundamental technique for the reconstruction of nonlinear dynamical systems from time series, we can see here that it is not optimal, and that in general one should apply a nonuniform embedding such as (1.10). However, the problem with adopting this approach is that selection of the optimal embedding strategy becomes computationally intractable for even moderate d and N. The solution we propose here is a simple estimate of the nonlinear prediction error, and a combination of genetic algorithm and sub-sample selection. Undoubtedly more sophisticated and efficient methods for solving this NP-hard problem exist. However, for a wide variety of experimental and simulated time series, the method we employ provides alternative embedding strategies which are 14

Provided that the subset is selected with replacement and that it is moderately large.

28

Applied Nonlinear Time Series Analysis Table 1.2 Comparison of correlation dimension estimates for the data and local constant model simulations using either the standard or optimal variable embedding strategy. We computed 30 simulations with either embedding strategy for each data set and report here the median, mean and standard deviation of the correlation dimension estimates (computed with the values de and T listed earlier). The value of correlation dimension estimated from the time series data is also provided for reference.

data Sunspots Fibrillation Laser Rossler R6ssler+noise Lorenz Lorenz+noise

median 2.095 1.467 2.212 1.306 1.961 1.981 1.642

correlation dimension standard optimal mean median mean 3.043±4.957 2.245 2.191±0.4137 1.464±0.04860 1.560 1.559±0.0387 2.205±0.116 2.478 2.448±0.190 1.323±0.1281 1.337 1.338±0.0969 1.936±0.135 1.886 1.861±0.141 1.983±0.0789 1.861 1.860±0.0968 1.644±0.0743 1.636 1.628±0.0576

"true" 1.889 1.619 2.129 1.588 1.819 1.966 1.612

often smaller (k < de and £k < der) than standard methods, and perform at least as well. We have applied correlation dimension as a quantitative measure of the accuracy of dynamic reconstruction and find that the optimal embedding strategy described here produces models which behave more like the true data. Hence, the application of this embedding methodology will allow more accurate modelling of the underlying dynamics. 1.7

Embedding window

As we have seen, the celebrated theorem of Takens [144] guarantees that, for a sufficiently long time series of scalar observations of an n-dimensional dynamical system with a twice differentiable15 measurement function, one may recreate the underlying dynamics (up to homeomorphism) with a time delay embedding. Unfortunately the theorem is silent on exactly how to proceed when the data is limited or contaminated by noise. In practice, time delay embedding is routinely employed as a first step in the analysis of experimentally observed nonlinear dynamical systems (see [2; 60]). Typically, one identifies some characteristic embedding lag r (usually related to the sampling rate and time scale of the time series under consideration) and utilises de lagged version of the scalar observable for sufficiently large de. In general, r is determined by identifying linear or nonlinear temporal correlations in the data and one will progressively increase de until 15

Technically, C 2 .

Time series embedding and reconstruction

29

the results obtained are self consistent. In the preceding section we described one possible (albeit computationally expensive) solution to the problem of rinding the best irregular embedding for a particular time series. We now describe an alternative, simpler, approach to the same problem: the problem of reconstructing the underlying dynamics from a finite scalar time series in the presence of noise. We recognise that in general the quality of the reconstruction will depend on the length of the time series and the amount of noise present in the system. Employing the minimum description length model selection criteria, we show that the optimal model of the dynamics does not depend on the choice of the embedding lag, only on the maximum lag (der in Eq. 1.8). We call that maximum embedding lag dw := d e r, the embedding window, and show that for long noise-free time series, the optimal dw minimises the one-step model prediction error. For short or noisy data, the optimal value of dw is data dependent. To estimate the one-step model prediction error and dw, we apply a generic local constant modelling scheme to several computational examples. We will return to this extremely useful modelling scheme in several different guises throughout this book. This method of estimating dw proves to be consistent and, even better, robust. Moreover, the results that we obtain capture the salient features of the underlying dynamics. Finally, we also find that in general there is no single characteristic time lag r. Generically, the optimal reconstruction may be obtained by considering the lag vector (ri,r 2 ,...,r f c )

(1.13)

where 0 < 7* < r i + 1 < dw.16 The textbooks [2; 60] and even the preceding sections of this volume, contain copious detail on the estimation of de and r. We briefly revisit some developments relevant to the estimation of embedding window here. Often, the primary aim of time delay embedding is to estimate dynamic invariants. In these instances, one may estimate r with a variety of heuristic techniques: usually autocorrelation, pseudo-period or mutual information. One then computes the dynamic invariant for increasing values of de until some sort of plateau onset occurs (see [65] and the references therein). For estimation of correlation dimension, dc, it has been shown that de > dc is sufficient [26]. However, for reconstruction of the underlying dynamics this is not the case. Alternatively, the method of false nearest neighbours [64] 16

This is the so called "variable embedding" described in [55] and elsewhere.

30

Applied Nonlinear Time Series Analysis

and its various extensions apply a topological reasoning: one increases de until the geometry of the time series does not change. We note that several authors have speculated on whether the individual parameters de and T, or only their product der, is significant. For example, Lai and Lerner [72] provide an overview of selection of embedding parameters to estimate dynamic invariants (in their case, correlation dimension). They impose some fairly generous constraints on the correlation integral and use these to estimate the optimal value of de and T. Their numerical results from long clean data imply that correct selection of T is crucial, selection of dw (and therefore de) is not. Conversely, utilising the BDS statistic [ll], Kim and co-workers [65] concluded that the crucial parameter for estimating correlation dimension is dw. Unlike these previous methods, the question we consider now is: "What is the optimal choice of embedding parameters to reconstruct the underlying dynamic evolution from a time series?" In answering this question we conclude that only the embedding window dw is significant, selection of optimal embedding lags is, essentially, a modelling problem [55]. Clearly, successful reconstruction of the underlying dynamics will depend on one's ability to identify any underlying periodicity (and therefore r). These results show that it is possible to estimate the optimal value of dw, and subsequently use this optimal value to derive a suitable embedding lag r. However, as previous authors have observed in many examples, estimation of T for nonlinear systems is model dependent [55] (and may even be state dependent) [55]). In the following sub-section we introduce the rationale for the calculations that follow. Section 1.7.2 demonstrates the application of this method to several test systems, and Application 1.8 studies the problem of modelling several experimental time series.

1.7.1

A modelling

paradigm

As before, let : M —> M be the evolution operator of a dynamical system, and h : M. —* R a C2 differentiable observation function. Through some experiment we obtain the time series {h(X\), h(X2), • •., /i(Xjy)}. Denote x; = h(Xi). Takens' theorem [144] states that for some m > 0 the mapping g Xi t-U (xi,Xi-i,Xi-2,-

• • ,xi-m-l)

(1-14)

Time series embedding and reconstruction

31

is such that the evolution of We will generalise the embedding map (1.14) and consider g as Xi H-^-> (aiXi,a2Xi-i,a,3Xi-2,

• • • , Od^i-d-i)-

(I-* 5 )

The objective of a successful embedding is to find a = [
f{z) = YJ\0{z-wl)

(1.16)

i=i

where 6 is some basis and A^ G R and tOj G Rfc are linear and nonlinear model parameters. The selection of this particular model architecture is arbitrary, but does not alter the results. We assume that there exists some algorithm to select V = (TO, AJ, A2,..., \m,W\,W2,... ,Wk) such that e^ = /(zi_i) - Zi ~ N(0,a) (or at the very least, J2(f(zi-i) - Zi)2 = ^ i s minimised). We do not consider the model selection problem here, rather we seek to find out what is the best choice of (a, d). Our own model selection work is summarised in [130]. The most obvious approach to this problem is to look for the maximum likelihood solution: maxmaxP(i|xo,a, d, V) (a,d) V

^'

where x is the vector of all the time series observations and x0 e R d is a vector of model initial conditions. Unfortunately this leads to the 17 In writing g(xi) = (xi,Xi_i,.. ., Xi_m_i) (i.e. in using equality for assignment) we take a slight liberty with the notation, but the meaning remains clear.

32

Applied Nonlinear Time Series Analysis

redundant solution d = N. To solve this problem one could either resort to Bayesian regularisation [80] or the minimum description length model selection criteria [103]. We choose the later approach.18 The description length of a time series is the length of the shortest (most compact) description of that time series. The description length of a time series with respect to a given model is the length of the description of that model, the initial conditions of that model and the model prediction error. We intend to optimise the description length of the observed time series {xi}iLi = x with respect to (a, d). At this point we make the fairly cavalier assumption that for a given (a, d) one can obtain the optimal model V. We will address this assumption in more detail later in this section. The description length of the data DL(x) is given by DL(x) = DL{x\x0, a, d, V) + DL(x0) + DL(a, d) + DL(V)

(1.17)

where XQ = (x\,X2, • • •, %d) are the model initial conditions. Notice that the description length of the model prediction errors DL(X\XQ, a, d, 7-*), is equal to the negative log likelihood of the errors under the assumed distribution. Similarly XQ is a sequence of d real numbers which for small d we approximate by d realisations of a random variable. Therefore DL(XQ) can also be computed as a negative log-likelihood of some probability distribution. If we assume that x and XQ are approximated by Gaussian random variables with variance a2 and a2D respectively, (1.17) becomes DL{x) w - In P(x\N(0, a2))

(1.18) 2

- In P(xo\N(0, a x)) +d + DL(d) + DL(V). Since a is a sequence of d independent zeros or ones, DL{a) = d, furthermore the description length of an integer d is given by DL(d) — [log(d)] + [log[log(
33

Time series embedding and reconstruction

nally obtains DL{x) « ^ (1 + In 2na2x) + ^ ~ - (1 + In 2TTCT2)

(1.19)

+ DL{x) + d + DL(d) + DL(V) 1 \-^.

d, L + ^

i=l

,o J

1

N — d, L

(1 + In2TT) + d + DL(d) + DL(x)

v-^ 9 / t=d+l

.

+ DL{V).

In this form, Eq. (1-20) provides the first suggestion of what the optimal embedding strategy should be. We see that a does not feature in this calculation. Hence, if we adopt the modelling paradigm suggested here, the embedding lag (or more generally the embedding strategy) is not crucial: one should only be concerned with the maximum embedding dimension d. Of course, this does not mean that to reconstruct the dynamics, the embedding lag is unimportant. When one applies numerical modelling to reconstruct the dynamics, embedding strategies are of very great significance; however, selecting the optimal embedding co-ordinates (or rather those that are most significant in predicting the dynamics) is inherently part of the modelling process [55]. Furthermore, the modelling algorithm should be allowed to choose from all possible embedding lags within the embedding window. Indeed, one often finds that the "optimal" embedding strategy is not fixed within a single model [55]. This result shows that it is preferable to identify the embedding window dw and let the model building process determine which of the dw co-ordinates are most useful. The description length of the mean of the data DL(x) is a fixed constant and we drop it from the calculation. Optimising (1.20) over all (a,d) requires selection of the optimal model for a given (a, d) and computation of the model prediction error of that model. For a given model, DL(V) can be calculated precisely [53]. However, selection of the optimal model is a more difficult problem. Instead, we restrict our attention to a particular class of model, and choose the optimal model from that class. To simplify the computation of (1.20) we restrict our attention to the class of local constant models on the attractor. We have two good reasons for choosing this particular class. Firstly, because the models are simple, estimates of the error as a function of (a, d) are relatively well behaved. Secondly, these models rely

34

Applied Nonlinear Time Series Analysis

on no additional parameters and therefore DL(V) = 0, simplifying our calculation considerably.19 In trials [132], we tested many alternative model classes. We found radial basis functions [53] and neural networks [130] to be excessively nonlinear and difficult to optimise for the purpose of determining embedding windows. Complex local modelling regimes, such as triangulation and tessellation [84] or parameter dependent local linear schemes [46], we found to be overly sensitive to small changes in the data. In comparison the local constant scheme we employ here appears remarkably robust. As local constant models have no explicit parameters (other than the embedding strategy (a, d)), DL(P) = 0. Therefore, for a given (a,d), computation of (1.20) only requires estimation of J3 e?- We employ an in-sample local constant prediction strategy. Let zs be the nearest neighbour to zt (excluding zt), then xt+i =xs+i

(1-21)

and therefore et+\ = xt+i — z s +i. In other words, for each point in the time series we determine the prediction error based on the difference between the successor to that point and the successor to its nearest neighbour.20 Since this is a form of interpolation rather than extrapolation, this strategy does not provide a predictive model, likewise (as with all local techniques) it does not describe the underlying dynamics. However, the strength of this particular approach is that it is simple and it provides a realistic estimate of the size of the optimal model's prediction error as a function of (a, d). The proposed algorithm may be summarised as follows. We seek to minimise (1.20) over d. To achieve this we need to estimate the model prediction error as a function of d. Hence, for increasing values of d we employ the local constant "modelling" scheme suggested by (1.21) to compute the model prediction error and substitute this into (1.20). The optimal embedding window dw is the value of d that minimises (1.20). 1.7.2

Examples

We now demonstrate the application of the above method to several numerical time series. First, we examine the performance of the algorithm and importance of the choice of modelling algorithm (1.21). ^Alternatively, one could argue that the data are the parameters, in either case the description length of the model is constant. 20 This is a technique sometimes referred to as "drop-one-out" interpolation.

Time series embedding and reconstruction

35

Fig. 1.13 Typical noisy time series for the Rossler dynamical system and the reconstructed attractor. Noisy x component (top panel) and a three dimensional reconstruction of the Rossler system (1.4). The system is integrated (and sampled) with a time step of 0.2, and subjected to additive Gaussian noise with a standard deviation of 10% of the standard deviation of the data.

The example we consider is 2000 points of the x component of a numerically integrated (sampling rate of 0.2) trajectory of the Rossler system, contaminated by additive Gaussian noise with a standard deviation of 5% of the standard deviation of the data. The effect of noise on this time series and its attractor are shown in Fig. 1.13. Figure 1.14 demonstrates the computation of (1.20) as a function of embedding window. To estimate model prediction error we employ the rather simple interpolative scheme described in the previous section. For comparison, the performance of alternative (more complex) modelling schemes is also shown in Fig. 1.14. We find that alternative, more parametric, mod-

36

Applied Nonlinear Time Series Analysis

Fig. 1.14 Computation of description length as a function of embedding window for Rossler time series data. The solid line and dot-dashed line are proportional to the logarithm of the sum of the squares of the model prediction error using a local constant and local linear method respectively. The local constant model utilised is described in Sec. 1.7.1, the local linear scheme is described in the text. The second modelling scheme exhibits a clear minimum which occurs at 4. The local constant modelling scheme employs only lags that provide an improvement to model prediction error. Its error as a function of embedding window is therefore monotonic (plateau onset occurs at 34). For small values of embedding window the linear scheme performs best, but for large values, behaviour is poor and extremely erratic. Computation of description length utilising the local constant scheme (solid line with circles) yields an optimal embedding window of 15. For clarity, the values dw = 4,15,34 are marked as vertical dashed lines.

elling methods produce results which are sensitively dependent on "correct" choice of modelling algorithm parameters.21 The first zero of the autocorrelation function occurs at a lag of 8 and the data exhibits a pseudo period of about 31 samples. With the embedding lag set at 8, false nearest neighbours indicates a minimum embedding dimension of 4. Standard methods, therefore, suggest an embedding window of roughly 32. By coincidence,22 the minimum of the model prediction error for a constant model occurs at this value. Conversely, the minimum of the error of the local linear model occurs at a value of 4. This comparatively low value of embedding window is due to the relative complexity of the local linear modelling scheme [137]. Although this scheme performs best for small embedding windows, the additional information introduced with larger embedding windows is not recognised by this scheme. The main reason for this is that the parameters of the scheme (neighbourhood size, neighbourhood 21

By modelling algorithm parameters we mean parameters associated with the model selection scheme itself rather than only the parameters optimised by that scheme. 22 In other examples, and for other amounts of noise or with other lengths of data this proved not to be the case.

Time series embedding and reconstruction

37

weights and so on) are also dependent on the embedding dimension and embedding lag. For example, values of neighbourhood size which work well for a small dimension embedding may not work well for larger embedding dimension. Moreover, as embedding dimension becomes larger, it becomes difficult to find good values for these parameters.. This general behaviour is observed in every example we consider. Therefore, although the local linear scheme often provides a good estimate of the optimal embedding dimension (as would false nearest neighbours), the description length estimated from a local constant model provides a much better estimate of the optimal embedding window. We have already mentioned that the local constant modelling scheme selects only lags that provide some improvement in model prediction error. Clearly, as dw increases there is a combinatorial explosion. To address this combinatorial explosion is both difficult and beyond the requirements of this algorithm. We consider only whether the addition of successive lags offers an improvement. Suppose for a de dimensional embedding the chosen model includes the lags {^i,£2, • • •, ^k} (where 0 < l\ < £t < £i+\ < Ik < de). To determine the set of model lags for the (de + l)-dimensional embedding we consider the performance of the local constant model with lags {£±,£2, • • • ,£k,de}. If this model performs better than the model with lags {I\,l2, • • • :£k} then it is accepted, otherwise we retain only the lags {*1,*2,...,4}Therefore, the selected lags may be used as an estimate of the optimal lags for a generalised variable embedding (1.13). In the case of the Rossler system data analysed in Fig. 1.14, the optimal lags were 1 to 15 and 19, 20, 24, 26, 29, 32 and 34. Altogether, 22 different lags. Clearly, a 22 dimensional embedding is excessive, and some subset of these lags would probably prove sufficient. Moreover, the minimum description length optimal embedding window is 15, limiting the selection to the first 15 lags. It is reasonable to suppose that each of these large number of lags may contribute some significant novel information to the modelling scheme. However, the expression we hope to optimise (1.20) is independent of which lags are included (indeed, in this example, they are all included23), and therefore we do not consider this issue more closely here. We defer the selection of optimal lags from this set for the modelling phase of dynamic reconstruction. In Fig. 1.15 we examine the effect of various noise levels and different 23

This is not the case in general.

38

Applied Nonlinear Time Series Analysis

Fig. 1.15 Optimal embedding dimension as a function of data length and noise level.The solid bars depict the optimal model size utilising the methods described in this paper for a single realisation of Rossler time series data. The panel on the left is for a fixed noise level of 5% and time series length between 118 and 5000 data. The panel on the right is for fixed data length of 2108 data and various noise levels (expressed as percentage of the standard deviation of the data). For the cases where noise was added to the time series, the results depicted here are for a single realisation of that noise (not an average). This is the likely cause of the moderate variation in the results observed for larger noise levels. For comparison, the embedding window that yielded minimum error for the local constant (asterisks) and local linear (circles) models is also shown.

length time series on the selection of embedding window. We observe that for longer time series, the optimal embedding window is larger. This is consistent with what one might expect. For short time series the optimal model can only capture the short term dynamics and therefore only recent past history (a small embedding window) is required. For larger quantities of data one is able to characterise the more sensitive long term dynamics and a larger embedding window provides significant advantage. Initially, an embedding window of about 10 is sufficient, while for the longest time series an embedding window of 35 is optimal. Significantly, these two values correspond to approximately the first zero of the autocorrelation function (or one-quarter of the pseudo-period) and the pseudo-period of the observed time series. We note in passing, that, the optimal embedding window for the local constant window is an upper bound on the minimum description length best window. This is as we would expect. The description length is the sum of a term proportional to the model prediction error and a function which increase monotonically with embedding dimension (the description length of the local constant model). Therefore the minimum of the model prediction error must be no less than the minimum of the description length.

Time series embedding and reconstruction

39

Fig. 1.16 Optimal embedding windows for Lorenz and Ikeda time series. The calculations depicted in Fig. 1.15 are repeated for time series of two standard systems. The top two panels are for a single realisation of the chaotic Lorenz system, the bottom two panels are for a single realisation of the chaotic Ikeda map. The solid bars depict the optimal model size utilising the methods described in this paper. The leftmost panels are for a fixed noise level of 5% and time series length between 118 and 5000 data. The panels on the right are for fixed data length of 2108 data and various noise levels (expressed as a percentage of the standard deviation of the data). This is the likely cause of the moderate variation in the results observed for larger noise levels. For comparison, the embedding window that yielded minimum error for the local constant (asterisks) and local linear (circles) models is also shown.

Conversely, we find that the optimal embedding window for the local linear method remains about 4 or 5 (roughly corresponding to the optimal embedding dimension). Variation in the noise level for a fixed length time series demonstrates similar behaviour. For noisier time series, a larger embedding window is required, as increasing the noise on each observation decreases the useful information provided. As the information provided to the optimal model by each observation decreases, more observations (a larger embedding window) is required to provide all the available information. For noise levels of up to 30% this method provides consistent, repeatable, results. Noisier

40

Applied Nonlinear Time Series Analysis

time series tend to yield a larger variation in the optimal estimates of the embedding window. Note that in contrast, the local linear scheme performs progressively worse, utilising a diminishing window as the noise level is increased. We believe that this is due to the additional parametric complexity of this modelling method. As more noise is added to the data, the (relatively) complex rules used to determine near neighbours and derive a weighted linear prediction from these, become more prone to the system noise, and actually perform worse. In Fig. 1.16 we repeat the above calculations for time series generated from the standard chaotic Lorenz system and the Ikeda map [61]. Variation of optimal embedding window as a function of noise and data length for the Lorenz data is very similar to the results depicted in Fig. 1.15 for the Rossler system. Increasing noise level or time series length yields a larger optimal model. Furthermore, optimal embedding window values tend to coincide with the pseudo-period of the time series, or one-quarter, or onehalf of this value. Results for the Ikeda map are substantially different. In this case the optimal embedding window estimated coincides with the value that minimises the error of the local constant and linear models. In general, an embedding dimension of 3 or 4 is suggested, and this is what one would expect for this system.24 We now return to the main purpose of estimating the embedding window, namely the reconstruction of the dynamics. For the Rossler system analysed in Figs. 1.14 and 1.15 we build nonlinear models following the methods described in [55] with embedding suggested by either autocorrelation or false nearest neighbours (namely de — 4 and r = 8), hereafter referred to as a Standard Embedding, or with the embedding window (of 34), hereafter a Windowed Embedding. Table 1.3 compares the average model size (number of nonlinear basis functions in the optimal model) and model prediction error for 60 models of this time series (2000 observations and 5% noise) with each of these two embedding strategies. These models are built to minimise the description length of the data given the model, and therefore a comparison of the optimal model description length is also given. These qualitative measures show a consistent improvement in the model performance for the model built from the windowed embedding. 24

Although the fractal dimension of the Ikeda map is less than two, a delay reconstruction of this map is highly "twisted" and requires an embedding dimension of 3 or 4 to successfully remove all intersecting trajectories. This is evident in Fig. 1.8.

Time series embedding and reconstruction

41

Table 1.3 Comparison of model performance with standard constant lag embedding (a Standard Embedding) and embedding over the embedding window suggested in Fig. 1.15 (a Windowed Embedding). Figures quoted are the mean of 60 nonlinear models, fitted with a stochastic optimisation routine to the same data set, and standard deviations. Figures quoted here are for 2000 data points with 5% noise; other values of these parameters gave similar, consistent, results. The three indicators are minimum description length (MDL) of the optimal model, root-mean-square model prediction error (RMS) and the model size (number of nonlinear terms in the optimal model). For each indicator, the new embedding strategy shows clear improvement. MDL and RMS have decreased, indicating a more compact description of the data and a smaller prediction error, respectively. Conversely, the mean model size has increased indicating that more structure is extracted from the data. Several other measures were also considered: mean amplitude of oscillation, correlation dimension, entropy and estimated noise level (see chapter. 2). However, for each of these measures the variance between simulations of models built using the same embedding strategy was as large as that between the different embedding strategies. The results of these calculations are therefore omitted. model 1 MDL 1 RMS | size Standard (de = 4, r = 8) -655 ± 23 0.158 ±0.003 15.6 ± 2.9 Windowed {dw = 15) -716 ± 17 0.151 ±0.004 21.1 ±5.5

1.8

Application: Sunspots and chaotic laser dynamics: Improved modelling and superior dynamics

We now consider the application of this method to two experimental time series: the annual sunspots times series [153] and experimental laser intensity data [158; 108]. A third, unsuccessful example is described in [132]. The raw time series data are depicted in Fig. 1.9. Since the main motivation for selection of the embedding window with the method described here is to improve modelling results, we concentrate exclusively on the comparison of the performance of nonlinear models of this data with standard embedding techniques and the windowed embedding suggested by the algorithm proposed here. By construction, the local constant modelling scheme performs best with the windowed embedding. Therefore, we consider a more complicated nonlinear radial basis modelling algorithm, first proposed in [53] and most recently described in [130]. Like the windowed embedding strategy, this modelling scheme is designed to optimise the description length of the time series [130]. We are interested in two types of measures of performance: short term behaviour (for example, mean square prediction error) and dynamic be-

42

Applied Nonlinear Time Series Analysis

Table 1.4 Comparison of model performance with standard constant lag embedding and embedding over the embedding window suggested in Fig. 1.15. Figures quoted are the mean of 60 nonlinear models, fitted with a stochastic optimisation routine to the same data set, and standard deviations. Figures quoted here are for 2000 data points, where more data is available; longer time series samples gave similar, consistent, results. The four indicators are minimum description length (MDL) of the optimal model, root-mean-square model prediction error (RMS), the model size (number of nonlinear terms in the optimal model), and the correlation dimension (CD) of the free run dynamics. For the laser time series, none of the models built using the standard embedding produced stable dynamics and it was therefore not possible to estimate correlation dimension. The correlation dimension estimated directly from these three data sets was 0.396, 1.090, 1.182 (note that the low value for the first data set is an artefact of the short time series). data

sunspots (de = 6, T = 3) (dw = 6) laser (de = 5, r = 2) (dw = 10)

I

MDL

I

RMS

I

size

I

CD

1267.9 ± 12.1 1230.1 ±11.6

13.16 ±1.116 12.31 ±0.6886

7.32 ± 1.818 6.96 ±1.50

0.938 ±0.456 0.7836 ± 0.4145

5753.6 ± 153.9 5239.8 ±159.0

2.405 ± 0.2954 1.767 ±0.1992

100.8 ± 12.3 109.5 ± 12.3

n/a 0.8637 ± 0.7999

haviour (invariant measures of the dynamical systems). Results equivalent to those depicted in table 1.3 have also been computed and are summarised in table 1.4. Table 1.4 shows that for the sunspot time series and the experimental laser intensity recording, the windowed embedding improved model performance. That is, the description length was lower, the one-step model prediction error was less and the models were larger. However, with the exception of one step model prediction error the difference in these measures was not statistically significant. For the recording of human VF, the new method did not improve model performance and, in fact, the optimal embedding window was dw = 2: substantially smaller than one would reasonably expect from such a complex biological system. It seems plausible, that in this case, the time series under consideration is too short, noisy or non-stationary (this conclusion is supported by Fig. 1.9). Finally, we note that the result for the sunspot time series is particularly encouraging because this improvement in short term predictability is achieved with a much smaller embedding (dw = 6 compared to der = 18). However, as has been observed elsewhere [130], short term predictability is not the best criteria with which to compare models of nonlinear dynamical systems. Therefore, for each model, we estimated correlation dimension, noise level and entropy, using a method described in [167]. Furthermore,

Time series embedding and reconstruction

43

Fig. 1.17 Typical model behaviour using the standard embedding strategy (de,r): Two simulated time series from models of the experimental data examined in this paper are depicted. The panels correspond to those of Fig. 1.9 and the horizontal and vertical axes in these figures are fixed to be the same values as the corresponding panels of Fig. 1.9.

under the premise that these models should exhibit pseudo-periodic dynamics we also computed mean limit cycle diameter (i.e. the amplitude of the limit cycle oscillations. In every case we found that the dynamics exhibited by models built from the traditional (i.e. uniform) embedding strategy was more likely to either be a stable fixed point or divergent. Finally, Figs. 1.17 and 1.18 show typical noise free dynamics in models of each of these three systems. No effort was made to ensure that the models performed well and the models and simulations presented in these figures were selected at random. For both data sets, the new method clearly performs better. Typically, the original method produced laser dynamics and sunspots simulations that were divergent and a stable fixed point (respectively). These results are typical. In contrast, the windowed method yields models which exhibit bounded (almost) aperiodic dynamics.25 Closer examination of the laser dynamics indicates that it eventually settles to a stable periodic orbit (this phenomenon can be observed towards the end of the time series depicted in Fig. 1.18).

44

Applied Nonlinear Time Series Analysis

Fig. 1.18 Typical model behaviour using the windowed embedding strategy (dw): Two simulated time series from the experimental data examined in this paper are depicted. The panels correspond to those of Fig. 1.9 and the horizontal and vertical axes in these figures are fixed to be the same values as the corresponding panels of Fig. 1.9.

1.9

Summary

We have approached the problem of optimal embedding from both the modelling perspective (Sees. 1.6 and 1.7) and from the perspective of estimating dynamic invariants (in the earlier sections). In contrast to previous reports (which focused on estimating dynamic invariants), our primary concern was selection of embedding parameters that provide the optimal reconstruction of the underlying dynamics for an observed time series. To achieve this, we assumed that the optimal model is that which minimises the description length of the data. From this foundation, we showed that the best embedding has a constant lag (r = 1) and a relatively large embedding window dw. In general, the optimal dw will be determined by the amount of noise and the length of the the time series. From an information theoretic perspective this is what one would expect: r > 1 implies some information is missing from the embedding. The optimal value of dw reflects a balance between a small embedding with too little information to reconstruct the dynamics and a large embedding where the model ceases to describe the dynamics. To compute the quantity dw we introduced an extremely simple non-

Time series embedding and reconstruction

45

predictive local constant model of the data and selected the value of dw for which this model performs best. One can see that this offers a new and intuitive method for selection of embedding parameters. In essence, one could neglect description length and simply choose the embedding such that this model performs best. However, the addition of description length makes the optimal dw dependent not only on the noise but also on the length of the time series. We see that for short time series, one shouldn't be confident of a large embedding window. The similarity between this new method of embedding window selection and the well established false nearest neighbour technique [12] is more than superficial.26 In Sec. 1.7.2 and 1.8 we provided an explicit comparison between our technique and the "standard" false nearest neighbour method. However, there are various improvements to this algorithm (such as [12]) which are worthy of further consideration. Nonetheless, there are several important distinctions between our method, and these false nearest neighbour techniques. As we have already emphasised, the aim of this method (to achieve the best model of the dynamics) differs from that of false nearest neighbours (topological unfolding). Furthermore, the incorporation of minimum description length means that our method explicitly penalises short or noisy time series. At a functional level, the two algorithms are similar because both methods seek to avoid data points which are close, but which quickly diverge. Such points are (respectively) either false nearest neighbours or bad nonlinear predictors of one another. However, whereas false nearest neighbour methods seek only to avoid this situation (i.e. spreading out the data is sufficient), the windowed embedding method insists that the neighbours which are the best predictors be found. Consider the situation where a system's dynamics either stochastic or extremely high dimensional. Using false nearest neighbour methods, one may simply embed the data in a high enough dimension so that the data are sufficiently sparse. However, doing so does not improve the nonlinear prediction error, consequently, the windowed embedding method would prefer a small embedding window. Conversely, consider the situation at a separatrix. Points which are close do rapidly diverge from one another and so they will appear as false near neighbours for large embedding dimensions, until (at a time scale similar 26

The comparison of this method to that described in [12] is particularly apt. Cao introduces a modified false nearest neighbour approach which, like our method, avoids many of the subjective parameters of alternative techniques.

46

Applied Nonlinear Time Series Analysis

to that of the underlying system) the points are eventually, sufficiently spread. But from a nonlinear prediction view-point, these points are equally difficult to predict for all embedding dimension, and again the windowed embedding method will indicate a much smaller embedding dimension than that suggested by a strict application of false nearest neighbours.27 Finally, we note that the examples of Sec. 1.7.2 showed that this method performed consistently and the applications in Sec. 1.8 showed that selecting embedding parameters in this way improved the model one-step prediction error. In effect, this is a demonstration that the method is working as expected. More significantly, we found that the dynamics produced by models built from windowed embedding also behaved more like the experimental dynamics than for models built from a standard embedding. This is a very positive results however, we are now faced with a more substantial problem: the problem of building the best nonlinear model for the data once the embedding window has been determined [130]. Information theory has shown us that the optimal embedding should fix r = 1, we now need to consider the practice of nonlinear modelling to determine which lags I = 1,2,3,... ,dw are significant for practical reconstruction from specific experimental systems.

27 We acknowledge that this problem is actually related to the "plateau" observed in plots of the fraction of false nearest neighbours against embedding dimension. In many cases, prudent selection of "plateau-onset" can minimise the problem. However, this remains somewhat subjective.

Chapter 2

Dynamic measures and topological invariants Estimation of correlation dimension is a cottage industry in its own right, and the details of several of the more robust methods will be covered in the next chapter. In this chapter, we focus on general classes of dynamic invariants. After a brief discussion of what we mean by dynamic invariants, and why these are important, we describe these quantities in three broad groups, the first of which (Sec. 2.1) is correlation dimension. The second group of invariants based on information theory: entropy and complexity1 (Sec. 2.2). Finally, the third group of invariants are based on non-linear prediction error, and divergence of nearby trajectories, is covered in Sec. 2.4 A dynamic invariant (sometimes, simply invariant2) is nothing more than a quantity describing the dynamical behaviour of a system with the special property that the value of that quantity does not depend on the coordinate system. In other words, the value of a dynamic invariant obtained directly from the original system (if we were able to observe the original system directly) will be the same as the value of this quantity when measured from a delay embedding reconstruction (1.15). Hence, knowing that a quantity is a dynamic invariant means that we can, in turn, measure that quantity from a delay embedding and infer that the same quantity also applies to the original system. Moreover, the choice of delay embedding, or even observation function g, in Sec. 1.1, is not important. The same result can be obtained in each case. However, as usual, there are several provisos. Firstly, the value of the dynamic invariant is only the same provided that the system is the same (al'As we will see the Lempel-Ziv complexity, which we describe, is actually not an invariant at all. 2 Note that an invariant measure is something more special as it also has the properties of a measure. 47

48

Applied Nonlinear Time Series Analysis

beit in different co-ordinates). This means that if the delay-reconstruction is incorrect, there is no assurance that the dynamic invariant will have the same value. Secondly, although the value of the dynamic invariant will be the same, the estimate of that invariant will not necessarily even follow the same distribution. So, what do dynamic invariants measure, and why are they important? Deterministic nonlinear dynamics may be either bounded or unbounded. If the dynamics are unbounded, that is, they grow beyond bound,3 they are typically not particularly interesting. Moreover, estimating dynamic invariants for unbounded systems from a time series is futile because the sampling of the system provided by a single trajectory is exceedingly sparse. Hence, we are only really interested in bounded dynamics. Bounded deterministic nonlinear dynamics can be either periodic or aperiodic. Periodic dynamics can be characterised by the number of degrees of freedom of the system: a dynamic invariant. Non-periodic dynamics can also be characterised in the same way. To differentiate non-periodic from periodic we can also characterise both by a different dynamic invariant: the average rate of divergence of nearby trajectories, or Lyapunov exponent. Finally, we note that chaotic dynamics, by definition, have the following three properties: deterministic, bounded, and aperiodic. Hence chaotic dynamics can also be defined in terms of the equivalent invariant measures. Determinism can be described by the ability to predict the future from past: short-term prediction error. Bounded is simply a binary invariant measure. And, finally, aperiodic can be measured by the dominant Lyapunov exponent. Hence a bounded system with deterministic dynamics and a positive Lyapunov exponent is chaotic. 2.1

Correlation dimension

Correlation dimension is nothing more than an extension of the usual notion of dimension to objects with a fractional dimension. A point has dimension zero and a line has dimension one. So an object with dimension between zero and one is somewhere between a point and a line. Similarly a filled square is two dimensional and a solid cube is three dimensional. Hence, a dimension of between two and three would represent an object which occupies more space than a plane, but less than a sphere. In Figs. 2.1 and 2.2 we depict the Cantor set (with dimension j^pf) and two and three 3

For example, consider the exponential spread of a computer virus.

Dynamic measures and topological invariants

49

Fig. 2.1 Generating the Cantor set. Recursively remove the middle third from each of the line segments (from top to bottom) in a set, starting with a single line segment (top). The resulting object has a fractional dimension. Moreover, any two points on that object are separated by a contiguous segment of finite length not in the set.

dimensional generalisations of that: the Sierpinski Carpet (dimension j^2"!) and the Menger Sponge (with dimension l°z^) • The three dimensional sponge shown in Fig. 2.2 is a slight variation on the usual formulation. We have taken a liberty with the original Menger sponge as this version is easier to visualise in three dimensions as a point-set.4 Notice, that although these objects have non-integer dimensions, they are still physically represented as a subset of the next highest integer dimension. For example, the Cantor set has a fractional dimension, but is represented strictly as a subset of a line.5 From Figs. 2.1 and 2.2 we see examples of objects with fractional dimension. These objects contain no hint of deterministic time dynamics and therefore are not directly related to time series, or to chaos. To extract comparable objects from temporal dynamics, we consider the limit set of a 4 Actually, the Sierpinski carpet in Fig. 2.2 is an oblique projection of the standard Menger sponge. 5 In a sense, this is true of the comparison between integer dimensions as well, but for fractional dimension it also provides an aid to understanding.

50

Applied Nonlinear Time Series Analysis

Fig. 2.2 The Sierpinski carpet and the Menger Sponge. Both objects are generalisations of the Cantor set with fractional dimension. The Sierpinski carpet is obtained by deleting, recursively, the centre of each square. The Menger Sponge is usually defined as a three dimensional generalisation of this. However, the three dimensional object illustrated here is generated with a slightly different rule (and is therefore easier to visualise). A cartoon of the true Menger sponge is shown in Fig. 2.3.

deterministic dynamical system. That is, we plot the set of states to which an (almost) arbitrary trajectory will converge. Mathematically, the limit set is the set of points a £ A such that \/xo € M/£ and Ve > 0, 3d > 0 such that \xd — A\ := minae.4 \xd — a\ < e. In this definition, £ is a set of measure zero (i.e. a finite set of points) which we exclude. These points include unstable periodic orbits and other unstable equilibria.6 If this limit set A is finite (i.e. corresponds to bounded dynamics) then it is called the attractor. A strange attractor is one with fractional dimension. For a chaotic dynamical system, the corresponding attractor may be a fractal (i.e. a strange attractor). However, there are obvious counter examples: the Logistic map does not have a fractal attractor (see Fig. 1.7). Strange non-chaotic attractors can also occur for systems with periodic forcing.7 Alternatively, consider the following nonlinear, but stochastic system: it is not chaotic, yet exhibits a strange attractor. Let a,b,c e R 2 , 6

The reason for excluding these points is that for an ergodic system it is far easier to estimate the limit set if we are not concerned with the unstable equilibria. Estimating unstable periodic orbits is a separate, challenging and interesting problem. 7 It is an open question whether, for continuous systems, strange non-chaotic attractors can be generated in other ways. However, this is outside the scope of this volume.

Dynamic measures and topological invariants

51

Fig. 2.3 The Menger Sponge. A cartoon of the true menger sponge (to iteration depth 3) is shown. To visualise this as the Menger sponge point set, each small cube represents a minute copy of the entire structure.

and define . ,

1

1

gi (x) = -x + -ai, , , 1 1 92{x) = -x+~a2, . .

1

1

then the stochastic map defined by Prob(x n+ i = gi{xn)) = | for i = 1, 2,3 exhibits a strange attractor (in precisely the sense defined above). That

52

Applied Nonlinear Time Series Analysis

Fig. 2.4 A Stochastic Strange Attractor. The random set in this figure is fractal and yet generated by a random process (described in the text).

attractor is plotted in Fig. 2.4. Clearly this system is not chaotic, it is, however, rather pathological. Hence a fractal (i.e. strange) attractor is neither a sufficient condition for chaos, nor necessary. Despite this, embedding chaotic time series will often result is attractors with a fractional correlation dimension, and, conversely, systems with fractal attractors are very often chaotic. Hence fractional correlation dimension is a useful, but by no means binding,8 indicator of chaos. In some sense, correlation dimension provides an indicator of what a system is not. Noise has an infinite correlation dimension and therefore will expand to fill any embedding space (i.e. noise has a correlation dimension equal to the embedding dimension). On the other hand, periodic systems have integer dimension. Therefore a finite (i.e. significantly lower than the embedding dimension) and non-integer correlation dimension are indicators 8 We have not yet even considered the problems associated with reliably estimating correlation dimension, nor of determining what values of the estimate are integers or not.

Dynamic measures and topological invariants

53

Fig. 2.5 Not chaos. An illustration of a Random walk and a periodic orbit (composed of two incommensurate periods). In both cases the estimation of correlation dimension may be employed to distinguish these systems from typical chaotic flows. However, as we discussed in the text there are more pathological cases.

that the underlying system is neither dominated by noise nor a periodic orbit (or a superposition of periodic orbits). In Fig. 2.5 we illustrate the "attractors" for a coloured noise process (a random walk) and a bi-periodic orbit. The correlation dimension of these two processes, for an embedding dimension of 3, are, 3 and 2, respectively. Neither of these attractors are strange. Contrast this with the Lorenz attractor of Fig. 1.5 with a correlation dimension of 2.08. In the next chapter we will discuss techniques for estimating correlation dimension at some length. Later in this volume we will consider the problem of distinguishing the integer correlation dimension 2 from the non-integer value 2.08... in the case of finite data. Now, let us consider exactly how to ascribe a fractional correlation dimension to an object. Let us start with integer dimension d €E Z + . Clearly, V oc Ld

(2.1)

where V is a measure of the "volume" (i.e. length, area, volume or hypervolume) or an object and L is an appropriate length scale. For d = 0 this is trivial, for d = 1 it is obvious (we define volume in one dimension to be the length of the object). In two dimensions, the "volume" if an object is its area and the relationship (2.1) holds. The extension of this relation to volumes and hyper-volumes also follows. So, for d > 0 we wish to define the correlation dimension of an object such that the relationship (2.1) holds. In other words: d oc logF/ log L.

(2.2)

54

Applied Nonlinear Time Series Analysis

For correlation dimension, define the correlation integral C(e) of an attractor A by C(e) = Prob(||x - y\\ < e\x, y € A).

(2.3)

Hence, we have a length scale e and a volume measure C(e). To see that Eq. (2.3) does provide a volume measure that allows us to extend Eq. (2.2), consider the effect of C(e) for points on a single line in space. Clearly, then, C(e) oc e. Similarly for a point in a plane C(e) oc e2, and in a (three dimensional) volume C(e) oc e3. One can therefore define the correlation dimension dc by

4 = li m ^M. e^O

log£

(2.4)

In chapter 2 we discuss the practicalities of estimating this quantity in various situations. In the next section we turn our attention to estimation of other quantities of interest: entropy, an information theoretic invariant; and complexity, a measure of the effectiveness of a chosen encoding scheme.

2.2

Entropy, complexity and information

We now discuss the estimation of two distinct quantities, both related to the information content in either a signal or the system. The entropy of a system is the rate of generation of new information, or (equivalently) destruction of old information. Entropy is a measure of how quickly information related to the current state of the system becomes irrelevant. Entropy is a dynamic invariant. By contrast, complexity is a measure of the novelty within a signal. A regular signal has low complexity, a highly irregular one has high complexity. Because the definition and subsequent estimation of complexity requires an explicit choice of encoding scheme (e.g. the digitisation or a course grained binning), complexity is not a dynamic invariant. In Sec. 2.2.1 we discuss the entropy of a system, and in Sec. 2.2.2 we consider the complexity of a signal. 2.2.1

Entropy

If two trajectories {xn} and {yn} of a deterministic flow start close and remain close, the information about the state of {xn+m} one can obtain

Dynamic measures and topological invariants

55

from {yn+m} is (roughly) constant. Conversely, if the trajectories rapidly diverge; either because they are random or are simply exhibiting sensitive dependence on initial conditions, the information shared between the states rapidly diminishes. That is, new information is generated. But, what do we mean by "information"? One convenient definition of information, especially between trajectories, is encapsulated in the mutual information introduced in Sec. 1.3.2. However, let us take a slightly different route. The Shannon entropy of a system is defined in terms of the probability P(x) that the system is in state x. Hence for a set of states A (in fact, we are primarily concerned with attractors, but this is unimportant at this stage)

H(A) = -[ Jx&A

P(x)log2(P(x))dx

(2.5)

measured in bits [116]. In this definition, and those that follow, P{x) > 0 for all x. States with zero probability of occurring are ignored — for the integral form of Eq. (2.5), this is fairly obvious, however for the discrete form Yl ~P{X) I°g2 P(x) it is critically important. Similar to Eq. (2.5), the joint entropy of multiple variables may be defined as the integrals over their joint probabilities. The Shannon entropy is of fundamental importance in communication engineering and coding theory: it measures uncertainty we have over the expected state of a system. Moreover, —P(x) log2 P(x) is a measure of the surprise associated with a particular outcome x. Now, consider the trajectory {xn} on attractor A. The KolmogorovSinai entropif is a measure of the uncertainty associated with a randomly selected trajectory. For the trajectory {xn} the surprise associated in that trajectory can be measured by — fn=0P(xn)log2(P(xn))dx. Because the trajectory is only sampled at some finite time intervals (for simplicity we will suppose these are integers10) we may re-write this as ~TTT 10m=oP(xn)log2(P(xn)). Now, suppose that rather than concern ourselves with the surprise of observing a particular trajectory, we are interested in the distribution of all trajectories observed over afinitetime window n — 0,...,T. Let us perform some partitioning A\,A2,---, AQ C A of the attractor A. Moreover Ai ^ 0 and Ai n Aj = 0 for all i,j. And let P{Ai0, Ai1, Ai2,..., AiT) denote the probability that the trajectory is in set Aij and time n = j . Then, Kolmogorov entropy can be defined as follows. 9 10

Also known as the metric entropy or simply the Kolmogorov entropy. The generalisation to other sampling rates is not difficult, just messy.

56

Applied Nonlinear Time Series Analysis

First, let

KT = -

J2

P(A0,Ail,...,AiT)log2{P(Aio,Ail,...,AiT))(2.6)

in bits. As before, this summation is only over non-zero probabilities P(-). One can intuit this definition as follows: KT+M — KT is the information (in visited bits) to correctly identify the hypercubes AT+I,AT+2,..-,AT+M (in order) by a trajectory between times T and T + M (sampling at a sampling rate of 1). Since KT > 0 (by definition) we can see that for a chaotic system KT > 0 (for T > 0), whereas, for a periodic system KT = 0. The Kolmogorov entropy is then defined as K =

lim

[ lim (KT+1 - KT)} •

maxi size(/li)—>0 IT—>oo

J

(2.7)

where it is presumed that the sampling rate is also sufficiently high. One can see that K is a measure of the rate at which additional information is required to make accurate predictions. For periodic systems, additional information is never required: provided the initial condition is known to some fixed accuracy, future predictions can also be made, accurate to the same level. However, for chaotic systems, the sensitive dependence on initial conditions means that new information is constantly required: an initial condition accurate to some finite precision will eventually become useless for prediction. We will return to this concept when describing the Lyapunov spectrum and the largest Lyapunov exponent. The relationship between entropy and the Lyapunov spectrum will be revisited in Sec. 2.4. For the time being we will only consider estimating 2.7 directly from time series data. This is a difficult task. First one must observe the system for a sufficiently long time (T —> oo) and to a sufficient accuracy (maxj size(.Ai) —» 0). Moreover, for finite accuracy measurements (particularly of a single trajectory) the choice of partition {A{\ will be critical. In Fig. 2.6 we illustrate the computation of entropy for four time series of 5000 points in the absence of noise (other than due to numerical precision). We see that, as expected, both the Shannon and KS entropy of the Ikeda system is greatest: small variation is magnified most rapidly in this system. In contrast the entropy of the two "almost periodic" systems (Rossler) and (sinusoidal) are quite close: certainly, it is difficult to claim that KT+i - KT —> 0 for the periodic case but KT+I — KT > 0 for the chaotic system. Certainly the results are consistent with this expectation, but they are hardly unequivocal evidence for it. Finally, the entropy of

Dynamic measures and topological invariants

57

Fig. 2.6 Computation of entropy from time series. Results are illustrated for 5000 data points generated from the Ikeda system (solid), the chaotic Rossler system (dot-dashed), a periodic sinusoid (dashed) and Gaussian noise (dotted). The top panel is the computation of KT using a binary partition of the data space and examining trajectories of length T. For larger values of T, this calculation becomes infeasible. The lower plot is the incremental values Kf+i = KT which is expected to converge to the Kolmogorov-Sinai entropy.

the random system is lowest, as expected. Moreover the Shannon entropy of this system starts at values very close to 0 and increases rapidly to a constant value: in the lower plot we see that (again, as one would expect) KT+I — KT is constant in this case. In Fig. 2.7 we see the effect of variation of binning of the data (the partition) on the resultant data. Both indicate that this calculation is difficult to realise in practice.11 Nevertheless, one can obtain estimates of the entropy (either the Shannon or Kolmogorov-Sinai versions) of a time series in this manner. However, in Sec. 2.4 we will show that it is easier (well, no-harder anyway) to estimate K via the Lyapunov spectrum. In 3.5 we will introduce an alternative algorithm which estimates a version of 11

Or that my implementation of this algorithm, available from the website, is faulty.

58

Applied Nonlinear Time Series Analysis

Fig. 2.7 Variation of entropy computation with binning. Results are illustrated for 5000 data points generated from the Ikeda system (solid), the chaotic Rossler system (dot-dashed), a periodic sinusoid (dashed) and Gaussian noise (dotted). The top panel is the computation of Kj- using a partition of 2 to 16 bins and examining trajectories of length T = 5. For larger values of T, this calculation becomes infeasible. From this figure one can observe the general situation that for such short trajectories the random system exhibits largest KT, and the chaotic systems have moderate values, which appear to be significantly larger than for the period system.

entropy and dimension simultaneously. 2.2.2

Complexity

Algorithmic complexity is a measure of the regularity of a symbolic sequence. When estimating complexity from a time series it is first necessary to employ some encoding scheme to convert the (presumably) high precision data into a sequence of elements from a finite and fairly small set. Various, fairly obvious, schemes exist and are widely used. Algorithmic complexity is defined as the number of sequences one observes in a symbolic sequence as a fraction of the maximum possible number of sequences. The maximum number of sequences one would observe is that which one observes for a random sequence of symbols. Loosely, algorithmic complexity is equivalent to the compression ratio afforded by the Lempel-Ziv algorithm.12 For time series data, complexity is often used as a measure of the structure in the underlying time series. However, computation of complexity for nonlinear time series analysis is dependent on the selection of a good encoding scheme. Moreover, complex12 I.e. the ratio of a file size before and after the application of WinZip, gzip, or any other similar syntactic compression schemes based on the Lempel-Ziv-Welsh family of algorithms.

Dynamic measures and topological invariants

59

ity estimates generally work well for discrete maps, but perform less well for continuous systems. Let's start with some definitions. For the time series {xn} suppose that we have some encoding scheme g such that g(xn) £ A where A — {01,02,03,... ,ad} is some alphabet. In the binary case A = {0,1}, and one will typically have g(x) = H(x — x) where H is the Heaviside step function.13 For now, let us just consider an alphabet of d symbols. The algorithmic complexity is computed according to the following scheme. First, let P and Q denote two symbolic sequences, PQ is their concatenation and P~ is P without the last symbol. Let P -< Q indicate that P is a substring of Q. For example, if P = [oi,a 2 ,a 3 ], Q = [a4,as,a6],

where ai,<22>a3,«4:a5;a6 are six distinct symbols from the alphabet A, PQ = [a1,O2,a3,a4,a5,a6], QP = [a 4 ,as,a 6 ,ai,a2,a 3 ], PQ- = [a 1 ,a 2 ,a3,a4,a 5 ]. P
Sequential complexity algorithm.

(1) Initialise c = 1, i = 1, j = 1, P = si and Q = S2 (c is the count of how many novel sequences we have seen, i and j are pointers to the data set, P is what we've already examined, and Q is the new bit). (2) If Q -< PQ~ (i.e. Q is a string we have already seen) then increment j = j + 1, update Q = Qsi+j, and leave P unchanged. Repeat this until Q -fi PQ~. (I.e. keep scanning forward until Q contains something we haven't seen before.) Usually, x is the mean of x, but in many situations it makes more sense to take the median.

60

Applied Nonlinear Time Series Analysis

(3) We now have that P = s i . . . S j , Q = si+i...

(4) (5) (6) (7)

si+j,

Q / PQ~ but

Q
For an alphabet consisting of d symbols and a sequence of length N it can be shown [74] that N C<

(l-2<

1+1

JV

74^f »))log d (JV)"

Since ( 1 + 1 ° ^ g ^ ) ( d i V ) ) -> 0 as N -> oo one generally observes that c is bounded by l o o ^ . In fact, l o ° ^ is the complexity of a random sequence of length iV (with an alphabet of d symbols). Therefore, it is usually more useful to consider the normalised complexity C = |plogdJV.

(2.8)

The problem, which we have not yet discussed, of choosing a suitable alphabet A and encoding scheme g, remains. The standard method is to partition the observed data into d bins, usually of either equal size or equal probability and set d = 2. In Sec. 2.2.3 we will consider several alternatives in more detail. 2.2.3

Alternative encoding schemes

Since digitally recorded time series data is necessarily quantised, we already have a natural alphabet and encoding scheme. Unfortunately, this scheme is not really practical. Even for a moderate digitisation scheme of around 8-10 bits, the number of potential sequences of symbols (even for an approximately periodic signal) is extremely large. Consider a periodic signal with a period incommensurate with the sampling rate. In such cases the number of unique sequences of symbols is directly related to the sampling. Under a fine sampling of such a system, c oc N. Therefore, in these cases, a very course sampling is to be preferred. In practice, the very simplest of

Dynamic measures and topological invariants

61

all such schemes is often used: [ ai x > XQ

(2.9)

where usually one takes XQ = T (it actually makes more sense to choose XQ to be the median of {xn} but the distinction is usually not substantial. Of course, for the extension to ternary and higher order encoding schemes, it is usually appropriate to choose bins with equal probability: a,\ 2

x < XQ Xo < x < X\

I

i

a

g(x) = I

(2.10)

ad_1 X d _ 3 < x < X d _ 2 . ad Xd-2 < x where Prob(Xi_i < x < Xt) = Prob(a; < Xo) = Prob(Xd_2 < x) = ±, for

i-2,...,d-2. In certain applications, various modifications to this hierarchy of schemes can be extremely useful. In particular, [169] showed that application of a variant of this scheme could be used to differentiate between various distinct cardiac arrhythmia. In [134] we applied a similar encoding to use complexity measures (along with the more vanilla power spectral density) to identify the onset of cardiac arrhythmia and to utilise this indicator for automated recording of abnormal rhythms in coronary care unit patients. We discuss this application of this algorithm in the Sec. 2.3. Note that the dependence on encoding scheme which we are discussing here is distinct from that for entropy. When calculating entropy, one must first decide on a partition of phase space (or the attractor) and then compute the transition probabilities over that partition. However, the entropy of the system is defined as the supremum over all such partitions (this is equivalent to taking the limit as the partition becomes increasing fine for an increasingly large amount of data). For algorithmic complexity, the complexity is only defined for the symbolic sequence and not for the underlying system. A different encoding scheme can produce fundamentally distinct estimates of complexity (consider the example of a periodic signal with frequency incommensurate with the sampling rate). Moreover, another striking feature of this computation of complexity with the standard encoding scheme described above is that this is likely to work extremely well for maps, but less well for smooth flows. The reason for

62

Applied Nonlinear Time Series Analysis

this is that with a smooth flow, one is likely to see long (or longer anyway) sequences of identical symbols. This will be true irrespective of whether the underlying system is periodic or "pseudo-periodic" chaos.14 Hence,, for smooth flows it becomes much more difficult to differentiate chaos from periodic dynamics. Here, we suggest an alternative encoding scheme which often performs better for such smooth systems. First let us appeal to terminology standards in audio coding, and refer to the standard schemes as Pulse Code Modulation (PCM). PCM encodes a regularly sampled (audio) signal as a sequence of quantised pulses of various amplitude. This is exactly the process outlined in Eq. (2.10). The first, obvious, extension of this scheme is Differential PCM (DPCM).15 DPCM simply applies (2.10) to the time series {xn — xn_i} of first differences of {xn}. Clearly DPCM will introduce both greater sensitivity to signal noise and slow changes in amplitude. For audio coding, DPCM (and also ADPCM) offers a more efficient encoding scheme. The price we pay here is that the resulting estimate of complexity is also more sensitive to the small effects of noise in the signal. An alternative, and novel approach for the estimation of complexity from time series is the application of a DM type encoding scheme. For a signal {xn}: some (high) sampling rate r (for notational convenience, we stipulate that ^ e Z + ), and some step size 5, the delta modulation scheme encodes the signal {xn} as a bit string {bi}iT=l (the bit string will be longer than the original signal). The DM scheme can be described in the following algorithm. Algorithm 2.2

Delta modulation encoding.

(1) Let yo = x\ ({ym}m will be the DM reconstructed signal, the aim is that {ym}m should closely follow {xn}n, however the sampling rate of {xn}n and {ym}m will be different). (2) Let m = 0 (3) Ensure that n = \mr'\ and r = m*T — (n — l)r (the sampling rates of {xn} and {ym} are different: n is the integer part of mr, and r is the 14 We will describe "pseudo-periodic" signals at some length later. In general a "pseudo-periodic" time series is one that appears "roughly" periodic: it may, in fact, be a noisy periodic orbit or some form of oscillatory chaos. For now, the archetypal pseudo-periodic system is the Rossler dynamical system. For distinct parameter values this can manifest as either a noisy pseudo-periodic periodic orbit or as pseudo-periodic deterministic chaos. 16 One may also extend this reasoning further and consider Adaptive DPCM. We do not do that here.

Dynamic measures and topological invariants

63

Fig. 2.8 The delta modulation process. For the original smooth signal illustrated in the top panel (dot-dashed line), the bit-string illustrated in the lower panel yields the (jaggedy) reconstructed signal in the top panel (solid line). Sampling interval r = 0.2 and delta step 5 = 0.38941 (chosen so that the signal-to-noise ratio is greatest) were used. Note that the sampling rate of the bit string (and therefore the reconstructed signal) is ten times that of the original data. Also, notice that the encoding requires that the reconstructed signal will increase over the next sample duration if it is below the true value and it will fall if it is above it. In Fig. 2.9 the same process is illustrated for a longer time window.

remainder). (4) U ym < T [(i -r)xn+i + rxn] then bm+i = 1 and ym+1 = ym + 5, otherwise bm+1 = 0 and ym+l =ym-6. (5) Increment m and continue from 3. Where \z\ — maxtcz kllk < z) and I(C) — < „ is the indicator _| 0 otherwise function. Following this scheme, the bit string corresponds to a sequence of discrete jumps, either up or down, of size S every T time steps. The effect of this is that {yn} will closely follow a linear interpolation of {xn}. The linear interpolation, evident in step 4 (i.e. T ( ^ - r)xn + rx n + 1 ) is necessary

64

Applied Nonlinear Time Series Analysis

Fig. 2.9 The delta modulation process. For the original smooth signal illustrated in the top panel (dot-dashed line), the bit-string illustrated in the lower panel yields the (jaggedy) reconstructed signal in the top panel (solid line). Sampling interval r = 0.2 and delta step 5 = 0.38941 (chosen so that the signal-to-noise ratio is greatest) were used. The calculation and the data are the same as in Fig. 2.8, except for a longer time window.

because the reconstruction {ym}m only has one bit of information at every time step. In Figs. 2.8 and 2.9 we illustrate this scheme. Notice, that the delta modulation process requires the selection of two parameters S and r, whereas the standard PCM technique (all else being equal) only really requires the selection of the number of bits in the quantisation scheme. However, a fundamental result in digital communication theory states [45] that these two parameters can actually be related. Delta modulation can lead to two types of noise: slope overload and quantisation error. Slope overload occurs when the quantisation function cannot rise and fall fast enough to keep up with the data, and quantisation error occurs due to the fundamental non-stationarity of the DM scheme (it must go up or down, even if the signal doesn't). Quantisation error can be improved only by increasing the sampling rate (and decreasing the step size appropriately). However, for a chosen value of r, one can avoid slope

Dynamic measures and topological invariants

65

Fig. 2.10 Calculating complexity: Binary PCM. Computation of complexity for the Ikeda map (solid line), the Rossler system (dot-dashed), and a sinusoidal signal (dashed) with various noise levels. Also shown are the results for pure Gaussian noise. Each time series consisted of 1000 data points, the pseudo-period of the Rossler system, and the period of the sinusoid are approximately 6. All signals are normalised to have mean zero and standard deviation of one. The x-axis is the magnitude of the standard deviation of the additive Gaussian noise. The results shown here are for a binary pulse code modulation (PCM) scheme.

overload by choosing .

5 > r ^ d n

(2.11)

max

Hence we can simplify the delta modulation scheme by choosing S = T ll§? max anc* o n ^ concerning ourselves with r. However, changing r in DM is equivalent to changing the number of bits in PCM. In Fig. 2.11 we illustrate the result of complexity calculations using binary PCM for our archetypal map and flow system. In Fig. 2.11 we illustrate the same calculations using an 8-level quantisation PCM scheme. Figure 2.12 is a similar calculation for DPCM. In Fig. 2.13 we illustrate the results obtained using the DM approach (with r = 0.1). In Fig. 2.14

66

Applied Nonlinear Time Series Analysis

Fig. 2.11 Calculating complexity: Octal PCM. Computation of complexity for the Ikeda map (solid line), the Rossler system (dot-dashed), and a sinusoidal signal (dashed) with various noise levels. Also shown are the results for pure Gaussian noise. Each time series consisted of 1000 data points, the pseudo-period of the Rossler system, and the period of the sinusoid are approximately 6. All signals are normalised to have mean zero and standard deviation of one. The x-axis is the magnitude of the standard deviation of the additive Gaussian noise. The results shown here are for an octal pulse code modulation (PCM) scheme.

we repeat this calculation with r = 0.5. While PCM is analogous to standard encoding schemes for complexity, DM is not. We find that the standard techniques work well for maps and in the presence of substantial noise. DM performs best for smooth flows, with lower total noise content. We utilise a delta modulation (DM) based encoding scheme and show that this can provide estimates of complexity that are more sensitive to nonlinear deterministic dynamics in flows: for low noise levels the DM technique can more appropriately distinguish between a deterministic chaotic flow and periodic motion. However, we note that an inherent feature16 of this scheme is its sensitivity to additive noise. In 16

Just like Microsoft, we view this problem as a feature rather than a bug.

Dynamic measures and topological invariants

67

Fig. 2.12 Calculating complexity: Binary DPCM. Computation of complexity for the Ikeda map (solid line), the Rossler system (dot-dashed), and a sinusoidal signal (dashed) with various noise levels. Also shown are the results for pure Gaussian noise. Each time series consisted of 1000 data points, the pseudo-period of the Rossler system, and the period of the sinusoid are approximately 6. All signals are normalised to have mean zero and standard deviation of one. The x-axis is the magnitude of the standard deviation of the additive Gaussian noise. The results shown here are for a binary differential pulse code modulation (DPCM) scheme.

extremely noisy systems we argue that the standard encoding schemes will also provide more robust measures of complexity — but with less sensitivity to small deterministic fluctuations. The essential message is not the utility, or otherwise, of any of these methods, but that the results obtained in each case are entirely dependent on the appropriate choice of quantisation scheme: complexity is not an invariant. Despite this, it is an extremely useful measure of information content in nonlinear time series. Extrapolating from here, one may suspect that we could consider utilising audio coding schemes for estimation of complexity. Unfortunately, this makes no sense when it comes to the more advanced (and therefore efficient) audio codecs. Such schemes (for example mp3, ram, ogg vorbis

68

Applied Nonlinear Time Series Analysis

Fig. 2.13 Calculating complexity: DM. Computation of complexity for the Ikeda map (solid line), the Rossler system (dot-dashed), and a sinusoidal signal (dashed) with various noise levels. Also shown are the results for pure Gaussian noise. Each time series consisted of 1000 data points, the pseudo-period of the Rossler system, and the period of the sinusoid are approximately 6. All signals are normalised to have mean zero and standard deviation of one. The x-axis is the magnitude of the standard deviation of the additive Gaussian noise. The results shown here are for a delta modulation (DM) scheme with r = 0.1.

and their ilk) aim for maximum compression. Maximum compression is achieved by reducing the amount of redundant information in the signal. Therefore, the symbolic sequence will, under an optimal encoding, be indistinguishable from a random sequence.17 In the following section we present an application of PCM-type encoding schemes to compute complexity and differentiate between various cardiac arrhythmia. The extension of this approach to DM type schemes; to differentiate between different waveforms for both ECG and pulse signals; is currently under development. 17 The same argument could be used to point out a fundamental flaw in the SETI project.

Dynamic measures and topological invariants

69

Fig. 2.14 Calculating complexity: DM.The calculation is the same as Fig. 2.13, except that r = 0.5.

2.3

Application: Detecting ventricular arrhythmia

Electroencephalogram (ECG) waveform data showing the spontaneous evolution of ventricular fibrillation (VF) together with its precursors in humans is rare. When such data has been obtained, the resolution is often poor, or the length of pre-onset recording is limited. In [134] we presented a solution to this problem. In this section we show how algorithmic complexity is used for automated identification of arrhythmia onset and demonstrate that this device can be practically implemented to record onset of various ventricular arrhythmia. We will also describe the automatic data collection facility that is capable of recording such data. We designed computer software that, in conjunction with proprietary hardware, allows continuous monitoring of physiological waveforms from up to 24 separate hospital beds. Episodes of cardiac arrhythmia (including ventricular tachycardia and ventricular fibrillation) are identified online by a nonlinear complexity algorithm. In the

70

Applied Nonlinear Time Series Analysis

Fig. 2.15 Spontaneous onset of Ventricular Fibrillation. An example of a recording of spontaneous VF (following initial VT). The horizontal axes are time (in seconds) and the vertical axes are surface ECG voltage (in mV). Data was recorded at 500 Hz and 10 bits. Recording was triggered (using the FFT technique) at 26-27 seconds (on the above time axes).

arrangement described here, each episode is automatically recorded for 20 minutes with 10 minutes both before and after onset of the arrhythmia. When monitoring a 6-bed coronary care unit, this facility will typically collect around 10-50 recordings per week. The majority of these will be artefact or miscellaneous minor arrhythmia. However, 2-8 genuine VF episodes will also typically be recorded per month. The mechanism underlying spontaneous evolution of ventricular fibrillation (VF) is poorly understood. Algorithmic methods to predict imminent arrhythmia require large data banks of representative time series showing spontaneous evolution of arrhythmia. However, such time series are particularly rare. Time series showing the evolution of VF are often recorded by implantable defibrillators (for example [81]). This data is often noisy, may only represent the sequence of RR intervals and is typically morphologically different from surface electrocardiogram (ECG) recordings. In 1991 Clayton and colleagues [22] described a solution for in-hospital monitoring of ECG and automatic recording of VF. Clayton's data acquisition system required a dedicated computer, hardware arrhythmia trigger

Dynamic measures and topological invariants

71

Fig. 2.16 Third degree AV block and ventricular bigeminy. An example of a recording of spontaneous VT using the methods described in this paper. The horizontal axes are time (in seconds) and the vertical axes are surface ECG voltage (in mV). Data was recorded at 500 Hz and 10 bits. Recording was triggered (using the FFT technique) at 3-4 seconds (on the above time axes).

and associated software for each bedside monitor in a 10-bed coronary care unit (CCU). Computational limitations of the time meant that 1-minute of pre-VF (along with 4 minutes of post-VF) could be recorded at 250 Hz for each trigger event. Here, we describe a new data acquisition system that essentially builds on the features described by Clayton and co-workers [22]. This system requires a single, dedicated personal computer (PC) that is connected to a hospital CCU network, and is capable of recording up to 45 minutes of pre-arrhythmia (along with post-arrhythmia) data at a resolution of 500 Hz at 10 bits (or 125 Hz at 12 bits).18 Because the system is connected via a local network to the CCU computers and bedside monitors, the hardware 18 These parameters are largely due to hardware constraints and are actually independent of the actual implementation of the complexity-based arrhythmia detection process. For the record, our data acquisition software runs on a low-end Hewlett-Packard (HP) Vectra PC (any PC with an ISA/EISA bus, capable of running Windows NT would be sufficient). ECG waveform data is collected by an assortment of bedside monitors that are connected to a HP Serial Distribution Network (SDN). The SDN is a local medical communications network connecting bedside monitors, patient information centres and computer systems. For example, the SDN in a single ward is responsible for collating and displaying patient information and waveforms on a single central console.

72

Applied Nonlinear Time Series Analysis

Fig. 2.17 Spontaneous onset of Alternans. The horizontal axes are time (in seconds) and the vertical axes are surface ECG voltage (in mV). Data was recorded at 500 Hz and 10 bits. Recording was triggered at ~ 58 seconds (on the above time axes).

may be located at a physically separate site. Hence there is virtually no interruption to the normal running of the CCU. Our custom software requests ECG waveform data from each bed connected to the system. Bedside monitors able to provide ECG waveform data then transmit this data to the host PC. Using either a spectral power analysis algorithm [22; 21] or a nonlinear complexity estimation algorithm [134; 169] the host PC calculates a statistical quantity from the data. If this quantity exceeds a threshold, the recording is automatically triggered. Once recording is triggered the memory buffer containing the historical data (up to approximately 45 minutes pre-trigger) is written to disk along with a predefined length of post-trigger data. Available patient information (i.e. name, patient specific hospital identification number, and bed number/location) is also recorded in the data file. Recorded data files can then be copied to a remote computer for subsequent off-line analysis. The data acquisition algorithms must achieve four main tasks: (1) periodic interrogation of the various bedside monitors to determine which beds are occupied, by whom, and if ECG waveform data is being monitored, (2) continuous monitoring of ECG data from all beds of interest, (3) testing a trigger criterion to determine when to record data, and (4) record data in

Dynamic measures and topological invariants

73

response to a trigger. The first two tasks, and the fourth are a straightforward programming exercise. The third task is the most important and it requires relatively swift computation of some index relating to the degree of irregularity in the ECG waveform. Motivated by [2l] and [169] we chose to implement a spectral power computation and a nonlinear complexity measure. These are discussed in more detail below. The fast Fourier transform (FFT) [95] allows for extremely rapid computation of a time series spectral power content. Following the advice of [22] we found that computing the proportion of power between 3 and 6 Hz gave a good indication of a signal's frequency content. When the proportion of root-mean-square power content between 3 and 6 Hz exceeded 0.75 we triggered data recording. Note that in [22] recording was triggered if the proportion of signal power in the 3-6 Hz range exceeded 0.75 for more than 2 seconds. For the experimental set-up we established, we chose to trigger recording if any sample had more than 75% power in the 3-6 Hz range. For our application the low specificity is more than compensated by the accurate collection of most data sets. We found that this method was fast and could identify potential arrhythmia episodes reliably without excessive false positives (for the purposes of data collection a large number of false positives is only an inconvenience, and is to be preferred to misidentined arrhythmia). However, this scheme is somewhat arbitrary. Patients with an abnormal sinus rhythm rate (i.e. abnormally fast or slow "normal" rhythm) or unusual movement artefact19 could trigger recording unnecessarily. For this reason we chose to also employ a measure of algorithmic complexity, estimated using the techniques laid down in this chapter. It has been noted that nonlinear complexity provides a good specificity and significance for separating sinus rhythm from VF and ventricular tachycardia (VT) [169]. Furthermore, we have found some evidence (for example [141]) to support the hypothesis that VF may best be described as a nonlinear dynamical system. Because of this we chose to implement nonlinear complexity as an alternative index to trigger recording. Nonlinear complexity is a measure of the structural complexity of a time series. It is an information theoretic measure popular in nonlinear dynamical systems theory. The complexity of a time series is (approximately) how much compression may be achieved when applying some computa19 For ambulatory patients, it has been observed that teeth brushing could trigger the arrhythmia alarm.

74

Applied Nonlinear Time Series Analysis

tional data compression algorithm. Regular rhythms are predictable and therefore may be substantially compressed (and therefore have a low complexity); irregular rhythms are less predictable and may be compressed less (a high complexity). For example, sinus rhythm is regular and predictable (one would therefore expect a sinus rhythm time series to have a low complexity measure. VF and VT however, are irregular and unpredictable) and time series recordings of VF and VT would be expected to have a higher complexity. This has been shown experimentally [169]. Although nonlinear complexity has been shown to offer good significance and specificity for identification of VF and VT, we found that the algorithm (as we implemented it) to be substantially slower than a FFT technique. Conversely, the specificity of the FFT technique is significantly lower than the complexity method. Since both algorithms can effectively be implemented in real time, we prefer the complexity method for this application. Finally, we present representative recordings of arrhythmia produced by our data collection facility. Figure 2.15 shows initiation of spontaneous VF following VT, for which the patient was successfully defibrillated after two shocks. Figures 2.16 and 2.17 show segments of spontaneous VF, and restoration of sinus rhythm. The method we have described here may be used to automatically record data during spontaneous cardiac arrhythmia and the data leading up to these episodes. We have used this facility to build up a library of VF and VT episodes. We can then examine ECG before the onset of VF for potential signatures using traditional linear as well as nonlinear analysis techniques [141]. Such data will be invaluable for future research into the spontaneous evolution of cardiac arrhythmia. 2.4

Lyapunov exponents and nonlinear prediction error

We have been discussing the estimation of dynamic invariants20 and have introduced two distinct classes: dimension based measures (Sec. 2.1) and information measure (Sec. 2.2). In this section we introduce a final type of dynamic invariant: invariants related to the dynamic evolution of trajectories. These invariants are typically either measures of nonlinear prediction error or Lyapunov exponents. Although these two things are distinct mea20

Well, mostly dynamic invariants.

Dynamic measures and topological invariants

75

sures, when estimated from time series data, they are indistinguishable.21 Let us start with nonlinear prediction error. The nonlinear prediction error at time horizon t for model fix, t) from class M. is defined as ef(t) = Y^\\f(xn,t)-xn+t\\

(2.12)

n

where f(xn, t) is the model prediction of the evolution of xn, time t into the future. Clearly, this depends on the choice of model / (and indeed model class M). The nonlinear prediction error E(t) is the minimum of (2.12) over all models E(t) = mine / (i).

(2.13)

But this is not a good definition. Some models will always perform well, but it is not necessarily going to be a good model of the dynamics (i.e. simply by chance, something is going to work, but only for the data we have seen). Moreover, minimisation over a non-convex, non-smooth and even discontinuous set / is impractical. In chapter 6 we will discuss how to choose only good models. For the time being we will use a slightly expedient trick and restrict ourselves only to models in the class M. of local constant models (again we will describe such models in more detail later). For now, let us just say that the best model prediction for xn+\ is x m + i where \\xn — xm\\ is minimal (i.e. our prediction is just where the nearest neighbour goes). This leads us to the following working definition for nonlinear prediction error (and this is the quantity to which we will refer in future). Let zn = (xn,..., xn-
E

® = \ JT1 E(*"+* - x^+t)-2 \

mn — arg

(2-14)

n—l

min 1 < mn

< N — t

\\zn - 2 m J|.

In other words, zmn is the closest neighbour to zn and (xn+t — ^m^+t) measures their separation over time t. Supposing that the system is stationary and that there is sufficient measurements, then this is exactly the behaviour 21

Or at most, they are merely variations on the same theme.

76

Applied Nonlinear Time Series Analysis

Fig. 2.18 Estimating nonlinear prediction error. The four panels show average separation (mean from 500 random points, and standard deviation) of trajectories of the Rossler system estimated from the data in Fig. 2.20. In each case the calculation was performed on 5000 samples of the x component of the Rossler dynamical system. The four panels are for the four different integration time steps of: (a) 0.05; (b) 0.2; (c) 0.5; and (d) 1.

we would expect from the best model.22 In Fig. 2.18 we illustrate the computation of nonlinear prediction error for one slightly different choice of local model (10 local neighbours are used at each step, and this forms the basis of our Lyapunov exponent calculations that follow). From this figure we see increasing NLPE, and a substantial variation in the estimates of this quantity. Moreover, from Fig. 2.18, one can illicit only a weak exponential divergence: this problem is largely attributable to the variance in the estimates of divergence, as one can see in Fig. 2.19, below). Now, we rapidly turn our attention to Lyapunov exponents. Lyapunov exponents measure the divergence (or convergence) of nearby trajectories. 22

The critical point is that one must have sufficient, and sufficiently clean, data and that that data must be embedded properly. But all this is required to build a decent model anyway.

Dynamic measures and topological invariants

77

A d dimensional system will have d Lyapunov exponents. The much touted sensitivity to initial conditions exhibited by chaotic systems implies that chaotic trajectories will diverge exponentially. Chaotic system therefore have, at least one positive Lyapunov exponent. Systems with more than one positive system are referred to as "hyper-chaotic". The sum of all d Lyapunov exponents measures the expansion, or contraction of a finite volume under the effects of the system dynamics. Dissipative systems therefore have a negative sum. Finally, continuous flows must have one 0 exponent23 — corresponding to the direction of motion (which exhibits neither expansion nor contraction). A useful feature of the Lyapunov spectrum Ai, A2,..., A^ is that, from it, one can directly compute the Kolmogorov-Sinai Entropy K [91]:

K= Y, A«

( 2 - 15 )

A,: > 0

if the invariant measure of the system is continuous along the unstable directions.24 Although somewhat unheralded, this relationship should not really be too surprising: both positive Lyapunov exponents and metric entropy measure the local increase in new information. Properly defined, calculation of Lyapunov exponents require computation of the local Jacobian along a trajectory (one trajectory is enough if the system is ergodic), multiplying these Jacobians together and then computing the eigenvalues. Then, (up to some constant related to the sampling time) the Lyapunov exponents can be computed as the logarithm of these eigenvalues. The entire procedure is described in [2]. We do not dwell on it here as we are unlikely to even know the form of the dynamics with sufficient accuracy. Each stage of the procedure is prone to numerical instability and errors in the initial model of the dynamics are likely to propagate. Instead we only consider estimating Lyapunov exponents directly from the data. The general algorithm was first proposed by Wolf [161] and this method is rather sensitive to numerical issues too. Because of these problems, this algorithm has fallen somewhat into disrepute. Although there have been more recent algorithms that attempt to estimate the underlying dynamics (via some form of modelling), the results obtained by these techniques are usually best when the model class and the underlying dy23 Note that this immediately implies that chaos requires three dimensions, and hyperchaos needs at least four. 24 This constraint only seems to be significant in somewhat pathological instances.

78

Applied Nonlinear Time Series Analysis

Fig. 2.19 Estimating Lyapunov exponents. The four panels show the logarithm of average separation (from 500 random points) of trajectories of the Rossler system estimated from the data in Fig. 2.20. In each case the calculation was performed on 5000 samples of the x component of the Rossler dynamical system. The four panels are for the four different integration time steps of: (a) 0.05; (b) 0.2; (c) 0.5; and (d) 1. The largest (positive) Lyapunov exponent should manifest as a linear scaling region in these plots.

namics match well.25 This is not something that can usually be known a priori. We prefer to stick to the simpler algorithm: readers are welcome to consider extensions to this scheme when the data under consideration warrant either the attention of a specific model class or some form of noise reduction (filtering). Following Wolf, we regard a single trajectory as representative of the entire dynamics (if it's not then building a model is likely to be rather futile anyway). Then, at some point in the future (or even in the past) the trajectory is likely to come close to its current location. One can then approximate the divergence of similar trajectories under perturbation by 25 Moreover, these techniques often perform badly if one cannot correctly select the model class to match the underlying dynamics.

Dynamic measures and topological invariants

79

Fig. 2.20 Estimating Lyapunov exponents. The data used for our calculations of Lyapunov exponents in Figs. 2.18 and 2.19. In each case the calculation was performed on 5000 samples of the x component of the Rossler dynamical system. These plots show only the first 500 data points. The four panels are for the four different integration time steps of: (a) 0.05; (b) 0.2; (c) 0.5; and (d) 1. .

the divergence of two spatially adjacent sections of the same trajectory. Consider just one point, zn, with Njy near neighbours zmi, z m 2 ,..., zmNN . The divergence of nearby trajectories, at the one location zn, can be approximated by / 1

L(zn, t)=]nl—^2

NN

\

\xn+t - xi+t\2 I .

(2.16)

Note that in (2.16) we measure separation of scalar points xn+t and Xi+t and not the vector equivalents. For a time delay embedding, the distinction is a moot point. Furthermore, by including the logarithm, we expect exponential divergence to be exhibited by a linear scaling of L{zn,t) with

80

Applied Nonlinear Time Series Analysis

t. To complete the process we need to average over all such points: 1

N

L(t) = jjY.L^t)n=\

(2- 17 )

A linear scaling region in L(t) is a good indicator of exponential divergence, the slope of that linear region corresponds (after taking into account the units prescribed by the system sampling rate) to the largest positive Lyapunov exponent. There are two points to note: first, this assumes that there is a positive Lyapunov exponent; and, second, this does not estimate the entire spectrum. Unfortunately, this procedure provides a numerical estimate of the largest Lyapunov exponent, provided that it is positive. In Fig. 2.19 we illustrate the extent to which this scaling region can be achieved for artificial (i.e. "easy") data sets. The data for which these calculations were performed is illustrated in Fig. 2.20. Results of estimating a Lyapunov exponent significantly different from zero are usually taken as a good sign that such a thing does exist. However, this still leaves us the problem of estimating the entire Lyapunov spectrum. This is only usually reliably done by first knowing the underlying dynamics (or building a very good model of it). We will not pursue this issue here. Finally, consider the similarity between Eqs. (2.17) and (2.14). Taking JV/v = 1 the only distinction is between an arithmetic and geometric mean. In general Lyapunov exponents and nonlinear prediction error measure the same thing: divergence of nearby trajectories, under the effect of small errors for a close approximation to the true dynamics. 2.5

Application: Potential predictability in financial time series

We now turn to real data and apply the methods described in the previous sections to three time series: daily financial indicators in three different markets. The question we aim to address is: "Are these financial time series data deterministic?" The statistics we have introduced in this chapter can be used to measure nonlinear prediction error and the extent of a deterministic attractor, and, can therefore determine whether observed data are indeed deterministic. Clearly, for financial data there is a great deal of financial motivation to know this. Unfortunately, almost as clearly, the answer is obvious. The data must not be deterministic. If the data were deterministic, then, presumably someone would have sought the source of

Dynamic measures and topological invariants

81

Fig. 2.21 Financial time series. One thousand inter-day log returns of each of the three time series considered in this paper are plotted (from top to bottom): DJIA, USDJPY, and GOLD.

this determinism, modelled it, and subsequently made a very neat profit. However, if this were to occur, then, almost as surely, the method would eventually become widely known (or at least widely copied) and the profit to be made from it would drop to zero. The is the efficient market hypothesis: all available information is known to all players in the financial market and accurately represented in the price. Therefore, the price reflects the knowledge embedded in our market predictions. However, the story cannot simply end there. Many financial institutions do spend substantial sums trying to map determinism in this data. Therefore, perhaps it is not such a stupid question to ask. The historical data which we intend to analyse is represented in Fig. 2.21. We have selected daily values of three financial indicators for analysis in this paper: the Dow-Jones Industrial Average (DJIA), the Yen-US$ ex-

82

Applied Nonlinear Time Series Analysis

change rate (USDJPY), and the London gold fixings (GOLD).26 In each case the time series represent daily values recorded on trading days up to the date January 14, 2002. USDJPY and GOLD consist of 7800 and 8187 daily returns respectively. The DJIA time series is longer, consisting of 27643 daily values. Each time series was pre-processed by taking the log-returns of the price data: xt = logpt - logp t _i.

(2.18)

We then repeated the following analysis with time series of length 100 to 1000 in increments of 100, and length 1000 to 10000 in increments of 1000. The shortest time series cover approximately 3 — 4 months of data, the longest covers about 40 years. All three data sets are only sampled on week days that are not bank holidays. However, local bank holidays and non-trading days cause some variation between the data sets. Each time series consisted of the most recent values. Representative samples of all three time series are shown in Fig. 2.21. At this point, we must admit that this experiment is actually illconceived. In every case and at all length scales, the normalised nonlinear prediction error is different from 1 (what we would expect for a random process). Moreover, correlation dimension estimates were lower than the embedding dimension and, without exception, not integers. However, we do not conclude that this data is non-random. Rather, we defer a detailed consideration of this problem until Sec. 4.6 (and even later to Sec. 5.5) where we apply more rigorous surrogate data test to the data to determine whether the observed values are distinct from those obtained for noise processes. 2.6

Summary

We have described several different statistics that may usefully be estimated from a time series. Two of them are related to the distribution of data in time: the divergence of nearby trajectories in Eqs. (2.17) and (2.14). The correlation dimension, part of a larger family of dimensions, can be estimated from the correlation integral, and so too (as we will see in the next section) can an alternative form of entropy, similar to the Kolmogorov26

The data were obtained from: http://finance.yahoo.com/ (DJIA), http://pacific.commerce.ubc.ca/xr/ (USDJPY), and http://www.kitco.com/ (GOLD).

Dynamic measures and topological invariants

83

Sinai entropy. Moreover, the Pesin identity can be used to determine Kolmogorov-Sinai entropy from the Lyapunov spectrum. Loosely speaking, all these important invariants can be estimated either from the probability distribution (2.3) or the monotonically increasing functions (2.17) and (2.14). In other-words, these (two) curves are sufficient to characterise the invariant of interest in a time series. Note that, we are not claiming that the distributions offer a complete characterisation of the system: only that the most commonly studied invariants can be extracted from them. The one odd-ball statistic we described was Lempel-Ziv complexity. It is not a dynamic invariant and it is not related to either the correlation integral or the divergence of trajectories. It is, however, a rough measure of information content in the time series and therefore related to the entropy of the system. We chose this statistic as it seems to have genuine usefulness for experimental nonlinear data, its estimation is exceedingly simple,27 and by suitable choice of encoding and corresponding alphabet, it can be used to quantify changing dynamics in a vast range of different systems. We saw an example of this for the real-time identification of onset of cardiac arrhythmia [169; 134]. Other effective applications of this measure, among other things, include analysis of multichannel EEG [159; 99]. In each of these cases complexity has shown itself to be a useful measure, not because it represents a "true" dynamic invariant, but because it is easy to calculate and comparatively robust to noise.

27

That is, low computational complexity.

Chapter 3

Estimation of correlation dimension

Although we introduced correlation dimension in the previous chapter, we deferred any deep discussion of problems related to estimation of the correlation dimension from data. We now return to the problem of reliable estimation of correlation dimension, and more generally, quantities based on the correlation integral (which, as we will see, include the system noise content and one type of entropy). We are accustomed to thinking of real world objects as one, two or three dimensional. However, as we have seen fractals have non-integer dimension, a so called fractal dimension. Many real world phenomena, in particular chaotic dynamical systems, can be observed to have properties of a fractal, including a non-integer dimension. As we saw in the previous chapter, meaningful definition of fractal dimension comes from a generalisation of well known properties of integer dimension objects. Most applications of correlation dimension to data derived from the experimental sciences have utilised the Grassberger and Procaccia algorithm. In recent years this algorithm has somewhat fallen into disrepute as it was increasingly used and abused by researchers in pursuit of "proof" of the existence of chaos in a particular system. In many cases, this abuse of the Grassberger and Procaccia correlation dimension estimation algorithm is nothing more than a slightly premature declaration of the existence of chaos. An unfortunate feature of the Grassberger-Procaccia technique is that it assumes that the data is generated by a finite dimensional attractor and then seeks to determine its dimension. Hence, one almost always1 expects to get a finite fractional correlation dimension estimate from this algorithm. Regardless, there is now an increasing awareness of the pitfalls of this algorithm and a desire to find and apply more robust methods. 1

That is, with probability 1. 85

86

Applied Nonlinear Time Series Analysis

In this chapter, in addition to a more detailed discussion of the Grassberger-Procaccia algorithm, we will introduce two newer algorithms. Although technically more complex, these methods are, in practice, more reliable and less prone to misinterpretation. Despite this, as with all the other techniques discussed in this volume, results obtained should be interpreted with considerable caution. 3.1

Preamble

In the previous chapter, we defined the correlation dimension by generalising the concept of integer dimension to fractal objects with non-integer dimension. In dimensions of one, two, three or more it is easily established, and intuitively obvious, that a measure of volume V(e) (e.g. length, area, volume and hyper-volume) varies as V(e) oc ed,

(3.1)

where £ is a length scale (e.g. the length of a cube's side or the radius of a sphere) and d is the dimension of the object. For a general fractal, it is natural to assume a relation like Eq. (3.1) holds true, in which case its dimension is given by

dJ^XM. log£

and therefore d«lim *£*£).

(3.2)

Let {-Ztjtii be an embedding of a time series in R de . We therefore define the correlation function, C^{e), by

CV(£)=(T)

YJ

I z

^ i~ziW<£)-

(3-3)

Here I(X) is the indicator function, which, as before, has a value of 1 if condition X is satisfied and 0 otherwise, and || • || is the usual distance function in R de . The sum Y^i^(\\zi ~ zj\\ < £) ls t n e number of points within a distance e of Zj. If the points z, are distributed uniformly within an object, this sum is proportional to the volume of the intersection of a

Estimation of correlation dimension

87

sphere of radius e with the object, and Cjv(e) is proportional to the average of such volumes. Comparing with Eq. (3.1) one expects that CN{e) oc ed°,

where dc is the dimension of the object. The correlation integral is defined as lim;v_.oo CJV(S). Define the correlation dimension dc by dc = Um

lim l ^ v ( £ ) .

(3 . 4)

The curious normalisation of C^ (e) is chosen so that rather than C/v [s] being an estimate of the expected number of points of an object within a radius £ of a point, it is instead an estimate of the probability that two points chosen at random on the object are within a distance e of each other. The difference between the expectation and the probability is only a constant of proportionality if the points were distributed uniformly, and this constant vanishes in the limit of Eq. (3.4). The reason for choosing the probability rather than the expectation is that the concept of dimension still makes sense, indeed generalises, to situations where the sample points Vi are not distributed uniformly within the object. In the following sections, we discuss the estimation of correlation dimension using the Grassberger-Procaccia algorithm (Sec. 3.2), Judd's algorithm (Sec. 3.3) and the Gaussian Kernel Algorithm introduced by Diks (Sec. 3.5). Finally, in Sec. 3.7 we provide a rapid review of other contemporary techniques: techniques which, despite featuring in the literature, we have little direct experience with. 3.2

Box-counting and the Grassberger-Procaccia algorithm

The method most often employed to estimate the correlation dimension is the Grassberger-Procaccia algorithm [40].2 In this method one calculates the correlation function and plots logCN(Z) against logs. The gradient of this graph in the limit as £ —> 0 should approach the correlation dimension. Unfortunately, when using a finite amount of data, the graph will jump about irregularly for small values of e. To avoid this, one instead looks at 2 For example, studies of heart rate [119; 120; 155; 165], electroencephalogram [3; 10; 82; 86; 92; 105; 106], parathyroid hormone secretion [94] and optico-kinetic nystagmus [117] have all utilised the Grassberger-Procaccia algorithm, or some variant of it.

88

Applied Nonlinear Time Series Analysis

Fig. 3.1 A time lag embedding. One of the data sets used in our calculation, together with the time lag embedding in 2 and 3 dimensions. The time lag used was 29 data points (580 ms).

the behaviour of this graph for moderately small e. A typical correlation integral plot will contain a "scaling region" over which the slope of log C^ (e) remains relatively constant. A common way to examine the slope in the scaling region is to numerically differentiate (or fit a line to) the plot of log £ against logCjv(ff). This ought to produce a function which is constant over the scaling region, and its value on this region should be the correlation dimension (see Fig. 3.2). Typically, the only guarantee against false identification of an erroneous correlation dimension with this technique is to ensure that the correlation integral exhibits a linear scaling region over at least two decades.3 Moreover, we should note that there are several heuristic rules for the amount of data required to "reliably" estimate the correlation dimension using the Grassberger-Procaccia technique. One report claims that at least 10 dc / 2 data are required to be confident in the correlation dimension estimate of dc [2]. An alternative is even more conservative, putting the bound on the amount of data to make sense of a dc dimensional attractor at 42dc .4 Either of these bounds is quite sobering. In Fig. 3.1 we illustrate an embedded time series (this data is actually 3

The choice of two decades is purely heuristic. A sceptical reader would probably not be surprised to see a citation to Douglas Adams [4] at this point. 4

89

Estimation of correlation dimension

Fig. 3.2 Correlation dimension from the distribution of inter-point distances. The logaxithm of the distribution of inter-point distances (upper panel), and an approximation to the derivative (lower panel) for one of our sets of data embedded in three dimensions. The approximate derivative is a smoothed numerical difference. This calculation used the same data set as Fig. 3.1, embedded in 3 dimensions with a lag of 29 data points (380ms). Even with well behaved data and a smooth approximately monotonic distribution of inter-point distances, the choice of scaling region is still subjective.

experimental measurements of infant respiration, measured with inductance plethysomnography at the abdomen, during quiet sleep). Figure 3.2 illustrates the distribution of interpoint distances for that data set. Clearly a scaling region is present but only over a fairly limited range. Moreover the limitations of this scaling region, due to system noise, is clearly in evidence. The correlation dimension, estimated with the Grassberger-Procaccia technique, or one of the alternatives to be described below, is only one of a family of dimensions. That family arises by rewriting the correlation sum (i.e. the discrete version of the correlation integral) as

C

^(£) = ^

£

\WZI

0
£

o<j < N 3 *i

/(ll*-*ill<e)|

I I

•

(3-5)

90

Applied Nonlinear Time Series Analysis

Note that for q = 2 we obtain Eq. (3.3), and the correlation dimension dc arises from C^\e) oc e ^ - 1 ) ^ Indeed for all integers q > 0 the dimension dq is denned from Cft'(e) oc e^" 1 )^. In fact this is the reason that the correlation dimension is often denoted as di- We will ignore this larger family of dimensions, except to note that d^ is also well defined, and often referred to as the information dimension. At the other end of the scale do is called the box counting dimension and it is particularly easy to calculated for well known fractals (such as those illustrated in the previous chapter). For q = 0 one may alternatively define the box-counting dimension db = do as the db = lim ^ £^0

M

-log£

(3.6)

where Nr(e) is the number of spheres (equivalently, bounding boxes) of radius (side length) e that are required to completely cover the data. It is a standard (and straightforward) exercise in fractal geometry to compute this quantity for each of the fractals illustrated in Figs. 2.1, 2.2, and 2.4 (you may do this to confirm the values provided by the author). One can clearly see that db ^ dc ^ di: and in specific situations each of these measures may excite more interest. Moreover db > dc > di. One should note that d\, is easy to calculate for fractals generated from simple iterative rules (such as those described in the previous chapter), whereas dc is useful when dealing with experimental data, and in particular, time series. The primary shortcoming of the Grassberger-Procaccia algorithm for estimating dc is that it assumes that the observed data does represent a finite dimensional set. In the next two sections, we describe alternative algorithms which explicitly allow and compensate for the presence of noise in the observed data. Both these methods are still restricted to additive Gaussian noise, but, nonetheless, offer a superior approach to the above algorithm. 3.3

Judd's algorithm

As Judd [50] points out, there are several problems with the procedure described in the previous section. The most obvious of these is that the choice of the scaling region is entirely subjective (see Fig. 3.2). For many data sets, a slight change in the region used can lead to substantially dif-

Estimation of correlation dimension

91

Fig. 3.3 A "Cantor-like" set. As described by Judd: some portion of the set is simply a Euclidean space of some dimension: corresponding to system noise and finite periodic dynamics (not shown). Some other portion of the set exhibits the same constructive method as the Cantor set in Fig. 2.1.

ferent results. Judd assumes that locally the attractor can be modelled as the Cartesian cross product of a bounded connected subset of a smooth manifold and a "Cantor-like" set: see Fig. 3.3. Judd demonstrates that for such objects (which include smooth manifolds and fractals), a better description of C/\r(e) is that for e less than

92

Applied Nonlinear Time Series Analysis

Fig. 3.4 Scale dependent correlation dimension for Ikeda time series data. The nine panels are plots of the correlation dimension dc(£) against log(e) for embedding dimension from 2 to 10. The data consists of 1500 observations of the Ikeda system with 10% observational noise. For low embedding dimensions de = 2,3,4 we see a good agreement with our understanding of correlation dimension: for moderate length scales e we see a correlation dimension dc(e) of about 1.8. For smaller length scales the noise dominates and dc{e) increases. For de > 3 we still see the desired effect (i.e. dc{-) ss 1.8 for moderate length scales) and noise effects dominating for short length scales. At the largest length scales we see irregular behaviour, due to the sparsity of the embedded points in large embedding dimensions. some £Q CN(e)^ed
(3.7)

where q(s) is a polynomial of order i, the topological dimension of the set. The topological dimension is lowest dimension t such that any open cover of a set has a refinement of order no more than t + 1. Roughly, this means that the topological dimension t is the smallest dimension such that any small part of the set can be completely unfolded without spurious self intersections. Consequently we consider correlation dimension dc as a function of £Q and write dc(eo), and call this the dimension at the scale So-

Estimation of correlation dimension

93

Fig. 3.5 Scale dependent correlation dimension for Rossler time series data. The nine panels are plots of the correlation dimension dc(e) against log(e) for embedding dimension from 2 to 10. The data consists of 1500 observations of the Rossler system with 10% observational noise. As we saw in Fig. 3.4, the results presented here are also precisely what we would expect: at large length scales the correlation dimension dwells at around 2, while for smaller length scales noise effects dominate. The amount of information one can obtain from plots such as these is clearly dependent on the amount of noise in the signal.

The Grassberger-Procaccia method assumes that CJV (e) oc sdc, but this new method allows for the presence of a further polynomial term that takes into account variations of the slope within and outside of a scaling region. This new method dispenses with the need for a scaling region and substitutes a single scale parameter £o. This has an interesting benefit. For many natural objects, the dimension is not the same at all length scales. If one observes a large river stone, its surface at it largest length scale is very nearly two-dimensional, but at smaller length scales one can discern the details of grains which add to the complexity and increase the dimension at smaller scales. A bold generalisation of this analogy, attributed to Grassberger [60] is: What is the dimension of spaghetti? "Zero when seen

94

Applied Nonlinear Time Series Analysis

Fig. 3.6 Scale dependent correlation dimension for infant respiratory data. The nine panels are plots of the correlation dimension dc(e) against log(e) for embedding dimension from 2 to 10. The data are illustrated in Fig. 3.1. For large length scales we observe a low dimensional attractor: at the largest length scales, this appears as a point set (dimension 1). For moderate length scales the dimension of the attractor is around 2, for smaller length scales the noise dominates, yielding a peak correlation dimension of about 6. It is interesting that the noise in this system manifests as (even for large embedding dimensions) a fairly low dimensional system. The deterministic dynamics, with a dimension of 1-2 indicate the deterministic pseudo-periodic behaviour observed in the data.

from a long distance, two on the scale of the plate, one on the scale of the individual noodles and three inside a noodle" [60]. Consequently, it is natural to consider dimension dc as a function of e0 and write dc(s0). By allowing our dimension to be a function of scale we produce estimates that are both more accurate and more informative. Moreover, we can avoid some of the approximation necessary to define correlation dimension as a single number and we can extract more detailed information about the changes in dimension with scale. For an alternative treatment of this algorithm see, for example [49]. Unlike previous estimation methods, this new algorithm recognises that

Estimation of correlation dimension

95

the dimension of an object (its structural complexity) may vary depending on how closely you examine it. Hence the value of the estimate of correlation dimension may change with scale. It therefore offers a more informative and appropriate estimate of dimension. Computing correlation dimension dc as a function of scale dc(£Q) can tell us much more about the structure of an object, for example, it can indicate the presence of large scale "periodic" motion and simultaneously detect smaller scale, higher dimensional, "chaotic" motion and noise. Quoting a single number as the correlation dimension of a data set ignores much of this information, in many respects it produces an "average dimension". Plots of dimension as a function of scale are particularly important when studying complex physiological behaviour because they yield far more information than a single estimate at a fixed scale. In Figs. 3.4 and 3.5 we illustrate the estimation of correlation dimension using this method for data from the Ikeda and Rossler systems. From these figures, we notice that, at moderate length scales, the same values of correlation dimension may be extracted as are anticipated by the GrassbergerProcaccia technique. The strength of Judd's algorithm is that for large length scales we can detect sparsity of data problems due to the embedding dimension, and at low length scales the noise effects are evident, and genuinely accounted for. In Fig. 3.6 we apply the same technique to infant respiratory traces. In the following section, we consider such data in more detail. 3.4

Application: Distinguishing sleep states by monitoring respiration

In this application we consider the deterministic dynamics5 evident in infant breathing using correlation dimension estimation via Judd's algorithm. Before considering the data analysis problem we provide some useful background to this problem and motivation. Any model that describes the control of respiration must account for two important behaviours, rhythm and pattern generation. There are two distinct approaches to the problem of rhythm: the first requires the existence of discrete "pacemaker" cells with intrinsic activity that drives other respiratory neurons. The output of various respiratory centres or pools of motor 5 The fact that most infants continue to breathe provides a trivial proof that on some level this dynamical system is deterministic.

96

Applied Nonlinear Time Series Analysis

neurons is then organised by a pattern generator. A second approach implies that networks of cells with oscillatory behaviour interact in a complex way to produce respiratory rhythms which are either further organised by a pattern generator or might be self-organising [34]. The first model is in effect a linearly coupled model with pattern generation dependent upon feedback mechanisms from the various components of the respiratory system whereas, the second is more likely to be complex. Advances in neurobiology have allowed recordings to be made from individual neurons and groups of neurons in the brain. Using these techniques, various studies have demonstrated that the concept of discrete respiratory centres made up of neurons with specific functions defined by the nature of a particular "centre" is obsolete [34]. Whilst there is organisation of neurons into functional networks or pools these are not necessarily anatomically discrete. Also, there are conflicting data in regard to the presence of a specific pattern generator. Given the complexity of the connections between the various groups of oscillating, respiratory-related neurons, and the capacity for interactions between simple oscillating systems to produce complex behaviour, we have argued that information about the organisation of respiratory control can be determined using dynamical systems theory. In essence, the argument that there is a simple "pattern generator" that co-ordinates the output from various "respiratory centres" is unnecessary if the output from interacting networks is dynamical and self-organising. In order to examine such a hypothesis, respiration must first be adequately shown to be "chaotic", then examined further to describe its dynamical structure. A full discussion of this study and a review of previous works is provided by [126]. From the dimension estimates we present here we are able to conclude that the dynamics of breathing during quiet sleep are consistent with a large scale, low dimensional system with a substantial small scale, high dimensional component i.e., a periodic orbit with a few (perhaps two or three) degrees of freedom supplemented by smaller more complex deterministic fluctuations. A description of our data collection protocol follows. Using standard non-invasive inductive plethysomnography techniques we obtained a measurement proportional to the cross sectional area of the chest or abdomen, which is a gauge of the lung volume. The present study collected measurements of the cross-sectional area of the abdomen of infants during natural sleep.6 6

The study was approved by the Princess Margaret Hospital ethics committee.

Estimation of correlation dimension

97

Ten healthy infants were studied at 2 months of age, in the sleep laboratory at Princess Margaret Hospital. Data from 15 infants (including the 10 studied at 2 months) between 1 and 6 months of age was recorded and is used in our calculations. The unfiltered analogue signal from a NIMS' Respitrace+ (Non-Invasive Monitoring systems, (NIMS) Inc; trading through SENSORMEDICS, Yorba Linda, CA., USA) inductance plethysomnograph was passed through a DC amplifier and 12 bit analogue to digital converter (sampling at 50 Hz). The digital data was recorded in ASCII format using LABDAT and ANADAT software packages (RHT-INFODAT, Montreal, Quebec, Canada) installed on an IBM 286 compatible microcomputer. The only practical limitation on the length of time for which data could be collected is the period that the infant remains asleep and still. The cross sectional area of the lung varies with the position of the infant. However, in this study we are interested only in the variation due to the breathing and so we have been careful to avoid artefact due to changes in position or band slippage. We have made observations of up to two hours that are free from significant movement artefacts, although typically observations are in the range five to thirty minutes. All 27 observations used to calculate dimension are between 240 and 360 seconds, those used to identify CAM are between 400 and 1400 seconds. The abdominal signal is not necessarily proportional to lung volume. Takens' embedding theorem [144] only requires a diffeomorphism (a smooth function) of a measurement of the system. Moreover, present methods are not capable of dealing well with multichannel data and therefore use of both rib and abdominal signals to approximate actual lung volume is difficult. Of the available measurements we found that the abdominal cross section was the easiest to measure experimentally. The 27 observations used to calculate dimension where selected based on sleep state (quiet, stage 3 — 4 sleep) and then on the basis of sufficient stationarity and a minimum of four minutes in length. From each of these, 240 seconds of stationary data (the 240 seconds which had the most stationary moving average) was used to calculate dimension. Data for RARM calculations were selected based on being at least 400 seconds in length and in a state of uninterrupted quiet sleep. The results of the calculations of dc(eo), as shown in Fig. 3.7, can be summarized as follows. All calculations fall into two broad categories. Most of the estimates of dc(eo) produced curves that increase, more or less linearly, with decreasing scale logeo but some showed an initial decrease

98

Applied Nonlinear Time Series Analysis

Fig. 3.7 Correlation dimension estimates. Correlation dimension estimates for one representative data set from each of the ten subjects. Any data sets that produced dimension estimates dissimilar to those illustrated here are discussed in the text (see Sec. 3.4) The plots are of scale (— log£o) against correlation dimension with confidence intervals shown as dotted lines (often indistinguishable from the estimate). Correlation dimension estimates were produced for embedding dimensions of 2, 3, 4, 5, 7 and 9 for all data sets except subjects 2, 4, and 7. Subjects 4 and 7 failed to produce an estimate for the 9 dimensional embedding. Subject 2 did not produce an estimate when embedded in 3 or 9 dimensions. All other dimension estimates are illustrated; higher embedding dimension produces larger correlation dimension.

Estimation of correlation dimension

99

Fig. 3.8 Dimension estimate for subject 8.One of the data sets used in our analysis. The periodic breathing caused the dimension estimates (the dimension estimates used embedding dimensions of 2, 3, 4, 5, 7, and 9) at large scale to increase.

in dimension before increasing with decreasing scale (Fig. 3.7, subjects 1 and 4). For any particular data set, it was generally found that the graph of dc(eo) was shifted to higher dimensions as the embedding dimension was increased, although the shape of the graph varied little with changes in the embedding dimension. In nearly all cases the dimension estimates at the largest scale lay between two and three. The more or less linear increase in dimension with decreasing scale e0, and the shift to higher dimensions as the embedding dimension is increased, are both indications that the system, or measurements, have a substantial component of high-dimensional dynamics, or noise, at small to moderate scales. The increase in dimension with decreasing scale is an obvious effect of high-dimensional dynamics or noise. The shifting to higher dimensions with increasing embedding dimension occurs because in higher-dimensional embedding, the points "move away" from their neighbours and tend to become equidistant from each other, which in effect amplifies, or propagates, the small scale, high-dimensional properties to large scales. (This effect is related to the counter-intuitive fact that spheres in higher-dimensions have most of their volume close to their surfaces rather than near their centres as is the case in two and three dimensions.) Some of the dimension estimates, particularly in two and three dimensions produced curves which linearly increased for large length scales, but appeared to level off as length scale decreased. For most of the estimates we have computed, this is the case when the data is embedded in two dimensions. Furthermore for these embeddings in two dimensional space, the correlation dimension estimate seemed to approach two. This indicates

100

Applied Nonlinear Time Series Analysis

Fig. 3.9 Dimension estimate for subject 2.One of our data sets along with the dimension estimates (shown are the estimate with an embedding dimension of 2, 3, and 4). Note the large sighs during the recording and the corresponding increase in the dimension estimate at moderate scale. Another data set from the same infant exhibited similar behaviour and produced a similar dimension estimate.

that as we look "closer" at the data (that is, at a smaller length scale), it appears to fill up all of our embedding space. For many of the dimension estimates (Fig. 3.7, subjects 7 and 9) the embedding in three dimensions also levelled at values slightly less than three. This behaviour can be attributed to an attractor with correlation dimension of approximately 2.8 to 2.9. However, it is probably more likely that this too is simply due to the data "filling up" the three dimensional space. This is consistent with the results of our false nearest neighbour calculations which suggested that three or four dimensional space would be required to successfully embed the data. There is one particular estimate which appeared to behave quite differently to all the others. Some of the curves of the estimates for subject 8 appeared to increase, decrease, and then increase again. This could indicate that as we look closer at the structure there is some length scale for which the embedding structure seems to be relatively high in dimension, whilst by looking at an even small length scale the behaviour has significantly lower dimension. These observations are supported by what we can observe directly from the data. This time series includes an episode of periodic breathing — increasing the complexity of the large scale behaviour (see Fig. 3.8). Similarly, some of the data sets for subject 2 include large sighs causing the dimension estimate to increase at large scales (see Fig. 3.9). Finally, the remainder of the estimates (for example Fig. 3.7, subjects 1, 2, 4, 6, 7, 8 and 10) behaved in yet another manner. These estimates are

Estimation of correlation dimension

101

approximately constant for a small range of large length scales and gradually increased over small length scales. The estimates at large length scales were generally about two to three, indicating that the large scale behaviour is slightly above two dimensional. The increase in dimension estimate for smaller length scales can again be attributed to either noise or high dimensional dynamics. However, the scale of "small scale structure" in the dimension estimates is at a larger scale than the instrumentation noise level. Typically the smallest scale is loge (£•<>) ~ —2.5, a scale of approximately 5% of the attractor ( e'3 w 0.049787 ss 0.05). The digitised signal will typically use at least 10 bits of the AD converter (2- 1 0 = 1/1024 < 0.001), other sources of instrumental error are certainly at levels less that 5%. The approximately two dimensional behaviour is probably due to the inspiration/expiration cycle along with breath to breath variation within that cycle. This is easily visualised as the orbit of a point around the surface of a torus. A dimension estimate of two could indicate that the attractor was any two dimensional surface, the embedded data however has an approximately toroidal shape. In this motion there is two characteristic cycles, firstly the motion around the centre of the torus, and secondly a twisting motion around the surface. Our estimates slightly over two indicate that this behaviour is complicated further by some other roughness over the surface of the torus. The shape of such an attractor would very closely resemble the textured surface of a doughnut. To describe a complex dynamical system as "chaotic" may not be helpful when the components that produce the structure cannot be described. Understanding those components is more important than simply classifying the system as chaotic. However, before the mechanism can be understood it is important to determine basic properties of the system: is it dynamic; is it nonlinear; is it chaotic? Subsequent to this study we have therefore investigated the structure of the dynamics we described in greater detail [126]. We use a new form of linear modelling to expose a possible source of our multi-dimensional periodic orbit. By applying a reduced autoregressive modelling (RARM) algorithm to some data from sleeping infants we observe cyclic amplitude modulation (CAM) indicative of substantial structure in breath to breath variability. These calculations lead us to conclude that respiration in quiet sleep is most probably controlled by multi-dimensional oscillator supplemented by small scale complex (chaotic) behaviour. Moreover, our results are inconsistent with a noise driven one dimensional model of eupnea (or any other model that does not include substantial structure in breath to breath variability).

102

Applied Nonlinear Time Series Analysis

Fig. 3.10 Gaussian Kernel Correlation Integral for Ikeda time series data. The three panels (from top to bottom) show: the correlation dimension; the correlation entropy; and, the noise level estimated from 1500 points of the x—component of the Ikeda system contaminated with 10% Gaussian observational noise. For each plot, the x—axis is the embedding dimension de. The embedding lag T = 1.

3.5

The Gaussian Kernel algorithm

Gaussian Kernel Algorithm takes an entirely different approach to modelling the noise in a signal. We return to the original definition of correlation dimension, i.e. Eqs. (3.3) and (3.4). Since the observations xn are contaminated by noise, one cannot know zn precisely. Therefore, computation of I(\zi — Zj\ < e) in Eq. (3.3) is actually somewhat fuzzy. Rather than adopting contemporary density estimation to improve the estimate of the distribution [118], one can model this uncertainty by replacing the hard indicator function /(•) with a continuous one. The choice (implied by its title) of the Gaussian Kernel algorithm is the Gaussian basis function exp -NZi~**H-. Details of this algorithm are described by [24] and an

103

Estimation of correlation dimension

Fig. 3.11 Gaussian Kernel Correlation Integra] for Rossler time series data. The three panels (from top to bottom) show: the correlation dimension; the correlation entropy; and, the noise level estimated from 1500 points of the x—component of the Rossler system contaminated with 10% Gaussian observational noise. For each plot, the x—axis is the embedding dimension de. The embedding lag r = 8 was chosen to correspond to one-quarter of the pseudo-period of the data.

efficient implementation of this technique is presented by Yu [167]. Essentially, one can begin by generalising the correlation integral in (3.5) as

(3.8) 0
\ \

0<3
/

/

where the bandwidth h takes a role analogous to e in Eq. (3.5). Clearly one could use other functions 4>{-) as the kernel function, provided they have finite (bounded) support. For any such kernel function one can show that the correlation dimension scaling law described above holds [37; 24]: T^'\h)<xhd'.

(3.9)

104

Applied Nonlinear Time Series Analysis

Fig. 3.12 Gaussian Kernel Correlation Integral for Experimental data. The three panels (from top to bottom) show: the correlation dimension; the correlation entropy; and, the noise level estimated from 1500 points of the x—component of infant respiratory data illustrated in Fig. 3.1. For each plot, the x—axis is the embedding dimension de. The embedding lag T = 8 was somewhat less than the one-quarter period criteria would suggest (29). Correlation dimension estimation is in reasonable agreement with Judd's algorithm, and the entropy calculation is consistent with the results for the Rossler system. We note that the noise level a is negative: this is actually, an oddity of our optimisation algorithm (the objective function is a function of a2 only, in this case the optimisation routine was kicked into the negative domain of a ).

However, when using 4>{x) = e~^~, one finds that [24] (3.10)

when \fh2 + a2 —> 0 and de —> oo. In (3.10), <j> is a normalisation constant, K is the correlation entropy and r and de are the usual uniform embedding parameters. The noise level a = ^- where an is the standard deviation of the additive Gaussian noise in the signal and as is the standard deviation of the observed signal (including the noise component). Obviously, if u > 0, y/h2 + a 2 7^ 0, but one still hopes (and indeed it is often the case) that

Estimation of correlation dimension

105

(3.10) holds. By estimating Tjye for a range of embedding dimensions de, one can estimate each of the parameters dc, K, and a. The benefit of this algorithm is therefore that we are able to simultaneously measure dc, K and a. Moreover, this relatively robust technique correctly accounts for noise, if that noise is Gaussian and additive. In Fig. 3.10 and 3.11 we illustrate this process for noisy time series data from the Ikeda and the Rossler systems. In each case, despite limited data and substantial noise contamination7 we obtain robust and reasonable results. For the Ikeda and Rossler data, the estimates of noise level are very good, and both algorithms provide consistent and reasonable estimates of correlation dimension for a moderate range of embedding dimensions. However, the Rossler example fails to provide a reasonable estimate of the correlation entropy (for a chaotic system, this should be positive). The likely cause of this problem is that the chaotic dynamics of the Rossler system occur on a slower time scale (on short time scales, the system is "almost periodic"). The application to experimental data yields a reassuringly small estimate of noise (about 2.5-3%); and entropy results consistent with those for the Rossler system. Consistent with the results using Judd's algorithm, the correlation dimension of this data is low: between 1 and 2 for small embedding dimensions, and bounded by about 6. Finally, we note that, as always, the usual caveat holds: one should carefully check that the fit of Eq. (3.10) to the distribution (3.8) obtained from the time series is actually good. 3.6

Application: Categorising cardiac dynamics from measured ECG

In this application, employ the Gaussian kernel algorithm, to probe for nonlinearity in human electrocardiogram recordings during normal (sinus) rhythm, ventricular tachycardia (VT) and ventricular fibrillation (VF). We observe that sinus rhythm and VT exhibit a correlation dimension of approximately 2.3 and 2.4 respectively. The correlation dimension of VF exceeds 3.2. The entropy of sinus rhythm, VT and VF is approximately 0.69 nats/sec, 0.55 nats/sec. and 0.67 nats/sec. respectively. 7 Applying these algorithms to copious clean data is not really of interest to us. We are attempting to show that reasonable estimates can be obtained from limited noisy data. The results we present here are, therefore, imperfect but perhaps more typical of experimental time series data.

106

Applied Nonlinear Time Series Analysis

Fig. 3.13 Spontaneous Evolution of V T and VP. This recording shows the spontaneous evolution from sinus rhythm to VT and subsequently to VF in a cardiac patient. On each panel the horizontal axis is time in seconds and the vertical axis is ECG surface voltage (in milli-volts). The recordings are contiguous. The rhythm during the top panel is predominantly sinus, during the second panel this changes to VT, and in the third panel to VF.

In broad terms, these results indicate that techniques from nonlinear dynamical systems theory should help us understand the mechanism underlying ventricular arrhythmia, and that these rhythms are likely to be a combination of low dimensional chaos and noise. Data showing ECG waveform during spontaneous evolution of ventricular arrhythmia in humans is rare. It is impossible to predict when arrhythmia will occur, and it is both impractical and unethical to induce life-threatening arrhythmia (especially as we are interested in observing spontaneous evolution of these rhythms). For this reason we have established a unique data collection facility [134] in the Coronary Care Unit (CCU) of the Royal Infirmary of Edinburgh. This apparatus is based on early efforts by Clayton and colleagues [23]. A single computer linked to the hospital data network continuously

Estimation of correlation dimension

107

Fig. 3.14 Recordings of sinus rhythm, VT and VF. This Fig. shows three representative recordings used in this study: (a) sinus rhythm, (b) VT, and (c) VF. Each of these recordings is from a different subject, all three recordings are included in the analysis described in this paper. On each panel the horizontal axis is the datum number (at 500 Hz) and the vertical axis is ECG surface voltage (in milli-volts).

monitors ECG waveform data from all patients in the CCU and adjacent Cardiology Ward. When the proportion of ECG waveform power in the 2.5 Hz to 6 Hz range exceeds 75% of the signal power, data is recorded together with 10 minutes of data preceding the trigger event. The ECG data is recorded at 500 Hz and 10 bits resolution — Fig. 3.13 shows a section of one such recording. From our data bank of recordings of ECG during various rhythms we selected 20 time series recordings of VF and 30 of VT from 11 subjects. We also selected a further 31 traces showing sinus rhythm prior to onset of arrhythmia. Each recording was 10000 data points (20 seconds) in length. Episodes of VT and sinus rhythm were selected on the basis on stationarity [166] and were non-overlapping and non-contiguous. Recordings of VF were selected from the limited available data and a single episode of VF was split, if possible, into multiple non-overlapping 10000 point episodes. Hence recordings of VF may represent arrhythmia an unspecified length of

108

Applied Nonlinear Time Series Analysis

Table 3.1 Results of GKA and surrogate calculations. This table contains a summary of the results of applying the calculations described in this paper to 81 ECG recordings. Each row summarises the results for a particular rhythm: sinus rhythm, VT and VF. For each rhythm the mean values of D, K and a, along with the standard deviation from the mean are given. Entropy is given in units of nats/sec, and noise is the estimated standard deviation of the noise component expressed as a fraction of the standard deviation of the signal (i.e. a in Eq. 5.1). Also shown are the total number of recordings that were found to be inconsistent with each of the specified null hypotheses. After recalculation with reduced embedding lag all surrogate hypotheses were rejected for all data — with the exception of 16 episodes of sinus rhythm and the cycle shuffled hypothesis. The reason for this is discussed in the text. n sinus ~31 VT 30

dimension D 2.3±0.36 2.4±0.35

entropy K 0.69±0.29 0.55±0.19

noise a 0.040±0.013 0.031±0.009

Alg. 0 31 30

Alg. Alg. 1 2 31 30 30 27

cycle shuff. 13 24

| VF [ 20 I 3.2±0.52 [ 0.67±0.33 | 0.054±0.024 || 20 | 20 | 15 | 14

time after onset (i.e. physicians often distinguish between early and late VF, we made no distinction). Figure 3.14 shows representative recordings of each rhythm from different patients. For each of the 81 data sets in this study we computed the GKA estimate of correlation dimension D , entropy K and noise level a for embedding lag k = 4 and embedding dimension m = 2 , 3 , . . . , 20. If the algorithm did not converge (as happened in 5 of the 81 recordings), then the calculation was repeated with k = 2. For each data set 30 surrogates of each type (algorithm 0, algorithm 1, algorithm 2 and cycle shuffled surrogates) were generated and the GKA algorithm applied to the surrogates (with the same values of m and k as used for the data). Thus, for each data set (of 81 examined here) 120 surrogates were generated and the GKA algorithm applied 121 times (with m and k as described above). For each group of 30 surrogates, the mean and standard deviation of the ensemble of statistic values were computed and the distance (in units of standard deviations) between the mean and the statistic value for the data was computed. If this exceeded three standard deviations, the null hypothesis was rejected (at the 98% confidence level). Otherwise the null was not rejected. If the null was rejected for at least one of the three statistics, D, K and a the data set was determined to be significantly different from the null hypothesis. Under this scheme the probability of a false rejection is p < 0.05. Figure 3.15 shows a representative calculation. Table 3.1 summarises the results of this calculation. These results show that the correlation dimension increases between sinus rhythm, VT and

Estimation of correlation dimension

109

Fig. 3.15 GKA applied to data and surrogates. This figure depicts typical results of the GKA and surrogate techniques as described in this paper. Each column of panels are estimates of a different quantity — correlation dimension D, entropy K, and noise level a. Each row are the estimates for a different data set — sinus rhythm, VT and VF. The three data sets used are those depicted in Fig. 3.14. On each panel the solid blue line is the estimated value of D, K or a for embedding dimensions m = 5, 6,..., 20. Errorbars for the estimates of D, K and a are also plotted. The dotted red lines are the estimates for the surrogates. Note that for each data set the data and surrogate values are distinct for at least one of D or K. Surrogates for a are not shown, as a is linear statistic and not independent of the surrogate generation method. Clear rejection is evident for sinus and VT rhythms. The result shown here for VF is more marginal. Recalculation with k = 2 provided clear evidence for the rejection of the null hypothesis in this case.

VF (2.3 to 2.4 and to 3.2). The entropy of time series recordings during sinus rhythm and VF are comparable (« 0.68) and substantially higher than during VT. Estimated noise level from all recordings is reassuringly low (overall mean noise was 0.04). Estimated noise for individual data sets ranged from 0.02 to 0.08, with the most noise observed during VF. Positive values of entropy and consistent non-integer values of correlation dimension are consistent with the hypothesis that the cardiac system exhibits deterministic chaos during sinus rhythm, VT and VF. This

110

Applied Nonlinear Time Series Analysis

is in direct support of our earlier results for induced swine VF [141; 168]. Recently we have also observed a period doubling bifurcation leading to chaos prior to onset of VT and VF in humans, and Rossler type chaos during sinus rhythm prior to VF and during VF [139]. However, the values of correlation dimension reported here are significantly lower than those observed in [134; 141] using a different algorithm [50; 51] and in [168] during induced VF in swine. The difference between observation of induced VF in pigs and spontaneous VF in humans may be due to two factors. Firstly, although the records under examination in [141; 168] were also 10000 data points in length, the sampling rate was lower (300 Hz) and the resting heart rate of pigs is higher than humans. Hence the record under examination covers a substantially longer trajectory in phase space. Furthermore, the noise level observed in human data (mean 4%, range 2% — 8%) is also substantially lower than that measured (with the GKA) in swine data (range 10% - 20%) [168]. Secondly, in the swine experiments VF was artificially induced in otherwise healthy animals. The human subjects used in this work are all seriously ill, and VF occurred naturally. The lower correlation dimension estimates may therefore be a result of the different physiological setting. It must also be reiterated that the results of the GKA (the current work and [168]) are not directly comparable to the results of Judd's algorithm described in [134; 141]. Judd's algorithm provides a correlation dimension curve dc(eo) as a function of "viewing scale" eo. Typically this curve increases as viewing scale decreases and one looks at the value limeo_»o dc(e). This value will naturally be higher than the mean < dc(eo) >eo which is close to the value of D estimated by GKA. As the discussion in [141] suggests, lim£o_»o dc{e) is affected by system noise, but the GKA treats system noise separately. Correlation dimension estimates from the GKA suggest that the complexity of the system attractor (the number of active degrees of freedom) is largest during VF, intermediate during VT and lowest during sinus rhythm. However, the entropy during VF and sinus rhythm exhibits the same distribution of values, suggesting that the "chaoticness" of the underlying system during sinus rhythm and VF is similar. During VT, the rate of information creation (the "chaoticness" according to entropy) is substantially lower — suggesting that VT is a more orderly and predictable rhythm, and sinus rhythm and VF are equally unpredictable. This is further confirmation of our earlier observation that both sinus rhythm and VF behave as low

Estimation of correlation dimension

111

dimensional chaos [139]. Some results of the application of nonlinear time series analysis and nonlinear modelling regimes to data collected as described in this paper are presented in [139; 133; 135; 136]. It is reassuring to note that the GKA estimates the system noise from time series recorded in this way to be only about 4%. This is slightly larger than the discretisation due to digitisation but typically comparable. For such a complex and noisy system measured in an active clinical environment this is a very encouraging result. Such an accurate quantitative measurement of the system noise should aid in determining accurate fitting (without over-fitting) of models. The most important implication from this paper for future work is that the human ECG during various physiological rhythms is measurably nonlinear, lowdimensional, and relatively free of noise contamination.

3.7

Even more algorithms

Although we have described three distinct algorithms for estimating a correlation dimension for noisy time series, there are still many refinements we have not considered in detail. Of these, one of the most important was introduced by Theiler [152; 147]. In cases when the data is nonstationary, or there is time dependent (e.g. coloured) noise, the estimation of the correlation integral by (3.3) is flawed. Instead, one should ensure that \i — j \ > Txh where the temporal threshold Txh is usually related to the pseudo period of the data, or the time scale of the noise process. The reason for this is that measuring correlation between points on the same orbit (or perturbed in the same manner by additive noise) will introduce an artificial bias: so-called trajectory bias. Similarly it has been suggested, since computation of Eq. (3.3) is expensive, that, for large data sets, one should only randomly choose reference points (i.e. select some % and or j) and not sum over all points in the data sets. In a recent study, Galka and Pfister [36] show that such random sampling (presumably with replacement) avoids a second sort of systematic bias in the correlation integral attributable to regular sample of a pseudo-periodic system. Of course, the effect of both of these modifications will critically depend on the available data. Clearly, these caveats can be equally applied to each of the three schemes described above. Moreover, in light of these caveats, and the problems associated with estimating the slope of the correlation integral in an irregular scaling region, we would generally recommend looking first at the corre-

112

Applied Nonlinear Time Series Analysis

lation integral itself: rather than over-interpreting from a functional form fitted to that curve. However, in some situations, as we show above, more complex functional fitting can yield reasonable, and robust, results. In a fairly recent paper, Theil and co-workers [146] address some of the problems with Grassberger-Procaccia correlation dimension estimation, by showing that, in fact, the correlation integral is equivalent to the so-called recurrence plot. A recurrence plot is produced in two dimensions by taking an embedded time series {zi} and placing a mark at the point (i,j) if \\zi — Zj\\ < eth where eth is some threshold value. Clearly, the set ^ )

= {(z,j):||2i-2j||<eth}

(3.11)

is related to the correlation integral. In fact, Cjv(e) « N(N-\) l-^6th^l where N is the number of embedded data, if e = eth- Theil and co-workers then argue that, if {z^ is the product of a time delay embedding, the equivalent quantity can be obtained from the original scalar time series: independent of the embedding. Clearly, this is numerically a simpler procedure. The comparative robustness of this algorithm is still under examination. In an earlier paper, Raab and Kiirths [98] discuss an alternative problem: the estimation of correlation dimension for "large-scale" (i.e. high dimensional) systems. For such systems, they describe the effective estimation of correlation dimension. But, to achieve reasonable results in such scenarios would require substantially more data than we typically obtain from physical systems (N 3> 104). It is worth considering that for such large quantities of data and a sufficiently low dimensional system, one can also obtain excellent results using each of the standard approaches described here. Indeed, Galka and others [35] examined both Judd's algorithm and the standard Grassberger-Procaccia scheme, and found that both techniques can correctly determine correlation dimensions of up to 6 from about 105 data. However, they qualify this result by observing that results are "noticeably dependent on the geometrical structure of the system". Moreover, the comparison concluded that the performance of Judd's algorithm equalled or exceeded that of the Grassberger-Procaccia approach: depending on the system under study [35]. In contrast to this, Guerrero and Smith [42] have examined the coherent (that is self-consistent) estimation of correlation dimension under assumptions which are a generalisation of Judd's algorithm. In general, one can expect the correlation integral to behave like C^e oc <^(e)edc where (e) is some correction factor related to the "lacunarity" of the underlying data set. For Judd's algorithm, one assumes that the data may be described

Estimation of correlation dimension

113

as the cross product of a multi-dimensional "Cantor-like" set and some Euclidean space. With this assumption, the lacunarity term (e) = p(e) where p(-) is a polynomial of order equal to the topological dimension of the data.8 But under even more general formulation of (f> it is natural to ask which values of correlation dimension are consistent with the distribution we estimate from the data, i.e. what are the "error-bars" on an estimate of correlation dimension. Guerrero and Smith show that, in general, the confidence in an estimate of correlation dimension tends to provide rather wide bounds on acceptable values [42]. When one considers this result, it is worth remembering that all we can really consider here is not estimates of the true correlation dimension; but, nonlinear statistics, related to correlation dimension and defined by the algorithms we have described in this chapter. Finally, one further version of dimension which is relevant to the study of time series data is the Richardson dimension [lOl]. This quantity was originally suggested some time ago to address the problem of determining the length of a coastline: the problem of "How long is the coastline of Britain?".9 Perhaps counterintuitively, the answer will depend on the granularity of the measure. This result is true for fractals, and indeed any curve surface: suppose we choose to measure the length of a curve with some straight edge of length £. Then the measured length of the curve de x £ will converge to the true length L as £ diminishes l i m ^ ^ £di = L.w For fractal curves, this limit does not converge as the asymptotic length is infinite. Nonetheless, for some trajectory {-z;}j we can define the length of that trajectory as a function of the granularity of our measure. Suppose we sample {ZJ} every K time steps, then the length of the trajectory can be defined as L(K) = J2k=i \\zk* ~~ zkK+i\\- One can then obtain some characteristic of the trajectory by searching for a linear scaling region in log L(«;)versus log K. This algorithm has received little recent attention, but by considering the effect of additive noise in a system this technique of quantifying irregularity may well provide a useful alternative to the battery of correlation dimension based statistics. We have considered estimation of various nonlinear quantities from data at some length. We now have a broad "battery" of potential nonlinear mea8

In practice, 2 seems to be sufficient [51]. See [154] for an excellent example of the application of this question to the coastline of Hong Kong. 10 The same procedure was originally used by Archimedes to attempt to calculate the value of IT by evaluating a line integral around the perimeter of a circle. 9

114

Applied Nonlinear Time Series Analysis

sures to apply to experimental data (in additional to the standard arsenal of linear measures). In the next chapter we discuss statistical tests, using these invariant statistics, to determine if a data set is inconsistent with specific classes of hypotheses: linear noise, coloured noise, scaled linear noise, or one of many more exotic alternatives.

Chapter 4

The method of surrogate data

Correlation dimension is neat. However, the problem with estimates of correlation dimension (or indeed any dynamic invariant) is that they are not particularly reliable. Take for example, the Lorenz system. In the chaotic regime its attractor is purported to have a correlation dimension of around 2.06-2.08. Then for example, based on an exceptionally reliable estimate of 2.07, how certain can we be that an observed time series did not originate from a noise linear system: perhaps something like the periodic motion on a torus depicted in Fig. 2.5? As it stands, correlation dimension, and the other statistics described in chapter 2 are fine — provided we treat them only as numbers and nothing more. If we wish to interpret them as estimates of invariant quantities of the underlying attractor, we need to be more cautious. Nonlinear measures such as correlation dimension, Lyapunov exponents, and nonlinear prediction error are often applied to time series with the intention of identifying the presence of nonlinear, possibly chaotic behaviour (see for example [13; 110; 156] or [126; 168; 13l] and the references therein). Estimating these quantities and making unequivocal classification can prove difficult and the method of surrogate data [149] is often employed to provide some rigour and certainty. Surrogate methods proceed by comparing the value of (nonlinear) statistics for the data and the approximate distribution for various classes of linear systems and by so doing one can test if the data have some characteristics which are distinct from stochastic linear systems. Surrogate analysis provides a regime to test specific hypotheses about the nature of the system responsible for data; nonlinear measures provide an estimate of some quantitative attribute of the system. Only because nonlinear measures are of particular interest, are they often used as the discriminating statistic in surrogate data hypothesis testing. In this 115

116

Applied Nonlinear Time Series Analysis

chapter, we introduce some terminology and review the most standard (i.e. oldest) methods for generating surrogates. In the next chapter we will discuss newer, and more advanced, techniques. 4.1

The rationale and language of surrogate data

The general procedure of surrogate data methods has been described by Theiler and colleagues [148; 149; 150; 151] and, contemporaneously, by Takens [145]. The principle of surrogate data is the following. One first assumes that the data come from some specific class of dynamical process, possibly fitting a parametric model to the data. One then generates surrogate data from this hypothetical process and calculates various statistics of the surrogates and original data. The surrogate data will give the expected distribution of the statistic values and one can check that the original data have a typical value. If the original data have atypical statistics, we reject the hypothesis that the process that generated the original data is of the assumed class. Figure 4.1 gives an example of this procedure. One always progresses from simple and specific assumptions to broader and more sophisticated models if the data are inconsistent with the surrogate data (see Fig. 4.2). The general framework of surrogate data may be described as follows. From an observed time series, we would like to determine whether it is consistent with certain classes of system, or not. Unfortunately, what the best surrogate data techniques can do is show that the data is statistically likely to be inconsistent with certain systems. First, one assumes some simple hypothesis: that the data is generated by a linear noise process for example. Based on this hypothesis, and from the observed data, one generates an ensemble of data sets which are both "like" the observed data and consistent with this hypothesis under consideration. In the case of linearly filtered noise, it would be desirable1 to generate surrogates which have the same autocorrelation characteristics as the observed data since autocorrelation (or equivalently, the Fourier power spectrum) completely defines the linear noise process. However, we also require that apart from sharing the autocorrelation curve of the observed data, the surrogate data sets must be random. One way to generate such data is by fitting a parametric model to the data. Later in this chapter we will introduce more robust techniques. In any case, to test whether there is any evidence that the data is 1

We will show how to achieve this aim later in this chapter.

The method of surrogate data

117

Fig. 4.1 The surrogate data procedure: We illustrate the general principle of the method of surrogate data. One begins with a time series of observations of a dynamical system. One wishes to know whether there is any evidence that the data is atypical of a particular distribution (in hypothesis testing, this is the best that can be done — one cannot prove consistency). To test this, surrogate data are generated that are consistent with that hypothesis and also largely "like" the original data. One then compares some statistic measured from the data d to the distribution of statistic values for the surrogates (s(n) : n = 1, 2 , . . . (N)}. If the data is atypical of the surrogates, the underlying hypothesis may be rejected as the likely origin of the data. If the statistic measured from the data is not atypical of the surrogates, the hypothesis may not be rejected (this is not equivalent to proving the hypothesis to be true).

inconsistent with the hypothesis, we compare the data to the surrogates. By choosing an appropriate test statistic, we may compare the value of that test statistic for the data and the ensemble of values obtained for the surrogates. If they differ, we may conclude that the surrogates, and

118

Applied Nonlinear Time Series Analysis

Fig. 4.2 A hierarchy of hypotheses: For surrogate data testing one typically proceeds from simple hypotheses to more complex ones, until one reaches a hypothesis which it is not possible to reject. The three standard linear surrogate tests (Algorithm 0, 1 and 2) correspond to the hypothesis of: i.i.d noise; linear filtered noise; and, a monotonic nonlinear transformation of linearly filtered noise.

therefore the underlying hypothesis, are not representative of the data. If there is no significant difference, we may make no conclusion. To see why it is not possible to conclude that the data and surrogates are genuinely the same, consider the rather ill-informed test statistic T(-) = 0. We now need to introduce some formal notation and provide concrete definitions of several terms. Let 4> be a specific hypothesis and J-^ the set of all processes (or systems) consistent with that hypothesis. Let Z £ RN be a time series (consisting of N scalar measurements) under consideration,2 and let T : RN —> U be a statistic which we will use to test the hypothesis (f> that Z was generated by some process F £ T$. Surrogate data sets Si, i — 1,2,... are generated from Z (and are the same length as Z and are consistent with the hypothesis <j> being tested). Generally U C R and one can discriminate between the data Z and surrogates Si consistent with the hypothesis, given the approximate probability density pr,F{t) — Prob(T(Sj) < t), i.e. the probability density of T given F. 2

For convenience, this is a slight change in notation from earlier in this book. We now denote the time series {zt}f simply as Z.

The method of surrogate data

119

In a sequel to his original work, Theiler [150] suggests that there are two fundamentally different types of test statistics: pivotal; and non-pivotal. We follow this definition here. Definition 4.1 A test statistic T is pivotal if the probability density PT,F is the same for all processes F consistent with the hypotheses; otherwise it is non-pivotal. As we will see in the following, a prudent choice of test statistic (i.e. a pivotal test statistic) can relieve some of the strain of ensuring that the surrogate generation algorithm generated the "right" surrogates. Clearly, the probability density PT,F will depend on our choice of both T and F. However, we want pr,F(t) = Prob(T(Z) < t\z is consistent with ). To more precisely consider this situation, Theiler also differentiates two different types of hypotheses [150]: simple hypotheses and composite hypotheses. Definition 4.2 A hypothesis is simple if the set of all processes consistent with the hypothesis T$ is singleton. Definition 4.3

A hypothesis that is not simple is composite.

For a simple hypothesis the method of surrogate data is trivial: we only need to compare several realisations of that particular system to the observed data and decide whether it is typical or not. When one has a composite hypothesis the problem is not only to generate surrogates consistent with F (a particular process) but also to estimate F G T$. Theiler argues [150] that it is highly desirable to use a pivotal test statistic if the hypothesis is composite. In the case when the hypothesis is composite, one must specify F — unless the test statistic T is pivotal, in which case pr,F is the same for all F € T$. In cases when non-pivotal statistics are to be applied to composite hypotheses (as most interesting hypotheses are), Theiler suggests that a constrained realisation scheme be employed. Definition 4.4 Let F £ T be the process estimated from the data Z, and let 5, be a surrogate data set generated from F% £ T$. Let F* £ T$ be the process estimated from Si, then a surrogate Si is a constrained realisation if F, = F. Definition 4.5

If F* ^ F, the surrogate Si is non-constrained.

Definition 4.6 An algorithm that generates constrained surrogates is constrained. One that does not is non-constrained.

120

Applied Nonlinear Time Series Analysis

That is, as well as generating surrogates that are typical realisations of a model of the data, one should ensure that the surrogates are realisations of a process that gives identical estimates of the parameters (of that process) to the estimates of those parameters from the data. For example, let <j> be the hypothesis that Z is generated by linearly filtered i.i.d. (independently and identically distributed) noise. Surrogates for Z could be generated by estimating (or even guessing) the best linear model (from Z) and generating realisations from this assumed model. These surrogates would be non-constrained. Constrained realisation surrogates can be generated by shuffling the phases of the Fourier transform of the data (this produces a random data set with the same power spectra, and hence autocorrelation as the data). Autocorrelation, nonlinear prediction error, or rank distribution statistics (standard deviation or higher moments) would be non-pivotal test statistics. The probability distribution of statistic values would depend on the form of the noise source and type of linear filter. However, as we describe in Sec. 4.5, correlation dimension or Lyapunov exponents would be pivotal test statistics, the problem is to be able to produce a pivotal estimate of these quantities. The probability distribution of these quantities will be the same for all processes, so exactly what estimate one makes of the linear model and i.i.d. noise source is not important. With this standard terminology at our disposal, we now consider the three most standard surrogate generation schemes.

4.2

Linear surrogates

Different types of surrogate data are generated to test membership of specific dynamical system classes, referred to as hypotheses. The three types of surrogates described by Theiler [149], known as Algorithms 0, 1 and 2, address the three hypotheses: (0) i.i.d. noise, (1) linearly filtered noise, and (2) a monotonic nonlinear transformation of linearly filtered noise. Constrained realisation consistent with each of these hypotheses can be generated by (0) shuffling the data, (1) randomising (or shuffling) the phases of the Fourier transform of the data (this was briefly described in the preceding paragraph), and (2) applying a phase randomising (shuffling) procedure to amplitude-adjusted Gaussian noise. See Fig. 4.2. Surrogates generated by these three algorithms have become known as Algorithm 0, 1 and 2 surrogates. Each of these hypotheses should be re-

The method of surrogate data

121

jected for data generated by a nonlinear system. However, rejecting these hypotheses does not necessarily indicate the presence of a nonlinear system, only that it is unlikely that the data are generated by a monotonic nonlinear transformation of linearly filtered noise. The system could, for example, involve a non-monotonic transformation or non-Gaussian or state dependent noise. However, before considering the problems posed by nonlinear systems and more exotic types of noise, we will describe each of these three algorithms. 4.2.1

Algorithm 0 and its analogues

Algorithms 0 surrogates test the hypothesis that the data is i.i.d. (independent and identically distributed) noise, therefore, these surrogates must be uncorrelated and yet still be like the original data. In this case, by "like" the original, we mean that the data and the surrogates have the same probability distribution. Algorithm 4.1 Algorithm 0 surrogates: version 1. The surrogate 5, is created by shuffling the order of the data Z. Generate an i.i.d. data set3 Y and reorder Z so that it has the same rank distribution as Y. The surrogate Si is the reordering of Z. This is the algorithm originally proposed by Theiler [149] and techniques equivalent to this have been common in many fields for some time. In finance for example [l3l], it is common practice to compare data (usually either transformed or modelled in some way) to a shuffled version of itself. Typically, one may build a model of data and look at the model prediction errors (the residuals). If these are indistinguishable from a shuffled version of themselves, one may conclude that the model captures all the determinism in the system.4 The problem with this algorithm is that it may perform too well. The surrogate generation procedure described above can be described as sampling without replacement. This means that the data and the surrogates have exactly the same probability distribution. But this is more than one would expect for random observation of the same system. A statistically The i.i.d. data set is usually Gaussian, but it is necessary only to reorder the data Z. Algorithm 0 surrogates are not necessarily Gaussian. 4 In fact, this conclusion is somewhat simplistic. Such a model may still be a bad model as it could be overfitting the data. We will discuss this in chapter 6

122

Applied Nonlinear Time Series Analysis

more well-founded test would be sampling with replacement. Algorithm 4.2 Algorithm 0 surrogates: version 2. The surrogate Si is created by sampling, with replacement, from the observation Z. In this version of the algorithm one simply uses the probability distribution of Z as an estimate of the true probability distribution of measurements of the underlying system. 5, is then only an approximation to Z. Despite this more correct version of the algorithm, it is still beguiling to see surrogates and data reproduce exactly the same probability distribution and the original version is still more common in practice. Moreover, for short data sets (small N) the sampling Z may not provide a good approximation to the underlying probability distribution (and sampling with replacement will produce repetitive and highly non-random surrogates). In such situations the original version of the algorithm may perform better. But, for large data sets (and therefore a good approximation to the true probability distribution), sampling with replacement should be preferred. 4.2.2

Algorithm 1 and its applications

Beyond, i.i.d noise, the next step on the hierarchy of test statistics (Fig. 4.2) is linearly filtered noise. Linearly filtered noise is completely characterised by its autocorrelation function (or equivalently, the power spectrum) and is otherwise random. Algorithm 1 surrogates are generated by randomising the phases of the Fourier transform of the data. Algorithm 4.3 Algorithm 1 surrogates: version 1. An Algorithm 1 surrogate Sj is produced by applying Algorithm 0 to the phases of the Fourier transform of Z. Calculate the Fourier transform of Z and shuffle the phases by applying Algorithm 0 (be careful to do this pairwise so that the complex conjugate pairings are preserved, otherwise the surrogate will be complex). Take the inverse Fourier transform to produce the surrogate Clearly, this algorithm will inherit the same problem as Algorithm 0: by shuffling the phases, one does not introduce sufficient randomisation. A simple palliative measure was anticipated in our previous discussion. Algorithm 4.4 Algorithm 1 surrogates: version 1.1. Calculate the Fourier transform of Z as before. But rather than shuffling the phases, multiply each complex conjugate pair by a random phase rotation e1^ where

The method of surrogate data

123

<j> ~ E/[0,27r) is a random number uniformly distributed on the interval [0,27r). Take the inverse Fourier transform to produce the surrogate Si. But even performing the test in this way, one still achieves a surrogate with exactly the same power spectrum (aka autocorrelation) as the original data. However, at this point it is still not clear how best to add a small amount of randomisation (so that the power spectra are only equivalent samples of the same underlying process) to the surrogates, without "overly" randomising them. According to our previous definition of constrained surrogate generation algorithms, we require that there is exact agreement between Ft and F. In some situations it may be more appropriate to only require that the processes estimated from the data and the surrogates are approximately the same [ill] or more precisely, are drawn from the same probability distribution. Finally, by again referring to the field of finance [131], we note that this is equivalent to the process of estimating the Hurst exponent for shuffled surrogates of a data set. 4.2.3

Algorithm 2 and its problems

Algorithm 2 surrogates extend the idea of Algorithm 1 surrogates to systems which have non-Gaussian rank distribution. The hypothesis under consideration is that the data is consistent with a monotonic nonlinear transformation of linear filtered noise. Linearly filtered noise is addressed by Algorithm 1. But, why a monotonic nonlinear transformation? A linear transformation is trivial.5 We choose to test only monotonic nonlinear transformations because they are relatively simple. The Algorithm 2 hypothesis can be tested in two ways. One re-scales the data so that it is Gaussian, and then applies Algorithm I.6 Alternatively, one could use the following algorithm (which, in any case, is equivalent). Algorithm 4.5 Algorithm 2 surrogates. The procedure for generating surrogates consistent with Algorithm 2 is the following [149]: start with the data set Z, generate a Gaussian data set Y and reorder Y so that it has the same rank distribution as Z. Then create an Algorithm 1 surrogate Si of Y (either by shuffling or, preferably, randomising the phases of the 5

What's the difference between a Gaussian distribution and a standard Normal distribution? 6 Equivalently, one could define a test statistic which first rescales the data to be Gaussian before measuring it, and then apply Algorithm 1.

124

Applied Nonlinear Time Series Analysis

Fourier transform of Y). Finally, reorder the original data Z to create a surrogate Si which has the same rank distribution as Y. Surrogates generated following this scheme are also referred to as amplitude adjusted Fourier transformed (AAFT) surrogates. In this case, we do not have the problem of "over-constrained" surrogates. The process of scaling and shuffling the phases of the Fourier transform nicely complement each other so that, in the end, the data does not appear too close to the original. Despite this, there are several problems with this algorithm and many authors have recommended solutions. Schreiber and Schmitz [112] raised concerns about aspects of Algorithm 2 surrogates. Although Z and Si have (by construction) identical probability distributions they will not, in general, have identical Fourier spectra (and therefore autocorrelation). To overcome this they propose an iterative version of the AAFT algorithm. Essentially, the procedure is the following. After rescaling the data to preserve the probability distribution, the randomising of the phases will somewhat alter the rank distribution. One therefore re-rescales the phases. But, this may reduce the randomisation in the phases of the Fourier transform, so, in turn, it is necessary to re-randomise these phases. This procedure is repeated until one obtains reasonable convergence of both the rank distribution and the power spectrum. Unfortunately, convergence to the same Fourier spectrum is not actually guaranteed under this method either, but their results seem to indicate a closer agreement between power spectra. But this solution raises the same problem we had with the previous algorithms: the surrogates can become "over-constrained". In fact, the degree to which this happens depends on how far one iterates the iterative AAFT (IAAFT) algorithm. Kugiumtzis [69] argues that the AAFT algorithm is adequate if the static nonlinear transformation in question is not too nonlinear. For situations where the transformation is more extreme caution must be taken. Although the IAAFT has been promoted in these situations, the evidence in [69] does not completely exonerate it [67]. As an alternative, Kugiumtzis suggests the Corrected AAFT (CAAFT) algorithm [68]. This approach, however, is somewhat parametric and dependent on the accurate estimation of a specific autoregressive model (including the correct model order) directly from the data. Moreover, using standard AAFT surrogate generation techniques we

The method of surrogate data

125

have found that although estimates of the power spectra (through whichever numerical scheme one chooses) may not agree very closely, autocorrelation p(r) does — at least for small to moderately large values of T. Even for systems where agreement between autocorrelation curves is not great, this iterated procedure may be unnecessary. For many cases we have considered, the original AAFT schemes seems to provide a good level of randomisation of the surrogates, while keeping them sufficiently constrained.7 Recently, further concerns have also been raised over the application of Algorithm 2 surrogates for almost periodic data [142] (data with a strong periodic component). However, our own numerical experiments [122; 124] indicate that the difference between the probability distributions estimated with the Algorithm 2 technique and more technical methods [112; 142] is minimal. Moreover, for data that exhibits strong periodic trends, the hypothesis being tested by Algorithm 2 is trivially false. Finally, problems related to edge effects have been observed in the Fourier transform process both for Algorithm 1 and 2 surrogates. If there is strong disagreement between the ends of the data (particularly for almost periodic time series), the Fourier transform will treat this as a jump discontinuity in the underlying system.8 This is a problem related to the Fourier transform method rather than the surrogate techniques. The best solution seems to ensure good agreement at the ends of the data by judiciously shortening the data before generating the surrogates. 4.3

Cycle shuffled surrogates

The surrogate algorithms derived above test for evidence that the data is inconsistent with linear noise (or a static monotonic transformation of it). However, many real-world experimental time series are clearly distinct from such processes: the data may exhibit asymmetry,9 short-term determinism or a strong pseudo-periodic trend. Definition 4.7 By pseudo-periodic we mean a time series which exhibits a strong periodic trend manifesting as a clear spike in the signal power 7

Although couched in a different terminology, this is related to the tradeoff between power and significance in a statistical hypothesis test. 8 The Fourier transform assume that the signal under consideration is periodic with period N, or some factor of N. 9 Typically, asymmetry in the time series will manifest as the data increasing and decreasing at different rates. Such asymmetry is an indication of nonlinearity in the system.

126

Applied Nonlinear Time Series Analysis

Fig. 4.3 Generation of cycle shuffled surrogates: An illustration of the method by which cycle shuffled surrogates are generated. Plot (a) shows a section of data, split at the peaks into its individual cycles. The second plot shows these cycles shuffled, note the discontinuity in plot (b). Plot (c) has a vertical shift on the individual cycles to remove the discontinuity in (b) — this has been replaced by non-stationarity.

spectrum.10 The underlying system may be periodic with additive or dynamic noise, or it may be one of many oscillatory chaotic systems (e.g. the Lorenz and Rossler systems). In the case of a pseudo-periodic signal it would be useful to be able to determine the presence of temporal correlation between cycles. Theiler [148] addresses this problem and propose that a logical choice of surrogate for strongly periodic data, such as epileptic electroencephalogram signals, should also be periodic. To achieve this Theiler decomposes the signal into 10 To be precise we should quantify what we mean by both "clear spike" and "strong periodic trend". However, we choose not to do this as to do so would impose an arbitrary boundary. This definition is deliberately vague.

The method of surrogate data

127

cycles, and shuffles the individual cycles. In a statistical framework the "block bootstrap" has been proposed by Kiinsch [71]. Kiinsch's algorithm decomposes and shuffles "blocks" of a data set. Theiler's hypothesis for strongly periodic signals is rather simple, but in many ways powerful. Theiler proposes that surrogates generated by shuffling the cycles addresses the hypothesis that there is no dynamical correlation between cycles. Algorithm 4.6 Cycle shuffled surrogates. Split the signal Z into its individual cycles (identify the location of the peak, or some other convenient point within each cycle11). Randomly reorder the cycles and form a new time series Si by concatenating the individual cycles. If the original time series Z is even slightly non-stationary, the individual cycles will almost certainly have to be shifted vertically to preserve the "continuity" of the original time series Z in the surrogate. Figure 4.3 illustrates this procedure. In some respect this algorithm is analogous to Algorithm 0, except that it tests temporal correlation between cycles, not data points. We could examine the correlation between cycles directly, by reducing each cycle to a single measurement [128; 125]. In general, for a pseudo-periodic time series one may define a Poinare section [41] and examine the first return map. It is then possible not only to test Algorithm 0 type hypotheses between cycles, but also Algorithm 1 and 2. However, reducing each cycle to a single measurement can result in substantial loss of information. Furthermore the cycle shuffled technique addresses a slightly different hypothesis. That is, that deterministic dynamics exist within cycles, but not between them. Note that the cycle shuffled algorithm relies on the data having a convenient place at which to break the cycles. For the data Theiler considered, epileptic encephalograms, this was not a problem — the data was mean stationary and exhibited sudden strong peaks [148]. However, if the data is smooth and one is forced to identify suitable breakpoints then the surrogates necessarily have either spurious discontinuities (at the points where the cycles have been reassembled) or suffer from non stationarity (due to intra cycle dependence, but not necessarily of period longer than the pseudo-period). Figures 4.4 and 4.5 illustrates this. Even with convenient break-points between cycles Theiler [151] observed spurious long term correlation in the autocorrelation plot for cycle shuffled surrogates. 11 For noisy time series data, the problem of identifying the peak of a cycle is rather non-trivial. We do not consider this problem here.

128

Applied Nonlinear Time Series Analysis

Fig. 4.4 Cycle shuffled surrogates. Generation of cycle shuffled surrogates is depicted here. The individual cycles are identified (panel (a)), separated (panel (b)) and reassembled in a random order (panel (c)). Long term behaviour of this data is depicted in Fig. 4.5.

Furthermore, the degree to which this method actually randomizes the surrogates is dependent on the number of cycles present in the test time series. If, for example, one was to employ a dynamic measure (such as correlation dimension) as a test statistic and the embedding window was dw then the embedded points would only differ from the true points if the embedding window crossed the cycle break point. If a time series of N points has p cycles, then each cycle is approximately of length — and — — dw + 1 embedded points will be identical for each cycle of data and surrogate (if — > dw). One must therefore ensure the proportion of each embedded surrogate that is identical to the true data, 1 — jj{dw — 1), is small.

The method of surrogate data

129

Fig. 4.5 Cycle shuffled surrogates. Panel (a) is a representative time series and panel (b) and (c) are cycle shuffle surrogates generated by two alternative methods. Note that, if the peak and trough values do not all occur at exactly the same position then the surrogate data is unable to preserve both stationarity and continuity. By re-aligning individual cycles vertically one is able to preserve continuity but not stationarity (panel (c)). Conversely, preserving stationarity introduces spurious discontinuities or points of nondifferentiablity in the surrogates (panel (b)). For the epileptic encephalogram data described and for infant respiratory data this method performed well. However, the technique lacks generality. In each panel the horizontal axis is datum number, the vertical axis is surface ECG voltage (in mV).

4.4

Test statistics

Surrogate analysis enables us to test whether the dynamics are consistent with linearly filtered noise or a nonlinear dynamical system. We may wish to apply the techniques of surrogate analysis to infant respiratory data, for example, using correlation dimension as a discriminatory test statistic[l2l]. Surrogate data analysis is not, however, entirely straightforward. Theiler's original work on surrogate methods [149] (see Sec. 4.2), suggested a "hierarchy" of hypotheses that should be tested with a "battery" of test statistics. More recent work [148; 151] has demonstrated that not all test statistics are equally good. Furthermore, not all hypotheses are

130

Applied Nonlinear Time Series Analysis

as straightforward, or interesting, as they may appear. Finally, the choice of test statistic and surrogate generation algorithm should be made very carefully [150]. The remainder of this chapter is largely concerned with the suitability or "pivotalness" of test statistics in general, and of nonlinear measures based on the correlation integral in particular. The "pivotalness" of nonlinear measures is a vital issue to the application of nonlinear surrogate data for hypothesis testing [124]. To compare the data to surrogates, a suitable test statistic must be selected. To be effective for hypothesis testing a test statistic must be able to be estimated reliably (it must be estimated consistently) and provide good discriminatory power. If a test statistic provides, in its own right, useful information about the data, this is a further benefit of a wise choice of test statistic. In any case, a suitable test statistic must measure a nontrivial invariant of a dynamical system that is independent of the way surrogates are generated. It is necessary that a test statistic not be invariant with respect to a given hypothesis. That is, we do not want to have the situation that for every data set z and every realisation Zi of any Fi € T$ that T(z) = T(zi). The test statistic must measure something which is independent of the surrogate generation method. Unfortunately not all interesting test statistics are pivotal and constrained realisation schemes can be extremely nontrivial. We have already seen in the previous section that even the pivotalness of Theiler's original algorithms is open to some debate. In this section we briefly discuss possible candidate test statistics. For much of this discussion we focus on using measures based on the correlation integral (Eq. (3.3)). In particular, we have chosen to use correlation dimension because it is a measure of great significance and has been the subject of much attention. However, neither of these qualities will ensure that correlation dimension is a good test statistic for hypothesis testing. However, we will proceed to show that correlation dimension can be estimated consistently and offers good discriminatory power as a test statistic for hypothesis testing. Correlation dimension, as we have defined it in Sec. 2.1, is a function of £o (see Fig. 3.5 and 3.4 for examples of correlation dimension curves). Moreover, the more standard correlation dimension estimation scheme produces a curve of correlation dimension as a function of embedding

The method of surrogate data dimension dc(de). Or, one may even wish to compare correlation integral curves Eq. (3.3). There are several obvious ways to compare these curves. On many occasions, however, it is sufficient to compare them for some fixed values of the independent variable. Other possibilities include the mean value of the dimension estimate, or the slope of the line of best fit. More sophisticated methods are statistical tests such as the \2 test or the Kolmogorov-Smirnov statistic applied to the distribution of inter-point distances to determine if the distributions are the same.

4.4.1

The Kolmogorov-Smirnov test

The distribution of inter-point distances CN(S) (3.3) is the probability that two points Vi, Vj on the attractor are less than a distance e apart. For two distributions of inter-point distances C^{e) and Cif(e), the KolmogorovSmirnov test measures the maximum absolute difference between the distributions max|CV(e) - CN{e)\.

4.4.2

The x2 test

The x2 test is a measure of the difference between an observed distribution CN(E) and the expected distribution CN{S)The x 2 test assumes some discrete distribution and compares the expected distribution to a set of experimental observations. The correlation dimension algorithm we employ imposes a binning on the distribution CN(E). Let pt denote the expected probability of a random inter-point distance falling in the i t h bin (calculated from CN{E)), let ./Vj denote the number of inter point distances in the ith bin (from CTV(£)), and let n denote the number of inter-point distances. Then the \2 statistic is given by y ^ (Nj - nPl)2 2f npi Details of these tests can be found in most introductory statistics texts, see for example [9].

131

132

4.4.3

Applied Nonlinear Time Series Analysis

Noise

dimension

An alternative to comparing correlation dimension curves in terms of the distribution of inter-point distances is to extract some important (scalar) statistic from Judd's estimate dc(eo), the correlation integral, or from the Gaussian kernel scheme. One such statistic (derivable from Judd's estimate Dc(e)) is the noise dimension. The expected value of dc at scale £o is given by [51] d

*~d™-

d~^2£°

where dn is the noise dimension. By taking a Taylor series approximation one gets that d

^(^-T%)-2;7%log4 dn+ 2

an + 2

Using this expression one can fit a line dc ?a m log e§ + b to the correlation dimension curve to estimate dn ~ —Z^p^-

4.4.4

Moments of the data

Before closing this section, we will also briefly consider employing several more standard linear statistics. Particularly in the case of the standard linear hypotheses. This will serve to illustrate the usefulness, or futility of such an approach. Let fin denote the n-th moment of the data Z = {z,},: Mn

= ((z- < z > n

(4.i)

where < • > denotes the mean. In particular, / ^ = 0, /J,2 is the standard deviation. The skewness 71, kurtosis excess 72 and kurtosis /?2 are all defined by 71 = - 7 I 5 , 132 = % 72 = ^ - 3 .

(4.2) (4-3) (4.4)

r2

Each of these statistics measures quantities related to the distribution of the data: the skewness is the degree of asymmetry in the distribution (with a longer tail below the median indicated by negative skewness); the kurtosis

The method of surrogate data

133

measures how "fat" the distribution tails are, or conversely, how peaked the distribution is (with a Gaussian having a kurtosis of 3); and, the kurtosis excess compares kurtosis with the equivalent for a Gaussian distribution (a positive value indicating indicating a leptokurtic, or "peaky" distribution). All of these statistics are functions of the rank distribution of the data and would be useless as test statistics for Algorithm 0 or 2. Conversely, the lag one autocorrelation p(l) would be a useless test statistic for Algorithm 1 or (asymptotically) 2. However, distribution-based measures may be used to test Algorithm 1 and correlation measures will be effective for Algorithm 0. 4.5

Correlation dimension: A pivotal test statistic — linear hypotheses

The linear processes consistent with the hypotheses addressed by Algorithms 0, 1 and 2 are all forms of filtered noise, and hence infinite dimensional. That is, the correlation dimension will be infinite. Therefore, a dimension estimation algorithm which relies on a time delay embedding will (or should) produce the same probability density of estimates of correlation dimension for any data set consistent with one of these hypotheses. To do this in general we could invoke Takens' embedding theorem [144]. Takens' theorem ensures that a time delay embedding scheme will produce a faithful reconstruction of an attractor (provided de > 2dc + 1) if the measurement function is C2. When dc is finite, one simply needs a sufficiently large value of de. In the case when dc is infinite, Takens' theorem no longer applies. However if dc is infinite (or indeed if dc > de12) the embedded time series will "fill" the embedding space. If the time series is of infinite length, the dimension dc of the embedded time series will then be equal to de. If the time series is finite, the dimension dc of the embedded time series will be less than de.13 For a moderately small embedding dimension this difference is typically not great and is dependent on the estimation algorithm and the length of the time series, and independent of the particular realisation. Hence, if the correlation dimension dc of all surrogates consistent with the hypothesis under consideration exceeds de, the correlation dimension is a pivotal test statistic for that value of de. 12

For estimating correlation dimension reliably one only requires de > dc + 1 [26; 25]. Conversely, if dc > de the estimate of dc should be de. 13 This is particularly likely for a short time series and large embedding dimension.

134

Applied Nonlinear Time Series Analysis

In addition to the treatment in [124], an examination of the "pivotalness" of the correlation integral (and therefore correlation dimension) can be found in a paper of Takens [145]. Takens' approach is to observe that, if p and (J are two distance functions in the embedded space X (we consider X = R™, Takens considers a general compact (/-dimensional manifold), k is some constant, and for all x, y G X k-'pix, y) < p'(x, y) < kp(x, y),

(4.5)

the correlation integral limjv^oo CW(£) with respect to either distance function is similarly bounded and hence the correlation dimension with respect to each metric will be the same. This result is independent of the conditions of Takens' embedding theorem (i.e. that n > 2dc + 1 for X = R"). Hence if we (for example) embed a stochastic signal in R™, the correlation dimension will have the same value with respect to the two different distance functions p and p'. To show that dc is pivotal for the various linear hypotheses addressed by Algorithms 0, 1 and 2 it is only necessary to show that various transformations can be applied to a realisation of such processes which have the effect of producing i.i.d. noise and are equivalent to a bounded change of metric as in (4.5). Therefore, our approach is to show that surrogates consistent with each of the three standard linear hypothesesare at most a C 2 function from Gaussian noise iV(0,1). A C 2 function on a bounded set (a bounded attractor or a finite time series) distorts distance only by a bounded factor (as in Eq. (4.5)) and so the correlation dimension is invariant. We therefore have the following result. Theorem 4.1 The correlation dimension dc is a pivotal test statistic for a hypothesis i / V F i , ^ G T<$, and embeddings £1^2 : R 1—* ^"1,2 there exists an invertible C2 function f : X\ 1—> X2 such that Vt /(£i(i*i(£))) = Proof. The proof of this proposition has been outlined in the proceeding arguments. Let F\,F2 G J-4, be particular processes consistent with a given hypothesis and F\{t) and F-2.it) realisations of those processes. We have that V t /(£i(i?i(t))) = &(F2(t)), and so if 6(^1), 6(2/1) £ * i and £2(£2), £2(2/2) 6 X2 are points on the embeddings £1 and £2 of Fi(t) and F2(t) respectively, /(&(a:i)) = 6(^2) and /(^(l/i)) = £2(2/2)Let p2 be a distance function on X2, then define /Oi(£i(ii),£i(l/i)) := P2(/(£i(*i)),/(6(yi))) = P2(6(x2),£2(2/2)). Clearly (4.5) is satisfied if Pl is a well defined distance function. The triangle inequality, the associative property, and non-negativity of p\ are trivial. However, /9i(£i(:ri), £1(2/1) =

135

The method of surrogate data

0 «• £i(:ri) = £1(2/1) requires that / is invertible. Hence, if / is invertible (4.5) is satisfied, limAr-xx, Cjv(e) on X\ and X2 are similarly bounded, and therefore the correlation dimension of X\ and X2 are identical. • Hence, if any particular realisation of a surrogate consistent with a given hypothesis is a C2 function from i.i.d. noise (which in turn is a C2 function from Gaussian noise), correlation dimension is a pivotal statistic for that hypothesis. In the following section we demonstrate dc is a pivotal statistic for each of the linear hypotheses 4>o, 4>x, and fa. 4.5.1

The linear hypotheses

Let us consider the problem of correlation dimension being pivotal for the linear hypotheses more carefully. First consider the hypothesis * that z ~ N(0,1), clearly J-^,, is singleton and so dc is a pivotal statistic (in fact any statistic is pivotal). Now let 4>Q> be the hypothesis that z ~ N(fi, a2) for some fj, and some a. If F G T$a, then ^ ^ G Jr(/>m, but this is an affine transformation and does not affect a statistic invariant under diffeomorphisms of the embedded data; correlation dimension is such a statistic. In general, if z ~ D where D is any probability distribution, the affine transformation —^ should be replaced by a monotonic transformation. This leads to the first corollary: Corollary 4.1 Statistics based on an unbiased estimate of the correlation integral are pivotal for Algorithm 0. Let (fix be the hypothesis that z is linearly filtered noise. In particular let F G F^ be ARMA(n,m). That is, F is defined by zt =

a.{ziyt-Jn+b.{ei}ttzl

where a G R", b e R m , {zi}\zi

= (zt-1,zt-2,...,zt-n)

(and { e ^ l i ,

similarly) and e ~ N(0,1). Again, a suitable linear transformation zt 1—> Z t - q . { Z i } ^

+ ( & 2 , b 3 , . - . , M - K ^ = et_x 0i

transforms such a time series to Gaussian noise (in general, i.i.d. noise). Therefore we have the same result for Algorithm 1. Corollary 4.2 Statistics based on an unbiased estimate of the correlation integral are pivotal for Algorithm 1.

136

Applied Nonlinear Time Series Analysis

Similarly if 2 is the hypothesis that z is a monotonic nonlinear transformation of linearly filtered noise, one only needs to show that the monotonic nonlinear transformation g : R —> R does not affect the correlation dimension. If g is C 2 , this is a direct consequence of the above arguments. If g is not C 2 , it can be approximated arbitrarily closely by a C2 function.14 Corollary 4.3 Statistics based on an unbiased estimate of the correlation integral are pivotal for Algorithm, 2. The above arguments do not guarantee that the correlation dimension dc(eo) estimated by Judd's algorithm, or any other particular technique, will be a pivotal statistic, it only implies that the actual correlation dimension and any statistic based on an unbiased estimate of the correlation integral will be. One still needs to be sure that the correlation integral can be estimated properly. In particular, the technical details of Judd's algorithm have been considered elsewhere [50; 51], and an independent evaluation of this algorithm is given by Galka and colleagues [35]. Provided one chooses a suitably small scale EQ the statistic dc(eo) will be (asymptotically) pivotal. The above argument, in conjunction with technical results concerning Judd's algorithm [35; 50; 51], imply that correlation dimension estimated by this algorithm is pivotal and the estimates are consistent. In the case of the Grassberger-Procaccia and Gaussian Kernel algorithms, one is assured of the same robust result only if the correlation integral can be estimated in an unbiased manner. 4.5.2

Calculations

We now turn to consider a computational study of the robustness of the results achieved in the previous section. We study the distribution of correlation dimension curves dc(e) produced with Judd's estimate. We find, in agreement with the theoretic results that the correlation dimension estimated in this manner is indeed pivotal for the linear hypothesis of Algorithm 0, 1, and 2. Estimates of the probability density of correlation dimension for various linear surrogates are shown in Figs. 4.6, 4.7 and 4.9. Figures 4.6 and 4.7 compare the estimates of pr,F{t) f° r various classes of simple and composite hypotheses concerned with Algorithms 1 (Fig. 4.6) and 2 (Fig. 4.7). 14

If this argument does not appear particularly convincing, keep in mind that very few AD convertors (or indeed digital computers) are C 2 , and so, time lag embeddings may never be used with digital observations (either experimental or computational).

The method of surrogate data

137

Fig. 4.6 Probability distribution for correlation dimension estimates of AR(2) processes: Shown are contour plots which represent the probability density of the correlation dimension estimate for various values of so- Panel (i) is the probability distribution function (p.d.f.) for various realisations of the AR(2) process xn - 0.4x n _i + 0.7z n -2 = <=„, en ~ N(0,1), panel (ii) shows the p.d.f. for AAFT surrogates of one of these processes. Panels (iii) and (iv) are for random (stable) AR(2) processes. In each of these two calculations fi\ and fj,2 were selected uniformly (subject to |^i|,|/i2| < 1) and the autoregressive process is z n + (/ii+M2)zn-i+/n/*2Sn-2 = e n , £n ~ N(0,l). In the third plot ni,(i2 £ R, in the fourth ^1,^2 £ C. For each calculation 50 realisations of 4000 points were calculated, and their correlation dimension calculated for embedding dimension de = 3, 4,5,10, 15 (shown are the results for de = 5) using a 10000 bin histogram to estimate the density of inter-point distances, the other calculations produced similar results. Note, for some values of eo (particularly in (iii)) our dimension estimation algorithm did not provide a value for dc(eo)- This does not indicate that the estimates of the probability density of correlation dimension are distinct, only that we were unable to estimate correlation dimension. In each case our calculations show a very good agreement between the p.d.f. of dc(eo) for all values of e0 for which a reliable estimate could be obtained.

Figure 4.9 compares different constrained and non-constrained realisation techniques for the experimental data of Fig. 4.8. In each case the probability density of correlation dimension estimated with Judd's algorithm

138

Applied Nonlinear Time Series Analysis

Fig. 4.7 Probability density for correlation dimension estimates of a monotonic nonlinear transformation of AR(2) processes: Shown are contour plots which represent the probability density of the correlation dimension estimate for various values of EQ. Similar to Fig. 4.6, the four plots are of the p.d.f. of dc(so) for: (i) various realisations of the AR(2) process xn — 0.4x n _i + 0.7in—2 = £n, fn ~ N(0, 1), observed by g(x) = x 3 ; (ii) AAFT surrogates of one of these processes; (iii) random (stable) AR(2) processes observed by g{x) = x3; (iv) random (stable, pseudo-periodic) AR(2) process observed by g{x) = x3. For these last two calculations fi\ and /i2 were selected uniformly (subject to |/ii|,|/J2l < 1) aI id the autoregressive process is xn + (Ml + H2)xn-\ + P-iH2Xn-2 = £n, en ~ N(0,l). In (iii) fii, m e R, in (iv) ^1)^2 G'C. In each calculation 50 realisations of 4000 points were calculated, and their correlation dimension calculated for de = 3, 4, 5, 10,15 (shown are the results for de = 5, the other calculations produced similar results) using a 10000 bin histogram to estimate the distribution of inter-point distances. In each case our calculations show a very good agreement between the p.d.f. of dc{eo) for all values of £o for which a reliable estimate could be obtained. Similar results were also obtained using g(x) = sign(x)\x\^^ as an observation function.

Pdc(e0),F(t) w a s estimated for fixed values of EQ by linearly interpolating the individual correlation dimension estimates to get an ensemble of values of <4(£o) from which Pdc(e0),F(t) ls estimated following methods described by [118]. The ensemble of probability density estimates were then used to

The method of surrogate data

139

Fig. 4.8 Experimental data: The abdominal rib movement and electrocardiogram signal for an 8 month old male child in rapid eye movement (REM) sleep. The 4000 data points were sampled at 50 Hz, and digitised using a 12 bit analogue to digital convertor during a sleep study at Princess Margaret Hospital for Children, Subiaco, Western Australia.

calculate the contour plots of Pdc(e0),F{t) f° r a u values of e0 for which our correlation dimension estimation algorithms converged. Figures 4.6 and 4.7 show that the probability density of correlation dimension is independent of which particular form of linear filtering one applies. In both Fig. 4.6 and Fig. 4.7, the first panel shows an estimate of the probability density function (p.d.f.) of correlation dimension for realisations given a particular (in Fig. 4.7, monotonic nonlinearly filtered) autoregressive process; the second panel shows an estimate of the p.d.f. from surrogates of one of the realisations in the first panel. The third and fourth panels show estimates of the p.d.f. of correlation dimension for realisations of different (stable) autoregressive processes. The probability density plot for AAFT (Algorithm 2) surrogates is virtually identical to that for different realisations of a single process, and for random processes. This agreement is particularly strong between the first two panels of each figure (distinct realisations of one process and surrogates of a single realisation). The slightly greater variation with the third and fourth panels is most probably a result of the scaling properties of our es-

140

Applied Nonlinear Time Series Analysis

Fig. 4.9 Probability density for correlation dimension estimates for surrogates of experimental data: The first three panels are p.d.f. estimates for surrogates of the abdominal movement data in Fig. 4.8 generated by: a.(i) a non-constrained realisation technique (we rescaled the data to be normally distributed, estimated the minimum description length best autoregressive model of order less than 100), generated random realisations of that process driven by Gaussian noise, and rescaled these to have the same rank distribution as the data); a.(ii) AAFT surrogates; and a.(iii) surrogates generated using the iterated AAFT method. Plots b.(i), b(ii) and b(iii) are similar calculations for the electrocardiogram data from Fig. 4.8. In each calculation 50 realisations of 4000 points were calculated, and their correlation dimension calculated of de = 3, 4, 5 (shown are the results for de = 5, the other calculations produced similar results) using a 10000 bin histogram to estimate the distribution of inter-point distances. In each case our calculations show a very good agreement between the p.d.f. of dc(eo) for all values of EQ for which a reliable estimate could be obtained.

timates of correlation dimension. However, this only produces convergence of the correlation dimension estimates at different scales £Q, not distinct

The method of surrogate data

141

probability distributions. The plots only fail to agree for values of EQ for which an estimate of dc(£o) was not obtained. The panels in Fig. 4.6 show precise agreement for the range —2 < log(e0) ~ —1.8, in Fig. 4.7 the range is —5 < log(£o) ~ —3.7. Outside these ranges one or more of the panels correspond to surrogates that failed to produce convergence of the correlation dimension algorithm at that particular scale. There is substantial difference between the probability densities shown in Fig. 4.6 and those for Fig. 4.7. The difference results from the different observation function g(x) = x3 in Fig. 4.7.15 This indicates a difference in the results of the dimension estimation algorithm, the nonlinear transformation g has changed the scale of structure present in the original process, and so yields different values of dc(eo). This indicates that correlation dimension is not pivotal over T$2, however, provided one can make a reasonable estimate of the process F G T$2 which generated z, then T is pivotal for the restricted class Tr where F £ T^ C J^,2.16 Note that the range of values of — log£o shown in Figs. 4.6 and 4.7 are quite distinct, the correlation dimension algorithm does not produce different probability density functions, it has only failed to produce an estimate at some scales. Figure 4.9 gives a comparison of the probability distribution for two different data sets with various surrogate generation methods. In each column the first panel shows results for a non-constrained surrogate generation method (we estimated the parameters of the best autoregressive model and generated simulations from it, see the caption of Fig. 4.9), and constrained surrogate methods suggested by Theiler (panel ii) and Schreiber and Schmitz (panel iii). The surrogates generated by either simple parameter estimation methods, the AAFT method or the method suggested by Schreiber and Schmitz17 produced almost identical results. Hence in this example any surrogate generation method will serve equally well, provided the surrogates are not completely different from the data. This confirms our earlier arguments and calculations with stochastic processes.

15 We also repeated the calculations of Fig. 4.7 with g(x) = sign(x)\x\l/A (note that this function is not C 2 ) and obtained another set of similar results. All the individual probability density plots were the same, but they were different from those in Figs. 4.6 and 4.7. 16 One would expect that the nonlinear transformation g would be fairly similar for all F 6 ^\*"2- From our calculations it appears sufficient to ensure that the data and surrogates have identical rank distributions. 17 We iterated the algorithm described in [112] 1000 times to generate each surrogate.

142

4.5.3

Applied Nonlinear Time Series Analysis

Results

The close agreement between the probability density estimates in the first two panels of each of Figs. 4.6 and 4.7 and panels a.(i)-(iii) and b.(i)-(iii) in Fig. 4.9 indicate that the surrogate generation methods suggested by Theiler [149] and those of Schreiber and Schmitz [112] generate surrogates for which dc(eo) is pivotal. This should be the case as these are all constrained realisation techniques (with the possible exception of Algorithm 2 surrogates [112]). The agreement between all four panels in Fig. 4.6 (and similarly between all four panels in Fig. 4.7) indicate that dc(£o) is virtually pivotal when is the hypothesis that the data are linearly filtered noise or a particular monotonic nonlinear transformation of linearly filtered noise. There are minor differences between the various panels in each figure, but these are only a result of the estimate of dc(eo) not converging. The difference between the results of Fig. 4.6 and those of Fig. 4.7 indicate that our estimate of correlation dimension is not pivotal for the hypotheses that the data are any monotonic nonlinear transformation of linearly filtered noise. The scale dependent properties of dc(eo) have altered the value of this statistic for various observation functions g. The linear models built to estimate Pdc(e0),F produced estimates of correlation dimension which closely agreed with those from the constrained surrogate generation methods. This indicates that a non-constrained realisation technique can do as well as a constrained one. Correlation dimension estimates dc(eo) are not pivotal for the set of all processes consistent with the hypothesis that the data are a monotonic nonlinear transformation of linearly filtered noise (otherwise all the probability density estimates in Figs. 4.6, 4.7, and 4.9 would be identical). However, the p.d.f. of dc(eo) for various realisations are similar enough to allow for the use of some more general non-constrained surrogate generation methods (such as the parametric model estimation we employ in Fig. 4.9 panel a.(i) and b.(i), and possibly the method suggested in [145]). Furthermore the p.d.f. of dc values for the surrogate generation methods of Schreiber and Schmitz [112] and Theiler [149] are identical. The difference in the results between Figs. 4.6, 4.7, and 4.9 is most likely a result of the different choice of observation function g affecting the scaling properties of the correlation dimension estimate. By ensuring the rank distribution of the data and surrogate are the same (as in Fig. 4.9, panels a.(i) and b.(i)) one can generate surrogates for which dc is pivotal. Alternatively one could choose a statistic without such sensitive scale de-

The method of surrogate data

143

Fig. 4.10 Typical surrogate time series. For the DJIA(jV = 1000) data set (the top panel of Fig. 2.21) we have computed representative surrogates consistent with each of Wo, Wi, and Hi-

pendence. However, for nonlinear hypothesis testing, sensitivity to scaling properties is an important feature of this particular test statistic. 4.6

Application: Are financial time series deterministic?

Now that we have a better idea of the significance of correlation dimension, or rather, how best to assess to significance of our estimates of correlation, we reconsider the results presented in an earlier application (Sec. 2.5). In Sec. 2.5 we arrived at the startling conclusion that the financial markets are deterministic dynamical systems. We arrived at this probable fallacy without even presenting significant results. In this section we probe the problem further. For each time series (three data sets, considered at 19 different lengths) we computed 50 surrogates from each of the three different surrogate gen-

144

Applied Nonlinear Time Series Analysis

eration algorithms. For each surrogate and original data set we computed correlation dimension (with de = 6) and nonlinear prediction error (also with de = 6). Hence for each of 19 data sets, we generated 150 surrogates and computed 151 estimates of correlation dimension and nonlinear prediction error. Representative surrogates for one of the data sets are shown in Fig. 4.10. We compared the distribution of statistic values to the value for the data in two ways. First, we rejected the associate null hypothesis if the data and surrogates differed by at least three standard deviations. We found that in all cases the distribution of statistic values for the surrogates were approximately normal and this therefore provided a reasonable estimate of rejection at the 99% confidence level. For correlation dimension dc(eo) we compared estimates for fixed values of £o only. Whilst it may seem a reasonable assertion, this technique does assume that the underlying distribution of test statistic values is Gaussian. To overcome this we compared data and surrogates in a non-parametric way. We rejected the associated hypothesis if the correlation dimension of the data was lower than the surrogates. Nonlinear prediction error (NLPE) was deemed to distinguish between data and surrogates if NLPE for the data exceeded that of the surrogates. The NLPE statistic is actually a measure of predictive performance compared to guessing. If nonlinear prediction is less than 1 the prediction does decidedly better than guessing; if prediction is greater than 1, it is decidedly worse. In none of our analysis did the nonlinear prediction of the data perform significantly better than guessing (it often performed worse). Therefore we chose to reject the hypothesis if the nonlinear prediction of the data exceeded that of all 50 surrogates. The probability of this happening by chance (if the data and surrogates came from the same distribution) is ^-. Therefore, in these cases we could reject the underlying hypothesis at the 98% confidence level. In all 171 trials (3 data sets at 19 length scales and 3 hypotheses) we found that non-parametric rejection (at 98% confidence) occurred only if the hypothesis was rejected under the assumption of Gaussianity (at 99% confidence). Tables 4.1 and 4.2 summarizes the results and Fig. 4.11 provides an illustrative computation. For N > 800 we found that the hypotheses associated with each of Ho, Hi, and Hi could be rejected. Therefore at the time scale of approximately 3 years, each of DJIA, GOLD and USDJPY is distinct from monotonic nonlinear transformation of linearly filtered noise. This result is not surprising; on this time scale one expects these time series to be non-stationary. This

145

The method of surrogate data

Table 4.1 Surrogate data hypothesis tests: GOLD and DJIA. Rejection of CD or NLPE with 99% confidence (assuming Gaussian distribution of the surrogate values) is indicated with "CD" and "NP" respectively. Rejection of CD or NLPE with 98% confidence (as the test statistic for the data is an outlier of the distribution) is denoted by "cd" or "np" respectively. This table indicates general rejection of each of the three hypotheses Hi (i = 0,1,2). Exceptions are discussed in the text. Results for GOLD and DJIA are shown in this table, USDJPY are shown in Table 4.2.

GOLD TV 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

7Yo cd cd CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD CD CD CD CD

I

7~ti np

7^2 -

NP cd np NP+cd NP NP NP NP NP NP NP NP NP cd np cd cd

CD NP+cd CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD CD CD CD

DJIA H-o cd cd -

cd CD CD CD CD+NP CD+NP CD+NP CD+NP CD CD CD+NP

1~L\ cd cd+np CD CD CD CD CD+np CD CD CD CD NP NP NP NP+cd NP NP NP

7"^2 cd cd cd

cd CD CD CD CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP CD+NP

result is only equivalent to the conclusion that volatility has not been constant over the last three years. Certainly, the time series in Fig. 2.21 confirm this. For TV < 800 we found evidence to reject 7i\ but often no evidence to reject Ho or Hi- With no further analysis this would lead us to conclude that these data are generated by a monotonic nonlinear transformation of linearly filtered noise with non-stationary evident for TV > 800. However, this is not the case. Rejection of H\ but failure to reject Ho is indicative of either a false positive or a false negative. The sensitive nature of both CD and NLPE algorithms, and their critical dependence on the quantity of data imply that for short noisy time series, they are simply unable to distinguish data from surrogates. Further calculations of CD with different values of de confirm this. In all cases we found evidence to reject Ho and 7^2- The high rate of false negatives (lack of power of the statistical test)

146

Applied Nonlinear Time Series Analysis

Table 4.2 Surrogate data hypothesis tests: DJIA Same results as shown in Table 4.1. In this table the results are shown for USDJPY. N 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Hp cd

np NP NP CD CD CD CD CD CD CD CD CD

USDJPY Hi cd

CD CD CD CD CD+NP CD+NP CD CD CD CD CD cd NP NP NP

H2 cd

np NP NP CD CD CD CD CD CD CD CD+np CD

is due to the insensitivity of our chosen test statistic for extremely short and noisy time series. That is, for short time series the variance in one's estimates of NLPE and CD becomes greater than the deviation between data and surrogates. Another possible source of this failure is highlighted by Dolan [29]. For particularly short time series, the randomisation algorithms involved in 7io and H.2 constrain surrogates to consist of special re-orderings of the original data. With short time series, the number of reasonable re-orderings is few and the power of the corresponding test is reduced [29]. We conclude that the evidence provided by linear surrogate tests (Algorithm 1) is sufficient to reject the random walk model of financial prices and ARMA models of returns. Consistent rejection across all time scales implies that non-stationarity is not the sole source of nonlinearity.18 18 Here, we assume that the system is stationary for N < 100. From a dynamical systems perspective this is reasonable, from a financial one it is arguable whether it is reasonable to describe markets as stationary over periods of approximately four months. However, this is more of an issue of definitions of stationarity. If causes of non-stationarity are considered part of the system (rather than exogenous inputs), the system is stationary.

The method of surrogate data

147

Fig. 4.11 Surrogate data calculation. An example of the surrogate data calculations for the DJIA(Ar = 1000). The top row shows a comparison of correlation dimension estimates for the data (solid line, with stars) and 50 surrogates (contour plot, the outermost isocline denotes the extremum). The bottom row shows the comparison of nonlinear prediction error for data (thick vertical line) and surrogates (histogram). From this diagram one has sufficient evidence to reject Hi (correlation dimension). Both Ho and W2 may not be rejected. Further tests with other statistics indicate that in fact these other hypotheses may be rejected also. This particular time series is chosen as illustrative more than representative. As Tables 4.1 and 4.2 confirm, most other time series led to clearer rejection of the hypotheses Hj (i = 0,1, 2, 3).

4.7

Summary

Estimating dynamic invariants alone is not meaningful. To provide meaningful interpretation of dynamic invariants from data, one needs to know what typical behaviour is expected under certain (trivial) hypotheses: How different is the correlation dimension for this data from typical noise processes? In this chapter we have largely been concerned with three standard surrogate tests to address this problem. These surrogate methods test the hypotheses of: i.i.d. noise, linearly filtered noise; and, a monotonic nonlinear transformation of linearly filtered noise. In each case we highlighted vari-

148

Applied Nonlinear Time Series Analysis

ous weakness of the algorithms and suggested appropriate remedies. One remedy in particular: the IAAFT algorithm has gained wide acceptance. We describe an alternative approach, with which one can avoid the necessary complications of the IAAFT. We argue that with a good choice of test statistic is probably easier to ensure that the AAFT produces truly constrained data sets (than to resort to the IAAFT). We show that an unbiased estimate of correlation dimension fulfils this requirement and may therefore be used as a pivotal statistic for surrogate data generated by algorithms that may not be entirely constrained. Our calculations demonstrated that the use of correlation dimension as a test statistic and the ordinary AAFT algorithm perform entirely satisfactorily. Finally, a fourth, and also largely standard algorithm to test for intercycle determinism in periodic time series was introduced. Some problems remain with this algorithm: most notably it is prone to generating nonstationary surrogates or introducing spurious discontinuity (high frequency dynamics). However, a judicious application of this technique19 can minimise these problems. At the very least, this method provides an additional sanity check for the expected dynamic behaviour under the hypothesis of no intercycle dynamics. In the next chapter we will introduce further extension of this method. With the four algorithms introduced in this chapter, we showed that from the many potential test statistics for surrogate data testing, measures based on the correlation integral represent one good choice. In particular, we showed that the surrogate data methods presented in this chapter can be trivially applied to cast doubt on any premature proclamations of chaos in the financial markets. In the next chapter we turn our attention to various more exotic surrogate data methods that may be applied to test hypotheses that are perhaps, more practical.

19

For example, by separating cycles at some arbitrary baseline level.

Chapter 5

Non-standard and non-linear surrogates Linear surrogate techniques test; via algorithm 0,1 and 2; three very specific hypotheses. These algorithms are not particularly useful for the significant fraction of experimental data which is clearly either deterministic or nonlinear. It would certainly be most disappointing, for example, to conclude that human ECG traces are a manifestation of linearly filtered noise. An extension of these methods, the cycle shuffled algorithm, tests a more specific hypothesis that there is no determinism between cycles. The surrogate methods we propose in this chapter are meant to probe the underlying dynamics of the system more deeply. These methods test whether there is any evidence that the data is inconsistent with certain classes of nonlinear systems. In Sec. 5.1 we review the simplest such approach: one builds a nonlinear model of the data and generates simulations from that model — those simulations are the surrogates. Depending on the power and "pivotalness" of the chosen test statistic, one is then able to test a class of nonlinear systems, including that particular model. In Sec. 5.3, we streamline this idea. Instead of complex parametric models we restrict our interest to a much simpler, parameter-free, model. The advantage of this particular model is that we better understand the expected behaviour of the surrogates and are better able to formulate a meaningful alternative hypothesis. We find that this method is able to routinely differentiate between pseudoperiodic chaotic dynamics and noisy periodic orbits. Moreover, we will see in specific examples that this method also does an excellent job of mimicking the observed dynamics. Finally, in Sec. 5.6 we present a medley of alternative computational techniques to generate advanced nonlinear surrogates. We describe Dolan's (two-dimensional) attractor surrogates, and Schreiber's simulated anneal149

150

Applied Nonlinear Time Series Analysis

ing approach. We also introduce new methods to test for periodic orbits and nonlinearity which we are currently investigating. 5.1

Generalised nonlinear null hypotheses: The hypothesis is the model

Existing surrogate methods are largely non-parametric and concerned with rejecting the hypothesis that a given data set is generated by some form of linear system. In this section we describe an alternative surrogate generation method which is both parametric and nonlinear. In general this method is unable to identify a given time series as either chaotic or simply nonlinear. Instead it addresses the simpler set of hypotheses that the data are consistent with a noise driven nonlinear system of a particular form. We model the data using methods described in Chapter 1.21 (see also [53; 123]) and generate noise driven simulations from that model. Of course, there is no reason to use these modelling techniques in particular. I choose to employ these methods as I have found them to be highly effective in a wide range of situations. Some of these applications appear in this volume. Using correlation dimension estimated with Judd's algorithm (or equivalently, one may employ alternative nonlinear statistics), we are then able to determine which properties are likely to be common to both data and model. Furthermore, the nonlinear surrogate generation method we propose here is a parametric modelling method that utilises a stochastic search algorithm — it is definitely not a constrained realisation method, and no related constrained method seems evident.1 Hypothesis testing with surrogate data is essentially a modelling process. To test if the data are consistent with a particular hypothesis, one first builds a model that is consistent with that hypothesis and has the same properties as the original data, then one generates surrogate data from the model and checks that the original data are typical under the hypothesis by comparing it to the surrogate data. For surrogates generated by algorithm 0, 1 or 2 the models used are linear. Each of these surrogate tests addresses a hypothesis that the data are either linear, or some (linear, or monotonic 1

It is often exceedingly difficult to produce consistent estimates of parameters of a model of a single data set (this is the subject of chapter 1.21). Given that we are not guaranteed that the estimates of these parameters are the same with each iteration of our modelling algorithm it is unlikely that one can construct a constrained realisation algorithm based on these modelling methods.

Non-standard and non-linear surrogates

151

nonlinear) transformation of a linear process. Although nonlinear, the hypothesis addressed by shuffling cycles is that there is no long-term temporal structure. To address the hypothesis that the data come from a noise driven nonlinear system, we build a nonlinear model and generate surrogate data (noise driven simulations). The nonlinear model that we build from the data is a cylindrical basis model using the methods of Judd and Mees [53; 123] (see chapters 1 and 6). Cylindrical basis models are a generalisation of radial basis models that allow for a variable embedding [55]. However, the choice of cylindrical basis models is arbitrary. The hypothesis we wish to test here is that the data are consistent with a nonlinear system that can be described by a cylindrical basis model and that the data of such a system can be modelled adequately using the algorithms we use. Rejection of the hypothesis could imply that the data cannot be described by a cylindrical basis model, or that the modelling algorithm failed to build an accurate model. We return to discuss this hypothesis in Sec. 4.4. Building a nonlinear model of data is a decidedly nontrivial process. In Sec. 6.5.3, we will introduce the general form (6.21) of these models and we discuss some details of the modelling algorithm. Chapter 6 also suggests some refinements to this modelling algorithm to produce improved results. For now, we are mainly concerned with the application of any model architecture as a surrogate test. The conclusions that can be drawn from testing with these nonlinear models are several. Surrogate data hypothesis testing can indicate that our data are not consistent with a nonlinear system of the type generated by our modelling procedure. Furthermore, this is a test of the modelling procedure itself. If the hypothesis cannot be rejected on the basis of our analysis, this will indicate that the model we have built is an accurate model of the data, with respect to correlation dimension. Failure to reject the null hypothesis can indicate successful and accurate modelling of the data. Even if correlation dimension cannot distinguish between data and surrogates, other measures, for example the largest Lyapunov exponent, may. There is one important caveat. Our methods do not test for the presence of a general nonlinear periodic orbit (for example). They only test for the presence of a nonlinear periodic orbit that can be accurately modelled as the sum of cylindrical basis functions of the form described in Sec. 6.5.3. This is not particularly restrictive since experience has shown that such functions can model a wide range of phenomena [53; 54; 55].

152

5.1.1

Applied Nonlinear Time Series Analysis

The "pivotalness" of dynamic measures

Theiler and Prichard [150] argue that by using algorithms that generate constrained realisations to generate surrogates, one is free to use almost any statistic one wishes. On the other hand, if one does not use such methods to generate surrogates, it is necessary to select a statistic which has exactly the same distribution of statistic values for all realisations consistent with the hypothesis being tested. When generating nonlinear surrogates, we suggest that it may be easier to use a pivotal test statistic, and choose realisations of any process consistent with that hypothesis as representative. With such a statistic it would be possible to build a nonlinear model (usually with reference to the data) and generate (noise driven) simulations from that model as surrogates. However, with this approach it is necessary to check that the probability distribution of the test statistic is independent of the particular model we have built, or determine for which models the distribution is the same. We can only test a hypothesis as broad as the set of all processes which have the same probability distribution of test statistic values. For example, if the distribution of the test statistic is different for every model, the only hypothesis we can test is that the data are consistent with a specific model. However, if all models within some class (for example, two dimensional periodic orbits) have the same distribution of statistic values, the hypothesis which we can test with realisations from any one of these models is much broader (for example, the hypothesis that the system has a two dimensional periodic orbit). In Sec. 4.5 we did this for the linear surrogate algorithms: 0, 1 and 2. We found that theoretically, an unbiased estimate of the correlation dimension, or even the correlation sum, would provide a pivotal statistic. Moreover, for several experimental data sets, we tested this with Judd's algorithm and found that it did provide a pivotal test statistic. A further consequence of this result is that, when a pivotal test statistic is available, one need not be concerned with the computational complexities of Schreiber and Schmitz's IAAFT scheme [112], or many of the other recent objects to the AAFT method. Unlike Theiler's algorithm 0, 1 and 2 surrogates, when testing with nonlinear surrogates (simulations of a model), the hypothesis being tested is not known a priori, but will be determined by the "pivotalness" of the test statistic. To illustrate our approach we choose to use correlation dimension. Other statistics, particularly measures derived from dynamical

153

Non-standard and non-linear surrogates

system theory that are invariant under diffeomorphisms and can be consistently estimated may serve equally well. For hypotheses such as those addressed by nonlinear models, one must determine the hypothesis for which the test is pivotal. If J-$ is the set of all noise driven processes, dc(e) will not be pivotal. However, if we restrict ourselves to T^ C T$ where T is pivotal on F^, the problem is resolved. To do this we simply rephrase the hypothesis to be that the data are generated by a noise-driven nonlinear function (modelled by a cylindrical basis model) of dimension d. For example, this allows us to test if the data are generated by a periodic orbit with 2 degrees of freedom driven by Gaussian noise. 5.1.2

Correlation dimension: nonlinear hypothesis

A pivotal test statistic

—

Beyond applying the linear hypotheses tests presented in the previous chapter, one may wish to ask more specific questions: are the data consistent with (for example) a noise driven periodic orbit? In particular, a hypothesis similar to this is treated by Theiler's cycle shuffled surrogates (Sec. 4.3). In this section we focus on more general hypotheses. In chapter 3 we test the hypothesis that infant respiration during quiet sleep is distinct from a noise-driven (or chaotic) quasi-periodic, toroidal, or ribbon attractor (with more than two identifiable periods). Such an apparently abstract hypothesis can have real value, these results have been confirmed with observations of cyclic amplitude modulation in the breathing of sleeping infants [128; 126] (chapters 3 and 5.3.3) during quiet sleep and in the resting respiration of adults at high altitude [157]. To apply such complex hypotheses, we build cylindrical basis models using a minimum description length criterion (see chapter 6) and generate noise driven simulations (surrogate data sets) from these models. This modelling scheme has been successful in modelling a wide variety of nonlinear phenomena. However, it involves a stochastic search algorithm. This method of surrogate generation does not produce surrogates that can be used with a constrained realisation scheme,2 so a pivotal statistic is needed. It is important to determine if the data are generated by a process consistent with a specific model or a general class of models. To do this we need to determine exactly how representative a particular model is for 2

If we are unable to estimate the model parameters consistently (from a single data set), we are certainly not going to be able to produce a surrogate which yields the same estimates of parameters as the data.

154

Applied Nonlinear Time Series Analysis

a given test statistic — how big is the set T$ for which T is pivotal? Bycomparing a data set and surrogates generated by a specific model, are we just testing the hypothesis that a process consistent with this specific model generated the data or can we infer a broader class of models? In either case (unlike constrained realisation linear surrogates), it is likely that the hypothesis being tested will be determined by the results of the modelling procedure and therefore depend on the particular data set one has. Many of the arguments of Sec. 4.5 apply here as well; the hypothesis one can test will be as broad as the class of all systems with distance function bounded by Eq. (4.5) (in the case of correlation integral based test statistics). In particular, proposition 4.1 holds — an invertible C 2 function will yield only a bounded change in the correlation integral. Consider the other side of the problem. We want T to be a pivotal test statistic for the hypothesis , where ^ is a broad class of nonlinear dynamical processes. For example, if J-Q is the set of all noise-driven processes, dc{so) will not be pivotal. However, if we are able to restrict ourselves to Fj, C T<$, where T is pivotal on J7^, the problem is resolved. To do this we simply rephrase the hypothesis to be that the data are generated by a noise-driven nonlinear function (modelled by a cylindrical basis model) of dimension d. For example, this would allow one to test if the data are consistent with a periodic orbit with 2 degrees of freedom driven by Gaussian noise. Furthermore, the scale-dependent properties of our estimate of dc{£o) provide some sensitivity to the size (relative to the size of the data) of structures of a particular dimension. This is a much more useful hypothesis than that the process is noisy and nonlinear — if this was our hypothesis, what would be the alternative? Because of the complexity of our dimension estimation algorithm and the class of nonlinear models, it is necessary to compare calculations of the probability density of the test statistic for various models. Having done so one cannot make any general claims about the "pivotalness" of a given statistic. However, for a given data set, it is possible to compare the probability distributions of a test statistic for various classes of nonlinear models and depending on the "pivotalness" of the statistics determine the hypothesis being tested. We present a specific application of this technique in Sec. 5.2. The data we choose to analyse is a recording of infant respiration (see Fig. 5.1). One would certainly hope that this data does not represent a noisy linear system. However, a more relevant question is whether the system describes a periodic oscillator or some type of deterministic aperiodic dynamics.

Non-standard and non-linear surrogates

155

Fig. 5.1 Experimental data: The abdominal rib movement for an 2 month old female child in quiet (stage 3—4) sleep. The 1600 data points were sampled at 12.5 Hz (to ease the computational load involved in building the cylindrical basis model this has been reduced from 50 Hz), and digitised using a 12 bit analogue to digital convertor during a sleep study at Princess Margaret Hospital for Children, Subiaco, Western Australia. These are the same data as illustrated in Pig. 3.9.

5.2

Application: Infant sleep apnea

Figure 5.2 presents some experimental results from the data of Fig. 5.1. We have estimated the probability density for an ensemble of models and for particular models from an experimental data set.3 Note that, this data is far more non-stationary than that in Fig. 4.8, and proves to be a greater modelling challenge. These calculations confirm that the distribution of correlation dimension estimates for different realisations of one model are the same as for different realisations of many models. The models used in this calculation were selected to have simulations with asymptotically stable periodic orbits. Models of this data set produce simulations with either asymptotic stable periodic orbits or fixed points (the second behaviour is clearly an inappropriate model of respiration). The probability distribution function of dc for all models therefore exhibits two modes. We are only concerned with a unimodal distribution at any one time. Figure 5.2 (ii), (hi) and (iv) show the probability density for particular models selected from the ensemble of models used in (i). Panel (hi) is the result of the calculations for the model which gave the smallest estimate of dc(e0) for log(e0) = 1-8 in (i), that is the model that generated the simulation with the lowest dimension. Panel (iv) is the result of the calculations 3 These calculations have also been repeated with the data in Fig. 4.8 and equivalent conclusions were reached.

156

Applied Nonlinear Time Series Analysis

Fig. 5.2 Probability density for correlation dimension estimates for nonlinear surrogates of experimental data: Shown are contour plots which represent the probability density of correlation dimension estimates for various values of eo- The data used in this calculation is illustrated in Fig. 5.1. The figures are probability distribution function (p.d.f.) estimates for surrogates generated from: (i) realisations of distinct models; (ii) realisations for one of the models used in (i) with approximately the median value of correlation dimension {dc{£o) for logeo = —1-8); (iii) realisations for the model used in (i) with the minimum value of correlation dimension; (iv) realisations for the model used in (i) with the maximum value of correlation dimension. In each calculation, 50 realisations of 4000 points were calculated, and their correlation dimension calculated of de = 3 , 4 , 5 (shown are the results for de = 5, the other calculations produced equivalent results) using a 10000 bin histogram to estimate the distribution of inter-point distances. In each case our calculations show a very good agreement between the p.d.f. of dc(£o) for all values of so for which a reliable estimate could be obtained.

for the model which gave the highest dimension estimate in (i). Panel (ii) corresponds to the median dimension estimate in (i). Despite this, all these probability densities are very nearly the same; there is no low bias in (iii) and no high bias in (iv). This indicates that dc(eo) is (asymptotically) pivotal, simulations from any (periodic) model of the data will produce the same estimate of the probability distribution of dc(e0). Hence one may build a single model of the data, estimate the distribution of dc(eo), and

Non-standard and non-linear surrogates

157

use that distribution to test the hypothesis that the data was generated by a process of the same general form as the model (this is the procedure followed in chapter 3). These calculations indicate that parametric nonlinear models of the data can be used to produce a pivotal class of functions when using correlation dimension as the statistic. That is, estimating the distribution of correlation dimension estimates for different models of a single set of (infant respiratory) data is equivalent to estimating the distribution of distinct realisation of a single model. Models which produced low (or high) correlation dimension estimates in Fig. 5.2 (i) did not produce estimates with lower or higher values of correlation dimension any more often than a more typical model. Indeed, they generated estimates with the same distribution of values. In general one may build nonlinear models of a data set and generate many noise-driven simulations from each of these models and compare the distributions of a test statistic for each model and for broader groups of models (based on qualitative features, such as fixed points or periodic orbits, of these models). By comparing the value of the test statistic for the data to each of these distributions (for groups of models), one may either accept or reject the hypothesis that the data was generated by a process with the same qualitative features as the models used to generate a given probability distribution function. 5.3

Pseudo-periodic surrogates

The basic premise of surrogate data techniques is to mimic certain properties of the data, while destroying others. For algorithm 0, 1 and 2 the properties of interest are linear, and these linear properties are the only ones to be preserved. In the previous section, we saw an alternative approach: by building a model we generate a set of properties which we wish to preserve, the generalisability of those properties4 depends directly on the pivotalness of the test statistic. The pseudo-periodic surrogate technique we describe in this section combines both these ideas. We build a simple, non-parametric model of the data and generate simulations from that model. These simulations contain the same gross scale dynamics as the data but are otherwise corrupted by dynamic noise. This simple idea allows us 4 That is, whether we can test just the specific properties processed by the data and the model, or a broader class of similar properties.

158

Applied Nonlinear Time Series Analysis

to test for the presence of fine scale deterministic dynamics, such as chaos. The simplicity of the model means that we are not overly concerned by the usual complexities of reliable parameter estimation of nonlinear models from data. In fact, this complex issue will be the topic of the next chapter. As a particular example we may formulate this non-parametric modelbased surrogate method as a test for noisy periodic dynamics. When testing for noisy periodic orbits we will call this surrogate generation technique the pseudo-periodic surrogate (PPS) algorithm [137]. Surrogate data suitable to test for pseudo-periodic determinism in a time series must address the null hypothesis of no determinism other than the periodic behaviour. That is, even if the behaviour appears chaotic, it may be nothing more than a periodic orbit perturbed by dynamic noise. In what follows we employ a time delay embedding [144] of the data to extract the topological features of the underlying dynamics. We wish to generate surrogates that preserve the large scale behaviour of the data (the periodic structure), but destroy any additional small scale (chaotic, linear or nonlinear deterministic) structure.

5.3.1

Shadowing

surrogates

The PPS algorithm we employ here is inspired by the application of local modelling techniques to generate simulated time series. The various implementations of local modelling techniques are local linear models [33], local constant simplices [143] and triangulation and tessellations [84]. We have studied the application of these various methods for surrogate generation [129]. However, we find that the simplest approach actually performs best. Therefore, the implementation we employ here is a local constant model over spatial neighbours. This method avoids the added complications of these alternative schemes, and technically detailed improvements described more recently (e.g. [70]). Surrogate time series are generated by inferring the underlying dynamics from such a local model, and contaminating a trajectory on the attractor with dynamic noise. With an appropriate choice of noise level, intra-cycle dynamics are preserved but inter-cycle dynamics are not. The null hypothesis these surrogates address is a periodic orbit with uncorrelated noise. In general, our algorithm neither demands nor generates periodic behaviour necessarily. We will show in the following that it actually has far wider application. Hence, we will call the general method Attractor Trajectory Surrogates. The algorithm may be stated as follows.

Non-standard and non-linear surrogates

159

Fig. 5.3 Examples of typical surrogate data sets. Panel (a) is a human electroencephalogram (ECG) recording during ventricular tachycardia. The purpose of the PPS algorithm is to provide a meaningful surrogate test for experimental data such as these. Panels (b), (c) and (d) are realisations of typical surrogates of that data set generated by algorithm 0, 1 and 2 respectively. These are all linear surrogates and clearly distinct from the data. Panel (e) is a cycle shuffled surrogate. Although this surrogate appears qualitatively more like the data than the standard linear surrogates, there is a notable non-stationarity not present in the data. The non-stationarity is a result of the shuffling of individual cycles when peak values do not precisely coincide. Finally, panel (f) is a typical surrogate generated using the PPS algorithm. This surrogate is qualitatively very similar to the original data. In each panel the horizontal axis is datum number and is identical. The vertical axis units are arbitrary (proportional to surface ECG voltage in (a)).

160

Applied Nonlinear Time Series Analysis

Algorithm 5.1

Attractor Trajectory Surrogates.

(1) Define the time delay embedding {zt}^lJ1dw of the scalar time series {xt}tLi as zt = (xt,xt+T,xt+2T, • • • ,xt+dcr)- For simplicity of notation we define the embedding window dw = der — 1 where de and r are the embedding dimension and embedding lag respectively. (2) Choose an initial condition Si £ {zt\t = 1,... ,N — dw}. (3) Let i = 1 (4) Select one of the neighbours of Sj from {zt\t = 1 , . . . , N—dw} at random, say Zj. (5) Set Sj+i = Zj+x and increment i. (6) Repeat this procedure from step 4 until i = N. (7) Take as the surrogate time series {{st)\ :t = l,2,..., N} (where (-)i denotes the scalar first coordinate of the vector). Applying this algorithm, the vector time series {st}t is a stochastic trajectory on the attractor approximated by {zt}t- That is, it is a random walk on the attractor. It therefore has the same underlying dynamics, albeit contaminated by noise, as the original system. Also note that, {st\t = 1,2,..., N] and {zt\t = 1,2,..., N — dw] approximate the same attractor but the reconstruction achieved by a delay embedding of {(st)i}$Li does not. Theorem 5.1 Surrogates constructed according to ATS algorithm follow the same vector field as the original data, but are contaminated with dynamic noise. Proof. Let A be the underlying attractor and (j> the underlying dynamical and A = limjv-+oo-4iv- Taken's operator. Let AN — {xt : i = 1,...,N} theorem says that a homeomorphism h exists between A and A. Let / denote the composition of that homeomorphism with <j) (i.e. / = h~locf)oh). Now we appeal to the shadowing lemma [41]. By construction, {si}f=1 is an a-pseudo orbifc of / where a > a and a := _ = max^||/(si) - s i + i | | . (Note that a is just the maximum perturbation applied in step 4 of the PPS algorithm.) The shadowing lemma implies that for all /3 > 0 there exists a > 0 such that if a < a, then {si}f_1 is f3-shadowed° by a point V&A. 5 6

That is, d(si+1,f(si)) < a for i = 1,..., N. That is, d(fW(y), s{) < 0 for i = 1,..., N.

Non-standard and non-linear surrogates

161

Hence, if the perturbations applied in step 4 are sufficiently small, the surrogates will look like a real trajectory of A. Conversely, at each point Si the distance to the true state f^(y) is bounded, d(si,f^(y)) < p. Hence one requires that the randomisation is sufficient to perturb {SJ} from being /3-shadowing a point in A, but no more. • As an immediate corollary to this result, provided the perturbations are of an appropriate magnitude we see that the addition of dynamic noise obliterates fine scale dynamics, that is, any deterministic inter-cycle dynamic behaviour. Hence we have the same gross periodic pattern exhibited by the data, but not fine scale determinism. It is interesting to note the similarity between surrogates generated in this manner and the so-called attractor surrogates described by Dolan and co-workers [28]. In [28] attractor surrogates are constructed (in 2 dimensions) as a coarse-grained second-order Markov model. The surrogates therefore resemble the original 2 dimensional attractor with the addition of noise. The technique we describe here does not aim to exactly reproduce the attractor in each surrogate, only the periodic structure of the data. Furthermore, the above shadowing arguments could not necessarily be applied to Dolan's attractor surrogates. 5.3.2

The parameters of the algorithm

Previously, we have criticised two alternative surrogate generation algorithms for user specifiable parameters and adjustable noise level. However, one immediately notices that the above algorithm has precisely three parameters: the embedding parameters de and r and the noise level (or rather the method for selecting one of the near neighbours of Sj at random). Since the aim of embedding the original time series is to reproduce dynamics topologically equivalent to the true system dynamics we are attempting to satisfy the necessary conditions of Taken's embedding theorem [144]. Selection of parameters de and r for successful embedding have been considered early and we therefore do not consider that problem here. However, we observe that suitable selection of embedding parameters is likely to be dependent on the particular data set under consideration.7 7 It is important to observe that the embedding parameters used to construct the surrogates may be selected with reference to the data. However, any embedding parameters required for computation of the test statistic (say correlation dimension) must be selected independent of the test data. Otherwise one is effectively tuning the test statistic to the data, thereby increasing the likelihood of a false rejection of the null hypothesis, see [113].

162

Applied Nonlinear Time Series Analysis

Selection of appropriate noise level is an entirely different problem. Excessive noise will produce surrogates entirely unlike the data — in the limiting case the surrogates will be equivalent to algorithm 0 surrogates.8 Insufficient noise will produce surrogates excessively like the data, leading to increased likelihood of false positive results. First let us define the form of the noise. We select Zj from among the neighbours of Sj with probability ProbOj = zt) = exp ~"Zt ~ Si".

(5.1)

Admittedly, the restriction to this particular form of randomisation is arbitrary, but we feel that it is not entirely unreasonable. From experimentation with many alternative techniques [129], we can conclude that (for our purposes) the alternative modelling approaches performed no better, and in most cases worse than the current scheme. These methods generally included additional user specifiable parameters. In the absence of any evidence to the contrary, we prefer to employ the simplest model that will produce the required dynamics (see [57]). With the specification of Eq. (5.1) we have now reduced the problem to selection of the noise radius p. Roughly, the noise radius p corresponds to the magnitude of dynamic noise in the system, and the operation of choosing Si from among the neighbours of zt is the influence of that noise on the system state. Hence, if there is no significant difference between the original data and the PPS data set, the the data is consistent with a periodic orbit (provided the data is pseudo-periodic) with dynamics noise. To apply this algorithm, we must choose an appropriate value of p. If p is too small, the surrogate and original data will be identical (or largely identical). If p is too large, the surrogate will be effectively i.i.d. noise. For a finite data set this means that at either extreme the number of short (length > 2) segments of the surrogate that are identical to the data will be small. At some intermediate value the number of short segments that are identical will reach a maximum, and it is this maximum that we choose as the value of the noise radius to generate surrogates. In all cases (to be discussed in the following sections) we observe this characteristic behaviour, and selecting p in this way produces surrogates that appear visually identical to the data 8 This is not exactly true: algorithm 0 surrogates are typically selected randomly without replacement, under the influence of excessive noise this method would randomly select points on the attractor with replacement. However, as discussed earlier, sampling with replacement is certainly more appropriate.

Non-standard and non-linear surrogates

163

Fig. 5.4 Surrogate test results for the Rossler system. Each panel depicts a probability distribution of correlation dimension estimates (as a contour plot) from 50 PPS data sets and the corresponding correlation dimension estimate for the data. Correlation dimension is estimated using the method described by Judd (thick solid line). For the chaotic Rossler data with dynamic noise (panel (a)) these results indicate rejection of the null hypothesis. For the noisy periodic Rossler (panel (b)), these results indicate the null hypothesis cannot be rejected. The horizontal and vertical axes are the dimensionless quantities logeo and rfc(eo) respectively.

but lack any long term determinism. Further discussion of selection of p may be found in [137]. 5.3.3

Linear noise and chaos

In [137] it was shown that this algorithm can differentiate between chaotic and periodic systems in the presence of noise. These results are summarised in Fig. 5.4. The results presented in Fig. 5.4 and subsequent plots compare correlation dimension estimates [50; 51] for the original time series and 50 PPS data sets. The correlation dimension estimates for the surrogates are presented as a contour plot. This contour plot is a pictorial representation of the probability distribution function (PDF) of dc(e0) for fixed distinct values of £0. That is, as before, each vertical slice is a different PDF computed using the kernel estimator described by [118]. The isoclines are selected automatically and uniformly. Any value outside the lowest isocline corresponds to a value outside the distribution of surrogate values. From the Rossler differential equations [152] in a chaotic and periodic regime (both period 2 and period 6) we generate simulation of 5000 points (sampling rate 0.2 time units) with Gaussian variants added to each component. Typical time series exhibit an apparently periodic structure with large oscillatory motion repeating irregularly. We found that the time se-

164

Applied Nonlinear Time Series Analysis

Fig. 5.5 Surrogate test results for Periodic orbits with noise. Each panel depicts a probability distribution of correlation dimension estimates (as a contour plot) from 50 PPS data sets and the corresponding correlation dimension estimate for the data. Correlation dimension is estimated using the method described by Judd (see text). For the periodic orbit with white noise (panel (a)) these results indicate the null hypothesis cannot be rejected. For the periodic orbit with coloured noise (panel (b)) and for the periodic orbit with deterministic chaos (the Ikeda map, panel (c)), these results indicate rejection of the null hypothesis. The horizontal and vertical axes are the dimensionless quantities logeo and dc(eo) respectively.

ries could not easily be distinguished by inspection, but application of the PPS algorithm with correlation dimension as a test statistic easily differentiated between the non-periodic and periodic dynamics [137]. Higher order moments and Shannon entropy as test statistics were unable to differentiate between data and surrogates. Mutual information provided significant separation between data and surrogates only for a small range of time lags. Figure 5.4 panel (b) shows the results of this algorithm applied to a periodic system with noise. In this section we tested the algorithm with periodic systems under the influence of various noise sources. In each case we examined a sine wave signal with additive noise. Figure 5.5 summarises the results for white and coloured noise. In the case of white noise the hypothesis test is unable to reject the null hypothesis, whereas for coloured

Non-standard and non-linear surrogates

165

Fig. 5.6 Data and representative surrogates for human sinus rhythm ECG. Panel (a) depicts a representative recording of human sinus rhythm ECG. Panels (b) and (c) show representative PPS data set. Qualitatively the data and surrogates appear indistinguishable. On each panel the horizontal axis is the datum number and the vertical axis is surface ECG voltage (in mV).

noise we conclude rejection is possible. This is precisely as expected — a periodic orbit with white (i.e. uncorrelated) nose is consistent with the null hypothesis, in the presence of coloured (correlated) noise, the null hypothesis is rejected. We also tested the algorithm in the presence of deterministic chaotic dynamics (the Ikeda map [152]) superimposed on a periodic orbit. Although the non-periodic component in this case is temporally correlated, that correlation is only short term. Higher order iterates appear less correlated. Hence, the first return map of the periodic orbit appears to behave much like independent and identically distributed noise (i.e. high dimensional dynamics). The PPS algorithm generated surrogates that appeared qualitatively similar to the data, however, application of correlation dimension as a test statistic led to clear rejection of the associated null hypothesis. Results of analysis of human electrocardiogram (ECG) data have been presented in [137]. The results of that work showed that human electrocardiogram during both sinus rhythm and ventricular tachycardia (VT) were not consistent with a periodic orbit with uncorrelated noise. This result is significant, because ECG during both rhythms is regular and appears largely periodic. Representative data during VT is depicted in Fig.

166

Applied Nonlinear Time Series Analysis

Fig. 5.7 A recording of Japanese vowel intonation. The three panels depict sections of the same recording of a Japanese vowel sound ("a") on different time scales. The data is sampled at 48 kHz.

5.3. Figure 5.6 depicts typical PPS data generated for a recording of sinus rhythm. Qualitatively the ECG time series and PPS data are indistinguishable. However, surrogate analysis demonstrates that the these time series are inconsistent with the null hypothesis of a periodic orbit with uncorrelated noise. The presence of long term correlations and possibly determinism indicates that analysis using the techniques of nonlinear dynamics, and examining inter-beat dynamics (the so called RR intervals) should prove fruitful. See [27] for a compelling example of the control of cardiac arrhythmia in human subjects using the methods of unstable periodic orbit detection and targeting. 5.4

Application: Mimicking human vocalisation patterns

Determining whether observed data is consistent with specific classes of dynamical systems is often (with the probable exception of the application

Non-standard and non-linear surrogates

167

Fig. 5.8 An ATS data set for the data in Fig. 5.7. The three panels depict sections of the same surrogate data set on different time scales. Clearly this surrogate data and the data in the preceding figure are qualitatively very similar.

to finance) not necessarily interesting in its own right. In this application we focus on mimicking vocalisation patterns. As a consequence of this technique, we can of course, test whether these vocalisation patterns are consistent with a noise periodic orbit or not [49]. But this is not our focus (in fact, this problem is considered for Chinese vowel sounds in [76]). The data depicted in Fig. 5.7 show a time trace of a Japanese vowel sound ("a") on different time scales. The data is clearly not linear and clearly very complicated. However, what is not clear is whether this data is a noisy periodic orbit or a deterministic chaotic system. We embed the data depicted in Fig. 5.7 in 6 dimensions with a constant embedding time lag equal to one-quarter of the period (113 points) and estimate a suitable value of noise radius p.g A typical realisation of the surrogate with this value of p is depicted in Fig. 5.8. It is clear from Figs. 5.7 and 5.8 that the data and surrogate are qualitatively extremely similar. This is re-assuring as it demonstrates that simple 9

We find p = 4800 is suitable.

168

Applied Nonlinear Time Series Analysis

Fig. 5.9 Power spectra for the vowel sounding data and surrogate. The two lines denote the power spectra estimated from the data and a representative ATS surrogate for data recorded during human vowel intonation.

embedding strategies can capture complex dynamics and, moreover, that the ATS algorithm is capable of mimicking such complex behaviour. In Fig. 5.9 we compare the power spectra of both signals and find that, in the frequency domain as well as the time domain, these signals are very similar. Finally, we mention that comparing the resultant audio files betrays no evidence of any distinction.10 5.5

Application: Are financial time series really deterministic?

We now reconsider the analysis of the financial time series depicted in Sec. 4.6. In particular, we apply the PPS algorithm here to mimic the nonstationarity (i.e. possible conditional heteroskedasticity) evident in the 10 These audio files can be obtained directly from the author, or from the book's website.

Non-standard and non-linear surrogates

169

Table 5.1 Surrogate data hypothesis tests. Rejection of CD or NLPE with 99% confidence (assuming Gaussian distribution of the surrogate values) is indicated with "CD" (correlation dimension, using the GKA) and "NP" (nonlinear prediction error) respectively. Rejection of CD or NLPE with 98% confidence (as the test statistic for the data is an outlier of the distribution) is denoted by "cd" or "np" respectively. This table indicates general rejection of each of the hypotheses H3 associated with the PPS method. Exceptions are discussed in the text. JV I GOLD I DJIA I USDJPY 100 cd cd 200 cd cd cd 300 cd cd 400 cd cd cd 500 cd cd cd 600 cd cd cd 700 cd cd cd 800 cd cd 900 cd cd cd 1000 cd cd+np cd 2000 np cd+np 3000 np cd+np 4000 cd+np np 5000 np cd cd+np 6000 np cd np 7000 np cd+np cd+np 8000 CD cd+np CD 9000 cd cd+np CD 10000 cd CD CD

data. As this financial data is not pseudo-periodic, we are not applying the PPS algorithm to mimic such structures. For this application, the PPS is applied to mimic the "burstiness" observed in non-stationary financial time series. We apply the PPS algorithm to the same data as analysed in Sec. 4.6. Table 5.1 summarises the results and Fig. 4.11 provides an illustrative computation. The embedding dimension for the PPS algorithm was chosen to be de = 8.11 n

A s discussed by [l 13] one should select parameters of the surrogate algorithm independently of parameters of the test statistic. For the test statistics we selected de = 6 based on evidence in the literature concerning the dimension of this data [8]. For PPS

170

Applied Nonlinear Time Series Analysis

The results of the PPS algorithm are clearer, and provide some clarification of the results of the linear tests. In all but six cases the PPS algorithm found sufficient evidence to reject H3.12 Since Hi C H3 for i = 0,1,2 we have (as a consequence of rejecting H3) that H2 should be rejected. Therefore, we conclude (five of the six exceptions notwithstanding) that these data are not generated by a static transformation of a coloured noise process. Rejection of W3 strengthens this statement further. As PPS data can capture non-stationarity and drift [129], it is capable of modelling locally linear norilinearities and non-stationarity in the variance (GARCH processes). Therefore, we conclude that these financial data are not a static transformation of non-stationary coloured noise. To show that the PPS algorithm can indeed test for ARCH-, GARCHand EGARCH-type dynamics in the data, we apply the algorithm to three test systems.13 An autoregressive conditional heteroskedastic (ARCH) model makes the volatility a function of the state xt = atut

a? = a + x2_1.

(5.2)

For computation we set a = 1 and = 0.5 (as [48] did). A generalised ARCH (GARCH) process extends Eq. 5.2 so that the noise level is also a function of the past noise level

a\ = a + 4>x\_x + ia\_x.

(5.3)

We simulate (5.3) with a = 1, cf> = 0.1 and £ = 0.8 [48]. Finally, an exponential GARCH (EGARCH) process introduces a static nonlinear transformation of the noise level. log a2 = a + <j> ^ = 1 + 2£ log at_, + 7 ^ i Ct-l

(5.4)

Ot-\

(a = 1, 4> = 0.1, £ = 0.8 and 7 = 0.1). For stochastic processes of the forms (5.2), (5.3) and (5.4), Fig. 5.10 shows typical results of the application of the PPS algorithm. For data lengths from N = 100 to N = 10000 we found that linear surrogate algorithms (algorithms 0, 1 and 2) showed clear evidence that this data was not consistent with the null hypothesis (of generation we found de = 8 performed best. 12 There are six exceptions. Of these six exceptions (see Table 5.1), five occur when H2 is rejected and therefore linear noise may still be rejected as the origin of this data. This leaves only one exception: for YEN(n = 300) we are unable to make any conclusion. One false negative from 57 trials with two independent test statistics is reasonable. 13 These three systems are given by Eqs. (25), (26) and (27) of [48].

Non-standard and non-linear surrogates

171

Fig. 5.10 Testing for conditional heteroskedasticity. Typical behaviour of the surrogate algorithms for ARCH (Eq. 5.2, top row), GARCH (Eq. 5.3, middle row),and EGARCH (Eq. 5.4, bottom row) processes. The calculations illustrated here are for N = 1000. Each panel depicts the ensemble of correlation dimension estimates for the surrogates (contour piot) and the value for the test data (solid line, with stars). From this diagram (as with our test calculations for N = 100,..., 10000) one can reject Ho, Til, and Ti.2, but may not reject H3. This confirms our assertion that the PPS algorithm provides a test for conditional heteroskedasticity.

nonlinear monotonic transformation of linearly filtered noise). However, the PPS algorithm was unable to reject these systems. Therefore we conclude that the financial data we examine here is distinct from ARCH, GARCH and EGARCH processes. Conditional heteroskedasticity is insufficient to explain the structure observed in these financial time series.

172

Applied Nonlinear Time Series Analysis

Table 5.2 Rate of false positives observed with PPS algorithm (5.4). Thirty realisations of length N = 100, 200, 300,..., 1000 were generated and the PPS algorithm applied. This table depicts the frequency of false positives (incorrect rejection of the null). N I 100 ~~6 ARCH GARCH 1 EGARCH 2

200 0 0 1

300 400 500 600 700 800 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0

900 I

10000 0 0 1

We repeated the computations depicted in Fig. 5.10 for multiple realisations of (5.2), (5.3) and (5.4) for different length time series N = 100,200,300,..., 1000. The results of this computation are summarised in Table 5.2. We found that in each case (with 30 realisations of each of these processes) clear failure to reject the null. The highest rate of false positives (p — 0.1) occurred for short realisations of GARCH processes. Hence, this test is robust against false positives. The local linearity of the PPS data means that certain mildly nonlinear systems may also be rejected as the source of this data. However, we can make no definite statement about how "mild" is mild enough. Instead, we conclude that these data are not a linear stochastic process, but should be modelled as stochastically-driven nonlinear dynamical systems. There are two further significant observations to be made from this data. Using correlation dimension as the test statistic, rejection of the hypotheses occurred when the correlation dimension of the data was lower than that of the data. Therefore, these time series are more deterministic than the corresponding surrogates. This implies that some of the apparent stochasticity in these time series is deterministic. This does not imply that the financial markets exhibit deterministic chaos. This does imply that these data exhibit deterministic components that are assumed (by the random walk, ARIMA and GARCH models) to be stochastic. For NLPE, one finds that the surrogate and data only differ significantly when the nonlinear predictive power of the data is worse than the surrogates. Hence, these time series are less predictable than the corresponding stochastic processes. This result may be somewhat surprising, but it is perhaps less shocking than the converse would be. If this data was found to be more predictable than noise, that would indicate that local linear models could be employed for profitable trading (at least trading with a

Non-standard and non-linear surrogates

173

Fig. 5.11 PPS surrogate calculation for Y E N / U S D , DJIA, and gold data. Panel (a) depicts the results of PPS analysis for the YEN/USD exchange rate data, panel (b) is the same calculation for the DJIA data. Analysis with PPS surrogates indicate the null hypothesis cannot be rejected. Therefore, these systems are largely stochastic and exhibit no long term deterministic dynamics. The horizontal and vertical axes are the dimensionless quantities logeo and dc(eo) respectively. For the gold data in panel (c) the PPS surrogates indicate the null hypothesis can be rejected. This may indicate additional long term determinism not found in the other two financial time series considered here. The horizontal and vertical axes are the dimensionless quantities log eo and rfc(eo) respectively.

positive expectation). Hence, this result implies that the model used to estimate nonlinear prediction error; a local linear model; is unable to capture the determinism of this data. This result supports the conclusion of the PPS hypothesis test: the dynamics of this system may not be adequately modelled by locally linear nonlinear models. Using only algorithm 2 surrogates, one may conclude that these systems are not consistent with a monotonic nonlinear transformation of linearly filtered noise. However, this does not mean that the appealing alternative hypothesis of long term determinism is true. To test this data against the null hypothesis of short term dynamics with uncorrelated noise, we apply the PPS algorithm. For each data set we employed the ensemble of test

174

Applied Nonlinear Time Series Analysis

statistics described previously. In the YEN/USD and DJIA we found no evidence to reject the null hypothesis. The likely conclusion is that these time series exhibit only short term stationary deterministic dynamics with uncorrelated noise, as shown in Fig. 5.11. Coupled with the rejection of the hypothesis of a monotonic nonlinear transformation of linearly filtered noise (algorithm 2 surrogates) we are led to conclude that this data contains dynamic nonlinearly filtered noise. Conversely for the daily gold price data we found that the time series and PPS data were significantly different. Therefore for this system we were able to reject the null hypothesis of short term stationary deterministic dynamics with uncorrelated noise. This may indicate that this system exhibits greater determinism and greater predictability. The simplest explanation for this result is that the time series exhibits correlated noise with a long correlation time. In either case this represents deterministic and (because of the rejection of the hypothesis associated with algorithm 2) nonlinear, information that may be exploited for prediction. 5.6

Simulated annealing and other computational methods

It would be desirable to arbitrarily extend the linear surrogate techniques of the previous chapter to test non-linear null hypotheses. The model-based surrogates in Sec. 5.1 and 5.3 are a step towards this objective. A notable alternative approach, was proposed by Schreiber [ill], and depends on simulated annealing. The idea is that one sets up an objective function which has a minimum when the conditions one wishes upon the surrogates are satisfied. For example, for algorithm 2 type surrogates, that objective function could measure deviation in power spectrum and rank distribution between data and surrogates.14 With this objective function established, one then unleashes some optimisation tools to obtain a minimum. In the original paper, Schreiber [ill] employs simulated annealing [20]. This algorithm is attractive because it is useful for large scale problems (in this case the dimension of the search space is the length of the data) and can find a reasonable solution relatively quickly.15 However, the disadvantage of this idea is clear: one needs to perform an optimisation over HN where N is the length of the data, normally 14 This is precisely the approach adopted in [112], albeit with a more ad hoc optimisation technique. 15 But, as a large scale optimisation technique it is still slow and computationally expensive.

Non-standard and non-linear surrogates

175

N » o(103) and in Fig. 5.8 for example, N = 25000. Finally, the issue of suitable randomisation of individual surrogates in this approach has not been fully addressed. The randomisation occurs entirely in the simulated annealing algorithm.16 Although this may be adequate for many situations, it is unclear that surrogates generated in this way will necessarily be sufficiently different from one another. Despite the difficulties, and the obvious computational cost of this approach, when one is faced with a short data set and a desire to test particularly complex null hypotheses, this approach may well be fruitful. A specific model based alternative has been proposed by Dolan and coworkers [28] for testing for the presence of unstable-periodic orbits. Unstable periodic orbits are embedded in all chaotic systems and have been found to be potentially useful in the control of cardiac arrhythmia, among other things. The attractor surrogates described by Dolan et al. [28] test whether the data really justify the supposed unstable periodic orbits. Surrogates are constructed by partitioning the embedding phase space into a set of disjoint contiguous patches. One then models the deterministic dynamics as a finite state Markov process17 with transition probabilities deduced from the data. In this way, one may obtain a symbol sequence corresponding to the transition between distinct patches in the partition. One then chooses individual data within these patches to generate the surrogate. For low dimensional, or extremely noisy systems, this approach is likely to work well. The extension to high-dimensional systems may, however, be problematic. Moreover, in such situations, this method is analogous to a course-grained version of the algorithm described in Sec. 5.3. In the search for periodic dynamics, and to distinguish linear from nonlinear determinism, an interesting alternative has been recently suggest by Luo and Nakamura [78]. The basic principle of the method is based on the realisation that, by definition, linear processes are closed under addition. That is, suppose {xt} and {yt} are two realisations of the same linear process. Then, {st + yt} is a distinct realisation of the same linear process. However, if {xt} and {yt} are two realisations of the same nonlinear process, {xt + yt} represents a different nonlinear process. Usually, the complexity of {xt + yt} will exceed that of {xt} (and {yt})- To exploit this behaviour in 16

Simulated annealing optimises by randomly trying different solutions and gradually decreasing the amount of randomisation as solutions get better. 17 In [28] the authors use a two-dimensional embedding and a second order Markov process; one would suppose that the extension to higher dimension would require substantial data.

176

Applied Nonlinear Time Series Analysis

practice we must extract two independent realisations, {xt} and {yt}, from the single observed time series {zt}. We can achieve this by setting xt = zt and yt = zt+s for some suitable shift s. For a chaotic system we only need to choose s larger than the characteristic decorrelation time.18 For a linear system we only need to check that {xt} a,nd{yt} are uncorrelated. Hence, by following the same procedure, we can expect very different behaviour for linear and nonlinear processes. The utility and generalisability of this approach is currently under investigation [89]. 5.7

Summary

In Sec. 5.1 we argued that any dynamic measure is a pivotal statistic for a very wide range of standard (linear) and nonlinear hypotheses addressed by surrogate data analysis. However, we found that one must be able to estimate this quantity consistently from data. We have at our disposal a very powerful and useful method of estimating correlation dimension dc(e0) as a function of scale e0. The details of this method have been considered elsewhere [50; 51] and an examination of the accuracy of this method may be found in, for example, [35]. Some scaling properties of this estimate prevent it from being pivotal over as wide a range of different processes as the true correlation dimension if it could be calculated.19 However, this statistic is still pivotal for a large enough class of processes to be an effectively pivotal test statistic for surrogate analysis. Rescaling the surrogates to have the same rank distribution as the data produced sufficiently good results for the linear surrogates in Sec. 4.5. Estimates of dc(s0) are pivotal over the sets of surrogates produced by algorithm 0, 1 and 2, and over the class of nonlinear surrogates generated by simulations of cylindrical basis models. This gives us a quick, effective and informative method for testing the hypotheses suggested by algorithm 0, 1, and 2 surrogates. Furthermore, it completely relieves the concerns raised by Schreiber and Schmitz [112]. If the test statistic is (asymptotically) pivotal, it doesn't matter if the power 18 19

That is, •£- where Ai is the largest Lyapunov exponent.

This may be a useful feature of this version of correlation dimension. The scale dependent properties of this algorithm mean that the algorithm may be able to differentiate between systems with identical correlation dimension. For example, rescaling the data with an instantaneous nonlinear transformation will produce a different estimate of dc(eo) (at least for large So) but not change the actual (asymptotic, £o —> 0) value of dc. This would allow one to differentiate between (for example) two differently shaped two dimensional periodic orbits.

Non-standard and non-linear surrogates

177

spectrum of surrogate and data are not identical (this is only a requirement for a constrained realisation scheme). The correlation dimension estimates of a monotonic nonlinear transformation of linearly filtered noise will have the same probability distribution regardless of exactly what the power spectrum is. With the help of minimum description length pseudo-linear modelling techniques, which we will describe in the next chapter, correlation dimension also provides a useful statistic to test membership of particular classes of nonlinear dynamical processes. The hypothesis being tested is influenced by the results of the modelling procedure and cannot be determined a priori. After checking that all models have the same distribution of test statistic values and are representative of the data (in the sense that the models produce simulations that have qualitative features of the data), one is able to build a single nonlinear model of the data and test the hypothesis that the data was generated from a process in the class of dynamical processes that share the characteristics (such as periodic structure) of that model. The greatest problem with this approach is the dependence on the nonlinear modelling algorithm: a notoriously difficult process. Although, in the next chapter, we will present robust and reliable methods of building nonlinear models, a non-parametric alternative is still desirable. The PPS algorithm described in Sec. 5.3 fulfils this requirement. Surrogates used by this method can be used to test for a noisy periodic orbit or a pseudo-periodic chaotic dynamical system. When applied to financial time series data we were even able to use this algorithm to test for nonlinear determinism in nonstationary data. In all the techniques we introduced in this section, the emphasis was on the appropriate surrogate generation algorithm. However, a study of pivotal statistics shows that there is a related and equally important problem. The power of the chosen test statistic needs to be address. In general one is looking for (evidence of) nonlinearity, and so we choose statistics which are sensitive to nonlinearity in the data. But, different choices of test statistics could lead to quite different hypothesis tests. Finally, in Sec. 5.6 we described several alternative techniques. Among them, there is a new method to distinguish between linear and nonlinear systems. When used in conjunction with the PPS algorithm, this test is likely to provide valuable information for experimental time series data.

Chapter 6

Identifying the dynamics

Up until this point, the main theme of this book has been the estimation of dynamic invariants from time series, and the application of surrogate data methods for hypothesis testing. Indeed, this is the main focus of this book. However, in this chapter we present a distinct topic which is usually dealt with separately in considerably more detail in many entire volumes. The problem of building models of the underlying dynamics from time series is typically considered as separate and distinct from embedding, estimation of dynamic invariants and surrogate testing. However, this is not the approach we take here. We have already seen in Sec. 1.6 the application of embedding for modelling purposes, and that the two processes are related. In general it does not make sense to talk about finding the "best" embedding, unless you can first specify the purpose of embedding. If one's aim is to build a good model, that model will depend critically on the choice of embedding. The results in Sec. 1.6 showed that the "best" embedding from which to build a predictive model of the dynamics is almost always not the standard uniform technique. Moreover, in Sec. 5.1 we saw that models may be used to generate surrogates, and in Sec. 5.4 we demonstrated that surrogates may be used as models. Surrogates can be generated by simulating the dynamics implied by a model and surrogates are based on a model which may be specified either implicitly or explicitly. By necessity, we have already been discussing the problem of building models in some detail. In this penultimate chapter, we focus exclusively on the problem of building deterministic models of reality, and what can be learned from such models. In Sec. 6.1, we describe an important demarcation between two broad types of models. Section 6.3 describes local modelling techniques and Sec. 6.5 covers what we describe as "semi-local" methods. In this chapter, we 179

180

Applied Nonlinear Time Series Analysis

also present two interesting and diverse applications: modelling of transmission of Severe Acute Respiratory Syndrome (SARS) in Hong Kong, in Sec. 6.2; and, a deterministic study of the onset of ventricular fibrillation in humans in Sec. 6.6.

6.1

Phenomenological and ontological models

Before presenting a detailed discussion of the primary approaches to modelling physical systems, it is best that we start by describing exactly what we mean by models. For the purposes of this discussion we follow the approach outlined by Judd [52]. The generic view of modelling is that a model is a representation of reality. Whereas reality may be arbitrarily complex, the model, by necessity is rather more simple. Moreover, the model is built not directly from reality but only from our limited observations of reality. This is precisely the framework provided by Taken's embedding theorem: one may observe reality and build a model of that reality. But, the model is only as good as the observations. According to Taken's theorem, the model will only be perfect if the observations are perfect and there are sufficiently many of them. This can never be the case. Therefore, whatever model we build, it is only a simple caricature of the underlying (potentially complex) reality. We now must consider how such models are to be built. In general, we will describe two types of models [52]: ontological and phenomenological models. Until the twentieth century, the vast majority (although not all) models had been ontological: that is, derived by logic from known facts (or assumptions) about reality. Newton's laws of motion are examples of simple ontological models of reality based directly on assumptions about the way physical matter behaves. Einstein's famous equation E = me2 is also an ontological model, this time based on a more complex deductive process, but still a logical conclusion from certain axiomatic assumptions. Euclidean geometry as a model for reality behaves in the same way: from certain axioms, one derives various theorems which serve as a model of reality. Such ontological models are based on rules presumed to hold true. These rules are deduced from physical observations of reality and the models can be confirmed or disproved with further observations of reality. However, the models do not directly "react" to observations of reality. A second type of model, originally in the domain of statistics, but now finding wider

Identifying the dynamics

181

application (with the advent of modern computers), is what we will call phenomenological models. Phenomenological models are specifically built to mimic the observations. Phenomenological models are characterised by a lack of profound understanding of reality1 and the pragmatic (often brute force) approach of forcing the model fit the observed data. From a general class of models, replete with a large set of parameters, one chooses parameter values which specify the member of that model class which best reflects our observations of reality. Because the model is built from observations of reality, one cannot use those same observations to validate the model. However, consistency of the model with new observations of reality can, again, be used to either validate or invalidate the chosen model. When one lacks an adequate understanding of reality, a phenomenological model is the best that can be done. Usually, in the analysis of time series, one deals with phenomenological models. In the remainder of this chapter we will present several useful phenomenological model classes, and techniques which may be used to fit them to data. In summary, phenomenological models describe the observed data whereas ontological models attempt (usually imperfectly) to explain reality. First, in Sec. 6.2 we present an example of a model which is both ontological and phenomenological. It is ontological in the sense that it is based on widely known (or assumed) equations for disease transmission dynamics and a social network structure which is presumed to exist. The model is phenomenological because the parameters of the disease transmission dynamics are fitted directly from the data. We will consider this example in considerable detail as it also incorporates many of the surrogate ideas in the previous section and illustrates how model simulations may be used as surrogates to validate the model. 6.2

Application: Severe Acute Respiratory Syndrome: Assessing governmental control strategies during the SARS outbreak in Hong Kong

Severe Acute Respiratory Syndrome (SARS) emerged in 2003 and infected over 8000 people globally [163; 164]. While the humanitarian and economic impact of the epidemic was significant, the disease propagation was characterised by a relatively small number of clusters of SARS cases and several 1

This is certainly not the same thing as a profound lack of understanding.

182

Applied Nonlinear Time Series Analysis

Fig. 6.1 The SARS data for HK. The top panel depicts the original reported number of suspected SARS cases per day (dark), and subsequently revised estimates (light grey). The lower panel depicts the inter-day variability for the revised data shown in the top plot. From the lower plot it is clear that the data is both "bursty" and apparently stochastic.

super-spreader events (SSE). Elsewhere [102], a SSE is defined as a rare event resulting in more than the average number of secondary infections from a single infectious individual. Clustering and SSEs result in disease propagation dynamics that appear characteristically "bursty". Figure 6.1 depicts the daily number of reported suspected SARS cases in Hong Kong. The initial data has subsequently been revised by the Hong Kong Department of Health [115] to provide the actual number of identified SARS cases on a daily basis. It is this revised data which we study here. Significant2 numbers of SARS cases were also reported during 2003 2

More than 50, say.

Identifying the dynamics

183

for Taiwan, Singapore, Vietnam, Canada and various parts of the Chinese mainland (most notably Guangdong province and Beijing municipality). The virus originated in Guangdong province (which borders Hong Kong) in November 2002 and the outbreak was spread to Hong Kong in February of 2003 by a medical doctor from the mainland. The general epidemiology of the outbreak has been extensively summarised by [15]. Analysis of early SARS infection data by a large group of researchers [102] found that during the first 10 weeks of the epidemic, excluding SSEs, an average of 2.7 secondary infections were generated for each case.3 In the same paper, [102] describe a stochastic compartmental model of transmission dynamics. However, the model described in that paper is somewhat over-parametrised. In [30], Donnelly and colleagues describe the initial explosive exponential growth of reported SARS cases quickly being tempered. Moreover, their study also found that the mean incubation period of the disease is 6.4 days (with a 95% confidence interval of 5.2-7.7). We use this value of 6.4 days to model the duration of the incubation period in the models we introduce below. Moreover, based on case studies, they also found that the average time between onset of clinical symptoms and hospitalisation was between 3-5 days. We used these values as an initial model for the time prior to isolation. Recent studies [73; 162] have analysed individual SARS cases in Hong Kong to provide us with rough estimates of the transmissibility of the disease. [73] studied transmission of SARS between household members in a household with at least one SARS case. They found that among 2139 household members of 881 index patients, there was a total of 188 likely secondary infections.4 Therefore, the rate of infection is approximately 0.088. If we suppose that the originally infected household member remains in the household for 3-5 days, we may assume that the probability of obtaining SARS from cohabitation (in one day) with an infected individual is approximately 0.02. Finally, [162] studied a single SSE in the Prince of Wales hospital in Hong Kong. In this case, a single infectious individual infected 10 of 27 medical students who came into direct contact with him. This (admittedly highly infectious individual) therefore provides us with an estimate of a 3 Actually, the precise meaning of this finding is unclear as the authors define a SSE as more than the average number of secondary infections! 4 The authors acknowledge that there is a possibility that SARS was contracted from a separate, or possibly common source rather than the infected family member, but this is probably unlikely.

184

Applied Nonlinear Time Series Analysis

Fig. 6.2 Estimate of the total daily number of infectious individuals for HK. The solid line is the cumulative sum of the revised daily SARS case data minus the cumulative sum of the daily recoveries and fatalities. The dashed line is three exponential fits to this data as described in the text.

transmission rate of 0.37. This transmission rate is significantly higher than one would expect for general person-to-person transmission. One would suppose that the relationship between patient and doctor is somewhat more intimate than normal "social" contact. We do not consider this value in our analysis. Referring to the data in Fig. 6.1, we initially attempt to model this with standard SEIR-type epidemic dynamics [87]. Denote the number of individuals that are susceptible to infection by S, the number that have been exposed (that is, those that are infected but not yet infectious) by E, the number that are infectious by / and the number that have been removed (either recovered, isolated or deceased) by R. It is then easy to see that the spread of the disease can be given by the SEIR model equations N = S+E +I+ R S' = -crSI E' = aSI-r)E

(6.1)

I' = riE- 7/ R' = -ri

where N is the total population size and a, 77 and 7 are the various transition probabilities between the four categories. In the case of transmission of

Identifying the dynamics

185

SARS-CoV, N > E + J + R and so JV « 5. The equations then reduce to E' = uNI - r)E V = r]E - 7 /

(6.2)

#' = 7/ or, written as difference equations AEt = oNtIt - r)Et AIt = vEt - lh

(6-3)

ARt = -ylt which can be easily solved. Solutions are either exponential growth or exponential decay. From the time series data for transmission of SARSCoV in Fig. 6.1, we can derive a value for / as follows. We approximate the number of infectious individuals as the cumulative sum of the (revised) number of SARS cases minus the cumulative sums of the number of deaths and recoveries. This time series is depicted in Fig. 6.2. In making this approximation we assume that hospitalised individuals are still infectious (this assumption is unavoidable, we will see later that it is also reasonable), and that recovered individuals are no longer infectious. From Fig. 6.2, we can clearly see the exponential decay phase of the data. However, there is no clear exponential growth. To provide a good fit to this data one must require that the model parameters of Eq. (6.2) be significantly non-stationary. This is clearly undesirable. Instead, we must resort to stochastic models of disease propagation. For a relatively small number of infections in a large population, it is only natural that this should be the case. The simplest stochastic model is obtained by modifying (6.3): AEt = oNJt - -qEt - ee AIt = riEt - 7/t + ee - er

(6.4)

ARt = -ylt + er where ee and er are independent random variables. Epidemiological motivation may provide alternative formulations for the noise terms er and ee in Eq. (6.4). However, for the current discussion, this is not important. Let us consider the difference equation in Eq. (6.4) and suppose only that ee and er are independent and identically distributed (i.i.d.) random variables. This is precisely equivalent to supposing that the

186

Applied Nonlinear Time Series Analysis

Fig. 6.3 Surrogate calculation for models of the form (6.4). Surrogate calculation testing the hypothesis that a model of the form (6.4), with two sets of parameters values, is sufficient to explain the observed data. Based on the evidence presented in this figure, such a model is not sufficient.

errors in the fit of an exponential growth and decay curve to the two phases of the data, shown in Fig. 6.2, are i.i.d. We therefore test the residuals of such a fit against the null hypothesis of i.i.d. noise using the method of surrogate data [145]. Figure 6.3 depicts the result of this calculation. Using lag one autocorrelation (i.e.. < xtxt_i >) as a test statistic, we fit an exponential (growth curve) to the first 36 days, a second exponential (growth curve) to the next 24 days and a third exponential (decay curve) to the remainder of the data. We can see that the data is not consistent with the hypothesis of Eq. (6.4). Therefore we can reject Eq. (6.4) as a suitable model of the transmission dynamics, regardless of the distribution of the noise processes ee and er. We will now propose a more sophisticated alternative model based on a small world network structure. The SEIR model discussed above supposes that all individuals have an equally small probability of being infected. We now propose an alternative model structure where infection can only occur along certain predefined paths. Suppose that the population is arranged in a regular grid. Each node can infect its four immediate neighbours (horizontally and vertically) and a random number of remote nodes. From the literature reviewed earlier, we see that the probability of infecting near neighbours (supposed to be members of the same household) is likely to be distinct from the probability of infecting remote neighbours (supposed to be daily acquaintances).

Identifying the dynamics

187

Fig. 6.4 The four compartmental small world network model of disease propagation. The top pane] depicts the transmission state diagram: S to E based on the SW structure and the infection probabilities pi,2; E to I with probability ro; and, I to R with probability r j . The lower panel depicts the distinction between short range and long range network links. The lower panel shows the arrangement of nodes in a small network. The lightly shaded infected node may infect its four immediate dark grey neighbours with probability pi and three other red nodes with probability p2.

Moreover, it is clear that by adjusting the relative magnitude of these two probabilities, one can generate transmission dominated by "clusters" (i.e. localised transmission) or many SSEs. Let pi denote the probability of infecting each of the near neighbours of a node and let p2 denote the probability of infecting the remote neighbours. Suppose that there are rii near neighbours and n2 remote neighbours. Moreover, we consider the model

188

Applied Nonlinear Time Series Analysis

Fig. 6.5 Small-world and scale-free distributions. The left panel depicts the computational estimate of the number of nodes connected to a given node by n or fewer links (the taller bars). Also shown is the number of nodes connected to a given node by n or fewer links if we allow only local transmission along any of the neighbouring paths (proportional to (2ra — I) 2 ). The right panel depicts the distribution (estimated from a simulation of 106 nodes) of the number of links that a given node has. Parameter values are n\ = 4, \x = 2.4 and n = 2700.

with four distinct groups of individuals, 5, E, I and R, corresponding, as before, to those that are susceptible, exposed, infected and removed. This situation is depicted in Fig. 6.4. We allow 7i2 to be random and fixed for each node, hence node i has nj links. Moreover, to explicitly model the small world structure with a scale free network:

P{nf=ek) = -e-i.

(6.5)

M Hence, although the near neighbour links are bi-directional, the remote neighbour links are only one-way.5 In Fig. 6.5, we depict the induced small world network structure and the explicitly modelled scale-free structure of the network. Namely, the number of links between any two nodes in the model is small; and, the probability that a node has a given (large) number of links is exponential. From Fig. 6.5, we see that 85% of nodes are connected by less than 8 links (and "almost all" are connected by no more than 8 links). In fact we choose the number of neighbourhood links so that the connectivity of the social network in Hong Kong would be roughly the furthermore, for computational convenience they are only assigned for nodes that become infected.

189

Identifying the dynamics

same as the "Six-degrees of separation" observed in North America [85] . 6 Finally, the probability of transitioning from state E to state I on a given day is ro, and the probability of transitioning from I to R is r\. In the simulations that follow we always take ni = 4 to indicate the immediate horizontal and vertical neighbours, N = 2700 so that N2 ss 7.3 x 106 is approximately the population of Hong Kong. Variation of the parameter TQ is largely immaterial and we therefore fix ro = -^j. With a daily probability of transition from E to / of g^ the average incubation period is 6.4 days (as reported by [30]).7 We now study the remaining parameters of the model pi, [i, p2, r\ in more detail. Wefix/i = 2.4 as this seems to be roughly appropriate for the degree of social connectivity observed in human networks. As a first approximation, we also set p\ = 0.02 as suggested by [73]. We do not adopt P2 = 0.37 as cited by [162] as this case is probably somewhat atypical. Moreover, this value leads to rapidly explosive growth of epidemic which is highly unlike the observed behaviour. According to published reports [30], the time before hospitalisation is 3 and 5 days. If we assume that hospitalisation is equivalent to isolation (and therefore transition from I to R), we can take n = \ = 0.25. Denote by fi2 the expected value of ri2- From the distribution in Eq. 6.5, for n — 2.4, this is approximately 8. The parameters n\, pi, fi2, P2, and r\ can be used to to approximate the expected number of new infections E(-AS) by E(-AS) = (nifcp2 + n2p2 - n)I

(6.6)

where k is the number of near neighbour links that support possible infections and, because the near neighbour infections are arranged in "clumps", \ < k < 1. Moreover, for each infected node, the number of new secondary infections per day is approximately {n\kp2 + TI2P2) and the total will be nkrx + 2n*(l - n)ri + 3nk(l - r2)2r1 + ... = —

n

where rik = {n\kp2 + n2p2)- From the available data [102], the average number of secondary infections is 2.7, therefore, we take nykp2 +n2p2 = 2.7ri, 6

By coincidence, the Erdos numbers observed for contemporary mathematicians. Admittedly, the 95% confidence interval is 1-18 days and is therefore significantly greater than that reported by [30]. However, we found that the model simulations were remarkably robust to changes in ro7

190

Applied Nonlinear Time Series Analysis

Fig. 6.6 Rate of growth of infection. The left panel is an estimate of the rate of growth in the number of infections from the data in Fig. 6.1. The instantaneous estimate is shown as a dashed line, the five day moving average is shown as a solid line. The right panel shows eigenvalues of Eq. (6.7). Growth rates, comparable to that observed in the data (i.e. 1.11 to 1.2), are observed for 0.2 < n < 0.025 .

and hence p2 « 0.386ri - 0.08. For 3 < ^ < 5 this gives 0 < p2 < 0.05. By studying the stability of the difference equation version of (6.4), we obtain the eigenvalues Ai = 1,

\2,3 = l-r-^Y1±]J\(r0-r^

+ nkr0

(6.7)

where, as before, nk = nikp\+fi2P2 is the average number of infections. We note that, as expected, the disease will be contained if rik < r\. Suppose that the (average of) 2.7 infections occurred prior to hospitalisation after an average of 4 days, then rik = 2.7/4 = 0.675. Hence the rate of spread of infection is given by

1-

r

-^1 + y \(r0 - n)2 + 0.675r0.

Figure 6.6 compares the average growth rate observed in the data (a 5-day moving average of the ratio of the total number of infections on two successive days) to that computed from these eigenvalues. From the first 40 days of data for Hong Kong, the mean rate of infection is 1.19 and the range is approximately 1.1 to 1.42. Rates of infection of 1.19 and 1.1 correspond to r\ values of approximately 0.05 and 0.2 respectively. Hence, we conclude that this model indicates that infection did not cease with hospital admis-

Identifying the dynamics

191

Fig. 6.7 Simulated epidemics. Each row depicts the evolution (at 20 day intervals) of infected individuals for different parameter values. In all cases, N = 100 (to ease visualisation), n\ = 4, fj, = 2.4, ro = g^, and r\ = 0.05. Infected nodes are shown as dark grey, exposed nodes are light grey and removed nodes are black. The top row is with px = 0.2 and p2 = 0, the second row is with pi — 0.2 and p2 = 0.002, the third row is with pi = 0.02 and p2 = 0.02, and the fourth row is with pi = 0 and P2 = 0.05.

sion after between 3 and 5 days. In the early stage of the epidemic, the rate of transmission of the virus indicates that patients remained infectious for much longer periods of time (possibly up to 20 days). We will now examine the role of clustering in the spread of infection. Clearly the number of "clumps" that occur in the model will be proportional

192

Applied Nonlinear Time Series Analysis

Fig. 6.8 Predicted distribution of casualties. The top panel is a probability density plot of the temporal evolution of 1000 simulations (lighter colours reflect higher probability) compared to the data. The middle plot is a comparison of the total number of casualties for each of those simulations compared to the true data value (1735). Approximately 13.5% of all simulations exhibited a greater total casualty count than the true data. The bottom plot is seven "representative" simulations (these simulations were randomly chosen from among the 91 simulations with a total casualty of between 1000 and 2500). In each case, we see that the data is typical of the observed simulations.

to the number of long range infections. As such, we expect the ratio of clusters to infections is approximately given by n i f c " 2 ^ 2 • If n2p2 ~ 0

Identifying the dynamics

193

the infection is largely localised and the growth of infection is polynomial (see Fig. 6.5). If n2p2 nikpi the rate of growth is exponential and the spread of infection is equivalent to a stochastic version of the standard SEIR model. It is the intermediate dynamics for n2P2 ~ n\kpx that are of most interest to us. In Fig. 6.7, we illustrate the typical spread of infection for each of these four scenarios. From Fig. 6.7 one observes that increasing hiP2 increases the number of clumps and also the rate of transmission of infected individuals. With p\ = 0 we see only isolated infections with no spatial correlation, and, conversely, for small fi2P2 we see a small number of large clumps. In all simulations we see that the rate of infection is approximately polynomial until several nonlocal infections occur. At this point, one sees an explosive (exponential) growth in infection. We now present simulations of our model with various parameter values and attempt to reproduce the observed dynamics. Following this, we will provide some Monte-Carlo simulations with the same parameter values and show that the range of observed behaviour is extremely wide. Clearly, the original data represents a non-stationary system. One can observe at least two distinct phases. In our linear modelling (Fig. 6.2), we assumed three phases, hence for our small world network model we also assume three phases. The reason for restricting our interest to such a small number of (discrete) regimes is to avoid problems associated with overdetermined systems.8 Clearly, we could seek more realism by increasing the range of parameter values during the epidemic, but, this would only be appropriate for a completely deterministic model. Instead we take N = 2700, ni = 4 and /u = 2.4 as before. The parameters P12 and ro,i are set as follows: = 0.02, ( 0.1 1 < t < 30 P2= I 0.02 30 < t < 60 , ( 0.01 t > 60

Pl

r0 = ^ , and r 0.025 1 < t < 20 ri = I 0.1 20 < t < 40 . [ 0.333 t > 40 8

Even for deterministic models, one needs to avoid excessive parameterisation.

194

Applied Nonlinear Time Series Analysis

The three changes of P2 correspond to the initial phase when SARS spread freely,9 an intermediate phase when P2 — Pi, and a final phase when P2 « 0. In this third and final stage, a combination of limited movement of the population and rudimentary hygiene measures combine to effectively limit transmission. Similarly, we include two changes in value for r\. During the initial phase, r± is close to zero, which corresponds to the rate of growth when the disease spreads without any control measures. During the second phase rudimentary control is in place and r-i has been decreased somewhat (as observed in Fig. 6.6). However, it is only during the final phase that ri « ^, corresponding to removal of infected individuals approximately 3 days after becoming infectious (i.e. in this final phase, hospitalisation is equivalent to isolation). Figure 6.8 presents the result of 1000 simulations. Finally, we provide some quantitative evidence that the simulations and model behaviour are similar. In Fig. 6.3 we showed that the ordinary SIRand SEIR-type models exhibited dynamics unlike the observed data. In Fig. 6.8 we see that, according to the very gross measure of total infections, the small world model and simulations are similar. In Fig. 6.9 we compute the lag one autocorrelation < xtxt-i >, and the logarithm of the lag one autocorrelation log(< XtXt-i >), for the 1000 model simulations and the data. In direct contrast to Fig. 6.3, the model and simulations are indistinguishable. Our SW model of disease propagation, when applied to the SAR-CoV propagation dynamics for Hong Kong in 2003, shows that the most serious effect of SARS is nosocomial transmission. By isolating infectious individuals as soon as infection is identified (typically in 3 to 5 days), the severity of the SARS outbreak in Hong Kong in 2003 would have been significantly lessened. Moreover, from a theoretical viewpoint, we see that the crucial feature in capturing the dynamic behaviour of the transmission dynamic time series is combining local and non-local transmission. We find that the SSEs, typically associated with highly infectious individuals can be modelled equally well with highly connected individuals. It has been widely reported that highly connected individuals will occur in human society and this may be a more appropriate way to model bursty propagation of diseases within communities.

9 One could also argue that this phase also models the (initially) highly contagious individuals observed in the early stages of the outbreak. We make no such distinction.

Identifying the dynamics

195

Fig. 6.9 Surrogate test for SARS model dynamics. The top panel depicts the lag 1 autocorrelation for 1000 model simulations compared to the data. The lower plot is the logarithm of the lag 1 autocorrelation for 1000 model simulations and the original data. One can see that the observed data is quite typical of these simulations (11.8% of the simulations exhibits larger correlation than the data).

6.3

Local models

The model presented in the previous section is a good ontological model as it is consistent with the observations and it provides a plausible explanation of the underlying system. However, this model is also completely ad hoc. No attempt was made to seriously fit the model parameters to the data and we saw that, while the model was indeed consistent with the observations, it was also capable of producing a vast range of different behaviours. An implication of this is that many other, different, models may also (possibly) be consistent with this data. For the remainder of this chapter we take a different approach to mod-

196

Applied Nonlinear Time Series Analysis

elling. We wish to build phenomenological models and, in the broadest sense, the objective of these models is to produce artificial sequences of data with the same qualitative and quantitative behaviour as our observations: we wish to develop an algorithm which mimics the dynamics of the true system. And so, for the problem of modelling time series data, what is more interesting is so-called (and much maligned) "black-box" approaches. While there is a rich literature (see for example [16; 17]) addressing the problem of approximation and an equally strong literature addressing the statistical problem of modelling random fluctuations [97; 153], we are interested in the intersection of these two fields: building deterministic models from data in the presence of noise. The simplest such black-box models are local models, the simplest of all local models is the local constant model, and the simplest of all implementations is the so-called naive prediction: Vt+i = Vt

(6.8)

where jjt+i denotes our prediction of yt+i- In the absence of any good information, Eq (6.8) is a reasonable thing to do. However, it is easy to improve on the naive predictor. In what follows we take yt to be a scalar point and vt to be the corresponding time delay embedded point. We do not consider the problem of accurate time delay embedding at this point. The simplest way to improve this prediction scheme is to make a "metrological" forecast: yt+i is predicted by looking for instances ys in the past (s < t) such that \\vs — vt\\ is small: Vt+i = ys+i

(6.9)

where s < t is such that ||us — vt\\ is minimal. This is the simple nearest neighbour predictor first described by Sugihara and May [143]. The scheme can easily be extended is several obvious ways: • ensure that s — t > TTK for some threshold Txh to avoid spurious temporal correlations in the predictions; • take an ensemble average of k nearest neighbours: k

Identifying the dynamics

197

where {vSl, vS2,..., vSk} are the k closest points to vt such that s, < t for all i; • take a weighted ensemble average where the weighting is somehow proportional to the nearness of the neighbour; and • choose a neighbour vs to Vt at random (the probability of choosing vs may well be a function of \\vs — vt ||) from which to obtain the prediction yt+i

=ys+i-

This final modification is exactly what we do to obtain the PPS and ATS algorithms to generate nonlinear surrogates in Chapter 5. Of course, having discussed local constant models it is natural to extend the discussion to local linear, and naturally, local polynomial models. For the sake of the present discussion, we only describe local linear models, higher order polynomial models are left as an exercise. An alternative formulation, provided by Mees consists of building triangulations and tessellations on the data to capture a more detailed picture of the underlying dynamics [84]. Unfortunately, the geometrically demanding nature of this approach means that it is limited to relatively low dimensional systems. Instead we will simply describe the standard local linear approach: yt+i = Xvt

(6.10)

where A is such that X/i=i efc *s minimal and e^ = XvSk — ySk+i are the prediction errors of the k nearest neighbours of vt (i.e. \\vt — vs.\\ is small for Si < t i = 1,2,... ,k). Computationally this technique is straightforward, and, in certain circumstances such local linearisation may have some advantages over Eq. (6.9). Unfortunately the previous description conceals a significant number of parameters, each of which must be well chosen to generate a good approximation to the true dynamics. The important thing to note about these local methods, particular Eq. (6.9) is that the model is described only by the previous data: there are no model parameters. Alternatively, one may argue that the data are the parameters. In either case this approach is interesting because the dynamics that one observes from these models, particularly in the ATS and PPS approach, is only what one may observe in the data. Unfortunately, these models are inadequate if one wishes to obtain an analytic approximation to the evolution operator $: this could be useful, for example, in determining fixed points, stability and unstable periodic orbits of a system. In these cases it is necessary to resort to more parametric models. We will describe

198

Applied Nonlinear Time Series Analysis

these in Sec. 6.5, and in the next section, we return, again, to the problem of embedding.

6.4

The importance of embedding for modelling

In Chapter 1.15 we described the various possible embedding strategies one could adopt prior to nonlinear model building. We now use the relative performance of these methods with two distinct modelling schemes to illustrate an important point: embedding is part of the modelling process not just a necessary pre-condition. To test how good the variable embedding strategies of Sec. 1.6.1 are at modelling the underlying dynamics, we apply two distinct modelling schemes. For each scheme we compare the results obtained with both the standard and variable embedding strategy. The first modelling scheme is essentially iterated "drop-one-out" constant interpolation as described in [129] — this is based on none other than Eq. (6.9). By construction, the variable embedding strategy will have the optimal short term prediction. However, in Table 1.2 we tested how well this strategy captures the long term dynamics. For each time series and each embedding strategy, we compute 30 random trajectories and compare the correlation dimension estimates [167]. Table 1.2 showed that both embedding strategies perform fairly well, with the variable embedding strategy performing significantly better for the short or noisy data sets. In most other cases, the difference is not significant. The second modelling scheme is more sophisticated and is an attempt to genuinely estimate the underlying deterministic dynamics of the system (this is not possible from a local constant method, despite the admirable results of Table 1.2). For each data set, we compare the standard embedding strategy (1.8) to the optimal embedding strategy (1.10) by constructing nonlinear models using the method described in [53]. These models are radial basis models with the number of radial basis functions determined according to the minimum description length principle (the underlying techniques will be described in the next section). For each data set we, computed 30 nonlinear models with either embedding strategy and report in Table 6.1 the average model size 10 and out-of-sample iterated mean model prediction error for the minimum de10

In this case, the number of basis functions in the best radial basis model.

data Sunspots Ventricular Fibrillation Laser Rossler Rossler-f-noise Lorenz Lorenz+noise

model standard 2.1±0.8 6±1.3 19.7±3.6 15.8±2.7 7.5±1.5 6.7±1.0 6.3±1.0 size optimal 2.8±0.8 8.6±1.8 20.3±4.2 13.9±2.3 6.4±0.7 12.7±3.2 6.7±1.1

prediction error optimal standard 8.28±4.26 12.70±11.12 0.035±0.012 0.079±0.012 4.58±1.94 6.34±2.34 0.13±0.06 0.17±0.12 0.34±0.13 0.29±0.11 1.24±0.40 2.92±0.42 3.67±0.80 1.48±0.62

correlation dimension standard optimal data 1.206 1.152 1.890 1.007 1.018 1.622 1.059 1.438 2.129 0.998 1.292 1.588 1.163 1.106 1.819 0.175 0.989 1.966 0.122 1.025 1.612

Table 6.1 Comparison of modelling results for the standard (de and r) and optimal (£i,.. .i^) embedding parameters listed in Table 1.1. For each embedding strategy, we constructed 30 nonlinear models, with minimum description length as a selection criterion and computed the average number of model parameters and the average out-of-sample iterated model prediction error. For each model, we also computed the mean correlation dimension estimate (computed with the values de and r listed earlier) for 30 simulations (different initial conditions) and reported the median value over all models. For reference the value of correlation dimension estimated from the time series data is also provided.

f

(199)

200

Applied Nonlinear Time Series Analysis

scription length best model.11 Furthermore, for each of the 30 models, we chose 30 random initial conditions from the true data and iterated the model predictions for TV time steps. For each of these iterated predictions, we computed correlation dimension using the technique described in [167]. Table 6.1 also compares the correlation dimension of the data to the simulations with either modelling scheme. In general we observe that the optimal scheme affords larger models with smaller prediction errors and correlation dimensions much closer to that of the true data. That is, both the qualitative and quantitative dynamics are reproduced much better with these non-uniform embedding strategies. Having illustrated the importance of an appropriate embedding, we now describe a particular parametric modelling method. The modelling method we choose to introduce is one which has found effective application for a wide range of data. In particular we describe radial basis models and the minimum description length principle for model selection. Both these topics are also considered in more detail elsewhere [52]. 6.5

Semi-local models

This section provides a brief overview of radial basis modelling and the minimum description length principle. In particular, the section briefly reviews the methods described by Judd and Mees [53] to build a radial basis model of variable size from data. 6.5.1

Radial basis functions

From a scalar time series {yt}^=\ we embed the data in R deT as in the proceeding sections vt = (yt-i,yt-2,---,yt-dcr)

Vt >der.

The values of de and r will be selected according to one of the standard techniques described previously, for reasons which will become apparent we choose to embed in R d » r not R d e . From this we wish to fit a model / : Rd'T i—> R Vt = f(vt) + et ^Repeated modelling attempts are required because this highly nonlinear fitting procedure is stochastic.

Identifying the dynamics

201

where et ~ iV(0,<72). We assume that the model / captures the dynamics of the underlying system and that the model prediction errors et can be modelled as additive Gaussian noise. Note that assuming additive Gaussian noise is a substantial simplification of the most general possible situation, however, this is all we attempt here. Choice of error model is an extremely important issue, and in some situations extremely difficult. For our purposes the simplification to additive Gaussian noise is sufficient. Observe that by using a time-delay embedding, the only new component of Vt+i that the model needs to predict is yt- The general form of the function / is

/(") = I> : R + i—> R is a fixed function. In this situation / is known as a radial basis function. For a discussion of radial basis functions and possible choices of see Powell [93]. There are several common choices of (/>, most of which are monotonic decreasing functions.12 We offer a slight generalisation of the functional form described above by including an additional scaling factor, and call a function of the form

/(«) = £ A,* (llZfjJT),

(6.11)

(where Tj > 0) a radial basis function. In general the selection of the parameters Xj , Cj (the centres of the j t h basis function) and r,j (the radius of that basis function) is a complex nonlinear optimisation problem. They can, however, be selected to minimise the mean sum of squares prediction error of the model / . The parameter n cannot. To optimise over the model size n, we introduce the information theoretic concept of description length.

6.5.2

Minimum description length principle

Roughly speaking, the description length of a particular model of a time series is proportional to the number of bytes of information required to reconstruct the original time series.13 That is, if one was to transmit a 12

With the exception of cubic functions s 3 . Other commonly used basis functions 2

include Gaussian e~s (these are the foundation of the basis functions employed here) and thin plate splines s 2 log s, among others. 13 To within some arbitrary (possibly the machine) precision.

202

Applied Nonlinear Time Series Analysis

Fig. 6.10 Description length as a function of model size: A plot of the expected behaviour of description length as a function of model size k of (see Eq. (6.18)). For k = 0 there is no model and the description length of the model prediction errors is the description length of the data. As k increases, the description length of the modelling error decreases as the model starts to fit the data. The description length of the model parameters increases as more model parameters are added. Eventually the additional model parameters are unimportant and do not greatly increase the description length of the model parameters, at this stage the description length of the modelling errors approaches zero (in the limit when the system is over-determined). The optimal model should be the model for which the model description length (the sum of the description lengths of the model parameters and the modelling errors) is minimal.

description of the data, the description length of the data is the compression gained by describing the model parameters and the model prediction error. Obviously if the time series does not suit the class of models being considered, the most economical way to do this would be to simply transmit the data. If however, there is a model that fits the data well, it is better to describe the model in addition to the (minor) deviations of the time series from that predicted by the model (see Fig. 6.10). Thus description length offers a way to tell which model is most effective. Our encoding of description length is identical to that outlined by Judd and Mees [53] and follows the ideas described by Rissanen [103]. Roughly speaking, the description length is given by an expression of the form: (description length) « (number of data) x log (sum of squares of prediction errors) + (penalty for number accuracy of parameters).

Identifying the dynamics

203

The approach of Judd and Mees is to calculate the description length penalty of the model as the penalty for the parameters Xj. The parameters Cj and Tj are given at no cost. For the present discussion we assume, as Judd and Mees have, that the only parameters required to describe the model are Xj for j = 1,2,...,k. Rissanen [103] suggests an optimal encoding for a floating point binary number Xj — Q.a\a2a^ .. .anj x 2m->. If Xj is the j model parameter and Xj that parameter truncated to rij binary bits, the difference between Xj and Xj will be at most 5j = 2~n*. We call Sj the precision of the parameter A^. Hence to encode Xj we need to encode the binary mantissa a1O2«3---Onj and the exponent ntji that is, two integers. The method employed by Judd and Mees to encode integers is that suggested by Rissanen. The integer p may be encoded in log2 p bits, but to do so, one must first encode the length of this code. That is, if one was to send the code for p, the receiver of that code needs to be told how long the code is. But the code length is itself a binary integer and the length of that code must also be specified. Hence the integer p can be encoded as a code word of length L*{P) = riog2c] + [iog2p] + [iog2 riog2p]i + riog2 riog2 rio g2 pin + . . . bits. This sequence continues until the last term is either 0 or 1, \x] is the smallest integer not less than x, and [log2 c] is an additional cost associated with small integers. Hence the cost of encoding Xj is given by

L(Aj)-L*(l)+L*(riog2(2max{Ai,l/Aj})l) °3

bits. Making a substitution of nats for bits, one arrives at the cost of encoding all the parameters as k

k

L(X) = 5 > ' ( rd - l ) + 5>*(M2max{Aill/Ai})l) i=i

i

i=i

(6.12)

nats. The factor of 2 is the additional cost of the sign of Xj. To perform the maximisation it is necessary to simplify Eq. (6.12). Judd and Mees [53] argue that the repeated log log... terms are slowly varying and so the — ^ In Sj terms dominate. The exponent can be simplified by assuming that the parameters only take values within some fixed range and

204

Applied Nonlinear Time Series Analysis

so the exponent cost is fixed. One then has k

Z(A) = ^ l n f

(6.13)

as an approximation to (6.12). The factor 7 is a constant, related to the assumed range of the exponent. While variation of results for different values of 7 may be an issue [88], we usually fix 7 at a realistic moderate value (say 32). With Eq. (6.13) we are now ready to derive the minimum description length of a radial basis model (6.11). The description length of a data set z given a model described by the parameters A (and some others which we ignore for the present) is (6.14)

L(z,X) = L(z\X) + L(X)

where L(z\X) = — lnP(z|A) is the data code length. This code length is simply the negative log likelihood of the data under the assumed distribution and (under the assumption of Gaussianity, et ~ N(0,a2)) is given by - I n ( (27^W 2e ~ (Tc ^ 2 ' j2 )• We assume, as Judd and Mees do, that the optimal values of 5j are small and A will not be too far from the maximum likelihood value A which optimises L(z\X) over A. Furthermore L{z\X) < L{z\X) + \5TQ5

(6.15)

where Q = DX\L{z\X). From (6.14) and (6.15) one gets k 1 T L(z,X) < L{z\X) + ~S Q6 + H n 7 - V l n ^ 2

i=i

(6.16)

as the approximation to be minimised. This minimisation yields (QS)i = ^

(6.17)

for every j . Let Sj denote the values of 5j corresponding to the solution of (6.17), then as an approximation to the description length of a given model we have k

L{z\\) + {-+\ai)k-y\\B.5j. 2

i=i

(6.18)

205

Identifying the dynamics

Calculation of description length for this modelling algorithm requires knowledge of Q = D\\L{a\\), which we will discuss in Sec. 6.5.3. Note that L(z\\) is the description length of the model prediction errors and will decrease with increasing model size. The last two terms of 6.18 are the description length of the model parameters and will increase with model size. Two other criteria for model selection are the Akaike criterion [5] —2 log (maximum likelihood) + 2k,

(6.19)

and the Schwarz criterion [114] — log (maximum likelihood) + — k log N.

(6.20)

One can see that (6.18) is a generalisation of both (6.19) and (6.20).14 Having now introduced our modelling criterion, we discuss the model selection algorithm in the following section. 6.5.3

Pseudo linear models

The function / which we wish to fit will in general be of the form (6.11). However this function may also necessarily include afRne terms, so let us rewrite / as n

(6.21)

J= l

where cf>j are arbitrary functions of the vector variable. These are the basis functions of the model / and the problem is to select the set {0ii^2i • • • ,4>n}, which minimises the description length (6.18). In practice we will restrict j are a particularly restricted class) is a difficult nonlinear optimisation. We choose to simplify matters somewhat by fixing n and finding a function (6.21) which minimises the mean sum of squares prediction error, and then minimising (6.18) over n. 14 The maximum likelihood of — In (maximum likelihood) = L{z\X).

(6.19)

and

(6.20)

is

given

by

206

Applied Nonlinear Time Series Analysis

Define Vt = (i(vdcT+i),...

y =

,(j)i{vN))

rp

,i =

l,...,m,

(ydCT+i,---,VN)T, rp

e = (ed eT+ i,...,ejv)

and let V

= [V1V2

v B = [vblvb2

•••

Vm),

••• v b j ,

where bi, b2,..., bm e B = {61,62, • • •, &m} are distinct. The set {Vi}£Li is the evaluation of m candidate basis functions, B is the current basis and {Hj}"=i is the evaluation of the j functions in that basis. If the et are assumed to be Gaussian and A has been chosen to minimise the sum of squares of the prediction errors e = y — VB\, Judd and Mees show that the description length is bounded by

( ^ - l ) l n ^ + (* + l)(i+ln7)-f>Ji+C7, j=i

where 7 is related to the scale of the data, N = N — der is the number of embedded vectors, and C is a constant independent of the model parameters. The model selection algorithm employed here and suggested by [53] is the following. Algorithm 6.1

Model selection algorithm.

(1) Normalise the columns of V to have unit length. (2) Let So = ( f - 1)ln{yTy/N) + \ + In7. Let e B = j / a n d B = 0. (3) Let fi = VTes and j be the index of the component of /J, with maximum absolute value. Let B' =BU{j}. (4) Calculate \B> so that y — VB'^B1 is minimised. Let /J,' — VTeB' • Let o be the index in B' corresponding to the component of // with smallest absolute value. (5) If o ^ j , then put B = B' \ [o], calculate \B so that y — VBXB is minimised, let eB = y — VBXB, and go to Step 3.

207

Identifying the dynamics

(6) Define Bk = B, where k = \B\. Find 5 such that (VgVB5)j = 1/^- for each j = { 1 , . . . , fc} and calculate 5fc = ( f - 1) In ^ + (fc + i)(i + (7) If some stopping condition has not been meet, go to Step 3. (8) Take the basis Bk such that Sk is minimum as the optimal model. Note that the 5j that satisfy (6.17) are calculated at Step 6. Typically one will continue increasing k until it is clear that the minimum of Sk has been reached. Depending on the modelling situation, the stopping condition may be k = m (in the case of reduced autoregressive models, discussed in chapter 5.3.3), or Sk+e > Sk for 1 < (. < L (for the general nonlinear modelling problem of this chapter and chapter 1.21). In the next section we will briefly describe an extremely useful extension to this general modelling scheme. 6.5.4

Cylindrical basis models

Cylindrical basis models are a generalisation of the well known radial basis models [93]. As before, let {yt}$Li be a scalar time series on N observations. For an embedding window dw, a cylindrical basis model is a function F : R j£ such that

F(zt) = a0 + ]T aiyt.ti + JT Xrf (Mji?1^3)l\

,

(6.22)

where zt = (y t -i, Vt-2, • • •, yt-dw) G Rdw, at, \j £ R are the weights, Cj G R''"' are the centres, and TJ € R + are the radii. The lags £i satisfy 0 < £i < li+\ < dw and the functions Pj(-) are orthogonal projections from Rd™ to some subset of the coordinate axis. The functions are typically some class of C2 functions satisfying Jo°° 4>(x)dx < oo. The essential difference between cylindrical basis models and standard radial basis models is the inclusion of the projection functions Pj(-). These functions allow for the projection onto different, significant, subspaces of Hd™ in different parts of phase space. This is both useful and intuitive since the complexity of most nonlinear dynamical systems varies with location in phase space. For example, the Lorenz attractor is basically two dimensional on the "wings", but the central separatrix contains three dimensional structure [55]. Finally, in the next section we bring all these modelling techniques to

208

Applied Nonlinear Time Series Analysis

bear on recordings of onset of human ventricular fibrillation. An analysis of this data shows that a period doubling bifurcation is a reasonable model of the onset of this lethal cardiac arrhythmia.

6.6

Application: Predicting onset of Ventricular Fibrillation, and evaluating time since onset

Ventricular fibrillation (VF) is a rapidly lethal cardiac arrhythmia and a common cause of death in the industrialised world. However, the mechanism underlying onset of VF and successful treatment via electrical defibrillation is poorly understood. Experimental observations in humans and animals have shown that the cardiac system is inherently nonlinear and that significant nonlinear deterministic structure can be extracted from ECG time series recordings [141; 135; 100]. Experimental preparations of cardiac cells have been shown to exhibit phase locking, period doubling and onset of chaos [43; 19; 109; 62]. However, these experiments consist of externally driven isolated cellular preparations. Theoretical models of the atrio-ventricular (AV) junction and periodically forced cardiac cells predict phase locking and period doubling bifurcation [160]. Period doubling, from period 1 to period 2 (known in the medical literature as alternans), can be occasionally observed in the diseased heart (see, for example [39]) and has recently been shown to precede ventricular fibrillation in theoretical models [44]. Cohen and colleagues have demonstrated period doubling (up to five times the base period) in the canine heart [104] and have more recently shown that T-wave alternans is a significant predictor of arrhythmia [66]. Although alternans is indicative of a period doubling bifurcation, a period doubling bifurcation route to chaos may not necessarily manifest itself as alternans. Alternans is defined as a doubling of the underlying period of a time series (see Fig. 6.12); a period doubling bifurcation occurs when the order of the periodic behaviour doubles — typically to two periods very close to one another. That is, alternans is sufficient but not necessary. Many authors have speculated that a period doubling bifurcation may provide a route to chaos during ventricular arrhythmia in humans [38; 47; 170]. However, no conclusive evidence has been provided. In this section, we infer the existence of a period doubling bifurcation route to chaos prior to the onset of ventricular tachyarrhythmia in humans through the analysis of nonlinear models of clinical ECG data. Whereas alternans is an easily

Identifying the dynamics

209

Fig. 6.11 Spontaneous evolution of VP in a human. The data shown here has been down-sampled to 125 Hz. On each plot the horizontal axis is time in seconds and the vertical axis is surface (ECG) electrical potential in milli-volts (mV). The top plot shows a short section of sinus rhythm, including ectopic beats. The second plot shows initiation and evolution of VT and the third shows VF. These plots are contiguous and the horizontal axes are equal. The entire data set is shown in Fig. 6.13(a).

observable and measurable phenomenon (Fig. 6.12), we apply nonlinear modelling techniques to search for more subtle features, not visually apparent in the original data. A period doubling bifurcation is not apparent from a cursory examination of the data in Fig. 6.11. We deduce the existence of such a bifurcation in a model of this data. To search for time dependent nonlinear dynamic structures we employ powerful nonlinear radial basis modelling techniques described in this chapter and also generalised further in [123]. These modelling methods have the necessary nonlinearity to accurately describe cardiac dynamics (unlike linear or polynomial models) and are amenable to analysis with the techniques of nonlinear dynamical systems theory (unlike neural networks or local linear methods). Furthermore, these models are generic and the modelling algorithm is (in general) assumption free. By employing an information theoretic model selection criteria, this algorithm chooses the simplest model that is consistent with the observed dynamics. To accurately model bifurcation in cardiac dynamics, these existing methods must be modified. By embedding time as a state variable [136] we observe a period doubling bifurcation and chaotic dynamics during sinus rhythm and ventricular tachy-

210

Applied Nonlinear Time Series Analysis

Fig. 6.12 Period doubling in human ECG. The data shown here has been downsampled to 125 Hz. On each plot the horizontal axis is time in seconds and the vertical axis is surface (ECG) electrical potential in milli-volts (mV). This data shows a spontaneous period doubling in the respiratory rate of a human subject (at around 76 seconds). Prior to this, several ectopic (premature) beats are observed (for example, at 57 and 71 seconds). These plots are contiguous and the horizontal axes are equal.

cardia (VT) immediately preceding VF. However, we find that successive runs of the modelling algorithm do not produce quantitatively identical results. Therefore, we explicitly calculate an index which we use as a surrogate for the true (hidden) bifurcation parameter and embed this as a state variable. This technique produces characteristic and reproducible results. We observe a transition from pseudo-periodic chaos (sinus rhythm prior to VT and VF) to a stable limit cycle (during VT) and to (distinct) pseudo-periodic chaos (VF). By pseudo-periodic chaos we mean, as before, the commonly observed type of chaos in systems such as the Rossler equations. A periodic orbit becomes bi-periodic and cascades through successive period doubling bifurcations; in the Poincare section; to chaos. In the chaotic regime the system exhibits chaotic, approximately cyclic, behaviour. The modelling described here characterises the transition from the physiological states of sinus rhythm to VT and VF as a period doubling bifurcation. We must emphasise that applying this modelling algorithm to experimental time series data cannot prove the existence of a bifurcation to chaos in the dynamical system underlying the observed data. However, this in-

Identifying the dynamics

211

Fig. 6.13 Period doubling chaos prior to onset of VF. The top panel shows the original time series. The second panel shows the estimated asymptotic peak (darker) and trough (lighter) values for fixed values of the bifurcation parameter. The bottom plot shows the estimated asymptotic amplitudes. The horizontal axis is time series datum number; the model bifurcation parameter is an affine transformation of this. Sinus rhythm is characterised as a periodic orbit. Prior to initiation of VT this periodic orbit undergoes a period doubling bifurcation and becomes chaotic. At onset of VT the chaotic behaviour bifurcates to pseudo-periodic chaos and period-8 dynamics. At the transition to VF this behaviour changes to a stable limit cycle and eventually a stable focus.

formation theoretic modelling algorithm estimates the simplest dynamical system (model) that is consistent with the observed time series. We may then apply computational dynamical systems theory to estimate the bifurcation behaviour of that model. We are able to show that this observed behaviour is both consistent with the time series and also the simplest consistent behaviour (within the class of models employed). We examine four recordings of spontaneous initiation and evolution of VF in human subjects. Such time series data are limited and difficult to obtain. Using a new data collection facility [134] we have been able to record time series showing spontaneous evolution from sinus rhythm to VF, and subsequent treatment. In two of these recordings, VF is preceded by VT. One of the other two recordings shows a direct transition from

212

Applied Nonlinear Time Series Analysis

Fig. 6.14 Trajectories estimates from a model. For fixed values of the bifurcation parameter, the system is run for 1000 iterations. The dynamic behaviour for various values of the bifurcation parameter are shown (from top): periodic, period 2, period 3, chaos, period 4 and a fixed point. The chosen fixed values of bifurcation parameter correspond to datum 1000, 2000, 4000, 5000, 6000 and 8000 of Fig. 6.13 (respectively). Trajectories such as these are used to estimate the amplitudes shown in Figs. 6.13 and 6.15. Note that while these trajectories do not capture the exact quantitative features of the system (for example, the periodic behaviour observed at datum 1000 is unlike sinus rhythm ECG) these features are quantitatively appropriate.

sinus rhythm to VF. The fourth recording shows bradycardic (slow rhythm) behaviour prior to onset of VF. For each of these four recordings, we selected an approximately 90 second (45000 data point) sub-sequence covering the transition from sinus rhythm, through any intermediate rhythms, to VF. We down-sampled the time series by taking a four-point moving average and

213

Identifying the dynamics

Fig. 6.15 Period doubling chaos prior to onset of VF. The calculation of Fig. 6.13 is shown here as a probability density plot. The figure depicts the estimated distribution of asymptotic oscillation amplitudes as a function of time (in seconds). High likelihood is depicted in bright red and low likelihood in dark blue.

sampling every fourth point, yielding 90 second episodes of approximately 11250 data points. Down sampling was done to reduce the computational load to a manageable level. By taking a four point moving average before down sampling, we reduced the amount of temporal information in the time series but increased the. spatial resolution. After down-sampling, the first 11000 data points were used to construct a cylindrical basis model. Figure 6.11 shows one of the recording used in this study. For comparison we have also applied this analysis to recordings of other spontaneous evolving physiological rhythms. Figure 6.17 shows a recording of bigeminy (a phenomenon where normal and abnormal beats alternate) and Fig. 6.12 shows spontaneous alternans. To build models of this data we apply the cylindrical basis modelling method with minimum description length as the model selection criteria. For this particular computation, we use a combination of Gaussian basis functions and Morlet wavelets. The modelling routine obtains an estimate of the function F which is used to approximate the evolution operator of a dynamical system by F(zt) = F(yt-i,yt-2,

• • • ,yt-dj

= Vt-

(6.23)

With the model size (k,m) fixed, our modelling algorithm selects Oj, £i, Xj, cj, rj and Pj(-) such that the model prediction error N

Error = $>? = £ t=dw+l

(F{zt)-yt)2,

(6.24)

214

Applied Nonlinear Time Series Analysis

Fig. 6.16 Period doubling chaos prior to onset of VF. (a) Fourier spectrogram (256 PFT with an overlap of 128 points) and (b) wavelet analysis of the data used in Figs. 6.13 and 6.15. In contrast to the analysis in Fig. 6.15 these calculations provide little additional information compared to the original time series. Most of the change in frequency content that is observable from these plots is also observable directly from the original time series. High density is shown in red and low density in blue.

is minimised. The optimal model size is then selected by finding the values of k and m such that the model description length is minimised. The model described by Eqs. (6.22) and (6.23) is time independent. The process we are modelling exhibits time-dependent features and these equations are unlikely to provide an adequate model. One possible solution is to model each of the parameters ait £j, Xj, Cj, Tj and Pj(-) as a function of time. However, this approach is somewhat impractical. A more feasible approach is to extend the embedding zt = (2/t-i,2/t-2,--.,2/t-dJ

(6.25)

to zt = (2/t-i, 3/t-a, - • -, y*-d»,€(*))

(6-26)

Identifying the dynamics

215

Fig. 6.17 Spontaneous evolution of VT and bigeminy in a human. The data shown here has been down-sampled to 125 Hz. On each plot the horizontal axis is time in seconds and the vertical axis is surface (ECG) electrical potential in milli-volts (mV). The top plot shows a short section of sinus rhythm and spontaneous evolution of VT. The second plot shows self termination of VT, followed by alternating regular QRS complexes (normal heart beats) and ectopic (abnormal) beats. In the third panel, sinus rhythm is restored. These plots are contiguous and the horizontal axes are equal.

so that time t is explicitly included as a dependent variable of F(zt) [136]. The affine transformation £ is included so that the time index "stretches" the attractor. If £ is too large, the embedded points £t will be too sparse in phase space; if £ is too small, there will effectively be insufficient separation. For this data we have found that it is sufficient to choose £ so that the maximum and minimum values of £(£) are the same as the maximum and minimum of yt. In general, a more robust choice may be to define £ in terms of the standard deviation of yt. The modelling algorithm we apply here must minimise a nonlinear function (description length) of several variables (at, £t, Xj, Cj, Tj and Pj(-))To solve this problem completely is beyond our limited computational resources. For this reason the algorithm we have implemented involves a stochastic search for a local minimum of description length. For many data sets this is sufficient to obtain reproducible and consistent results [55; 54; 56; 126; 140]. We have tested this modelling algorithm with four systems with known bifurcation structures and obtained good agreement between actual and predicted results [136]. This modelling methodology is able to

216

Applied Nonlinear Time Series Analysis

accurately uncover bifurcation structure from time series simulations of (i) the Rossler equations, (ii) FitzHugh-Nagumo type simulations of ventricular arrhythmia, (iii) nonstationary noise processes, as well as from (iv) experimental data of infant respiration [136]. The computational simulations examined in [136] have well known and easily verifiable dynamics, and therefore provided a test of the accuracy of this approach. We were able to correctly extract a nonstationary trend from an i.i.d. noise simulation [136] with a slight nonstationarity. When applied to a noisy simulation of the Rossler system undergoing a period doubling bifurcation, the modelling algorithm was able to correctly identify the qualitative features of the bifurcation from the noisy data (5000 observations) [136]. The model we utilise here is an ordinary iterative map with a tunable parameter. This bifurcation parameter has been allowed to vary in the construction of the model. By keeping this parameter fixed and observing the iterated behaviour of the model, we have an estimate of the asymptotic dynamics of the system for that (fixed) value of the bifurcation parameter. For each such trajectory, we estimate the (asymptotic) peak and trough values observed. For a periodic orbit these will be fixed; for period 2 dynamics there will be two distinct amplitudes; and for chaos the sequence of amplitudes will be irregular. It is these estimates of the asymptotic amplitudes that are plotted in Figs. 6.13 and 6.15 as a function of the bifurcation parameter. Representative trajectories for various values of the bifurcation parameter are depicted in Fig. 6.14. For example, from the representative trajectories shown in Fig. 6.14 we would deduce (from top to bottom): a periodic orbit with only one observable amplitude, period 2, period 3, chaos, period 4, and a fixed point (zero amplitude). In Figs. 6.13 and 6.15 we provide an estimate, derived from a single application of our modelling scheme to the data set of Fig. 6.11. This estimate is representative of our other results. The dynamics presented in this figure contain all the various complex behaviours observed in separate runs of the modelling routine. Some iterations of the modelling routine did not produce the complete period doubling bifurcation illustrated here. However, they all exhibited a complex bifurcation from pseudo-periodic regime (corresponding to sinus rhythm) to chaos (VT) and a stable periodic orbit or focus (VF). Models from each of the four recordings of onset of VF are qualitatively similar. We compared this technique and the bifurcation behaviour estimated from models with traditional frequency domain methods applied directly to the data. Figure 6.15 is an estimate of the probability distribution of

Identifying the dynamics

217

the amplitude of system orbits for various values of the bifurcation parameter (time index). The data is the same as the lower panel of Fig. 6.13, expressed as a probability distribution. For comparison we computed a windowed spectrogram and wavelet transform from the original data (Fig. 6.16). While the coarse time-dependent features are evident in these plots, we find very little information not evident in the original time series. The asymptotic time behaviour displayed in Fig. 6.15 is achieved by extrapolating the model dynamics for a model fitted to the data. Qualitative features of period doubling and onset of chaos are present in each of the models despite this complex structure not being obvious in the original time series. The modelling algorithm is fitting subtle features of the asymptotic dynamics that may not be observed directly during the comparatively rapid change in the system dynamics prior to (and during) onset of ventricular arrhythmia. Although essential features are consistent between most models, this variation between models is somewhat unsatisfactory. The problem is that in addition to estimating unknown dynamics we are relying on the modelling algorithm to estimate an unknown bifurcation parameter at each point in time. To overcome this problem we estimated the bifurcation parameter a priori and built a model based on this bifurcation parameter. Generically, the time series recordings of evolution of VF exhibit two or three of four distinct dynamical regimes: sinus rhythm, VT, bradycardia and VF. A bifurcation parameter must reflect this. For the data presented here, we found that the proportion of the power spectral content between 4 and 8 Hz performed well. Alternative indices including entropy, complexity and point-wise correlation dimension were also considered. However, we found that spectral power content provided the best separation between different physiological rhythms. In our experience, spectral power content also performed well for real time identification of arrhythmia [134]. Spectral power captures many of the essential features of the time series (Fig. 6.16). However, this spectral variation alone is insufficient to deduce the period doubling bifurcation and onset of chaos observed in models of this data. Figure 6.18 shows estimates of this quantity for the same time series as in Fig. 6.13. We estimated the power content between 4 and 8 Hz for a 2048 point sliding window (before down sampling the original recordings) and used this quantity as a surrogate for the true, unknown, bifurcation parameter.

218

Applied Nonlinear Time Series Analysis

Fig. 6.18 Bifurcation diagram estimated with surrogate bifurcation parameter. The top plot shows values used for the bifurcation parameter as a function of the time index (sampled at 125 Hz) for the same data set as Fig. 6.11. The second plot shows the asymptotic peak (darker) and trough (lighter) values as a function of the time index and the third plot shows the estimated amplitude behaviour . The complex period doubling bifurcation evident in Fig. 6.13(b) is absent, but a clear transition from large scale Rossler type chaos (for sinus rhythm preceding VF) through period-6 behaviour (VT) to another pseudo-periodic chaotic regime (VF), is apparent.

That is, the embedding (6.25) or (6.26) is replaced by zt = (yt-i,yt-2,

• • • ,yt-dm,t(p(yt)))

(6.27)

where p(yt) is the above-mentioned statistic, and £(•) is an affine scale transformation. The remainder of the modelling process is identical. Figure 6.18 depicts a representative result of this calculation for the data illustrated in Fig. 6.13. Repeated runs of the modelling algorithm on the same data produced equivalent results. The results between data sets were similar. Whereas the calculation estimating both model and bifurcation parameters would typically yield smooth bifurcation from a periodic orbit, to chaos (sinus rhythm prior to VT and during VT) and back to a noisy periodic orbit (VF), this is not the case with an o priori estimate of the

Identifying the dynamics

219

bifurcation parameter. For a smooth and gradual change in the values of this bifurcation parameter, the change in the underlying dynamics is similarly subtle. However, as this parameter is estimated directly from the time series the value often changes fairly rapidly (Fig. 6.18), and this in turn produces sudden changes in the underlying dynamics. Closer examination of small changes in the bifurcation parameter do show a period doubling bifurcation during the transition between sinus rhythm and tachycardia. The weakness of this approach is that we are constraining the time dependent dynamics of the model to conform to our expectation of the system, i.e. £(•). We have therefore described two alternative techniques to examine the changing dynamics in experimental time series. Although we consider only human ECG recordings the method itself is generic and equally applicable to any experimental time series. Estimating the model and bifurcation parameter simultaneously provides a method of examining the changing dynamics with no a priori information. For short, noisy, or experimental data this method has its limitations. Primarily, the computational burden to estimate the optimal model, model parameters, and bifurcation parameter from a single scalar time series is immense. This problem meant that it was difficult to get fully repeatable results (for a single time series). Rather than obtain the global minimum of (6.18) the modelling algorithm would halt at a locally optimal value. To reduce the computational effort we provided an alternative approach — estimating the model bifurcation parameter a priori. This approach meant that the models produced were more easily repeatable (for a single time series) and also consistent (between different time series). The main problems with this approach are finding a "reasonable" surrogate for the bifurcation parameter, and the restriction of the dynamics that this implies. Estimating the ..bifurcation parameter and model simultaneously meant that the problem of reproducing the dynamics was reduced to finding the right nonlinear function from (an affine transformation of) the time index to the underlying bifurcation parameter. By assuming an appropriate value of bifurcation parameter first, the existence of an injective mapping from the surrogate bifurcation parameter £,(p(yt)) to the underlying bifurcation parameter is not guaranteed (i.e. p(yt) is not necessarily a 1-to-l function oft). However, we hope that a "reasonable" choice of the surrogate parameter will provide a trivial transformation from £(p{yt)) to the underlying bifurcation parameter. When applied to recordings of initiation of VF, these modelling tech-

220

Applied Nonlinear Time Series Analysis

niques provide a description of the simplest dynamical system (within the model class) consistent with the observed data. Observations derived from these models are possible dynamical phenomena consistent with the data. This does not imply that the data must have been produced by such phenomena, only that such phenomena are a simply and compact explanation for the observed data. Prior to onset of ventricular arrhythmia, sinus rhythm may behave as a noisy dynamical system exhibiting Rossler-type chaos (i.e. chaotic pseudo-periodic orbits). We also find evidence that a period doubling bifurcation may precede imminent ventricular tachyarrhythmia. In the arrhythmic state this dynamical system behaves as either Rossler type chaos or a stable focus. We believe that this possible contradiction is due to limitations of the modelling process. When modelling both dynamics and estimating the bifurcation parameter the modelling algorithm is unable to extract sufficient information from the VF and VT states to model them as more than noisy periodic orbits. When the bifurcation parameter is assumed a priori, and p(yt) is a non-monotonic function of t (i.e. the value remains the same at different times, leading to a more densely populated phase space), arrhythmic (VT and VF) behaviour can be better modelled as pseudo-periodic chaos (Rossler type chaos). These results imply that both sinus rhythm and VF may be characterised as chaotic dynamical systems. However the underlying dynamics during sinus rhythm, VT and VF are all fundamentally different. These results are therefore consistent with earlier work estimating dynamical invariants from time series [141; 135; 100; 140; 168]. The similarity between these results and various observations of alternans is significant but not conclusive. We observe an increase in the order of periodic behaviour in the amplitude of respiration whereas alternans typically manifests as a doubling in the underlying respiratory rate (and sometimes observed directly as two distinct amplitudes). We found no evidence for chaotic bifurcations in recordings of alternans or bigeminy. We also found that prior to onset of tachyarrhythmia the pseudoperiodic chaos observed in models of these time series becomes strictly periodic and a period doubling bifurcation may be observed as a mechanism consistent with transition between these states. These results are preliminary, and certainly not clinically significant. A much larger sample needs to be considered before definitive statements may be made. Nevertheless, observation of period doubling bifurcations prior to onset of VF suggests that the human cardiac system undergoes a fundamental change in its dynamical state prior to onset. Methods such as Lyapunov exponent and

Identifying the dynamics

221

correlation dimension [168] estimation may therefore be applied to predict imminent arrhythmia. However, such algorithms need to be substantially modified to be both sensitive and accurate for extremely short time series if they are to be applied as clinical indicators of imminent arrhythmia.

Chapter 7

Applications

Many of the techniques presented in this volume have already found widespread use, if not genuine "application". However, our main emphasis has been on more recent modifications to the standard algorithms, and a more pragmatic approach. In the past, the standard approach to nonlinear time series analysis for arbitrary experimental data {yt}t may be summarised by the following four step plan: Algorithm 7.1

The black box

(1) Choose embedding dimension de and embedding lag r using the false nearest neighbours algorithm and mutual information criterion (Chapter 1). (2) Embed the scalar time series according to (1.8) and estimate the chosen invariant, usually correlation dimension dc(de,T) (Chapter 2). (3) Compare the correlation dimension estimated from the data to an ensemble of AAFT surrogates (Chapter 4). (4) Build a nonlinear model of the deterministic dynamics in the data (Chapter 6). However, it is clear that this approach cannot work in general and probably will not provide useful results or a reasonable application, except in extremely isolated cases. The primary problem with this "algorithm" is that it ignores the most important part on nonlinear time series analysis: the data. One should always try to understand the data as far as possible and use that information for an informed analysis. Although ontological models may provide a better understanding of a given system, we are often limited to phenomenological models. But, this does not mean that one should not consult the data, or seek more information about the underlying system. At each stage of Algorithm 7.1 the user should be actively 223

224

Applied Nonlinear Time Series Analysis

intervening: checking that the results are both sensible and meaningful. The second major problem with Algorithm 7.1 is that it is not immediately clear what the purpose of this approach is. Step 1 is a standard, but by no means unique, pre-cursor to time delay embedding reconstruction; step 2 is often undertaken to find corroboration of the existence of chaos (which is usually pre-supposed); step 3 is used to validate step 2; and, finally, step 4 is undertaken because we eventually want to make predictions. But, step 1 is not guaranteed to be the best approach, and it may not even be adequate. Step 2 is flawed because estimation of dynamic invariants implicitly assumes that the said dynamic invariant exists. Step 3 is sensible, but, if the hypothesis being considered is not even plausible for the data, it is redundant. Finally, in step 4 we need only observe that any pundit can make predictions, only an oracle (and the occasional, very lucky, pundit) can make good predictions. As we have seen, for experimental data, limited in both length and resolution, it is not possible to apply the time delay reconstruction technique and achieve a genuine embedding in the mathematically rigorous sense. Therefore, it does not make sense to seek the global optimal embedding strategy. Which embedding strategy is best will sensitively depend on the purpose of the reconstruction. A good embedding for estimating correlation dimension is not necessarily a good embedding for building a deterministic model of the dynamics (this is even true in the case of infinite clean data: one can estimate correlation dimension dc provided that de > dc [26], but for reconstruction one needs de > 2dc + 1 [144]). We also saw in Chapter 1 that good embeddings for modelling are often non-uniform. In Chapter 6 we actually generalised this further with variable embedding schemes: it is neither certain nor even intuitive that a good embedding is necessarily constant over phase space. When we seek the simplest model of the underlying dynamics we found that the complexity of the underlying system varies with location in phase space. Moreover, when making experimental observations, the amount of data that can be collected while the system's parameters remain stationary is very often very limited. In some situations, the delay reconstruction that can be implemented from such limited data is not likely to be useful. Sometimes, the only correct conclusion that can be made from a short and noisy data set is that it is not possible to make a conclusion. The same general cautionary remarks must also be given when one attempts to estimate dynamic invariants. But here, there are two complementary problems. The first problem is that of estimation: with limited data

Applications

225

the estimates are not going to be good, so we need to have some manner of giving errors bars (or better yet, a probability distribution) on our estimates. Often, the most important thing is not accurate estimates, but only unbiased ones. The second problem is that even with reliable estimation, one must carefully consider exactly what it is that has been estimated. Correlation dimension is an excellent example of this. When estimating correlation dimension (according to the original algorithm of GrassbergerProcaccia), one assumes a deterministic attractor exists, and then attempts to estimate it's dimension. Unfortunately, if the assumption is flawed, the resultant quantity has no meaning. The newer schemes (Judd's and Dik's algorithms) also suffer from similar general problems, but in these cases noise is part of the a priori assumption. Finally, when applying these problems to real data there is a third problem of a somewhat deeper nature: the question still remains, what exactly does this quantity mean. Is it relevant, useful, or even correct to talk about finite dimensional attractors for such experimental data? Physicians, for example, are not likely to be impressed by evidence of finite dimensional attractors in the human respiration system [126]. However, physicians are extremely interested to learn whether these models can offer a reasonable explanation for the underlying mechanics. Surrogate data was initially received as a panacea for all the ills associated with (in particular) correlation dimension estimation. But, surrogate data does not solve all the problems. Technical issues aside, it is useful to preclude a linearly filtered noise source as the possible origin of experimental data. But, often this is a trivial thing to do. If the data exhibits some regular oscillations, or even asymmetry, this is expected. By formally rejecting the null hypothesis we are only showing that the correlation dimension is not doing something extremely trivial. Moreover, if the surrogates generate dynamics that are physically unrealistic, they are not useful. To provide new insight into the underlying system, surrogates must be appropriate for the time series under consideration. One should not be able to trivially reject the surrogates. It is therefore not particularly enlightening to apply linear surrogate methods to data which exhibits periodic fluctuations and which we know is not noise (for example, respiration). In such cases, linear surrogate techniques only serve as a minor significance test for the chosen test statistic. An important issue which is often ignored when one conducts surrogate tests is the choice of test statistic. We found that correlation dimension is a pivotal statistic (for the most common hypotheses) and may therefore

226

Applied Nonlinear Time Series Analysis

be used to avoid addressing some of the more pedantic technical problems, particularly with Algorithm 2 (AAFT) surrogates. But, a prudent choice of test statistic may also be used to modify the underlying hypothesis. Used appropriately, model simulations may even provide a form of surrogate hypothesis testing. However, the more usual purpose of modelling is to either obtain a snapshot of the underlying deterministic dynamical system, or, at least, make predictions. But, predictions from a model are not equivalent to predictions of the true future: they only predict what the model is likely to do next. Even if the model is a good model (it is not possible to have a correct model [59]), a single prediction of a chaotic system is not useful. One can only really make an ensemble of predictions, and the choice of the correct ensemble is still not obvious. However, if we believe our models, there is much we can try to infer about the underlying dynamics from them. Notwithstanding these cautionary remarks, the main emphasis of this volume has been on how these techniques may be profitably applied to real data. Like all models of reality (either ontological or phenomenological), surrogate data and estimating invariants of the attractor, describe the underlying reality. These methods are only useful when their description of reality provides new information. We saw, for example, the application of these methods to financial data was largely fruitless, until we chose appropriate surrogate data and test statistics. When we did this, we found strong evidence that the standard financial heteroskedastic models (ARCH, GARCH and EGARCH, as well as ARMA) are wrong. These models do not offer an adequate description of reality. Moreover, if one massages this data somewhat further, it is easy to see that the markets are genuinely not efficient [131]. Most of the applications presented in this volume concerned physiological rather than financial systems. For these processes we were able to make several significant conclusions. We found evidence of deterministic periodic oscillation in the respiratory pattern that suggests an oscillatory underlying control mechanism. It turns out, that the period of this oscillation coincides with the period of periodic breathing, and that a deficit in this mechanism may be responsible for phenomena such as sudden infant death [126; 128]. From studies of electrocardiogram recordings before and during ventricular fibrillation, we were able to suggest new indicators of onset of cardiac arrhythmia. Complexity based measures were shown to be able to distinguish between sinus rhythm, ventricular tachycardia and ventricular

Applications

111

fibrillation. Moreover, we found tantalising evidence in phenomenological models that the onset of fibrillation is characterised by a bifurcation in the underlying system dynamics prior to onset of arrhythmia. With more advanced computational methods and a more thorough study, techniques such as this could eventually be used as an early warning of imminent arrhythmia [139]. Other modelling methods have also been shown to be able to estimate the time since onset of an evolving cardiac arrhythmia [138]. Finally, we presented a largely ontological model of SARS transmission in Hong Kong. Despite the fact that this model is basically ad hoc, the structure was motivated by the environment and the parameters determined by studies of the data. The model showed striking agreement with the observed data and also indicated a potentially vast range of behaviours. But, this simple model suggested that the critical factor for control of SARS was preventing transmission within hospitals. If known patients could be effectively isolated, the threat of SARS is greatly reduced. Clearly, the applications presented here are only a small subset of the current work with these methods, and the methods we have described reflect the author's personal bias. Despite the fact that nonlinear time series analysis is firmly grounded in the mathematics of nonlinear dynamical systems theory, the choice of techniques for a given application is not clear-cut. Nonlinear time series analysis still resists all attempts to reduce the underlying methods to a black box which may be mindlessly applied to data. What I provide here is only a toolkit. If you use the tools incorrectly, the results you get will be nonsense.

Bibliography

1. H. D. I. Abarbanel, Reggie Brown, John J. Sidorowich, and Lev Sh. Tsimring. The analysis of observed chaotic data in physical systems. Rev M Phys, 65:1331-1392, 1993. 2. Henry D.I. Abarbanel. Analysis of observed chaotic data. Institute for nonlinear science. Springer-Verlag, New York, 1996. 3. Peter Achermann, Rolf Hartmann, Anton Gunzinger, Walter Guggenbiihl, and Alexander A. Brobely. All-night sleep EEG and artificial stochastic control signals have similar correlation dimensions. Electroencephalogr Clin Neurophysiol, 90:384-387, 1994. 4. Douglas Adams. The hitch-hikers guide to the galaxy: A trilogy in four parts. Heinemann, London, 1984. 5. Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19:716-723, 1974. 6. A. M. Albano, A. Passamante, and Mary Eileen Farrell. Using higher-order correlations to define an embedding window. Physica D, 54:85-97, 1991. 7. M. Ataei, B. Lohmann, A. Khaki-Sedigh, and C. Lucas. Model based method for determining the minimum embedding dimension from chaotic time series — univariate and multivariate cases. Nonlinear Phenomena in Complex Systems, 6:842-851, 2003. 8. William A. Barnett and Apostolos Serletis. Martingales, nonlinearity, and chaos. Journal of Economic Dynamics and Control, 24:703-724, 2000. 9. Donald A. Berry and Bernard W. Lindgren. Statistics: Theory and Methods. Brooks/Cole publishing company, 1990. 10. H. Bettermann and P. Van Leeuwen. Dimensional analysis of RR dynamic in 24 hour electrocardiograms. Ada Biotheor, 40:297-312, 1992. 11. William A. Brock, David A. Hsieh, and Blake LeBaron. Nonlinear dynamics, chaos and instability. The MIT Press, Cambridge, Massachusetts, 1991. 12. Liangyue Cao. Practical method for determining the minimum embedding dimension of a scalar time series. Physica D, 120:43-50, 1997. 13. M. C. Casdagli, L. D. Iasemidis, J. C. Sackellares, S. N. Roper, R. L. Glimore, and R. S. Savit. Characterizing nonlinearity in invasive EEG recordings from temporal lobe epilepsy. Physica D, 99:381-399, 1996. 229

230

Bibliography

14. C.J. Cellucci, A.M. Albano, and P.E. Rapp. Comparative study of embedding methods. Physical Review E, 67:066210-1-13, 2003. 15. Moira Chan-Yeung and Rui-Heng Xu. SARS: epidemiology. Respirology, 8:S9-S14, 2003. 16. E.W. Cheney. Introduction to Approximation Theory. American Mathematical Society, Providence, Rhode Island, 2 edition, 1982. 17. Ward Cheney and Will Light. A Course in Approximation Theory. Brooks/Cole, Pacific Grove, 1999. 18. B. Cheng and H. Tong. Orthogonal projection, embedding dimension and sample size in chaotic time series from a statistical perspective. Phil Trans R Soc Lond A, 348:325-341, 1994. 19. Dante R. Chialvo, Robert F. Gilmour Jr., and Jose Jalife. Low dimensional chaos in cardiac tissue. Nature, 343:653-657, 1990. 20. Edwin K.P. Chong and Stainslaw H.Zak. An introduction to optimization. Wiley-Interscience Series in Discrete mathematics and optimization. John Wiley & Sons, 1996. 21. R.H. Clayton, A. Murray, and R.W.P. Campbell. Comparison of four technique for recognition of ventricular fibrillation from the surface ecg. Medical & Biological Engineering & Computing, 31:111-117, 1993. 22. R.H. Clayton, A. Murray, A.M. Whittam, and R.W.F Campbell. Automatic recording of ventricular fibrillation. In Computers in Cardiology, pages 685688. IEEE Computer Society, 1981. 23. R.H. Clayton, A. Murray, A.M. Whittam, and R.W.F. Campbell. Automatic recording of ventricular fibrillation. In Computers in Cardiology, pages 685-688. IEEE, 1991. 24. C. Diks. Estimating invariants of noisy attractors. Physical Review E, 53:R4263-R4266, 1996. 25. Mingzhou Ding, Ceslo Grebogi, Edward Ott, Tim Sauer, and James A. Yorke. Estimating correlation dimension from a chaotic time series: when does plateau onset occur? Physica D, 69:404-424, 1993. 26. Mingzhou Ding, Ceslo Grebogi, Edward Ott, Tim Sauer, and James A. Yorke. Plateau onset for correlation dimension: when does it occur? Physical Review Letters, 70:3872-3875, 1993. 27. W.L. Ditto, M.L. Spano, V. In, J. Neff, B. Medows, J.J. Langberg, A. Bolmann, and K. MaTeague. Control of human atrial fibrillation. International Journal of Bifurcation and Chaos, 10:593-601, 2000. 28. Kevin Dolan, Annette Witt, Mark L. Spano, Alexander Neiman, and Frank Moss. Surrogates for finding unstable periodic orbits in noisy data sets. Physical Review E, 59:5235-5241, 1999. 29. Kevin T. Dolan and Mark L. Spano. Surrogate for nonlinear time series analysis. Physical Review E, 64:046128, 2000. 30. Christl A Donnelly et al. Epidemiological determinants of spread of causal agent of severe acute respiratory syndrome in Hong Kong. Lancet, 361:17611766, May 7 2003. http://image.thelancet.com/extras/03art4453web.pdf. 31. A.E. Eiben, J.E. Sm ith, A.E. Eiben, and J.D. Smith. Introduction to Evolutionary Computing. Natural Computing Series. Springer-Verlag, 2003.

Bibliography

231

32. J. Doyne Farmer, Edward Ott, and James A. Yorke. The dimension of chaotic attractors. Physica D, 7:153-180, 1983. 33. J. Doyne Farmer and John J. Sidorowich. Predicting chaotic time series. Physical Review Letters, 59:845-848, 1987. 34. J.L. Feldman and J.C. Smith. Neural control of respiration in mammals: an overview. In J.A. Dempsey and A.I. Pack, editors, Regulation of Breathing., pages 39-69. Marcel Dekker Inc, New York, 1995. 35. A. Galka, T. Maafi, and G. Pfister. Estimating the dimension of highdimensional attractors: A comparison between two algorithms. Physica D, 121:237-251, 1998. 36. Andreas Galka and Gerd Pfiser. Dynamical correlations on reconstructed invariant densities and their effect on correlation dimension estimation. International Journal of Bifurcation and Chaos, 13:723—732, 2003. 37. John F. Gibson, J. Doyne Farmer, Martin Casdagli, and Stephen Eubank. An analytic approach to practical state space reconstruction. Physica D, 57:1-30, 1992. 38. A.L. Goldberger, D.R. Rogney, J. Mietus, E.M Antman, and S. Greenwald. Nonlinear dynamics in sudden cardiac death syndrome: Heartrate oscillations and bifurcations. Experientia, 44:983-987, 1988. 39. Ary L. Goldberger, Valmik Bhargava, Bruce J. West, and Arnold J. Mandell. Nonlinear dynamics of the heartbeat. II. Subharmonic bifurcations of the cardiac interbeat interval in sinus node disease. Physica D, 17:207-214, 1985. 40. Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. Physica D, 9:189-208, 1983. 41. John Guckenheimer and Philip Holmes. Nonlinear oscillations, dynamical systems, and bifurcations of vectorfields,volume 42 of Applied mathematical sciences. Springer-Verlag, New York, 1983. 42. A. Guerrero and L.A. Smith. Towards coherent estimation of correlation dimension. Physics Letters A, 318:373-379, 2003. 43. Michael R. Guevara, Leon Glass, and Alvin Shrier. Phase locking, perioddoubling bifurcations and irregular dynamics in periodically stimulated cardiac cells. Science, 214:1350-1353, 1981. 44. Harold M. Hastings, Flavio H Fenton, Steven J. Evans, Omer Hotomaroglu, Jagannathan Geetha, Ken Gittelson, John Nilson, and Alkan Garfinkel. Alternans and the onset of ventricular fibrillation. Physical Review E, 62:40434048, 2000. 45. Simon Haykin. Communication Systems. John Wiley & Sons, New York, 2001. 46. Rainer Hegger, Holger Kantz, and Thomas Schreiber. Practical implementation of nonlinear time series methods: The TISEAN package. Chaos, 9:413435, 1999. 47. Bart P.T. Hoekstra, Cees G.H. Diks, Maurits A. Allessie, and Jacob DeGoode. Nonlinear analysis of the pharmacological conversion of sustained atrial fibrillation in consious goats by the class Ic drug cibenzoline. Chaos, 7:430-446, 1997. 48. David A. Hsieh. Chaos and nonlinear dynamics: Applications to financial

232

Bibliography

markets. J Finance, 46:1839-1877, 1991. 49. Tohru Ikeguchi and Kazuyuki Aihara. Estimating correlation dimensions of biological time series with a reliable method. Journal of Intelligent and Fuzzy Systems, 5:33-52, 1997. 50. Kevin Judd. An improved estimator of dimension and some comments on providing confidence intervals. Physica D, 56:216-228, 1992. 51. Kevin Judd. Estimating dimension from small samples. Physica D, 71:421429, 1994. 52. Kevin Judd. Building optimal models of time series. In G. Gousebet, S. Meunier-Guttin-Cluzel, and O. Menard, editors, Chaos and its reconstruction, chapter 2, pages 179-214. Nova Science Publishers, Inc, New York, 2003. 53. Kevin Judd and Alistair Mees. On selecting models for nonlinear time series. Physica D, 82:426-444, 1995. 54. Kevin Judd and Alistair Mees. Modeling chaotic motions of a string from experimental data. Physica D, 92:221-236, 1996. 55. Kevin Judd and Alistair Mees. Embedding as a modelling problem. Physica D, 120:273-286, 1998. 56. Kevin Judd and Michael Small. Towards long-term prediction. Physica D, 136:31-44, 2000. 57. Kevin Judd, Michael Small, and Alistair I. Mees. Achieving good nonlinear models: Keep it simple, vary the embedding, and get the dynamics right. In Alistair I. Mees, editor, Nonlinear Dynamics and Statistics, chapter 3, pages 65-80. Birkhauser, Boston, 2001. 58. Kevin Judd and Leonard Smith. Indistinguishable states i. perfect model scenario. Physica D, 151:125-141, 2001. 59. Kevin Judd and Leonard Smith. Indistinguishable states ii. imperfect model scenarios. Physica D, 196:224-242, 2004. 60. Holger Kantz and Thomas Schreiber. Nonlinear time series analysis. Number 7 in Cambridge Nonlinear Science Series. Cambridge University Press, Cambridge, 1997. 61. Daniel Kaplan and Leon Glass. Understanding nonlinear dynamics. Number 19 in Texts in Applied Mathematics. Springer-Verlag, New York, 1996. 62. H.S. Karagueuzian, S.S. Khan, K. Hong, Y. Kobayashi, T. Denton, W.J. Mandel, and G.A. Diamond. Action-potential alternans and irregular dynamics in quinidine-intoxicated ventricular muscle-cells — implications for ventricular proarrhythmia. Circulation, 87:1661-1672, 1993. 63. Matthew B. Kennel and Henry D. I. Abarbanel. False neighbors and false strands: A reliable minimum embedding dimension algorithm. Physical Review E, 66:o026209-l-18, 2002. 64. Matthew B. Kennel, Reggie Brown, and Henry D. I. Abarbanel. Determining embedding dimension for phase-space reconstruction using a geometric construction. Physical Review A, 45:3403-3411, 1992. 65. H.S. Kim, R. Eykholt, and J.D. Salas. Delay time window and plateau onset of the correlation dimension for small data sets. Physical Review E, 58:5676-5682, 1998.

Bibliography

233

66. T. Klingenheben, M. Zabel, R.B. D'Agostino, R.J. Cohen, and S.H. Hohnloser. Predictive value of t-wave alternans for arrhythmis events in patients with congestive heart failure. Lancet, 356:651-652, 2000. 67. D. Kugiumtzis. Test your surrogate data before you test for nonlinearity. Physical Review E, 60:2808-2816, 1999. 68. D. Kugiumtzis. Surrogate data test for nonlinearity including nonmonotic transforms. Physical Review E, 62:R25-R28, 2000. 69. D. Kugiumtzis. On the reliability of the surrogate data test for nonlinearity in the analysis of noisy time series. International Journal of Bifurcation and Chaos, 11:1881-1896, 2001. 70. D. Kugiumtzis, O.C. Lingjasrde, and N. Christopherson. Regularized local linear prediction of chaotic time series. Physica D, 112:344-360, 1998. 71. H.R. Kiinsch. The jackknife and the bootstrap for general stationary observations. Annals of Statistics, 17:1271-1241, 1989. 72. Ying-Cheng Lai and David Lerner. Effective scaling regime for computing the correlation dimension from chaotic time series. Physica D, 115:1-18, 1998. 73. Joseph T.F. Lau et al. Probable secondary infections in households of SARS patients in Hong Kong. Emerging Infectious Diseases, 10, 2004. 74. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans Information Theory, IT-22:75-81, 1976. 75. T.Y. Li and J.A. Yorke. Period 3 implies chaos. American Mathematical Monthly, 82:985-992, 1975. 76. Z. Li, T. Ikeguchi, and Michael Small. Nonlinear analysis of Chinese vowels. In Proceedings of the 2002 IEICE General Conference, page 57, Waseda University, Japan, 2002. 77. E.N. Lorenz. Deterministic nonperiodic flow. J. Atmos. Sci., 20:130-141, 1963. 78. Xiaodong Luo, Tomomichi Nakamura, and Michael Small. Surrogate test to distinguish chaotic and pseudo-periodic time series data. Physical Review E, 2005. in press. 79. Xiaodong Luo and Michael Small. Using geometric measures of redundance and irrelevance tradeoff coefficient to choose suitable delay times for continuous systems. 2004. To appear. 80. David J.C. MacKay. Bayesian interpolation. Neural Comp, 4:415-447, 1992. 81. Venkatesh Mani, Xeujun Wu, Mark A. Wood, and Peng-Wie Hsia. Association of short term heart rate spectral power with onset of spontaneous ventricular techycardia or ventricular fibrillation. Computers in Cardiology, 25:101-104, 1998. 82. G. Mayer-Kress, F. E. Yates, L. Benton, M. Keidel, W. Tirsch, S. J. Poppl, and K. Geist. Dimensional analysis of nonlinear oscillations in brain, heart and muscle. Math Biosci, 90:155-182, 1988. 83. A. I. Mees, P. E. Rapp, and L. S. Jennings. Singular-value decomposition and embedding dimension. Physical Review A, 36:340-346, 1987. 84. Alistair I. Mees. Dynamical systems and tesselations: detecting determinism in data. International Journal of Bifurcation and Chaos, 1:777-794, 1991.

234

Bibliography

85. S. Milgram. The small world problem. Psychology today, 2, 1967. 86. M. Molnar and J. E. Skinner. Correlation dimension changes of the EEG during the wakefulness-sleep cycle. Acta Biochim Biophys Hung, 26:121125, 1991. 87. J.D. Murray. Mathematical Biology, volume 19 of Biomathematics Texts. Springer, 2nd edition, 1993. 88. Tomomichi Nakamura. Modelling nonlinear time series using selection methods and information criteria. PhD thesis, University of Western Australia, Department of Mathematics and Statistics, 2003. 89. Tomomichi Nakamura, Xiaodong Luo, and Michael Small. Topological test for chaotic time series. 2005. in press. 90. Lyle Noakes. The Takens embedding theorem. International Journal of Bifurcation and Chaos, 1:867-872, 1991. 91. Ya. B. Pesin. Characteristic lyapunov exponents and smooth ergodic theory. Russ. Math. Surveys, 32:55-114, 1977. 92. Jan Pieter Pijn, Jan Van Neerven, Andre Noest, and Fernando H. Lopes da Silva. Chaos or noise in EEG signals; dependence on state and brain site. Electroencephalogr Clin Neurophysiol, 79:371-381, 1991. 93. M. J. D. Powell. The theory of radial basis function approximation in 1990. In Will Light, editor, Advances in Numerical Analysis. Volume II: wavelets, subdivision algorithms and radial basis functions, chapter 3, pages 105-210. Oxford Science Publications, 1992. 94. Klaus Prank, Heio Harms, Matthias Dammig, Georg Brabant, Fedor Mitschke, and Rolf-Dieter Hesch. Is there low-dimensional chaos in pulsatile secretion of parathyroid hormone in normal human subjects? American Journal of Physiology, 266E:653-658, 1994. 95. William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling. Numerical recipes in C. Cambridge University Press, 1988. 96. Dean Prichard and James Theiler. Generalized redundancies for time series analysis. Physica D, 84:476-493, 1995. 97. M. B. Priestly. Non-linear and non-stationary time series analysis. Academic Press, London, 1989. 98. Corinna Raab and Jiirgen Kurths. Estimation of large-scale dimension densities. Physical Review E, 64:016216-1-5, 2001. 99. P.E. Rapp, C.J. Cellucci, K.E. Korslund, T.A.A. Watanabe, and M.A. Jimenez-Monta no. Effective normalization of complexity measurements for epoch length and sampling frequency. Physical Review E, 64:016209-1-9, 2001. 100. Flavia Ravelli and Renzo Antolini. Complex dynamics underlying the human electrocardiogram. Biol Cybern, 67:57-65, 1992. 101. L.F. Richardson. The problem of contiguity. Gen. Syst. Yearbook, 6:139-187, 1961. 102. Steven Riley et al. Transmission dynamics of the etiological agent of SARS in Hong Kong: impact of public health interventions. Science, 300:19611966, 2003. 103. Jorma Rissanen. Stochastic complexity in statistical inquiry. World Scien-

Bibliography

235

tific, Singapore, 1989. 104. Amy L. Ritzenberg, Dan R. Adam, and Richard Jonathan Cohen. Period multupling-evidence for nonlinear behaviour of the canine heart. Nature, 307:159-161, 1984. 105. J. Roschke and J. Aldenhoff. The dimensionality of human's electroencephalogram during sleep. Biol Cybern, 64:307-313, 1991. 106. J. Roschke and J. B. Aldenhoff. A nonlinear approach to brain function: deterministic chaos and sleep EEG. Sleep, 15:95-101, 1992. 107. Otto E. Rossler. Continuous chaos — four prototype equations. Annals of the New York Academy of Sciences, 316:376-392, 1979. 108. Tim Sauer. Time series prediction by using delay coordinate embedding. In A.S Weigend and N.A. Gershenfeld, editors, Time series prediction: Forecasting the future and understanding the past, volume XV of Studies in the sciences of complexity, pages 175-193, Reading, MA, May, 1992 1993. Santa Fe Institute, Addison-Wesley. 109. Guillermo V. Savino, Lilia Romanelli, Diego L. Gonzalez, Oreste Piro, and Max E. Valentinuzzi. Evidence for chaotic behaviour in driven ventricles. Biophysical Journal, 56:273-280, 1989. 110. Gary Bruno Schmid and Rudlof M. Dunki. Indications of nonlinearity, intraindividual specificity and stability of human EEG: the unfolding dimension. Physica D, 93:165-190, 1996. 111. Thomas Schreiber. Constrained randomization of time series. Physical Review Letters, 80:2105-2108, 1998. 112. Thomas Schreiber and Andreas Schmitz. Improved surrogate data for nonlinearity tests. Physical Review Letters, 77:635-638, 1996. 113. Thomas Schreiber and Andreas Schmitz. Surrogate time series. Physica D, 142:346-382, 2000. 114. Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461-464, 1978. 115. Severe Acute Respiratory Syndrome Expert Committee. Report of the severe acute respiratory syndrome expert committee. Technical report, Hong Kong Department of Health, 2 October 2003. http://www.sarsexpertcom.gov.hk/english/reports/ report.html. 116. C.E. Shannon. A mathematical theory of communication. The Bell Systems Technical Journal, 27:379-423, 623-656, 1948. 117. Mark Shelhamer. Correlation dimension of optokinetic nystagmus as evidence of chaos in the oculomotor system. IEEE Transactions on Biomedical Engineering, 39:1319-1321, 1992. 118. B. W. Silverman. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman and Hall, London; New York, 1986. 119. James E. Skinner, Clara Carpeggiani, Carole E. Landisman, and Keith W. Fulton. Correlation dimension of heartbeat intervals is reduced in conscious pigs by myocardial ischemia. Circ Res, 68:966-976, 1991. 120. James E. Skinner, Craig M. Pratt, and Tomas Vybiral. A reduction in the correlation dimension of heartbeat intervals precedes imminent ventricular

236

Bibliography

fibrillation in human subjects. Am Heart J, 125:731-743, 1992. 121. Michael Small. Nonlinear dynamics in infant respiration. PhD thesis, University of Western Australia, Department of Mathematics, 1998. URL: http://www.eie.polyu.edu.hk/~ensmall/. 122. Michael Small and Kevin Judd. Using surrogate data to test for nonlinearity in experimental data. In International Symposium on Nonlinear Theory and its Applications, volume 2, pages 1133-1136. Research Society of Nonlinear Theory and its Applications, IEICE, 1997. 123. Michael Small and Kevin Judd. Comparison of new nonlinear modelling techniques with applications to infant respiration. Physica D, 117:283-298, 1998. 124. Michael Small and Kevin Judd. Correlation dimension: A pivotal statistic for non-constrained realizations of composite hypotheses in surrogate data analysis. Physica D, 120:386-400, 1998. 125. Michael Small, Kevin Judd, Madeleine Lowe, and Stephen Stick. Detection of periodic breathing during quiet sleep using linear modelling techniques. In preparation. 126. Michael Small, Kevin Judd, Madeleine Lowe, and Stephen Stick. Is breathing in infants chaotic? Dimension estimates for respiratory patterns during quiet sleep. Journal of Applied Physiology, 86:359-376, 1999. 127. Michael Small, Kevin Judd, and Alistair Mees. Modeling continuous processes from data. Physical Review E, 65:046704, 2002. 128. Michael Small, Kevin Judd, and Stephen Stick. Linear modelling techniques detect periodic respiratory behaviour in infants during regular breathing in quiet sleep. American Journal of Respiratory Critical Care Medicine, 153:A79, 1996. (abstract). 129. Michael Small and C.K. Tse. Applying the method of surrogate data to cyclic time series. Physica D, 164:187-201, 2002. 130. Michael Small and C.K. Tse. Minimum description length neural networks for time series prediction. Physical Review E, 66:066701, 2002. Reprinted in Virtual Journal of Biological Physics Research 4 (2002). 131. Michael Small and C.K. Tse. Determinism in financial time series. Studies in Nonlinear Dynamics and Econometrics, 7(3), 2003. To appear. 132. Michael Small and C.K. Tse. Optimal embedding parameters: A modelling paradigm. Physica D, 2003. To appear. 133. Michael Small, Dejin Yu, Richard Clayton, and Robert G. Harrison. Evolution of ventricular fibrillaion revealed by first return plots. Computers in Cardiology, 27:525-528, 2000. 134. Michael Small, Dejin Yu, Neil Grubb, Jennifer Simonotto, Keith Fox, and Robert G. Harrison. Automatic identification and recording of cardiac arrhythmia. Computers in Cardiology, 27:355-358, 2000. 135. Michael Small, Dejin Yu, and Robert G. Harrison. Nonlinear analysis of human ECG rhythm and arrhythmia. Computers in Cardiology, 27:147150, 2000. 136. Michael Small, Dejin Yu, and Robert G. Harrison. Nonstationarity as an embedding problem. In S. Boccaletti, J. Burguete, W. Gonzalez-Vi nas, H.L.

Bibliography

137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155.

237

Mancini, and D.L. Valladares, editors, Space Time Chaos: Characterization, Control and Synchronization, pages 3-18. World Scientific, Singapore, 2001. Michael Small, Dejin Yu, and Robert G. Harrison. A surrogate test for pseudo-periodic time series data. Physical Review Letters, 87:188101, 2001. Michael Small, Dejin Yu, and Robert G. Harrison. Variation in the dominant period during ventricular fibrillation. IEEE Transactions on Biomedical Engineering, 48:1056-1061, 2001. Michael Small, Dejin Yu, and Robert G. Harrison. Period doubling bifurcation route in human ventricular fibrillation. International Journal of Bifurcation and Chaos, 13:743-754, 2003. Michael Small, Dejin Yu, Robert G. Harrison, Colin Robertson, Gareth Clegg, Michael Holzer, and Fritz Sterz. Characterizing nonlinearity in ventricular fibrillation. Computers in Cardiology, 26:17-20, 1999. Michael Small, Dejin Yu, Robert G. Harrison, Colin Robertson, Gareth Clegg, Michael Holzer, and Fritz Sterz. Deterministic nonlinearity in ventricular fibrillation. Chaos, 10:268-277, 2000. C.J. Stam, J.P.M Pijn, and W.S. Pritchard. Reliable detection of nonlinearity in experimental time series with strong periodic components. Physica D, 112:361-380, 1998. G. Sugihara and R.M. May. Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature, 344:734-741, 1990. Floris Takens. Detecting strange attractors in turbulence. Lecture Notes in Mathematics, 898:366-381, 1981. Floris Takens. Detecting nonlinearities in stationary time series. International Journal of Bifurcation and Chaos, 3:241-256, 1993. M. Theil, M.C. Romano, P.L. Read, and J. Kurths. Estimation of dynamical invariants withour embedding by recurrence plots. Chaos, 14:234-243, 2004. James Theiler. Estimating fractal dimension. J Opt Soc Am A, 7:1055-1073, 1990. James Theiler. On the evidence for low-dimensional chaos in an epileptic electroencephalogram. Physics Letters A, 196:335-341, 1995. James Theiler, Stephen Eubank, Andre Longtin, Bryan Galdrikian, and J. Doyne Farmer. Testing for nonlinearity in time series: The method of surrogate data. Physica D, 58:77-94, 1992. James Theiler and Dean Prichard. Constrained-realization Monte-Carlo method for hypothesis testing. Physica D, 94:221-235, 1996. James Theiler and Paul Rapp. Re-examination of the evidence for lowdimensional, nonlinear structure in the human electroencephalogram. Electroencephalogr Clin Neurophysiol, 98:213-222, 1996. J.M.T. Thompson and H.B. Stewart. Nonlinear dynamics and chaos. John Wiley & Sons Ltd., Chichester, 1986. Howell Tong. Non-linear time series: a dynamical systems approach. Oxford University Press, New York, 1990. C.K. Tse. Applied nonlinear circuits and systems research group. WWW page, October 2004. http://chaos.eie.polyu.edu.hk/. Robert G. Turcott and Malvin C. Teich. Fractal character of the electro-

238

156. 157.

158.

159.

160. 161. 162. 163. 164.

165.

166. 167.

168. 169.

Bibliography

cardiogram: distinguishing heart-failure and normal patients. Annals of Biomedical Engineering, 24:269-293, 1996. Karin Vibe and Jean-Marc Vesin. On chaos detection methods. International Journal of Bifurcation and Chaos, 6:529-543, 1996. Thomas B. Waggener, Paul J. Brusil, Richard E. Kronauer, Ronald A. Gabel, and Gideon F. Inbar. Strength and cycle time of high-altitude ventilatory patterns in unacclimatized humans. Journal of Applied Physiology, 56:576-581, 1984. Eric A. Wan. Time series prediction by using a connectionist network with internal delay lines. In A.S Weigend and N.A. Gershenfeld, editors, Time series prediction: Forecasting the future and understanding the past, volume XV of Studies in the sciences of complexity, pages 195-217, Reading, MA, May, 1992 1993. Santa Fe Institute, Addison-Wesley. T.A.A. Watanabe, C.J. Cellucci, E.Kohegyi, T.R. Bashore, R.C. Josiassen, N.N. Greenbaun, and P.E. Rapp. The algorithmic complexity of multichannel eegs is sensitive to changes in behaviour. Psychophysiology, 40:77-97, 2003. Bruce J. West, Ary L. Goldberger, Galina Rovner, and Valmik Bhargava. Nonlinear dynamics of the heartbeat. I. The AV junction: Passive conduit or active oscillator? Physica D, 17:198-206, 1985. Alan Wolf, Jack. B. Swift, Harry L. Swinney, and John A. Vastano. Determining Lyapunov exponents from a time series. Physica D, 16:285-317, 1985. Tse-Wai Wong et al. Cluster of SARS among medical students exposed to single patient, Hong Kong. Emerging Infectious Diseases, 10, 2004. World Health Organisation. Consensus document on the epidemiology of SARS. Technical report, World Health Organisation, 17 October 2003. http://www.who.int/csr/sars/en/ WHOconsensus.pdf. World Health Organisation. Summary table of SARS cases by country. Technical report, World Health Organisation, 15 August 2003. http://www.who.int/csr/sars/country/en/ country2003_08-15.pdf viewed 14 January 2003. Yoshiharu Yamamoto, Richard L. Hughson, John R. Sutton, Charles S. Houston, Allen Cymerman, Ernest L. Fallen, and Marked V. Kamath. Operation Everest II: An indication of deterministic chaos in human heart rate variability at simulated extreme altitude. Biol Cybern, 69:205-212, 1993. Dejin Yu, Weiping Lu, and Robert G. Harrison. Space time-index plots for probing dynamical nonstationarity. Physics Letters A, 250:323-327, 1998. Dejin Yu, Michael Small, Robert G. Harrison, and C. Diks. Efficient implementation of the Gaussian kernel algorithm in estimating invariants and noise level from noisy time series data. Physical Review E, 61:3750-3756, 2000. Dejin Yu, Michael Small, Robert G. Harrison, Colin Robertson, Gareth Clegg, Michael Holzer, and Fritz Sterz. Measuring temporal complexity of ventricular fibrillation. Physics Letters A, 265:68-75, 2000. Xu-Sheng Zhang, Yi-Sheng Zhu, Nitish V. Thakor, and Zhi-Zhong Wang.

Bibliography

239

Detecting ventricular techycardia and fibrillation by complexity measure. IEEE Transactions on Biomedical Engineering, 46:548-555, 1999. 170. Xu-Sheng Zhang, Yi-Sheng Zhu, Nitish V. Thakor, Zi-Ming Wang, and ZhiZhong Wang. Modelling the relationship between concurrent epicardial action potentials and bipolar electrograms. IEEE Transactions on Biomedical Engineering, 46:365-376, 1999.

Index

Adams, Douglas, 88 Akaike information criterion, 205 alternans, 72, 208 ARCH, 170 atrio-ventricular junction, 208 attractor, 61, 88 chaos and fractional correlation dimension, 52 fractal, see attractor, strange strange, 49 stochastic, 50 audio codecs, see complexity audio coding, 62, 67 autocorrelation, see embedding lag, autocorrelation, 186

algorithmic complexity, 58 binary DPCM, 67 binary PCM, 65 coding schemes, 60-68 definition, 59 delta modulation, 62, 68, 69 algorithm, 62 differential pulse code modulation (DPCM), 62 Lempel-Ziv, 58, 83 octal PCM, 66 pulse code modulation (PCM), 62 sequential complexity algorithm, 59 compression syntactic, 58 computational complexity NP hard, 27 conditional heteroskedasticity, 171 coronary care unit, 61, 71, 106 correlation dimension, 48-54, 85-114 and chaos, 85 as a function of scale, 95 average, 95 box counting, 87-90 chaos and fractional correlation dimension, 52 correlation function, 86 correlation integral, 87, 103, 130, 134, 154 scaling, 103-105 correlation sum, 89 definition, 53

Bayesian regularisation, 31 bigeminy, 213, 215 black box algorithm, 1, 223 block bootstrap, 127 C 2 , 28, 134 Cantor set, 49 "Cantor-like" sets, 91 cardiac arrhythmia, 61 cardiac dynamics, 105-111 chaos, 62, 76, 224 chaotic systems, 226 hyper-chaos, 76 necessary conditions, 76 X2 test, 131 complexity, 58-68 241

242

Diks' algorithm, see correlation dimension, Gaussian Kernel algorithm estimation, 115 Gaussian Kernel algorithm, 225 Gaussian kernel algorithm, 102-105 implementation, 103 interpreting, 108-111 Grassberger-Proccacia algorithm, 85, 87-90, 112 information dimension, 90 Judd's algorithm, 90-95, 225 interpreting, 99-101 self-consistency, 112 noise dimension, 131 pivotalness, 133-142 related to recurrence plots, 112 related to topological dimension, 92 Richardson dimension, 113 scaling, 92 temporal correlation bias, 111 test statistic, 130 cylindrical basis models, 207-208 data understand, 224 decorrelation time, 10 determinism, 2 in financial data, 80 digitisation and embeddings, 5 dimension correlation, see correlation dimension embedding dimension see embedding, embedding dimension, 5 fractal, see fractal dimension integer, 53 dynamic invariants, 24 and chaotic dynamics, 48 caution, 224 correlation dimension, see correlation dimension definition, 47 entropy, see entropy

Index

estimating problems, 115 importance, 48 Lyapunov exponent, see Lyapunov exponent dynamic noise, 161 dynamical system determinstic, 2 dynamical systems theory, 227 EGARCH, 170 Einstein, Albert, 180 electrocardiogram (ECG), 68 electroencephalogram (EEG), 69, 126 embedding, 224 and modelling, 179, 198-200 embedding dimension, 5-9 optimal dynamics, 9 plateau onset, 8-9, 29 embedding lag, 9-13 approximate period, 11-12 autocorrelation, 10, 37 generalised lags, 12-13 mutual information, 11 redundance and irrelevance tradeoff exponent, 12 embedding time, 214 embedding window, 14, 21, 26, 28-30, 34, 43 false nearest neighbours, 6-7, 45 false nearest strands, 7-8 for local modelling, 200 for modelling, 30-34 irregular, 14, 19-28 lags, 28 multiple lags, 20 non-uniform, 20 optimal reconstruction, 43 parameters, 13 standard, 40 strategy, 21 the "best" embedding, 179 time delay, 4-5 topological, 9 uniform, 20 variable, 20, 30

243

Index windowed, 40 entropy, 54-58 Kolmogorov-Sinai, 55, 77 Shannon, 55 epidemic modelling SEIR dynamics, 184 stochastic model, 185 epileptic 126 Erdos number, 188 Euclidean geometry, 180 evolution operator, 2 false nearest neighbours, see embedding, false nearest neighbours Fast Fourier transform, 73, 74 fractal dimension, 5, 48, 85 fractal, 86

p . n ™ T-n GARCH,170 Gaussian no.se, 201 ,. , higher moments, 132 ° _, andSARS,182 household size, 186 Prince of Wales hospital, 183 Hurst exponent, 123 hyper-chaos, see chaos, hyper-chaos i.i.d. noise, 122 Ikeda map, 17, 39, 56, 92, 165 independent and identically distributed noise, see i.i.d. noise infant respiration, 95-102 control of respiration, 95 chaotic?, 101 cyclic amplitude modulation, 101 data collection, 96 inductance plethysomnography, 88, 96 sleep state, 97 Kolmogoroc-Smirnov test, 131

linear correlation, 132 linearly filtered noise, 123 local models, 158, 195-198 local linear, 197 naive predictor, 196 nearest neighbour, 196 Logistic map, 16 Lorenz dynamical system, 16, 39 Lyapunov exponent, 56, 76-80 Lyapunov spectra, 56 Lyapunov spectrum, 77, 82 Wolf's algorithm, 77-80 maximum likelihood, 31 Menger Sponge, 49-51 Microsoft, 66 minimum description length, 24-25, 31 33 177 " ' ' 201-205 definition, 203-205

motivation, 203 modelling, 179-221 ^ embedding] ^

19g

local models, see local models models of reality, 226 ,. 1C1 nonlinear, 151 nonlinear modelling and surrogates, 177 ontological and phenomenological modelSi 180_181 r a d i a l b a s i s m o d e l S i see r a d i a l b a s i s

models semi-local, 200-208 t h e m o d e l i s n o t rea lity, 226 monotonic nonlinear transformed noise, 123 Morlet wavelets, 213 mp3, 67

Newton, Isaac, 4, 180 Newtons' laws of motion, 180 nonlinear prediction error, 75-76, 144, 172 observation function, 4 ogg vorbis, 67 ontological models

244

Index

definition, 180 period doubling bifurcation, 208 Pesin identity, 77, 82 phase space, 9 phenomenological models definition, 180 power spectrum, 123, 124, 168 pseudo-periodic, 125 definition, 61 (Q6). = ±,2M h V

v

Rossler dynamical system, 14, 34, 56, 9 3 103 ' > 163- 2 1 0 radial basis model cylindrical basis model, see cylindrical basis models radial basis models, 200-201 Gaussian basis functions, 201 model selection algorithm, 206 pseudo linear models, 205-207 stopping criterion, 207 reality according to Judd, 180 ignorance of, 181 recurrence plots, 112 reduced autoregressive models, 101 relativity, 180 sampling ... , ... , , . .„. with and without replacement, 121 OADC

o

A 4 . D

-4.

bAKS, see Severe Acute Respiratory ^CADCA c J Syndrome (SARS) scale-tree network, 188 ' Schwarz information criterion, 205 Severe Acute Respiratory Syndrome (SARS), 181-194, 227 and surrogates, 194 clusters, 186 model parameters, 193 rate of growth, 189 scale-free network, 188-189 simulations, 191

small-world network model, 186-189 spread of, 183 studies of, 183 super-spreader events, 183, 186, 194 Sierpinski carpet, 49, 50 six-degrees of separation, 188 small-world network, 186 spaghetti, 94 stationarity, 3 linear, 3 surrogate data, 108, 115-177 AAFT, see surrogate data, algorithm 2 algorithm 0, 121-122 pivotalness, 135 algorithm 1, 122-123, 145 pivotalness, 135 algorithm 2, 123-125 failing, 173 pivotalness, 135 amplitude adjusted Fourier transform, see surrogate data, AAFT and simulated annealing, 174-175 attractor surrogates, 161, 175 attractor trajectory surrogates, 158 algorithm, 160 noise radius ' 161-163 shadowing, 160 composite hypotheses, 119 constrained realisation, 120 definition, 119

, . . __ corrected AAFT, 124 . . .. correlation dimension . , , 1trr) i r . pivotal, 152-154 c y d e shuffledi 125_128 algorithm 127 failure of,'153 inconvenient, 127 l ac k of randomisation, 127 definition, 118 Fourier shuffled, see surrogate data, algorithm 1 Fourier transform

Index

245

phase mismatch, 125 hypothesis testing and modelling, 150 iterated AAFT, 124 redundant, 141, 146 linear hypotheses, 116 linear surrogates trivial, 149 model surrogates, 150-154 are nonconstrained, 153 null hypothesis, 153 nonlinear models pivotalness, 152-153 nontationarity effect of, 148 financial, over-constrained, 123-125 parametric surrogates, 150 periodic data, 125 pivotal statistics definition, 118 pivotalness, 129 pseudo-periodic surrogates, 157, 168-170 test for heteroskedasticity, 170-172 rationale, 116 shuffled surrogate, see surrogate data, algorithm 0 simple hypotheses, 119 test for nonlinearity, 175 test for periodicity, 175 test statistic, 128-133 choice of, 225 consistency, 176 correlation dimension, 130 Gaussian kernel algorithm, 136 Grassberger-Proccacia correlation dimension, 136 invariance, 130 Judd's algorithm, 136-142 pivotal, 133-142 pivotalness, 149 trival, 130 trivial, 125, 224, 225 surrogates

attractor trajectory surrogates, 163 symbolic sequence, 61 Takens' embedding theorem, 4-5, 8, 28, 30, 133, 180 test statistic, see surrogate data, test statistic thin plate spline, 201 time series Chinese and Japanese vowels, 166 electrocardiogram (ECG), 105-111, 165, 208, 227 electroencephalogram (EEG), 83, 159 80-82, 142-146, 168-174, 226 infant respiration, 94-102, 104, 139, 153-157, 226 laser, 18, 26, 41-43 Severe Acute Respiratory Syndrome (SARS), 182-185 sunspots, 18, 26, 41-43 ventricular fibrillation, 18, 26, 41-43, 106, 107, 208-221 ventricular tachycardia, 106, 107 vowel vocalisations, 166-168 topological dimension, see correlation dimension, related to topological dimension trajectories, 7 ventricular arrhythmia, 105 detecting, 69-74 monitoring hospital, 69 ventricular bigeminy, 71 ventricular fibrillation, 69, 70, 74, 208-221 onset, 208-209 ventricular tachycardia, 74 wavelets, 213 WinZip, 58