Lecture Notes in Control and Information Sciences Editors: M. Thoma · M. Morari
310
Lecture Notes in Control and Inf...
13 downloads
454 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Lecture Notes in Control and Information Sciences Editors: M. Thoma · M. Morari
310
Lecture Notes in Control and Information Sciences Edited by M. Thoma and M. Morari Further volumes of this series are listed at the end of the book or found on our homepage: springeronline.com
Vol. 309: Kumar, V.; Leonard, N.; Morse, A.S. (Eds.) Cooperative Control 301 p. 2005 [3-540-22861-6] Vol. 308: Tarbouriech, S.; Abdallah, C.T.; Chiasson, J. (Eds.) Advances in Communication Control Networks 358 p. 2005 [3-540-22819-5]
Vol. 307: Kwon, S.J.; Chung, W.K. Perturbation Compensator based Robust Tracking Control and State Estimation of Mechanical Systems 158 p. 2004 [3-540-22077-1] Vol. 306: Bien, Z.Z.; Stefanov, D. (Eds.) Advances in Rehabilitation 472 p. 2004 [3-540-21986-2] Vol. 305: Nebylov, A. Ensuring Control Accuracy 256 p. 2004 [3-540-21876-9] Vol. 304: Margaris, N.I. Theory of the Non-linear Analog Phase Locked Loop 303 p. 2004 [3-540-21339-2] Vol. 303: Mahmoud, M.S. Resilient Control of Uncertain Dynamical Systems 278 p. 2004 [3-540-21351-1] Vol. 302: Filatov, N.M.; Unbehauen, H. Adaptive Dual Control: Theory and Applications 237 p. 2004 [3-540-21373-2] Vol. 301: de Queiroz, M.; Malisoff, M.; Wolenski, P. (Eds.) Optimal Control, Stabilization and Nonsmooth Analysis 373 p. 2004 [3-540-21330-9] Vol. 300: Nakamura, M.; Goto, S.; Kyura, N.; Zhang, T. Mechatronic Servo System Control Problems in Industries and their Theoretical Solutions 212 p. 2004 [3-540-21096-2] Vol. 299: Tarn, T.-J.; Chen, S.-B.; Zhou, C. (Eds.) Robotic Welding, Intelligence and Automation 214 p. 2004 [3-540-20804-6] Vol. 298: Choi, Y.; Chung, W.K. PID Trajectory Tracking Control for Mechanical Systems 127 p. 2004 [3-540-20567-5] Vol. 297: Damm, T. Rational Matrix Equations in Stochastic Control 219 p. 2004 [3-540-20516-0] Vol. 296: Matsuo, T.; Hasegawa, Y. Realization Theory of Discrete-Time Dynamical Systems 235 p. 2003 [3-540-40675-1]
Vol. 295: Kang, W.; Xiao, M.; Borges, C. (Eds) New Trends in Nonlinear Dynamics and Control, and their Applications 365 p. 2003 [3-540-10474-0] Vol. 294: Benvenuti, L.; De Santis, A.; Farina, L. (Eds) Positive Systems: Theory and Applications (POSTA 2003) 414 p. 2003 [3-540-40342-6] Vol. 293: Chen, G. and Hill, D.J. Bifurcation Control 320 p. 2003 [3-540-40341-8] Vol. 292: Chen, G. and Yu, X. Chaos Control 380 p. 2003 [3-540-40405-8] Vol. 291: Xu, J.-X. and Tan, Y. Linear and Nonlinear Iterative Learning Control 189 p. 2003 [3-540-40173-3] Vol. 290: Borrelli, F. Constrained Optimal Control of Linear and Hybrid Systems 237 p. 2003 [3-540-00257-X] Vol. 289: Giarre, L. and Bamieh, B. Multidisciplinary Research in Control 237 p. 2003 [3-540-00917-5] Vol. 288: Taware, A. and Tao, G. Control of Sandwich Nonlinear Systems 393 p. 2003 [3-540-44115-8] Vol. 287: Mahmoud, M.M.; Jiang, J. and Zhang, Y. Active Fault Tolerant Control Systems 239 p. 2003 [3-540-00318-5] Vol. 286: Rantzer, A. and Byrnes C.I. (Eds) Directions in Mathematical Systems Theory and Optimization 399 p. 2003 [3-540-00065-8] Vol. 285: Wang, Q.-G. Decoupling Control 373 p. 2003 [3-540-44128-X] Vol. 284: Johansson, M. Piecewise Linear Control Systems 216 p. 2003 [3-540-44124-7] Vol. 283: Fielding, Ch. et al. (Eds) Advanced Techniques for Clearance of Flight Control Laws 480 p. 2003 [3-540-44054-2] Vol. 282: Schroder, J. Modelling, State Observation and Diagnosis of Quantised Systems 368 p. 2003 [3-540-44075-5]
A. Janczak
Identiˇcation of Nonlinear Systems Using Neural Networks and Polynomial Models A Block-Oriented Approach With 79 Figures and 22 Tables
Series Advisory Board
A. Bensoussan · P. Fleming · M.J. Grimble · P. Kokotovic · A.B. Kurzhanski · H. Kwakernaak · J.N. Tsitsiklis
Author Prof. Andrzej Janczak University of Zielona G´ora Institute of Control and Computation Engineering ul. Podgorna 50 65-246 Zielona G´ora Poland
ISSN 0170-8643 ISBN 3-540-23185-4 Springer Berlin Heidelberg New York Library of Congress Control Number: 2004097177 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in other ways, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Data conversion by the authors. Final processing by PTP-Berlin Protago-TEX-Production GmbH, Germany Cover-Design: design & production GmbH, Heidelberg Printed on acid-free paper 62/3020Yu - 5 4 3 2 1 0
Preface
The identification of nonlinear systems using the block-oriented approach has been developed since the half of 1960s. A large amount of knowledge on this subject has been accumulated through literature. However, publications are scattered over many papers and there is no book which presents the subject in a unified framework. This has created an increasing need to systemize the existing identification methods and along with a presentation of some original results have been the main incentive to write this book. In writing the book, an attempt has been made at the presentation of some new ideas concerning the model parameter adjusting with gradient-based techniques. Two types of models, considered in this book, use neural networks and polynomials as representations of Wiener and Hammerstein systems. The focus is placed on Wiener and Hammerstein models in which the nonlinear element is represented by a polynomial or a two-layer perceptron neural network with hyperbolic tangent hidden layer nodes and linear output nodes. Pulse transfer function models are common representations of system dynamics in both neural network and polynomial Wiener and Hammerstein models. Neural network and polynomial models reveal different properties such as the approximation accuracy, computational complexity, available parameter and structure optimization methods, etc. All these differences make them complementary in solving many practical problems. For example, it is well known that the approximation of some nonlinear functions requires polynomials of a high order and this, in turn, results in a high parameter variance error. The approximation with neural network models is an interesting alternative in such cases. The book results mainly from my research in the area of nonlinear system identification that have been performed since 1995. Two exceptions from this rule are Chapter 1, containing the introductory notes, and Chapter 5, which reviews the well-known Hammerstein system identification methods based on polynomial models of the nonlinearity. In writing the book, an emphasis has been put on presenting various identification methods, which are applicable to both neural network and polynomial models of Wiener and Hammerstein systems, in a unified framework.
VI
Preface
The book starts with a survey of discrete-time models of time-invariant dynamic systems. Then the multilayer perceptron neural network is introduced and a brief review of the existing methods for the identification of Wiener and Hammerstein systems is presented. Two subsequent Chapters (2 and 3) introduce neural network models of Wiener and Hammerstein systems and present different algorithms for the calculation of the gradient or the approximate gradient of the model output w.r.t. model parameters. For both Wiener and Hammerstein models, the accuracy of gradient evaluation with the truncated backpropagation through time algorithm is analyzed. The discussion also includes advantages and disadvantages of the algorithms in terms of their approximation accuracy, computational requirements, and weight updating methods. Next, in Chapter 4, we present identification methods, which use polynomial models of Wiener systems. The parameters of the linear dynamic system and the inverse nonlinearity are estimated with the least squares method, and a combined least squares and instrumental variables approach. To estimate parameters of the noninverted nonlinearity, the recursive prediction error and the pseudolinear regression methods are proposed. Then the existing identification methods based on polynomial Hammerstein models are reviewed and presented in Chapter 5. Wiener and Hammerstein models are two examples of block-oriented models which have found numerous industrial applications. The most important of them, including nonlinear system modelling, control, and fault detection and isolation, are reviewed in Chapter 6. This chapters presents also two applications of Wiener and Hammerstein models – estimation of system parameter changes, and modelling vapor pressure dynamics in a five stage sugar evaporation station. The book contains the results od research conducted by the author with the kind support of the State Committee for Scientific Research in Poland under the grant No 4T11A01425 and with the additional support of the European Union within the 5th Framework Programme under DAMADICS project No HPRN-CT-2000-00110. A few people have contributed directly or indirectly to this book. First of all, I would like to express my sincere gratitude to Professor J´ ozef Korbicz for his constant help, advice, encouragement, and support. I am very grateful to Ms Agnieszka Ro˙zewska for proofreading and linguistic advice on the text. I highly appreciate the help of my colleagues from the Institute of Control and Computational Engineering, University of Zielona G´ ora.
Zielona G´ ora, July 2004
Andrzej Janczak
Contents
Symbols and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Models of dynamic systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Series-parallel and parallel models . . . . . . . . . . . . . . . . . . . 1.1.4 State space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Nonlinear models composed of sub-models . . . . . . . . . . . . 1.1.6 State-space Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.7 State-space Hammerstein models . . . . . . . . . . . . . . . . . . . . 1.2 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 MLP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Optimizing the model architecture . . . . . . . . . . . . . . . . . . . 1.3 Identification of Wiener systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Identification of Hammerstein systems . . . . . . . . . . . . . . . . . . . . . 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 5 5 8 10 10 11 15 15 16 16 17 18 19 25 30
2
Neural network Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Series-parallel and parallel neural network Wiener models . . . . 2.3.1 SISO Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 MIMO Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Gradient calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Series-parallel SISO model. Backpropagation method . . 2.4.2 Parallel SISO model. Backpropagation method . . . . . . . . 2.4.3 Parallel SISO model. Sensitivity method . . . . . . . . . . . . . 2.4.4 Parallel SISO model. Backpropagation through time method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 31 32 34 34 37 40 40 42 42 43
VIII
Contents
2.4.5 2.4.6 2.4.7 2.4.8
2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 3
Series-parallel MIMO model. Backpropagation method . Parallel MIMO model. Backpropagation method . . . . . . Parallel MIMO model. Sensitivity method . . . . . . . . . . . . Parallel MIMO model. Backpropagation through time method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.9 Accuracy of gradient calculation with truncated BPTT . 2.4.10 Gradient calculation in the sequential mode . . . . . . . . . . . 2.4.11 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-tank system example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prediction error method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Recursive prediction error learning algorithm . . . . . . . . . 2.7.2 Pneumatic valve simulation example . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 2.1. Gradient derivation of the truncated BPTT. SISO Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 2.2. Gradient derivation of truncated BPTT. MIMO Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 2.3. Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . Appendix 2.4. Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . .
46 48 48 49 49 51 52 53 61 65 65 66 69 71 72 73 74
Neural network Hammerstein models . . . . . . . . . . . . . . . . . . . . . . 77 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3 Series-parallel and parallel neural network Hammerstein models 79 3.3.1 SISO Hammerstein models . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.2 MIMO Hammerstein models . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4 Gradient calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.1 Series-parallel SISO model. Backpropagation method . . 84 3.4.2 Parallel SISO model. Backpropagation method . . . . . . . . 85 3.4.3 Parallel SISO model. Sensitivity method . . . . . . . . . . . . . 85 3.4.4 Parallel SISO model. Backpropagation through time method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.4.5 Series-parallel MIMO model. Backpropagation method . 87 3.4.6 Parallel MIMO model. Backpropagation method . . . . . . 90 3.4.7 Parallel MIMO model. Sensitivity method . . . . . . . . . . . . 90 3.4.8 Parallel MIMO model. Backpropagation through time method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.4.9 Accuracy of gradient calculation with truncated BPTT . 92 3.4.10 Gradient calculation in the sequential mode . . . . . . . . . . . 96 3.4.11 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.5 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.6 Combined steepest descent and least squares learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Contents
IX
3.8 Appendix 3.1. Gradient derivation of truncated BPTT. SISO Hammerstein models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.9 Appendix 3.2. Gradient derivation of truncated BPTT. MIMO Hammerstein models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.10 Appendix 3.3. Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . 111 3.11 Appendix 3.4. Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . 113 3.12 Appendix 3.5. Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . 114 3.13 Appendix 3.6. Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . 115 4
Polynomial Wiener models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.1 Least squares approach to the identification of Wiener systems 118 4.1.1 Identification error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1.2 Nonlinear characteristic with the linear term . . . . . . . . . . 121 4.1.3 Nonlinear characteristic without the linear term . . . . . . . 122 4.1.4 Asymptotic bias error of the LS estimator . . . . . . . . . . . . 123 4.1.5 Instrumental variables method . . . . . . . . . . . . . . . . . . . . . . 125 4.1.6 Simulation example. Nonlinear characteristic with the linear term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.1.7 Simulation example. Nonlinear characteristic without the linear term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.2 Identification of Wiener systems with the prediction error method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.2.1 Polynomial Wiener model . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.2.2 Recursive prediction error method . . . . . . . . . . . . . . . . . . . 132 4.2.3 Gradient calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.2.4 Pneumatic valve simulation example . . . . . . . . . . . . . . . . . 133 4.3 Pseudolinear regression method . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.3.1 Pseudolinear-in-parameters polynomial Wiener model . . 137 4.3.2 Pseudolinear regression identification method . . . . . . . . . 138 4.3.3 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5
Polynomial Hammerstein models . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.1 Noniterative least squares identification of Hammerstein systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.2 Iterative least squares identification of Hammerstein systems . . 145 5.3 Identification of Hammerstein systems in the presence of correlated noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.4 Identification of Hammerstein systems with the Laguerre function expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.5 Prediction error method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.6 Identification of MISO systems with the pseudolinear regression method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.7 Identification of systems with two-segment nonlinearities . . . . . 155 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
X
6
Contents
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.1 General review of applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.2 Fault detection and isolation with Wiener and Hammerstein models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.2.1 Definitions of residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2.2 Hammerstein system. Parameter estimation of the residual equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.2.3 Wiener system. Parameter estimation of the residual equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.3 Sugar evaporator. Identification of the nominal model of steam pressure dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.3.1 Theoretical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.3.2 Experimental models of steam pressure dynamics . . . . . . 181 6.3.3 Estimation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Symbols and notation ak a ˆk A(q −1 ) ˆ −1 ) A(q A(q −1 ) A ˆ (m) A bk ˆbk B(q −1 ) ˆ −1 ) B(q B(q −1 ) B ˆ (m) B C D e(n) E f (·) fˆ(·) fˆk (·) ˆf (·) f −1 (·) fˆ−1 (·) gˆk (·) ˆ (·) g h(n) h1 (n) h2 (n) H1 (q −1 ) H2 (q −1 ) J J(n) K mf
kth parameter of the polynomial A(q −1 ) ˆ −1 ) kth parameter of the polynomial A(q denominator polynomial of the system pulse transfer function denominator polynomial of the model pulse transfer function denominator polynomial of the faulty system pulse transfer function state space matrix, (nx × nx) mth parameter matrix of the MIMO dynamic model. The MIMO Wiener model, (ns × ns); the MIMO Hammerstein model, (ny × ny) kth parameter of the polynomial B(q −1 ) ˆ −1 ) kth parameter of the polynomial B(q nominator polynomial of the system pulse transfer function nominator polynomial of the model pulse transfer function nominator polynomial of the faulty system pulse transfer function control matrix, (nx × nu) mth parameter matrix of the MIMO dynamic model. The MIMO Wiener model, (ns × nu); the MIMO Hammerstein model, (ny × nf ) observation matrix, (ny × nx) matrix describing the effect of inputs on outputs, (ny × nu) identification error (one-step-ahead prediction error) mathematical expectation nonlinear function of the system nonlinear function of the model kth nonlinear function of the MIMO nonlinear element model nonlinear function of the MIMO nonlinear element model inverse nonlinear function of the system inverse nonlinear function of the model kth nonlinear function of the MIMO inverse nonlinear element model nonlinear function of the MIMO inverse nonlinear element model system impulse response impulse response of the sensitivity model impulse response of the linear dynamic model pulse transfer function of sensitivity models pulse transfer function of the linear dynamic model global error function local error function number of unfolded time steps expected value of fˆ(u(n), w)
XII
List of symbols
m wc M n na nb nf ns nu ny N q −1 Rd s(n) sˆk (n) sˆ(n) ˆs(n) u(n) u(n) var (1) vji (2)
vkj
vˆ(n) v vk V (1) wji (2)
wkj w wk W xj (n) x(n) x ˆ(n) y(n) yˆ(n) yk (n) yˆk (n) yˆ(n|n−1) y(n) ˆ (n) y
expected value of ∂ fˆ(u(n), w)/∂wc number of nonlinear nodes discrete time ˆ −1 ) order of the polynomials A(q −1 ) and A(q −1 ˆ −1 ) order of the polynomials B(q ) and B(q number of outputs of MIMO Hammerstein nonlinear element model number of outputs of MIMO Wiener linear dynamic model; number of system (model) inputs number of system (model) outputs number of input-output measurements backward shift operator Euclidean d-dimensional space output of the linear dynamic part of Wiener system kth output of the linear dynamic part of MIMO Wiener model output of the linear dynamic part of Wiener model output of the linear dynamic part of MIMO Wiener model system input MIMO system input variance ith weight of the jth hidden layer node of the inverse nonlinear element model jth weight of the kth output node of the inverse nonlinear element model output of the nonlinear element part of Hammerstein model weight vector of the inverse nonlinear element model kth path weight vector of the MIMO inverse nonlinear element model weight vector of the MIMO inverse nonlinear element model ith weight of the jth hidden layer node of the nonlinear element model jth weight of the kth output node of the nonlinear element model weight vector of the nonlinear element model kth path weight vector of the MIMO nonlinear element model, weight vector of the MIMO nonlinear element model activation of the jth nonlinear node regression vector; system state model state system output model output kth output of the MIMO Wiener (Hammerstein) system kth output of the MIMO Wiener (Hammerstein) model one-step-ahead predictor of y(n) MIMO system output MIMO model output
List of abbreviations
zj (n) z(n) γˆk γk ∆A(q −1 ) ∆f (·) ∆f −1 (·) ∆ˆ saˆk (n) ∆ˆ sˆbk (n) ∆ˆ yaˆk (n) ∆ˆ yˆbk (n) ∆ˆ ywc (n) (n) ε(n) η θ, θˆ λ µk µ ˆk ξak (n) ξbk (n) ξwc (n) σ2 σf2 2 σw c ϕ(·) ψ(n)
activation of the jth nonlinear node of the inverse nonlinear element model instrumental variables vector kth parameter of the polynomial of the inverse nonlinear element model kth parameter of the polynomial of the inverse nonlinear element change in the pulse transfer function denominator of the linear dynamic system change in the nonlinear function of the nonlinear element change in the nonlinear function of the inverse nonlinear element computation error of ∂ˆ s(n)/∂ˆ ak computation error of ∂ˆ s(n)/∂ˆbk computation error of ∂ yˆ(n)/∂ˆ ak computation error of ∂ yˆ(n)/∂ˆbk computation error of ∂ yˆ(n)/∂wc discrete white noise disturbance additive system output disturbance learning rate parameter vector exponential forgetting factor kth parameter of the polynomial of the nonlinear element, kth parameter of the polynomial of the nonlinear element model s(n)/∂ˆ ak – the Wiener model; calculation accuracy degree of ∂ˆ ∂ yˆ(n)/∂ˆ ak – the Hammerstein model calculation accuracy degree of ∂ˆ s(n)/∂ˆbk – the Wiener model; ˆ ∂ yˆ(n)/∂ bk – the Hammerstein model calculation accuracy degree of ∂ yˆ(n)/∂wc variance of u(n) variance of fˆ(u(n), w) variance of ∂ fˆ(u(n), w)/∂wc nonlinear activation function gradient of the Wiener model output
List of abbreviations i.i.d. r.h.s w.r.t AR ARMA ARMAX
XIII
independent and identically distributed right hand side with respect to autoregressive autoregressive moving average autoregressive moving average with exogenous input
XIV
List of symbols
ARX BJ BP BPP BPPT BPS CSTR DLOP FDI ELS FIR IMC IV MA MIMO MISO MLP MPC MSE NAR NARMA NARMAX NARX NBJ NFIR NOBF NOE NMA OBFP OE PE PI PID PRBS RIV RLS RELS RMS RPE RPLR SIMO SISO SM WMPC
autoregressive with exogenous input Box-Jenkins back propagation back propagation for parallel models back propagation through time back propagation for series-parallel models continuous stirred tank reactor discrete Legendre orthogonal polynomial fault detection and isolation extended least squares finite impulse response model internal model control instrumental variables moving average model multiple-input multiple-output multiple-input single-output multilayer perceptron model-based predictive control mean square error nonlinear autoregressive nonlinear autoregressive moving average nonlinear autoregressive moving average with exogenous input nonlinear autoregressive with exogenous input nonlinear Box-Jenkins nonlinear finite impulse response nonlinear orthonormal basis function nonlinear output error nonlinear moving average orthogonal basis with fixed poles output error prediction error proportional plus integral proportional plus integral plus derivative pseudorandom binary sequence recursive instrumental variables recursive least squares recursive extended least squares root mean square recursive prediction error recursive pseudolinear regression single-input multiple-output single-input single-output sensitivity method Wiener model-based predictive control
1 Introduction
The class of block-oriented nonlinear models includes complex models which are composed of linear dynamic systems and nonlinear static elements. Wiener and Hammerstein models are the most known and the most widely implemented members of this class. A model is called the Wiener model if the linear dynamic block (element) precedes the nonlinear static one. In the Hammerstein model, the connection order is reversed. Models of nonlinear static elements can be realized in different forms such as polynomials, splines, basis functions, wavelets, neural networks, look-up tables, and fuzzy models. Impulse response models, pulse transfer models, and state space models are common representations of linear dynamic systems. Depending on realization forms of both of these elements, various structures of Wiener and Hammerstein models can be obtained. To evaluate and compare them, the following properties are commonly taken into account: approximation accuracy, extrapolation behavior, interpolation behavior, smoothness, sensitivity to noise, available parameter optimization methods, and available structure optimization methods. From the approximation theorem of Weierstrass, it follows that any continuous function defined on the interval [a, b] can be approximated arbitrarily closely by a polynomial. Polynomial models are widely used as models of nonlinear elements. A great advantage of polynomial models is effective parameter optimization, which can be performed off-line with the least squares method or on-line with its recursive version. Moreover, structure selection can also be performed effectively with the orthogonal least squares algorithm, in which the set of regressors is transformed into a set of orthogonal basis vectors. With this algorithm, it is possible to calculate individual contribution from each basis vector to output variance. Polynomials have some fundamental disadvantages as well. First of all, although any continuous function can be approximated arbitrary closely by a polynomial, some nonlinear functions require a very high polynomial order. In multi-input multi-output models, the number of parameters grows strongly when the number of inputs increases. This, increases the model uncer-
A. Janczak: Identification of Nonlinear Systems, LNCIS 310, pp. 1–30, 2005. © Springer-Verlag Berlin Heidelberg 2005
2
1 Introduction
tainty and may cause the optimization problem numerically ill-conditioned. The other disadvantages of polynomials are their oscillatory interpolation and extrapolation properties. Therefore, in practice, the application of polynomial models is recommended only in some specific cases where the system structure can be assumed to be approximately of the polynomial type. An alternative to polynomial models are neural network models of the multilayer perceptron architecture. Multilayer perceptrons are feedforward neural networks containing one or more hidden layers of nonlinear elements, but one hidden layer is the most common choice in practice. The application of multilayer perceptrons in approximation problems is justified by their universal approximation property which states that a single hidden layer is sufficient to uniformly approximate any continuous function with support in a unit hypercube [31]. Multilayer perceptrons, owing to their advantages such as high approximation accuracy, lower numbers of nodes and weights in comparison with other model architectures, capability to generate a wide variety of functions, are the most frequently applied neural networks. Unlike polynomials, multilayer perceptron models do not suffer oscillatory interpolation and extrapolation behavior. They reveal a tendency to monotonic interpolation. The extrapolation behavior is also smooth but in a long range the network response tends to a constant value owing to the saturation of commonly applied sigmoidal functions. Multilayer perceptron models are very useful for high dimensional problems as well. This comes from the fact that the numer of weights in a multilayer perceptron model is proportional to the number of inputs. In contrast to polynomial models, multilayer perceptron models can be successfully used not only to represent systems described by polynomials but also by infinite power series expansions. Summarizing, as they offer the user some different interesting features, multilayer perceptron models are complementary to polynomial ones. Obviously, multilayer perceptron models are not free of drawbacks. The most significant of them are the use of local optimization methods to update the weights, and a risk of getting trapped in a shallow local minimum. This leads often to the necessity of repeated training with different weight initializations. Moreover, trial and error techniques have to be used for some parameters such as initial weights, learning rates, etc. Also, the available model structure optimization methods are rather computationally intensive. Now, with the development of the identification of Wiener and Hammerstein systems it is possible to systemize different techniques and to present them in a unified framework. We attempt at such a presentation in this monograph by reviewing the existing approaches along with a presentation of original research papers and some results that have not been published yet. Neural network models of Wiener and Hammerstein systems considered in Chapters 2 and 3 are composed of a multilayer perceptron model of the nonlinear element and one or more linear nodes with tapped delay lines constituting a model of the linear dynamic system. Series-parallel Wiener models contain also another multilayer perceptron model of the inverse nonlinear element.
1 Introduction
3
Two basic configurations of models, i.e., series-parallel and parallel models are discussed. In series-parallel models, the gradient can be calculated with the well-known backpropagation method. In parallel models, only a crude approximation of the gradient can be obtained with the backpropagation method. Therefore, two other methods, referred to as the sensitivity method and the backpropagation through time method, which provide the exact value of the gradient or its more accurate approximation, should be taken into account. All these gradient calculation methods are derived in a unified manner, both for the SISO and MIMO cases. Computational complexity of the methods is analyzed and expressed in terms of polynomial orders and the number of unfolded time steps in the case of the truncated backpropagation through time method. The accuracy of gradient calculation with the truncated backpropagation through time method is analyzed as well. It is shown that the accuracy of gradient calculation depends on the numbers of discrete time steps necessary for the impulse responses of sensitivity models to decrease to negligible small values. Based on this result, adaptive procedures for adjusting the number of unfolded time steps are proposed to meet the specified degrees of accuracy. The original contribution of the book comprises new approaches to the identification of Wiener and Hammerstein systems and some concerned theoretical results. For both SISO and MIMO neural network models, the following gradient calculation methods are derived and analyzed: – – – – – – – –
Backpropagation for series-parallel Wiener models. Backpropagation for parallel Wiener models. Sensitivity method for parallel Wiener models. Truncated backpropagation through time for parallel Wiener models. Backpropagation for series-parallel Hammerstein models. Backpropagation for parallel Hammerstein models. Sensitivity method for parallel Hammerstein models. Truncated backpropagation through time for parallel Hammerstein models.
Having the rules for gradient calculation derived, various gradient-based learning algorithms can be implemented easily. Sequential versions of the steepest descent, prediction error, and combined steepest descent and recursive least squares algorithm are discussed in detail in Chapters 2 and 3. Another group of identification methods, derived and discussed in the book, uses polynomial representation of Wiener systems: – – – –
Least squares method for polynomial Wiener systems with the linear term. Least squares method for polynomial Wiener systems without the linear term. Combined least squares and instrumental variables method for polynomial Wiener systems with the linear term. Combined least squares and instrumental variables method for polynomial Wiener systems without the linear term.
4
– –
1 Introduction
Recursive prediction error method. Recursive pseudolinear regression method.
All these polynomial-based identification methods employ a pulse transfer function representation of the system dynamics and polynomial models of the nonlinear element or its inverse. In spite of the fact that polynomial models of both Wiener and Hammerstein systems can be expressed in linearin-parameters forms, such transformations lead to parameter redundancy as transformed models have a higher number of parameters than the original ones. The number of parameters of transformed models grows strongly with increasing the model order. As a result, the variance error increases and some numerical problems may occur as well. Moreover, as shown in Chapter 4, to transform a Wiener system into the linear-in-parameters form, the nonlinearity have to be invertible, i.e., the nonlinear mapping must be strictly monotonic. It is also shown that the least squares parameter estimates are inconsistent. To obtain consistent parameter estimates, a combined least-squares and instrumental-variables method is proposed. The restrictive assumption of invertibility of the nonlinear element is no more necessary in both the prediction error method and the pseudolinear regression method. In this case, however, the Wiener model is nonlinear in the parameters and the parameter estimation becomes a nonlinear optimization task. Wiener and Hammerstein models have found numerous industrial applications for system modelling, control, fault detection and isolation. Chapter 6 gives a brief review of applications which includes the following systems and processes: – – – – – – – – – – – – – – – – –
pH neutralization process, heat exchangers, distillation columns, chromatographic separation process, polymerization reactor, quartz microbalance polymer-coated sensors, hydraulic plants, electro-hydraulic servo-systems, pneumatic valves, pump-valve systems, electrooptical dynamic systems, high power amplifiers in satellite communication channels, loudspeakers, active noise cancellation systems, charging process in diesel engines, iron oxide pellet cooling process, sugar evaporator.
Wiener and Hammerstein models reveal the capability of describing a wide class of different systems and apart from industrial examples, there are many other applications in biology and medicine.
1.1 Models of dynamic systems
5
1.1 Models of dynamic systems This section gives a brief review of discrete-time models of time-invariant dynamic systems. By way of introduction, the section starts with linear model structures and shows nonlinear models as generalizations of linear ones. 1.1.1 Linear models According to Ljung [112], a time-invariant model can be specified by the impulse response h(n), the spectrum Φ(ω) of the additive disturbance H(q −1 )ε(n), and the probability density function fε (·) of the disturbance ε(n). The output y(n) of a discrete-time linear model excited by an input u(n) and disturbed additively by ε(n) is y(n) = G(q −1 )u(n) + H(q −1 )ε(n),
(1.1)
where G(q −1 ) is called the input pulse transfer function, H(q −1 ) is the noise pulse transfer function, and q −1 denotes the backward shift operator. Unless stated more precisely, we assume ε(n) to be a zero-mean stationary independent stochastic process. Representing G(q −1 ) and H(q −1 ) as rational functions leads to the general model structure of the following form [112, 149]: A(q −1 )y(n) = where
C(q −1 ) B(q −1 ) u(n) + ε(n), F (q −1 ) D(q −1 )
(1.2)
A(q −1 ) = 1 + a1 q −1 + · · · + ana q −na ,
(1.3)
B(q −1 ) = b1 q −1 + · · · + bnb q −nb ,
(1.4)
C(q −1 ) = 1 + c1 q −1 + · · · + cnc q −nc ,
(1.5)
D(q −1 ) = 1 + d1 q −1 + · · · + dnd q −nd ,
(1.6)
F (q
−1
) = 1 + f1 q
−1
+ · · · + fnf q
−nf
.
(1.7)
In practice, the structure (1.2) is usually too general. Depending on which of the polynomials (1.3) – (1.7) are used, 32 different model sets can be distinguished. A few commonly used structures, which belong to the general family of structures, are listed below. Finite impulse response (FIR) structure. The choice of A(q −1 ) = C(q −1 ) = D(q −1 ) = F (q −1 ) = 1 results in the simplest model structure known as the finite impulse response model y(n) = B(q −1 )u(n) + ε(n).
(1.8)
The output of the model (1.8) is a weighted sum of nb past inputs u(n − 1), . . ., u(n − nb):
6
1 Introduction
y(n) = b1 u(n − 1) + · · · + bnb u(n − nb) + ε(n).
(1.9)
The optimal one-step-ahead predictor, i.e., the predictor that minimizes the prediction error variance is yˆ(n|n − 1) = B(q −1 )u(n) = b1 u( n − 1) + · · · + bnb u(n − nb).
(1.10)
Introducing the parameter vector θ, θ = b1 . . . bnb
T
,
(1.11)
and the regression vector x(n), x(n) = u(n − 1) . . . u(n − nb)
T
,
(1.12)
(1.10) can equivalently be expressed in a regression form: yˆ(n|n − 1) = xT (n)θ.
(1.13)
The FIR model is able to approximate asymptotically stable dynamic systems quite well if their impulse responses decay reasonably fast. Autoregressive (AR) structure. The AR model structure is defined with the choice of B(q −1 ) = 0, and C(q −1 ) = D(q −1 ) = F (q −1 ) = 1: y(n) =
1 ε(n). A(q −1 )
(1.14)
In this case, the parameter vector θ and the regression vector x(n) become θ = a1 . . . ana
T
,
(1.15)
x(n) = − y(n − 1) . . . − y(n − na)
T
.
(1.16)
Moving average (MA) structure. The MA model structure corresponds to the choice of B(q −1 ) = 0, and A(q −1 ) = D(q −1 ) = F (q −1 ) = 1: y(n) = C(q −1 )ε(n).
(1.17)
The parameter vector θ and the regression vector x(n) are θ = c1 . . . cnc
T
,
x(n) = ε(n − 1) . . . ε(n − nc)
(1.18) T
.
(1.19)
Autoregressive with exogenous input (ARX) structure. The ARX model structure can be obtained with the choice of C(q −1 ) = D(q −1 ) = F (q −1 ) = 1:
1.1 Models of dynamic systems
y(n) =
1 B(q −1 ) u(n) + ε(n). A(q −1 ) A(q −1 )
7
(1.20)
For the ARX model, the parameter vector θ and the regression vector x(n) have the forms T θ = a1 . . . ana b1 . . . bnb , (1.21) x(n) = − y(n − 1) . . . − y(n − na) u(n − 1) . . . u(n − nb)
T
.
(1.22)
Autoregressive moving average (ARMA) structure. A combination of the autoregressive model and the moving average model results in the ARMA model. It can be obtained with the choice of B(q −1 ) = 0 and D(q −1 ) = F (q −1 ) = 1: C(q −1 ) ε(n). (1.23) y(n) = A(q −1 ) In this case, the parameter vector θ and the regression vector x(n) become θ = a1 . . . ana c1 . . . cnc
T
,
(1.24)
x(n) = − y(n − 1) . . . − y(n − na) ε(n − 1) . . . ε(n − nc)
T
.
(1.25)
Autoregressive moving average with exogenous input (ARMAX) structure. The ARMAX model is the most general structure of all those considered up to now as it contains all of them as special cases. To obtain the ARMAX model, we choose D(q −1 ) = F (q −1 ) = 1: y(n) =
C(q −1 ) B(q −1 ) u(n) + ε(n). A(q −1 ) A(q −1 )
(1.26)
For the ARMAX model, the parameter vector θ and the regression vector x(n) are defined as follows: θ = a1 . . . ana b1 . . . bnb c1 . . . cnc
T
,
(1.27)
x(n) = − y(n − 1) . . . − y(n − na) u(n − 1) . . . u(n − nb) ε(n − 1) . . . ε(n − nc)
T
(1.28)
.
Output error (OE) structure. The OE model can be obtained if we choose A(q −1 ) = C(q −1 ) = D(q −1 ) = 1: y(n) =
B(q −1 ) u(n) + ε(n). F (q −1 )
(1.29)
In this case, the parameter vector θ and the regression vector x(n) are defined as T θ = f1 . . . fnf b1 . . . bnb , (1.30) x(n) = − yˆ(n − 1) . . . − yˆ(n − nf ) u(n − 1) . . . u(n − nb)
T
,
(1.31)
8
1 Introduction
where yˆ(n) =
B(q −1 ) u(n). F (q −1 )
(1.32)
Box-Jenkins (BJ) structure. A structure which is a more general development of the OE model, is called the Box-Jenkins model. To obtain the BJ model, the choice of A(q −1 ) = 1 should be made: y(n) =
B(q −1 ) C(q −1 ) u(n) + ε(n). F (q −1 ) D(q −1 )
(1.33)
The one-step-ahead predictor for the BJ model has the form [112]: yˆ(n|n − 1) =
D(q −1 ) B(q −1 ) C(q −1 ) − D(q −1 ) u(n) + y(n). C(q −1 ) F (q −1 ) C(q −1 )
(1.34)
1.1.2 Nonlinear models Nonlinear counterparts of linear model structures can be defined assuming that there is a nonlinear relationship between the actual system output and past system inputs, the past system or model outputs, and the actual and past additive disturbances. For nonlinear input-output models we have y(n) = g x(n), θ + ε(n)
(1.35)
yˆ(n|n − 1) = g x(n), θ .
(1.36)
or, in the predictor form,
Depending on the form of x(n), nonlinear model structures known as NFIR, NAR, NMA, NARMA, NARX, NARMAX, NOE, NBJ can be defined. The function g(·) is a nonlinear mapping, which for any given θ maps Rd to Rp , where d is the number of regressors (elements of the vector x(n)) and p is the number of model outputs. In a parametric approach, g(·) is expressed by a function expansion: ng
βm gm x(n) ,
g x(n), θ =
(1.37)
m=1
where gm (·) is called the basis function, and θ = [β1 . . . βng ]T . There are different forms of function expansions, used for nonlinear systems representation, which are based on polynomials, Volterra kernels, Fourier series, piecewise constant functions, radial basis functions, wavelets, kernel estimators, neural networks, and fuzzy models [112]. Discrete-time Volterra models. If the function g(·) is analytic, then the system response can be represented by the Volterra series [109]:
1.1 Models of dynamic systems n
y(n) =
n
h1 (i1 )u(n − i1 ) + i1 =1
9
n
h2 (i1 , i1 )u(n − i1 )u(n − i2 )+ i1 =1 i2 =1
(1.38)
· · · + ε(n), where the kernel functions hj (i1 , . . . , ij ), j = 1, 2, . . ., describe system dynamics. Two basic problems associated with practical application of Volterra series are difficulty concerning the measurement of Volterra kernel functions and the convergence of Volterra series [145]. The Volterra series representation is really a Taylor series with memory. Therefore, the problem of convergence is the same as that of Taylor series representation of a function, i.e., the Volterra series representation of a system may converge for only a limited range of the system input amplitude. To circumvent this problem, Wiener formed a new set of functionals from the Volterra functionals – G-funtionals which have an orthogonal property when their input is a Gaussian process. Extensive studies of Volterra models show their successful application to low order systems. The reasons for this system order limitation are practical difficulties in extending kernel estimation to orders higher than the third [117]. Kolmogorov-Gabor models. The application of generalized polynomial models to the representation of nonlinear dynamic systems y(n) = f u(n − 1), . . . , u(n − nb), y(n − 1), . . . , y(n − na) + ε(n)
(1.39)
results in Kolmogorov-Gabor models: nab
y(n) = a0 +
nab
i1
ai1 i2 xi1 (n)xi2 (n)
ai1 xi1 (n) + i1 =1 i2 =1
i1 =1 nab
il−1
i1
+ ··· +
... i1 =1 i2 =1
(1.40)
ai1 i2 ...il xi1 (n)xi2 (n) . . . xil (n) + ε(n), il =1
where nab = na + nb, xj (n) =
u(n − j) if 1 j nb y(n − j + nb) if nb < j nab.
(1.41)
The number of parameters M in (1.40) increases strongly as nab or l grow [123]: (l + nab)! M= . (1.42) l!nab! The large model complexity of the Kolmogorov-Gabor model restricts its practical applicability and leads to reduced polynomial models containing only selected terms of (1.40). Note that (1.40) has the NARX structure. The other nonlinear structures such as NFIR, NAR, NMA, NARMA, NARX, NARMAX, NBJ can be obtained via a relevant redefinition of xj (n).
10
1 Introduction
Nonlinear orthonormal basis function models (NOBF). The main disadvantage of both the FIR and NFIR models is that many parameters may be needed to describe a system adequately if its impulse response decays slowly. This disadvantage can be reduced by introducing linear filters which incorporate prior knowledge about process dynamics. Orthonormal Laguerre and Kautz filters, which have orthonormal impulse responses, are commonly applied. The Laguerre filter can be described with only one parameter α, a real pole k−1 1 − αq 1 . (1.43) Lk (q) = q−α q−α Therefore, this kind of filters is suitable for modelling systems with welldamped behavior. For systems with resonant behavior, Kautz filters are suitable as they have a complex pole pair. In practice, estimates of a system dominant pole or dominant conjugate complex poles are used. The regression vector for the NOBF model has the form x(n) = L1 (q) L2 (q) . . . Lr (q)
T
.
(1.44)
1.1.3 Series-parallel and parallel models Models of dynamic systems can be used in two basic configurations: a prediction configuration or a simulation configuration. The prediction configuration permits the prediction of future system outputs based on past system inputs and outputs. Examples of the prediction configuration are the FIR, AR, ARX models in the linear case, and the NFIR, NOBF, NAR, NARX models in the nonlinear case. In the simulation configuration, future system outputs are also predicted but only on the basis of past system inputs without employing past system outputs. The OE, BJ, and NOE, NBJ models are examples of the simulation configuration. In the system identification literature, the one-step prediction configuration is called a series-parallel model and the simulation configuration is called a parallel model [121, 123]. Parallel models, being dynamic systems themselves, are also called recursive models as their mathematical description has the form of difference equations. In contrast to parallel models, series-parallel models are described by algebraical equations. In the context of neural networks, these models are called feedforward ones. In the identification process, model parameters are calculated in such a way so as to minimize a chosen cost function dependent on the identification error e(n). The two model configurations above entail two different definitions of the identification error. For the series-parallel model, the identification error is called the equation error, for the parallel model – the output error. 1.1.4 State space models The extension of a linear state space model to the nonlinear case results in the following nonlinear state space model:
1.1 Models of dynamic systems
x(n + 1) = h x(n), u(n) + υ(n),
(1.45)
y(n) = g x(n) + ε(n), nx
nu
nx
11
(1.46) ny
ny
where x(n) ∈ R , u(n) ∈ R , υ(n) ∈ R , y(n) ∈ R , ε(n) ∈ R . If the system state x(n) is available for measurement, the identification is equivalent to the determination of the vector functions h(·) and g(·): x ˆ(n + 1) = h x(n), u(n) ,
(1.47)
y ˆ(n) = g x(n) .
(1.48)
The model (1.47), (1.48) is of the series-paralel type. In practice, at least some of state variables are unknown and they have to be estimated. If no state variables are measured, simultaneous estimation of system states and the determination of the functions h(·) and g(·) is required. The high complexity of such a task is the main reason for the dominance of the much simpler input-output approaches [123]. 1.1.5 Nonlinear models composed of sub-models Except for artificial ”phenomena” created by mathematical equations, models are always different from modelled phenomena [131]. Mathematical models are simply equations which are derived on the basis of the first principles and/or the experiment data and they are different from the underlaying phenomena of a nonmathematical nature. In general, models can be characterized by their time resolution and granularity where granularity specifies the level of details included into the model. Up to now, we have considered models which are classified as black box models [148]. These models are based on the measurement data only, i.e., their parameters and structure are determined through experiments. Typically, the parameters of black box models have no interpretation in terms of physical, chemical, biological, economical, and other laws. Another class of models are white box models as they are completely derived from the underlying principal laws. Even if some of the parameters are estimated from data, a model is included into this class as well. In contrast to black box models, the parameters of white box models have a clear interpretation [94]. Gray box models combine features of the white and black box models. They are derived both on the basis of the underlying laws and the measurement data. For example, the system structure can be determined utilizing a priori knowledge about the system nature, and system parameters can be estimated from data. Nonlinear models of a given internal structure composed of sub-models, referred to as also block-oriented models, are members of the class of gray box models. Wiener and Hammerstein models, shown in Figs 1.1 – 1.2, are two
12
1 Introduction
ε (n ) u (n )
Linear dynamic system
s (n )
Nonlinear static element
y (n )
Fig. 1.1. SISO Wiener system
ε (n ) u (n )
Nonlinear static element
v (n )
Linear dynamic system
y (n )
Fig. 1.2. SISO Hammerstein system
well-known examples of such models, composed of sub-models [61, 108, 109]. Both of them contain a linear dynamic system and a nonlinear static element in a cascade. While in the Wiener system the nonlinear element follows the linear dynamic system, in the Hammerstein model, both of these sub-models are connected in reverse order. The Wiener model is given by y(n) = f
B(q −1 ) u(n) + ε(n), A(q −1 )
(1.49)
where f (·) denotes the nonlinear function describing the nonlinear element, and B(q −1 )/A(q −1 ) is the pulse transfer function of the linear dynamic system. With the same notation, the Hammerstein model can be expressed as y(n) =
B(q −1 ) f u(n) + ε(n). A(q −1 )
(1.50)
The multi-input multi-output (MIMO) Wiener model can be described by the following equation: y(n) = f s(n) + ε(n), (1.51) where f (·) : Rns → Rny is a nonzero vector function, y(n) ∈ Rny , ε(n) ∈ Rny . The output s(n), s(n) ∈ Rns , of the MIMO linear dynamic system is na
s(n) = − m=1
A(m) s(n − m) +
nb
B(m) u(n − m),
(1.52)
m=1
where u(n) ∈ Rnu , A(m) ∈ Rns×ns , B(m) ∈ Rns×nu . In a similar way, the MIMO Hammerstein model can be obtained by connecting a MIMO linear dynamic system with a MIMO static nonlinear element – Fig. 1.4. The output of the MIMO Hammerstein model is
1.1 Models of dynamic systems
13
ε1 (n ) s1 (n )
u 2 (n )
M u nu (n )
s2 (n )
M sns (n )
y 1 (n ) Nonlinear static element
Nonlinear static element
u 1 (n )
ε 2 (n ) y 2 (n )
M ε ny (n ) y ny (n )
Fig. 1.3. MIMO Wiener system
ε 1 (n )
u 2 (n )
M u nu (n )
y 1 (n )
v1 (n )
Linear dynamic system
Nonlinear static element
u1 (n )
v2 (n )
M vnf (n )
ε 2 (n ) y 2 (n )
M
ε ny (n )
y ny (n )
Fig. 1.4. MIMO Hammerstein system
na
y(n) = − m=1
A(m) y(n − m) +
nb
B(m) f u(n − m) + ε(n),
(1.53)
m=1
where A(m) ∈ Rny×ny , B(m) ∈ Rny×nf , f (·) : Rnu → Rnf is a nonzero vector function, u(n) ∈ Rnu . Note that the definitions (1.51), (1.52), and (1.53) describe models with a coupled static part and coupled dynamics. We can also consider models with an uncoupled static part and coupled dynamics or a coupled static part and uncoupled dynamics as special cases of these general forms. The general Wiener model considered by Sieben [147] is a SISO (single-input single-output) model in which a nonlinear static MISO (multi-input singleoutput) element follows a linear SIMO (single-input multiple-output) dynamic
14
1 Introduction ε (n ) u (n )
Linear dynamic system 1
Linear dynamic system 2
Nonlinear static element
y (n )
Fig. 1.5. SISO Wiener-Hammerstein system ε (n ) u (n )
Nonlinear static element 1
Linear dynamic system
Nonlinear static element 2
y (n )
Fig. 1.6. SISO Hammerstein-Wiener system
system. Another structure, known as the Uryson model [40], consists of several Hammerstein models in parallel, each path having the same input and with several outputs summed. More complicated structures arise through the interconnection of three submodels in a cascade. In this way, structures called Wiener-Hammerstein (Fig. 1.5) and Hammerstein-Wiener models (Fig. 1.6) can be obtained. The WienerHammerstein structure is given by y(n) =
B2 (q −1 ) B1 (q −1 ) f u(n) + ε(n), A2 (q −1 ) A1 (q −1 )
(1.54)
where B1 (q −1 )/A1 (q −1 ) and B2 (q −1 )/A2 (q −1 ) are pulse transfer functions of the first and the second linear dynamic system, respectively. The identification of Wiener-Hammerstein systems with correlation methods was studied by Billings and Fakhouri [17] and Hunter and Korenberg [71]. A recursive identification method for the MISO Wiener-Hammerstein model was proposed by Boutayeb and Darouach [21]. Bloemen et al. [19] considered the application of Hammerstein-Wiener models to the predictive control problem. Bai [7] developed a two-stage identification algorithm for HammersteinWiener systems in which optimal parameter estimates of both the nonlinear elements and the linear dynamic system are obtained using the RLS algorithm followed by singular value decomposition of two matrices. The algorithm is convergent in the absence of noise and convergent with the probability 1 in the presence of white noise. Recently, the identification of Hammerstein-Wiener systems was also studied by Bai [9]. In the Hammerstein-Wiener model, two nonlinear blocks, described by the functions f1 (·) and f1 (·), are separated by a linear dynamic system: y(n) = f2
B(q −1 ) f1 u(n) A(q −1 )
+ ε(n).
(1.55)
1.1 Models of dynamic systems
15
1.1.6 State-space Wiener models A state-space description of MIMO Wiener systems with nu inputs, ny outputs, nx states variables, and ns internal variables has the form [113, 158, 164] x(n + 1) = Ax(n) + Bu(n),
(1.56)
s(n) = Cx(n) + Du(n),
(1.57)
y(n) = f s(n) + ε(n),
(1.58)
where x(n) ∈ Rnx , u(n) ∈ Rnu , s(n) ∈ Rns , y(n) ∈ Rny , and f (·) : Rnf → Rny is a nonzero vector function, ε(n) ∈ Rny is a zero-mean stochastic process. The parallel representation of the state-space Wiener model has the form x ˆ(n + 1) = Aˆ x(n) + Bu(n),
(1.59)
ˆ s(n) = Cˆ x(n) + Du(n),
(1.60)
y ˆ(n) = f ˆ s(n) .
(1.61)
Replacing the model state x ˆ(n) in (1.59) and (1.60) with the system state x(n) results in the series-parallel representation of the state-space Wiener model: x ˆ(n + 1) = Ax(n) + Bu(n),
(1.62)
ˆ s(n) = Cx(n) + Du(n),
(1.63)
y ˆ(n) = f ˆ s(n) .
(1.64)
1.1.7 State-space Hammerstein models A state space MIMO Hammerstein model can be described by the following equations [157]: x(n + 1) = Ax(n) + Bv(n), (1.65) y(n) = Cx(n) + Dv(n) + ε(n),
(1.66)
v(n) = f u(n) , nx
nu
ny
(1.67) ny
where x(n) ∈ R , u(n) ∈ R , y(n) ∈ R , ε(n) ∈ R is a zero-mean stochastic process, v(n) ∈ Rnf , and f (·): f (·) : Rnu → Rnf is a nonzero vector function. The MIMO Hammerstein model in a parallel state-space representation has the form x ˆ(n + 1) = Aˆ x(n) + Bˆ v(n), (1.68) y ˆ(n) = Cˆ x(n) + Dˆ v(n),
(1.69)
v ˆ(n) = f u(n) .
(1.70)
16
1 Introduction
Replacing the model state x ˆ(n) in (1.68) and (1.69) with the system state x(n), we obtain the series-parallel representation of the state-space Hammerstein model: x ˆ(n + 1) = Ax(n) + Bˆ v(n), (1.71) y ˆ(n) = Cx(n) + Dˆ v(n),
(1.72)
v ˆ(n) = f u(n) .
(1.73)
1.2 Multilayer perceptron Multilayer feedforward neural networks, also referred to as multilayer perceptrons, are the most widely known and used neural networks [60, 68, 69, 127, 144, 169]. In the multilayer perceptron (MLP), the neurons are ordered into one or more hidden layers and connected to an output layer. This type of neural network is used in Chapters 2 and 3 intensively for modelling both the nonlinear element and its inverse. 1.2.1 MLP architecture The ith output xi (n) of the first hidden layer (Fig. 1.7) is nu
(1)
xi (n) = ϕ j=1
(1)
wij uj (n) + wi0
(1)
,
(1.74)
(1)
where wij is the jth weight of the ith neuron, wi0 is the bias of the ith neuron, ϕ(·) is the activation function of hidden layer neurons, nu is the number of inputs, and uj (n) is the jth input. Common choices of the activation function are sigmoidal functions such as the logistic function ϕ(x) =
1 1 + exp(−x)
(1.75)
and the hyperbolic tangent function ϕ(x) = tanh(x) =
1 − exp(−2x) . 1 + exp(−2x)
(1.76)
The outputs of the first hidden layer can be connected to the successive hidden layers and finally to the output layer. In the commonly used MLP neural network with one hidden layer, the outputs xi (n), i = 1, . . . , M , where M is the number of hidden layer neurons, are transformed by the output layer into the outputs yk (n): M
yk (n) = φ i=1
(2)
(2)
wki xi (n) + wk0
,
(1.77)
1.2 Multilayer perceptron
17
1
u1 (n )
u2 (n )
w i(11) w i(12)
w i(10)
ϕ (⋅)
x i (n )
(1) winu
M
unu (n )
Fig. 1.7. The ith hidden layer neuron
(2)
(2)
where wki is the ith weight of the kth output neuron, wk0 is the bias of the kth output neuron, and φ(·) is the activation function of output layer neurons. Although φ(·) can be a nonlinear function, the linear activation function φ(x) = x is a typical choice. 1.2.2 Learning algorithms Nonlinear optimization of neural network weights is the most common technique used for training the MLP. Using gradient-based learning methods, the cost function J, typically the sum of squared errors between the system output and the neural network model output, is minimized. As a result, neural network weights are adjusted along the negative gradient of the cost function. The backpropagation (BP) learning algorithm is an implementation of the gradient descent optimization method for weight updating. The backpropagation learning algorithm uses the backpropagation algorithm as a technique for the computation of the gradient of the MLP w.r.t. its weights [68, 123, 132]. In spite of its computational simplicity, training the MLP with the BP learning algorithm may cause several problems such as very slow convergence, oscillations, divergence, and the ”zigzagging” effect. A large number of improvements, extensions, and modifications of the basic BP learning algorithm have been developed to circumvent this problem [60, 68, 123]. The reason for the slow convergence of the BP learning algorithm is that it operates on the basis of a linear approximation of the cost function. To achieve a significantly higher convergence rate, higher order approximations of the cost function should be used. Examples of such learning techniques are the Levenberg-Marquard method, quasi-Newton methods, conjugate gradient methods [68, 123], and the RLS learning algorithms [144]. To extract information from the training data and increase the the effectiveness of learning, some data preprocessing such as filtering, removing redundancy, and removing outliers is usually necessary.
18
1 Introduction
Also, scaling the data is essential to make learning algorithms robust and decrease the learning time. The recommended scaling technique is based on removing the mean and scaling signals to the same variance [127]. Alternatively, the mean can be removed from signals and zero-mean signals scaled with respect to their maximum absolute values, to obtain values in a specified interval, e.g. [−1, 1]. In general, to minimize the learning time, applying nonzero-mean input signals should be avoided. This comes from the fact that the learning time for the steepest descent algorithm is sensitive to variations in the condition number λmax /λmin , where λmax is the largest eigenvalue of the Hessjan of the cost function and λmin is its smallest nonzero eigenvalue. Experimental results, show that for nonzero-mean input signals the condition number λmax /λmin is larger than for zero-mean input signals [68]. Note that the choice of asymmetric activation function, e.g. the logistic function, introduces systematic bias for hidden layer neurons. This has a similar effect on the condition number λmax /λmin as nonzero-mean inputs. Therefore, to increase the convergence speed of gradient-based learning algorithms, the choice of antisymmetric activation functions, such as the hyperbolic tangent, is recommended. 1.2.3 Optimizing the model architecture In practice, MLP models of real processes are of a rather large size. It is well known that models should not be too complex because they would learn noise and thus generalize badly to new data. On the other hand, they should not be too simple because they would not be capable to capture the process behavior. In the MLP with one hidden layer, the problem of architecture optimizing boils to choosing the number of hidden layer nodes and eliminating insignificant weights. The overall model error is composed of two components – a bias error which express the systematic error caused by the restricted model flexibility and a variance error, being the stochastic error due to the restricted accuracy of parameter estimates. Both these components of the model error are in conflict, known as a bias/variance dilemma, because the bias error decreases and the variance error increases for growing model complexity. Therefore it is necessary to find a compromise – the so called bias/error tradeoff. To accomplish this, the optimization of the model architecture is necessary. There are two groups of methods used to optimize the neural network architecture, known as network growing and network pruning. In network growing methods, new nodes or layers are added starting from a small size network until the enlarged structure meets the assumed requirements. A well-known example of a growing method is the cascade-correlation algorithm [36]. In pruning methods, the initial structure is large, then we prune it by weakening or eliminating some selected weights. The idea of pruning is based on the assumption that there is a large amount of redundant information stored in a fully connected MLP. Network pruning
1.3 Identification of Wiener systems
19
is commonly accomplished by two approaches – one based on complexity regularization and the other based on removing some weights using information on second-order derivatives of the cost function. In complexity regularization methods, the complexity penalty term is added to the cost function. The standard cost function in back-propagation learning is the mean-square error. Depending on the form of the complexity penalty term different regularization techniques can be defined such as the weight decay, the weight elimination, the approximate smoother, the Chauvin’s penalty approach [60, 123]. In fact, regularization methods do not change the model structure but reduce the model flexibility by keeping some weights at their initial values or constraining their values. In this way the reduction of the variance error can be achieved at the price of the bias error. The optimal brain damage (OBD) [107] and the optimal brain surgeon (OBS) [67] are the most widely known an used methods based on the use of information on second-order derivatives of the cost function. Both of them are used for reducing the size of the network by selectively deleting the weights. Both of them employ the second-order Taylor expansion of the cost function about ∗ T ] the operating point – the nw-dimensional weight vector w∗ = [w1∗ . . . wnw for which the cost function has a local minimum. Their objective is to find a set of weights whose removing cause the least change of the cost function. To achieve reasonable low computational complexity, only diagonal terms of the second-order Taylor expansion are included into the definition of the saliency of parameters in the OBD method. This corresponds to the assumption that the Hessjan matrix is diagonal matrix. The OBD method is an iterative procedure of the following form: 1. Train the MLP to minimum mean-square error of the cost function. 2. Compute the diagonal second-order derivatives hii , i = 1, . . . , nw, of the cost function. 3. Compute the saliencies for weights: Si = hii (wi∗ )2 /2. 4. Delete some weights that have small saliencies. 5. Return to Step 1. No such assumption about the Hessjan matrix is made in the OBS method and the OBD can be considered as a special case of the OBS.
1.3 Identification of Wiener systems Many different approaches to Wiener system identification have been proposed based on correlation analysis, linear optimization, nonparametric regression, nonlinear optimization with different nonlinear models such as polynomials, neural networks, wavelets, orthogonal functions, and fuzzy sets models. Correlation methods. Billings and Fakhouri used a correlation analysis approach to the identification of block-oriented systems based on the theory
20
1 Introduction
of separable processes [14]. When the input is a white Gaussian signal, it is possible to separate the identification of the linear dynamic system from the identification of the nonlinear element [15, 17]. For Wiener systems, the first order correlation function Ruy (k) of the system input u(n) and the system output y(n) is directly proportional to the impulse response h(k) of the linear system, and the second order correlation function Ru2 y (k) is directly propor2 tional to the square of h(k). Therefore, if Ruy (k) and Ru2 y (k) are equal except for a constant of proportionality, the system has a Wiener-type structure. A correlation approach to the identification of direction-dependent dynamic systems using Wiener models was used by Barker et al. [12]. They considered Wiener systems containing the linear dynamic part with different transfer functions for increasing and decreasing system output. The Wiener system was excited with maximum-length pseudo-random or inverse maximum-length pseudo-random binary signals. The determination of Wiener model parameters was performed by matching the system and model correlation functions, outputs, and discrete Fourier transforms of the outputs. Linear optimization methods. In linear optimization methods, it is assumed that a model can be parameterized by a finite set of parameters. The nonlinear element is commonly modelled by a polynomial along with a pulse transfer function model of the linear dynamic system. A Wiener model parameterized in this way is nonlinear in the parameters and parameter estimation becomes a nonlinear optimization problem. Note that the inversion of the Wiener model is a Hammerstein model whose linear part is described by the inverse transfer function and nonlinear element is described by the inverse nonlinear function. A necessary condition for such a transformation is the invertibility of the function f (·) describing the nonlinear part of the Wiener model. To obtain an asymptotically stable inverse model, the Wiener model should be minimum phase. The inverse Wiener model is still nonlinear in the parameters but it is much more convenient for parameter estimation as it can be transformed into the linear-in-parameters MISO form with the method proposed by Chang and Luus [27] for Hammerstein systems. The parameters of the transformed model can be calculated with the least squares method. Assuming the knowledge of f (·), the identification of the inverse Wiener system was considered by Pearson and Pottman [138]. To overcome the practical difficulty related to the fact that the inverse system identification approach penalizes prediction errors of the input signal u(n) instead of the output signal y(n), they used the weighted least squares method with weighting parameters that weight the relative importance of the error. A polynomial inverse model of the nonlinear element and a frequency sampling filter model of the linear dynamic system nf −1
G(jwm )Fm (n),
sˆ(n) = m=0
(1.78)
1.3 Identification of Wiener systems
21
where sˆ(n) is the output of the linear dynamic model, G(jwm ), m = 0, . . . , nf − 1, is the discrete frequency response of the linear system at wm = 2πm/nf , Fm (n) is the output of the mth frequency sampling filter defined as 1 − q −nf 1 u(n), (1.79) Fm (n) = nf 1 − ej(2πm/nf ) q −1 where u(n) is the system input and q −1 is the backward shift operator, was used by Kalafatis et al. [95]. The model output yˆ(n) can then be expressed in the linear-in-parameters form: nf −1
r
G(jwm )Fm (n) −
yˆ(n) = m=0
γk y k (n),
(1.80)
k=2
where y(n) is the system output and γk , k = 2, . . . , r, are the parameters of the polynomial inverse model of the nonlinear element. The parameters G(jwm ) and γk can be calculated with the least squares method. Unfortunately, such an approach leads to inconsistent parameter estimates for Wiener systems with additive output disturbances as the regression vector is correlated with the disturbance. To overcome this problem, an iterative algorithm was proposed [96], which consists of the following three steps: First, parameter estimates are calculated using the least squares method. Then using the obtained estimates, predicted system outputs are calculated. Finally, to calculate corrected parameter estimates, the predicted system outputs are employed in another estimation step using the least squares method. The above estimation procedure is repeated until the parameter estimates converge to constant values. The recursive least squares scheme is used in the orthonormal basis functionbased identification method proposed by Marciak et al. [116]. They use a noninverted model of the linear system composed of Laguerre filters and an inverse polynomial model of the nonlinear element. In the frequency approach to the identification of Wiener systems of Bai [10], the phase estimate of the linear dynamic system output is determined based on the discrete Fourier transform (DFT) of the filtered system output. Having the phase estimate, the structure of nonlinearity can be determined from the graph of the system output versus an estimated linear dynamic system output and approximated by a polynomial. The identification of Wiener systems based on a modified series-parallel Wiener model, defined by a non-inverted pulse transfer model of the linear element and an inverse polynomial model of the nonlinear element, can be performed with the method proposed by Janczak [80, 83]. The modified series-parallel model is linear in the parameters and its parameters are calculated with the least squares method. The method requires the nonlinear function f (·) to be invertible and the linear term of the polynomial model to be nonzero. However, in the case of additive output noise, direct application of this approach results in inconsistent parameter estimates. As a remedy against such a situation, a combined least squares and instrumental variables estimation procedure is
22
1 Introduction
proposed. A detailed description of this method, along with its extension to the identification of Wiener systems without the linear term of the nonlinear characteristic, is given in Chapter 4. Parameter estimation of Wiener systems composed of a finite impulse response (FIR) model of the linear system and an inverse polynomial model of the nonlinear element was considered by Mzyk [119]. To obtain consistent parameter estimates, the instrumental variables method is employed with instrumental variables defined as a sum of powered input signals and some tuning constants selected by the user. The identification of MIMO Wiener systems with the use of basis functions for the representation of both the linear dynamic system and the nonlinear element was proposed by G´ omez and Baeyens [44]. It is assumed that the nonlinear element is invertible and can be described by nonlinear basis functions. The MIMO linear dynamic system is represented by rational orthonormal bases with fixed poles (OBFP). Special cases of OBFP are the FIR, Laguerre, and Kautz bases. Under the above assumptions, the MIMO Wiener model can be transformed into the linear-in-parameters form and parameter estimates of the transformed model that minimize the quadratic cost function on prediction errors are calculated using the least squares method. To calculate matrix parameters of the nonlinear element from the parameter estimates of the transformed model, the singular value decomposition technique is applied. Nonparametric regression methods. Parametric regression methods are based on a restrictive assumption concerning the class of nonlinear functions. The kernel nonparametric regression approach, which considerably enlarges the class of nonlinearities identified in Wiener systems, was introduced by Greblicki [46] and studied further in [48]. Next, the idea of nonparametric regression was advanced by employing orthogonal series for recovering the inverse of the nonlinear characteristic [47]. For nonparametric regression algorithms that use trigonometric, Legendre and Hermite orthogonal functions, pointwise consistency was shown and the rates of convergence were given. Greblicki also proposed and analyzed recursive identification algorithms based on the kernel regression for both discrete-time [51] and continuous-time Wiener systems [50, 52]. Nonlinear optimization methods. The identification of Wiener systems using prediction error methods was discussed in [85, 88, 128, 165, 166]. Wigren [165] analyzed recursive Gauss-Newton and stochastic gradient identification algorithms assuming that system nonlinearity is known a priori. He established conditions for local and global convergence of parameter estimates to the true system parameters at correlated measurement disturbances. Another recursive prediction error method, proposed by Wigren [166], estimates the parameters of a pulse transfer function model of the linear system and a piecewise linear model of the nonlinear element. With the technique of a linearized differential equation, the local convergence of estimates to the system para-
1.3 Identification of Wiener systems
23
meters is proved. It is also shown that the input signal should be such that there is energy in the whole range of piecewise linear approximation. V¨ or¨ os [162] used a parametric model to describe Wiener systems with a special kind of discontinuous nonlinear element – a piecewise-linear function with a preload and a dead zone. The pure preload, dead zone, and a two-segment piecewise-linear asymmetric nonlinearities are special cases of the general discontinuous nonlinearity. He proposed an identification method that uses a description of the nonlinear element based on the key separation principle. Such a formulation of the problem makes it possible to transform the nonlinear element model into the pseudolinear-in-parameters form. Parameter estimation is conducted iteratively as the model contains three internal variables that are unmeasurable. The iterative procedure is based on the use of parameter estimates from the preceding step to estimate these unmeasurable internal variables. Another approach to the identification of Wiener systems with the assumed forms of hard-type nonlinearity parameterized by a single unknown parameter a was proposed by Bai [8]. Examples of nonlinearities of such type are the saturation, preload, relay, dead zone, hysteresis-relay, and the hysteresis. To find the unknown parameter a, the separable least squares method is used, in which the identification problem is transformed into a one-dimensional minimization problem. As the cost function is one-dimensional, global search methods can be applied to find its minimum. Alternatively, it is also possible to find the minimum directly from the plot of the cost function versus a. Having found the optimal estimate of a, the parameters of the linear dynamic system can be estimated using the least squares method. For this approach, the conditions under which the strong consistency of parameter estimates can be achieved are given. Although the separable least squares method can be extended to the two-dimensional case easily, its extension to the case of nonlinearities parameterized by a larger number of parameters is more complicated. An iterative scheme for the identification of Wiener systems with the prediction error method was proposed by Norquay et al. [128]. In this approach, the nonlinear element is modelled by a polynomial. To calculate the approximated Hessjan, the Levenberg-Marquardt method is used along with the calculation of the gradient via the simulation of sensitivity models. A recursive version of this method was used in [85, 88]. The pseudolinear regression algorithm, described by Janczak [86, 88], can be obtained from the prediction error scheme if the dependence of both the delayed and powered output signals of the linear model on model parameters is ignored. In this way, the model is treated as a linear-in-parameters one. Such a simplification reduces computational complexity of the algorithm at the price of gradient approximation accuracy. A neural network-based method for the identification of Wiener systems was developed by Al-Duwaish [3]. In this method, the linear dynamic system and the nonlinear element are identified separately. First, the parameters of the linear system, described by a linear difference equation, are estimated with
24
1 Introduction
the recursive least squares (RLS) algorithm based on a response to a small input signal which ensures a linear perturbation of the nonlinear element. Then having the linear system identified, the backpropagation learning algorithm is applied to train a multilayer neural network model of the nonlinear element. This step is performed with another response to an increased input signal which perturbs the nonlinear element nonlinearly. The identification of a MISO Wiener system was studied by Ikonen and Najim [72]. Based on the input-output data from a pump-valve pilot system, a neural network Wiener model was trained with the Levenberg-Marquardt method. Visala et al. [159] used a MIMO Wiener model for the modelling of the chromatographic separation process. The linear dynamic part of their model is composed of Laguerre filters while the MIMO feedforward neural network is employed as a static nonlinear mapping. It is assumed that the MIMO linear dynamic model is decoupled. Parameter estimation is reduced to training the nonlinear part of the model with the Levenberg-Marquardt method as the linear dynamics is assumed to be known. This means that suitable values of Laguerre parameters are selected on the basis of a priori information. The identification of a nonlinear dynamic system using a model composed of a SIMO dynamic system followed by a MISO nonlinear element was studied by Alataris et al. [1]. Their methodology employs a multilayer perceptron with a single hidden layer and polynomial activation functions as a model of static nonlinearity. The inputs to this models are outputs of a Laguerre filter bank used as a model of the linear dynamic system. The parameters of the neural network model are adjusted by means of the gradient descent method. The choice of a real pole, being the only degree of freedom of Laguerre filters, is based on a trial and error rule. Applying the polynomial activation function permits an easy transition between the neural network and Volterra series models. A neural network model of both the linear dynamic system and the nonlinear element is used to describe SISO Wiener systems in the single step sequential procedure proposed by Janczak [89]. For more details about this approach and its extension to the MIMO case refer to Chapter 2. The identification of Wiener systems using a genetic method, revealing a global optimization property, was studied by Al-Duwaish [5]. In this approach, the parameters of a Wiener model composed of a pulse transfer function model of the linear part and a polynomial model of the nonlinear element are calculated. The pulse transfer function model is defined in the pole-zero form. With the fitness function defined as a sum of squared errors between the system and model outputs and by performing genetic operations of cross-over, mutation, and selection, new generations of solutions are generated and evaluated until predetermined accuracy of approximation is achieved. A parallel neural network Wiener model was also trained using recursive evolutionary programming with a time-dependent learning rate by Janczak and Mrugalski [93].
1.4 Identification of Hammerstein systems
25
1.4 Identification of Hammerstein systems As in the case of Wiener systems, discussed in Section 1.3, we will review briefly various identification methods for Hammerstein systems. Correlation methods. For a white Gaussian input, the computation of the cross-correlation function makes it possible to decouple the identification of Hammerstein systems and identify the linear dynamic system and the nonlinear element separately [14, 15, 16, 17]. First, the impulse response h(k) of the linear dynamic system is estimated using the correlation technique. The first order correlation function Ruy (k) is directly proportional to the impulse response, Ruy (k) = αh(k). Then if necessary, the parameters of the pulse transfer function Z[αh(k)] = B(q −1 )/A(q −1 ) can be calculated from the impulse response easily. Having an estimate of αh(k) available, the parameters of a polynomial model of the nonlinear element can be calculated with the least squares algorithm [16]. For Hammerstein systems, the second order correlation function Ru2 y (k) is directly proportional to the impulse response of the linear element and provides a convenient test of the system structure. If the first and second order correlation functions are equal except for a constant of proportionality, the system must have the structure of a Hammerstein model. Linear optimization methods. In contrast to correlation methods, which use a nonparametric model of the linear dynamic system and a parametric model of the nonlinear element, linear optimization methods use parametric representations for both parts of the Hammerstein system. The parameters of a pulse transfer function representation of the linear element and a polynomial model of the nonlinear element can be estimated using the iterative least squares method proposed by Narendra and Gallman [120]. The method is based on an alternate adjustment of the parameters of the linear and nonlinear parts of the model. Another iterative approach was proposed by Haist et al. [63] for Hammerstein systems with correlated additive output disturbances. This method, being an extension of the method of Narendra and Gallman, overcomes its obvious drawback of biased estimates via estimation of both parameters of the Hammerstein model and a linear noise model. The transformation of the SISO Hammerstein model into the MISO form makes it possible to estimate the parameters of the transformed model utilizing the least squares method noniteratively [27]. In spite of its simplicity and the elegant form of a one step solution, this method has an inconvenience in the form of redundancy in the calculation of the parameters of the nonlinear element. More precisely, nb different sets of these parameters can be calculated from the parameters of the transformed MISO model, where nb is the nominator order of the pulse transfer function. With instrumental variables defined as linear filter outputs and powers of
26
1 Introduction
system inputs, instrumental variables methods were studied by Stoica and S¨ oderstr¨ om [152]. They proved that the instrumental variables approach gives consistent parameter estimates under mild conditions. An algorithm that uses an OBFP model of the MIMO linear dynamic system and a nonlinear basis function model of the MIMO nonlinear element was proposed by G´ omez ana Baeyens [44]. In this approach, the MIMO Hammerstein model is transformed into the linear-in-parameters form and its parameters are calculated using the least squares method. To calculate matrix parameters of the nonlinear element from parameter estimates of the transformed model, the singular value decomposition technique is applied. The algorithm provides consistent parameter estimates under weak assumptions about the persistency of the excitation of system inputs. Bai [11] proposed a two-step algorithm that decouples the identification of the linear dynamic system from the identification of the nonlinear element. In the first step, the algorithm uses a pseudo-random binary sequence (PRBS) input to identify the linear dynamic system. With the PRBS signal, the effect of nonlinearity can be eliminated as any static nonlinear function can be completely characterized by a linear function under the PRBS input. Therefore, any identification method of linear systems can be applied to obtain a linear dynamic model. The identification of the nonlinear element is made in the other step. As the PRBS signal assumes only two values, a new input signal that is rich enough, e.g., a pseudo-random sequence of a uniform distribution is used to identify the nonlinear element. An important advantage of this decoupling technique is that the asymptotic variance of the estimate of the Hammerstein system transfer function is equal to the asymptotic variance of the estimate of the system transfer function in the linear case [126]. A piecewise linear model of the nonlinear element and a pulse transfer function model of the linear dynamic system are used in the identification scheme proposed by Giri et al. [43]. The Hammerstein model, defined in this way, is transformed into the linear-in-parameters form and parameter estimation is performed using a recursive gradient algorithm augmented by a parameter projection. To ensure the convergence of the model to the true system, a persistently exciting input sequence is generated. Nonparametric regression methods. In parametric regression methods, it is assumed that the nonlinear characteristic belongs to a class that can be parameterized by a finite set of parameters. Clearly, such an assumption is a very restrictive one. For example, a commonly used polynomial representation of the nonlinear element excludes typical discontinuous characteristics such as dead-zone limiters, hard-limiters, and quantizers. A very large class of nonlinear functions, including all measurable L2 functions, can be identified using nonparametric regression. A nonparametric regression approach to the identification of Hammerstein systems was originally proposed by Greblicki and Pawlak [55] and studied further in [56, 57, 103]. Kernel regression identification procedures for different block-oriented systems, including Ham-
1.4 Identification of Hammerstein systems
27
merstein ones, were also studied by Krzy˙zak and Partyka [105]. Nonparametric regression methods comprise two separate steps. First, the impulse response function of the dynamic system is estimated with a standard correlation method. Then the characteristic of the nonlinear element f (·) is estimated as a kernel regression estimate u(n) − u(k) l(N ) fˆ u(n) = k=0 N −1 , u(n) − u(k) K l(N ) k=0 N −1
y(k + 1)K
(1.81)
where N is the number of measurements, K(·) is the kernel function, l(·) is a sequence of positive numbers. In this definition, 0/0 is understood as zero. It can be shown that nonparametric regression estimates converge to the true characteristic of the nonlinear element as the number of measurements N tends to infinity. Greblicki and Pawlak also showed that for sufficiently smooth characteristics, the rate of convergence is O(N −2/5 ). All of the above identification algorithms recover the nonlinear characteristic in a nonrecursive manner. Two other algorithms based on kernel regression estimates, proposed by Greblicki and Pawlak [58], allow one to identify the nonlinear characteristic recursively. Another class of nonparametric methods uses orthogonal expansions of the nonlinear function. Greblicki [45] considered a Hammerstein system driven by a random white input, with the system output disturbed by random a white noise, and proposed two nonparametric procedures based on the trigonometric and Hermite orthogonal expansions. Both of these algorithms converge to the nonlinear characteristic of the system in a pointwise manner, and the integrated error converges to zero. The pointwise convergence rate is O(N −(2q−1)/4q ) in probability, where q is the number of derivatives of the nonlinear characteristic. The identification of Hammerstein systems by algorithms based on the Hermite series expansion with the number of terms depending nonlinearly on input-output measurements was considered by Krzy˙zak et al. [106]. In this approach, the system is driven by the stationary white noise, and the linear and nonlinear components are estimated simultaneously. The application of the Fourier series estimate to identify nonlinearities in block-oriented systems, including Hammerstein ones, was studied by Krzy˙zak [102, 104]. For nonlinear functions and input signal densities having finite Fourier series expansions, such an approach has two advantages over kernel regression ones – higher computational efficiency and higher rates of convergence. Recovering the nonlinear function using a Legendre polynomial-based method with an adaptively selected number of terms was studied by Pawlak [136]. It was shown that the estimate of f (·) is globally consistent and the rates of convergence were established. Greblicki and Pawlak [59] considered also the identification of Hammerstein systems with Laguerre polynomials. An alternative to nonparametric methods based on orthogonal expansions
28
1 Introduction
are algorithms that employ multiresolution approximation. In the context of Hammerstein systems identification, the Haar multiresolution analysis was first used by Pawlak and Hasiewicz [137]. This idea was studied further by Hasiewicz [64, 65, 66]. Haar multiresolution approximation algorithms converge pointwise. Their forms and convergence conditions are the same for white and correlated additive output noise. Their another advantage is the faster convergence rate in comparison with other nonparametric identification algorithms that use orthogonal series expansions. The idea of the identification of nonlinearities in block-oriented systems with ´ Daubechies wavelets was studied by Sliwi´ nski and Hasiewicz [155]. As the lack of a closed analytical form makes Daubechies wavelets practically unapplicable, they used an estimation procedure that employs approximations that are easy to compute. Nonlinear optimization methods. A prediction error approach to the identification of Hammerstein systems was discussed by Eskinat et al. [34]. Contrary to the least squares approach used by Chang and Luus [27], prediction error methods make it possible to estimate the parameters of a pulse transfer function of the linear system and a polynomial nonlinear characteristic element directly, without any transformation of the parameters, and there is no problem of parameter redundancy. The prediction error method uses the Levenberg-Marquardt method to approximate the Hessjan of the sum-squared cost function. Neglecting the fact that a model is nonlinear in the parameters and treating it as a linear one leads to pseudolinear regression methods. An example of such an approach is the method proposed by Boutayeb and Darouach [21], in which the parameters of a MISO Hammerstein system with the output disturbed additively by correlated noise are estimated recursively. An identification method for Hammerstein systems which uses a two-segment model of the nonlinear element, composed of separate polynomial maps for positive and negative inputs, was proposed by V¨ or¨ os [161]. This method also employs models in the pseudolinear form to calculate parameters iteratively. The idea of the method is based on splitting the nonlinear characteristic, which can be described accurately by a polynomial of a high order, into two segments that can be approximated with polynomials of a much lower order. A similar technique was also applied by V¨ or¨ os [160] for the identification of Hammerstein systems with the nonlinear element described by a discontinuous function. A neural network approach can be applied to the identification of Hammerstein systems with nonlinear elements described by a continuous function. The identification of Hammerstein systems with neural network models was considered by Su and McAvoy [153]. They used the steady-state and the transient data to train a neural network Hammerstein model. In this approach, a neural network model of the nonlinear element and a model of the linear dynamic system are trained separately. First, the neural network is trained with the backpropagation (BP) learning algorithm on the steady-state data. After the
1.4 Identification of Hammerstein systems
29
training, the neural network serves as a nonlinear operator. The linear dynamic model is then trained on the set of transient data employing system input variables transformed by the nonlinear operator as inputs. Considering two basic configurations of the linear dynamic model, i.e., the series-parallel and the parallel one it is shown that the gradients of the cost function can be obtained via the backpropagation method for the series-parallel model, and the backpropagation through time method for the parallel one. Having the gradient calculated, a steepest descent or a conjugate gradient algorithm is suggested to adjust the model parameters. The identification of SISO Hammerstein systems by multilayered feedforward neural networks was also studied by Al-Duwaish et al. [3]. They considered a parallel neural network Hammerstein model trained recursively using the BP learning algorithm for training a model of the nonlinear element and the recursive least squares algorithm (RLS) for training the linear dynamic model. In this approach, partial derivatives of the squared cost function are calculated in an approximate way without taking into account the dependence of past model outputs on model parameters. This corresponds to Equations (3.36) – (3.38). Moreover, the fact that partial derivatives of the model output w.r.t. the weights of the nonlinear element model depend not only on the actual but also on past outputs of the nonlinear element model is not taken ino account. Both of these simplifications of gradient calculation reduce computational complexity of training at the price of reduced accuracy of gradient calculation and may result in a decreased convergence rate. The extension of the RLS/BP algorithm to a MIMO case is discussed in [4]. A genetic approach to the identification of Hammerstein systems is considered in [5]. In this approach, the pole-zero form of the linear dynamic system and the nonlinearity of a known structure but unknown parameters are identified. In general, SISO neural network Hammerstein models can be represented by a multilayer perceptron model of the nonlinear element and a linear node with two tapped delay lines used as a model of the linear system [73]. Both the series-parallel and parallel models can be considered. They have a similar architecture, the only difference is the feedback connection in the parallel model [100]. For the series-parallel model, the gradient of its output with respect to model parameters can be obtained using the computationally effective backpropagation algorithm. The calculation of the gradient in parallel models can be made with the sensitivity method or the backpropagation through time method [75, 77, 90]. As the Hammerstein model contains a linear part, it is also possible to use non-homogeneous algorithms that combine the recursive least squares or recursive pseudolinear regression algorithms with all of the above-mentioned methods [75]. The neural network Hammerstein models have simple architectures, commonly with a few tens of processing nodes and adjustable weights, and the trained models are easy to be applied in practice. Due to the simple architecture of neural network Hammerstein models, their training algorithms have low computational complexity. More details on the
30
1 Introduction
neural network approach to the identification of Hammerstein system can be found in Chapter 3.
1.5 Summary The material presented in this chapter starts with a concise introductory note explaining an underlaying incentive to write the book. Next, Section 1.1 contains a review of discrete time models of dynamic systems. It starts with the well-known linear model structures and shows nonlinear models as generalizations of linear ones. Section 1.2 introduces the multilayer perceptron, a neural network architecture that is used in both neural network Wiener and Hammerstein models. Next, in Sections 1.3 and 1.4, various available identification methods for Wiener and Hammerstein systems are briefly reviewed and classified into the following four groups: correlation methods, linear optimization methods, nonparametric regression methods, and nonlinear optimization methods.
2 Neural network Wiener models
2.1 Introduction This chapter introduces different structures of neural network Wiener models and shows how their weights can be adjusted, based on a set of system inputoutput data, with gradient learning algorithms. The term ’neural network Wiener models’ refers to models composed of a linear dynamic model followed by a nonlinear multilayer perceptron model. Both the SISO nad MISO Wiener models in their two basic configurations known as a series-parallel and a parallel model are considered. In series-parallel Wiener models, another multilayer perceptron is used to model the inverse nonlinear element. For neural network Wiener models, four different rules for the calculation of the gradient or the approximate gradient are derived and presented in a unified framework. In series-parallel models, represented by feedforward neural networks, the calculation of the gradient can be carried out with the backpropagation method (BPS). Three other methods, i.e., backpropagation for parallel models (BPP), the sensitivity method (SM), and truncated backpropagation through time (BPTT) are used to calculate the gradient or the approximate gradient in parallel models. For the BPTT method, it is shown that the accuracy of gradient approximation depends on both the number of unfolded time steps and impulse response functions of the linear dynamic model and its sensitivity models. Computational complexity of the algorithms is also analyzed and expressed in terms of the orders of polynomials describing the linear dynamic model, the number of nonlinear nodes, and the number of unfolded time steps. Having the gradient calculated, different gradient-based algorithms such as the steepest descent, quasi-Newton (or variable metric), and conjugate gradient can be applied easily. All the learning algorithms, discussed in this chapter, are one-step identification procedures, i.e., they allow one to identify both the nonlinear element and the linear dynamic system simultaneously. The sequential (on-line) mode of the algorithms makes them also suitable for the identification of systems with slowly varying parameters. An important advantage of the parallel Wie-
A. Janczak: Identification of Nonlinear Systems, LNCIS 310, pp. 31–75, 2005. © Springer-Verlag Berlin Heidelberg 2005
32
2 Neural network Wiener models
ner model is that it does not require the assumption of invertibility of the nonlinear element. Due to output error formulation of the prediction error, parallel Wiener models are also useful for the identification of systems disturbed additively by the white output noise. The chapter is organized as follows: The identification problem is formulated in Section 2.2. Then the SISO and MISO series-parallel and parallel neural network Wiener models are introduced in Section 2.3. Section 2.4 contains details of different gradient calculation algorithms and an analysis of gradient computation accuracy with the truncated BPTT method. Simulation results and a real data example of a laboratory two-tank system are shown in Sections 2.5 and 2.6 to compare the convergence rates of the algorithms. The recursive prediction error learning algorithm is derived in Section 2.7. Finally, Section 2.8 summarizes the essential results.
2.2 Problem formulation Consider a two-block structure composed of a linear dynamic system and a static nonlinear element in a cascade, referred to as the SISO Wiener system – Fig. 2.1. The output y(n) to the input u(n) at the time n is y(n) = f where
B(q −1 ) u(n) + ε(n), A(q −1 )
(2.1)
A(q −1 ) = 1 + a1 q −1 + · · · + ana q −na , B(q
−1
) = b1 q
−1
+ · · · + bnb q
−nb
(2.2)
,
−1
(2.3) −m
and q is the backward shift operator, with the properties that q y(n) = y(n − m) and q −m f s(n) = f s(n − m) , f (·) is the characteristic of the nonlinear element, a1 , . . ., ana , b1 , . . . , bnb are the unknown parameters of the linear dynamic system, and ε(n) is the system output disturbance. Let us assume that: Assumption 2.1. The function f (·) is continuous. Assumption 2.2. The linear dynamic system is casual and asymptotically stable. Assumption 2.3. The polynomial orders na and nb are known. The identification problem can be formulated as follows: Given the sequence of the system input and output measurements {u(n), y(n)}, n = 1, . . . , N , estimate the parameters of the linear dynamic system and the characteristic of the nonlinear element minimizing the following global cost function: N
1 2 y(n) − yˆ(n) , J= 2 n=1 where yˆ(n) is the output of the neural network Wiener model.
(2.4)
2.2 Problem formulation
33
ε (n )
u (n )
B (q −1 )
s (n )
A(q −1 )
f (s (n ) )
y (n )
Fig. 2.1. Wiener system
There are two basic approaches to find the minimum of the global cost function (2.4). The first method is called the sequential mode or pattern learning as it uses pattern-by-pattern updating of model parameters, changing their values by an amount proportional to the negative gradient of the local cost function 1 2 y(n) − yˆ(n) . (2.5) 2 This results in the following rule for the adaptation of model parameters: J(n) =
w(n) = w(n−1) − η
∂J(n) , ∂w(n−1)
∂J(n) ∂ yˆ(n) = − y(n) − yˆ(n) , ∂w(n−1) ∂w(n−1)
(2.6) (2.7)
where w(n) is the weight vector containing all model parameters at the time n, and η > 0 is the learning rate. It has been shown that such a learning procedure minimizes the global cost function J provided that the learning rate η is sufficiently small [167]. The other approach is called batch learning as it uses the whole set of inputoutput data {u(n), y(n)}, n = 1, . . . , N , to update model parameters. In this technique, the global cost function J is minimized iteratively in such a way that, at each iteration, the parameter changes over all training patterns are accumulated before the parameters are actually changed. The gradient learning algorithm in its basic version has some serious drawbacks. The fixed learning rate η may be chosen too large leading to unnecessary oscillations or even the divergence of the learning process. On the other hand, at too small η, the learning process may become extremely slow. Many different modifications of the gradient algorithm exist which improve and speed up the learning process considerably. Adding a momentum term is one simple way to improve the learning. Including the momentum term smoothes weight changes making it possible to increase the learning rate η. This may not only speed up the learning process but may also prevent it from getting trapped in a shallow local minimum. Also, applying diverse learning rates for different weights, neurons, or layers may be a useful tool to improve the learning process. Other effective modifications of the gradient algorithm are based on the adaptation of the learning rate η according to some heuristic rules [68].
34
2 Neural network Wiener models
It is well known that parallel models can loose their stability during the learning process even if the identified system is stable [121]. A common way to avoid this is to keep the learning rate η small. To preserve the stability of Wiener models, some other heuristic rules can be applied easily. For example, asymptotic stability of the model can be tested after each parameter update and if the model becomes unstable, the unstable poles can be moved back into the unit circle by scaling. This requires recalculation and scaling the parameters of the linear dynamic model to keep the steady-state gain of the model unchanged. Another simple approach, which can be applied in the sequential mode, is based on the idea of a trial update of parameters, testing the stability of the model, and performing an actual update only if the obtained model is asymptotically stable; otherwise the parameters are not updated and the next training pattern is processed.
2.3 Series-parallel and parallel neural network Wiener models In analogy with linear dynamic models, neural network Wiener models can be used in two basic configurations known as series parallel and parallel models. Neural network architectures of both of these models are introduced and discussed below. 2.3.1 SISO Wiener models To introduce series-parallel Wiener models, it is necessary to assume that the nonlinear function f (·) is invertible. A series-parallel neural network Wiener model, shown in Fig. 2.2, is composed of a multilayer perceptron model of the inverse nonlinear element, a linear node with two tapped delay lines, used as a model of the linear dynamic system, and another multilayer perceptron used as a model of the nonlinear element. The input to the model are the system input u(n) and system output y(n), and the model is of the feedforward type. Multilayer perceptrons are universal approximators. This means that the multilayer perceptron with at least one hidden layer can approximate any smooth function to an arbitrary degree of accuracy as the number of hidden layer neurons increases [37]. Multilayer perceptrons with one hidden layer are most common. Therefore, we assume that both the model of the nonlinear element and the inverse model of the nonlinear element have the same architecture, and contain one hidden layer of M nonlinear processing elements, see Fig. 2.3 for the nonlinear element model and Fig 2.4 for the inverse nonlinear element model. The output yˆ(n) of the series-parallel neural network Wiener model is yˆ(n) = fˆ sˆ(n),w ,
(2.8)
2.3 Series-parallel and parallel neural network Wiener models u (n − 1)
u (n )
.
q −1
u(n − 2)Ku(n − nb + 1)
.
q −1
bˆ1
…
bˆ2
.
bˆnb −1
u(n − nb)
q −1 bˆnb fˆ(sˆ(n ), w )
−aˆ1
y (n )
gˆ (sˆ(n ), w )
.
q −1
− aˆ2
y (n − 1)
− aˆna −1
.
q −1
…
35
yˆ (n )
ˆ
− aˆna
.
q −1
y(n − 2)K y(n − na + 1)
y (n − na)
Fig. 2.2. Series-parallel SISO neural network Wiener model
with na
nb
sˆ(n) = −
a ˆm gˆ y(n−m), v + m=1
ˆbm u(n−m),
(2.9)
m=1
fˆ sˆ(n),w =
M j=1
(2)
(2)
w1j ϕ xj (n) + w10 ,
(2.10)
(1)
(2.11)
(1)
xj (n) = wj1 sˆ(n) + wj0 , M
(2)
gˆ y(n), v = j=1
(2)
v1j ϕ zj (n) + v10 ,
(1)
(1)
zj (n) = vj1 y(n) + vj0 ,
(2.12) (2.13)
where the function fˆ(·) describes the nonlinear element model, the function gˆ(·) describes the inverse nonlinear element model, ϕ(·) is the activation function, a ˆ1 , . . . , a ˆna , ˆb1 , . . . , ˆbnb are the parameters of the linear dynamic model, (1) (1) (2) (2) w = [w10 . . . wM1 w10 . . . w1M ]T is the parameter (weight) vector of the (2) (2) (1) (1) nonlinear element model, and v = [v10 . . . vM1 v10 . . . v1M ]T is the parameter vector of the inverse nonlinear element model. Note that gˆ(·) does not denote the inverse of fˆ(·) but describes the inverse nonlinear element. The architecture of the parallel model (Fig. 2.5), which does not contain any inverse model, is even simpler in comparison with that of the series-parallel one. The parallel model is of the recurrent type as it contains a single feedback
36
2 Neural network Wiener models 1 (1) wM 0
sˆ(n )
1
(1) w10 (1) w 20
(2) w10
ϕ (x 1 (n ) )
(1) w11
(2) w11
(1) w 21
(2) ϕ (x 2 (n ) ) w12
(1) wM 1
(2) w1M
M
yˆ (n )
ϕ (x M (n ) ) Fig. 2.3. Neural network model of the SISO nonlinear element
connection and u(n) is the only input to the model. The output yˆ(n) of the parallel Wiener model is given by yˆ(n) = fˆ sˆ(n),w ,
(2.14)
with fˆ(·) defined by (2.10) and (2.11), and where the output of the linear dynamic model is given by the following difference equation: na
nb
a ˆm sˆ(n−m) +
sˆ(n) = − m=1
ˆbm u(n−m).
(2.15)
m=1
Using the backward shift operator notation, (2.15) can be written as ˆ −1 ) sˆ(n) + B(q ˆ −1 )u(n), sˆ(n) = 1 − A(q
(2.16)
ˆ −1 ) = 1+ a ˆ −1 ) = ˆb1 q −1 +· · ·+ ˆbnb q −nb . where A(q ˆ1 q −1 +· · ·+ a ˆna q −na and B(q Note that the characterization of the Wiener system is not unique as the linear dynamic system and the nonlinear element are connected in series. In other words, Wiener systems described by B(q −1 )/A(q −1 ) /α and f αs(n) reveal the same input-output behavior for any α = 0. Therefore, to obtain a unique characterization of the neural network model, either the linear dynamic model or the nonlinear element model should be normalized. For example, to normalize the gain of the linear dynamic model to 1, the model weights are (1) (1) scaled as follows: ˆbk = ˜bk /α, k = 1, . . . , nb, wj1 = αw˜j1 , j = 1, . . . , M , where (1) ˜bk , w ˜ ˜ ˜j1 denote the parameters of the unnormalized model, and α = B(1)/ A(1) is the linear dynamic model gain.
2.3 Series-parallel and parallel neural network Wiener models 1 (1) vM 0
sˆ(n )
1
(1) v10
37
(2) v10
ϕ (z 1 (n ) )
(1) v20
(1) v11
(2) v11
(1) v21
(2) ϕ (z 2 (n ) ) v12
(1) vM 1
(2) v1M
M
gˆ (sˆ(n ), v )
ϕ (z M (n ) ) Fig. 2.4. Neural network model of the SISO inverse nonlinear element
u (n )
u (n − 1)
q −1
u(n − 2)Ku (n − nb + 1)
…
q −1 bˆ1
bˆ2
u (n − nb)
q −1
bˆnb −1
bˆnb fˆ(sˆ(n ), w )
−aˆna
− aˆna −1
q −1 yˆ(n − na)
K − aˆ2
− aˆ1
…
q −1
yˆ(n − na + 1)K yˆ(n − 2)
yˆ (n )
ˆ
q −1 yˆ(n − 1)
Fig. 2.5. Parallel SISO neural network Wiener model
2.3.2 MIMO Wiener models Consider a parallel MIMO neural network Wiener model with nu inputs, ny outputs, ns outputs of the MIMO linear dynamic model, and M nonlinear nodes in the hidden layer. Assume that a single hidden layer multilayer perceptron with ns inputs and ny outputs, containing M nonlinear nodes in its hidden layer, is used as a model of the nonlinear element – Fig 2.6. The output y ˆ(n) of the model at the time n is y ˆ(n) = ˆ f ˆ s(n),W ,
(2.17)
38
2 Neural network Wiener models 1 (1) (1) w11 wM 0
sˆ1 (n )
sˆ2 (n ) M sˆns (n )
(2) w10
(2) (2) (2) w ny 0 w 20 w11
(1) w 20
(1) w 21
(1) wM 1 (1) w12
1
(1) w10
(2) w 21
(2) w ny 1 (2) w12
(1) w 22
(1) wM 2
(2) w ny 2
(1) w1ns
(2) w1M
(1) w 2ns
(1) w Mns
yˆ1 (n )
yˆ2 (n )
(2 ) w 22
M (2) w 2M (2 ) w nyM
yˆny (n )
Fig. 2.6. Neural network model of the MIMO nonlinear element
where
y ˆ(n) = yˆ1 (n) . . . yˆny (n)
T
,
(2.18)
ˆ f (ˆ s(n), W) = fˆ1 ˆ s(n),wny s(n),w1 . . . fˆny ˆ ˆ s(n) = sˆ1 (n) . . . sˆns (n) (1)
(1)
T
W = w10 . . . wMns w10 . . . wnyM (1)
(1)
wt = w10 . . . wMns wt0 . . . wtM
(2.19) (2.20)
T
(2) T
(2)
,
,
(2)
(2)
T
,
(2.21)
,
(2.22)
where t = 1, . . . , ny. Then the tth output of the nonlinear element model is yˆt (n) = fˆt ˆ s(n),wt =
M j=1
ns
xj (n) = i=1
(2)
(2)
wtj ϕ xj (n) + wt0 ,
(1)
(1)
wji sˆi (n) + wj0 .
(2.23)
(2.24)
The output ˆ s(n) of the MIMO dynamic model can be expressed as na
ˆ s(n) = − m=1
where
ˆ (m)ˆ A s(n−m) +
nb
ˆ (m) u(n−m), B
(2.25)
m=1
u(n) = u1 (n) . . . unu (n) ˆ (m) ∈ Rns×ns , A
T
,
(2.26) (2.27)
2.3 Series-parallel and parallel neural network Wiener models 1 (1) (1) v11 vM 0
y1 (n)
y2 (n) M
1
(1) v20
(2) (2) v11 vns 0
(1) v21
(1) vM 1 (1) v12
(1) v10
(2) v12
(1) vM 2
(2) vns 2
(1) v1ny
v1(2M)
(1) v2ny
g1(y(n),v1)
(2) v20
(2 ) v22
g2 (y(n),v2 ) M
v2(2M) (2 ) vnsM
(1) vMny
yny (n)
(2) v10
(2) v21
(2) vns 1
(1) v22
39
gns(y(n),vns)
Fig. 2.7. Inverse neural network model of the MIMO nonlinear element
ˆ (m) ∈ Rns×nu . B
(2.28)
The series-parallel MIMO Wiener model is more complex than the parallel one as it uses an inverse model of the nonlinear element – Fig. 2.7. To derive the series-parallel MIMO Wiener model, assume that the nonlinear function f (·) describing the nonlinear element is invertible. Then the output of the MIMO linear dynamic model can be expressed as na
ˆ s(n) = − m=1
where
nb
ˆ (m) g A ˆ y(n−m), V +
ˆ (m) u(n−m), B
(2.29)
m=1
g ˆ y(n), V = gˆ1 y(n), v1 . . . gˆns y(n), vns V=
(1) v10
(1) . . . vMny
(1)
(2) v10
(1)
T (2) . . . vnsM ,
(2)
(2) T
vt = v10 . . . vMny vt0 . . . vtM
,
T
,
(2.30) (2.31) (2.32)
where t = 1, . . . , ns. Note that g ˆ(·) in (2.29) describes the inverse nonlinear element model and does not denote the inverse of ˆ f (·). Assuming that the inverse nonlinear element is modelled by a single hidden layer perceptron with ny inputs, ns outputs, and M nonlinear nodes (Fig. 2.7), its tth output is M
gˆt y(n), vt = j=1 ny
zj (n) = i=1
(2)
(2)
vtj ϕ zj (n) + vt0 ,
(2.33)
(1)
(2.34)
(1)
vji yi (n) + vj0 .
40
2 Neural network Wiener models
2.4 Gradient calculation The calculation of the gradient in series-parallel Wiener models can be carried out with the well-known backpropagation algorithm. Series-parallel Wiener models are specialized multilayer perceptron networks with a multilayer perceptron model of the inverse nonlinear element followed by a single linear node model of the linear dynamic system followed by another multilayer perceptron structure of the nonlinear element model. In contrast, parallel Wiener models are dynamic systems and the corresponding neural networks are recurrent ones with one linear recurrent node that models the linear dynamic system followed by the multilayer perceptron model of the nonlinear element. This complicates the calculation of the gradient considerably. A crude approximation of the gradient can be obtained with the backpropagation method, which does not take into account the dynamic nature of the model. To evaluate the gradient more precisely, two other methods, known as the sensitivity method and the backpropagation through time method are used [88]. All pattern algorithms considered here have their batch counterparts. Given the rules of gradient computation, the batch versions of the algorithms can be derived easily, taking into account the fact that parameter updating should be performed iteratively, in a cumulative manner, after the presentation of all training patterns. 2.4.1 Series-parallel SISO model. Backpropagation method Taking into account (2.9)–(2.13), the differentiation of (2.8) w.r.t. the model parameters a ˆk , k = 1, . . . , na, ˆbk , k = 1, . . . , nb, and vc , wc , c = 1, . . . , 3M + 1, where vc and wc denote the cth elements of v and w, yields ∂ yˆ(n) s(n) ∂ yˆ(n) ∂ˆ ∂ yˆ(n) , = = −ˆ g y(n−k), v ∂ˆ ak ∂ˆ s(n) ∂ˆ ak ∂ˆ s(n)
(2.35)
∂ yˆ(n) s(n) ∂ yˆ(n) ∂ˆ ∂ yˆ(n) , = = u(n−k) ˆ ˆ ∂ˆ s (n) ∂ˆ s(n) ∂ bk ∂ bk
(2.36)
na
∂ˆ g y(n−m), v ∂ yˆ(n) ∂ yˆ(n) s(n) ∂ yˆ(n) ∂ˆ , = =− a ˆm ∂vc ∂ˆ s(n) ∂vc ∂vc ∂ˆ s(n) m=1 ∂ fˆ sˆ(n),w ∂ yˆ(n) = . ∂wc ∂wc
(2.37)
(2.38)
From (2.10) and (2.11), it follows that the partial derivative of the Wiener model output w.r.t. the output of the linear dynamic model is
2.4 Gradient calculation
∂ fˆ sˆ(n),w ∂ yˆ(n) = = ∂ˆ s(n) ∂ˆ s(n) M
= j=1
(1) (2) wj1 w1j ϕ
M j=1
(2) ∂ϕ
w1j
xj (n) ∂xj (n) ∂xj (n) ∂ˆ s(n)
41
(2.39)
xj (n) .
Differentiating (2.12), partial derivatives of the inverse nonlinear element mo(1) (1) (2) (2) del output w.r.t. its parameters vj1 , vj0 , v1j , j = 1, . . . , M , and v10 can be calculated as ∂ˆ g y(n), v (1) ∂vj1
=
∂ˆ g y(n), v (1) ∂vj0
∂ˆ g y(n), v ∂ϕ zj (n) ∂zj (n) (2) = v1j ϕ zj (n) y(n), (2.40) ∂zj (n) ∂v (1) ∂ϕ zj (n) j1
=
∂ˆ g y(n), v ∂ϕ zj (n) ∂zj (n) (2) = v1j ϕ zj (n) , (1) ∂z (n) ∂ϕ zj (n) j ∂vj0 ∂ˆ g y(n), v (2)
∂v1j
= ϕ zj (n) ,
∂ˆ g y(n), v (2)
∂v10
= 1.
(2.41)
(2.42)
(2.43)
From (2.10) and (2.11), it follows that partial derivatives of the output of the (2) (1) (1) nonlinear element model w.r.t. the weights wj1 , wj0 , w1j , j = 1, . . . , M , and (2)
w10 are ∂ fˆ sˆ(n),w (1) ∂wj1
=
∂ fˆ sˆ(n),w (1)
∂wj0
∂ fˆ sˆ(n),w ∂ϕ xj (n) ∂xj (n) (2) = w1j ϕ xj (n) sˆ(n), (2.44) ∂xj (n) ∂w(1) ∂ϕ xj (n) j1
=
∂ fˆ sˆ(n),w ∂ϕ xj (n) ∂xj (n) (2) = w1j ϕ xj (n) , (2.45) ∂xj (n) ∂w(1) ∂ϕ xj (n) j0 ∂ fˆ sˆ(n),w (2)
∂w1j
= ϕ xj (n) ,
∂ fˆ sˆ(n),w (2)
∂w10
= 1.
(2.46)
(2.47)
42
2 Neural network Wiener models
In spite of the fact that the series-parallel model is of the feedforward type, its training with the backpropagation (BPS) method is quite complex. This comes from the fact that the total number of hidden layers in the seriesparallel model equals 4. Moreover, although both the nonlinear element and its inverse are identified, only an approximate inverse relationship between them is obtained in practice. 2.4.2 Parallel SISO model. Backpropagation method The parallel Wiener model does not contain any inverse model of the nonlinear element (Fig. 2.5). This simplifies the training of the model considerably. On the other hand, as only an approximate gradient is calculated with the backpropagation (BPP) method, a very slow convergence rate can be observed. In the BPP method, the dependence of the past linear dynamic model outputs sˆ(n − m), m = 1, . . . , na, on the parameters a ˆk and ˆbk is neglected. Hence, from (2.14) and (2.15), it follows that ∂ yˆ(n) s(n) ∂ yˆ(n) ∂ˆ ∂ yˆ(n) , k = 1, . . . , na, = = −ˆ s(n−k) ∂ˆ ak ∂ˆ s(n) ∂ˆ ak ∂ˆ s(n)
(2.48)
∂ yˆ(n) s(n) ∂ yˆ(n) ∂ˆ ∂ yˆ(n) , k = 1, . . . , nb, = = u(n−k) ˆ ˆ ∂ˆ s(n) ∂ bk ∂ˆ s(n) ∂ bk
(2.49)
∂ fˆ sˆ(n),w ∂ yˆ(n) = . ∂wc ∂wc
(2.50)
The partial derivative of the parallel Wiener model output w.r.t. the output of the linear dynamic model is ∂ yˆ(n) ∂ fˆ sˆ(n),w = = ∂ˆ s(n) ∂ˆ s(n) M
= j=1
(1)
M j=1
(2) ∂ϕ
w1j
xj (n) ∂xj (n) ∂xj (n) ∂ˆ s(n)
(2.51)
(2)
wj1 w1j ϕ xj (n) . (1)
Partial derivatives of the parallel Wiener model output w.r.t. the weights wj1 , (1)
(2)
(2)
wj0 , w1j , j = 1, . . . , M , and w10 are calculated in the same way as for the series-parallel model using (2.44) – (2.47). 2.4.3 Parallel SISO model. Sensitivity method The sensitivity method (SM) differs from the BPP method markedly in the calculation of partial derivatives of the output of the linear dynamic model
2.4 Gradient calculation
43
w.r.t. its parameters. In general, as the SM uses a more accurate evaluation of the gradient than the BPP method, a higher convergence rate can be expected. Assuming that the parameters a ˆk and ˆbk do not change and differentiating (2.15), we have na
∂ˆ s(n−m) ∂ˆ s(n) = −ˆ s(n−k) − , k = 1, . . . , na, a ˆm ∂ˆ ak ∂ˆ ak m=1 ∂ˆ s(n) = u(n−k) − ∂ˆbk
na
a ˆm m=1
∂ˆ s(n−m) , k = 1, . . . , nb. ∂ˆbk
(2.52)
(2.53)
The partial derivatives (2.52) and (2.53) can be computed on-line by simulation, usually with zero initial conditions. The substitution of (2.52) and (2.53) into (2.48) and (2.49) gives ∂ yˆ(n) = ∂ˆ ak ∂ yˆ(n) = ∂ˆbk
na
− sˆ(n−k) −
a ˆm m=1 na
u(n−k) −
a ˆm m=1
∂ˆ s(n−m) ∂ yˆ(n) , ∂ˆ ak ∂ˆ s(n)
∂ˆ s(n−m) ∂ yˆ(n) . ∂ˆ s(n) ∂ˆbk
(2.54)
(2.55)
The calculation of partial derivatives for the parallel Wiener model is illustrated in Fig. 2.8. Other partial derivatives are calculated in the SM in the same way as in the BPP method. In comparison with the BPP method, the SM is only a little more computationally intensive. The increase in computational burden comes from the simulation of na + nb sensitivity models, whereas dynamic models of a low order are used commonly. Note that to obtain the exact value of the gradient, the parameters a ˆk and ˆbk should be kept constant. That is the case only in the batch mode, in which the parameters are updated after the presentation of all learning patterns. In the sequential mode (pattern learning), the parameters a ˆk and ˆbk are updated after each learning pattern and, as a result, an approximate value of the gradient is obtained. Therefore, to achieve a good approximation accuracy, the learning rate η should be sufficiently small to keep changes in the parameters negligible. 2.4.4 Parallel SISO model. Backpropagation through time method In the backpropagation through time (BPTT) method, partial derivatives of the model output w.r.t. the weights of the nonlinear element model are calculated in the same way as in BPS and BPP methods, or the SM from (2.44) – (2.47). Also, partial derivatives of the model output w.r.t. the output of the linear dynamic model are calculated in the same way from (2.51).
44
2 Neural network Wiener models PARALLEL WIENER MODEL
u (n )
Bˆ (q −1 ) Aˆ (q −1 )
sˆ(n )
q −1
∂ sˆ (n ) ∂ bˆ1
1 ˆ A(q −1 )
M
q
∂ sˆ (n ) ∂ bˆnb
q −1
M
M
−1
−1
q
1 Aˆ (q −1 )
yˆ (n )
fˆ(sˆ(n ), w )
−1 ˆ A(q −1 )
∂ sˆ (n ) ∂ aˆ1
M
−1 Aˆ (q −1 )
∂ sˆ (n ) ∂ aˆna
Fig. 2.8. Parallel Wiener model and its sensitivity models
The only difference is in the computation of partial derivatives of the linear dynamic model output w.r.t. its parameters a ˆk and ˆbk . To perform this, the linear dynamic model is unfolded back in time. Unfolding (2.15) back in time for one step gives na
sˆ(n) = − a ˆ1 sˆ(n−1) −
nb
a ˆm sˆ(n−m) + m=2
m=1
na
−a ˆ1 −
nb
a ˆm sˆ(n−m−1) + m=1
na
−
ˆbm u(n−m−1)
(2.56)
m=1 nb
a ˆm sˆ(n−m) + m=2
ˆbm u(n−m) =
ˆbm u(n−m).
m=1
Such an unfolding procedure can be continued until the initial time step is obtained. The unfolded Wiener model is no more of the recurrent type and can be represented by a feedforward neural network with the model of the nonlinear element on its top and copies of the linear dynamic model below, see Fig. 2.9 for an example of the 3rd order model. When discrete time elapses, the number of copies of the linear dynamic model increases. At the time n, a fully unfolded model contains n copies of the linear dynamic model. To keep computational complexity constant, unfolding in time is commonly restricted to only a given number of time steps K in a method called truncated
2.4 Gradient calculation
45
yˆ (n )
fˆ(sˆ(n ), w ) sˆ(n )
i1 = 0 bˆ1
bˆ2
q −1 u(n − 1)
bˆ3
− aˆ1
− aˆ2
u(n − 2)
− aˆ3
q −1
q −1
q −1
sˆ(n − 1)
u(n − 3)
sˆ(n − 2)
sˆ(n − 3)
i1 = 1 bˆ1
bˆ2
q −1 u(n − 2)
bˆ3
− aˆ1
− aˆ2
u(n − 3)
− aˆ3
q −1
q −1
q −1
sˆ(n − 2)
u(n − 4)
sˆ(n − 3)
sˆ(n − 4)
i1 = 2 bˆ1
bˆ2
q −1 u(n − 3)
bˆ3
u(n − 4)
M
sˆ(n − 3)
M
M
q −1 u(0)
− aˆ2
− aˆ3
q −1 u(n − 5)
M q −1
u(1)
− aˆ1
q −1
q −1 sˆ(n − 4)
M sˆ(1)
M q −1
q −1 u(−1)
sˆ(n − 5)
sˆ(0)
sˆ(−1)
i1 = n −1 bˆ1
bˆ2
q −1 u(0)
bˆ3
u(−1)
− aˆ1
− aˆ2
− aˆ3
q −1
q −1 u(−2)
sˆ(0)
q −1 sˆ(−1)
sˆ(−2)
Fig. 2.9. Unfolded-in-time Wiener model of the third order
BPTT. In this way, only an approximate gradient is computed. The unfolded network is of the feedforward type and differs from a common multilayer perceptron distinctly. First, apart from adjustable weights, marked in Fig. 2.9 with thick lines, it contains nonadjustable short-circuit connections between the neighboring layers, marked with thin lines. Then for models of the order
46
2 Neural network Wiener models
na > 2, some of these short-circuit weights are connected in series and form lines connecting more distant layers. Finally, the unfolded model contains as many copies of the linear dynamic model as unfolded time steps. All these differences should be taken into account in the training algorithm. A detailed description of the truncated BPTT algorithm is given in Appendix 2.1. 2.4.5 Series-parallel MIMO model. Backpropagation method In a sequential version of the gradient-based learning algorithm extended to the MIMO case, the following cost function is minimized w.r.t. the model parameters: ny 1 2 yt (n) − yˆt (n) , (2.57) J(n) = 2 t=1 where yˆt (n) = fˆt ˆ s(n),wt is the tth output of the Wiener model, and ˆ s(n) denotes the vector of the outputs of the linear dynamic model. From (2.29), it follows that the lth output of the linear dynamic model can be written as na
ns
nb
(m)
sˆl (n) = − m=1 m1 =1
a ˆlm1 gˆm1 y(n−m), vm1 +
nu
m=1 m1 =1
ˆb(m) um (n−m), (2.58) 1 lm1
(m) (m) ˆ (m) and B ˆ (m) . where l = 1, . . . , ns, and a ˆlm1 and ˆblm1 denote the elements of A Partial derivatives of the series-parallel model (2.17) and (2.29) can be calculated with the backpropagation method (BPS). Differentiating (2.23), partial derivatives of the model output yˆt (n), t = 1, . . . , ny, w.r.t. the parameters of the nonlinear element model can be obtained as
∂ yˆt (n) (1) ∂wji
=
∂ yˆt (n) (1)
∂wj0
s(n),wt ∂ϕ xj (n) ∂xj (n) ∂ fˆt ˆ (2) = wtj ϕ xj (n) sˆi (n), ∂xj (n) ∂w(1) ∂ϕ xj (n) ji
(2.59)
s(n),wt ∂ϕ xj (n) ∂xj (n) ∂ fˆt ˆ (2) = wtj ϕ xj (n) , (1) ∂x (n) ∂ϕ xj (n) j ∂wj0
(2.60)
=
∂ yˆt (n) (2)
∂wtj
= ϕ xj (n) ,
∂ yˆt (n) (2)
∂wt0
= 1,
(2.61) (2.62)
where j = 1, . . . , M , and i = 1, . . . , ns. From (2.23) and (2.58), it follows that ∂ yˆt (n) (k) ∂ˆ alm1
=
∂ yˆt (n) ∂ˆ sl (n) ∂ yˆt (n) = −ˆ gm1 y(n−k), vm1 , (k) ∂ˆ sl (n) ∂ˆ ∂ˆ sl (n) a lm1
m1 = 1, . . . , ns, k = 1, . . . , na,
(2.63)
2.4 Gradient calculation
sl (n) ∂ yˆt (n) ∂ yˆt (n) ∂ˆ ∂ yˆt(n) , = = um1 (n−k) (k) (k) ∂ˆ sl (n) ∂ˆb ∂ˆ sl (n) ∂ˆb lm1
lm1
47
(2.64)
m1 = 1, . . . , nu, k = 1, . . . , nb. From (2.23) and (2.24), it follows that ∂ fˆt (ˆ ∂ yˆt (n) s(n),wt ) = = sˆl (n) ∂ˆ sl (n) M
= j=1
(1) (2) wjl wtj ϕ
M
(2) ∂ϕ
j=1
xj (n) ∂xj (n) ∂xj (n) ∂ˆ sl (n)
wtj
(2.65)
xj (n) .
Taking into account (2.58), partial derivatives of the model output w.r.t. the parameters of the inverse nonlinear element model can be calculated according to the following rule: ∂ fˆt s(n),wt ∂ yˆt (n) = = ∂v ∂v ns
= l=1
na
ns
sl (n) ∂ yˆt (n) ∂ˆ sˆl (n) ∂v
l=1 ns
∂ yˆt (n) sˆl (n) m=1 m
1 =1
g m1 (m) ∂ˆ a ˆlm1
y(n−m), vm1 , ∂v
(2.66)
where v is any parameter of the inverse nonlinear element model. From (2.33) and (2.34), it follows that partial derivatives of the inverse nonlinear element output w.r.t. its parameters are ∂ˆ gm1 y(n), vm1 (1) ∂vji
= =
∂ˆ gm1 y(n), vm1 (1) ∂vj0
∂ˆ gm1 y(n), vm1 ∂ϕ zj (n) ∂zj (n) ∂zj (n) ∂v (1) ∂ϕ zj (n) ji (2) vm1 j ϕ
(2.67)
zj (n) yi (n),
=
∂ˆ gm1 y(n), vm1 ∂ϕ zj (n) ∂zj (n) ∂zj (n) ∂v (1) ∂ϕ zj (n) j0
=
(2) vm1 j ϕ
(2.68)
zj (n) ,
∂ˆ gm1 y(n), vm1 (2)
∂vm1 j
= ϕ zj (n) ,
∂ˆ gm1 y(n), vm1 (2)
∂vm1 0
= 1,
where m1 = 1, . . . , ns, j = 1, . . . , M , i = 1, . . . , ny.
(2.69) (2.70)
48
2 Neural network Wiener models
2.4.6 Parallel MIMO model. Backpropagation method From (2.25), it follows that the lth output of the linear dynamic model can be written as na
ns
sˆl (n) = − m=1 m1 =1
(m) a ˆlm1 sˆm1 (n−m)
nb
nu
+ m=1 m1 =1
ˆb(m) um1 (n−m), lm1
(2.71)
where l = 1, . . . , ns. Applying backpropagation rules to the parallel MIMO Wiener model, the dependence of the delayed outputs sˆm1 (n − m), m = (m) (m) 1, . . . , na, on the parameters a ˆlm1 and ˆblm1 is neglected. This reduces computational complexity of the backpropagation method for the parallel model (BPP) in comparison with the BPS method as no inverse nonlinear model is utilized. Using the BPP method, the calculation of partial derivatives of the model output w.r.t. the parameters of the nonlinear element model is performed in the same way as in the BPS method according to the formulae (k) (2.59) – (2.62). Also, the calculation of ∂ yˆt (n)/∂ˆblm1 is performed in the same way as in the BPS method according to (2.64). As the parallel Wiener model uses delayed outputs of the linear dynamic model instead of the outputs of the inverse nonlinear model, BPP differs from BPS in the calculation of (k) alm1 : ∂ yˆt (n)/∂ˆ ∂ yˆt (n) (k) ∂ˆ alm1
=
sl (n) ∂ yˆt (n) ∂ˆ ∂ yˆt (n) , k = 1, . . . , na. = −ˆ sm1 (n−k) (k) ∂ˆ sl (n) ∂ˆ ∂ˆ sl (n) a
(2.72)
lm1
2.4.7 Parallel MIMO model. Sensitivity method Assume that the linear dynamic model is time invariant. As in the case of the BPS and BPP methods for MIMO Wiener models, the calculation of partial derivatives of model output w.r.t. the parameters of the nonlinear element model in the SM is performed in the same way. From (2.71), it follows (k) arp , t = 1, . . .,ny, k = 1, . . . , na, r = 1, . . . , ns, that to calculate ∂ yˆt (n)/∂ˆ (k) p = 1, . . . , ns, and ∂ yˆt (n)/∂ˆbrp , t = 1, . . . , ny, k = 1, . . . , nb, r = 1, . . . , ns, p = 1, . . . , nu, the following set of linear difference equations is solved on-line by simulation: ∂ˆ sl (n) (k) ∂ˆ arp
na
= −δlr sˆp (n−k) −
∂ˆ sl (n) = δlr up (n−k) − (k) ∂ˆbrp where
ns
sm1 (n−m) (m) ∂ˆ , (k) ∂ˆ arp m=1 m1 =1 a ˆlm1
na
(2.73)
ns
sm1 (n−m) (m) ∂ˆ , (k) ∂ˆbrp m=1 m1 =1 a ˆlm1
(2.74)
2.4 Gradient calculation
δlr =
1 for l = r . 0 for l = r
49
(2.75)
To solve (2.73) and (2.74), zero initial conditions are assumed. 2.4.8 Parallel MIMO model. Backpropagation through time method Detailed derivation of the truncated backpropagation through time (BPTT) method for the parallel MIMO Wiener model is given in Appendix 2.2. 2.4.9 Accuracy of gradient calculation with truncated BPTT An important issue that arises in the practical application of truncated BPTT is a proper choice of the number of unfolded time steps K. The number K should be large enough to ensure a good approximation of the gradient. On the other hand, too large K does not improve the convergence rate significantly but it increases computational complexity of the algorithm. Theorems 2.1 and 2.2 below give some insight into gradient calculation accuracy on the basis of the linear dynamic model and its sensitivity models and allows one to draw a conclusion on how the choice of K is to be made [89]. Theorem 2.1. Define the computation error ∆ˆ saˆk (n) =
∂ˆ s(n) ∂ + sˆ(n) − , ∂ˆ ak ∂ˆ ak
where ∂ˆ s(n)/∂ˆ ak , k = 1, . . . , na, denote partial derivatives calculated with the BPTT method, i.e. unfolding the model (2.15) n − 1 times back in time, and ∂ + sˆ(n)/∂ˆ ak denote partial derivatives calculated with the truncated BPTT method unfolding the model (2.15) K times back in time. Assume that (A1) (A2) (A3) (A4)
ˆ −1 )/A(q ˆ −1 ) is asymptotically stable; The linear dynamic model B(q sˆ(n) = 0, n = 0, . . . , −na + 1; u(n) = 0, n = −1, . . . , −nb; ∂ˆ s(n)/∂ˆ ak = 0, n = 0, . . . , −na + 1; The input u(n), n = 0, 1, . . ., is a sequence of zero-mean i.i.d. random variables of finite moments, E[u(n)] = 0, E[u(n)u(m)] =
σ 2 for n = m . 0 for n = m
Then var ∆ˆ saˆk (n) = σ
n−k
i1
i1 =K+1
i2 =K+1
2
2
h1 (i2 )h2 (i1 −i2 )
, k = 1, . . . , na, (2.76)
50
2 Neural network Wiener models
for n > K + k and 0 otherwise, where h1 (n) is the impulse response function of the system H1 (q −1 ) =
1 , ˆ A(q −1 )
(2.77)
and h2 (n) is the impulse response function of the system H2 (q −1 ) =
ˆ −1 ) B(q . ˆ −1 ) A(q
(2.78)
Proof: The proof is shown in Appendix 2.3. Theorem 2.2. Define the computation error ∆ˆ sˆbk (n) =
∂ˆ s(n) ∂ + sˆ(n) − , ∂ˆbk ∂ˆbk
where ∂ˆ s(n)/∂ˆbk , k = 1, . . . , nb, denote partial derivatives calculated with the BPTT method, i.e. unfolding the model (2.15) n − 1 times back in time, and ∂ + sˆ(n)/∂ˆbk denote partial derivatives calculated with the truncated BPTT method unfolding the model (2.15) K times back in time. Assume that (A1) (A2) (A3) (A4)
ˆ −1 ) is asymptotically stable; ˆ −1 )/A(q The linear dynamic model B(q sˆ(n) = 0, n = 0, . . . , −na + 1; u(n) = 0, n = −1, . . . , −nb; ∂ˆ s(n)/∂ˆbk = 0, n = 0, . . . , −na + 1; The input u(n), n = 0, 1, . . ., is a sequence of zero-mean i.i.d. random variables of finite moments, E[u(n)] = 0, E[u(n)u(m)] =
Then var ∆ˆ sˆbk (n) = σ 2
n−k
σ 2 for n = m . 0 for n = m
h21 (i1 ), k = 1, . . . , nb,
(2.79)
i1 =K+1
for n > K + k and 0 otherwise, where h1 (n) is the impulse response function of the system H1 (q −1 ) =
1 . ˆ A(q −1 )
(2.80)
2.4 Gradient calculation
51
Remark 2.1. From (2.76) and (2.79), it follows that the variances of computation errors do not depend on k for n → ∞, lim var ∆ˆ saˆk (n) = σ
n→∞
2
∞
i1
i1 =K+1
i2 =K+1
2
h1 (i2 )h2 (i1 −i2 )
sˆbk (n) = σ 2 lim var ∆ˆ
n→∞
∞
,
(2.81)
h21 (i1 ).
(2.82)
i1 =K+1
Theorems 2.1 and 2.2 provide a useful tool for the determination of the number of unfolded time steps K. From Theorems 2.1 and 2.2, it follows that, for a fixed value of K, the accuracy of gradient calculation with the truncated BPTT method depends on the number of discrete time steps necessary for the impulse response h1 (n) to decrease to negligible small values. Note that this impulse response can differ significantly from an impulse response of the system 1/A(q −1 ), particularly at the beginning of the learning process. Therefore, both h1 (n) and h2 (n) can be traced during the training of the neural network model and K can be changed adaptively to meet the assumed degrees of the gradient calculation accuracy ξak (n) and ξbk (n) defined as the ratio of the variances (2.76) or (2.79) to the variances of the corresponding partial derivatives: var ∆ˆ saˆk (n) = ξak (n) = ∂ˆ s(n) var ∂ˆ ak
2
n−k
i1
i1 =K+1
i2 =K+1
h1 (i2 )h2 (i1 −i2 ) n−k
i1
i1 =0
i2 =0
2
,
(2.83)
h1 (i2 )h2 (i1 −i2 ) n−k
var ∆ˆ sˆbk (n) ξbk (n) = = ∂ˆ s(n) var ∂ˆbk
h21 (i1 )
i1 =K+1 n−k
.
(2.84)
h21 (i1 )
i1 =0
The initial value of K can be determined based on the assumed initial values of a ˆk and ˆbk or a priori knowledge of system dynamics. The required value of K depends strongly on the system sampling rate. If the sampling rate increases for a given system, the number of discrete time steps, necessary for the impulse response h1 (n) to decrease to negligible small values, increases and a higher value of K is necessary to achieve the same accuracy of gradient calculation. 2.4.10 Gradient calculation in the sequential mode The derivation of gradient calculation algorithms has been made under the assumption that a model is time invariant. Obviously, pattern by pattern
52
2 Neural network Wiener models
updating of model parameters in the sequential mode makes this assumption no longer true. As a result, approximated values of the gradient are obtained for parallel models. To obtain a good approximation of the gradient, the learning rate should be small enough to achieve negligible small changes of model parameters. In comparison with the SM, an important advantage of the BPTT method is that the actual values of model parameters can be used in the unfolding procedure. Although the gradient calculated in this way is still approximate, such an approach may increase the convergence rate of the learning process, as illustrated in the simulation example below, even if model unfolding is restricted to only a few time steps. Another interesting feature of the BPTT method is that it can provide exact values of the gradient when not only the actual values of parameters are employed but also linear dynamic model outputs are unfolded back in time according to the following rule: sˆ(n−na−i1) =
na−1
1 a ˆna nb
+
− sˆ(n−i1) −
a ˆm sˆ(n−m−i1) m=1
(2.85)
ˆbm u(n−m−i1) ,
m=1
where i1 = 1, . . . , n−1. The formula (2.85) can be used provided that a ˆna = 0. ˆna 0, then sˆ(n−na−i1 ) does not influence or does In practice, if a ˆna = 0 or a not influence significantly sˆ(n−i1 ), and the unfolding step can be omitted. Note that, provided that ˆbnb = 0, unfolding linear dynamic model outputs can be replaced with unfolding model inputs: u(n−nb−i1) =
1 sˆ(n−i1 ) + ˆbnb nb−1
−
na
a ˆm sˆ(n−m−i1) m=1
(2.86)
ˆbm u(n−m−i1) .
m=1
In this case, if ˆbnb = 0 or ˆbnb 0, then u(n−na−i1) does not influence or does not influence significantly sˆ(n−i1 ), and the unfolding step can be omitted. 2.4.11 Computational complexity Computational complexity of on-line learning algorithms for recurrent neural network models is commonly much higher than computational complexity of the on-line backpropagation algorithm and depends on the number of the recurrent nodes L. For example, for the fully connected recurrent network, computational complexity per one step of BPTT, truncated up to K time steps backwards, is O(L2 K). Computational complexity of the real time recurrent learning algorithm of Williams and Zipser [167] is even much higher O(L4 ) [115]. Fortunately, the parallel neural network Wiener model contains
2.5 Simulation example
53
only one recurrent node, i.e., L = 1. This considerably reduces computational requirements of the examined learning algorithms. The evaluation of computational complexity per one step given below is expressed in terms of the polynomial orders na and nb, and the number of the nonlinear nodes M . For the truncated BBTT algorithm, computational complexity depends also on the number of unfolded time steps K. This evaluation of computational complexity is made under the following assumptions: 1. The number of bits of precision for all quantities is fixed. 2. The costs of storing and reading from computer memory can be neglected. 3. No distinction is made between the costs of different operations such as addition, multiplication, calculation of input-output transfer functions such as ϕ(·) or its first derivative. The computation of the model output requires the function ϕ(·) to be calculated M times, and to compute the gradient of the model output, its first derivative has to be calculated M times. Both of these operations are included into computational costs of the weight update given in Table 2.1. Computational complexity of BPS is approximately two times higher than computational complexity of BPP. The application of the SM requires only 2na(na + nb) operations more than BPP. Computational complexity of the BPTT algorithm depends also on the number of unfolded time steps K and is higher than the complexity of the BPP algorithm by 2K(2na + nb) operations. Table 2.1. Computational complexity of on-line learning algorithms Algorithm
Computational complexity
BPS BPP SM BPTT
(31 + 2na)M + 5(na + nb) + 5 16M + 6(na + nb) + 3 16M + 2na(na + nb) + 6(na + nb) + 3 16M + 2K(2na + nb) + 6(na + nb) + 3
2.5 Simulation example In the simulation example, the second order Wiener system composed of a continuous linear dynamic system given by the pulse transfer function G(s) =
1 6s2 + 5s + 1
(2.87)
was converted to discrete time, assuming a zero order hold on the input and the sampling interval 1s, leading to the following difference equation: s(n) = 1.3231s(n−1)−0.4346s(n−2)+0.0635u(n−1)+0.0481u(n−2). (2.88)
54
2 Neural network Wiener models
The system contained also a nonlinear element (Fig. 2.10) given by −1 for s(n) < −1.5 s(n) + 0.5 for −1.5 s(n) −0.5 for | s(n) |< 0.5 . f s(n) = 0 s(n) − 0.5 for 0.5 s(n) 1.5 1 for s(n) > 1.5
(2.89)
The system was driven by a sequence of 60000 random numbers uniformly distributed in (−2, 2). The parallel neural network Wiener model was trained recursively using the steepest descent method, with the learning rate of 0.005, and calculating the gradient with the BPP, truncated BPTT, and SM algorithms. The nonlinear element model contained 25 nodes of the hyperbolic tangent activation. To compare convergence rates of the algorithms, a mean square error of moving averages defined as n 1 2 yˆ(j) − y(j) for n I n j=1 JI (n) = (2.90) n 1 2 yˆ(j) − y(j) for n > I I j=n−I+1
is used. The indices P (n) and F (n) are used to compare the identification accuracy of the linear dynamic system and the nonlinear function f (·), respectively, 1 P (n) = 4
F (n) =
2
(ˆ aj − aj )2 + (ˆbj − bj )2 ,
(2.91)
j=1
1 100
100
2 fˆ(s(j)) − f (s(j), w) ,
(2.92)
j=1
where {s(j)} is a testing sequence consisting of 100 linearly equally spaced values between −2 and 2. The simulation results for a noise-free Wiener system and a Wiener system disturbed by the additive output Gaussian noise N (0, σe ) are summarized in Tables 2.2 – 2.4. Noise-free case. The identification results are given in Table 2.2 and illustrated in Figs. 2.10 – 2.16. As shown in Table 2.2, the highest accuracy of nonlinear element identification, measured by the F (60000) index, was achieved for the BPTT algorithm at K = 6. The best fitting of the linear dynamic model, expressed by the P (60000) index, can be observed for the the BPTT algorithm, at K = 4. The moving averages index J2000 (60000) that describes the accuracy of overall model identification also has its minimum for the
2.5 Simulation example
55
BPTT algorithm, at K = 4. The lowest accuracy of identification, measured by all four indices, was obtained for the BPP algorithm. The results obtained for the SM algorithm are more accurate in comparison with the results of the BPTT algorithm at K = 1 and less accurate than these at K = 2, . . . , 8. Noise case I. The results obtained at a high signal to noise ratio SN R = 17.8, defined as SN R = var(y(n) − ε(n))/var(ε(n)), are a little less accurate than these for the noise-free case (Table 2.3). The minimum values of all four indices were obtained for the BPTT algorithm, i.e., F (60000) at K = 6, P (60000) at K = 5, J2000 (60000) at K = 4, and J60000 (60000) at K = 3 and 4. Noise case II. The results are given in Table 2.4. In this case, training the neural network Wiener model at a low signal to noise ratio SN R = 3.56 was made using the following time-dependent learning rate: 0.01 . η(n) = √ 4 n
(2.93)
In general, the BPP algorithm has the lowest convergence rate. This result can be explained by employing in BPP the approximate gradient instead of the true one. In the batch mode, the truncated BPTT algorithm is an approximation of the SM algorithm and the gradient approximation error decreases with an increase in K and approaches 0 for the fully unfolded model. In the sequential mode, the actual values of linear dynamic model parameters are used to unfold the model back in time. Therefore, as the truncated BPTT algorithm is not a simple approximation of the SM algorithm, the BPTT estimation results do not converge to the results obtained for the SM algorithm. Note that, unlike the SM algorithm, BPTT makes it possible to calculate the true value of the gradient if the linear dynamic model output s(n) is unfolded back in time. Pattern learning is usually more convenient and effective when the number of training patterns N is large, which is the case in the simulation example. Another advantage of pattern learning is that it explores the parameter space in a stochastic way by its nature, thus preventing the learning process from getting trapped in a shallow local minimum [68]. In practice, the number of the available training patterns may be too small to achieve the minimum of the global cost function J after a single presentation of all N training patterns. In such a situation, the learning process can be continued with the same set of N training patterns presented cyclically. Each cycle of repeated pattern learning corresponds to one iteration (epoch) in the batch mode.
56
2 Neural network Wiener models Table 2.2. Comparison of estimation accuracy, (σe = 0) Algorithm BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
J60000 (60000) J2000 (60000) F (60000) 5.51 × 10−3 1.40 × 10−3 2.17 × 10−3 1.34 × 10−3 1.19 × 10−3 1.19 × 10−3 1.22 × 10−3 1.25 × 10−3 1.27 × 10−3 1.31 × 10−3 1.35 × 10−3 1.38 × 10−3 1.38 × 10−3 1.39 × 10−3 1.39 × 10−3 1.39 × 10−3 1.39 × 10−3
1.29 × 10−3 7.94 × 10−5 3.80 × 10−4 7.52 × 10−5 3.81 × 10−5 2.71 × 10−5 2.78 × 10−5 2.86 × 10−5 2.97 × 10−5 3.99 × 10−5 5.25 × 10−5 9.62 × 10−5 9.03 × 10−5 8.46 × 10−5 8.20 × 10−5 8.08 × 10−5 8.03 × 10−5
8.01 × 10−3 3.53 × 10−4 8.05 × 10−4 2.37 × 10−4 1.15 × 10−4 4.68 × 10−5 4.66 × 10−5 4.37 × 10−5 4.54 × 10−5 9.23 × 10−5 9.12 × 10−5 3.50 × 10−4 4.25 × 10−4 4.15 × 10−4 3.92 × 10−4 3.80 × 10−4 3.70 × 10−4
P (60000) 2.07 × 10−3 2.66 × 10−6 4.16 × 10−5 4.12 × 10−7 6.81 × 10−7 3.82 × 10−7 9.66 × 10−7 1.74 × 10−6 1.35 × 10−6 2.00 × 10−6 4.10 × 10−6 3.05 × 10−6 2.98 × 10−6 3.02 × 10−6 2.89 × 10−6 2.86 × 10−6 2.85 × 10−6
Table 2.3. Comparison of estimation accuracy, (σe = 0.02, SN R = 17.8) Algorithm BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
J60000 (60000) J2000 (60000) F (60000) −3
5.99 × 10 1.93 × 10−3 2.66 × 10−3 1.82 × 10−3 1.67 × 10−3 1.67 × 10−3 1.72 × 10−3 1.75 × 10−3 1.77 × 10−3 1.81 × 10−3 1.86 × 10−3 1.90 × 10−3 1.91 × 10−3 1.91 × 10−3 1.92 × 10−3 1.93 × 10−3 1.93 × 10−3
−3
1.72 × 10 8.75 × 10−5 8.97 × 10−4 5.71 × 10−4 5.32 × 10−4 5.31 × 10−4 5.45 × 10−4 5.59 × 10−4 5.82 × 10−4 6.04 × 10−4 6.15 × 10−4 7.68 × 10−4 8.61 × 10−4 9.24 × 10−4 9.94 × 10−4 1.07 × 10−3 1.15 × 10−3
−3
6.01 × 10 2.30 × 10−4 1.04 × 10−3 2.22 × 10−4 1.87 × 10−4 9.49 × 10−5 6.32 × 10−5 5.68 × 10−5 6.23 × 10−5 9.88 × 10−5 1.04 × 10−4 2.82 × 10−4 3.23 × 10−4 2.92 × 10−4 2.62 × 10−4 2.37 × 10−4 2.19 × 10−4
P (60000) 2.15 × 10−3 1.32 × 10−5 7.77 × 10−5 5.84 × 10−6 2.80 × 10−6 1.97 × 10−6 1.45 × 10−6 5.93 × 10−6 4.49 × 10−6 5.23 × 10−6 1.04 × 10−5 1.17 × 10−5 1.26 × 10−5 1.36 × 10−5 1.52 × 10−5 1.70 × 10−5 1.85 × 10−5
2.5 Simulation example
57
Table 2.4. Comparison of estimation accuracy, (σe = 0.1, SN R = 3.56) Algorithm
J60000 (60000) J2000 (60000) F (60000)
BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
1.94 × 10−2 1.30 × 10−2 1.41 × 10−2 1.28 × 10−2 1.25 × 10−2 1.25 × 10−2 1.26 × 10−2 1.27 × 10−2 1.27 × 10−2 1.27 × 10−2 1.28 × 10−2 1.29 × 10−2 1.29 × 10−2 1.29 × 10−2 1.29 × 10−2 1.29 × 10−2 1.29 × 10−2
1.55 × 10−2 1.09 × 10−2 1.12 × 10−2 1.07 × 10−2 1.06 × 10−2 1.06 × 10−2 1.07 × 10−2 1.08 × 10−2 1.08 × 10−2 1.08 × 10−2 1.08 × 10−2 1.08 × 10−2 1.08 × 10−2 1.08 × 10−2 1.08 × 10−2 1.09 × 10−2 1.09 × 10−2
2.10 × 10−2 1.36 × 10−3 5.26 × 10−3 2.94 × 10−3 1.39 × 10−3 7.65 × 10−4 5.73 × 10−4 6.56 × 10−4 6.45 × 10−4 7.34 × 10−4 1.12 × 10−3 1.55 × 10−3 1.46 × 10−3 1.42 × 10−3 1.42 × 10−3 1.41 × 10−3 1.41 × 10−3
P (60000) 2.16 × 10−2 1.07 × 10−3 2.79 × 10−3 1.18 × 10−3 7.23 × 10−4 7.93 × 10−4 7.55 × 10−4 8.89 × 10−4 8.69 × 10−4 8.83 × 10−4 9.23 × 10−4 9.69 × 10−4 1.01 × 10−3 1.05 × 10−3 1.04 × 10−3 1.05 × 10−3 1.06 × 10−3
1
f (s((n)), f^(s((n))
0.5
0
-0.5
-1 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
s(n)
Fig. 2.10. True (solid line) and estimated (dotted line) nonlinear functions, (BPTT, K = 4)
P(n)
2 Neural network Wiener models 10
0
10
-2
10
-4
10
-6
10
-8
0
1
2
3
4
5
6 x 10
n
4
Fig. 2.11. Convergence rate of linear model parameters, (BPTT, K = 4)
Parameter estimates of the linear dynamic system
58
0.5
aˆ2
bˆ1 bˆ2
0
-0.5
-1
aˆ1 -1.5
0
1
2
3
n
4
5
6 × 10
4
Fig. 2.12. Evolution of linear model parameters, (BPTT, K = 4)
F(n)
2.5 Simulation example 10
-1
10
-2
10
-3
10
-4
10
-5
0
1
2
3
4
5
n
6 x 10
4
J2000(n)
Fig. 2.13. Convergence rate of the nonlinear element model, (BPTT, K = 4)
10
-1
10
-2
10
-3
BPP
BPPT, K = 1
BPPT, K = 2
10
-4
BPPT, K = 4 BPPT, K = 3
10
-5
0
1
2
3
n
4
5
6 × 10
Fig. 2.14. Convergence rate of the BPP and BPPT algorithms
4
59
60
2 Neural network Wiener models
ξak(n)
0.4
0.2
0
0
1
2
3
4
5
n
6 x 10
4
x 10
4
ξbk(n)
0.4
0.2
0
0
1
2
3
4
5
n
6
Fig. 2.15. Evolution of gradient calculation accuracy degrees, (BPTT, K = 4)
h1(n)
1.5 1 0.5 0
0
5
10
0
5
10
n
15
20
25
15
20
25
h2(n)
1
0.5
0
n
Fig. 2.16. Impulse responses of sensitivity models and the linear dynamic model, (BPTT, K = 4)
2.6 Two-tank system example
61
2.6 Two-tank system example A laboratory two-tank system with transport delay, shown in Fig. 2.17, was used as a source of data for a real process example [89]. A system actuator, consisting of the dc motor M, the diaphragm water pump P, and the bypass valve V, was identified. The bypass valve V was left in an open position, introducing a dead zone into the system steady-state characteristic, see Fig. 2.21 for the characteristic obtained via an experiment with recording a system response to a stairs-wise input. The parameters of both a parallel neural network Wiener model and a linear autoregressive with an exogenous input (ARX) model were determined based on a sequence of 6000 input and output data recorded in another open loop experiment controlled by a computer at the sampling interval of 1s. The motor M was driven by a control signal obtained from a pseudorandom generator of uniform distribution and transformed in such a way that each pseudorandom value was kept constant for ten sampling intervals constituting ten successive control values. The flow rate of water, measured with the Venturi flow meter F, was chosen as a system output and the squared control signal was used as the model input. Before identification, the input and output data were scaled to zero mean values and variances 1. First, the parameters of the ARX model were estimated with the least squares method. Based on the results of ARX model estimation and applying the Akaike final prediction error criterion [112], the second order model with delay 1 was chosen, i.e., na = 2, nb = 2. Then the parallel neural network Wiener model containing 30 nonlinear nodes of the hyperbolic tangent activation function was trained recursively with BPP, the SM, and truncated BPTT at K = 1, . . . , 15 with the input-output data set processed cyclically five times. Finally, the trained neural network Wiener models were tested with another set of 3000 input-output data. The obtained results, i.e., the values of J3000 (30000) for the training set and J3000 (3000) for the testing set, are given in Table 2.5 and illustrated in Figs 2.18 – 2.21. The inspection of the impulse responses h1 (n) and h2 (n), presented in Figs 2.18 and 2.19, shows that both of them decrease to practically negligible small values at n = 12. Therefore, according to Theorems 2.1 and 2.2, no significant improvement can be expected with unfolding the model for more than 12 time steps – see the results in Table 2.5 for a confirmation of this conclusion. As can be expected, the results obtained for the Wiener model with the SM and BPTT at K = 2, . . . , 15 are better than the result for the ARX model. Moreover, contrary to the simulation example, a gradual improvement in model fitting can be seen (Fig. 2.20) with an increase in K which can be explained by applying a sufficiently low learning rate of 0.001. The operation of the diaphragm water pump at constant excitation causes some periodic changes of the water flow rate. These changes can be considered as additive output disturbances and, as a result, the cost function achieves much higher values than in the case of the noise-free simulation example, see Tables 2.2 and 2.5.
62
2 Neural network Wiener models
y
F
h1
V
h2 u
M
P
Fig. 2.17. Two-tank system with transport delay
1 0.9 0.8 0.7
h1(n)
0.6 0.5 0.4 0.3 0.2 0.1 0
0
2
4
6
8
10
12
n
ˆ 1 (q −1 ), K = 15 Fig. 2.18. Two-tank system. Impulse response of the system H
2.6 Two-tank system example
63
0.4 0.35 0.3
h2(n)
0.25 0.2 0.15 0.1 0.05 0
0
2
4
6
8
10
12
n
ˆ 2 (q −1 ), K = 15 Fig. 2.19. Two-tank system. Impulse response of the system H
10
1
BPP 10
BPTT, K = 1
0
BPTT, K = 2
J3000(n)
BPTT, K = 3 M BPTT, K = 8 10
-1
BPTT, K = 15 SM 10
-2
0
0.5
1
1.5
n
2
2.5
3 × 10
4
Fig. 2.20. Two-tank system. Training the Wiener model with BPP, BPTT, and the SM
64
2 Neural network Wiener models
Output of the Wiener system/model
4000 3500 3000 2500 2000 1500 1000 500 0
0
1
2
3
4
5
6
7
Output of the linear dynamic system/model
8
9 x 10
8
Fig. 2.21. Two-tank system. The true (dotted line) and estimated (solid line) nonlinear characteristics, BPTT, K = 15
Table 2.5. Two-tank system. Comparison of estimation accuracy Algorithm ARX BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
J3000 (30000) J3000 (3000) 0.0690 0.0176 0.0374 0.0253 0.0210 0.0200 0.0194 0.0188 0.0183 0.0181 0.0179 0.0178 0.0178 0.0178 0.0178 0.0177 0.0177
0.0406 0.0860 0.0196 0.0461 0.0306 0.0246 0.0229 0.0221 0.0213 0.0206 0.0202 0.0200 0.0199 0.0198 0.0198 0.0197 0.0197 0.0197
2.7 Prediction error method
65
It is well known that parallel models are more suitable for the identification of noise-corrupted systems as they do not use past system outputs, which are noise-corrupted, to calculate the model output. Clearly, the parallel neural network Wiener model is also well suited for the identification of Wiener systems with additive white output noise. On the other hand, from the fact that series-parallel models use past system outputs, it follows that these models are not able to describe systems with additive white output noise appropriately, and the series-parallel neural network Wiener model is not suitable for the identification of Wiener systems additive disturbances either. The noise interfering with the system makes the identification a stochastic optimization problem. Therefore, a large number of input-output data should be used to achieve a required level of model accuracy for systems with a high level of output disturbances. Moreover, small or slowly decreasing values of the learning rate η are necessary to achieve the convergence of steepest descent algorithms.
2.7 Prediction error method It is well known that steepest descent algorithms have a linear convergence rate and can be quite slow in many practical cases. A faster convergence rate can be achieved with methods which are based on a second-order approximation of the optimization criterion to determine the search direction. An alternative to steepest descent methods is the prediction error (PE) method or its pattern version – the recursive prediction error (RPE) method. The PE and RPE algorithms have superior convergence properties in comparison with steepest descent algorithms as they use the approximate Hessjan to compute the search direction. The properties of the PE and RPE algorithms for different neural network models are discussed in [29, 127]. RPE algorithms for Wiener models with known static nonlinearity or the nonlinear model approximated with a piecewise linear function were analyzed by Wigren [165, 166]. A batch PE algorithm for Wiener models with a polynomial model of the nonlinear element was used by Norquay et al. [128]. A sequential version of the PE algorithm for the training of a SISO neural network Wiener model was proposed in [85]. In both these algorithms, the gradient of the model output w.r.t. the parameters of its linear part is computed with the SM. 2.7.1 Recursive prediction error learning algorithm The identification problem can be formulated as follows: Given a set of input and output data {Z N = {u(n), y(n)}, n = 1, . . . , N, } and a candidate neural network Wiener model, estimate the parameters θ of the model so that the predictions yˆ(n|n − 1) of the system output are close to the system output y(n) in the sense of the following mean square error criterion:
66
2 Neural network Wiener models
1 2N
J(θ, Z N ) =
N
2
y(n) − yˆ(n|n − 1) .
(2.94)
n=1
A solution to this problem, for nonlinearly parameterized models, is a gradient technique based on the approximation of the Hessjan known as the prediction error method. The PE method is a batch optimization method of the Gauss-Newton type. The RPE algorithm is a recursive counterpart of the PE method. Both the PE and RPE algorithms are guaranteed to converge to a local minimum of J(θ, Z N ) with the probability 1 as N → ∞ [29]. Given the gradient ψ(n) =
∂ yˆ(n|n − 1) ∂ yˆ(n|n − 1) ∂ yˆ(n|n − 1) ∂ yˆ(n|n − 1) ... ... ˆ ∂ˆ a1 ∂ˆ ana ∂ b1 ∂ˆbnb ∂ yˆ(n|n − 1) (1)
∂w10
...
∂ yˆ(n|n − 1) ∂ yˆ(n|n − 1) (1)
∂wM1
(2)
∂w10
...
∂ yˆ(n|n − 1)
T
(2.95)
(2)
∂w1M
of the model output w.r.t. the parameter vector (1) (1) (2) (2) θ= a ˆ1 . . . a ˆna ˆb1 . . . ˆbnb w10 . . . wM1 w10 . . . w1M
T
,
(2.96)
the RPE algorithm can be expressed as [29, 127]: K(n) =
P(n−1)ψ(n) , 1 + ψ T (n)P(n−1)ψ(n)
(2.97)
θ(n) = θ(n−1) + K(n) y(n) − yˆ(n|n − 1) ,
(2.98)
P(n) = P(n−1) − K(n)ψ T (n)P(n−1).
(2.99)
The RPE algorithm is an alternative to the well-known steepest descent methods and has much better convergence properties and lower memory requirements. Computational complexity of the RPE algorithm is much higher and it is not recommended for problems with a large number of estimated parameters. Note that replacing K(n) with ηψ(n) in (2.98), where η denotes the learning rate, results in the steepest descent algorithm. In some cases, it may be useful to apply algorithms that combine quasi-Newton methods and the steepest descent approach. In the RPE and SM algorithm, used in the example below for comparison, a search direction is calculated as a sum of the search directions of the RPE method and the SM. 2.7.2 Pneumatic valve simulation example As a source of data for testing the RPE algorithm and comparing it with the SM, and the RPE and SM algotithm, a model of a pneumatic valve was used [165]. The dynamic balance between the pneumatic control signal u(n) applied
2.7 Prediction error method
67
to the valve stem, a counteractive spring force and friction is described by the linear dynamics s(n) = 1.4138s(n−1) − 0.6065s(n − 2) + 0.1044u(n−1)
(2.100)
+ 0.0833u(n − 2).
The flow through the valve is a nonlinear function of the valve stem position s(n): 0.3163s(n) . (2.101) f s(n) = 0.1 + 0.9s2 (n) It is assumed that the model output is additively disturbed by a zero-mean discrete white Gaussian noise ε(n) with the standard deviation σε = 0.01: y(n) = f s(n) + ε(n).
(2.102)
A sequence of 10000 pseudorandom numbers, uniformly distributed in (−1, 1), was used as the model input. Based on the simulated input-output data, the parallel neural network Wiener model containing 10 nonlinear nodes of the hyperbolic tangent activation function was trained with the RPE, SM, and RPE and SM algorithms. The identification results are illustrated in Figs 2.22 – 2.25 [87]. To compare estimation accuracy of linear system parameters and the nonlinear function f (·), the indices (2.91) and (2.92) are used. 0.6 0.4
Parameter estimates
0.2 0 -0.2 -0.4 -0.6
aˆ1 aˆ2 bˆ1 bˆ
-0.8 -1
2
-1.2 -1.4
0
200
400
600
800
1000
n
Fig. 2.22. RPE algorithm. Evolution of parameter estimates of the linear element
2 Neural network Wiener models 10
10
RPE SM RPE and SM
0
-1
F(n)
10
1
10
10
10
-2
-3
-4
0
2000
4000
n
6000
8000
10000
Fig. 2.23. Convergence of the linear element model
10
10
10
1
RPE SM RPE and SM
0
-1
F(n)
68
10
10
10
-2
-3
-4
0
2000
4000
n
6000
8000
Fig. 2.24. Convergence of the nonlinear element model
10000
2.8 Summary
69
True and estimated nonlinear functions
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6
True Estimated
-0.8 -1 -1
-0.5
0
0.5
1
s(n)
Fig. 2.25. RPE algorithm. True and estimated nonlinear characteristics
2.8 Summary In this chapter, various steepest descent learning algorithms for neural network Wiener models have been derived, analyzed and compared in a unified framework. For the truncated BPTT method, it has been shown that the accuracy of gradient calculation depends on impulse responses of the linear dynamic model and its sensitivity models. The number of unfolded time steps can be determined on the basis of the number of time steps necessary for the impulse response of sensitivity models to decrease to negligible small values. The application of these algorithms to both simulated and real data systems has been demonstrated as well. The parallel neural network Wiener model is simpler than its series-parallel counterpart as it does not contain the inverse model of the static nonlinear element. Another important advantage of the parallel neural network model is that its training with the BPP algorithm requires almost half the computational burden for the training of the series-parallel model with the BPS algorithm. Applying the SM or truncated BPTT algorithms, one can expect a higher convergence rate, in comparison with the BPP algorithm, as a more accurate gradient approximation is used. In comparison with gradient computation in feedforward neural networks, gradient computation in recurrent networks is commonly much more computationally intensive. That is not the case for the parallel neural network Wiener model, as both the SM and truncated BPTT algorithms require only a little more computational effort than
70
2 Neural network Wiener models
the BPP algorithm. This comes from the fact that, in all these algorithms, 3M + 1 parameters of the nonlinear element model are adjusted in the same way. The algorithms update the parameters of the linear dynamic model in different manners but the number of the parameters na + nb is commonly not greater than a few. Both the simulation example and the real process example show the lowest convergence rate of the BPP algorithm. In the simulation examples, the highest accuracy of both the linear dynamic model and the nonlinear element has been obtained using the truncated BPTT algorithm and unfolding the linear dynamic model only a few steps back in time. The RPE identification algorithm has been also extended to training recurrent neural network Wiener models. The calculation of the gradient is performed with the SM and it does not require much more computation than the BPP algorithm. In comparison with steepest descent methods, the RPE algorithm has much better convergence properties at the expense of higher computational complexity. The RPE algorithm can be very useful when the system output is disturbed additively by the white noise. Note that in such cases, the steepest descent algorithms require a small step size to achieve convergence to a local minima of the performance function. Although only sequential learning versions of the basic steepest descent algorithm for neural network Wiener models are described and discussed in this chapter, their batch counterparts can be derived easily given the rules for gradient computation. Also, the implementation of other gradient methods such as variable metric methods, conjugate gradient methods, or the RLS learning algorithms (a review and discussion on their hardware implementation using the systolic technology is given by Rutkowski [144]) is straightforward.
2.9 Appendix 2.1. Gradient derivation of the truncated BPTT
71
2.9 Appendix 2.1. Gradient derivation of truncated BPTT. SISO Wiener models Assume that the parallel SISO Wiener model is K times unfolded back in time. ak and ∂ + sˆ(n)/∂ˆbk denote partial derivatives calculated for the Let ∂ + sˆ(n)/∂ˆ model unfolded back in time for K time steps, and ∂ˆ s(n)/∂ˆ ak and ∂ˆ s(n)/∂ˆbk – partial derivatives calculated without taking into account the dependence of past model outputs on aˆk and ˆbk , respectively. The differentiation of (2.15) gives ∂ + sˆ(n) ∂ˆ s(n) = + ∂ˆ ak ∂ˆ ak i
K
1
s(n−i1 ) ∂ˆ s(n) ∂ˆ ∂ˆ s (n−i ) ∂ˆ ak 1 =1 K
∂ˆ s(n) , = −ˆ s(n−k) − sˆ(n−k−i1 ) ∂ˆ s (n−i 1) i =1
(2.103)
1
where k = 1, . . . , na, K
∂ + sˆ(n) ∂ˆ s(n) s(n−i1) ∂ˆ s(n) ∂ˆ = + ∂ˆ s(n−i1 ) ∂ˆbk ∂ˆbk ∂ˆbk i1 =1 K
∂ˆ s(n) , = u(n−k) + u(n−k−i1) ∂ˆ s(n−i1 ) i =1
(2.104)
1
where k = 1, . . . , nb. From (2.15), it follows that partial derivatives of the actual output of the linear dynamic model sˆ(n) w.r.t. the delayed outputs sˆ(n−1), . . . , sˆ(n − K) are na
na
∂ˆ s(n−m) ∂ˆ s(n) ∂ˆ s(n) =− =− , a ˆm a ˆm ∂ˆ s(n−i1 ) ∂ˆ s (n−i ) ∂ˆ s (n−i 1 1+m) m=1 m=1
(2.105)
where i1 = 1, . . . , K, and ∂ˆ s(n−m)/∂ˆ s(n−i1 ) = 0 for m > i1 . Finally, to obtain the partial derivatives ∂ yˆ(n)/∂ˆ ak and ∂ yˆ(n)/∂ˆbk , the partial derivatives ˆ ∂ˆ s(n)/∂ˆ ak and ∂ˆ s(n)/∂ bk in (2.48) and (2.49) are replaced with ∂ + sˆ(n)/∂ˆ ak + ˆ and ∂ sˆ(n)/∂ bk : ∂ yˆ(n) = ∂ˆ ak ∂ yˆ(n) = ∂ˆbk
K
− sˆ(n−k) −
sˆ(n−k−i1) i1 =1 K
u(n−k) +
u(n−k−i1) i1 =1
∂ yˆ(n) ∂ˆ s(n) , ∂ˆ s(n−i1 ) ∂ˆ s(n)
∂ yˆ(n) ∂ˆ s(n) . ∂ˆ s(n−i1) ∂ˆ s(n)
(2.106)
(2.107)
72
2 Neural network Wiener models
2.10 Appendix 2.2. Gradient derivation of truncated BPTT. MIMO Wiener models Assume that the parallel MIMO Wiener model is K times unfolded back in (k) (k) time. Let ∂ + sˆl (n)/∂ˆ arp and ∂ + sˆl (n)/∂ˆbrp denote partial derivatives calcula(k) ted for the model unfolded back in time for K time steps, and ∂ˆ sl (n)/∂ˆ arp (k) and ∂ˆ sl (n)/∂ˆbrp – partial derivatives calculated without taking into acco(k) (k) unt the dependence of past model outputs on a ˆrp and ˆbrp , respectively. The differentiation of (2.71) gives ∂ + sˆl (n) (k) ∂ˆ arp
=
∂ˆ sl (n) (k) ∂ˆ arp
K
ns
+ i1 =1 i2
si2 (n−i1) ∂ˆ sl (n) ∂ˆ (k) ∂ˆ s (n−i ) i 1 2 ∂ˆ arp =1 K
∂ˆ sl (n) , = − δlr sˆp (n−k) − sˆp (n−k−i1) ∂ˆ s (n−i1 ) r i =1
(2.108)
1
where 1 for l = r , 0 for l = r
δlr =
(2.109)
where l = 1, . . . , ns, r = 1, . . . , ns, p = 1, . . . , ns, k = 1, . . . , na, K
ns
∂ + sˆl (n) si2 (n−i1 ) ∂ˆ sl (n) ∂ˆ ∂ˆ sl (n) = + (k) (k) (k) ∂ˆ si2 (n−i1 ) ∂ˆbrp ∂ˆbrp ∂ˆbrp i1 =1 i2 =1 K
∂ˆ sl (n) , = δlr up (n−k) + up (n−k−i1) ∂ˆ sr (n−i1 ) i =1
(2.110)
1
where l = 1, . . . , ns, r = 1, . . . , ns, p = 1, . . . , nu, k = 1, . . . , nb. From (2.71), it follows that partial derivatives of the actual lth output of the linear dynamic model w.r.t. its delayed rth output are na
na
ns
∂ˆ sl (n) =− ∂ˆ sr (n−i1) m=1 m
1
sm1 (n−m) (m) ∂ˆ a ˆlm1 ∂ˆ sr (n−i1 ) =1
=−
(m)
a ˆlm1
m=1
∂ˆ sm1 (n) , (2.111) ∂ˆ sr (n−i1+m)
where ∂ˆ sm1 (n−m) = 0 for m > i1 or m = i1 and m1 = r. ∂ˆ sr (n−i1 )
(2.112) (k)
Finally, partial derivatives of the tth model output w.r.t. the parameters a ˆrp (k) and ˆbrp are ∂ yˆt (n) (k) ∂ˆ arp
K
=
sˆp (n−k−i1)
− δlr sˆp (n−k) − i1 =1
∂ yˆt (n) ∂ˆ sl (n) , ∂ˆ sr (n−i1 ) sˆl (n)
(2.113)
2.11 Appendix 2.3. Proof of Theorem 2.1 K
∂ yˆt (n) = (k) ∂ˆbrp
δlr up (n−k) +
up (n−k−i1) i1 =1
∂ yˆt (n) ∂ˆ sl (n) , ∂ˆ sr (n−i1 ) sˆl (n)
73
(2.114)
where ∂ yˆt (n)/ˆ sl (n) is given by (2.65).
2.11 Appendix 2.3. Proof of Theorem 2.1 In the BPTT method, the Wiener model is unfolded back in time and then the unfolded model is differentiated w.r.t. model parameters. This is equivalent to the differentiation of the model and unfolding the differentiated model, i.e., sensitivity models back in time. Taking into account (2.77), and (2.78), from (2.52) it follows that the outputs of sensitivity models for the parameters a ˆk , k = 1, . . . , na, can be expressed as functions of the input u(n): ∂ˆ s(n) = −H1 (q −1 )ˆ s(n−k) = −H1 (q −1 )H2 (q −1 )u(n−k) ∂ˆ ak = −H(q −1 )u(n−k), where
H(q −1 ) = H1 (q −1 )H2 (q −1 ).
(2.115)
(2.116)
The impulse response of the system (2.116) is given by the convolution relationship n
h(n) =
h1 (i2 )h2 (n−i2 ).
(2.117)
i2 =0
Therefore, the partial derivative ∂ˆ s(n)/∂ˆ ak is related to the input u(n) with the impulse response functions h1 (n) and h2 (n): n−k
∂ˆ s(n) =− ∂ˆ ak i
n−k
i1
h(i1 )u(n−k−i1 ) = −
h1(i2 )h2 (i1−i2 )u(n−k−i1 ). (2.118)
i1 =0 i2 =0
1 =0
The calculation of (2.118) requires unfolding sensitivity models n − 1 times ak calculated for sensitiback in time. Thus, the partial derivatives ∂ + sˆ(n)/∂ˆ vity models unfolded K times back in time are n−k i1 − h1 (i2 )h2 (i1 −i2 )u(n−k−i1), for n K + k i =0 i =0 1 2 i1 K ∂ + sˆ(n) h1 (i2 )h2 (i1 −i2 )u(n−k−i1) = − (2.119) ∂ˆ ak i1 =0 i2 =0 K n−k − h1 (i2 )h2 (i1 −i2 )u(n−k−i1), for n > K + k. i1 =K+1 i2 =0
74
2 Neural network Wiener models
From (2.119) it follows that, the errors of computing partial derivatives with the truncated BPTT method are 0, for n K + k i1 n−k (2.120) ∆ˆ saˆk (n) = − h (i )h (i −i )u(n−k−i ), for n > K + k. 1 2 2 1 2 1 i1 =K+1 i2 =K+1
Since {u(n)} is a sequence of zero-mean i.i.d. random variables, ∆ˆ saˆk (n) is a weighted sum of n− k − K zero-mean i.i.d. random variables. Therefore, the variances of the errors (2.120) and are given by (2.76) for n > K + k, and are equal to zero otherwise.
2.12 Appendix 2.4. Proof of Theorem 2.2 In the BPTT method, the Wiener model is unfolded back in time and then the unfolded model is differentiated w.r.t. model parameters. This is equivalent to the differentiation of the model and unfolding the differentiated model, i.e., sensitivity models back in time. Taking into account (2.80), from (2.53) it follows that the outputs of sensitivity models for the parameters ˆbk , k = 1, . . . , nb, can be expressed as functions of the input u(n): ∂ˆ s(n) = H1 (q −1 )u(n−k). ∂ˆbk
(2.121)
The partial derivatives ∂ˆ s(n)/∂ˆbk are related to the input u(n) with the impulse response function h1 (n): ∂ˆ s(n) = ∂ˆbk
n−k
h1 (i1 )u(n−k−i1 ).
(2.122)
i1 =0
The calculation of (2.122) requires unfolding sensitivity models n − 1 times back in time. Thus, the partial derivatives ∂ + sˆ(n)/∂ˆbk calculated for sensitivity models unfolded K times back in time are n−k h1 (i1 )u(n−k−i1 ), for n
K +k ∂ + sˆ(n) i1 =0 = K ∂ˆbk h1 (i1 )u(n−k−i1 ), for n > K + k.
(2.123)
i1 =0
Therefore, the errors of computing partial derivatives with the truncated BPTT method are
2.12 Appendix 2.4. Proof of Theorem 2.2
∆ˆ sˆbk (n) =
0,n−k
for n
75
K +k
h1 (i1 )u(n−k−i1 ), for n > K + k.
(2.124)
i1 =K+1
Since {u(n)} is a sequence of zero-mean i.i.d. random variables, ∆ˆ sˆbk (n) is a weighted sum of n− k − K zero-mean i.i.d. random variables. Therefore, the variances of the errors (2.124) are given by (2.79) for n > K + k, and are equal to zero otherwise.
3 Neural network Hammerstein models
3.1 Introduction This chapter deals with gradient-based learning algorithms for training neural network Hammerstein models. As in the case of neural network Wiener models discussed in Chapter 2, four different gradient calculation algorithms, i.e., backpropagation for series-parallel models (BPS), backpropagation (BPP), the sensitivity method (SM), and backpropagation through time (BPTT) for parallel models are derived. Having the rules for gradient calculation derived, steepest descent or other gradient-based learning algorithms can be implemented easily. Besides steepest descent algorithms, four other learning algorithms, which combine steepest descent algorithms with the recursive least squares (RLS) algorithm or the recursive pseudolinear regression (RPLR) algorithm, are proposed. For the truncated BPTT algorithm, gradient calculation accuracy is analyzed. It is shown that gradient calculation accuracy depends on impulse responses of sensitivity models and the linear dynamic model. Knowing these impulse responses, the errors of the calculation of partial derivatives of the model output w.r.t. model parameters can be evaluated. Computational complexity of the algorithms is analyzed and expressed in terms of the polynomial orders na and nb, the number of nonlinear nodes, and the number of unfolded time steps for the BPTT algorithm. This chapter is organized as follows: In Section 3.2, the identification problem is formulated. Section 3.3 introduces both the series-parallel and parallel neural network Hammerstein models. Gradient calculation in the SISO and MIMO Hammerstein models is considered in Section 3.4. Based on sensitivity models, an analysis of gradient calculation accuracy with the truncated BPTT algorithm is performed. The results of this analysis are shown to be useful in the determination of the number of unfolded time steps, which is necessary to calculate the gradient at an assumed level of accuracy. Section 3.4 gives also a comparison of computational complexity of the algorithms. Section 3.5 contains a comparative simulation study based on the identification of a second order Hammerstein system. The combined steepest descent and the RLS or
A. Janczak: Identification of Nonlinear Systems, LNCIS 310, pp. 77–116, 2005. © Springer-Verlag Berlin Heidelberg 2005
78
3 Neural network Hammerstein models
RPLR algorithms, which use different approximations of the gradient obtained with BPS, BPP, the SM, and truncated BPTT, are described in Section 3.6. Finally, a few concluding remarks are presented in Section 3.7.
3.2 Problem formulation The Hammerstein system, shown in Fig. 3.1, is an example of a block-oriented nonlinear system composed of a zero-memory nonlinear element followed by a linear dynamic system. Let f (·) denote the characteristic of the nonlinear element; a1 , . . ., ana , b1 , . . . , bnb , are the parameters of the linear dynamic system, q −1 is the backward shift operator, and ε(n) is the additive output disturbance. The output y(n) of the Hammerstein system to the input u(n) at the time n is y(n) = where
B(q −1 ) f u(n) + ε(n), A(q −1 )
(3.1)
A(q −1 ) = 1 + a1 q −1 + · · · + ana q −na ,
(3.2)
B(q −1 ) = b1 q −1 + · · · + bnb q −nb .
(3.3)
The following assumptions are made about the system: Assumption 3.1. The function f (·) is continuous. Assumption 3.2. The linear dynamic system is casual and asymptotically stable. Assumption 3.3. The polynomial orders na and nb are known. The identification problem can be formulated as follows: Given a set of the system input and output data {u(n), y(n)}, n = 1, 2, ..., N , the goal is to estimate the parameters of the linear dynamic system and the characteristic of the nonlinear element minimizing the following global cost function: N
J=
1 2 y(n) − yˆ(n) , 2 n=1
(3.4)
where yˆ(n) is the output of the neural network Hammerstein model. This can be done by an off-line iterative procedure that uses the whole training set and is called batch learning or the batch mode. For control or filtering applications, it may be more convenient to use another iterative learning procedure that employs the local cost function J(n) =
1 2 y(n) − yˆ(n) . 2
(3.5)
This results in an on-line learning method that is also called pattern learning or the sequential mode as it process the training data sequentially. In both
3.3 Series-parallel and parallel neural network Hammerstein models
79
ε (n ) u (n )
f (u (n ) )
v (n )
B (q −1 )
y (n )
A(q −1 )
Fig. 3.1. SISO Hammerstein system
learning procedures, the weights are updated along the negative gradient of the cost function J or J(n), respectively. Applying a pattern learning version of the steepest descent method, the actual values of the linear dynamic model parameters a ˆk (n), ˆbk (n) and any parameter wc (n) of the nonlinear element model are updated as follows: ∂J(n) , k = 1, . . . , na, ∂ˆ ak (n−1)
(3.6)
ˆbk (n) = ˆbk (n−1) − η ∂J(n) , k = 1, . . . , nb, ∂ˆbk (n−1)
(3.7)
ˆk (n−1) − η a ˆk (n) = a
wc (n) = wc (n−1) − η
∂J(n) , c = 1, . . . , 3M + 1, ∂wc (n−1)
(3.8)
where η is the learning rate, and 3M + 1 is the number of adjustable weights of the nonlinear element model.
3.3 Series-parallel and parallel neural network Hammerstein models Neural network Hammerstein models, considered in this section, consist of a multilayer perceptron model of the nonlinear element followed by a model of the linear dynamic system. Both the SISO and MISO structures of neural network Hammerstein models are introduced. In the SISO case, a single linear node with two tapped delay lines forms the model of the linear dynamic system. The number of linear nodes in the MISO case is equal to the number of model outputs while the number of tapped delay lines equals the sum of the number of model inputs and outputs. 3.3.1 SISO Hammerstein models Consider a neural network model of the nonlinear element of a multilayer perceptron architecture, composed of one hidden layer with M nonlinear nodes and one linear output node (Fig. 3.2). The output fˆ u(n),w of this model to the input u(n) is
80
3 Neural network Hammerstein models 1 (1) wM 0
u (n )
1
(1) w10
(2) w10
ϕ (x 1 (n ) )
(1) w 20
(1) w11
(2) w11
(1) w 21
(2) ϕ (x 2 (n ) ) w12
(1) wM 1
(2) w1M
M
fˆ(u (n ), w )
ϕ (x M (n ) ) Fig. 3.2. Neural network model of the SISO nonlinear element
fˆ u(n),w =
M
(2)
(2)
j=1
w1j ϕ xj (n) + w10 , (1)
(1)
xj (n) = wj1 u(n) + wj0 , (1)
(1)
(2)
(3.9) (3.10)
(2)
where w = [w10 . . . wM1 w10 . . . w1M ]T is the parameter vector, xj (n) is the activation of the jth node of the hidden layer at the time n, ϕ(·) is the acti(1) (1) (2) (2) vation function of hidden layer nodes, and w10 , . . . , wM1 , and w10 , . . . , w1M are the weights and the thresholds of hidden layer nodes and the output node, respectively. Both the feedforward and recurrent neural networks can be employed as models of the Hammerstein system. In the feedforward model (Fig. 3.3), past values of the system input u(n) and the system output y(n) are used as model inputs:
where
ˆ −1 )fˆ u(n),w , ˆ −1 ) y(n) + B(q yˆ(n) = 1 − A(q
(3.11)
ˆ −1 ) = 1 + a A(q ˆ1 q −1 + · · · + a ˆna q −na ,
(3.12)
ˆ B(q
−1
) = ˆb1 q
−1
+ · · · + ˆbnb q
−nb
.
(3.13)
Unlike the feedforward model, the recurrent one (Fig. 3.4) does not use past values of the system output y(n) but past values of its own output yˆ(n) instead: ˆ −1 )fˆ u(n),w . ˆ −1 ) yˆ(n) + B(q (3.14) yˆ(n) = 1 − A(q In the system identification literature, the models of these two types are traditionally known as the series-parallel model and the parallel model [112, 121].
3.3 Series-parallel and parallel neural network Hammerstein models u (n )
u (n − 1)
q −1
fˆ(u (n ), w )
u(n − 2)Ku (n − nb + 1)
…
q −1 bˆ1
bˆ2
81
u (n − nb)
q −1
bˆnb −1
bˆnb yˆ (n )
ˆ −aˆ1
y (n )
− aˆ2
q −1
− aˆna −1
…
q −1 y (n − 1)
− aˆna
q −1
y(n − 2)K y(n − na + 1)
y (n − na)
Fig. 3.3. Series-parallel SISO neural network Hammerstein model
u (n )
u (n − 1)
fˆ(u (n ), w )
q −1
u(n − 2)Ku (n − nb + 1)
…
q −1 bˆ1
bˆ2
u (n − nb)
q −1
bˆnb −1
bˆnb yˆ (n )
ˆ −aˆna
− aˆna −1
q −1 yˆ(n − na)
K − aˆ2
− aˆ1
…
q −1
yˆ(n − na + 1)K yˆ(n − 2)
q −1 yˆ(n − 1)
Fig. 3.4. Parallel SISO neural network Hammerstein model
While the series-parallel model is based on the equation error definition, the parallel model corresponds to the output error definition. The identification error for the series-parallel model (3.11) is −1
ˆ −1 ) B(q ) f u(n) e(n) = y(n) − yˆ(n) = A(q A(q −1 ) ˆ −1 )fˆ u(n),w + A(q ˆ −1 )ε(n). − B(q
(3.15)
82
3 Neural network Hammerstein models
Therefore, it is clear that if the system is disturbance free (ε(n) = 0) and model parameters converge to true parameters of the system, then the identification error e(n) converges to zero. If we assume that the system output is corrupted by the additive white noise disturbance (ε(n) = 0) and model parameters are equal to their true values, the identification error is correlated, i.e., e(n) = ˆ −1 )ε(n). A(q For the parallel model of the Hammerstein system (3.14), the identification error is e(n) = y(n) − yˆ(n) =
ˆ −1 ) B(q B(q −1 ) f u(n) − fˆ u(n),w + ε(n). ˆ −1 ) A(q −1 ) A(q
(3.16)
As in the previous case, the identification error converges also to zero if model parameters converge to their true values and the system is disturbance free. However, contrary to the series-parallel model, if model parameters are equal to true parameters of the system, for a system corrupted by the additive white noise, the identification error is not correlated, i.e., e(n) = ε(n). The characterization of the Hammerstein system is not unique as the nonlinear element and the linear dynamic system are connected in series. In other words, Hammerstein systems described by B(q −1 )/A(q −1 ) /α and αf u(n) reveal the same input-output behavior for any α = 0. Therefore, to obtain a unique characterization of the neural network model, either the nonlinear element model or the linear dynamic model should be normalized. To normalize the gain of the linear dynamic model to 1, the model weights are scaled as follows: (2) ˆbk = ˜bk /α, k = 1, . . . , nb, w(2) = αw˜(2) , j = 1, . . . , M , where ˜bk , w ˜1j denote 1j 1j ˜ ˜ the parameters of the unnormalized model, and α = B(1)/ A(1) is the linear dynamic model gain. 3.3.2 MIMO Hammerstein models Consider a parallel MIMO neural network Hammerstein model with nu inputs, ny outputs, nf outputs of the nonlinear element model, and M nonlinear nodes in the hidden layer. The lth output of the nonlinear element model (Fig. 3.5) at the time n is fˆl u(n),wl =
M
(2)
j=1
(2)
wlj ϕ xj (n) + wl0 , l = 1, . . . , nf, nu
xj (n) = i=1
(1)
(1)
wji ui (n) + wj0 ,
(3.17)
(3.18)
and the lth vector of weights is defined as (1)
(1)
(2)
(2) T
wl = w10 , . . . , wMnu , wl0 . . . , wlM
.
(3.19)
3.3 Series-parallel and parallel neural network Hammerstein models
1 (1) (1) w11 wM 0
u1 (n )
(1) w12
(2) w12
(1) wM 2
(2) w nf 2
(1) w1nu
(2) w1M
(1) w 2nu
fˆ2 (u(n), w2 )
(2) w 22
M (2) w 2M
fˆnf (u(n), wnf )
(2) w nfM
(1) w Mnu
unu (n )
fˆ1 (u(n), w1 )
(2) w 21
(2) w nf 1
(1) w 22
u2 (n )
(2) w10
(2) (2) (2) w11 w nf 0 w 20
(1) w 20
(1) w 21
(1) wM 1
M
1
(1) w10
83
Fig. 3.5. Neural network model of the MIMO nonlinear element
The output y ˆ(n) of the parallel MIMO Hammerstein model can be expressed as na
y ˆ(n) = −
ˆ (m)ˆ f u(n − m),W , B
(3.20)
m=1
m=1
where
nb
ˆ (m) y A ˆ(n−m) +
y ˆ(n) = yˆ1 (n) . . . yˆny (n)
T
,
(3.21)
u(n) = u1 (n) . . . unu (n)
T
,
(3.22)
ˆ f u(n),W = fˆ1 u(n),w1 . . . fˆnf u(n),wnf ˆ (m)
A
∈R
ny×ny
T
,
,
(3.24)
ˆ (m) ∈ Rny×nf , B (1)
(1)
(2)
(3.23) (3.25)
(2)
W = w10 , . . . , wMnu , w10 . . . , wnf M
T
.
(3.26)
From (3.20), it follows that the tth output of the parallel model is ny
nf nb (m) ˆb(m) fˆm1 a ˆtm1 yˆm1 (n−m)+ yˆt (n) = − tm1 m=1 m1 =1 m=1 m1 =1 na
u(n−m),wm1 , (3.27)
where the superscript m associated with the elements a ˆ and ˆb denotes the matrix number, and the subscripts t and m1 denote the row and the column number, respectively. Replacing yˆm1 (n−m) with ym1 (n−m) on the r.h.s. of (3.27), a series-parallel model can be obtained:
84
3 Neural network Hammerstein models na
ny
(m)
yˆt (n) = − m=1 m1 =1
nb
a ˆtm1 ym1 (n−m)+
nf
ˆb(m) fˆm u(n−m),wm . (3.28) 1 1 tm1
m=1 m1 =1
ˆ (m) , m = 1, . . . , na, are diagonal matrixes if the outputs yˆt (n), t = Note that A 1, . . . , ny, are mutually independent, i.e., they do not depend on yˆm1 (n−m), ˆ (m) , m = 1, . . . , nb, are diagonal where m1 = t. Moreover, if, additionally, B matrices, a MIMO Hammerstein model with uncoupled dynamics is obtained.
3.4 Gradient calculation Due to its static nature, the gradient in the series-parallel SISO Hammerstein model, shown in Fig. 3.3, can be calculated with the well-known backpropagation algorithm. Although the backpropagation algorithm can be also employed to calculate the gradient in recurrent models, it fails to evaluate the gradient properly. This often results in an extremely slow convergence rate. One method of gradient calculation in recurrent models is the sensitivity method, known also as dynamic backpropagation or on-line recurrent backpropagation [23, 121, 139, 150]. In this method, to compute partial derivatives of the cost function, a set of linear difference equations is solved on-line by simulation. An alternative to the SM is the backpropagation through time method [163], which uses a model unfolded back in time. In this case, the gradient can be evaluated with a method similar to backpropagation. With both of these methods, a higher convergence rate may be expected. Nevertheless, their computational complexity is higher than the computational complexity of the backpropagation method. 3.4.1 Series-parallel SISO model. Backpropagation method The output of the series-parallel Hammerstein model depends on past inputs and outputs of the system. There are no feedback connections and the computationally effective backpropagation learning algorithm (BPS) can be used to calculate the gradient. Differentiating (3.11), partial derivatives of yˆ(n) w.r.t. model parameters can be obtained as follows: ∂ yˆ(n) = −y(n−k), k = 1, . . . , na, ∂ˆ ak
(3.29)
∂ yˆ(n) = fˆ u(n−k),w , k = 1, . . . , nb, ∂ˆbk
(3.30)
nb ˆ ∂ yˆ(n) ˆbm ∂ f u(n−m),w , c = 1, . . . , 3M + 1, = ∂wc ∂wc m=1
(3.31)
3.4 Gradient calculation
85
where wc denotes any parameter of the nonlinear element model. Partial derivatives of fˆ u(n),w w.r.t. the elements of the vector w can be calculated differentiating (3.9): ∂ fˆ u(n),w (1) ∂wj1
=
∂ fˆ u(n),w (1)
∂wj0
∂ fˆ u(n),w ∂ϕ xj (n) ∂xj (n) (2) = w1j ϕ xj (n) u(n), (3.32) ∂xj (n) ∂w(1) ∂ϕ xj (n) j1 =
∂ fˆ u(n),w ∂ϕ xj (n) ∂xj (n) (2) = w1j ϕ xj (n) , ∂xj (n) ∂w(1) ∂ϕ xj (n) j0 ∂ fˆ u(n),w (2)
∂w1j
= ϕ xj (n) ,
∂ fˆ u(n),w (2)
∂w10
= 1,
(3.33) (3.34) (3.35)
where j = 1, . . . , M . 3.4.2 Parallel SISO model. Backpropagation method Owing to its low computational complexity, the backpropagation method is also used for training recurrent models in some cases. Neglecting the dependence of past outputs of the parallel model (3.14) on its parameters, the following expressions can be obtained: ∂ yˆ(n) = −ˆ y (n−k), k = 1, . . . , na, ∂ˆ ak
(3.36)
∂ yˆ(n) = fˆ u(n−k),w , k = 1, . . . , nb, ∂ˆbk
(3.37)
ˆ ∂ yˆ(n) ˆbm ∂ f u(n−m),w , c = 1, . . . , 3M + 1. = ∂wc ∂wc m=1
(3.38)
nb
3.4.3 Parallel SISO model. Sensitivity method Parallel models are dynamic systems themselves as their outputs depend not only on current and past system inputs but also on past model outputs. This makes the calculation of the gradient more complex. The differentiation of (3.14) w.r.t. model parameters yields na
∂ yˆ(n−m) ∂ yˆ(n) = −ˆ y(n−k) − a ˆm , k = 1, . . . , na, ∂ˆ ak ∂ˆ ak m=1
(3.39)
86
3 Neural network Hammerstein models ∂ f (u (n ), w )
∂ yˆ (n )
(1 ) ∂ w 10
(1 ) ∂ w 10
Bˆ (q −1 ) Aˆ (q −1 )
M ∂ f (u (n ), w ) (2 ) ∂ w 1M
M ∂ yˆ (n ) (2 ) ∂ w 1M
Bˆ (q −1 ) Aˆ (q −1 )
PARALLEL HAMMERSTEIN MODEL
u (n )
∂ yˆ (n ) ∂ bˆ1
fˆ(u (n ), w )
vˆ(n )
q −1
q −1
1
−1
Aˆ (q −1 )
M ∂ yˆ (n ) ∂ bˆnb
yˆ (n )
Bˆ (q −1 ) Aˆ (q −1 )
q
M
M
−1
−1
q
1
Aˆ (q −1 )
M
−1
Aˆ (q −1 )
∂ yˆ (n ) ∂ aˆ1
∂ yˆ (n ) ∂ aˆna
Aˆ (q −1 )
Fig. 3.6. Parallel model of the Hammerstein system and its sensitivity models
∂ yˆ(n) = fˆ u(n−k),w − ∂ˆbk
na
a ˆm m=1
∂ yˆ(n−m) , k = 1, . . . , nb, ∂ˆbk
(3.40)
nb na ˆ ∂ yˆ(n−m) ∂ yˆ(n) ˆbm ∂ f u(n−m),w − = a ˆm , c = 1, . . . , 3M+1. (3.41) ∂wc ∂wc ∂wc m=1 m=1
Equations (3.39) – (3.41) define the sensitivity models shown in Fig. 3.6. To obtain partial derivatives, these linear difference equations of the order na are solved on-line. Note that this requires the simulation of na + nb + 3M + 1 sensitivity models. In the SM, we assume that the parameters of the linear dynamic model are time invariant. This is true in the batch mode but not in the sequential mode, in which these parameters are updated after the presentation of each learning pattern. Thus, only an approximate gradient can be obtained if we apply a recursive version of the SM. The calculated gradient approaches the true one if parameter changes are small enough. This can be achieved at small learning rates but it may result in a slow convergence rate.
3.4 Gradient calculation
87
3.4.4 Parallel SISO model. Backpropagation through time method In the BPTT method, it is also assumed that model parameters do not change in successive time steps. The Hammerstein model (3.14), unfolded for one step back in time, can be written as na
yˆ(n) = − a ˆ1 yˆ(n−1) −
nb
a ˆm yˆ(n−m) + m=2
m=1
na
nb
=−a ˆ1 −
a ˆm yˆ(n−m−1) + m=1
na
−
ˆbm fˆ u(n−m−1),w
(3.42)
m=1 nb
a ˆm yˆ(n−m) + m=2
ˆbm fˆ u(n−m),w
ˆbm fˆ u(n−m),w .
m=1
The unfolding procedure can be continued until initial conditions of the model are achieved. As we proceed with the unfolding procedure, an unfolded model equation is obtained, which becomes more and more complex and makes gradient calculation more and more difficult. On the other hand, the unfolded model can be represented by a multilayer feedforward neural network, see Fig. 3.7 for an example of an unfolded-in-time Hammerstein model. The completely unfolded neural network corresponding to (3.14) is no longer of the recurrent type and can be trained with an algorithm similar to backpropagation and called backpropagation through time (BPTT) [84]. Nevertheless, the complexity of the unfolded network depends on the number of time steps and both the computational and memory requirements of the BPTT method are time dependent. In practice, to overcome this inconvenience, unfolding in time can be restricted to a fixed number of time steps in a method called truncated backpropagation through time. In this way, it is often possible to obtain quite a good approximation of the gradient, even for models unfolded back in time only for a few time steps. A detailed description of gradient calculation with the truncated BPTT method is given in Appendix 3.1. 3.4.5 Series-parallel MIMO model. Backpropagation method For the MIMO Hammerstein model, the following cost function is minimized in the sequential mode: 1 J(n) = 2
ny
2
yt (n) − yˆt (n) .
(3.43)
t=1
The gradient in series-parallel MIMO Hammerstein models can be calculated with a version of the BPS algorithm extended to the multidimensional case. The differentiation of (3.28) w.r.t the weights of the MIMO nonlinear element model gives
88
3 Neural network Hammerstein models yˆ (n )
i1 = 0 u (n − 1)
bˆ1
bˆ2
fˆ(u(n − 1),w)
bˆ3
q −1 vˆ(n −1)
− aˆ1
− aˆ2
vˆ(n − 2)
− aˆ3
q −1
q −1 vˆ(n − 3)
yˆ(n −1)
q −1 yˆ(n − 2)
yˆ(n − 3)
i1 = 1 u (n − 2)
bˆ1
bˆ2
fˆ(u (n − 2), w)
bˆ3
q −1 vˆ(n − 2)
− aˆ1
− aˆ2
vˆ(n − 3)
− aˆ3
q −1
q −1 vˆ(n − 4)
yˆ(n − 2)
q −1 yˆ(n − 3)
yˆ(n − 4)
i1 = 2 u (n − 3)
bˆ1
bˆ2
fˆ(u (n − 3), w)
vˆ(n − 3) u (1)
bˆ3
q −1 vˆ(n − 4)
M
yˆ(n − 3)
M
q −1 vˆ(1)
− aˆ2
M
q −1 yˆ(n − 4)
M
q −1 vˆ(0)
− aˆ3
q −1 vˆ(n − 5)
M
fˆ(u (1), w )
− aˆ1
q −1
M q −1
q −1 vˆ(−1)
yˆ(1)
yˆ(n − 5)
yˆ(0)
yˆ(−1)
i1 = n −1 u (0)
bˆ1
bˆ2
fˆ(u (0), w )
bˆ3
q −1 vˆ(0)
− aˆ1
− aˆ2
vˆ(−1)
− aˆ3
q −1
q −1
q −1 vˆ(−2)
yˆ(0)
yˆ(−1)
yˆ(−2)
Fig. 3.7. Parallel Hammerstein model of the third order unfolded back in time
∂J(n) (1) ∂wji
∂J(n) (1) ∂wj0
∂J(n) (2) ∂wlj
∂J(n) (2) ∂wl0
ny
yt (n) − yˆt (n)
=− t=1 ny
=−
yt (n) − yˆt (n) t=1 ny
=−
yt (n) − yˆt (n) t=1 ny
yt (n) − yˆt (n)
=− t=1
where i = 1, . . . , nu, j = 1, . . . , M , l = 1, . . . , nf .
∂ yˆt (n) (1)
∂wji
∂ yˆt (n) (1)
∂wj0
∂ yˆt (n) (2)
∂wlj
∂ yˆt (n) (2)
∂wl0
,
(3.44)
,
(3.45)
,
(3.46)
,
(3.47)
3.4 Gradient calculation
89
The calculation of partial derivatives of the model outputs w.r.t. the weights of the MIMO nonlinear element requires the calculation of partial derivatives of the outputs of the MIMO nonlinear element fˆl u(n),wl w.r.t. its weights. The differentiation of (3.17) gives ∂fl u(n),wl (1) ∂wji
∂fl u(n),wl (1) ∂wj0
=
∂fl u(n),wl ∂ϕ xj (n) ∂xj (n) ∂xj (n) ∂w(1) ∂ϕ xj (n) ji
=
(2) wlj ϕ
=
∂fl u(n),wl ∂ϕ xj (n) ∂xj (n) ∂xj (n) ∂w(1) ∂ϕ xj (n) j0
=
(2) wlj ϕ
xj (n) ui (n),
(3.49)
xj (n) ,
∂fl u(n),wl (2)
∂wlj
= ϕ xj (n) ,
∂fl u(n),wl (2)
∂wl0
(3.48)
= 1.
(3.50)
(3.51)
Taking into account (3.28) and (3.48) – (3.51), we have ∂ yˆt (n) (1)
∂wji
=
nb
nf
nb
nf
ˆ ˆb(m) ∂ fm1 u(n−m),wm1 tm1 (1) ∂wji m=1 m1 =1
= m=1 m1 =1
∂ yˆt (n) (1)
∂wj0
=
ˆb(m) w(2) ϕ xj (n−m) ui (n−m), tm1 m1 j
nb
nf
nb
nf
ˆ ˆb(m) ∂ fm1 u(n−m),wm1 tm1 (1) ∂wj0 m=1 m1 =1
= m=1 m1 =1
∂ yˆt (n) (2)
∂wlj
(3.52)
ˆb(m) w(2) ϕ tm1 m1 j
(3.53)
xj (n−m) ,
nb ˆ ˆb(m) ∂ fl u(n−m),wl = ˆb(m) ϕ xj (n−m) , tl tl (2) ∂w m=1 m=1 lj nb
=
∂ yˆt (n) (2)
∂wl0
nb ˆ (m) ∂ fl u(n−m),wl ˆ ˆb(m) . = = btl tl (2) ∂wl0 m=1 m=1
(3.54)
nb
(3.55)
90
3 Neural network Hammerstein models
(k) (k) As only the output yˆt (n) depends upon the parameters a ˆtm1 and ˆbtm1 , the differentiation of (3.43) yields
∂J(n)
= − yt (n) − yˆt (n)
∂ yˆt (n)
, k = 1, . . . , na, m1 = 1, . . . , ny,
(3.56)
∂J(n) ∂ yˆt (n) = − yt (n) − yˆt (n) , k = 1, . . . , nb, m1 = 1, . . . , nf. (k) (k) ˆ ∂ btm1 ∂ˆbtm1
(3.57)
(k) ∂ˆ atm1
(k)
∂ˆ atm1
Differentiating (3.28), partial derivatives of the model output w.r.t. the parameters of the MIMO linear dynamic model can be calculated as ∂ yˆt (n) (k)
∂ˆ atm1
= −ym1 (n−k),
∂ yˆt (n) = fˆm1 u(n−k),wm1 . (k) ∂ˆb
(3.58)
(3.59)
tm1
3.4.6 Parallel MIMO model. Backpropagation method Neglecting the dependence of past values of output signals upon the parameters of the parallel model and differentiating (3.27) results in the multidimensional version of the BPP method. In the BPP method, partial derivatives of the model output w.r.t. the parameters of the nonlinear element model are calculated in the same way as in the BPS method. The only difference is the calculation of partial derivatives (k) of the model output w.r.t. the parameters a ˆtm1 of the linear dynamic model, which requires replacing ym1 (n−k) in (3.58) with yˆm1 (n−k). The BPP algorithm has the same computational complexity as the BPS algorithm but it suffers from a low convergence rate as it uses a crude approximation of the gradient. 3.4.7 Parallel MIMO model. Sensitivity method A much better evaluation of the gradient, at the price of computational complexity, can be obtained with the multidimensional version of the SM. The SM differs from the BPS method in the calculation of partial derivatives of the model output w.r.t. model parameters. To calculate partial derivatives of the model output w.r.t. the parameters of the nonlinear element model, a set of linear difference equations is solved by simulation. From (3.27) and (3.48) – (3.51), it follows that
3.4 Gradient calculation
∂ yˆt (n) (1) ∂wji
nf
nb
=
ˆb(m) w(2) ϕ xj (n−m) ui (n−m) tm1 m1 j
m=1 m1 =1 ny na
− m=1 m1 =1
∂ yˆt (n) (1)
∂wj0
nb
(m) a ˆtm1
nf
=
m=1 m1 =1 ny na
m=1 m1 =1
(2) ∂wlj
nb
= m=1
ˆb(m) ϕ xj (n−m) − tl
∂ yˆt (n) (2) ∂wl0
nb
= m=1
ˆb(m) − tl
∂ yˆm1 (n−m) (1)
∂wji
(3.60) ,
ˆb(m) w(2) ϕ xj (n−m) tm1 m1 j
− ∂ yˆt (n)
91
∂ yˆm1 (n−m)
(m) a ˆtm1
(1)
∂wj0 ny
na
(m)
m=1 m1 =1
na
ny
a ˆtm1
(m)
m=1 m1 =1
a ˆtm1
(3.61) ,
∂ yˆm1 (n−m) (2)
∂wlj
∂ yˆm1 (n−m) (2)
∂wl0
,
,
(3.62)
(3.63)
where i = 1, . . . , nu, j = 1, . . . , M , l = 1, . . . , nf . Similarly, partial derivatives of the model output w.r.t. the parameters of the linear dynamic model are obtained from another set of linear difference equations: ∂ yˆt (n) (k) ∂ˆ arp
ny
na
(m)
= − δtr yp (n−k) − m=1 m1 =1
δtr =
a ˆtm1
∂ yˆm1 (n−m) (k)
∂ˆ arp
,
1 for t = r , 0 for t = r
(3.64)
(3.65)
where k = 1, . . . , na, r = 1, . . . , ny, p = 1, . . . , ny, na
ny
∂ yˆt (n) = δtr fˆp u(i−k),wp − (k) ∂ˆbrp m=1 m
1 =1
(m)
a ˆtm1
∂ yˆm1 (n−m) , (k) ∂ˆbrp
(3.66)
where k = 1, . . . , nb, r = 1, . . . , ny, p = 1, . . . , nf . 3.4.8 Parallel MIMO model. Backpropagation through time method The details of gradient calculation with truncated backpropagation through time method in MIMO Hammerstein models are given in Appendix 3.2.
92
3 Neural network Hammerstein models
3.4.9 Accuracy of gradient calculation with truncated BPTT As in the case of neural network Wiener models, we will assume that the input u(n) is a sequence of zero-mean i.i.d. random variables. Theorem 3.1. Define the computation error ∆ˆ yaˆk (n) =
∂ yˆ(n) ∂ + yˆ(n) − , ∂ˆ ak ∂ˆ ak
where ∂ yˆ(n)/∂ˆ ak , k = 1, . . . , na, denote partial derivatives calculated with the BPTT method, i.e. unfolding the model (3.14) n − 1 times back in time, and ak denote partial derivatives calculated with the truncated BPTT ∂ + yˆ(n)/∂ˆ method unfolding the model (3.14) K times back in time. Assume that (A1) (A2) (A3) (A4)
ˆ −1 )/A(q ˆ −1 ) is asymptotically stable; The linear dynamic model B(q ˆ yˆ(n) = 0, n = 0, . . . , −na + 1; f u(n),w = 0, n = −1, . . . , −nb; ∂ yˆ(n)/∂ˆ ak = 0, n = 0, . . . , −na + 1; The input u(n), n = 0, 1, . . ., is a sequence of zero-mean i.i.d. random variables of finite moments, E[u(n)] = 0, E[u(n)u(m)] =
σ 2 for n = m . 0 for n = m
Then var ∆ˆ yaˆk (n) =
n−k σf2 i1 =K+1
2
i1
h1 (i2 )h2 (i1 −i2 ) i2 =K+1
, k = 1, . . . , na,
(3.67)
for n > K + k and 0 otherwise, where σf2 = var fˆ u(n),w , and h1 (n) is the impulse response function of the system H1 (q −1 ) =
1 , ˆ A(q −1 )
(3.68)
and h2 (n) is the impulse response function of the system H2 (q −1 ) =
ˆ −1 ) B(q . ˆ −1 ) A(q
Proof: The proof is shown in Appendix 3.3. Theorem 3.2. Define the computation error ∆ˆ yˆbk (n) =
∂ yˆ(n) ∂ + yˆ(n) − , ∂ˆbk ∂ˆbk
(3.69)
3.4 Gradient calculation
93
where ∂ yˆ(n)/∂ˆbk , k = 1, . . . , nb, denote partial derivatives calculated with the BPTT method, i.e. unfolding the model (2.14) n − 1 times back in time, and ∂ + yˆ(n)/∂ˆbk denote partial derivatives calculated with the truncated BPTT method unfolding the model (2.14) K times back in time. Assume that (A1) (A2) (A3) (A4)
ˆ −1 ) is asymptotically stable; ˆ −1 )/A(q The linear dynamic model B(q ˆ yˆ(n) = 0, n = 0, . . . , −na + 1; f u(n),w = 0, n = −1, . . . , −nb; ∂ yˆ(n)/∂ˆbk = 0, n = 0, . . . , −na + 1; The input u(n), n = 0, 1, . . ., is a sequence of zero-mean i.i.d. random variables of finite moments, E[u(n)] = 0, E[u(n)u(m)] =
Then var ∆ˆ yˆbk (n) = σf2
n−k
σ 2 for n = m . 0 for n = m
h21 (i1 ), k = 1, . . . , nb,
(3.70)
i1 =K+1
for n > K + k and 0 otherwise, where σf2 = var fˆ u(n),w , and h1 (n) is the impulse response function of the system H1 (q −1 ) =
1 . ˆ A(q −1 )
(3.71)
Proof: The proof is shown in Appendix 3.4. Theorem 3.3. Define the computation error ∆ˆ ywc (n) =
∂ yˆ(n) ∂ + yˆ(n) − , ∂wc ∂wc
where ∂ yˆ(n)/∂wc denote partial derivatives calculated with the BPTT method, i.e. unfolding the model (3.14) n−1 times back in time, and ∂ + yˆ(n)/∂wc denote partial derivatives calculated with the truncated BPTT method unfolding the model (3.14) K times back in time and wc is the cth element of (2) (2) (1) (1) w10 . . . wM1 . . . w11 . . . w1M , c = 1, . . . , 3M . Assume that (A1) (A2) (A3) (A4)
ˆ −1 ) is asymptotically stable; ˆ −1 )/A(q The linear dynamic model B(q ˆ yˆ(n) = 0, n = 0, . . . , −na + 1; f u(n),w = 0, n = −1, . . . , −nb; ∂ yˆ(n)/∂wc = 0, n = 0, . . . , −na + 1; ∂ fˆ u(n),w /∂wc = 0, n = −1, . . . , −nb; The input u(n), n = 0, 1, . . ., is a sequence of zero-mean i.i.d. random variables of finite moments,
94
3 Neural network Hammerstein models
E[u(n)] = 0, σ 2 for n = m . 0 for n = m
E[u(n)u(m)] = Then
n
2 var ∆ˆ ywc (n) = σw c
h22 (i1 ), c = 1, . . . , 3M,
(3.72)
i1 =K+2 2 for n > K + k and 0 otherwise, where σw = var ∂ fˆ u(n),w /∂wc ] and h2 (n) c is the impulse response function of the system
H2 (q −1 ) =
ˆ −1 ) B(q . ˆ −1 ) A(q
(3.73)
Proof: The proof is shown in Appendix 3.5. Theorem 3.4. Define the computation error ∆ˆ yw(2) (n) = 10
∂ yˆ(n) (2)
∂w10
−
∂ + yˆ(n) (2)
∂w10
,
(2)
where ∂ yˆ(n)/∂w10 is a partial derivative calculated with the BPTT method, (2) i.e. unfolding the model (3.14) n − 1 times back in time, and ∂ + yˆ(n)/∂w10 is a partial derivative calculated with the truncated BPTT method unfolding the model (3.14) K times back in time. Assume that (A1) (A2) (A3)
ˆ −1 ) is asymptotically stable; ˆ −1 )/A(q The linear dynamic model B(q ˆ yˆ(n) = 0, n = 0, . . . , −na + 1; f u(n),w = 0, n = −1, . . . , −nb; (2) (2) ∂ yˆ(n)/∂w10 = 0, n = 0, . . . , −na + 1; ∂ fˆ u(n),w /∂w10 = 0, n = −1, . . . , −nb;
Then
n
∆ˆ yw(2) (n) = 10
h2 (i1 ),
(3.74)
i1 =K+2
for n > K + k and 0 otherwise, where h2 (n) is the impulse response function of the system H2 (q −1 ) =
ˆ −1 ) B(q . ˆ −1 ) A(q
Proof: The proof is shown in Appendix 3.6.
(3.75)
3.4 Gradient calculation
95
Remark 3.1. The lowest gradient calculation accuracy and the highest values of the variances (3.67), (3.70), (3.72) and the error (3.74) are achieved with the BPP method for which K = 0. Remark 3.2. From (3.67) and (3.70), it follows that the variances of computation errors do not depend on k for n → ∞, i.e, lim var ∆ˆ yaˆk (n) =
n→∞
2
∞
i1
i1 =K+1
i2 =K+1
σf2
h1 (i2 )h2 (i1 −i2 )
yˆbk (n) = σf2 lim var ∆ˆ
n→∞
∞
,
(3.76)
h21 (i1 ).
(3.77)
i1 =K+1
Theorems 3.1 – 3.4 provide a useful tool for the determination of the number of unfolded steps K. From Theorems 3.1 and 3.2 it follows that, for a fixed value of K, calculation accuracy of partial derivatives of the parallel neural network Hammerstein model output w.r.t. the parameters a ˆk and ˆbk with the truncated BPTT method depends on the number of discrete time steps necessary for the impulse response h1 (n) to decrease to negligible small values. Analogously, calculation accuracy of partial derivatives of the parallel neural network Hammerstein model output w.r.t. the parameters wc , c = 1, . . . , 3M , (2) and w10 depends on the number of discrete time steps necessary for the impulse response h2 (n) to decrease to negligible small values. Note that these impulse responses can differ significantly from the impulse responses of the system 1/A(q −1 ) or B(q −1 )/A(q −1 ), particularly at the beginning of the learning process. Therefore, both h1 (n) and h2 (n) can be traced during the training of the neural network model and K can be changed adaptively to meet the assumed degrees of gradient calculation accuracy ξak (n), ξbk (n) and ξwc (n) defined as the ratio of the variances (3.67), (3.70) and (3.72) to the variances of the corresponding partial derivatives:
var ∆ˆ yaˆk (n) = ξak (n) = ∂ yˆ(n) var ∂ˆ ak
2
n−k
i1
i1 =K+1
i2 =K+1
h1 (i2 )h2 (i1 −i2 ) n−k
i1
i1 =0
i2 =0
2
(3.78)
h1 (i2 )h2 (i1 −i2 ) n−k
ξbk (n) =
,
var ∆ˆ yˆbk (n) = ∂ yˆ(n) var ∂ˆbk
h21 (i1 )
i1 =K+1 n−k i1 =0
h21 (i1 )
,
(3.79)
96
3 Neural network Hammerstein models n
ywc (n) var ∆ˆ ξwc (n) = = ∂ yˆ(n) var ∂wc
h22 (i1 )
i1 =K+2 n
,
(3.80)
h22 (i1 )
i1 =0
where c = 1, . . . , 3M . An initial value of K can be determined based on initial conditions or a priori knowledge of system dynamics. As in the case of Wiener systems, the required value of K for Hammerstein systems depends strongly on the system sampling rate. If the sampling rate increases for a given system, the numbers of discrete time steps, necessary for the impulse responses h1 (n) and h2 (n) to decrease to negligible small values, increase and a higher value of K is necessary to achieve the same accuracy of gradient calculation. 3.4.10 Gradient calculation in the sequential mode The derivation of gradient calculation algorithms has been made under the assumption that the Hammerstein model is time invariant. In the sequential mode, this assumption is not valid and only approximated values of the gradient are obtained for parallel models. Hence, to obtain a good approximation of the gradient, the learning rate should be small enough to achieve negligible small changes of model parameters. Applying the BPTT method, the actual values of model parameters can be used in the unfolding procedure. Although the gradient calculated in this way is still approximate, such an approach may increase the convergence rate of the learning process even if model unfolding is restricted to only a few time steps. As in the case of neural network Wiener models, the BPPT method is able to provide exact values of the gradient if linear dynamic model outputs are unfolded back in time according to the following rule: yˆ(n−na−i1) =
1 a ˆna nb
+
na−1
− yˆ(n−i1 ) −
a ˆm yˆ(n−m−i1) m=1
ˆbm fˆ u(n−m−i1), w
(3.81) ,
m=1
where i1 = 1, . . . , n − 1. The formula (3.81) can be employed provided that a ˆna = 0. In practice, if a ˆna = 0 or a ˆna 0, then yˆ(n − na − i1 ) does not influence or does not influence significantly yˆ(n−i1 ), and this unfolding step can be omitted. Alternatively, it is possible to unfold the outputs of the nonlinear element model instead the outputs of the linear dynamic model:
3.5 Simulation example
97
na
1 fˆ u(n−nb−i1), w = a ˆm yˆ(n−m−i1) yˆ(n−i1) + ˆbnb m=1 nb−1
−
ˆbm fˆ u(n−m−i1), w
(3.82)
.
m=1
Nevertheless, applying (3.82) instead of (3.81) is more complex as it requires not only that ˆbnb = 0 but also needs calculating the model inputs u(u(n−nb−i1) that correspond to the nonlinear element outputs fˆ u(n−m−i1), w . 3.4.11 Computational complexity Computational complexity of algorithms, given in Table 3.1, is evaluated based on the same assumptions as those given in Section 2.4.11. Table 3.1. Computational complexity of learning algorithms Computational complexity BPS or BPP 3na + 5nb + 15nbM + 3M + 1 SM 5na + 5nb + 15nbM + 6naM + 3M + 1 BPTT (13nbM + 5nb + 3na)K + 3na + 5nb + 15nbM + 3M + 1
3.5 Simulation example A Hammerstein system composed of a linear dynamic system G(s) =
1 6s2 + 5s + 1
(3.83)
and the nonlinear element (Fig. 3.8): f u(n) = sin 0.6πu(n) + 0.1 cos 1.5πu(n)
(3.84)
was used in the simulation study. The system (3.83) was converted to discrete time, assuming the zero order hold on the input and the sampling interval 1s, resulting in the following difference equation: s(n) = 1.3231s(n−1)−0.4346s(n−2)+0.0635u(n−1)+0.0481u(n−2). (3.85) The system (3.83) was driven by a pseudo-random sequence uniformly distributed in (−1, 1). The nonlinear element model contained 25 nonlinear nodes of the hyperbolic tangent activation function. Both the series-parallel and parallel Hammerstein models were trained recursively using the steepest descent
98
3 Neural network Hammerstein models
method to update the weights, and BPS, BPP, SM, and truncated BPTT algorithms to calculate the gradient. To compare the different algorithms, training was carried out based on the same sequence of 20000 input-output patterns with the learning rate of 0.2 for nonlinear nodes and 0.02 for linear ones. The overall number of training steps was 60000 as the training sequence was used three times. All calculations were performed at the same initial values of neural network parameters. The mean square error of moving averages defined as n 1 2 yˆ(j) − y(j) for n I n j=1 JI (n) = (3.86) n 1 2 yˆ(j) − y(j) for n > I I j=n−I+1
is used to compare the convergence rate of the algorithms. The indices P (n) and F (n) are used to compare the identification accuracy of the linear dynamic system and the nonlinear function f (·) and, respectively, P (n) =
F (n) =
1 4
2
(ˆ aj − aj )2 + (ˆbj − bj )2 ,
(3.87)
j=1
1 100
100
fˆ u(j) − f u(j), w
2
,
(3.88)
j=1
where u(j) is a testing sequence consisting of 100 linearly equally spaced values between −1 and 1. The results of model training for both a noise-free Hammerstein system and a system disturbed by the additive output Gaussian noise N (0, σe ) are given in Tables 3.2–3.4. Noise-free case. The identification results are given in Table 3.2 and illustrated in Figs 3.8–3.14. The BPS example uses the true gradient. This results in a high convergence rate and smooth decreasing of J2000 (n) (Fig. 3.9). As can be expected, the lowest convergence rate can be observed in the BPP example. This can be explained by the low accuracy of gradient approximation with the backpropagation method. A better gradient approximation results in a higher convergence rate that can be seen in both the SM and BPTT examples. Maximum accuracy of both the linear dynamic model and the nonlinear element model can be observed in the BPTT example at K = 4. The results of nonlinear element identification obtained in the BPTT examples at K = 3, . . . , 11 are more accurate in comparison with those obtained using the SM algorithm. Further unfolding of the model does not improve identification accuracy and even some deterioration can be observed. Comparing the results of linear dynamic system identification, it can be noticed that BPTT outperforms the SM when K = 2, . . . , 7.
3.5 Simulation example
99
Noise case I. The identification results for a high signal to noise ratio SN R = 18.46, SN R = var(y(n) − ε(n))/var(ε(n)) are given in Table 3.3. The highest accuracy of the nonlinear element model was achieved in the BPTT example at K = 15, and the highest accuracy of the linear dynamic model was achieved in the BPTT example at K = 2. The overall identification accuracy measured by the indices J60000 (60000) increases with an increase in the number of unfolded time steps K. The index J2000 (60000) has its minimum at K = 4. Noise case II. The identification results for a low signal to noise ratio SN R = 3.83 are given in Table 3.4. The model was trained using the timedependent learning rate 0.1 . (3.89) η(n) = √ 4 n The highest accuracy of both the nonlinear element model and the linear dynamic model was achieved using the BPTT method, at K = 4 and K = 3, respectively. In all three cases, the BPTT algorithm provides the most accurate results. This can be explained by the time-varying nature of the model used in the SM algorithm. In other words, to obtain the exact value of the gradient, model parameters should be time invariant. That is the case in the batch mode but Table 3.2. Comparison of estimation accuracy (σe = 0) Algorithm BPS BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
J60000 (60000) J2000 (60000) F (60000) −4
3.76 × 10 6.81 × 10−3 1.15 × 10−3 1.51 × 10−3 1.09 × 10−3 9.61 × 10−4 9.31 × 10−4 9.31 × 10−4 9.44 × 10−4 9.63 × 10−4 9.82 × 10−4 9.99 × 10−4 1.01 × 10−3 1.03 × 10−3 1.03 × 10−3 1.04 × 10−3 1.05 × 10−3 1.05 × 10−3
−4
2.86 × 10 4.89 × 10−2 5.85 × 10−4 1.09 × 10−3 5.41 × 10−4 4.24 × 10−4 4.04 × 10−4 4.12 × 10−4 4.37 × 10−4 4.64 × 10−4 4.91 × 10−4 5.14 × 10−4 5.32 × 10−4 5.42 × 10−4 5.52 × 10−4 5.61 × 10−4 5.65 × 10−4 5.69 × 10−4
−4
8.22 × 10 1.02 × 10−1 3.29 × 10−5 1.19 × 10−4 4.18 × 10−5 2.48 × 10−5 2.14 × 10−5 2.20 × 10−5 2.18 × 10−5 2.31 × 10−5 2.52 × 10−5 2.73 × 10−5 2.99 × 10−5 3.25 × 10−5 3.44 × 10−5 3.56 × 10−5 3.63 × 10−5 3.68 × 10−5
P (60000) 2.00 × 10−3 4.18 × 10−3 3.30 × 10−6 1.85 × 10−5 2.24 × 10−6 9.55 × 10−7 8.27 × 10−7 9.96 × 10−7 1.71 × 10−6 2.64 × 10−6 3.59 × 10−6 4.11 × 10−6 4.29 × 10−6 4.17 × 10−6 4.10 × 10−6 4.05 × 10−6 3.99 × 10−6 3.99 × 10−6
100
3 Neural network Hammerstein models Table 3.3. Comparison of estimation accuracy (σe = 0.012, SN R = 18.46) Algorithm BPS BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
J60000 (60000) J2000 (60000) F (60000) 7.28 × 10−4 7.44 × 10−3 1.40 × 10−3 1.83 × 10−3 1.27 × 10−3 1.14 × 10−3 1.12 × 10−3 1.14 × 10−3 1.16 × 10−3 1.18 × 10−3 1.22 × 10−3 1.24 × 10−3 1.26 × 10−3 1.27 × 10−3 1.29 × 10−3 1.32 × 10−3 1.35 × 10−3 1.37 × 10−3
5.99 × 10−4 4.90 × 10−3 7.83 × 10−4 1.20 × 10−3 6.95 × 10−4 5.87 × 10−4 5.68 × 10−4 5.84 × 10−4 6.14 × 10−4 6.46 × 10−4 6.76 × 10−4 7.01 × 10−4 7.23 × 10−4 7.33 × 10−4 7.45 × 10−4 7.55 × 10−4 7.64 × 10−4 7.71 × 10−4
9.17 × 10−4 1.09 × 10−1 1.20 × 10−4 1.68 × 10−4 1.95 × 10−4 1.98 × 10−4 1.35 × 10−4 1.33 × 10−4 2.04 × 10−4 1.59 × 10−4 1.65 × 10−4 1.59 × 10−4 1.27 × 10−4 1.19 × 10−4 1.08 × 10−4 1.06 × 10−4 1.05 × 10−4 1.04 × 10−4
P (60000) 6.70 × 10−3 3.82 × 10−3 6.43 × 10−5 5.39 × 10−5 1.33 × 10−5 1.58 × 10−5 1.66 × 10−5 2.75 × 10−5 4.24 × 10−5 6.75 × 10−5 8.56 × 10−5 8.81 × 10−5 9.82 × 10−5 9.68 × 10−5 9.85 × 10−5 9.76 × 10−5 9.82 × 10−5 9.90 × 10−5
Table 3.4. Comparison of the estimation accuracy (σe = 0.06, SN R = 3.83) Algorithm BPS BPP SM BPTT, K = 1 BPTT, K = 2 BPTT, K = 3 BPTT, K = 4 BPTT, K = 5 BPTT, K = 6 BPTT, K = 7 BPTT, K = 8 BPTT, K = 9 BPTT, K = 10 BPTT, K = 11 BPTT, K = 12 BPTT, K = 13 BPTT, K = 14 BPTT, K = 15
J60000 (60000) J2000 (60000) F (60000) −3
6.96 × 10 1.60 × 10−2 7.97 × 10−3 6.42 × 10−3 5.22 × 10−3 5.04 × 10−3 5.02 × 10−3 5.06 × 10−3 5.09 × 10−3 5.12 × 10−3 5.15 × 10−3 5.17 × 10−3 5.19 × 10−3 5.20 × 10−3 5.22 × 10−3 5.23 × 10−3 5.23 × 10−3 5.24 × 10−3
−3
6.72 × 10 1.32 × 10−2 5.72 × 10−3 6.18 × 10−3 5.21 × 10−3 5.04 × 10−3 4.99 × 10−3 5.02 × 10−3 5.03 × 10−3 5.05 × 10−3 5.08 × 10−3 5.11 × 10−3 5.13 × 10−3 5.15 × 10−3 5.17 × 10−3 5.18 × 10−3 5.18 × 10−3 5.19 × 10−3
−3
2.55 × 10 2.16 × 10−2 4.46 × 10−4 9.35 × 10−3 1.32 × 10−3 4.97 × 10−4 4.40 × 10−4 4.84 × 10−4 5.62 × 10−4 5.25 × 10−4 4.48 × 10−4 4.43 × 10−4 3.89 × 10−4 3.80 × 10−4 3.66 × 10−4 3.62 × 10−4 3.55 × 10−4 3.52 × 10−4
P (60000) 1.76 × 10−1 5.54 × 10−2 1.75 × 10−2 1.15 × 10−2 6.07 × 10−3 5.76 × 10−3 5.80 × 10−3 7.06 × 10−3 7.47 × 10−3 7.80 × 10−3 8.35 × 10−3 8.56 × 10−3 8.91 × 10−3 9.01 × 10−3 9.18 × 10−3 9.24 × 10−3 9.19 × 10−3 9.26 × 10−3
3.5 Simulation example
101
1.5
f (u((n)), f^(u((n))
1
0.5
0
-0.5
-1
-1.5 -1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
u(n)
Fig. 3.8. BPTT results, K = 1. True (solid line) and estimated (dotted line) nonlinear functions
0
10
-2
10
-4
BPP
J2000(n)
10
BPPT, K = 1 BPS
SM BPPT, K = 2 BPPT, K = 4 BPPT, K = 3
10
-6
0
1
2
3
n
4
5
6 x 10
4
Fig. 3.9. Comparison of convergence rates of different learning algorithms
3 Neural network Hammerstein models Parameter estimates of the linear dynamic system
102
1
aˆ2
0.5
bˆ1 0
bˆ2
-0.5
-1
aˆ1 -1.5
0
1
2
3
4
5
n
6 × 10
4
F(n)
Fig. 3.10. SM results. Evolution of the parameters of the linear dynamic model
10
2
10
0
10
-2
10
-4
10
-6
0
1
2
3
n
4
5
6 × 10
4
Fig. 3.11. BPTT results, K = 4. Convergence of the nonlinear element model
P(n)
3.5 Simulation example 10
2
10
0
10
-2
10
-4
10
-6
0
1
2
3
4
5
n
103
6 × 10
4
Fig. 3.12. BPTT results, K = 4. Convergence of the linear dynamic model
τak(n)
0.4 0.2 0
0
1
2
3
n
4
5
× 10
0
1
2
3
n
4
5
× 10
0
1
2
3
n
4
5
× 10
τbk(n)
0.4
6
4
6
4
6
0.2 0 0.4
τwc(n)
4
0.2 0
Fig. 3.13. Evolution of gradient calculation accuracy degrees, (BPTT, K = 4)
104
3 Neural network Hammerstein models
h1(n)
1.5 1 0.5 0
0
5
10
5
10
n
15
20
25
15
20
25
h2(n)
0.2
0.1
0 0
n
Fig. 3.14. Impulse responses of sensitivity models and the linear dynamic model (BPTT, K = 4)
not in the sequential mode. Clearly, the calculated gradient approaches its true value if the learning rate η is small enough and changes of model parameters in subsequent time steps are negligible. Note that, in the sequential mode, the exact value of the gradient can be obtained only with the BPTT method. This, however, requires not only using the actual values of linear dynamic model parameters but also the calculation of corrected values of the past model outputs yˆ(n − i1 ), i1 = 1, . . . , n − 1, for a time-invariant Hammerstein model according to the formula (3.81).
3.6 Combined steepest descent and least squares learning algorithms The parameters of the linear part of the series-parallel Hammerstein model can be computed with the RLS algorithm. In this case, the overall learning algorithm combines the BPS learning algorithm for the parameters of the (1) (1) (2) (2) nonlinear part wj0 , wj1 , w1j and w10 , j = 1, . . . , M , with the RLS algorithm for a ˆ1 , . . ., a ˆna , ˆb1 , . . . , ˆbnb . ˆ ˆna , ˆb1 , . . . , ˆbnb ]T denote the parameter vector of the linear Let θ(n) = [ˆ a1 , . . . , a ˆ dynamic model at the time n. The vector θ(n) can be computed on-line using the RLS algorithm as follows:
3.6 Combined steepest descent and least squares learning algorithms
ˆ ˆ θ(n) = θ(n−1) + K(n)e(n), K(n) =
(3.90)
P(n−1)x(n) , 1 + xT (n)P(n−1)x(n)
P(n) = P(n−1) −
105
(3.91)
P(n−1)x(n)xT (n)P(n−1) , 1 + xT (n)P(n−1)x(n)
(3.92)
ˆ e(n) = y(n) − xT (n)θ(n−1),
(3.93)
where P(n) ∈ R(na+nb)×(na+nb) is a symmetrical matrix, e(n) is the one-step ahead prediction error of the system output, and x(n) is the regression vector: x(n) = − y(n−1) . . . − y(n−na) fˆ u(n−1),w . . . fˆ u(n−nb),w
T
. (3.94)
The output yˆ(n) of the parallel Hammerstein model is a nonlinear function of the parameters a ˆ1 , . . ., a ˆna , ˆb1 , . . . , ˆbnb as not only the actual output yˆ(n) but also the past outputs yˆ(n−1), . . . , yˆ(n−m) depend on these parameters. In spite of this, the same recursive scheme (3.90) – (3.93) can be also applied to the parallel model with the vector x(n) defined as x(n) = − yˆ(n−1) . . . − yˆ(n−na) fˆ u(n−1),w . . . fˆ u(n−nb),w
T
. (3.95)
Such an approach is known as recursive pseudolinear regression (RPLR) for the output error model [112]. The combined RPLR and BPP algorithm was used by Al-Duwaish [3]. It is also possible to combine RPLR and the SM or BPTT algorithms. This may result in a higher convergence rate as both the SM and BPTT algorithms evaluate the gradient more accurately. Note that both (3.94) and (3.95) define the gradient of the model output w.r.t. ˆ θ(n) computed with the BPS and BPP methods, respectively. For the parallel model, the gradient calculation can be performed more accurately with the SM or the BPTT algorithm. Taking into account (3.39) and (3.40), for the SM, we have na
na
x(n) = − yˆ(n−1)−
a ˆm
m=1
fˆ u(n−1), w − a ˆm m=1
na
a ˆm m=1 T
na
−
∂ yˆ(n−m) ∂ yˆ(n−m) . . .− yˆ(n−na)− a ˆm ∂ˆ a1 ∂ˆ ana m=1
∂ yˆ(n−m) ∂ˆbnb
∂ yˆ(n−m) . . . fˆ u(n−nb),w ∂ˆb1
.
For BPTT, see (3.100) and (3.101) in Appendix 3.1, it follows that
(3.96)
106
3 Neural network Hammerstein models K
x(n) = − yˆ(n−1) −
yˆ(n−i1 −1)
i1 =1 K
−
yˆ(n−i1 −na)
i1 =1
∂ yˆ(n) . . . − yˆ(n−na) ∂ yˆ(n−i1 )
∂ yˆ(n) ∂ yˆ(n−i1 )
fˆ u(n−1),w
K
∂ yˆ(n) + . . . fˆ u(n−nb),w fˆ u(n−i1 −1), w ∂ y ˆ (n−i1 ) i =1
(3.97)
1
K
+ i1 =1
fˆ u(n−i1 −nb), w
∂ yˆ(n) ∂ yˆ(n−i1 )
T
.
Therefore, the vector x(n) defined by (3.95) can be replaced in (3.91) and (3.92) by (3.96) or (3.97).
3.7 Summary This chapter deals with different gradient calculation methods for neural network Hammerstein models. An important advantage of neural network models is that they do not require static nonlinearity to be expressed by a polynomial of a given order. The gradient in series-parallel models, which are suitable for the identification of disturbance-free systems, can be effectively calculated using the BPS method. On the other hand, parallel models due to their recurrent nature are more useful in the case of additive output disturbances but their training procedures are more complex. In general, the calculation of the gradient in parallel models requires much more computational effort than in series-parallel ones. Fortunately, the parallel Hammerstein model contains only one recurrent node which makes the SM algorithm only a little more computationally intensive. Computational complexity of the truncated BPTT algorithm is moderate as well. It is approximately equal to the complexity of BPS multiplied by the number of unfolded time steps K, which is usually not greater than a few. The combined steepest descent and RLS or RPLR algorithms require only slightly more computation than homogeneous ones. This stems from the fact that the number of operations necessary to ˆ calculate θ(n) depends on the square of the number of linear dynamic model parameters, which is commonly assumed to be low. A higher accuracy of gradient approximation in the SM and BPTT learning algorithms may result in their higher convergence rate. The application of the combined steepest descent and RLS or RPLR algorithms may increase the convergence rate further. In spite of the fact that only sequential versions of gradient-based learning algorithms are derived and discussed in this chapter, their batch mode counterparts are easy to be derived having the rules of gradient calculation
3.7 Summary
107
determined. Sequential algorithms have low computational and memory requirements. Moreover, they are less likely to be trapped in a shallow local minimum due to the use of pattern by pattern updating of weights, which makes the search more stochastic in nature. In the batch mode, in contrast to the sequential mode, the parameters of the linear dynamic model are kept constant during each sweeping of the input-output data (iteration). As a result, the exact value of the gradient is obtained applying the SM or BPTT algorithms. In the sequential mode, the parameters of the linear dynamic model are updated after processing every input-output pattern and thus only an approximate gradient can be obtained using the SM. An advantage of the sequential version of the BPTT algorithm is that model unfolding can be made on the basis of actual values of linear dynamic model parameters. Therefore, the BPTT algorithm can provide a more accurate approximation of the gradient than the SM. Theoretically, exact values of the gradient can be obtained if not only actual values of the linear dynamic model parameters are used but also the linear dynamic model is completely unfolded back in time, according to (3.81).
108
3 Neural network Hammerstein models
3.8 Appendix 3.1. Gradient derivation of truncated BPTT. SISO Hammerstein models Assume that the parallel SISO Hammerstein model is K times unfolded back in time. Our primary goal is to find expressions for partial derivatives of the actual model output yˆ(n) w.r.t. the past model outputs yˆ(n−i1 ), i1 = 1, . . . , K. Differentiating (3.14), we have na
∂ yˆ(n−m) ∂ yˆ(n) =− . a ˆm ∂ yˆ(n−i1 ) ∂ yˆ(n−i1 ) m=1
(3.98)
The terms ∂ yˆ(n − m)/∂ yˆ(n − i1 ) are equal to zero for m > i1 . The partial ˆna . Therefore, derivatives (3.98) can be expressed by the parameters a ˆ1 , . . . , a (3.98) can be written in a more convenient form as na
∂ yˆ(n) ∂ yˆ(n) =− . a ˆm ∂ yˆ(n−i1 ) ∂ yˆ(n−i1+m) m=1
(3.99)
Using (3.99), we can start the calculation with i1 = 1 and then proceed back in time employing the partial derivatives calculated previously to calculate the successive ones. Having calculated (3.99), the next step is the calculation of ˆna , ˆb1 , . . . , ˆbnb , partial derivatives of yˆ(n) w.r.t. the model parameters a ˆ1 , . . . , a (2) (2) (1) (1) and w10 ,. . ., wM1 , w10 ,. . ., w1M . Let ∂ + yˆ(n)/∂ˆ ak , k = 1, . . . , na, and ∂ + yˆ(n)/∂ˆbk , k = 1, . . . , nb, denote partial derivatives of the model output unfolded back in time, and ∂ yˆ(n)/∂ˆ ak and ∂ yˆ(n)/∂ˆbk – partial derivatives calculated from (3.14) without taking into ˆk and ˆbk , respectively. account the dependence of past model outputs on a The differentiation of (3.14) gives ∂ yˆ(n) ∂ + yˆ(n) = + ∂ˆ ak ∂ˆ ak i
K
1
∂ yˆ(n) ∂ yˆ(n−i1) ∂ y ˆ (n−i1 ) ∂ˆ ak =1 K
∂ yˆ(n) , = −ˆ y(n−k) − yˆ(n−k−i1 ) ∂ y ˆ (n−i1 ) i =1
(3.100)
1
K
∂ yˆ(n) ∂ yˆ(n−i1 ) ∂ + yˆ(n) ∂ yˆ(n) = + ∂ yˆ(n−i1 ) ∂ˆbk ∂ˆbk ∂ˆbk i1 =1 = fˆ u(n−k),w +
K i1 =1
fˆ u(n−k−i1),w
∂ yˆ(n) . ∂ yˆ(n−i1 )
(3.101)
Partial derivatives of the output of the model unfolded back in time w.r.t. wc , c = 1, . . ., 3M + 1, can be calculated as follows:
3.7. Appendix 3.2. Gradient derivation of truncated BPTT
109
∂ fˆ u(n−m−i1),w ∂ yˆ(n−i1 ) ∂ yˆ(n) ∂ yˆ(n−i1) ∂ fˆ u(n−m−i1),w ∂wc 1 =0 m=1 (3.102) K nb ∂ yˆ(n) ∂ fˆ u(n−m−i1),w ˆ . = bm ∂ yˆ(n−i1 ) ∂wc i =0 m=1
∂ + yˆ(n) = ∂wc i
K
nb
1
To calculate (3.102), we also need ∂ fˆ u(n−m−i1),w)/∂wc , which can be calculated in the same way as in the backpropagation method using (3.32) – (3.35).
3.9 Appendix 3.2. Gradient derivation of truncated BPTT. MIMO Hammerstein models Assume that the parallel MIMO Hammerstein model is K times unfolded back in time. Differentiating (3.27), we have partial derivatives of the actual model outputs yˆt (n), t = 1, . . . , ny, w.r.t. the past model outputs yˆr (n−i1 ), r = 1, . . . , ny, i1 = 1, . . . , K, ny
na
∂ yˆt (n) =− ∂ yˆr (n−i1 ) m=1 m
1 =1 ny
na
=− m=1 m1 =1
a ˆtm1
∂ yˆm1 (n−m) ∂ yˆr (n−i1)
(m) a ˆtm1
∂ yˆm1 (n) , ∂ yˆr (n−i1+m)
(m)
(3.103)
where ∂ yˆm1 (n−m) = 0, if m > i1 or m = i1 and m1 = r. ∂ yˆr (n−i1 )
(3.104)
(k) (k) Denote with ∂ + yˆt (n)/∂ˆ arp , p = 1, . . . , ny, and ∂ + yˆt (n)/∂ˆbrp , p = 1, . . . , nu, (k) partial derivatives of the model output unfolded back in time, and ∂ yˆt (n)/∂ˆ arp (k) and ∂ yˆt (n)/∂ˆbrp – partial derivatives calculated from (3.27) without taking (k) (k) into account the dependence of past model outputs on a ˆrp and ˆbrp , respectively. The differentiation of (3.27) gives
∂ + yˆt (n) (k) ∂ˆ arp
=
∂ yˆt (n) (k) ∂ˆ arp
K
ny
+ i1 =1 i2
∂ yˆt (n) ∂ yˆi2 (n−i1 ) (k) ∂ y ˆ i2 (n−i1 ) ∂ˆ arp =1 K
∂ yˆt (n) , yˆp (n−k−i1 ) = −δtr yˆp (n−k) − ∂ y ˆ r (n−i1 ) i =1
(3.105)
1
δtr =
1 for t = r , 0 for t = r
(3.106)
110
3 Neural network Hammerstein models
where k = 1, . . . , na, ny
K
∂ yˆt (n) ∂ + yˆt (n) ∂ yˆt (n) ∂ yˆi2 (n−i1 ) = + (k) (k) (k) ˆ ˆ ∂ y ˆ i2 (n−i1 ) ∂ brp ∂ brp ∂ˆbrp i1 =1 i2 =1 K
= δtr fˆp u(n−k),wp +
fˆp u(n−k−i1 ),wp
i1 =1
∂ yˆt (n) , ∂ yˆr (n−i1)
(3.107)
where k = 1, . . . , nb. Taking into account (3.52) – (3.55), partial derivatives of the output of the model unfolded back in time w.r.t. the parameters of the nonlinear element model are calculated as ∂ + yˆt (n) (1) ∂wji
=
K
∂ yˆt (n) (1) ∂wji
i1 =1 i2
nf
nb
= m=1 m1 =1 ny
K
ny
+
∂ yˆt (n) ∂ yˆi2 (n−i1 ) ∂ yˆi2 (n−i1 ) ∂w(1) =1
ˆb(m) w(2) ϕ xj (n−m) ui (n−m) tm1 m1 j nf
nb
+ i1 =1 i2 =1 m=1 m1 =1
ui (n−m−i1) ∂ + yˆt (n) (1) ∂wj0
=
∂ yˆt (n) (1) ∂wj0
i1 =1 i2
= m=1 m1 =1 K
ny
ˆb(m) w(2) ϕ xj (n−m−i1) i2 m1 m1 j
∂ yˆt (n) , ∂ yˆi2 (n−i1)
∂ yˆt (n) ∂ yˆi2 (n−i1) ∂ yˆi2 (n−i1 ) ∂w(1) =1 j0
ˆb(m) w(2) ϕ xj (n−m) tm1 m1 j
nb
nf
+
(3.109)
ˆb(m) w(2) ϕ xj (n−m−i1) i2 m1 m1 j
i1 =1i2 =1m=1m1 =1
∂ + yˆt (n) (2) ∂wlj
=
∂ yˆt (n) (2) ∂wlj nb
= m=1 K
K
∂ yˆt (n) , ∂ yˆi2 (n−i1 )
ny
+ i1 =1 i2
∂ yˆt (n) ∂ yˆi2 (n−i1 ) ∂ yˆi2 (n−i1 ) ∂w(2) =1 lj
ˆb(m) ϕ xj (n−m) tl ny
(3.108)
ny
K
+
nf
nb
ji
nb
+ i1 =1 i2 =1 m=1
ˆb(m) ϕ xj (n−m−i1) i2 l
(3.110) ∂ yˆt (n) , ∂ yˆi2 (n−i1 )
3.10 Appendix 3.3. Proof of Theorem 3.1
∂ + yˆt (n)
∂ yˆt (n)
=
(2) ∂wl0
(2) ∂wl0 nb
= m=1
ny
K
+ i1 =1 i2
ˆb(m) + tl
111
K
∂ yˆt (n) ∂ yˆi2 (n−i1) ∂ yˆi2 (n−i1 ) ∂w(2) =1 l0
ny
nb
i1 =1 i2 =1 m=1
ˆb(m) i2 l
∂ yˆt (n) , ∂ yˆi2 (n−i1 )
(3.111)
where i = 1, . . . , nu, j = 1, . . . , M , l = 1, . . . , nf .
3.10 Appendix 3.3. Proof of Theorem 3.1 In the BPTT method, the Hammerstein model is unfolded back in time, then the unfolded model is differentiated w.r.t. model parameters. This is equivalent to the differentiation of the model and unfolding the differentiated model, i.e., sensitivity models back in time. Taking into account (3.68) and (3.69), it follows from (3.39) that the outputs of sensitivity models for the parameters a ˆk , k = 1, . . . , na, can be expressed as functions of the input u(n): ∂ yˆ(n) = − H1 (q −1 )ˆ y (n−k) = −H1 (q −1 )H2 (q −1 )fˆ u(n−k),w ∂ˆ ak = − H(q −1 )fˆ u(n−k),w , where
H(q −1 ) = H1 (q −1 )H2 (q −1 ).
(3.112)
(3.113)
The impulse response of the system (3.113) is given by the convolution relationship n
h(n) =
h1 (i2 )h2 (n−i2 ).
(3.114)
i2 =0
The partial derivatives ∂ yˆ(n)/∂ˆ ak are related to the input u(n) with the impulse response functions h1 (n) and h2 (n) and the function fˆ ·): n−k
∂ yˆ(n) =− ∂ˆ ak i
h(i1 )fˆ u(n−k−i1),w
1 =0
n−k
i1
=−
(3.115) h1 (i2 )h2 (i1 −i2 )fˆ u(n−k−i1),w .
i1 =0 i2 =0
The calculation of (3.115) requires unfolding sensitivity models n − 1 times back in time. Thus, the partial derivatives ∂ + yˆ(n)/∂ˆ ak calculated for the sensitivity models unfolded K times back in time are
112
3 Neural network Hammerstein models
n−k i1 − h1 (i2 )h2 (i1 −i2 )fˆ u(n−k−i1),w , for n K + k i1 =0 i2 =0 i1 K ∂ + yˆ(n) h1 (i2 )h2 (i1 −i2 )fˆ u(n−k−i1),w = − (3.116) ∂ˆ ak i1 =0 i2 =0 n−k K − h1 (i2 )h2 (i1 −i2 )fˆ u(n−k−i1),w , for n > K + k. i1 =K+1 i2 =0
The errors of computing partial derivatives with the truncated BPTT method are 0, for n K +k ∆ˆ yaˆk(n) = (3.117) n−k i1 ˆ − h1 (i2 )h2 (i1 −i2 )f u(n−k−i1),w , for n > K +k. i1 =K+1i2 =K+1
As the input {u(n}) is a sequence of zero-mean i.i.d. random variables, the output of the nonlinear element model {fˆ u(n),w } is a sequence of i.i.d. random variables as well. Their expected values mf and variances σf2 depend on both the neural network model architecture and the parameter vector w. E[fˆ u(n),w ] = mf , 2
E fˆ u(n),w − mf
(3.118)
= σf2 .
(3.119)
Therefore, the expected values of (3.117) are i1
n−k
h1 (i2 )h2 (i1 −i2 ),
E ∆ˆ yaˆk (n) = −mf
(3.120)
i1 =K+1 i2 =K+1
and the variances of the errors (3.117) are n−k
var ∆ˆ yaˆk (n) = E
i1
−
h1 (i2 )h2 (i1 −i2 )fˆ u(n−k−i1),w
i1 =K+1 i2 =K+1 n−k
i1
2
h1 (i2 )h2 (i1 −i2 )
+ mf
(3.121)
i1 =K+1 i2 =K+1 n−k
=E
i1
h1 (i2 )h2 (i1 −i2 ) fˆ u(n−k−i1),w −mf
2
.
i1 =K+1 i2 =K+1
Since fˆ u(n − k − i1),w − mf , i1 = K + 1, . . . , n − k, are zero-mean i.i.d. random variables, the variances of the errors (3.117) are given by (3.67) for n > K + k, and are equal to zero otherwise.
3.11 Appendix 3.4. Proof of Theorem 3.2
113
3.11 Appendix 3.4. Proof of Theorem 3.2 In the BPTT method, the Hammerstein model is unfolded back in time, then the unfolded model is differentiated w.r.t. model parameters. This is equivalent to the differentiation of the model and unfolding the differentiated model, i.e., sensitivity models back in time. Taking into account (3.68), it follows from (3.40) that the outputs of sensitivity models for the parameters ˆbk , k = 1, . . . , nb, can be expressed as functions of the input u(n): ∂ yˆ(n) = H1 (q −1 )fˆ u(n−k),w . ∂ˆbk
(3.122)
The partial derivatives ∂ yˆ(n)/∂ˆbk are related to the input u(n) with the impulse response functions h1 (n) and the function fˆ ·): ∂ yˆ(n) = ∂ˆbk
n−k
h1 (i1 )fˆ u(n−k−i1),w .
(3.123)
i1 =0
The calculation of (3.123) requires unfolding sensitivity models n − 1 times back in time. Thus, the partial derivatives ∂ + yˆ(n)/∂ˆbk calculated for sensitivity models unfolded K times back in time are n−k h1 (i1 )fˆ u(n−k−i1),w , for n
K +k ∂ + yˆ(n) i1 =0 = K ∂ˆbk h1 (i1 )fˆ u(n−k−i1),w , for n > K + k.
(3.124)
i1 =0
The errors of computing partial derivatives with the truncated BPTT method are for n K + k 0,n−k (3.125) ∆ˆ yˆbk (n) = h1 (i1 )fˆ u(n−k−i1),w , for n > K + k. i1 =K+1
As the input u(n) is a sequence of zero-mean i.i.d. random variables, the output of the nonlinear element model fˆ u(n),w is a sequence of i.i.d. random variables as well. Their expected values mf and variances σf2 depend on both the neural network model architecture and the parameter vector w: E[fˆ u(n),w ] = mf , E fˆ u(n),w − mf Therefore, the expected values of (3.125) are
2
= σf2 .
(3.126) (3.127)
114
3 Neural network Hammerstein models n−k
E ∆ˆ yˆbk (n) = mf
h1 (i1 ),
(3.128)
i1 =K+1
and the variances of the errors (3.125) are n−k
var ∆ˆ yˆbk(n) = E
h1 (i1 )fˆ u(n−k−i1),w −mf
=E
h1 (i1 )
i1 =K+1
i1 =K+1 n−k
n−k
h1 (i1 ) fˆ u(n−k−i1),w −mf
2
(3.129)
2
.
i1 =K+1
Since fˆ u(n − k − i1),w − mf , i1 = K + 1, . . . , n − k, are zero-mean i.i.d. random variables, the variances of the errors (3.125) are given by (3.70) for n > K + k, and are equal to zero otherwise.
3.12 Appendix 3.5. Proof of Theorem 3.3 To prove Theorem 3.3, we will consider gradient calculation with the sensitivity model (3.41) completely unfolded back in time, and the sensitivity model unfolded back in time up to K steps. Using the backward shift operator notation, (3.41) can be written as ∂ fˆ u(n − 1),w ∂ yˆ(n) = qH2 (q −1 ) , ∂wc ∂wc
(3.130)
where c = 1, . . . , 3M . The above partial derivatives, corresponding to the model completely unfolded back in time, can be expressed by the following convolution summations: n−1
∂ yˆ(n) = ∂wc i
h2 (i1 + 1)
1 =0
∂ fˆ u(n−i1 − 1),w ∂wc
∂ fˆ u(n−i1),w . = h2 (i1 ) ∂wc i =1 n
(3.131)
1
In the truncated BPTT method, these summations are performed only up to K times back in time and partial derivatives of the model output w.r.t. wc are n ∂ fˆ u(n−i1),w h2 (i1 ) , for n K + 1 ∂wc ∂ + yˆ(n) i1 =1 = K+1 (3.132) ∂wc ∂ fˆ u(n−i1),w h2 (i1 ) , for n > K + 1. ∂wc i =1 1
3.13 Appendix 3.6. Proof of Theorem 3.4
115
From (3.131) and (3.132), it follows that the errors of computing partial derivatives with the truncated BPTT method are for n K + 1 0, n ˆ ∂ f u(n−i ),w 1 ∆ˆ ywc (n) = (3.133) h2 (i1 ) , for n > K + 1. ∂wc i1 =K+2
As the input u(n) is a sequence of zero-mean i.i.d. random variables, the partial derivatives ∂ fˆ u(n),w /∂wc , are sequences of i.i.d. random variables 2 as well. Their expected values mwc and variances σw depend on both the c neural network model architecture and the parameter vector w: E
∂ fˆ u(n),w ∂wc
= mwc ,
∂ fˆ u(n),w − mwc ∂wc
E
2
(3.134)
2 . = σw c
(3.135)
Therefore, the expected values of (3.133) are n
E ∆ˆ ywc (n) = mwc
h2 (i1 ),
(3.136)
i1 =K+2
and the variances of (3.133) are n
h2 (i1 )
var ∆ˆ ywc(n) = E
i1 =K+2 n
= E i1 =K+2
∂ fˆ u(n−i1 ),w −mwc ∂wc
n
2
h2 (i1 ) i1 =K+2
∂ fˆ u(n−i1),w −mwc h2 (i1 ) ∂wc
(3.137)
2
.
Since ∂ fˆ u(n−i1 ),w /∂wc − mwc , i1 = K + 2, . . . , n, are zero-mean i.i.d. random variables, the variances of the errors (3.133) are given by (3.72) for n > K + 1, and are equal to zero otherwise.
3.13 Appendix 3.6. Proof of Theorem 3.4 From (3.41), it follows that ∂ yˆ(n) (2)
∂w10
= qH2 (q −1 )
Taking into account (3.35), we have
∂ fˆ u(n − 1),w (2)
∂w10
.
(3.138)
116
3 Neural network Hammerstein models
∂ yˆ(n) (2)
∂w10
n−1
=
n
h2 (i1 + 1) = i1 =0
h2 (i1 ).
(3.139)
i1 =1
Unfolding (3.138) back in time for K steps gives n h2 (i1 ), for n K + 1 + ∂ yˆ(n) i1 =1 = K+1 (2) ∂w10 h2 (i1 ), for n > K + 1.
(3.140)
i1 =1
Therefore, ∆ˆ yw(2) (n) = 10
0,
n
for n
K +1
h2 (i1 ), for n > K + 1. i1 =K+2
(3.141)
4 Polynomial Wiener models
This chapter deals with polynomial Wiener models, i.e., models composed of a pulse transfer model of the linear dynamic system and a polynomial model of the nonlinear element or the inverse nonlinear element. A modified definition of the equation error and a modified series-parallel Wiener model are introduced. Assuming that the nonlinear element is invertible and the inverse nonlinear element can be described by a polynomial, the modified series-parallel Wiener model can be transformed into the linear-in-parameters form and its parameters can be calculated with the least squares method. Such an approach, however, results in inconsistent parameter estimates. As a remedy against this problem, an instrumental variables method is proposed with instrumental variables chosen as delayed system inputs and delayed and powered delayed outputs of the model obtained using the least squares method. An alternative to this combined least squares-instrumental variables approach is the prediction method, in which the parameters of the noninverted nonlinear element are estimated, see [128] for the batch version and [85] for the sequential one. The pseudolinear regression method [86], being a simplified version of the prediction error method of lower computational requirements, is another effective technique for parameter estimation in Wiener systems disturbed additively by a discrete-time white noise. This chapter is organized as follows: First, the least squares approach to the identification of Wiener systems based on the modified series-parallel model is introduced in Section 4.1. Two different cases of a Wiener system with and without the linear term are considered. Section 4.2 contains details of the recursive prediction error approach to the identification of polynomial Wiener systems. The pseudolinear regression method is discussed in Section 4.3. Finally, a brief summary is given in Section 4.4.
A. Janczak: Identification of Nonlinear Systems, LNCIS 310, pp. 117–141, 2005. © Springer-Verlag Berlin Heidelberg 2005
118
4 Polynomial Wiener models
4.1 Least squares approach to the identification of Wiener systems This section presents a least squares approach to the identification of polynomial Wiener systems. To transform the Wiener model into the linear-inparameters form, the noninverted model of the linear dynamic system and the inverse model of the nonlinear element are used. The following assumptions are made about the identified Wiener system: Assumption 4.1. The SISO Wiener system is B(q −1 ) u(n) + ε(n) , A(q −1 )
y(n) = f where
(4.1)
A(q −1 ) = 1 + a1 q −1 + . . . + ana q −na ,
(4.2)
B(q −1 ) = b1 q −1 + . . . + bnb q −nb ,
(4.3)
and ε(n) is the additive disturbance. Assumption 4.2. The polynomials A(q −1 ) and B(q −1 ) are coprime. Assumption 4.3. The orders na and nb of the polynomials A(q −1 ) and B(q −1 ) are known. Assumption 4.4. The linear dynamic system is casual and asymptotically stable. Assumption 4.5.The input u(n) has finite moments and is independent of ε(k) for all n and k. Assumption 4.6. The nonlinear function f (·) is defined on the interval [a, b]. Assumption 4.7. The nonlinear function f (·) is invertible. Assumption 4.8. The inverse nonlinear function f −1 y(n) can be expressed by the polynomial f −1 y(n) = γ0 + γ1 y(n) + γ2 y(n)2 + · · · + γr y(n)r
(4.4)
of a known order r. The identification problem can be formulated as follows: Given the sequence of the Wiener system input and output measurements {u(n), y(n)}, i = 1, . . . , N , estimate the parameters of the linear dynamic system and the inverse nonlinear element minimizing the following criterion: J(n) =
1 2
N
2
y(n) − yˆ(n) , j=1
where yˆ(n) is the output of the Wiener model.
(4.5)
4.1 Least squares approach to the identification of Wiener systems
119
4.1.1 Identification error For a polynomial Wiener model, both its parallel and series-parallel forms are nonlinear functions of model parameters. Moreover, the series-parallel model contains not only a model of the nonlinear element but also its inverse [73] – Fig. 4.2. Consider the parallel model of the Wiener model given by ˆ −1 ) B(q u(n) , yˆ(n) = fˆ ˆ −1 ) A(q
(4.6)
ˆ −1 ) = 1 + a A(q ˆ1 q −1 + · · · + a ˆna q −na ,
(4.7)
ˆ −1 ) = ˆb1 q −1 + · · · + ˆbnb q −nb . B(q
(4.8)
with
If fˆ(·) is invertible, (4.6) can be written as ˆ −1 ) B(q u(n). fˆ−1 yˆ(n) = ˆ −1 ) A(q
(4.9)
Assumption 4.9. The inverse nonlinear function fˆ−1 (·) has the form of a polynomial of the order r: fˆ−1 yˆ(n) = γˆ0 + γˆ1 yˆ(n) + γˆ2 yˆ2 (n) + · · · + γˆr yˆr (n).
(4.10)
Assume also that the polynomial (4.10) contains the linear term, i.e., γˆ1 = 0. Then combining (4.10) and (4.9), the output of the model can be expressed as [80, 83]: yˆ(n) =
1 γˆ1
ˆ −1 ) B(q u(n) − ∆fˆ−1 yˆ(n) ˆ −1 ) A(q
,
(4.11)
where ∆fˆ−1 yˆ(n) = γˆ0 + γˆ2 yˆ2 (n) + γˆ3 yˆ3 (n) + · · · + γˆr yˆr (n).
(4.12)
The model (4.11) can be written as ˆ −1 ) yˆ(n)+ yˆ(n) = 1− A(q
1 ˆ −1 ˆ −1 )∆fˆ−1 yˆ(n) B(q )u(n)− A(q γˆ1
.(4.13)
Replacing yˆ(n) by y(n) on the r.h.s. of (4.13), the following modified seriesparallel model can be obtained: ˆ −1 )u(n)− A(q ˆ −1 )∆fˆ−1 yˆ(n) . (4.14) ˆ −1 ) y(n)+ 1 B(q yˆ(n) = 1− A(q γˆ1
120
4 Polynomial Wiener models
ε (n )
WIENER SYSTEM
u (n )
B (q −1 )
y (n )
f (s (n ) )
−1
A(q )
MODIFIED SERIES-PARALLEL WIENER MODEL
Bˆ (q −1 ) γˆ1
−
∆fˆ −1 (y (n ) ) γˆ1
Aˆ (q −1 ) e (n )
Fig. 4.1. Modified series-parallel Wiener model. The identification error definition for systems with the linear term WIENER SYSTEM
u (n )
ε (n )
B (q −1 )
y (n )
f (s (n ) )
−1
A(q )
SERIES-PARALLEL WIENER MODEL
fˆ −1 (y (n ) )
1 − Aˆ (q −1 )
Bˆ (q −1 )
sˆ(n )
fˆ(sˆ(n ) )
−
e (n )
Fig. 4.2. Series-parallel Wiener model. The definition of the identification error
The modified series-parallel model, shown in Fig. 4.1, is different from the series-parallel model, which contains both the model of the nonlinear element and its inverse, ˆ −1 ) fˆ−1 y(n) + B(q ˆ −1 )u(n) , yˆ(n) = fˆ 1 − A(q
(4.15)
and the inverse series-parallel model u ˆ(n−1) =
1 ˆb1
ˆb1 − B(q ˆ −1 ) u(n) + A(q ˆ −1 )fˆ−1 yˆ(n) ,
(4.16)
see Figs 4.2 – 4.3 for comparison. Applying (4.14), the following modified definition of the identification error can be introduced:
4.1 Least squares approach to the identification of Wiener systems
ε (n )
WIENER SYSTEM
u (n )
121
B (q −1 )
y (n )
f (s (n ) )
−1
A(q )
INVERSE SERIES-PARALLEL WIENER MODEL
Bˆ (q −1 ) bˆ1
Aˆ (q −1 ) bˆ1
−
fˆ −1 (y (n ) )
e (n )
Fig. 4.3. Inverse series-parallel Wiener model. The definition of the identification error
ˆ −1 )y(n) − e(n) = A(q
e(n) = y(n) − yˆ(n)
(4.17)
1 ˆ −1 ˆ −1 )∆fˆ−1 yˆ(n) . B(q )u(n) − A(q γˆ1
(4.18)
4.1.2 Nonlinear characteristic with the linear term Assuming that the identified Wiener system has an invertible nonlinear characteristic with γ1 = 0, we will express the modified series-parallel Wiener in the linear-in-parameters form. Introduce the parameter vector θˆ ˆ00 α ˆ20 . . . α ˆrna ˆna βˆ1 . . . βˆnb α θˆ = a ˆ1 . . . a
T
,
(4.19)
and the regression vector x(n) x(n) = − y(n−1) . . . − y(n−na) u(n−1) . . . u(n−nb) 1 − y 2 (n) . . . − y r (n−na)
T
,
(4.20)
where ˆbk βˆk = , k = 1, . . . , nb, γˆ1
α ˆ jk
na γˆj 1 + a ˆm , k = 0, j = 0, γˆ1 m=1 = γˆj , k = 0, j = 2, 3, . . . , r, γˆ1 γˆj a ˆk , k = 1, . . . , na, j = 2, 3, . . . , r. γˆ1
(4.21)
(4.22)
122
4 Polynomial Wiener models
Then the model (4.14) can be written as ˆ yˆ(n) = xT (n)θ.
(4.23)
Minimizing (4.5), the parameter vector θˆ can be calculated with the least squares (LS) or recursive least squares (RLS) method. Note that the number of parameters in (4.14) is na + nb + r(na + 1), while the number of parameters ˆ −1 ), B(q ˆ −1 ), and fˆ(·) is na + nb + r + 1. Therefore, to obtain a unique of A(q solution, methods similar to these proposed in [34] for the identification of Hammerstein models can be employed. 4.1.3 Nonlinear characteristic without the linear term Consider a Wiener system that fulfills the following assumptions: – –
The linear term γ1 = 0. The second order term γ2 = 0.
In this case, the following modified series-parallel model can be defined (Fig. 4.4): ˆ −1 )u(n)− A(q ˆ −1 )∆fˆ−1 yˆ(n) . (4.24) ˆ −1 ) y 2 (n)+ 1 B(q yˆ2 (n) = 1− A(q γˆ2 Now, the identification error can be defined as e(n) = y 2 (n) − yˆ2 (n) (4.25) ˆ −1 )u(n) − A(q ˆ −1 )y 2 (n) − 1 B(q ˆ −1 )∆fˆ−1 yˆ(n) . = A(q γˆ2 Hence, (4.24) can be written in the following linear-in-parameters form: ˆ yˆ2 (n) = xT (n)θ,
(4.26)
ˆ with the parameter vector θ, θˆ = a ˆ1 . . . a ˆ00 α ˆ30 . . . α ˆrna ˆna βˆ1 . . . βˆnb α
T
,
(4.27)
and the regression vector x(n), x(n) = − y 2 (n−1) . . . − y 2 (n−na) u(n−1) . . . u(n−nb) 1 − y 3 (n) . . . − y r (n−na)
T
,
(4.28)
where ˆbk βˆk = , k = 1, . . . , nb, γˆ2
(4.29)
4.1 Least squares approach to the identification of Wiener systems
ε (n )
WIENER SYSTEM
u (n )
123
B (q −1 )
y (n )
f (s (n ) )
−1
A(q )
MODIFIED SERIES-PARALLEL WIENER MODEL
Bˆ (q −1 ) γˆ2
−
∆fˆ −1 (y (n ) ) γˆ2
Aˆ (q −1 ) e (n )
Fig. 4.4. Modified series-parallel model. The identification error definition for systems without the linear term
α ˆ jk
na γˆj 1 + , k = 0, j = 0, a ˆm γ ˆ2 m=1 = γˆj , k = 0, j = 3, 4, . . . , r, γˆ2 γˆj a ˆk , k = 1, . . . , na, j = 3, 4, . . . , r. γˆ2
(4.30)
As in the previous case, the parameter vector θˆ can be calculated with the least squares (LS) or recursive least squares (RLS) method minimizing the following criterion: 1 J= 2
N
2
y 2 (n) − yˆ2 (n) .
(4.31)
j=1
4.1.4 Asymptotic bias error of the LS estimator Consider a polynomial Wiener system (4.1) – (4.4) that contains the linear term, i.e., γ1 = 0, and the modified series-parallel Wiener model (4.23).We will show now that parameter estimates of the Wiener system obtained with the LS method are nonconsistent, i.e. asymptotically biased, even if the additive disturbance ε(n) is (n) , (4.32) ε(n) = A(q −1 ) where (n) is the discrete time white noise. Theorem 4.1. Let θˆ denote the vector of parameter estimates, defined by (4.19), and θ – the corresponding true parameter vector of the Wiener system, defined by (4.1) – (4.4).
124
4 Polynomial Wiener models
Then the LS estimate of θ is asymptotically biased, i.e., θˆ does not converge (with the probability 1) to the true parameter vector θ. Proof: The output y(n) of the Wiener system, defined by (4.1) and (4.32), is y(n) = 1−A(q −1 ) y(n)+
1 B(q −1 )u(n)−A(q −1)∆f −1 yˆ(n) + (n) . (4.33) γ1
Introducing the true parameter vector θ, θ = a1 . . . ana β1 . . . βnb α00 α20 . . . αrna where βk =
αjk
T
,
bk , k = 1, . . . , nb, γ1
(4.34)
(4.35)
na γj 1 + , k = 0, j = 0, am γ1 m=1 = γj , k = 0, j = 2, 3, . . . , r, γ1 γ a j , k = 1, . . . , na, j = 2, 3, . . . , r, k γ1
(4.36)
the system output can be expressed as y(n) = xT (n)θ +
1 (n). γ1
(4.37)
The solution to the LS estimation problem is given by θˆ =
1 N
−1
N
1 N
T
x(n)x (n) n=1
N
x(n)y(n) .
(4.38)
n=1
ˆ From (4.37) and (4.38), it follows that the parameter estimation error θ−θ is N
1 ˆ θ−θ = x(n)xT (n) N n=1 1 1 = γ1 N
−1
N
−1
N
T
x(n)x (n) n=1
N
1 1 x(n)y(n)− x(n)xT (n) θ N n=1 N n=1 1 N
N
(4.39)
x(n) (n) . n=1
Therefore, if N → ∞, 1 ˆ θ−θ → E x(n)xT (n) γ1
−1
E x(n) (n)
= 0,
as E y 2 (n) (n) = 0,. . . , E y r (n) (n) = 0, and thus E x(n) (n) = 0.
(4.40)
4.1 Least squares approach to the identification of Wiener systems
125
Remark 4.1. In a similar way, it can also be shown that the parameter vector θˆ of the modified series-parallel Wiener model (4.24), calculated with the LS method, is asymptotically biased. Remark 4.2. It can also be proved that asymptotically biased LS parameter estimates are obtained using other linear-in-parameter models which contain the inverse polynomial model of the nonlinear element. Examples of such models are the frequency sampling model [95, 96], the inverse Wiener model, and the model based on Laguerre filters [116]. 4.1.5 Instrumental variables method To obtain consistent parameter estimates, the regression vector x(n) should be uncorrelated with system disturbances. That is not the case if we use the modified series-parallel model, as the powered system outputs y 2 (n), . . . , y r (n) depend on (n). The instrumental variables method is a well-known remedy against such a situation. Applying the instrumental variables method, parameter estimation can be performed according to the following scheme: 1. Estimate the parameters of the system using the LS or the RLS method. 2. Simulate the model to determine the instrumental variables z(n). 3. Estimate the parameters of the system using the IV or the RIV method with the instrumental variables z(n). The choice of instrumental variables is a vital design problem in any instrumental variables approach, see [149] for more details. Clearly, the best choice would be undisturbed powered system outputs, but these are not available for measurement. Instead, we can employ powered outputs of the model, or powered outputs of the linear dynamic model, calculated with the LS method, and define the instrumental variables as z(n) = − yˆ(n−1) . . . − yˆ(n−na) u(n−1) . . . u(n−nb) 1 − yˆ2 (n) . . . − yˆr (n−na)
T
(4.41)
in the case of Wiener systems with the linear term, or z(n) = − yˆ2 (n−1) . . . − yˆ2 (n−na) u(n−1) . . . u(n−nb) 1 − yˆ3 (n) . . . − yˆr (n−na)
T
(4.42)
in the case of Wiener systems without the linear term. The instrumental variables z(n) are uncorrelated with system disturbances: E[z(n)ε(n)] = 0.
(4.43)
126
4 Polynomial Wiener models
4.1.6 Simulation example. Nonlinear characteristic with the linear term A linear dynamic system given by the continuous-time polynomial pulse transfer function 1 (4.44) G(s) = 2 6s + 5s + 1 was converted to discrete time, assuming a zero order hold on the input and the sampling interval 1s, leading to the following difference equation: s(n) = 1.3231s(n−1)−0.4346s(n−2)+0.0635u(n−1)+0.0481u(n−2). (4.45) The linear dynamic system was followed by a nonlinear element described by the function f s(n) = 4 3 0.75s(n) − 1 . (4.46) Therefore, the inverse nonlinear function f −1 y(n) is a polynomial: f −1 y(n) =
4 1 + y(n) + 0.25y 2(n) + y 3 (n). 3 48
(4.47)
The input sequence {u(n)} consisted of 40000 pseudo-random numbers uniformly distributed in (−5, 5). Parameter estimation was performed with both the LS method and the IV method, assuming that r = 3 and γˆ1 = 1. Additive system disturbances were given by ε(n) = [1/A(q −1 )] (n) with { (n)} – a normally distributed pseudo-random sequence N (0, 0.1). This corresponds with the signal to noise ratio SN R = var(y(n) − ε(n))/var(ε(n)) = 3.14. The identification results, given in Tables 4.1 and 4.2 and illustrated in Figs 4.5 and 4.6, show a considerable improvement in IV parameter estimates in comparison with LS ones. Table 4.1. Parameter estimates, SN R = 3.14 Parameter
True
LS
IV
σε = 0 σε = 0.1 σε = 0.1 a1 a2 b1 b2 γ0 γ1 γ2 γ3
−1.3231 −1.2803 −1.3292 0.4346 0.4158 0.4370 0.0635 0.0558 0.0636 0.0481 0.0423 0.0482 1.3333 1.0764 1.3813 1.0000 1.0000 1.0000 0.2500 0.2473 0.2503 0.0208 0.0199 0.0209
4.1 Least squares approach to the identification of Wiener systems
127
True and estimated inverse nonlinear functions
4
2
0
-2
True LS IV
-4 -10
-8
-6
-4
-2
0
2
y(n)
Fig. 4.5. Wiener system with the linear term. True f −1 y(n) fˆ−1 y(n) inverse nonlinear functions
and estimated
Estimation error of the inverse nonlinear function
0.4 LS IV
0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -10
-8
-6
-4
-2
0
2
y(n)
Fig. 4.6. Wiener system with the linear term. Estimation error fˆ−1 y(n) − f −1 y(n) .
128
4 Polynomial Wiener models Table 4.2. Comparison of estimation accuracy Performance index 1 4 1 3 1 50
2
(ˆ aj − aj )2 + (ˆbj − bj )2
j=1
(ˆ γ 0 − γ 0 )2 + 50
LS (σε = 0)
3
LS
IV
4.62 × 10−23 5.70 × 10−4 1.08 × 10−5
(ˆ γ j − γ j )2
1.13 × 10−21 2.20 × 10−2 7.66 × 10−4
j=2
fˆ−1 y(n) − f −1 y(n)
2
3.78 × 10−21 4.74 × 10−2 1.91 × 10−3
i=1
4.1.7 Simulation example. Nonlinear characteristic without the linear term The linear dynamic system (4.45) and a nonlinear element defined by the function f s(n) =
s(n) + 0.5, s(n)
0
(4.48)
were used in the example of a Wiener system without the linear term and a nonzero second order term. The inverse nonlinear function is a polynomial: √ f −1 y(n) = 0.25 − y 2 (n) + y 4 (n), y(n) 0.5. (4.49) The input sequence {u(n)} contained 50000 pseudo-random numbers uniformly distributed in (1.5, 6). Additive system disturbances were ε(n) = [1/A(q −1 )] (n) with { (n)} – a normally distributed pseudo-random sequence N (0, 0.025). As in the previous example, parameter estimation was performed using the LS method and the IV method and assuming: r = 4, γˆ1 = γˆ3 = 0, γˆ2 = 1. The identification results, given in Tables 4.3 and 4.4 and illustrated in Figs 4.7 and 4.8, confirm practical feasibility of the proposed approach. Table 4.3. Parameter estimates, SN R = 19.37 Parameter
True
LS
IV
σε = 0 σε = 0.025 σε = 0.025 a1 a2 b1 b2 γ0 γ2 γ4
−1.3231 −1.4448 0.6233 0.4346 0.0635 0.0014 0.0481 0.0011 0.2500 1.2119 1.0000 1.0000 0.2212 1.0000
−1.2898 0.4107 0.0635 0.0481 0.2187 1.0000 1.0005
4.1 Least squares approach to the identification of Wiener systems
129
True and estimated inverse nonlinear functions
25 True LS IV
20
15
10
5
0
-5 0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
y(n)
Fig. 4.7. Wiener system without the linear term. True f −1 y(n) and estimated fˆ−1 y(n) inverse nonlinear functions
Estimation error of the nonlinear function
5
0
-5
-10
-15
-20
-25 0.6
LS IV
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
y(n)
Fig. 4.8. Wiener system without the linear term. Estimation error fˆ−1 y(n) − f −1 y(n)
130
4 Polynomial Wiener models Table 4.4. Comparison of estimation accuracy Performance index 1 4
1 50
2
LS (σε = 0)
LS
IV
(ˆ aj − aj )2 + (ˆbj − bj )2
3.85 × 10−16 1.41 × 10−2 4.20 × 10−4
− γ0 )2 + (ˆ γ 4 − γ 4 )2
1.24 × 10−13 7.66 × 10−1 4.92 × 10−4
j=1 1 (ˆ γ0 2 50
fˆ−1 y(n) − f −1 y(n)
2
5.80 × 10−12 6.16 × 101 7.58 × 10−4
i=1
Although only one technique for instrumental variables generation is discussed and illustrated here, other known techniques can be considered as well. Contrary to the identification of the inverse Wiener model, an attractive feature of this approach is that the linear sub-system is not required to be minimum phase.
4.2 Identification of Wiener systems with the prediction error method In the linear regression approach, described in Section 4.1, the class of identified systems is restricted by the assumption of invertibility of the nonlinear characteristic. This assumption is not necessary in the recursive prediction error method of Wigren [165], in which the nonlinear characteristic is approximated with a piecewise linear function. This section presents an identification algorithm for Wiener systems which uses the recursive prediction error (RPE) approach with a polynomial model of the nonlinear element and a pulse transfer function model of the linear dynamic system – Fig. 4.9. 4.2.1 Polynomial Wiener model Consider a discrete-time Wiener system (Fig. 4.9) composed of a SISO linear dynamic system in a cascade with a SISO nonlinear element. The output y(n) of the Wiener system at the time n is y(n) = f s(n) + ε(n),
(4.50)
where f (·) is the steady state characteristic, ε(n) is the additive output disturbance, and s(n) is the output of the linear dynamic system: s(n) =
B(q −1 ) u(n) A(q −1 )
(4.51)
4.2 Identification of Wiener systems with the prediction error method ε (n )
WIENER SYSTEM
u (n )
B (q −1 )
s (n )
A(q )
POLYNOMIAL WIENER MODEL
sˆ(n )
M
y (n )
f (s (n ) )
−1
Bˆ (q −1 ) Aˆ (q −1 )
131
μˆ 0
μˆ1
M
μˆ 2
e (n )
−
M
μˆr
Fig. 4.9. Wiener system and its polynomial model
with A(q −1 ) = 1 + a1 q −1 + · · · + ana q −na ,
(4.52)
B(q −1 ) = b1 q −1 + · · · + bnb q −nb ,
(4.53)
where a1 , . . . , ana , b1 , . . . , bnb are the parameters of the linear dynamic system. Assume that the linear dynamic system is casual and asymptotically stable, and f (·) is a continuous function. Assume also that the polynomials A(q −1 ) and A(q −1 ) are coprime and u(n) has finite moments and is independent of ε(k) for all n and k. The steady state characteristic of the system can be approximated by a polynomial fˆ(·) of the order r: ˆ1 sˆ(n) + µ ˆr sˆr (n), ˆ2 sˆ2 (n) + · · · + µ fˆ sˆ(n) = µ ˆ0 + µ
(4.54)
where sˆ(n) is the output of the linear dynamic system model sˆ(n) =
ˆ −1 ) B(q u(n) ˆ −1 ) A(q
(4.55)
with ˆ −1 ) = 1 + a A(q ˆ1 q −1 + · · · + a ˆna q −na ,
(4.56)
ˆ −1 ) = ˆb1 q −1 + · · · + ˆbnb q −nb , B(q
(4.57)
132
4 Polynomial Wiener models
where a ˆ1 , . . . , a ˆna , ˆb1 , . . . , ˆbnb are the parameters of the linear dynamic model. Therefore, the parameter vector θˆ of the model defined by (4.54) and (4.55) is θˆ = a ˆ1 . . . a ˆna ˆb1 . . . ˆbnb µ ˆ0 µ ˆ1 . . . µ ˆr
T
.
(4.58)
4.2.2 Recursive prediction error method The identification problem considered here can be formulated as follows: Given a set of input and output data Z N = {(u(n), y(n)), k = 1, . . . , N } estimate the parameters of the Wiener system so that the predictions yˆ(n|n−1) of the system output are close to the system output y(n) in the sense of the following mean square error criterion: ˆ ZN ) = 1 JN (θ, 2N
N
2
y(n) − yˆ(n|n−1) .
(4.59)
k=1
Given the gradient of the model output w.r.t. the parameter vector ψ(n) =
dˆ y (n|n−1) dθˆ
T
,
(4.60)
the RPE algorithm can be expressed by (2.97) – (2.99). In practice, it is useful to modify the criterion (4.59) with an exponential forgetting factor λ. The forgetting factor λ ∈ [0, 1] and values close to 1 are commonly selected. To protect the algorithm from the so-called covariance blow-up phenomenon, other modifications of the algorithm may be useful that impose an upper bound on the eigenvalues of the matrix P (n) and are known as the constant trace and exponential forgetting and resetting algorithms [127]. 4.2.3 Gradient calculation The gradient ψ(n) of the model output w.r.t. to the model parameters is defined as T
ψ(n) =
∂ yˆ(n) ∂ yˆ(n) ∂ yˆ(n) ∂ yˆ(n) ∂ yˆ(n) ∂ yˆ(n) ∂ yˆ(n) ... ... ... . (4.61) ∂a1 ∂ana ∂b1 ∂bnb ∂ µ ˆ0 ∂ µ ˆ1 ∂µ ˆr
Although the model given by (4.54) and (4.55) is a recurrent one due to the difference equation (4.55), the calculation of the gradient does not require much more computation than in the case of the pure static model. The only difference is in the calculation of partial derivatives of the model output w.r.t. the parameters of the linear dynamic model. This can be done with the sensitivity method solving by simulation the following set of linear difference equations [73, 74]:
4.2 Identification of Wiener systems with the prediction error method
133
na
∂ˆ s(n − m) ∂ˆ s(n) = −ˆ s(n − k) − a ˆm , k = 1, . . . , na, ∂ˆ ak ∂ˆ ak m=1 ∂ˆ s(n) = u(n − k) − ∂ˆbk
na
a ˆm m=1
∂ˆ s(n − m) , k = 1, . . . , nb. ∂ˆbk
(4.62)
(4.63)
Hence, partial derivatives of the model output w.r.t. the parameters a ˆk and ˆbk can be calculated as
where
∂ yˆ(n) ∂ˆ s(n) ∂ yˆ(n) = , ∂ˆ ak ∂ˆ s(n) ∂ˆ ak
(4.64)
∂ yˆ(n) s(n) ∂ yˆ(n) ∂ˆ = , ∂ˆ s(n) ∂ˆbk ∂ˆbk
(4.65)
∂ yˆ(n) µ2 sˆ(n) + · · · + rµ ˆr sˆr−1 (n). =µ ˆ1 + 2ˆ ∂ˆ s(n)
(4.66)
Partial derivatives of the model output w.r.t. the parameters of the nonlinear element model are ∂ yˆ(n) = sˆj (n), j = 0, 1, . . . , r. (4.67) ∂µ ˆj Note that the derivation of the partial derivatives (4.62) and (4.63) is made under the assumption that the parameters of the linear dynamic model are time invariant. As this assumption is not true because of the sequential nature of the RPE algorithm, an approximate gradient is obtained. A more accurate evaluation of the gradient can be calculated using the truncated BPTT algorithm. 4.2.4 Pneumatic valve simulation example The model of a pneumatic valve (2.100) – (2.101) was used in the simulation example. It was assumed that the system output y(n) was additively disturbed by the zero-mean discrete white Gaussian noise ε(n) with the standard deviation of σε = 0.005 and 0.05: y(n) = f s(n) + ε(n).
(4.68)
A sequence of 20000 pseudorandom numbers, uniformly distributed in (0, 1), was used as the system input. Based on the simulated input-output data, the Wiener system was identified using the RPE algorithm. To compare estimation accuracy of the linear system parameters and the nonlinear function f (·), the indices (2.91) and (2.92) were used with {s(n)} defined as a sequence of 100 linearly equally spaced values between − min s(n) and max s(n) . The results shown in Tables 4.5 and 4.6 are illustrated in Figs 4.10 – 4.13.
134
4 Polynomial Wiener models
In the example, the nonlinear characteristic is of an infinite order while a finite order model is estimated. In spite of the fact that the estimated parameters are different from the parameters of the polynomial approximating (2.101), the nonlinear characteristic is approximated quite well showing practical applicability of this approach.
Table 4.5. Parameter estimates Parameter Approximating Estimated Estimated polynomial σε = 0.005 σε = 0.05 a1 a2 b1 b2 µ0 µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9
−1.4138 0.6065 0.1044 0.0833 0.0010 0.9530 0.8149 −11.651 34.749 −57.593 59.242 −37.639 13.5602 −2.1210
−1.4167 0.6088 0.1059 0.0812 0.0054 1.0581 −1.1304 −0.2255 0.8751 0.2007 −0.3143 −0.3218 −0.0583 0.2263
−1.4159 0.6092 0.0984 0.0899 0.0595 0.6871 −0.5716 0.1347 −0.0036 −0.0121 −0.0045 −0.0011 −0.0002 −0.0000
Table 4.6. Comparison of estimation accuracy, s(j) – a pseudorandom sequence, uniformly distributed in (min s(n) , max s(n) Performance index 1 4
2
σε = 0.005
σε = 0.05
(ˆ aj − aj )2 + (ˆbj − bj )2 2.95 × 10−6 2.44 × 10−4
j=1 1 100
100 j=1
fˆ s(j) − f s(j)
2
5.38 × 10−5 1.87 × 10−3
True and estimated nonlinear functions
4.2 Identification of Wiener systems with the prediction error method
135
0.3 0.25 0.2 0.15 0.1 f (s (n )) fˆ(s (n ))
0.05 0
0
0.2
0.4
s (n )
0.6
0.8
1
Fig. 4.10. True and estimated nonlinear functions (σε = 0.005)
Normalized autocorrelation function of residuals
1.2 1 0.8 0.6 0.4 0.2 0 -0.2
0
20
40
m
60
80
100
Fig. 4.11. Autocorrelation function of residuals and the 95% confidence interval (σε = 0.005)
4 Polynomial Wiener models Estimated parameters of the linear dynamic system
136
1
aˆ2 0.5
bˆ1 bˆ2
0
-0.5
-1
aˆ1 -1.5
0
0.5
1
1.5
2
n
x 10
4
Fig. 4.12. Evolution of linear dynamic system parameter estimates (σε = 0.005)
Estimated parameters of the nonlinear element
1.5
μˆ 1
μˆ 4
μˆ 5
μˆ 9
μˆ 0
μˆ 6
μˆ 3
1
0.5
μˆ 8
0
-0.5
μˆ 7
-1
μˆ 2 -1.5
0
0.5
1
n
1.5
2
x 10
4
Fig. 4.13. Evolution of nonlinear element parameter estimates (σε = 0.005)
4.3 Pseudolinear regression method
137
4.3 Pseudolinear regression method The pseudolinear regression approach to parameter estimation is based on the assumption that nonlinear components of a model can be neglected and the model can be treated as a linear-in-parameters one. For polynomial Wiener systems, we estimate the parameters of a pulse transform function and the parameters of nonlinear characteristic. It is assumed that the nonlinear characteristic contains the linear term. An obvious advantage of the pseudolinear regression approach, in comparison with other regression methods, is its low computational complexity and applicability for the identification of Wiener systems with noninvertible nonlinear characteristics. The pseudolinear regression method can be considered as a simplified prediction error approach, in which the exact gradient is replaced with an approximate one. Such a simplification reduces computational complexity of the method but may deteriorate its convergence rate. 4.3.1 Pseudolinear-in-parameters polynomial Wiener model Consider the Wiener system (4.50) – (4.53) and its polynomial model (4.54) – (4.57). To derive the pseudolinear regression method, it is necessary to assume that the polynomial model fˆ sˆ(n) has a nonzero linear term, i.e., µ ˆ1 = 0. For convenience, we can assume that µ ˆ1 = 1. Note that there is no loss of generality if we assume that µ ˆ1 = 1, as the steady state gain of the linear µ1 . The polynomial model output dynamical model can be multiplied by 1/ˆ can be written as ˆna sˆ(n−na) + ˆb1 u(n−1) + · · · yˆ(n) = − a ˆ1 sˆ(n−1) − · · · − a T + ˆbnb u(n−nb) + µ ˆ0 + µ ˆ2 sˆ2 (n) + · · · + µ ˆr sˆr (n) = θˆ ψ,
(4.69)
where T ˆ0 µ ˆ2 . . . µ ˆr θˆ = θˆl µ
T
θˆl = a ˆ1 . . . a ˆna ˆb1 . . . ˆbnb
, T
(4.70) ,
ψ(n) = ϕT (n) 1 sˆ2 (n) . . . − sˆr (n)
(4.71) T
,
ϕ(n) = − sˆ(n−1) . . . − sˆ(n−na) u(n−1) . . . u(n−nb)
(4.72) T
.
(4.73)
The model (4.69) is a linear function of the parameters µ ˆ0 , µ ˆ2 , . . . , µ ˆr , but it ˆ ˆ is a nonlinear function of aˆ1 ,. . .,ˆ ana , b1 ,. . .,bnb . This comes from the fact that both sˆ2 (n), . . . , sˆr (n) and sˆ(n−1), . . . , sˆ(n−na) depend on θˆl .
138
4 Polynomial Wiener models
4.3.2 Pseudolinear regression identification method Minimization with respect to model parameters of the weighted cost function J=
1 2
N
λN −j y(n) − yˆ(n)
2
(4.74)
j=1
results in the following identification algorithm: ˆ ˆ θ(n) = θ(n−1) + K(n) e(n),
(4.75)
ˆ e(n) = y(n) − ψ T (n)θ(n−1),
(4.76)
K(n) = P (n)ψ(n) =
P (n) =
P (n−1)ψ(n) , λ + ψ T (n)P (n−1)ψ(n)
1 P (n−1) − K(n)ψ T (n)P (n−1) , λ
(4.77)
(4.78)
where λ denotes the exponential forgetting factor. 4.3.3 Simulation example The Wiener system described by the second order pulse transfer function 0.125q −1 − 0.025q −2 B(q −1 ) = −1 A(q ) 1 − 1.75q −1 + 0.85q −2
(4.79)
and the nonlinear characteristic (Fig. 4.16) f s(n) = −0.25 + s(n) + s2 (n) − 0.5s3 (n) − 0.2s4 (n) + 0.2s5 (n) (4.80) was used in the numerical example. The system was excited with a sequence of 10000 pseudo-random numbers of uniform distribution in (−1, 1). The system output was disturbed additively √ with another pseudo-random sequence √ of uniform distribution in (− 3α, 3α). The identification results, obtained at α = 3.34×10−5, 3.34×10−3, 3.34×10−1 and the forgetting factor λ = 0.9995, are summarized in Tables 4.7 and 4.8 and illustrated in Figs. 4.15 – 4.17.
4.3 Pseudolinear regression method
139
0
γˆ0
-0.2 -0.4 0 2
γˆ2
6000
8000
10000
2000
4000
n
6000
8000
10000
2000
4000
n
6000
8000
10000
2000
4000
n
6000
8000
10000
2000
4000
n
6000
8000
10000
0 -0.5
-1 0
0.5 0 -0.5
-1 0
0.5
γˆ5
n
1 -1 0
γˆ4
4000
0 0.5
γˆ3
2000
0 -0.5 0
Fig. 4.14. Evolution of nonlinear element parameter estimates (α = 3.34 × 10−3 )
1
aˆ1
0 -1 -2 0
2000
4000
n
6000
8000
10000
0
2000
4000
n
6000
8000
10000
0
2000
4000
n
6000
8000
10000
0
2000
4000
n
6000
8000
10000
1
aˆ2
0.5 0 -0.5 -1 0.3
bˆ1
0.2 0.1 0 -0.1
0.3
bˆ2
0.2 0.1 0 -0.1
Fig. 4.15. Evolution of linear dynamic system parameter estimates (α = 3.34 × 10−3 )
140
4 Polynomial Wiener models 2
1.5
f (n)
1
0.5
0
-0.5
-1.5
-1
-0.5
0.5
0
1
1.5
s (n )
Fig. 4.16. Nonlinear element characteristic
1.2 1
Re(m)
0.8
0.6 0.4
0.2
0
-0.2
0
20
40
60
80
100
m
Fig. 4.17. Autocorrelation function of residuals and the 95% confidence interval (α = 3.3403 × 10−3 )
4.4 Summary
141
Table 4.7. Parameter estimates Parameter True value Estimated Estimated Estimated α a1 a2 b1 b2 µ0 µ2 µ3 µ4 µ5
−1.7500 0.8500 0.1250 −0.0250 −0.2500 1.0000 −0.5000 −0.2000 0.1000
3.34 × 10−5 3.34 × 10−3 3.34 × 10−1 −1.7504 −1.7500 −1.7327 0.8504 0.8501 0.8374 0.1249 0.1242 0.1199 −0.0251 −0.0245 −0.0124 −0.2499 −0.2500 −0.2454 1.0003 1.0053 1.0035 −0.4992 −0.4913 −0.5493 −0.2007 −0.2040 −0.2834 0.0991 0.0912 0.1493
Table 4.8. Comparison of estimation accuracy Index 1 N 5
N
e(n)2
α = 3.34 × 10−5 α = 3.34 × 10−3 α = 3.34 × 10−1 3.3656 × 10−6
3.3408 × 10−3
3.3441 × 10−1
(ˆ µj − µj )2 2.3762 × 10−6
1.9759 × 10−4
1.2501 × 10−2
i=1
j=0
4.4 Summary In this chapter, it has been shown that the linear-in-parameters definition of the modified equation error makes it possible to use the linear regression approach to estimate the parameters of Wiener systems with an invertible nonlinear element. As such an approach results in inconsistent parameter estimates, a combined least squares-instrumental variables method has been proposed to overcome this problem. Contrary to linear regression approaches, prediction error methods make it possible to identify Wiener systems with both invertible and noninvertible nonlinear characteristics. Moreover, there is no problem of parameter redundancy, as the number of the estimated parameters is equal to the total number of the model parameters na + nb + r + 1. In comparison with the prediction error method, the pseudo-linear regression approach has lower computational requirements as it uses an approximate gradient instead of the true one.
5 Polynomial Hammerstein models
In this chapter, we will review seven methods of the identification of Hammerstein systems which use Hammerstein models with a polynomial model of the nonlinear element. The reviewed methods use models of the linear dynamic system in the form of pulse transfer models [21, 27, 34, 63, 120, 161], or a Laguerre expansion of the impulse response [156]. Wiener system parameters are estimated with different procedures such as the ordinary least squares [27], iterative least squares [63, 120, 161], iterative correlation and steepest descent [156], prediction error [34], and pseudolinear regression [21] methods.
5.1 Noniterative least squares identification of Hammerstein systems One of the best known identification methods for Hammerstein systems is based on the transformation of a nonlinear-in-parameters SISO identification problem into a linear-in-parameters MISO one and the application of least squares parameter algorithm. This method was originally proposed by Chang and Luus [27]. A short description of the method is given below. First, we introduce a parallel polynomial Hammerstein model, i.e., a model ˆ −1 ): ˆ −1 )/A(q with the linear part defined by the transfer function B(q yˆ(n) =
ˆ −1 ) B(q vˆ(n), ˆ −1 ) A(q
(5.1)
ˆ −1 ) = ˆb1 q −1 + · · · + ˆbnb q −nb , B(q
(5.2)
ˆ −1 ) = 1 + a A(q ˆ1 q −1 + · · · + a ˆna q −na ,
(5.3)
and the nonlinear part defined by a polynomial model: vˆ(n) = fˆ u(n) = µ ˆ1 u(n) + µ ˆ2 u2 (n) + . . . + µ ˆr ur (n).
A. Janczak: Identification of Nonlinear Systems, LNCIS 310, pp. 143–157, 2005. © Springer-Verlag Berlin Heidelberg 2005
(5.4)
144
5 Polynomial Hammerstein models
Without loss of generality, µ ˆ1 can be assumed to be a unity, and µ ˆ2 , . . . , µ ˆr can be normalized accordingly. Then (5.1) can be expressed as yˆ(n) =
ˆ −1 ) B(q u(n) + ˆ −1 ) A(q
r
ˆ −1 ) B(q µ ˆ uk (n). ˆ −1 ) k A(q
k=2
(5.5)
ˆ −1 ) and u1 (n) = ˆ1 (q −1 ) = B(q ˆ −1 ),. . . , B ˆk (q −1 ) = µ Introducing B ˆk B(q k u(n),. . . , uk (n) = u (n), (5.5) can be rewritten as r
yˆ(n) = k=1
ˆk (q −1 ) B u (n). ˆ −1 ) k A(q
(5.6)
The MISO Wiener model (Fig. 5.1) consists of r linear dynamic sub-models with a common denominator in each branch. Equivalently, (5.6) can be written as r
ˆ −1 ) yˆ(n) + yˆ(n) = 1 − A(q
ˆk (q −1 )uk (n). B
(5.7)
k=1
ˆ −1 ) and B ˆk (q −1 ), we have By using the definitions of A(q na
yˆ(n) = −
r
nb
a ˆm yˆ(n−m) + m=1
ˆbkm uk (n−m),
(5.8)
k=1 m=1
where ˆbkm is the mth parameter of the polynomial Bk (q −1 ). The parallel model (5.8) can be transformed into a linear-in-parameters series-parallel model using the available delayed system outputs y(n − m) instead of the model outputs yˆ(n−m): na
r
nb
a ˆm y(n−m) +
yˆ(n) = − m=1
ˆbkm uk (n−m).
(5.9)
k=1 m=1
Introducing the parameter vector θˆ and the regression vector x(n): θˆ = a ˆ1 . . . a ˆna ˆb11 . . . ˆb1nb . . . ˆbr1 . . . ˆbrnb
T
,
x(n) = − y(n−1) . . . − y(n−na) u1 (n−1) . . . u1 (n−nb) . . . ur (n−1) . . . ur (n−nb) we have
T
,
ˆ yˆ(n) = xT (n)θ.
(5.10) (5.11)
(5.12)
The minimization of the sum of N squared errors between the system output and the model output w.r.t. the regression vector θˆ gives θˆ = (XT X)−1 XT y,
(5.13)
5.2 Iterative least squares identification of Hammerstein systems u1 (n )
Bˆ1 (q −1 ) Aˆ (q −1 )
u2 (n )
Bˆ 2 (q −1 ) Aˆ (q −1 )
M
ur (n )
145
yˆ (n )
M
Bˆr (q −1 ) Aˆ (q −1 )
Fig. 5.1. Hammerstein model transformed into the MISO form
where
X = x(1) . . . x(N )
T
y = y(1) . . . y(N )
T
,
(5.14)
.
(5.15)
In this way, the parameters of the transformed MISO model (5.9) can be calculated, but our primary goal is to determine the parameters of the model ˆ defined by (5.1) and (5.4). The parameters a ˆm are elements of the vector θ. Also, the parameters ˆbm are available directly as ˆbm = ˆb1m . The parameters µ ˆk can be calculated from the remaining elements of θˆ but there is a certain amount of redundancy as µ ˆk =
ˆbmk , k = 2, . . . , r, m = 1, . . . , nb, ˆb1k
(5.16)
and nb different sets of parameters µ ˆ can be calculated. Chang and Luus [27] suggested computing the root mean-square error (RMS) of the model output for all nb sets of µ ˆ and accept the set that yields the least value for the RMS as a more reliable approach in comparison with computing the mean of the nb values.
5.2 Iterative least squares identification of Hammerstein systems In the iterative least squares method of Narendra and Gallman [120], the parameters of the linear dynamic system and the nonlinear static element are updated separately and sequentially. This method utilizes an alternate adjustment of the parameters of the linear dynamic system and the nonlinear static element to minimize the sum of squared errors. The error e(n) at the time n is
146
5 Polynomial Hammerstein models na
e(n) = y(n) −
nb
−
ˆbm vˆ(n−m) ,
a ˆm y(n−m) + m=1
where
(5.17)
m=1 r
vˆ(n) =
µ ˆk uk (n).
(5.18)
k=1 T T Introducing θˆ = θˆa θˆb
T
and x(n) = xTa (n) xTb (n) θˆa = a ˆ1 . . . a ˆna
T
θˆb = ˆb1 . . . ˆbnb
T
T
, where
,
(5.19)
,
(5.20)
xa (n) = − y(n−1) . . . − y(n−na) xb (n) = vˆ(n−1) . . . vˆ(n−nb)
T
T
,
,
(5.21) (5.22)
(5.17) can be written as e(n) = y(n) − xT (n)θˆ = y(n) − xTa (n)θˆa − xTb (n)θˆb . Now, introducing the vectors µ ˆ= µ ˆ1 . . . µ ˆr 1) . . . ur (n−1)
T
T
(5.23)
and u(n−1) = u(n−1) u2 (n−
, the matrix U(n) can be defined: T u (n−1) .. U(n) = . .
(5.24)
uT (n−nb)
From (5.18), it follows that the vector xb (n) can be written as xb (n) = U(n)ˆ µ,
(5.25)
and (5.23) becomes e(n) = y(n) − xTa (n)θˆa − U(n)ˆ µ
T
θˆb .
(5.26)
In this iterative approach, parameter estimation is performed according to the following iterative procedure: 1. Based on N measurements of the system input and output signals and the (1)T (1) (1)T assumed initial vector µ ˆ (1) , the parameter vector θˆ = [θˆa θˆb ]T of the model (5.23) is estimated: (1) θˆ = (XT X)−1 XT y,
where
y = y(1) . . . y(N )
T
X = x(1) . . . x(N )
T
, .
(5.27) (5.28) (5.29)
5.3 Identification of Hammerstein systems in the presence of correlated noise
147
(1)T (1)T θˆb ]T , µ ˆ(2) is calculated by minimizing the sum of squared 2. With [θˆa errors for the model (5.26) (1) µ ˆ (2) = UT U)−1 UT (y − Xa θˆa ),
where
Xa = xa (1) . . . xa (N )
T
,
(1) (1) U = UT (1)θˆb . . . UT (N )θˆb
(5.30) (5.31)
T
.
(5.32)
(2) (2)T (2)T θˆb ]T is calculated according to the scheme 3. Using µ ˆ (2) , θˆ = [θˆa used in Step 1, and the process is continued.
An obvious advantage of the iterative method is nonredundant model parameterization. The simulation study performed by Gallman [41] shows that such an iterative procedure gives significantly lower parameter variance and a slightly lower RMS error than the noniterative method of Chang and Luus. The iterative method applied to a variety of problems such as polynomial nonlinearity, a half-wave linear detector or saturating nonlinearity has proven to be very successful [120]. In spite of the reported successful experience, the method may not converge in some cases.
5.3 Identification of Hammerstein systems in the presence of correlated noise A direct application of the method of Chang and Luus [27] for Hammerstein systems with correlated additive output disturbance results in asymptotically biased parameter estimates. To overcome this problem, an iterative least squares procedure, which estimates the parameters of the linear dynamic system, the parameters of the nonlinear static element and the parameters of a noise model, was proposed by Haist et al. [63]. Their approach is based on the assumption that the Hammerstein system with an output disturbance has the form ε(n) B(q −1 ) f u(n) + , (5.33) y(n) = −1 A(q ) D(q −1 ) where
D(q −1 ) = 1 + d1 q −1 + · · · + dnd q −nd ,
(5.34)
and ε(n) is a zero-mean discrete-time white noise. Assume that the nonlinear element is described by the polynomial f u(n) = µ1 u(n) + µ2 u2 (n) + . . . + µr ur (n).
(5.35)
Denoting the undisturbed system output by y¯(n) and the additive disturbance at the system output by η(n), we have
148
5 Polynomial Hammerstein models
y(n) = y¯(n) + η(n) with
r
y¯(n) = k=1
(5.36)
Bk (q −1 ) uk (n), A(q −1 )
η(n) =
(5.37)
ε(n) , D(q −1 )
(5.38)
where Bk (q −1 ) = µk B(q −1 ) and uk (n) = uk (n) or, equivalently, na
r
nb
am y¯(n−m) +
y(n) = − m=1
bkm uk (n−m) k=1 m=1
(5.39)
nd
−
dm η(n−m) + ε(n). m=1
Introducing the parameter vector θ and the regression vector x(n): θ = a1 . . . ana b11 . . . b1nb . . . br1 . . . brnb d1 . . . dnd
T
,
(5.40)
x(n) = − y¯(n−1) . . . − y¯(n−na) u1 (n−1) . . . u1 (n−nb) . . . ur (n−1) . . . ur (n−nb) − η(n−1) . . . − η(n − nd) we have
y(n) = xT (n)θ + ε(n).
T
,
(5.41)
(5.42)
Now, parameter updating can be made according to the following iterative procedure: 1. Since the sequences {¯ y (n)} and {η(n)} are not known initially, calculate the parameters of a deterministic part of the model using the noniterative procedure of Chang and Luus, described in Section 5.1. Generate an estimate of the sequence {¯ y(n)} from (5.37) and calculate an estimate of the sequence {η(n)} from (5.36) using the estimate of the sequence {¯ y(n)}. 2. Calculate improved parameter estimates from θˆ =
N i=1
−1
x(n)xT (n)
N
x(n)y(n) ,
(5.43)
i=1
where the estimated values of y¯(n) and η(n) are used in x(n) instead of their unknown true values. 3. The improved parameter estimates yield another estimate of the sequence {¯ y(n)}. Also, the estimated sequence {η(n)} is calculated again, from (5.36). 4. Continue the procedure from Step 2 until the change in the normalized RMS is less than some specified minimum.
5.4 Identification of Hammerstein systems
149
To ensure numerical stability of the procedure, employing a stepping factor is suggested. With the stepping factor, the adopted value of the model parameter ∗(j+1) at the iteration j + 1 is calculated according to the formula vector θˆ ∗(j+1) (j+1) ∗(j) θˆ = θˆ + (1 − )θˆ ,
(5.44)
(j+1) is the estimate of the parameter vector obtained from (5.43). where θˆ As in the case of the method of Chang and Luus, it follows from (5.41) that there is some redundancy in parameter determination. In spite of its practically confirmed successful applications, no proof of the convergence of this method is available.
5.4 Identification of Hammerstein systems with the Laguerre function expansion An example of identification methods which use orthonormal basis functions is the method proposed by Thathachar and Ramaswamy [156]. In this method, the nonlinear part of the Hammerstein model is represented by the polynomial (5.4), and the linear part – by a Laguerre expansion of its impulse response h(n): nl
Ai li (n),
h(n) =
(5.45)
i=1
where li (n) is the ith Laguerre function, and Ai , i = 1, . . . , nl, are the parameters. Discrete Laguerre functions are defined as Li (q −1 ) =
1 − exp(−2T )
q −1 − exp(−T )
i−1
1 − exp(−T )q −1
i
, for i = 1, 2, . . . ,
(5.46)
where T is the sampling period, and Li (q) is the positive half Z transform of li (n): Li (q −1 ) =
∞
li (n)q −n .
(5.47)
n=0
The functions li (n) fulfill the condition of orthonormality: ∞
li (n)lk (n) = n=0
and
0 for i = k i, k 1 for i = k
li (n) = 0 for n < 0 and all i.
n,
(5.48)
(5.49)
The identification problem is to determine the parameters µ ˆ k , k = 1, . . . , r, of the nonlinear part, and Ai , i = 1, . . . , nl, of the linear part. The problem
150
5 Polynomial Hammerstein models
is solved in an iterative manner, where each iteration comprises two steps. In the first step, the parameters µ ˆk are kept constant at their values from the previous iteration and Ai are calculated by solving a set of linear equations. To start the identification procedure, some arbitrary initial values of µ ˆk are assumed permitting the calculation of the output of the nonlinear part of the model. The expressions for the parameters Ai can be derived from the input-output cross-correlation function of the linear part: Cnl A1 = Bnl s−1 Cnl−s+1 − Ai Bnl−s+i i=1 . (5.50) As = Bnl nl−1 C1 − Ai Bi i=1 Anl = Bnl The coefficients Bj , j = 1, . . . , nl, are calculated as the following averages: N
1 Bj = lim u(n)fj (n), N →∞ N + 1 n=0
(5.51)
where fj (n) is the output of Lj (q −1 ) excited by fˆ u(n) . In a similar way, the coefficients Cj , j = 1, . . . , nl are calculated as N
1 Cj = lim u(n)yj−1 (n), N →∞ N + 1 n=0
(5.52)
where yj−1 (n) is the output of Lj−1 (q −1 ) excited by y(n). In the other step, the parameters Ai are kept constant while µ ˆ k are determined using an algorithm that adjusts µ ˆk towards their optimal values minimizing the mean square error (MSE) between the model output and the system output. The minimization is carried out with gradient methods, i.e., steepest descent in the disturbance-free case, and stochastic approximation in the presence of an additive output disturbance. The MSE is defined as J=
1 N
N n=1
µ ˆT
∞
2
h(m1 )u(n−m1 ) − y(n) ,
(5.53)
m1 =0 T
T
where u(n) = u(n) u2 (n) . . . ur (n) and µ ˆi = µ ˆ1 µ ˆ2 . . . µ ˆr . In the disturbance-free case, the parameters of the nonlinear element model are adjusted according to following rule:
5.5 Prediction error method
µ ˆ(new) = µ ˆ (old) − η
∂J , µ ˆ
151
(5.54)
where η is the step size. In the presence of additive output disturbances, the iteration-dependent step size η(n) is used instead of a constant one, which fulfills the following conditions: ∞
η(n)
0,
∞
η 2 (n) < ∞,
η(n) = ∞.
k=1
(5.55)
k=1
5.5 Prediction error method In the prediction error method, the Hammerstein system is assumed to be given by r B(q −1 ) y(n) = µk uk (n) + ε(n), (5.56) A(q −1 ) k=1
where ε(n) is a zero-mean white noise used to model disturbances. For the nonwhite case, the parameters of a noise filter should also be identified but for the sake of complexity this problem is not considered here. The one-step-ahead predictor for the system output has the form [34]: yˆ(n|n−1) =
ˆ −1 ) B(q ˆ −1 ) A(q
r
µ ˆk uk (n).
(5.57)
k=1
T ˆna ˆb1 . . . ˆbnb µ ˆ1 . . . µ ˆr should be estimaThe parameter vector θˆ = a ˆ1 . . . a ted based on a set of N input-output measurements to minimize the sum of squared prediction errors
J= where
1 N
N
ˆ e2 (n, θ),
(5.58)
i=1
ˆ = y(n) − yˆ(n|n−1). e(n, θ)
(5.59)
The prediction error method requires the calculation of the gradient of (5.58) w.r.t. its parameters. For the Hammerstein model, partial derivatives of (5.58) are nonlinear in the parameters. Therefore, since a direct solution to the optimization problem cannot be found, iterative methods have to be used. The prediction error method is a second order gradient-based technique, in which the parameter vector θˆ is adjusted along the negative gradient of (5.58): (j) (j−1) (j−1) (j−1) − ηj H −1 (θˆ )G(θˆ ), θˆ = θˆ
(5.60)
152
5 Polynomial Hammerstein models
where ηj is the step size, H(·) is the Hessjan of (5.58) or its approximation, and G(·) is the gradient of (5.58). The gradient of (5.58) can be computed as (j)
G(θˆ ) =
2 dJ = N dθˆ
N
ˆ 2 ∂e(n, θ) =− N ∂ θˆ
N
ˆ e(n, θ)ψ(n),
(5.61)
∂ yˆ(n|n−1) ∂ yˆ(n|n−1) ∂ yˆ(n|n−1) ∂ yˆ(n|n−1) ... ... ˆ ∂ˆ a1 ∂ˆ ana ∂ b1 ∂ˆbnb ∂ yˆ(n|n−1) ∂ yˆ(n|n−1) ... ∂µ ˆ1 ∂µ ˆr
(5.62)
i=1
ˆ e(n, θ)
i=1
where ψ(n) =
with
r ˆ −1 ) B(q 1 ∂ yˆ(n|n−1) µ ˆ um (n−k), =− ˆ −1 ) ˆ −1 ) m ∂ˆ ak A(q A(q m=1
(5.63)
r
1 ∂ yˆ(n|n−1) µ ˆm um (n−k), = ˆ ˆ A(q −1 ) m=1 ∂ bk
(5.64)
ˆ −1 ) B(q ∂ yˆ(n|n−1) = um (n). ˆ −1 ) ∂µ ˆm A(q
(5.65)
The main drawback of using the Hessjan in (5.60) is that it requires second order derivatives. To avoid the calculation of second order derivatives, an approximate Hessjan can be used. In the Levenberg-Marquardt method, the approximate Hessjan is calculated as (j)
H(θˆ ) =
1 N
N i=1
ˆ ∂eT (n, θ) ˆ ∂e(n, θ) 1 + µI = N ∂ θˆ ∂ θˆ
N
ψ(n)ψ T (n) + µI, (5.66)
i=1
where µ is a nonnegative small scalar and I is the identity matrix with an appropriate dimension. The iterative prediction algorithm comprises the following steps: (0) 1. Start iterations with an initial estimate of the parameters θˆ and set µ ˆ1 = 1. 2. Pick a small value for µ (a typical choice is 0.0001). (j−1) 3. Compute yˆ(n|n−1) and J(θˆ ). 4. Compute the gradient and the Hessjan through (5.61) – (5.66). (j) 5. Update parameter estimates through (5.60), and calculate J(θˆ ). (j)
(j−1)
6. If J(θˆ ) > J(θˆ ), decrease µ by a factor (say 10) and go to Step 4. (j) (j−1) ˆ ˆ J(θ ) < J(θ ), update solution and increase µ by a factor (say 10) and go to Step 2. Having the rules for the calculation of the gradient derived, a recursive version of the algorithm can be implemented easily, see Equations (2.97) – (2.99).
5.6 Identification of MISO systems with the pseudolinear regression method
153
5.6 Identification of MISO systems with the pseudolinear regression method Consider a MISO Hammerstein model with nu inputs and the output disturbed additively by correlated measurement noise [22]: nu
y(n) = j=1
C(q −1 ) Bj (q −1 ) f ε(n), (n) + u j j Aj (q −1 ) D(q −1 )
(5.67)
and the polynomial models of static nonlinearities r
where
fj uj (n) = µj1 uj (n) + . . . + µjrj uj j (n),
(5.68)
Aj (q −1 ) = 1 + aj1 q −1 + . . . + ajnaj q −naj ,
(5.69)
Bj (q −1 ) = 1 + bj1 q −1 + . . . + bjnbj q −nbj ,
(5.70)
C(q
−1
) = 1 + c1 q
−1
+ . . . + cnc q
−nc
,
(5.71)
D(q −1 ) = 1 + d1 q −1 + . . . + dnd q −nd .
(5.72)
Assume that ε(n) is a zero-mean white noise, the model (5.67) is asymptotically stable, and the polynomial orders naj , nbj , rj , nc, and nd are known. The model (5.67) can be transformed into the following equivalent form:
where ¯ −1 ) = A(q
nu
nu
¯ −1 ¯j (q −1 )fj uj (n) + C(q ) ε(n), B D(q −1 ) j=1
(5.73)
¯1 q −1 + . . . + a Aj (q −1 ) = 1 + a ¯na q −na ,
(5.74)
¯ −1 )y(n) = A(q
k=1
¯j (q −1 ) = Bj (q −1 ) B
nu
Ak (q −1 ) = ¯bj1 q −1 + . . . + ¯bjnabj q −nabj ,
(5.75)
k=1 k=j
¯ −1 ) = C(q −1 )A(q ¯ −1 ) = 1 + c¯1 q −1 + . . . + c¯nac q −nac , C(q
(5.76)
with na = na1 +. . .+nanu , nabj = na1 +. . .+naj−1 +nbj +naj+1 +. . .+nanu , nac = na + nc. Performing another transformation results in a model in the pseudolinear-in-parameters form: A(q
−1
nu
)y(n) = j=1
where
¯ −1 )ε(n), Bj (q −1 )fj uj (n) + C(q
(5.77)
154
5 Polynomial Hammerstein models
¯ −1 )D(q −1 ) = 1 + α1 q −1 + . . . + αnα q −nα , A(q −1 ) = A(q
(5.78)
¯j (q −1 )D(q −1 ) = βj1 q −1 + . . . + βjnβj q −nβj , Bj (q −1 ) = B
(5.79)
with nα = na + nd and nβj = nabj + nd. Introducing the parameter vector θ, θ = α1 . . . αnα β11 µ11 . . . β11 µ1r1 . . . β1nβ1 µ11 . . . β1nβ1 µ1r1 . . . βnu1 µnu1 . . . βnu1 µnurnu . . . βnunβnu µnu1 . . . βnunβnu µnurnu c¯1 . . . c¯nac
T
(5.80)
,
and the vector x0 (n), x0 (n) = y(n−1) . . . y(n−nα) u1 (n) . . . ur11 (n) . . . u1 (i−nβ1 ) . . . nu (n) . . . unu (n−nβ1 ) . . . ur11 (i−nβ1) . . . unu (n) . . . urnu nu (n−nβ1 ) urnu
we have
ε(n−1) . . . ε(n−nac)
T
(5.81)
,
y(n) = xT0 (n)θ + ε(n).
(5.82)
Introduce the prediction error e(n) ˆ e(n) = y(n) − xT (n)θ(n−1),
(5.83)
where θˆ is the vector of adjustable parameters θˆ = α ˆ1 . . . α ˆ1r1 . . . ˆ11 . . . βˆ1nβ1 µ ˆ nα βˆ11 µ ˆ11 . . . βˆ11 µ ˆ1r1 . . . βˆ1nβ1 µ βˆnu1 µ ˆnu1 . . . βˆnu1 µ ˆnur . . . βˆnunβ µ ˆnu1 . . . βˆnunβ µ ˆnur
nu
(5.84)
x(n) = y(n−1) . . . y(n−nα) u1 (n) . . . ur11 (n) . . . u1 (i−nβ1 ) . . . nu (n) . . . unu (n−nβ1 ) . . . ur11 (i−nβ1 ) . . . unu (n) . . . urnu
(5.85)
nu
cˆ ¯1 . . . cˆ ¯nac
T
nu
nu
,
and x(n) is defined as
nu (n−nβ1 ) urnu
e(n−1) . . . e(n−nac)
T
.
ˆ The estimate θ(n) of θ0 , which minimizes the sum of squared errors between the system and model outputs, can be obtained with the recursive pseudolinear regression algorithm as follows: ˆ ˆ θ(n) = θ(n−1) + K(n)e(n), K(n) = P (n)x(n) =
P (n−1)x(n) , 1 + xT (n)P (n−1)x(n)
P (n) = P (n−1) − K(n)xT (n)P (n−1).
(5.86) (5.87) (5.88)
5.7 Identification of systems with two-segment nonlinearities
155
5.7 Identification of systems with two-segment nonlinearities In some cases, only polynomials of a higher order can approximate nonlinear characteristics adequately. An increase in the polynomial order r causes a multiple increase in the overall number of parameters in the linear-in-parameters model (5.12). Moreover, a single polynomial model can be inaccurate in a whole range of the system input signal. An alternative approach to single polynomial model solutions, discussed earlier, was proposed by V¨or¨ os [161]. In this method, a two-segment description of the nonlinear characteristic, composed of separate polynomial maps for positive and negative inputs, is used. The main motivation for such an approach is a better fit without increasing polynomial orders for some types of nonlinearities. Assume that the nonlinear characteristic fˆ(·) is vˆ(n) = fˆ u(n) =
fˆ1 u(n) , if u(n) > 0 , fˆ2 u(n) , if u(n) < 0
where
r
fˆ1 u(n) =
(5.89)
µ ˆ1k uk (n),
(5.90)
µ ˆ2k uk (n).
(5.91)
k=1 r
fˆ2 u(n) =
k=1
Introducing a switching sequence {g(n)}, g(n) = g u(n) =
0, if u(n) > 0 , 1, if u(n) < 0
(5.92)
(5.89) can be written as vˆ(n) = fˆ1 u(n) + fˆ2 u(n) − fˆ1 u(n) g(n).
(5.93)
Then taking into account (5.90) and (5.91), r
vˆ(n) = k=1
where
µ ˆ1k uk (n) +
r
pk uk (n)g(n),
(5.94)
k=1
ˆ2k − µ ˆ1k . pk = µ
(5.95)
The substitution of (5.94) into (5.1) results in a model which is nonlinear in the parameters. However, assuming ˆb1 = 1, the model (5.1) can be expressed as ˆ −1 ) yˆ(n). ˆ −1 ) − 1 vˆ(n) + 1 − A(q (5.96) yˆ(n) = vˆ(n − 1) + B(q
156
5 Polynomial Hammerstein models
Then substituting (5.94) only for vˆ(n − 1), the following pseudolinear-inparameters model of the parallel type can be obtained: r
yˆ(n) =
r
µ ˆ1k uk (n−1) +
k=1
pk uk (n−1)g(n−1)
k=1
(5.97)
ˆ −1 ) − 1 vˆ(n) + 1 − A(q ˆ −1 ) yˆ(n). + B(q Replacing yˆ(n) with y(n) transforms the model (5.97) into the series-parallel form: r
yˆ(n) =
r
µ ˆ1k uk (n−1) +
k=1
pk uk (n−1)g(n−1)
k=1
(5.98)
ˆ −1 ) − 1 vˆ(n) + 1 − A(q ˆ −1 ) y(n) + B(q or
ˆ yˆ(n) = xT (n)θ,
(5.99)
where the parameter vector is defined as θˆ = µ ˆ11 . . . µ ˆ1r p1 . . . pr ˆb2 . . . ˆbnb a ˆ1 . . . a ˆna
T
,
(5.100)
and the regression vector is x(n) = u(n−1) . . . ur (n−1) u(n−1)g(n−1) . . . ur (n−1)g(n−1) vˆ(n−2) . . . vˆ(n−nb) − y(n−1) . . . − y(n−na)
T
.
(5.101)
The model (5.98) is also of the pseudolinear form as vˆ(n) is an unmeasurable variable which depends on the parameters of the nonlinear function fˆ(.). ˆ V¨ Therefore, no noniterative algorithm can be applied to estimate θ. or¨ os proposed an iterative algorithm which uses the preceding parameter estimates of (j) (j) ˆ1k , and pk the estimates µ ˆ1k , and pk to estimate vˆ(n). Denote with vˆ(j) (n), µ of vˆ(n), µ ˆ1k and pk obtained at the step j: vˆ(j) (n) =
r k=1
(j)
µ ˆ1k uk (n) +
r k=1
(j)
pk uk (n)g(n).
(5.102)
Then the error to be minimized can be expressed as ˆ +1), e(n) = y(n) − x(j)T (n)θ(j
(5.103)
where x(j) (n) is the regression vector with the estimates of vˆ(n) calculated ˆ +1) is the (j +1)th estimate of θ. ˆ according to (5.102), and θ(j The iterative identification procedure can be divided into the following steps: 1. Using the regression vector x(j) (n), minimize a proper criterion based on ˆ +1). (5.103) to estimate θ(j
5.8 Summary
157
2. Using (5.102), calculate vˆ(j+1) (n). 3. Repeat Steps 1 and 2 until the parameter estimates converge to constant values. Testing the above identification procedure with differently shaped two-segment polynomial and exponential nonlinearities has revealed its good convergence properties, although the formal proof of the convergence is not available. Although the iterative procedure which uses the whole input-output data set was originally proposed by V¨ or¨ os, the derivation of the sequential version of the method is straightforward. A similar approach can also be used for the identification of discontinuous Hammerstein systems, i.e., systems with the nonlinear element described by a discontinuous function, or Wiener systems with this type of nonlinearity [160].
5.8 Summary We have presented seven different methods of the identification of Hammerstein systems, which use Hammerstein models with a polynomial model of the nonlinear element. The iterative least squares method of Narendra and Gallman and the noniterative least squares method of Chang and Luus are the oldest and belong to the best known methods. Both of them have some drawbacks. While the convergence of the first one is not guaranteed, the drawbacks of the other one are parameter redundancy and a huge number of parameters in the case of high order models. The iterative method of Haist et al. [63] also uses a non-unique parameterization but, in contrast to a one-step method of Chang and Luus, its advantage are consistent parameter estimates in the case of correlated output noise. In the method of Thathachar and Ramaswamy, a Laguerre expansion is used to represent the linear part of the model. The parameters of the model are adjusted iteratively with a gradient based method. The parameters of a pulse transfer function model of the linear dynamic system and a polynomial model of the nonlinear element are estimated iteratively with the Levenberg-Marquardt method in the prediction error approach discussed by Eskinat et al. [34]. An alternative to the prediction error method is the pseudolinear regression method proposed by Boutayeb et al. Finally, the iterative method of V¨ or¨ os allows one to identify Hammerstein systems using a two-segment polynomial model of the nonlinear element. The method uses a linear regression approach but a formal proof of convergence is not available.
6 Applications
This chapter starts with a brief survey of the reported applications of Wiener and Hammerstein models in both system modelling and control. Next, the estimation of parameter changes in the context of fault detection and isolation is considered in Section 6.2. Modelling vapor pressure dynamics in a five stage sugar evaporation station is studied in Section 6.3. Two nominal models of the process, i.e., a linear model and a neural network Wiener model are developed based on the real process data recorded at the Lublin Sugar Factory in Poland. Finally, Section 6.4 summarizes the results.
6.1 General review of applications A few authors [39, 42, 95, 96, 128, 130, 133, 134, 154] have studied the control of the pH neutralization process with control methods that use Wiener and Hammerstein models. Based on the Hammerstein model, Fruzzetti et al. [39] proposed a model predictive control (MPC) strategy for a pH neutralization process and a binary distillation column. MPC is a control strategy in which a model of the process is used to predict future process outputs [62, 127]. Using the Hammerstein model and an inverse model of the nonlinear element, it is possible to apply a linear MPC controller in the analyzed control scheme. The controlled system is identified using the Hammerstein model, defined as a special form or the NARMAX model, and applying a forward regression orthogonal estimator [18]. A model of the pH neutralization process was employed by Gerkˇsiˇc et al. [42] for testing the MPC scheme. The actual and future control values are calculated minimizing a performance criterion over a certain future horizon. The method of Gerkˇsiˇc et al. uses a two-step procedure for Wiener system identification. In the first step, with a testing input signal slowly varied from the lower to the upper edge of the working range, a polynomial model of the
A. Janczak: Identification of Nonlinear Systems, LNCIS 310, pp. 159–185, 2005. © Springer-Verlag Berlin Heidelberg 2005
160
6 Applications
nonlinear element is determined. Another testing signal composed of a sequence of random steps and a small noise sequence is used in the other step to identify the linear system. The performance of three MPC schemes with different disturbance rejection techniques was compared. Also, comparisons with other control schemes were made showing that MPC is superior to the proportional plus integral (PI) scheme. The modelling of the pH neutralization process with a polynomial model of the inverse nonlinear element and a frequency response model of the linear element Wiener was studied by Kalafatis et al. [95, 96]. In a pilot scale pH plant [95], a continuous stirred tank reactor (CSTR), a strong base (NaOH) reacts with feed streams of a strong acid (HCL) and a buffering solution (NaHCO3 ). The flow rate of the strong base is the process input and the pH of the exiting stream is the process output. An experimental pH neutralization process, in which a 0.82 mM HCL solution is added to a 2.0 mM NaOH solution, was examined by Norquay et al. [128]. The Wiener model of the process, consisting of a step-response model followed by cubic spline nonlinearity, is determined based on a response to an input signal with a random amplitude and frequency and applied to the MPC scheme. The obtained Wiener model gives an excellent fit to data in comparison with a linear step response model, which captures system dynamics but not the gain. In [130], Norquay et al. discuss two different Wiener system identification procedures and apply Wiener models in MPC of a pH neutralization process. In a two stage procedure, a nonlinear element model is fitted to the steady-state data followed by the estimation of linear model parameters. A single stage procedure calculates the parameters of both models simultaneously with prediction error techniques. MPC used by Norquay et al., see Fig. 6.1 for the block diagram, is based on the internal model control scheme (IMC) [127]. In Fig. 6.1, Gp represents the controlled process, Gc and Gm represent the controller and the process model, respectively. The inverse model of the nonlinear element that follows the Wiener model cancels its nonlinear gain. Applying another inverse model of the nonlinear element, nonlinear properties of the neutralization process are also cancelled. In this way, the nonlinear predictive control problem reduces to a linear one. It has also been shown by experimental evaluation and testing that MPC developed by Norquay et al. outperforms the proportional plus integral plus derivative (PID) and linear MPC control strategies. The model reference adaptive control scheme for the control of the pH neutralization process in a continuous flow reactor was studied by Pajunen [133]. In this approach, the nonlinear element is modelled by a piecewise linear function and a pulse transfer model is used as a model of the linear system. The parameters of the controller are calculated recursively with the least squares type adaptation algorithm. The method is useful for systems with unknown and time-varying parameters that can be described by the Wiener model. Global
6.1 General review of applications
ys (n )
fˆ −1 (ys (n ) )
−
Gc
u (n )
ε (n )
161
y (n )
Gp
fˆ −1 (y (n )) yˆ (n )
Gm
fˆ −1 (yˆs (n ))
−
Fig. 6.1. Wiener-model-based prediction control based on the inversion of nonlinearity
stability of the control algorithm is established assuming that the nonlinear element can be exactly represented by a linear spline function for a given set of breakpoints. Patwardhan et al. [134] used a Wiener model-based MPC method to control the pH level in an acid-base neutralization process. The method proposed by Patwardhan et al. uses a MIMO Wiener model with two inputs, the acid flow rate and the base flow rate, and two outputs, i.e., the level and pH. The Wiener system was identified with the partial least squares method. The performance of three model predictive controllers, i.e., linear, Hammerstein model-based, and Wiener model-based ones, was compared. Both the Hammerstein and Wiener model-based MPC schemes outperform the linear one. The MPC scheme is also able to meet several set-point changes into nonlinear regions that the other controllers cannot handle. A control strategy that linearizes nonlinearity using the inverse nonlinear element model was applied by Sung [154] to control a simulated pH neutralization process in the CSTR. The identification of the pH neutralization process is performed by a simple relay-based feedback method that separates the identification of the nonlinear element from the identification of the linear dynamic system. As a result, a nonparametric nonlinear element model, in the form of a look-up table, is obtained and the ultimate gain, i.e., the gain at which the system controlled by the proportional controller oscillates, and the oscillation frequency are determined. A continuous stirred-tank fermentation process that exhibits both nonlinear and nonstationary features was studied by Roux et al. [143]. In this application, the parameters of the Hammerstein model are estimated using the RLS algorithm. Based on three different models, i.e., linear, nonlinear, and Hammerstein ones, adaptive prediction control algorithms are derived and tested. Both one-step-ahead and multi-step-ahead cost functions are applied.
162
6 Applications
The application of Wiener models in the control of distillation columns was studied in [19, 20, 110, 114, 129, 138, 146, 159]. Bloemen et al. [19, 20] used the Wiener model in MPC algorithms based on polytopic descriptions and an inverse of nonlinearity and compared the obtained results with two other algorithms based on a linear state space model and an FIR model. The identified system is a computer simulator of a moderate-high purity distillation column. The Wiener system is identified via an indirect closed loop algorithm based on the subspace identification methods [113, 157, 164]. A pilot binary distillation column with 26 trays was identified by Ling and Rivera [110]. They used a pulse transfer model of the linear system, a polynomial model of the nonlinear element, and a white noise input signal uniformly distributed in [−0.5, 0.5]. Four different models, i.e., second order Volterra series, Wiener, Hammerstein, and NARX models were identified. Their performance was evaluated based on their steady-state responses and step responses at different amplitudes. The obtained results can be summarized as follows: – – –
The NARX model provides the most accurate approximation of column dynamics. The second order Volterra series model can capture correctly column dynamics only in a quite limited operation range. The Wiener and Hammerstein models can approximate well column dynamics if the input design is selected properly.
Luyben and Eskinat [114] used a continuous-time Hammerstein model for modelling a 20-tray binary distillation column. The identification of the column is performed via consecutive use of two or more relay-feedback tests with different relay heights and different known dynamic elements inserted into the loop. The Hammerstein model was also applied by Luyben and Eskinat to the identification of a steam-water heat exchanger. Exciting the system input with the PRBS signal, the parameters of the Hammerstein model are estimated using the Narendra-Gallman algorithm [120], see Section 5.2. Considering the identification of nonlinear systems with block-oriented models, Pearson and Pottman [138] illustrated it with the application to a simulated distillation column example. Under the assumption that the nonlinear steady-state characteristics are known a priori, they determined both the Wiener and Hammerstein models and compared their performance with linear and feedback block-oriented models. A model predictive control of a C2-splitter based on a MIMO Wiener model was considered by Norquay et al. [129]. The C2-splitter is a high purity 114 tray distillation column, which separates a feed stream consisting primarily of ethylene and ethane with a small amount of methane into an ethylene product and ethane waste stream. The Wiener model employed in this application is constructed using cubic splines for nonlinear elements and the first order plus dead time models for linear elements. While the reboiler duty (in terms of the flow of the heating stream) and the reflux flow rate are model inputs, the top composition and the bottom temperature of the column are model outputs.
6.1 General review of applications
163
To identify the static nonlinear element, steady-state responses to changes of the process input about a nominal operating point were recorded. Based on the steady state data, a piecewise polynomial model was constructed. To gain some understanding of system dynamics, step tests were performed, and the first order plus dead time models fitted. Before industrial implementation, the control strategy was tested with a simulated C2-splitter. The applied predictive control algorithm of the IMC type appears successful in the rejection of major disturbances. Comparisons with linear IMC have shown the Wiener model-based approach to be superior. Sentoni et al. [146] used a Wiener model consisting of a set of discrete-time Laguerre systems and a multilayer perceptron model of the nonlinear element to model a simulated binary distillation column. The application of the model was illustrated with a nonlinear model predictive control example. The chromatographic separation process was modelled with a MIMO neural network Wiener model by Visala et al. [159]. The separation unit consists of two interconnected columns and is used for the separation of different compounds from the incoming solution. Separate input-output models are identified for both columns. The model is used for process monitoring. In this way, changes in dynamics and drifts into undesired disturbed states can be observed by on-line simulation. To include changes in process dynamics, the model can be retrained with recent production data. A two-step identification procedure was used by Cervantes et al. to identify the CSTR and the polymerization reactor. First, the linear dynamic system is identified using the correlation method, then the nonlinear element is approximated with a piecewise linear function. On the basis of the obtained Wiener models, an MPC control strategy is developed and tested. Another two-step identification procedure was used by Rollins and Bhandari [142] for the identification of MIMO Wiener and Hammerstein systems. In the first step, the nonlinear element is identified from the ultimate response data of sequential step tests. The parameters of the linear dynamic system are estimated in the second step under the constraint of the fitted nonlinear element model. A simulation study, based on simulated CSTR, showed higher prediction accuracy of the proposed method in comparison with that of the unconstrained identification procedure. Nonlinear structure identification of chemical processes based on a nonlinear input-output operator was studied by Menold et al. [118]. They introduced a deterministic suitability measure that qualifies the capability of a model class to capture the input-output behavior of a nonlinear system. The suitability measure can be used to select the model structure prior to the actual parameter identification. A strategy for computing the suitability measures with respect to the Hammerstein, the linear, the Wiener, and the diagonal Volterra model, was evaluated with the example of simulated CSTR. Davide et al. [32] applied the Wiener model to the modelling of quartz microbalance polymer-coated sensors used for the detection of mixtures of n-octane and toulene. The Wiener model is composed of two linear dynamic models
164
6 Applications
followed by a single nonlinear element. Exciting system inputs with two independent white Gaussian signals, linear dynamic systems are identified using the correlation method. Having the linear dynamic systems identified, a polynomial model of the nonlinear element can be fitted. A radial basis function neural network was applied as a model of a hydraulic pilot plant by Knohl et al. [99]. Due to the linear-in-parameters form of the model, the RLS algorithm with a constant trace can be used for parameter estimation. Based on the developed neural network model, an indirect adaptive control strategy was derived and tested. An inverse nonlinear element model was applied by Knohl and Unbehauen [98] for the compensation of input nonlinearity in an electro-hydraulic servo system. The system is controlled by the standard indirect adaptive control method. In this application, both the nonlinear element and its inverse are modelled with the RBF neural network. As the Hammerstein model of the system which contains a dead zone nonlinearity can be written in the linearin-parameters form, the RLS algorithm is used for parameter estimation. A Wiener model describing a pneumatic valve for the control of fluid flow was used by Wigren [165]. In this example, a linear dynamic model is used to describe the dynamic balance between a control signal, a counteractive spring force and friction. A flow through the valve, being the model output, is a nonlinear function of the valve stem position. Ikonen and Najim [72] modelled the valve pressure difference in a pump-valve pilot system with a MISO Wiener model. The inputs to the model are the pump command signal, the true valve position, and the consistency of the pulp, whereas the pressure difference over the valve is the model output. Celka et al. [25] considered modelling an electrooptical dynamic system running in a chaotic mode. They showed that, under certain assumptions, such a system can be represented by a Wiener model. An adaptive precompensation of nonlinear distortion in Wiener systems via a precompensator of a Hammerstein model structure was studied by Kang et al. [97]. The examined system consists of an adaptive Wiener model-type estimator and an adaptive precompensator that linearizes the overall system. It is assumed that the nonlinear element is modelled by a polynomial. The parameters of the compensator, being ideally the inverse of the Wiener system, are estimated minimizing the mean square error defined as a difference between the delayed input signal and the system output. An adaptive algorithm for updating the parameters of the precompensator uses the stochastic gradient method. The proposed technique can be applied to the compensation of nonlinear distortions in electronic devices or electromechanical components, e.g., high power amplifiers in satellite communication channels, distortion reduction of loudspeakers, active noise cancellation. The charging process in diesel engines, considered by Aoubi [6], also reveals nonlinear input-output behavior and a strong dependence of its dynamics on the operating point. The SISO generalized Hammerstein model is defined as
6.1 General review of applications
ˆ −1 )ˆ A(q y (n) = b0 +
r
ˆk (q −1 )uk (n). B
165
(6.1)
k=1
Based on real experimental data, two-input one-output generalized Hammerstein model composed of a third order polynomial model of the nonlinear element and a second order linear system is derived. The inputs to the model are the engine speed and the fuel mass, the output is the loading pressure. The model contains 48 adjustable parameters, which are calculated with the least squares technique. Modelling continuous-time and discrete-time chaotic systems, including the typical Duffing, Henon, and Lozi systems, is another interesting application of Wiener models proposed by Chen et al. [28]. In this application, a single hidden layer neural network model of the nonlinear element, trained with the steepest descent algorithm, acts as a neural network controller of the linear part of the Wiener model. The identification of discrete-time chaotic systems using both Wiener and Hammerstein neural network models was also studied by Xu et al. [168]. Different control strategies for Hammerstein systems have been considered by several authors, see [13] for adaptive predictive control based on the polynomial Hammerstein model, and [124, 125] for minimum-time dead-beat controllers. The application of the MIMO Hammerstein model to an iron oxide pellet cooling process used in pyrometallurgical installations was studied by Pomerleau et al. [140]. For the cooling zone of an induration furnace, they used a two-input two-output Hammerstein model with decoupled models of nonlinear elements and a coupled model of the linear dynamic system. Based on the empirical Hammerstein model, a linear MPC strategy was developed and compared with a phenomenological nonlinear MPC. The estimation of parameter changes in Wiener and Hammerstein systems in the context of fault detection and isolation is considered in [78, 79, 91, 92], see Section 6.2 for more details. The identification of steam pressure dynamics in a five stage sugar evaporator is studied in [82, 91], more details are given in Section 6.3. A theoretical model of the muscle relaxation process that expresses the interaction between the drug dose and the relaxation effect can be described as a continuous-time Wiener model [33]. This Wiener model consists of a third order linear dynamic system plus dead time followed by static nonlinearity. The dead time is introduced to model the transport delay of the drug. Quaglini et al. [141] used a Wiener model, composed of the ARX model and a polynomial model of the nonlinear element, for the relaxation function of soft biological tissues. The identification is made by iterative minimization of two different cost functions that are linear in the parameters of the linear dynamic model and the nonlinear element model, respectively. Wiener and Hammerstein models have also been applied to the modelling of some other biological systems. A brief survey of such applications is given by Hunter and Korenberg [71].
166
6 Applications
6.2 Fault detection and isolation with Wiener and Hammerstein models Model-based fault detection and isolation (FDI) methods require both a nominal model of a system, i.e., a model of the system in its normal operation conditions, and models of the faulty system to be available [38, 135, 30]. A nominal model of the system is used in a fault detection step to generate residuals, defined as a difference between output signals of the system and its model. An analysis of these residuals gives an answer to the question whether any fault occurs or not. If a fault occurs, a fault isolation step is performed in a similar way analyzing residual sequences generated with the models of the faulty system – see Fig. 6.2 In the case of complex industrial systems, e.g., a five-stage sugar evaporation station, the above procedure can be even more useful if it is applied not only to the overall system but also to its chosen sub-modules. Designing an FDI system with such an approach may result in both high fault detection sensitivity and high fault isolation reliability. The problem considered here can be stated as follows: Given system input and output sequences and knowing the nominal models of Wiener or Hammerstein systems, generate a sequence of residuals and process this sequence to detect and isolate all changes of system parameters caused by any system fault. Both abrupt (step-like) and incipient (slowly developing) faults are to be considered as well. Assume that the nominal models of a Wiener or Hammerstein system defined by the nonlinear function f (·) and the polynomials A(q −1 ) and B(q −1 ) are known. These models describe the systems in their normal operating conditions with no malfunctions (faults). Moreover, assume that at the time k a step-like fault occurred, and caused a change in the mathematical model of the system. This change can be expressed in terms of additive components of pulse transfer function polynomials. The polynomials A(q −1 ) and B(q −1 ) of the pulse transfer function B(q −1 )/A(q −1 ) of the faulty system have the form:
where
A(q −1 ) = A(q −1 ) + ∆A(q −1 ),
(6.2)
B(q −1 ) = B(q −1 ) + ∆B(q −1 ),
(6.3)
A(q −1 ) = a1 q −1 + · · · + ana q −na ,
(6.4)
B(q −1 ) = b1 q −1 + · · · + bnb q −nb ,
(6.5)
∆A(q
−1
) = α1 q
−1
+ · · · + αna q
−na
,
(6.6)
∆B(q −1 ) = β1 q −1 + · · · + βnb q −nb .
(6.7)
The characteristic of the static nonlinear element g(·) in the faulty state can be expressed as a sum of f (·) and its change ∆f (·):
6.2 Fault detection and isolation with Wiener and Hammerstein models u (n )
y (n )
SYSTEM
Nominal model
yˆ (n )
Model 1
yˆ1 (n )
e (n ) −
e1 (n ) −
M
M Model k
167
yˆk (n ) −
ek (n )
Fig. 6.2. FDI with the nominal model and a bank of models of the faulty system
where
g u(n) = f u(n) + ∆f u(n) ,
(6.8)
∆f u(n) = µ0 + µ2 u2 (n) + µ3 u3 (n) + · · · .
(6.9)
For Wiener systems, it is assumed that both f (·) and g(·) are invertible. The inverse nonlinear function of the faulty system can be written as a sum of the inverse function f −1 (·) and its change ∆f −1 (·):
where
g −1 y(n) = f −1 y(n) + ∆f −1 y(n) ,
(6.10)
∆f −1 y(n) = γ0 + γ2 y 2 (n) + γ3 y 3 (n) + · · · .
(6.11)
Note that ∆f −1 (·) here does not denote the inverse of ∆f (·) but only a change in the inverse nonlinear characteristic. 6.2.1 Definitions of residuals The residual e(n) is defined as a difference between the output of the system y(n) and the output of its nominal model yˆ(n): e(n) = y(n) − yˆ(n).
(6.12)
Although both series-parallel and parallel Wiener and Hammerstein models can be employed as nominal models for residual generation, we will show that the major advantage of serial-parallel ones is their property of transformation into the linear-in-parameters form. With the nominal series-parallel Hammerstein model (Fig. 6.3) defined by the following expression:
168
6 Applications ε (n )
HAMMERSTEIN SYSTEM
u (n )
v (n )
g (u (n ))
y (n )
B (q −1 ) −1
A (q )
NOMINAL MODEL
B (q −1 )
f (u (n ))
1 − A(q −1 )
vˆ (n )
−
yˆ (n )
e (n )
Fig. 6.3. Generation of residuals using the series-parallel Hammerstein model
we have
yˆ(n) = 1 − A(q −1 ) y(n) + B(q −1 )f u(n) ,
(6.13)
e(n) = A(q −1 )y(n) − B(q −1 )f u(n) .
(6.14)
The output of the faulty Hammerstein system is y(n) =
B(q −1 ) g u(n) + ε(n). A(q −1 )
(6.15)
Thus, taking into account (6.2), (6.3) and (6.8), (6.15) can be written in the following form: y(n) = 1 − A(q −1 ) − ∆A(q −1 ) y(n) + B(q −1 ) + ∆B(q −1 ) f u(n) + ∆f u(n)
+ A(q −1 )ε(n).
(6.16)
Next, assume that the disturbance ε(n) is ε(n) =
(n) , A(q −1 )
(6.17)
where (n) is a zero-mean white noise. Now, the substitution of (6.17) into (6.16), and (6.13) and (6.16) into (6.12) results in a residual equation expressed in terms of changes of the polynomials A(q −1 ), B(q −1 ) and the function f u(n) : e(n) = −∆A(q −1 )y(n) + ∆B(q −1 )f u(n) + B(q −1 ) + ∆B(q −1 ) ∆f u(n) + (n).
(6.18)
Similar deliberations for the parallel Hammerstein model (Fig. 6.4) give
6.2 Fault detection and isolation with Wiener and Hammerstein models
ε (n )
HAMMERSTEIN SYSTEM u (n )
( ( ))
g u n
v (n )
169
y (n )
B (q −1 ) A (q −1 )
NOMINAL MODEL ( ( ))
vˆ (n )
f u n
B (q −1 ) A(q −1 )
yˆ (n )
e (n )
−
Fig. 6.4. Generation of residuals using the parallel Hammerstein model
e(n) =
A(q −1 )∆B(q −1 ) − B(q −1 )∆A(q −1 ) f u(n) A(q −1 ) A(q −1 ) + ∆A(q −1 ) B(q −1 ) + ∆B(q −1 ) ∆f u(n) + ε(n). + A(q −1 ) + ∆A(q −1 )
(6.19)
In the case of Wiener systems, residual generation with the series-parallel Wiener model is more complicated as both the model of the nonlinear element and the model of the inverse nonlinear element are used (Fig. 6.5). A simpler form of the series-parallel model can be obtained if we assume that the function f (·) is invertible and introduce the following modified definition of the residual [78]: e(n) = f −1 y(n) − f −1 yˆ(n) .
(6.20)
The nominal parallel Wiener model can be expressed as yˆ(n) = f
B(q −1 ) u(n) . A(q −1 )
(6.21)
The output of the faulty Wiener system is y(n) = g
B(q −1 ) g u(n) + ε(n) . A(q −1 )
(6.22)
From (6.21), it follows that the output of the linear dynamic model is f −1 yˆ(n) =
B(q −1 ) u(n). A(q −1 )
(6.23)
Equation (6.23) can be written in the form f −1 yˆ(n) = 1 − A(q −1 ) f −1 yˆ(n) + B(q −1 )u(n).
(6.24)
170
6 Applications
ε (n )
WIENER SYSTEM
B (q −1 ) A (q −1 )
u (n )
s (n )
y (n )
g (u (n ))
NOMINAL MODEL 1 − A(q −1 )
B (q −1 )
sˆ(n )
f
( ˆ( ))
−1
(u (n )) yˆ (n )
f s n
e (n )
−
Fig. 6.5. Generation of residuals using the series-parallel Wiener model
WIENER SYSTEM u (n )
B (q −1 ) A (q −1 )
ε (n )
s (n )
y (n )
g (u (n ))
NOMINAL MODEL f
B (q −1 )
−1
(y (n ))
1 − A(q −1 )
− f
−1
e (n )
(yˆ (n ))
Fig. 6.6. Generation of residuals using the modified series-parallel Wiener model.
Therefore, the modified series-parallel model can be defined as follows: f −1 yˆ(n) = 1 − A(q −1 ) f −1 y(n) + B(q −1 )u(n).
(6.25)
The generation of residuals with the modified series-parallel Wiener model is illustrated in Fig. 6.6. Taking into account (6.2), (6.3) and (6.10), it follows from (6.22) that the signal f −1 y(n) for the faulty Wiener system can be expressed as
6.2 Fault detection and isolation with Wiener and Hammerstein models
f −1 y(n) = 1 − A(q −1 ) − ∆A(q −1 ) f −1 y(n) + ∆f −1 y(n) + B(q −1 ) + ∆B(q −1 ) u(n) − ∆f −1 y(n) + A(q −1 )ε(n).
171
(6.26)
Hence, assuming that the disturbance ε(n) is given by (6.17), the residual (6.20) is e(n) = −∆A(q −1 )f −1 y(n) + ∆B(q −1 )u(n) − A(q −1 ) + ∆A(q −1 ) ∆f −1 y(n) + (n).
(6.27)
A major advantage of both the series-parallel Hammerstein model and the modified series-parallel Wiener model is that their residual equations (6.18) and (6.27), after a proper change of parameterization, can be written in the linear-in-parameters form. Then these redefined parameters can be estimated with linear regression methods. 6.2.2 Hammerstein system. Parameter estimation of the residual equation Assume that the function ∆f (·) describing a change of the steady-state characteristic f (·) caused by a fault has the form of a polynomial of the order r: ∆f u(n) = µ0 + µ2 u2 (n) + · · · + µr ur (n). (6.28) Thus, the residual equation (6.18) can be written in the following linear-inparameters form: na
nb
αj y(n−j) +
e(n) = − j=1
βj f u(n−j) + d0 j=1
r
nb
+
(6.29)
dkj uk (n−j) + (n),
k=2 j=1
where
nb
d0 = µ0
(bj + βj ),
(6.30)
j=1
dkj = µk (bj + βj ).
(6.31)
Note that (6.29) has M = na + nb + nb(r −1) + 1 unknown parameters αj , βj , dkj and d0 . In a disturbance-free case ( (n) = 0), these parameters can be calculated performing N = M measurements of the input and output signals and solving the following set of linear equations: r
nb
na j=1
n = 1, 2, . . . , M.
nb
βj f u(n−j) + d0 +
αj y(n−j) +
e(n) = −
j=1
k=2 j=1
dkj uk (n−j),
(6.32)
172
6 Applications
In practice, it is more realistic to assume that the system output is disturbed by the disturbance (6.17). Now, performing N M measurements of the input and output signals, the parameters of the residual equation (6.29) can ˆ be estimated using the least squares method with the parameter vector θ(n) and the regression vector x(n) defined as follows: ˆ θ(n) = α ˆ1 . . . α ˆna βˆ1 . . . βˆnb dˆ0 dˆ21 . . . dˆ2nb . . . dˆr1 . . . dˆrnb
T
,
x(n) = − y(n−1) . . . − y(n−na) f u(n−1) . . . f u(n−nb) 1 u2 (n−1) . . . u2 (n−nb) . . . ur (n−1) . . . ur (n−nb)
T
(6.33)
.
ˆ The vector θ(n) can be calculated on-line using the recursive least squares algorithm: ˆ ˆ ˆ θ(n) = θ(n−1) + P (n)x(n) e(n) − xT (n)θ(n−1) , P (n) = P (n−1) −
(6.34)
P (n−1)x(n)xT (n)P (n−1) , 1 + xT (n)P (n−1)x(n)
(6.35)
ˆ with θ(0) = 0 i P(0) = αI, where α 1 and I is an identity matrix. Parameter estimation of the residual equation (6.29) often results in asymptotically biased estimates. That is the case if system additive output disturbances are not given by (6.17) but have the property of zero-mean white noise, i.e., ε(n) = (n). In this case, the residual equation (6.18) has the form e(n) = −∆A(q −1 )y(n) + ∆B(q −1 )f u(n) + B(q −1 )
(6.36)
+ ∆B(q −1 ) ∆f u(n) + A(q −1 ) + ∆A(q −1 ) (n).
The term [A(q −1 ) + ∆A(q −1 )] (n) in (6.36) makes the parameter estimaˆ correlated. tes asymptotically biased and the residuals e(n) − xT (n)θ(n−1) To obtain unbiased parameter estimates, other known parameter estimation methods, such as the instrumental variables method, the generalized least squares method, or the extended least squares method, can be employed [35, 149]. ˆ Using the extended least squares method, the vectors θ(n) i x(n) are defined as follows: T ˆ ˆ na βˆ1 . . . βˆnb dˆ0 dˆ21 . . . dˆ2nb . . . dˆr1 . . . dˆrnb cˆ1 . . . cˆna , (6.37) θ(n) = α ˆ1 . . . α
x(n) = −y(n−1). . .−y(n−na) f u(n−1) . . .f u(n−nb) 1 u2 (n−1) T
. . . u2 (n−nb) . . . ur (n−1) . . . ur (n−nb) eˆ(n−1) . . . eˆ(n−na) ,
(6.38)
where eˆ(n) denotes the one step ahead prediction error of e(n): ˆ eˆ(n) = e(n) − xT (n)θ(n−1).
(6.39)
6.2 Fault detection and isolation with Wiener and Hammerstein models
173
Example 6.1 The nominal Hammerstein model composed of the linear dynamic system 0.63467q −1 + 0.48069q −2 B(q −1 ) = A(q −1 ) 1 − 1.3231q −1 + 0.4346q −2
(6.40)
and the nonlinear element f u(n) = tanh(u(n))
(6.41)
was used in the simulation example. The faulty Hammerstein system is described as B(q −1 ) 0.1752q −1 + 0.1534q −2 = , −1 A(q ) 1 − 1.6375q −1 + 0.6703q −2 g u(n) = tanh(u(n)) − 0.25u2(n) − 0.2u3 (n) + 0.15u4(n) − 0.1u5 (n) + 0.05u6(n) − 0.025u7(n) + 0.0125u8(n).
(6.42)
(6.43)
The steady-state characteristics of both the nominal model and the faulty Hammerstein system are shown in Fig. 6.7. The system input was excited with a sequence of N = 25000 pseudorandom numbers of uniform distribution in the interval (−1.5, 1.5). The system output was disturbed by the disturbance ε(n) defined as ε(n) =
(n) A(q −1 )
(6.44)
or ε(n) = (n),
(6.45)
where (n) are the values of a Gaussian pseudorandom sequence N (0, 0.05). For the system disturbed by (6.44), the parameters of (6.29) were estimated with the RLS method – the RLSI example. In the case of the disturbance (6.45), the parameter estimation was performed using both the RLS and RELS methods – the RLSII and RELSII examples. The obtained parameter estimates are given in Table 6.1. The identification error of ∆f u(n) is shown in Fig. 6.8. A comparison of estimation accuracy measured by four different indices defining estimation accuracy of the residuals e(n), the function ∆f u(n) , the parameters µj , j = 2, . . . , 8, and αj i βj , j = 1, 2, is given in Table 6.2. The highest accuracy is obtained in the RLSI example for the system disturbed by (6.44). For the system disturbed with (6.45), a comparison of parameter estimates and indices confirms asymptotical biasing of parameter estimates obtained in the RLSII example.
6 Applications 2.5 f(u(n)) g(u(n))
2
f (u(n)), g (u(n))
1.5 1 0.5 0 -0.5 -1 -1.5
-1
-0.5
0
0.5
1
1.5
u(n)
Fig. 6.7. Hammerstein system. Nonlinear functions f u(n) and g u(n)
0.04 0.02
∆fˆ(u(n))− ∆f (u(n))
174
0 -0.02 -0.04 -0.06 RLSI RLSII RELSII
-0.08 -0.1 -1.5
-1
-0.5
0
0.5
1
u(n)
Fig. 6.8. Identification error of the function ∆f u(n)
1.5
6.2 Fault detection and isolation with Wiener and Hammerstein models
175
Table 6.1. Parameter estimates RLSI
True
Parametr
value α1 α2 β1 β2 µ2 µ3 µ4 µ5 µ6 µ7 µ8 c1 c2
ε(n) =
−0.3144 0.2357 −0.4594 −0.3273 −0.2500 −0.2000 0.1500 −0.1000 0.0500 −0.0250 0.0125 −1.6375 0.6703
(n) A(q −1 )
RLSII ε(n) = (n)
−0.3151 0.2369 −0.4616 −0.3305 −0.2431 −0.1770 0.1440 −0.1202 0.0498 −0.0210 0.0138
−0.1478 0.0708 −0.4664 −0.2972 −0.2871 −0.1362 0.2431 −0.1460 −0.0372 −0.0183 0.0387
RELSII ε(n) = (n) −0.3062 0.2269 −0.4625 −0.3278 −0.3124 −0.1671 0.2908 −0.1275 −0.0604 −0.0202 0.0405 −1.1253 0.1341
Table 6.2. Comparison of estimation accuracy Index
RLSI ε(n) = A(q(n) −1 )
RLSII ε(n) = (n)
RELSII ε(n) = (n)
2.82 × 10−3
9.62 × 10−3
4.49 × 10−3
3.69 × 10−5
8.77 × 10−4
1.37 × 10−4
(µj − µ ˆ j )2
3.08 × 10−5
3.45 × 10−4
9.87 × 10−5
(aj − a ˆj )2 + (bj − ˆbj )2
4.07 × 10−6
1.40 × 10−2
3.88 × 10−5
N
1 N 1 400
e(n) − eˆ(n)
400
i=1
i=1
1 7 1 4
2
2
f u(n) − fˆ u(n) 8
2
j=2
j=1
6.2.3 Wiener system. Parameter estimation of the residual equation Assume that the function ∆f −1 (·) describing a change in the inverse steadystate characteristic caused by a fault has the form of a polynomial of the order r:
176
6 Applications
∆f −1 y(n) = γ0 + γ2 y 2 (n) + · · · + γr y r (n).
(6.46)
As in the case of the Hammerstein system, the residual equation (6.27) can be expressed in the linear-in-parameters form: na
e(n) = −
αj f −1 y(n−j) +
j=1 r na
+
nb
βj u(n−j) + d0 j=1
(6.47)
k
dkj y (n−j) + (n), k=2 j=0
where
na
d0 = −γ0 1 +
(aj + αj ) ,
(6.48)
j=1
dkj =
−γk , j=0 −γk (aj + αj ), j = 1, . . . , na.
(6.49)
For disturbance-free Wiener systems, the parameters αj , βj , dkj and d0 can be calculated solving a set of M = na + nb +(na+1)(r−1) +1 linear equations. ˆ In the stochastic case, parameter estimation with the parameter vector θ(n) and the regression vector x(n) defined as ˆ θ(n) = α ˆ1 . . . α ˆ na βˆ1 . . . βˆnb dˆ0 dˆ20 . . . dˆ2na . . . dˆr0 . . . dˆrna
T
,
x(n) = −f −1 y(n−1) . . . − f −1 y(n−na) u(n−1) . . . u(n−nb) 1 y 2 (n) . . . y(n−na)2 . . . y r (n) . . . y(n−na)r
T
(6.50)
(6.51)
can be performed with the LS or the RLS method. An alternative to the above procedure can be parameter estimation with the extended least squares (ELS) algorithm. In this case, the parameter vector θ(n) and the regression vector are defined as T ˆ θ(n) = α ˆ1 . . . α ˆ na βˆ1 . . . βˆnb dˆ0 dˆ20 . . . dˆ2na . . . dˆr0 . . . dˆrna cˆ1 . . . cˆna , (6.52)
x(n) = − f −1 y(n−1) . . . − f −1 (y(n−na) u(n−1) . . . u(n−nb) 1 y 2 (n) . . . y(n−na)2 . . . y r (n) . . . y(n−na)r eˆ(n−1) . . . eˆ(n−na)
T
(6.53)
,
where eˆ(n) denotes the one step ahead prediction error of e(n): ˆ eˆ(n) = e(n) − xT (n)θ(n−1).
(6.54)
The parameters obtained with the LS method or the ELS method are asymptotically biased. This comes from the well-known property of the least squares
6.2 Fault detection and isolation with Wiener and Hammerstein models
177
method stating that to obtain consistent parameter estimates, the regression vector x(n) should be uncorrelated with the system disturbance (n), i.e., E[ x(n) (n) ] = 0. Obviously, this condition is not fulfilled for the model (6.47) as the powered system outputs y 2 (n), . . . , y r (n) depend on (n). A common way to obtain consistent parameter estimates in such cases is the application of the instrumental variable method (IV) or its recursive version (RIV). Instrumental variables should be chosen to be correlated with regressors and uncorrelated with the system disturbance (n). Although different choices of instrumental variables can be made, replacing y 2 (n), . . . , y r (n−na) with their approximated values obtained by filtering u(n) through the nominal model is a good choice. This leads to the following estimation procedure: 1. Simulate the nominal model (6.21) to obtain yˆ(n). 2. Estimate the parameters of (6.47) using the IV or the RIV method with the instrumental variables z(n) = − f −1 y(n−1) . . . − f −1 y(n−na) u(n−1) . . . u(n−nb) 1 yˆ2 (n) . . . yˆ2 (n−na) . . . yˆr (n) . . . yˆr (n−na)
T
.
Example 6.2. The nominal Wiener model used in the simulation example consists of the linear dynamic system 0.5q −1 − 0.3q −2 B(q −1 ) = A(q −1 ) 1 − 1.5q −1 + 0.7q −2
(6.55)
and the nonlinear element of the following inverse nonlinear characteristic: 1 f −1 y(n) = y(n) − y 3 (n). 6
(6.56)
The faulty Wiener system is defined by the pulse transfer function B(q −1 ) 0.3q −1 − 0.2q −2 = −1 A(q ) 1 − 1.75q −1 + 0.85q −2
(6.57)
and the inverse nonlinear characteristic (Fig. 6.9) of the form g −1 y(n) = tan y(n) .
(6.58)
The system was excited with a sequence of N = 50000 pseudorandom values of uniform distribution in (−1, 1). The system output was disturbed additively with the disturbances (6.45), with { (n)} – a pseudorandom sequence, uniformly distributed in (−0.01, 0.01). Parameter estimation was performed with the RLS and RELS methods for both a disturbance-free case – RLSI example, and for the system disturbed by (6.45) – the RLSII RELSII examples. The parameter estimates are given in Table 6.4, and Table 6.3 shows a comparison of estimation accuracy. The identification error of the function
178
6 Applications
∆f −1 (·) is shown in Fig. 6.10. An assumed finite order of the polynomial (6.46), i.e., r = 11 is another source of the identification error as the function ∆f −1 (·) can be represented accurately with the power series. A comparison of the results obtained in the RLSII example and the RELSII example shows the inconsistency of the RLS estimator.
Table 6.3. Comparison of estimation accuracy Example Index 1 N 1 400
400
N
e(n) − eˆ(n)
2
8.07 × 10−6 1.64 × 10−4 8.60 × 10−5
∆f −1 yˆ(n) − ∆fˆ−1 yˆ(n) 1 10 2
RLSII RELSII ε(n) = (n) ε(n) = (n)
i=1
i=1
1 4
RLSI ε(n) = 0
11
γj − γˆj
2
2
1.39 × 10−6 1.29 × 10−4 3.68 × 10−5 1.73 × 10−4 1.44 × 10−2 4.31 × 10−3
j=2
ˆj )2 + (bj − ˆbj )2 (aj − a
1.72 × 10−6 4.09 × 10−4 5.75 × 10−6
j=1
1.5
f −1(y(n)), g −1(y(n))
1
0.5
0
-0.5 f − 1(y(n)) g − 1(y(n))
-1
-1.5 -0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
y(n)
Fig. 6.9. Wiener system. Inverse nonlinear functions f −1 yˆ(n) and g −1 yˆ(n)
6.2 Fault detection and isolation with Wiener and Hammerstein models Table 6.4. Parameter estimates Parameter
True value
RLSI ε(n) = 0
α1 α2 β1 β2 γ2 γ3 γ4 γ5 γ6 γ7 γ8 γ9 γ10 γ11 c1 c2
−0.2500 0.1500 −0.2000 0.1000 0 0.5000 0 0.1333 0 0.0540 0 0.0219 0 0.0089 −1.7500 0.8500
−0.2482 0.1491 −0.2006 0.1006 −0.0000 0.4833 0.0002 0.1646 −0.0004 0.0372 −0.0004 0.0136 0.0010 0.0199
RLSII ε(n) = (n) −0.2164 0.1293 −0.2060 0.1065 −0.0019 0.3445 0.0054 0.4240 −0.0057 −0.0835 −0.0053 −0.0846 0.0084 0.0753
RELSII ε(n) = (n) −0.2487 0.1526 −0.2032 0.1021 −0.0002 0.4177 −0.0002 0.2895 −0.0011 −0.0277 −0.0029 −0.0346 0.0068 0.0539 −0.9514 −0.0061
0.025 0.02
∆fˆ− 1(y(n))− ∆f − 1(y(n))
0.015 0.01 0.005 0 -0.005 -0.01 -0.015
RLSI RLSII RELSII
-0.02 -0.025 -0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
y(n)
Fig. 6.10. Identification error of the inverse nonlinear function ∆f −1 u(n)
179
180
6 Applications
6.3 Sugar evaporator. Identification of the nominal model of steam pressure dynamics The main task of a multiple-effect sugar evaporator is thickening the thin sugar juice from sugar density of approximately 14 to 65–70 Brix units (Bx). Other important tasks are steam generation, waste steam condensation, and supplying waste-heat boilers with water condensate. In a multiple-effect evaporation process, the sugar juice and the saturated steam are fed to the successive stages of the evaporator at a gradually decreasing pressure. Many complex physical phenomena and chemical reactions occur during the thickening of the sugar juice, including sucrose decomposition, the precipitation of calcium compounds, the decomposition of acid amides, etc. A flow of the steam and sugar juice through the successive stages of the evaporator results in a close relationship between temperatures and pressures of the juice steam in these stages. Moreover, juice steam temperature and juice steam pressure depend also on other physical quantities such as the rate of the juice flow or juice temperature. 6.3.1 Theoretical model A theoretical approach to process modelling relies upon deriving a mathematical model of the process by applying the laws of physics. Theoretical models obtained in this way are often too complicated to be used, for example, in control applications. Despite this, theoretical models are a source of valuable information on the nature of the process. This information can be very useful for the identification of experimental mathematical models based on process the input-output data. Moreover, theoretical models can serve as benchmarks for the evaluation and verification of experimental models. Modelling the multiple-effect sugar evaporator is complicated by the fact that it consists of a number of evaporators connected both in series and in parallel. All this makes the formulation of a theoretical model a difficult task. Theoretical models are derived based on mass and energy balances for all stages. To perform this, the following assumptions are made [111]: 1. The juice and the steam are in a saturated equilibrium. 2. Both the mass of the steam in the juice chamber and the mass of the steam in the steam chamber are constant. 3. The operation of juice level controllers makes it possible to neglect variations of juice levels. 4. Heat losses to the surrounding environment are negligible. 5. Mixing is perfect. The model of the dependence of steam pressure in the steam chamber of the stage k on steam pressure in the steam chamber of the stage k −1, given by Carlos and Corripio [24], has the form
6.3 Sugar evaporator
Pk = Pk−1 −
γk (Ok−1 )2 , 2 ρk (Tk−1 , Pk−1 )
P1 = P0 −
181
(6.59)
γ1 S 2 , ρ21 (T0 , P0 )
(6.60)
where Pk denotes juice steam pressure in the steam chamber of the stage k, γk is the conversion factor, Ok is the steam flow rate from the stage k, ρk is the steam density at the stage k, and Tk is steam temperature at the stage k. The models (6.59) and (6.60) describe steady-state behavior of the process. To include dynamic properties of the process, a modified model described with differential equations of the first order was proposed: Pk = τk
γk (Ok−1 )2 dPk−1 + Pk−1 − 2 , dt ρk (Tk−1 , Pk−1 )
(6.61)
dP0 γ1 S 2 , + P0 − 2 dt ρ1 (T0 , P0 )
(6.62)
P1 = τ1
where τk is the time constant. Equations (6.59) – (6.62) are a part of the overall model of the sucrose juice concentration process. This model comprises also mass and energy balances for all stages [111]. 6.3.2 Experimental models of steam pressure dynamics In a neural network model (Fig. 6.11), it is assumed that the model output Pˆ2 (n) is a sum of two components defined by the nonlinear function of the pressure P3 (n) and the linear function of the pressure P1 (n). The model output at the time n is ˆ −1 ) B(q ˆ −1 )P1 (n), Pˆ2 (n) = fˆ P (n) + C(q ˆ −1 ) 3 A(q with
ˆ −1 ) = 1 + a A(q ˆ1 q −1 + a ˆ2 q −2 , ˆ B(q
−1
) = ˆb1 q
−1
+ ˆb2 q
−2
(6.63)
(6.64)
,
(6.65)
ˆ −1 ) = cˆ1 q −1 + cˆ2 q −2 , C(q
(6.66)
ˆ2 , ˆb1 , ˆb2 , cˆ1 i cˆ2 denote model parameters. The function fˆ(·) is where a ˆ1 , a modelled with a multilayer perceptron containing one hidden layer consisting of M nodes of the hyperbolic tangent activation function: fˆ(ˆ s(n) =
M j=1
(2)
(2)
w1j tanh xj (n) + w10 ,
(6.67)
182
6 Applications
P3 (n )
Bˆ (q −1 ) Aˆ (q −1 )
P1 (n )
Pˆ2 (n )
fˆ(sˆ(n ))
Cˆ (q −1 )
Fig. 6.11. Neural network model of steam pressure dynamics (1)
(1)
xj (n) = wj1 sˆ(n) + wj0 ,
(6.68)
ˆ −1 )/A(q ˆ −1 ) P3 (n) is the output of a linear dynamic model, where sˆ(n) = B(q (1) (1) (2) (2) and w10 ,. . ., wM1 , w10 ,. . ., w1M are the weights of the neural network. The neural network structure contains a Wiener model [76, 75] in the path of the pressure P3 (n) and a linear finite impulse response filter of the second order in the path of the pressure P1 (n). In a linear model of steam pressure dynamics, the Wiener model is replaced with a linear dynamic model: ˆ −1 ) B(q ˆ −1 )P1 (n). Pˆ2 (n) = P (n) + C(q ˆ −1 ) 3 A(q
(6.69)
6.3.3 Estimation results Parameter estimation of the models (6.63) i (6.69) was performed based on a set of 10000 input-output measurements recorded at the sampling rate of 10s. The RLS algorithm was employed for the estimation of the ARX model. The neural network Wiener model of the structure N (1−24−1) was trained recursively with the backpropagation learning algorithm [81]. For both models, 50000 sequential steps were made, processing the overall data set five times. Then the models were tested with another data set of 8000 input-output measurements. The results of the estimation and the testing are shown in Figs 6.13 – 6.15, and a comparison of the test results for both models is given in Table 6.5. Lower values of the mean square prediction error of the pressure P2 obtained for the neural network model for both the training and testing sets confirm the nonlinear nature of the process. As the analyzed model of steam pressure dynamics is characterized by a low time constant of approximately 20s, fast fault detection is possible. Moreover, as steam pressure is not controlled automatically, there is no problem of closedloop system identification.
6.3 Sugar evaporator
183
125 120
P2(n)
ˆ
115
P2(n)
110
ˆ2(n) P2(n), P
105 100 95 90 85 80 75 70
0
1000
2000
3000
4000
n
5000
6000
7000
8000
Fig. 6.12. Steam pressure in the steam chamber 2 and the output of the neural network model – the testing set
125 120 115
ˆ2(n) P2(n), P
110 105 100 95 90 85 80
P2(n)
ˆ
75 70
P2(n)
0
2000
4000
6000
8000
10000
n
Fig. 6.13. Steam pressure in the steam chamber 2 and the output of the neural network model – the training set
6 Applications 2 1.5 1
ˆ f (s(n)) ˆ
0.5 0 -0.5 -1 -1.5 -2 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
ˆs(n) Fig. 6.14. Estimated nonlinear function
0.7 0.6 0.5 0.4
h(n)
184
0.3 0.2 0.1 0
0
2
4
6
8
n
Fig. 6.15. Step response of the linear dynamic model
10
6.4 Summary
185
Table 6.5. Mean square prediction error of P2 [kPa] ARX model
Neural network model
Training set
2.103
1.804
Testing set
2.322
2.084
6.4 Summary Identification methods of system parameter changes, discussed in this chapter, require a nominal model of the system, the sequences of the system input and output signals and the sequence of residuals to be available. If the system is disturbance free, it is possible to calculate system parameter changes in both Wiener and Hammerstein systems by solving a set of linear equations. In practice, however, it is more realistic to assume that the system output is disturbed by additive disturbances and to estimate system parameter changes using parameter estimation techniques. In Hammerstein systems with polynomial nonlinearities, system parameter changes can be estimated with the LS method. In this case, consistent parameter estimates are obtained provided that the system output is disturbed additively with (6.17). For other types of output disturbances, the use of the LS method results in inconsistent parameter estimates. Therefore, to obtain consistent parameter estimates, other parameter estimation methods, e.g., the ELS method can be used. The estimation of system parameter changes in Wiener systems with inverse nonlinear characteristics described by polynomials with the LS method also results in inconsistent parameter estimates. To obtain consistent parameter estimates, an IV estimation procedure has been proposed. For systems with characteristics described by the power series, the estimation of the parameters of linear dynamic systems and changes of the nonlinear characteristic for Hammerstein systems or the inverse nonlinear characteristic for Wiener systems can be made employing neural network models of the residual equation. Note that neural network models can be also useful in the case when the order of the system polynomial nonlinearity r is high. This comes from the well-known fact that the high order r introduces errors due to noise and model uncertainty, and slows down the convergence rate [9]. Nominal models of systems are identified based on input-output measurements recorded in the normal operation conditions. In spite of the fact that such measurements are often available for many real processes, the determination of nominal models at a high level of accuracy is not an easy task, as has been shown in the sugar evaporator example. The complex nonlinear nature of the sugar evaporation process, disturbances of a high intensity and correlated inputs that do not fulfill the persistent excitation condition are among the reasons for the difficulties in achieving high accuracy of system modelling.
References
1. Alataris K., Berger T. W., Marmarelis V. Z. (2000) A novel network for nonlinear modeling of neural systems with arbitrary point-process inputs. Neural Networks, 13:255–266. 2. Al-Duwaish H., Karim M. N., Chandrasekar V. (1996) Use of multilayer feedforward neural networks in identification and control of Wiener model. Proc. IEE – Contr. Theory Appl., 143:225–258. 3. Al-Duwaish H., Nazmul Karim M., Chandrasekar V. (1997) Hammerstein model identification by multilayer feedforward neural networks. Int. J. Syst. Sci., 18:49–54. 4. Al-Duwaish H., Karim M. N. (1997) A new method on the identification of Hammerstein model. Automatica, 33:1871–1875. 5. Al-Duwaish H. (2000) A genetic approach to the identification of linear dynamical systems with static nonlinearities. Int. J. Contr., 31:307–313. 6. Aoubi M. (1998) Comparison between the dynamic multi-layered perceptron and generalised Hammerstein model for experimental identification of the loading process in diesel engines. Contr. Engng. Practice, 6:271–279. 7. Bai E.-W. (1998) An optimal two-stage identitcation algorithm for Hammerstein-Wiener nonlinear systems. Automatica, 34:333–338. 8. Bai E.-W. (2002) Identifcation of linear systems with hard input nonlinearities of known structure. Automatica, 38:853–860. 9. Bai E.-W. (2002) A blind approach to the Hammerstein-Wiener model identification. Automatica, 38:967–979. 10. Bai E.-W. (2003) Frequency domain identifcation of Wiener models. Automatica, 39:1521–1530. 11. Bai E.-W. (2004) Decoupling the linear and nonlinear parts in Hammerstein model identication. Automatica, 40:671–676. 12. Barker H. A., Tan A. H., Godfrey K. R. (2003) Automatica, 39:127–133. 13. Bars R., Haber R., Lengyel O. (1997) Extended horizon nonlinear adaptive predictive control applied for the parametric Hammerstein model. In: Domek S., Emirsajow Z., Kaszy´ nski R. (eds), Proc. 4th Int. Symp. Methods and Models in Automation and Robotics, MMAR’97, Technical University of Szczecin Press, Szczecin, 447–451. 14. Billings S. A., Fakhouri S. Y. (1978a) Theory of separable processes with applications to the identification of non-linear systems. Proc. IEE, 125:1051– 1058. 15. Billings S. A., Fakhouri S. Y. (1978b) Identification of a class of nonlinear systems using correlation analysis. Proc. IEE, 125:691–697. 16. Billings S. A., Fakhouri S. Y. (1979) Non-linear system identification using the Hammerstein model. Int. J. Syst. Sci., 10: 567–578. 17. Billings S. A., Fakhouri S. Y. (1982) Identification of systems containing linear dynamic and static non-linear elements. Automatica, 18:15–26.
188
References
18. Billings S. A., Chen S., Korenberg M. J. (1989) Identification of MIMO nonlinear systems using a forward-regression orthogonal estimator. Int. J. Contr., 49:2157–2189. 19. Bloemen H. H. J., Boom T. J. J., Verbruggen H. B. (2001) Model-based predictive control for Hammerstein-Wiener systems. Int. J. Contr., 74:482–495. 20. Bloemen H. H. J., Chou C. T., Boom T. J. J., Verdult V., Verhaegen M., Backx T. C. (2001) Wiener model identification and predictive control for dual composition control of a distillation column. J. Process Contr., 11:601– 620. 21. Boutayeb M., Darouach M. (1995) Recursive identification method for MISO Wiener-Hammerstein model. IEEE Trans. Automat. Contr., 40:287–291. 22. Boutayeb M., Aubry D., Darouach M. (1996) A robust and recursive identification method for MIMO Hammerstein model. In: Proc. Int. Conf. Contr, UKACC’96, Exeter, UK, 482–495. 23. Campolucci P., Uncini A., Piazza F., Rao B. D. (1999) On-line learning algorithms for locally recurrent neural networks. IEEE Trans. Neural Networks, 10:253–271. 24. Carlos A., Corripio A. B. (1985) Principles and pracitice of automatic control. John Wiley and Sons, New York. 25. Celka P., Bershad N. J., Vesin J. (2001) Stochastic gradient identification of polynomial Wiener systems: analysis and application. IEEE Trans. Signal Processing, 49:301–313. 26. Cervantes A. L., Agamennoni O. E., Figueroa J. L. (2003) A nonlinear model predictive control based on Wiener piecewise linear models. J. Process Contr., 13:655–666. 27. Chang F. H. I., Luus R. (1971) A noniterative method for identification using Hammerstein model. IEEE Trans. Automat. Contr., AC-16:464–468. 28. Chen G., Chen Y., Ogmen H. (1997) Identifying chaotic systems via a Wienertype cascade model. IEEE Contr. Syst. Mag., 17:29–36. 29. Chen S., Billings S. A., Grant P. M. (1990) Non-linear system identification using neural networks. Int. J. Contr., 51:1191–1214. 30. Chen J., Patton R. J. (1999) Robust model-based fault diagnosis for dynamic systems. Kluwer Academic Publishers, London. 31. Cybenko G. (1989) Approximation by superposition of sigmoidal functions. Math. Contr. Signals and Syst., 2:359–366. 32. Davide F. A. M., Di Natale C., D’Amico A., Hierlemann A., Mitrovics J., Schweizer M., Weimar U., G¨ opel W. (1995) Structure identification of nonlinear models for QMB polymer-coated sensors. Sensors and Actuators, B 24–25:830–842. 33. Drewelow W., Simanski O., Hofmockel R., Pohl B. (1997) Identification of neuromuscular blockade in anaesthesia. In: Domek S., Emirsajow Z., Kaszy´ nski R. (eds) Proc. 4th Int. Symp. Methods and Models in Automation and Robotics, MMAR’97 , Technical University of Szczecin Press, Szczecin, 781–784. 34. Eskinat E., Johnson S. H., Luyben W. L. (1991) Use of Hammerstein models in identification of nonlinear systems. AIChE J., 37:255–268. 35. Eykhoff P. (1980) System identification. Parameter and state estimation. John Wiley and Sons, London. 36. Fahlman S., Lebiere C. (1990) The cascase-correlation learning architecture. In: Touretzky D. S. (ed.) Advances in Neural Information Processing Syst. 2, Morgan Kaufmann, San Mateo. 37. Fine T. L. (1999) Feedforward neural network methodology. Springer, New York, Berlin, Heidelberg. 38. Frank P. M. (1990) Fault diagnosis in dynamical systems using analytical and knowledge-based redundancy – A survey of some new results. Automatica, 26:459–474. 39. Fruzzetti K. P., Palazogˇ glu A., McDonald K. A. (1997) Nonlinear model predictive control using Hammerstein models. J. Process Contr., 7:31–41. 40. Gallman P. (1975) An iterative method for the identification of nonlinear systems using a Uryson model. IEEE Trans. Automat. Contr., AC-20:771– 775.
References
189
41. Gallman P. (1976) A comparison of two Hammerstein model identification algorithms. IEEE Trans. Automat. Contr., AC-21:124–77. 42. Gerkˇsiˇc S., Juriˇciˇc D., Strmˇcnik S., Matko D. (2000) Wiener model based nonlinear predictive control. Int. J. Syst. Sci., 31:189–202. 43. Giri F., Chaoui F. Z., Rochdi Y. (2001) Parameter identification of a class of Hammerstein plants. Automatica, 37:749–756. 44. G´ omez J. C., Baeyens E. (2004) Identification of block-oriented nonlinear systems using orthonormal bases J. Process Contr., 14:685–697. 45. Greblicki W. (1989) Non-parametric orthogonal series identification of Hammerstein systems. Int. J. Syst. Sci., 20: 2355–2367. 46. Greblicki W. (1992) Nonparametric identification of Wiener systems. IEEE. Trans. Inf. Theory, 38:1487–1492. 47. Greblicki W. (1994) Nonparametric identification of Wiener systems by orthogonal series. IEEE Trans. Automat. Contr., 39:2077–2086. 48. Greblicki W. (1997) Nonparametric approach to Wiener system identification. IEEE Trans. Circ. Syst. I, 44:538–545. 49. Greblicki W. (1998) Continuous-time Wiener system identification. IEEE Trans. Automat. Contr., 43:1488–1493. 50. Greblicki W. Recursive identification of continuous-time Wiener systems. (1999) Int. J. Contr., 72:981–989. 51. Greblicki W. (2001) Recursive identification of Wiener systems. Int. J. Appl. Math. and Comp. Sci., 11:977–991. 52. Greblicki W. (2002) Recursive identification of continuous-time Hammerstein systems. Int. J. Syst. Sci., 33: 969–977. 53. Greblicki W., Krzy˙zak A. (1979) Non-parametric identification of a memoryless system with a cascade structure. Int. J. Syst. Sci., 10:1311–1321. 54. Greblicki W., Pawlak M. (1985) Fourier and Hermite series estimates of regression functions. Ann. Inst. Statist. Math., 37:443–454. 55. Greblicki W., Pawlak M. (1986) Identification of discrete Hammerstein systems using kernel regression estimates. IEEE Trans. Automat. Contr., AC31:74–77. 56. Greblicki W., Pawlak M. (1987) Hammerstein system identification by nonparametric regression estimation. Int. J. Contr., 45:343–354. 57. Greblicki W., Pawlak M. (1989) Nonparametric identification of Hammerstein systems. IEEE Trans. Inf. Theory, 35:409–417. 58. Greblicki W., Pawlak M. (1989) Recursive nonparametric identification of Hammerstein systems. J. Franklin Inst., 326:461–481. 59. Greblicki W., Pawlak M. (1994) Nonparametric recovering nonlinearities in block oriented systems with the help of Laguerre polynomials. Contr. Theory Advanced Techn., 10:771–791. 60. Gupta M. M., Jin L., Homma N. (2003) Neural networks. From fundamentals to advanced theory. John Wiley and Sons, Hoboken. 61. Haber R., Unbehauen H. (1990) Structure identification of non-linear dynamic systems – a survey on input/output approaches. Automatica, 26:651–677. 62. Haber R. (1995) Predictive control of nonlinear dynamic processes. Appl. Math. and Comp., 70: 169–184. 63. Haist N. D., Chang F. H. I., Luus R. (1973) Nonlinear identification in the presence of correlated noise using Hammerstein model. IEEE Trans. Automat. Contr., AC-18:552–555. 64. Hasiewicz Z. (1999) Hammerstein system identification by the Haar multiresolution approximation. Int. J. Adaptive Contr. Signal Processing, 13:697–717. 65. Hasiewicz Z. (2000) Modular neural networks for nonlinearity recovering by the Haar approximation. Neural Networks, 13:1107–1133. 66. Hasiewicz Z. (2001) Non-parametric estimation of nonlinearity in a cascade time series system by multiscale approximation. Signal Processing, 81:791–807. 67. Hassibi B., Stork D. G. (1993) Second derivatives for network pruning: optimal brain surgeon. In: Hanson S. J., Cowan J. D., Giles C. L. (eds) Advances in Neural Information Processing Syst. 5 , Morgan Kaufmann, San Mateo.
190
References
68. Haykin S. (1999) Neural networks. A comprehensive foundation. Prentice Hall, Upper Saddle River. 69. Hertz J., Krogh A., Palmer R. G. (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City. ˙ 70. Hunt K. J., Sbarbaro D., Zbikowski R., Gawthrop P. J. (1992) Neural networks for control systems – a survey. Automatica, 28:1083–1112. 71. Hunter I. W., Korenberg M. J. (1986) The identification of nonlinear biological systems: Wiener and Hammersein cascade models. Biol. Cybern., 55:135–144. 72. Ikonen E., Najim K. (2001) Identification of Wiener systems with steady-state non-linearities. In: Proc. Europ. Contr. Conf., ECC’01, Porto, Portugal, CDROM. 73. Janczak A. (1995) Identification of a class of non-linear systems using neural networks. In: Ba´ nka S., Domek S., Emirsajow Z. (eds) Proc. 2nd Int. Symp. Methods and Models in Automation and Robotics, MMAR’95 , Technical University of Szczecin Press, Szczecin, 697–702. 74. Janczak A. (1997) Identification of Wiener models using recurrent neural networks. In: Domek S., Emirsajow Z., Kaszy´ nski R. (eds) Proc. 4th Int. Symp. Methods and Models in Automation and Robotics, MMAR’97, Technical University of Szczecin Press, Szczecin, 727–732. 75. Janczak A. (1997) Recursive identification of Hammerstein systems using recurrent neural models. In: Tadeusiewicz R., Rutkowski L., Chocjan J. (eds) Proc. the 3rd Conf. Neural Networks and Their Applications, Polish Neural Networks Society Press, Cz¸estochowa, 517–522. 76. Janczak A. (1998) Recurrent neural network models for identification of Wiener systems. In: Borne P., Ksouri M., El Kamel A. (eds) Proc. 2nd IMACS Multiconference, CESA’98, Nabeul-Hammamet, 965–970. 77. Janczak A., (1998) Gradient descent and recursive least squares learning algorithms for on line identification of Hammerstein systems using recurrent neural network models. In: Heiss M. (ed.) Proc. Int. ICSC/IFAC Symp. Neural Computation, NC’98, ICSC Academic Press, Vienna, 565–571. 78. Janczak A. (1999) Fault detection and isolation in Wiener systems with inverse model of static nonlinear element. In: Proc. Europ. Contr. Conf., ECC’99, Karlsruhe, F1046-5, CD-ROM. 79. Janczak A. (1999) Parameter estimation based fault detection and isolation in Wiener and Hammerstein systems. Int. J. Appl. Math. and Comp. Sci., 9:711–735. 80. Janczak A. (2000) Least squares identification of Wiener systems. In: Domek S., Kaszy´ nski R. (eds) Proc. 6th Int. Conf. Methods and Models in Automation and Robotics, MMAR’2000, Technical University of Szczecin Press, Szczecin, 933–938. 81. Janczak A. (2000) Neural networks in identification of Wiener and Hammerstein systems. In: Duch W., Korbicz J., Rutkowski L., Tadeusiewicz R. (eds) Biocybernetics and biomedical engineering 2000. Neural networks, Akad. Ofic. Wyd. EXIT, Warsaw, 419–458, (in Polish). 82. Janczak A. (2000) Parametric and neural network models for fault detection and isolation of industrial process sub-modules. In: Edelmayer M., Bany´ssz C. (eds) Prepr. 4th Symp. Fault Detection Supervision and Safety for Technical Processes, SAFEPROCESS’2000, Budapest, Hungary, 348–351. 83. Janczak A. (2001) On identification of Wiener systems based on a modified serial-parallel model. In: Proc. Europ. Contr. Conf., ECC’2001, Porto, Portugal, 1852-1857. 84. Janczak A. (2001) Training neural network Hammerstein models with truncated back-propagation through time algorithm. In: Kaszy´ nski R. (ed.) Proc. 7th IEEE Int. Conf. Methods. and Models in Automation and Robotics, MMAR’2001, Technical University of Szczecin Press, Szczecin, 499–504. 85. Janczak A. (2002) Prediction error approach to identification of polynomial Wiener systems. In: Domek S., Kaszy´ nski R. (eds) Proc. 8th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR’2002, Technical University of Szczecin Press, Szczecin, 457-461.
References
191
86. Janczak A. (2002) Identification of Wiener systems with pseudolinear regression approach method. In: Bubnicki Z., Korbicz J. (eds) Proc. 14th Polish Conf. Automation, XIV KKA, Zielona G´ ora, 413–416, (in Polish). 87. Janczak A. (2002) Training of neural network Wiener models with recursive prediction error algorithm. In: Rutkowski L., Kacprzyk J. (eds) Advances in Soft Computing. Neural Networks and Soft Computing, Proc. 6th Int. Conf. Neural Networks and Soft Computing, Physica-Verlag, Heidelberg, New York, 692–697. 88. Janczak A. (2003) Identification of Wiener and Hammerstein systems with neural networks and polynomial models. Methods and applications. University of Zielona G´ ora Press, Zielona G´ ora. 89. Janczak A. (2003) A comparison of four gradient learning algorithms for neural network Wiener models. Int. J. Syst. Sci., 34:21–35 90. Janczak A. (2003) Neural network approach for identification of Hammerstein systems Int. J. Contr., 76:1749–1766. 91. Janczak A. (2004) Parametric and neural network models in fault detection and isolation. In: Korbicz J., Ko´scielny J. M., Kowalczuk Z., Cholewa W. (eds) Fault diagnosis, models, artificial intelligence, applications, Springer, Berlin, Heidelberg, New York, 381–410. 92. Janczak A., Korbicz J. (1999) Neural network models of Hammerstein systems and their application to fault detection and isolation In: Proc. 14th World Congress of IFAC, Beijing, P.R.C., P:91–96. 93. Janczak A., Mrugalski M (2000) Neural network approach to identification of Wiener systems in a noisy environment. In: Bothe H., Rojas R. (eds) Proc. Int. ICSC Symp. Neural Computation, NC’2000, Berlin. 94. Juditsky A., Hjalmarsson H., Benveniste A., Delyon B., Ljung L., Sj¨ oberg J., Zhang O. (1995) Nonlinear Black-box modeling in system identification: mathematical foundations. Automatica, 31:1725–1750. 95. Kalafatis A. D., Arifin N., Wang L., Cluett W. R. (1995) A new approach to the identification of pH processes based on the Wiener model. Chem. Engng Sci., 50:3693–3701. 96. Kalafatis A. D., Wang L., Cluett W. R. (1997) Identification of Wiener-type non-linear systems in a noisy environment. Int. J. Contr., 66:923–941. 97. Kang H. W., Cho Y. S., Youn D. H. (1998) Adaptive precomensation of Wiener systems. IEEE Trans. Signal Processing, 46:2825–2829. 98. Knohl T., Unbehauen H. (2000) Adaptive position control of electrohydraulic servo systems using ANN. Mechatronics, 10:127–143. 99. Knohl T., Xu W. M., Unbehauen H. (2003) Indirect adaptive dual control for Hammerstein systems using ANN. Contr. Engng Practice, 11:377–385. 100. Korbicz J., Janczak A. (1996) A neural network approach to identification of structural systems. In: Proc. IEEE Int. Symp. Industrial Electronics, ISIE’96, Warsaw, pp. 97–103. 101. Korbicz J., Janczak A. (2002) Artificial neural network models for fault detection and isolation of industrial processes. Comp. Assist. Mech. and Engng Sci., 9:55–69. 102. Krzy˙zak A. (1989) Identification of discrete Hammerstein systems by the Fourier series regression estimate. Int. J. Syst. Sci., 20:1729–1744 103. Krzy˙zak A. (1990) On nonparametric estimation of nonlinear dynamic systems by the Fourier series estimate. Signal Processing, 52:299–321. 104. Krzy˙zak A. (1996) On estimation of a class of nonlinear systems by the kernel regressin estimate. IEEE Trans. Inf. Theory, 36:141–152. 105. Krzy˙zak A., Partyka M. A. (1993) On identification of block-oriented systems by non-parametric techniques. Int. J. Syst. Sci., 24:1049–1066. 106. Krzy˙zak A., S¸asiadek J. Z., K´egl B. (2001) Non-parametric identification of dynamic nonlinear systems by a Hermite series approach. Int. J. Syst. Sci., 32:1261–1285 107. Le Cun Y., Kanter I., Solla S. A. (1990) Optimal brain damage. In: Touretzky D. S. (ed.) Advances in Neural Information Processing Syst. 2, Morgan Kaufmann, San Mateo.
192
References
108. Leontaritis I. J., Billings S. A. (1985) Input–output parametric models for non-linear systems. Part I: deterministic non-linear systems. Int. J. Contr., 41:303–328. 109. Leontaritis I. J., Billings S. A. (1985) Input–output parametric models for nonlinear systems. Part II: stochastic non-linear systems. Int. J. Contr., 41:329– 344. 110. Ling W.-M., Rivera D. (1998) Nonlinear black-box identification of distillation column models — design variable selection for model performance enhancement. Appl. Math. and Comp. Sci., 8:794–813. 111. Lissane Elhaq S., Giri F., Unbehauen H. (1999) Modelling, identification and control of sugar evaporation – theoretical design and experimental evaluation. Contr. Engng Practice, 7:931–942. 112. Ljung L. (1999) System identification. Theory for the user. Prentice Hall, Upper Saddle River. 113. Lovera M., Gustafsson T., Verhaegen M. (2000) Recursive subspace identifcation of linear and non-linear Wiener state-space models. Automatica, 36:1639– 1650. 114. Luyben W. L., Eskinat E. (1994) Nonlinear auto-tune identification. Int. J. Contr., 59:595–626. 115. Mak M. W., Ku K. W., Lu Y. L. (1999) On the improvement of the real time recurrent learning algorithm for recurrent neural networks. Neurocomputing, 24:13–36. 116. Marciak C., Latawiec K., Rojek R., Oliveira G. H. C. (2001) In: Kaszy´ nski R. (ed.) Proc. 7th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR’2001, Technical University of Szczecin Press, Szczecin, 965– 969. 117. Marmarelis V. Z., Zhao X. (1997) Volterra models and three-layer perceptrons. IEEE Trans. Neural Networks, 8:1421–1432. 118. Menold P. H., Allg¨ ower F., Pearson R. K. (1997) Nonlinear structure identification of chemical processes. Computers and Chem. Engng, 21:S137–S142. 119. Mzyk G. (2002) Instrumental variables in Wiener system identification. In: Domek S., Kaszy´ nski R. (eds) Proc. 8th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR’2002, Technical University of Szczecin Press, Szczecin, 463–468. 120. Narendra K. S., Gallman P. G. (1966) An iterative method for the identification of nonlinear systems using Hammerstein model. IEEE Trans. Automat. Contr., AC-11:546–550. 121. Narendra K. S., Parthasarathy K. (1990) Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks, 1:4–26. 122. Narendra K. S., Parthasarathy K. (1991) Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans. Neural Networks, 2:252–262. 123. Nelles O. (2001) Nonlinear system identification. From classical approaches to neural networks and fuzzy models. Springer, New York, Berlin, Heidelberg. 124. Neˇsi´c D. (1997) A note on dead-beat controllability of generalized Hammerstein systems. Syst. and Contr. Letters, 29:223–231. 125. Neˇsi´c D., Mareels I. M. Y. (1998) Dead-beat control of simple Hammerstein models. IEEE Trans. Automat. Contr., 43:1184–1188. 126. Ninnes B., Gibson S. (2002) Quantifying the accuracy of Hammerstein model estimation.Automatica, 38:2037–2051. 127. Nørgaard M., Ravn O., Poulsen N. K., Hansen L. K. (2000) Neural networks for modelling and control. Springer, New York, Berlin, Heidelberg. 128. Norquay S. J., Palazoglu A., Romagnoli J. A. (1998) Model predictive control based on Wiener models. Chem. Engng Sci., 53:75–84. 129. Norquay S. J., Palazoglu A., Romagnoli J. A. (1999) Application of Wiener model predictive control (WMPC) to an industrial C2-splitter. J. Process Contr., 9:461–473. 130. Norquay S. J., Palazoglu A., Romagnoli J. A. (1999) Application of Wiener model predictive control (WMPC) to a pH neutralization experiment. IEEE Trans. Contr. Syst. Technology, 7:437–445.
References
193
131. Pacut A. (2000) Stochastic modelling at diverse scales. From Poisson to network neurons. Warsaw University of Technology Press, Warsaw. 132. Pacut A (2002) Symmetry of backpropagation and chain rule. In: Proc. 2002 Int. Joint Conf. Neural Networks, Honolulu, HA, IEEE Press, Piscataway. 133. Pajunen G. (1992) Adaptive control of Wiener type nonlinear systems. Automatica, 28:781–785. 134. Patwardhan R. S., Lakshminarayanan S., Shah S. L. (1998) Constrained nonlinear MPC using Hammerstein and Wiener models. AIChE J., 44:1611–1622 135. Patton R. J., Frank M., Clark R. N. (1990) Fault diagnosis in dynamic systems. Theory and applications. Prentice-Hall, New York. 136. Pawlak W. (1991) On the series expansion approach to the identification of Hammerstein systems. IEEE Trans. Automat. Contr., 36:763–767. 137. Pawlak W., Hasiewicz Z. (1998) Nonlinear system identification by the Haar multiresolution analysis. IEEE Trans. Circ. and Syst. – I: Fund. Theory and Appl., 45:945–961. 138. Pearson R. K., Pottmann M. (2000) Gray-box identification of block-oriented non-linear models. J. Process Contr., 10:301–315. 139. Pich´e S. W. (1994) Steepest descent algorithms for neural network controllers. IEEE Trans. Neural Networks, 5:198–212. ´ (2003) Performance analysis of a dyna140. Pomerleau D., Hodouin D., Poulin E. mic phenomenological controller for a pellet cooling process. J. Process Contr., 13:139–153. 141. Quaglini V., Previdi F., Contro R., Bittanti S. (2002) A discrete-time nonlinear Wiener model for the relaxation of soft biological tissues. Medical Engng and Physics, 24:9–19. 142. Rollins D. K., Bhandari N. (2004) Constrained MIMO dynamic discrete-time modeling exploiting optimal experimental design. J. Process Contr., 14:671– 683. 143. Roux G., Dahhou B., Queinnec I. (1996) Modelling and estimation aspects of adaptive predictive control in a fermentation process. Contr. Engng Practice, 4:55–66. 144. Rutkowski L. (2004) New soft computing techniques for system modelling, pattern classification and image processing. Springer, Berlin, Heidelberg. 145. Schetzen M. (1980) The Volterra and Wiener theories of ninlinear systems. John Wiley and Sons, New York. 146. Sentoni G., Agamennoni O., Desages A., Romagnoli J. (1996) Aproximate models for nonlinear process control. AIChE J., 42:2240–2250. 147. Sieben S. (1996) The Wiener model – an approach by deterministic inputs. In: Proc. 1st IMACS Multiconference, CESA’96, Lille, France, 465–469. oberg J., Zhang O., Ljung L., Benveniste A., Delyon B., Glorennec P., Hjal148. Sj¨ marsson H., Juditsky A. (1995) Nonlinear black-box modeling in system identification: a unified overview. Automatica, 31:1691–1724. 149. S¨ oderstr¨ om T., Stoica P. (1994) System identification. Prentice Hall Int., London. 150. Srinivasan B., Prasad U. R., Rao N. J. (1994) Back propagation through adjoints for the identification of nonlinear dynamic systems using recurrent neural models. IEEE Trans. Neural Networks, 5:213–228. 151. Stapleton J. C., Baas S. C. (1985) Adaptive noise cancellation for a class of nonlinear systems. IEEE Trans. Circ. and Syst., 32:143–150. 152. Stoica P., S¨ oderstr¨ om T. (1982) Instrumental-variable methods for identification of Hammerstein systems. Int. J. Contr., 35:459–476. 153. Su H.-T., McAvoy T. J. (1993) Integration of multilayer perceptron networks and linear dynamic models: A Hammerstein modeling approach. Ind. Engng Chem. Res., 32:1927–1936. 154. Sung S. W., Lee J. (2004) Modeling and control of Wiener-type processes. Chem. Engng Sci., 59:1515–1521 ´ nski P., Hasiewicz Z. (2002) Computational algorithms for wavelet-based 155. Sliwi´ system identification. In: Domek S., Kaszy´ nski R. (eds) Proc. 8th IEEE Int. Conf. Methods and Models in Automation and Robotics, MMAR’2002, Technical University of Szczecin Press, Szczecin, 495–500.
194
References
156. Thathachar M. A. L., Ramaswamy S. (1973) Identification of a class of nonlinear systems. Int. J. Contr., 18:741–752. 157. Verhaegen M., Westwick D. (1996) Identifying MIMO Hammerstein systems in the context of subspace model identyfication methods. Int. J. Contr., 63:331– 349. 158. Verhaegen M., Westwick D. (1996) Identifying MIMO Wiener systems using subspace model identyfication methods. Signal Processing, 52:235–258. 159. Visala A., Pitk¨ anen H., Aarne H. (2001) Modeling of chromatographic separation process with Wiener-MLP representation. J. Process Contr., 78:443–458. 160. V¨ or¨ os J. (1997) Parameter identification of discontinuous Hammerstein systems. Automatica, 33:1141–1146. 161. V¨ or¨ os J. (1999) Iterative algorithm for identification of Hammerstein systems with two-segment nonlinearities. IEEE Trans. Automat. Contr., 44:2145–2149. 162. V¨ or¨ os J. (2001) Parameter identifcation of Wiener systems with discontinuous nonlinearities. Syst. and Contr. Letters, 44:363–372. 163. Werbos P. J. (1990) Backpropagation through time: What it does and how to do it. Proc. IEEE, 78:1550–1560. 164. Westwick D., Verhaegen M. (1996) Identifying MIMO Wiener systems using subspace identification methods. Signal Processing, 52:235–258. 165. Wigren T. (1993) Recursive prediction error identification using the non-linear Wiener model. Automatica, 39:1011–1025. 166. Wigren T. (1994) Convergence analysis of recursive identification algorithms based on the non-linear Wiener model. IEEE Trans. Automat. Contr., 39:2191–2206. 167. Williams R. J., Zipser D. (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Computations, 1:271–280. 168. Xu M., Chen G., Tian Y.-T. (2001) Identifying chaotic systems using Wiener and Hammerstein cascade models. Math. and Computer Modelling, 33:483– 493. ˙ 169. Zurada J. M. (1992) Introduction to artificial neural systems. West Publishing Company, St. Paul.
Index
AR model, 6 ARMA model, 7 ARMAX model, 7 ARX model, 6, 61, 182
algorithm, 31 methods, 17, 70 continuous stirred tank reactor, 160 Daubechies wavelets, 28 discrete Fourier transform, 21 discrete Laguerre functions, 149 discrete-time chaotic systems, 165 distillation column, 159, 162–163
backpropagation algorithm, 17, 29, 52, 84 learning algorithm, 17, 24 method, 31, 40–42, 46–47, 84–85, 87–90, 98, 105, 108 for parallel models, 31, 42, 47–48, 85–86, 90 through time, 31, 40, 43–46, 48, 77, 84, 86–87, 91 through time, truncated, 45, 49–51, 87, 91–96, 107, 108, 110, 112, 113 batch mode, 43, 78, 99, 106 bias/variance dilemma, 18 tradeoff, 18 black box models, 11 Box-Jenkins model, 8 BPP method, see backpropagation for parallel models BPS method, see backpropagation method BPTT method, see backpropagation through time
Gauss-Newton method, 66 Gaussian noise, 54, 67, 98, 133 signal, 20, 25, 164 gradient calculation accuracy, 49, 77, 91, 94, 96 degrees, 51, 55, 95, 99 algorithms, 51, 77, 96 gray box models, 11
C2-splitter, 162 charging process in diesel engines, 164 chromatographic separation proces, 163 combined steepest descent and least squares learning algorithms, 99–105 complexity penalty term, 19 regularization methods, 19 computational complexity, 52, 96 conjugate gradient
Haar multiresolution approximation, 28 Hammerstein model, 1, 12, 30 MIMO, 12 SISO, 12, 25 state space, 15 Hammerstein-Wiener model, 14 heat exchanger, 162 Hermite orthogonal functions, 22 series, 27 Hessjan, 18, 19
electro-hydraulic servo system, 164 equation error, 10, 81, 117 FIR models, 5 Fourier series, 8, 27 frequency sampling filter, 20 fuzzy models, 8
196
Index
identification of Hammerstein systems, 24 correlation methods, 25 in presence of correlated noise, 147 instrumental variables methods, 25 iterative least squares method, 145–147 Laguerre function expansion method, 149–151 linear optimization methods, 25 non-iterative least squares method, 143–145 nonlinear optimization methods, 28 nonparametric regression methods, 26 prediction error method, 151–152 pseudolinear regression method, 153–154 steepest descent method, 79 with two-segment nonlinearities, 155–157 identification of Wiener systems, 19 combined least squares-instrumental variables method, 141 correlation methods, 19, 163, 164 instrumental variables method, 125 least squares method, 122, 123 linear optimization methods, 20 nonlinear optimization methods, 22 nonparametric regression methods, 22 pseudolinear regression method, 134–138 recursive prediction error method, 132 steepest descent method, 54 indirect adaptive control, 164 internal model control, 160 iron oxide pellet cooling process, 165 Kautz filters, 10 kernel regression estimate, 27 Kolmogorov-Gabor models, 9 Laguerre filter bank, 24 filters, 21, 24, 125 function expansion method, see identification of Hammerstein systems Laguerre function expansion method polynomials, 27 Legendre orthogonal functions, 22 Levenberg-Marquardt method, 17, 23, 24, 152, 157 linear models, 5 general model, 5 MA model, 6
MLP, see multilayer perceptron model predictive control, 160–162, 165 model reference adaptive control, 160 model-based fault detection and isolation methods, 166 modified equation error, 141 modified series-parallel model, see polynomial Wiener model; modified series-parallel model MPC, see model predictive control multilayer perceptron, 2, 16–19, 24, 29, 34, 37, 40, 79, 163, 181 universal approximation property, 2 multiple-effect sugar evaporator, 180 experimental models steam pressure dynamics, 181 theoretical model, 180–181 muscle relaxation process, 165 NAR model, 8 NARMA model, 8 NARMAX model, 8, 159 NARX model, 8, 162 NBJ model, 8 network growing, 18, 19 pruning, 19 neural network Hammerstein model, 79–84, 165 model of MIMO nonlinear element, 82, 89 model of SISO nonlinear element, 79, 85 parallel MIMO model, 83, 90–91 parallel SISO model, 81, 85–87 series-parallel MIMO model, 83, 87 series-parallel SISO model, 81, 84 neural network Wiener model, 24, 31, 32, 34–39, 165 model of MIMO inverse nonlinear element, 39, 47 model of MIMO nonlinear element, 38, 46 model of SISO inverse nonlinear element, 35, 41 model of SISO nonlinear element, 34, 41 parallel MIMO model, 37–38, 48–49 parallel SISO model, 24, 34, 36, 42–46 series-parallel MIMO model, 39, 46–47 series-parallel SISO model, 36, 40–42 NFIR model, 8 NMA model, 8 NOBF model, 9 NOE model, 8 nonlinear models, 8–16 models composed of sub-models, 11
Index state space models, 10 nonlinear orthonormal basis function models, see NOBF model OE model, 7 optimal brain damage, 19 optimal brain surgeon, 19 orthonormal bases with fixed poles, 22 output error, 10, 32, 81 model, 105 parallel models, 10 pH neutralization process, 159–161 pneumatic valve simulation example, 66, 133 polymerization reactor, 163 polynomial Hammerstein model, 143 parallel, 143 model, 137, 143, 153, 155, 157, 163 of inverse nonlinear element, 117, 160 of nonlinear element, 117, 130, 143, 157, 160, 162, 164, 165 models, 1, 9 Wiener model, 119–125, 131 inverse series-parallel, 120 modified series-parallel, 119–125 nonlinear characteristic with linear term, 120, 126 nonlinear characteristic without linear term, 122, 128 parallel, 119 series-parallel, 119, 120 PRBS, see pseudo-random binary sequence precompensation of nonlinear distortion, 164 prediction error method, 65–66, 151–152 pseudo-random binary sequence, 26, 162 quartz microbalance polymer-coated sensors, 163 quasi-Newton algorithm, 31 radial basis functions, 8, 164 neural network, 164 real time recurrent learning algorithm, 52 recursive least squares algorithm, 24, 77, 164, 172 method, 122, 123 recursive prediction error method, 22, 65–66
197
recursive pseudolinear regression, 29, 77, 105, 154 RLS algorithm, see recursive least squares algorithm learning algorithms, 17, 70 method, see recursive least squares method RPE, see recursive prediction error method RPLR, see recursive pseudolinear regression saliency, 19 scaling, 18 sensitivity method, 29, 31, 40, 42–43, 48, 77, 84–86, 90–91, 132 models, 23, 31, 43, 49, 73, 74, 77, 86, 110, 112 sequential mode, 33, 43, 51–52, 78, 87, 96, 99, 106 series-parallel models, 10 SM method, see sensitivity method splines, 1 state space models, 10–11 Hammerstein models, see Hammerstein model, state space Wiener models, see Wiener model, state space steepest descent algorithm, 31, 77, 143, 165 stochastic approximation, 150 Taylor expansion, 19 two-tank system, 61 variable metric methods, 70 Volterra series models, 8, 24, 162 wavelets, 1, 8, 19 Weierstrass theorem, 1 white box models, 11 Wiener model, 1, 12 general, 13 inverse, 118 MIMO, 12, 22 MISO, 24 pseudolinear-in-parameters, 137 SISO, 12 state space, 14 Wiener model-based predictive control, 161 Wiener-Hammerstein model, 14