Quantitative Sciences on Biology and Medicine - Volume 1
Basic Principles and Practical Applications in Epidemiological...
24 downloads
892 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Quantitative Sciences on Biology and Medicine - Volume 1
Basic Principles and Practical Applications in Epidemiological Research
Jung-Der
Wang
World Scientific
Basic Principles and Practical Applications m Epidemiological Research
QUANTITATIVE SCIENCES ON BIOLOGY AND MEDICINE Series Editors: Timothy T. Chen (University of Maryland, USA) Kung-Yee Liang (Johns Hopkins University, USA) & Lee-Jen Wei (Harvard School of Public Health, USA)
Vol. 1:
Basic Principles and Practical Applications in Epidemiological Research by Jung-Der Wang
Basic Principles and Practical Applications in Epidemiological Research
Jung-Der Wang, M.D.,SC.D. Institute of Occupational Medicine and Industrial Hygiene National Taiwan University College of Public Health and Department of Internal Medicine National Taiwan University Hospital
V f e World Scientific wl
New Jersey'London • Hong Kong NewJersey London** Singapore Sine
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
BASIC PRINCIPLES AND PRACTICAL APPLICATIONS IN EPIDEMIOLOGICAL RESEARCH Copyright © 2002 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4801-6 ISBN 981-02-4925-X (pbk)
Printed in Singapore.
Preface Epidemiology has been developing rapidly over the last half a century. It has evolved to a discipline which can serve as a basic science in both public health and clinical medicine, as they all involve information collected from human observational studies. As we are entering the information age in the 21st century, all health care workers must be practically trained with the ability to judge and select valid information coming from studies of human population. The capability is particularly important as more and more health related information over flowing on the net and webpages. In this book my aim is to equip our readers with basic concepts and principles on how to conduct and critique scientific studies involving human population. Thus, the book is not only written for students majored in epidemiology as the first course and a conceptual review, but also for scholars, experts, and students who are not majored in epidemiology but need such concepts to abstract valuable information from or cooperate with others to conduct epidemiological studies. It has been my dream to uncover the principles of how scientific knowledge is produced or advanced since I was only a teenage boy in 1960's. I was first puzzled and attracted by Hume's suspicion that a hypothesis or theory coming from induction has no definite proof to be universally true. Later, I was convinced by Popper that by empirically falsifying all the hypotheses or conjectures, the only one left un-refuted is closest to the truth, and our belief of the existing physical and chemical theories has its foundation. Moreover, one must select samples from population efficiently and validly and conduct valid and sensitive measurements to empirically test these hypotheses. However, when I grew up in clinical medicine and public health, I found that health policy decision can not wait for such an un-ended quest, because we must make causal decisions to save lives and prevent morbidity. Thus, I began to accept and apply Baysian's concept of subjective probability and preference (utility) in my daily practice. The studies of
v
vi
Basic Principles and Practical Applications in Epidemiological Research
epidemiology has given me an opportunity to summarize all these basic concepts. Hence, this book is written in such an order and hope to guide our readers step by step: from the general concept of scientific research in natural science into inference in observational studies of human population: Beginning with a classification of two types of inferences (descriptive and causal) in chapter 1, I spend chapters 2 and 3 to elucidate Popperian philosophy of science and how it can be practically applied in the search of etiologic agents. Then, I have tried to approach causal criteria and its decision from a refutationist's attitude but incorporate Baysian decision viewpoint in chapter 4. Although most epidemiology textbooks do not talk about fundamental measurement concept, I choose to tackle it from the viewpoint of "all true measurement is essentially comparative" (Helmholtz). When introducing epidemiological measurements in chapter 6, I have deliberately added some new developments in the quantification of utility of health such as quality adjusted life year (QALY). Principles of study design are presented in chapter 7, which delineates that searching for alternative explanation (or possible confounding) is the main concern for causal studies, while that of a descriptive study is how to infer from sample at hand to target population. People commonly encounter crude rates, which in fact, are sets of specific rates adjusted with sets of specific weights and should be reweighted before comparison (chapter 8). Although most epidemiologic textbook do not deal with the concept of sampling, I try to scratch it from zero to let our readers get a ball-park view in chapter 9. Causal epidemiological studies are introduced according to whether the denominator or accumulated person-time is ascertained individually for the whole population (as in cohort or follow up studies, chapter 10) or just a sample (as in case-control studies, chapter 11). Then, the principles of the whole book is summarized into a critique form of how to evaluate an epidemiological study in chapter 12. In the last chapter, I have introduced a third type of question, the decisional question, for the application of epidemiology on health policy research with a special emphasis on outcome assessment. Because my major goal is to introduce the general concept through practical examples, this book does not contain any details of data
Preface
vii
analysis. Readers who are interested in such details are encouraged to read more advanced textbook, such as "Modern Epidemiology" (edited by Rothman and Greenland, 1998) I am indebted to Professors Olli Miettinen and David Wegman for their teaching, advice and kind support during my study in Harvard University. The book was first drafted and taught in 1997 at Mahidol University School of Public Health, Bangkok, Thailand. I am greatly indebted to all the faculty members and students who attended the class and provided inputs, especially Drs. Armonrath, Charlumchai and Pathom, who read through the text and gave me many invaluable comments. Later on, the content was also taught in the epidemiology class of the National Taiwan University College of Public Health, and received more inputs, especially from Professors. Jing-Hsiang Hwang (chapter 9) and Kaiping Yao (chapter 5). Later on, it was posted on a WHO website (http://www.who.int/peh-super) for readers' comment in 1999. I am especially grateful to all of WHO officials and scholars, especially Dr. Hiko Tamashiro, for her kind, persistent encouragement and comments. Dr. Joseph Chiu, a Christian brother and a specialist in infectious diseases, kindly spent his time to read through the whole text and brought questions on parts where the text was not clear enough. His encouragement and enthusiasm actually strengthened the content of this book. To make the whole text more readable, it was very kind of Ms. Sunny Wang to carefully read and edit the whole text. I am also grateful to the National Science Council of Taiwan for funding most of the empirical research used in the book, and to Y. T. Lee's Foundation for Outstanding Scholars for their kind financial support during 1996-2000, when the whole book was drafted and under revision. I am indebted to Professor Sander Greenland's inputs before the publication. Finally, I want to thank my Lord for providing me a happy and supportive family who went through all the evenings and days with me, especially my wife, Wang-Huei who typed the first draft in Bangkok.
This page is intentionally left blank
Contents Preface
v
Chapter
1. Introduction to epidemiological research
Chapter
2. Principles of scientific research: Deductive methods and process of conjecture and refutation
1
17
Chapter
3. Scientific hypothesis and degree of corroboration
39
Chapter
4. Causal inference and decision
57
Chapter
5. Basic principles of measurement
81
Chapter
6. Basic measurements in epidemiological research
121
Chapter
7. Study design
161
Chapter
8. Adjustment and standardization of rates
197
Chapter
9. Introduction to sampling method and practical applications
211
Chapter 10. Follow-up study
233
Chapter 11. Case-control study
259
Chapter 12. How to critically review an empirical study
291
Chapter 13. Application of epidemiological methods in health service research and policy
307
References
341
Index
363 ix
This page is intentionally left blank
Chapter 1 Introduction to Epidemiological Research 1.1 1.2 1.3 1.4 1.5
Definition of epidemiology Evolving trends of epidemiological research Types of inferences in epidemiological research Outline of the basic principles of epidemiological research Summary
Introduction In the last several decades, epidemiology has evolved from the simple observation of mortality and/or morbidity figures to a scientific discipline with versatile applications. With this expansion, we have witnessed a constant influx of innovative concepts and statistical tools of such magnitude that may confuse a beginner in the field. Consequently, the aim of this textbook is to use examples from actual research to provide the reader with a clear and simple concept on how to conduct epidemiological research. Since epidemiological research is based on the basic principles of scientific research, readers from any field can benefit from this fundamental understanding. Essentially, the principles of epidemiological research will be introduced as a liberal art (Fraser, 1987). In this chapter, we shall start with the definition of epidemiology. Then, we will discuss how its scope of study has expanded and summarize its different studies into two types of questions or inferences. Finally, I shall give an outline of what we will learn in this book. 1.1
Definition of epidemiology
The word epidemiology comes from the late Latin or Greek epidemia Epi indicates on or among, while demos means people, and logos or logy denotes science or theory. Thus, the word conveys the idea that
l
2
Basic Principles and Practical Applications in Epidemiological Research
epidemiology is the study of some event or characteristics occurring or prevalent among people. Since the 19th century, epidemiology has been commonly used to study people's illnesses, and as a result, it has become customary that the main subject matter of epidemiology involves disease or health-related events or states. MacMahon and Pugh (1970) defined it as the study of the distribution and determinants of disease frequency in man. Later, Miettinen (1985a) defined it as the principles of studying the occurrence of illness and related states and events, including those of health care. However, these two definitions only emphasize the study of illness and health-related issues. Since the scope of epidemiological study has expanded to cover health care and policy, I will tentatively define epidemiology as the study of the occurrence of health-related events, states and policy in a human population. Referring back to its broader Greek definition, I believe that the methodology of epidemiological research is useful in any discipline involved in the study of a human population, such as sociology and psychology. Thus, the clarification of the basic principles of epidemiological research will help facilitate their broader applications. 1.2
Evolving trends of epidemiological research
During the past half-century, there have been three developing trends in epidemiological research: from acute to chronic diseases, from disease-oriented to determinant-oriented approaches, and from health-related events to health policy research and decision. From acute to chronic diseases In the late 19th and early 20th centuries, the subject matter of epidemiology focused mainly on fatal acute or infectious diseases (Snow, 1936; Durkheim, 1951). After World Wars I and II, the rapid development of microbiology and antibiotics quickly conquered most infectious diseases caused by bacteria. Today, except for AIDS (acquired immune deficiency syndrome), the mortality rates of most infectious diseases have progressively dropped and are no longer one of the top 10 leading causes of death in most
Chapter 1 Introduction
3
developed countries, such as the United States (Table 1.1). Table 1.1
Rani:
Comparison of ten leading causes of death between 1940 and 1995 in the United States (U.S. Department of Commerce, 1945; Rosenberg HM et al, 1996). 1940
1. Heart diseases 2. Malignant neoplasm 3. Cerebrovascular diseases 4. Nephritis
1995 Crude mortality rate(x 10"5;) 292.5 Heart diseases 121.3 Malignant neoplasm Cerebrovascular diseases 90.9 81.5
5. Pneumonia
70.3
6. Accidents (excluding motor vehicle) 7. Tuberculosis 8. Diabetes mellitus
47.4
9. Motor vehicle accidents 10. Premature birth
45.9 26.6 26.2 24.6
Chronic obstructive pulmonary diseases Accidents and adverse effects (including motor vehicle) Pneumonia and influenza Diabetes mellitus Human immunodeficiency virus infection Suicide Chronic liver disease
Crude mortality rate(x 10"5) 281.2 204.7 60.2 39.9 34.1 31.8 22.5 16.2 11.8 9.5
This shift of leading causes of death from infectious to non-infectious or chronic diseases, such as cancer and cardiovascular diseases, is even more obvious in some recently industrialized countries, such as Taiwan (Table 1.2).
4
Basic Principles and Practical Applications in Epidemiological Research
Table 1.2
Comparison often leading causes of death between 1952 and 1997 in Taiwan.
1. 2. 3. 4. 5. 6.
Gastroenteritis Pneumonia Tuberculosis Cardiovascular diseases Cerebrovascular diseases Perinatal mortality
Crude mortality rate (x 10"5) 135.0 131.5 91.6 49.0 48.8 44.1
7. 8. 9. 10.
Nephritis and nephropathy Malignant neoplasm Bronchitis Malaria
36.3 30.7 28.1 27.5
Rank
1952
1997
Malignant neoplasm Cerebrovascular diseases Accidents and injuries Cardiovascular diseases Diabetes mellitus Chronic hepatitis and cirrhosis Pneumonia Nephritis and nephropathy Hypertension Suicide
Crude mortality rate (x 10" 134.10 59.56 52.22 49.71 34.67 22.03 16.73 16.20 12.07 10.04
Since epidemiologists are primarily interested in studying how to prevent morbidity and mortality from all kinds of diseases, epidemiological research now also covers chronic and/or non-infectious diseases. Faced with this new challenge, epidemiologists quickly sensed the importance of the concept of time in chronic health problems. Thus, total person time at-risk has been developed to replace number of person(s) or population at risk for the denominator of incidence rate. Although, in general, responsible agents for acute infectious diseases can be clearly identified because of their acute onset and simple causal relationship(s), determining the etiology of chronic diseases seems to be more difficult. Chronic diseases pose a more complex problem because of their longer induction time and more complicated pathophysiology. As a result, there is a greater opportunity for other (extraneous) agents or determinants to affect the outcome under study, as well. In this book, the existence of such extraneous determinants, which may partially or totally explain the causal effect, will be regarded as confounding.
Chapter 1 Introduction
5
For example, when cigarette smoking was first proposed to cause lung cancer by Hill (1953), many eminent scholars continued to dispute these findings (Stolley, 1991). Therefore, the concepts of cause (Hill, 1965; Rothman, 1976, 1981, 1986, 1988; Susser, 1977, 1986, 1991) and confounding (Miettinen, 1974b; Miettinen and Cook, 1981; Greenland and Robins, 1986; Greenland and Rothman, 1998) have gone through a long period of development to accommodate the growing coverage of chronic diseases. Prolonged observation time has also facilitated the development of a more efficient sampling and observation method for studying health-related events using a case-control study design (Cornfield, 1951; Miettinen, 1976; Breslow, 1980). A case-control study is one type of epidemiological study, which will be discussed later in Chapter 11. From disease-oriented to determinant-oriented approaches By understanding the agent(s) or determinant(s) of a disease, one can take appropriate and specific protective measures against the disease. For example, the drive to understand microbiological agents led to the development of vaccinations and the field of microbiology. In similar fashion, today's epidemiologists are examining determinants of chronic diseases for possible methods of treatment and prevention. Yet, the more complex interaction of chronic disease agents and factors of daily life, such as life-style, diet, occupation, living environment, etc., further heighten the demand for a more detailed understanding. This need for a more refined conceptualization and measurement of each individual determinant has extended traditional disease-oriented epidemiology to determinant-oriented epidemiology. Since such an approach frequently involves other scientific disciplines, such as occupational health and nutritional science, new methods of interdisciplinary approach and subdomains of epidemiology have been developed. For example, the incorporation of knowledge from occupational health into epidemiology has led to the formation of occupational epidemiology. Likewise, the use of nutritional science and sociology has resulted in the development of nutritional epidemiology and
6
Basic Principles and Practical Applications in Epidemiological Research
social epidemiology, respectively. In these new subdomains, one can now utilize specific measurement methods from other disciplines and attempt to examine the effect of multiple determinants on the pathophysiology in question. Thus, epidemiology has also extended from a disease or illness-centered approach, e.g., breast cancer epidemiology, stroke epidemiology, tuberculosis epidemiology, etc., to include the aforementioned determinant-oriented subdomains, such as occupational epidemiology, nutritional epidemiology, etc. Of course, such an extension does not necessarily mean that a disease-oriented approach will be abolished or lose its importance. Rather, the latter approach will continue to thrive, while the former will add to the understanding of the pathophysiology and natural history of a specific disease. Due to the above two approaches, epidemiology has gradually become the basic science of public health (Morris, 1975) and even of clinical medicine (Feinstein, 1983, 1985). From health-related events to health policy research and decisions While, initially, epidemiologists may have mainly focused on fatality due to disease, i.e., mortality, they have gradually extended their work to morbidity and even quality of life (QOL) because of today's general longer life expectancy. Given the limited resources of health services, there is a growing demand for people to make choices among different preventive, diagnostic, therapeutic and rehabilitative measures that affect survival and QOL. For example, a 63-year-old patient must choose whether or not to be operated on if she suffers from an aggravating hip joint pain and a 10-year history of stable angina pectoris (Sackett et al, 1991). Because the operation may result in long-term disability or even mortality, she may be worse off than in her current health state. The decision is not obvious and is very difficult to make. Take an example of public health decision-making: Should we spend more resources on the prevention of AIDS (acquired immune deficiency syndrome), cancer or stroke? Since all three diseases may result in mortality or chronic poor quality of life, we cannot rely on only mortality rate to decide policy. As a result, epidemiologists (or biostasticians) have now sought to quantify both survival
Chapter 1 Introduction
7
and QOL or even the total utility of health gained from the reduction of an exposure because no other discipline has shown genuine interest in nor is trained to carry out such a task. Therefore, epidemiological research now also attempts to assess the effectiveness of health service or policy (Tsauo et ah, 1999). Or, at least epidemiologists should learn how to cooperate with people in other disciplines, such as biostasticians, economists, oncologists, on tackling the health policy issues. 1.3 Types of inferences in epidemiological research The development of new concepts and methods in the broadening field of epidemiology often creates confusion among aspiring young epidemiologists and health professionals who have not majored in epidemiology. For example, case-control study design, which has potential applications in disciplines like sociology, already has many different names: retrospective study, case-referent study, case-base study, case-cohort study, case control study nested in a cohort, control initiated case control study multiple control series, multiple case series, etc. Similarly, people often worry about sampling procedures and response rates. One of the most common questions is: "Suppose that I have a 40% (or, X%) response rate, is the sample representative enough?" I believe that such questions are commonly raised from the lack of a systematic understanding of the basic principles of epidemiological research, although there are many standard textbooks of epidemiology on the market. Consequently, I take this opportunity to provide readers with the most intuitive approach: Begin by asking the research question in either a causal or descriptive form, conduct the study in a scientific manner, and then draw inferences from either causal or descriptive viewpoints, as summarized in Figure 1.1. By "scientific manner," I mean drawing representative samples for descriptive studies, and proposing all possible hypotheses and falsifying them one by one for causal studies. We shall discuss the detailed process of scientific research in the next two chapters and in Chapter 7. For now, let us first examine some real examples of descriptive and causal research for
8
Basic Principles and Practical Applications in Epidemiological Research
illustration: Descriptive: 1. What are the disease pattern and demand for the emergency medical system in Taipei (Chiang et al., 1986)? In other words, how many visits for each different disease are there in one year? What is the proportion, i.e., prevalence rate of alcoholism among Taiwan aborigines (Hwue/a/., 1990)? Causal: 1. What are the different causal agents for outbreaks of different diseases inside printing factories, such as polyneuropathy, respiratory paralysis or hepatitis (Wang et al, 1986; Tsai et al, 1990; Deng et al, 1987)? 2. Does wearing a helmet protect motorcycle riders from head injuries (Tsai etal, 1995)? In descriptive studies, one generally desires to find facts pertaining to an entire population; while in causal studies, one's questions focus on causal agents that produce the outcome of interest. The key question of the former is: How do we perform representative sampling? Or, how do we draw appropriate inferences for a particular population group from our collected sample? The most important question of the latter is: Is there any alternative determinant, which can explain the causal relationship of interest, indicating a mixed effect (i.e., confounding) in causality? Although the two studies ask two very different types of questions, one must always recruit subjects and perform measurement(s) for both. From asking the right question to understanding the basic principles, one should be able to design a valid study and appropriately interpret the results. Therefore, the objective of this book is to equip our readers with such ability. How can we achieve such a goal? What are the basic principles of epidemiological research?
Chapter 1 Introduction
Raise a question to be answered. It Causal?
Descriptive?
* Propose all possible hypotheses:
Define what and whom to measure.
Deduce facts to be found from each
Perform representative sampling
hypothesis.
and measurements.
H| ->F,,F 1 2 ... F,, H, -> F,, F„ ... F,
H„ Perform measurements to observe what has
Examine your sampling data of
actually happened and begin to falsify the
respondents, and draw inferences for
hypotheses.
the appropriate population group.
Apply statistical tools to summarize the data and control confounding. (Perform more falsification.)
Rule out all alternative hypotheses. Regard as valid the only hypothesis left unrefuted.
Figure 1.1
How to approach a problem.
9
10 Basic Principles and Practical Applications in Epidemiological Research
1.4
Outline of the basic principles of epidemiological research
In the last half-century, different applications of epidemiological research have grown and expanded to cover a wide range of fields. It has evolved into both a liberal art and basic science of both clinical medicine and public health. Throughout its progress, different authors have tried to summarize the common principles of these studies (MacMahon and Pugh, 1970; Kupper era/., 1982; Miettinen, 1985a; Rothman, 1986; Rothman and Greenland, 1998; Kelsey et ai, 1996). This book has the same objective. Its main themes are abstracted as follows: 1. To understand how scientific research is performed. In fact, according to Sir Karl Popper, a philosopher of science, objective knowledge is obtained through the repeated cycle of proposing hypotheses and falsifying them one by one. He labeled this concept "conjectures and refutations" (Popper, 1965). This concept serves as major part of the foundation of this book and will be delineated in Chapter 2. 2. To learn how to propose hypotheses that are scientific, falsifiable or testable. Empirical tests can only be performed on hypotheses that are falsifiable. Without such tests, we are unable to differentiate which hypothesis is closest to the truth. Thus, one must know how to propose a falsifiable and relevant hypothesis. This will be discussed in Chapter 3. 3. To make causal decisions from a refutationist's point of view. While the scientific quest for natural laws or causal hypotheses may never end, public health and medicine often demand us to make causal decisions under different scenarios and degrees of uncertainties. Thus, an epidemiologist needs to learn about the causal criteria for decision-making. Still, it is better to look at these criteria from a refutationist's viewpoint, in order to avoid causing any harm during the intervention process. This will
Chapter I
Introduction
11
be discussed in Chapter 4. 4. To accurately conceptualize and measure the objects under study. To perform an empirical test, one must be able to conceptualize and measure the subject matter of interest. As a result, one first needs to grasp the fundamental principles of measurement and be able to apply them in empirical studies. This topic will be presented in Chapter 5. 5. To carry out valid and precise epidemiological measurements. Epidemiologists have tried to discern health events from health states and develop specific indicators to measure them, such as incidence rate, prevalence rate and quality-adjusted life-year, as well as indicators for effect such as rate ratio, rate difference and odds ratio. These indicators will be discussed in Chapter 6. 6. To design causal and descriptive studies and draw appropriate inferences. Concepts of confounding and response rate will be discussed in Chapter 7 as the basis for proper study design and inferences. One should approach confounding by looking for any alternative hypotheses that have not yet been falsified; while with low response rate, one should consider if there is a large difference between respondents and non-respondents. Chapter 7 will delineate the basic concepts of a valid and efficient study design and proper inference of results. 7. To apply adjustment of rates as a means of controlling confounding, and to look at crude rate as a summation of weighted specific rates. These topics will be covered in Chapter 8.
12 Basic Principles and Practical Applications in Epidemiological Research
8. To select the most efficient sampling procedures and to know how to properly generalize the study result obtained from a sample to the population. These topics are covered in Chapter 9. 9. To conduct follow-up or cohort studies in a manner such that no extraneous factor can also explain the effect under study. Chapter 10 discusses all related topics of follow-up studies, including clinical trials. 10. To understand why case-control study design is a general solution when no denominator or person year data are available, and to conduct proper sampling on such a design. Chapter 11 will cover the concept and development of case-control study, including density sampling, mortality odds ratio and other related topics. 11. To critically evaluate research based on the above conceptual development. All principles covered in previous chapters are summarized into two practical formats, i.e.,descriptive and causal, in order to assess the scientific merit of a study. Moreover, guidelines on how to approach a problem and how to write a scientific paper will also be provided in Chapter 12. 12. To apply the concept and measurements of utility of health in health policy research. To carry out such a task, one needs to understand the concept of decision analysis and estimate the utility of health by combining survival with the quality of life function into a common unit, e.g., quality-adjusted life-year (QALY). Effectiveness of health services can thus be evaluated, in addition to the risk of a specific condition. Chapter 13 will introduce the
Chapter 1 Introduction
13
usual format of decisional questions and measurement method of utility of health, making the quantification of health policy research more feasible. 1.5
Summary
The principles of studying the occurrence of health events, states and related issues among a population have developed into a scientific discipline called epidemiology. During the last half-century, it has expanded rapidly to cover new areas: from acute or infectious diseases to chronic diseases, from disease-oriented to determinant-oriented, and from health-related events to health policy research. To complement this growth, new concepts and measurements, as well as powerful statistical tools have been developed to tackle the different questions. Yet. this rapid development and varied approaches may often confuse beginners and/or investigators coming from other scientific disciplines. The goal of this book is to summarize these new developments into basic principles consistent with the scientific inquiry of proposing hypotheses and falsifying them one by one, i.e., conjectures and refutations. This task may be accomplished by classifying all epidemiological research questions into two types: descriptive or causal, although the involvement of health policy research may involve a decisional type of question.
14 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 1
Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score
% 1. The primary concern of study in epidemiology is human population, which is also the main subject field of social sciences. 2. In the 21 century, infectious diseases epidemiology will become less important as almost all such diseases will be under control. 3. Since the study of etiological agent will provide scientific basis for proactive prevention of diseases, epidemiology will become more and more important in 21 century. 4. One must understand the detailed mechanism of an etiologic agent and its pathophysiology in order to implement effective prevention. Thus, it is very natural that epidemiological study will also focus on details of different major determinants for the disease. 5. If a disease takes a longer period of time for the development of symptom, there will be more other determinants to confound the causal effect. 6. The ultimate goal of epidemiological study usually includes some implication for making preventive health policy. 7. If the response rate of a population sample is below 50%, then one cannot draw any conclusion or inference at the end. 8. Most epidemiological research can be classified into 2 different types of inference: causal and descriptive.
Chapter I
Introduction
15
9. Basic concepts and principles of epidemiological research are quite different from those of other natural sciences. 10. Concepts and principles learn from epidemiology cannot be applied to clinical science, because the former primarily deals with population while the latter studies an individual. Answer: (1) T (2) F (3) T (4) T (5) T (6) T (7) F (8) T (9) F (10) F
This page is intentionally left blank
Chapter 2
2.1 2.2 2.3 2.4 2.5 2.6
Principles of Scientific Research: Deductive Methods and Process of Conjecture and Refutation
The process of scientific research Deductive methods Conjectures and refutations Why take a refutational attitude? The limitations of conjectures and refutations Summary
Introduction Epidemiological research, with its specific focus on the health of a human population, is still a discipline based on the general principles of scientific research in searching for objective knowledge. Consequently, we shall start by understanding these principles, a repetitive process of proposing and disproving (falsifying) hypotheses one by one, or in Popper's terms, conjectures and refutations (Popper 1965, 1968). Because our hypothesis may not be directly tested or falsifiable empirically, we need to deduce statements from it, which then may be more testable or falsifiable. Thus, hypothesis formation, deduction and empirical tests form the basis of scientific research. In contrast to a verificationist's view, however, I take a refutationist's stand because empirical tests can only disprove hypotheses and can never prove any one to be true. Since human populations can only be observed or at most randomized with informed consent, taking a falsificationist's attitude in such studies may help one to avoid and detect erroneous conclusions. Let us first examine some examples of scientific research in this chapter, and clarify its basic principles and limitations.
17
18 Basic Principles and Practical Applications in Epidemiological Research
2.1 The process of scientific research The example of nosocomial infection is a good illustration of the process of scientific research. Although Himpel (1966) used this research to advocate his verificationist's viewpoint, I, on the other hand, will discuss this example from the point of view of a refutationist. Example: Etiologic agent of puerperal fever During 1844-8, a Hungarian doctor, Semmelweis, served as an obstetrician in the Vienna General Hospital. He found that the mortality rate due to puerperal fever in obstetric ward I was higher than that of ward II, as listed on Table 2.1. At the time, microbiology was still a developing science and knowledge of infectious diseases was very limited. I will attempt to show how one uses empirical tests to refute rather than verify the proposed hypotheses, by specifically looking at the difference of mortality rates between the two obstetric wards of the same hospital.
Table 2.1
Mortality rates of puerperal fever in the obstetric wards of Vienna General Hospital, during 1844-6.
Calendar year
1844
1845
1846
Ward I
8.2 %
6.8%
11.4%
Ward
2.3%
2.0%
2.7%
II
Hypothesis 1: The agent of puerperal fever was infectious via atmospheric-cosmic-telluric change. Refutation: If hypothesis 1 were true, then the agent would easily spread to other areas, such as obstetric ward II or other nearby obstetric clinics in Vienna. However, the fact that ward II and nearby clinics were relatively spared from the disease contradicts this hypothesis.
Chapter 2
Hypothesis 2: Refutation:
Hypothesis 3:
Refutation:
Hypothesis 4:
Refutation: Hypothesis 5:
Refutation:
Conjectures and refutations
19
Crowding in obstetric ward I produced a higher mortality in ward I. Initially, ward I was more crowded. However, because many women moved to ward II for fear of high mortality in ward I, the two wards were almost similarly crowded later. Yet, the difference in mortality rate persisted. Hypothesis 2 was thus refuted. Patients in obstetric ward I were examined by medical students whose poor examination skills resulted in high mortality. Injuries resulting from the delivery of baby are in general much greater traumas than rough examinations. Midwives who received their training in ward II might examine the patient in a similar manner but did not show the same ill effect. Moreover, the mortality rate of puerperal fever persisted after examinations performed by medical students were minimized or even terminated temporarily. The psychological stress of death produced high mortality. During this time, a dying person was usually given an anointing ceremony for consecration, in which a minor priest would ring a bell, while walking in front of the senior pastor. The huge psychological stress, created by the ringing noise of the "dying" bell, facilitated the deaths of the women in ward I. Dr. Semmelweis asked the minor priest not to ring the bell, but high mortality rate persisted. The posture of delivery caused the difference in mortality rate. Women in ward I usually delivered their babies in supine position, while women in ward II delivered in lateral position. Dr. Semmelweis ordered all women in ward I to deliver their babies in lateral position, but mortality rate remained high.
20 Basic Principles and Practical Applications in Epidemiological Research
Hypothesis 6:
Failed Refutation:
Puerperal fever was transmitted through substances originating from a dead body. Dr. Kolletschka, a colleague of Dr. Semmelweis, developed symptoms similar to puerperal fever after he was cut by a student's autopsy knife, which had touched a patient who had died of puerperal fever. Dr. Kolleschka later died. If doctors or medical students examined patients in ward I after they finished an autopsy, such actions may have spread the puerperal fever agent from the dead to the living. Since knowledge of infectious agents was limited at this time, it was not possible to directly test the primary hypothesis, i.e., observe and measure what substances came from the dead bodies. Thus, in order to test this hypothesis, Semmelweis invoked an auxiliary hypothesis, which he could then test. He conjectured that water containing calcium chlorite destroys the agent, i.e., it sterilizes the doctors' hands. Semmelweis required every doctor and student to wash their hands after autopsy. As a result, the mortality rate in ward I dropped down to 1.27% in 1848, while that of ward II showed a 1.33% rate. Moreover, women who delivered their babies during transport also had a lower mortality rate because they were not examined, and thus, were not transmitted any contaminating agent before delivery. If a mother suffered from puerperal fever, her newborn baby was also more likely to die of a similar fever because the agent might be transmitted to the baby during delivery. These facts seemed to support the hypothesis that the agent came from a dead body.
Popper (1965) called the above process of scientific research "conjectures and refutations." It is a repetitive cycle of proposing hypotheses (or, conjectures) and attempting to test each one, instead of
Chapter 2
Conjectures and refutations
21
confirming them. As with the example of puerperal fever, one cannot ever prove a hypothesis, but can fail to refute it. One is, thus, left with the more valid conclusion by deductively eliminating all other possible causes. Figure 2.1 summarizes the principles of scientific research. In human observational studies, one often makes the mistake of only looking for confirmatory facts, and consequently, one may more easily become self-deluded. By taking a refutational attitude, one can better avoid this fallacy. For example, if Dr. Semmelweis had stopped his refutation effort, he would have failed to consider the possibility of transmission of an agent from living bodies. Later, he noticed that in one instance, he had not washed his hands after examining a patient with cervical cancer and proceeded to examine another 12 women awaiting delivery. Later, 11 of them died of puerperal fever. Consequently, he immediately modified his hypothesis and proposed that the agent could also be transmitted through a living body. In fact, even if the refutation of hypothesis 6 were successful, it does not necessarily follow that the primary hypothesis was refuted because an auxiliary hypothesis was invoked, i.e., that the disinfection of hands with calcium chlorite was effective in killing the agent. Thus, if one has refuted a hypothesis indirectly, one should consider whether or not one has only falsified the auxiliary hypothesis and not the primary hypothesis. To do so, one must examine all auxiliary hypotheses made during the empirical tests and measurements. This process can be expressed in more abstract terms: Let H denote the primary hypothesis; A,, A2, , An denote the auxiliary hypotheses invoked; and B denote the predicted outcome or effect. Then, if H, A b A2, , An are all true, then B is true. If we have found B to be false, then at least one of H, A,, A2, , An should be false, but not necessarily A or every An. In all empirical scientific studies, one always assumes or invokes the auxiliary hypothesis that measurements made are accurate enough to detect
22 Basic Principles and Practical Applications in Epidemiological Research
the effect predicted by the hypothesis. For epidemiological studies, which are empirical observations on human population, one usually invokes additional auxiliary hypotheses. Thus, one should be careful to avoid possible confounding or mixing of effects by any extraneous factors, such as auxiliary hypotheses.
Conjectures Propose all possible explanations or hypotheses: Hi, H,,
Repeat the cycle to explore the details of the mechanism.
,H„
Deductions
Refutations
Assume that each hypothesis
Falsify and rule out each
is true, then predict what
hypothesis by examining the
should or should not happen.
observed and predicted facts.
H, 'Fu> Fi2 H, >F 2 ,,F 2;
The more valid hypothesis .Fa
will remain unrefuted.
Hn => F n ,, Fn2, ... , Fn,
Observations and measurements Design a study and observe the target facts predicted by the hypotheses. Because of ethical concerns against experimentation, one can only conduct observational studies on human beings.
Figure 2.1
Data analysis Use statistics as a summary tool and control confounding so that the causal relationship is clear.
Conjectures and refutations are the basis for causal epidemiological research. This figure also shows the proper places for observations and measurements, as well as statistical analysis (Wang, 1991).
Chapter 2
2.2
Conjectures and refutations
23
Deductive methods: Common logical reasoning
The example of the preceding section reminds us that we often use deductive logic to provide more specific statements derived from the hypothesis, which can then be tested empirically. Such statements are very useful because according to deductive logic, these statements are guaranteed to be true if the premise is true. Therefore, we should use these deductive rules in our everyday research. In this book, I only list the most common ones, as on Table 2.2. Further details may be found in standard textbooks of logic, like those written by Tarski (1957) or Copi (1972).
Table 2.2
Most frequent forms of logical reasonings.
" n"— implies " v"~- or
" = " — is equivalent to " ~ " — not
" • " — and
(1) Modus Ponens piDq
_B
e.g., If p, "Smokers are more likely to develop lung cancer," is true, then q, "Chinese women who smoke are more likely to develop lung cancer," is also true because Chinese women smokers are implied in all smokers.
(2) Modus Tollens P -D 1 -^-9 ' *~P
e.g., If "Smokers are more likely to develop lung cancer" is true, then "Chinese women who smoke are more likely to develop lung cancer" is also true. If we have evidence to demonstrate that Chinese women smokers are not more likely to develop lung cancer than nonsmokers, then we can conclude that the premise is not true either.
24 Basic Principles and Practical Applications in Epidemiological Research
(3) Hypothetical Syllogism qp r
e.g., If "Smokers are more likely to develop lung cancer" is true, then "Women who smoke are more likely to develop lung cancer" is true. If "Women smokers are more likely to develop lung cancer" is true, then "Chinese women smokers are more likely to develop lung cancer" is also true. If the first premise is true, then "Chinese women smokers are more likely to develop lung cancer" should be true.
(4) Disjunctive Syllogism pvq ~P
.. q
e.g., Assume that "Today Mr. John Doe is either attending school or going home," is true. If we know that Mr. Doe is not attending school now, then he must be going home.
(5) Constructive Dilemma (p -D q) • (r => s) p vr qvs
e.g., "If you want to write a good thesis, then you should first carry out a good study." And "If you want to be a good epidemiologist, then you should first study epidemiological methods." Since you want to either write a good thesis or be a good epidemiologist, you ought to carry out a good study or you should study epidemiological methods.
Chapter 2 Conjectures and refutations
(6) Absorption p=>q p 3 ( p • q) (8) Conjugation P q ••• p • q
2.3
25
(7) Simplification p- q
(9) Addition P q •pvq
Conjectures and Refutations
Traditionally, one applies the induction method to propose a more general hypothesis after observing several incidences. For example, after observing 100 black ravens, one may propose a hypothesis stating that all ravens are black. Then, the hypothesis or "theory" as some may contend, implies that the 101th raven or even the unborn ravens in the future are also black. This hypothesis is only a guess or conjecture. One can never prove that it is true even after observing thousands or more black ravens because one has no logical basis on which to guarantee that the next one will be black. Just as Hume claimed in the 18th century: "We are never able in a single instance, to discover any power or necessary connection, any quality which binds the effect to the cause, and renders the one an infallible consequence of the other. We only find that one does actually, in fact, follow the other. Hume argued that one can never prove a causality such as A causes B. Instead, one can only say that whenever A occurs, B is frequently observed. The doubt that Hume gave remained unanswered, until in the late 1930's Popper (1965,1972) proposed an alternative viewpoint to tackle the issue.
26
Basic Principles and Practical Applications in Epidemiological Research
Popper considered that although one cannot prove that a hypothesis is always true, one can still refute or prove a hypothesis to be false. The one that remains unrefuted is more likely to be true in comparison with those already falsified. Since the search for truth and causal mechanisms never ends, the process of conjecture and refutation should continue indefinitely. Epidemiological research is one of the many types of scientific research, focusing on the human population. Due to ethical concerns, scientists are forbidden to perform experiments on human beings, and can only conduct randomized clinical trials once obtaining the patients' informed consent. Thus, research conducted among a human population must rely mostly on observations of the population under study. Yet, scientists must be very careful in how they obtain such observations. Since a large multitude of events happen each day, it is very natural for one to simply pick what one wants to observe, i.e., confirmatory facts, and ignore evidence that may contradict the proposed hypothesis. The Popperian approach, as shown in Figure 2.1, can help one to avoid such a fallacy and seems to fit quite well in etiologic research. The following is a typical process of an etiologic diagnosis by an occupational physician (Wang, 1991). Astute occupational physicians or health professionals usually start to suspect an occurrence of occupational disease whenever there is a clustering of cases during a certain time period and within a specific workplace. They then propose all possible etiologies or causes according to present-day medical knowledge and toxicological databases. For each individual hypothesis, they deduce several consequences or try to predict what could and/or should happen. Then, they design a study and go to the field to obtain {i.e., observe and measure) target facts predicted by each hypothesis. After collecting such information, they summarize their findings and try to control for any confounding by statistical tools. Also, they attempt to challenge each hypothesis by examining the observed and predicted facts. After ruling out all false hypotheses by deduction, they may only have one hypothesis left unrefuted. Consequently, they may consider this hypothesis to be the most valid. By repeating this cycle, they can further explore the mechanism in finer detail.
Chapter 2
Conjectures and refutations
27
It is very natural that one usually looks for facts confirming one's own intuitive conjecture or hypothesis and, consequently, be misled in that direction. However, a skeptical refutational attitude can help one to always looks for alternative hypotheses and facts contrary to the proposed hypothesis. From my experience, such an attitude helps one to avoid misdiagnoses and guides one to the most valid etiology. 2.4
Why take a refutational attitude?
People familiar with the induction method may argue that inductivists also resort to empirical tests to decide which hypothesis is true and deserves to be called a theory. Then, why should one take the side of the Popperian philosophy of science? From my personal experience, a verificational or confirmatory approach usually leads to a very complacent attitude, preventing one from searching for alternative explanations or hypotheses. Thus, one is more likely to ignore evidence that are contradictory to the favorite hypothesis, while a refutational attitude encourages one to look for contradictions and alternatives. Take the example of the documentation of an outbreak of botulism, which was initially thought to be the result of occupational hazard. Example 2.1 Outbreak of acute bilateral weakness of extremities and respiratory paralysis in a color printing factory (Wang and Chang, 1987;Tsaiefa/., 1990) In September 1986, an apprentice in a color printing factory in Chang-Hwa suddenly developed acute bilateral weakness and respiratory paralysis. The victim's father alleged via phone that there were several other workers with similar symptoms. Since occupational diseases resulting from organic solvent exposures, i.e., n-hexane induced polyneuropathy and carbon-tetrachloride induced hepatitis, were once documented in color printing shops (Wang et al, 1986; Deng et al, 1987; Huang et al, 1987, respectively), investigators went to the work site, proposing the hypothesis of possible solvent intoxication (Figure 2.2).
28
Basic Principles and Practical Applications in Epidemiological Research
In practice, however, the investigators also considered all alternative conjectures, including other medical problems due to the impaired function of upper motor neurons, lower motor neurons, as well as of the neuromuscular junction. They then deduced outcome statements from the different hypotheses: If solvent poisoning were the cause (H,), it would impair consciousness at a certain point in the clinical course. If any solvent reported to produce polyneuropathy, e.g., n-hexane, acrylamide, methylbutyl ketone, etc., were the cause (H2, H3), it would also impair the nerve conduction velocity (NCV). If the cause were other medical problems involving upper motor neurons (Hn.2), there would be signs of impaired consciousness and/or involuntary movement. Moreover, if it were a case of Guillain-Barre syndrome, a specific lower motor neuron disease (Hn_,), it would usually not show case clustering in space and time, and would lead to a demyelination effect, e.g., impaired NCVs. If it were a case of myasthenia gravis or syndrome, a disease of the neuromuscular junction Hn(1), the neostigmine test would be positive. If it were a neuromuscular blockade caused by drugs, pesticides (such as organophosphrites), spider bites, or snake bites H„(2), there would be a history of medication or bite before the onset of symptoms. If it were botulism, another neuromuscular junction disease Hn(3), we would be able to culture the clostridium and find botulinum toxin in the food they consumed. Field observations disclosed that all affected workers had a clear consciousness throughout the clinical course. An NCV study of three affected workers showed intact lower motor neurons. There was no involuntary movement, no history of medication or bites prior to the appearance of symptoms, and the neostigmine test was found to be negative. A significant association between illness and eating breakfast in the factory cafeteria on September 26 or 27 was found; seven of seven affected workers vs. seven of 32 unaffected workers ate breakfast in the factory on these two days (p = 0.0002 by Fisher's exact test) (Chou et al, 1988). Further testing effort showed that type A botulinum toxin was detected in canned peanuts manufactured by an unlicensed company, and its specimen also showed a
Chapter 2
Conjectures and refutations
29
Clustering of cases with sudden onset of bilateral weakness of extremities with respiratory paralysis in a printing shop
Conjectures H,: Acute solvent poisoning H2: n-hexane induced Polyneuropathy H3: Acrylamide or methyl butylketone induced polyneuropathy
Deductions H[—'Consciousness disturbance tt-r* Impaired NCVs H3-»Impaired NCVs H^-^-Consciousness disturbance and/or involuntary movement Hn.!—"-Impaired NCVs Guillain-Barre Syndrome H„(i)^Myasthenia gravis or syndrome Hn(2)-»Dnig or toxin induced Hn(3) ^Botulism
H„.2:Other upper motor neuron disease H„.|:Other lower motor neuron diseases: Guillain-Barre Syndrome H„: Neuromuscular junction disease l)Myasthenia gravis 2)Drug or toxin induced 3)Botulism
Rare clustering in space & time. Rare clustering in space & time and neostimine test(+) History of medication, or spider or snake bites Clostridium botulinum(+) and toxin (+)
Future refutation attempts Eliminated the canned peanuts, and no more new cases occurred.
Refutations Hi, H2,...Hn(i)iHn(2) were all excluded as the diagnosis.
Data analysis Eating breakfast on Sept. 25-26 was shown to be associated with the appearance of symptoms.
Observations and measurements All affected workers had clear consciousness throughout. NCV was intact. No involuntary movement. Neostigmine test (-). No history of specific drug usage or animal bites. Culture showed Clostridium botulinum and toxin in Canned peanuts.
Figure 2.2
Outbreak of botulism in a printing factory: An example of taking a refutational attitude (Wang, 1991).
30
Basic Principles and Practical Applications in Epidemiological Research
full growth of Clostridium botulinum. A final refutation trial involved the removal of such products from the commercial market, resulting in no more new cases. This investigation documented the first cases of botulism due to a commercial food product in Taiwan. Confounding or mixing of effect in the search for causal agents Let us examine some more examples: During the study of "association between the air concentration of S0 2 and occurrence of asthma," investigators found that the prevalence rate of asthma was higher among communities with higher ambient S0 2 concentrations (Lin et al, 1981). There was an almost linear association between the two. Can one then conclude that the high prevalence of asthma was caused by high S0 2 concentration in air? If one takes a refutational attitude, one must also consider other determinants of asthma, such as occupational exposure for these asthma patients, house dusts, familial tendency, etc. Moreover, one must distinguish between asthma cases developed after moving in from those cases developing asthma before moving into the community. In other words, one should always consider and rule out alternative explanations or hypotheses before reaching any conclusion. In the search for the etiological agent of polyneuropathy among press-proofing workers (Wang et al, 1986), if investigators had simply confirmed that a pigment of printing paste contained lead and concluded that workers' polyneuropathy resulted from this lead exposure, we should have missed the true cause - n-hexane. Fortunately, we considered other alternative hypotheses, including lead, carbon disulfide, methylbutyl ketone, acrylamide, etc., and successfully ruled out each one to demonstrate that n-hexane was the responsible agent. Finally, when all such factories removed n-hexane, no new cases occurred, corroborating our hypothesis. Consider another study, in which researchers sought to find the etiologic agent for an outbreak of hepatitis among printing workers (Deng et al, 1987). The employer and most employees originally blamed viral
Chapter 2
Conjectures and refutations
31
hepatitis B because the hepatitis B surface antigen carrier rate was approximately 15-20 % in Taiwan (Chen and Sung, 1978). If we had not considered alternative hypotheses, investigators would have missed that the cleaning agent, carbon tetrachloride, produced the outbreak. Similarly, researchers have demonstrated that an outbreak of hepatitis among synthetic leather workers was caused by dimethyl formamide, a chemical used in the manufacturing process (Wang etal., 1991). Consider this hypothetical study attempting to determine whether there is an association between noise and hypertension. If one finds that there is a high prevalence rate of elevated blood pressure among people living in a noisy region and conclude that such evidence supports the association, one may reach an erroneous conclusion. The people in these noisy communities may also be on a high salt diet, they might have higher body mass indices (BMI) or their job stresses may be higher, etc. Before all these alternative explanations are addressed, one should not jump to any early conclusions. In fact, mixing of the effect by alternative hypotheses or explanations is known as confounding (Miettinen, 1974b, Miettinen and Cook, 1981). The basic principle for determining the existence of confounding involves first finding any unrefuted alternative explanations, which are major determinants of the outcome under study and associated with the exposure of interest. We shall return with a more detailed discussion on confounding in Chapters 4 and 7. All the above examples have indicated that one should take a refutational attitude in scientific research to avoid the pitfall of jumping to false conclusions. Dr. John C Eccles, a Nobel laureate, once wrote an essay entitled "In praise of falsification," (1981) attributing his success in neurobiology to the falsificationist attitude of constantly testing and renewing his original hypothesis. Table 2.3 displays a summary of the comparison between verificational and falsificational attitudes in scientific research.
32 Basic Principles and Practical Applications in Epidemiological Research
Table 2.3
Comparison of the verificationist and falsificationist's views in science (modified from Maclure, 1985). Verificationist
Falsificationist
Fundamental view Origins of hypotheses
Science is based on verifying hypotheses. Observation comes first, and reveals a hypothesis.
Theory
A good theory is the one that has been verified multiple times.
Axiom
Induction is logical.
Science is based on disproving or falsifying hypotheses Explanation comes during observation. The observer thinks of a hypothesis first, and then imposes it on what he observes or expects to observe. A good theory is a hypothesis that stands firm after many critical attempts of falsification. Thus, one can only regard a theory to be more corroborated than other alternative hypotheses. Only deduction is logical. A hypothesis found by the induction method is still a guess or conjecture. We can never prove that a hypothesis is always true, but we can disprove or falsify hypotheses.
2.5
The limitations of conjectures and refutations
A scientific theory, regardless of the number of failed attempts at critical refutation, can only be regarded as more corroborated than other alternative hypotheses or theories, and is not guaranteed to remain unrefuted in the future. For example, scientists upheld Newton's law of motion for more than two centuries and once regarded it as universally true. However, in the early 20th century, Einstein's theory of relativity replaced Newton's laws of physics. Similarly, we cannot say that the theory of relativity will remain unchallenged, either. Instead, we have faith to accept it as more corroborated than all other falsified theories attempting to explain our
Chapter 2 Conjectures and refutations
33
physical world. Thus, our faith is not blind because the theory of relativity is the only one not yet disproven. In order to empirically test a hypothesis, we must propose hypotheses that can be falsified. Otherwise, we may be left with many unrefuted hypotheses that cannot be tested. We will examine this issue in Chapter 3. Another possible limitation of this approach is that one can only consider alternative explanations or hypotheses that one can imagine. If the real etiologic agent is not included in one's list of hypotheses because of one's limited knowledge, one may be left with no answer after all proposed hypotheses have been falsified. Therefore, inviting an expert in the field and looking into a comprehensive database of published literature that covers the subject area is crucial to the success of this strategy. For example, the OSH-ROM database (Silver Platter, 1998), which contains about 200,000 abstracts, may be one of the most comprehensive for making an etiologic diagnosis of occupational and/or environmental diseases. Furthermore, if one is confronted with a new disease, one may search such a database and rule out all known etiologic agents. In such a condition, one may try to identify or define the new agent as specifically as possible. Otherwise, one must use a surrogate variable (Wang and Miettinen, 1982), which is amenable to change, so that the problem can first be mitigated, while the causal mechanism is relegated to later study. The case of pre-malignant skin lesions among paraquat manufacturers (Wang et al, 1987) is a good example of this kind: In June 1983, two workers from a paraquat manufacturing factory visited a dermatology clinic complaining of numerous bilateral hyperpigmented macules with hyperkeratotic changes on parts of their hands, neck, and face exposed to the sun. Specimens showed increased melanin in the basal layer, hyperkeratosis, epidermal hyperplasia, and dysplasia. Some specimens also showed Bowenoid changes. Since malignant and premalignant skin lesions were reported among bipyridyl manufacturing workers (Bowra et al, 1982), investigators strongly suspected an
34 Basic Principles and Practical Applications in Epidemiological Research
occupational cause. However, they also considered other alternative causes (hypotheses) of skin cancer, such as exposure to ionizing radiation, coal tar, soot, pitch or any other polyaromatic hydrocarbons (PAH), efc.(Scotto and Fraumeni, 1982), as shown in Figure 2.3. If ionizing radiation had been the cause, all affected workers should have had a positive exposure history through occupational or medical origin. Similarly, if PAH or tars had been the cause, all affected workers should have been exposed to them. To falsify all of these conjectures and deductive statements, the investigators conducted a study in 1985 and visited all 28 factories engaged in paraquat manufacturing and packaging, and examined the manufacturing processes as well as workers. They examined 228 workers and none of them had ever been exposed to the aforementioned skin carcinogens except sunlight and 4, 4,-bipyridine and its isomers. In an attempt to falsify paraquat itself as an alternative, the researchers stratified workers according to their work assignments: administrative jobs, paraquat packaging, bipyridine crystallization and centrifugation, and multiple job assignments. After excluding workers with multiple exposure, they found that 1 out of 7 administrators and 2 out of 82 paraquat packaging workers developed hyperpigmented skin lesions, as compared with 3 out of 3 workers involved in only bipyridine crystallization and centrifugation. Moreover, all 17 workers with hyperkeratotic or Bowen's lesions had a history of direct exposure to bipyridyl and its isomers. The longer the exposure to bipyridyls, the more likely the development of skin lesions. This trend could not be explained by sunlight or age as demonstrated by stratification and logistic regression analysis (Table 7.10). The skin lesion was tentatively attributed to a combination of bipyridyl exposures and sunlight. In the follow-up study, the investigators made additional attempts to refute their hypothesis by enclosing all processes involving bipyridyl exposure. Since no more new cases occurred at the enclosed factory, their conclusion presently remains valid.
Chapter 2
Conjectures and refutations
35
Example 2.2 Outbreak of premalignant and malignant skin lesions among paraquat manufacturers (Wang et al, 1987)
Cases of Bowen's disease and hyperkeratosis in a paraquat manufacturing factory Conjectures H,: Ionizing radiation H2: Tars, soots, pitch, etc. Hn.,:Paraquat Hn: Bipyridyl and/or its isomers Deductions H,: Affected workers should have a history of exposure to ionizing radiation, e.g., X-ray, etc. H2:Affected workers should have a history of contact with tars, soots, pitch, etc. H,,.,: Affected workers were exposed to paraquat. H„: Affected workers were Observations and Measurements exposed to bipyridyls. All workers were exposed neither to ionizing radiation, nor to tars, soots, pitch, etc.2 out of 82 workers, exposed only to paraquat, developed skin lesions, while all 17 workers with hyperkeratotic skin lesions had a direct exposure to bipyridines.
Figure 2.3
Future refutation attempts Enclosure of all of processes involving bipyridine resulted in no more new cases.
* Refutations H b H2, ... ,H„.| were all refuted. Only Hn remain unrefuted, and sunlight was found to be a co-factor.
* Data analysis Stratified analysis showed that the longer workers were involved in bipyridine crystallization and centrifugation the more likely they were to develop skin lesions. This association cannot be explained by age and the amount of sunlight exposure.
Premalignant and malignant skin lesions caused by bipyridyls as an example of conjectures and refutations.
36 Basic Principles and Practical Applications in Epidemiological Research
2.6 Summary Epidemiological research is based on the basic principles of scientific research, and thus the deductive methods involved in conjectures and refutations. Since deductive methods are defined by common logical reasoning, one can be sure that if the premise is true, then the end statement is true. The method of conjecture and refutation is a process of proposing hypotheses and trying to falsify each one of them. Although one can never be sure that a conjecture is true, one can tentatively conclude that a hypothesis which stands firm after many empirical tests is nearest to the truth compared with those refuted. A refutational attitude causes one to always consider alternative hypotheses and look for evidence contradictory to the hypothesis. Thus, it can help one to avoid the pitfalls of complacency and self-delusion. However, this strategy is still limited to those hypotheses one can imagine. As a result, one may falsify all proposed hypotheses or be left with too many unrefuted ones. For the former, one needs to first mitigate the damage and expand one's hypotheses. For the latter, one should propose hypotheses which can be empirically tested.
Chapter 2
Conjectures and refutations
37
Quiz of Chapter 2 Please write down the credibility of each assertion in terms of percentage (%). Your score will be calculated according to the percentage of credibility that you actually obtain after comparison with the instructor's "gold standard". Score
% 1. The scientific knowledge on human population is progressed through repeated falsifying hypotheses instead of confirming hypotheses. 2. A limitation of conjecture and refutation is that one can only consider hypotheses that one can imagine. If the true etiological agent is not included in one's list of hypotheses, then one may end up with no answer after refuting all proposed hypotheses. 3. One can never be sure that a theory will be forever true even after one thousand times of refutation tests. 4. One can be sure that the un-refuted hypothesis is closer to the truth than those already refuted ones. 5. A refutational attitude tends to guide the investigator to avoid self-delusion or accepting a hypothesis which is contradictory to the facts. 6. Inductive reasoning guarantees that if the premise is true, then the statement that follows should be also true. 7. A scientist should try to verify his/her hypothesis and disprove other people's hypotheses. 8. The observation itself will automatically show one the hypothesis or even theories. 9. A good scientist should try to falsify his/her favorite hypothesis, because it may be easily overlooked without
38
Basic Principles and Practical Applications in Epidemiological Research
scrutiny. 10. In an observation of human population, one easily invokes some auxiliary hypothesis, which may not be highly corroborated, because one can not conduct strict experimentation on human. Answer: (1) T (2) T (3) T (4) T (5) T (6) F (7) F (8) F (9) T (10) T
Chapter 3 Scientific Hypothesis and Degree of Corroboration 3.1 3.2 3.3 3.4 3.5 3.6
Hypothesis formation What makes a hypothesis scientific? Successful refutation and auxiliary hypotheses Failure to falsify and degree of corroboration Credibility of a hypothesis and decision-making Summary
Introduction In applying the principles of conjecture and refutation, one attempts to propose the most comprehensive list of hypotheses in order to find the most valid etiologic agent. Yet, how does one propose a scientific hypothesis? What characteristics should a hypothesis possess in order to be called "scientific"? How many of these repetitive cycles of conjecture and refutation need one undergo before an unrefuted hypothesis deserves to be called a theory? In other words, how can we evaluate the credibility of a hypothesis after many empirical tests? These are the issues to be discussed in this chapter. 3.1
Hypothesis Formation
—
How to form a conjecture?
Conjectures usually precede observations Traditional inductivists argue that hypothesis is generally formed after one's observation of facts or phenomena. However, most hypotheses are not formed in this manner because natural phenomena do not directly spell out the laws of nature. For example, everyone observes sunset and sunrise, yet it was Copernicus who proposed that the earth revolves around the sun rather than the sun revolving around the earth. Similarly, everyone observes that apples fall to the ground but only Newton proposed the theory
39
40
Basic Principles and Practical Applications in Epidemiological Research
of gravitation. Only those who dare to propose alternative hypotheses or explanations of nature are more likely to provide a hypothesis or theory closest to the true law. In fact, purposeful observations always involve explanations or interpretations already formed in the observer's mind. The observer then tries to select and fit the observed phenomena to his/her different explanations or hypotheses. As Popper (1965) pointed out, man typically first conjectures and then observes his expectation, namely, his conjecture precedes his observations. For example, Popper once asked his students to observe the blackboard. After half an hour, his students inquired, "What aspect or characteristics of the board did you want us to observe?" Thus, our conjecture is often based on our personally biased view — a view heavily influenced by our past experiences. A person with a verificationist's attitude tends to see only those facts which he/she expects, while remaining blind to any fact contradictory to the favored hypothesis. Therefore, a verificationist is more likely to believe a hypothesis which, in fact, may already be refuted. In other words, since hypothesis usually comes to mind before purposeful observation, taking a refutational attitude to propose all kinds of alternative hypotheses, will help one to avoid false conclusions. How to propose a hypothesis How, then, does one form a hypothesis and propose alternative or new hypotheses? This question falls in the larger domain of psychology (Tweney et ai, 1981). Here I shall share some of my own personal views. In general, the ability to propose various hypotheses depends on one's previous experiences and educational background. The better able one is to approach an issue from different angles, the more hypotheses one can propose to explain the phenomena. Although many hypotheses may be quickly falsified even before one can write them down, the habit of thinking of all possible alternative explanations will increase the likelihood that one will include the true etiologic agent. Mill's five methods of induction can certainly help in this process (Please see next paragraph). Consultation of any database, review article, literature, or expert specialized in the field can
Chapter 3 Scientific hypothesis
41
also provide one with more alternative hypotheses. In fact, one of Popper's students, Feyerabend (1975), even proposes that any method will do; no formative method is needed. For example, if one wants to consider the possible etiologic factors of suicide among Taiwanese aborigines, one may look at the problem from a sociological viewpoint, which may provide one with possibilities, such as culture shock, economic pressure, disintegration of their social system, etc. One may also consider it from a clinical psychiatric viewpoint and propose alcoholism, affective disorder, etc. Or, one may think of family problems such as marital instability, divorce, or destruction of the traditional family system, etc. With a pluralistic approach, one will less likely neglect any important etiologic factor. Moreover, occasionally, a consultation of a specialist in the field can shed light on a discovery. The structure of the DNA double helix proposed by Crick and Watson (1968) was developed from consulting with Pauling on the concept of hydrogen bonding. Pauling used this same idea to construct the alpha—helix structure of proteins. Traditionally, inductivists believe that certain rules must be followed to form a hypothesis; this is generally summarized as Mill's five rules of induction (Copi, 1972). To broaden our approach to the proposal of hypotheses, I have also included these rules for the readers' reference, as described on the following page and in Table 3.1. In summary, Mill's methods of induction are only rules for proposing a hypothesis or explanation. They are simply based on the principle of consistency (See chapter 4). And all the proposed hypotheses are only explanatory conjectures. They are not universal truths and may not even be the true etiologic agent. Therefore, one must empirically test or falsify each one of them. Mill's five methods of induction (1) Method of agreement: If the same disease (health effect) always appears following a specific common agent under various environmental or occupational settings, then the common agent may be the cause of the
42
Basic Principles and Practical Applications in Epidemiological Research
disease. For example, if workers exposed to asbestos fibers under various manufacturing processes, like asbestos textile, brake lining, asbestos cementing, etc., have an increased occurrence of mesothelioma, one can propose asbestos as a cause of mesothelioma. Similarly, since the occurrence of lung cancer frequently increases among smokers, whatever their gender, ethnicity, place of living, etc., one may propose the hypothesis that smoking causes lung cancer. (2) Method of difference: If two or more populations have different frequencies of a specific disease, yet they share similar distributions of all determinants except the exposure of interest, then that exposure may be considered a cause of the disease. For example, different frequencies of polyneuropathy were observed at different press-proofing factories. Among the 15 factories we observed, only workers in the three factories using n-hexane as the cleaning solvent developed polyneuropathy, while those at the 12 factories using toluene did not. Since all workers shared the same demographic characteristics except the cleaning agent, we proposed n-hexane as the causal agent. (Wang, et al, 1986) (3) Joint method of agreement and difference: This rule is a combination of (1) and (2). If more than two populations are observed and a specific disease always has an increased frequency only if a particular agent exists, one may suggest that this agent is a cause of the disease. For example, if after observing increased occurrences of mesothelioma among asbestos textile and brake lining workers, but not among cotton and wool workers, then one could propose that asbestos is a causal agent for mesothelioma. (4) Method of residue: According to this rule, after observing increased frequencies of several diseases following several possible agents, one can rule out those diseases already linked to particular agents, and can conclude that the remaining disease is caused by the residual agent. For example, suppose one finds increased frequencies of hearing impairment, lung cancer, and low back pain among asbestos textile workers who have worked for more than 10 years. Since one knows that hearing impairment is caused by noise and back pain is caused by lifting heavy objects, one may propose that asbestos fibers, the only obvious agent remaining, causes lung cancer.
Chapter 3 Scientific hypothesis
43
(5) Method of concomitant variation: This rule refers to a condition similar to a linear dose-response relationship. In other words, if the variation of specific disease frequencies in different populations changes relative to different distributions of a particular agent, then the agent may be the cause of the disease. For example, if the occurrence of lung cancer in different population groups increases with the increased amount of smoking, one may propose that smoking causes lung cancer.
Table 3.1 Mill's rules of induction — How to observe relevant facts and propose hypotheses. Please refer to the text for a more detailed explanation. (1) Method of agreement ABCD — abed AEFG — aefg
A, B, C, D, E, F, G = events/agents observed a, b, c, d, e, f, g = outcomes/effects observed
Propose the hypothesis: A is the cause of a.
(2) Method of difference AB—ab B—b Propose the hypothesis: A is the cause of a.
(4) Method of residue ABC - abc B is a known cause of b. C is a known cause of c. Propose the hypothesis: A is the cause of a.
(3) Joint AB AC B
method of agreement and difference: — ab — ac — b
Propose the hypothesis: A is the cause of a.
(5) Method of concomitant variation ABC — abc A*BC — a+bc A-BC — a"bc Propose the hypothesis: A is the cause of a.
44
3.2
Basic Principles and Practical Applications in Epidemiological Research
What makes a hypothesis scientific?
To propose hypotheses that can be tested or falsified, one must consider what characteristics a scientific hypothesis should possess. Is there any principle one can use to differentiate a scientific hypothesis from pseudo-scientific ones? The answer to the above question is quite straightforward when one considers that scientific knowledge only advances through the process of conjecture and refutation. Specifically, only hypotheses that can be empirically tested or refuted belong to the "scientific" category, although one can propose many explanations or hypotheses for a phenomenon. Conducting empirical tests allows one to distinguish between the many different hypotheses to find the one that best explains the phenomena. The advancement of scientific knowledge is based on this search for the irrefutable hypothesis. No matter how broad its explanatory power, any hypothesis cannot be called scientific if it cannot be tested or falsified. Popper first proposed this rule in the 1930s, demarcating scientific from pseudo-scientific hypotheses or theories (1965). At the time, he found that the theories proposed by Freud, Adler, Marx and Einstein all possessed very broad explanatory powers, i.e., these hypotheses could explain practically everything happening within their own fields. However, only Einstein's relativity theory clearly predicted the phenomenon of red shift when light passes through a gravitational force. Since one can deduce statements from the theory of relativity, which forbid or predict events, one can empirically test it. The Freudian and Adlerian hypotheses, however, cannot be contradicted empirically. Popper gave two contrasting examples: A man pushed a child into the water with the intention to drown it, and another man sacrificed his life in an attempt to save it. Both of these cases can be equally explained in Freudian and Adlerian terms. According to Freud, the first man suffered from repression, while the second man achieved sublimation. According to Adler, the first man suffered so greatly from a feeling of inferiority that he dared to commit a crime to prove himself, as did the second man, who tried to prove himself by daring to rescue the
Chapter 3 Scientific hypothesis
45
child. In fact, one can look at all kinds of human behavior and find "confirmation" of these hypotheses. However, has one really confirmed anything? No, since these hypotheses cannot predict any future behavior, one cannot empirically verify anything at all and cannot distinguish which one is the more corroborated hypothesis. The conditions for Marx's hypothesis are different. Popper (1966) provided a great deal of argument in his book entitled, "The open society and its enemies", and claimed that Marx's hypothesis was already refuted by empirical evidence from capitalist societies or nations. However, Marx's followers tried to save his theory by creating an "ad hoc hypothesis," making it immune to falsification and thus, pseudo-scientific. Einstein's relativity theory has stood firm against many refutational attempts. Besides correctly predicting the red shift phenomenon, Einstein's theory also proposes that the speed of light is absolute and that the velocity of any particle will not exceed the speed of light. In reality, the largest speed ever achieved by an elementary particle produced by an accelerator was 99.999999985% that of the light speed. Moreover, the relativity theory predicts that the life of a moving body achieving almost light speed will be prolonged. Experimental physicists have already found that, if moving at 99.5% the speed of light, a particle's life span is prolonged ten times that of an inert particle. Since all critical falsification attempts have failed, Einstein's theory of relativity is the only theory of motion that remains unfalsified. Popper claimed that it is precisely this falsifiabilty which distinguishes scientific from pseudo-scientific hypotheses. Although one may consider Popper's view to be too narrow and only applicable to the natural sciences, it still provides one with a conceptual understanding on how to propose a scientific hypothesis. I have found Popper's concept quite useful in my daily practice of causal epidemiological research. In addition, I believe that in other less developed health-related disciplines, such as traditional medicine or folk therapy, one must also try to propose hypotheses that can be tested. Even though Chinese herbal medicine has lasted for more than 3000 years, the advancement of this discipline has been relatively limited. This lack of progress may be a result
46
Basic Principles and Practical Applications in Epidemiological Research
of Chinese herbal doctors' reliance on a traditional system of Yin-Yang-Wu-Xing hypothesis, which similar to Freudian or Adlerian theory, is immune to refutation. This lack of falsifiability has restrained the progress of Chinese herbal medicine for the last hundreds of years. Only in the past few decades have people tried to propose falsifiable hypotheses for acupuncture and have enhanced the objective knowledge on this aspect of Chinese medicine. In one example of folk medicine, many Southeast Asian migrant workers believe that SUDS (Sudden Unexpected Death Syndrome) is caused by a widow ghost who sucks away a young male's soul. As there is no method to measure this widow ghost, one has no practical way to test the validity of this hypothesis and distinguish it among the different proposed hypotheses. Pseudo-scientific hypotheses simply cannot be tested through the process of conjectures and refutations. However, this irrefutability is not related to the usefulness of a theory. For example, ethical theories cannot be disproven or proven through conjectures and refutations, but they are highly useful and influential to our everyday life because they provide the guidelines for one's behavioral conduct. While science is useful in developing and demonstrating how an animal can be cloned, one must consider the moral values needed to set the appropriate range of applications for the new technology. Although principles of scientific research are advocated in this book in the search for natural laws, one should also understand their limitations so that these principles do not become a juggernaut. 3.3
Successful refutation and auxiliary hypotheses — disproved the primary hypothesis?
Has one
In the process of conjectures and refutations, one may observe facts contradictory to the deductions made from the primary hypothesis, leading one to conclude that the primary hypothesis is false. However, one should not jump to such an early conclusion in any human observational study, including that of epidemiology. Owing to the reliance on observation
Chapter 3 Scientific hypothesis
47
rather than experimentation, one cannot control all determinants of a particular outcome and must then invoke auxiliary hypotheses. For example, one must always make the assumption that one's measurements are accurate and sensitive enough to detect the effect predicted by the primary hypothesis. If the instrument for red shift measurement is inadequately sensitive, it may mislead one into believing that one's refutational attempt is successful. However, the contradictory data obtained only falsifies the auxiliary hypothesis rather than the theory of relativity. Similarly, if Dr. Semmelweis' attempts to decrease the mortality rate of puerperal fever by washing every examining doctors' or students' hands were unsuccessful, he may have thought he refuted the conjecture. Yet, the empirical data may have refuted the auxiliary hypothesis of the effectiveness of calcium chlorite disinfection, rather than the primary hypothesis. Take another example: If one wants to test the hypothesis that asbestos causes lung cancer, one may collect information on asbestos textile workers and analyze the data to determine whether exposed workers have a higher morbidity or mortality rate of lung cancer. During the research process, one must invoke at least following two auxiliary hypotheses: first, assume the adequate induction time for lung cancer; and second, assume that the diagnosis of lung cancer in our study is sufficiently accurate and sensitive. If one has not found an increased occurrence of lung cancer among exposed workers, one must first check the validity of the above two auxiliary hypotheses before concluding that our primary hypothesis is refuted. About 5-6 years after the core meltdown of the Three Mile Island nuclear power plant in 1979, there was a study showing negative radiation effect on nearby residents (Hatch et al., 1990), which in fact invoked similar auxiliary hypotheses of adequate induction time and accurate and sensitive measurements. It was not until 7 years later that Wing et al (1997) found increased incidences in several types of cancer (including lung cancer and leukemia) and falsified the early conclusion. In this case as well, one must avoid premature conclusions regarding the health effect of such an event, by first attempting to falsify all auxiliary hypotheses. To express this argument in more logical terms, let us denote H as the
48
Basic Principles and Practical Applications in Epidemiological Research
primary hypothesis and A,,A2, A 3 ,...A n as the auxiliary hypotheses invoked during our refutation attempt, while F is the fact observed or effect. A refutation attempt is shown below: If H, Al5 A2,...An are true, then F is true. Suppose we have found F to be false, then at least one statement of H, A,,A 2 ,...A n is false. However, if we know that A,, A2,A3,...Anare all true, then H is false. Critical refutation requires one to invoke only highly corroborated hypotheses, so that one is not confused as to whether the auxiliary or primary hypothesis is refuted. Since in all studies, one invokes the auxiliary hypothesis that measurement errors are smaller than the real difference, one should attempt to improve the accuracy of measurement in all studies, which will be discussed in Chapter 5. 3.4 Failure to falsify and degree of corroboration — Do the results of the study corroborate the primary hypothesis? During one's refutation attempt, one may fail to falsify the hypothesis. In this situation, the primary hypothesis seems to be corroborated and corresponds well with the facts. However, one must still examine the relevancy of our refutation attempt and analyze whether the range of consistency of our hypothesis can be expanded in time and place. Scientists believe a priori that natural laws exist and are universally true. Consequently, in scientific research, our refutation attempt should also adhere to these beliefs in determining degree of corroboration. Examine the following refutation attempt of the hypothesis that all ravens are black. Refutation 1: By deductive reasoning, i.e., modus tollens (p =D q; ~ q; .'. ~ p), an equivalent statement of this hypothesis is: Anything (including birds) that is not black is not a raven. Therefore, after observing a red vase, one has corroborated the hypothesis. Similarly, observation of a white man, a green tree, yellow clothes, etc. all seem to corroborate or fail to refute the hypothesis. But in fact, many such observations seem to have little relevance to the color of a raven. Refutation 2: One conducts a replicative study by expanding the
Chapter 3 Scientific hypothesis
49
number of ravens observed from 10 to 100. If one's sample of ravens is randomly drawn and all of them are shown to be black, then one fails to falsify the hypothesis. This refutation attempt apparently has direct relevance to the hypothesis but only shows that the hypothesis is corroborated for local ravens. Refutation 3: One observes whether ravens in other continents (e.g., Australia, Africa, and Asia, etc.) and at different times (e.g., historical records of bird museums in 1796, 1896, 1946, etc.) are all black. If the hypothesis resists refutation under different time and spatial orientations, then the range of consistency of one's hypothesis is expanded to an even larger time and spatial dimension. Refutation 4: One invites a critic of the hypothesis to perform observations under the alternative hypothesis that some ravens are yellow or any other color. If he fails to find a raven with a different color, then this failure to falsify the primary hypothesis certainly corroborates the primary hypothesis. The above four types of refutation attempts increase in degree of corroboration of the hypothesis. The first attempt has no direct relevance to the hypothesis, and thus, the hypothesis remains uncorroborated. The second attempt, a merely replication study, has relevance but seems to challenge the hypothesis in only the local area. The third attempt challenges the consistency of the hypothesis in various times and places, and the failure to refute corroborates the hypothesis to a high degree. The fourth attempt, based on refuting challenging alternative hypotheses, strongly corroborates the primary hypothesis since the refutation of such alternatives eliminates the existence of competing hypotheses. Therefore, in attempting to refute a hypothesis, one should aim for direct relevance, expand the range of consistency at different times and places, and attempt to refute challenging alternative hypotheses. A failed refutation attempt, based on these criteria, more strongly corroborates the primary hypothesis and can help save resources as well. The above discussion leads to a broader view of applying subjective Bayesian analysis (Greenland, 1998b) in evaluating hypothesis after many refutation attempts.
50
3.5
Basic Principles and Practical Applications in Epidemiological Research
Credibility of a hypothesis and decision-making
Public health involves decision-making Even if a hypothesis has stood firm after many critical refutation attempts, can it be regarded as truth? Let us first define a true statement as a statement that corresponds to the fact (Tarski 1969; Popper, 1965). Then, most cause-effect relationships in our common daily life, such as who turns on the light in the office or what causes an outbreak of food poisoning, can be clarified through the process of conjecture and refutation as discussed in Chapter 2. Whereas in scientific research, the pursuit of truth (or law of nature) is an unending quest, and all un-refuted hypotheses are still considered conjectures subject to future challenge. When all competing alternative hypotheses are refuted, one may claim that the only hypothesis that remains corresponds more to the fact than all the other hypotheses. Such a highly corroborated hypothesis is one's closest approximation to the true natural law but may still be replaced in the future if falsified. In the public health field, however, one must take action at certain time in order to prevent morbidity and mortality. For example, the hypothesis that smoking causes lung cancer was proposed by Doll and Hill in the 1950s after their epidemiological studies, and since then, many have attempted to critically refute it in various times and spatial orientations without any success (U.S. Department of Health, Education, and Welfare, 1964, 1979). Moreover, presently there exists no alternative hypothesis that can explain the high proportion of lung cancer patients who are smokers. Although further refutation attempts may be still in need, public health authorities must take action now to prevent the increasing numbers of people possibly developing and dying of lung cancer. Thus, one is dealing with a different issue from that in the research of pure natural science. Instead of only searching for more definite answers, one must make decisions under some uncertainty, which is called decision-making (Bell, et al, 1988; Raiffa, 1976). In general, one needs to weigh the cost and benefits or cost and effectiveness, as well as medical ethics, for making such policy decisions in public health. This will be discussed in more depth in Chapters 4 and 13.
Chapter 3 Scientific hypothesis
51
Credibility change for a hypothesis To help measure one's degree of belief for a hypothesis to facilitate rational action, I recommend Bayesian's approach of quantifying subjective probability (Savage, 1972; Howson and Urbach, 1993; Greenland, 1998a). In practice, one may be neutral to a hypothesis in the beginning, i.e., credibility of 50%. After he examines the results of all refutation attempts, in terms of relevance, consistency, different times and places, and challenging hypotheses, etc., he may start to move incrementally toward either one of two sides •credibility of 0% or 100%. If deciding whether a hypothesis is true further involves one's subjective preference, i.e., expected utility, then this should also be taken into consideration. If one assumes that as a scientist, one has no subjective preference for any specific hypothesis, then one simply draws his/her subjective opinion on credibility from posterior credibility, which is obtained by combining prior subjective probability of the hypothesis (assuming 0.5 if one is neutral) with the merit of the study after review. However, since public health decision-making frequently calls for the opinions or perceptions of a group of scientists, and since such decisions frequently affect people's autonomy (e.g., the choice to smoke or chew betel nuts), in general, I would recommend an expert to take a relatively conservative position on assessing the credibility of a hypothesis. Furthermore, due to the medical ethic of non-maleficence, to not inflict any harm or evil on others, (Beauchamp and Childress, 1994), one should not place too much credibility on a hypothesis by just looking at a particular study's statistical significance, e.g., p-value < 0.05 (or even p-value < 0.01). In fact, this is just a summary statistic calculated based on the assumption that the null hypothesis is true. Neither should one completely abandon a new hypothesis with a p-value exceeding 0.05. Instead, one should examine the study from every aspect, looking for any conflict with alternative explanations or auxiliary hypotheses, which one invokes in the refutation attempt. For example, before 1950, one might put a credibility of 0.5 on the hypothesis that smoking causes lung cancer. After reading many studies and Hill's discussion of probable causation (Hill, 1965), one's credibility
52
Basic Principles and Practical Applications in Epidemiological Research
may increase to 70%. If the hypothesis is wrong, one may still have a credibility of 30%. Furthermore, after reading the Surgeon General's report (U.S. Department of Health, Education and Welfare, 1979), one's credibility might increase to 85% and up. However, if one immediately takes a relatively extreme position, at about 95%, then one will have only a subjective credibility of 5% if the hypothesis is found to be false. Thus, it is advisable to start with a relatively conservative position, rather than begin with an extreme level of credibility, especially when the utility or harm of taking an extreme position is high. 3.6
Summary
Scientists usually first form hypotheses and then impose them on their observations of nature. As a result, they should take a critical or refutational attitude toward the proposed hypothesis. Mill's five rules of induction are an excellent way to propose hypotheses. However, since they are all based on the principles of consistency, one should not limit hypothesis formation only from these rules. In order to understand the laws of nature, one relies mainly on the process of conjectures and refutations. Thus, any hypothesis that cannot be empirically falsified or tested escapes from refutation and is unscientific. If one's refutation attempt is successful, one should examine the primary and auxiliary hypotheses and determine which hypothesis is actually refuted. In designing a study, one should avoid invoking any unfounded auxiliary hypotheses. Moreover, one should ensure the validity of the auxiliary hypothesis that measurements made are accurate and sensitive enough to detect the predicted effect, as this assumption is invoked in every empirical study. If one fails to falsify the hypothesis of interest, then one determines a hypothesis's degree of corroboration by examining its relevancy and consistency in different time and spatial settings, as well as considering the possibility of any other unrefuted alternative hypotheses. Although in scientific research one may continue the search for natural laws even after many failed refutation attempts, in public health one often must make decisions under some
Chapter 3 Scientific hypothesis
53
uncertainty in the effort to decrease morbidity and mortality. I recommend Bayesian's approach to assess one's subjective credibility for a hypothesis: Start from a neutral stance, and then incrementally increase credibility after careful evaluation of all studies that can corroborate one's hypothesis. Based on the medical ethic not to inflict unnecessary harm on others, I also recommend taking a more conservative attitude towards one's subjective credibility of a hypothesis.
54 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 3 Please write down the credibility of each assertion in terms of percentage (%). Your score will be calculated according to the percentage of credibility that you actually obtain after comparison with the instructor's "gold standard". Score
% 1. With a verificationist's attitude, one tends to only see facts which one expects and often ignores any fact contradictory to the favorite hypothesis. As a result, one is more liable to believe a hypothesis that may in fact be false. 2. Mill's rules of induction can be summarized as been based on a rule of consistency. 3. Methods of induction are rules for proposing hypotheses or explanations. only, there is no guarantee that the hypothesis it true. 4. Popper proposed that the falsifiability of a hypothesis is the demarcation between scientific and pseudo-scientific hypotheses. Therefore, anything unscientific is of no use. 5. If A, A2, ..., A n are true, thenB is true. Suppose we have found B to be false, then all the statements A,_ A2, , An are false, as well. 6. One's attempts to refute a hypothesis should aim to obtain more direct relevance to the hypothesis, expand the hypothesis' range of consistency in time and space, and equally challenge alternative hypotheses. Then, a failure to refute will more likely corroborate the primary hypothesis.
Chapter 3 Scientific hypothesis
55
7. In public health, the pursuit of truth is an unending quest, and all unrefuted hypotheses are still considered conjectures subject to challenge in the future. Thus, Popper's principles can be applied in all circumstances. 8. In medical ethics, one should first consider the principle of doing no harm before recommending any preventive measure. 9. In public health, decision-making frequently involves subjective judgments. 10. The assumption that measurements made are accurate and sensitive enough to detect the effect under study is always an auxiliary hypothesis invoked during one's empirical refutation attempt.
Answer: (1) T (2) T (3) T (4) F (5) F (6) T (7) F (8) T (9) T (10) T
This page is intentionally left blank
Chapter 4 4.1 4.2
4.3 4.4
Causal Inference and Decision
Causal concepts in medicine and public health Proposed criteria for causal decisions 4.2.1 Necessary criteria 4.2.2 Quasi-necessary criteria 4.2.2.1 Consistency 4.2.2.2 Chance is not a causal factor 4.2.2.3 No alternative explanation 4.2.2.4 Coherence 4.2.3 Other supportive criteria 4.2.3.1 Strength of association 4.2.3.2 Specificity of association 4.2.3.3 Biological gradient or dose-response relationship 4.2.3.4 Biological plausibility Objective knowledge and consensus method Summary
Introduction The process of conjectures and refutations lies at the heart of science. Scientists propose conjectures and perform refutations in their search for the laws of nature. Scientific "theories" are the hypotheses, which have resisted all kinds of refutation attempts. Yet, from a Popperian scientist's point of view, such theories remain conjectures. Although scientists can usually wait for more critical tests of causality, public health investigators may need to take more immediate actions to prevent morbidity or mortality, as predicted by the hypothesis. Public health policy-making is not too different from daily life decision-making, where one makes decisions under some extent of uncertainty. For example, will it rain today? Should one carry an umbrella? Should one drive a more direct path to the office and face a potential traffic jam or take an
57
58
Basic Principles and Practical Applications in Epidemiological Research
alternative highway? In public health and medicine, one often asks questions, such as: should one add beta-blockers to a patient with hypertension, in addition to the angiotensin-converting enzyme inhibitor? Should one propose to a breast cancer patient, with no palpable local lymph node, a modified radical mastectomy or just a simple mastectomy? Based on the available evidence for potentially detrimental health effects of secondary or environmental tobacco smoke, should one adopt a regulation prohibiting smoking in all public places and offices? Should one still advocate a mandatory BCG vaccination for every newborn in Taiwan? In such circumstances, one must consider the utility or harm produced by the event, in addition to the subjective risk. Before further discussion of the utility that may be involved, I recommend that one carefully consider the criteria of causal decisions proposed by Hill (1965) and Susser (1986) from a refutational point of view. As pointed out in Chapter 3, public health decisions can conflict with a person's autonomy. For example, the US Surgeon General's warning on a cigarette pack may influence a smoker's decision by inducing some psychological stress. As a result, it is important to take a refutational attitude when reviewing causal criteria, in order to avoid wrong decisions or at least minimize any adverse effects. This notion to avoid inflicting harm on others is a major ethical priority among medical professionals, known as non-maleficence (Beauchamp and Childress, 1994). It is also more in accord with the precautionary principle of today's scientific and public community (Appell 2001; The European Commission 2000). This chapter will first examine some characteristics of causal concepts in medicine and public health, followed by a review of causal criteria, and comments on how to minimize erroneous conclusions made in public health work.
Chapter 4
4.1
Causal inference and decision
59
Causal concepts in medicine and public health
Defining a cause amenable to modification A cause is an event, state or agent of nature, which initiates, alone or in conjunction with other causes, a sequence of events resulting in an effect. In medicine and public health, a cause is defined and measured relative to an alternative condition. For example, "smoking" is defined in contrast to "nonsmoking." In order to make more accurate and quantitative measurements, one may further classify smoking into "smoking more than 1.5 packs/day," "smoking 0.5-1.5 packs/day," "smoking less than 0.5 packs/day," "never smoking," and "ex-smoker" for comparison. As recommended by MacMahon and Pugh (1967), a cause must be amenable to manipulation or modification. The definition of cause must be as specific as possible, in order to effectively avoid adverse effects on the human population. Moreover, this definition will most likely gain acceptance by other scientists and the public. For example, most people accept the polio vaccine since it is directed against a specific infectious agent. In the prevention of skin cancer among paraquat manufacturers, if one defines the cause as the entire manufacturing process, then one cannot take any preventive action, other than wiping out the whole process. However, since our study pointed out that skin cancer was closely related to only the crystallization and centrifugation processes, (Wang et al., 1987), the industry found it acceptable to simply enclose these two processes. Still, it would have been even more helpful to pinpoint the main responsible agent and elucidate the pathophysiology, e.g., link the exposure of bipyridyl or its isomers to skin cancer (Wang et al., 1987; Jee et al., 1995). Then, we could have implemented preventive actions against exposure to bipyridyl or its isomers in other non-paraquat manufacturing processes. In the nineteenth century, doctors noticed that a considerable number of chimney sweepers developed scrotal cancer. What form of prevention could be implemented? Completely outlawing chimney sweeping was certainly not feasible. However, if one could demonstrate that a specific
60 Basic Principles and Practical Applications in Epidemiological Research
agent inside the chimney produced scrotal cancer, then one could prevent scrotal cancer by removing the carcinogen while still preserving the job. Furthermore, this knowledge could then be applied elsewhere. Take another example: If smoking-related cancer is caused by the specific content of tar inside the smoke of a burning cigarette, then the tobacco industry may more easily preserve their market, by simply selling cigarettes with a lower content of such carcinogens. Sufficient vs. necessary cause Causal concepts can be further examined in two ways: sufficient cause and necessary cause. 1. Sufficient cause In medicine or public health, several factors or component causes generally act together to produce an effect. In only a few cases, is there only one component cause, such as death by beheading. The minimum combination of different component causes, which act together or in sequence, to produce the effect, is called a sufficient cause. For example, a sufficient cause for contracting AIDS (acquired immune deficiency syndrome) involves an infection by HIV (human immunodeficiency virus), lack of immunity to the HIV virus, adequate induction time, and many other genetic and pathologic factors. Moreover, different combinations of component causes may produce the same effect. Rothman (1976) proposed a model of component causes to express that an effect can be produced by different mechanisms, as shown in Figure 4.1. 2. Necessary cause A necessary cause is a component cause or agent, which is always required to produce the effect. For example, mycobacterium tuberculosis is a necessary cause in producing pulmonary tuberculosis. Exposure to lead source(s) is a necessary cause for lead poisoning, etc.
Chapter 4
Sufficient cause I
Figure 4.1
Sufficient cause II
Causal inference and decision 61
Sufficient cause III
There are three different sufficient causes that can produce the effect (hypothetical disease). Factor A may be a necessary cause for the effect (modified from Rothman, 1976).
However, a necessary cause itself may not be sufficient to produce the effect. For example, exposure to HIV virus is a necessary cause to induce AIDS. Yet, only exposure to HIV will not be sufficient to produce AIDS. Other component causes are required, such as the lack of immunity to the HIV virus, a wound at the exposed site, etc. For sufficient causes, which component cause or factor is the most important for our concerns of treatment and prevention? If the necessary component is difficult to remove, then the answer usually depends on which component cause is more feasibly removed or replaced. In fact, the least frequently present component cause or factor may be more easily removed. For example, to prevent asbestos-related mesothelioma and lung cancer among community residents, Chang et al. (1999) conducted an environmental survey of 41 asbestos factories throughout Taiwan. They found that five factories were responsible for three-fourths of the projected cases of lung cancer and mesothelioma. Thus, the EPA (Environmental Protection Agency) of Taiwan tried to first relocate and improve control in these five factories. Similarly, for the prevention of AIDS, HIV-infected case identification and counseling guidance are generally more feasible in a community like Taiwan, where the HIV incidence rate is still low. However, once the incidence rate grews higher, such as that of Thailand or
62
Basic Principles and Practical Applications in Epidemiological Research
Uganda in early 1990's, a general promotion of the use of condoms may be more feasible and effective. 4.2
Proposed criteria for causal decisions
One also needs to examine some proposed rules for the consideration of causal criteria (Yerushalmy and Palmer, 1959; Hill 1953, 1965; Susser 1977, 1986; Evans 1978). Although agreeing on a few common principles, such as consistency and correct temporality, different authors have proposed and emphasized different criteria. From a refutational point of view, proposing necessary criteria can reduce the difficulty of determining causality, as such criteria will allow one to rule out other possible causal relationships. Thus, I have tried to classify causal criteria into categories of necessity: necessary, quasi-necessary and other. 4.2.1 Necessary criteria In theory, there are two necessary criteria for causality: temporality and consistency. By definition, a cause should always precede its effect, and a universal law of nature should remain valid in any place, time, condition or setting. However, in any refutation attempt, one always invokes auxiliary hypotheses that may not always be correct, such as measurement accuracy and sensitivity. In epidemiological research, in which one can only observe a human population, one invokes an even greater number of auxiliary hypotheses. With so many additional assumptions involved, obtaining consistency is often difficult. As a result, the criterion of consistency is only a quasi-necessary one. The characteristics of a quasi-necessary criteria will be discussed in the next section, 4.2.2. Temporality is a necessary criterion for determining causal relationships. If one demonstrates that the effect precedes the cause, then the latter can at most be considered an aggravating factor rather than the main cause. To apply this criterion in practice, one must also know the minimal induction time and maximal latency period for the cause to produce
Chapter 4
Causal inference and decision
63
the effect. For example, in cases of HIV infection, Person B could not have transmitted the HIV virus to Person A, if A and B's first sexual exposure was just one day prior to A's HIV blood test. The assumed induction time for HIV infection is simply too short. Similarly, one would not attribute Person A's recent sero-conversion of HIV to an exposure occurring 5 years earlier because this length of time exceeds the maximum latency period. Temporality is also a key factor in the diagnosis of occupational diseases. The European Commission (1994) has published an information guideline that clearly specifies the minimal induction time and maximal latency period for almost every occupational disease on the list, in order to avoid confusion and to decrease legal disputes. In fact, in my own work, I often use temporality as a necessary criterion and, subsequently, rule out about half of the cases who come to seek a diagnosis for occupational or environmental disease. The following examples illustrate this point. Case 1. Mediastinal tumor of an engineer in a nuclear power plant A 36 year-old male patient, an engineer in a nuclear power plant, was found to have a tumor mass located in the upper right mediastinum. Surgical removal of the tumor was performed immediately, and the histological pathology showed a granuloma. Since he had been working in an environment, which involved occasional exposure to ionizing radiation and beryllium dusts during the 6 years prior to admission, he demanded a determination of the work-relatedness of his illness. A retrospective review of all the serial chest X-ray films from the Veterans General Hospital (VGH), where the annual physical examinations were performed, revealed that the tumor mass was recognizable (about 0.3 cm) in the first film, which was taken during the pre-employment physical. Therefore, the claim of work-relatedness was denied because the lesion had existed well prior to any exposure. Furthermore, any aggravation of mediastinal granuloma by either ionizing radiation or beryllium has not been documented in previous research.
64 Basic Principles and Practical Applications in Epidemiological Research
Case 2. Asthma and facial palsy among residents living near a petrochemical refinery Two patients with bronchial asthma and two patients with facial palsy were found in a community near a petrochemical refinery of Southern Taiwan. People were concerned that such illnesses might be related to air pollutants evaporating from VOCs (volatile organic compounds), contained in the wastewater discharge of the petrochemical plant. The company's plan to expand and build a fifth oil cracking plant further complicated the issue. Were these illnesses environmentally related? A detailed history taking revealed that the first patient developed asthma about 12 years earlier, but moved into the community only about 10 years ago. Similarly, the second patient began suffering from asthmatic attack approximately 8 years ago and had been regularly taking bronchodilators before moving into the community about 5 years earlier. According to the criterion of temporality, the VOCs contained in the wastewater of the plant could not have caused their asthma. Moreover, one needs more evidence to document whether the frequency and severity of asthmatic attack was aggravated by the VOCs evaporating from the wastewater. Both patients with facial palsy were in their mid-40s and suffered from the disease only on one side of their faces. A detailed search of the NIOSHTIC (National Institute of Occupation Safety and Health) database showed no previous report of any association between VOCs and facial palsy. The unilateral lesion also suggested a local etiology. Thus, an environmental association could not be established. Case 3. Brain tumor of a worker in a nuclear power plant A 47 year-old man came to my occupational clinic and asked if I could certify that exposure to ionizing radiation from working in a nuclear power plant had caused his recently diagnosed brain tumor. The patient had worked at the power plant for only 18 months. Doctors diagnosed his tumor as an astrocytoma and estimated it to be 1 kg at the time of
Chapter 4
Causal inference and decision
65
craniotomy. In order to produce a tumor of 1 kg size, a single malignant cell would need to divide approximately 40 times, containing approximately 1012 cells ( =2 40 ). If the first malignant cell had, in fact, been produced by exposure to the ionizing radiation in the plant, the calculated doubling time of this tumor would have been less than 18/40 = 0.45 month. When I checked the most updated Medline database at the time he came to my clinic, the shortest doubling time of a brain tumor ever reported was about 1.5 months. Moreover, no previous research has ever reported that radiation-induced brain tumors develop within such a short period of induction time (Committee on the Biological Effects of Ionizing Radiations, BEIR V, 1990). Therefore, I could not certify that his tumor was work-related. 4.2.2 Quasi-necessary criteria For a causal criterion to be considered quasi-necessary, it is either a necessary criterion on theoretical grounds or is a confounder (an alternative explanation of the causal effect), which must be ruled out to better clarify causality. The validity of auxiliary hypotheses, invoked during one's refutation attempt, is not regarded as a necessary criterion, but is still important when evaluating causality. The quasi-necessary criteria are: 1. Consistency 2. Ruling out chance 3. Ruling out confounders (alternative explanations) 4. Coherence or consistency with other highly corroborated theories 4.2.2.1 Consistency The principle of consistency stipulates that a natural law should be universally true and that a causal relationship can be found under different times, places and settings. Strictly speaking, consistency must be a necessary criterion because if one has not invoked any false auxiliary hypotheses, the causal hypothesis should resist refutation under different settings. However, since one frequently must invoke less corroborated
66
Basic Principles and Practical Applications in Epidemiological Research
auxiliary hypotheses in human observational studies, one often ends up with a doubtful conclusion as to whether the primary causal hypothesis has, in fact, resisted refutation. For example, investigators performed an empirical study to evaluate the hypothesis that bipyridine and its isomers had caused the development of hyperpigmental and hyperkeratotic skin lesions among paraquat manufacturers. Because hyperpigmented spots or freckles are not uncommon among Caucasian or light-skinned people, researchers used the workers' close friends, of a similar age and same gender, as the non-exposed group. The results showed that a higher prevalence of freckles was not found among these workers. However, in this refutation attempt, investigators had invoked the auxiliary hypotheses of sensitive measurement, adequate induction time and at least minimal intensity of bipyridine exposure. Yet, given that the whole process was largely enclosed, the latter assumption might not have been true, i.e., workers were exposed to lower than minimal level of bipyridine (Cooper et ai, 1994). Thus, although investigators found that the hypothesis of the effects of bipyridine exposure lacked consistency, the negative findings may not necessarily refute the primary hypothesis but may in fact refute the auxiliary hypothesis. Similarly, up to 1986, only some studies of environmental tobacco smoke have shown a consistent positive association with lung cancer (Weiss, 1986). Since some studies tried to detect a small effect with low intensity of exposure and relatively large measurement error, firm conclusions could not be drawn from such studies on the effect of smoking. With potentially false auxiliary hypotheses, one should take care to determine if the data refute the primary hypothesis or, instead, refute the auxiliary hypothesis. Moreover, if we are not always looking for hypothesis that explains into detailed mechanism similar to natural law, then there can be one-time cause such as a turn-off of the electric switch producing light off. Thus, consistency may be theoretically a necessary criterion but is not easily maintained in the daily practice of epidemiological research.
Chapter 4 Causal inference and decision
67
4.2.2.2 Ruling out chance For all observable phenomena, chance is always a potential alternative explanation. Consequently, one must perform some kind of statistical analysis in order to rule out chance to a certain degree. Based on the assumptions that the null hypothesis is true and that the sample was drawn in a random manner, frequentist statisticians often calculate p-value to rule out chance. However, the traditional p < 0.05 is an arbitrary decision rule set up to reject the null hypothesis. It does not necessarily mean that one has already eliminated the possibility of chance. Nor does it mean that the probability of obtaining the result is less than 0.05. Rather, it only indicates that if the null hypothesis is true and the sample is a random one, then the likelihood of obtaining a result like this one or a more extreme condition is less than 0.05. Since in observational studies, we simply assume that physical, social and behavioral processes by which people become exposed to different risk factors are close to randomization, it is not a real randomization (Greenland, 1990). Thus, it is usually more informative to write the confidence interval and exact p-value, as Rothman (1978) recommends. To better determine the extent in which chance plays a role, one should always consider the sample size of the study because p-value is also influenced by sample size. Even if the association is not strong, p-value will usually be small if the sample size of the study is large {e.g., more than 500 or 1000). Conversely, p-value is usually large if the sample size is very small {e.g., < 30). Therefore, if the sample size is large and, yet, has a large p-value (> 0.05), chance is likely the cause, and one can regard the study as a strong refutation of any association. If the sample size is moderate, and one obtains a very small p-value, e.g., < 0.01 or even smaller, then one may tentatively conclude that chance is not a probable cause and consider the study to be a failed refutation attempt of the proposed association. Nonetheless, with only a single study, one should refrain from overemphasizing the results. Finally, if the sample size is small, one should infer that no matter how small the p-value, chance cannot be ruled out as a causal factor (Miettinen, 1985a). Because of such an easy misconception on the interpretation calculating, experts in the field (Nurminen, 1997;
68
Basic Principles and Practical Applications in Epidemiological Research
Greenland, 1998) have proposed to take Bayesian viewpoints of likelihood ratio and posterior probability as an alternative. 4.2.2.3 Ruling out confounders (alternative explanations) Even if one rules out chance or displays a statistically significant association, one still cannot conclude the association to be causal. One must now consider whether there still remain any confounders or alternative hypotheses, which can explain the phenomena, and subsequently, attempt to refute such confounding. This criterion should not be confused with "specificity," (detailed in section 4.2.3.2) which was earlier proposed as a causal criterion (Hill, 1965), but is currently not considered to be necessary (Susser 1986; Weed 1986). As pointed out in Chapter 3, if all other alternative hypotheses are refuted, then one may claim that the hypothesis of interest is the one that best corresponds to fact. Otherwise, there are at least two equally valid hypotheses, which can explain the effect, i.e., confounding. One should avoid confounding right from the beginning, in the stage of study design. Since the confounder is a potential causal determinant of the outcome and is associated with the exposure of interest, one can conduct a literature search to find all other known causal factors of the effect. One should then control these factors by design or data analysis. For example, during the study for the cause of skin cancer among paraquat manufacturers, researchers asked each worker if they had ever been exposed to any known skin carcinogens such as radiotherapy, coal tars, pitch or cutting oils (Wang et ai, 1987). Since none of the 228 workers examined showed such exposures, the investigators ruled out these potential confounders. We shall return to the issue of confounding in Chapter 7. To evaluate any specific hypothesis, one should design a study (a refutation attempt) to critically test the hypothesis of interest and any equally corroborated alternative hypotheses. For example, if one observes an increased occurrence of lung cancer among people with yellow finger tips, (especially the 2nd and 3rd fingers), then one might propose a hypothesis that yellow fingertips cause lung cancer. However, there is the alternative hypothesis that smoking causes lung cancer. Thus, one should design a
Chapter 4
Causal inference and decision
69
study to determine which one is more highly corroborated. In fact, recent studies have shown that yellow fingertips may also result from smoking. Furthermore, suppose that another study has demonstrated that non-smokers with yellow fingertips do not show an increase of lung cancer frequency. Such studies present a strong refutation attempt of the competing hypothesis that yellow fingertips cause lung cancer. Ruling out confounders does not conflict with the possibility of a multi-factorial hypothesis for a disease, in which a health effect may result from a variety of factors. Rather, this criterion stipulates that any alternative cause or hypothesis should not equally explain any single finding of an increased frequency of a disease. For example, while arsenic (Chen et al, 1988; Enterline et al, 1987; Hays 1997), asbestos (Selikoff et al, 1968; Stayner et al., 1996) and smoking (U.S. Department of Health, Education and Welfare, 1979) can all produce lung cancer, this criterion does not preclude a new agent such as BCME (bischloromethyl ether) from causing lung cancer (International Agency for Research on Cancer, 1987). However, if one finds evidence revealing that BCME is the cause of a specific occurrence of lung cancer; arsenic, asbestos, smoking or any other known lung carcinogens should not simultaneously explain this event. In order to corroborate the proposed hypothesis, one should rule out confounding from any alternative cause. In addition, if at least two different equally corroborated hypotheses exist, these two causal hypotheses may, in fact, act independently or synergistically to produce the effect. If current evidence cannot differentiate these two conditions, then one cannot draw any firm conclusions about the causal relationship. As a result, the fulfillment of this criterion is a quasi-necessary one. 4.2.2.4 Coherence The criterion of coherence stipulates that the hypothesis should be consistent with currently existing well-founded theories. If the hypothesis conflicts with common physical or chemical laws or theories, it is likely to
70
Basic Principles and Practical Applications in Epidemiological Research
be false. Of course, one cannot completely rule out the possibility that a new conjecture might be correct and that the status of any existing theory should be reconsidered. If this is the case, then a scientific revolution, so named by Kuhn (1970), may occur as the old paradigm is now being contested. Since such paradigm shifts rarely occur, coherence or consistency with current scientific knowledge must still be classified as quasi-necessary. 4.2.3 Other supportive criteria Some authors (Hill, 1965; Susser, 1977, 1986) have proposed other criteria, which are not necessary, but may help in determining causality from a Bayesian point of view. Namely, the existence of any of these criteria for a causal relationship may improve the subjective credibility of the hypothesis in causal decision. They are: 1. Strength of association 2. Specificity of association 3. Biological gradient or dose-response relationship 4. Biological plausibility 4.2.3.1 Strength of association Strength of association means that the magnitude of effect such as rate difference or rate ratio should be large or strong between the exposed and non-exposed. This criterion must not be confused with statistical association, such as p-value. If the exposure factor produces a very large effect, such as a rate ratio > 5, then the likelihood of a causal relationship may be high, since other known causal factor(s) usually cannot completely explain such a large magnitude. For example, suppose that asbestos workers were found to have a rate ratio of 5 for lung cancer mortality, compared with the general population who are presumed to have very low (or negligible) asbestos exposure. Further, assume that prevalence rates for smoking among asbestos workers and the general population were 90% and
Chapter 4
Causal inference and decision
71
50%, respectively. This observation raises the concern that the increase of lung cancer among asbestos workers might be completely due to a high prevalence rate of smoking. If the rate ratio for lung cancer among smokers vs. nonsmokers is equal to 10, smoking alone may have caused the workers to develop lung cancer: Let Ro denote the incidence rate of lung cancer for the nonsmoking population unexposed to asbestos, and thus, lOR^ denotes that of smokers unexposed to asbestos. Assuming that asbestos does not cause lung cancer, then, the rate ratio of lung cancer for asbestos workers vs. that for the general population will be: (10R„)(90%) + Ro( 1 -90%)
9.1 Ro =
(10Ro)(50%) + Ro(l-50%)
=1.65 5.5 Ro
The rate ratio of 1.65, in which smoking is the only causal factor, cannot explain the rate ratio of 5 found in the original study. Thus, the likelihood of causal association between asbestos and lung cancer may be high. However, if the strength of association is not strong, say, a rate ratio of less than 2 or 1.5, then one may be concerned that an unknown factor, such as smoking, may confound the effect, and the causal hypothesis may not be true. However, strength of association is not a necessary criterion because even with a weak association, the relationship might still be causal. When this criterion conflicts with other criteria, how does one make a decision? There is no easy answer to this question. Instead, one must consider all the evidence and all the necessary and quasi-necessary criteria before any conclusion can be drawn. Earlier last century, Karl Pearson and Almroth Wright held conflicting opinions about whether a typhoid fever vaccine should be adopted as a routine inoculation in the army. Pearson compared the incidence and case fatality rates among the inoculated and non-inoculated. He found that the strength of association was not as high as that of smallpox vaccine, and accordingly, he opposed inoculation.
72
Basic Principles and Practical Applications in Epidemiological Research
Wright counted only autopsy cases, whose diagnoses were more definite, and showed that the different trials showed a consistent finding of 5 times more protection from typhoid mortality, which led him to support an extensive vaccination. After further studies, vaccination against typhoid was found to be effective (Susser, 1977), as Wright had believed. 4.2.3.2 Specificity of association This criterion suggests that the more specific a causal relationship is, the more likely that such an association exists. However, it is not a necessary criterion because there are already several examples demonstrating multiple causes for the same disease, even when a specific association exists between one exposure and the disease. Furthermore, one exposure factor can simultaneously produce many different diseases. For example, while exposure to asbestos, smoking or arsenic can individually produce lung cancer; asbestos can also cause mesothelioma and pleural plaque; smoking can also result in bladder cancer, chronic obstructive pulmonary disease, ischemic heart disease, etc.; and arsenic can produce bladder and liver cancer, as well. In fact, when the first committee report on smoking and health (U.S. Public Health Service, 1964) was published in 1961, Berkson, an eminent scientist, proposed that the causal hypothesis was not tenable because of the lack of specificity between smoking and lung cancer. However, it is now widely accepted that this causal hypothesis is highly corroborated and that Berkson's opinion, solely based on this criterion, was not a wise one. If specificity of association is unnecessary, why does one still consider it to be evidence supporting a causal hypothesis? By establishing a specific association between the agent and effect, one has successfully refuted other alternative hypotheses. Thus, specificity of association fulfills the quasi-necessary criterion of ruling out confounders and supports causality. 4.2.3.3 Biological gradient or dose-response relationship This criterion is derived from the rule of concomitant variation from
Chapter 4
Causal inference and decision
73
Mill's rules of induction (Chapter 3). However, such a biological gradient may simply result from confounding. For example, the more yellow stained a person's finger tips are, the more likely he/she is to develop lung cancer, yet this is because smoking produces both yellow fingers and lung cancer in a dose-response manner. Moreover, observation of such a gradient generally relies on many auxiliary hypotheses. Thus, such a relationship may not be easily demonstrated in empirical studies and, therefore it is not a necessary criterion. To quantify dosage, one must clarify the detailed effect and the mechanism of effect to be produced. Therefore, an understanding of detailed PB-PK (physiologically-based pharmacokinetic) model and pathophysiologic mechanism is also helpful. In addition, the following factors must be considered in such a relationship: 1.
2. 3. 4.
5.
Intensity or concentration of exposure: For example, how many cigarettes are smoked per day, and what is the content of tar or any specific carcinogen such as bezo(a)pyrene? Similar questions should also be applied to alcohol drinking, occupational and environmental exposures, etc. Duration of exposure: For example, how long has the person been smoking? Onset of the first exposure: Is the requirement of temporality, i.e., minimal induction time and maximum latency period fulfilled? Dose rate: What was the frequency and dosage of each exposure and what was the portal of entry? Generally speaking, a high dose rate (e.g., ionizing radiation) is usually more harmful to the human body as the body's time for repair is decreased. To clarify the actual dose, one also needs to know its application method and the absorption, distribution or even excretion mechanism. For example, does the person always inhale the smoke? Or does he quickly blow it away? Host factor: The genetic susceptibility may be different for different persons. Gender, age, ethnicity and any other lifestyle factors may interact with the exposure in affecting health.
74
Basic Principles and Practical Applications in Epidemiological Research
6.
Type of dose-response relationship: Is the relationship linear or quadratic? Is there any threshold?
Because of all the complicated mechanisms and auxiliary hypotheses involved, one is frequently unable to observe a "typical" dose response relationship. If such a relationship does exist and there is no other alternative hypothesis that can explain the phenomenon, then the criterion of biological gradient may support a causal decision. This relationship, however, is not a necessary criterion. 4.2.3.4 Biological plausibility This criterion means that current biological knowledge, i.e., evidence from animal experiments or cellular and/or molecular biological research, supports the causal hypothesis. However, biological plausibility is not a necessary criterion because of inter-species differences between humans and animals. For example, at the end of the nineteenth century, when many dyestuff workers in Germany developed bladder cancer, many suspected that beta-naphthylamine might be carcinogenic,. In the 1910s, a similar tragedy occurred among dyestuff workers in the U.K. In 1921, the ILO (International Labor Office) announced beta-naphthylamine to be a bladder carcinogen without any evidence from animal models. Bladder cancer in animal experiments was not successfully induced by beta-naphthylamine until 1937, when dogs were used as the experimental model. Similar results were shown for some human carcinogens, such as benzene and arsenic. Thus, a hypothesis that fulfills this criterion can be considered as supportive evidence for a causal decision but is not absolutely necessary. Despite having discussed the criteria for causality, one still needs some kind of formal procedure or mechanism that can be applied to decision-making in public health or medicine. Such a procedure may prevent important decisions made solely based on one expert's opinion. In general, an expert committee, arriving upon some consensus under a refutational or critical attitude, may be helpful to avoid personal bias.
Chapter 4
4.3
Causal inference and decision
75
Objective knowledge and consensus method
In scientific research, one performs conjectures and refutations in the effort to find the objective laws of nature. Popper termed such a collection of natural laws the "third world." As a Christian, I believe that God created the laws of our natural world. Still, no matter how these laws came into being, one should regard their discovery as based on a subjective understanding and on the method of conjectures and refutations. If a scientist can take a refutational attitude, then he or she may rise above his or her own personal biases in the discovery process, and find that the hypotheses left unrefuted are the closest approximations to the objective natural law. If everyone in a committee maintains such an attitude, then after full communication of the possible hypotheses and evidence, the group may less likely make false conclusions or recommendations. This group of experts or persons should have no conflict of interest in making such a decision and should be free from external pressure and influence. In addition to medicine and public health, such a designated consensus method has been put into practice in the jury practice of Western countries such as the U.S. and the U.K. (Fink, 1984). Specifically, every juror selected must be unrelated to both sides, the plaintiff and the defendant. After comprehensive and critical communication of all hypotheses and evidences, the jury as a group is then asked to reach a consensus conclusion. Even though every juror still makes a subjective judgment, the whole group is expected to reach a relatively objective decision. Although a critical and inter-subjective decision may still be fallacious (for example, ancient people used to believe that the sun revolves the earth), it is still the least fallible, if everyone keeps a refutational attitude, avoids outside influences and establishes a comprehensive communication. In public health and medicine, a great deal of knowledge is specialized. Thus, the relevant experts should examine the issue at hand with a critical and refutational attitude. The likelihood of reaching a fallacious decision will generally be lower, if the decision is made by a group rather than by a
76
Basic Principles and Practical Applications in Epidemiological Research
single expert. Therefore, causal decisions should be carried out by a committee of experts who understand the issue at hand. This can save time in both the communication and consensus procedure. For example, in the U.S., an advisory committee to the Surgeon General was set up to make the key decisions on the issue of smoking and health (U. S. Public Health Service, 1964). In the primitive stages of science, many different hypotheses may coexist and they are often contradictory to each other. However, the process of conjectures and refutations can eliminate some of them. The one hypothesis, which resists falsification and is recognized by a group of experts engaging in a comprehensive communication and refutational critique, will be closest to the truth for the present time being. As new evidence appears, the expert committee may revise their consensus opinion periodically to make a more informed conclusion and decision. 4.4
Summary
While the scientific research process of conjecture and refutation continues, we in public health and medicine should take action by making causal decisions to prevent morbidity and/or mortality among people. This decision should be made with a critical or refutational attitude. To facilitate such a decision, I have classified causal criteria into three levels: necessary, quasi-necessary and other. First, correct temporality is the only necessary criterion for any cause. In other words, the response or effect of exposure must occur after minimal induction time and within maximum latency period. Second, consistency should exist under different times, places and settings of observation, as long as no false auxiliary hypotheses are invoked in the study. One should also rule out chance and other alternative hypotheses to a certain extent, and the hypothesis should be coherent with current, well-founded chemical or physical laws. The category of "other" contains some criteria, which are not completely necessary, but their presence can be supportive for causal decision-making, i.e., strength and specificity of association, biological gradient and biological plausibility. To reduce the bias involved
Chapter 4
Causal inference and decision
11
in causal decision-making of medicine and public health, I recommend inviting a committee of experts to critically judge all available hypotheses and evidence. After a comprehensive discussion, they can attempt to reach a consensus and make a decision.
78 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 4 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment.
Score
% 1. The only necessary criterion of causality is a correct temporal sequence. 2. Cause should be defined as something amendable. 3. HIV (human immuno-deficiency virus) is a sufficient cause for AIDS (acquired immuno-deficiency syndrome). 4. We cannot attribute the cause of a current case of AIDS to the patient's sexual contact with an infectious source occurring 3 days ago. 5. When the sample size is large (e.g. >1000), then a p-value > 0.10 is a strong refutation. 6. If the sample size is small (e.g. < 30), then chance cannot be completely ruled out. 7. Strictly speaking, the principle of consistency must be a necessary causal criterion. However, since one frequently has to use less corroborated auxiliary hypotheses in human observational studies, one often ends up with a doubtful conclusion that one has not yet refuted the primary causal hypothesis. 8. If the strength of association is low, e.g., a rate ratio of 1.5, then we can conclude that the association is not causal. 9. If one finds a dose-response relationship between an exposure and a disease, then one has established a causal association. 10. To obtain an objective decision on causality, an expert
Chapter 4
Causal inference and decision
79
committee should take a refutational attitude and engage in full and comprehensive communication.
Answers: (1) T (2) T (3) F (4) T (5) T (6) T (7) T (8) F (9) F (10) T
This page is intentionally left blank
Chapter 5 Basic Principles of Measurement 5.1 5.2 5.3
What is measurement? Why does one perform measurement? How does one measure? 5.3.1 Measurement in socio-behavioral sciences 5.4 Accuracy of measurement: validity and reliability 5.5 Scales of measurement 5.5.1 Nominal scale: A scale of qualitative measurement 5.5.2 Ordinal scale: A scale of semi-quantitative measurement 5.5.3 Interval scale: A quantitative measurement with or without an absolute zero starting point 5.5.4 Ratio scale: A quantitative measurement with an absolute zero starting point 5.6 Common evaluation method in medical diagnostic tests 5.7 Validity and reliability of physico-chemical, biological and socio-behavioral measurements from a refutationist's point of view 5.7.1 Measurement of chemicals in the environment or inside the human body 5.7.2 Conceptualization of exposure dose and its measurement in occupational and environmental medicine 5.7.3 Validity and reliability on socio-behavioral measurement 5.8 How to perform accurate measurement by questionnaire 5.8.1 Construction of a questionnaire 5.8.2 Interview procedures 5.9 Summary Introduction Observations and measurements of phenomena are a necessary part of the scientific research process. This is not surprising since all scientific research involves empirically testing hypotheses. Yet, it is usually not
81
82
Basic Principles and Practical Applications in Epidemiological Research
possible to make direct observations and measurements of one's primary hypothesis, especially in human observational studies. Instead, one usually deduces statements from the primary hypothesis that can be observed and measured. One then enters the field to collect the data. And finally summarizes the data by use of statistical tools to attempt to refute the various hypotheses. Since in any empirical study, one always invokes the assumption that measurements made are sufficiently accurate and sensitive, one should carefully select a valid measurement method to be used in one's study. Even in public health decision-making, which involves more than just simply testing hypotheses, people demand data from previous experiences or observations in order to make rational judgments. Therefore, measurement, a term I use to denote both observation and measurement, is one of the central issues in epidemiological research. However, most books on epidemiological methods regard measurement as an issue not unique to epidemiology and thus, decide not to discuss it (Rothman and Greenland 1998; Miettinen 1985a; Kupper et al, 1982). Others simply discuss measurement error and some of its practical aspects (Kelsey et al., 1996). In contrast, I have chosen to explore its theoretical principles and practical applications. This chapter will attempt to clarify the concept of measurement, illustrate how to make accurate measurements, explain how to set up scales of measurement and then apply such principles in the common evaluation method of medical diagnostic procedures and environmental chemical analyses. Moreover, the common validity and reliability measures in socio-behavioral science will also be discussed and explained, followed by some practical advice for the construction of questionnaires and interview procedures. 5.1
What is measurement? As Lord Kelvin pointed out: "I often say that when you can measure what you are speaking about and express it in numbers you know something about it; but when you cannot measure it, when you cannot express it in
Chapter 5 Principles of measurement
83
numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts advanced to the stage of science, whatever the matter may be." (Michell, 1990) The classical definition of measurement is to express a concept or characteristic of a group of objects in terms of classes (qualitative, categorical or nominal) or numbers (quantitative). This definition is very intuitive and straightforward for physical, chemical or biological objects, in which one simply finds a gold standard for comparison to obtain the class or number. It may not be sufficient, however, for the measurement of subjective preference (utility), such as health-related quality of life (HRQL) in public health or any concept involving multidimensional aspects, such as attitude, personality or cognitive ability in psychology. Thanks to the development of operationalism and representationism, Stevens (1946) has expanded the definition of measurement to "the assignment of numerals to objects or events according to rules." Thus, a measurement can be a corresponding set of operations, as long as they are precisely defined and consistently performed to assign numbers to objects. 5.2
Why does one perform measurement?
There are two major reasons for measurement. First, one attempts to form a clearer definition of the concept or object to gain a more profound understanding and to be able to perform further operations or mathematical manipulation. For example, one may want to know the current blood lead level of newborns in Taipei (Hwang and Wang, 1990). This question is descriptive, asking how one measures blood lead content and how is it distributed among newborns of Taipei. First, one must decide whether one wants to use atomic absorption spectrometry, anodic stripping voltammetry or some other method to determine blood lead level. Moreover, one must also decide from whom and how one should take blood samples. The other major reason for measurement is to empirically test a hypothesis. For example, one may want to test the hypothesis that smoking causes lung cancer. In this instance, one must consider the definitions of
84 Basic Principles and Practical Applications in Epidemiological Research
smoking and lung cancer, as well as conduct actual measurements in a population. Specifically, one must decide whether to include only cases with histopathologic diagnosis or include even those cases with merely clinical evidence. Moreover, one should also measure other determinants of lung cancer, such as exposure to asbestos or arsenic. Otherwise, one will be unable to differentiate the effect of smoking from other alternative causes, resulting in confounding. Thus, a causal study always involves measurements of the major determinant and outcome of interest, as well as other determinants of outcome. Let us look at another example: an investigator wants to evaluate the preventive efficacy of taking AZT (zidovudine) after needlestick injury of HIV (human immuno-deficiency virus) contaminated blood. First, he must at least define and measure a case of such an injury, the dose schedule of taking AZT, seroconversion of HIV and other determinants of HIV infection, such as personal sexual practice, blood transfusion, etc. In fact, in the beginning of the process of conjectures and refutations (see Figure 2.1), when one attempts to propose hypotheses, it is implied that one looks for something to be measured and falsified empirically. If one cannot directly conduct measurement under the primary hypothesis, one then deduces statements that can be measured and ultimately, tested. In epidemiology, one may try to perform measurements on specific rates {e.g., lung cancer mortality rate), ratios (e.g., sex ratio) from the population under study. Moreover, when one summarizes the results in the data analysis stage with statistical methods, one may also obtain measurements of a causal effect, such as rate ratio or rate difference. Thus, measurement is the basic tool in empirical scientific research, whether it is causal or simply descriptive. 5.3
How does one measure?
Need of a gold standard As mentioned in our definition, measurement involves operational rules to express certain concepts or characteristics by different numbers or scales. For concepts that are less abstract, such as length, weight, volume, angle, etc., one simply defines the concept and establishes a gold standard for
Chapter 5 Principles of measurement
85
comparison. Then, one can measure the object by comparing it with the gold standard. Essentially, most measurements are comparative. For example, if one wants to measure the body height of a person, one must define height as the length from the top of the head to the sole of the foot, with the person standing in an erect posture. Then, one uses a large straight ruler (for comparison) to determine the height expressed in a certain unit, e.g., centimeter (cm). Because the height of an adult slightly decreases in the afternoon after working in an erect position all day (especially after heavy lifting), one may stipulate that measurements be made in the morning for consistency and accuracy. Although such a minor change {e.g., a decrease of 0.5-lcm) in measurement of height may not usually influence one's conclusion, one must still be very careful in certain occasions. Controlling all determinants of a measurement For example, one unpublished study in Taiwan tried to determine the change of body weight in a pregnant woman during her 1st, 2nd and 3rd trimesters. The results showed that there was an average loss of 0.6 kg of body weight during the 1st trimester, as compared with the early 2nd trimester. Why did a pregnant woman's weight decrease during the 1st trimester? It turned out that the investigator did not standardize his weighing operation and, consequently, maintained a systematic bias towards a heavier weight in the 1st trimester. Specifically, during the 1st prenatal visit, pregnant women were usually weighed after breakfast with a full urinary bladder. During the second prenatal visit, in the early 2nd trimester, women came in for blood tests and were often weighed with an empty stomach and bladder. As a result, the 0.6 kg decrease of body weight during the 1sl trimester was more likely due to weighing after a meal and with a full bladder during the first examination. This example illustrates the importance of understanding and controlling the determinants of measurement, and how they should explicitly be clarified in a document for comparison and future reference. Such a document is similar to the Standard Operation Procedure (SOP) of a manufacturing process, which should be carefully followed when making actual measurements.
86 Basic Principles and Practical Applications in Epidemiological Research
Assisted instruments for interview When one conducts an interview, one can help subjects provide more accurate answers on their health behavior by providing devices for comparison. Let us look at another common example of using a questionnaire survey to measure the amount of alcohol intake per day. Clinicians are usually skeptical about the accuracy of measurements on alcohol intake obtained from clinical interviews. However, if one can provide a comprehensive collection of colored photographs of all brands of beers, wines and liquors, as well as different types of cups for the person to identify and determine, then one may obtain more accurate information on alcohol consumption. The photographs or assisted instruments provide the needed gold standard for comparison. Although such instruments may help to improve data collection from interviews, one is still limited by the subject's ability to measure the objects in question. For example, in a study to measure occupational exposures among workers in Taiwan, the workers were asked if they were exposed to any specific chemicals or hazards. Since most people in Taiwan were not aware of the chemicals in their work environments in 1980's (Wang, 1991), they could only identify occupational hazards through sensation. Accordingly, researchers found that the most extensive hazard at workplaces was noise, and less than 5% of workers could provide any specific chemical names. Thus, a questionnaire interview is always limited by the subject's cooperation and his/her ability to measure the items in question. 5.3.1 Measurements in socio-behavioral science In socio-behavioral science, measurement is further complicated by the need to measure the unobservable or latent concepts composed of several dimensions and domains. Under this condition, investigators usually formulate some construct or theory to express the concept and then, measure it from different dimensions and domains. After proper summarization procedures through statistical analysis, one tests the validity of the construct or theory with the results of one's measurements. Please refer to Figure 5.1. For example, the measurement of HRQL (health-related quality of life)
Chapter 5 Principles of measurement
87
generally involves at least physical, psychological and social domains (WHO, 1948). Each domain can be measured in two dimensions: 1) an objective assessment of functioning or health status, 2) a subjective perception of health (Patrick and Erickson 1993, Testa and Simonson 1996). Since the construct of HRQL cannot be observed directly, one is actually measuring different components of HRQL, in terms of these three domains and two dimensions. The results are then summarized to yield a scaled score. This final score should be an accurate quantitative measurement of the concept of HRQL, if one has chosen the proper items and scales of measurement. The WHOQOL (WHO quality of life) generic questionnaires (WHO, 1995, 1998a, 1998b) has required all participating countries to conduct studies for descriptors first and adds cultural specific questions. Both the exploratory and confirmatory factor analyses show a consistent result of following four domains: physical, psychological, social relation, and environment. Thus, I recommend its use for general assessment of health profile and international comparison. Take another example of measuring aggressiveness in personality. One may begin by using personality theory to give this subjective characteristic an operational definition from several dimensions, followed by selection of proper measurement items and scales, as shown in Figure 5.1. After collection of the data, one still needs to summarize the different items and scales statistically to obtain results that can validate or test the original construct. A test of the theoretical construct By itself, a socio-behavioral measurement is a test of the original theoretical construct. It follows, then, that one examines the measurements to determine the existence of such a construct or theory. Moreover, one must remain wary of the fact that these measurements are subject to potential errors themselves. Thus, although one's success in refuting a hypothesis may be attributable to a false construct or theory, in some cases, it is a result of measurement errors (Blalock, 1982). Although the same statement is equally applicable to the natural sciences, one must seriously consider measurement error when dealing with measurement in the social sciences because they are based on relatively more abstract matters and subjective judgments, and rarely
88 Basic Principles and Practical Applications in Epidemiological Research
remain consistent in different times, places and change of social settings.
Conceptualize Form theoretical construct (e.g., subjective quality of life)
-> into several dimensions or domains, (e.g., physical, mental)
Attempt to falsify the theoretical construct by empirically testing if it can be consistently measured.
Select suitable items to be measured in each domain.
Decide on the measurement scale for each item and weight summarization.
Collect data. Summarize the score according to the original construct.
Figure 5.1
Process of conceptualization and measurement in socio-behavioral science.
Selection of a sensitive measurement In the daily practice of public health and medicine, most measurements are relatively straightforward. One simply needs to carefully define the gold standard used for comparison and control all determinants of one's measurement. In addition, if there are several ways to make equally valid measurements, one should select the most sensitive method. For example, in our study to detect the health effect of lead exposure among kindergarten children, we found IQ (Intelligent quotient) to be more sensitive than any neurophysiological or hematological measurement, if the blood lead was below 25 ug/dl. Thus, we selected IQ as the primary indicator of health effect
Chapter 5 Principles of measurement
89
on kindergarten children when studying possible exposure from a neighboring lead recycling factory (Wang et al., 1992; Soong et al., 1999). Similarly, since nerve conduction velocity (NCV) is usually more sensitive than clinically overt symptoms of bilateral weakness of upper and lower extremities, we used NCV in our study to detect sub-clinical polyneuropathy under different levels of n-hexane exposure (Wang et al., 1986). Throughout all kinds of measurement, one must always compare the objects one sets out to measure with a set of standards. The gold standard chosen should correspond best to the concept or fact and should be widely accepted by the scientific community. Otherwise, the accuracy of our measurement will be doubted and will be difficult to compare with other investigators' results. For example, the measurement of length is based on rulers reproduced from the ultimate standard metric ruler made from platinum-iridium alloy. In 1960, the 11th General Conference on Weights and Measures defined the standard meter to be equal to 1,650,763.73 wavelengths of orange-red radiation in a vacuum of krypton 86. Development of a gold standard In developing a new method of measurement, how does one establish a gold standard? Reaching a consensus by expert committee, as recommended in Chapter 4 for causal decision-making, is still probably the wisest choice. This expert committee must attempt to develop a gold standard, which easily performs measurement and shows consistent results under various times, places and settings. To fulfill these requirements, the measurement method must be clearly specified and all its determinants must be easily characterized and controlled in practice. With a refutational attitude and after full communication, experts may not find it too hard to reach a consensus. In general, most standard classifications or measurement systems recommended by the ISO (International Organization of Standardization), WHO or ILO (International Labor Office) are produced in this manner and are widely accepted by the scientific community. For example, the widely used International Classification of Disease (ICD) is a product of the consensus of WHO experts, who continually revise their decision every ten years.
90
5.4
Basic Principles and Practical Applications in Epidemiological Research
Accuracy of measurement: Validity and reliability
Target shooting as an illustration The goal of measurement is to achieve high accuracy, i.e., high validity (or low systematic error) and high reliability (or low random error). If the method or instrument can actually measure what it claims to measure, then it is a valid method. If a measurement yields values that are consistent or their distributions remain close to each other, then it is reliable or "precise" as described by chemists. Of course, one always demands measurements to be both valid and reliable, namely, to have the least possible systematic and random errors (Carmines and Zeller, 1979). This concept can be illustrated by the example of target shooting. When a person shoots a clearly marked target with a gun, each bullet hits the target paper and produces a hole. After many shots, an examination of the target paper may show four types of distribution, as illustrated in Figure 5.2. The distribution of type A shows a cluster of holes near the target center, which indicates that most bullets hit very close to the target, namely, both a small systematic error and a small random error. The shots are all very accurate, i.e., valid and reliable. Type B shows a cluster of holes near the side of the target paper, which indicates a small random error but there seems to be a systematic error which directs the bullet toward the side of target instead of the center. In other words, the target line of the gun needs to be readjusted or the standard used for comparison during the measurement seems to be biased. Type C shows an even distribution of bullet holes without any clustering. This indicates a big random error (low reliability) but, seemingly, no systematic error. Type D shows no cluster but holes distributed widely over only the upper part of the target paper. This indicates a systematic error of shifting the central target to the upper portion and a large random error. If under some extreme cases, a large proportion of bullets do not produce any holes on the target paper, then one must also consider that such a large error probably also results from a combination of big systematic and random errors. Under this circumstance, one must reconsider the definition of the gold
Chapters
A
B
C
Principles of measurement
91
D
target value Type A: excellent accuracy Type B: high reliability with a low validity Type C: low reliability with a possible high validity Type D: both poor validity and poor reliability Figure 5.2 Four types of distribution of gun shots on a target paper. An accurate measurement requires high validity (or low systematic error) and high reliability (or low random error). Although systematic error appears to be smaller than random error in this figure, they are not necessarily related in this manner.
standard used in the measurement or suspect the existence and measurability of the object in question. In statistical terms, one can see how error is a combination of both random and systematic errors, or a function of variance and bias, respectively. People usually define the total variance of an estimator to be the mean squared error (MSE) of the estimator, if it is obtained from a random sample. It can be
92
Basic Principles and Practical Applications in Epidemiological Research
derived that the MSE of an estimator is equal to the sum of variance (i.e. square of random error) plus the square of systematic error (Anderson et al., 1980). Let 9 be the target value, 9 be the estimator from a random sample, bias (6 ) be the systematic error of the estimator, and Var (9 ) be the total variance of 6 . bias(0 ) = E(0 )- 9 Var(0) = E[0 -E(0)] 2 =
E(9f-[E(9)f
MSE(<9 ) = E(9
-9f
= E[9
-E(9
= E[9
) + E(9
)-9f
2
-E(9 )] + [E(0
= Var(0 ) + [bias(0 )]
)-9f
2
Making the most direct and relevant measurement is important in reducing these errors. For example, it is invalid to use a ruler for measuring weight. It is also invalid to measure weight in a space shuttle revolving around the earth because there is no detectable gravitational force inside. These are two very extreme cases. One may use the term "poor validity" to describe such an inaccurate measurement. For example, to make a more definite diagnosis of lung cancer and its cell type, one needs to take a biopsy (or at least a cytologic specimen) and perform histopathologic examination. Evidence from simple imaging like a chest X-ray film cannot be considered a valid measurement for cancer, although a doctor may still use it to predict the likelihood of a certain cell type based on previous clinical experiences. Validity of a study The term "validity" is also frequently used in the evaluation of the quality of measurement, and causal and statistical inferences of a study. If there is a mixing of causal effects, in which an extraneous causal factor can partially or totally explain the effect, then confounding exists and the study is
Chapter 5 Principles of measurement
93
considered weak or poorly valid. For example, in a study comparing recurrency rates of peptic ulcer after cimetidine and anti-helicobacter treatments, if antacids taken by the cimetidine group happen to be less than those taken by the anti-helicobacter group, then the comparison of two rates may involve confounding. Specifically, the lower recurrency rate in the latter group may also be explained by a higher antacid intake. Another example is an occupational physician, familiar with the work histories of coal workers, may more likely diagnose pneumoconiosis if he finds some small round opacities on the chest films. As a result, a differential classification bias may arise, which makes the exposure group (i.e., coal miners) more easily linked to the outcome (i.e., pneumoconiosis), and the validity of such a classification may be in doubt. Therefore, it is generally recommended in clinical studies and other epidemiological studies that outcomes should be assessed without any information on the exposure and vice versa. Such attention must be paid especially to cross-sectional and density sampling case-control studies when one often measures both outcome and exposure at the same time. In short, poor validity or large systematic bias usually arises from using inappropriate instruments, lack of standardization during measurement processes or any uncontrolled confounding in a study. Most of these problems can be foreseen and effectively controlled or prevented during study design. There will be a more detailed discussion of this issue in Chapter 7. Reliability of measurement Good reliability means small random errors. Large random errors usually result from a poor detection limit (or low sensitivity) of an instrument, non-differential misclassification of exposure or outcomes, or simply a small sample size. For example, if one wants to measure newborns' blood lead content, a flame atomic absorption spectrometry may not be sensitive enough to detect the difference between those with and without parental exposure. This is because the magnitude of such a difference may be less than 5 ug /dl (Wang et al., 1989), which can only be detected by a graphite furnace atomic absorption spectrometry. Similarly, using a less sensitive method such as RPHA (reverse phase hemagglutination) to screen HBsAg (hepatitis B surface
94 Basic Principles and Practical Applications in Epidemiological Research
antigen) among pregnant women may miss some carriers of low titer, which can be detected only by radioimmunoassay method. If a measurement is so unreliable or crude that it seems to always randomly and wildly miss the target, e.g., none of all 20 shots hit the target paper in Figure 5.2., one must consider that there may also be a validity problem. Namely, one must change an instrument or modify the measurement method. Thus, some investigators even suggest that good reliability is a prerequisite to validity. Since one always invokes or assumes the auxiliary hypothesis that measurements are sufficiently accurate and sensitive, one should carefully select a method of measurement, choose a widely accepted gold standard, as well as control the determinants of measurement in the study. 5.5
Scales of measurement
The accuracy and mathematical implication of further operations on measurement are both related to the scales of measurement. There are four scales of measurement: nominal, ordinal, interval and ratio. 5.5.1 Nominal scale: A scale of qualitative measurement This scale is produced when one tries to classify or categorize objects during observation. A nominal scale must usually be mutually exclusive and totally inclusive so that every object can be classified into one and only one category. It is similar to the concept of qualitative measurement in physics and chemistry. For example, one classifies every person according to 2 genders (male and female), 4 blood types (A, B, AB, and O types), 2 carrier states of hepatitis B (yes and no) or disease condition, such as bipolar disorder (yes and no), diabetes mellitus (yes and no), etc. As the understanding of a specific disease advances, one may add on subcategories, such as IDDM (insulin dependent diabetes mellitus) and NIDDM (non-insulin dependent diabetes mellitus) or any other subtypes of mania, etc.
Chapter 5 Principles of measurement
95
5.5.2 Ordinal scale: A scale of semi-quantitative measurement Some characteristics or concepts may not have two mutually exclusive categories. For many types of psychological measurement, such as whether a person is quick or slow in temper, pessimistic or optimistic, introversive or extroversive, etc., one may need to develop categories in between the two extremes. Under such circumstances, one may try to use some "anchor points" (or explicit standards for comparison), more easily identified under different places and times, to further assign several categories according to an ordinal degree of the characteristic. Although such a measurement scale may not be as accurate as interval scales, it actually makes the measurement more sensitive and precise than just having two extreme categories. For example, one commonly uses the following ordinal scales: proteinuria: (-), (+), (+), (++), (+++), (++++) degree of deep tendon reflexes: 0, +, ++, +++, ++++ degree of muscle power: 0, 1,2, 3, 4, 5 (for 1-5, a "+" or "-" sign can be added) Many brands of strip tests for proteinuria normally have a clear quantitative equivalent for each specified ordinal scale. In one brand, (+) refers to 100 mg % protein in urine, (++) refers to 100-300 mg %, (+++) indicates above 300 mg %, etc. The degrees of deep tendon reflexes "0", "++" and "++++," indicate completely no reflex, normal reflex and clonus reaction, respectively; while the "+" and "+++" degrees are the in-betweens. Each degree of muscle power has a specific definition: "0" indicates no visible contraction; " 1 " indicates visible contraction of the tested muscles but inability to flex or extend a joint; "2" indicates ability to horizontally move a joint but inability to lift against a gravitational force; " 3 " indicates the ability to move a body part against gravity; "4" indicates enough strength to resist against only a mild outside force; "5" means normal strength. Throughout these ordinal scales, one tries to find or set up relatively objective standards for comparison, which can be easily identified. To say "objective," I mean that different persons measuring at different times and places will obtain a relatively consistent result. For example, to ask the frequency of a particular
96
Basic Principles and Practical Applications in Epidemiological Research
symptom, say, epigastralgia, one tries to explicitly specify the standard of each ordinal scale, such as "more than 4 times per day," "1-3 times per day," "4-6 times per week," "1-3 times per week," "1-3 times per month," "less than 1 time per month," etc., instead of using "very frequently," "frequently," "occasionally," "rarely," "never," etc. Certainly, one can more easily standardize and compare such forms of measurement between different studies. Let us look at another example. The ILO (International Labor Office, 1980) has provided copies of X-ray films from the originals to experts (or "B" reader as they call them) to be used in comparison with standards to scale the degree of profusion of round and irregular opacities. Four standard films represent 0/0, 1/1, 2/2 and 3/3 degree of profusion, respectively. If a reader looks at an unknown film and feels that it is very similar to the 1/1 film, he can first scale down to 1/. After comparing with the 2/2 film again, if he feels that the degree of profusion is more then 1/1, he can write down 1/2; however, if he still feels that it is almost the same as 1/1, he could scale down to 1/1. By repetitively comparing several times with standard films, the ordinal scales can be extended to 12 scales: 0/-, 0/0, 0/1, 1/0, 1/1, 1/2, 2/1, 2/2, 2/3, 3/2, 3/3, 3/+. 0/- means that the lung field is clearer than the standard film 0/0; and 3/+ means that the degree of profusion is greater than 3/3. The result of such a standard reading technique is cross-checked for agreement among different readers during the certification examination of a "B" reader, who is then qualified for reading X-ray films of pneumoconiosis for inter-hospital and international comparison. In the socio-behavioral sciences, one often measures subjective preferences or feelings of people. Because everybody has his/her own scale of subjective feelings, there is usually no objective base line and standard for ordinal comparison, as shown in Figure 5.3. One may try to solve this problem by providing several standard examples describing in detail about what each specific scale means. However, in general, such a practice unfavorably increases the duration of the interview or test. An alternative is to conduct a descriptor study first, which surveys a group of most commonly used terms and tries to select the terms that are closest the targeted degree of
Chapter 5 Principles of measurement
A
I
I
97
I 100%
J 100%
0%
25%
50%
75%
100%
Degree of satisfaction
Figure 5.3 Different scales for the measurement of subjective feelings or preferences. Every subject may have different baselines (0% and 100% satisfaction), as well as different degrees of specific feelings among different ordinal scales. As shown here, A, B and C could well represent 3 different persons' subjective scales and baselines.
satisfaction. The WHOQOL clearly delineates such a requirement for every participation center (WHOQOL, 1995, 1998a). So, the Taiwan version also conducted such a survey from throughout the country and picked up the most appropriate terms measuring capacity, frequency, intensity, and evaluation (Line/a/., 1999). 5.5.3 Interval scale: Quantitative measurement with or without an absolute zero starting point The most important characteristic of this scale is that the quantity of each interval is equivalent to each other. For example, a difference of 2 mmHg of blood pressure between 110 and 108 mmHg is exactly the same as that between 70 and 68 mmHg, etc. Similarly, an interval of 1°C is the same at all different temperatures of Celsius. However, an interval scale does not necessarily have an absolute zero starting point.
98
Basic Principles and Practical Applications in Epidemiological Research
5.5.4 Ratio scale: Quantitative measurement with an absolute zero starting point This scale is the most precise and most powerful scale for mathematical manipulation, as it possesses the characteristic of extensiveness of measurement. Namely, 20 kg is two times 10 kg, 20 minutes is 10 times 2 minutes, etc. Although such a scale can always be calculated in a ratio manner, one should also pay attention to the biological or sociobehavioral implications involved before one makes any generalizations. For example, although a blood cholosterol level of 180 mg % is two times 90 mg % in quantity, the likelihood of the former developing an atherosclerotic heart disease is not necessarily twice that of the latter. Similarly, a person with a blood pressure of 180/120 mmHg may not have twice the likelihood of developing stroke than that of another person with a blood pressure of 90/60 mmHg. Thus, a reduction of ratio-scale variable to an ordinal or a nominal variable may be necessary, depending on different research objectives. Moreover, in epidemiological research, one sometimes needs to reduce a ratio or interval scale measurement to an ordinal scale, e.g., lumping different ages together in order to obtain enough numbers for each stratified cell and to maintain statistical power in the data analysis. There is no absolute rule for how a data set of ratio scale should be treated, but one certainly cannot expand an original data set from a nominal or ordinal scale to a ratio scale. Therefore, I recommend measurement in ratio scale whenever possible, and then analyze the data set according to different objectives later. Extensiveness of measurement In fact, most physico-chemical measurements seem to have a fundamental characteristic, extensiveness of measurement (Bladlock, 1982), i.e., facility of mathematical manipulation. This is represented mathematically as: a o a = 2 a, where "o" represents an operation or manipulation. Any measurement with a ratio scale, such as length, weight, volume, time, etc. possesses the above characteristic of concatenation or physical addition
Chapter 5 Principles of measurement
99
because these scales have a clear absolute zero point. As aforementioned, 20 cm is equal to 10 times 2 cm, and 10 kg is equal to 5 times 2 kg, etc. Some measurements with interval scales do not share such a characteristic. For example, 20°C is not equal to 2 times 10°C, because there is no absolute zero in the thermal Celsius degree scale. Instead, 0°C is defined as the temperature at which water becomes frozen. In public health and sociobehavioral sciences, one also frequently deals with nominal and ordinal scales. However, counting numbers under a specific nominal or ordinal scale can also fulfill the extensiveness of measurement. For example, 2 women plus 2 women equal 4 women, etc. The recent development of the theory of conjoint measurement (Luce and Tukey 1964; Michell 1990) provides an alternative way to identify quantitative structure via ordinal relations rather than via physical addition. Because behavioral scientists quite frequently use ordinal scale measurement, the conjoint measurement theory will probably find more applications in this field in the future. 5.6
Common evaluation method in medical diagnostic tests
In clinical medicine and epidemiological research, one often needs to evaluate the validity or accuracy (as some authors call it) of a new diagnostic test or measurement. Although the choice of any particular diagnostic test involves decision-making and, hence, utility of different tests, my coverage of this subject will only focus on the validity of diagnostic tests and the evaluation of such validity. Sampling procedures involved in the common evaluation of diagnostic tests A diagnostic test is generally developed to assist a physician or investigator in estimating a patient's likelihood of possessing a disease. To judge the validity of this diagnostic test, the physician needs information or data from previous evaluations, i.e., the results from the manufacturers' evaluations, which supposedly demonstrate the diagnostic tests' high sensitivity and specificity. The comparison of data between present and
100 Basic Principles and Practical Applications in Epidemiological Research
previous studies is usually presented as in Table 5.1. Strictly speaking, the determinants of sensitivity and specificity include not only the validity and reliability of the test itself, but also the potential systematic and random errors resulting from the sampling procedure of recruiting subjects during evaluation. Consequently, to evaluate the performance of a test with data of previous experiences, one must also pay attention to the sampling procedure of subject recruitment. In many occasions, non-random samples of patients are simply drawn from a medical center or a university hospital. Therefore, the inference of constant specificity and sensitivity may not be applicable to the hospital or clinic, even with the exact same procedure in testing. Let us first come back to the original definitions of commonly used indicators for such tests. Sensitivity, specificity and predictive values Sensitivity is usually defined as the proportion of positive tests among patients with the disease. Specificity is defined as the proportion of negative tests among normal persons or those without the disease. Positive predictive value (+PV) indicates the likelihood of a patient with a positive test actually having the disease; negative predictive value (-PV) indicates the likelihood that a patient with a negative test really does not have the disease. If the patients who are selected for diagnostic test are randomly drawn from all patients with such disease, then there will be no systematic bias in sensitivity. Similarly, if the normal persons are also randomly drawn from a normal population, then the specificity will not be biased, either. However, if the results of a diagnostic test are related to the severity of the disease, the sensitivity may not remain constant when the test is applied in a different hospital, with a different distribution of severity in patients. For example, if a specific tumor marker is more easily detected in cancer patients at a more advanced stage, then the sensitivity figure obtained from sampling of a medical center with more patients in an advanced stage may not be directly applicable to a local hospital or clinic, where patients may have a less severe form of the disease. Thus, the sensitivity obtained from a medical center may be over-estimated in the above condition. On the other hand, carrying out a random sample of a normal population is relatively easier, and the resulting
Chapter 5 Principles of measurement 101
index of specificity is less affected by sampling procedure. However, sometimes, a developer of a diagnostic test draws patients without this disease from other wards of the hospital to serve as a sample of quasi-normal persons. Under this condition, the generalizability of specificity to some extent may be in doubt. The positive and negative predictive values are always changing depending on the total prevalence of the disease in the population, as shown in the formula on Table 5.1. Example of developing a new tumor marker for diagnostic testing For example, suppose the NTUH (National Taiwan University Hospital) reported a new diagnostic test (a tumor marker) of nasopharyngeal cancer (NPC), and the specificity and sensitivity were 90% and 80%, respectively (see Table 5.2). If the marker happens to be related to the severity of NPC, i.e., the more advanced stage a patient is, the more likely he will obtain a positive result; then one should examine selection of patients and normal persons in the evaluation study. If the patients were selected from a group all hospitalized for radiation treatment {i.e., at a more advanced stage), the sensitivity would be an overestimate. Thus, if the same diagnostic test were performed in a group of outpatients or patients from a local hospital, sensitivity would be lower than 80%. Random selection helps to reduce problems of sensitivity bias, so that the sample can be generalized to all other hospitals and clinics. Moreover, test developers might simply test 100 patients and 100 normal persons for convenience, producing an artificially high NPC prevalence rate of 50%. Nowhere in the world is there such a high prevalence rate, and thus, the positive predictive value from this study is an overestimate. A physician or epidemiologist who wants to apply this test should adjust the total prevalence of NPC to a value, which he considers to be more plausible in the particular clinic or community hospital. Thus, both the sensitivity and positive predictive values must be adjusted to lower values for a community hospital or local clinic, resulting in a less effective performance of this new test, despite the exact same test procedures and technicians.
102 Basic Principles and Practical Applications in Epidemiological Research
Table 5.1
Definitions of sensitivity (Se), specificity (Sp), and positive and negative predictive values (+PV and - PV).
If previous samples of a + c and b + d are randomly drawn or are at least representative, then the sensitivity and specificity represent the accuracy of the diagnostic test and usually will not vary according to different places and time. However, the positive and negative predictive values always change according to the total prevalence (TP) of the screened population.
result of (+) new test (-)
Gold standard of disease (+) (-) a b c d
a: true positive cases b: false positive cases c: false negative cases d: true negative cases
a + c, b+d Accuracy (if the sample were randomly drawn) = (a+d)/ (a+b+c+d) Sensitivity = a / (a+c) Specificity = d / (b+d) + PV = a / (a+b) - PV = d / (c+d) Assume that the total prevalence (TP) of the disease in our population for a new test is: TP = (a+c) / (a+b+c+d) Let D+ denote with the disease, D" without the disease; T+ denote positive test, T" negative test; and Prob (X) denote the probability of occurrence of condition X. +PV
=
Prob(D + /T + ) Prob ( T / D + ) Prob (D+) Prob (VID+)
Prob (D+) + Prob (T+ / D") Prob (D~)
(Se) (TP) (Se)(TP) + (l-Sp)(l-TP) Similarly, (Sp) (1-TP) -PV = (Sp)(l-TP) + (l-Se)(TP)
Chapter 5 Principles of measurement 103
In addition, one should be very careful about how the true presence or absence of a particular disease is determined. One can only select the best available and feasible definition of this disease. If the newly developed method is more accurate than the currently available "gold standard" definition of the disease, then the sensitivity and specificity one obtains will not be accurate. For example, oral cholesystography (OC) was first developed to detect gall bladder stones and was used as a standard to evaluate the later developed ultrasonography. It turned out that sonography was demonstrated to be more accurate than OC (Cooperberg and Burhenne, 1980), and consequently, OC was replaced by sonography afterwards. Therefore, when one evaluates a new diagnostic method, one should choose a gold standard, which is widely accepted by experts. If one finds that the current gold standard is not suitable, one may develop a new one to replace it. Detection of false negative tests In order to develop a new method for early detection of cancer, i.e., cancer screening in a population, one should not include any known prevalent cancer cases. Evaluation of the performance of this new method should be based on diagnosing unknown cases, the basis of the concept of early screening (Cole and Morrison, 1980). Thus, one should evaluate the test by performing screening in a population at risk who may develop the cancer. And then, one should closely follow up those cases testing negative to see how many cases actually have cancer and should have tested positive. Yet, because they look apparently normal and even screen negative, these false negative cases are difficult to detect. Current practice is to follow up all negative tests at the end of one year and to inquire about any cancer development. This one-year period is designated only for convenience and ease of memory. Ideally, one should select a follow-up period that conforms with the earliest detectable pre-clinical period (DPCP). The ideal follow-up period also may vary according to the different cell types, doubling times, organs and DPCP. It usually takes several periodic screenings to answer the above question (Morrison, 1992).
104 Basic Principles and Practical Applications in Epidemiological Research
Table 5.2
Evaluation of a hypothetical diagnostic test (e.g., DNase tumor marker) for nasopharyngeal cancer (NPC) in a medical center.
Assume that patients were drawn randomly from those hospitalized for treatment, and non-patients or normal persons were drawn randomly from a local community. In this table, the 50% total prevalence of NPC is an artifact during sampling, which should be different from the actual prevalence in any local community. Thus, one cannot apply the positive and negative predictive values (+PV and -PV) to any other situation.
Nasopharyngeal cancer New test (+) (-) DNase (+) 80 10 tumor marker (-) 20 90 100
5.7
Se: Sp: +PV: -PV:
80/100 90/100 80/90 90/110
= = = =
a/(a+c) d/(b+d) a/(a+b) d/(c+d)
100
Validity and reliability of physico-chemical, biological and socio-behavioral measurements from a refutationist's point of view
In the scientific process of conjectures and refutations, one always invokes the auxiliary hypothesis of accurate and sensitive measurement. As a result, one should try one's best to assure that this auxiliary hypothesis is highly corroborated. Then one can be assured that there is no confounding in one's refutation attempt of the primary hypothesis. As public health research often involves multidiseiplinary measurements, one may not be able to master all such methods. Consequently, an interdisciplinary approach and cooperation is usually necessary. An understanding of how principles of measurement can be applied in different fields can help facilitate cooperation. Here I provide such considerations in several important fields as illustrations.
Chapters
Principles of measurement 105
5.7.1 Measurement of chemicals in the environment or inside the human body In chemical measurements, people usually talk about establishing a QA/QC (quality assurance /quality control) program to guarantee accuracy and sensitivity in the laboratory. The basic concept of this program is to be aware of and minimize all possible measurement errors. In general, errors inside a laboratory arise from determinants of measurement, including: basic technique of operating a balance instrument and pipettes, washing glassware, purity of chemicals, solutions (including water) and solvents, any contaminant inside the air in the laboratory, analytic method, type and brand of instrument, and the qualification of the person who performs the measurement. If one is dealing with environmental sampling or obtaining human blood or fluids, then one should also consider sampling strategy, onsite pre-treatment, suitable container and method for storage and transportation, etc. If one is trying to measure trace chemicals, such as cadmium, lead in blood or volatile organic chemicals in ambient air; the demand for purity (or lack of contamination) of every involved substance should be even stricter. For example, since lead may be a trace contaminant in nitric acid or water, one must use ultra-pure nitric acid and twice of deionized water, of which the electric resistance is above 16 mega-ohm, for blood lead analysis. The containers should be cleansed with nitric acid vapor or hyperchloric acid and should not adsorb any metal. They should be made of pure polyethylene, teflon or quartz. The laboratory should have at least a clean bench, which contains clean air with less than 10,000 particles per cubic meter. Implementing these requirements helps prevent the existence of extraneous alternative explanations of one's measurement data. Target value and QA/QC How does one know that one's measurement is accurate or close to the target value? Chemists usually repeat the same procedure, often called a double check, to see if the result is reproducible. If the results of such check are very similar or almost the same, then the measurement is considered
106 Basic Principles and Practical Applications in Epidemiological Research
precise or reliable. To falsify whether the measurement hits the known target value, chemists usually measure the target with a completely different method to compare consistency with the first measurement. If the two results are very close to the same target value, then one can regard the refutation attempt to be a failure on the grounds of consistency in different settings. For example, blood lead that is measured with atomic absorption or anodic stripping is frequently tested with radioisotope dilution and mass spectrometry (Rabinowitz et al, 1991a). To simulate the matrix effect of a biomedical sample, e.g., blood, urine, hair, nail, etc., one usually purchases some simulated matrix from the NBS (National Bureau of Standard) and applies standard addition method to test its stability. To assure the stability of the instrument, one generally inserts a QC sample of known concentration for every 10 unknown samples. After QA/QC system is implemented steadily both inside and outside the laboratory, one may participate in a national or international QC program, which regularly sends out QC samples every 1-3 months to check the accuracy of measurement of participating laboratories. My laboratory has participated in the CDC (Center for Disease Control and Prevention) lead proficiency-testing program since 1985. Every month, we test three samples from the CDC, and one month later, receive the target value and summary statistics of more than 150 participating laboratories throughout the world (Department of Health and Human Service, 1986-2000). The target value is the average of the top 10 most credible laboratories selected by CDC. If there is any large discrepancy between the target value and our measurement, then my colleagues and I critically evaluate every laboratory step to determine the possible errors. This procedure also involves conjectures and refutations. For example, to make sure that the polyethylene (PE) tubes, purchased from a German company, were devoid of any lead contamination, we used the same bulk of ultra-pure nitric acid to wash 10-20 such tubes in order to determine the lead content inside. The final nitric acid contained no detectable lead. Thus, we concluded that the lead content inside the PE tubes was at least 10 times below our detection limit. The above example shows how the method of conjecture and refutation can also be applied to improve the accuracy of measurement.
Chapters
Principles of measurement 107
5.7.2 Conceptualization of exposure dose and its measurement in occupational and environmental medicine The exposure dose of an environmental toxic substance was once thought to be simply a cumulative amount of the particular substance, expressed as follows: Dose =
C(t)dt,
where C (t) = air concentration of
substance as a function of time t. The formula only sums up the air concentration across exposure duration and thus, represents the cumulative external dose. In toxicology, however, one is more concerned about the dose inside the body, i.e., in the target organ where the health effect develops. To interpret the above formula as the target organ dose, one must invoke a series of auxiliary hypotheses, including the assumption that the amount in the occupational or ambient environment should closely represent the internal dose, and that the internal dose should closely represent the real target organ dose, etc. By invoking the auxiliary hypotheses, A|, A2, , A„, to make indirect measurements of target dose, one should assess the validity of such assumptions. If any one of these auxiliary hypotheses turns out to be false, the conclusion drawn from this formula may be completely erroneous. Therefore, when one considers exposure dose of an agent, one must also pay attention to the serial events involved in absorption, distribution and excretion of the agent, as well as the pathological physiology or mechanism of the effect (or response). By using a PB-PK (physiologically-based pharmacokinetic) model, which carefully considers how a toxic substance is absorbed, distributed and excreted in human body, one can address the problem of estimating target organ dose. Therefore, in an environmental exposure dose assessment, I recommend utilization of the PB-PK model and pathophysiology of a particular pollutant. Based on these considerations, one can select the most appropriate sampling
108 Basic Principles and Practical Applications in Epidemiological Research
time and/or most feasible way to estimate the target organ dose and health effect(s). For example, in examining the exposure dose of asbestos, one must examine the lungs, the target organ for asbestos. Assuming that the cumulative dose of residual asbestos inside the lung is the target value, one may perform a transbronchial or open lung biopsy to obtain a representative piece of lung tissue and measure the asbestos directly. Under these conditions, one invokes the auxiliary hypothesis that the small piece obtained represents the whole lung. Since biopsy involves an invasive procedure and is less acceptable among workers, one may try bronchoalveolar lavage to count the asbestos fibers inside. However, since these fibers are easily removed and may only represent recent exposure, one may, alternatively, count fibers from the sputum. Yet this action requires making the assumption that the sputum coughed out by an asbestos worker represents cumulative doses. In all different situations, one must always be careful about all the different auxiliary hypotheses invoked in the measurement process. In the example of measuring lead exposure dose, one must decide what target organ to measure, as lead toxicity involves hematological, neurological, nephrological, gastrointestinal and immunological systems. Should one measure lead concentration in each different organ system? Or, should one perform an overall surrogate measurement? It may not be feasible to test each and every organ. Using the PB-PK model, people generally agree that blood lead, representing the exposure dose of the above different organs in the past month, is the most acceptable form of measurement (Lauwerys, 1983). Although it does not represent the total body burden (TBB), one can still estimate the cumulative amount by multiplying the concentration with the duration of exposure. The TBB of lead in adults may be obtained by measuring the long bones with fluorescence X-ray, but due to concerns of measurement sensitivity and exposure to ionization radiation, this method has not yet been widely accepted. The dentin of desidual teeth, from children of elementary school age, are an excellent indicator of TBB of lead. Investigators frequently use the desidual teeth to study the association of lead exposure and IQ (intelligent quotient) impairment among children
Chapter 5 Principles of measurement 109
(Rabinowitz et al, 1991a, 1991b, 1992). However, this method can only be applied to children around 4-9 when the desidual teeth are actively dropping. Pubic hair, which usually does not shed after puberty, may also be a potential candidate for measuring cumulative exposures of different heavy metals. But again, PB-PK models and pathophysiology of the pollutants should be considered first when evaluating a specific metal. 5.7.3 Validity and reliability of socio-behavioral measurement Measurement has been one of the central issues in the development of social and behavioral sciences because of the inherent difficulty in assessing what is inside human mind. Thanks to the efforts of many distinguished scientists, such as Campbell or Stevens (1946), a theoretical foundation was established, based on representationism and operationalism. Alternatively, Luce and his associates (1964) have developed the conjoint measurement theory, which has the potential to initiate a new paradigm for socio-behavioral measurement (Michell, 1990). Since public health deals with the modification of human health behavior, some discussion and understanding of the commonly used terms in socio-behavioral science is beneficial for interdisciplinary cooperation and approach. The measurement of characteristics of human behavior is relatively difficult in at least two aspects: First, the concept or theoretical construct is quite abstract, making verbalization and representation by a set of standard operation procedures difficult. In many occasions, the existence of such a construct may still be in doubt or under dispute. Thus, the validity of a construct is always subject to empirical falsification measurement. Second, the subjective human mind generally changes over time, place and setting, as it continuously interacts with the outside world. Therefore, the problem of reliability arises when the test is administered at a later time, by a different interviewer or under a different environment. To tackle this problem, social and behavioral scientists have developed several different types of validity and reliability indices for falsifying theoretical constructs and measurement procedures, as shown on Table 5.3. Each index is
110 Basic Principles and Practical Applications in Epidemiological Research
usually applied for different occasions during performance of measurement. Moreover, these scientists have tried to quantify validity and reliability by statistical summarization of measurement errors and proportion attributed to different systematic or random errors. To conduct a measurement, it is better to first conceptualize the problem through intuitive operational definitions and then, to consult experts in the respective fields or disciplines to select suitable validity and reliability indices for evaluation of the measurement. For example, the WHOQOL questionnaire was developed from the contribution of many experts from 15 different countries. It has also stood firm even after repeated tests in more than 20 different centers and conducted factor analysis to see if the construct validity is still preserved (WHOQOL group, 1996; WHOQOL-Taiwan group, 2000). Of course, the evaluation of measurement accuracy is always limited by resource availability and technical feasibility. Thus, the researchers have to make appropriate choices for each study and conduct at least one kind of validity and one kind of reliability evaluation for their measurements. 5.8
How to perform accurate measurement by questionnaire Limitations of questionnaire information
In the public health field, one frequently performs measurement or obtains information by questionnaire. Such information is collected through either self-report or interview. The former is usually administered through direct distribution or mail for the subjects to fill it out and return. The latter is conducted through telephone or in person. In all manners, one is actually asking a subject to provide information for us. Therefore, such a measurement is always limited by the subject's cooperation and ability to provide the information. If the subject cannot read or write, then he/she is obviously not suitable for answering any self-report questionnaire. A subject's ability to provide accurate information may be enhanced to some extent if one provides some standard tools for him/her to compare. For example, to determine amount of alcohol intake, one may provide photographs to help subjects identify their alcohol consumption more
Chapter 5 Principles of measurement 111
accurately. To measure nutritional intake, one may try to help subjects recall food consumption in the past 24 hours by providing pictures or plastic models of all possible items to compare. However, subjects generally cannot provide any information that they do not know or are unable to measure. For example, subjects are usually unable to provide information on their spouses' occupational exposure because, in general, they have no access to such information (Chang and Wang, 1988). Even if some had the opportunity to visit their spouses' workplaces, few went with the intention to determine occupational exposures. Thus, one should not expect to obtain any information beyond the subject's ability to measure or anything too sensitive for the subject to release (such as personal sexual history or income). In addition, one should try to explain to the subject that providing a true answer would be most beneficial to himself/herself and to society as well. Hence, a short but sincere introduction at the beginning of questionnaire is often helpful to win a subject's cooperation. Besides these main problems, one should also pay attention to other determinants of questionnaire measurements, which are listed in Figure 5.4 and discussed in the following paragraphs.
112 Basic Principles and Practical Applications in Epidemiological Research
Table 5.3
Simplified explanations of common indices of validity and reliability in the behavioral sciences and comments from a refutational perspective.
Name of index Simplified explanations Refutational perspective on its use A. Validity Because the measurement tool is an I. Criterion-related Determine how the Validity measurement is related to the indirect measurement of the criterion, it is a test of both the concept and the criterion defined by a tool. theoretical concept. a. Predictive validity
Determine how accurate the The accuracy of the prediction is a measurement tool can predict refutation attempt of the the socio-behavioral measurement. characteristics.
b. Concurrent validity
Simultaneously measure a concept with two or more different tools, and compare the results to see if they converge.
If the concept and tools are all highly valid, then the results from different tools should be highly consistent or correlated with each other.
Il.Content validity Judge from the contents and After consensus falsification by logic of the questionnaires to experts, the tool should be modified see if it actually represents the to a higher degree of validity. characteristics. Usually a group of experts are recruited to do this job. III. Construct
B. Reliability I. Test-retest reliability
A construct is usually an The construct is tested empirically by abstract concept based on some measurement and statistical modeling socio-behavioral theoretical to see if it exists. Usually factor model, which should be analysis, path analysis, correlated empirically tested by factor analysis, etc., are employed. measurement to see if it exists.
Repeatedly administer the It is an empirical test to see how same test to see how consistent consistent the measurement is. they are.
Chapter 5 Principles of measurement 113
II. Alternate The same characteristics are form reliability represented by two or more equivalent or alternative forms of test. Results from these forms are compared and summarized to this index to demonstrate their equivalence. IILInternal consistency Randomly split the items in a a. Split-half questionnaire into 2 half reliability pieces, and evaluate the relatedness of the two parts. b. KuderParallel tests are administered Richardson to test the internal consistency eliability of the measurement tools (especially for binary variables). c. Cronbach An indicator calculated to coefficient represent the internal alpha consistency of a questionnaire. IV. Intra-rater An indicator calculated to or represent the consistency in a intra-observer rater's observations. reliability
V. Inter-rater or inter-observer reliability
An indicator calculated to represent the consistency of judgment among different raters.
It is a falsification attempt to determine whether two or more different forms of test are measuring the same characteristics.
It is a refutation attempt to examine the homogeneity or consistency of the items in a questionnaire in measuring the same characteristics. Same as above.
Same as above.
It is a refutation attempt of consistency for personal judgment. In the natural sciences, the investigator simply repeats the study once more to see if he/she gets the same results. It is a refutation attempt to determine if measurement performed by different observers will be consistent. It is similarly used by natural scientists that attempt to reproduce the measurement.
5.8.1 Construction of a questionnaire In a self-report questionnaire, one should avoid using any difficult or
114 Basic Principles and Practical Applications in Epidemiological Research
unusual words, of which subjects, especially those with a lower educational background, easily misconstrue or cannot understand. Rather, always try to use clear, commonly used words and simple, straightforward sentences. Otherwise, different subjects will interpret the questions in different ways, and one may end up with unreliable information and not even know what one has actually measured. In practice, one may perform several pretests to detect any ambiguous words and correct them before actual application in the field. For example, in my study of skin cancer among paraquat manufacturing workers (Wang et ah, 1987), we pretested and modified the questionnaires four times. One should also include all questions regarding the outcome of interest and all determinants of the outcome. These questions should be closed-ended, so that responses are mutually exclusive and totally inclusive. In other words, every possible response has one pre-designated code, and each response does not overlap with any others. For a question with too many possible responses (say, more than 10), one can reclassify them into a smaller number of items and lump miscellaneous minors as "others." In regards to asking Asians a preference question, I recommend listing an even number of options for the subjects to select. Because most Asians are reluctant to show extreme subjective feelings, they frequently select the middle option, when answering a preference question, with an odd number of options. Moreover, dummy questions, inquiring symptoms unrelated to exposure, can be included as a baseline rate of positive response and as a test of the validity of the questionnaire. Then, all questions may be randomized and rearranged to avoid any possible differential guidance or ordering effect on the respondents. Finally, one should reexamine all questions to be included from the perspective of data analysis. If there is any doubt that the information from a particular question or item will be analyzed after collection, then delete it. People are usually more reluctant to answer a questionnaire that appears to be too long or is expected to take more than 10-15 minutes to answer. 5.8.2 Interview procedures The interview procedure should always be standardized, including the
Chapter 5 Principles of measurement 115
greeting of subjects, words used in the questionnaire, words used in response to questions raised by subjects, facial expressions and posture of the interviewer, selection of an environment for interview, and time limit for the interview, etc. Before any field trial, every interviewer should review the instructions for administering a questionnaire and practice asking the questions, repeatedly 30-50 times. All interviewers should be trained, and standardized procedures should be developed after several trials. Moreover, the detailed information of the interview procedure should be documented in a handbook for future reference. The standardization is particularly important if the interviewer is asked to measure the subjects' personal preference responses. During the standardization process, videotape recording can be used to preserve the actual interview scenery for future recall. Should any question be raised during later comparison of results, the tape can be replayed to help clarify and resolve the problem. Otherwise, there may be an unnecessary "interviewer" effect and potential confounding may follow. For example, in one study, we randomly sampled 200 out of 5000 pregnant women who were once interviewed by questionnaire to determine the consistency of history obtained from spouses (Chang and Wang, 1988). The agreement between spouses for educational level was up to 91%, while numbers for pregnancy, artificial and spontaneous abortions were also up to 90%. However, there was only 45% agreement between spouses on total household incomes. Using the husband's answer as a gold standard, 87% of wives gave an accurate answer on what cigarette brand her husband smoked. However, less than 40% of wives could provide information on her spouses' potential occupational hazards. Thus, to obtain accurate results from questionnaires, one must design a simple and clear questionnaire, win full cooperation from the subjects, and standardize interview procedures. Even so, the questionnaire can only obtain information limited to what a subject can measure. The major determinants of accurate measurement by questionnaire are summarized in Figure 5.4.
116 Basic Principles and Practical Applications in Epidemiological Research
Data of different Characteristics
Determinants that influence the accuracy of questionnaire measurements
Demographic data -» Facts
->
Interview Facts: subjects' memory, Subjects' technique judgment or -> -»ycoand/or operation questionnaire measurement design Perceived health or -> Facts: subjects' preference—> quality of life and/or subjective perception Lifestyle and occupational exposure data
->
Objective health data-» Facts: doctor's, nurse's measurement or -» medical laboratory and records instrumental measurement, subjects' complaints
coding, input computer data
abstracted
Figure 5.4 Determinants of measurement accuracy on health-related data obtained by questionnaire administered through self-report or interview. The flow directions of different kinds of data show the determinants or potential confounders that may lead to errors of the final codes of files.
5.9
Summary
All empirical science involves measurement, and all refutation attempts invoke the auxiliary hypothesis that their measurements are accurate and sensitive enough to detect an effect. Thus measurement is one of the central issues in scientific research. In most biomedical measurement, one can obtain accurate data simply by defining a widely accepted gold standard for comparison and controlling all determinants during the actual measurement. However, in socio-behavioral science one must first conceptualize the
Chapter 5 Principles of measurement 117
theoretical construct and find an operational procedure to measure it. The data are then summarized statistically and feedback to test the original construct. Accuracy of measurement can be achieved through minimizing systematic error and random error (high validity and reliability). In clinical medicine, a diagnostic test is usually evaluated by the indices of sensitivity and specificity, which is based on the assumption that cases and normal persons are randomly drawn. To apply such a test on clinical grounds for positive predictive value, one also needs to know the total prevalence of a particular disease. Dose in environmental and occupational exposure should be sampled and measured according to the pathophysiology and PB-PK model for a meaningful interpretation. In the actual measurement of dose at the target organ, one often needs to invoke some assumptions that may not be highly corroborated. To obtain accurate information by questionnaire, one must be clear and straightforward and insert some dummy questions, as well as develop standardized interview procedures. Nonetheless, the information collected by questionnaire is always limited by the subjects' cooperation and ability to measure the items in question.
118 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 5 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score
% 1. An accurate and sensitive measurement is always one of the auxiliary hypotheses that one invokes during a refutation attempt. 2. Since questionnaires and interviews ask subjects to measure something by themselves, investigators are always limited by the subjects' cooperation and ability to measure these items. 3. Throughout all kinds of measurement, one always needs to compare a set of gold standards with the objects one sets out to measure. 4. For ordinal scales, one should try to find or set up a relatively objective standard for comparison, which makes each ordinal scale more easily identifiable. 5. An interval scale has an absolute 0 starting point. 6. A reduction of ratio-scale variable to an ordinal or nominal variable is prohibited because there will be some loss of information. 7. If a diagnostic test happens to be related to the severity of a patient, then the sensitivity result obtained from a medical center is usually an overestimate if one wants to apply it in a community hospital. 8. In an interview questionnaire, standardization is particularly important to avoid any systematic error if the interviewer is also asked to measure some of the subjects' responses.
Chapter 5 Principles of measurement 119
9. Measurement is one of the central issues in scientific research. 10. A socio-behavioral measurement is by itself a test of the original theoretical construct. Thus, a successful falsification in an empirical study can be attributed to either error in measurement or the substantive theory.
Answers: (1) T (2) T (3) T (4) T (5) F (6) F (7) T (8) T (9) T (10) T
This page is intentionally left blank
Chapter 6 Basic Measurements in Epidemiological Research 6.1 6.2
6.3
6.4 6.5
6.6
6.7
Evolving trends in epidemiological measurement Basic measurements of outcome in epidemiology 6.2.1 Outcome measurement: Counting of events and states, rate, proportion and ratio 6.2.2 Determinants of measurement indices or parameters Incidence rate, cumulative incidence rate, risk and their determinants 6.3.1 Incidence rate and density 6.3.2 Cumulative incidence rate(CIR) and risk 6.3.3 Determinants of incidence rate (IR) Prevalence or prevalence rate Measurement of effect: rate difference, rate ratio, etiologic fraction and expected number of prevented cases 6.5.1 Rate difference and rate ratio 6.5.2 Excess fraction (EF) and etiologic fraction 6.5.3 Odds ratio 6.5.4 Expected number of prevented cases Measurement of utility of health: Quality-adjusted survival 6.6.1 Concept of QALY and its potential application 6.6.2 Life table method for estimating the QAS 6.6.3 Estimation of HRQL (health-related quality of life) Summary
Introduction A scientific discipline is usually considered a function of its substantive matters. In epidemiology, one is concerned with the occurrence and determinants of diseases and health-related events and states in human populations. Therefore, the science of epidemiology can be expressed as
121
122 Basic Principles and Practical Applications in Epidemiological Research
P = f {D}, where P is the parameter of interest one wants to measure, and D are the determinants of such parameters. In this chapter, different parameters of measurements will be considered. Starting with the traditional counting of events and states in populations to obtain rate, proportion and ratio. Measurement of effect, in terms of rate ratio, rate difference and etiologic fraction, will also be discussed. Principles and applications of epidemiological research have recently been extended to assist health policy decision-making. As a result, the measurement of utility of health has also extended from counting the number of lives saved to measuring the functions of health-related quality of life (HRQL) and survival, and combining them with a common unit of quality-adjusted life year (QALY). Let us review some of the developing trends in epidemiological measurement. 6.1
Evolving trends in epidemiological measurement
Since the development of this discipline, early epidemiologists sought to calculate mortality rates, birth rates, etc., from demographic data obtained from death and birth certificates, regularly collected by governmental agencies. Since these rates were known to possibly change with different places, times, and population characteristics, such as age and sex, epidemiologists attempted to propose hypotheses which could interpret and predict such trends. Later, they found that such countings could also be performed for different health events (or diseases) to obtain morbidity rates, such as incidence rates for acute diseases: food poisoning, measles, polio, etc.; and chronic diseases: cancer, diabetes mellitus, stroke, etc. Moreover, if one clearly specifies the observation period of patients (for comparison), then cure rate, recurrency rate, case fatality rate, etc. can also be calculated. MacMahon and Pugh (1970) summarized epidemiological measurements into three types: rate, proportional rate, and ratio; and pointed out the importance of considering determinants, such as time, place, person, and other differences. In the beginning, counting the occurrence of a health event {e.g., cholera) was considered similar to counting the frequency of positive responses of Bernoulli trials. There was no distinction between relative risk and rate ratio,
Chapter 6 Measurements in epidemiology 123
as they were both regarded as a ratio of two proportions. More will be discussed on their differences later in this chapter. Neither was there differentiation between density sampling and cumulative incidence sampling for case-control study (Cornfield, 1951). This will be further explained in Chapter 11. Finally, time factor was dealt with by only specifying the presumed duration. As epidemiologists increasingly began to deal with chronic diseases, the concept of time as a denominator for incidence rate became more and more obvious (Elandt-Johnson, 1975). Because events (diseases) occur with the passage of time, some members in the cohort would be lost to follow-up or censored if observation time was prolonged. Statisticians, thus, could analyze incidence rate if they regarded it as a hazard rate of survival function (Breslow and Day, 1980). The concept of density sampling in case-control study developed by Miettinen (1976) further clarified that such a design neither requires the assumption of rare disease nor a constant exposure proportion(Greenland and Thomas, 1982). Again, case-control study will be discussed in further detail in Chapter 11. With its primary goal of prevention, epidemiological research, specifically etiologic studies, have been applied to health policy decision-making ever since the birth of epidemiology. Simple calculation of mortality and/or morbidity rates, and expected number of prevented cases may have been sufficient in the past. However, with the discovery of more and more etiologic agents, many of which are multi-factorial and cause both morbidity and mortality; it has become even more difficult to decide which action or policy will provide the greatest amount of utility of health and which should be taken first. Thus, combining the consideration of quality of life and length of survival, which is accomplished by the common unit of QALY (quality-adjusted life year) (Weinstein and Stason, 1977; Beauchamp and Childress, 1994; Gold et ah, 1996), can help in this decision-making process. However, there are so many different health states for each disease to be conceptualized in assessing quality of life, and survival status can possibly change with the passage of time. Therefore, developing a calculation of QALY by quality-adjusted survival (QAS) is crucial for a more accurate
124 Basic Principles and Practical Applications in Epidemiological Research
quantification of utility of health (Hwang et al, 1996, 1999). 6.2
Basic measurements of outcome in epidemiology
6.2.1 Outcome measurement: Counting of events and states, rate, proportion, and ratio Distinction between events and states The epidemiological measurement for outcome, i.e., diseases or health-related objects, can generally be classified into 2 types: event and state. Occurrence of an event always involves a change of state with the passage of time. A state or status denotes a characteristic or feature that is present at a certain time point or period. For example, a person may develop the common cold today. This means that he (or she) did not have the symptoms of a common cold yesterday. This change in condition is regarded as the event of a common cold, while displaying or not displaying symptoms of a cold on a particular day is called one's state of health. Similarly, if a patient's tuberculosis (TB) is cured, then the event of cure has occurred, namely, one's health state has changed from affliction with TB to recovery. Ratio scales: Proportion and rate Time, health-related state, and event are the elementary measurements in epidemiological studies. As pointed out in Chapter 5, the simple counting of such elements possesses the characteristic of extensiveness of measurement, i.e., a o a = 2a. For example, 20 students who wear eyeglasses are equal to 10 times 2 students with glasses. One can create a proportion by dividing one counting with another counting of the same category. If the duration of time is also considered in the denominator, then such a proportion becomes a rate. Both proportion and rate are ratio scales. For example, 20 out of 40 college students wear eyeglasses, and 10 out of 40 high school students are of the same state. Then, the proportion of eyeglass-wearing in college is 2 times that in high school. Take another example: If within the last 3 months, 20 out of 100 students in public health develop a common cold, then the incidence
Chapter 6 Measurements in epidemiology 125
rate, or rate of catching a cold, is 0.2 during this period. If 20 out of 50 medical students develop a common cold in the same period, then the incidence rate is 0.4, which is 2 times that of students majoring in public health. Time in the denominator Initially, there was no need to put time into the denominator of rate, with specification of the duration of observation being sufficient. However, as epidemiologists came across more and more events of a chronic nature, it became more likely that two rates might be obtained from different durations of observations. If the two rates have different time units in their denominators, they cannot be directly compared. Moreover, if a rate is unspecified, one generally takes the average rate during the time period involved. When calculating incidence rate, one must also take care to differentiate an event from a state in the numerator because only the development of an event involves the passage of time. However, there may be some instances where the division of two countings of the same state (i.e., a proportion) may also be called a "rate" because the collection of data involves a period of time in the denominator (Miettinen, 1985a). When one divides the count of a particular event or state with another count of a different event or state, the result is called a ratio. Usually the events involved are related in some manner to obtain a meaningful index of ratio. For example, the sex ratio of a graduate class is 10/10= 1,which may be compared with another class with a different sex ratio, say 10/5 = 2, etc. 6.2.2 Determinants of measurement indices or parameters Demand for an index with fewer determinants A determinant of an index (or parameter) is a factor that is influential or predictive of the change in index. To examine a rate, proportion, or ratio, one needs to consider its determinants, which include accuracy of measurement of both numerator and denominator, and determinants of the related events or states. One prefers to explore an index with a fewer number
126 Basic Principles and Practical Applications in Epidemiological Research
Table 6.1
Common indices or measurements involving health-related event, state, and time.
Proportion = (No. of a specific state or event) / (all possible states or events) e.g., The proportion of graduate students wearing eyeglasses among the entire class is: 10/20 = 0.5 The proportion of gastrointestinal disease among all diseases contracted in the past year is: 4/16 = 0.25 Ratio = (No. of a specific state or event) / (No. of another specific state or event) e.g., The sex ratio in a graduate class is: 10 (female)/10 (male) = 1 Rate = (No. of a specific event) / (Total amount of observed person-time) (Unit = time"') e.g., The incidence rate of gastrointestinal (GI) disease during the past year is: 4 (GI disease) /20 (total persons observed in one year) = 0.2(year"') Duration = (Total amount of person-time) / (No. of a specific event) e.g., It is the inverse of rate or the average duration of waiting time to develop gastrointestinal disease: 20/4 = 5 (year)
of determinants because of the ease in testing its determinants and etiologic agents individually. For example, the etiologic agents of an age-, sex- and cause-specific mortality rate are more easily determined than those for an overall crude death rate, because for the latter one must also falsify each possible etiologic role for age, sex, and different causes of death. Determinants of rate vs. proportion Proportion is generally more complicated than incidence rate because the denominator of the former is usually composed of different events or states, each with its own determinants. Let us examine a hypothetical example of the proportional mortality rate of motorcycle injuries, a proportion, as shown on Table 6.2.
Chapter 6 Measurements in epidemiology 127
Table 6.2
Frequencies of different traffic injuries in May, 1996 of city X.
Types of traffic injuries Motorcycle Car Bicycle Pedestrian Total
No. of death 15 8 2 2 27
Proportion (%) 56 30 7 7 100
Can one conclude that motorcycles are the most dangerous? Since one has no information on the population-time at-risk for the different types of traffic injuries, one can only conclude that motorcycle mortality is the highest proportion among them. Some may suggest that counting how many of each type of registered vehicles there are in city X is a good approximation of the number of people using each one of them, and might help judge the danger in using each vehicle. However, many people, who own bicycles and/or motorcycles, may not frequently use them for commuting, and the number of people riding on each vehicle may also be different. Thus, using the number of registered vehicles as the approximation of total observed person-time in the denominator only results in an erroneous conclusion. Let us take another example. If the outpatient clinic of hospital Y has the following summary statistics (as in Table 6.3), can one conclude that the most common disease is gastrointestinal (GI) disease? If the main purpose is for administrative consideration, of whether they need to increase specialists in a GI clinic to match market demand, then the answer is straightforward. However, if one wants to interpret it as a higher occurrence rate among local people, then the evidence does not seem to be adequate. There are still other determinants that may explain the higher proportion of GI disease found in hospital Y. For example, the GI specialists in hospital Y may be more outstanding and famous, and thus, attract more people with GI disease from city X to this hospital. Another alternative explanation may be that patients with non-GI diseases visit other hospitals more often than hospital Y, which produces the effect that the proportion of visits with GI disease at hospital Y
128 Basic Principles and Practical Applications in Epidemiological Research
increases. In other words, proportions are less suitable for refutation because using the total number of diseases as the denominator invokes the auxiliary assumption that occurrence of every kind of non-GI diseases should be constantly stable. Therefore, one needs to find the correct denominator, population-time at-risk, in order to figure out which explanation is more plausible. Thus, determinants of an incidence rate are generally simpler than those of a proportion. Table 6.3
Frequencies of patients who visit hospital Y stratified by main categories of disease on a particular day.
Main types of disease Chest disease Gastrointestinal disease Hypertension disease Heart disease Kidney disease Endocrinologic problems Others Total
No. of patients 150 250 150 100 100 50 200 1000
Proportions (%) 15 25 15 10 10 5 20 100
Determinants of crude vs. specific rates Let us examine some rates and proportions used in medicine and public health, by first looking at the mortality rates listed in Table 6.4:
Table 6.4
Mortality rates with total population no. in one year as the denominator.
Total no. of deaths in one year (a) crude death rate = Total population no. during mid-year x 1 year
Chapter 6 Measurements in epidemiology 129
Total no. of live newborns in one year (b) crude birth rate = Total population no. during mid-year x 1 year
Total no. of deaths of a specific age stratum in one year (c) age-specific mortality rate = Total population no. during mid-year x 1 year
Total no. of deaths due to a specific cause in one year (d) cause-specific mortality rate = Total population no. during mid-year x 1 year
(e) sex-, age-, causespecific mortality rate =
Total no. of deaths of a specific sex, age stratum due to a specific cause in one year Total population no. during mid-year x 1 year
Consider four different rates, (a), (c), (d), and (e) in Table 6.4, and judge which one has fewer determinants, making it more suitable for studying etiologic agents. Of course, (e) sex-, age-, cause-specific mortality rate is the most suitable because it has the smallest number of determinants, where death rate is clearly specified for each and every cause, age strata, and gender. On the other hand, (a) crude death rate lacks this specificity, and thus, a change in this rate can arise from any number of determinants singly or in combination. Falsification of the numerous possibilities becomes an overwhelming task. Similarly, an age-specific mortality rate has taken care of the age factor but not of different genders and causes; a cause-specific mortality rate has focused on a specified cause of death, but has left gender and age strata ambiguous. Thus, the more specific a rate, the fewer determinants one must analyze for its falsification. Take another example of the measurement of indices involving newborns, in Table 6.5:
130 Basic Principles and Practical Applications in Epidemiological Research
Table 6.5
Rates or ratios with no. of live births in one year as the denominator.
Total no. of infants who died within one year after birth in one year (a) Infant mortality rate = No. of live births in one year x 1 year
Total no. of infants who died within 4 weeks after birth in one year (b) Neonatal mortality rate = No. of live births in one year x 1 year
Total no. of fetal deaths (pregnant > 20 weeks) in one year (c) Fetal mortality ratio = No. of live births in one year
Strictly speaking, the number of live births is the correct denominator for (a) and (b) of Table 6.5; however, it is not the correct denominator for (c). Fetal mortality rate requires counting all pregnancies longer than 20 weeks during the year, which may not be feasible. Thus, people may simply take the number of live births as a surrogate, and interpret this fetal mortality ratio as the odds between fetal death and live birth. Rate with a specified time unit With such examples of rates as shown in Tables 6.4 and 6.5, one may understand why many earlier books of epidemiology left out the time unit of one year in the denominator, and simply assumed the observation period of one year. However, if unexpected problems develop and induce significant changes within several weeks or months, such as the enforcement of a helmet law, one may want to look at the change of such rates in a shorter period of time for comparison. Since two rates with different observation periods
Chapter 6 Measurements in epidemiology 131
cannot be compared, the use of a specified time unit in the denominator is a fundamental solution. In clinical medicine, people sometimes still use case fatality rate, remission rate, recurrency rate, etc., which also indicate occurrences of specific events but lack a time unit. In fact, such rates usually imply certain observation periods for the occurrence of each event, and people simply assume that those who use it know the duration of observation for a specific index. However, for long-term observation, such as the follow-up for cancer, one generally specifies the duration of follow-up, such as 3-year survival rate or 5-year survival rate, etc. Consequently, I recommend that clinical medicine also use a time unit to allow for fair comparisons. Similarly, the same concept can be applied to health policy studies, as well as other (health- related) events and rates in a population. For example, the following rates are often used to evaluate the effectiveness of an emergency medical service system, as shown in Table 6.6. Table 6.6
Indices of effectiveness of an emergency medical service system.
(a) Mortality rate before arrival at the hospital
Total no. of deaths occurring before arrival at the hospital during a certain time period = Total no. of emergency runs by ambulances during this period
Total no. of deaths occurring on ambulances during a certain time period
(b) Mortality rate during transportation =
Total no. of emergency runs by ambulances during this period Total no. of deaths occurring before arrival of ambulance during a certain time period (c) Mortality rate on the scene = Total no. of emergency runs by ambulances during this period
132 Basic Principles and Practical Applications in Epidemiological Research
Again, one tries to incorporate a time unit, such as month"', in the denominator for comparison of such rates in the health services. Since the determinants of (a) in Table 6.6 include all determinants of (b) and (c), one ought to obtain information on (b) and (c) for a more detailed evaluation of an emergency medical service system. 6.3
Incidence rate, cumulative incidence rate, risk and their determinants
6.3.1 Incidence rate or density Incidence rate is a fundamental measurement in epidemiology. It is defined as the number of new occurrences of a specified event in a population during a period of time. The population should only include candidates who may develop the event, namely population-at-risk. For example, the population-at-risk and incidence rate of cervical cancer should not include any males. Thus, the denominator of an incidence rate is the total amount of observation of population-time at-risk. It is written as follows: Number of new cases of a specific event Incidence rate = Total amount of population-time at-risk In a dynamic population, where some extent of turnover among people may occur, one simply assumes that it is in a steady state. In other words, the rate of new people joining the population per unit of time is the same as the rate of those leaving. Thus, one may obtain the incidence rate very easily. For example, if 50 new cases of an event occur in a stable candidate population of 100,000 during a year, then the average incidence rate is 50/100,000 = 5 x 10"4 year"1. A cohort population is usually defined as a group of people followed up throughout a certain period of time without replenishment. The total person-time is accumulated for each member until one develops the event or dies, or is censored due to a failure to follow up or cessation in observation.
Chapter 6 Measurements in epidemiology 133
Let us look at the example of following 5 persons for 7 years, as shown in Figure 6.1.
case i
1
C\ \J
9
T.
• indicates follow-up stop X indicates ccurrence of the event O indicates loss to follow-up or death
V
iSh
A
v9
2
r\
< i
1 2
i
-2
= 8.9x10
year
22.5
i
i
i
>
<
3
4
5
6
7
>
Follow up time (year) Figure 6.1
Calculation of incidence rate from following 5 persons for 7 years.
In Figure 6.1, there are 2 persons (cases 3 and 4), who develop the health events of interest during the follow-up person-time. The denominator is accumulated as follows: 7 (no. l) + 2(no. 2)+ 4.5 (no. 3)+ 3 (no. 4)+ 6 (no. 5) = 22.5 person-years. The contribution of person-years for person no.3 is 4.5 person years because he/she no longer contributes any person-year at-risk after the occurrence of the health event. Thus, the average incidence rate is 2/22.5 (year) = 8.9 x 10"2 year"1. Such a calculation is similar to the life table method. When the population is large, which is usually the case, incidence rate can be viewed as the proportion of people developing the event among the population followed in a short period of time, as shown in Figure 6.2. (Rothman, 1986) Incidence rate is also called "hazard rate" or "failure rate" by statisticians,
134 Basic Principles and Practical Applications in Epidemiological Research
(e.g., Cox, 1972; Kalbfleisch and Prentice, 1980; Lee, 1992), because it is similar to the occurrence of a specific hazard in a certain proportion of a population (human or non-human objects). It has been also known as "force of morbidity" (MasMahon and Pugh, 1970), because it indicates a tendency within a specified time period for a certain proportion of the population to succumb to the disease. Miettinen (1976) gave it another name, "incidence density," because it is similar to measuring the density of an event occurring in a population followed across time. To illustrate why incidence rate should be given a time unit, let us examine the following example. Suppose that one has followed up 200 cases with myocardial infarction for 1.5 years and obtained the data as follows: 10 cases died within one week; the next 15 died before one month; an additional 20 cases died within 6 months; and another 25 cases died by the end of the follow-up period. Assuming that there are no losses to follow-up or censorship (withdrawal of living cases), one might calculate the mortality rates of different periods: 1st week 8-30 days 2nd - 6th months 0.5-1.5 years
10/200 = .05 (per week or week-1) 15/(23 x (200-10)) = .0034 (per day or day-') 20/(5 x (190-15)) = .0229 (per month or month"') 25/(175-20) = .1613 (per year or year -1 )
Since these numbers were observed under different durations of observation time or units, they cannot be compared. Furthermore, one cannot determine when the highest mortality rate occurred. However, if one uses the same unit, e.g., per week or week"1, then, one can obtain a new set of figures: 1st week 8-30 days 2nd-6th months 0.5-1.5 years
10/[(200)(lwk)] = .05 week"1 or wk"1 15/[(190)(3.3wk)] = .0239 wk"1 20/[(175)(21wk)] = .0054 wk"1 25/[(155)(52 wk)] = .0031 wk"1
After transforming all mortality rates to the same unit, one can then
Chapter 6 Measurements in epidemiology 135
compare and state that the 1 st week was the most dangerous, with the highest mortality rate.
Population At Risk N(t) A N
Time
Figure 6.2
A cohort of population (N(t)) followed through time. Within At time period, a proportion (AN) of the population develops the event. The incidence rate AN IR= When At-»0, one may obtain the instantaneous incidence (N(t))(At) rate (modified from Rothman, 1986).
6.3.2 Cumulative incidence rate (CIR) and risk While incidence rate represents the tendency or proportion of a population who will develop the health event, it does not delineate the risk of a single individual. Assuming that risk is defined as the probability that a particular event occurs during a stated period of time or results from a particular challenge, then risk can be estimated by CIR or proportion after the specified time period. Let incidence rate (IR) at time t be the rate of new cases developed from the population (N(t)) at risk after a small period (t) of observation time, as shown in Figure 6.2.
136 Basic Principles and Practical Applications in Epidemiological Research
K- -^t
3, , (IR)At
N(t)At
*N N(t)
Take the integration, - } o (IR)dt = In (N(t)) - In (N(o)) = In
Take the antilogs, exp (- fo (IR)dt)
CIR t =
N(t) N(o)
_N(t) N(o)
N(0)
~ N ( t ) = 1 - ^ - = l-exp(- J f' V(IR)dt)= l-exp(-Y (IRi)(Ati)) n N(o) N(o) ° f
Alternatively, the risk can be estimated by the life-table method. Let us construct a life table after j years of follow-up. Let R(j) be the risk of contracting the disease by the end of the j t h year. Then, the risk of developing the event by the end of the 1st year can be estimated by the average incidence rate IR(1): R(1)=1-(1-R(1))=1-(1-IR(1)) Similarly, the j , h year will be as follows: R0) = i - n O - R 0 ) ) = i - ( i - R ( i ) ) ( i - R ( 2 ) )
(i-R(j))
j
= 1-(1-IR(1))(1-IR(2))
(l-IR(j))
So, one must obtain the data of particular incidence rates during the specified period of time in order to assess the risk. For an individual, one must assume that if the person has not died of other competing causes, then his probability of getting the disease is R(t): When R(t) < 0.1, e x = 1 + X;, thus, R(t) = K l R ^ t , . ) . i
For example, the incidence rate of stomach cancer for an American
Chapter 6 Measurements in epidemiology 137
white male in 1975 was 12.4x 10"5, and his life expectancy at birth was 69.4. Thus, his lifetime risk of developing stomach cancer (conditional that he does not die of other diseases) is: Ro-«.4= 1-e-(l2'4xl0"5)x69-4 = 0.012, which is approximately equal to (12.4 x 10"5) x 69.4 = 0.009 In other words, when incidence rate is very small, the CIR is almost equivalent to the sum of age specific rate (IR,) multiplied with the interval of age stratum (t|). This method is also recommended by IARC (International Agency for Research on Cancer) for the calculation of cumulative cancer incidence for international comparison (Davis, 1978), as it not only indicates personal risk, but also can be directly compared as long as the cumulative time period remains the same, e.g., 0-7'4 years of age. When the cumulative time period is prolonged, CIR is equal to 1 because almost every one at risk develops the event. When the specified time interval is short, t-»0, then CIR is equal to incidence rate. If one wants to apply risk interpretation for an individual, however, one needs to assume that the person has not died of some other disease during the observation time period. The assumption may not be true because of competing causes of death, and the CIR often over-estimates the life time risk as demonstrated by Schouten et al (1994). 6.3.3 Determinants of incidence rate (IR) After identifying the calculation method for IR, one should also be familiar with all of its determinants so that whenever one finds a change of IR, one can immediately think of all the determinants, which may contribute to this change. Determinants of IR can largely be classified as follows: 1. Because the numerator of IR is the counting of a particular (health) event, it follows that all the determinants of the event are also determinants of IR. For example, if the event is bladder cancer, then all of its determinants,
138 Basic Principles and Practical Applications in Epidemiological Research
such as smoking, exposures to aromatic amines, diagnostic criteria, etc., should also be determinants of the IR of bladder cancer. Similarly, if the event is head injury, then all of its determinants, such as the wearing of a helmet or seat belt, diagnostic criteria, etc., should also be determinants of the IR of head injury. In fact, the definition and measurement accuracy of the health event should always be considered as one of the possible determinants. 2. Accurate counting of both the numerator and the denominator, including how the information is obtained and the accuracy of the data, is always a major determinant. For example, if one uses death certificate data as an approximation to estimate the IR of lung cancer, the error may not be too large because most cases with lung cancer die within 1-3 years. If the same set of data is used to estimate IRs of skin cancer and glaucoma, then one is bound to make an underestimation because patients with either disease rarely die of them, and such diagnoses, especially glaucoma, often do not appear on the death certificate. In a cross-sectional survey of occupational diseases, workers with diseases are often hospitalized and absent from their jobs, leading to a frequent underestimation of the data in the numerator. For example, in a study of the prevalence of hepatitis among workers exposed to dimethyl formamide (Wang et al, 1991), we found that two workers did not come in for their physical examinations because they were hospitalized with hepatitis. In another investigation, we studied lead poisoning among lead battery recycling workers and found a 48% prevalence of lead poisoning. However, we underestimated the IR because those who did not come in for physical examination actually worked for significantly longer hours and experienced a higher exposure (Jang et al, 1994). Similarly, if the reason for loss of follow-up in a cohort study is related to the outcome or exposure of interest, then we should be careful about either over- or under-estimation. For example, in a follow up study of workers exposed to vinyl chloride monomer (Du and Wang, 1998), if workers suffering from ill health were more likely to take early retirement and become lost to follow up, then there might be an
Chapter 6
Measurements in epidemiology 139
underestimation of the effect. 3. Induction time is always a determinant of the occurrence of the health event. Induction time is defined as the minimal time period required for an etiologic agent to produce the event. It should be differentiated with the latency period that denotes the time period from exposure to the agent to the time when the disease is detected (Rothman, 1981). Latency period will generally vary according to the development of clinical technology for early detection, while induction time is supposed to stay constant. For example, one will not attribute the brain tumor of a patient to his radiation exposure that occurred only two months ago, an inadequate induction period. Similarly, one will never attribute a painless, fever-less diarrhea to some food consumed one month ago, either. Therefore, for calculation of IR of any health event with long induction time, one should not include those cases and population-at-risk with inadequate induction time or exceeding maximum latency period in the numerator and denominator, respectively. (See also cases 1-3 in Chapter 4.) 4. Constitutional factors or any genetic predisposing factor related to the event. For example, anyone with a family history of breast cancer or colon cancer is more likely to develop such an event than those without such a history (Neuhausen, 1999). Ethnically, Caucasian people are more likely to develop skin cancer. As our understanding of human genetics and genomic medicine increases, such factors should always be kept in mind, and an extensive search into the literature is crucial to the inclusion of these factors. 5. Environmental or occupational factors. For example, people with occupational and/or environmental asbestos exposure are more likely to develop lung cancer and mesothelioma (Chang et
140 Basic Principles and Practical Applications in Epidemiological Research
ai, 1999). Children living or employees working next to a heavily contaminated factory may be under a danger of increased lead absorption (Wang et ai, 1998). Again, a comprehensive literature search is needed to avoid overlooking important determinants. 6. Lifestyle and soeio-behavioral factors. One must consider any lifestyle practices that may have been reported as determinants of the heath event. For example, cigarette smoking is a major determinant for lung cancer and chronic obstructive lung disease. Homosexual practice and intravenous drug abuse are major determinants for HIV (human immunodeficiency virus) infection, etc. The above classification system is relatively arbitrary, but can help epidemiologists easily remember and avoid overlooking any major determinant. Overall, a comprehensive literature search for the health outcome and its determinants should be performed before conducting the study. 6.4
Prevalence or prevalence rate (PR)
Prevalence is defined as the proportion with a specified state in the population at a particular time point or period. Strictly speaking, it is a proportion rather than a rate because there is no inherent time involved in the denominator. However, since the collection of such data often involves an observation time, it is often called a prevalence rate by many epidemiologists. No. of people with the specified state PR= Total no. of population who may have such a state It can be derived that prevalence is connected to incidence rate through the duration of health state under a stationary population and without migration. To simplify interpretation, I shall use a disease to represent the
Chapter 6
Measurements in epidemiology 141
health event. Let us use the following notations: Nt: Total no. of population at time t IR: Incidence rate of the disease TR: Termination rate of the disease (i.e., incidence rate of patients recovered back to normal) D: Mean duration of the disease If the population is in a steady state, then the number of new cases developed is equal to the number of patients who died or recovered back to normal. N, (1-PR)(IR) = N, (PR)(TR) 1 Because the inverse of TR is equal to D, i.e. TR = D 1 (1-PR)(IR) = ( P R ) ( — — ) D (IR)D Thus, PR = 1 + (IR) D
If IR and D are both very small, then PR = (IR) D . PR is a proportion generally applied to describe a population rather than an individual. It can be used for a time point or during a period of our observation. Because PR depends on both incidence rate and mean duration of the diseased state ( D ), its determinants include all determinants of IR plus D . If D is unrelated to the exposure of interest or any other etiologic agent, then PR actually changes with all determinants of IR. Thus, epidemiologists sometimes conduct cross-sectional studies, which usually obtain PR only, for causal inference. Under this condition, one should always be aware that one invokes an
142 Basic Principles and Practical Applications in Epidemiological Research
assumption of mean duration being unrelated to any etiologic agent of interest. If the disease is not a highly fatal one, then one may collect information about duration-to-date on each prevalent cases and estimate IR (Freeman and Hutchinson, 1980). 6.5
Measurement of effects: rate difference, rate ratio, etiologic fraction, odds ratio, and expected number of prevented cases.
In general, an effect is the end result produced by a change or modification on the cause. In epidemiology, such an effect is usually expressed as the change of occurrence of health events or states, and the cause is usually construed as exposure to an etiologic agent. Because a disease or health event generally has many different determinants, one should pay attention to alternative causes (or other determinants), to try to control potential confounding. By analyzing other alternatives, one can conclude with more validity that the effect, i.e., the change of occurrence frequencies of a disease, is attributable to the change of specific exposure states. 6.5.1 Rate difference and rate ratio When one compares one or more rates with the baseline rate (the non-exposed rate), the comparison is generally expressed in two ways: Rate difference (Ri-R0) and rate ratio (R,/R0), assuming that RQ denotes the baseline rate and R, denotes the rate among the exposed. One can produce such comparisons from the data for incidence rate, cumulative incidence rate, or prevalence rate to imply measurements of effect. For example, if the (40-50) age-specific lung cancer incidence rate for smokers and non-smokers in city X were 10 x 10"5 year"' and 1 x 10"5 year"1, respectively, then the rate difference and rate ratio resulting from smoking are (10-l)x 10 s year"1 and (lOx 10"5 year"')/(l x 10"5 year"1) = 10, respectively, assuming that there is no exposure to other alternative etiologic agents. In the study of acute infectious disease, people previously called incidence rate ratio (IRR) "relative risk" (RR) because IRR is almost equal to cumulative incidence rate ratio (CIRR), which is an estimate of RR, if the observed time duration is short and clearly
Chapter 6 Measurements in epidemiology 143
specified. If one is studying chronic diseases, however, then CIRR or RR can approximate IRR only under two assumptions: one, the disease is rare; two, the exposure proportion is constant during the observation period (Greenland and Thomas, 1982). Because both assumptions may be violated in studies of chronic disease, and since IRR can cover both acute and chronic diseases, I recommend that future epidemiologists should use IRR or rate ratio more often. Even for acute disease, if one divides time into smaller intervals, one may obtain different estimates for IRR and RR. For example, suppose that there was an outbreak of food poisoning after a banquet, attended by 100 people. Guests began to develop painless diarrhea 3 hours after the banquet. The numbers of patients each hour thereafter were 3, 6, 14, 10, 8, 4, respectively, as shown on Table 6.7. Table 6.7
Calculation of incidence rates and cumulative incidence rates (estimate of risk) from a hypothetical example of acute disease (e.g., food poisoning), showing how they are interrelated: Incidence rate is a first order derivative of cumulative incidence rate when the time interval approaches 0.
Time No. of persons (hr) at risk 0-2 100 2-3 100 3-4 97 4-5 91 5-6 77 67 6-7 7-8 59 8-24 55
No. of new cases Incidence rate (IR) Cumulative incidence with diarrhea during the hour (hr"1) rate (Risk Q.Q 0 0 0 3 3/100 3/100(Risk0.3) 6 6/97 9/100(Risko.4) 14 14/91 23/100(Risko.5) 10 10/77 33/100(Risko_6) 8 8/67 41/100(Risko.7) 4 4/59 45/100(Risko-8) 0 0 45/100(Risko.24)
Because of the small time interval for observing food poisoning, a common disease, the IRs and CIRs (risks) are quite different. Thus, the ratio of two IRs may also differ from the ratio of two risks.
144 Basic Principles and Practical Applications in Epidemiological Research
6.5.2 Excess fraction (EF) and etiologic fraction If the incidence rate ratio (IRR) is larger than 1, then one may obtain an excess fraction (Greenland and Robins, 1998), which was called attributable risk percent, or attributable proportion in the past. Let IR] and IR0 denote incidence rate among the exposed and nonexposed, respectively. Then, EF = (IR, - IR0)/( IR,) = 1 - (1/(IRR)). For example, if the age-specific lung cancer incidence rates for smokers and nonsmokers were lOx 10"5 year "' and lxlO" 5 year "', respectively, then the EF = [(10-1) x 10"5 year "'] / (lOx 10"5 year"') = 90%. This excess fraction was once interpreted as the etiologic fraction (Miettinen, 1974a), if there is no alternative cause in the estimate of incidence rate of lung cancer among smokers. However, if the etiologic fraction is defined as these individuals for whom exposure was a contributory cause, then there may be cases accelerated by exposure (Greenland, 1999). If the number of such accelerated cases is substantial, then the excess fraction may underestimate the etiologic fraction. The assumption of no competing risk further complicates the interpretation. Thus, an epidemiologist must recognize that rate ratios and rate fractions only reflect the overall impact of exposure on a population, rather than the total number of persons affected by exposure (Greenland and Robins, 1986), and one should always consider biologic model and mechanism of how exposure produces effect on interpreting an epidemiologic measure. 6.5.3 Odds ratio One will often come across data on cases of the disease (numerator data), and will not find available data on population-time at-risk (denominator data). Accordingly, epidemiologists have designed a method termed "case control study" to sample a population-at-risk over a defined time period. Essentially one attempts to sample a control series to estimate the exposure odds of the controls (exposed vs. non-exposed). One also utilizes the data from the case studies (numerator data) to estimate the exposure odds for case series. Thus, an odds ratio or estimate of the incidence rate ratio for the
Chapter 6 Measurements in epidemiology 145
exposed and non-exposed population can be obtained by dividing the exposure odds of the case series by that of the control series. One can then further model an odds ratio with multiple logistic regression to control possible confounding by other variables. Such modeling can be performed easily on currently available computer packages such as BMDP (Biomedical Data Processing), SAS (Statistical Analysis System), SPSS (Statistical Process of Social Science). Thus, odds ratio has become a fashionable name in epidemiology and people sometimes forget that it had better be interpreted as a measurement of effect under the case control study design (Greenland, 1987a). Readers who desire to model odds ratio in statistical packages should be familiar with case-control study design (chapter 11) first and then decide whether it is the correct measurement of effect. 6.5.4 Expected number of prevented cases (Tsauo et ah, 1999) To estimate the effect of a prevention program, one can also try to calculate the expected number of prevented cases. To perform such a calculation, one must first assume that the excess fraction is equal to the etiologic fraction. One should also first consider the simplest case of a prevention program, involving the elimination of one particular exposure, and further assume that such an exposure can only result in one particular disease di.
Suppose that the activity of the prevention program begins at time to, and the proportion of exposure also begins to decrease at the same time because of this activity. Moreover, assume that the prevention program completes at time tj, and the reduction of proportion of exposure also becomes stabilized after t]. Then, after a period of induction I
146 Basic Principles and Practical Applications in Epidemiological Research
proportion during to-ti is Pe(tj) - Pe(to). Assume that the induction time (1^) for producing a preventive effect (reversing the process of developing disease) is relatively short, so that the competing risk of dying from other diseases in the population is small and will not change rRldi or IRodi significantly during the study period. Thus, after time I^, the number of prevented cases is the number of incidence cases that would have occurred, had the prevention program not proceeded, minus the number which occurred under the prevention program, as seen in the following equations (see Figure 6.3):
Pe(t 0 )
/N
\
IRdi(t)
APe
\
W/A ^
Pe(t,)
<-Idi to
—>
tl to+I,di
Time
proportion of exposure at time t: Pe(t) total incidence rate of disease dj at time t: IRj (t) IR di (t) = IR ld .(Pe(t-l d .)) + IR 0d .(l-Pe(t-I di ))' where IRj^ : incidence rate among the exposed group where IRQJ : incidence rate among the nonexposed group Figure 6.3 Calculation of the expected number of prevented cases resulting from reducing proportion of exposure from Pe(to) to Peftj)
-1 X'i
N{t) [(IR
y x p e ( t - **)}+(IR™)(1
~ Pe(t -7*))]dt
Chapter 6 Measurements in epidemiology 147
The above formula can be further reduced to: \X!f'dN{t)[{IRUi-IRodi)(Pe(t0)-Pe{t-Idi))]dt A case with one exposure resulting in several diseases such as smoking causing lung cancer, myocardial infarction, etc. is more complicated, but one may proceed to individually compute the expected number of prevented cases for each different disease. Later, overlapping cases {i.e., persons afflicted with two or more diseases) on utility calculation should also be considered to avoid double counting. 6.6
Measurement of utility of health: Quality adjusted survival
6.6.1 Concept of QALY and its potential application Epidemiology as a discipline has always been expected to provide useful information to assist public health policy decision-making. Most epidemiologists also generally hope to apply the results of their etiologic studies in the actual prevention of disease. At first, this demand was relatively simple. For each study of acute infectious diseases, epidemiologists could point to a simple infectious agent as the target for prevention. The evaluation of a particular public health action, in terms of total utility of health, was simply the number of lives saved, and there was very little ethical dispute. Finally, the results of intervention were quickly obtained, often with positive feedback. In the last several decades, epidemiological studies have been extended to cover chronic diseases, which usually have multiple etiologic agents and produce chronic functional disability, in addition to mortality. As epidemiological studies grow in number, and more etiologic agents for each disease are uncovered, policy makers are confronted with hard choices in the allocation of limited health service resources. The ever-growing medical expenditures, resulting from high technology medicine such as MRI (magnetic resonance imaging), hemodialysis, gene therapy, etc., have further aggravated this problem. While many epidemiologists may still doubt whether they should
148 Basic Principles and Practical Applications in Epidemiological Research
participate in the field of cost-effectiveness or even cost-benefit assessment during the health policy decision process, most will agree that they should continue to supply relevant information to assist such processes. Providing only information about the rate ratio of a particular etiologic agent, moreover, may not be sufficient. Rather, epidemiologists should also try to quantify the amount of utility of health that a person loses if one contracts a particular disease or even the utility that a preventive policy and/or program may achieve. In fact, combining calculations of morbidity and mortality into a single unit of utility of health is also a major issue in clinical decision-making. To tackle the above problem, the concept of QALY (quality-adjusted life year) was developed in the last 20 years (Gold et al, 1996; Patrick and Erickson, 1993; Weinstein and Stason, 1977) to serve as a common unit to quantify utility of health for policy decision-making. However, the lack of consistent and accurate methods for estimating the HRQL (health-related quality of life) and calculation of QALY still precludes it from wide acceptance and usage. Recently, Hwang et al. (1996) developed a method that combines the survival function with HRQL function to estimate quality-adjusted survival (QAS) in QALYs. A kernel-type smoothing estimator is also provided so that a random, cross sectional sample of 50 surviving patients may give us a relatively accurate estimation of HRQL. Multiplied with the survival function, this will provide the QALYs leftover or quality-adjusted life expectancy (QALE) of a particular disease. The area under the QAS curve is the total utility left over for an average patient with a particular disease X;. The whole approach can be demonstrated by an example shown on Example 13.5 (Figures 13.3-13.5). The above concept of QAS can be expressed by a simple equation. Let Qol (t I Xj) denote the HRQL function and S(t I xO denote the survival function; both of them vary with time t and determinant Xj. Then the QAS leftover or QALE becomes the integration of the area below the curve of QAS: Let E[Qol(t I Xj)] denote the expected HRQL of disease X; at time t and is usually a number between 0 and 1. After adjusting every point of S(t|xO at
Chapter 6 Measurements in epidemiology 149
time t, we take an integration throughout life: QALE= J-£[eo/(^,.)]5(^,.)A Similarly, one can obtain the QAS of a normal person x0. The difference between the QAS of a normal person and that of a patient with a particular disease x; is the utility lost in contracting the disease: lroE[Qol(t\x0)]S(t\x0)dt
- J-£[Go/('K-)]5(r|x,.)A
Such quantification will be an efficient tool to supplement the primary indicator (i.e., number of lives saved or lost) for health policy decision-making. If one can count the numbers of dollars spent in health services per QALY for different diseases, then this data can be compared in a cost/effectiveness analysis. Extrapolation of QAS to life time In many chronic diseases commonly encountered, the life expectancy may be very long. For example, patients with diabetes mellitus, hypertension, or papillary thyroid cancer frequently survive for more than 15-20 years if early recognized and carefully treated. If the follow-up period is relatively short, say, 5 years, then one may face with sets of survival data with high censoring rates, say, more than 80-90 percents. Under this condition, one may borrow information from the general population, namely, create a reference population from the life table of the general population according to the Monte Carlo method. Then, one can fit a simple linear regression line to the logit of the ratio of QAS functions for the index and reference populations and accomplish the extrapolation to life time. Readers who are interested in the details can check into the original article written by Hwang and Wang (1999). In the same way, psychometric score functions for every facets of QOL can also be intergrated into survival and extrapolated to obtain life time scores.
150 Basic Principles and Practical Applications in Epidemiological Research
Ethical considerations for a QALY approach Although the calculation of QAS and QALY may more accurately quantify the effectiveness of heath services and possibly maximize the total utility of society, one must be cautious about its ethical implications. Since the aged, disabled, and specific underprivileged groups (e.g., aboriginal people) may have smaller QALYs; their needs may be ignored, if utility is purely based on QAS calculations. For example, saving the life of a young child may produce more QALYs than saving those of four senior citizens or those of two severely disabled adults. Since the QALY approach does not take into account ethics of distributive justice, one must always examine how and to whom QALYs are distributed. Thus, I suggest our society should first count the number of lives saved in each age stratum of 10-15 years as the primary indicators for a global resource allocation. Then, one can try to maximize numbers of QALYs within each stratum, after careful consideration of the specific needs of disabled or underprivileged people who are also stratified to individual strata. Moreover, since there is some uncertainty involved in the measurement of HRQL, it is better to place all figures from patients, medical professionals, and the community sample into the calculation for a sensitivity analysis. Finally, one must let the public make the final decision (Carr-Hill, 1989; American Geriatric Society, 1989; LaPuma and Lawlor, 1990; Spiegelhalter et al., 1992; Robinson, 1993; Huang et al., 1996; Beauchamp and Childress,2000). Possible cost/benefit analysis If one tries to apply the human capital method to calculate the potential work ability or wages left over until retirement at age 65, then one can simply replace the Qol (t I xs) in the equation of QALE with the function of work ability or wage of a person with determinant Xj, denoted as WA (t| Xj). This latter function represents the foregone earnings (potential salary lost) of a person who died unexpectedly. Expected wages leftover = [ 65E[W. (t I x. )]S(t I x. )dt If the function of direct medical cost for disease X; is Cost (t I Xj), then the
Chapter 6 Measurements in epidemiology 151
total medical cost for a disease Xjcan be expressed as : Expected life long medical cost = J
ElCosttylx^jS^x^dt
In fact, all the above concepts may be included in a more general expected utility function leftover for a person at age t as proposed by Freeman III (1991) and Garber (2000). Let U (t I x,) denote the utility function of such a person with determinant x,. Then, Expected utility leftover = J," £[C/(f|jt, )]£(*!*, )<# Then, the total cost of disease x, can also be derived by summing up the direct (medical) and indirect (human capital) costs throughout life:
Cost of disease x, = j°E[Cost(t\x0)]S(t\x0)dt
- J £ £[awf(f|x,)]£(f|x / )<#
+ j ^ E[WA(t\x0)]S(t\x0)dt - J«£[W,(f|x,)]S(f|*,)£ft Although the above cost of illness approach is often criticized as having no foundation in welfare economic theory, it can still be regarded as the lower bound of the willingness-to-pay approach (Kenkel, 1994). 6.6.2 Life table method for estimating the QAS In addition to the more general estimation method of kernel type smoother, the QAS can also be calculated with a more intuitive life table method by simply adding two columns to the typical life table. Similarly, one requires a set of survival data and another set of HRQL data that correspond to the partitioned survival intervals. To illustrate the calculation step-by-step, one first adds a column of HRQLs (Qol(t,)) to the life table (Table 6.8) and then derives the quality-adjusted survival time (QAST) as follows: Let us denote the number of subjects lost to follow up as I;, the number
152 Basic Principles and Practical Applications in Epidemiological Research
withdrawn alive as wi5 the number dying as db the number entering interval as n'j and HRQL as Qol(t;) at the beginning of interval i. Then, one can calculate the number exposed to risk as ri; (n; = n'i - (Ii + Wj)/2), conditional proportion dying as q, (q; = d: /n ; ), conditional proportion surviving as p j (p j = 1 - q j), and cumulative proportion surviving as S(t j), where S(t s) = pM x S (tj.,). The QAST can thus be estimated from the S(t;) and the set of HRQL data, the Qol(tj). k
One has QAST = VgS,x('/+i -/,•) where i=i
QS i = [S(ti)-S(t i+l )]/2x[Qol(t i ) + Qol(ti+,)]/2 + S(ti+1) x [Qol(ti) + Qol (ti+l)]/2 The QS; consists of 2 parts: the first portion is the QAST contributed by patients who die in (t,; ti+1), with an assumption that the prospective times of death for the dying subjects are randomly distributed in the interval; and the second portion is the patients who still survive at time ti+1. The above formula can be rewritten as: QS; = [S(ti) + S (ti+1)]/2x [Qol(ti) + Qol (ti+1)]/ 2 It means QSi can be estimated .by multiplying the mean of S(t j) and S (ti+1) with the mean of Qol (ts) and Qol (ti+!). The total QAST is the summation of each QS(tj) multiplied with the interval width. If one wants to take an annual discount rate r into consideration, then the formula can be rewritten as ,
*"•
QAST = ~S* QSix{ti+\ -t;)x(
*
)
^—'
l+r
i=i
year after onset of the illness.
, where k represents the kth
Table 6.8 Notation of a typical life table with added columns of QOL (quality of life) and QAST (quality adjusted survival time) (modified from Lee, 1992). Number Interval
Lost to Follow-up
Number Number Withdrawn Number Entering Alive Dying Interval
t,-t2
d,
t2-t3
Number Exposed To Risk
Conditional Proportion Dying
Conditional Proportion Surviving
q
p
S(fj)=1.00
qolUt)
QSX
q
p
Ht2)
qol(t2)
QS2
s(t,)
qol(tf)
QS,
ti-ti+i
Proportion Surviving
QOL(t,)
QAST
5
« ds.
*(',-.)
qol{t,J
QSS_,
ns
d.
Hts)
qol(tJ
QSs
?
ni = ni- (li + wi) 12 qi = di/ni
pi = \-qi
S(r,) = p,-.*S(f,_,) qol(tj) : from survey
Si
QAST : quality adjusted survival time
s
QSi = [S(f,) - S(tM )]/ 2 * [qoKt.) + qol(tM )]/ 2 + S(tM ) * [qol(ti) + qol(tM)]/
S' S"
=
[S{ti) + S(tM)]/2*[qol(ti)
+ qol(.ti+l)]/2
2
154 Basic Principles and Practical Applications in Epidemiological Research
6.6.3 Estimation of HRQL (health-related quality of life) Although studies of HRQL have become quite popular during the last decade (The WHOQOL Group 1995, 1998a, 1998b; Testa and Simonson, 1996), thus far there has still been no consensus on which standard method to use, and the issue of comparability remains an unresolved issue (Gold et al., 1996). However, if the HRQL data are to be used for public health resource allocation, there seems to be a growing consensus that decisions should include a society's perspective obtained through inquiry of well-informed representatives of the community. Moreover, patients' HRQL of a particular disease can also be studied for any policy decision regarding clinical diagnosis and treatment of that disease. Thus, both subjective summary of health from utility measurement(s) and a more diversified health profile from multi-dimensional assessment are required. The latter can also be a validity check-up for the quantitative change of the former. With having already discussed the multi-dimensional approach of HRQL in chapter 5, I will now briefly introduce the concept of utility measurement. Standard gamble There are 3 major approaches to utility measurement. The standard gamble and time trade-off techniques are based on expected utility theory, while the rating scale method is based on psychometrics. The standard gamble approach begins by asking a subject to consider a hypothetical choice between the certainty of continued life in the particular health state (which is less than optimal) and a gamble. The gamble has two outcomes: either full health (assigned a utility of 1), or death (assigned a utility of 0). The probabilities in the gamble are systemically or incrementally changed until the subject is indifferent to the choices of certainty of continued life in the particular health state and the gamble. The expected value of the gamble at this point is, by substitution, the utility for the health state of interest relative to the full health and death. Often a probability wheel with two different colors representing full health and death is recommended to help subjects understand and make proper decisions. Because most diseases do not cause a
Chapter 6 Measurements in epidemiology 155
lot of damage to the quality of life, the actual trade off could begin with 99 % cure and then decrementally go down (Bala et al., 1999). Time trade-off The time tradeoff method usually asks a subject to decide what amount of time they would be willing to give up for a better versus poorer state of health (Torrance et al, 1972). Often a visual aid is used to help the subject choose between two alternatives: One is in a less desirable health state (A) for a longer period of time followed by death; the other is in full health (B) for a shorter period of time followed by death. The time in state B is decreased incrementally to the point where the subject becomes indifferent to either two alternatives. The preference for state A is calculated as life expectancy at the point of indifference in state B divided by the life expectancy in state A. Both standard gamble and time tradeoff are somewhat difficult for people to understand, and often require interviewer and visual aid assistance in these procedures. Nonetheless, since preference measure is essential in calculating QALE, I recommend at least one of the two methods be used in cost effectiveness or utility assessment. In my experience with Chinese in Taiwan, patient's family usually feel uncomfortable of telling (or speculating) a specific period of life expectancy. Sometimes they refuse to be conducted interview and thus making time trade-off method less popular. Rating scale Direct rating scale method can be performed with a visual analogue scale with a few markings at discrete points. It is highly familiar to most people and usually has a high reliability between test and retest (Lin RD, et al, 1997), depending on the chronicity and stability of a disease. However, it is unrelated to the expected utility theory. The choice of method will probably remain unsettled in the near future because each has its pros and cons. My recommendation is to perform both standard gamble and rating scale measurements for all subjects, whenever it is feasible. If the selected sample can not accept standard gamble method, then conduct the rating scale first, and
156 Basic Principles and Practical Applications in Epidemiological Research
also randomly choose to test a subset with either standard gamble or time trade-off, in order to obtain data for possible transference in the future. 6.7
Summary
Basic measurements in epidemiology involve the counting of health events and states, which also provide ratio scale measurements. Rate is counted as the frequencies of events (or change of states) occurring as time passes. This implies a time unit in the denominator for comparison. Since proportion and ratio involve more than one state or event, their determinants are usually more complicated than those of rate. Incidence rate is defined as the number of persons who develop a particular event divided by total amount of population-time at-risk. Therefore, determinants of incidence rate include measurement accuracy of the numerator and denominator, involved induction time and all etiologic factors (such as constitutional, environmental, life style, etc.) for the health event to develop. Risk is defined as the probability of developing an event after a specified period of time, and it can be estimated by cumulative incidence rate. Prevalence generally depends on incidence rate and the mean duration of the disease state. If the mean duration is unrelated to the exposure, one can use prevalent cases for etiologic studies. If the disease is not usually fatal, prevalence cases can provide data of duration-to-date, which can then be used to estimate incidence rate. Comparisons of rates from different populations yield rate ratios and rate differences, which can serve as measures of effect. The etiologic fraction is often approximated by the excess fraction, if there is no alternative cause that can explain the effect. But this approximation requires additional assumption and should be more carefully used. Expected number of prevented cases can also be calculated by integrating the difference between the baseline incidence rate and the expected rate after the implementation of a preventive program and a period of induction time. Because simply counting the number of lives saved or lost is not sufficient or efficient enough for current policy decision-making involving both mortality and morbidity, QALY (quality-adjusted life-year) has been developed as a common unit for quantifying utility of health. The QAS (quality-adjusted survival), which combines both survival function and the
Chapter 6 Measurements in epidemiology 157
function of health-related quality of life, provides an accurate way of estimating numbers of QALY left over for a particular disease. However, one should always consider the ethic of distributive justice in the QALY approach to avoid and prevent discrimination against the aged, disabled or underprivileged.
158 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 6 Please write down the probability that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 1. Frequencies of different traffic injuries in May of 2000 of city X were summarized as follows: Types of traffic injuries
3. 4.
5. 6. 7.
Proportion (%)
20
40
Passenger car
15
30
Bicycle
5
10
Pedestrian
5
10
Truck and others
5
10
50
100
Total
2.
No. of death
Motorcycle
We can conclude that motorcycles are the most dangerous because they produce the highest mortality. Prevalence rate is defined as the number of all occurrences of a specified event in a population during a period of time. Induction time is always a determinant for incidence rate of an event. Use of QALY (Quality adjusted life year) as the common unit has an implication that old aged citizens will be allocated smaller portion of resource. Odds ratio can only be considered as equivalent to rate ratio under case-control design. Rate must be considered with a specified time unit in epidemiology. To calculate QALY, one must have information of a standard
Chapter 6 Measurements in epidemiology 159
life table plus two columns for quality of life and quality adjusted survival time. 8. QALY calculation has the potential to become a common metrics for utility of health. 9. We have collected and followed up 300 cases with stroke for 3 years and following data are obtained: 10 cases died within the first week; the next 10 cases died before the end of first month; additional 20 cases died by 6 months; another 20 cases died by the end of follow-up period: Please calculate the incidence rate for the following periods: During lsl week: ; 8-30days: ; 2nd-6th month: 0.5 year—3 years: Answers : (1)F (2)F (3)T (4)T (5)T (6)T (7)T (8)T (9)1.74 year'; 1.09 year"'; 0.18 year 1 ; 0.032 year1
This page is intentionally left blank
Chapter 7 7.1
Study Design
Types 7.1.1 7.1.2 7.1.3
of inferences in empirical studies: descriptive and causal Research leads to descriptive inferences or descriptive studies: Research leads to causal inferences or causal studies More about distinctions between descriptive and causal inferences 7.2 Principles of study design for descriptive studies 7.2.1 How to draw inferences without a high response rate 7.3 Principles of study design for causal studies 7.3.1 Formation of confounding 7.3.2 Characteristics of confounders and their control 7.3.3 How to detect confounding or systematic bias in a study 7.4 Summary Introduction An architect spends a great deal of time and effort carefully designing and supervising the construction of a building. Usually, he/she will draw detailed floor plans and even build a miniature model before actual construction in order to avoid any irreparable mistakes. Similarly, an epidemiologist should clearly delineate his/her objectives and explain how he/she will accomplish these goals during the study design stage. Moreover, one should prevent uncontrollable confounding by taking into consideration all the important determinants of outcome. Even if one is able to return to the field, collecting information on a neglected determinant may be too costly and even impossible. Therefore, a good study design is fundamental to the success of a study. In this chapter, we shall first discuss common types of inferences in empirical studies, and provide principles and guidelines for designing causal and/or descriptive studies.
161
162 Basic Principles and Practical Applications in Epidemiological Research
7.1
Types of inferences in empirical studies: Descriptive and causal
In public health and medicine, one obtains empirical data in an attempt to learn more about the characteristics or theories regarding a population's health. In general, such empirical studies can be divided based on the 2 types of inferences made: descriptive or causal. A descriptive inference involves making statistical inferences from sampled data to describe the characteristics of the source population. In contrast, a causal inference utilizes evidence to examine whether or not an original causal hypothesis has been falsified. Although some epidemiologists have broadly described causal studies as being analytical, I choose to classify them as "causal" to better distinguish them from descriptive studies. Causal studies not only analyze a particular health condition among a population, but also specifically attempt to uncover cause and effect relationship through the process of conjectures and refutations. In many studies, one may draw both types of inferences, but let us begin by using the terms "descriptive" and "causal" to form a general categorization of studies according to their main purposes. 7.1.1 Research leads to descriptive inferences or descriptive studies A descriptive study is an empirical study that intends to infer the characteristics of a population by studying a select sample from the population. The general goal is to select or use the smallest sample size possible to infer the facts or characteristics of a population. This type of study usually involves only one population, for which one attempts to calculate either incidence or prevalence rates. Or, one may perform many aspects of measurement in a population, without making any causal comparisons between rates. The following examples display various descriptive inferences made in actual epidemiological research. Table 7.1 compares and contrasts the major questions asked in descriptive and causal studies.
Chapter 7 Study design 163
Table 7.1
Examples of major questions asked in descriptive and causal studies.
Descriptive studies
Causal studies
Example 7.1
Example 7.4
What are the disease patterns and demand of What is the etiologic agent of the emergency service system in Taipei?
premalignant skin lesions among paraquat manufacturers?
Example 7.2
Example 7.5
What are the prevalence rates of alcohol What is the etiologic agent of hepatitis abuse and dependence among aborigines in among synthetic leather workers? Taiwan? Example 7.3
Example 7.6
What is the prevalence rate of n-hexane What is the etiologic agent of a hepatitis induced
polyneuropathy
among
press- outbreak in a printing factory?
proofing workers in Taipei? Example 7.7 Is wearing a helmet protective against head injury? If yes, is there any difference in magnitudes of protection among different types of helmet?
Example 7.1 Disease pattern and demand of Taipei's emergency medical system If one wants to study the disease pattern and demand of an emergency medical services system, one needs to draw samples of emergency visits from major hospitals in this system. In a study of the Taipei emergency medical services system, Chiang et al. (1986) drew a 30-day random sample and reviewed 7,314 medical records from 11 major hospitals. They estimated that during 1982-3, there were about 24,500 visits due to specific infectious diseases, 66,500 visits due to respiratory diseases, 49,600 traumas and injuries, 4,900 psychiatric diseases, etc. These numbers were calculated to be 8.8%,
164 Basic Principles and Practical Applications in Epidemiological Research
24%, 17.9% and 1.8% of all emergency visits, respectively, as summarized in Table 7.2. Although many prevalence rates could have been calculated from the data, investigators chose to study only one population. Furthermore, instead of making causal comparisons to determine etiology, the rates were listed for administrative comparison to facilitate resource allocation. Table 7.2
Estimated frequency of emergency visits by diagnosis in Taipei, 1982-3. Two sets of figures are presented depending on the number of major diagnoses assigned to each visit.
Disease category Specific infectious diseases Neoplastic diseases Psychiatric diseases Ophthalmologic diseases Otolaryngologic diseases Cardiovascular diseases Respiratory diseases Gastrointestinal diseases Genitourinary diseases Obstetric-Gynecologic diseases Neonatal diseases Trauma Burn Poisoning Others Unknown or unspecified Total
One diagnosis 24,524 (8.8%) 4,924 (1.8%) 4,911 (1.8%) 2,678 (1.0%) 1,171 (0.4%) 13,741 (5.0%) 66,462 (24.0%) 25,596 (9.2%) 9,742 (3.5%) 10,344 (3.7%) 3,452 (1.2%) 49,621 (17.9%) 2,779 (1.0%) 2,563 (0.9%) 19,242 (6.9%) 35,433 (12.8%) 277,183
Multiple diagnoses 30,258 (9.1%) 6,325 (1.9%) 6,157 (1.9%) 3,258 (1.0%) 1,833 (0.6%) 20,675 (6.2%) 73,885 (22.3%) 31,595 (9.5%) 12,141 (3.7%) 11,687 (3.5%) 4,063 (1.2%) 56,104(16.9%) 2,779 (0.8%) 2,842 (0.9%) 27,995 (8.5%) 39,481 (11.9%) 331,078
Example 7.2 Alcoholism among Taiwanese aborigines Similarly, in a study on alcoholism prevalence rates among different aboriginal tribes in Taiwan, Hwu et al. (1990) drew samples proportional to different village sizes, using household registration data, and proceeded to conduct personal interviews. They successfully interviewed 1,555 out of
Chapter 7 Study design 165
2,113, and calculated prevalence rates of alcohol abuse (AA) and alcohol dependence (AD) among Atayal, Paiwan and Yami ethnic groups (11.6%, 11.4%, and 14.2%; and 9.0%, 8.1% and 6.4%, respectively), according to DSM-III diagnostic criteria. In this study, these rates were estimated and mainly compared for descriptive and administrative purposes; while causal risk factors were analyzed in another article (Hwu et al, 1991) to avoid confusion. Example 7.3
n-Hexane induced polyneuropathy in press-proofing workers
To determine the prevalence of n-hexane induced polyneuropathy in press-proofing factories, one needs to draw a sample of such factories and examine their workers. In one study (Wang et ai, 1986), we surveyed 16 press-proofing factories in Taipei and found 15 out of 59 workers had developed the disease. Although our study also explored the causal relationship between exposure and outcome, the first part of our inference only involved determining the prevalence rate of n-hexane polyneuropathy among these workers. One can only draw valid inferences about a source population from a representative sample. By virtue of the central limit theorem in statistics, a representative sample must be randomly drawn and sufficiently large (as in Examples 7.1 and 7.2), in order to calculate valid incidence and prevalence rates of a population. We will discuss sampling methods in further detail in Chapter 9 (Introduction to sampling method and practical applications). In Example 7.3, without information on the number of existing press-proofing factories in Taipei, we simply went to all such factories we could possibly find in the city to insure a representative sample of the source population. We also lacked prior knowledge of the solvents used by the workers, and accordingly concluded our sample to be random for solvent exposure and likely representative. In addition to representative sampling, one should use the most efficient sampling method. If two sampling methods A and B are of the same size, and method A can obtain smaller random errors or variance than B, then method A
166 Basic Principles and Practical Applications in Epidemiological Research
is considered to be more efficient. We shall discuss principles for determining the minimum required size (most efficient method) for sampling in Chapter 9. Thus, descriptive studies focus mainly on statistical inferences, and one should carefully evaluate the representativeness of a sample in order to make valid conclusions. 7.1.2 Research leads to causal inferences or causal studies The main objective of a causal study is to falsify hypotheses through empirical evidence. Because conducting such a study usually requires a comparison of at least 2 populations (exposed vs. non-exposed), some investigators call it a comparative study (Anderson et al, 1980). For example (Example 7.4), one may wish to determine the etiologic agent of skin cancer or premalignant skin lesions (Wang et al, 1987; Jee et al, 1995), the etiologic agent of hepatitis among synthetic leather workers (Example 7.5, Wang et al, 1991), the etiologic agent of hepatitis among printing workers (Example 7.6, Deng et al, 1987), or the magnitudes of protection of wearing different types of helmets among motorcycle riders (Example 7.7, Tsai et al, 1995), etc. Under all these circumstances, one must measure rates (especially incidence rates) for at least two comparison groups. The rate difference and/or rate ratio is the measurement of the effect of exposure. For an even more detailed study, one can further divide the comparison groups into different intensities of exposure. With this more delicate categorization, one can apply an interval or ratio scale for the exposure, in which rate ratios are calculated for each level of exposure. The most important validity issue in causal studies is whether the effect attributed to the exposure is complicated or mixed by any extraneous factor, i.e. confounding. In other words, one should examine whether there is any alternative hypothesis that can partially or totally explain the casual effect under study. If there is any evidence indicating the presence of confounding, then the association or estimate is inaccurate. Such a study contributes minimally to scientific knowledge and sometimes produces a spurious effect. Moreover, any public health action based on such an erroneous conclusion
Chapter 7 Study design 167
may be a waste of resources and is counter-productive. Thus, one should attempt to control or prevent potential confounding beginning in the study design stage and on throughout the empirical study. In Example 7.4, the study of premalignant skin lesions among paraquat manufacturers, one should prevent confounding of the effect of bipyridyl exposures by other carcinogenic substances ever reported, such as tars, ionizing radiation, arsenic, etc. Similarly, in Examples 7.5 and 7.6, one should control for various viral hepatitis, hepatotoxic drugs, alcohol, etc. from mixing with the effect produced by dimethyl formamide (DMF) and carbon tetrachloride in synthetic leather manufacturing and printing workers. Again, as in Example 7.7, one should control confounding from local traffic conditions, motorcycle speed, the skidding tendency of road surface, etc. However, if in Example 7.5 one wants to study any possible synergistic or modifying effect on liver function between viral hepatitis B and DMF, one should try to stratify or model the joint effect, as shown on Table 7.3. In Table 7.3, frequencies of workers with abnormal liver function (i.e., ALT > 35 IU) were compared for different intensities of exposure to DMF. Since chronic hepatitis B infection, as indicated by a positive carrier status of its surface antigen (HBsAg), may increase the frequency of abnormal liver function, one should compare workers with the same carrier state in the same stratum. Such stratification can eliminate the mixing of effect from DMF and HBsAg, as shown in the upper panel of Table 7.3. Essentially, it provides us an opportunity to look at the effect of DMF while holding that of hepatitis B constant. The concept of modeling is similar to stratification. In modeling, one holds all other determinants in the model constant, while singularly examining the change of outcome under changing exposures of interest. Modeling had the added advantage that one can explore the quantitative interaction among different determinants, while stratification only provides an intuitive appeal through displaying actual frequencies in individual strata. In the lower panel of Table 7.3, the frequencies of workers with abnormal liver function were compared by modeling analysis for an independent effect of each determinant: HBsAg, DMF, and exposure indices 1 and 2. A 95% confidence level that includes an odds ratio of 1 is usually not considered to be
168 Basic Principles and Practical Applications in Epidemiological Research
a statistically significant effect. As both stratification and modeling hold stratified and modeled determinants constant, they are useful in controlling confounding. To perform modeling and stratification, one needs to measure or collect information on all other determinants of the outcome. Moreover, a causal inference usually involves a hypothesis of a natural law, which should be universal in different time and spatial settings. Thus, one should try to draw conclusions beyond the statistical association obtained from one study. Furthermore, one should consider evidence outside of the study to see which hypotheses are falsified and which remain unrefuted. The characteristics of descriptive and causal inferences in epidemiological studies are summarized in Table 7.4. 7.1.3 More about distinctions between descriptive and causal inferences The general and primary goal of statistical analysis in epidemiological research is to summarize the data, which is often expressed as the estimation of parameters, tests of significance or both. In a descriptive study, parameters describing characteristics of a population, such as rate, are estimated from a random or quasi-random sample. Statistical inferences are made mainly based on the representativeness of the sample, meaning that one must determine to which specific population one can infer characteristics from the sample under study. In a causal study, the statistical summarization of data is used to rule out chance as an alternative explanation or to estimate the magnitude of the causal effect while holding other determinants constant. After estimating parameters of effect (rate ratio or rate difference) and performing tests of significance from the sample; investigators often consider evidence and alternative hypotheses outside of the study, in order to falsify hypotheses. Because causal hypotheses are proposed natural laws that should not be limited by different times, places or particular settings, one must take outside hypotheses and evidence into consideration before reaching a conclusion. In fact, this process involves conjecture and refutation. In other texts, causal inferences have been
Chapter 7 Study design 169
Table 7.3
Stratification of workers with abnormal liver function,* by index of exposure to dimethylformamide (DMF) and hepatitis B surface antigen (HBsAg) (Wang, et al, 1991). DMF Exposure Index
Liver Function (ALT level)
Status
HBsAg (-)
S 35 <35 ^35
Total
<35 ^35
HBsAg (+)
<35 SRR
0
1-2
2 19 2 52 4 71 1.0
2 17 3 60 5 77 1.
3
5
4
21
3
6
14
74
6
11
18
95
6.2
2.5
X2 for odds ratio = 1
9.71 (p = 0.008)
X2(l)(Mantel-Extension for trend)
5.15 (p = 0.02)
Modeling by Odds ratio
Logistic regression 95% CI
p value
2.81
0.92-8.59
0.07
Exposure index 1
1.23
0.31-4.81
0.77
Exposure index 2 * Defined by ALT ^ 3 5 IU/L.
6.16
1.53-24.79
0.01
HBsAg DMF
SRR = standardized rate ratio (See Chp. 8 for description.) ALT = alanine aminotransferase level IU/L = international units/liter (*Top half of table demonstrates stratification. Bottom half is an example of multivariate analysis by modeling.)
170 Basic Principles and Practical Applications in Epidemiological Research
Table 7.4
Characteristics of descriptive and causal studies in common epidemiological studies.
Causal study Falsify a causal relationship or hypothesis. Calculate 2 or more rates to obtain rate 2. Measurement Measure one or more characteristics and calculate one or ratio (or difference) of at least 2 or more more rates in only one population. populations. One should also measure all other determinants of outcome. Assess the representativeness of Examine whether an alternative 3. Validity determinant can explain the effect of the sample (the validity that inferences drawn from the sample exposure (usually represented as rate ratio or difference), i.e., detect any describe the source population.) confounding. Select a sampling method which Consider validity first and then the most 4. Reliability efficient sampling methods. minimizes cost or maximizes efficiency (usually the smallest sample size). 5. Temporality Conduct a cross-sectional study. Conduct a longitudinal study (conceptually involves passage of time (Conduct sampling and measurements simultaneously; no for development of effect even in a need to wait a period of time, for cross-sectional study). Observed populations can either be dynamic the development of effect) (possible turn-over of members) or cohort (no exclusion based on subsequent exposure experience). Draw mainly statistical inferences, Draw mainly causal inferences, devoid 6. Inference from sample to population only. of any spatial and time limitation. And consider evidence from other refutation attempts or empirical studies of the same hypothesis. Characteristics Descriptive study 1. Research goal Find facts or describe facts.
broadly labeled as scientific inferences in order to distinguish them from the commonly drawn statistical inferences in descriptive studies, but I choose to use the term "causal" for an even clearer distinction.
Chapter 7 Study design 171
For example, suppose one wants to know the total prevalence rate of smokers among Americans. One may attempt to perform a separate random sampling of black and white Americans. Assuming that the ratio of black to white people is 1:9, then one may accordingly weigh the sampling result to obtain the overall prevalence. The main concern of such a study design is whether samples from both groups represent their source populations, a descriptive study. However, if one's goal is to determine whether smoking causes lung cancer, then one is attempting to falsify a causal hypothesis. Instead of choosing samples from both groups of people, one usually performs a study on either white or black people to avoid any possible confounding by mixing the two different racial groups. Likewise, one usually performs a smoking experiment on a pure strain of rats or other rodents because the mixing of different strains requires invoking the unfounded auxiliary hypothesis that different strains share the same susceptibility to smoking. Thus, since one's main goal in a causal study is to falsify a hypothesis, one should prioritize the control of confounding over determining the representativeness of the sample. Whether the results obtained from the sample of white Americans, e.g., smoking causes lung cancer, can be generalized to the black population is a separate issue. We may draw some conjecture after looking at outside studies and evidence, such as ethnic differences between lungs, defense mechanisms to chemical carcinogens, cell components, etc. among different races. If there are no biochemical or physiological differences, then one may conclude that the same effect may occur, despite color differences in skin. However, in the case of studying whether or not sunlight causes skin cancer, one cannot apply the same inference drawn from a sample of white people to a black population, because different racial groups contain different amounts of melanin pigments in their skin cells, which could result in different levels of sunlight absorption. Let us consider a similar example. If one finds that smoking increases lung cancer among males, then one may infer that smoking has the same effect on females because of no known biochemical or toxicological differences in the lungs of both genders. However, smoking's effect on coronary artery disease in males may not be the same as in females because it has been found
172 Basic Principles and Practical Applications in Epidemiological Research
that young females are less likely to develop coronary artery disease before menopause. Thus, one needs to examine the age strata before drawing any inferences. In conclusion, the validity of descriptive inferences depends on the representativeness of the sample. Therefore, one should be very careful in the sampling process. The validity of causal inference, however, depends on eliminating any confounding. To achieve this goal, the sampling process should focus on the comparability of exposed and non-exposed groups in terms of other determinants of outcome. Whether such a causal inference can be generalized to other populations is a separate issue and requires additional outside evidence and knowledge. 7.2
Principles of study design in descriptive studies
To apply the concepts and principles of descriptive studies into the process of study design, I have listed the following steps or sequences as a reference guide: Step 1 Define the objectives of the study: The first issue in study design is to set up goals or objectives. One should specify the characteristics or facts that the study intends to measure. Table 7.1 presents examples of the specific goals or main question of three descriptive studies. Step 2 Select a measurement method: After defining one's objectives, the next issue is to choose accurate and sensitive measurement method(s). In general, one can perform a comprehensive literature search and consult experts in the field for state-of-the-art measurement methods. Then, one can select the most appropriate method based on one's limited resources. If there is no method available to research the new concept, then one may develop a new method as well, based on the principles discussed in Chapter 5. Consultation with experts is crucial for accurate and sensitive measurements because of an
Chapter 7 Study design 173
expert's greater familiarity with the determinants involved in the method of measurement. Furthermore, these methods are not always available in the medical literature or original research papers. In Example 7.2, Hwu et al. consulted experienced psychiatrists before choosing an appropriate diagnostic tool for alcohol abuse and alcohol dependence (1990). After consultation, Hwu et al. understood that inviting psychiatrists to perform house-to-house interview visits was not feasible, and decided to use questionnaire interviews followed by a random sampling of psychiatrist interviews to check-up on validity. Step 3 Choose a study population: When one defines the goal of a study, one usually has a particular study population in mind. In Example 7.1, the study population was confined to Taipei residents; in Example 7.2, Taiwanese aborigines; and in Example 7.3, press-proofing workers in Taipei factories. However, sometimes one may have several populations from which to select. Under these circumstances, I recommend accuracy of measurement as the first priority, followed by resource limitations, such as accessibility and budget constraint. In Example 7.2, we chose only 3 out of 9 ethnic groups because of the limitations in manpower and time. Step 4 Decide on the sampling method for selection of subjects: The main goal is to have a representative sample of the source population. (Chapter 9 also focuses on this issue.) Theoretically, one aims for a random sample, which can be used to infer the parameters or characteristics of a population, by virtue of repeated sampling and the central limit theorem. In reality, however, this is often costly and not feasible because random selection usually requires sampling of a wide area. Thus, a quasi-random sample, i.e., sampling performed unrelated to the outcome of interest, is often substituted. For example, are patients with diabetes mellitus, who come to a medical center for treatment, a random sample among all diabetics? Of course not. Can a sample of patients from a medical center represent all diabetics? The answer to the second question
174 Basic Principles and Practical Applications in Epidemic-logical Research
may not always be negative. If the determinants for entering a medical center are unrelated to measurement or outcome of the study, then the sample may be regarded as quasi-random. If the selection mechanisms for diabetic patients entering a medical center are unrelated to the patients' reaction to insulin treatment, then the results of such a treatment in this medical center should also be applicable to diabetic patients elsewhere. Another issue falls into this part is the number of subjects to be selected. In general, one is limited with the budget and resource constraint. For a small outbreak investigation, one may try to ask as many involved people as possible. For others, one should try to sample most efficiently, which is discussed in Chapter 9. A consultation with statistician may be necessary for a more complicated sampling scheme. Step 5 Decide on the method of data analysis: One should consider methods of data analysis during the study design stage, to form a comprehensive view of the study right from the beginning. It is very similar to an architect consulting another structure architect during the design stage to determine whether the proposed structure will stand. Consultation with a statistician at an early stage can prevent researchers from collecting useless data and can possibly sharpen their method of measurement, especially when using questionnaires. Many statisticians, nowadays, are also trained or have some experience in the field of epidemiology. As a result, they can serve as another expert in the field, to help oversee and insure that any waste of resources is avoided. 7.2.1 How to draw inferences if the response rate is not high? Despite careful study design, one may still come across a low response rate, e.g., below 30%, from our intended study sample. One of the most frequently asked questions is: "How high should the response rate be to draw a valid inference?" "Must I obtain up to 70% or 90%?" "Will a response rate of 30%o be representative enough for drawing any conclusions?" To answer these questions, let us first examine the determinants of measurement errors.
Chapter 7 Study design 175
Suppose in an epidemiological study, Wi denotes the proportion of respondents, W0 denotes the proportion of non-respondents; Y is the overall mean of what we want to measure in the source population, Y\ denotes the mean of the respondent population, Yo denotes the mean of the non-respondent population, y\ denotes the sample mean of respondents, n, denotes the number or size of the respondent sample; and Sj denotes the standard deviation of the respondent sample. Then, the MSE (mean square error) of y\ to Y will be as follows (Snedecor and Cochran, 1980): MSE( yi ) = E ( 7 i - 7 ) 2 = E(yi -W,Fi -W 0 Fo) 2 = E[yi - ( l - W 0 ) F ; - W 0 F o ] 2 = E[(yi - Fi) 2 + W 0 (Fi - Yo)]2 = S17n, + W 0 2 (7; -Yo)2 Since S,2/n, is the variance of using y\ to estimate Y\, it can be directly obtained from the collected data. An additional item, namely, W02( Y\ - Yo f, indicates two additional major components of the magnitude of MSE( y\): the proportion of non-respondents (W0) and the difference between the means of respondent and non-respondent populations (Y\ - Yo). If any one of these two terms is close to zero, then MSE (yi) = S^/n,, and one can directly infer the population mean from the respondent sample. In other words, if the proportion of respondents is high (W0 = 0), or if there is little difference between the respondent and non-respondent populations (Y\ -Yo = 0), then the respondent sample at hand is representative. When the proportion of respondents is low, one may attempt to examine the extent to which the respondents and non-respondents differ. If their difference is not large, i.e., Y\ - Yo is small, then one can still draw inferences from the respondent sample. However, if Y\ - Yo seems large, then one may want to perform
176 Basic Principles and Practical Applications in Epidemiological Research
direct interviews or household visits of a small random sample among non-respondents, e.g., 10-30 in size, to see how much difference actually exists between the two populations. Alternatively, one may compare the distributions of the determinants of measurement content, such as age, sex, educational level, etc., between respondents and non-respondents. If the patterns or distributions of these determinants are similar, then one has some evidence to infer that Y\ - Yo may be small. Moreover, when one states that the response rate is low, it raises the question of: "How low is too low? Is 70% or 50% too low?" The answer to this question depends on the actual prevalence rate of the characteristics in the total population. If the true prevalence rate is high, e.g., more than 20-30%, then a response rate of 70-80% will be high enough without excessive bias. However, if the true prevalence rate is less than 5-10% and even if the respondent rate is as high as 80%, the sample may still display a large bias because people with the characteristics under study may all belong to the non-respondent proportion (> 20%). In fact, one may obtain data from other studies on the determinants involved in the two populations, in order to draw an indirect conclusion. Let us see how the above principles can be applied to practical research. Example 7.1
Taipei's emergency medical service
In the study of the disease pattern of and demand for Taipei's emergency medical service, we successfully reviewed 7,314 medical records out of a stratified random sample of 8,231, giving us a response rate of 89%. In only one hospital, we obtained a low response rate (22.1%) due to loss of its regular logs for emergency registration during June 1982 - January 1983. Because our study was retrospective, it would have been impossible for the hospital to have prior knowledge of our study. Thus, we assumed that the loss of registration logs was most likely unintentional and that the disease patterns during the period of loss and no loss were similar. A separate concern was whether the non-respondent proportion (11%) happened to include more cases of one particular disease, which only occupied a small fraction of the source
Chapter 7 Study design 177
population, e.g., < 3%, as in Table 7.2. Because none of the major hospitals with high response rates, i.e., more than 93%, showed such a trend in their disease patterns, we considered that the respondent sample was representative. Hence, we regarded our statistical inference as acceptable. Example 7.2
Alcoholism among Taiwanese aborigine groups
To study the prevalence rates of alcohol abuse and alcohol dependence among Atayal, Paiwan and Yami ethnic groups, we performed stratified random sampling. 1,555 out of 2,113, interviews originally expected were valid, indicating a response rate of 74%. Moreover, by comparing the distribution patterns of respondents with non-respondents by age, sex and educational backgrounds, we found no statistically significant differences. Therefore, we tentatively concluded that the sample was probably representative. Let us see some more examples. Example 7.8
Chinese herbal medicine
In our study of the potential adverse effect of Chinese herbal drugs on fetuses, we attempted to inquire pregnant woman, who entered Taipei Municipal Maternal and Children Hospital for prenatal visits, about their history of taking Chinese herbal drugs. In the process, we encountered the possibility that women were using different names (Antaiyin vs. Thirteen mixtures) to describe the same medication. Since Chinese herbal drugs are not fully standardized under one system, we attempted to falsify this conjecture by conducting a questionnaire survey of all 575 Chinese herbal drug stores and 267 Chinese herbal doctors' clinics in Taipei (Shu et al, 1987). After 3 sets of mailings and telephone requests, the Chinese herbal drug stores' response rate was 19.8%, Chinese herbal doctors was 29.6%, and the overall response rate was 22.9%). All respondents agreed that the two names indicated the same medication. To determine the reason for non-response, we randomly sampled 84 non-respondents, from each group (drug stores and doctors), by telephone calls and direct interviews. The results were summarized in Table 7.5. Only
178 Basic Principles and Practical Applications in Epidemiological Research
about 1/5 of non-respondents refused to respond at all. One third of non-respondents were too busy to answer, suggesting non-response was unrelated to the measurement. The rest (about 45%) were irrelevant to the measurement, as they did not prescribe the drugs or were out of town for a long period of time. Although our response rate was as low as 20%, we were able to draw some conclusions by obtaining evidence that showed similarity between the respondent and non-respondent populations with respect to the facts or characteristics measured. Even if indirect evidence show that respondents have traits or tendencies for a different result, then one may still report the results or estimates, but with some added comments about the non-respondents. Otherwise, one may present a comparison of determinants of outcome for respondents and non-respondents to draw a more reasonable conclusion about the possible direction and magnitude of bias of the estimate. Let us examine Example 7.9 (Jang etal., 1994).
Table 7.5
Reasons for non-response to the inquiry that Antaiyin and Thirteen Mixtures, the most commonly used herbal medications, were the same (reported with permission, Shu et al, 1987). Chinese
Chinese herbal
Total
herbal clinic
drug store
(percentage)
Too busy to respond
8 (28%)
21 (38%)
34,5%
Refusal to answer
6 (21%)
12 (22%)
21.4%
No active prescription for pregnant women
3 (10%)
12 (22%)
17.8%
Doctors or pharmacists out of town
7 (24%)
Reasons for no response
3
(5%)
11.9% 10.7%
Doctors on sick leave
3 (10%)
0
Others
2
(7%)
7 (13%)
29(100%)
55(100%)
Total
3.6% 84 (100%)
Chapter 7 Study design 179
Table 7.6
Comparisons of blood lead levels, age, job, and duration of employment in battery recycling workers, examined and unexamined (Wang et al., 1998, expressed in mean ± 1 S.D.). Examined
Unexamined
Job
Blood No. of
Age
Duration of No. of
categories
lead
(years)
Furnace
(R?A) 87.1
employment workers (years) employment (days) (days)
37 ±10
349 ± 328*
workers 19
14
Age
35±5
Duration of
553 ±261
Maintenance 82.8
3
30 ± 4
259 ±243*
2
37+1
410 ±218
Dissecting
69.2
10
35 ± 8
450 ±225
14
36±9
324 ±306
Refining
64.2
6
31 ±4
559 ± 166
3
29±2
679 ±311
Crane
64.1
6
44 + 4* 544 ± 376
3
36±4
1007 ± 838
Field cleaner 95.4
4
4 3 + 1 * 698 ± 229*
2
49±1
133 ± 14
Office
6
50 ± 3
548 ± 248
0
—
operator 48.5
—
cleaner Office guard 38.4
5
52 ±13
960 ±861
8
43±20
Salesman
5
27 ± 6
149 ± 136
0
—
8.6
465 ±336
—
etc.
* p < 0.05
Example 7.9
Lead poisoning in a lead recycling smelter
In a cross-sectional survey of lead poisoning in a lead recycling smelter, 64 out of 110 workers received a complete physical examination. 48% of the examined workers fulfilled the diagnostic criteria of lead poisoning proposed by Cullen et al. (1983). As shown in Table 7.6, workers who came in for examination had a shorter duration of employment. Some examined workers alleged that many workers heavily exposed to lead failed to come because they were afraid of losing their jobs, if shown to have high blood lead levels. Moreover, a high proportion of workers with low exposure risk, such as outside salesmen, office cleaners and office workers, volunteered for
180 Basic Principles and Practical Applications in Epidemiological Research
examination. Thus, our evidence showed that the respondents (examined) were somewhat different from the non-respondents (non-examined). Accordingly, inferences can only be drawn for the respondent group. Because field workers, who were employed longer, were more likely to have higher blood lead levels and develop lead poisoning, the respondent group might display a lower than actual prevalence of lead poisoning. Thus, the prevalence rate of lead poisoning of 48% might be an underestimation. 7.3
Principles of study design for causal studies
Study design for causal studies includes the setting up of objectives, choosing appropriate measurement methods of outcome and exposure of interest, and selecting the exposed and non-exposed populations. Because the validity of causal studies mainly depends on concept of controlling confounding, we shall first examine the formation of confounding and then discuss how to prevent or control confounders during study design. 7.3.1 Formation of confounding Confounding occurs if there is any extraneous or external determinant that can partially or totally explain the effect of a causal study (Miettinen and Cook, 1981; Miettinen, 1985a; Rothman, 1986; Greenland and Robins, 1986; Greenland, 1996). Accordingly, the measurements of effect, either rate ratio or rate difference, will be partially or totally attributed to these alternative causal hypotheses. The original concept of confounding can be traced back to the process of conjectures and refutations in scientific research, which was discussed in detail in Chapters 2-4. In brief, one needs to falsify or refute all possible hypotheses one by one through empirical studies. If in any such attempt or empirical study, one fails to falsify any one of two or more causal hypotheses, leaving a mixed effect, one has failed to achieve the goal of one's study. This concept of confounding is essentially broader than that found in collapsibility criteria (Yanagawa, 1979; Boivin and Wacholder, 1985), in which confounding is only considered once there is empirical evidence of the
Chapter 7 Study design 181
mixing effect of alternative determinants. In this book, one also considers the potential existence of confounding when one hasn't taken care to control or disprove causality by extraneous factors. By taking this broader view, one will actively search for any alternative determinants during one's study, and then attempt to prevent and control any possible confounding effect. Let us examine some examples. Example 7.10 Air concentration of sulfur dioxide and asthma, near a refinery In a study to determine the association of air concentration of sulfur dioxide with the frequency of asthma in a neighborhood near a refinery, Lin et al. (1981) obtained the data, as shown on Table 7.7. Apparently, there seemed to be a linear association between prevalence rate of asthma and elevated concentrations of S0 2 . However, there was a possibility that an alternative explanation could explain this association. For example, since the authors of this study collected all cases of asthma, they may have erroneously included cases developed before the afflicted moved into the study community. Moreover, they may have included some asthma cases that were work-related because many of the adults in the refining region also worked in the nearby petrochemical factories. Finally, the study did not take family tendency for asthma into account. Before all these alternative hypotheses were clarified or falsified, one cannot draw any conclusions. Example 7.11 Jet noise and high mortality rate In a study to determine the association between high mortality rate and exposure to jet noise, Meecham and Shaw (1979) found that residents near an airport showed a higher mortality rate in comparison with the entire Los Angeles area. However, because they did not adjust for different age, sex and ethnic groups in calculating such rates, they could not rule out potential confounding. When other investigators (Frenches et al, 1980) standardized the data for the above determinants, the difference between mortality rates of
182 Basic Principles and Practical Applications in Epidemiological Research
Table 7.7
Association between frequencies of asthma and air concentration of sulfur dioxide (SO:) in a community, near a large petrochemical factory (Summarized from Line/a/., 1981).
Area
Average air cone, of S0 2 No. of cases No. of population 2
(mg/day/100cm PbO7)
Prevalence
with asthma residing in the area rate of asthma
Refining region
1.58
35
3,800
Dormitory A-1
0.90
3
1,152
.0026
Dormitory A-2
0.88
1
395
.0025
.0092
Dormitory B-3
0.78
6
2,564
.0023
Dormitory B-4
0.75
5
2,243
.0022
Dormitory B-5
0.73
2
1,550
.0013
the residents near the airport and all of Los Angeles disappeared. Therefore, high mortality was simply a result of uncontrolled confounding. • In fact, if one considers research conducted in experimental biology, the concept of confounding becomes clearer. Assume that the experimental group (exposed group in epidemiological research) is treated additionally with an "x" factor, in comparison with the control (non-exposed) group. The difference of outcomes between the two groups can be attributed to the "x" factor (treatment), if all other determinants of outcomes are the same between the two groups, as shown below:
Experimental group:
O + x -» ® Comparison of outcomes
Control group:
O
—> O
All other determinants of outcome should a priori be known and under strict experimental control so that the two groups have the same distribution of such determinants. This condition is more readily achievable in animal experimentation, in which one utilizes inbred strain animals to control
Chapter 7 Study design 183
constitutional factors and set up a totally controlled environment. However, due to ethical concerns, one cannot experimentally control a human population, but can only make observations or at most set up double-blind random trials. Thus, if any major determinant of outcome is differentially distributed between the two groups, confounding may result. 7.3.2 Characteristics of confounders and their control A confounder is, a priori, a causal determinant of the outcome under study. Furthermore, it is associated with the exposure of interest, meaning that it is differentially distributed among the exposed and non-exposed populations. In Example 7.10, cases of occupational or hereditary asthma might produce confounding if they also reside in the area with high S0 2 concentration, the exposure under study. Likewise, people who developed asthma elsewhere and then moved into the study community produced confounding because they chose to live in the exposed area. In Example 7.11, since distributions of age and sex of residents near the airport were significantly different from the entire Los Angeles area, these factors were mixed with the exposure to jet noise, resulting in confounding. Thus, in the detection of potential confounding in a study, one should list all causal determinants of the outcome of interest, and examine them one by one to see if any one of them is associated with the exposure. On the other hand, non-causal determinants do not produce any confounding. In Example 7.10, the investigators worried neither about the color of clothing people wore nor about body heights, as these are not causal determinants of asthma. Similarly, the researchers in Example 7.11, studying the association of high mortality rate and exposure to jet noise, did not worry about the above characteristics either, since they were not determinants of mortality. Even in physical or chemical experiments, one is only concerned with known determinants of outcome and seeks to control them during the experiment. Moreover, a determinant of exposure that happens to be associated with the outcome of interest, still cannot be viewed as an alternative explanation
184 Basic Principles and Practical Applications in Epidemiological Research
because the association may be indirectly linked through the exposure (Rothman and Greenland, 1998). For example, socio-economic status is a determinant of smoking and happens to be associated with the incidence of lung cancer. However, because socio-economic status first affects smoking rates, which then leads to lung cancer rates, socio-economic status cannot be considered a potential confounder, if smoking rates were under control. So, one should only examine and control for the causal determinants of outcome, to save time and resources, instead of also considering all determinants of exposure. One should prevent alternative causal determinants from linking with the exposure by restricting them in the design stage, or control them by stratification or modeling in the analysis stage. The latter involves one's attempts to comparably distribute the causal determinants between the exposed and non-exposed populations. Thus, strategies for controlling confounding can be summarized as follows: Strategy 1. Restriction of enrollment of subjects (See also Section 11.4) Restrict the selection of exposed and non-exposed groups during the design stage so that they are comparable in alternative causal determinants. Suppose one wants to study whether smoking causes lung cancer, one may restrict the study to only the white male population, to avoid potential confounding from mixing female and/or nonwhite people. Moreover, one needs to select a comparable non-exposed group (placebo/sham group in clinical trials or animal experiments), to rule out other causal determinants. Conceptually, such a non-exposed group should possess the same distribution or pattern of all causal determinants of outcome as the exposed group, all except for the exposure of interest, as summarized in Table 7.8. Therefore, an investigator should first perform a comprehensive literature search to find all causal determinants of outcome.
Chapter 7 Study design 185
Table 7.8
How to select suitable nonexposed groups to prevent or control confounding, as demonstrated through various types of causal research. This is generally accomplished by comparably distributing all alternative causal determinants of outcome. i —
Clinical trials
Follow-up or Cohort
Types of
Animal
causal research
experiment
studies
Examples
Painting rabbit ears Treating peptic ulcer with tar to produce patients with cimetidine skin cancer
Determining the association between asbestos and lung cancer
Comparability of effects
Tar group vs. non-tar group (painted with solvent to dissolve the tar)
Cimetidine group vs. placebo group (contains everything used to formulate cimetidine tablet but does not contain cimetidine molecule itself)
Asbestos workers vs. nonasbestos workers (e.g., cotton workers exposed to fibers that do not cause lung cancer)
Comparability of contrasted populations: constitutional, life style, socio-behavioral, occupational and environmental
Selecting inbred strains
Randomizing then comparing causal determinants between the two groups.
Factors affecting job entry and exit should be unrelated to lung cancer, or at least comparable between the two groups. The non-asbestos workers should not possess any characteristics which are causal determinants of lung cancer, e.g., greater number of smokers.
Comparability of Measurements
Double blind measurement
Double blind measurement
Measurement of outcome and collection of data should be unrelated to the exposure.
■
■
186 Basic Principles and Practical Applications in Epidemiological Research
I have reclassified the determinants of outcome (See also Chapter 6) into 3 categories (Wang and Miettinen, 1982): (1) Effect(s) caused by the exposure of interest. (2) Characteristics of the population (constitutional, life style, socio-behavioral, occupational and environmental factors) that may influence the outcome. Obviously, the selection procedures for study subjects have a direct impact on these characteristics. (3) Measurement or data collection methods. In order to avoid missing any important determinant, one should thoroughly search for causal determinants among these 3 categories. Moreover, such considerations help one to recognize that the ideal "reference" group is not simply characterized by non-exposure but is also comparable with the exposed in terms of all other alternative causal determinants. Strategy 2. Standardization, stratification or modeling Apply standardization (adjustment), stratification or modeling in the data analysis. Conceptually, in all of these processes, one attempts to divide or stratify the population into several or more smaller groups or strata, according to different causal determinants. Within each specific group or strata, the exposed and non-exposed groups should then be comparable. The standardization procedure involves the summarization of the different strata into an overall figure, which can also be achieved by other procedures such as Mantel-Haenszel (1959; Mantel, 1963) or maximum likelihood estimation. Modeling involves a similar concept of holding all other factors equal to see how much the outcome will change according to a change of exposure status. In a study to determine kerosene as the major cause of dermatitis among ball-bearing workers (Example 7.12, Jee et ai, 1985), other causal determinants of dermatitis (gender, age, use of protective cream and hand gloves, habits of hand-washing, dish-washing and laundry; etc.) and occupational exposures were evaluated. Because all 79 ball-bearing workers were young females, we restricted our non-exposed group to females of a similar age, educational backgrounds and salaries, by selecting 263
Chapter 7 Study design 187
zipper-manufacturing workers, unexposed to kerosene. Thus, we chose a non-exposed group comparable in effect, contrasted populations and measurements, as summarized in Table 7.9. In Example 7.4, we found that paraquat manufacturers in Taiwan developed hyper-pigmentation or freckles in sunlight exposed areas of their bilateral forearms. To control the possible confounding effect of sunlight and age, we stratified the exposed and non-exposed populations according to these two determinants, in addition to the duration of exposure to centrifugation and crystallization, as shown on Table 7.10. The data indicated that the longer the exposure duration during the centrifugation and crystallization processes, the more likely the development of hyper-pigmentation. A logistic regression analysis of the same data showed a similar result. Because the stratified analysis allows the reader to visualize the analysis and is more intuitively appealing, I recommend such a tabulation. However, a multivariate analysis, which simultaneously considers many determinants of outcome to measure the individual magnitude of each determinant's effect, can help explore how these factors interact. The bottom half of Table 7.3 is one example of a multivariate analysis by modeling. (See also Tables 11.5 and 11.6.) In fact, both strategies 1 and 2 can be combined and applied in the same study, just as one can have both descriptive and causal inferences in a single study. In the study to determine whether higher levels of lead absorption occurred among workers in an iron-forging factory near a lead recycling factory, we (Chao and Wang, 1994; Example 7.13) deliberately chose workers from another iron-forging factory, 20 km away, as a reference or non-exposed group. Since the two groups performed the same job, the requirement of comparability of effects was satisfied. We excluded truck drivers and workers on the job for less than 2 months because of smaller exposure dose and likelihood that their blood lead level had not yet achieved a steady state in such a short period of time. Furthermore, we stratified the two samples into outdoor and indoor workers because by being outdoors raised the likelihood of exposure to lead dusts from the nearby factory and local traffic. In our analysis, blood lead was randomly measured without knowledge of the exposure status. Thus, comparability of effects, contrasted populations and
188 Basic Principles and Practical Applications in Epidemiological Research
measurements were achieved.
Table 7.9
Characteristics of the exposed (ball-bearing factory) and reference (zipper manufacturing company) groups (Jee et ai, 1985).
Number of workers Sex Age(years) Range Mean SD Monthly income Educational level Major contactantsb Minor contactantsc a. b. c.
Exposed group 79 Female 16-26 18.9 1.9 NTS 7,000a Elementary school only, in most cases Kerosene, steel Antirust oil
Reference group 263 Female 15-29 20.2 3.7 NT $ 7,000a Elementary school only, in most cases Plastics, textiles Scissors, lubricants
NT $ stands for new Taiwan dollars. NT $ 7,000 is equivalent to USD 200. Any substances to which the workers were heavily exposed daily. Substances to which the workers were lightly and only occasionally exposed.
We also needed to address one more potential confounder, the possibility of lead contamination from the iron-forging process itself. During our study, we found a higher proportion of smokers among reference workers than among the exposed. Thus, if there was any risk of lead contamination in the iron-forging process, the non-exposed might show a higher lead content due to the picking up and inhaling of cigarette butts resting on possibly contaminated surfaces. However, the statistical summary of the data, in Table 7.11, showed conversely that there were higher blood lead levels among the exposed workers. By taking care of potential confounders and sampling two groups of workers comparable in effects, population characteristics and measurements, we were able to conclude that exposure to a nearby lead recycling factory increased the risk of lead contamination. Similarly, in a
Chapter 7 Study design 189
study exploring the effect on liver function by mixtures of solvents, we apply two modeling techniques to control potential confounding by viral hepatitis infection, alcoholism, and body mass index. Because both general linear model and logistic model showed a consistent trend, we concluded that exposure to mixtures of solvent would mildly impair liver function, evidenced by elevated total bile acid in serum(Chen, et al, 1997). 7.3.3 How to detect confounding or systematic bias in a study In the literature {e.g., Michael et al, 1984; Sacket, 1979) on systematic bias or confounding, one may frequently encounter many terms, such as selection bias, recall bias, diagnostic accuracy bias, ecological fallacy, Hawthorne effect, etc. Although each bias has its own definition and may evolve from different conditions, all of them can be viewed as one type of mixing of effect, i.e. confounding. Since only causal determinants of outcome have the potential to become confounders, one can try to detect bias by first listing all causal determinants. As discussed in the previous section, one can look for alternative determinants by analyzing the effect of exposure, population characteristics (including constitutional, lifestyle, social, occupational and environmental differences), and measurement methods, as in Table 7.8. Namely, one can examine each causal determinant to see if it is associated with the exposure. During the study design stage, one should conduct a comprehensive literature search and consult experts in the field to list all major causal determinants ever reported. Then consider strategies to control confounding either by restrictive selection of subjects or by stratification and modeling in the data analysis. The basic principles for control of confounding in both follow-up and case-control studies are essentially the same, with some added features for the latter, which will be discussed in Chapter 11.
190 Basic Principles and Practical Applications in Epidemiological Research
Table 7.10 Number of workers with hyper-pigmented macules stratified by duration of exposure to centrifugation and crystallization processes, sunlight and age (Wang et al, 1987). Exposure to sunlight (hours per week) <4
Hyperpigmented macules <31 31-45 46-60 >60
4-12
<31 31-45 46-60 >60
> 12
Duration of exposure to crystallization and
Age
<31 31-45 46-60 >60
Total
centrifugation processes (months) >6 Total t 1-6 0 1 1 0 0 18 10 12 1 0 23 2 7 12 1 1 4 19 1 1 0 0 1 0
Yes No Yes No Yes No Yes No
0 0 18 1 21 4 14 0 1
Yes No Yes No Yes No Yes No
2 15 3 20 4 24 2 1
0 2 3 0 0 4 0 0
1 0 4 0 3 0 0 0
3 17 10 20 7 28 2 1
Yes No Yes No Yes No Yes No
2 5 1 11 3 9 0 1
2 2 3 0 1 1 0 0
0 0 5 1 2 1 2 0
4 7 9 12 6 11 2 1
Yes No
22 140
12 15
35 3
69 158
Chapter 7 Study design 191
%2( 1 )(Mantel-Haenszel) X2(l)(Mantel extension for the trend) Odds ratio: Point estimate (Mantel-Haenszel) 95% Confidence interval (test based) 6.7 : p < 0.001 t Does not include one worker with unclassified lesion.
74.32* 61.9* 12.5 -23.4
Table 7.11 Comparison of blood lead and general characteristics of workers in two iron-forging factories by Wilcoxon Rank sum test (Chao et al., 1994). Reference factory
Exposed factory No. of
Median
Range
Male Female Working outdoors(total) Male Female Total No. Age (years) Duration of work (years ) % Male % Male smoker
Median Range
workers
workers Working indoors(total)
No. of
11
,ja,b
21-6
11
5C
5-4
6
15
21-6
4
5
5-4
5
11
14-9
7
4
5-4
25
24a
49-10
36
5.5
20-4
24
23
49-10
31
6
20-4
1
40
5
4
5-4
36 33 ±9.6
47 35.1+11.5 6.7 ±5.4
3.011.7
74.5
83.3 43
57
a pO.001, if compared with the reference factory. b p<0.001, if compared with working outdoors. c p>0.1, if compared with working outdoors and stratified by sex. d p = 0.003, if compared with workers employed for more than 2 months by the same factory.
192 Basic Principles and Practical Applications in Epidemiological Research
7.4
Summary
Based on the inferences made, empirical research can be classified into two types: descriptive and causal. In descriptive studies, one defines the objectives, chooses accurate and sensitive measurement methods, selects representative population through proper sampling methods, and then goes to the field to measure and collect the data. If the response rate is very low, then one may try to perform random sampling of the non-respondents to determine if non-response is related to the effect or items of measurement. If all determinants involved in the measurement process and all determinants of outcome are similar for respondents and non-respondents, then one can draw conclusions about the source population from the small group of respondents. Otherwise, one should only draw inferences about the respondent population, which must be specified or characterized. For causal studies, one should prevent or control confounding during the study design stage. Confounding occurs if there are any alternative causal determinants, which can partially or totally explain the effect of exposure. Since a confounder is inherently a causal determinant of outcome, one should search for all such determinants through a comprehensive literature search and consultation with experts. Then, one should examine the distribution of each causal determinant among the exposed and non-exposed populations, in order to determine if any determinants are associated with the exposure of interest. After uncovering the potential confounders in one's study, one can prevent confounding by restrictive selection of subjects, or control it by stratification and/or modeling in the data analysis. The main theme of this chapter is also summarized in Figure 7.1.
Chapter 7 Study design 193
Define the objectives
i Perform a Descriptive studies
literature search
Causal studies
and consult experts. ^ List all causal determinants of List all determinants of
outcome and all determinants of
measurement and its contents.
the measurement
Choose accurate, sensitive and
Choose accurate, sensitive and feasible measurement methods for exposure, outcome and other causal determinants of outcome.
feasible measurement methods.
i
I
Select the exposed and non-exposed groups, according to strategies for controlling potential confounders.
Select the population to be measured.
I Decide on a sampling method for the exposed and non-exposed populations, based on the strategies: restriction of the sampling population or stratification and/or modeling in the data analysis.
I Decide on a sampling method.
Decide on methods for data analysis.
Go to the field to carry out the study. Figure 7.1
Diagram on how to design descriptive and causal empirical studies.
194 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 7 1-5 300 cases were collected and followed up for stroke every 3 months for 1.5 years, and the following life table data were obtained. Please fill in all the blanks of the table. Interval
No. lost to
No.
No. deaths No. entering No. at risk
(month)
follow-up
withdrawn
interval
rate
alive
per month
1-3
10
4-6
10
5
15
7-9
10
10
10
10
mortality
20
10-12
15
10
5
13-15
25
10
5
300
290
0.023
Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 6. When the response rate is not high, e.g., 42%, then the result cannot be generalized to the entire population. 7. In a study of lung cancer among asbestos workers, if one has not collected data on smoking, one cannot rule out the possibility of smoking as an alternative cause. Thus, one may have difficulty in drawing a conclusion. 8. Even if the response rate exceeds 85% or even 90%, one still cannot draw a definite inference on a divorce rate of 5% because many people who are divorced may not have responded. 9. If all other determinants of outcome for the exposed and
Chapter 7 Study design 195
non-exposed are comparable, there will be no confounding in the conclusion. 10. In causal studies, we are always looking for alternative hypotheses that have not yet been refuted.
Answers:(6) F (7) T (8) T (9) T (10) T
(l)-(5) Please see Table 6.8 for details: No. at risk = No. entering interval - 1/2 (No. censored=No. withdrawn alive +No. lost to follow-up) Mortality rate per month =
No. death Total No. of person - months No. death 3 monthx(No. a trisk-l/2(No.
Interval (months')
No. lost to follow-up
No. withdrawn alive
No. death
No. entering interval
1-3 4-6 7-9
10 10 10 15 25
10 5 10 10 10
20 15 10 5 5
300 260 230 200 170
10-12 13-15
death))
No. at risk
Rate per months
290
0.0238
252.5
0.0204
220
0.0155
187.5
0.0090
152.5
0.0111
This page is intentionally left blank
Chapter 8
8.1 8.2 8.3 8.4 8.5 8.6
Adjustment and Standardization of Rates
Components of crude rates Adjustment or standardization of rates Standardized mortality or morbidity ratio (SMR) and indirect standardization Precision maximizing weight SMR computed from a case-control study Summary
Introduction As discussed in Chapter 6, the basic measurements of epidemiology include rate, incidence rate, and prevalence rate. Their purpose is to measure the characteristics (states or events) of a population. Since a population is generally heterogeneous, in which people possess a variety of characteristics (different age, gender, and occupation), there are many determinants which can explain an overall crude rate. In contrast, age-, sex-, and ethnic-specific rates are simpler because certain determinants have been ruled out in the etiologic study. However, oftentimes for mostly administrative purposes, a crude rate, such as the crude mortality rate of a country, may provide a simple summary of a heterogeneous group. Each crude rate is a summary of different component rates weighted by some population distribution. When looking at two or more different rates, epidemiologists must pay careful attention to how crude rates are weighted to make valid comparisons. In this chapter, we will first discuss the components of crude rates and their adjustment or standardization. Then, we will consider the special case of standardized mortality ratio (SMR) and indirect standardization, which actually utilizes a set of weights closely approximating the precision maximizing weight. Finally, we will look at how to obtain SMR through a
197
198 Basic Principles and Practical Applications in Epidemiological Research
case-control study design. Throughout the discussion, incidence rate will be used in our examples, but these examples can also be generalized for other types of rates such as prevalence rates. 8.1
Components of crude rates (Miettinen, 1972a)
According to the definition of a crude incidence rate (See Section 6.3), it is the total number of new cases divided by the total number of population-time at-risk. Assume that a, is the number of new cases and Nj is the number of population-time at-risk in the i"1 stratum, then the crude incidence rate can be expressed as:
Crude incidence rate = I
In fact, crude incidence rate can also be viewed as a summarization or summation of a group of individual specific rates of each ilh stratum (a: / Nj) weighted by its corresponding population-time at-risk (Nj), i.e.,
Sa,
INK—)
;
Nj
I Nj /
IN; i
For example, in Table 8.1, the crude mortality rate of employees in financial institution X is the total number deaths divided by total population-time at-risk, 50/(29,335 x 1) year"1 = 1.70 x 10"3year"'. This crude rate can be rewritten as an expression of component age-specific mortality rates, as follows:
Chapter 8 Adjustment and standardization of rates 199
[(9399 x 1) x
+(7402 x 1) x 9399x1
+...+ 7402x1
( 5 3 x l ) x - i — ] ^[(9399 + 7402+...+53) x 1 year] = 53x1 1.70 x 10"3year"' Similarly, one can use the strata-specific component rates to calculate the crude rates of a teacher's association, a hypothetical occupational group, and the general population of Taiwan as 1.57 x 10"3 year"1, 3.25 x 10"3 year"1, and 5.86 x 10"3 year"1, respectively. Here, the population distributions for each age-specific group serve as the corresponding weights (namely, Nj = W;) needed to obtain the crude rates. If one applies a population distribution outside of the group in question, such as the general population of Taiwan, as weights for the set of component rates, then these rates have undergone an adjustment. 8.2
Adjustment or standardization of rates
Because each crude rate has its own individual set of component rates and weights, the comparison between two or more crude rates may be confounded by the different weighting schemes. For example, in Table 8.1, all age-specific mortality rates (see columns labeled ASMR) of the teacher's association are the same as those of another hypothetical occupational population. However, their crude rates are very different: 1.57 vs. 3.25 x 10"3 year"1. This variance results from different population counts for each individual rate. For instance, there are more people in the age bracket from 30-34 in the teacher's association than there are in the same age group for the hypothetical occupational population. Thus, even though the two groups have the same age-specific rates for each age strata, these rates are weighted differently. The crude mortality rate of the teacher's association is as follows: [(26,627 x 0.30 x 10"3 year"1 + 31,897 x 0.44 x I0"3 year"'+ ... + 748 x 4.01 x 10-3year-')]-H [(26,627 + 31,897 + ...+748)] = 1.57 x 10"3year"'
200 Basic Principles and Practical Applications in Epidemiological Research
The crude rate of the hypothetical occupational population is as follows: [(6,667 x 0.30 x 10"3 year1 + 9,091 x 0.44 x 10"3 yeaf' + ... + 27,181 x 4.01 x 10"3 yeaf')]-H (6,667 + 9,091 + ... + 27,181) = 3.25 x lO^yeaf1 Thus, if one wants to compare two or more crude rates, one needs to apply the same set of weights to their component rates. Such a procedure is called adjustment or standardization of rates, and the set of weights is usually chosen from a "standard" population distribution (Miettinen, 1972b). The formula of adjustment can then be written as:
IW,. A ) TV. , where W, is the weight chosen for stratum i. /
In Table 8.1, if we apply the national population distribution as the standard set of weights, then the age-standardized mortality rate of the financial institution X becomes as follows: [[(1,768,454 x 1 x 0.53) + (1,430,984 x 1 x 0.54) + ... + (467,117x1 x 18.88)] xl0-3]-^(l,768,454 + 1,430,984 + ... + 467,117)x 1 year = 3.20x 10'3 year"' Similarly, one can calculate the age-standardized mortality rates for the teacher's association, hypothetical occupational population, and general population as 1.87 x 10"3 year "', 1.87 x 10 3 year"1, and 5.86 x 10"3 year "', respectively. It turns out that the standardized mortality rate of the teacher's association is the same as that of the hypothetical occupational population, because the same set of weights, the general population distribution of
Table 8.1
Age
Comparison of age-specific mortality rates (ASMR), cumulative mortality rates (CMR30.69), age-standardized mortality rates, and standardized mortality ratios (SMR) for employees of a public financial institution X, members of a teacher association, a hypothetical occupational population, and the general population (GP) of Taiwan, 1987. Employees of a
Members of a
A hypothetical
financial institution X
teacher association
occupational population
No. of Pop. ASMR No. of deaths count 10"3 year"1 deaths
30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 30-69 CMR3OJ»
Age-standardized mortality rate (GP of Taiwan as standard weights) SMR (GP of Taiwan as reference population)
5 4 6 4 7 12 11 1
9,399 7,402 4,200 2,801 1,965 1,926 1,589 53
0.53 0.54 1.43 1.43 3.56 6.23 6.92 18.88 crude rate 50 29,335 1.70 179.14X10"3
3.2 X 10"3 year"1
0.47
8 14 25 31 35 57 44 3
Pop. count 26,627 31,897 23,548 20,735 13,808 11,838 8,754 748
217 137,955 W.76 X
ASMR No. of Pop. ASMR 10-3 deaths count lO"3 year"1 year"1 0.30 2 6,667 0.30 0.44 4 0.44 9,091 1.06 11 1.06 10,377 1.50 18 11,989 1.50 2.53 56 22,135 2.53 4.82 117 24,272 4.82 5.03 132 26,243 5.03 4.01 109 27,181 4.01 crude rate crude rate 1.57 449 137,955 3.25 93.76 X 10"3 0"3
1.87 X 10"3 year"'
0.42
1.87 X 10"3 year'1
0.16
General population of Taiwan No. of deaths
Pop. count
ASMR lO"3 year"1 1.38 1.84 2.84 4.17 6.30 9.53 15.04 24.32 crude rate 5.86
2,444 2,635 2,558 3,772 5,056 7,494 9,860 11,362
1,768,454 1,430,984 901,802 903,857 802,202 786,726 655,426 467,117
45,181
7,716,568 278.99X10"3
5.86 X 10'year 1
1
202 Basic Principles and Practical Applications in Epidemiological Research
Taiwan, are applied to both. Thus, adjusting the age-specific mortality rates to obtain age-standardized mortality rates eliminates the confounding factor of different population distributions, and one can then make more valid comparisons between the two groups. Adjustment or standardization of rates can be regarded as one way to control confounding and has been widely used in the comparison of two or more populations or sets of component rates. 8.3 Standardized mortality or morbidity ratio (SMR) and indirect standardization When one wants to compare the mortality rates of two populations, adjustment with the same set of weights or population distributions is usually necessary to control confounding. The population of interest is often a cohort population with a particular exposure. It is called the index or exposed population, while the comparison group is a population without the exposure and is called the reference or non-exposed population. Here, the reference population will not be referred to as the "control" population, in order to prevent any confusion with the control group of a case-control study. In fact, one can rarely perform a "control" experiment in observational human studies. One of the most commonly used rate ratios for comparing mortality between the index and reference populations is the SMR, originally defined as an observed-to-expected ratio. The observed is the sum of the number of observed deaths for each stratum i of the index population (a;), while the expected is the sum of the expected stratum-specific mortality rates of the reference population, adjusted by the population-time distributions of each stratum i from the index population. Namely,
Chapter 8 Adjustment and standardization of rates 203
lai '
observed SMR =
£N,i(
= expected
—) Nn
=
XN,i(
bj N0i
)
ZNn( '
bj
)
N0i
where aj = no. of deaths for stratum i of the index population bj = no. of deaths for stratum i of the reference population Nn= no. of person-years observed for stratum i of the index population Noi=no. of person-years observed for stratum i of the reference population. In fact, SMR can be regarded as the overall weighted rate ratio of the two sets of component rates, in which the stratum-specific population-time distributions of the index population are used to weight both rates. However, to calculate SMR, one does not need to adjust the stratum-specific rates of the index population to calculate this ratio. Accordingly, calculating SMR has been labeled "indirect standardization" to distinguish it from direct standardization. As discussed earlier in Section 8.2, direct standardization involves directly multiplying the stratum-specific rates of two or more index populations with the weights of a reference population distribution. SMR, however, is a special case for the more general case of the comparison of overall rates for two populations, in which the same set of stratum specific weights are applied to the component rates of index and reference populations. Assume the same notations as before, the general comparison of two sets of rates can be expressed as:
204 Basic Principles and Practical Applications in Epidemiological Research
iw,.(^)/zw ;
?W*(-S-)
iw, A / z w ,
£w;(-^-)
#0/
'
'■
N
0i
When Wj= N n , this rate ratio becomes the SMR, or observed-to-expected ratio. Because the weights taken for each SMR study are the specific index population distributions of that study, the SMRs obtained for different index populations cannot be directly compared. For example, if one employs the general population as the common reference population for the index populations of the teacher's association and the hypothetical occupation, then the SMRs calculated for each one of them cannot be directly compared because the set of weights assigned are different. The SMRs of the two occupations are 0.42 and 0.16, respectively. They are obtained by the following calculations: 217 /[(26,627 x 1 x 1.38 + 31,897 x 1 x 1.84 + ... + 748xlx24.32)xl0 - 3 ] = 0.42 449 /[(6,667 x 1 x 1.38 + 9,091 x 1 x 1.84 + ... + 27,181 xlx24.32xl0" 3 ] = 0.16 As shown in previous sections and in Table 8.1, these two index populations share exactly the same age-specific mortality rate, but possess different SMRs and crude rates because of different weights employed in the calculation. Similarly, one cannot directly compare the SMR of the financial institution, which is 0.47, with the other two occupational populations for the same reason. Therefore, two or more SMRs can only be compared if they use the same set of weights and reference component rates. If not, then the results are confounded by either one of these two factors. Because epidemiologists from different
Chapter 8 Adjustment and standardization of rates 205
countries still debate the validity of using a universal set of weights for standardization (Doll, 1976), some have recommended using the cumulative incidence rate (CIR, Davis, 1978; Miettinen, 1985a). CIRs can be directly compared if the cumulative age period is the same. The CIR30-69S of each population in Table 8.1 are calculated as follows: -X (Age-specific rates) t CIRrfl-r, = 1-e Since t = 5 for each age stratum, then, CIR30-69 of employees in the financial institution -(0.53 + 0.54 + ... + 18.88) X 10"3 X 5 = 1-e = 179.1 x 10"3 CIR30-69 of the teacher's association -(0.30 +0.44+ ...+4.01) X 10'3 X 5 = 1-e = 93.8 x 10"3 CIR30-69 of the hypothetical occupational population -(0.30 +0.44+...+4.01) x 10"3 X 5 = 1-e = 93.8 x 10"3 CIR30.69 of the general population -(1.38 + 1.84 + 2.84 + ... +24.32) X 10"3 X 5 = 1-e = 279.0 x 10"3 The above CIRs can be directly compared and interpreted because every year is uniformly weighted. When CIR « 0.1, then a; CIR,i_,2 = X(stratum-specific rates) t, = £ ( ) tj.
'
H
For cancer incidence rate, the above assumption is generally acceptable. Thus, Davis (1976) has recommended this simple formula for the international
206 Basic Principles and Practical Applications in Epidemiological Research
comparison of cancer incidence rates. 8.4
Precision maximizing weight
From a statistical point of view, to maximize precision, one may choose a set of standard weights, which considers the variances of both populations. Under the null hypothesis, the precision maximizing weight depends on the population distributions of both index and reference populations for stratum i, Nu and Noi, respectively. W-
'
=
] =
1/N ]i+ 1/N 0i
(N,,)(N oi ) N„+N0|
If Noi » Nu, then Wj = Na. Conceptually, this means that when the index population is much smaller than the reference population, the precision maximizing weight is approximately equal to the index population. Thus, the SMR calculation, which is commonly used to explore a particular occupational population, is very close to the precision maximizing weight because a specific occupational group is generally much smaller compared to the general population. SMR can also be calculated from a case-control study design, but this will be discussed in the next section. The CIR calculation, however, assumes uniform age-specific rates, which sometimes overweighs the older age strata because these strata usually possess small numbers of population-time-at-risk. As a result, epidemiologists can still use both SMRs and CIRs, but they need to be careful about their limitations. 8.5
SMR computed from a case-control study
Although case-control study design is discussed more fully in chapter 11, it deserves some discussion here to give readers a broader understanding of SMR. Because SMR is an observed-to-expected ratio, it can be obtained by a
Chapter 8 Adjustment and standardization of rates 207
case-control study without collecting the actual information of population-time at-risk. Consider the example of determining the causal relationship between blackfoot disease and drinking artesian well water (Chen et ai, 1988). Blackfoot disease is a peripheral vascular disease involving gangrene of the feet and toes, which was prevalent in four towns of Southern Taiwan. For age stratum i, we collected (a, + bi) cases of blackfoot disease and inquired their histories on drinking water from artesian wells. It turned out that a; cases were exposed and bi cases were non-exposed to artesian well water. We also randomly collected (c; +di) local residents without the disease but of similar ages and sexes, and asked their histories of exposure. The numbers of exposed to non-exposed were C| and dj, respectively. Then, without obtaining the population-time at-risk for the exposed (Nil) and non-exposed (Na), we could calculate the SMR, as shown in Table 8.2.
Table 8.2
SMR (standardized morbidity ratio) can be obtained by a case-control study design, as illustrated by the study on drinking water from artesian wells and blackfoot disease in four endemic towns of Taiwan (Chen et ai, 1988). Drinking artesian well water Yes
Blackfoot disease Population-time at-risk
Yes No
Ci
N,i
No b,
Total
di
ai+bj q + di
N0j
N.j + Noi
Because sampling procedures for (aj + bi) and (Ci + di) are both unrelated to the exposure of artesian well water, the ratio, ai / bi, estimates exposure odds for drinking well water of people with blackfoot disease, Ai / Bi; while q / dj estimates
208 Basic Principles and Practical Applications in Epidemiological Research
the exposure odds of the population-time at-risk, N u / Noi. Then, SMR can be calculated by summarizing them across the i,h strata in a case-control study:
SMR =
'ZNB(btINu)
=
i Z(N H /N 0i ) i
5X _
i
i
8.6
Summary
A crude rate can be viewed as a summary of a weighted set of component rates. Therefore, two crude rates cannot be compared unless the set of weights are the same or similar, which requires an adjustment or standardization procedure. SMR is an observed-to-expected ratio, which is actually standardized with weights taken from the index population for both the numerator and the denominator. Two or more SMRs cannot be directly compared unless both sets of reference rates and weights are the same. CIRs, however, can be directly compared if the cumulative period is the same. For statistical efficiency, one may choose to use precision maximization weights as the standard weights, which can be obtained through SMR calculation. Moreover, SMR can also be obtained through a case-control study design when one does not have information on the population-at-risk or any other denominator data.
Chapter 8 Adjustment and standardization of rates 209
Quiz of Chapter 8 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Items 6-10 will be scored according to the exact correct answer instead of judgmental scores. Score % 1. A crude rate is a summarization of individual specific rates weighted by population-time at-risk. 2. There were 3 SMR studies of lung cancer among asbestos workers. The first one, reported from Sweden, had an SMR of 7, the second and third from the U.S. reported SMRs of 4 and 5, respectively. One can say that Swedish workers were exposed to a higher concentration of asbestos than that of Americans' because the SMR of Swedish asbestos workers was higher. 3. A confounder must be a causal risk determinant. 4. SMR is usually using a set of precision maximization weights because an industrial cohort is usually much smaller compared with the general population. 5. Data from case-control study can be used to calculate SMR. 6-10. There are 4 reports with different follow-up periods and recurrences of breast cancer. (A) 10 cases recur out of 100 cases within 6 months; (B) 20 recur out of 100 cases within 1.5 years; (C) 30 recur out of 100 cases within 2 years; (D) 50 recur out of 100 cases within 3 years. Please calculate the average recurrence rates for each study. A recurrence of cancer is an event, so please
210 Basic Principles and Practical Applications in Epidemiological Research
6. 7. 8. 9. 10.
use the same unit for comparison. (A) (B) (C) (D) Please arrange the above rates in the ascending order. < < <
Answer: (1) T (2) F (3) T (4) T (5) T (6) 0.20 year"1 (7) 0.13 year'1 (8) 0.15 year1 (9) 0.17 year"1 (10) (B)<(C)<(D)<(A)
Chapter 9
Introduction to Sampling Method and Practical Applications
9.1 9.2
Sampling vs. census method for descriptive studies The concept of probability sampling 9.2.1 Why probability sampling? 9.3 Simple random sampling 9.4 Stratified random sampling 9.4.1 Principles of stratification 9.4.2 Estimation of the mean and standard deviation of stratified random samplings 9.4.3 Determining the sample size in each stratum 9.5 Systematic random sampling 9.6 Other sampling methods: cluster sampling and sampling with probability proportional to size 9.6.1 Cluster sampling 9.6.2 Selection with probability proportional to size 9.7 Non-sampling errors 9.8 Summary
Introduction As discussed in Chapter 7 (Study design), all empirical studies can be classified according to the types of inferences made, either descriptive or causal. In descriptive studies, an investigator aims to gather facts about a population from the sample at hand. The major goal of such studies is to obtain a representative sample so that one can make valid generalizations about the source population. Another goal in conducting a descriptive study is to minimize random eirors and/or maximize the sampling efficiency so that one can draw reliable or precise inferences with minimal cost and/or the smallest possible sample size.
211
212 Basic Principles and Practical Applications in Epidemiological Research
In causal studies, although one is concerned about sampling method, one is more focused on preventing or controlling confounding. For follow-up studies, selection or loss of follow-up of subjects should not be related to the outcome. In case-control studies, one must carefully select or include subjects unrelated to the exposure, to serve as cases and controls. These issues will be discussed further in Chapters 10 and 11. In this chapter, our main focus is on the sampling procedures for descriptive studies. We will begin with the concept of probability sampling, then discuss the issues of simple and stratified random sampling, systematic sampling, cluster sampling, sampling proportional-to-size and non-sampling errors. 9.1
Sampling vs. census method for descriptive studies
The major goal for a descriptive study is to determine what actually happens in a population. One can achieve this goal either by a census study or a sample of the population. Although a census study will collect information from eveiy member in the population, it is generally less efficient, more difficult to obtain data of high quality, and sometimes infeasible. For example, it is not practical to measure all the body heights of Taipei citizens. It may even cause a riot if one investigator asks every motor cycle rider in Taipei to keep a diary about how long each one rides per day and whether one wears a helmet each time. Moreover, since a census study may involve a large number of people, it is also more difficult to carry out quality control for measurement and data collection. One must train more personnel to conduct interviews, measurement, etc., which will certainly cost a lot more resources for facility, equipment, and personnel. Besides, summary data obtained from a census study only have the advantage of smaller variance than sampling, which is usually not cost-effective. Thus, in our daily life, one usually performs sampling to explore the fact in a population to save time and all kinds of resources. Then, how can we conduct sampling in a study? 9.2
The concept of probability sampling Probability sampling is a sampling method, in which the selection of a
Chapter 9 Sampling method and practical applications 213
subject or unit depends on a predetermined probability, or each unit of the sample space has a predetermined probability to be selected into the sample. Essentially, probability sampling has two characteristics: 1) a collection of sample space {S,,S2, ... Sn} exists in the source population, in which every sample Sj has a corresponding non-zero probability of n{ to be selected; 2) the selection of Sj is random. In simple random sampling, each subject or unit of the population has the same probability of being selected. To clarify this concept, let us examine several examples that did not conform to probability sampling. Example 9.1
Aflatoxin contamination of soybeans
In a study surveying average level of aflatoxin contamination on soybeans in Taiwan, investigators took samples from stores in 5 major Taiwanese cities. They drove along the highway, entered each city, "randomly" selected 3 stores selling soybeans, and contamination randomly selected samples from each store. Although they attempted to obtain a random sample, their sampling procedure, in fact, was not dependent on any predetermined probability. By choosing stores more accessible from the highway, they introduced a bias into their sample. If this bias could be determined to be unrelated to the exposure, then the sample could still be considered quasi-random. However, if the stores with poorer accessibility to the highway had fewer customers, then they might have stored their soybeans for a longer period of time than the others. This longer storage period might then lead to a higher proportion of aflatoxin contamination. As a result, by only sampling stores closer to the highway, investigators could possibly have underestimated the level of contamination. Example 9.2
Pollution control regulations
In a study to determine the suitability and effectiveness of pollution control regulations in Taiwanese factories, investigators attempted to conduct a probability sampling. They assigned each of the 358 townships a natural number and used the random number to draw 30 out of them, planning to
214 Basic Principles and Practical Applications in Epidemiological Research
interview 10 factories for each township. It was a random sample for the townships. However, when they conducted the site visits, the interviewers decided to visit the nearest 5 factories to the east and west of train or bus stations along their travel route. Thus, their selection of factories in each town was not based on probability, but depended on a factory's proximity to a transportation station. In the above two examples, investigators failed to assign each member of the population a predetermined probability for selection. Instead, they simply chose their samples according to their own subjective wills, and failed to follow a random process. Although the samples might still be quasi-random ones if investigators' arbitrary selections happen to be unrelated to determinants of outcome, one had better follow a true probability sampling. Example 9.3
Tuberculosis screening
In a study on tuberculosis screening, the Taiwanese Tuberculosis Control Bureau randomly selected 30 out of 358 townships for chest X-ray screening. However, when the truck carrying the X-ray machine arrived at each selected township, the workers allowed villagers to volunteer for examinations. Since people volunteering for chest X-ray screening tended to have some symptoms of cough, sputum or chest discomfort, the abnormality rate measured was probably an overestimation. Volunteerism is certainly not an example of probability sampling. Example 9.4
Average IQ of high school students
In this study, investigators attempted to determine the average IQ (Intelligence quotient) of high school students. They asked teachers to pick two of the most "representative" classes to be tested. The teachers in the school had a meeting and chose two classes. Although these two classes might have been representative, they were based on the teachers' subjective opinions instead of a probability sampling.
Chapter 9 Sampling method and practical applications 215
9.2.1 Why probability sampling? All of the above four examples failed to conduct probability sampling to some extent. It is important to obtain a random sample because usual statistical inferences are based on the central limit theorem and the assumption of repeated sampling. Strictly speaking, one must be able to perform repeated random samplings, in order to make valid inferences. In any study lacking probability sampling, investigators must invoke the auxiliary hypothesis that their sample is approximate or close to a probability sample, so that they can still infer the mean or variance of the population. Or, at least the sampling procedure should be unrelated to the content or outcome of measurement. In Example 9.1, since investigators failed to conduct a probability sampling, they must assume then that the accessibility of the highway was unrelated to the contamination level of aflatoxin in soybeans. If, however, easy transportation happened to be associated with rapid sales and turnover of soybeans, and consequently less susceptible to the growth of fungi and their toxins; then the results obtained from the sample in Example 9.1 would be an underestimation of aflatoxin contamination in Taiwan. Similarly, in Example 9.2, a factory near a transportation center might have a greater publicity, encountering more protests from people. As a result, their attitudes towards pollution control regulations might not be representative of other factories. Thus, the lack of representativeness raises the doubt that inferences made can be applied to all other factories. In Example 9.3, people, with chest symptoms, might have been more likely to volunteer for chest X-ray examinations. Thus, the study might have overestimated the prevalence of pulmonary tuberculosis. In Example 9.4, allowing the teachers to decide which two classes to test might have resulted in a representative sample. However, if the IQ results of their students might influence the reputation of some teachers or the school, then these teachers might not necessarily select the most representative ones. Therefore, one can only make valid inferences of the population with probability sampling, without having to make auxiliary assumptions. Moreover, when one examines many different aspects of a population, including political, economic and health-related characteristics; then one will
216 Basic Principles and Practical Applications in Epidemiological Research
find it even harder to judge the direction and magnitude of a systematic bias. Thus, probability sampling, if feasible, is recommended. In common clinical research, every hospital or medical center performs research on its own patients and publishes its own summary statistics and results. In fact, they are all based on the assumption that the patients selected into the study are not much different from other hospitals, with regard to outcome or contents of measurement. For example, if a study aims to evaluate the therapeutic effect of a newly synthesized insulin drug produced from genetic engineering technology, investigators need to assume that the diabetic patient's reaction to the new drug in the hospital under study is not different from those of other hospitals. Similarly, a study exploring a new regimen of antineoplastic treatment should assume that patients selected for the trial of this new regimen are not different from other patients with the same disease. Because it is not possible to guarantee such an assumption, randomization for clinical trials has been devised to assure that assignment to the treatment group is through a random process. If randomization cannot be done, then one must insure that the selection of patients into the study is unrelated to the outcome or contents of measurement. One must then consider this sample to be quasi-random, and subsequently, make more conservative generalizations from the results. 9.3 Simple random sampling A simple random sampling is a sampling method which provides an equal and larger than 0 probability for every unit or subject in the population to be selected. It can be further divided into two types: sampling with and without replacement. The former sampling method allows a selected unit or subject to be immediately replaced back to the population for sampling, which implies that the same subject can be sampled more than one time. The latter does not permit such a replacement. In general, if the selected sample is less than 2% of population, then the two methods yield almost the same results. In practice, one can assign a natural number to each unit or subject of the population, and use any available table of random numbers or generate them from a computer program to select a unit or subject into the sample. In
Chapter 9 Sampling method and practical applications 217
Examples 9.2 and 9.3, there were 358 townships. Each of them was assigned a natural number, i.e., from 1 to 358. Suppose that the first 3-digit number from the table of random numbers selected was 793, which was larger than 358 and discarded. The second was 422 and discarded as well; the third one was 138, included; followed by 162, 100, ..., etc. The first thirty 3-digit random numbers, between 1-358, were used to select the townships for study. A random sample can be repeated an infinite number of times, which can generate an infinite number of random samples, with a mean and a variance for each sample. By virtue of the central limit theorem, the sampling distribution of means of repeated random samples is approximately a normal distribution; the mean of such means is equal to the population mean; and the variance of the sample means is equal to the population variance divided by the sample size. Thus, one can infer from a random sample characteristics about the source population. The following expresses these relationships in formulaic terms: Let S.D. denote the standard deviation of the sample, Yj the ilh number of the sample, in which i = 1, 2, ..., n, and n a limited natural number indicating the size of the sample. The mean is then equal to: n
Y = Z Y; / n, where the degree of freedom is n - 1. According to the central limit theorem, if one performs repeated random samplings, the mean
( Y ) of the means Y
is equal to the
population mean {\x)\ While the standard error of the means,CTY>is related to the standard deviation S.D. of the sample:
218 Basic Principles and Practical Applications in Epidemiological Research
oy - \S IV n jy/l - (j)
where <j> = n / N (N is the population size). If the population has a limited number of units or subjects, then the sampling fraction, i.e., (j>, must be corrected. However, most texts assume that, N, the population number, is close to infinity, then -> 0, and the correction factor 1 - <j> can be ignored. For example, if the size of population is 200,000, 20,000 or 2,000, a sample of 100 or less makes little difference in the correction factor 1 - ^ . In fact, when < 0.10, ^j\- is usually close to 1. If one samples a population to inquire about a positive health event or state, then one may rewrite the formula as follows: Let p denote the proportion of members with the event or state in the population, and q be that without the event or state, so that q = 1 - p; then the standard deviation (S.D.) of p is
For example, in a survey conducted among a community of 432 families to determine how many of them regularly watch CNN news, one randomly samples 50 families and finds that 10 out of 50, i.e., p = 0.2, watch CNN news regularly. The standard deviation of the sampled proportion is then:
l(0-2X0-8) j - J O . .Q.Q53 V 50 V 432 If one omits the correction factor, then it becomes 0.057.
Chapter 9 Sampling method and practical applications 219
9.4
Stratified random sampling
To achieve a higher efficiency, one may divide the population into several sub-populations or strata according to the determinants of outcome so that there is small variation within the sub-population but large variations between or among sub-populations. Then, one may reduce the sample size within each sub-population or strata. 9.4.1 Principles of stratification In order to perform stratified random sampling, one needs to obtain prior information about the determinants of outcome in order to form the basis for stratification. Let us examine the following examples. Example 9.5
HIV infection in Bangkok
If one wants to study the prevalence rate of HIV (human immuno-deficiency virus) infection among the adult population of Bangkok, one attempts to select a random sample. Since the prevalence rate may be around < 5%, one probably needs a big sample size, say 1000 or more. However, if one possesses information of a higher carrier rate among a younger age group, then one can probably stratify the whole population into 3 - 4 age strata {e.g., >60, 45-59, 30-44, 15-29) and perform random sampling within each stratum with a reduced sample size for the younger age strata. For example, if the prevalence rates for old and young age groups are around 1-2%, 3-4%, respectively, then one may allocate more samples to the old age group because their prevalence rates are smaller and variance is larger than those of the younger age groups. In other words, one may improve sampling efficiency by stratifying the population into several relatively homogeneous sub-populations. Example 9.6
Disease patterns and frequency of use of Taipei's emergency medical services
To determine the disease patterns and frequency of use of the emergency
220 Basic Principles and Practical Applications in Epidemiological Research
medical services in Taipei, one might perform simple random sampling of 11 hospitals providing such services. However, each hospital may have different characteristics and/or emphases on providing services. For example, some hospitals have neurosurgical or burn units; some have a team for extracorporeal circulation; while others have MRI (magnetic resonance imaging) instruments; some focus on psychiatric patients, etc. Because the demand for emergency visits for certain diseases, e.g., ophthalmic or psychiatric emergency, only occupied a small proportion (as previously shown on Table 7.2), investigators decided to treat each hospital as a separate strata to avoid missing important diseases, which make up only a small proportion (Chiang et al., 1986). Thus, the stratified random sampling may improve sampling efficiency, when compared with simple random sampling. The basis for stratification is generally based on prior information or prior studies, so that small variations exist within each stratum, but large differences exist between or among individual strata, which may be rechecked after empirical samples are obtained. 9.4.2 Estimation of the mean and standard deviation of stratified random samplings Once sampling is finished, one needs to calculate the overall mean and/or standard deviation by appropriately weighting the original proportion of each stratum. Let Nj denote the total number of sub-population or stratum i, y{ indicate the mean of stratum i, and N represent the total sum of all sub-populations. Then, the overall mean of the population (y"st) is calculated as follows: y * = X N& /N = YJ Wtyt , (where W, = N, / N) i
I
i
Namely, the sample in each sub-population should be weighted by Wb which is determined by the proportion of each sub-population, N; / N.
Chapter 9 Sampling method and practical applications 221
The variance or standard deviation is as follows: Assume that sample size and S.D. in the ith stratum are n; and Si, respectively. Let ^jdenote the sampling fraction of stratum i, i.e., 0,- = nj/Ni, then: Var(y;) = S i 2 (l-<^)/n i Var(3Q = Z,W i 2 S i 2 (l-a)/n i S.D.of(3£")=
P£w\Sr(l-0O/n,
9.4.3 Determining the sample size in each stratum To actually perform stratified random sampling, one needs to define the sample size in each stratum. According to Snedecor and Cochran (1980), optimum allocation of size, nj, is proportional to the size of each stratum and standard deviation, while inversely proportional to the square root of measurement cost: n
i ^ —H- <
where Q is the cost of sample collection and measurement in stratum i, Nj is the number of units or subjects in stratum i, and S, is the standard deviation of the sample in stratum i. In other words, to maximize sampling efficiency, one needs a larger size for larger strata (or sub-populations) and a larger variation within each stratum, but a smaller size for any stratum with a higher cost of measurement. If the costs of collection and measurement of samples are the same across different strata, then the above formula reduces to n( °c Nj Sj In reality, one may not be able to find out the sampling variation or cost of each stratum before performing sampling. However, one can still collect as much prior information as possible, by reviewing previous studies,
222 Basic Principles and Practical Applications in Epidemiological Research
performing a pilot study or inquiring experts in the field. In Example 9.6, the investigators needed to determine how many medical records should be reviewed for each emergency medical department in the 11 hospitals. According to the above rule, the size of the sample should depend on the number of the emergency visits for each hospital. Moreover, more medical records should be reviewed for any emergency department with more variation. Because the medical centers in Taipei generally accepted whatever diseases came in for emergency visits and usually had a larger capacity (see Table 9.1), the investigators sampled the highest proportion of records from these 4 hospitals; as compared to the 4 metropolitan teaching hospitals and the 3 community hospitals. Since they expected approximately 1 % psychiatric, otolaryngologic or ophthalmologic emergency visits after a pilot review of emergency records, they finally decided on sample sizes of 1000,600-700 and 300-400 for medical centers, metropolitan teaching hospitals and community hospitals, respectively. The total number of samples was approximately 7,000-9,000 medical records. The next step was to decide which dates of the year would be chosen (Gibson et ah, 1977). Because there were considerably wide variations between the 4 seasons, each weekday and national holiday, they decided to randomly select 7 days for each season, with 1 Sunday and 6 weekdays. In addition to these 28 days, they also randomly chose 2 out of 11 national holidays. For example, they listed all the Sundays for spring season and assigned a natural number to each one of them, then used random numbers to select one Sunday. In the same fashion, they chose all other Sundays and the other 6 weekdays for each season, as shown on Table 9.2.
Chapter 9 Sampling method and practical applications 223
Table 9.1
Frequencies of emergency medical visits for hospitals responsible for the Taipei metropolitan area, 1982. Emergency visits
%
NTU* Hospital
30,526
7.5
Veteran Hospital
49,390
12.1
Chang-Gung Hospital
37,070
9.1
Mckay Memorial Hospital
69,251
17.0
Hospital name Medical Centers
Metropolitan Teaching Hospital Cathy Hospital
15,724
3.9
Chung-Seng Hospital
21,712
5.3
Jen-Ai Hospital
27,070
6.6
Ho-Ping Hospital
12,930
3.2
9,997
2.5
Jen-Gih Hospital
3,570
0.9
Tai-Liau Hospital
4,669
1.1
Community Hospital Taipei Medical College Hospital
All Others
125,021
30.7
Total *National Taiwan University
406,930
100
Table 9.2
The 30 days chosen to review the medical records of emergency medical visits based on stratified random sampling, Feb. 1982-Jan. 1983.
Season
Sunday
Monday
Tuesday Wednesday Thursday Friday
Saturday
Spring
April 25
April 12
March 2 March 10
April 1
Feb. 5
Feb.13
Summer May 23
June 14
June 8
July 22
July 9
May 1
Autumn Sept. 5
Aug. 16
Sept. 14 Oct. 13
Aug. 5
Aug. 13
Oct. 30
Winter
Jan. 24
Jan. 4
Dec. 16
Dec. 24
Nov. 27
Nov. 7
June 2 Nov. 17
224 Basic Principles and Practical Applications in Epidemiological Research
The total numbers of emergency medical visits for each hospital within the designated 30 days are listed on Table 9.3. At the NTU (National Taiwan University) Hospital, there were a total of 2,640 emergency visits. Because the investigators expected to sample about 1000 medical records for this medical center, about 2/5 of the total visits, they performed random sampling to pick up 2 out of every 5 consecutive patients. They finished reviewing 1021 (96.7%) out of 1056 medical records originally intended. In the same manner for the other hospitals, they completed about 89% of medical records intended. Unfortunately, Tai-Liau Hospital lost their log books on emergency visits during June 1982 - Jan. 1983. As a result, the researchers finished only 22.1% of intended medical records. Because the loss was unintentional, the investigators assumed that such a loss was most likely unrelated to the disease pattern, or was simply a random occurrence. In order to count the total emergency visits for different diseases, they needed to weight the above finding. For example, the sampling fraction of the NTU Hospital was 1,021/30,526 = 3.35%. With 34 head injury patients from the sampled records and assuming only one major diagnosis for each medical record, then the estimation of emergency visits with head injury for one year at the NTU Hospital would be 34 - 0.0335 = 1,015. If they allowed for multiple diagnoses (the first 3 diagnoses recorded by the doctor), then it would be 42 ■*■ 0.0335 = 1254. Through such calculations, investigators set up Tables 9.3 and 7.2. Variance can also be obtained by one of the above formulae. 9.5
Systematic random sampling
Systematic random sampling is another type of probability sampling. Instead of applying random numbers to select every unit or subject into the sample, random numbers are only used once or a limited number of times. For example, if one wants to select 10% out of 1000 people to be studied, he/she can pick up a random number from 0 - 9 , and simply add 10 to that number as the next random number until all 100 members are selected. Because every unit in the population has an equal probability of 10% to be selected, and the first number is through a random process, it is still a probability sampling.
Chapter 9 Sampling method and practical applications 225
Table 9.3
Frequency of emergency visits during the 30 days chosen by stratified random sampling,
proportion of medical records sampled for review, expected no. of
records review, no. of records completed, and proportion of completion (Chiang etal, 1986, reproduced with permission). Hospital name
No. Proportion of Expected no. medical of records for completed emergency records review visits during sampled the 30 days for review No. of
NTU
2,640
2/5
1,056
% completion
1,021
96.7
Veteran General
4,081
1/4
1,020
929
91.1
Chang Gung
3,282
1/3
1,094
1,019
93.1
McKay Memorial
6,150
1/6
1,025
956
93.3
Cathay
1,416
1/2
708
675
95.3
Chung-Seng
1,848
2/5
733
636
86.8
Jen-Ai
1,909
2/5
734
660
89.9
1,073
2/3
715
689
96.4
824
1/2
412
378
91.8
Jen-Gin
272
1/1
272
249
91.6
Tai-Liau
462
1/1
462
102
22.1
8,231
7,314
88.9
Ho-Ping Taipei College
Total
Medical
23,957
-
The method is simple and does not require a random number table or process to be performed many times. Thus, it can easily be carried out by any personnel in charge of product quality control in a factory, and has, in fact, been widely used. The only concern, however, is the small possibility that such a systematic procedure may be associated with any particular trend of outcome measurement. For example, in a study to determine the average height of high school students, investigators decided to measure the body
226 Basic Principles and Practical Applications in Epidemiological Research
height of one out of every 8 students. It turned out that the first number was 2. Traditionally, students of this school were given the registration number according to the seat they took, and there were 8 seats in a column. To have clear visual fields for everybody, students were arranged in the order of increasing heights from front to back. So, when the first number was 2 and thereafter 10, 18,26, etc., students who were chosen happened to have a lower average height. Although this condition may not occur frequently, investigators applying systematic sampling should be aware of potential problems and take the necessary precautions to check if the procedure of selection is associated with the outcome of measurement. 9.6
Other sampling methods: Cluster sampling and selection with probability proportional to size
Besides the commonly used method of systematic sampling, there are other ways to perform probability sampling, which often involve complicated methods of calculating variance or standard deviation. Readers may consult any statistician or the standard textbooks written by Cochran (1977) or Kish (1965) for details. For reference, concepts for two additional sampling methods are provided. 9.6.1 Cluster sampling This sampling method combines several basic sampling units into a cluster, usually located in a geographic area or administrative vicinity so that all units of a chosen cluster can be studied with minimal costs. By designating a number to each cluster, one can then perform random sampling on clusters. Cluster sampling is often used for nation-wide interview studies of households or factories, where people or factories are often widely distributed. In Examples 9.2 and 9.3, each township can be regarded as a cluster of many households or factories. Investigators can reduce costs of visits by interviewing several households or factories in the same cluster. However, to provide every unit with the same probability of selection, clusters with different sized units should be sampled proportional to their sizes. Thus, if
Chapter 9 Sampling method and practical applications 227
different townships have a similar number of factories or households, then one can randomly sample 30 out 358 townships, as in Examples 9.2 and 9.3. However, if there are large variations in size, then one should perform selection with probability proportional to size, as shown below. 9.6.2 Selection with probability proportional to size In cluster sampling, if the cluster units are of different or irregular sizes, then opportunity to be selected should also be different. One should give every unit or subject in the cluster the same probability of being chosen. The cluster with a larger size or more units, should have a higher probability to be selected, namely, probability proportional to size. In a study determining the prevalence rates of alcoholism among aborigines, Hwu et al. (1990) classified the aborigines into 6 sub-ethnic groups (see Table 9.4). Then, they used cluster sampling to save resources, as the people were widely distributed in a mountainous area. Each "tseung" or administrative unit consisting of an isolated tribe was defined as a cluster. There were a total of 144 administrative units with irregular sizes. By defining probability to be proportional to size, they decided that 15, 9, 2 tseungs would be visited for Atayal, Paiwan, and Yami, respectively. The final results are shown on Table 9.4. 9.7
Non-sampling errors
If one's sampling method is incorrect, then one's estimate may be biased or there may be a large random error. Three other types of errors, unrelated to sampling procedures, also need to be addressed and precautions taken against them. They are errors resulting from non-response, measurement errors and errors arising from the poor handling of data, e.g., during abstracting, coding, key-in or editing, etc. as shown in Figure 5.4. The potential problem of non-response was already discussed in detail in Chapter 7, Section 7.2.1. Briefly here, we must focus on the proportion of non-response and the difference between respondents and non-respondents. A small random sample of non-respondents with a size of 10-30 may be
228 Basic Principles and Practical Applications in Epidemiological Research
Table 9.4
Ethnic group
The total population (ages 15-60) and sample subjects of the Atayal, Paiwan and Yami aborigines (Hwu et al, 1990). Sub-ethnic group
Atayal
Number of Tseung
Sampled Tseung
Total population
Expected * sample size
Valid * sample size
36,584 793 1,140 404 12,977 360 10,585 222 330 13,022 211 406 27,706 Paiwan 666 853 1 25,077 550 736 2 106 117 2,629 1 1,735 Yami 106 120 6 Total 66,025 1,555 2,113 * Expected sample size: The sample size estimated by statistical calculation based on the assumption that alcoholism prevalence ranged from 25% to 67%, after preliminary study. * Valid sample size: After quality control of the completed interview forms, 186 invalid forms were excluded. The figures represented the number of valid interview forms used for data analysis. 1 2 3
78 28 27 23 62 55 7 4 144
15 5 5 5 9 7 2 2 26
drawn and more resources per subject may be added to understand the reasons for non-response. If reasons for non-response are unrelated to the outcome or contents of measurement, then the respondents are considered to be representative. Or, one may compare the distribution of determinants between the respondents and non-respondents in order to draw some inferences. If there is a big difference between the respondents and non-respondents, then one should draw conservative inferences about the source population from the sampled respondents and avoid generalizations about the non-respondent population. The issue of measurement error was covered in Chapter 5. In fact, providing a set of gold standards and controlling for all determinants of measurements can help improve the accuracy of measurements. For questionnaires with or without interview, one should try one's best to obtain
Chapter 9 Sampling method and practical applications 229
the subjects' full cooperation and provide them with some visible gold standards such as a set of clear pictures or figures for them to compare and measure. The conceptualization of abstract socio-behavioral constructs should also be tested with empirical measurements. Errors arising from the handling of data can also be detected and minimized by checking the data. For example, one can use a package or computer program to produce out-of-range values or impossible values for each item of measurement. Then, most frame-shift errors occurring during key-in operations can be detected. For minor random errors, one can try additional random check-ups. During data analysis and refutation stages, one may recheck for any outliers or anything that contradicts our prediction to insure a true refutation. If one has the resources and time, one may also perform a double data entry check, but this is usually too expensive to be feasible. 9.8
Summary
Probability sampling and the central limit theorem are the foundations for statistical inference in descriptive studies. Simple random sampling provides every unit or subject in a population an equal probability of being selected. Stratified random sampling tries to divide the population into several relatively homogeneous sub-populations or strata, within which one performs simple random sampling. The optimum allocation of strata is proportional to the sizes and variances of individual strata, and inversely proportional to the square root of cost for each stratum. A systematic sampling is a special case of simple random sampling where the rule of sampling is systematically designed so that every unit still has the same probability of selection. Cluster sampling is often used if units of population are widely distributed, to save costs and resources. If several units are geographically and/or temporally closely related, they can be grouped as a cluster and measured simultaneously. The probability of a cluster being selected, however, should be proportional to its size. As non-sampling errors may develop if there is low response rate, inaccurate measurement and/or poor handling of data sets, one must take care to attend to such errors.
230 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 9 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 1. Our current calculations of p-value, mean, variance and confidence interval are all based on the assumption that our sample is a probability sample. 2. If an empirical study is not a probability sample, then it is not representative and one cannot draw inferences about the population under study. 3. To determine the etiologic factors of drug abuse, one collects all cases that attend a substance abuse camp. If one finds that about one-half of these cases grew up in a single-parent household, one can then draw the conclusion that children living with a single parent are more likely to develop substance abuse. 4. In a study surveying the causes for becoming a child prostitute, one collects cases from a nightclub in Bangkok. More than 70% alleged that they voluntarily took on this job. One may conclude that more than 70% of child prostitutes are employed voluntarily. 5. If 1/3 of patients with diabetes mellitus (D.M.), admitted in Ramathibodi Hospital, require insulin injection to control their blood sugar, then one can conclude that about 1/3 of all D.M. patients are insulin-dependent. 6. The basic principle for stratification is based on factors that can influence the outcome. For example, since body height is influenced by gender, one needs to stratify the sample by gender. 7. About 85% of malaria patients hospitalized in Mahidol
Chapter 9 Sampling method and practical applications 231
Hospital for Tropical Diseases showed resistance to chloroquine. We, then, may infer that malaria in Thailand provides an 85% resistance to chloroquine. 8. A causal inference should be based on both the results of one's study and knowledge outside of one's study. 9. Strictly speaking, only results from a probability sampling can be inferred by the central limit theorem and concept of repeated sampling. 10. The optimum allocation for stratified sampling depends linearly on the size of each stratum and standard deviation. Answers: (1) T (2) F (3) F (4) F (5) F (6) T (7) F (8) T (9) T (10) T
This page is intentionally left blank
Chapter 10
Follow-up Study
10.1 Classification of follow-up studies 10.1.1 Experimental studies 10.1.2 Observational studies 10.2 Principles and practical applications of follow-up studies 10.2.1 Review of the principles for follow-up studies 10.2.2 How to conduct a follow-up study in practice 10.2.3 Example of a cohort study: An occupational mortality study of PVC (polyvinyl chloride) workers 10.2.4 Loss of follow-up 10.3 Introduction to the principles of clinical trials 10.4 Evaluation for the effectiveness of herbal drugs or traditional medicine 10.4.1 Clinical trial - The ideal method 10.4.2 Follow-up study without random allocation - The practical method 10.5 Summary Introduction Having discussed in detail the sampling procedures of descriptive studies, we will now examine how to conduct causal studies. As discussed in Chapter 2, a causal study should be guided by the process of conjecture and refutation, in which one attempts to falsify all possible hypotheses. The hypothesis left unrefuted is, thus, determined to be the most valid. Causal studies are also inherently longitudinal in concept because an event or change of state always involves a passage of time. Based on different ways to measure population-time at risk, causal studies can be categorized as follow-up or case-control studies. The major difference between the two types is that follow-up studies try to assess the states of exposure and other determinants of outcome for every individual in the population at risk, while case-control studies only sample a small
233
234 Basic Principles and Practical Applications in Epidemiological Research
proportion of the population at risk as controls to explore the odds of exposure. In this chapter, we shall focus on the principles of follow-up studies and discuss case-control studies later in Chapter 11. We will begin by looking at different types of follow-up studies, followed by illustrating the principles for performing such studies, and then applying the same principles to clinical trials, such as the evaluation of herbal drugs or alternative medicine. 10.1
Classification of follow-up studies
In causal epidemiological studies, time is a crucial factor because one aims to observe the occurrence of a health event(s) among exposed and non-exposed individuals who can potentially develop the event. (Review Section 6.2 for the definition of event.) The collection of these individuals is called the population at risk or simply the candidate population. They are followed and observed over time to determine the frequency of occurrence of the health event. The total amount of person-time observed is the denominator of the incidence rate, while the number of persons who have developed the specific event is the numerator. In a typical follow-up study, population at risk is usually divided into exposed vs. non-exposed populations according to the states of exposure to a specific agent, such as smoking, asbestos or viral hepatitis B, etc. Then, the incidence rates of the exposed population is compared with that of the non-exposed to obtain rate ratio or rate difference, the expression of exposure effect. Follow-up studies can be further classified into two types: experimental and observational, as indicated in Figure 10.1. 10.1.1 Experimental studies If the allocation of exposed vs. non-exposed (treatment vs. placebo/sham) groups is dependent on probability or is pre-assigned before observation, then such a follow-up study is an experimental study. Such studies can further be divided into clinical, field or community trials depending on recruitment from patients with a specific disease, healthy individuals or the community, respectively. Although it is unethical to perform strict experimentation on
Chapter 10 Follow-up study 235
Causal study
Follow-up study
Case-control study
(Assess every individual
(Assess only a sample in the
in the population-at-risk.)
population-at-risk.)
^
* Experimental study
Observational study
(Allocate subjects according to (Observe what happens to the study probability before observation.)
subjects.)
A Ecological study Clinical trial
Field trial
(patients)
(healthy individuals)
Community
Cohort study
intervention trial (No exclusion due to any (community)
change in exposure status.)
i
(Avg level of exposure of group assumed for the individual)
Retrospective Prospective Ambi-spective
Figure 10.1 Classification of causal studies by various methods of recruiting subjects and performing observation and measurement. The dotted line indicates "quasi-" or an approximation only.
236 Basic Principles and Practical Applications in Epidemiological Research
human beings, as one can with animals; one can still perform random allocation of human subjects to specific groups, with their informed consent. In such a randomized trial, one can probably obtain an even distribution of potential confounders among different treatment groups. Of course, double-blind (or at least single-blind) assessment of outcome is necessary to avoid any bias in favor of either the treatment or placebo group. Such an experimental study is commonly used for the clinical evaluation of a new drug or treatment procedure on patients and is called a clinical trial. If the subjects recruited are healthy individuals, such as in the hepatitis B vaccine (Chen et al, 1987) or p-carotene (Albanes, 1999) trials, then it is usually called a field trial. Although in field trials there is usually no random allocation, the subjects are pre-assigned to groups based on their eligibility. In a community intervention trial, all the people in a community are included in the trial, such as fluoridation of water or addition of iodide to salt, etc. Since in both field and community trials, recruitment of subjects is not totally dependent on probability, these two types of trials are better classified as observational studies rather than experimental studies. 10.1.2 Observational studies In most epidemiological studies, investigators simply conduct observational studies. They follow and observe whatever happens to the exposed and non-exposed populations to determine the incidence rates of a specific event(s). Typically, a group of subjects are enrolled in observational studies based on their exposure states, such as smoking, occupational exposures, etc. Such a group of people is often called a cohort. Everyone in the cohort is followed up through time until an individual's development of the event of interest, death or censoring/cessation of the study. Occupational epidemiologists often conduct a retrospective cohort study, in which they study a group of people from a distinct point in time, e.g., the establishment of a company or occupation, up to the present time. A retrospective cohort study is relatively popular because one can gather quite a large amount of person-time data for the denominator. However, one can also conduct a
Chapter 10 Follow-up study 237
prospective cohort study, in which one begins enrollment and observation of a cohort at the present time and follows them prospectively through time. The combination of a cohort, formed retrospectively with prospective enrollment and follow-up, is called an ambispective study. (See Figure 10.1.) Some epidemiologists (Rothman, 1986) consider a follow-up study to be synonymous with a cohort study because a follow-up study is usually observational. However, experimental studies have been included under the category of follow-up study because they also involve following individuals through time. (See Figure 10.1.) When it is too difficult or expensive to obtain information on every individual in the population at risk, one performs a case-control study. In such studies, one obtains the odds of exposure state and rate ratio by studying a sample of the population at risk. Ecological study is another type of causal observational study. Instead of measuring the exposure state of every individual, it simply assumes that the average amount of exposure, such as the average consumption of fat in a community, represents the intensity of individual exposure. It also attempts to determine the association between the average exposure and the occurrence rate. For example, Doll and Peto (1981) displayed that the average consumption of fat in different countries was associated with the age-adjusted mortality rates of breast cancer. Because in this type of study, one does not assess individual exposure states, the distribution of exposure within the people in a community may be variable, and the majority of exposure may be clustered in only among a small proportion of the population. Thus, it may produce an "ecological fallacy," which means that the effect between the exposure and outcome is overestimated, underestimated or spurious. When the study lacks any information on other causal determinants and cannot control for potential confounding, the ecologic aggregation will even distort dose-response relationship (Greenland 1992, Greenland and Robins 1994). In one of recent studies on the health effect of air pollution, Hwang et al. (2000) proposed a subject domain approach, which can take additional information on potential confounders into consideration and can improve the validity of an ecological study. Most epidemiologists categorize an ecological study as a special type of study (Rothman 1986; Kelsey et al, 1996).
238 Basic Principles and Practical Applications in Epidemiological Research
I have attempted to relate ecological studies with cohort studies because in an ecological study, one attempts to make a causal inference based on an indirect measurement of the individual's intensity of exposure. In both studies, one obtains the level of exposure (or determinants of outcome) of every individual to evaluate the occurrence of outcome. However, since the average level of exposure in a demographically localized population may not truly represent every individual's intensity of exposure, one must always be cautious and conservative when making causal inferences. Another recommendation is to acknowledge that one is not studying the actual population at risk and proceed to conduct a case-control study, in which one obtains a sample of the population at risk. (See Chapter 11 - Case-control studies.) 10.2
Principles and practical applications of follow-up studies
10.2.1 Review of the principles for follow-up studies Follow-up studies are based on principles similar to those found in the paradigm of animal experimentation. Basically in these studies, one attempts to rule out alternative explanations, by selecting a reference (or placebo) group with comparable characteristics to the group under study, except for the determinant of interest. (Refer back to Table 7.7 for a general overview of comparisons between animal experiment and follow-up studies, such as clinical trials and cohort studies.) Let us look at an example of animal experimentation to illustrate these principles. If one wants to study whether or not tar painted on the ears of rabbits causes skin cancer, one needs to select a reference or control group, which consists of rabbits taken from the same inbred strain, eliminating genetic factors as a potential confounder. Moreover, one must also paint their ears with the solvents used to dissolve tar, but without the tar. The rabbits also should not be exposed to other skin carcinogens, such as UV (ultraviolet) light to obtain comparability of effect. When performing outcome measurements, one should be blind to which group each rabbit belongs, thus eliminating reader bias. By taking these necessary steps, one can eliminate alternative
Chapter 10 Follow-up study 239
hypotheses, which can explain the outcome and thus, control all potential confounders. These same principles of animal experimentation can be applied to clinical trials. One may achieve a high degree of comparability in clinical trials by selecting a reference group, similar to the exposed group in terms of constitutional, lifestyle, socio-economic, other occupational and/or environmental factors, which can influence the outcome. Moreover, by randomly allocating subjects to either treatment or placebo groups, one can eliminate other biases. However, the same principles must be modified to be directly applicable to a cohort study or other human observational studies. Specifically, one cannot randomize subjects during enrollment in a cohort study, but one can perform stratification and/or modeling in the data analysis stage, so that the causal determinants involved are the same, producing comparability of effect. Now, let us consider the above principles in some practical applications. 10.2.2 How to conduct a follow-up study in practice The following steps should be considered in conducting a follow-up study: Stepl: Clarify one's objective by asking, "What are the outcome and exposure of interest? Or, what is the causal question that the study attempts to answer?" Step 2: Clearly define the outcome and exposure, and make valid measurements. For example, to study the association between salt intake and hypertension, one needs to define hypertension and salt intake. Should one use the old WHO definition for hypertension > 160/95 mm Hg? Or, should one also include borderline hypertension of systolic pressure > 140 mm Hg and/or diastolic pressure > 90 mm Hg? And, how does one measure blood pressure accurately? One also must define salt intake and select a valid
240 Basic Principles and Practical Applications in Epidemiological Research
method to measure it. In the study to determine the association between vinyl chloride exposure and liver cancer (Example 10.1), investigators had to define the diagnosis of liver cancer (Du and Wang, 1998). They needed to ask questions such as, "Is it necessary to have histopathologic proof? Or, will cytologic evidence also count?" Moreover, in analyzing vinyl chloride exposure, the investigators had to consider induction period. Otherwise, they might have included a large number of unqualified person-years in the denominator, e.g., people with inadequate induction times. For example, if the minimal induction period for liver cancer related to vinyl chloride is 5 years, then they should not have included anyone working less than 5 years in the factory among the population at risk. If no other study has been conducted thus far, then investigators may have to make some assumptions for induction time and perform a sensitivity analysis to show how the magnitude of effect is influenced under different conditions. Step 3: List alternative causal determinants of outcome and make valid measurements. Since all alternative causal determinants are potential confounders, one should quantify them one by one to prepare for stratified analysis and modeling in the data analysis. Even if one does not perform stratified analysis, one still must obtain the data to show that the exposed and non-exposed are similar or comparable in terms of different causal determinants, or provide evidence to rule out confounding. (Robins, 2001) In the study of skin cancer among paraquat manufacturers (Example 7.4), investigators tried to quantify the proportion of workers once exposed to other skin carcinogens such as ionizing radiation, tar, mineral cutting oil, etc. Since none were exposed to these agents, there was no need to stratify for such exposures. However, since many of them were exposed to sunlight, they had to perform a stratified analysis for sunlight exposure in order to control any potential confounding, as in Table 7.10.
Chapter 10 Follow-up study 241
Step 4: Select the exposed and non-exposed populations to be studied. One may select the cohort of the exposed to be a group of workers with the same exposure during the period of study and simply take the general population as the non-exposed. However, the general population includes people who are institutionalized and hospitalized. Thus, one may encounter the "healthy worker effect," in which exposed workers seem healthy when compared with a general population that includes the sick. Alternatively, one can employ another group of unexposed workers as the reference group, to avoid any "healthy worker effect" (Wang and Miettinen, 1982). In Example 10.1, investigators recruited from all 5 factories involved in PVC (poly-vinyl chloride) polymerization during Jan. 1, 1985 and Mar. 31, 1994 to set up the cohort of exposed group. For comparison, they selected workers in optical equipment and motorcycle manufacturing as the non-exposed group because the two occupational groups had similar wages and smoking patterns to the PVC workers (Sterling and Weinkam, 1976; Covey and Wynder, 1981). Of course, the researchers assumed that the reference occupations did not involve other known liver carcinogens. They did not use PVC workers unexposed to vinyl chloride as the non-exposed group because of small numbers. Step 5: Determine the incidence rates for the exposed and non-exposed, and obtain rate ratio and/or rate difference. In practice, one needs to calculate the denominators (total person-time at risk) for both exposed and non-exposed groups. This can be obtained by summing up the follow-up person-time for every individual in the cohort. To define a cohort, one simply takes everybody, fulfilling the criteria of exposure and adequate induction time, and follows each one until an individual's development of the event, death, or censoring of the study. The incidence rate of the exposed can be compared with those of the general population with a similar distribution of sex, age and ethnic origin. SMR (standardized mortality or morbidity ratio) can also be calculated. Or, one can apply
242 Basic Principles and Practical Applications in Epidemiological Research
modeling for cohort studies (Breslow and Day, 1987). Readers, who are more interested in the detailed modeling procedure for data analysis in a cohort study, should refer to Breslow and Day's textbook (1987), or Rothman and Greenland's (1998). 10.2.3 Example of a cohort study: An occupational mortality study of PVC (polyvinyl chloride) workers Let us examine an example of a cohort study to show how the denominator for incidence rate is calculated. In 1974, the U.S. NIOSH (National Institute of Occupational Safety and Health) was informed by a rubber company that 4 out of about 500 workers developed angiosarcoma of the liver (ASL) during 1967-1974. As a result, a study was conducted to determine the association between liver cancer and exposure to VCM (vinyl chloride monomer) among these workers. Since ASL is a rare disease in the U.S., with approximately 27 cases per year, there seemed to be an increased rate ratio by crude calculation, using the general population of the U.S. as the non-exposed group: 4 /(500persons x Syears) _ 27/(200,000,000persons x \year) ~ Since then, there have been more than 200 cases of ASL reported from PVC polymerization plants. What is the effect of VCM on Taiwanese workers, a population group with a hepatitis B carrier rate of 15-20 % (Sung et al, 1984)? Will the hepatitis B virus act synergistically with VCM to produce ASL or other types of liver cancer? To conduct an SMR study in Taiwan, investigators needed to define a cohort and follow-up each individual to obtain the data for person-years at risk. Labor insurance records and death certificates provided two computerized sources from which to check if each person had ever been hospitalized or had died from liver cancer. Assuming that the cohort was composed of all workers who have ever worked in a PVC polymerization plant from 1969-1993 and knowing that the follow-up ceased
Chapter 10 Follow-up study 243
in 1993, let us examine 10 hypothetical PVC workers with different work histories: No. 1: VCM unloading operator for PVC polymeration. Employed since 1969 at the age of 25, up to the present time. No. 2: Tank cleaner of PVC polymerization since 1969 (age 20), up to the present time. No. 3: Pipe fitter (maintenance) worker since 1969 (age 30) and left the factory in July 1972. No. 4: VCM and PVC unloading operator since 1969 (age 25) and died 25 years later from liver cancer. No. 5: Office clerk in the PVC company since 1969 (age 25) and now deputy manager of the sales department. No. 6: PVC reliever since 1974 (age 46) and died in an accident in 1976. No. 7: PVC tank cleaner since 1979 (age 23) and currently was a foreman of the polymerization process. No. 8: Control room operator for PVC polymerization since 1969 (age 20) and left the company in 1970. However, he came back to the same job in 1984 and has remained at this job ever since. No. 9: Bagger and trucker for PVC factory since 1969 at age 40 and retired in 1994. No. 10: PVC polymerization worker since 1986 (age 20), up to the present time. The person-years at risk for the above ten hypothetical workers are abstracted and summarized in Table 10.1.
244 Basic Principles and Practical Applications in Epidemiological Research
Table 10.1
Follow-up person-years for workers of a PVC (polyvinyl chloride) company. #1 = No. 1 worker, # 2 = No. 2 worker, etc. The number n following each numbered worker (#N-n) indicates the number of person-years observed in each category. (For example, #1-5 = 5 person-years observed for worker No. 1, in the years 1969-78).
Age (yr) 20-29
1979-88
1989-93
Total person-years
#1-5 # 1-5
#7-7
#10-5 # 10-f
50
#2-10
#10-3
#7-5
68
62
1969-78
#4-5 #5-5 #8-10 30-39
# 1-5 #1-5
##1-5 1-5
#3-10
#2-10
#4-5
#4-5
#5-5
#5-5 #7-3 #8-10
40-49
#6-2
#1-5
# 1-5 #1-5
#9-10
#3-10
#2-5
#4-5
#4-5
#5-5
#5-5 #8-5
50-59
#9-10
60-69
Total person-years
#3-5
15
# 9-5
72
83
46
200
Chapter 10 Follow-up study 245
From Table 10.2, one can see that Nos. 1, 2, 4, 6, 7, 8 and 10 workers were exposed to high concentrations of VCM (e.g. median exposure > 2 mg/m3), while Nos. 3, 5, 9 were exposed to low concentrations of VCM because they were indirectly exposed. Moreover, if one allows for 5 years of minimal duration of exposure and 8 years of latency period, then Table 10.1 can be reduced to Table 10.3. By taking into account induction and latency periods, the total number of person-years at risk for different categories of year and age are reduced by more than half, from 200 to 95. If one does not count workers with indirect exposure or low concentration, then the follow-up population time is reduced even further. When the denominator or person-years are reduced, the incidence rate becomes elevated. In other words, any inappropriate inflation of the follow-up person-years, by including people with low or inadequate intensity of exposure or insufficient induction time, will lead to an underestimation of incidence rate. For this reason, some cohort studies of big companies, such as nuclear power plants, have been non-conclusive. Moreover, some companies have the habit of giving the most hazardous jobs to outside contract workers, which may also lower the intensity of exposure of regular workers. Similarly, the number of new cases (numerator) should only include those with sufficient induction time and within maximum latency period to prevent overestimation of the numerator and incidence rate. For example, a patient experiencing diarrhea, 1 week after an outbreak of staphyllococcus food poisoning, cannot be counted as a case of such poisoning because the period of development of effect has already exceeded the maximum latency period for bacterial food poisoning. Likewise, a brain tumor 1 kg in size cannot be attributed to a radiation exposure that occurred within 1.5 years (Example 4.3), etc. Epidemiologists should give heed to detect potential confounding in their evaluation of the results of published cohort studies.
246 Basic Principles and Practical Applications in Epidemiological Research
Table 10.2
Personal TWA (time-weighted average) concentrations of VCM for various jobs of five PVC manufacturing factories (Du et al., 1996).
Job names
Sample size Mean(mg/m3) Median(mg/m3) Range of TWA(mg/m3)
Tank supplier
9
659.67
23.70
5.70-3677.80
PVC reliever
10
153.07
47.92
Tank cleaner*
14
95.57
69.15
1.04-825.69 0.36-341.88
VCM unloading operator
2
12.56
12.56
10.23-14.97
Safety & health
4
12.04
1.74
1.19-22.87
Foreman
4
9.04
6.89
1.84-20.59
Stripper operator
specialist 3
4.51
3.37
2.33-7.82
VCM recovery operator
5
4.38
4.48
0.88-5.93
Control room operator
8
4.01
3.47
1.04-10.02
6
3.42
3.47
1.19-7.95
General office personnel
4
3.34
2.56
Field supervisor/manager Maintenance
3
2.69
1.76
0.85-5.49
Dryer operator
6
1.84
1.48
Bagger and trucker
5
0.93
1.09
2
0.93
0.93
Gatekeeper
LOD = limit of detection *The date for tank cleaner was measured as a short-term task lasting only 15-40 minutes
Chapter 10 Follow-up study 247
Table 10.3
Follow-up person-years for PVC workers, assuming 5 years of minimal exposure duration and 8 years of latency period. Other legends are the same as in Table 10.1.
Age
1969-78
1979-88
20-29
#2-2
30-39
#1-2
#1-5
#4-2
#2-10
#5-2
#4-5
1989-93
Total
#10-1
3
#7-5
38
#1-5
#1-5
39
#4-5
#2-5
#5-5
#4-5
#5-5 #7-2 40-49
#9-2
#5-5 #8-2 50-59 60-69
Total
10
#9-10
10
52
#9-5
5
33
95
10.2.4 Loss of follow-up In clinical trials, with a relatively small number of enrolled patients, it is easy to complete a follow-up study. However, in cohort studies or community intervention trials, the population number involved is generally large, and such studies may require a large consumption of resources. This task may be eased by sources of computerized death certificate data and/or national health insurance. However, for developing countries or any country without access
248 Basic Principles and Practical Applications in Epidemiological Research
to national registrations, this job may be daunting. One should then consider using an alternative method, such as case-control design. Nonetheless, one should do one's best to avoid any loss of follow-up. If there is a high proportion, say > 20% of loss of follow-up, then one should try to determine the reasons for the loss, and ascertain if they are associated with the exposure or non-exposure (Greenland, 1977). To determine whether the loss is associated with the outcome of interest, one can take a random sample from the people lost, and find out whether there is a differential loss of people with the disease of interest for the exposed and non-exposed. Another solution is to perform a sensitivity analysis, by assuming different proportions of people lost to follow-up, who are in fact developing the disease of interest, and determine the range or variation of the estimate. 10.3 Introduction to the principles for clinical trials The principles for clinical trials and cohort (observational) studies are the same, except that clinical trials demand random allocation of treatment and double blind outcome assessment. Double blindness means that both the researcher and subject are unaware of the treatment assignment. Such additional requirements guarantee that other determinants of outcome are equally distributed between the treatment group and placebo (sham) group, as long as the number of study subjects is sufficiently large. Thus, it can be considered a proxy to the animal experiment. The most common clinical trials conducted are drug trials, which usually evaluate both survival and quality of life (Staquet et al, 1998). In fact, drug evaluations on humans involve four stages. Each stage has its own unique goal to guard a drug's safety and efficacy before and after its clinical application. Stage 1: Clinical pharmacology and toxicity tests Adequate animal experimentation is required before any human experiment to determine the potential toxicity (safety) of a drug. Sometimes, such studies can even help determine the possible effectiveness of a drug. Once a drug is deemed safe among animal models, then it can be approved for
Chapter 10 Follow-up study 249
clinical pharmacology and toxicity tests. The major goal of stage 1 is to evaluate the toxicity and PB-PK (physiologically based- pharmacokinetics) model inside the human body. Beginning from one single dose, the application of drug is slowly increased to a multiple dose schedule, while one observes for any adverse effects. Moreover, one attempts to determine the optimum dose schedule with minimal toxic effect. Throughout the first stage of evaluation, one should follow the guidelines of good laboratory practice (GLP) to assure the quality of data from animal experiment. (U.S. Food and Drug Administration, 2001 a) Stage 2: Small group trials At this stage, one administers the drug to a small group of patients (usually less than 100-200) to evaluate the effectiveness and toxicity. The objective of stage 2 is to obtain preliminary efficacy data and additional safety profile for a comprehensive trial at stage 3. One can refer to Meinert's clinical trials (Meinert and Tonascia, 1984) and all the regulations published by the U.S. FDA (Food and Drug Administration, 2001b) as well as those published by the European Union as the main references for conducting clinical trials. Stage 3: Comprehensive drug trials After passing the first two stages, a drug may be approved for typical comprehensive drug trials, in which it is compared with the best available method of treatment or placebo treatment if effective treatment does not exist. At this stage, patients are randomly assigned to two groups, and researchers blindly assess the drug's effectiveness. Nowadays, both reduction in mortality (increased survival) and quality of life are monitored to determine overall effectiveness. Although the current trend is to assess and compare these two domains independently (Cox et al, 1992), future developments may include a combination of the two functions (quality-adjusted survival) for an overall comparison (Hwang et al, 1996). The results of this stage largely decide whether the FDA will approve the drug for clinical usage. For detailed procedures and data analysis, I recommend looking into other standard textbooks and papers (Pocock, 1983; Peto et al, 1976, 1977) as well as
250 Basic Principles and Practical Applications in Epidemiological Research
keeping up with all the new publications from the FDA website. In general, a standard operation procedure called "good clinical practice" (GCP) is required and should be strictly followed. (U.S. Department of Health and Human Service, 1997) Stage 4: Market surveillance After a drug is approved by the FDA and becomes widely used in clinical practice, physicians and users must still provide active reports to monitor any other adverse effects. Such actions are crucial to guaranteeing a drug's safety. However, this stage is usually not included as part of a clinical trial. Table 10.4 lists 14 major components of a drug trial protocol, which is submitted to an experimental research or ethical committee for approval. Starting from background and general goal of the trial, one proceeds to discuss the specific goal of this study, naming the drug and proposing its effect on a specific disease, as well as some target adverse effect. Then, one's protocol should clarify the criteria for patient inclusion, dosage schedule and procedure for assessing the effectiveness and toxicity of the drug. The study design outlines selection of the reference group, and how randomization and double blind assessment is to be accomplished. The procedure should be even more detailed for multi-center trials to assure an accurate measurement. Otherwise, errors may arise due to poor understanding or coordination. Informed consent is always required to honor the patient's autonomy. One must describe how large the sample size should be and explain how one obtained this number. Then, one must perform a quality assurance/control or monitoring procedure to maintain a standard format for such trials. Sometimes, there are patients who cannot tolerate the drug and want to withdraw during the trial. A standard procedure should be clearly written to guarantee minimal or no bias. By the end, the protocol should state the strategy for data analysis and clarify administrative responsibility such as the decisions of termination, etc. After the protocol is completed, it should be sent to the Institutional Review Board (U.S. FDA 2001) or committee for human rights to go through a review procedure and guarantee that there is no violation of ethical code of clinical
Chapter 10 Follow-up study 25 \
trials (Declaration of Helsinki, 1975).
Table 10.4
Major items included in a protocol for a clinical drug trial 1.
Background and general goal
2.
Specific goal of this trial
3.
Selection criteria for patients
4.
Therapeutic dose and schedule
5.
Method for assessing effectiveness
6.
Study design
7.
Patient registration and random assignment procedure
8.
Informed consent
9.
Sample size
10.
Monitoring the trial process
11.
Information and coding procedure
12.
Change or modification of protocol
13.
Strategy of data analysis
14.
Administrative responsibility
10.4 Evaluation for the effectiveness of herbal drugs and other alternative medical treatments The ethical foundation of clinical drug trials is based on the principles of patient's autonomy and nonmalificence {i.e. doing no harm). People who prescribe herbal drugs or other alternative medicines, such as homeopathy, acupuncture, etc., may not be so familiar to such principles and may follow a different philosophy in treatment evaluation. In my own attempts to convince some Chinese herbal doctors to conduct clinical trials, these Chinese herbalists seemed rather reluctant to accept Western ways of drug evaluation, especially Western clinical drug trials. Since Chinese herbal drugs have been extensively used in mainland China, Japan, Korea, Taiwan and some
252 Basic Principles and Practical Applications in Epidemiological Research
southeastern Asian countries, I propose the following principles to guide follow-up studies for the evaluation of such treatment or drugs. The same principles can also be applied to evaluate other alternative medical treatments. Briefly, one should try to use randomized clinical trials for such evaluations whenever possible. However, if the alternative medicine has been practicing for quite a long time and accepted as a "quasi-standard" treatment, then a follow-up study with blind assessment of outcome may be more feasible. 10.4.1 Clinical trial -The ideal method Since a lot of Western drugs, such as digoxin, ephedrine and quinine, are derived from herbs and were found to be effective through well-designed clinical trials, these same standard procedures can be applied to Chinese or other herbal drugs. In other words, randomized clinical trials are required for all kinds of new treatments, as long as they have never been applied to a specific health condition. This work may lead to the discovery of further clinical applications for such medicines. The FDA even publish a guidance for industry botanical drug products to guide such a development (U.S. FDA, 2000). In practical terms, the following steps should be taken first: 1. The formulae for Chinese herbal drugs must be standardized, and quality control must be emphasized to assure standard composition of active ingredients for each preparation. In other words, CMC (Chemistry, Manufacturing, and Control) guidelines should be strictly followed. 2. Diagnosis of disease, criteria of patient selection and assessment of effectiveness all must be standardized. I recommend that diagnostic criteria of a disease should be adopted from standard western medicine to avoid mixing different disease entities in a single trial. 3. All the major points of clinical trials or GCP, including random assignment, double blind assessment, informed consent and ethical committee review, should be included.
Chapter 10 Follow-up study 253
10.4.2 Follow-up study without random allocation —The pragmatic way If double blind clinical trials are unacceptable to people practicing alternative medicine, it may be more practical to omit random allocation. One can simply follow and observe patients who receive alternative treatments and compare the results with an established Western medical treatment or placebo. In principle, if one can prevent or control potential confounding and obtain comparability of effect, then a valid conclusion can still be drawn. In a study of traditional breathing-coordinated exercise among hemodialysis patients, Tsai et al (1995) demonstrated that such an exercise significantly improved the quality of life of patients. In reality, this method (See Figure 10.2) may be more acceptable to practitioners of alternative medicine. The protocol of pragmatic trial proposed by Walker and Anderson (1999) also shares a similar idea. However, one must pay attention to the following points in order to exclude possible confounders. 1.
2.
Diagnosis of disease, selection criteria of patients, and the formula and dosage schedule of herb drugs should be standardized. One should avoid including symptoms as criteria because they may represent a manifestation of many different diseases, raising many alternative explanations. In fact, since the diagnosis of Chinese herbal medicine is generally very crude, I recommend adopting disease classifications from formal Western medicine. The types of body constitution for individual patients can be further classified according to the theory of Chinese traditional medicine, but the information is mainly used for supplemental stratification analysis and test of traditional medical theory. Definition of effectiveness and/or outcome should be standardized and evaluated by experts blind to treatment allocation. Quality of life cannot be the only outcome variable for single blind assessment, because it totally depends on patient's subjective judgement. Moreover, all other causal determinants of outcome should be listed, defined and measured accurately throughout all treatment courses. Stratification or modeling must be performed in the data analysis to control potential confounding.
254 Basic Principles and Practical Applications in Epidemiological Research
3.
To obtain a comprehensive knowledge of the disease under study, I recommend that experts of this particular disease be included in the evaluation team. Moreover, epidemiologists and statisticians should be invited to participate, since they are experts on design and data analysis. Finally, upholding a falsification attitude throughout the evaluation process is crucial for the success of this strategy.
Patients with a
Recruited by
specific disease instead of a
^
Herbal treatment
w
Placebo treatment -k. ~
doctors who explain -^
symptom
the alternative treatments, but the patient chooses the treatment
Single blind assessment of outcome
4
Other treatment
Figure 10.2 A follow-up method to evaluate the effectiveness of Chinese herbal drugs or other alternative medical treatments.
10.5
Summary
Causal studies are always* longitudinal because the development of a health event (change of health state) requires the passage of time. A follow-up study measures the exposure, other causal determinants and outcome of every member in the population at risk. Experimental studies pre-assign subjects according to the type of study: A clinical trial uses probability to allocate subjects to different groups and measures the results using double-blind methods, in which both researcher and subject are unaware of treatment allocation. A field trial recruits healthy individuals for study, while a community intervention trial enrolls people in the community into the study. However, field and community trials are often better categorized as observational studies because subjects are not selected according to a predetermined probability. An observational study just follows and observes
Chapter 10 Follow-up study 255
every subject in the study through time until an individual's development of the event of interest, death or censoring of the study. One should measure not only the exposure and outcome of interest, but also other causal determinants of outcome. In observational studies, one should also perform stratified analysis and/or modeling to control potential confounding. An ideal reference (non-exposed) group should have comparability of demographic characteristics, measurement and effect, so that other causal determinants cannot explain the effect under study. In counting the follow-up person-years for an exposed group (denominator), one should take into account induction time, and the duration and intensity of exposure. When counting the number of events occurred (numerator), one must insure sufficient minimal induction time and within maximal latency period for development of effect. The inappropriate inclusion of data in either the numerator and/or denominator will produce bias in the measurement of incidence rate. Since in a clinical trial, one cannot always recruit a large number of patients due to budget and time constraints in an isolated center or hospital, one can employ a multi-center trial and recruit subjects from several hospitals and then perform stratified analysis/modeling in order to control confounding. A more practical way of evaluating Chinese herbal drugs or other alternative medical treatments is through a well-designed follow-up study, or at least the assessment of outcome should be blind to the treatment.
256 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 10 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 1. In follow-up studies, all cases developed starting from the time of cohort definition must be included. 2. A clinical trial can still be viewed as a special type of follow-up study, but subject allocation must be based on probability. 3. The following figures were obtained by a group of U.S. investigators: Age-adjusted mortality rate of cardiovascular disease Average water hardness of each state in the U.S.
We can conclude that increased water hardness is related to the increase in cardiovascular disease. 4. The problem of an ecological study is not only the lack of individual measurements of exposure but also the lack of such measurements for all other determinants of outcome or potential confounders. 5. The comparability of effects requires that exposed and non-exposed are similar in terms of occupational or
Chapter 10 Follow-up study 257
6.
7.
8.
9.
10.
environmental exposures that may influence the outcome under study, but not including the exposure of interest. The general population is not an ideal reference or non-exposed population because it actually contains people with various exposures. By using it as the reference population, one assumes that the number of people with various exposures is small. The goal of double-blind assessment in clinical trials can be achieved similarly in a cohort study by having people who assess the outcome blind to the exposure status and people who evaluate exposure blind to outcome results. One of the major reasons why herbal medicine is still widely received by a lot of people is that both herbal doctors who prescribe such medicines and people who accept them do not know or may not follow the western bioethic principle of first do no harm. To determine the magnitude of bias due to loss in follow-up, one may perform a random sampling of people lost in follow-up and draw more resources to determine any differential loss between people with the disease of interest for the exposed and non-exposed. Case-control study can be viewed as a follow-up study with the estimation of exposure proportion of population time at risk performed by random sampling and conducted prospectively.
Answer: (1) F (2) T (3) F (4) T (5) T (6) T (7) T (8) T (9) T (10) T
This page is intentionally left blank
Chapter 11
Case-Control Study
11.1 When to use case-control design 11.2 Evolution of case-control study 11.2.1 Cornfield's concept 11.2.2 Miettinen's concept of density sampling 11.2.3 Cumulative incidence sampling 11.2.4 The proportional hazard model 11.3 Practical selection of cases and controls based on the principle of density sampling 11.4 Control of confounding in case-control studies 11.5 Mortality or morbidity odds ratio (MOR) in occupational epidemiological studies - Valid selection of non-exposed occupation(s) and control diseases 11.5.1 Valid selection of reference (non-exposed) occupation(s) 11.5.2 Valid selection of reference cause(s) of death 11.6 Summary Introduction In causal empirical studies, exposed and non-exposed populations are recruited to allow one to calculate incidence rates and obtain rate ratio or rate difference. As discussed earlier, an incidence rate is the number of health events that have occurred divided by the population-time at risk. However, incidence rate and rate ratio is often difficult to calculate because it may not be feasible to obtain information on the population-time at risk (the denominator). In this chapter, we shall present the concept of case-control study as a general solution for calculating incidence rate ratio without data on population-time at risk. Beginning with an illustration of examples of data sets lacking information on population at risk, we will then review the development of case-control studies and density samplings (time-matched sampling), and
259
260 Basic Principles and Practical Applications in Epidemiological Research
provide basic principles, real examples and practical advice for selecting cases and controls. Finally, we will discuss, in detail, a special type of case-control study, the mortality or morbidity odds ratio (MOR) study and show that case-control and follow-up studies are closely related, in terms of issues of validity. 11.1 When to use case-control design Although one may have information on mortality or occurrence of disease, one may still lack data on the population-time at risk. In such circumstances, one may conduct a case-control study, in which one obtains a sample of the population at risk and, if necessary, a sample of the cases to calculate the incidence rate ratio. Let us examine the following hypothetical example, in which only the number of deaths are provided in a study on traffic-related accidents: Example 11.1 Traffic accidents in Taipei City In January 1996, the traffic police department of Taipei City published the mortality statistics of December 1995, as shown on Table 11.1. Based on the data shown on Table 11.1, can one conclude that motorcycles are more dangerous than cars and other commuting vehicles? One probably cannot make such a conclusion without information on the number and duration of people using each mode of transportation during the month under study, i.e., the population-time at risk. From an administrative point of view, one may be willing to invest more health resources on the prevention of motorcycle mortality. However, if one aims to determine etiology, then the data is inconclusive for lack of sufficient information on population-time at risk. Would the number of registered vehicles provide the data needed for the denominator of incidence rate? Certainly not, since one does not know the corresponding number of users for each type of vehicle, riding time for each user nor the frequency of use for each. The real denominator data is the sum of all persons at risk for each vehicle type during that month. Only with this data can one calculate and compare incidence rates for the different types of vehicles.
Chapter 11 Case-control study 261
Table 11.1
Mortality statistics due to traffic accidents in Taipei City, Dec. 1995.
Mode of transportation
Proportion (%)
No. of deaths
Motorcycle
15
50
Car
9
30
Truck
1
3.3
Bicycle
2
6.7
Pedestrian
3
1.0
30
100
Total
Here are some more hypothetical examples, which lack data on population-time at risk. Example 11.2
Smoking and NTUH lung cancer patients
During 1991-5, the National Taiwan University Hospital (NTUH) hospitalized about 300 new cases of lung cancer. The distribution of smoking status by gender is summarized in Table 11.2.
Table 11.2
Cases of lung cancer at NTUH (National Taiwan University Hospital) during 1991-5, stratified by gender and smoking status. Gender
Smoking
Male
Female
Total
Yes
150
50
200
No
50
50
100
Total
200
100
300
Although two-thirds of lung cancer cases occurred in smokers, one cannot conclude that smokers are more likely to develop lung cancer as this number may not represent the proportion of smokers in the general population.
262 Basic Principles and Practical Applications in Epidemiological Research
Another alternative explanation for the higher proportion is that smokers with lung cancer might more likely go to the hospital than non-smokers. To clarify the association between smoking and the development of lung cancer, one needs to quantify the population-time at risk among smokers and non-smokers so that one can calculate and compare incidence rates. However, it is difficult to determine the population at risk for a specific hospital, like NTUH, because there is more than one teaching hospital in metropolitan Taipei, and as a result, some people at risk in Taipei's general population may choose to go to another hospital, other than NTUH. Example 11.3
Endemicity of Scabies - An example of a proportionate morbidity ratio (PMR) study
In March 1995, a dermatologist noted an increase in the number of cases of scabies in his outpatient clinic. Intuitively, he guessed that there might be a scabies epidemic in the community and asked one of his resident physicians to count the number of cases of scabies in his clinic. See Table 11.3. From this table, the proportion of scabies patients seemed to increase from 5.0% to 12.5%. By dividing the two proportions, he obtained the proportionate morbidity ratio (PMR) of 2.5 (See Table 11.9 for a comparison of O/E ratio, PMR and MOR). With a 2.5 times increase in proportion of scabies from the previous year, could he conclude that there was an endemic? Although the total number of patients seen in the dermatology clinic was used as the population at risk, a more accurate population at risk should have included healthy individuals, outside of the clinic, who are not afflicted by any skin diseases. In fact, the total number of patients visiting the dermatology clinic in March could only represent the population at risk, if the total incidence rate of skin diseases and the health seeking behavior of patients did not change from March 1994 to 1995. Because the doctor lacked sufficient information to validate such an assumption, he was unable to determine if there was an endemic of scabies.
Chapter 11 Case-control study 263
Table 11.3
Frequencies and proportions of scabies in a dermatology clinic. Calendar time March 1994(%)
March 1995(%)
No. of patients with scabies
100 (5.0%)
250(12.5%)
Other dermatologic diseases
1,900(95.0%)
1,750(87.5%)
Total
2,000(100%)
2,000(100%)
All three of the above examples illustrate that, oftentimes, data sets only provide for the numerator of an incidence rate. In some occasions, data for the denominator of an incidence rate may exist, but for lack of resources and high costs, one may not be able to collect all of the pertinent data, as in the next example. Example 11.4
Parental and environmental exposures — Congenital malformation
In a study on the relationship between parental occupational and environmental exposures with congenital malformation, investigators collected about 12,000 newborns and parental histories of exposures (Chen et ah,2000). Although it would have been more accurate to measure the content of polychlorinated biphenyls (PCBs) and lead in umbilical cord blood of every newborn, investigators decided that it was simply too expensive to measure every one case. Here, case-control study design serves as a feasible alternative, in which investigators would only need to sample and measure a portion of the newborns to calculate incidence or prevalence rates. In fact, one should perform case-control studies to determine causality, when data on the population-time at risk is either impossible or too costly to obtain. In a causal study, one needs to obtain incidence rates for the exposed and non-exposed populations. When one lacks data for the denominators of both rates, a general solution is to perform sampling, which quantifies the
264 Basic Principles and Practical Applications in Epidemiological Research
odds of these two denominators. The key issue is how to perform such sampling so that one can accurately estimate the odds with the least number of assumptions. For a more thorough understanding, we shall start with the original concept of case-control study as developed by Cornfield (1951), followed by Miettinen's (1976) concept of density sampling, and finally the proportional hazard model (Prentice and Breslow, 1978; Breslow and Day, 1980). 11.2 Evolution of case-control study 11.2.1 Cornfield's concept (1951) In response to the presence of only mortality and morbidity statistics in clinical data, Cornfield, over forty years ago, thought of a method to estimate relative risk (risk ratio). At the time, he was examining the etiologic factors of lung, breast and cervix cancers (1951). His argument was as follows: Suppose that there were a total of (A+B) cases of lung cancer and one wanted to determine the association between lung cancer and cigarette smoking. Table 11.4 summarizes such a condition:
Table 11.4
Illustration of a case-control study design, in which (A+B) cases of lung cancer were collected and stratified by exposure to smoking. Exposure (smoking) Yes
No
Total
Lung cancer
A
B
A+B
Control
c
d
c+d
N„
N,+N„
Population at risk
N
l
In this case, one is primarily interested in quantifying how many cases developed after suitable induction time among the exposed (N,) and non-exposed (N0) populations (A/N, and B/N0, respectively). Then, one can calculate the risk ratio (RR) = (A/N,)/(B/N0). If N,, N0 and B are all non-zero
Chapter 11 Case-control study 265
natural numbers, then RR can be rearranged to be (A/B)/(N,/N0). The value of A/B can be collected simply by surveying exposure to smoking among lung cancer cases. N,/N0 may be estimated by taking a random sample of controls (c+d) from the remaining candidate population (N,+N0) - (A+B) and surveying their exposure odds ratio c/d to smoking. However, in order for c/d to be a valid estimate of N,/N0, the remaining population (N, + No) - (A + B) should approximate the original population, (N, + N0). Thus, (A + B) must be small, which is true for only rare diseases. Thus, Cornfield's method for case-control study requires the assumption that the disease of interest is rare. As long as this assumption is met, one can then estimate incidence rate ratio using sampled controls and cases. (A/N,)/(B/N0) = (A/B)/(N,/N0) * (A/B)/(c/d) « (a/b)/(c/d) In addition to examining a rare disease, one should also perform random sampling in order to obtain a calculated odds c/d which is a valid estimate of N,/N0. If a completely random sample is not feasible, which is usually the case, then one's sample of controls must be unrelated to the exposure. This is a crucial requirement for case-control studies. Similarly, if one also intends to sample cases (a+b) to estimate A/B, one needs to perform random sampling here as well. Since a completely random sample from all cases is usually not feasible, case series from one or several hospitals can be used as long as the reasons for going to these hospitals are unrelated to the exposure. Then, a/b can be considered a valid estimate of A/B. To reemphasize briefly, in case-control studies it is extremely crucial that the sampling procedure be unrelated to the exposure, since random sampling is usually not feasible. Cornfield suggested that controls be sampled after the collection of all cases. Accordingly, Cornfield's sampling method was once called a retrospective study, in which patients were asked to recall previous exposure states. However, in conducting such a study, Cornfield needed to invoke two assumptions:
266 Basic Principles and Practical Applications in Epidemiological Research
(1) The disease under study is rare. After all cases have been collected, one obtains the odds of c/d from the remaining candidate population, (Ni+No) - (A+B). As a result, c/d is a valid estimate of Nj/No only if the disease of interest is rare because the remaining population (Ni+N0) - (A+B) must approximate the original population, (Ni+No). (2) The exposure proportion, [Ni/(Ni+N0)], remains constant in the general population during the period under study. The exposure proportion, estimated from the control sample (c+d), only represents conditions after the collection of cases. Therefore, Cornfield's sampling method assumes that this proportion remains constant throughout the study. Although Cornfield's case-control study design was a major breakthrough for causal studies, it required the assumptions of a rare disease and a constant exposure proportion throughout the period in question. Moreover, when Cornfield presented his theory, statistical methods to control confounding were not yet available, which limited its practical use. Years later, Mantel and Haenszel (1959) finally developed a procedure to control confounders, by performing overall tests and summarizing many stratified tables. Their contribution facilitated the practical and broad application of case-control study. 11.2.2 Miettinen's concept of density sampling (1976) Sheehe (1962) and Miettinen (1976) independently developed the concept of density sampling for case-control studies. Because Miettinen's argument is more easily understood, I hereby summarize his concept in my own words: Suppose that the general population is a dynamic population, in which there is constant turnover of members but size remains stable over time. Then, cases constantly develop from both the exposed (Ni) and non-exposed (No) candidate populations (population at risk), as depicted in Figure 11.1. Similar to Cornfield's method, Miettinen's concept requires that the selection of cases and controls be unrelated to the exposure. One way to
Chapter 11 Case-control study 267
insure that the collection of cases and controls are unrelated to the exposure is for investigators to be ignorant of whether the person belongs to the exposed or non-exposed group. Moreover, as mentioned in the previous section, limiting samples to cases in one or more hospitals may be valid, as long as the reasons for entering the hospital are unrelated to the exposure. Miettinen's significant contribution to case-control study design was to propose collecting controls whenever a case develops, so that controls and cases are matched in time. Thus, by the simultaneous collection of a case and a fixed number of controls, Miettinen's density sampling procedure takes into account the time factor and splits it into very small intervals, as displayed in Figure 11.1. The calculation of c/d, among the sampled controls (c+d), is then a good estimate of the ratio of population-time at risk between the exposed and non-exposed, (c/d = [Ni(ti-to)]/[No(trto)] = Ni/N0). If there is some change of Ni/No during the study period, then c/d is an average of Ni/N0 during the study period (Greenland and Thomas, 1982). Thus, one can obtain a more accurate estimation of incidence rate ratio (IRR) by matching in time the case series (a+b) and control series (c+d). IRR is expressed as follows: IRR = [A/N.CtrtoMB/Nofr-to)] = (A / ^ l ) = (l/l)=^. B/ N0 b/ d be Moreover, there is no need to make the same assumptions, as required in Cornfield's concept. With the density sampling method, one does not have to study a rare disease because during each sampling of controls, one only excludes one case from the candidate population. Thus, the remaining candidate population, (Ni + N0) - (A + B), where (A + B) = 1, approximately equals (N, + No). Nor is there any need to assume a constant exposure proportion during the period under study because the density sampling procedure already takes the average exposure proportion c/d. In fact, Cornfield's sampling scheme is simply a special case for case-control study, in which controls are collected after all of the case series are obtained. In contrast to density sampling, Cornfield's procedure actually measures the cumulative incidence rate (CIR) and is accordingly called "cumulative incidence sampling," which still requires the two aforementioned assumptions.
268 Basic Principles and Practical Applications in Epidemiological Research
Populations Pool of exposed cases
Time to
H>
Incidence density or rate A
A: -[-TTTTTTTTTTTH
N,(t,-t0) Exposed candidate
N,: / / / / / / / / / / / / / / / / / / / /
populations Pool of non-exposed cases
B
B:
N0(t,-t0) Non-exposed candidate
N„: / / / / / / / / / / / / / / / / / / / /
Populations ,A/N,(t,-t0) A/N, A /N, Incidence rate ratio (IRR): B / N 0 ( t , - t 0 ) " B / N 0 ~ B / N0 T = occurrence of new cases. /
= controls sampled for each case.
Figure 11.1
Miettinen's concept of density sampling (modified from Miettinen, 1976).
11.2.3 Cumulative Incidence Sampling Now with Miettinen's concept, one can express the concept of cumulative incidence ratio (CIR) in terms of density sampling or incidence rate ratio (IRR). Let CIRi and CIRo denote cumulative incidence of cases among the exposed and non-exposed, respectively. Then, the risk ratio is estimated by CIRi/ CIR0, and cm l-exp[-flR,(t)dt] Risk ratio (RR) = i ^ i = ^ — ^ - ^ CIR0 l-exp[-JlR 0 (t)dt] where, IRi(t) is the incidence rate of the exposed as a function of time; IR0(t) is the incidence rate of the non-exposed as a function of time. If the disease is rare, namely, f'' //?, (t)dt« 0.1 and f ■ ' IRn (t)dt« 0.1 J
'0
J
'0
Chapter 11 Case-control study 269
'i
l-exp[-f;'IR t (t)dt] J then, RR= '° ' *
l-exp[-JJlR0(t)dt]
\'}IRx{t)dt *> '
J
JJlR0(t)dt
£/i?iWA' =±
£IRomt
= IRR
'0
Thus, IRR is an approximation of risk ratio when the disease is rare and the observation time is short. Alternatively, if we reconsider risk ratio approximation from Cornfield's concept and assume that the disease is rare and no change of exposure proportion during the study period, then Riskratio= ™L = ^ CIR 0
« MN.-A)
B/N 0
=
B/(N0-B)
MN,-A) = B/(N0-B)
i/f. b/d
Again, one can calculate a / b from a sample of cases (a + b) to estimate A/B, and one can calculate c/d from a sample of controls (c + d) to estimate (N, A)/(N 0 -B). Based on the above concept, if one obtains both cases and controls in a cross-sectional survey, then one has obtained prevalent cases and controls, and one can then calculate the prevalence odds ratio (POR), which is the ratio of the prevalence odds of cases (a/b) and the prevalence odds of controls (c/d). The POR value can be an estimate of risk ratio, as well. P0R =
£d = (,/N,xd/N0) be
(b/N0)(c/N,)
=
crc,(i-crc 0 ) = a* C7R0(1-CK,)
= Rjsk
^
CIR0
Therefore, Miettinen's concept of density sampling, which does not require any assumptions, revolutionized case-control study design and can be applied to all kinds of diseases. Specifically, one can now apply case-control studies to other non-rare acute diseases, such as injury-related health problems (Tsai et ah, 1995). One can also carry out case-control studies concurrently within a cohort to save resources-(e.g., as in a cohort study of vinyl chloride monomer exposure and liver cancer (Du and Wang, 1998). Moreover, with density sampling, one can now conduct case-control studies initiated by the
270 Basic Principles and Practical Applications in Epidemiological Research
control group (Greenland, 1985), multiple control groups (Fairweather, 1987; Liang and Stewart, 1987), etc., in addition to initiation by case series. Essentially, as statistical methods advance, the applications for case-control studies grow ever larger. Case-control design can be regarded as a general solution for calculating incidence rate ratio, when one lacks data on population at risk because the random sample of controls can consistently estimate the odds of population-time at risk between the exposed and the non-exposed ([N.Ct.-gj/fNoCtrto)]). 11.2.4
The proportional hazard model
Cox (1972) proposed a mathematical model to represent the risk of developing a particular disease based on the concept of proportional hazard. This concept is the same as incidence rate under the condition that the person has survived up to the specified time. In other words, one has successfully dealt with time and expressed the incidence rate through a probability model. Prentice and Breslow (1978) applied the proportional hazard model to case-control studies and reached the same conclusion as that obtained from density sampling. Moreover, the direct development of statistical methods from this model have facilitated its practical use (Breslow and Day, 1980; Liang and Stewart, 1987). Finally, it also provides flexibility for modeling the change of exposure as a function of time, further enlarging the potential applications for case-control studies. For those familiar with the concept of modeling, this approach can be very powerful and versatile. However, it may not be easily understandable for most physicians and lay people, who in general are unfamiliar with statistical models. I recommend that beginners in epidemiology understand the concept of density sampling first when utilizing case-control studies. Later, one can study modeling for data analysis or simply consult a statistician.
Chapter 11 Case-control study 271
11.3 Practical selection of cases and controls based on the principle of density sampling (Miettinen, 1985b; Wacholder et al. 1992a, 1992b, 1992c; Rothman and Greenland, 1998) In practice, one can usually perform in either one of the following two ways: The first approach is tc define the population at risk, and then try to collect every new case that occurs during the period of study. For every case, one tries to sample one to several controls matched in time of occurrence and/or age. A second method is to define where one can collect new cases (numerator data). Then, define who the population at risk are and perform sampling amolfg them. Let us examine several examples for illustration: Example 11.5
Preventive effect of helmets - An example of density sampling
In a study to determine the preventive effect of different types of helmets for head injury, Tsai et al. (1995) decided to use all Taipei City residents as the population at risk. Therefore, eligible cases were all new case0 of motorcycle accidents occurring between Aug. 1 and Oct. 15, 1990 from 16 hospitals governmentally approved for emergency care. The investigators sampled two controls groups, emergency room and street, by density sampling. The emergency room control group consisted of motorcyclists who sought emergency care other than head injuries at one of the 16 hospitals. The street controls were matched with cases in time and place of accident. Sampling of the street controls occurred one week after each case, by simply taking four consecutive photos of passing motorcyclists. Investigators used the first motorcyclist appearing in each of the four pictures as the street control and determined his/her helmet-wearing state, age, sex and type of motorcycle. All cases and controls were sampled without any preference to helmet use. After controlling potential confounders by multiple logistic regression analysis, Tsai et al. found that both groups showed relatively consistent protective effects for different types of helmets, as summarized in Table 11.5. Specifically, a full-face helmet showed a protection factor of 3, while a partial coverage helmet provided a doubtful 25% protective effect.
272 Basic Principles and Practical Applications in Epidemiological Research
Example 11.6 Etiology of bJackfoot disease - An example of cumulative sampling Chen et al. (1988) conducted a case-control study to examine multiple etiologic agents for blackfoot disease (BFD), an endemic peripheral vascular disease in southern Taiwan. They recruited";!!1 living cases of BFD residing in the 4 townships within the endemic area as case scries. Since all individuals within the endemic township were at risk, they randomly sampled three controls matched for age, sex and residence for every case. A total' of 24.1 BFD patients and 759 matched healthy controls were identified and interviewed. Multiple logistic regression analysis showed that ttie duration of artesian well water consumption was positively associated with the development of BFD in a dose-response relationship, as summarized in Table 11.6. Both arsenic poisoning and family history of BFD were also found to be significantly associated with the disease. Returning back to the examples introduced earlier in this chapter, we will now discus^ now to select cases and controls for each of them. In Example 11J, all Taipei residents were identified as the population at risk. The odds of population-time at risk for different methods of commuting can then be estimated by taking samples from the street. This approach is similar to the method of sampling street controls in Example 11.5. In Example 11.2, since NTUH is a medical center or a tertiary care center, patients may come from all over Taiwan. Thus, one may find it difficult to determine the population at risk, which here consist of people who will go to NTUH if they develop lung cancer. However, one may use NTUH patients with diseases unrelated to the exposure (smoking) as controls because these patients would probably enter NTUH if they had developed lung cancer instead. Diseases known to be related to smoking, such as chronic obstructive pulmonary diseases, laryngeal cancer, bladder cancer, ischemic heart disease, etc., must be excluded as candidates for control samples. Moreover, due to reasons of accessibility or connections, one may want to sample another group of controls from people living in a nearby neighborhood, workers from the same factory, and/or friends or relatives of hospital employees. These people
Chapter 11 Case-control study 273
Table 11.5
Adjusted odds ratios of different risk predictors of head injury computed from multiple logistic regression analysis (Tsai et al., 1995).
Risk predictors
Helmet type Full face vs. no helmet Partial coverage or full vs. no helmet Weather Rainy vs. Sunny Cloudy vs. Sunny Place At intersection vs. not at intersection Motorcycle type RS vs. STI* UB vs. STI* STIII vs. STI* STII vs. STI* Riding position Driver vs. passenger Age (years) < 29 vs. > 65 30-64 vs. >65 Sex Male vs. female
Unconditional logistic regression analysis; emergency room (cases,, n=562; controlss, n=789) Odds ratio 95% confidence interval
Conditional logistic regression analysis; (daytime cases, n=224; matched street controls, n= 1,094) Odds ratio 95% confidence interval
0.26 0.72
0.14-0.47 0.38-1.37
0.36 0.73
0.13-0.98 0.36-1.47
1.31 1.31
0.91-1.87 0.96-1.79
3.32 0.72
1.60-6.86 0.35-1.50
0.99
0.79-1.24
0.86 1.09 0.92 1.03
0.56-1.32 0.76-1.56 0.65-1.29 0.76-1.39
1.55 0.83 0.53 3.71
0.81-2.97 0.50-1.37 0.34-0.83 2.31-5.95
1.10
0.83-1.47
1.22
0.76-1.96
0.68 0.63
0.50-0.92 0.44-0.89
0.68 0.69
0.26-1.81 0.25-1.87
1.02
0.78-1.33
0.64
0.43-0.95
RS, racing sport type; UB, utility bike; STI, STII and STIII, step-through types with a stroke volume of < 50, 51-99, and > 100 cc, respectively.
274 Basic Principles and Practical Applications in Epidemiological Research
Table 11.6
Multiple logistic regression analysis of risk factors associated with blackfoot disease (Chen et al., 1988).
Variables Artesian well water consumption (yrs) 0 1-29 >30 Arsenic poisoning No Yes Familial blackfoot disease history No Yes Staple food consumed Rice Rice + sweet potato Sweet potato Vegetable consumption frequency (days wk) 7 <7 Egg consumption frequency (days wk) >4 1 to 3 <1 Meat consumption frequency (days wk) >4 1 to 3 <1 Formal education Yes
All patients Multivariateadj usted OR P
1.00 3.04 3.47
O.001
Patients at Stage II or HI Multivariateadjusted OR P
1.00 3.32 4.03
<0.001
1.00 2.77
<0.01
1.00 3.15
O.001
1.00 3.29
<0.01
1.00 3.63
O.01
1.00 1.94 1.90
1.00 1.43
<0.05
<0.05
1.00 2.77 2.58
1.00 1.58
O.01
<0.05
1.00 2.08 2.30
<0.10
1.00 3.68 3.71
<0.10
1.00 1.30 1.58
NS
1.00 2.18 1.69
NS
1.00
1.00
Chapter 11 Case-control study 275
No 1.03 NS Occupational sunshine exposure (hrs day) <6 1.00 ^6 U7_ NS NS = Statistically nonsignificant; OR = odds ratio.
1.22
NS
1.00 1.32
NS
will generally serve as a valid sample for population at risk, as long as cases and controls were not selected under a procedure associated with the exposure. An alternative way to study the association between lung cancer and smoking in this example is to define the total population at risk to be all Taipei residents, rather than only those who would enter NTUH. One then needs to collect additional lung cancer cases from other Taipei hospitals, similar to the comprehensive collection of head injuries in Example 11.5 and BFDs in Example 11.6, and then sample controls from Taipei residents. In Example 11.4, one can conduct a case-control study for the association of parental occupational and environmental exposures with congenital malformation. First, one can define cases to be all newborns with congenital anomaly delivered in the Taipei Municipal Maternal and Child Hospital (TMMCH). Then, one can identify the population at risk to be all pregnant women who came to TMMCH for delivery. Controls can be randomly selected from other healthy newborns and matched in terms of delivery time. Therefore, only a limited number of cord bloods need to be sampled and measured for lead and PCB content, saving a significant amount of resources. 11.4 Control of confounding in case-control studies In causal studies, one always needs to consider other determinants, which may mix or confound the effect under study. In Section 7.3, we discussed control of confounding through restriction in study design and stratification and/or modeling during data analysis. In practice, one should consider confounding during sampling design for
276 Basic Principles and Practical Applications in Epidemiological Research
case-control studies. One may implement restriction rules for the selection of cases and controls. For instance, in Example 11.2, one may impose that asbestos workers not be included in cases nor in controls since asbestos is a determinant of lung cancer. Table 11.4 illustrates the stratification of smokers among lung cancer cases. This table can be converted to Table 11.7, in which asbestos workers can be excluded to a group labeled "other." In general, the group labeled "other" consists of persons excluded from the population at risk based on comparability of effect, contrasted populations and measurements, as described in Section 7.3. These three requirements will be discussed in more detail in Section 11.5, for the selection of controls from a group of decedents or patients with another disease. In Example 11.5, researchers deliberately included only motorcycle riders, excluding other modes of transportation leading to head injury. In general for case-control studies, one must select cases and controls in a manner unrelated to the exposure and consider comparability in terms of effect, contrasted populations and measurements. If the number of subjects is large and one wants to explore the interactive effect of ,say, asbestos and smoking on lung cancer, one may then apply the strategy of stratification and modeling. Tables 11.5 and 11.6 illustrate stratified and modeling analyses for controlling confounding.
Table 11.7
An illustration of a case-control study design, in which the group "other" indicates subjects excluded due to potential confounding. Exposure Yes
No
Total
Other
a;*
b;
(a+b)i
(a+bV
Controls
C;
di
(c+d)i
Population at risk
N„
(c+dV (N, + N„V
Cases
*"i" indicates ilh stratum.
N
»l
(N.+Nn),
Chapter 11 Case-control study 211
11.5 Mortality or morbidity odds ratio (MOR) in occupational epidemiological studies - Valid selection of non-exposed occupation(s) and control diseases In occupational epidemiological studies, one of the most commonly used measures of relative mortality (or morbidity) is the observed-to-expected (O/E) ratio. This is the ratio of the observed number of deaths (or events) among an exposed population to an estimate of the expected number of deaths in a reference (non-exposed) population, (a/N,) / (b/N0). The computation of the expected number generally requires information on the size of the population at risk, or more specifically, the number of person-years of follow-up of the workers under study. When it is impossible or too expensive to define the population at risk, a common practice is to compute the proportionate mortality ratio (PMR), as a substitute (Schilling, 1973; Kupper et al, 1978; Decoufle et al, 1980). Conceptually, this ratio is the proportion of the index (exposed) group who develop the disease divided by the proportion of the reference (non-exposed) who develop the disease. PMR can be interpreted as the observed-to-expected ratio only under the assumption of equal total death/morbidity rates for the index and the reference populations. (See Table 11.8 for the mathematical definitions of O/E, PMR and MOR.) If
a + c = b_+d^^ m e n N,
pMR =
[ a /( a+c )] / [b/(b+d)] = (a/N,) / (b/N0) = O/E
N0
Thus, PMR is precluded from any meaningful quantitative interpretation because it must also assume that incidence rates of other diseases of the exposed be less than the non-exposed, if the exposed has a higher incidence rate of the disease of interest than the latter (Miettinen and Wang, 1981). In mathematical terms: If one assumes that
a+c
N,
=
and one finds a/N, > b/N0, N0
278 Basic Principles and Practical Applications in Epidemiological Research
then c/N, < d/N0. And this is not necessarily true. (See also Table 11.8) An alternative measurement, the mortality or morbidity odds ratio (MOR), was proposed and demonstrated to be theoretically equivalent to the exposure odds ratio (EOR) of case-control studies (Miettinen and Wang, 1981). MOR is a specific type of risk ratio (POR) (mentioned in Section 11.2.3), in which one studies the occurrence of disease or death. Equally computable as the PMR when there is no information on the population-time-at-risk, MOR can be interpreted as the O/E ratio under the assumption that death/morbidity rates for control diseases between the exposed and non-exposed are equal: If — .= — , then MOR = (a/c) / (b/d) = (a/N,) / (b/N„) = O/E. N, N0 Therefore, unlike PMR, MOR can serve as a quantitative estimate of the O/E ratio because the assumption c/N, = d/ N0 is most likely true, if one has chosen reference diseases which fulfill three aspects of validity, one of which requires such diseases to be unrelated to the exposure (i.e., the comparability of effect). These validity issues will be discussed in the next section (See also Table 11.8.) The MOR, however, has not received popular acceptance probably because valid selection of non-exposed occupation(s) and reference (control) diseases have not been clarified in a standard epidemiological journal. Consequently, I have deliberately included this topic in my book, although the major part of this section has already been published in an Asian journal (Wang and Miettinen, 1984).
Chapter 11 Case-control study 219
Table 11.8 Data layout and formulas for observed-to-expected (O/E) ratio, proportionate mortality ratio (PMR) and mortality odds ratio (MOR). Number of deaths Cause of interest Other control cause(s) Population-time of follow-up O/E ratio (rate ratio) PMR+
Equal to O/E if
Reference occupation (Non-exposed)
a
b
c
d
N,=?
N„ = ?
(a/N,)/(b/N 0 ) [a/(a+c)j / [b/(b+d)]
MOR* +
Exposed occupation
(a/c) / (b/d) a+c
= ; i.e., given equal total death/morbidity rates between N, N0 the exposed and reference populations. c d * Equal to O/E if = ; i.e., given equal death/morbidity rates for the control N, N0 disease(s) between the exposed and non-exposed populations. 11.5.1 Valid selection of reference (non-exposed) occupation(s)
To prevent confounding in MOR studies, valid selection of reference (non-exposed) occupations involves adherence to three aspects of validity, as mentioned in Chapters 7 and 10: (1) comparability of effects, (2) comparability of contrasted populations and (3) comparability of measurement (Wang and Miettinen, 1982). Comparability of effects Each reference occupation must possess the following three characteristics in order to have comparable effects with the exposed occupations: (1) It must represent a lesser exposure or complete nonexposure to the factor of interest than the index occupation. (2) It should possess all other effects of the index occupation. Finally, (3) it should not include additional exposures or preventive practices, which may influence the outcome of interest.
280 Basic Principles and Practical Applications in Epidemiological Research
Consider the study of potential carcinogenic exposure in the lens manufacturing process (Wang et al, 1983). Without a specific factor of interest in the beginning of the study, it was necessary to consider the whole lens manufacturing process to be the exposure. Under such terms, the first requirement for reference occupations was satisfied by other jobs in the optical industry, as well as other occupations in the involved town. However, to satisfy further requirements, any occupation with known carcinogenic exposure - asbestos, aromatic amines, arsenic, etc. (Doll and Peto, 1981) - had to be excluded from the reference occupations. This meant the exclusion of pipe fitting and automechanical work, jobs with excessive exposure to asbestos. See also Table 11.9.
Table 11.9 Valid selection of reference occupations as illustrated by the study of cancer risks in lens manufacturing. Requirements of validity
Suitable reference occupations
Excluded occupations
1. Comparability
Non-lens occupations in optical
Pipe fitting, and
of effects
industry. Occupations in other
automechanical work.*
industries without carcinogenic exposure (same town). 2. Comparability of populations
Occupations with similar incomes Management, science, law, to lens workers: woolen textile
accounting and engineering
workers, craftsman, foreman or
in the optical industry.
operatives in other manufacturing, Service work (same town). construction or transportation industries (same town). 3. Comparability
No further requirements, in part,
of mortality
because of the lack of previous
information
suspicion of occupational cancer hazards.
* asbestos exposure
Chapter 11 Case-control study 281
Comparability of contrasted populations Occupations with comparable effects may not provide a valid contrast with the exposed occupation because of differences in population characteristics. Although age, sex and race can be controlled by stratified analysis, other determinants of the mortality of interest may be distributed differentially between the exposed and potential reference populations. In general, these risk indicators can be thought of in terms of forces of entry into and exit from the exposed and reference populations. Incomparability of populations arises if a determinant of the outcome of interest is related differentially to the entry into and/or exit from the compared occupations. Or, it may also arise if there is any effective difference in preventive programs related to the mortality of interest. Consider again the study of lens manufacturing and cancer mortality. First, this study was limited to white males, with somewhat similar distributions in smoking and dietary habits (the two most important determinants of cancer mortality) between the exposed and non-exposed occupations. Since death certificates do not provide information on these characteristics, researchers could not control them in their analysis. Therefore, they needed assurance of comparability ofjob entry and exit forces. To obtain such comparability, we selected job categories with salaries and physical demands similar to those in the index occupation. Among potential reference jobs in the optical industry, we excluded management, law, science, accounting and engineering because of higher incomes and less physical demands. Among other jobs in town, we selected the following categories of occupation due to wages (Table 11.10) and physical demand similar to those in lens work: woolen textile workers, craftsmen, foremen or operatives in manufacturing, construction or transportation industries. We also considered pre-employment and periodic physical examinations to be potentially important determinants for job entry and exit. However, there was no difference between the index and selected reference occupations except that the former required a definite rectal examination. Since most optical workers enter their job at a relatively young age, (23-30 years of age), it may be presumed that the examination is almost always negative. Such results will
282 Basic Principles and Practical Applications in Epidemiological Research
have a negligible effect on cancer mortality. Considerations of comparability of job entry and exit factors also indirectly imply smoking and dietary influences because similar socio-economic levels imply similar smoking habits (Sterling and Weinkam, 1976; Covey and Wynder, 1981) and, presumably, similar dietary patterns as well.
Table 11.10 Average annual wages paid to workers in different industries in the town under study: Each dollar amount is calculated from the data of the Division of Employment Security, Commonwealth of Massachusetts (Wang et al., 1984). Calendar year
1940
1945
1950
1955
1960
1965
1970
1975
Optical industry
1,451
2,346
3,135
3,950
4,418
5,439
7,647
11,028
Woolen textile
1,233
2,036
2,881
3,474
4,541
5,252
6,561
8,716
industry* 956
963
1,494
1,847
2,028
2,928
2,931
5,812
Other manufacturing 1,281
Service industry
2,306
3,109
4,104
4,273
5,652
7,915
10,381
908
1,474
1,762
3,889
4,079
5,163
7,726
10,529
1,134
1,969
2,716
3,941
4,692
5,899
7,615
9,615
of 1,364
2,139
2,881
3,715
4,150
5,138
7,027
9,543
industries Transportation and communication industry Construction industry Average employed
* The number represents the average of this town and a neighboring town where many descendants worked.
Comparability of measurement In occupational mortality studies, it is generally necessary to use death certification as the basis of outcome information. However, it is possible that the diagnosis or cause of death may be recorded differently between the
Chapter 11 Case-control study 283
compared occupations. Suspicions of occupational influence on risk and concern about insurance or liability are some of the factors that may lead to over or under-representation of the outcome of interest within the compared populations. Thus, it is again necessary to resort to restriction rules as a substitute for control. However, no prior suspicion of cancer resulting from lens work or any of the selected reference occupations existed, and income levels as well as the quality of health services were similar. It thus seems that comparability of mortality information was achieved in this study. Therefore, non-exposure alone is an insufficient criterion for the selection of a reference occupation(s). Rather, it is necessary to select occupations comparable with the exposed occupations in terms of extraneous effects, population characteristics and mortality information. 11.5.2
Selection of reference cause(s) of death
A major concern for epidemiologists is to effectively eliminate any bias in the estimation of the O/E ratio, when information on population at risk is unavailable. For MOR to provide an unbiased estimate of the O/E ratio, the exposed and reference populations should have equal rates of death/morbidity from reference cause(s). To accomplish this end, one must select reference cause(s) of death (control diseases), which satisfy the three aspects of validity mentioned earlier. Let us examine how cardiovascular diseases were selected as control diseases for the MOR study in Section 11.5.1. (See also Table 11.11.) Both lens and reference workers were free of exposure to any known cardiovascular toxins (Rosenman, 1979) {i.e., carbon monoxide, carbondisulfide, nitrates, cobalt, fluorocarbons, etc.) or any unusual stress (Schnall et al. 1994). As for comparability of job entry and exit factors, a review of company practices revealed that security guards were required to have no history of cardiovascular disease in order to be hired or remain on the job. For this reason, security guards in the optical company were excluded from among the reference occupations, so that cardiovascular disease can be used as a reference cause of death. By choosing reference job titles with similar
284 Basic Principles and Practical Applications in Epidemiological Research
incomes and levels of physical demand to those found in lens works, one may presume that the compared populations had similar lifestyles (an influential factor for cardiovascular disease). Besides, neither the exposed nor the reference occupations were suspected of atherogenic exposures. One may then presume that diagnosis and certification of death from cardiovascular disease were conducted in a similar fashion, and that job and company designations on death certificates were accurate. After taking care of all these considerations, we found that lens workers have a 2.9 times increased risk of colorectal cancer. The stratified analysis in Table 11.12 summarizes these findings.
Table 11.11 Selection of cardiovascular disease (CVD) as the reference cause of death. Selection considerations
Validity requirements 1. Comparability ;The contrasted of effects
occupations were not exposed to any known
i cardiovascular toxins (CO, CS2,. nitrates, cobalt, or fluorocarbons, etc.) nor to unusual stress.
2. ComparabilitylThe contrasted occupations presumably had similar selection factors for of
Iboth job entry and exit in terms of CVD risk indicators - physical
populations idemands, income, feasibility of smoking, etc. Security guards in the i optical industry were excluded from among the reference occupations j because they were forbidden to have a history of CVD throughout their ; service. 3. ComparabilityjDiagnosis and death-certification of cardiovascular disease between the of
j compared occupations was similar, given absence of suspicious
information
atherogenic factors in the compared populations.
In fact, the above principles can be applied to other case control studies with controls sampled from other diseases. Investigators do not usually employ such a careful sampling scheme because they often lack prior
Chapter 11 Case-control study 285
Table 11.12 Frequency of death from colorectal cancer (CoRe-CA) and cardiovascular disease (CVD), according to years of employment in lens production and age (Wang et al., 1983). Lens workers
Reference workers
30-44
CoRe-CA CVD CoRe-CA
45-59
CVD 60-74
CoRe-CA CVD
>75
CoRe-CA CVD
Total
1-19 years > 20 years
Cause of death
Age
CoRe-CA CVD
Total
1
1
0
1
17
4
2
6
2
2
1
3
90
22
15
37
10
3
5
8
224
27
42
69
12
4
4
8
223
21
27
48
25
10
10
20
554
74
86
160
Crude mortality odds ratio
(1)
3.0
2.6
2.8
Standa rdized mortality odds ratio
(1)
3.2
2.6
2.9
(sMOR) X2(l) (Mantel-Haenszel) 2
% (\) (Mantel extension for the trend)
12.4 9.7
Rate ratio (Mantel-Haenszel) point estimate 90% confidence interval
2.9 1.8-4.!
286 Basic Principles and Practical Applications in Epidemiological Research
selected into the study. As epidemiologists are becoming more familiar with information on subjects as to what kind of occupational exposures will be the concept of case-control design, they will begin to consider it more carefully (Miettinen, 1985b; Wacholder etal, 1992a; 1992b; 1992c). Example 11.7 VCM workers' risk of liver disease-An example of MOR study Here is another example of a MOR study, in which investigators also selected reference occupations and reference causes of death. In this study, investigators wanted to determine if there is an increased risk for primary liver cancer and chronic liver disease among VCM (vinyl chloride monomer) workers (Du and Wang, 1998). We used a cohort of 2,224 workers as the population at risk. Using a labor insurance database, we compared disease occurrences for VCM workers with that of motorcycle and optical equipment manufacturing workers, from Jan. 1985 to Mar. 1994. We used cardiovascular and cerebrovascular diseases as their control (reference) diseases because these diseases were unrelated to the exposure of VCM. Calculating age-adjusted morbidity odds ratio (See Table 11.13), we showed that there was a significant increased risk of hospitalization among VCM workers for primary liver cancer and chronic liver disease. Furthermore, although investigators had assumed that hospitalization rates for petrochemical, optical equipment and transportation vehicle manufacturers older than 45 years of age were the same, computer files in 1988, in fact, showed similar values of 1.62 x 10~2, 2.19 xlO 2 and 2.02 xlO"2, respectively. In this study, obtaining information on other diseases occurring within the same occupation was relatively easy for researchers because Taiwan's labor insurance provides free medical services for all workers in a company employing more than 5 workers. Consequently, such diseases could serve as controls, as long as they were unrelated to the exposures. In fact, sampling of controls from other diseases of the same population at risk can be conveniently performed in a hospital.
Chapter 11 Case-control study 287
Table 11.13 Frequency and morbidity odds ratio (MOR) of vinyl chloride monomer(VCM) workers hospitalized for primary liver cancer and chronic liver disease, compared with optical workers and motorcycle manufacturers as the non-exposed populations, and cardiovascular-cerebrovascular diseases (CV - CB) as control diseases (Jan. 1985-Mar. 1994) (Du and Wang, 1998). Age stratum
VCM
Optical
workers
workers (95% C.I.) manufacture (95% C.I.) rs
(years) Primary liver
MOR
4
Motorcycle
MOR
<45
1
5
7
6.4
>45
7
4
(1.5-11.0)
139
2
(2.3-17.5) 3.9
41
(2.3-5.6)
cancer Chronic liver disease and liver
<45
23
71
2.2
>45
12
18
(1.3-3.9)
<45
13
81
cirrhosis CV-CB disease
132
>45
27
99
156
Total number of
<45
621
2,914
4,703
Hospitalized
>45
423
753
1,158
(controls)
workers
Example 11.7 provides a good model for Example 11.3. In the latter example, one can conduct an MOR study to sample controls afflicted with skin diseases of different etiologic factors from scabies. The increase in MOR then serves as valid evidence of endemicity for scabies. With the same design, Liu et al. (2002) discovered the association between nasopharyngeal carcinoma and printing works. The MOR study or case-control design is a good alternative to the PMR study, which was used by the dermatologist in this example. PMR is often more difficult to interpret because one utilizes the total number of diseases as a surrogate for population-time at risk, which requires the additional assumptions that total incidence rate of all skin diseases must be the same for the two periods.
288 Basic Principles and Practical Applications in Epidemiological Research
11.6 Summary Case-control design is a fundamental solution for causal studies lacking information on population-time at risk. By employing Miettinen's density sampling method (matching cases and controls in time), there is no need to assume that either the disease is rare or that there is a constant exposure proportion during the period of study. The most important requirement is: the sampling of both cases and controls must be unrelated to the exposure of interest. Since density sampling design involves the simultaneous collection of cases and control, it is not necessarily a retrospective study. Rather, one also samples a group of cases to obtain the odds of exposure for the numerator of incidence rate, while one samples a group of controls to estimate the odds of population-time at risk for the denominator. In practice, the simplest procedure may be either to define the population at risk, collect all cases within it, and perform a random sampling of controls. Or, define the catchment (or collection) of cases, then try to identify the population at risk for the catchment, and perform sampling of controls. So far, PMR has been a commonly used method to explore occupational diseases. However, using PMR involves the assumption that the total mortality/morbidity rates among the exposed and the non-exposed are equal. This assumption also requires that the mortality/morbidity rate for other diseases among the exposed be less than the non-exposed, if the disease of interest among the exposed is higher than the non-exposed. Therefore, an increased mortality/morbidity rate of the disease of interest among the exposed cannot be interpreted quantitatively. As an alternative to PMR, one can set up an MOR or case-control study, in which one selects both reference occupation(s) and reference disease(s) unrelated to exposure as one's controls. To prevent confounding in an MOR study, one must carefully select valid non-exposed or reference occupational group(s) and valid reference cause(s) of death. This selection depends on the fulfillment of three conditions of validity: comparability of effects, comparability of contrasted populations and comparability of measurements.
Chapter 11 Case-control study 289
Quiz of Chapter 11 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 1. The validity of a follow up study is always better than that of a case-control study because the latter is always conducted retrospectively. 2. Case-control study design is a general solution to a causal study without information on the denominators or population time at risk. 3. In a case-control study conducted by cumulative sampling, one always needs to assume that the disease is rare and the exposure proportion is constant during the study period. 4. The sampling procedure for case-control study should not be related to the exposure of interest. 5. If one wants to use lung cancer patients from National Taiwan University Hospital (NTUH) for a case-control study to determine lung cancer's etiology, one may take patients with other lung diseases as controls. 6. If one wants to use lung cancer patients from Taipei city for a case-control study to determine the etiology of lung cancer, one may take all patients from NTUH hospital as a case series. And there is no need to add on patients from other hospitals in Taipei. 7. If one wants to use cervical cancer patients from NTUH hospital for a study on the etiologic agents of cervical cancer, one may take patients from orthopedic wards as controls. 8. Matched in time of occurrence is equivalent to density sampling in case-control studies.
290 Basic Principles and Practical Applications in Epidemiological Research
9. If the proportions of measles seen in a pediatric clinic during March 1995 and March 1996 were 5% and 20 %, then we might c7onclude that there was a local endemic of measles and that the incidence rate increased 4 times than that of the previous year. 10. The following statistics were obtained for two companies:
Frequency of occupational injuries
Steel Co. A
Steel Co. B
14
10
per 1 million ton of steel production
We can conclude that occupational injury is worse in company A.
Answer: (1) F (2) T (3) T (4) T (5) F (6) F (7) T (8) T (9) F (10) F
Chapter 12 How to Critically Review an Empirical Study 12.1 Abstract of an empirical study 12.2 Objective of the study 12.3 Validity of measurements in research 12.3.1 Validity of causal studies: Judgment of comparability between the exposed and reference groups 12.3.2 Validity of descriptive studies: Representativeness of the sample 12.4 Examination of the results and discussion 12.4.1 Data analysis 12.4.2 Is there any new finding in this study, and has the study achieved its goal? 12.4.3 If I carry out a similar study, how shall I modify the study design and data analysis? 12.4.4 Practice makes perfect 12.5 How to write up a paper based on an empirical study: Practical advice 12.6 Summary Introduction Having discussed the basic principles of research in the previous 11 chapters, we will now explain how these concepts can be used in the critical reading of an empirical study to obtain useful information. We have defined an empirical epidemiological study as a study that involves actual observation or measurement in a population(s). Therefore, it does not include any purely theoretical studies, literature review or article on the development of a measurement method not yet applied in the field. In general, a purely theoretical paper starts with definitions and postulates, and derives theories by deduction. Any real-life examples used in
291
292 Basic Principles and Practical Applications in Epidemiological Research
such papers are usually taken from other studies to illustrate the theory, without any new observations. A review article is usually written by a leading expert or pioneer on the subject. After carefully reviewing every important article, he/she provides a broad understanding of the subject matter, and may also shed some light on possible future developments. Even though a review is also not an empirical study, a report of the development of a new measurement method may include some demonstration, which may be part of an empirical study. Accordingly, the principles discussed in this chapter can be applied to this portion of the review article. A case report can also be considered to be an empirical study, with a very small sample size or number of study subject(s). For beginners in a particular subject area, I recommend that they should first read some review articles. One needs to first understand the big picture before reading any specific empirical studies. Otherwise, the views contained in a particular paper, may immediately influence one's own opinions and hypotheses, which should be avoided from a refutationist's point of view. Such reviews have been published for more than several decades in the basic medical science. They are usually entitled: "Annual review of " biochemistry, physiology, pharmacology, etc., "Advances in " virus research, genetics, etc., and "Recent progress in " hormone research, etc. In public health, we have recently published similar reviews, such as "Annual review of public health," "Epidemiological review," etc. Moreover, searching databases, such as Medline, also provides one with relevant review articles. Based on the references found in review articles, one may then select some original empirical studies. One should look critically in details of study design, data analysis, discussion, etc.; where the principles discussed in this chapter are to be applied. In the following paragraphs, we shall discuss the principles of critique based on the classification of inferences in empirical studies as descriptive, causal or a combination of them (See also Chapter 7). We shall begin with the abstract, followed by the sections of objective, materials and method, results, and discussion. Finally, we will provide some general guidelines about how to actually write up an empirical study.
Chapter 12 Review of an empirical study 293
12.1 Abstract of an empirical study In general, all empirical studies require an abstract. Although some journals stipulate the style of abstract, most do not. Nonetheless, an abstract should be short {e.g., less than 200 words) but must contain all the crucial information contained in the sections of objective, materials and method, discussion and conclusion. Scientists sometimes call the abstract a miniature of the study containing everything important. Because many important ideas and facts are written in each manuscript, one must be very selective in what to include in an abstract. One may begin with only one or two sentences to express the objective(s), followed by several sentences to describe how measurements are performed and how subjects are selected into the study. Then, the most significant findings are summarized in several sentences with actual numbers, means and standard deviations, as well as confidence intervals and statistical testing. The final part may include statements of major discussion, conclusion and future prospects. The length of an abstract is usually less than two hundred and fifty words, depending on the journal. Some journals may explicitly limit the number of words, and one must simply follow their guidelines. In general, an abstract should be as brief and concise as possible, without sacrificing comprehensiveness. One must take great care in writing this part because most databases only include abstracts and only allow one to search with words contained within them. Moreover, other investigators may only peek at an abstract for a minute to decide whether they would spend time reading the whole manuscript. 12.2 Objective of the study Every empirical study should clearly state its objective in the introductory section. This section should clearly state why investigators want to perform such a study and what goal(s) they set out to achieve. In a causal study, one tries to clarify the relationship between exposure and outcome (the causal hypothesis). In a descriptive study, one delineates the facts or contents
294 Basic Principles and Practical Applications in Epidemiological Research
that he/she is going to measure and attempts to infer the findings to a target population. For example, the following statement is typical for causal studies: "We conducted this study to determine the preventive effects of different types of helmets on head injury. "(Tsai et al., 1995). One example of an objective for a descriptive study is: "To compare the prevalence of alcoholism in different ethnic groups, we performed a prevalence study of alcoholism in present-day Taiwan aborigines... "(Hwu et al., 1990). Although many empirical studies can be categorized into either one of the two types, there are many studies that include both descriptive and causal objectives. For example, in a study of an occupational disease, Wang et al. (1986) wanted to determine both the prevalence and etiologic factors of polyneuropathy among press-proofing workers. The following statements are also examples: "We performed this study to determine whether there was increased lead absorption among children of an exposed kindergarten and if it were associated with air and soil pollution in the surrounding area" (Wang et al., 1992). u The objective of this study was to investigate the prevalence rate of dermatoses among workers in a ball-bearing factory and its possible association with their exposure to kerosene " (Jee et al., 1986). In addition to the objectives of the study, the author should provide some background and current understanding of the problem in the introduction. The explication of past scientific developments should follow a logical sequence, without being overly lengthy. 12.3 Validity of measurements in research In the section of materials and method, one needs to write down the contents of measurement, measurement methods and procedure of subject selection. For a reader to judge the validity of these measurements and subject
Chapter 12 Review of an empirical study 295
inclusion, investigators must explicitly state the determinants of each measurement. This should include the name and model number of the measuring instrument, as well as QA/QC (quality assurance/quality control) procedures used with the measurement. One also needs to list the brand names of each instrument due to different levels of performance or accuracy. Since sampling errors are frequently much larger than direct measurement errors from the laboratory in occupational and environmental studies (Liden and Kenny, 1992), one also needs to describe the sampling strategy (Mulhausen and Domiano, 1998) as well as the methods used to transport and preserve the samples. To prevent any measurement bias when assessing outcome, both researchers and subjects must be blind to exposure assignment. Even persons evaluating the exposure cannot know any information on the outcome. These specific arrangements should be clarified in this section. In a descriptive study, every measurement method for the contents under study should be described. In a causal study, the investigators must describe in detail the procedures of measurement method for exposure and outcome of interest, as well as other determinants of outcome or potential confounders. For example, in the study of etiologic agents of pre-malignant skin lesions among paraquat manufacturers, investigators needed to determine, in addition to the exposure of interest, if other potential risk factors, such as coal tars, pitch and cutting oils, were present (Wang et al., 1987). Similarly, to determine the etiologic agent of a hepatitis outbreak among synthetic leather workers, researchers had to assess viral hepatitis, alcohol intake and the use of hepatotoxic medicines, in addition to exposure to dimethyl formamide (Wang etai, 1991). Having written a clear and concise description of their measurement methods, investigators now leave it to the reader to judge the accuracy of their measurements. Valid and precise measurement is always an auxiliary hypothesis invoked in the process of conjecture and refutation, and it is the reader's duty to assess the accuracy of measurement to make a sound critique. One can make a well-informed judgment from either one's experience or consultation with an expert or literature search. Please review Chapter 5 for a more detailed analysis of measurement accuracy and validity.
296 Basic Principles and Practical Applications in Epidemiological Research
12.3.1 Validity of causal studies: Judgment of comparability between the exposed and reference (the nonexposed) groups After reviewing methods of measurements and their accuracy, one should consider the validity of an empirical study. For a causal study, the principles of validity were discussed in Chapters 7, 10 and 11. In brief, one should examine if all other determinants were ruled out as alternative explanations for the effect of interest. From the paradigm of animal experimentation, one always controls all experimental conditions so that the treatment and placebo (sham) groups are comparable for every extraneous determinant. Although one cannot perform experiment on a human population, one can still apply restriction on subject selection, and implement stratification and modeling to provide comparability in effects, contrasted population and measurement (Wang and Miettinen, 1982). In other words, one should examine whether all other determinants are comparably distributed between the exposed and non-exposed (reference) groups. For example, in a study to determine the etiologic agent of pre-malignant skin lesions among paraquat manufacturers, investigators demonstrated that none of the workers were exposed to other skin carcinogens, such as coal tar, pitch, cutting oil and radiotherapy. Moreover, they stratified and controlled age and amount of sunlight exposure, as shown in Table 7.10, eliminating them as confounders (Wang et al., 1987). Similarly, in a study to determine the etiologic agent of a hepatitis outbreak among manufacturers of synthetic leather, researchers showed that none of the workers were positive for anti-hepatitis A IgM antibody and none of the exposed were alcoholics. Moreover, they stratified workers by hepatitis B carrier states, as summarized in Table 7.3, to control for potential confounders. In a case-control study, validity also depends on sampling procedure, meaning that the selection of cases and controls should not be related to the exposures of interest. Most case-control studies collect all cases exhibiting the disease of interest and sample a group of healthy controls, surveying the previous exposure histories of these two groups. MOR (mortality or
Chapter 12 Review of an empirical study 297
morbidity odds ratio) studies collect case and control diseases from both the exposed and non-exposed, and thus should pay special attention to the comparability of the two groups. Furthermore, the reference disease(s) should not be related to the exposure as discussed in Section 11.4.2. 12.3.2 Validity of descriptive studies: Representativeness of the sample In descriptive studies, validity of inferences drawn about the source population depends on the representativeness of the sample. To judge the validity of a descriptive study, one should assess the sampling procedure, response rate and similarity between respondents and non-respondents in regard to the contents of measurement. If the sample is selected according to probability (or at least through a procedure unrelated to the contents of measurement), then one can make inferences about the source population based on the central limit theorem. If the response rate is low, then one must learn how investigators compensated for it. Did they perform an additional random sample among the non-respondents to see how different they were from the respondents? Or, did they provide supplementary information on the determinants of content of measurement before drawing any conclusion? If respondents and non-respondents were similar in terms of known determinants of effect of interest, then one may make conclusions based on the samples at hand. However, if they were different, then investigators can only draw conclusion(s) about the respondent population. For example, Shu et al. (1987) had a response rate of 22.9% for a study to determine whether or not two different commonly used names of Chinese herbal medicines were the same. A random sample of 84 non-respondents showed that 45% of them were irrelevant because most of them either were not prescribing the medication or were out of the country, as shown in Table 7.5. About one third of the non-respondents were too busy to answer our questionnaire. Since all respondents showed a consistent answer that the two names were in fact the same composition of herbs, the authors' could reasonably conclude that the sample was representative.
298 Basic Principles and Practical Applications in Epidemiological Research
12.4 Examination of the results and discussion 12.4.1 Data analysis In the results section, one of the most important issues may be checking each table and figure to find any errors of summation, subtraction or any other types of calculation. Also, one needs to ask the questions: Is the paragraph written clearly enough so that a reader can judge for him/herself? Is the statistical analysis properly performed to control potential confounding? Is there any alternative method of data analysis or alternative explanation of the results? To answer the above questions, one needs statistical knowledge at a level more advanced than that covered in this book. At least one should know the concept and practical use of stratified analysis, general linear model, multiple logistic model, survival analysis, Cox regression model and mixed model. In order to determine if there exist any alternative explanations, one should also possess substantive knowledge in the subject area. Or, one may consult experts or perform a literature search to obtain relevant information about alternative determinants of the outcome of interest, in order to examine if all were already under control. Readers of this book can refer back to Chapters 2-4 to understand how a hypothesis can be corroborated. 12.4.2 Is there any new finding in this study, and has the study achieved its goal? While reading the results and discussion, one should always ask the following questions: Is this a new finding? Has the study achieved the objective it sets out to accomplish in the introductory section? If a reader is very careful, he/she may sometimes find something significant, which even investigators have not noticed or emphasized. One may choose to communicate with the authors about it. If the reader happens to be a peer reviewer, he/she may suggest some comments that can enrich the contents of the article. Of course, the investigators themselves can ask these worthwhile questions for self-evaluation of their empirical study.
Chapter 12 Review of an empirical study 299
12.4.3 If I carry out a similar study, how shall I modify the study design and data analysis? There is a proverb in epidemiology: "You can always perform a better study afterwards." In other words, one will always find something to be improved after finishing an empirical study because there are many things one might not foresee before proceeding with field data collection and the rest of the study. Thus, one is constantly learning from one's study and should later attempt to modify future research either in study design or data analysis. Thinking along this line will help one avoid some pitfalls, from which others have unfortunately suffered. 12.4.4 Practice makes perfect Actively applying the above procedures to critically review at least 30-50 articles can greatly sharpen one's critical thinking. One can also apply these procedures in one's presentation, which may also help one to develop a "knee jerk reflex," an automatic reaction to apply such basic principles and procedures to evaluate any study. In general, one will begin by searching for the objectives of the study (descriptive, causal or both), accuracy of measurement, sampling procedure, response rate and data analysis. Then one will look for any new findings, measurements of exposure, outcome and other causal determinants. Finally, one will analyze how subjects are selected into the study and how potential confounding is controlled. Figure 12.1 summarizes these major points and illustrates the flow of thinking. 12.5 How to write up a paper based on an empirical study: Practical advice In the last two decades, the biomedical world has been following a trend to develop a uniform style of manuscript in all of its journals (International Committee of Medical Journal Editors, 1982, 1997). It was first proposed and adopted by 150 journals in a Vancouver meeting in 1978. Later, an increasing
300 Basic Principles and Practical Applications in Epidemiological Research
number of journals have participated. In a less developed country like Taiwan, most of our medical journals voluntarily comply with such a proposal, e.g., Journal of Formosan Medical Association, Chinese Journal of Public Health (Taipei), Proceeding of the National Science Council (ROC), etc. The proposed format requires every manuscript to include a title page, an abstract, keywords, introduction, materials and method, results, discussion, acknowledgment, references, tables, figures and their legends. Each item has its specific requirements, to which I recommend that all readers should refer in order to avoid any unnecessary corrections or rejections (International Committee of Medical Journal Editors, 1997). Of course, if there is any additional stipulation required by the journal to which one is submitting, then one must follow their specific instruction. To write a research paper, one should begin by preparing a literature review. One needs to collect relevant articles and mark down important points or sentences, to which he/she may refer in the future. One can try to classify and file the article according to the section in which it will be cited. After deciding on the style of the manuscript, one may begin to write up the materials and method section, because this section will not significantly change, once the data collection is completed. During the data analysis stage, one then tries to summarize the results into tables and figures, and search for more relevant literature to help continue the process of conjecture and refutation. One can write down some notes whenever a new idea arises, and file them according to the four sections of the paper. When all tables and figures, collection of reference articles, and main points are almost completed, one can then begin to write up the sections of results and discussion. It may take several hours to arrange the sequence of tables and figures, and decide which one will appear first in the results section. Then, one can bring out these notes and examine all the ideas that one collected during these days of thinking and writing of previous sections. Then, one may try to further group or categorize them into 5-10 major points, and arrange them in logical sequence or according to priority in the section of discussion. Each major point may contain several ideas written in notes, which must also be arranged into logical sequence in a paragraph. Of course, one
Chapter 12 Review of an empirical study 301
must not forget to write down the new or significant findings of his/her empirical research. This part may be especially difficult because one can think of several different versions of organization. Nonetheless, one must choose what one thinks is best. In most occasions, one may get a better idea about the logical sequence of a presentation after reading papers in the same field. After drafts of all other sections have been complete, one should then tackle the introductory section. This part is generally guided by the objectives and most significant findings in the sections of results and discussion. Lastly, one must list and number the references or citations within the paper. For investigator(s) who are not native speakers of English, it is usually difficult to write in this language for the first time. One may worry a great deal about grammar, use of vocabulary, etc. My recommendation is one should simply write and disregard grammar for the first draft. After one or two weeks, examine it again for appropriateness of expression and grammar, and make the necessary corrections. The revised draft can then be sent to a friend, in a related but not the same field. One should ask the friend to pinpoint unclear and/or confusing parts, which need revision or rephrasing. In fact, the first revision will generally take a great deal of time. Moreover, writing and speaking are two very different forms of communication. With speaking, one can use different tones or facial appearances to emphasize main points or help express an idea, which are generally infeasible in writing. Thus, one may need repeated revisions to clarify points. After several revisions, the manuscript will improve in readability and will be ready for submission. Before submitting the article for publication, one should recheck style and proofread once more to make any final corrections. Although this may sound tedious, it is important to show the editors your diligence, sincerity and careful attention to detail.
302 Basic Principles and Practical Applications in Epidemiological Research
12.6 Summary(Figure 12.1) Causal study
Descriptive study Introduction What is the objective of the study?
Material and Methods What are the exposure, outcome and other
What are the items under study?
determinants of outcome in the study?
Are they clearly defined and validly
Are the above items clearly defined and
measured?
validly measured? What are the factors that may influence each
What are the factors that may influence
of these measurements? Are they properly
measurement of each of these items?
controlled?
4How are subjects selected into the study?
What population(s) are under study?
How are exposed and non-exposed defined
How will the sample be selected? Will it be
and selected?
selected according to probability?
How are cases and controls defined and
How high is the response rate?
selected?
Results and Discussion Are the exposed and non-exposed
Do respondents differ from non-respondents
comparable in terms of effects, contrasted
in
populations and measurements? (Are
measurements?
terms
of
items
or
contents
of
exposed and non-exposed comparable for other determinants of outcome?) If no, how do the authors deal with the
If the response rate is low, and
potential confounding?
non-respondents are associated with one or more item(s) of measurement, what have the authors done to compensate for it?
Chapter 12 Review of an empirical study 303
Is the study written clearly enough for readers to judge for themselves? Is there any error in statistics, tables and figures? * Is there any alternative view or explanation for each table or figure? Is data analysis sufficient to control all
Are the respondents and non-respondents
potential confounding? Or, have all
similar in terms of determinants of items of
alternative explanations of the effect of
measurement?
interest been falsified?
^
If there is confounding, what is the
If not, then what is the direction and
magnitude and direction of such an effect?
magnitude of bias for the results, i.e., over-estimation or under-estimation?
Is there any alternative method of data analysis? * Does the study achieve its objective? Are there any more conclusions that can be drawn which the authors have missed? If you were to conduct a similar study, how would you improve the study design and data analysis? Figure 12.1 Logical sequence and questions to be asked while critically review an epidemiological study.
304 Basic Principles and Practical Applications in Epidemiological Research
Quiz of Chapter 12 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 1. An environmental engineering company conducted an environmental impact assessment in Chonburi province in 1990. They found that respiratory diseases, cancer, cardiovascular diseases and accidents accounted for 15%, 20%, 18% and 16%, respectively, of total mortality in the area. One can conclude that cancer is the number one cause of death and deserves public attention. 2. During 1991-3, some businessmen built a petrochemical complex in the same area. Five years later, respiratory diseases, cancer, cardiovascular diseases and accidents accounted for 12%, 20%, 20% and 18%, respectively, of total mortality. One can conclude that there was no increase in respiratory mortality during the last 5 years. 3. From assertions 1 and 2, one can conclude that petrochemical factories will not produce any hazardous effect to the local Thai people. 4. The abstract is a summary of the study and contains all the major points of the paper. 5. A confounder must be a major determinant of the outcome under study. 6. Mortality odds ratio (MOR) is a special case of case-control study with controls selected from some other diseases. 7. The collection of controls in a case-control study is always retrospective. 8. Comparability of effect requires that the exposed and
Chapter 12 Review of an empirical study 305
non-exposed are exactly the same except for the exposure of interest. For example, if one wants to study the health effect of asbestos, then the non-exposed workers should wear the clothes and uniform of the same color (say, white) as the exposed. 9. In a descriptive study, one must always be careful about the definition of the numerator and denominator of a calculated rate or ratio. 10. A PMR (proportionate mortality ratio) study can be interpreted qualitatively but not quantitatively, because it must invoke an assumption that total mortality rates between the exposed and non-exposed are the same.
Answer: (1) T (2) F (3) F (4) T (5) T (6) T (7) F (8) F (9) T (10) T
This page is intentionally left blank
Chapter 13 Application of Epidemiological Methods in Health Service Research and Policy 13.1 Research questions in health policy: From descriptive and causal to decisional 13.2 Outcome measurement of cost-effective analysis in health and medicine: 13.2.1 Estimation of quality-adjusted life years (QALY) for utility Psychometric score assessment of health profile for effectiveness 13.3 Risk assessment: Public health and individual viewpoints 13.3.1 Risk assessment from a public health point of view 13.3.2 Risk perceived by individuals 13.4 Summary Introduction It is a common dream for all epidemiologists to apply the results of their research for saving lives or decreasing mortality and/or morbidity. When Snow discovered that diarrhea among London residents was associated with different systems of water supply, his immediate action was to notify the relevant authority and amend the contaminated water supply (Snow, 1936). Finding an increased frequency of lung cancer among cigarette smokers, Doll and Hill hoped to document the magnitude of the effect and argue for public health policy action (Hill, 1965). Similarly, when my colleagues and I found that carbon tetrachloride was responsible for an outbreak of hepatitis in a printing factory, our hope was to remove this solvent from all printing shops (Deng et al., 1987). As epidemiologists, we are eager to turn our new findings into preventive action, which involves health policy-making. Policy makers in health and medicine rely on the causal inferences or
307
308 Basic Principles and Practical Applications in Epidemiological Research
quantitative data provided by epidemiological research to assist decision-making. In the new field of clinical epidemiology, physicians are using epidemiological concepts to make diagnostic and therapeutic decisions. Although public health policy may be influenced by political and socio-economic concerns, the overall consideration should be based on quantitative information regarding total health effects on the community. Policy makers may easily obtain qualitative information or expert recommendations. However, the magnitude of effect needs to be quantified with data from epidemiological research. This chapter deals with the formulation of questions in health policy and management, and discusses the issues involved in outcome measurement of cost-effectiveness in health and medicine. I am going to introduce the QALY (quality-adjusted life year) as one of the common units on utility assessment for national resource allocation and psychometric score assessment of health profile for effectiveness in clinical decision. We will also attempt to apply these concepts in risk assessment. 13.1 Research questions in health policy: From descriptive and causal to decisional Although a wide range of issues may exist, we will attempt to classify research questions in health policy-making into three types: descriptive, causal and decisional. In addition to descriptive and causal questions, health policy involves decision analysis. Specifically, one integrates results from descriptive and causal studies to formulate a decisional question to be applied to health policy-making. First, let us examine a hypothetical example of clinical decision-making by a neurosurgeon: Example 13.1 Clinical decision-making in a head injury case A hypothetical case of motorcycle-related head injury was admitted to the Emergency Department of the National Taiwan University Hospital (NTUH). Doctors suspected a subdural hematoma from physical examination and analysis of computed tomographies. The patient's consciousness also
Chapter 13 Health Policy Application 309
seemed to slowly worsen to a Glascow Coma Scale of 8. The neurosurgeon at the NTUH created the following decision tree to facilitate judgment on surgery (Figure 13.1):
Subdural hematoma
ConsequenceCs) Good result
Probability Utility 0.85 1.0
Poor result' (partially disabled) Death
0.10
0.5
0.05
0
Good result
0.85
1.0
Poor result (partially disabled) Death
0.10
0.5
0.05
0
Operatio; 0.895
0.5 No operation
Good result
0.7
1.0
Poor result (partially disabled) Death
0.2
0.5
0.1
0
Good result
0.4
1.0
Poor result (partially disabled) Death
0.2
0.5
0.4
0
Figure 13.1 A surgical decision tree analysis for a head injury case with suspected subdural hematoma. Good results indicate a complete recovery without any sequel, while poor results indicate a permanent partial disability.
310 Basic Principles and Practical Applications in Epidemiological Research
Assume that a good result (complete recovery) carries a utility of 1 and that of a permanent partial disability carries a utility of 0.5, then the surgeon can calculate the total expected utility for various conditions. To do so, he needs to take into account the probability estimates that various conditions will occur, as shown in Figure 13.1. These probability estimates were obtained from the results of previous causal and descriptive studies. First the surgeon can assess the utility value for operation under each condition. This involves taking the product of probability of occurrence and assigned utility: (pk)ck(u). For example, the expected utility of operating on someone known to have a subdural hematoma is: (Prob. of good result x Assigned utility) + (Prob. of poor result x Assigned utility) + (Prob. of death x Assigned utility) = (0.85 x 1.0) + (0.10 x 0.5) + (0.05 x 0) = 0.9 The utility of operating on someone known to have an epidural hematoma is: (0.85x 1.0) + (0.10 x 0.5) + (0.05 x 0) = 0.9 And, the utility of operating on someone known to have both types of hematoma is: (0.7x 1.0) + (0.2 x 0.5) + (0.1 x 0) = 0.8 In order to calculate the total expected utility of operating on someone, without definite knowledge of their condition, the surgeon should not simply take the sum of the above calculated expected utilities. Rather, he must also take into account the probability of occurrence for each of the three conditions. For example, the total expected utility of operation on someone with such a head injury is: (Prob. of subdural hematoma x Expected utility of operation for subdural hematoma) + (Prob of epidural hematoma x Expected utility of operation for epidural hematoma) + (Prob of both conditions x Expected utility of operation for both) = (0.8 x 0.9) + (0.15 x 0.9) + (0.05 x 0.8) = 0.895 Since the expected utility for no operation is the same for all conditions, the total expected utility for the decision not to operate on a head injury is:
Chapter 13 Health Policy Application 311
(Expected utility of good result + Expected utility of poor result + Expected utility of death) = (0.4 x 1.0 + 0.2 x 0.5 + 0.4 x 0) = 0.5 Since the total expected utility of operation was 0.895 versus 0.5 for no operation, the neurosurgeon recommended an immediate operation. Although a decision-making question takes on a variety of forms, the basic idea can generally be summarized into a diagram, such as the flow diagram in Figure 13.2. The decision-maker must consider and list all possible events (ek) of a decision, assign probabilities (pk) to each event, and evaluate the consequences (ck) of each event in terms of utility (ck(u)). One then determines the optimal strategy by averaging out and folding back, as detailed in Example 13.1. Epidemiological research can provide decision-makers with the probabilities or incidence rates and the utility lost or gained for certain conditions during a period of time. In many occasions, one may directly use the collected data to estimate expected utility, such as QALYs gained from a specific prevention program. Business people commonly apply equivalent monetary value (EMV) to each utility of consequence, but attaching EMV to human lives raises serious ethical concerns. Thus, in order to avoid ethical disputes, health professionals have traditionally favored economic analysis, which assesses cost per unit of health effect or utility, or cost-utility analysis (Drummond et al, 1997). Some authors consider it as a special case of cost-effectiveness analysis (Gold et al., 1996) Let us examine the following examples: Example 13.2 Need for a material safety data sheet Researchers found that for five out of ten occupational diseases documented in Taiwan during 1983-90, none of the affected employers and employees knew to what kind of chemicals they were exposed. As a result, public health workers raised the question of whether or not to implement a hazard communication system, specifically a material safety data sheet (MSDS), for industrial factories in Taiwan (Wang, 1991). Moreover, they also asked, how many workers and what kind of factories would benefit from this new system, if implemented?
312 Basic Principles and Practical Applications in Epidemiological Research
Events
Consequences
Payoff of Expected Event: consequencex Probability = utility
^:P<J-
e,:
c,(u)
x
p,
e,(u)
e2:
c2(u)
x
p2
e 2 (u)
ek:
ck(u)
x
pk
ek(u)
Figure 13.2 Simplified decision diagram or tree. Each possible event (ek) has a probability pk of resulting in a consequence (ck). Assuming that each consequence has a payoff ck(u), the expected utility is ek(u) = ck(u)pk. The expected utility of E is equal to the sum of Prob(ek will occur)ek(u).
This example raises a descriptive question: How many workers, stratified by type of industry, would be protected from a system of MSDS? To answer this question, one simply needs to obtain data from the labor statistics bureau, or one needs to design a descriptive study to survey the frequency of workers, stratified by types of factories and/or exposures. If one wants to determine the cost of implementing such a system, then the question becomes decisional or requires a cost-effectiveness analysis. If an epidemiologist is unfamiliar with the quantification of cost, he/she may request the assistance or develop a mutual cooperation of an expert in economic analysis. Example 13.3
Helmet regulation
Since motorcycle injuries once claimed more than 5,000 lives annually in Taiwan and since many head injuries are fatal (Department of Health, 1996),
Chapter 13 Health Policy Application 313
the National Health Administration and Congress of Taiwan seriously considered helmet regulation. In the attempt to formulate an effective policy, researchers asked the following questions: Are helmets effective in the prevention of head injury? Do different types of helmet provide a similar magnitude of protection? By conducting a case-control study, Tsai et al. (1995) attempted to answer these two causal questions. They found that full-face helmets provide the greatest protection. If policy-makers wanted to know the cost-effectiveness ratio for helmet-wearing regulation among motorcyclists, a decisional question, then epidemiologists need to supply additional information to find the total number of QALYs saved. Thus, they need to find out how many QALYs are saved per head injury case and how many cases are prevented per year (Tsauo et al., 1999). This study will be described in detail in the next section on the estimation of QALY, Section 13.2. Again, anything related to the quantification of utility of health should depend on epidemiological work. Example 13.4
Regulation of asbestos factories
At one time, there were more than 30 asbestos factories throughout Taiwan. People living in nearby communities were concerned that they might develop asbestos-related cancer. In response, investigators raised the following questions: Should we completely wipe out asbestos-related factories in Taiwan? Or, how big is the risk of developing lung cancer and mesothelioma among people living in neighboring communities? A descriptive study can answer the latter question of risk assessment (Chang et al., 1999). While, the former question is decisional and requires a cost-effectiveness or cost-benefit analysis. As discussed in Chapter 6, the primary unit for counting the utility of health is the number of lives saved. With chronic diseases, entailing both mortality and morbidity, the above unit alone is not efficient enough for health policy-making or resource allocation. An arbitrary assignment of 0.5 of utility to partial permanent disability as in Example 13.1 seems also imprecise and may not truly reflect a person's preference. Thus, there is a need to develop more sensitive measurement for both clinical decision and resource allocation.
314 Basic Principles and Practical Applications in Epidemiological Research
Because people usually prefer to trade some duration of survival under poor quality of life with shorter length of survival under good quality of life, a common unit, quality-adjusted life-year (QALY), was proposed (Weinstein and Stason, 1977). This unit has been widely accepted for the assessment of cost-effectiveness or cost-utility analysis in health and medicine (Gold et al., 1996; Drummond, et al., 1997). While the QALY may be enough for inter-disease comparison and national resource allocation, I believe that more delicate psychometric assessment of health profile is needed at least for clinical decision of single disease to improve further efficiency. Although cost function (and opportunity cost) may not belong to the field of epidemiology, policy-makers need epidemiologists to quantify data on expected utility or health effect, and epidemiologists need assistance from experts who can perform assessment of cost function or economic analysis. Thus, a decisional question usually requires multidisciplinary cooperation. If one is trained in both fields of epidemiology and economics, then one should be able to answer such questions on one's own. Recently, public health education has a growing interest to create an additional track to train epidemiologists who can provide such a service in the future. 13.2 Outcome measurement of cost-effective analysis in health and medicine From 1965 to 1995, the health care expenditure of the United States has increased from 5 % of GDP (Gross domestic product) to 15 %, while the proportion spent for preventive service has been below 1 % of all these expenses (Gold et al, 1996). The cost-effectiveness of health care spending has always been a major concern throughout the world nowadays (Gold et al., 1996; Drummond et al., 1997). The implementation of the National Health Insurance (NHI) system in Taiwan since March of 1995 has covered over 97 % of all people of Taiwan, which is a great success in the distributive justice. However, the quality of health care services seems difficult to keep, and the ever-growing demand of implementing new medical technology makes the cost containment an impossible mission for policy makers. Therefore,
Chapter 13 Health Policy Application 315
improving cost-effectiveness has become one of the top goals to be achieved for health care industry in Taiwan and the whole world. What an epidemiologist can do is to develop and conduct outcome assessment for utility and psychometry of health. 13.2.1
Estimation of quality-adjusted life years (QALY) for utility and psychometric assessment of health profile for effectiveness
Outcome measurement of cost-effectiveness analysis usually requires one to quantify quality-adjusted survival (QAS), by multiplying the survival function with the HRQL (health-related quality of life) function, as outlined in Chapter 6. Different diagnostic and treatment procedures may produce different QAS curves, and the difference between any two of them is the average improved effectiveness or utility (in QALY) gained per case, if the superior option was chosen. Let us look at some examples for illustration: Example 13.5
Expected utility loss from angina pectoris
Suppose that QASn denotes the QAS of an average normal healthy person in the general population without any specific disease, and let QASap denote the QAS of an average patient with angina pectoris. Then, the utility loss for a patient with angina pectoris is simply QASn - QASap. With the survival data from Parker et al. (1946) (in which researchers followed up 1541 cases for 17 years until all cases had died) and the survival data from life table of the U.S. 1960 general population (National Center for Health Statistics, 1966), one can plot and compare survival with and without the disease, as on Figure 13.3. Assuming that HRQL after the diagnosis of angina pectoris is similar in the past and present, one might take a hypothetical cross-sectional survey of these patients and obtain the HRQL (modified from Fryback et al., 1993), as shown in Table 13.1. Further assume that the HRQL of an average person younger than 70 years old is 1, and a person over 70 is 0.9 because of mild physical disability. One also knows that the average age of a person with
316 Basic Principles and Practical Applications in Epidemiological Research
Table 13.1 Yearly survival rates, hypothetical health-related quality of life (HRQL) and quality-adjusted survival time (QAST) after diagnosis of angina pectoris. Years after diagnosis Survival rates
HRQL
QAST
0
1.0000
0.5
0.5434
1
0.8114
0.7
0.5732
2
0.7170 0.6524
0.8
0.5820
3
0.9
0.5540
4
0.5786
0.9
0.4941
5
0.5193
0.9
0.4412
6
0.4611
0.9
0.3952
7
0.4172
0.9
0.3548
8
0.3712,
0.9
0.3174
9
0.3342
0.9
0.2690
10
0.2987
0.8
0.2218
11
0.2557
0.8
0.1760
12
0.2136
0.7
0.1391
13
0.1839
0.7
0.1129
14
0.1636
0.6
0.0920
15
0.1429
0.6
0.0723
angina pectoris is 59. Then, one can construct curves for the HRQL of patients with angina pectoris and of the general population, as in Figure 13.4. Finally, one can combine the survival curves and the HRQL curves to obtain curves for QASn and QASap as shown in Figure 13.5. Based on the formula for QSj from Section 6.6.2, QASn and QASap can be calculated. The following is the calculation for QASap: QS, = [S(ti) + S (ti+1)]/2 x [Qol(ti) + Qol (ti+1)]/ 2 '1.0+0.8114Y0.5 + 0.7^ f0.8114 + 0.7170Y0.7 + 0.8Nl f f f0.1636 + 0.1429Y0.6+0.6>l _ , 2 A 2 J \ 2 A 2 / 1 2 A 2
,
In real numbers, the QASn - QASap is equal to 16.68 - 5.36 = 11.32
Chapter 13 Health Policy Application 317
QALYs, which can be regarded as the utility gained from the successful prevention of an average case of angina pectoris. If one takes an annual discount rate into consideration (e.g., r = 0.03)(The World Bank, 1993), then the expected utility loss from getting angina pectoris is 12.69 - 4.62 = 8.07 QALY. patients with angina general population
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30 Years After Diagnosis
Figure 13.3 Survival curves of patients with angina pectoris and for the population during the thirty years after diagnosis (Wang et al., 2000).
In addition, one can multiply the QALYs gained with the number of affected cases to obtain the expected total utility gained. In practical terms, one can first establish a cohort by enrolling all patients (say, 500 or more) diagnosed with a specific disease X( from a specific hospital during the last 10 years, which can be followed and compared with the death certificate database composed by the Department of Health to obtain the survival function. Then, a stratified random sample of 50-100 patients with different durations-to-date were surveyed with the QOL questionnaire (e.g., WHOQOL-BREF) and utility measurement, which can estimate the QOL function of the disease X;.
318 Basic Principles and Practical Applications in Epidemiological Research
patients with angina general
1.2 $
l
°
0.8
\ _
population
0.6 0.2 0
J
I
0
I
2
I
4
I
I
I
6
l__l
8
!
L_l
10
I
12
L
14
16
Years After Diagnosis Figure 13.4 QOL curves of patients with angina pectoris and for the population during the thirty years after diagnosis (Wang et al., 2000).
■ patients with angina general population
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 Years After Diagnosis
Figure 13.5 QAS curves for patients with angina pectoris and for the general population during the thirty years after diagnosis (Wang et al., 2000).
Chapter 13 Health Policy Application 319
The expected value of QOL is multiplied with the corresponding survival rate at time t to obtain the QAS curve for the disease X;. Moreover, Hwang et al (1999) further showed that the curve can be extrapolated throughout life, if the cohort population has been followed up for about 3-5 years. The break-through has made the estimation of QALE (quality-adjusted life expectancy) for most chronic diseases possible. Hwang and Wang (2001) have recently demonstrated that similar results can be obtained if we replace utility values by psychometric scores for each health profile items or facets, of which the meaning can be interpreted as survival-weighted psychometric scores. Let us use Example 13.3 for a more detailed illustration of how one can estimate the total utility gained in the first 12 months of the implementation of a helmet law for motorcyclists in Taipei. Example 13.3
Helmet regulation
First, investigators randomly selected 400 out of 8221 registered head injury cases in Taipei during 1989-94, of which they successfully assessed 99 current health profiles (Tsauo et al., 1999). They converted these profiles to corresponding utility values of IHRL (index of health-related quality of life, proposed by Rosser et al., 1992) and constructed the survival function, HRQL, and quality-adjusted survival of head injury from 7 years of follow-up of 400 cases, as shown in Figures 13.6 - 13.8. Head injury is one of the main causes of death from motor vehicle injury in Taipei, of which the estimated annual incidence rate was 1.82*10"3 year"1 (Ding et al., 1993). 64.6% of all motor vehicle injuries involved head injuries and 80% of these injuries involved motorcyclists (Ding et al., 1994). The incidence rate of head injury among motorcyclists was 9.41 * 10"4 year"1, as shown on Table 13.2. According to Tsai et al. (1995), the odds ratios (OR) of head injury with no helmet, full-face helmet and partial coverage helmet were 1.0, 0.31 and 0.73, respectively. At the time, the proportions (Pe) of no helmet (Pe,), full-face (Pe2) and partial coverage helmets (Pe3) were 88%, 8.5% and 3.4%,
320 Basic Principles and Practical Applications in Epidemiological Research
1.2
0.4 0.2
0
_L.._J
16
J
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81
Axis of abscissas: month Axis of ordinates: survival probability reference population head injury cases Figure 13.6 The survival function of head injury cases and the reference population in the 80 months after onset (Reprinted from Accident Analysis & Prevention, 31, Tsauo JY, et al. Estimation of utility gain from the helmet law in Taiwan by quality adjusted survival time, 253-63, Copyright (1999), with permission from Elservier Science).
Chapter 13 Health Policy Application 321
1.2 '
0.8 0.6 0.4 " 0.2 0 '
' 1 6
' 11
' 16
' 21
' 26
' 31
' ' 36 41
' 46
' 51
' 56
' 61
' 66
' 71
' ' 76 81
Axis of abscissa: time after onset of head injury in month Axis of ordinates: health-related quality of life reference population head injury cases Figure 13.7
The health-related quality of life (HRQL) of head injury cases and the reference population in the 80 months after onset (Reprinted from Accident Analysis & Prevention, 31, Tsauo JY, et al. Estimation of utility gain from the helmet law in Taiwan by quality adjusted survival time, 253-63, Copyright (1999), with permission from Elservier Science).
respectively (Tsai et al., 1995). Thus, the incidence rate (IRi) of head injury for cyclists without helmets can be obtained by solving the following equation: (IRhi indicates incidence rate of head injury; IR2 and IR3 indicate incidence rates of head injury with full-face helmet and partial coverage helmets, respectively) Pe,(IR,) + Pe2(IR2) + Pe3(IR3) = IRM
322 Basic Principles and Practical Applications in Epidemiologica) Research
Pe,(IR,) + Pe2(OR2)(IR,) + Pe3(OR3)(IR,) = IRhi (0.88)(IR,)+ (0.085)(0.31)(IR,)+ (0.034X0.73XIR,) = 9.41 *10"4 year"1 IR, = 1.01*10"3 year1 The incidence rates of head injury for cyclists with full-face and partial coverage helmets could also be derived as 3.44x 10'4 and 8.10x 10"4 year"1, respectively. To estimate the number of prevented cases one year after enforcement of the helmet law, Tsauo et al. assumed that the proportion of motorcyclists with no helmet dropped from 88% to 5%, as reported by the traffic police department of Taipei. They also assumed that the ratio of full-face to partial coverage helmet remained at 8.5:3.4, so that the proportion of cyclists with full-face helmet would be 67.3% and partial coverage helmet would be 26.9%. The number of head injury cases before enforcement of the helmet law in one year was calculated to be 2541 cases, as shown in Table 13.2. The number of cases after one year of enforcement of the law was estimated to be 1241. (See Table 13.2.) Thus, the number of prevented cases was shown to be 1300 cases. Since the expected loss of utility per case of head injury (without consideration of an annual discount rate) is 4.8 QALYs, then the total utility gained by the intervention during the first year would be 4.8x (2541-1241) = 6,240 QALYs. The above calculation still does not take into account the potential utility gained from reduced severity and improved survival and HRQL from the protective effect of a full-face and partial coverage helmet in cases of head injury.
Chapter 13 Health Policy Application 323
1.2
0.8
0.6 0.4 0.2 0
1
6
11 16
21 26
31
36
41
46
51
56
61
66
71
76
Axis of abscissas: month after onset Axis of ordinates: quality-adjusted survival reference population head injury cases Figure 13.8
The quality-adjusted survival of head injury cases and the reference population in the 80 months after onset (Reprinted from Accident Analysis & Prevention, 31, Tsauo JY, et al. Estimation of utility gain from the helmet law in Taiwan by quality adjusted survival time, 253-63, Copyright (1999), with permission from Elservier Science).
324 Basic Principles and Practical Applications in Epidemiological Research
Table 13.2
Estimation of the expected number of prevented head injuries and QALYs (quality-adjusted life-years) gained under a helmet law enforced for one year. Assumption: the induction time for prevention of head injury is 0. (Prevention becomes effective when a person begins wearing a helmet while riding a motorcycle.) (Reprinted from Accident Analysis & Prevention, 31, Tsauo JY, et al. Estimation of utility gain from the helmet law in Taiwan by quality adjusted survival time, 253-63, Copyright (1999), with permission from Elservier Science).
No. of Taipei residents
2.7xl0 6
Incidence rate of head injury in Taipei
1.82x 103 year'1
Proportion of traffic injuries in head injury
64.6%
Proportion of motorcycle injury in traffic injury
80%
Incidence rate of head injury of motorcyclists
9.41X10"4 year"' (=1.82xl0"3x64.6%x80%)
Odds ratio of head injury for cyclists with
no helmet full-face helmet partial coverage helmet
Proportion of cyclists with no helmet before intervention
88%
Proportion of cyclists with a full-face helmet
8.5%
Proportion of cyclists with a partial coverage helmet
3.4%
Incidence rate of head injury for cyclists with no helmet
1.01 xlO"3 year"1
Proportion of cyclists with no helmet, after execution of law, reduced to: 5% Proportion of full face helmet
(1-5%) x (8.5/12) = 67.3%
Proportion of partial coverage helmet
(1-5%) * (3.4/12) = 26.9%
Number of cases before regulation in 1 year 2.7xl0 6 x9.41 xio-4(year"') x 1 year = 2541 Number of cases with head injury after enforcement of law for 1 year 2.7x 106 x 1.01 x lO"3 x (0.05+67.3%x0.3 l+26.9%x0.73) =1241 Total utility gained from the enforcement of the helmet law in 1 year 4.8x(2541-1241)= 6240 QALYs
Chapter 13 Health Policy Application 325
In short conclusion, the effectiveness of health services can be evaluated by QAS method (for utility) and survival-weighted psychometric assessment (for health profile), which can be used in both the national health service resource allocation and clinical decision, as shown by Figure 13.9. 13.3 Risk assessment: Public health and individual viewpoints Another major contribution to health policy and management by epidemiologists is risk assessment. Although the definition of risk may not be universal, it is useful to discern the definitions of hazard, risk and detriment. Hazard is considered to be the intrinsic property or potential of causing harm. Some examples of objects, which can pose a hazard to people's health, are work material, equipment, work methods and practices. Risk is usually defined as the probability that an adverse event occurs during a stated period of time (or results from a particular challenge.) Detriment is a numerical measure of the expected harm or loss associated with an adverse event and is generally the integrated product of risk and harm, which is often expressed in terms of cost (such as US $, sterling, loss in utility of health or loss of productivity). In modern management system of occupational safety and health, however, risk is defined as a combination of the likelihood of the adverse event and its consequence, as proposed in the British Standard (BS) 8800 (British Standard Institute. 1996), which is also the common notion recognized among toxicologists in their risk assessment (Fan and Chang, 1996). Thus, readers should be aware of such a possible confusion. As this book focuses on methods of epidemiological research, I take the epidemiological tradition to delineate risk as the probability only in chapter 6.
326 Basic Principles and Practical Applications in Epidemiological Research
Clinical decision making
National resource allocation Maximize utility of all people (No. of QALY)
Maximize individual patient's utility
under the constraint of National Health
under resource constraint
Insurance System (NHIS)
based on: psychometric theory
based on: expected utility theory
WHOQOL health profile
standard gamble (utility, e.g. E u r o Q o l )
(multi-dimensions) +
+
survival function
survival function
Summarize to only one dimension survival weighted psychometric scores for each facets
QAS (QALY)= J E{Qol(t I xi ))S(t i xi)
E[FAS}=jS(t\xi)Esub[Q(t\xi)]dt
Each patient participates in clinical decision to maximize no. of QALY/per given cost How much is the patient willing to pay for a QALY ?
quality adjusted survival Cost/QALY(or DALY)
How much will NHIS pay per QALY under the constraint of distributive justice?
Figure 13.9 Utility and psychometric assessment weighted by survival for national resource allocation and clinical decision (modified from Wang et al, 2000).
Chapter 13 Health Policy Application 327
13.3.1 Risk assessment from a public health point of view In epidemiology and public health, we usually consider disease or death as the adverse event. Risk can then be estimated by cumulative incidence rate (cumulative proportion) after the specified period of time. In fact, such a risk estimate is commonly based on historical data. Its calculation was discussed earlier in Chapter 6 and can be briefly summarized as follows: Let incidence rate (IR) at time (t) be the rate of new cases developing during the person-time observation. The risk of developing the health event during the period of t0 to ti is estimated by: CIRto_t, = 1 - exp(-J;; (IR)dt) = 1 - exp(-I (IR; )(A t i )) i
Assume that t; is any small interval during t0 and t,, and assume that the person has not died of other competing causes of death. Or, construct a life table after the j , h year of follow-up and apply the Kaplan-Meier method to estimate the risk during 0 - j years. Let R(j) be the risk of contracting the disease by the end of the jth year and IR(j) be the average incidence rate of the jth year, then RG)=i
- n a - ^ ( y ) ) = i -(i -IR(I)XI -iR(2))---(i-iR(j)) j
Thus, one needs to obtain the data of particular incidence rates during the specified period of time in order to assess the risk. Furthermore, R(t) represents an individual's probability of getting the disease, only if one first assumes that this person has not died of other competing causes. Because incidence rates may change according to the variation of different determinants, one needs to assess these determinants in order to characterize the risk. Moreover, an identification of the population at risk from these hazards is also needed in an empirical assessment. For example, an occupational and environmental risk assessment usually involves the following four steps, as recommended by the U.S. NRC (National Research
328 Basic Principles and Practical Applications in Epidemiological Research
Council) (NRC, 1983): 1. Hazard identification: What health effects will result from the toxicant? 2. Dose response assessment: Draw a figure or develop a formula to express what proportion in a population will suffer from the effect of different doses. Namely, actual doses of the agent at the target organ are considered in estimating incidence rates of a population. 3. Exposure assessment: Evaluate or measure the exposure dose for the particular situation. 4. Risk characterization: Determine the risk or probability of occurrence of the health effect. If the information on the population at risk is available, one can estimate the number of expected cases during the specified period of time. Let us consider Example 13.4 again. Example 13.4 Regulation of asbestos factories Chang et al. (1999) tried to quantify the risk of increased mortality of lung cancer and mesothelioma among people living near asbestos factories. Detrimental health effect could potentially result from exposure to asbestos, which would lead to mesothelioma and lung cancer cases among the exposed. After reviewing the literature, investigators selected a dose-response model for asbestos-related lung cancer, as follows: D = O-E = E*(b/100)*X*w D: excess death of lung cancer = observed (O) - expected (E) b: slope (fiber- year /cc)"1: dependent on type of asbestos factory, asbestos cement: 0.5 asbestos textile: 3 X: cumulative exposure dose of asbestos (fiber- year/cc) w: weighting factor; conversion due to situation offset from the original equation. The original assumption for this model was that workers work 40 hours
Chapter 13 Health Policy Application 329
per week and 48 weeks per year. Accordingly, this model needed to be adjusted for ambient exposure of community people who usually stay 16 hours at home, 7 days per week and about 50 weeks per year. Thus, the weighting factor was adjusted to (50 x x 16 x 7)/(40 x 48) = 2.92. In addition to performing air measurement, the investigators also counted the number of people actually living near the factory with help from local policemen and obtained the population sizes for residences 100m, 200m and 300m away from (or diameter distances 200m, 400m, and 600m of) the factories. Similarly, the investigators selected the following dose-response model for asbestos-related mesothelioma: Although there were only 5 asbestos textile factories, which accounted for more risk of lung cancer among residents if compared with cement factories. The EPA of Taiwan issued an order to relocate textile factories and implement a more effective control technology. I(t) = K*C*(T3-2 - (T - D)32), if 0 < t =d I(t) = K*C*T32, if t > d I(t): Incidence rate of mesothelioma C: asbestos concentration during exposure (f / c.c.) T: interval of first exposure starting from the present (year) D: exposure duration (year) t > d, ever exposure 0 < t = d, current exposure K: constant, 0.04 * 10"8for chrysotile exposure. Tables 13.3 and 13.4 summarize the exposure concentration and risk of developing mesothelioma and lung cancer in areas near cement and textile asbestos factories.
330 Basic Principles and Practical Applications in Epidemiological Research
Table 13.3
Diameter Distance
Estimated number of excess deaths from lung cancer and mesothelioma, resulting from 74-year exposure to air-borne asbestos among residents near asbestos-cement factories in Taiwan. Population size Concentration (person)
in air (f/cc)
200 m
1221
0.006
400 m
4221
600 m
9101
Table 13.4
Dose (f*yr /cc)
Excess death
mesothelioma
lung cancer
1.30
0.177
0.007
1.51
0.714
0.011
0.006
1.30
1.320
0.021
0.003
Estimated number of excess deaths from lung cancer resulting from 74-year exposure of air-borne asbestos among residents near asbestos-textile factories in Taiwan.
Diameter
Population size Concentration
Distance
(person)
in air (f/cc)
Dose (f * yr /cc)
Excess death lung cancer
mesothelioma
200 m
228
0.012
2.59
0.40
0.001
400 m
552
0.02
4.32
1.60
0.004
600 m
960
0.006
1.30
0.84
0.002
Question: How much risk is acceptable? Everyday, people face the question of how much risk is acceptable or unacceptable. Although there is no universal consensus, one can use the numbers provided by the OSHA (Occupational Safety and Health Administration, 1997) and EPA (Environmental Protection Agency, 1991) of the U.S. as references. For example, the U.S. OSHA uses an acceptable working lifetime (45 years) of 10"3 as a guide in determining permissible
Chapter] 3 Health Policy Application 331
exposure levels for carcinogens. Thus, they have classified different occupations according to their individual lifetime risks of mortality: High risk occupations'.
fire fighting mining / quarrying Average risk occupations: all manufacturing all service Low risk occupations: electrical equipment retail clothing
=27.45 x 10"3 =20.16 x 10"3 = 2.7 x 10"3 = 1.62 x 10"3 = 0.48 x 10"3 =0.07 x 10"3
The U.S. EPA (National Research Council, 1994) and FDA (Food and Drug Administration) consider a lifetime risk of 10"6 for each contaminant to be acceptable. From animal to human: Can we obtain a reference value from animal experimentation ? In most situations, human epidemiological data may not be available. In such cases, we may need to extrapolate the potential risk from animal data. However, interspecies differences need to be considered and adjusted for: including body weight, body surface area, life span, pharmacokinetics, metabolism, genetic constitution, repair mechanism, rate of intake, nutritional conditions, bacterial flora, mode/route of exposure, exposure schedule and competing causes of death. While there is no universal method to account for such differences, we usually employ a mg/kg/day or a mg/m2/day dose rate scale. The U.S. federal agencies have used some standard reference values for such an adjustment, as shown in Tables 13.5 and 13.6.
332 Basic Principles and Practical Applications in Epidemiological Research
Table 13.5
Species Human Mouse Rat
Table 13.6
Reference values for dose calculations: lifespan, body weight, food and water intake, and nominal air intake for adults.
Sex M F M F M F
Lifespan (yr) 70 78 2 2 2 2
Body Weight (kg) 75 60 0.03 0.025 0.5 0.35
Food Intake (Wet Weight) (g /day) 1500 1500 5 5 20 18
Water Intake (ml/day) 2500 2500 5 5 25 20
Air Intake (m3/day) 20 20 0.04 0.04 0.2 0.2
Reference values for cdose calculations: selected lung ventilation values for humans (m3 per time period). Light
Light
Heavy
Resting
Activity
Activity
Activity
Total
(m3 / 8 hr)
(m 3 /16hr)
(m 3 /8hr)
(m 3 /8hr)
. (m 3 /24hr)
19.2 9.6
20.6
33.8
9.1
12
Male (adult)
3.6
Male (adult)
3.6
Female (adult)
2.9
Female (adult)
2.9
22.8 21.1
18.2
24
Moreover, the following abbreviations are useful: NOEL (mg/kg body weight/day) = no-observed effect level NOAEL (mg/kg body weight/day) = no-observed adverse effect level LOAEL (mg/kg body weight/day) = lowest-observed adverse-effect level
(statistically or '
biologically significant)
These levels can be calculated from animal and human studies after adjusting for safety factors: (NOEL, NOAEL, LOAEL) =
G-h
Chapter 13 Health Policy Application 333
C = concentration in mg per unit of contaminated media (air, water, or food) Ij = intake in units of contaminated media per day (Tables 13.7 & 13.8) Wj = adult body weight in kg j
= 1 ,refers to data from an animal study
j
=
2, refers to data from a human study
F; (safety factors) F, = 1-10 intraspecies variation in sensitivity F2 = 1-10 potential synergism F3 = 1-10 less appropriate route F4 = same factor mediate of interest (e.g., if drinking water only accounts for 20% of total intake of a compound, then F4 = 5) F5 = 1 or 10, OR F5 = 1 in risk analysis, OR F5 = 10 if LO AEL is used instead of NOEL or NO AEL F6 = lor 10, OR F6 = 1 for human data, OR F6 = 10 for animal data After allowing for lifetime dose and latency period, the data from animal studies is: D, =
(Ci)(/i)
(70)(
J]
)(74 - L)(365)
i
The data from human studies is: D2 = (<3)(/2) T2 x 365
Where,
Ej = lifetime in years Lj = median latent period in years L = median human latent period in years Tj = median exposure time in years, adjusted, if necessary, for latency, remaining lifetime from last exposure and observation time from last exposure
334 Basic Principles and Practical Applications in Epidemiological Research
Based on the above calculation, the U.S. EPA has established a series of reference values for different chemicals. To assess the risk of any exposure, one can divide the sum of exposure concentration of each chemical by the corresponding U.S. EPA reference value. This dose ratio is called the hazard index (HI). When HI is larger than 1.0, one cannot guarantee that the exposure is safe. Let us consider another example: Example 13.5 Hazard Waste Dumping Company X dumped hazardous wastes, including at least 10 toxic chemicals, including VCM (vinyl chloride monomer), tri -and tetrachloroethylene, etc., into local wells. Three years later, when the EPA of Taiwan discovered this fact, they and the local residents demanded that Company X conduct a health risk assessment. Company X hired a consulting firm which claimed to have reviewed all of the relevant epidemiological literature to come up with the lowest level of effect. Instead of using the U.S. EPA standards for calculating no effect levels, Company X defined their own set of "no effect levels" for these toxic chemicals. They, then, divided the exposure concentration of these toxic chemicals measured at the dumping site by their "no effect levels" to come up with a hazard index. This index was about 10-100 times less than the hazard index (HI) calculated using U.S. EPA reference values. Is this an acceptable measurement of risk? Is there any risk? Table 13.7 displays the U.S. EPA oral reference doses, the exposure doses measured at the dumping site in Taoyuan and Chupei, and the corresponding U.S. EPA-derived HI values for the ten toxic chemicals. Later, Lee et al. (2002) conducted a risk assessment and measured 49 off-site residential wells for these compounds. The cancer risks of reasonable maximal exposure (EPA, 1989) exceeds 1.9X10"4, which was for precautionary action. Thus, the community people should discontinue using the contaminated groundwater for any purpose. In order to provide comparative risk assessment and take appropriate action to reduce risk, one must measure different health hazards in terms of a common unit for valid comparisons to be made. To perform comparative risk
Table 13.7
U.S. EPA oral reference doses, exposure doses at the hazardous waste dumping sites in Taoyuan and Chupei Areas, and the U.S. EPA derived hazard indices for 10 toxic chemicals.
Chemical
US EPA Oral Reference Dose (mg/kg-day)
Taoyuan Area Dumping
Hazard Index (HI) or Dose Ratio
Site - Exposure Dose Max. dose
Avg. dose
Max. dose Avg. dose
(mg/kg-day) (mg/kg-day)
Chupei Area Dumping
HI or Dose Ratio
Site - Exposure Dose Max. dose
Avg. est.
Max. est.
Avg. est.
(mg/kg-day)
dose
dose
dose
1,1 -Dichloroethane
0.1
0.012
0.00069
0.12
0.0069
0.016
0.0012
0.16
0.012
1,2-Dichloroethane
0.00038*
0.0056
0.00026
14.74
0.68
-
-
-
-
1,1-Dichloroethene
0.009
0.064
0.0045
7.11
0.50
0.0053
0.00066
0.59
0.07
cis 1,2-Dichloroethene
0.01
0.064
0.005
6.4
0.50
0.0041
0.00057
0.41
0.06
1,1,1 -Trichloroethane
0.09
0.052
0.004
0.58
0.04
0.086
0.0045
0.96
0.05
1,1,2-Trichloroethane
0.004
-
-
-
-
0.00065
0.00016
0.16
0.04
Methylene chloride
0.06
-
-
-
-
0.0035
0.00057
0.06
0.01
Tetrachloroethene
0.01
0.45
0.04
0.04
4
0.0062
0.0012
0.62
0.12
a 3-
Trichloroethene
0.006
0.081
0.0068
0.0068
1.13
0.003
0.00044
0.5
0.07
•-o o
Vinyl chloride
0.002*
0.0025
0.00095
0.00095
0.48
0.00061
0.00019
0.31
0.10
88.70
7.34
3.77
0.53
Total hazard index
■i **: »-^
•I I ia
* Since the US EPA oral reference dose was not available, children residing at the sites, of 21.5kg weight, having consumed 1 L of water, were used to calculate the oral reference dose.
5' a
336 Basic Principles and Practical Applications in Epidemiological Research
assessment and management, one needs a common unit to represent utility of health. Recent development of the concept of QALY can serve this purpose. Again, the QAS approach introduced in Chapter 6 and previous sections may be the best available method to calculate the utility of health in terms of QALYs. Consequently, a comparative risk assessment can be accomplished in more quantitative terms, which can facilitate decision-making based on a consensus. 13.3.2 Risk as perceived by individuals From an individual perspective, both the adverse effect of a particular future event and its probability are inherently subjective because the future is uncertain and does not even exist except in the minds of people attempting to anticipate it. Bayesian statistics, as opposed to frequentist statistics, attempts to deal with the subjectivity involved with risk. Recently, Thompson and Adams (Adams, 1995) approached this question with the assumption that risk is culturally constructed. They advocated the theory of risk compensation: It stipulates that everyone has a propensity to take risks, which is influenced by the rewards of risk-taking. The perceptions of risks are influenced by the experience of accidental losses, both one's own and others'. The individual risk-taking decisions represent a balancing act in which perceptions of risk are weighed against propensity to take risk. The following advice are taken from Adam's book: 1. Everyone is seeking to manage risk. We are all guessing. If we knew for certain, then we would not be dealing with risk. 2. Our guesses are strongly influenced by our cultural backgrounds and beliefs, and at the same time, our guesses tend to reinforce these beliefs. Furthermore, our guesses strongly influence our behavior. 3. In the absence of reduction in people's propensity to take risks, safety and/or health interventions will redistribute the burden of risk. They will not reduce it.
Chapter 13 Health Policy Application 337
4.
It will never be possible to capture "objective risk," however powerful your computer, because the computer's predictions will then guide behavior intended to influence that which is predicted. (Similar to Heisenberg's principle of uncertainty - the act of measuring the location of a particle alters the position of the particle in an unpredictable way.)
Although Adam's idea itself seems to be subjective and somewhat premature, it actually describes how people look at issues of risk. Since efforts to reduce risk aim at behavior modification, more research and falsification attempts of this hypothesis are needed in the future. 13.4 Summary In medical and public health policy-making, epidemiologists are confronted with decisional questions, in addition to descriptive and causal questions. A decisional analysis generally involves the subjective judgment of probability and preference (utility) of different health events, which may reasonably be based on previous empirical experiences. While probabilities or frequencies of health events can be obtained through general epidemiological research, the assessment of preference or utility of health calls for a common unit, QALY, to be quantified. One can evaluate the effectiveness of different kinds of health services, including preventive, diagnostic, therapeutic and rehabilitative, through calculating the QAS to determine QALYs gained or lost. As a result, one can directly use QALYs for the measurement of utility of health and make comparisons for decision-making. Moreover, as the QAS can be extrapolated throughout life, the utility loss from most major diseases may be quantified, which makes the national health services resource allocation more feasible. Individual facets of quality of life or health profile can also be assessed through psychometric method and weighted with survival, which then can be used to improve the effectiveness of clinical decision. One may be able to perform risk assessment in health with the QALY concept and compare
338 Basic Principles and Practical Applications in Epidemiological Research
the different risks in terms of the number of QALYs lost or gained in the future. Since risk, as perceived by an individual, is always subjective and one's attitude toward risk is probably culturally determined, future work on individual risk reduction should also focus on subjective perception, in addition to objective quantification of risk.
Chapter 13 Health Policy Application 339
Quiz of Chapter 13 Please write down the probability (%) that the assertion is true. You will be scored according to the credibility of your subjective judgment. Score % 1. The utility value in clinical decision analysis can be assigned to the expected number of QALYs (quality adjusted life years) gained or lost, which will be more efficient than simply giving a value between 0 and 1. 2. The primary unit of utility of health is the number of lives saved or lost, which should always be considered in the cost-effectiveness analysis of health policy. 3. The QALY calculation cannot be used as a primary basis for across age allocation of health service resources because it ethically discriminates against old-aged people. 4. If we study the lung function impairment of air pollution among Thai school children, the exposed and non-exposed populations should have the same color uniforms. 5. If one studies the hazardous effect of arsenic on school children's IQ (intelligence quotient), he/she should select a non-exposed school with parents of similar educational background. 6. There are only two major determinants for the representativeness of a sample, i.e., the response rate and how much difference there is between the non-respondents and respondents. 7. We shall always consider all determinants of outcome and all determinants of exposure as potential
340 Basic Principles and Practical Applications in Epidemiological Research
confounders. 8. The calculation of expected number of QALYs lost or gained is also needed for comparative risk assessment. 9. The U.S. EPA (Environmental Protection Agency) and FDA (Food and Drug Administration) consider that a lifetime risk of 10"6 is acceptable. 10. The main reason why people take risks is there is always some reward behind the risk.
References Adams J. Risk, London, U.K.: UCL Press, 1995. Albanes D. Beta-carotene and lung cancer: a case study. Am J Clin Nutr, 1999;69(6):1345S-1350S. American Geriatric Society Public Commentitee. Equitable distribution of limited medical resources. J Am Geriatr Soc, 1989; 37: 1063-1064. Anderson S, Auquier A, Hauck WW, Oakes D, Vandaele W, Weisberg HI. Statistical methods for comparative Studies, Chapter 2, Confounding factors, New York: John Willey & Sons, 1980, pp. 7-17. Appell D. The new uncertainty principle. Sci Am, 2001; 284(1): 18-19. Bala MV, Wood LL, Zarkin GA, Norton EC, Gafni A, O'Brien BJ. Are Health States "Timeless"? The Case of the Standard Gamble Method- A comparison of willingness-to-pay and quality-adjusted life-years. J Clin Epidemiol, 1999; 52: 1047-1053. Beauchamp TL, Childress JF. Principles of biomedical ethics, 4' ed and 5' ed, New York: Oxford University Press, 1994, and 2001. Becher H. The concept of residual confounding in regression models and same applications. StatMed, 1992; 11: 1747-58. Bell DE, Raiffa H, Tverskey A. Descriptive, normative, and prescriptive interactions in decision making, in Bell DE, Raiffa H, Tverskey A (eds), Decision making: Descriptive, normative, and prescriptive interactions, Cambridge, U.K.: Cambridge University Press, 1988, pp. 99-30. Blalock HM. Conceptualization and measurement in the social science, Beverly Hills: Sage Publications, 1982, pp. 7-55. Boivin JH, Wacholder S. Conditions for confounding of the risk ratio and of the odds ratio. Am J Epidemiol, 1985; 121: 152-8. Bowra GT, Duffield DP, Osborn AJ, Purchase JPH. Premalignant and neoplastic skin lesions associated with occupational exposure to "tarry" byproducts during manufacture of 4,4' -bipyridyl. Br J Industr. Med, 1982;39:76-81.
341
342 Basic Principles and Practical Applications in Epidemiological Research
Breslow NE, Day NE. Indirect standardization and multiplicative models for rates, with reference to the age adjustment of cancer incidence and relative frequency data. J Chron Dis, 1975; 28: 289-303. Breslow NE, Day NE. Statistical methods in cancer research, vol l-The analysis of case-control studies, Lyon: International Agency for Research on Cancer, 1980, pp. 41-81. Breslow NE, Day NE. Statistical methods in cancer research, vol II - The design and analysis of cohort studies, Lyon: International Agency for Research on Cancer, 1987. British Standard Institute. Guide to Occupational health and safety management systems, British Standard, 1996, BS 8800. 2503-2546. Carmines EG, Zeller RA. Reliability and Validity assessment, Beverly Hills, California: Sage Publications, Inc. 1979. Carr-Hill, R. Assumption of QALY procedure. Soc Sci Med, 1989; 29: 469477. Chang HY, Chen CR, Wang JD. Risk assessment of lung cancer and mesothelioma in people living near asbestos-related factories in Taiwan. Arch Environ Health, 1999; 54: 194-201. Chang PJ. Factors influencing the accuracies of measurement data obtained from questionnaire interview, Master thesis, Taipei, Taiwan: National Taiwan University, 1988. Chang PJ, Wang JD. The accuracy of occupational histories obtained from spouse. Progress in occupational epidemiology, 1988, 53-62. Chao KY, Wang JD. Increased lead absorption caused by working next to a lead recycling factoty. Am J Industr Med, 1994; 26: 229-235. Chen CJ, Wu MM, Lee SS, Wang JD, Cheng SH, Wu HY. Atherogenicity and carcinogenicity of high-arsenic artesian well water: multiple risk factors and related malignant neoplasms of blackfoot disease. Arteriosclerosis, 1988; 8: 452-60. Chen DS, Hsu NH, Dung JL, et al. A mass vaccination program in Taiwan against Hepatitis B virus infection in infants of hepatitis B surface antigen-carrier mothers. J Am MedAssoc, 1987; 257:2597-603. Chen DS, Sung JL. Hepatitis B virus infection and chronic liver diseases in
References 343
Taiwan. Acta Gastroenterol, 1978; 25: 423-430. Chen JD, Wang JD, Tsai SY, Chao WI. Effects of occupational and nonoccupational factors on liver function tests in workers exposed to solvent mixtures. Arch Environ Health, 1997; 52: 270-274. Chen PC, Doyle PE, Ho CK, Chang PJ, Wang JD. Influence of maternal risk factors on low birthweight, preterm delivery and small for gestational ageA prospective cohort study of pregnancy. Chinese J of Public Health, 2000; 19(3): 192-202. Chiang CY, Wang JD, Lee CH. The disease pattern and demand of the emergency medical services system in Taipei. J Natl Public Health Assoc (R.O.C.), 1986; 6(2): 50-63. (in Chinese) Chou JH, Hwang PH, Malison MD. An outbreak of type A foodborne botulism in Taiwan due to commercially preserved peanuts. Int J Epidemiol, 1988; 17(4): 899-902. Cochran WG. Sampling techniques, 3rd ed, New York: John Wiley & Sons, 1977. Cole P, Morrison AS. Basic issues in population screening for cancer. J Natl Cancer Inst, 1980; 64: 1263-72. Committee on the Biological Effects of Ionizing Radiations (BEIR). Health effects of exposure to low levels of ionizing radiation, BEIR V. Washington, D.C.: National Academy Press, 1990. Cooper SP, Downs T, Burau K, Buffer PA, Tucker S, Whitehead L, Wood S, Delclos G, Huang B, Davidson T, Key M. A survey of actinic keratoses among paraquat production workers and a nonexposed friend reference group. Am J Industr Med, 1994; 25: 335-47. Cooperberg PL, Burhenne HJ. Real-time ultrasonography: diagnostic technique of choice in calculous gallbladder disease. New Engl J Med, 1980; 302:1277-9. Copi IM. Introduction to logic, 4th ed, New York: MacMillan Co., 1972. Cornfield J. A method of estimating comparative rates from clinical data, Application to cancer of the lung, breast and cervix. J Natl Cancer Inst, 1951; 11: 1269-75. Covey LS, Wynder EL. Smoking habits and occupational status: J Occup
344 Basic Principles and Practical Applications in Epidemiological Research
Med, 1981;23:537-42. Cox DR. Regression models and life tables (with discussion). J Roy Stat Soc B, 1972; 34: 187-220. Cox DR, Fitzpatrick R, Fletcher AE, Gore SM, Spiegelhalter DJ and Jones DR. Quality-of-life assessment: Can we keep it simple? J R Stat Soc A, 1992; 155: 353-393. Cullen MR, Robins JM, Eskenazi B. Adult inorganic lead intoxication: presentation of 31 new cases and a review of recent advances in the literature. Medicine, 1983; 62: 221-47. Davis W. Cancer registration and its techniques, Lyon: International Agency for Research on Cancer, 1978, pp. 162-3. Declaration of Helsinki. Recommendations guiding medical doctors in biological research involving human subject, adopted by the 18th World Medical Assembly, Helsinki, Finland, 1964, and amended by 52nd World Medical Assembly, Edinburgh, Scotland, October, 2000. Decoufle P, Thomas TL, Pickle LW. Comparison of proportionate mortality ratio and standardized mortality ratio risk measures. Am J Epidemiol, 1980; 111: 263-9. Deng JF, Wang JD, Shih TS, Lan FL. Outbreak of carbon tetrachloride poisoning in a color printing factory related to the use of isopropyl alcohol and an air conditioning system in Taiwan. Am J Industr Med, 1987; 12: 11-19. Department of Health and Human Service (DHHS). Blood lead proficiency testing. 1986-2000. Rockville, Maryland:DHHS, Public Health Service, Center for Disease Control. Department of Health. Vital statistics. Taipei, Taiwan: Department of Health, Executive Yuan, Republic of China, 1994 and 1996. Department of Health. Vital statistics. Taipei, Taiwan: Department of Health, Executive Yuan, Republic of China, 1999. Ding SL, Wang JD, Chen KT. Estimation of case fatality rate and incidence rate of traffic injury in Taiwan—analysis of 4329 victims at a medical center. J Formosan MedAssoc, 1993; 92: S76-S81. (in Chinese) Ding SL, Pai L, Wang JD, Chen KT. Head injuries in traffic accidents with
References 345
emphasis on the comparisons between motorcycle-helmet users and non users. J Formosan MedAssoc, 1994; 93(Suppl): S42-S48. (in Chinese) Division of Mental Helath, World Health Organization (WHO). Quality of life assessment: An annotated bibliography, Geneva: WHO, 1994. Dolby GR. The role of statistics in methodology of the life sciences. Biometrics, 1982; 38: 1069-83. Doll R. Comparison between registries, Age standardized rates, in Waterhouse JW, Muir C, Correa P, Powell J (eds.), Cancer incidence in five continents, vol. Ill, Lyon: International Agency for Research on Cancer, 1976, pp. 453-9. Doll R, Peto R. The causes of cancer. J Natl Cancer Inst, 1981; 66: 1191308. Drummond MF, Stoddart GL, Torrance GW. Methods for the economic evolution of health care programes, 2nd ed. Oxford, U.K.: Oxford University Press, 1997. Du CL, Chan CC, Wang JD. Comparison of personal and area sampling strategies in assessing workers' exposure to vinyl chloride monomer. Bull Environ Contam Toxicol, 1996; 56: 534-42. Du CL, Wang JD. Increased morbidity odds ratio of primary liver cancer and cirrhosis of the liver among vinyl chloride monomer workers. Occup Environ Med, 1998; 55(8): 528-32. Durkheim E. Suicide: A study in sociology, Translated by Spaulding JA and Simpson G Glencoe, III.: The Free Press, 1951. Eccles JC. In praise of Falsification, in Tweney RD, Doherty ME, Mynatt CR (eds.), On scientific thinking, New York: Columbia University Press, 1981, pp. 109-10. Elandt-Johnson RC. Definition of rates: some remarks on their use and misuse. Am J Epidemiol, 1975; 102: 267-71. Enterline PE, Henderson VL, Marsh GM. Exposure to arsenic and respiratory cancer, a reanalysis. Am J Epidemiol, 1987; 125: 929-938. European Commission. Information notices on diagnosis of occupational diseases, Luxembourg: Office for official publications of the European Communities, 1994.
346 Basic Principles and Practical Applications in Epidemiological Research
Evans A. Causation and disease : a chronological journey. Am J Epidemiol, 1978; 108: 249-58. Fairweather WR. Comparing proportion exposed in case-control studies using several control groups. Am J Epidemiol, 1987; 126: 170-8. Fan AM, Chang LW Eds. Toxicology and risk assessment: Principles, methods, and applications, New York: Marcel Dekker, Inc., 1996. Feinstein AR. An additional basic science for clinical medicine: I. The constraining fundamental paradigms. Ann Intern Med, 1983; 99: 393-7. Feinstein AR. Clinical epidemiology-the architecture of clinical research, Philadelphia: WB Saunders, 1985. Feyerabend P. Against method. Thetford, Norfolk, UK: Thetford Press Ltd., 1975. Fink A, Kosecoff J, Chassin M, Brook RH. Consensus methods: characteristics and guidelines for use. Am J Public Health, 1984; 74: 97983. Fraser DW. Epidemiology as a liberal art. New Engl J Med, 1987; 316: 30914. Freeman III AM. The measurement of environmental and resource values: Theory and methods, Washington DC: Resources for the future, 1991, pp. 314-66. Freeman J, Hutchison GB. Prevalence, incidence and duration. Am J Epidemiol, 1980; 112: 707-23. Frerichs RR, Beeman BL, Coulson AH. Los Angeles airport noise and mortality-Faulty analysis and public policy. Am J Public Health, 1980; 70: 357. Freund JE, Walpole RE. Mathematical statistics. 3rd, Englewood Cliffs: Prentice-Hall Inc., 1980, pp. 59-64. Fryback DG, Dasbach EJ, Klein R, Klein BEK, Dorn N, Peterson K, Martin PA. The Beaver Dam health outcomes study: Initial catalog of health-state quality factors. Med Dec Making, 1993; 13: 89-102. Garber AM. Advances in cost-effectiveness analysis of health interventions, in Culyer AJ, Newhouse JP (eds.), Handbook of health economics, Amsterdam: Elsevier, 2000, pp. 181-122.
References 347
Gibson G, Picker ER, Wagner JL. Evaluative Measures and data collection methods for emergency medical services systems. Public Health Rep, 1977;92:315-21. Glass RI. New prospects for epidemiologic investigations. Science, 1986; 234: 951-5. Gold MR, Siegel JE, Russell LB, Weinstein MC. Cost-effectiveness in health and medicine, 1st ed, Oxford, U.K.: Oxford University Press, 1996. Gore SM. Assessing clinical trials-first steps. BrMedJ, 1981; 282: 1605-7. Greenland S. Response and follow up bias in cohort studies. Am J Epidemiol, 1977; 106: 184-7. Greenland S. Control-initiated case-control studies. Intl J Epidemiol, 1985; 14: 130-134. Greenland S. Interpretation and choice of effect measures in epidemiologic analysis. Am J Epidemiol, 1987a; 125: 761-8. Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiol Rev, 1987b; 9: 1-30. Greenland S. Randomization, statistics, and causal inference. Epidemiology, 1990;1: 421-429. Greenland S. Divergent biases in ecologic and individual- level studies. Stat Med, 1992; 11: 1209-23. Greenland S. Absence of confounding does not correspond to collapsibility of the rate ratio or rate difference. Epidemiology, 1996; 7: 498-501. Greenland S. Probability logic and probabilistic induction. Epidemiology, 1998a; 9: 322-332. Greenland S. Introduction versus Popper: substance versus semantics. Int J Epidemiol, 1998b; 27: 543-548. Greenland S. Relation of probability of causation to relative risk and doubling dose: a methodologic error that has become a social problem. Am J Public Health, 1999; 89: 1166-1169. Greenland S, Robins JM. Identifiability, exchangeability, and epidemiological confounding. Int J Epidemiol, 1986; 15: 413-9. Greenland S, Robins JM. Conceptual problems in the definition and
348 Basic Principles and Practical Applications in Epidemiological Research
interpretation of attributable fractions. Am J Epidemiol, 1988; 128: 11851197. Greenland S, Robins JM. Ecologic studies-biases, misconceptions, and counter examples. Am J Epidemiol, 1994; 193: 747-760. Greenland S, Thomas DC. On the need for the rare disease assumption in case-control studies. Am J Epidemiol, 1982; 116: 547-53. Hatch MC, Beyea J, Neives JW, Susser M. Cancer near the Three Mile Island Nuclear Plant: radiation emissions. Am J Epidemiol, 1990; 132: 397-412. Hayes RB. The carcinogenicity of metals in humans. Cancer causes and control, 1997; 8: 371-85 Hill AB. Observation and experiment. New Engl J Med, 1953; 248: 9951001. Hill AB. The environment and disease: association or causation? Proc Roy SocMed, 1965; 295-300. Himpel CG. Philosophy of natural science, Englewood Cliffs, N.J.: Prentice-Hall, 1966. Hopkin. K. The risks on the table, Scientific American, 2001 http://www.sciam.com/2001/0401 issue/0401 hopkin.html. Howson C, Urbach P. Science Reasoning: The Bayesian Approach, 2" ed, LaSalle, IL: Open Court, 1993. Huang YS, Deng JF, Wang JD, Lee SD, Wu JC, Wang JY, Tsai YT, Tsay SH. Clinical manifestations and laboratory findings of cases in an outbreak of carbon tetrachloride-induced hepatic injury at a printing factory. J Formosan MedAssoc, 1987; 86: 743-9. (in Chinese) Hwang JS, Chen YJ, Wang JD, Lai YM, Yang CY, Chan CC. A subjectdomain approach to study air pollution effects on schoolchild's illness absence. Am J Epidemiol, 2000; 152: 67-74. Hwang JS, Tsauo JY, Wang JD. Estimation of expected quality adjusted survival by cross sectional survey. Stat Med, 1996; 15: 93-102. Hwang JS, Wang JD. Monte Carlo estimation of extrapolation of qualityadjusted survival for follow-up studies. Stat Med, 1999; 18: 1627-1640. Hwang JS, Wang JD. Survival weighted psychometric assessment for health
References 349
related quality of life. Quality of Life Research, 2001. (submitted) Hwang YH, Wang JD. Temporal fluctuation of the lead level in the cord blood of neonates in Taipei. Arch Environ Health, 1990; 45: 42-5. Hwu HG, Yeh YL, Wang JD. Risk factors of alcoholism among Taiwan aborigines: implications for etiological models and the nosology of alcoholism. Acta Psychiatr Scand, 1991; 83: 267-72. Hwu HG, Yeh YL, Wang JD, Yeh EK. Alcoholism among Taiwan aborigines defined by the Chinese diagnostic interview schedule: a comparison with alcoholism among Chinese. Acta Psychiatr Scand, 1990; 82: 374-80. International Agency for Research on Cancer (IARC). IARC monographs on the evaluation of carcinogenic risks to humans, suppl.7, Overall evaluations of carcinogenicity: An updating of IARC monographs vols. 142, Lyon, France: IARC, 1987. International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals. Ann Intern Med, 1982; 96:766-71. International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals. New Engl J Med, 1997; 336: 309-15. International Labor Office. Guidelines for the use of 1LO international classification of radiographs of pneumoconiosis, revised ed, Geneva: International Labor Office, 1980, pp. 4-17. Jang CS, Wang JD, Hwang YH, Chang YC. Lead poisoning in a battery recycling smelter. J Occup Safety Health, 1994; 2(2): 11-21. Jee SH, Kuo HW, Su WP, Chang CH, Sun CC, Wang JD. Photodamage and skin cancer among paraquat workers. Intl J Dermatol, 1995; 34: 466-9. Jee SH, Wang JD, Sun CC, Chao YF. Prevalence of probable kerosene dermatoses among ball-bearing factory workers. Scand J Work Environ Health, 1986; 12: 61-65. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data, New York: John Wiley & Sons, 1980. Kelsey JL, Whittemore AS, Evans AS, Thompson WD. Methods in
350 Basic Principles and Practical Applications in Epidemiological Research
observational epidemiology, 2" ed, New York; Oxford university Press, 1996. Kenkal D. Cost of illness approach, in Todley G, Kenkel D, Fabian R (eds.), Valuing health for policy; an economic approach, Chicago University of Chicago press, 1994, pp. 42-71. Kish L. Survey sampling, New York: John Wiley & Sons, 1965. Kleinbaum DG, Kupper LL. Applied regression analysis and other multivariable methods, 2nd ed, North Scituate: Duxbury Press, 1988. Kuhn TS. The structure of scientific revolutions, International Encyclopedia of Unified Science, 1970. Kupper LL, Kleinbaum DG, Morganstern H. Principles of epidemiologic research, Landon, U.K.: Lifetime Learning Pub., 1982. Kupper LL, McMichael AJ, Symon MJ, Most BM. On the utility of proportional mortality analysis. J Chron Dis 1978; 31: 15-22. LaPuma J, Lawlor EF. Quality-adjusted life-years: ethical implications for physicians and policy-makers. J Am MedAssoc, 1990; 263: 2917-21. Lauwerys RR. Industrial chemical exposure: guidelines for biological monitoring, Davis CA: Biomedical Publications, 1983, pp. 27-38. Lee ET. Statistical methods for survival data analysis, New York; John Wiley & Sons. Lee LJH, Chan CC, Chung CW, Ma YC, Wang GS, Wang JD. Health risk assessment on residents exposed to chlorinated hydrocarbons contaminated in groundwater of a hazard waste site. J Toxicol Environ Health, 2002; 65:293-309. Liang KY, Stewart WF. Polychotomous logistic regression methods for matched case-control studies with multiple case or control groups. Am J Epidemiol, 1987; 125: 720-31. Liden G, Kenny LC. The performance of respirable dust samplers: sampler bias, precision and inaccuracy. Ann Occup Hygiene, 1992; 36: 1-22. Lin MR, Yao KPG, Hwang JS, Wang JD. Scale descriptor selection for Taiwan-version of questionnaire of World Health Organization quality of life. Chinese J of Public Health, 1999,18(4): 262-270. (in Chinese) Lin RD, Yao KP, Pai L, Yu CT, Wang JD. Reliability and validity of utility
References 351
approach to measuring health related quality of life: An example of patients on hemodialysis. Chinese J Public Health, 1997; 16: 404-416. (in Chinese) Lin YC, Ko YC, Fu WC, Ou CC. An observation on the air pollution of petroleum industry especially on the relationship between the ambient air sulfur dioxide and the prevalence of asthma cases. Environ Protection, 1981;4:43-54. Liu YH, Du CL, Lin CT, Chan CC, Chen CJ, Wang JD. Increased morbidity from nasopharyngeal carcinoma and chronic pharyngitis or sinusitis among workers at a newspaper printing company. Occup Environ Med, 2002; 58 (in press) Lubin JH. Case-control methods in the presence of multiple failure times and competing risks. Biometrics, 1985; 41: 49-54. Luce RD, Tukey JW. Simultaneous conjoint measurement: a new type of fundamental measurement. J Math Psychol, 1964; 1: 1-27. Maclure M. Popperian refutation in epidemiology. Am J Epidemiol, 1985; 121: 343-50. MacMahon B, Pugh TF. Causes and entities of disease, in Clark DW And MacMahon B (eds.), Preventive Medicine, 1st ed, Boston: Little, Brown and Co., 1967. MacMahon B, Pugh TF. Epidemiology: principles and methods, Boston: Little, Brown and Company, 1970. Mantel N. Chi-square tests with one degree of freedom: extension of Mantel-Haenszel procedure. J Am Stat Assoc, 1963; 58: 690-700. Mantel N, Haenszel W. Statistical Aspects of the analysis of data from retrospective studies of diseases. J Natl Cancer Inst, 1959; 22: 719-48. Meecham WC, Shaw N. Effects of jet noise on mortality rates. Br J Audiol, 1979; 13: 77. Meinert CL, Tonascia S. Clinical Trials : Design, Conduct, and Analysis, Oxford, U.K.: Oxford University Press , 1984. Michael III M, Boyce WT, Wilcox AJ. Biomedical Bestiary, Boston: Little, Brown and Company, 1984. Michell J. An introduction to the logic of psychological measurement, Hill
352 Basic Principles and Practical Applications in Epidemiological Research
sdale, New Jersey; Lawrence Erlbaum Associates, 1990. Miettinen OS. Components of the crude risk ratio. Am J Epidemiol, 1972a; 96: 168-72. Miettinen OS. Standardization of risk ratios. Am J Epidemiol, 1972b; 96: 383-8. Miettinen OS. Proportion of disease caused or prevented by a given exposure, trait, or intervention. Am J Epidemiol, 1974a; 99: 325-32. Miettinen OS. Confounding and effect modification. Am J Epidemiol, 1974b; 100: 350-353. Miettinen OS. Estimability and estimation in case-referent studies. Am J Epidemiol, 1976; 103: 226-35. Miettinen OS. Design options in epidemiologic research. An update. Scand J Work Environ Health, 1982; 8(suppl 1): 7-14. Miettinen OS. Theoretical Epidemiology: principles of occurrence research in medicine, New York : John Wiley & Sons , 1985a, pp. 39-44, 245-250, 266-272. Miettinen OS. The "case-control" study: valid selection of subjects. J Chron Dis, 1985b; 7: 543-8. Miettinen OS, Cook EF. Confounding: Essence and detection. Am J Epidemiol, 1981; 114: 593-603. Miettinen OS, Wang JD. An alternative to the proportionate mortality ratio. Am J Epidemiol, 1981; 114: 144-8. Morris JN. Use of Epidemiology, 3rd edition, Edinburgh, Churchill Livingston, 1975, p. 3. Morrsion AS. Sequential pathogenic components of rates. Am J Epidemiol, 1979;109:709-18. Morrison AS. Screening in chronic disease, 2n ed, New York; Oxford University Press, 1992. Mulhausen JR, Damiano J. A strategy for assessing and managing occupational exposures, Fairfax, VA: American Industrial Hygiene Association Press, 1998. National Center for Health Statistics. U.S. State Life Tables for 1959-61, Vol. 2, Nosl-26, Public health service, 1966, Publication No. 1252-Vol2-
References 353
Nosl-26. National Research Council. Risk assessment in the Federal Government: Managing the process, Washington, D.C.: National Academy Press, 1983. National Research Council. Science and judgment in risk assessment, Washington, D.C.: National Academy Press, 1994. Neuhausen SL. Ethnic differences in cancer risk resulting from genetic variation. Cancer, 1999 Oct 15; 86(8 Suppl): 1755-62 Nurminen M. Statistical significance — a misconstrued notion in medical research. ScandJ Work Environ Health, 1997; 73: 232-235. Parker RL, Dry TJ, Willims FA, Gace RP. Life expectancy in angina pectoris. JAMA, 1946; 131: 95-100. Patrick DL, Erickson P. Health status and health policy: quality of life in health case evaluation and resource allocation, New York: Oxford University Press, 1993, pp. 76-112. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient, I. Introduction and design. BrJ Cancer, 1976; 35: 1-39. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient, II. Analysis and examples. Br J Cancer, 1977; 35: 1-39. Pocock SJ. Clinical trials, a practical approach, New York: John Wiley & Sons, 1983. Popper KR. Conjectures and refutations: the growth of scientific knowledge, New York: Harpper & Row, 1965, pp. 33-65, 215-250. Popper KR. The open society and its enemies, London: Routledge and Kegan Paul, 5th edition, 1966. Popper KR. The logic of scientific discovery, New York: Harper and Row, 1968. Popper KR. Objective knowledge: an evolutionary approach, Oxford: Clarendon Press, 1972. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 1986; 73: 1-11. Prentice RL, Breslow NE. Retrospective studies and failure time models.
354 Basic Principles and Practical Applications in Epidemiological Research
Biometrika, 1978; 65: 153-8. Rabinowitz MB, Bellinger D, Leviton A, Wang JD. Lead levels among various deciduous tooth types. Bull Environ Contam Toxicol, 1991b; 47: 602-8. Rabinowitz MB, Wang JD, Soong WT. Dentine lead and child intelligence in Taiwan. Arch Environ Health, 1991a; 46: 351-60. Rabinowitz MB, Wang JD, SooHg WT. Apparent threshold of lead's effect on child intelligence. Bull Environ Contam Toxicol, 1992; 48: 688-95. Raiffa H. Decision analysis: Introductory lectures on choices under uncertainty, New York: Addision Wesley, 1976. Robins J. Data, design, and background knowledge in etiologic inference. Epidemiology, 2001; 11: 313-320. Robinson R. Cost-utility analysis. BrMedJ, 1993; 307: 859-862. Rosenberg HM, Ventura SJ, Maurer JD, et al. Births and Deaths: United States, 1995, Monthly Statistical Report. Hyattsville, Maryland: National Center for Health Statistics, 1996, Vol. 45, No. 3, Suppl. 2, p. 31. Rosenman KD. Cardiovascular disease and environmental exposure. Br J IndustrMed, 1979; 36: 85-97. Rosser R, Cottee M, Rabin R, Selai C. Index of health-related quality of life, in Hoplins A (eds.), Measures of quality of life, London, U.K.:Royal College of Physicians of London, 1992, pp. 81-90, 147-153. Rothman KJ. Causes. Am J Epidemiol, 1976; 104: 587-92. Rothman KJ. A show of confidence, (Editorial). New Engl J Med, 1978; 299: 1362-3. Rothman KJ. Induction and latent periods. Am J Epidemiol, 1981; 114: 2539. Rothman KJ. Modern epidemiology, Boston: Little, Brown and Co., 1986. Rothman KJ. Causal inference, Chestnut Hill, Massachusetts; Epidemiology Rosources Inc., 1988. Rothman KJ, Greeland S. Modern epidemiology, 2nd ed, Philadelphia, PA: Lippincott-Raven Pub., 1998. Sackett DL. Bias in analytic research. J Chron Dis, 1979; 32: 51-63. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology: a
References 355
basic science for clinical medicine, 2" ed, Boston: Little, Brown and Co., 1991, pp. 239-243. Salman WC. Confirmation. Scientific American, 1973; May: 75-83. Savage LJ. The foundations of statistics, New York: Dover Publications, Inc., 1972. Schilling R S F. Occupational health practice, London: Butterworth, 1973, pp. 172-80. Schnall PL, Landsbergis PA, Baker D. Job strain and cardiovascular disease. Annu Rev Public Health, 1994; 15: 381-411. Schouten LJ, Straatman H, Kiemeney LALM, Verbeek ALM. Cancer Incidence: Life Table Riskverstw Cumulative Risk. J Epidemiol Comm Health, 1994;48:596-600. Scotto J, Fraumeni JF Jr. Skin (other than melanoma), in Schottenfeld D, Fraumeni JF Jr (eds.), Cancer epidemiology and prevention, Philadelphia: WB Saunders, 1982, pp. 996-1011. Selikoff IJ, Hammond EC, Churg J. Asbestos exposure, smoking, and neoplasia. JAmMedAssoc, 1968; 204: 104-110. Sheehe PR. Dynamic risk analysis in retrospective matched-pair studies of disease. Biometrics, 1962; 18:323-341. Shu CC, Wang JD, Shen CY, Chen SC. Prescription pattern of Chinese herb drugs among herb drug stores and Chinese herb doctors in Taipei. J Natl Public Health Assoc (R.O.C.), 1987; 6(3): 91-104. (in Chinese) Silver Platter. OSH-ROM, Occupational Safety and Health on CD-ROM, Health and Safety Publishing, publisher, 1998. Skoog DA. Principles of instrumental analysis, Philadelphia: Saunders College Publishing, 1985, pp 5-29. Snedecor GW, Cochran WG. Statistical method, 7th ed, Ames: The Iowa State University Press, 1980, pp 3-16, 44-51, 434-63. Snow J. Snow on cholera, (a reprint of two papers by John Snow), New York; Commonwealth Fund, 1936. Soong WT, Chao KY, Jang CS, Wang JD. Long term effect of increased lead absorption on intelligence of children. Arch Envion Health, 1999; 54(4): 297-301.
356 Basic Principles and Practical Applications in Epidemiological Research
Spiegelhatter DJ, Gore SM, Fitzpatrick R, Fletcher AE, Jones DR, and Cox DR. Quality of life measures in health care, III: Resource allocation. Br MedJ, 1992,305: 1205-1209. Spilker B. Guide to clinical trials, New York: Raven Press, 1991. Staquet MJ, Hays RD, Fayers PM. Quality of life assessment in clinical trials: Methods and practice, Oxford, U.K.: Oxford University Pres, 1998. Stayner LT, Dankovic DA, Lemen RA. Occupational exposure to asbestos and cancer risk: A review of the amphibole hypothesis. Am J Pubic Health, 1996; 86: 179-186. Sterling TD, Weinkam JJ. Smoking characteristics by type of employment. J Occup Med, 1976; 18: 743-54. Stevens SS. On the theory of scales of measurement. Science, 1946; 103: 667-680. Stolley PD. When genius errs: RA Fisher and the lung cancer controversy. Am J Epidemiol, 1991; 133: 416-25. Stolley PD, Lasky T. Investigating disease patterns: The science of epidmiology, New York; Scientific American Library, 1995. Sung JL, Chen DS, Lai MY, et al. Epidmiological study on hepatitis B virus infection in Taiwan. Chinese J Gastroenterol, 1984; 1: 1-9. Susser M. Judgement and causal inference: criteria in epidemiologic studies. Am J Epidemiol, 1977; 105: 1-15. Susser M. The logic of Sir Karl Popper and the practice of epidemiology. Am J Epidemiol, 1986; 124: 711-8. Susser M. What is a cause and how do we know one? A grammar for pragmatic epidemiology. Am J Epidemiol, 1991; 133: 635-48. Tarski A. Introduction to logic and to methodology of deductive science, Oxford, U.K. Oxford University Press, 1957. Tarski A. Truth and Proof. Scientific American, 1969; June: 63-77'. Testa MA, Simonson DC. Assessment of quality-of-life outcomes. New Engl JMed, 1996;334:835-840. The European Commission. Commission adopts communication on precautionary principle, URL: http://europa.eu.int/comm/trade/
References 357
whats_new/dpp_en.htm (cited 2001 Nov 8) The World Bank. World Development Report 1993: Investing in health, Oxford, U. K. :Oxford University Press, 1993. Torrance GW, Thomas WH, Sackett DL. A utility evaluation model for evaluation of health care programs. Health Service Res, 1972; 7: 118-133. Tsai SJ, Chang YC, Wang JD, Chou JH. Outbreak of type A botulism caused by a commercial food product in Taiwan: clinical and epidemiological investigations. Chin Med J (Taipei), 1990; 46: 43-8. Tsai TJ, Lai JS, Lee SH, Chen YM, Lan C, Chiang HS. Breathingcoordinated exercise improves the quality of life in hemodialysis patients. J Am Soc Nephrol, 1995; 6: 1392-1400. Tsai YJ, Wang JD, Huang WF. A case-control study of the effectiveness of different type of helmets for the prevention of head injuries among motorcycle riders in Taipei. Am J Epidemiol, 1995; 142: 974-81. Tsauo JY, Hwang JS, Wang JD. Estimation of utility gain from the helmet law in Taiwan by quality adjusted survival time. Accident Anal & Prevention, 1999; 31: 253-63. Tweney RD, Doherty ME, Mynatt CR. On scientific thinking, New York: Columbia University Press, 1981. U.S. Department of Commerce, Bureau of Census. Vital Statistics in Statistical abstracts of the United States 1944-5, Washington D.C.: U.S. Department of Commerce, 1945, p. 82. U.S. Department of Health and Human Service. International Conference of Harmonization; Good clinical practice: Consolidated guideline, Availability, Fed Reg, 1997, 25692-25709. U.S. Department of Health, Education , and Welfare. Smoking and health. Report of the Advisory Committee to the Surgeon General of the Public Health Service, U.S. Dept. of Health, Education, and Welfare, Public Health Service, Center for Disease Control, PHS Publication, 1964, No. 1103. U.S. Department of Health, Education, and Welfare. Smoking and health, a report of the Surgeon General. U.S. Dept. of Health, Education, and Welfare, Public Health Service, DHEW Publication, 1979, No (PHS) 79-
358 Basic Principles and Practical Applications in Epidemiological Research
50066. U.S. Department of Labor, Occupational Safety and Health Administration. OSHA preambles: Methylene Chloride, VII. Significance of Risk, URL: http://www.oshaslc.gov/Preamble/methylch_data/METHYLENE_CL7.html (cited 2001 Nov8) U.S. Environmental Protection Agency. Risk assessment guidance for superfund (RAGS), Vol.", Human health evaluation manual (Part A)— Interim final. EPA/540/1-89/002. Washington, D.C.: U.S. EPA, 1989. U.S. Environmental Protection Agency. Role of the Baseline Risk Assessment in Superfund Remedy Selection Decisions, URL: http://es.epa.gov/oeca/osre/910422.html (cited 2001 Nov 8) U.S. Food and Drug Administration. Guidance for industry botanical drug products, URL: http://www.fda.gov/cder/guidance/index.htm, 2000. U.S. Food and Drug Administration. Guideline on safety pharmacology studies for human pharmaceuticals http://www.fda.gov/cder/guidance/3772dft.pdf, 2001a. (cited 2001 Jun 30) U.S. Food and Drug Administration. Information sheet: Guidance for Institutional Review Board and clinical investigators, 1998 update, URL: http://www.fda.gov/oc/ohrt/irbs/default.htm, 2001b. (cited 2001 June 23) Wacholder S, Silverman DT, McLaughlin JK, Mandel JS. Selection of controls in case-control studies, I: principles. Am J Epidemiol, 1992a; 135: 1019-28. Wacholder S, Silverman DT, McLaughlin JK, Mandel JS. Selection of controls in case-control studies, II: types of controls. Am J Epidemiol, 1992b; 135: 1029-41. Wacholder S, Silverman DT, McLaughlin JK, Mandel JS. Selection of controls in case-control studies, III: design options. Am J Epidemiol, 1992c; 135: 1042-50. Walker LG, Anderson J. Testing complementary and alternative medicine within a research protocol. Eur J Cancer, 1999; 35: 1614-1618. Wang JD. From conjecture and refutation to the documentation of
References 359
occupational diseases in Taiwan. Am J Industr Med, 1991; 20: 557-65. Wang JD, Chang YC. An outbreak of type A botulism due to commercially preserved peanuts - Chang Hwa County. Epidemiol Bull (R.O.C.), 1987; 3(3): 21-23. Wang JD, Chang YC, Kao KP, Huang CC, Lin CC, Yeh WY. An outbreak of n-hexane induced polyneuropathy among press proofing workers in Taipei. Am J Industr Med, 1986; 10: 111-118. Wang JD, Chen JD. Acute and chronic neurological symptoms among paint workers exposed to mixtures of organic solvents. Environ Res, 1993; 61: 107-116. Wang JD, Huang CC, Hwang YH, Chiang JR, Lin JM, Chen JS. Manganese induced parkinsonism: an outbreak due to an unrepaired ventilation control system in a ferromanganese smelter. Br J Industr Med, 1989; 46: 856-9. Wang JD, Huang PH, Lin JM, Su SY, Wu MC. Occupational asthma due to toluene diisocyanate among velcro-like tape manufacturers. Am J Industr Med, 1988;14:73-8. Wang JD, Jang CS, Hwang YH, Chen ZS. Lead contamination around a kindergarten near a battery recycling plant. Bull Environ Contam Toxicol, 1992;49:23-30. Wang JD, Lai MY, Chen JS, Lin JM, Chiang JR, Shiau SJ, Chang WS. Dimethylformamide-induced liver damage among synthetic leather workers. Arch Environ Health, 1991; 46: 161-6. Wang JD, Lin WM, Hu FC, Hu KH.Occupational risk and the development of premalignant skin lesions among paraquat manufacturers. Br J Industr Med, 1987; 44: 196-200. Wang JD, Miettinen OS. Occupational mortality studies: principles of validity. ScandJ Work Environ Health, 1982; 8: 153-8. Wang JD, Miettinen OS. The mortality odds ratio (MOR) in occupational mortality studies-Selection of reference occupations and reference causes of death. Ann Acad Med (Singapore), 1984; 13(suppl): 312-6. Wang JD, Soong WT, Chao KU, Hwang YH, Jang CS. Occupational and environmental lead poisoning: Case study of battery recycling smelter in
360 Basic Principles and Practical Applications in Epidemiological Research
Taiwan. JofToxicol Sci, 1998; 23(suppl II): 241-245. Wang JD, Wegman DH, Smith TJ. Cancer risk in the optical manufacturing industry. Br J Industr Med, 1983; 40: 177-81. Wang JD, Yu CF, Chung CW, Yao KPG. Evaluation of effectiveness of health service in the 21 st century: Quality of life and quality adjusted survival analysis. Formosan J Med, 2000; 4: 65-74. (in Chinese) Watson JD. The double helix: a personal account of the discovery of the structure ofDNA, London: Weidenfeld and Nicolson, 1968. Waxweiler. Risk among workers exposed to vinyl chloride. Ann New York AcadSci, 1976;271:40-8. Weed DL. On the logic of causal inference. Am J Epidemiol, 1986; 123: 965-79. Weinstein MC, Stason WB. Foundations of cost-effectiveness analysis for health and medical practices. New Engl J Med, 1977; 296: 716-21. Weiss ST. Passive smoking and lung cancer: what is the risk? Am Rev RespirDis, 1986; 133: 1-3. WHOQOL Group. The World Health Organization Quality of Life assessment (WHOQOL): Position paper from the World Health Organization. Soc Sci Med, 1995; 41: 1403-9. WHOQOL Group. The World Health Organization Quality of Life assessment (WHOQOL): Development and general psychometric properties. Soc Sci Med, 1998a; 46(12): 1569-85. WHOQOL Group. Development of the World Health Organization WHOQOL-BREF quality of life assessment. Psychol Med, 1998b; 28:551-8. WHOQOL-Taiwan Group. Introduction to the development of the WHOQOL-Taiwan version. Chin J Public Health, 2000; 19(4): 315-24. Wing S, Richardson D, Armstrong D, Crawford-Brown D. A reevaluation of cancer incidence near the Three Mile Island nuclear plant: the collision of evidence and assumptions. Environ Health Perspect, 1997; 105: 52-57. World Health Organization. In Constitution of World Health Organization, Handbook of basic documents, 5th ed, Geneva: Palais des Nations, 1948, pp. 3-20.
References 361
World Health Organization. WHOQOL-BREF: Introduction, administration, scoring and generic version of the assessment-field trial version, Geneva: WHO, 1996. Yanagawa T. Designing case-control studies. Environ Health Perspect, 1979; 32: 143-56. Yerushalmy J, Palmer CE. On the methodology of investigations of etiologic factors in chronic diseases. J Chron Dis, 1959; 10: 27-40.
This page is intentionally left blank
Index Aborigines, 8, 164, 177, 227 Accuracy, 90, 91 Adjustment, 186, 199,200 Aflatoxin, 213,215 Age-specific mortality rate, 129 Alcoholism, 8, 41, 164, 177, 227, 228 Alternative explanation, 27, 65, 68 Alternative medicine, 251 Ambispective study, 237 Angina pectoris, 315 Angiosarcoma of the liver, 242 Asbestos, 61, 71, 108, 185, 328-330 Association specificity of, 72 strength of, 70 Asthma, 30, 64, 181, 182 Attributable proportion, 144 Attributable risk percent, 144 Auxiliary hypotheses, 21, 46^-8
Causal decision, 62-74 Causal hypotheses, 10, 168, 180 Causal inference, 7-9, 57-77, 162, 166, 168-172 Causal study, 166, 170, 180, 296, 302 Cause, 59 Censor, censoring, 123, 132, 149 Census, 212 Central limit theorem, 165, 217 Chance, 67 Chinese herbal medicine, 45-46, 177-178,251-254 Clinical decision, 308, 326 Clinical trial, 26, 185, 235, 238, 248-253 Clostridium botulinum, 28-30 Cluster sampling, 226 Coherence, 69 Cohort, 236 Cohort study, 235, 236, 242 Community intervention trial, 235, 236 Comparability, 185, 279-287, 296 of contrasted population, 185,281 of effects, 185,279 of measurement, 185, 282 Concentration of exposure, 73 Conceptualization, 88 Conditional proportion, 152 Confounder, 65, 68-69, 183, 188 Confounding, 4, 8, 9, 30, 93, 166, 180-192,275-276 Congenital malformation, 263, 275 Conjecture and refutation, 22, 25, 32 limitation of, 32-33
Bayesian analysis/approach, 49, 51 Bernoulli trial, 122 Bias, 189,213,216 Biological gradient, 72 Biological plausibility, 74 Birthrate, 122, 129 Blackfoot disease, 207, 272 Botulism, 27 Candidate population, 132 Case-base study, 7 Case-cohort study, 7 Case-control study/design, 7, 206, 235, 259-290 Case-referent study, 7
363
364 Basic Principles and Practical Applications in Epidemiological Research
Consensus method, 75 Consistency, 41, 65 Construct, 87 Cornfield's concept, 264-266 Cost/benefit analysis, 150 Cost-effective analysis, 314-326 Cost-effectiveness, 308 Cost-utility analysis, 314 Cox, 270, 298 Credibility, 50, 51 of a hypothesis, 50-52 Cross-sectional, 170 Crude birth rate, 129 Crude death rate, 128 Crude rate, 198 Cumulative incidence rate (CIR), 135, 205, 327 Cumulative incidence ratio, 268 Cumulative incidence sampling, 267-270, 272 Death rate, 126, 128, 129 Decision-making, 10, 50, 123, 308, 311 Deduction, 22, 26, 29, 35, 46 Deductive method, 23 Degree of corroboration, 48 Degree of freedom, 217 Demographic data, 122 Denominator, 125, 138, 144, 234, 259 Density sampling, 93, 123, 266-271 Descriptive inference, 8, 162, 168, 172, 297 Descriptive study, 162, 170, 172, 297, 302 Detectable preclinical period, 103 Determinants, 4, 125, 126, 128, 180, 302 of incidence rate, 137-140 of measurements, 85, 116, 125 of outcome, 182,186,302
Dimension, 87 Domain, 87 Dose rate, 73 Dose-response relationship, 72 Double blind, 248 Doubling time, 65,103 Dummy question, 114 Duration, 73, 126, 141
Ecological fallacy, 189, 237 Ecological study, 235, 237, 238 Emergency medical system/service, 8, 163, 176 Empirical test, 17 Epidemiology, 1 Equivalent monetary value (EMV), 311 Etiologic agent, 18, 30 Etiological fraction, 144 Excess fraction, 144 Expected number of prevented cases, 145, 324 Expected utility, 310-311 Experimental study, 234, 235 Exposure odds ratio, 265, 278 Extensiveness of measurement, 98
Failure rate, 133 Falsificationist, 32 Fetal mortality ratio, 130 Field trial, 235 Follow-up study, 233-255 experimental studies, 234-236 observational studies, 235-238 Gold standard, 84, 89 Good clinical practice, 250 Good laboratory practice, 249
Index 365
Hazard rate, 133 Hazardous waste, 334 Head injury, 138, 224, 271,294, 308-311,319-325 Health service or policy, 7, 147 Healthy worker effect, 241 Health-related quality of life (HRQL), 83,154-156,321 Helmet, 271,273,319-325 Herbal medicine, 45-46, 178 HIV (human immuno-deficiency virus), 60,84,219 Hypothesis, 17,39-53, Incidence density, 132 Incidence rate, 132 Incidence rate ratio (IRR), 142, 268 Indirect standardization, 202 Induction, 40-43 joint method of agreement and difference, 42 method of agreement, 41 method of concomitant variation, 43 method of difference, 42 method of residue, 42 Induction time, 73, 139 minimal, 63 Infant mortality rate, 130 Intelligence quotient (IQ), 108, 215 Intensity of exposure, 73
Life expectancy, 6,137 Life table method, 133, 151-153 Loss of follow-up, 247 Malignant skin lesions, 35 Mantel-Haenszel, 186 Material safety data sheet (MSDS), 311 Maximum latency period, 63, 73, 245 Maximum likelihood estimation, 186 Mean duration, 141 Mean squared error (M.S.E), 91 Measurement, 81-117, 83 epidemiological research, 121-157 measurement of effect, 142 validity of, 294 Miettinen's concept, 266-269 Mill's rule of induction, 41—43 Minimal duration of exposure, 245 Modeling, 167, 186 Morbidity rate, 122 Mortality or morbidity odds ratio (MOR), 277-287, 296 Mortality rate, 126, 128-130 Multiple logistic regression, 145
Kaplan-Meier method, 327 Kernel-type smoothing estimator, 148
National resource allocation, 326 Neonatal mortality rate, 130 N-hexane, 27-28, 165 Non-respondent, 175-176, 228, 297, 302 proportion of, 175 Non-sampling errors, 227 Numerator, 138,144,234
Latency period, 63 Lead, 106, 108,179 Lead recycling smelter, 179
Objective knowledge, 75 Observational study, 235, 236
Jet noise, 181
366 Basic Principles and Practical Applications in Epidemiological Research
Observed to expected (O/E) ratio, 277 Odds ratio, 144 exposure, 265, 278 mortality or morbidity odds ratio(MOR), 277-287, 296 Onset of first exposure, 73 Operational definition, 87 Outcome, 239
Parameter, 122,168 Paraquat manufacturers, 35, 187 Person-time, 133 Person-years, 133, 244 Physiologically-based pharmacokinetic (PB-PK) model, 107 Placebo group, 185 Polyneuropathy, 8, 29, 42, 165 Polyvinyl chloride (PVC), 242-247 Popper, 10, 17, 20, 25-26 Population at risk, 233 candidate, 234 cohort, 132 dynamic, 132 index, 206 reference, 206 Population-time, 128 Pragmatic trial, 253 Precision, 90 Precision maximizing weight, 206 Predictive value negative, 100 positive, 100 Press-proofing, 30, 42, 165 Prevalence odds ratio, 269 Prevalence rate, 101, 140 Probability sampling, 212-216 Proportion, 124,126 Proportional hazard model, 270
Proportionate morbidity or mortality ratio (PMR), 262, 277 Prospective study, 235 Psychometric assessment, 326 Puerperal fever, 18
Quality-adjusted life expectancy (QALE), 148 Quality-adjusted life year (QALY), 147, 150,315-326 ethical consideration, 150 Quality-adjusted survival (QAS), 147-156,326 extrapolation of QAS to life time, 149 Quality assurance/quality control, 105, 295 Quality of life (QOL), 6, 83, 87, 337 Quasi-necessary criteria, 65 Quasi-random, 173, 213
Random assignment, 252 Random error, 90-93 Random sampling, 216 Randomization, 67, 216, 250 Rate, 124, 126 Rate difference, 142 Rate ratio, 142 Rating scale, 155-156 Ratio, 126 Recurrency rate, 93, 122, 131 Red shift, 44, 45 Reference group, 239 Reference occupation, 279 Refutational attitude, 21, 27, 31, 36, 75 Relative risk, 122, 142, 264 Relevance, 49 Reliability, 90, 93
Index 367
Remission rate, 131 Repeated random sampling, 217 Response rate, 7, 174, 302 Retrospective study, 265 Risk, 135 perception of, 336 Risk assessment, 325-338 Risk predictors, 273 Risk ratio, 264, 268
Sampling, 211-229 Sampling distribution, 217 Sampling efficiency, 219-221 Scabies, 262 Scale interval, 97 nominal, 94 ordinal, 95 ratio, 98 Scientific hypothesis, 44 Selection with probability proportional to size, 227 Sensitivity, 100 Simple random sampling, 216 Single blind, 254 Socio-behavioral sciences, 86, 88, 96, 109 Specificity, 100 Specificity of association, 72 Standard deviation, 220 Standard gamble, 154-155 Standard operation procedure (SOP), 85, 250 Standardization, 186, 199 Standardized mortality or morbidity ratio (SMR), 202, 206
Stratification, 167, 186,219 Stratified analysis, 240 Stratified random sampling, 219 Sulfur dioxide, 181-182 Systematic bias, 93, 100, 189 Systematic error, 90 Systematic random sampling, 224 Systematic sampling, 212, 226 Summary statistics, 106, 127, 216 Survival function, 123, 148, 315, 317, 319,326 Survival rate, 131,316
Temporality, 62 Termination rate, 141 Time trade-off, 155 Total body burden, 108 Tuberculosis, 124,214
Utility, 147
Validity, 90, 92 of causal studies, 296 of descriptive studies, 297 of measurement, 294 of sociobehavioral measurement, 109 Variance, 92, 221 Verificationist, 32 Vinyl chloride monomer (VCM), 138, 242, 269, 287, 335
WHO quality of life (WHOQOL), 87, 110, 154,317
Basic Principles and Practical Applications m Epidemiological Research Based on the concept of "conjecture and refutation" from the Popperian philosophy of science, i.e. looking for alternative causes, this book simplifies the design and inferences of human observational studies into two types: descriptive and causal. It clarifies how and why causal inference should be considered from the search for alternative explanations or causes, and descriptive inference from the sample at hand to the source population. Furthermore, it links the health policy and epidemiological concept with decisional questions, for which the basic measurement can be quality-adjusted survival time or qualityadjusted life year.
www. worldscientific. com | 9 789810^248017
| ^