Generalized Linear Models, Estimating Functions Multivariate Extensions Department of Biostatistics School of Hygiene & Public Health Johns Hopkins University July 12–14, 1999 Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific inquiries. Inthe year of 1972, thanks to the seminal work by Nelder and Wedderburn, many usefulregression models are unified under the framework of generalized linear models (GLMs).
These include as special cases ANCOVA, multiple regression for continuous responses, lo-gistic regression for binary responses and log-linear models for contingency tables. Theyasserted that each model under consideration can be cast through two components. Oneis the systematic component which relates the mean response to the covariates by spec-ifying a link function. The other is the random component which is needed to accountfor uncertainty in the response variable. Through this common framework, issues thatare commonly faced in regression such as estimating procedure, regression diagnostics,measurement errors in covariates, goodness-of-fit, etc. can be addressed in a unified fash-ion. While there is no doubt that this approach will be in its existence for many yearsto come, it faces two, among others, challenges that has drawn a good deal of attentionamong statistical researchers in the past few decades. One is to deal with the situationwhere the scientists may not have sufficient knowledge about the subject matters to beincorporated by their statistical colleagues to characterize the random mechanism forwhich the data are generated. The work by Wedderburn in 1974, known as the quasilikelihood method, provides a means to cope with this complication. It relaxes the ran-dom component assumption required by GLMs, rather a much weaker assumption on thevariance expression for the response variables is installed. Several questions remain tobe answered regarding the use of this alternative method. First, can the work by Wed-derburn be extended to accommodate more general and realistic variance specifications?Second, does the quasi score method by Wedderburn possess any desirable statisticaloptimality properties? Third, is there any evidence that this method perform well inpractice? The other challenge GLMs face is how to deal with the situation where some of the ob- servations are not necessarily statistically independent of each others. Implicitly, GLMsoperates on the assumption that all the observations are independent statistically of eachothers. More frequently, one witnesses in biomedical studies that such an assumptionmay be invalid. In the past decade or two, the statistical community also witnessed thetremendous development in statistical methods for analyzing data that are correlatedwith each others. A natural questions to raise is: what are the pros and cons of thesenew methods in terms of their suitability in addressing scientific objectives from studiesinvolving correlated data? In this lecture note, we intend to address the above two posed challenges and questions through the vehicle of estimating functions, a concept formally developed in 1960 bythe independent work of Durbin and Godambe. This lecture note has the followingorganization. In Chapter one, we first characterize five motivating examples which involvecorrelated data. We argued that for all examples considered, their scientific objectives canbe cast through regression modeling. Also discussed are (1) the statistical consequencewhen the correlation possessed by the data is ignored in the analysis and (2) quantitiesthat are useful for measuring the degree of associations for correlated data.
Chapter two provides a chronicle overview on some recent developments of regres- sion techniques with focus on generalized linear models, quasi likelihood methods andestimating function approaches. This overview is preceded by some discussions on issues concerning likelihood-based inference. These include the impact of nuisance parametersand cautions in using the popular Wald's procedure.
Chapters three and four may be viewed as a preview of statistical modeling for corre- lated data. In Chapter three, a brief review on historic developments for analyzing pro-portional data leads to the following question: is the binomial assumption made thereina reasonable one? The well known "litter effect" in teratological experiments suggeststhat one of the key assumptions in binomial, namely, all the binary responses from withinare statistically independent of each others, is invalid. We discussed in Chapter threedrawbacks of some attempts in statistical literature in the 1970s and 1980s to relax thetwo key assumptions in binomial for proportional data. The teratological experimentdata also serves to illustrate the usefulness of the quasi likelihood method in dealingwith (1) uncertainty of random mechanisms and (2) impact of nuisance parameters.
Chapter four consists of two parts. In part one, we discussed three different statistical models that are useful for polytomous responses in which the discrete response variablecontains three or more categories. Circumstances under which these models may be, ormay not be, appropriate are highlighted and contrasted through two examples. Thispart also serves to extend GLMs to incorporate either ordinal or nested polytomousresponse variables. In part two, we focused on interpretations of parameters induced bylog-linear models which are popular for analyzing data from contingency tables. Therehas been some renewed interest lately in transferring this approach for contingency tablesto correlated data analysis. Through detailed discussions on interpretations of log-linearparameters, we argued that such an application to correlated data may proceed withcaution.
In Chapter five, we contrasted three different statistical modelings for correlated data: marginal models, random effects models and observation driven models with focus oninterpretations of induced parameters. Such contrasts and circumstances under whichthe proposed models may be most appropriate are illustrated through two examplesintroduced in Chapter one: The Baltimore Eye Survey Study and the SchizophreniaTrial Study.
Chapter six deals with statistical inference for correlated data when either one of three statistical models mentioned above is employed. These methods may be viewed asmultivariate extensions of GLMs and quasi likelihood which were primarily designed foranalyzing univariate responses. Again, these newly developed methods were illustratedthrough the two above mentioned examples.
This lecture note is made possible through the kind invitation of Dr. C. Z. Wei, the former Director of the Institute of Statistical Science, Academia Sinica, Taiwan,R.O.C., to be a part of the lecture series in statistics sponsored by the institute in 1999.
The coordinating efforts by Dr. C.H. Chen, the executive secretary of the lecture seriescommittee, and by the committee members are gratefully acknowledged. The typingof this lecture note and the lectures delivered in July, 1999 in Academic Sinica werediligently prepared by Ms. Patty Hubbard, Department of Biostatistics, Johns HopkinsUniversity and by Ms. Umy Chen, Institute of Statistical Science, Academia Sinica.
Their hard work and patience throughout are beyond what can be described in words –I thank them from the bottom of my heart.
1 Correlated Data
Characteristics of Examples . . . . . . . . . . . . .
Impact of Ignoring Within-Cluster Dependence . . . . . . . .
Incorrect Variance . . . . . . . . . . . . . .
Measures of Within-Cluster Dependence . . . . . . . . . .
2 Regression Analysis: An Overview
Some Issues Concerning Likelihood-Based Inference . . . . . . .
Impact of Nuisance Parameters . . . . . . . . . .
Cautions on Wald's Procedure . . . . . . . . . .
Generalized Linear Models . . . . . . . . . . . . . .
First Two Moments of GLMs . . . . . . . . . . .
Score Function for β . . . . . . . . . . . . .
MLE Versus Weighted Least Squares . . . . . . . . .
Quasi-Likelihood Method . . . . . . . . . . . . . .
Estimating Function Approach . . . . . . . . . . . .
Limitations: Impact of Nuisance Parameters Orthogonality Properties of Conditional Score Functions . . .
(Local) Optimality of Quasi-Score Functions 3 Analysis of Binary Data
Historical Developments . . . . . . . . . . . . . .
Empirical Logit versus Logistic Regression . . . . . . .
Is the Binomial Assumption Reasonable? . . . . . . .
Teratological Experiments . . . . . . . . . . . . . .
Some Cautionary Notes on Beta-Binomial Distributions . . .
An Alternative Inferential Procedure . . . . . . . . .
Further Relaxation of Binomial Assumptions . . . . . . . .
Multiplicative Model . . . . . . . . . . . . .
4 Analysis of Polytomous and Count Data
Types of Polytomous Data . . . . . . . . . . . . . .
Regression Models for Polytomous Responses . . . . . . . .
Polytomous Logistic Regression Models . . . . . . . .
Proportional Odds Models . . . . . . . . . . . .
Continuation Ratio Models . . . . . . . . . . .
Log-Linear Models for Contingency Tables . . . . . . . . .
Interpretations of Log-Linear Model Parameters An Alternative Representation of Log-Linear Models . . . .
5 Statistical Modellings for Correlated Data
Observation Driven Models . . . . . . . . . . . . .
Contrasts of Three Modelling Approaches . . . . . . . . .
6 Statistical Inference for Correlated Data
Quasi-Likelihood Approach . . . . . . . . . . .
Likelihood Approach . . . . . . . . . . . . .
Conditional Likelihood Approach . . . . . . . . . .
Likelihood Approach . . . . . . . . . . . . .
Observation Driven Models . . . . . . . . . . . . .
Likelihood Approach . . . . . . . . . . . . .
Quasi-Likelihood Approach . . . . . . . . . . .
Pre- Post Trial on Schizophrenia Revisited . . . . . . . . .
List of Tables
Some features of five examples in §1.1. . . . . . . . . . .
Ranges of correlation coefficients for a pair of binary responses with cor-responding means πj and πk, respectively. . . . . . . . . .
Proportions of children surviving 2 years free of disease following diagnosisand treatment for neuroblastoma by age and stage of disease at diagnosis(Breslow & McCann, 1971). . . . . . . . . . . . . .
Proportions of fetuses surviving the 21 day lactation period among thosewho were alive at 4 days by treatment group (Weil, 1970). . . . .
The averaged MLE (± s.e.) for θ for different assumed values of φ from1,000 simulations (Williams, 1988). . . . . . . . . . . .
Estimates and standard errors of β1 = log{θT /(1−θT )}−log{θC/(1−θC)}of Weil's data presented in Table 3.2. Upper entry: common φ; lower entry:heterogeneous φ (Liang and Hanfelt, 1994). . . . . . . . .
Simulation results for β1 = log{θT /(1 − θT )} from 1,000 replications mim-icking the data structure of Weil's. The true β1 is 1.129 and φT = 0.317and φC = 0.021 (Liang and Hanfelt, 1994). . . . . . . . .
Regression estimates (± s.e.) based on both proportional odds models andpolytomous logistic regression models for the disturbed dream example(Maxwell, 1961). . . . . . . . . . . . . . . . .
Regression estimates (± s.e.) based on proportional odds models and con-tinuation ratio models for the tonsil size example (Holmes and Williams,1954). . . . . . . . . . . . . . . . . . . .
Regression estimates and standard errors (in parenthesis) for the visualimpairment response from the Baltimore Eye Survey Study. . . . .
Regression estimates and standard errors (in parenthesis) from three mod-elling approaches: Baltimore Eye Survey Study. . . . . . . .
Regression estimates and standard errors (in parenthesis) from three mod-elling approaches: a pre- post trial on schizophrenia. . . . . . .
LIST OF TABLES List of Figures
The plot of the logarithm of V1/V2 versus ρ for selected cluster sizesn(2, 5, 10) and φ(0, 0.2, 0.5, 0.8, 1.0). Here, V1 is the variance of the leastsquared estimate ˆ β1 when the within-cluster dependence is ignored and V2 is the correct variance. We have assumed E(Yij) = β0 + β1xij andρ = corr(Yij, Yik), j < k = 1, . . , n : φ is the ratio of the between clustervariance to the total variance among the xij's. . . . . . . .
The plot of V3/V2 versus ρ for selected n(2, 5, 10) and φ(0, 0.2, 0.5, 0.8, 1.0).
Here V2, ρ, φ, and the assumed model are the same as described in Figure1.1, V3 is the variance of the best unbiased estimate of β1. . . . .
Beta-binomial log-likelihoods for the exposed group based on the datareported by Weil (1970). . . . . . . . . . . . . . .
Predicted risk of VI by race and age among those with 9 years of educationin Baltimore Eye Survey Study (Tielsch et al., 1990): solid line, whites;dashed line, blacks. . . . . . . . . . . . . . . .
Mean responses in PANSS by dropout cohort. . . . . . . . .
Regression analysis is commonly used in biomedical research. A typical assumptionbehind this tool is that all observations are statistically independent, or at least uncor-related with each other. The following research problems in which we have been directlyor indirectly involved suggest that there are many situations for which this assumptionmight not be met.
Example 1.1 (Baltimore Eye Survey Study). Over 5,000 individuals aged 40 and above received a visual examination as part of a population-based prevalence study ofocular disorders (Tielsch et al. 1990). The main objective of the summary is to identifydemographic variables such as race and age that are associated with vision loss. Thecomplication of the study is that data are available on both eyes which are unlikely tobe independent because many causes of eye impairment are binocular.
Example 1.2 (Family Study on Chronic Obstructive Pulmonary Disease (COPD)). Six hundred and thirteen members of 158 COPD cases seen at the Johns Hopkins Hospitalwere examined and given spirometry tests (Cohen, et al., 1977). The primary objectiveis to determine the percent of total variance attributable to unobserved genetic factorsshared among siblings and their parents. It is typical in family studies that the responsevariables, FEV1 in this case, are measured for related individuals, which in turn arecrucial to address scientific questions of interest such as the one stated above.
Example 1.3 (Sister Chromatid Exchange Study). A total of 14 hepatocelluar carci- noma, 14 nasopharyngeal carcinoma and 16 cervical cancer patients and their age-sex-matched controls, were studied to compare the frequency of sister chromatid exchange(SCE) in their peripheral lymphocytes. The main hypothesis, which was tested in thestudy conducted by Professor C.J. Chen, College of Public Health, National Taiwan Uni-versity, is that cancer patients may have a higher frequency of SCE in lymphocytes thanmatched controls. The outcome variable is the number of SCE per cell where 20 cellsfrom each subject were cultured.
Example 1.4 (Pre-Post Trial of Schizophrenia). Five hundred and twenty three pa- tients diagnosed with schizophrenia were randomly allocated amongst six treatments:placebo, haloperidol 20 mg and risperidone at dose levels 2mg, 6mg, 10mg and 16mg.
The primary response variable is the total score obtained on the Positive and Negative CHAPTER 1. CORRELATED DATA Symptom Rating Scale (PANSS), a measure of severity of schizophrenia. This score foreach patient was measured at baseline, weeks 1, 2, 4, 6 and 8 after randomization. Themain objective is to determine if risperidone improves the rate of decline in PANSS overthe two month period.
Example 1.5 (Diabetic Retinopathy Study). Over 1,700 patients with diabetic retinopa- thy and visual acuity of 20/100 or better in both eyes participated in the study (Mur-phy and Patz, 1978). One eye of each patient was randomly selected for laster photo-coagulation and the other was observed without treatment. The main objective is tostudy the effectiveness of laster photo-coagulation in delaying the onset of blindness andthe response variable for each eye is time from randomization to the occurrence of visualacuity less than 5/200.
Characteristics of Examples
Although different in their scientific objectives, the examples introduced above sharesome important characteristics in common that are critical to the statistical developmentpresented in Chapters 5 and 6. First, data in each of these examples are organized in"clusters." For example, in the Baltimore Eye Survey Study, each cluster comprisesobservations of two eyes from an individual. In the family study (Example 1.2), a clusteris formed by a family with observations from related individuals. Secondly, responsevariables from the same cluster are likely to be correlated with each other. In longitudinalstudies where repeated observations from each subject are measured over time, this isknown as "tracking." In family studies, observations from related individuals such assiblings are correlated due to shared genes and/or environments. This explanation canbe applied to the fellow eyes in Example 1.1 as well. Thirdly, the scientific objectivesof these examples can be formulated statistically in terms of regression analyses as onewould have been for independent responses. For example, one could use the conventionallogistic regression to relate the risk of visual impairment for each eye in the BaltimoreEye Survey Study to demographic variables such as age and race. In addition, one canassess how the degree of within-cluster dependence between fellow eyes on variables suchas race, through regression as well. The issue of measuring within-cluster dependencewill be addressed in §1.4, whereas more formal statistical modellings for correlated datawith the focus on regression will be the central theme of Chapter 5.
The examples introduced above, on the other hand, differ from another in some impor- tant ways. First, the type of response varies from binary (Example 1.1), count (Example1.3), continuous (Example 1.2 and 1.4) to survival subject to censoring (Example 1.5).
Secondly, they differ in the structure of within-cluster dependence. For example, if thesampling unit for the Baltimore Eye Survey Study were the household in which eachmember of the sampled household was offered for eye examinations, more complicateddependence structure would result since one would expect responses from family mem-bers to be correlated to each other. Furthermore, the degree of association between felloweyes is likely to be stronger than responses of eyes from two related subjects. Designs ofthis kind are special cases of the so-called "nested design." Examples include interventionstudies in which children (patients) are nested within household (clinicians) which in turnare nested within villages (clinics). Special care is required to model within-cluster de-pendence for data collected from nested designs or from family studies where the degreeof association for siblings is likely to be different than that of parents (Qaqish and Liang, 1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE Finally, the examples differ from one another with respect to their focus. In longitu- dinal studies (Example 1.4), the regression of response variables on explanatory variablessuch as time (in weeks) from randomization, treatment status and their interactions ismost important whereas the within-cluster dependence may be viewed as a nuisance. Onthe other hand, it is typical in family studies (Example 1.2) that the primary focus ison the role genetic factors play in disease as captured by the percent of total variationin response variables attributable to genetic components. In this situation, the within-cluster association represents the primary focus even though regression adjustment forthe regression variable is crucial to separate out the environmental impact from the ge-netic ones. Table 1.1 summarizes both the common factors and differences among thefive examples.
Table 1.1: Some features of five examples in §1.1.
Scientific Focus Impact of Ignoring Within-Cluster Dependence
A natural question to raise is "what happens if one ignores the within-cluster depen-dence and uses the conventional regression method assuming independence among ob-servations?" In the Baltimore Eye Survey example, this amounts to treating 10,398 ob-servations from 5,199 individuals as if they were collected from 10,398 individuals withone observation (could be from the right or left eye) per individual. From a statisti-cal viewpoint, there are at least two consequences as a result of ignoring within-clusterdependence: incorrect assessment of precision of regression coefficient estimates and in-efficient estimation of regression coefficients. We now address these two issues in greaterdetail through a simple model for correlated data.
Let Yij be the jth observation from the ith cluster, j = 1, . . , n, i = 1, . . , m. Here n is the common cluster size and m the number of clusters. We assume for each i, E(Yij) = β0 + β1xij, Var(Yij) = σ2,Cov(Yij, Yik) = σ2ρ, j < k = 1, . . , n. This model assumes the expected value of Y is a simple linear function of a covariate, x, and that the correlation, as measure by correlation coefficient, between each pairof responses from the same cluster has the same value, ρ, say. The focus here is on theinference for β1 when ignoring the fact that responses from the same cluster are correlatedwith each other. This amounts to the use of the ordinary least square estimator ˆ CHAPTER 1. CORRELATED DATA estimate β1 where (yij − yi)(xij − xi) / (xij − xi)2, i=1 j=1 i=1 j=1 with yi and xi be the sample means of the Yi's and xi's from the ith cluster.
Ignoring the within-cluster association leads to the use of V1 = σ2/ (xij − x)2 = σ2/VT , j ij / (nm), as the variance estimate of ˆ β1. The correct variance, V2, of β1 has the form V2 = V1{1 + ρ(nφ − 1)}, where φ = n i −x)2 / VT is the fraction of the total variation in the x's explained by the between cluster variation in xi's. Two important cases of φ deserve special attention.
One is when φ = 0, i.e. x1 = . . , xm. This occurs commonly in longitudinal studies inwhich every subject's response is measured at the same set of times. In this case, β1 isestimated mainly by using the within-cluster changes in Y . The other extreme cases iswhen φ = 1, i.e. xi1 = xi2 = . . = xin for all i ∈ m. This is typcial in studies such asthe Baltimore Eye Survey Study where all the covariates are cluster-specific.
Figure 1.1: The plot of the logarithm of V1/V2 versus ρ for selected clustersizes n(2, 5, 10) and φ(0, 0.2, 0.5, 0.8, 1.0). Here, V1 is the variance of theleast squared estimate ˆ β1 when the within-cluster dependence is ignored and V2 is the correct variance. We have assumed E(Yij) = β0 + β1xij andρ = corr(Yij, Yik), j < k = 1, . . , n : φ is the ratio of the between clustervariance to the total variance among the xij's.
1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE Figure 1.1 shows plots of log(V1/V2) = log{1 + ρ(nφ − 1)} against ρ for some selected n's and φ's. The message is rather clear. For within cluster comparisons, i.e. φ = 0, theconfidence interval based on V1 is wider than should be and discrepancy between V1 andV2 increases with ρ. On the other hand, for between cluster comparisons, i.e. φ = 1, theconfidence interval based on V1 is too narrow and the discrepancy also increases withρ. In either case, invalid scientific conclusions, which could be either false positive ornegative, may be drawn if V1 is used as the variance estimate of ˆ While the ordinary least square estimate ˆ β1 remains unbiased as an estimator of β1, the well known Gauss-Markov estimating theorem suggests that the uncertainty in estimatingβ1 may be reduced by using the weighted least square estimator, ˜ β1 say, which properly accounts for the within-cluster association. Under the specified correlation structure in(1.1), ˜ β1 has the variance of form V3 = V1(1 − ρ){1 + (n − 1)ρ} / {1 − ρ + (1 − φ)}. Figure 1.2 shows plots of V3/V1 against ρ for selected n's and φ's. Interestingly, no The plot of V3/V2 versus ρ for selected n(2, 5, 10) and φ(0, 0.2, 0.5, 0.8, 1.0). Here V2, ρ, φ, and the assumed model are the sameas described in Figure 1.1, V3 is the variance of the best unbiased estimateof β1.
efficiency loss occurs when φ is equal to 0 or 1, irrespective of ρ and n. On the otherhand, a great deal of efficiency loss by using ˆ β1 may result when φ approaches 0.5. Such phenomenon is more apparent with increased ρ but less so for n. The main messagein Figure 1.2 is that ignoring correlation could lead to a substantial loss of statisticalpower especially when both within-cluster and between-cluster information is being usedto estimate β1 and that it is important to examine and incorporate the within-clusterassociation structure so as to help improve upon the efficiency for the β inference.
CHAPTER 1. CORRELATED DATA Measures of Within-Cluster Dependence
Adequately describing the pattern of within-cluster dependence is important in severalregards: (i) it may represent the main scientific objective as is typically the case in familystudies (Example 1.2) and (ii) it may help to more precisely characterize the relationshipbetween the means and covariates as asserted in §1.3.2. In this section, we briefly discusshow one measures the within-cluster association and its patterns.
For continuous responses, the most commonly used measure of dependence between a pair of responses from the same cluster is correlation coefficient, defined as j , Yk ) = { , j < k = 1, . . , n. This quantity takes values between 1 and 1 inclusively. Very little dependence betweenYj and Yk may be claimed if ρ is close to zero and a strong association is evident if ρis close to either 1 or 1. Furthermore, a positive association between Yj and Yk, i.e.
ρ > 0, means Yj tends to be larger than expected if Yk is and vice versa. Finally, thismeasure of association has the desired property that it be dimensionless and symmetric.
For the former, it means that ρ for Yj and Yk is the same as that for cYj and cYk wherec is a non-zero constiuent. Thus, the same magnitude of ρ results whether one uses, forexample, meter or foot as the unit for height or length. For the latter, it matches with theintuition that the association between Yj and Yk has the same concept as that betweenYk and Yj. For a cluster of size greater or equal to 3, it raises the question as to howone distinguishes between, for example, ρ1,2 and ρ1,3, the correlation coefficient betweenY and Y2 and between Y1 and Y3, respectively? The answer to this question depends onthe nature of the clustering and sometimes, on how the scientific objective is formulated.
For longitudinal studies, it is a common belief that the correlation coefficient betweentwo observations adjacent in times is likely to be larger than that when two observationsare far apart in times. This is a part of the notion, known as "tracking" in longitudinalstudies. To capture this phenomenon, one may model ρ(Yj, Yk) = α tj−tk , where tj and tk are times at which Yj and Yk are observed, respectively. This patternis known as the AR-1 (auto-regressive model of order one) model. For more detaileddiscussion on describing and empirically examining patterns of within-subject associationfor longitudinal data, we refer the readers to the work by Diggle (1988), Diggle, Liangand Zeger (1994) and the references therein.
For family studies of first degree relatives, i.e. parents, siblings and offspring, the pattern of within-family association is likely to be different from that from longitudinalstudies. For siblings, it would be sensible to assume that the degree of association,as measured by ρ, is the same for each pair of siblings. However, it would be equallysensible to allow this correlation coefficient (ρSS) to be different than that of parents,ρPP. Indeed, any sensible statistical procedure should be flexible enough to allow one totest the equality of ρSS and ρPP. Assuming that the same environment is shared by allrelatives in the same family, a larger ρSS than ρPP would suggest that for the continuousresponses considered, known as quantitative traits in the field of genetic epidemiology,genetic factors may play a non-trivial role. For more detailed discussion on how one may 1.4. MEASURES OF WITHIN-CLUSTER DEPENDENCE model and estimate patterns of within-family association, see, for example, Beaty et al.
The use of correlation coefficient as a measure of association is less useful for binary responses. This is because the range of ρ is narrowed considerably due to the constraintthat max(0, πj + πk − 1) < Pr(Yj = Yk = 1) < min(πj, πk). where πj = Pr(Yj = 1) and πk = Pr(Yk = 1). Furthermore, the degree of constraintdepends on values of the π's. Table 1.2 shows ranges of ρ as a function of selected πjand πk. For example, if the true πj(πk) is 0.1(0.3), then the corresponding ρ between Yjand Yk must be greater than 0.22, but less than 0.51, much narrower than the idealsituation of (1, 1). This constraint in range of ρ makes the use of ρ much less desirablefor binary responses than it is for continuous responses, especially when characterizingwithin-cluster association represents the main objective. As an alternative, one may Table 1.2: Ranges of correlation coefficients for a pair of binary responseswith corresponding means πj and πk, respectively.
consider the use of the odds ratio (OR) j = Yk = 1) Pr(Yj = Yk = 0) j , Yk ) = Pr(Yj = 1, Yk = 0) Pr(Yj = 0, Yk = 1) as a measure of association between a pair of responses, Yj and Yk. For this quantity, noconstraint is induced other than the fact that it must be positive. A positive associationresults if OR is greater than one and a negative association may be claimed if OR isless than one. Furthermore, the odds ratio is symmetric as well and is familiar to publichealth researchers for its ease in interpretation. Just as in continuous response situations,one can model the odds ratio, or more preferably its log version, through regression. Fora more detailed discussion on the use of odds ratios for within-cluster associations, seeHeagerty and Zeger (1998) for longitudinal data and Liang and Beaty (1991) for familydata. Finally, different measures of association between discrete variables have beensuggested in the literature; see Goodman and Kruskal (1979). We choose OR as theprimary measure of within-cluster dependence for discrete data, mainly because it iseasy to interpret and is familiar to biomedical researchers.
Regression Analysis: An

As mentioned earlier, regression analysis has been and remains one of the most commonlyused statistical tools in biomedical research. In clinical research, this technique is useful tohelp assess the degree of treatment effects when randomized clinical trials are conducted.
It can also be used for clinicians to identify prognostic factors of disease for predictionpurposes. In public health research, regression analysis is instrumental to help identifyrisk factors of disease for prevention purposes. It is also useful in observational studies toadjust for confounding variables that are not considered in the design stage for matching.
In this chapter, we provide a chronicle overview on some recent developments of regressiontechniques with focus on methods including generalized linear models, quasi-likelihoodmethods and estimating function approaches. Since these methods are, by and large,likelihood-based, we first review some key issues in likelihood-based inference that isparticularly relevant to subsequent developments.
Some Issues Concerning Likelihood-Based Infer-

Let f (·; θ, φ) be the probability (density) function for a random variable, Y , of size mindexed by two sets of parameters, θ of p-dimensional and φ of q-dimensional. Here θrepresents parameters of interest reflecting scientific interest and φ the so-called "nui-sance parameters." Nuisance parameters, by definition, are of little intrinsic interest toinvestigators, yet necessary to fully specify the random mechanism, f (·; θ, φ), for whichthe data, Y ≡ y, are generated. A typical example of nuisance parameters is the exposedprobability of controls in case-control studies in which the interest is on the associationbetween the studied disease and the hypothesized exposure of interest, characterized bythe odds ratio, θ. Upon the data were observed, the only quantities in f that are un-known to investigators are θ and φ. The phrase "likelihood function" is formally definedas a function of θ and φ that is proportional to f (y; θ, φ), i.e.
L(θ, φ) ≡ L(θ, φ Y = y) ∝ f (y; θ, φ). CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW Impact of Nuisance Parameters
When the value of nuisance parameters φ is known, one can contrast two distinct valuesof θ, θ1 and θ2, by computing the following likelihood ratio A likelihood ratio which is greater than one would suggest that there is evidence in fa-vor of θ1 instead of θ2 as conveyed by the data through the likelihood function (e.g.
Royall, 1997). Furthermore, the larger the likelihood ratio in magnitude, the strongerthe evidence. However, when φ is unknown to investigators, one faces the additionalcomplication due to the need of specifying φ when computing LR(θ1, θ2). The followingexample from a teratological experiment (Weil, 1970) illustrates that in the absence ofknowledge in φ, the presence of nuisance parameters may have a profound impact onthe likelihood inference for parameters of interest. Figure 2.1 shows plots of log L(θ, φ)against θ for selected φ values. Here the likelihood function is proportional to a productof 16 observations following a beta-binomial distribution indexed by θ, the probability ofsurviving a 21 day lactation period for a fetus and φ, the nuisance parameter characteriz-ing the so-called "litter effect", an effect commonly observed in teratological experiments.
More detailed treatments of analysis for teratological experiments in general and of thisexample in particular will be given in Chapter 3. Choosing θ1 = 0.8 and θ2 = 0.6 as twoplausible values for θ, we note that different values of φ lead to different conclusions as towhich θ is favored more by the data. For example, L(0, 8, 0.8)/L(0.6, 0.8) = exp(2.89)strongly suggests that θ = 0.6 is favored over θ = 0.8 when φ = 0.8 is used, whereasL(0.8, 0, 1)/L(0.6, 0, 1) = exp(5.82) strong supports the opposite conclusion.
Nuisance parameters could have non-trivial impacts on estimating θ as well. The following well known Neyman-Scott problem (Neyman and Scott, 1948) demonstratesthat with many nuisance parameters at hand, the conventional maximum likelihoodapproach for estimating θ may not even be consistent at all. Consider {Yij, j = 1, . . , n},independent observations from the ith of m independent clusters following a normaldistribution with mean φi and a common variance θ, i = 1, . . , m. The likelihood functionfor θ and φ = (φ1, . . , φm) is proportional to L(θ, φ 1, . . , φm) θ− n2 e It is easy to verify that the maximum likelihood estimator (MLE) of θ, as a result ofjointly maximizing the likelihood function with respect to θ and φ, has the form (yij − yi)2/(nm). i=1 j=1 Assuming that the cluster size, n, is fixed by design, it can be easily shown that ˆ converges as m → ∞ to (m − 1)θ/m instead of θ. Such an undesirable phenomenonfor ˆ θ is not specific to the example given above. Indeed, it is well known that when the dimension of nuisance parameters, increase with the sample size, m in this case, the MLEof θ is in general inconsistent (e.g. Andersen, 1970). An important special case of this 2.1. SOME ISSUES CONCERNING LIKELIHOOD-BASED INFERENCE Figure 2.1: Beta-binomial log-likelihoods for the exposed group based onthe data reported by Weil (1970).
kind is on the estimation of the common odds ratio when the one-to-one matched case-control design is adopted (Breslow and Day, 1980). Indeed, it has been shown (Breslow,1981) that ˆ θ converges, as the number of matched pairs increases, to θ2 instead of θ.
Many likelihood-based inferential procedures have been developed to eliminate, or at least to reduce, the impact of nuisance parameters; see for example the work of Kalbfleischand Sprott (1970), Andersen (1970), Cox (1972), Lindsay (1982), Cox and Reed (1987),Reed (1995) and references therein. These methods rely upon the availability of thelikelihood function that is fully and correctly specified. In §2.4 we present an alternativeapproach, the estimating function approach, which also serves to eliminate (or reduce) theimpact of nuisance parameters without the need to fully specify the likelihood function.
Cautions on Wald's Procedure
The basis behind the frequentist approach builds on the repeated sampling, or long run,interpretation for inferential making (e.g. Cox and Hinkley 1974). A central theme ofthis approach is on testing the null hypothesis that θ = θ0, a pre-specified value andon computing confidence intervals for θ. With the exception that f (·; θ, φ) is generatedfrom an exponential family (Lehmann, 1959), one needs to appeal to the large sampleapproximation for the purpose, for example, of deriving confidence intervals. To thisend, three large sample inferential procedures have been developed and employed on adaily basis: likelihood-ratio-based, score-statistic-based and Wald-based. We now brieflydescribe these three procedures from the viewpoint of testing the null hypothesis thatH0 : θ = θ0. Their application to confidence intervals can be converted through hy- CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW pothesis testing in a straightforward manner and will be mostly omitted. As the namesuggested, the likelihood-ratio-based method relies upon the use of the ratio of the like-lihood function evaluated at θ0 and ˆ θ for testing H0, i.e.
1 = 2 log φ) are the MLE of (θ, φ) and ˆ φ(θ0) the MLE of φ under H0. The score-statistic- based procedure relies upon one of the key properties of the likelihood function, namely, E(S(θ, φ); θ, φ) = 0 ∀θ, φ, S(θ, φ) = log L(θ, φ)/∂(θ, φ), known as the score function for (θ, φ). Based on (2.3), one would reject H0, on theintuitive ground, if S(θ0, φ) is "large." To assure that this assessment of S(θ0, φ) beinglarge is not affected by the unit of the Yi's, one needs to standardize S by considering T2 = St(θ0, ˆ I(θ, φ) is known as the Fisher information matrix for (θ, φ). Finally, in contrast to the LRprocedure in which θ0 and ˆ θ are compared indirectly through the likelihood function, the Wald's method uses θ − θ0)t(Iθθ − ItθφI−1I φφ φθ ) (ˆ to test H0, here we have re-expressed I(θ, φ) as I(θ, φ) = where, for example, Iθθ = E(−∂2 log L(θ, φ)/∂θ2) Iθφ = E(−∂2 log L(θ, φ)/∂θ∂φ). Detailed treatments on these three large sample procedures can be found in Rao (1973, §6.e). In particular, it was shown that under some regularity conditions, allthree test statistics converge, as m → ∞, to a χ2p distribution and furthermore, areequivalent asymptotically to each other, i.e. the difference between T1 and T2, and hence T1 and T3 and T2 and T3, is in the order of op( m). Despite the similarity among thesethree procedures when the sample is large, the Wald procedure possess some undesirableproperties that are not shared by the other two procedures. First, this procedure is notinvariant in the following sense. Let δ = δ(θ) be a one-to-one function of θ so that the null 2.2. GENERALIZED LINEAR MODELS hypothesis of θ ≡ θ0 is equivalent to δ ≡ δ(θ0) = δ0. Unlike T1 and T2, the Wald's teststatistic is different numerically depending on the choice of parameters used to formulatethe null hypothesis. The following hypothetical example on testing the hypothesis of anassociation from a 2 × 2 table illustrates the shortcoming. With 8(2) out of 18 cases (17controls) exposed to a pre-specified risk factor, one has the MLE for θ, the odds ratio,equal to 6.0 with an estimated standard error (s.e.) of 5.35. This leads to a Wald teststatistic value of (6 1)2/(5.34)2 = 0.70 with p-value < 0.1. On the other hand, if oneused δ = log θ as the basis for inference, the corresponding Wald's test statistic is equalto (log 6 0)2/(0.89)2 = 4.0, highly statistically significant at the 0.05 level.
The other drawback of the Wald procedure is that in certain circumstances T3 may be "powerless" when the true θ value, θ say, is far different from θ0, i.e. 0 −θ is large inmagnitude. This counter-intuitive phenomenon may be explained at least heuristically asfollows. For simplicity, we assume p = 1 and there is no nuisance parameters, i.e. q = 0.
It is true in general that the asymptotic variance of ˆ θ, I−1(θ), depends on θ. While the numerator of T3, θ − θ0)2, increases as θ − θ0 increases since ˆ θ → θ, so does the denominator of T3, I(θ). Under the circumstances that I(θ) increases at a faster ratethan (ˆ θ − θ0)2 does, T3 becomes arbitrarily small in spite of the fact that the true θ, θ is very different from θ0. This undesirable property of T3 was first observed and explainedfor logistic regression models by Hauck and Donner (1977). It remains an open questionas to how one may characterize such "circumstances" leading to the peculiar behavior ofthe Wald procedure.
For illustration, consider the following time to relapse data from a clinical trial re- ported in Lee (1992). For five breast cancer patients receiving CMF (x = 1) after a radicalmastectomy, the times to relapse were recorded as 23, 16+, 18+, 20+, 24+ in weeks, wherethe + sign represents censoring. The five control patients (x = 0) all experienced re-lapse at 15, 18, 19, 19 and 20 weeks. A proportional hazard model with x, the treatmentstatus, as the sole covariate was fitted to the above data resulting in ˆ β = 16.66 and s.e.( ˆ β) = 1513. This leads to a Wald's test statistic value of 0.0001 with p-value = 0.991.
The other two test statistics, the likelihood ratio and score test statistic, amount to 8.76and 6.90 with p-values 0.003 and 0.009, respectively, matching well with the intuition asconveyed by the data that the new treatment (CMF) prolongs time to relapse for breastcancer patients receiving radical mastectomy.
Generalized Linear Models
Multiple linear regression analysis and logistic regression analysis are daily employedin biomedical research for reasons given at the onset of this chapter. In 1972, thesemethods, among others, are unified under the framework of generalized linear models(GLMs) (Nelder and Wedderburn). We now briefly review this seminal work along withseveral key properties of GLMs.
For a sample of size m, let Yi be the univariate response and xi, a p × 1 vector, be the covariates thought to be related to Yi, i = 1, . . , m. To specify a GLM, the followingtwo components are sufficient (Nelder and Wedderburn, 1972): (A) (Random component).
Yi is generated from an exponential dispersion family CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW (Tweedie, 1947, Jφhanssen, 1987) of the form f (Y iyi − b(θi) i = y; θi, φ) = exp + c(y i; φ) (B) (Systematic component). The expectation of Yi, denoted as µi, is related to xi through the link function h, i.e.
µi = E(Yi xi) = h−1(xtiβ). Some special cases of GLMs are as follows.
Example 2.1 (Multiple linear regression models). This corresponds to a special case with h(µ) ≡ µ, the "identity" link function, a(φ) ≡ φ and Yi being normally distributedwith mean µi = θi and variance φ.
Example 2.2 (Logistic regression models). For a binary response with Y = 1 or 0, this popular model is a special case of GLMs with h(µ) = log{µ/(1 − µ)}, the "logit" link,a(φ) 1, and Yi following a binomial distribution of size 1 and probability of "success"equal to µi.
Example 2.3 (Poisson regression models). For count responses which are common in cohort analysis with the number of "events" such as the death or diagnosis of a disease,this corresponds to the use of the "log" link, i.e. h(µ) = log µ, a(φ) 1 and a Poissondistribution with mean µi as the basis for inference. Note that the index Example 2.4 (Log-linear models). For data that can be expressed in contingency tables, log-linear models are commonly used (e.g. Bishop, Fienberg and Holland, 1975).
Here as in Example 2.3, the index i represents the group of individuals sharing similarcharacteristics which are categorical in nature. This model can also be cast as a GLM withh(µ) = log µ and a Poisson distribution for Yi, which is the number of individuals with xias the covariate values. A major difference between this model and the Poisson regressionmodel lies on the objective of interest. For log-linear models, the primary interest is onexamining the pattern of associations among categorical variables represented by x. Forthe latter, it is of interest to identify risk factors characterized by x that may be associatedwith the risk of a particular event.
We now present several key properties of GLMs that are particularly relevant to the First Two Moments of GLMs
The exponential dispersion family displayed in (2.4) does not explicitly reveal how themoments of Yi, in particular the first two moments, may be related to θi and φ. Simplealgebra suggests that The expression for the variance of Yi explains the rationale behind the phrase "dispersionparameters" for φ. Finally, given the relationship between θi and β through (2.5) and(2.6) above, it is clear that f (Y ; θ, φ) can also be expressed as f (Y ; β, φ).
2.2. GENERALIZED LINEAR MODELS Score Function for β
Assuming that the Yi's are independent and represent a random sample from the targetedpopulation, the likelihood function for β and φ is simply proportional to L(β, φ) f (yi; β, φ) i(β)yi − b(θi(β)) + c(y i; φ) Consequently, the score function for β can be derived as β (β, φ) = log L(β, φ)/∂β = i; β, π)(yi − µi(β)). It is important to note that the score function for β depends only on the first twomoments of the Yi's despite the full specification of f through (A) in (2.4). This formsthe primary motivation behind the use of the quasi-likelihood advocated by Wedderburn(1974) which is the focus of the next section. Furthermore, given that a(φ) appears as aproportional factor in var(Yi), no knowledge on φ is needed to derive the MLE of β, ˆ solving (β, φ) = 0 even though the asymptotic variances of ˆ (β, φ), does depend on φ, where Σ1(β, φ) = lim i; β, φ) MLE Versus Weighted Least Squares
An alternative estimating procedure for β is the well-known weighted least squares ap-proach. It amounts to minimizing the following objective function for β, namely, Q(β, φ) = i − µi(β))2 , Var(Yi; β, φ) the weighted differences between the "observed" (Yi's) and the "expected" (µi's). Again,the dispersion parameter φ plays no role in the estimation of β through minimizing Qin (2.9). While intuitive, this approach, in contrast to the common belief, may lead toinconsistent estimation of β if the variance of Yi depends on the mean and hence on β.
To see this, note that minimizing Q in (2.9) is equivalent to solving ∂Q/∂β ≡ 0, where ∂Q(β, φ)/∂β = i; β, φ)(yi − µi(β)) · (y i, β, φ) i − µi(β))2 Note that the first term in (2.10) is identical to (β, φ), the score function of β, whereasthe second term has in general non-zero expectations regardless of the sample size m.
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW The resulting solution, ˜ β say, would be inconsistent as an estimator of β unless Var(Yi) is independent of β in which case ∂Q(β, φ)/∂β ≡ Sβ(β, φ) and hence ˜ This undesirable property of the weighted least squares approach can be alleviated through the following simple modification. Assuming the existence of a consistent esti- β say, for β, one remedy is to minimize instead Q(β, φ) = i − µi(β))2 . i=1 Var(Yi; β, φ) It can be shown that the minimizer of Q(β, φ), ˜ β say, is now consistent for any consistent β of β. One such choice is the minimizer of i − µi(β))2. Furthermore, it can be seen that by iterating the minimization process in (2.11) repeatedly, i.e. mini-mizing i − µi(β))2 t+1(β, φ) = , t = 0, 1, . . , i=1 Var(Yi; βt, φ) 0 = β, it leads to ˆ β, the MLE. Indeed, one can show that β and ˆ β are asymptot- ically equivalent to each other in that ˜ β = op(m).
An important aspect of model fitting is model checking which examines the pattern ofdepartures from the fitted model. There are at least two different types of departure:the systematic departure which describes the departure of the model from the dataas a whole, and the isolated departure which identifies individuals whose data are notin accordance with the fitted model (McCullagh and Nelder, 1989, Chapter 12). Thissubsection focuses only on the former one.
For the ith subject with xi as the covariate value, there are two means to estimate µi, the expected value of Yi. One is to estimate by µi( ˆ β) based on the fitted model. The other is to simply estimate µi by Yi = yi, an unbiased estimator. The first approach,which uses the whole data, is legitimate only if the fitted model adequately describesthe data. On the other hand, the second approach, while always legitimate, does notfully utilize the sample data except yi itself. These contrasts between two approachesprovides a means to answer the question as to "how well the model fits the data" bycomparing µi( ˆ β) with yi through, for example, the likelihood function, L(µi, φ). Indeed the quantity "deviance" is proportional to the likelihood ratio test statistic comparingthe null hypothesis that the fitted model is adequate versus the saturated alternative,i.e. the µi's are different and unrelated to each other. More specifically, the deviance fora fitted model is formally defined as µ; Y ) = 2 {log L(µi( ˆβ); φ) log L(yi; φ)}a(φ). Thus, a "small" value in D would indicate that the fitted model describes the data asa whole rather well whereas a "large" value would suggest otherwise. A few cautionarynotes on the use of the deviance, or the Pearson χ2 statistic (Q( ˆ β, φ)), as a measure of 2.3. QUASI-LIKELIHOOD METHOD goodness-of-fit. First, except for the special case of normal responses, the distribution ofD, even asymptotically, is unknown. Second, as the hypothetical example in McCullaghand Nelder (1989, p.123) suggested, any global measure of goodness-of-fit such as thedeviance may have a limited usage for choosing competitive models for the purpose ofpredictions when the x considered falls outside the range formed by the observed xi's,known as extrapolation.
To adopt the generalized linear model approach one needs to specify the random compo-nent for the response variables. Very often one faces the problem where the investigatorsare uncertain about the complete random mechanism by which the data are generated.
This could be due to the fact that the underlying biologic theory is not fully understoodyet and/or no substantial (empirical) experience of similar data from previous studiesis available. On the other hand, the scientific objective can often be adequately charac-terized through regression, i.e. the systematic component may be well described. Withthis in mind, Wedderburn (1974) proposed as an alternative the quasi-likelihood methodwhich replaces the random component (A) in GLMs by (A) (Variance specification). The variance of Yi has the form Var(Yi) = a(φ)V (µi), where V is the known function relating the variance to the mean, µi.
In the absence of the likelihood function that is available for statistical inference, Wedderburn (1974) proposed the use of the quasi-score function to estimate β, i.e. bysolving β (β, φ) = i; β, φ)(yi − µi(β)) = 0. This quasi-score function has the same expression as (2.7) and would be the true scorefunction for β if the Yi's were indeed generated from an exponential dispersion fam-ily. McCullagh (1983) studied the asymptotic behavior of ˆ β, the solution of Sβ ≡ 0, and found that under some regularity conditions, ˆ β is consistent and asymptotically normally distributed with the covariance matrix equal to Σ in (2.8). Firth (1987), on the otherhand, examined the efficiency of ˆ β relative to the MLE of β when the true probability mechanism for the Yi's is not from an exponential dispersion family. Several differenttypes of departure from the exponential dispersion family were established and, as ex-pected, high efficiency of ˆ β is maintained when the departure of f from the exponential dispersion family is only "modest" (Firth, 1987). A remaining question to be addressedis whether ˆ β or the quasi-score function, , possesses any optimal property? This issue will be the main focus of the next section. Indeed, we will consider the more generalsituations than (A), namely (A) (Variance specification). The variance of Yi has the form Var(Yi) = Vi(µi; φ), where Vi is a known function of the mean, µi, and φ, the dispersion parameter.
This specification, of course, includes (A) as a special case with Vi(µi, φ) = a(φ)V (µi).
Returning to the general setup in §2.1 where we assume the data, Y = y, is generatedfrom a probability (density) function f (y; θ, φ) indexed by parameters of interest θ andnuisance parameters φ. In the regression setting, θ would be the regression coefficientwhereas φ would be dispersion parameters. An estimating function for θ is simply afunction of the data y and parameters of interest θ, i.e. g(y; θ). An estimating functionis called unbiased if E(g(Y ; θ); θ, φ) = 0 for all θ and φ values. An important example of unbiased estimating functions fromthe perspective of health research is the quasi-score function, (β, φ) in (2.12) wherea(φ) factors out in (β, φ). While the formal theoretical development of estimatingfunctions was not advanced until the 1960's, scientists grappled with how to designestimating functions which combined observations and unknown interest as early as themid 18th century; see the interesting book on the history of statistics before 1900 byStigler (1986). Meanwhile, the method of moments, advocated by Karl Pearson, maybe viewed as a precursor of the modern estimating function approach. Here, selectedempirical moments such as the sample mean are equated to their expectations which arepresumably dependent on parameters of scientific interest.
The concept of unbiased estimating functions and the corresponding optimality theory was formally developed in the year of 1960 through two independent work by Durbin(1960) and Godambe (1960). We now briefly recast the essence of a series of moregeneral work by Godambe, before addressing the question concerning the optimalityof the quasi-score function. Consider the class of unbiased estimating functions G ={g(y; θ) : E(g(Y ; θ); θ, φ) = 0, ∀θ, φ}. A member of G, g say, is optimal within this class(Godambe, 1960) if it minimizes W (θ, φ) = E1(∂g(Y ; θ)/∂θ)Cov(g(Y ; θ))E1(∂g(Y ; θ)/∂θ). When φ is known, i.e. there is no nuisance parameters, Godambe (1960) showed that thescore function for θ is the optimal one among G. In the more practical situation whereφ is present and unknown, Godambe (1976) showed that g(y; θ) = log f (y t; θ)/∂θ is optical among G where t is a complete, sufficient statistic for φ for fixed θ.
Limitations: Impact of Nuisance Parameters
This modern line of optimal estimating function discussed thus far is subject to the fol-lowing two major limitations. First, optimality considered is ascribed to the estimatingfunction, yet scientists and other practitioners are concerned more about estimators. AsCrowder (1989) put it: "This is like admiring the pram rather than the baby." Thisconcern may be reconciled as follows.
It can be shown that under some regularity β, the solution of g(y; θ) = 0, has minimum variance among solutions of {g(y; θ) = 0 : g ∈ G}. Thus the limitation of estimating functions noted above may notbe too serious an issue as such if one is willing to appeal to a large sample argument.
2.4. ESTIMATING FUNCTION APPROACH Second, nuisance parameters can compromise the optimality theory. Specifically, the optimal estimating function given in (2.16) relies upon the existence of a complete suf-ficient statistic for φ that does not depend on t. This is the case when Y is generatedfrom an exponential family with θ and φ being the "natural (or canonical)" parameters,but more generally t = t(θ). In this more general situation, log f (y t(θ); θ, φ)/∂θ de-pends on φ as well and hence is only locally optimal (Lindsay, 1982). More seriously,the sample space of Y given t(θ) may depend on θ and consequently, f (y t(θ); θ, φ) manynot be differentiable with respect to θ (Kalbfleisch and Sprott, 1970). It is worth notingthat the quasi-score function in (2.12) also suffers from this limitation if the variance ofYi has the more general form in (2.13), in which case the nuisance parameter φ may notfactor out in (2.13).
Orthogonality Properties of Conditional Score Functions
The previous subsection made it clear that with the exception of exponential families, nui-sance parameters could compromise the optimal theory. While many likelihood methodsare potentially available leading to unbiased estimating functions that are independentof nuisance parameters, there is no general guidance as to how one should proceed. Inaddition, unlike the conditional score function in (2.16), there is no known theory to sup-port the optimality of derived estimating functions. For this reason, we redirect our focuson the finding of locally optimal estimating functions in which the impact of nuisanceparameters may be minimized, if not eliminated. Avoiding the complication associatedwith the computation of f (y tθ; θ, φ), where is complete and sufficient for φ for fixedθ, Lindsay (1982) extends the concept of conditional score functions in this more generalsituation, i.e. t ≡ tθ by defining gC(θ, φ) = log L(θ, φ)/∂θ − E(log L(θ, φ)/∂θ tθ), which reduces to log f (y t; θ)/∂θ when = t. Due to the dependence of gC on φ, theconditional score function defined above is only locally optimal among G at the trueφ value (Lindsay, 1982). However, gC(θ, φ), by definition, is orthogonal to the spacespanned by the sufficient statistic for φ. Thus the impact of nuisance parameterson gC(θ, φ) is expected, at least intuitively, to be small. The following "orthogonality"properties enjoyed by gC(θ, φ) represent attempts to quantify the above speculation.
These properties ((a) to (d)) are listed so that gC(θ, φ) implies (a), which implies (b),which implies (c), which implies (d): (a) E(gC(θ, ˆ φθ); θ, φ) = 0 for all θ, φ and any ˆ φθ which is a function of ; (b) E(gC(θ, φ); θ, φ) = 0 for all θ, φ and φ; (c) E(∂gC(θ, φ)/∂φ; θ, φ) = 0 for all θ, φ and φ; (d) Cov(gC(θ, φ), ∂ log L(θ, φ)/∂φ; θ, φ) = 0 for all θ and φ.
Interpretation of these four orthogonality properties can be seen in Liang and Zeger(1995). Furthermore, a rich class of probability (density) functions in which is in-deed complete and sufficient for φ is given by Liang and Tsou (1992). In summary,we have devoted this subsection to the discussion of desirable properties possessed bythe conditional score function proposed by Lindsay (1982). These properties, namely, CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW local optimality and orthogonality to nuisance parameters, form the criteria for the nextsubsection.
Finally, several researchers, including Cox and Reid (1987), Liang (1987), Ferguson, Reid and Cox (1991), Waterman and Lindsay (1996) have investigated strategies forapproximating the conditional score function in the more general situation where thecomplete sufficient statistic for φ for fixed θ may not exist or are difficult to find.
(Local) Optimality of Quasi-Score Functions
We now return to the question posed in §2.3, namely the potential optimality of the quasi-likelihood method, which will be addressed in more general settings. In the previous twosubsections, we have assumed implicitly that the likelihood function, indexed by θ andφ,is readily available for θ inference. As mentioned in §2.3, there are situations whereeither the investigators are uncertain about the random mechanism or the probabilitydistribution is too complicated to write down explicitly. Examples include data fromspatial processes encountered in plant ecology (Besag, 1974) and from DNA sequencing.
Nevertheless, it is common that the parameters of interest θ are well defined reflectingthe scientific interest, even though they may not completely specify the probability dis-tribution for the data y. We assume that the data y = (y1, . . , ym) is decomposed intom "strata" and the yi's, i = 1, . . , m, are uncorrelated with each other. The sizes ofthe yi's are not required to be the same. Furthermore, we assume that θ is common inmeaning to all m strata and that there exists an unbiased estimating function, gi(yi; θ),for the ith stratum, i = 1, . . , m. A simple and trivial example of this setup is that theyi's are independent and identically distributed with mean θ and variance φ and a simplechoice of gi would be yi −θ. A natural question is how to combine {gi(yi; θ), i = 1, . . , m}together into a single unbiased estimating function for θ inference that is optimal in somesense? To address this question we consider a more restricted class (than G) of unbiased estimating functions, namely, G = {g(y; θ) = which is a linear combination of the gi's. It can be shown (e.g. Liang and Zeger, 1995)that the optimal unbiased estimating function within G, g say, is which minimizes W (θ, φ) defined in (2.15). Furthermore, the solution of g = 0, ˆ has the smallest asymptotic variance among {˜ θ : g(y; ˜ θ) = 0, for all g ∈ G}. This setup includes several important examples: Example 2.5 (Quasi-score function). Under (A) and (B) specified in §2.3, one has h(µi) = xiθ, Var(Yi) = a(φ)V (µi). By choosing gi(yi; θ) = yi − µi(θ), one has as the optimal unbiased estimating function 2.4. ESTIMATING FUNCTION APPROACH among G = { m a i)(yi − µi(θ)), the quasi-score function for θ.
Example 2.6 (Correlated data). For yi = (yi, . . , yin ), the observations from the ith cluster, i = 1, . . , m, where ni is the cluster size, the objective of interest is fre-quently captured through the modeling of µij = E(Yij), i.e. h(µij) = xt θ (Liang and Zeger, 1986).
In this case, one such choice of gi would be yi − µi(θ) where µi = i(θ), . . , µin (θ))t and the optimal g among G = {g = i)(yi − µi(θ)), known as the generalized estimating equation (GEE) (Liang and Zeger, 1986).
Example 2.7 (2 × 2 tables). In case-control studies, subjects are stratified into m strata by confounding variables. The primary interest is on the degree of associationbetween the targeted disease and a dichotomous risk factor of interest as measured bythe odds ratio, θ. In this case, yi = (yi1, ni1 − yi1, yi2, ni2 − yi2) where yi1(yi2) is thenumber of exposed among ni1 cases (ni2 controls) in the ith stratum, i = 1, . . , m. Itcan be verified easily that gi(yi; θ) = yi1(ni2 − yi2) − θyi2(ni − yi) is an unbiased estimating function for θ from the ith stratum. While is complicated computationally, a(θ) reduces to 1/(n i1 + ni2) when θ is evaluated at one (McCullagh and Nelder, 1989) and this leads to the well known Mantel-Haenszelestimator of θ (Mantel-Haenszel, 1959).
Implicitly, the optimality theory developed thus far assumes g is functionally inde- pendent of the nuisance parameters φ. This is indeed the case for Example 2.7 if theexpectations derived in computing g are conditional on ti = yi1 + yi2, the total numberof exposed in the ith stratum, i = 1, . . , m. In general, however, the distribution gi,which is functionally independent of φ, does depend on φ. Thus, g = g(y; θ, φ) andis locally optimal among G at the true φ value as is the conditional score function θ(2.17) when the probability mechanism is available for inference. One important excep-tion is the quasi-score function under (A), i.e. a(φ) appears as a proportional factor inVar(Y ). In this case, a(φ) factored out and g = g(y; θ) is globally optimal among G.
Even though g may depend on φ, we now argue that the impact of φ on g and on thecorresponding solution of g = 0 is small. This is because g shared the orthogonalityproperties (b) - (d) enjoyed by the conditional score function, i.e.
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW (b) E(g(y; θ, φ); θ, φ) = 0 ∀θ, φ and φ, (c) E(∂g(y; θ, φ)/∂φ; θ, φ) = 0 ∀θ, φ and φ, (d) Cov(g(y; θ, φ), ∂ log L(θ, φ)/∂φ; θ, φ) = 0 ∀θ and φ.
One important implication of these properties, at least when the sample size is large, isthat ˆ θ, the solution of g(y; θ, ˆ φθ) = 0, where ˆ φθ is an arbitrary consistent estimator of φ, has the same variance as that when φ is known. That is, the uncertainty in estimating θ,as measured by the variance, is unaltered whether φ is known or not, another indicationthat Chapter 3
Analysis of Binary Data
Binary (dichotomous) response is commonly observed in biomedical research. Examplesinclude death versus alive, ill versus healthy, injured or not in a car accident, depressed ornot in psychiatric research, etc. Sometimes there is a time element involved in defining theresponse. For example, one may be interested in, as shown below in Table 3.1 whethera child diagnosed with neuroblastoma survives through a two year period. Anotherexample, shown in Table 3.2, has the response for each fetus as to whether it survives21 day lactation period or not. These two examples share similar data structure whichis the focus of this chapter, namely, (yi/ni, xi), i = 1, . . , m, where yi is the number of "affected" out of the ni "subjects" in the ith stratum and xi, Table 3.1: Proportions of children surviving 2 years free of disease followingdiagnosis and treatment for neuroblastoma by age and stage of disease atdiagnosis (Breslow & McCann, 1971).
Stage at Diagnosis a p × 1 vector, covariates that are common to all ni subjects. For the neuroblastomaexample, m is equal to 15 with p = 2 covariates, namely, age and stage at diagnosis.
For the teratological experiment example, m would be 32 with one primary covariate,i.e. treatment status: x = 1 if treated and 0 if control. Questions of interest for datathat can be summarized in this kind of structure include (i) can variation in yi/ni beexplained by xi? and (ii) can one predict the affected status using xi? The first questionis important in that it provides a means to identify risk factors that may be useful forprevention purposes. The second question is useful for identifying prognostic factorswhich may have clinical and policy implementation implications.
CHAPTER 3. ANALYSIS OF BINARY DATA Table 3.2: Proportions of fetuses surviving the 21 day lactation periodamong those who were alive at 4 days by treatment group (Weil, 1970).
12/12, 11/11, 10/10, 9/9, 10/11, 9/10, 9/10, 8/9, 8/9, 4/5, 7/9, 4/7, 5/10, 3/6, 3/10, 0/7 13/13, 12/12, 9/9, 9/9, 8/8, 8/8, 12/13, 11/12, 9/10, 9/10, 8/9, 11/13, 4/5, 5/7, 7/10, 7/10 In this chapter, we discuss some analytical issues that are relevant to this type of data structure with special attention paid to the issue of statistical dependence within astratum which is common in data from teratological experiments.
Given the much longer history in multiple linear regression for continuous responses,it was natural that some earlier developments for analyzing binary data appealed tothe conventional linear regression method. This forms the basis behind the first twoapproaches presented below.
Approach 1 (Arcsine analysis). This approach amounts to creating a new response variable through a arcsine transform on the original response variable, that is, y = yi/ni) and imposing the following model i = xtiβ + i, This transformation was motivated by the fact that the variance of y, at least asymptot- ically, is independent of the means, xtβ, a desirable property to have in the conventional linear regression approach; see §2.2.3.
Approach 2 (Empirical logit analysis). This approach is similar to Approach 1 except that a different transformation is applied to yi/ni, namely, y = log(y i/(ni − yi)). This is a special case of a comprehensive treatment for analyzing categorical data developedin the 1960's by Grizzle, Starmer and Koch (1969).
Approach 3 (Modern analysis). This approach, which is a special case of GLMs, differs from the first two approaches in one important aspect. In this modern (for the lack ofa better term) approach, the notion of transformation is applied to the expectation ofyi/ni, πi, rather than the data itself. In other words, for a given "link" function, h say,this approach specifies h(πi) = xtiβ and furthermore, yi follows a binomial distribution. Some common choices of link func-tions include the probit link for bioassay analysis (e.g., Finney, 1964) and the logit linkfor case-control studies (Breslow and Day, 1980).
It is interesting to contrast the modern approach versus the first two approaches. For simplicity, we restrict our attention to the log odds transformation (Approach 2) so thatthe comparison is made between the empirical logit approach and the logistic regressionapproach.
3.1. HISTORICAL DEVELOPMENTS Empirical Logit versus Logistic Regression
We now argue that the empirical logit approach inherits the following three drawbacksthat are not shared by the logistic regression method. First, the regression coefficientsof this approach are difficult to interpret: change in averaged Y  = log{Y /(n − Y )} perunit change in x. The transformation Y  is made for convenience whose average doesnot necessarily possess physical meaning of interest. This is to be contrasted with theinterpretations of the logistic regression coefficients: change in log odds in risk per unitchange in x. These quantities are easy to understand with sound physical meaning forbiomedical researchers. Second, the normal distribution assumption for the Y 's relies heavily on the ni's being large. As can be seen in either example, the ni's vary fromstrata to strata and are not (can't be in teratological experiments) fixed or controlledby designs. This concern, on the other hand, is not shared by the logistic regressionmethod. Indeed, the latter approach can be applied even if ni ≡ 1 for all i so longas m is sufficiently large. Third, with the exception of the intercept, all the regressioncoefficients in logistic regression models are estimable should one adopt the retrospectivesampling, a commonly adopted design in epidemiologic studies. The same thing cannotbe said for the empirical logit approach. Finally, parameters in logistic regressin modelsare the "canonical" parameters in an exponential family which means the intercepts canbe eliminated through conditioning which is critical for the matched case-control designsto be implemented (Breslow and Day, 1980).
Is the Binomial Assumption Reasonable?
Either one of three approaches presented earlier assumes that Yi follows a binomialdistribution with size ni and probability πi, i = 1, . . , m. A natural question to ask iswhether the binomial assumption is adequate? To answer this question, we note that onenatural way to characterize a binomial variate, Y say, is that Y be expressed as a sum ofn binary observations, i.e., Y = Y1 + . . + Yn with Yj = 1 or 0 for all j = 1, . . , n suchthat (1) all the Yj's are statistically independent of each other and (2) Pr(Yj = 1) = π,the common probability for all j = 1, . . , n.
For the neuroblastoma example, it seems sensible to assume that the number of chil- dren with the same age and stage of diagnosis who survived for a two-year period is wellapproximated by a binomial distribution. After all, these children are unrelated to eachother and meanwhile, comparable to each other with regard to major prognostic factorsfor neuroblastoma. On the other hand, it is well known in the teratological experimentliterature (e.g. Haseman and Kupper, 1979) of the tendency for fetuses from the samelitter (stratum in our notation) to respond more alike than fetuses from different litterseven if they all received the same treatment. This is known as the "litter effect" whichmay be explained by the fact that fetuses from the same litter share the common genesand environment which in turn leads to the "likeness" of responses from fetuses of thesame litter. This argument strongly suggest that the statistical independence assumptionof the binomial assumption may be invalid. Indeed, for the sixteen proportions from thetreated group in Table 3.2, only 25% (= 0.328/1.295) of the total variation in the y/n'sis explained by the variation induced by the binomial assumption (Liang and Hanfelt,1994).
In the next, we discuss pros and cons of approaches that have been developed to aid investigators analyze data derived from teratological experiments.
CHAPTER 3. ANALYSIS OF BINARY DATA Teratological (toxicological) experiments are designed to test whether a substance isindeed a carcinogen or toxin; and if so, it would be of interest to establish the dose-response relationship in order to provide guidances for the tolerance level regarding safetyof human populations. Typically, pregnant animals (e.g. rats) are randomized to receivethe substance of different dose levels. The adverse effect such as structural developmentaleffect of the substance in the fetuses is then assessed to address the question stated above.
As mentioned in §3.1.2, the so-called "‘litter effect" is commonly observed and con- sequently, the binomial assumption on Y , the number of affected fetuses from a litter, isinvalid. One of the earliest approaches to address this issue of litter effect, also knownas "extra binomial variation" or more generally "over-dispersion," is to assume that Yfollows a beta binomial distribution (Skellam, 1948) instead. This distribution may bemotivated and derived as follows. Consider yi/ni, i = 1, . . , m, the responses (proportions of fetuses that are affected) from litters receiving the samedosage. For the example presented in Table 3.2, m is equal to 16 for both treated andcontrol groups. To account for variation in responses among litters that are "comparable"(regarding treatment levels) to each other, this approach assumes (1) given πi, Yi πi ∼ Binomial (ni, πi), i = 1, . . , m, (2) πi ∼ Beta (α, γ), α, γ > 0.
Consequently, the beta-binomial distribution for each Yi arrived as Pr(Yi = yi; α, γ) = Pr(Yi = yi πi)f (πi; α, γ)dπi Γ(α + γ) πyi (1 − π πα−1(1 − π Γ(α + γ)Γ(y )Γ(ni − yi + γ) . Γ(α)Γ(γ)Γ(α + γ + ni) The key step of this derivation is assumption (2) which acknowledges that the underlyingrisk, πi, varies from litter to litter by imposing a distribution, beta in this case, on theπi's. Intuitively, the litter effect is pronounced if the variance of πi is large. One difficultyassociated with the representation in (3.1) is lack of clear interpretability for both α andγ. However, noting that the first two moments of a beta distribution have the form α/(α + γ) ≡ θ, θ(1 − θ) = φθ(1 − θ), α + γ + 1 we now argue that the new parameters, θ and φ are more easily interpreted. This isbecause the first two moments of the Yi's can now be expressed as E(Yi/ni) = E(E(Yi/ni πi)) = E(πi) = θ, θ(1 − θ) Var(Yi/ni) = i − 1)φ). 3.2. TERATOLOGICAL EXPERIMENTS Thus θ is simply the overall, or averaged, risk for those randomized to a particular dosegroup. On the other hand, φ, the proportional factor in Var(π), is introduced to accountfor variation that is not solely explained by the binomial variation. Furthermore, note in§3.1.2 that Yi can be expressed as Yi = Yi1 +. .+Yin and that these n i binary responses are not statistically independent of each. If one assumes the correlations for each pairof these Yij's are the same, then it can be shown easily that this common correlationcoefficient is exactly equal to φ. Thus, the notion of variation between litters is equivalentto within litter "likeness" as captured by the same quantity φ.
Finally, to test or to establish dose-response relationship between θ and x, the dosage level, one may consider h(θ) = β0 + xtβ1, in line with the systematic component specified under GLMs.
Some Cautionary Notes on Beta-Binomial Distributions
While intuitive for its derivation, one may use this approach with caution from thefrequentist viewpoint. First, as shown below, the parameter of interest θ and the nuisanceparameter φ are highly intertwined with each other. Consequently, the presence of φis likely to have non-trivial impacts on θ inference; see §2.1 for general discussion onimpacts of nuisance parameters. To see this, Table 3.3 below, which is reported inWilliams (1988), shows through simulations that misspecifying φ value could lead tosevere bias in θ estimation. For example, if the true θ and φ values are 0.125 and0.2, respectively and a sample of 20 beta-binomial observations were generated, theaveraged MLE of θ over 1,000 replications is equal to 0.167 if a φ value of 0.4 wasassumed when deriving θ. This phenomenon is also evident when the beta-binomialdistributions are fitted to the Weil data in Table 3.2. Table 3.4 shows that the MLEof β1 = log(θT /(1 − θT )) log(θC/(1 − θC)) dropped to 0.665 when φT and φC wereforced to be the same. This is to be contrasted with the MLE of β1 when φT and φC areallowed to be different: 1.129. It is interesting to note that the standard error estimateof ˆ β1 when assuming binomial distribution in the Y 's, i.e. ignoring the litter effect is too small: 0.330 compared to 0.464 when a beta-binomial distribution is assumed instead.
Table 3.3: The averaged MLE (± s.e.) for θ for different assumed values ofφ from 1,000 simulations (Williams, 1988).
True parameter values Assumed φ value θ = 0.125, φ = 0.2 θ = 0.25, φ = 0.333 0.1263 ± 0.0012 0.2530 ± 0.0020 0.1118 ± 0.0011 0.2310 ± 0.0020 0.1134 ± 0.0011 0.2266 ± 0.0020 0.1271 ± 0.0011 0.2336 ± 0.0019 0.1457 ± 0.0012 0.2491 ± 0.0018 0.1670 ± 0.0013 0.2685 ± 0.0018 0.1906 ± 0.0014 0.2903 ± 0.0017 Second, the asymptotic normal approximation for the MLE of the β's may not be ready to take place for practical sizes when fitting the beta-binomial distribution to the CHAPTER 3. ANALYSIS OF BINARY DATA Table 3.4: Estimates and standard errors of β1 = log{θT /(1 − θT )} −log{θC/(1 − θC)} of Weil's data presented in Table 3.2. Upper entry: com-mon φ; lower entry: heterogeneous φ (Liang and Hanfelt, 1994).
1± s.e.
0.961 ± 0.330 0.665 ± 0.460 0.19 1.129 ± 0.464 0.32 0.02 Quasi-likelihoodVar(Y ) = (1 − θ) 1.022 ± 0.471 0.21 ×(1 + (n − 1)φ) 1.070 ± 0.470 0.46 0.05 0.961 ± 0.475 2.78 φnθ(1 − θ) 0.961 ± 0.475 4.74 data at hand. Table 3.5 shows results from a simulation in which data mimicking thestructure from Table 3.2 and following beta-binomial distributions with the true θ's andφ's being the MLE's from the last row of Table 3.4 were generated 1,000 times. Eitherallowing φT and φC to be different in the analysis or not, the MLE approach leads toundesirable 95% coverage probability that is far from being symmetric as would havebeen the case if the normal approximation to the distribution of MLE's were ready totake place.
Finally, for a litter of size n in which Y = Y1 + . . + Yn and Y−j defined as (Y1, . . , Yj−1, Yj+1, . . , Yn), one has j = 1 Y−j ) Pr(Yj = 0 Y−j) γ + n − if Y is assumed to follow a beta-binomial distribution indexed by α and γ. In otherwords, the odds for Yj to be affected, i.e. Yj = 1, depends on Y−j, the responses from the rest of the litter, only through , the total number of affected. While appealing mathematically, this expression in (3.2) raises a concern regarding the biologic plausibilityof this distribution as the litter size varies. It seems more intuitive that the effect of Y−j, if any, on Yj would be through its proportion, i.e.
/(n − 1), rather than through An Alternative Inferential Procedure
The beta-binomial distribution is an excellent example in which (1) the probability dis-tribution is indexed by parameters of interest and nuisance parameters, (2) the nuisanceparameters have performed impacts on inference for parameter of interest and (3) yetthe conditional score function approach in §2.4 is not applicable (= y in this case).
Since the main interest of teratological experiments can be cast through regression mod-elling on the first moment of the Y 's, one alternative is to consider the quasi-likelihood 3.2. TERATOLOGICAL EXPERIMENTS Table 3.5: Simulation results for β1 = log{θT /(1 − θT )} from 1,000 repli-cations mimicking the data structure of Weil's. The true β1 is 1.129 andφT = 0.317 and φC = 0.021 (Liang and Hanfelt, 1994).
95% converge prob.
φT = φC Quasi-likelihood1. Var(Y ) = (1 − θ) ×(1 + (n − 1)φ) φT = φC 2. Var(Y ) = φnθ(1 − θ)φT = φC appproach detailed in §2.3. Specifically, one may consider (A) (Variance specification). The variance of Yi has the form of Var(Yi) = niθ(1 − θ)(1 + (ni − 1)φ), or φniθ(1 − θ). (B) (Systematic component). The mean of Yi/ni is modeled as logit θi = β0 + β1xi, i = 1, . . , m. Note that the first variance expression corresponds to that induced by the beta-binomialdistribution whereas the second one is induced from the exponential-dispersion family. Asdiscussed in §2.4.3, while the quasi-score function does depend on φ, yet the impact of φare minimal at least from the theoretical viewpoint. Results below provide some empiricalevidence of the above assertion. Table 3.4 shows for the Weil data that the estimators andthe corresponding s.e.'s from the quasi-likelihood approach are very similar to each otherregardless of the choice of the variance functions and whether φT and φC are assumed tobe the same or not. This observation is confirmed by the simulated study presented inTable 3.5, which shows that this approach, unlike the beta-binomial likelihood approach,leads to excellent coverage probabilities, small bias and excellent agreement between thesample variance and the averaged estimated variance of ˆ β1. More extensive simulations which mimic the data structures that are common in practice can be found in Liang andHanfelt (1994).
In summary, we review in this section, two different approaches for analyzing data from teratological experiments which confront the non-trivial litter effects. Both theo- CHAPTER 3. ANALYSIS OF BINARY DATA retical arguments and empirical results support the use of the quasi-likelihood methodas a sensible alternative to the conventional beta-binomial approach.
Further Relaxation of Binomial Assumptions
Recall in §3.1.2 we discussed two components that constitute the binomial variate Y =Y1 + . . + Yn, namely, the statistical independence and identical distribution in the Yj's,j = 1, . . , n. In §3.2 we addressed ways to relax the first assumption of statistical inde-pendence. In this section, we review some recent developments for attempts to furtherrelax the second assumption of identical distributions. This work may be motivated infamily studies where responses from n related individuals are not only correlated witheach other due to shared genes and environment, but have different risks of disease dueto different make up in risk factor profiles consisting of, for example, age, sex, etc. Draw-backs of these approaches are highlighted so as to provide further motivation for moremodern developments for modelling correlated data to be presented in Chapter 5.
Let xj be the p × 1 vector of covariates for the jth component of a cluster of size n.
Building on the expression in (3.2), Rosner (1984) proposes to consider for j = 1, . . , n j = 1 Y−j , x1, . . , xn) Pr(Yj = 0 Y−j, x1, . . , xn) y) log(γ + n − y − 1) + βtxj. It models explicitly the dependence of the log odds for Yj to be one on Y−j and x1, . . , xn =j  and xj for each j = 1, . . , n. Using the Hammersley-Clifford Theorem (Besag, 1974), this specification in (3.3) does lead to a legitimate joint distribution forY = (Y1, . . , Yn) given (x1, . . , xn), namely, 1 = y1, . . , Yn = yn x1, . . , xn) = sbn−s exp  1 = . . = Yn = 0 x1, . . , xn) (α + ), s ≥ 1, (γ + ), s ≥ 1, b0 = 1, As mentioned in §3.2.1, the effect of Y−j on Yj, as modeled in(3.3), depends on =j , which is a function of the cluster size n. More seriously, the effect of xj on Yj, as 3.3. FURTHER RELAXATION OF BINOMIAL ASSUMPTIONS measured by β, is likely to be diminished as xj is competing directly against Y−j inexplaining the variation in Yj. Furthermore, the random effect motivation conveyed inthe original beta-binomial deviation (i.e. π ∼ Beta(α, γ)) seems to disappear when onetries to incorporate covariate effects for each component j.
Approaching the same issue from a totally different angle, Bahadur (1961) proposed thefollowing join distribution for (Y1, . . , Yn) given (x1, . . , xn) Pr(Y1 = y1, . . , Yn = yn x1, . . , xn) j = yj xj ) ρjkzjzk + + . . + ρ , µj = Pr(Yj = 1 xj), j = 1, . . , n. µj(1 − µj) This is termed "additive model" in that the ratio of Pr(Yj = yj, j = 1, . . , n x1, . . , xn) n Pr(Y j = yj xj ) is additive in the products of the yj's. As a special case, Kupper and Haseman (1978)consider Pr(Yj = yj, j = 1, . . , n x1, . . , xn) Pr(Yj = yj xj)(1 + ρ where ρ is the common correlation coefficient for each pair of the yj's and µj ≡ µ forall j = 1, . . , n. A major drawback of this model, as acknowledged by Kupper andHaseman (1978), is that there is no guarantee that the quantities on the right hand sideof (3.5) are non-negative, a necessary condition for probability functions. Furthermore,as pointed out in §2.4, the range of ρ is constrained by the µj's and as in the beta-binomial distribution, the µj's and ρ are likely to be highly intertwined with each other.
These concerns have limited the usage of this approach.
Similar in spirit to the Rosner model, this third approach considers for j = 1, . . , n j = 1 Y−j , x1, . . , xn) = α + γ j + βtxj . j = 0 Y−j , x1, . . , xn) CHAPTER 3. ANALYSIS OF BINARY DATA Again, utilizing the Hammersley-Clifford theorem, this leads to (Altham, 1978; Connollyand Liang, 1988) Pr(Yj = yj, j = 1, . . , n x1, . . , xn) Pr(Yj = 0, j = 1, . . , n x1, . . , xn) j + γ yj − 1) + βt Due to the similar representations in (3.6), this model is subject to the same drawbacksinherited by the Rosner model discussed earlier.
In summary, we have reviewed in this section three different approaches to extend the binomial distribution allowing statistical dependence and different risks among the Yj'sfrom the same cluster. Drawbacks of these approaches are highlighted. In this regard,the treatment of this chapter may be viewed as a preview of more modern developmentsfor modelling correlated data to be presented in Chapter 5.
Chapter 4
Analysis of Polytomous and
Count Data

In biomedical research, categorical response with three or more possible categories, knownas polytomous response, is also common in practice. Examples of this kind includedegree of injury (mild, modest and severe), different cell types of lung cancer (e.g.,small, squamous, large, etc.) and income level (low, medium and high). Just as inbinary response, researchers are interested in identifying factors that are related to thepolytomous response for prevention, intervention and prediction purposes. The firstpart of this chapter reviews a variety of statistical models that have been proposed inthe literature to address the above issue. We discuss through several real examples thepros and cons of each approach with the emphasis on the interpretations of regressioncoefficients induced by the models.
In social science research, data are frequently expressed in contingency tables since variables observed are mostly categorical. The scientific interest is often on how thesevariables are related to each other. For that purpose, log-linear models (e.g. Bishop,Fienberg and Holland, 1974) have been developed since the 1950's to address the issueof patterns of association among variables collected from the same subjects. Attemptshave also been made in the past decade or two to utilize this approach to help analyzethe types of correlated data introduced in Chapter one. The second part of this chapteris devoted to critically reviewing the pros and especially the cons of log-linear modelswith emphasis on the interpretations of coefficients implied by the models. One usefulreference for this topic is Liang, Zeger and Qaqish (1992) and the discussion therein.
Types of Polytomous Data
Examining examples of polytomous responses listed earlier, one notes that they can beclassified into at least three different types in terms of scale which require different at-tention for modelling. The first type is in a nominal scale. Here categories are regardedas exchangeable such as cell types of lung cancer, blood type or hair color (red, black,brown, etc.). The second type is in a ordinal scale where categories are ordered like theordinal numbers (first, second, third, etc.). Examples include degree of injury, perception CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA of food quality and for that matter hair color (light, medium, dark, etc.). It is importantto point out for ordinal responses that while ordered, the notion of "distance" or "spac-ing" between categories is arbitrary and in some instances, irrelevant. Furthermore, inmany instances, choice and definition of categories are either arbitrary or subjective. Animportant implication of this observation is that any sensible analytical procedure shouldhave the desired property that same conclusions be drawn if new categories are formed bycollapsing adjacent categories of the original scales. As an illustrative example, considerthe following data from a 3 × 3 table relating the height (short, medium or tall) of thehusbands to that of the wives. The simple chi-squared test gives rise to a value of 2.9which provides little evidence of association between husbands and wives heights. Thesame conclusion is drawn if one either collapses the "short" and "medium" categoriestogether (χ21 = 1.45) or collapses instead the "medium" and "tall" categories together(χ21 = 0.90).
The third type of polytomous response is the interval scale in which scores or numericallabels are frequently attached. Examples of this kind include salary, income, age etc. inwhich the scores may be the midpoints of the intervals defined according to the originalscales. In this situation, it is common that differences between scores from differentcategories be interpreted as a measure of separation of the categories.
In the next section, we discuss the pros and cons of three commonly used regression models for polytomous responses of either nominal or ordinal scales. Throughout we willuse the following two examples for illustration.
Example 4.1 (Disturbed dream). The data presented below is taken from Maxwell (1961) on the association between degree of suffering from disturbed dreams and ageamong 223 boys age 5 to 15. It is clear that the response variable, severity of disturbeddreams, is ordinal while the explanatory variable aged is an interval scale.
Degree of suffering from Example 4.2 (Tonsil size). The data below is from Holmes and Williams (1954) who classify 1398 children aged 1-15 years according to their relative tonsil size and whether 4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES or not they were carriers of streptococcus pyogenes (SP). The response variable here,tonsil size, is also ordinal and the interest is on assessing the nature and direction of thepossible effect of SP on the enlargement of tonsil size.
Regression Models for Polytomous Responses
Consider a response variable, Y , which has C + 1 possible categories, i.e. Y = 0, 1, 2, . .
or C with C being a positive integer. The interest is on the relationship between Y andx, a p × 1 vector of covariates.
Polytomous Logistic Regression Models
This model represents a straightforward extension of the logistic regression model forbinary response by contrasting the odds for Y to be j = 1, . . , C instead of 0, i.e.
Pr(Y = j x) Pr(Y = 0 x) j + βt j = 1, . . , C. It can easily be seen that the distribution function for Y given x is fully specified by theC logistic regression models above, namely, Pr(Y = j x) = , j = 0, 1, . . , C. with α0 = 0 and β0 = 0, a vector of p zeros. Here βj characterizes the effect of x onthe log odds for being the jth category instead of the category zero, j = 1, . . , C. It isinteresting to point out that for a pair of categories, 0 < j < , the log odds for Y to be instead of j is fully determined by (4.1) as Pr(Y = x) Pr(Y = j x)  − αj ) + (β − βj )tx, which is linear in x as well. The simple equality of βj = β −(β −βj), results of (4.1) and(4.2), explains why subjects whose Y values are different than 0 and j are informativeabout βj. This model is particularly useful for nominal response where categories areconsidered exchangeable. Finally, assuming that Yi follows a multinomial distribution,likelihood inference for αj's and βj's can be carried out and is available in SAS (PROCCATMOD).
Proportional Odds Models
For ordinal responses, McCullagh (1980) proposes the use of proportional odds modelswhich can also be expressed in C logistic regression models: j (x) j + βtx, j = 1, . . , C, j (x) CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA where γj(x) = Pr(Y ≥ j x). This model focuses on the modelling of γj(x),the cumulativeprobability of Y to be in the jth category or higher given x, which is meaningful onlyif Y is ordinal. It assumes that the effect of x on those C log odds is the same for allj = 1, . . , C, an assumption which can be checked empirically. An important featureof this model is that there is no need to assign scores to the C + 1 categories of theresponse variable. Instead, the ordinal feature of Y is incorporated (or acknowledged) by(1) modelling through γj and (2) assuming that the effect of x on log{γj(x)/(1 − γj(x)}is the same, or at least is in the same direction. Furthermore, the interpretation of βis well preserved: the change in log odds for suffering more than less (Example 4.1) perunit change in x. Finally, this model has been implemented in SAS as well (PRODLOGISTIC).
Returning to the disturbed dream example, Table 4.1 shows the MLEs of the αj's and βj's when both the polytomous logistic regression model and the proportional oddsmodel are entertained with a single continuous x: age in years (midpoint of each ageinterval). Results from the proportional odds model fitting suggest a strong negativerelationship between suffering from disturbed dreams and age among 5-15 year old boysin that the odds of suffering from disturbed dreams decreases by a factor of 0.8(= e0.229)per year. Results from fitting the polytomous logistic regression model, which ignores theordinal feature of the response variable, lead to the similar conclusion qualitatively. Wenote, however, that the standard error estimate of ˆ β(0.052) is, as expected, much smaller than that of the three ˆ βj's (0.078, 0.078 and 0.081, respectively). This example serves to illustrate that proper modelling through acknowledging the ordinal feature of theresponse variable could lead to a substantial gain in efficiency for regression coefficientsof interest.
Table 4.1: Regression estimates (± s.e.) based on both proportional oddsmodels and polytomous logistic regression models for the disturbed dreamexample (Maxwell, 1961).
Proportional odds Polytomous logistic 6.402 ± 1.403 6.271 ± 2.108 5.586 ± 1.387 4.784 ± 2.115 4.571 ± 1.377 7.792 ± 2.169 0.263 ± 0.078 0.208 ± 0.078 0.323 ± 0.081 0.229 ± 0.052 Continuation Ratio Models
There are situations where the response variable is "ordered" in a hierarchical (or nested)fashion. Examples include 4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES Example 4.3 (Teratological experiment). Chen et al. (1991) present data from a ter- atological experiment in which each litter was randomized into one of three levels ofexposure (low, medium and high) to hydroxyures. Here fetuses from a litter could bedead or alive at birth, and if alive, they could be malformed or normal after a period oftime; see the diagram below. Here the relationship between the proportions of fetusesthat are dead and the dosage level allow investigators to examine the embryo-toxic effectof the targeted substance; whereas the proportions of alive fetuses that are malformedcan be used to address the structural developmental effect of the substance. Here theresponse variable has three categories with a natural order of death, alive but malformedand alive and normal, a decreasing order of health severity. It is hierarchical (or nested)in that the issue concerning the risk for being malformed is only meaningful to thosefetuses who are alive at birth.
Example 4.4 (Insemination experiment). McCullagh and Nelder (1989, p.62) describe an experiment in which milch cows receive treatments to induce pregnancy in such away that each cow will continue the experiments until pregnant as shown below. Herethe response variable represents a series of 1/0 response indicating whether the cow ispregnant or not. However, this response variable is hierarchical (or nested) in that thejth response would be recalled only if the cow was not pregnant in the previous j − 1experiments.
Example 4.5 (Tonsil enlargement). Returning to the tonsil size example (Example 4.2), one might argue that the response variable can be expressed in a way that issimilar to that in Example 4.3. Here each subject may be first classified as enlargedor not regarding the tonsil size; and for those experiencing enlargement, they can befurthered classified as either greatly enlarged or slightly enlarged as shown below. Herethe investigators are in the position to address two related but different questions, the firstbeing whether carrying SP is associated with tonsil enlargement, and the second beingwhether carrying SP is associated with serious enlargement given tonsil enlargement hastaken place.
Diagram 1 (Teratological experiment)
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA Diagram 2 (Insemination experiment)
Diagram 3 (Tonsil enlargement)
slightly enlarged For ordinal response of this kind, it is clear that proportional odds models may not be appropriate. Instead we consider the continuation ratio model by Mason and Fienberg(1985) logit (πj(x)/γj(x)) = αj + βtjx, j = 0, 1, . . , C − 1, where πj(x) = Pr(Y = j x) and γj(x) = Pr(Y ≥ j x). Like the other two models,this model is represented by C logistic regression models. Here πj(x)/γj(x) can bere-expressed as Pr(Y = j Y ≥ j, x), the conditional probability for Y to be j giventhat Y is either in the jth category or higher. It is rather clear that this conditionalprobability is more meaningful than γj(x) itself in reflecting the scientific question ofinterest. Note that in (4.4) we have allowed the effect of x on πj(x)/γj(x) to be differentfor different j. This would be a sensible approach for Examples 4.3 and 4.5 as, forinstance, there is no scientific reason to expect that the embryo-toxic effect to be the sameas the structural developmental effect. On the other hand, it may be equally sensible 4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES to allow the βj's to be the same in Example 4.4, which characterizes the effect of x onthe probability for being pregnant allowing the probability itself (i.e. the αj's) to varyacross experiments. Nevertheless, this assumption of common β can always be examinedstatistically. Returning to the tonsil size example, Table 4.2 gives the MLEs and theirs.e. when both the proportional odds model and the continuation ratio model were fittedto the data shown earlier. Here the sole covariate is the SP carrier status (x = 1 for beingcarrier and 0 if not). Both models lead to conclusions of strong evidence of associationbetween SP carrier and tonsil enlargement, even though the regression coefficients fromthe two models have different interpretations. Results from the proportional odds modelsuggest that the odds for having more serious tonsil size enlargement is 83% (= e0.603 1) higher for the carrier than the non-carriers. On the other hand, results from thecontinuation ratio model indicate that the odds for having tonsil enlargement is 67%(= e0.514 1) higher for the carriers than the non-carriers; whereas the odds for havingtonsil size greatly enlarged among those with tonsil enlargement is 72% (= e0.543 1)higher for the carriers than the non-carriers.
Table 4.2: Regression estimates (± s.e.) based on proportional odds mod-els and continuation ratio models for the tonsil size example (Holmes andWilliams, 1954).
Proportional odds Continuation ratio 0.509 ± 0.056 0.512 ± 0.051 1.363 ± 0.067 1.245 ± 0.093 0.514 ± 0.273 0.543 ± 0.286 0.603 ± 0.227 Log-Linear Models for Contingency Tables
Log-linear models represent a popular approach to analyze data that can be summarizedin contingency tables. One important feature of this approach is that there is no needto specify response variables. This is especially useful in social science research withthe aim to examine the relationship (or the association) among variables considered.
More recently, this log-linear model approach has been advocated by some researchers todeal with correlation data. Examples include cross-over trials where patients receive, bydesign, different treatments over several periods (Jones and Kenward, 1987) and geneticstudies on detection of maternal effects of disease genes (Weinberg, Wilcox and Lie,1998). In this section, we briefly review the work on log-linear models with the emphasison interpretations of specified parameters. It serves to point out the potential for drawingincorrect conclusions when using log-linear models to analyze contingency tables data ordiscrete correlated data.
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA Interpretations of Log-Linear Model Parameters
We start with a simple I × J contingency table in which Yij represents the number ofindividuals who are in the ith(jth) category of variable 1 (variable 2), i = 1, . . , I, j =1, . . , J. Let Pij be the corresponding probability for an individual to be in the (i, j) cellof the I × J table. With m being the sample size, one has immediately µij = E(Yij) =mPij. It is intuitive that two categorical variables of I and J categories, respectively, arenot associated with each other (or statistically independent) if Pij = Pi+P+j, where Pi+ and P+j are the marginal probabilities, e.g. Pi+ is the probability that anindividual falls into the ith category of the first variable. The expression in (4.5) leads to log n + log Pi + log P+j µ + αi + βj with α1 = 0 = β1 to avoid over-parameterizations. Thus, one can test the assumptionof statistical independence between the two variables by computing the deviance whichcontrasts this no association (or null) model with the saturated model where all theI × J Pij's are unspecified. When this null model is rejected, a natural question, froma modelling viewpoint, is how one may modify (4.6) to acknowledge (or to characterize)the dependence (or association) between two variables. One obvious approach is to addto (4.6) additional parameters (αβ)ij, i.e.
log µij = µ + αi + βj + (αβ)ij with α1 = β1 = 0 = (αβ)1j = (αβ)i1, for all i = 1, . . , I, j = 1, . . , J. The followingrepresentation clearly qualifies the (I − 1) × (J − 1)(αβ)ij's as "association parameters,"namely, for i = 1, . . , I and j = 2, . . , J , is the log odds ratio contrasting the (i, j) cell versus the (1, 1) cell; see the 2 × 2 tablebelow An important question is how to incorporate, under the framework of log-linear models,the additional information that (some of) the variables are ordinal? As an illustration,the table below gives the 12 empirical log odds ratios from the disturbed dream example.
It shows the pattern that the estimated log odds ratios decrease in age for each level ofsuffering and decrease in degree of suffering from disturbed dream for each age group.
This suggests that one approach is to assign scores, si, i = 1, . . , I, tj, j = 1, . . , J toboth variables and model (αβ)ij, the log odds ratios, accordingly, e.g.
(αβ)ij = sitjβ. 4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES Degree of suffering from disturbed dreamNot severe Returning to the disturbed dream example, we have fitted the above model i.e. (4.8),to the 5 × 4 contingency table with si = mid-age of the ith age group, i = 1, . . , 5 andtj = j, j = 1, . . , 4. The MLE for β gives 0.205 with s.e. 0.050 which suggests thatthe odds for suffering from disturbed dream with higher unit decreases by a factor of0.81(= e0.205) per year for boys between 5 and 15 years of age. It is interesting tocontrast this approach with the proportional odds model approach in §4.2.2; see Table4.1. Both approaches lead to the same conclusion that there is a convincing negativeassociation between age and degree of suffering from disturbed dreams. One importantdistinction between these two modelling approaches is on the assumptions being madefor the variables and the interpretability of the primary parameter of interest, β in (4.3)and β in (4.8). While both assigning scores to the age variable, which is the intervalscale, the log-linear model approach makes the further assumption by assigning scores tothe other variable, which is considered as a response variable for the proportional oddsmodel approach. One argument against this additional assumption is that it is ratherarbitrary. Specifically, by assuming tj = j, j = 1, . . , 4, one is making the assumptionthat, for example, the "difference" in degree suffering from disturbed dreams between thethird and first categories is twice the "difference" between the second and first categories.
This arbitrariness makes the β coefficient in (4.8) difficult to interpret. The β coefficientinduced by the proportional odds model, on the other hand, does not rely on the assigningof scores for the response variable and is easy to interpret. We note that the deviancesfor both models with 11 degrees of freedom, are very comparable to each other (12.42for the proportional odds model and 14.08 for the log-linear model). This example alsoserves to illustrate that goodness-of-fit measures such as the deviance are secondary (orless relevant) in choosing between models compared to the interpretability of primaryparameters induced by the models.
Moving on to more practical situations of I × J × K contingency tables formed by three categorical variables of I, J and K levels, respectively, let Yijk(µijk) be thenumber (expected number) of individuals who are in the ith, jth and kth category ofvariables 1, 2 and 3, respectively, i = 1, . . , I, j = 1, . . , J , and k = 1, . . , K. Again,the primary interest is on how these three variables are associated with each other. Forthree-dimensional or higher-dimensional contingency tables, more log-linear models areavailable in additional to the obvious null model log µijk = µ + αi + βj + γk, α1 = β1 = γ1 = 0 and the saturated model log µijk = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk. Some interesting intermediate models include (A) log µijk = µ + αi + βj + γk + (αβ)ij, CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA (B) log µijk = µ + αi + βj + γk + (αβ)ij + (αγ)ik, (C) log µijk = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk.
To better understand the interpretation of parameters, in particular (αβ)ij, specifiedabove, we consider the following log odds ratio log ijkP11k ≡ log OR which is the log odds ratio relating variables 1 and 2 among (conditional on) thoseindividuals who are in the kth category of variable 3. Depending on models chosen,log ORij k has different expressions and hence different interpretations. For the nullmodel, which implies no association among three variables, log ORij k is equal to zero asexpected. For model (C), this log odds ratio reduces to (αβ)ij, which is the constant logodds ratio common to all K categories of variable 3. On the other hand, if the saturatedmodel is entertained, log ORij k amounts to log odds ratio only for those who are in thekth category of variable 3.
In summary, log-linear models have the following two important features that are worthy of noting. First, interpretations of lower order terms (e.g. (αβ)ij) depend on thepresence or absence of higher order terms (e.g. (αβγ)ijk). More importantly, the two-wayassociation parameters (e.g. (αβ)ij) have the "conditional" log odds ratio interpretations,i.e. log OR relating two variables adjusting for other variables. One important implicaitonof the second feature stated above is the following: when more than one variable isconsidered as response variables, is it sensible to adjust for other response variables ifa primary question of interest is the association between a particular response and anindependent variable? We will return to this in the next chapter.
An Alternative Representation of Log-Linear Models
As mentioned at the onset of §4.3, there has been some attempts recently to adoptlog-linear models as a means to model correlated data. At first glance, this may notseem obvious as log-linear models are applied to aggregated data whereas the interestfor the latter is to model the joint distribution of correlated data from a cluster. Themain purpose of this subsection is to show briefly how to bridge the gap betwen thesetwo seemingly unrelated tasks. The other purpose is through this new representationto reinforce some potential concerns regarding interpretations of regression coefficientsinduced by log-linear models, which were mentioned in §4.3.1.
Recall that what one observes for each subject in the context of contingency tables can be expressed as Y = (Y1, . . , Yj, . . , Yn)t, where n is the number of categorical variables under investigation and Yj is a categoricalvariable of Cj levels, j = 1, . . , n. In the disturbed dream example, n = 2 and C1 = 5and C2 = 4 if Y1 and Y2 represent age and degree suffering from disturbed dreams. Forsimplicity, we assume n = 3 and Cj = 2 for all j = 1, 2, 3 so that Yj = 1 or 0 dependingon whether the subject falls in the second or the first category, respectively. Using thesame notations as in §4.3.1, we now consider a joint distribution for Y = (Y1, Y2, Y3)t 4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES Pr(Yj = yj, j = 1, 2, 3) 2y1 + β2y2 + γ2y3 + (αβ)22y1y2 + (αγ)22y2y3 j = 0, j = 1, 2, 3) It can be easily shown (e.g. Liang and Zeger, 1989) that the model in (4.9) is equivalentto the saturated log-linear model for a 2 × 2 × 2 contingency table where Yijk = # of subjects in which (Y1 = i − 1, Y2 = j − 1, Y3 = k − 1) i, j, k = 1, 2. The representation in (4.9) provides an alternative way to interpret the parameters spec-ified under log-linear models by considering for example 1 = 1 y2, y3) = α 2 + (αβ)22y2 + (αγ)22y3 + (αβγ)222y2y3, 1 = 0 y2, y3) the log odds for the first categorical variable, Y1, to be one as a function of Y2 and Y3, theother two variables. Thus, (αβ)22 may be interpreted as the effect, measured by log odds,of Y2 on Y1 adjusting for Y3. In other words, (αβ)22 characterizes the association betweenY1 and Y2 conditional on the status of the third variable, Y3 in this case. Similarly, onecan derive through (4.9) 2 = 1 y1, y3) 2 + (αβ)22y1 + (βγ)22y3 + (αβγ)222y1y3, 2 = 0 y1, y3) 3 = 1 y1, y2) 2 + (αγ)22y1 + (βγ)22y2 + (αβγ)222y1y2. 3 = 0 y1, y2) Representations above serve to illuminate the concerns of using log-linear models tomodel longitudinal data, an important special case of correlated data. In this setting, Yjrepresents for an individual the binary observation measured at the jth visit, j = 1, 2, 3.
An obvious question is whether it is sensible to model as shown in (4.10) the dependenceof the first observation, Y1, on observations measured subsequently, Y2 and Y3 in thiscase. Another drawback of this modelling is that it is not reproducible (e.g. Liang, Zegerand Qaqish, 1992) in that the joint distribution of any subset of Y = (Y1, Y2, Y3) do nothave the same representations as (4.9) for Y itself. This is particularly troublesome forcorrelated data of varying cluster sizes.
Statistical Modellings for
Correlated Data

This chapter is devoted to discussions on statistical modellings for correlated data thathave been developed recently. Recall in Chapter 1 we pointed out through examples thatscientific objectives for correlated data can often be characterized statistically throughregression to (1) assess the relationship between response variables and explanatory vari-ables and (2) examine and describe the degrees and patterns of within-cluster association.
We discuss in this chapter three different approaches: marginal models, random effectsmodels and observation driven models. These models are illustrated through two exam-ples presented in Chapter 1: The Baltimore Eye Survey (Example 1.1) and the pre-posttrial for schizophrenics (Example 1.4). These three models are contrasted with the focuson their distinct interpretations of regression coefficients. To begin with, we recall thatcorrelated data can often be expressed as follows: Let Yi = (Yi1, Yij, . . , Yin )t be the ni observations from the ith cluster, i = 1, . . , m, m being the total number of clusters.
For the jth component, j = 1, . . , ni, we assume the availability of xij, a p × 1 vectorof covariates, that may be associated with Yij. For the Baltimore Eye Survey Example,m = 5, 199, ni = 2 for all subjects and Yi1(Yi2) represents the visual impairment sta-tus (1: impaired; 0: normal) of the left (right) eye of the ith subject. In this examplexi1 = xi2 = xi, i.e. all the covariates are the same for both eyes including age, race andpossibly interactions between the two demographic variables. The primary objective ofthis study is to identify demographic variables that are associated with the risk of visualimpairment. The secondary objective is to examine the degree of fellow eye associationand its possible dependence on demographic variables as well. As for the pre-post trial,m = 170 and the ni's range from 2 to 6 as some of the patients dropped out duringthe course of follow-up. The primary response variable is PANSS measuring the severityof schizophrenia; whereas the main explanatory variables include the treatment status(xi = 1 if receiving rispirodine of 6mg and 0 if receiving haloperidol of 20 mg), time (inweeks) from the baseline, tj(0, 1, 2, 4, 6 and 8) and the interaction between xi and tj.
The main objective of this clinical trial is to assess the efficacy of the new treatment forschizophrenia, rispirodine, as measured by the rate of decline in PANSS.
CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA This approach may be viewed as a multivariate analogue of the quasi-likelihood methodin §2.3: (A) Systematic component h(µij) = xt β, µ ij = E(Yij xij ).
(B) Variance specification Var(Yij) = V (µij; φ), where φ is the over-dispersion param- (C) Covariance specification Cov(Yij, Yik) = C(µij, µik; α), j < k = 1, . . , ni, where C is a known function indexed by association parameters α.
For the Baltimore eye survey example, one may consider logit Pr(Yij = 1) = β0 + β1 Racei + β2 (Age 60) + β3 (Agei +β4 Racei  (Age 60) + β5 Racei  (Age 60)2, (B) Var(Yij) = µij(1 − µij), (C) log OR(Yi1, Yi2) = α0 + α1 Racei + α2 Age.
Several remarks are worth noting. First, implicitly one assumes in (A) of (5.1) that theeffects of a age and race on the risk of visual impairment are the same for both eyes.
While one can test this assumption formally, expert opinion on this is that commonetiology for both eyes is warranted. An interpretation of eβ1 , the coefficient for the Racevariable, is that it represents the odds of visual impairment for any eye of blacks of 60years old relative to the odds for whites of the same age. Finally, recall in §1.4 ourpreference of using odds ratio as a measure of association for categorical variables. Thespecification (C) along with (A) and (B) lead to a full specification of Cov(Yi1, Yi2). Forthe pre-post trial example, one could, for example, consider (A) µij = β0 + β1tj + β2xi + β3xi  tj, (B) Var(Yij) = φ, (C) Cov(Yij, Yik) = φ e tj−tk .
Here the AR-1 correlated structure for the repeated observations from the same patient isimposed as shown in (C). More importantly, the interaction term xi tj, in (A) postulatesthat the difference in averaged PANSS between the treated (risperidone) and control(haloperidol) groups varies linearly in time as characterized by β3. More specifically, β3represents the change of averaged difference in PANSS between two groups per unit (inweeks) change as time progresses.
Finally, some general remarks on marginal models (Liang and Zeger, 1993). First, the β coefficients specified in (A) describe the effects of covariates on the "marginal"expectation of the response variable. For the eye survey example, this marginal expec-tation amounts to the averaged Y for a given eye irrespective of the outcome statusof the fellow eye. For the pre-post trial example, this marginal expectation means the 5.2. RANDOM EFFECTS MODEL averaged Y at a given visit irrespective of the PANSS scores at other visits. Thus βhas the so-called "population averaged" interpretation by contrasting the averaged Y 'sfrom different subpopulations formed by different values in x (Zeger, Liang and Albert,1988). This approach, from this viewpoint, is particularly relevant for studies of publichealth significance. Furthermore, the interpretation of β is the same regardless of thecluster size, ni, and of magnitude of within-cluster association. Finally, just as in quasi-likelihood, the joint distribution of Yi = (Yi1, . . , Yin )t is not fully specified by (A), (B) and (C) and this issue will be addressed in the next chapter.
Random Effects Model
Unlike the marginal model approach which addresses issues of the regression of Y on xand within-cluster association through two separate sets of regression models, randomeffects models and observation driven models intend to address these two issues simulta-neously through a single set of regression models. The essence of random effects models,however, is the introduction of cluster-specific regression coefficients relating Y to x toreflect natural heterogeneity among clusters caused by unmeasured factors. This notionof heterogeneity among clusters, called "random effects" translates into within cluster"likeness" or dependence. Specifically, given bi, random effects specific to the ith cluster,i = 1, . . , m, (A) Yi1, . . , Yij, . . , Yin are statistically independent of each other.
(B) Each Yij follows a GLM with h(E(Yij bi)) = xtijβ + ztijbi, where zij, a q × 1 vector, is generally a subset of xij.
(C) The bi's are assumed to be independent and generated from a common distribution, F (·; δ), with mean zero and indexed by parameters, δ.
For the pre-post trial example, patients vary tremendously in baseline PANSS and inchange of PANSS over time. To acknowledge this variation statistically, one may consider (B) Yij = (β 1 +bi1)tj +β 2 xi +β 3 xi tj + ij , where given bi = (bi0, bi1)t, ij N (0, φ), j = 1, . . , ni, (C) bi follows a bivariate normal distribution with mean (0, 0)t and covariance matrix In this model, β0 and β0+β2 represent the averaged baseline PANSS for patients receivinghaloperidol and risperidone respectively, whereas β1 and β1 +β3 represent averaged ratesof change in PANSS for the two respective groups. Thus negative β1 and β3 would suggestthat receiving risperidone would lead to a greater decline in PANSS over time thanhaloperidol does. Furthermore, bi0 and bi1, i = 1, . . , m, are introduced to account forvariation across patients discussed earlier. The larger the δ00(δ11), the greater variations CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA across subjects in PANSS scores (in rates of change in PANSS). Finally, Var(Yij) =φ + δ00 + 2δ01tj + δ11t2 and Cov(Yij, Yik) = δ00 + δ01(tj + tk) + δ11tjtk, j < k, a result of the introduction of random effects, the bi's. For the eye survey example, onemay consider (B) Yij bi ∼ Binomial(1, µ ), where µ = E(Y ij bi) and (β0 + bi) + β1 Racei + β2 (Agei 60) + β3(Agei 4 Racei  (Agei 60) + β5 Raci  (Agei (C) bi ∼ N (0, δ).
This logistic regression model in (B) is very similar to that under marginal modelswith the exception of the additional individual-specific parameter, bi, which is includedto account for variation among subjects in risks for visual impairment that are notexplained by age, race variables. Indeed, with random intercepts, Zeger, Liang andAlbert (1988) have shown that there is a simple mathematical relationship between βand β in marginal models, namely, logit E(Yij) = logit E(E(Yij bi)) = xtijβ/(c2δ200 + 1)1/2, where 16 3/(15π). Thus, approximately = (c2δ00 + 1) 2 β. We will return to the comparison between these two models in §5.4. Some generalcomments on random effects models. This approach for modelling correlated data isparticularly useful to characterize the "change" in Y within clusters. For example, β +bidescribes the effect of x on Y that is specific to the ith cluster (or subject). Furthermore,the empirical Bayes method (Robbins, 1956) and subsequent developments (e.g. Lairdand Ware, 1982; Zeger and Karim, 1991; Breslow and Clayton, 1993) allow investigatorsto estimate not only β but also bi as well. Thus this modelling is particularly relevantfrom a clinical viewpoint. For example, longitudinal studies would allow clinicians tocommunicate with concerned parents the drop in risk of their own children for sufferingfrom infectious disease if they change their smoking behavior (e.g. from smoking to non-smoking). Finally, given (A), (B) and (C) specified above, the probability distributionfunction for Yi as indexed by β, φ and δ is fully specified.
Observation Driven Models
As in random effects models, this approach addresses simultaneously both issues of re-gression of Y on x and within-cluster association through a set of regression models.
One important feature of the observation driven models that distinguish itself from therandom effects model is that it operates on the premise that correlation within a cluster 5.3. OBSERVATION DRIVEN MODELS arises because one response is explicitly caused by others. A typical example of this kindis the spread of the disease that is contagious within a family or a community. Define Yi,−j = (Yi1, . . , Yi,j−1, Yi,j+1, . . , Yin )t a vector of (ni − 1) × 1 Yij's excluding Yij. Observation driven models may be charac-terized as follows (A) For Hij = H(Yi,−j), a function of Yi,−j, Yij given Hij follows a GLM with h(E(Yij Hij)) = xtijβ + where fν, ν = 1, . . , q, are known, up to α, functions of Hij.
Choice of Hij depends on situations that are dealt with. For longitudinal studies, it isnatural to consider Hij = (Yi1, . . , Yi,j−1), observations prior to the jth visit. On theother hand, for family studies, in particular, of siblings, it would be more sensible toconsider Hij = Yi,j−1, the rest of the family members. If adopting this approach, onemay consider for the eye survey example (A) Yij Yi,3−j ∼ Binomial(1, µ), j = 1, 2 with logit Pr(Yij = 1 Yi,3−j, xij) Racei + β 60) + β Racei  (Agei 60) + β Racei  (Agei Note that with ni ≡ 2, this model is the same as those proposed by Rosner (1984) andConnolly and Liang (1988) discussed in §3.3. It is interesting to note that eβ the interpretation as the relative odds of visual impairment for one eye among blacks of60 years of age versus that of whites of the same age, adjusting for the visual impairmentstatus of the fellow eye. This is to be contrasted with eβ1 under marginal models inwhich no adjustment for the fellow eye status is made. If the objective of the studyis to identify risk factors, in particular demographic ones, for prevention purposes, onemay question the relevance of this observation driven approach as the effects of race,age, etc. on risk of visual impairment for one eye may be diminished due to their directcompetition against the visual status of the fellow eye in explaining Yij; see (5.4). Forthe pre-post trial example, one may consider (A) Yij Yi1, Yi,j−1 is specified as 1 tj + β 2 xi + β + α(Yi,j−1 − β − 1 tj−1 − β 2 xi − β 3 xi  tj−1) + ij , where ij ∼ N (0, φ).
This is known as a AR-1 process in which Yij depends on the past history in Yi onlythrough Yi,j−1, the immediate past observation. We will discuss the relationship amongβ, β and β for this example in §5.4.
CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA Generally speaking, this approach represents a particular way to model the within- cluster dependence mechanism, e.g., past responses cause the present one in longitudinalstudies. This direct modelling is particularly useful for prediction purposes. For example,one may be interested in predicting PANSS two months after the termination of the studyif the patient receiving risperidone had a score of 140, which is very high, at that time.
On the other hand, if the main purpose of the study is on the regression of Y on xone should be cautious about the interpretation of β which may be non-trivial andpotentially misleading especially when the cluster sizes vary. This is especially so fordiscrete response variables; see discussion in §3.2 and §4.3.
Contrasts of Three Modelling Approaches
It is clear that the three approaches presented in Sections 5.1-3 are motivated distinctlyconsidering how within-cluster association may be incorporated in conjunction with theconventional regression of Y on x. With the exception of identity link function, whichis commonly assumed for continuous responses, the three regression coefficients, β1, βand β, are likely to be different numerically and interpretation-wise. In addition, thewithin-cluster association pattern as implied by the models are in general different forthese approaches as well.
We now first examine more closely the special case of continuous responses with identity link. It can be shown mathematically that in this case β = β = β. To see this, note that random effects models specify the effects of xi on the conditionalexpectations of Yi given bi, the random effects. However, noting that E(bi) = 0 asspecified in (C) of §5.2, we have immediately that E(Yij) = E(E(Yij bi)) = E(xtijβ + ztijbi) = xtijβ, which when comparing with (A) in §5.1, leads to β = β. Similarly, if one adopts theobservation driven model, one has E(Yij) = E(E(Yij H(Yi,j−1))) E(xtijβ + αν(yi,j−ν − xti,j−νβ)) Thus, for continuous responses of identity link, the three regression coefficients relatingY to x are identical numerically. The only major difference is on how the within-clusterassociation may be modeled. For example, a random effects model with intercepts as theonly random effects leads to Cov(Yij, Yik) = δ00, a constant covariance for each pair ofobservations from the same cluster. On the other hand, if an AR-1 process is assumedunder observation driven models, one has Cov(Yij, Yik) = α tj−tk which decreases in tj − tk , the "distance" (in time) between two observations from the same cluster.
The equality in β, β and β shown above is indeed an exception rather than the rule.
To elaborate on this, we assume, for simplicity, one is dealing with longitudinal studies of 5.4. CONTRASTS OF THREE MODELLING APPROACHES binary responses and logit link. In this situation, eβ for a covariate (who marginal modelscompares prevalences in subgroups formed by observations with different x values (andhence measured at different times). For random effects models of the same covariate x,eβ describes changes in personal risk due to change in x over time. On the other hand,eβ in observation driven models characterizes changes in incidence for individuals withthe same history in Y but differ in x measured at the same time. It has been shown inthis situation (Zeger, Liang and Albert, 1988) that β is smaller than β in magnitude.
However, one should in general reach similar conclusions for testing the hypothesis of noassociation between Y and x (i.e. β = 0 or β = 0) since β) ˆ On the other hand, by examining (A) in §5.1 and (A) in §5.3 closely, we would speculatethat β would be smaller than β in magnitude due to the fact that (1) both xij andH(Yi,−j) appear in the same regression model in explaining Yij and (2) the Yij's are ingeneral highly correlated with each other.
Statistical Inference for
Correlated Data

In Chapter 5 we introduced and contrasted three statistical modellings for correlateddata. In this chapter, we review approaches to make inference about parameters speci-fied by the models. These approaches include full likelihood methods, conditional like-lihood methods and estimating function methods. Pros and cons of each approach arehighlighted.
Recall that for marginal models only the first two joint moments of the Yi's are requiredto be specified, i.e.
(A) Systematic component h(µij) = xtβ, for all j = 1, . . , n (B) Covariance/variance specification Cov(Yij, Yik) = C(µij, µik; α, β), j ≤ k = 1, . . , ni.
The question is that given this specification in (A) and (B) how to make inference on βand α based on the data at hand. Just as in the univariate response case, we considertwo different approaches.
If the regression of Y on x is the primary interest which usually is the case in, for example,longitudinal studies, one may estimate β by solving 1(β, α) = i; β, α)(yi − µi(β)) = 0. The above equation is know as the generalized estimating equation (GEE) (Liang andZeger, 1986) which reduces to the quasi-score equation defined in §2.3 when ni = 1 for alli = 1, . . , m. Since S1 in (6.1) depends on α as well, one may iterate until convergencebetween solving S1(β, ˆ α(β)) = 0 for β and updating ˆ α(β) with the latest ˆ β solved above CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA (Liang and Zeger, 1986). This approach, which we term GEE1, is robust in the sensethat, just as in quasi-likelihood, (1) no distributional assumption on Yi = (Yi1, . . , Yin )t is needed and (2) consistent estimation of β and correct precision assessments on ˆ available even if the covariance structure for Yi, C(µij, µik; β, α), is incorrectly specified.
For the latter part, see some key references including White (1982), Gourieoux et al.
(1984), Royall (1986) and Liang and Zeger (1986). Furthermore, high efficiency relativeto the MLE of β is usually maintained even if Cov(Yi) is incorrectly specified (e.g. Liangand Zeger, 1986; Zhao, Prentice and Self, 1992). As expected, the degree of efficiency lossdepends on the magnitude of within-cluster correlation (the higher the correlation, thegreater degree of efficiency loss) and on how "well" the specified "working" covariance forYi in (6.1) approximates the true covariance structure. It is important to point out thatincorrectly assuming no association within-cluster may yield high degree of efficiency lossespecially for covariates whose values vary within clusters; see §2.4 and Zhao, Prenticeand Self (1992). More importantly, such an exercise may lead to severe biased estimationin β if the data are missing in a non-trivial manner. This point will be further illuminatedthrough the pre-post trial data analysis to be presented later.
When the within-cluster association is the primary interest, which is typically the case for family studies, one may argue that this GEE1 approach may be inefficient asfar as inference for α is concerned. This is because in GEE1, estimation of β and α istreated as if they were independent (or orthogonal) to each other. As an alternative, onemay solve for δ = (β, α) simultaneously through ∂µ∗ t 2(β, α) = i; δ, γ)(zi − µ∗ i (δ)) = 0, i ) = (Y t i1, . . , Y 2 in , Yi1Yi2, . . , Yi,n i −1 Yini µ∗i = Et(Zi; δ) = (Et(Yi); β), Et(Wi; β, α))t, and γ are parameters characterizing the third and fourth moments of the Yi's as requiredby Cov(Zi) (Liang, Zeger and Qaqish, 1992). Several remarks on this GEE2 approach.
First, with a proper choice of ˆ α(β), the GEE1 method may be viewed as a special case of GEE2 if Cov(Yi, Wi) in the "working" covariance matrix for Zi, Cov(Zi), is set to bezero. It can be seen in this situation that the first p components of S2(δ, γ) reduces toS1(β, α). Second, in some instances considered, Liang, Zeger and Qaqish (1992) foundthat the GEE2 method is more efficiency for estimating β and α than the GEE1 method,although the degree of efficiency gain is much less substantial for β than α. We willreturn to this point through an illustration using the eye survey data as an example.
This new approach, however, creates a complication that is not associated with the useof GEE1. One complication is the need to estimate γ as seen in (6.2). Given thatCov(Zi) is serving as a working covariance matrix and its correct specification is notneeded, one may compute Cov(Zi) as if Yi follows a multivariate normal distribution(Crowder, 1987; Prentice and Zhao, 1991). This resolves the issue of dependence of S2on γ since multivariate normal distributions are fully specified by the first two moments.
Another approach is to replace γ by a "guess-mates," γ∗, a fixed value. Liang, Zegerand Qaqish (1992) found that for cases considered, this approach sacrifices very littlepower loss. However, no general performance of these two approaches for dealing with 6.1. MARGINAL MODELS γ has been established and this issue is likely to be examined on the case by case basis.
Another complication is that the robust property for β enjoyed by GEE1 is no longermaintained. To see this, except when Cov(Yi, Zi) is set to be zero, it can be shown thatthe first p components of S2(δ, γ), S2(δ, γ) say, is a linear combination of yi − µi(β)and Wi − E(Wi; β, α). Thus, S2(δ, γ) is an unbiased estimating function for β only ifE(Wi; β, α) computed is correct, i.e. if Cov(Yi) is correctly specified. In other words,the consistency of ˆ β, the solution of S2(δ, γ) = 0, depends on the correct specification of Cov(Yi); see Liang and Zeger (1995) for a detailed examination of this issue for a specialcase.
Recall that specification (A) and (B) do not fully specify the joint distribution of the Yi'sas only the first two moments of the Yi's are involved in (A) and (B). A natural questionis: does there exist a probability function for Yi that is compatible with (A) and (B)?The work by Gourieroux et al. (1984) and Zhao et al. (1992) suggests that the answer isyes. Dropping the index i temporarily, consider the following "partly exponential family" f (Y = y; β, α) = exp{θty + c(y; α) − b(θ, α)}, where θ, called "canonical parameters," is determined by µ = E(Y ; β) = y exp{θty + c(y; α) − b(θ, α)}dy through the one-to-one correspondence between Here c(y; α) is pre-specified and α is called "shape parameters." Some choices of c(y; α)include c(y; α) = yt which, in conjunction with (6.3), leads to the multivariate normal distribution and forbinary responses c(y; α) = known as the quadratic exponential family (e.g. Zhao and Prentice, 1990; Fitzmauriceand Laird, 1993). With this partly exponential family, it has been shown (Zhao et al.,1992; Fitzmaurice and Laird, 1993) that the score function for δ = (β, α) has the form S(β, α) i (β) i; δ) i − µi(β) Cov(Yi; Wi; δ) Iq×q wi − E(Wi; δ) where Iq×q is the q × q identity matrix and wi = ∂c(yi; α)/∂α. Consequently, the first pcomponents of S(β, α) is simply S1(β, α) in (6.1). Thus S1(β, α) from GEE1 is the scorefunction for β if Yi is from a partly exponential family as long as the Cov(Yi) specified CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA in S1(δ) is the true covariance matrix from Yi. Furthermore, it can be shown (e.g., Zhaoet al., 1992) that β and α are orthogonal to each other, i.e. the covariance between thescore function for β and α is zero. Consequently, the consistency of ˆ β, the MLE of β, is maintained even if c(y; α) is misspecified, a property enjoyed by GEE1.
On the other hand, this likelihood approach based on the partly exponential family may be used with caution. First, choice of c(y; α), which along with µ(β), would leadto the complete specification of f (y; β, α) requires more knowledge about the randommechanism than it does for (A) and (B) only. In the absence of sufficient knowledge,either substantive or empirical, one should proceed with caution. Second, this family isin general not reproducible, i.e. the distribution function of any subset of Yi may not havethe same form as f (Yi) in (6.3). This lack of reproducibility is especially troublesomeif cluster sizes vary which is typically the case for family and longitudinal studies. We Table 6.1: Regression estimates and standard errors (in parenthesis) for thevisual impairment response from the Baltimore Eye Survey Study.
now return to the Baltimore eye survey example to illustrate some of the points madeearlier. Table 6.1 presents results from four approaches: Approach I: logistic regressionusing the data from the left eyes only; Approach II: logistic regression using data fromboth eyes treating them as if they were statistically independent of each other; ApproachIII (IV): GEE1 (GEE2) in which both approaches model log OR(YL, YR) as a functionof race. First, point estimates of logistic regression coefficients from four approaches arevirtually identical suggesting that the etiology of visual impairment is similar for botheyes.
Furthermore, the association between age and risk of visual impairment is strong and positive and varies between blacks and whites; see Figure 6.1. Second, the ordinarylogistic regression approach (Approach II), which ignores within-cluster association, has 6.2. RANDOM EFFECTS MODELS Figure 6.1: Predicted risk of VI by race and age among those with 9 yearsof education in Baltimore Eye Survey Study (Tielsch et al., 1990): solidline, whites; dashed line, blacks.
the tendency, as expected, to underestimate the s.e. of ˆ β compared to that of approaches III and IV, which are known to be correct, at least asymptotically. Third, it is obviousand intuitive that it is less efficient to use data from one eye only; see the last columnof Table 6.1. However, the ratios of the variance of ˆ β between approaches I and IV, the latter using twice the sample size, are approximately equal to 1.6 instead of 2. This isknown as "design effect" in complex survey sampling and is due to the fact that the datafrom fellow eyes are highly correlated with each other. Indeed, results from Table 6.1indicate that the estimated odds ratio relating two fellow eyes in terms of risk for visualimpairment is approximately 9.97(= e2.30) for whites and 17.1(= e2.30+0.54) for blacks.
The significant difference in odds ratios between blacks and whites is consistent with thelong-standing clinical finding in the United States that blacks have higher incidence ofglaucoma and of diabetic mellitis, both of which tend to be bilateral diseases. Finally,there is a substantial efficiency gain in using GEE2 for inference on the pattern of within-cluster association as the ratio of Var( ˆ α) for GEE1 versus GEE2 are 2.22(= (2.64/1.77)2) for the intercept and 2.65(= (45/.255)2) for the race variable.
Random Effects Models
As noted in §5.2, given (A) Conditional on bi, the random effects, Yi1, . . , Yin are statistically independent and for each j, Yij follows a GLM with h(E(Yij bi)) = xt β + zt b (B) bi, a q × 1 vector, ∼ F (·; D) with mean zero, the likelihood function for β, φ and D is completely specified, namely, L(β, φ, D) f (yij bi; β, φ)f (bi; D)dbi. CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA As an example, for binary responses of logit link and for a sole random intercept, i.e.
zij ≡ 1, following N (0, D), the likelihood function has the form L(β, D) yijxtijβ + log 1 + extij One can see from (6.5) that the likelihood function involves q-dimensional integrals, onefrom each cluster and in general, numerical approximation is needed in order to computer(6.4). We now consider two different approaches whose pros and cons will be contrastedthroughout.
Conditional Likelihood Approach
This approach only works when the link function is a canonical link in which the firstpart of the integral in (6.4) has the simple form, namely, f (yij bi; β) = exp  yijxtijβ + yijztijbi − which includes (6.5) as a special case. Some important examples of canonical linksinclude, as mentioned in §2.2, the identity link for normal distributions, log link forPoisson distribution and logit link for binomial distribution. One important feature of (6.6) is that given bi, yijxij and yijzij are jointly minimally sufficient for β and bi, the canonical parameters. Consequently, one can eliminate bi in (6.6) by conditioning yijzij, the minimal sufficient statistics for bi for fixed β, i.e.
w ) yi1, . . , yin ij zij ; β where β = (βw, βz) and xij = (wij, zij) as zij is assumed to be a subset of xij. Thus,through conditioning (see §2.2), it gives rise to a conditional likelihood which is indepen-dent of bi andβz and depends only on βw. In other words, only a subset of β is estimable.
Thus, the usefulness of this conditional likelihood approach depends on whether covari-ates of interest have been declared as random effects or not. For example, if zij ≡ 1, i.e.
random intercepts is the only random effect declared, then only those β from covariateswij's which vary within-cluster are estimable since otherwise Thus it should be clear that this conditional likelihood approach would not work for theeye survey example since all the covariates are cluster-specific rather than eye-specific.
6.2. RANDOM EFFECTS MODELS As another example, consider longitudinal studies in which primary objectives are of-ten centered on the individual change of the response variables over time. If the timevariable, i.e. the times for which the response variables are measured, is considered asa random effect, known as random slopes, then one cannot estimate the averaged rateof change coefficient should the conditional likelihood method be employed. This alongwith the fact that canonical links are required limit the extent to which this method maybe applied. One desirable property of this method, i.e. using Lc(βw) as the basis forinference, is that no distributional assumption on the bi's is needed, which is comfortingas the validity of such an assumption is difficult to verify in general.
An obvious alternative is to use the full likelihood, L(β, φ, D), in (6.4) as the basis forinference on both fixed effects, β, and the variance components, D and φ. For continuousresponses with identity links and normal distribution on both Y b and b itself, i.e.
(A) Yij bi ∼ N (xt β + zt b ij i, φ), (B) bi ∼ MVN(0, D), Yi follows a multivariate normal distribution with mean xtβ and covariance matrix i + φIni×ni .  , zi =  .  . In this case, the restricted maximum likelihood (REML) method (e.g. Patterson andThompson, 1971; Laird and Ware, 1982) for β, φ and D can be used. Furthermore, theempirical Bayes method (Robbins, 1955) can be utilized to help provide more desirableestimates of the bi's. This aspect of inference is useful especially for the purpose ofidentifying clusters that are distinct from others with regard to the behavior of therandom effects. A prime example is in growth studies in which investigators are interestedin identifying children whose "growth rates" as captured by the bi's are lower than the restof the sample. We note that this REML method along with empirical Bayes procedurefor estimating the bi's have been implemented in SAS as a part of PROC MIXED.
In general, the likelihood function involves complicated q-dimensional integrations which are demanding numerically especially when q is modestly large. Two methodshave been developed recently to address this issue. One is to use the Gibbs samplermethod in which the posterior distribution of β, φ, and D given the data is simulatedthrough, for example, the Monte Carlo Markov Chain (MCMC) algorithm (e.g. Zegerand Karim, 1991). The other approach, known as the penalized likelihood method,approximates f (bi Yi) by a normal density function leading to an approximate likelihoodfunction which is less computationally intensive (e.g., Stiratteli et al. 1984; Breslow andClayton, 1993). This approach has been shown through simulations to perform well ingeneral except for binary responses with small cluster sizes (Breslow and Clayton, 1993).
This method has been implemented in SAS as well (PROC GLIMMIX) and has beenrefined more recently to correct the bias occurring in the situation described above (e.g.
Breslow and Lin, 1995; Lin and Breslow, 1996).
Recall that observation driven models tackle the issue of within-cluster association bydirectly modelling the dependence of Yij on some functions of Yi,−j = {Yi1, . . , Yi,j−1, Yi,j+1, . . , Yn }, namely, Hij = Hij(Yi,−j) through (A) For j = 1, . . , ni, Yij Hij follows a GLM with h(E(Yij Hij)) = xtijβ + Some common choices of Hij were discussed in §5.3.
Just as in §6.1.2, a natural question is "does there exist a probability function for Yi thatis compatible with specification (A)?" While the answer no doubt depends on choices ofH and , it is comforting that the Hammerley-Clifford Theorem in Besag (1974) couldserve as a guidance for the question posed above. As an example, for binary responses with q = 1, a choice of (Hij; α) = α i has led to the "multiplicative" model developed by Connolly and Liang (1988).
As an alternative, one can estimate δ = (β, α) by solving  E(Y ij Hij ; δ) ij Hij ) (yij − E(Yij Hij ; δ)) = 0, i=1 j=1 a strategy similar to GEE for marginal models. An advantage of this approach is thatthe statistical package developed for GEE of marginal models such as SAS PROC GEN-MOD can be utilized. On the theoretical end, this approach shares the same robustnessproperty enjoyed by GEE for marginal models.
Returning to the Baltimore Eye Survey Study example, Table 6.2 presents results from three modelling approaches. The first column is based on GEE2 for marginal models asshown in Table 6.1. The second column is derived from observation driven models usingthe GEE1 method. Specifically, this model assumes for each j = 1, 2, logit Pr(Yij = 1 Yi,3−j, Agei, Racei) 1 Racei + β 60) + β Racei  (Agei 60) + β Racei  (Agei +α1Yi,3−j + α2Yi,3−j  Racei. Finally, the third column corresponds to results from fitting a random effects model ofsame covariates as marginal models with the addition of random intercepts. Consistent 6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED with discussion in §5.4, the estimated β's from the observation driven model are ingeneral smaller in magnitude than that of the β's from the marginal model. In par-ticular, the estimated β for the Race variable is reduced from 0.334 in column oneto 0.099 in column two and as a result, is not significantly different than zero statisti- Table 6.2: Regression estimates and standard errors (in parenthesis) fromthree modelling approaches: Baltimore Eye Survey Study.
 variance of random intercepts.
cally if the observation driven model is employed. A simple and obvious explanation ofthis discrepancy is that the effects of those demographic variables are, to some extent,"explained away" by the competing variate, Yi,3−j, the visual impairment status of thefellow eye; see ˆ α1 and ˆ α2 in column two. On the other hand, the estimated β's from the random effects model are larger in magnitude than the ˆ β's. However, the Wald test statistics for testing the significance of either β or β are similar in magnitude and leadto similar conclusions. For an example, the Wald statistics for the Race variables are3.18(= .334/1.05) from column one and 2.58(= .487/.189) from column three.
Pre- Post Trial on Schizophrenia Revisited
Recall that in this trial more than five hundred schizophrenic patients were randomizedinto six arms, placebo, risperidone of 2, 6,, 10 and 16mg and haloperidol of 20mg. Forillustrative purpose, we focus on two particular groups: risperidone of 6mg (x = 1)and haloperidol of 20mg (x = 0). The main objective is to examine whether the new CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA treatment in risperidone leads to further and faster decline in PANSS, a measure ofseverity in schizophrenia, compared to the then standard treatment in haloperidol. Table Table 6.3: Regression estimates and standard errors (in parenthesis) fromthree modelling approaches: a pre- post trial on schizophrenia.
Random effects model 6.3 presents results form three modelling approaches. For marginal models, we consideredthe use of independence and exchangeable (also known as compound symmetry) matricesas the working covariance for the Yi's. For random effects models, we considered (1)random intercepts only which leads to the compound symmetry covariance matrix inY and (2) random intercepts and slopes. In the latter, this accounts for variations inPANSS changes over time among patients. For observation driven models, we consideredthe following AR-1 model Yij = xtijβ + α(yi,j−1 − xti,j−1β) + ij, j = 1, . . , ni, where the ij ∼ N (0, φ). With the exception of the GEE approach using the inde-pendence matrix as the working matrix, conclusions from all three approaches are verysimilar qualitatively; see comments in §5.4. Specifically, the averaged rate of decline inPANSS for the haloperidol groups is estimated as 1.16 per week; whereas the averagedrate of decline is estimated as 2.35(= 1.16 + 1.19) per week for the risperidone group.
Such a difference is statistically significant at the 0.05 level (1.192/.482 = 2.47, p-value < 0.01). More importantly, this difference appears to be meaningful clinically aswell. Considering two patients of the same PANSS scores before receiving treatments,this result suggests that taking the new treatment would lead to 9.5(= 1.19 × 8) unitgreater decline in PANSS in the eighth week, a substantial improvement from the clinicalpoint-of-view.
6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED Figure 6.2: Mean responses in PANSS by dropout cohort.
Finally, ignoring within-cluster association would lead to very different and presum- ably misleading conclusions. Specifically, results from GEE assuming the repeated mea-sures from the same patient were statistically independent of each other suggest thatthere is no difference in averaged rate of decline in PANSS between the two groups(0.117 ± 0.545). An explanation for such a striking discrepancy is that this latterapproach is vulnerable to severe and non-trivial missing data phenomenon which is ex-hibited in this example. For example, the risperidone group experienced 36% dropoutrate in the eight week follow-up period; whereas the haloperidol group fared no betterwith a 53% dropout rate. Furthermore, Figure 6.2 strongly suggests that the chance ofdropping out at a given visit depends on the prior history of PANSS scores. For an exam-ple, patients dropped out at weeks 2, 4 and 6 experienced, on average, a sudden increasein PANSS prior to the scheduled visits. This pattern of missing mechanism, known asmissing at random (Little and Rubin, 1987), suggests that ignoring within-cluster associ-ation would, in general, lead to incorrect inference for questions of interest, the differencein rates of change in PANSS between groups in this case. For more detailed treatmentson accounting for non-trivial missing data patterns, see Diggle (1998) and Scharfstein,Rotnitzky and Robbins (1999) and references therein.
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA [1] Altham, P.M.E. (1978). Two generalizations of the binomial distribution. Applied Statistics, 27, 161–167.
[2] Andersen, E.B. (1970). Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society, Series B, 32, 283–301.
[3] Bahadur, R.R. (1961). A representation of the joint distribution of responses in n dichotomous items. In Studies on Item Analysis and Prediction, Ed. H. Solomon,pp.158–168. Stanford Mathematical Studies in the Social Sciences VI, Stanford Uni-versity Press, Stanford, California.
[4] Beaty, T.H., Skjaerven, R., Breazeale, D.R. and Liang, K.-Y. (1997). Analyzing sibships correlations in birth weight using large sibships from Norway. Genetic Epi-demiology, 14, 423–433.
[5] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, Series B, 36, 192–236.
[6] Bishop,Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, Massachussetts.
[7] Breslow, N.E. (1981). Odds ratio estimators when the data are sparse. Biometrika, 68, 73-84.
[8] Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 125–34.
[9] Breslow, N.E. and Day, N.E. (1980). Statistical Methods in Cancer Research, Volume I. IARC Scientific Publications No. 32. Lyon.
[10] Breslow, N.E. and Lin, X. (1995). Bias correction in generalized linear mixed models with a single component of dispersion. Biometrika, 82, 81-91.
[11] Breslow, N.E. and McCann, B. (1971). Statistical estimation of prognosis of children with neuroblastoma. Cancer Research, 31, 2098–2103.
[12] Chen, J.J., Kodell, R.L., Howe, R.B. and Gaylor, D.W. (1991). Analysis of trinomial responses for reproductive and developmental toxicity experiments. Biometrics, 47,1049–1058.
[13] Cohen, B.H., Bell, W.C., Jr., Brashears, S., Diamond, E.L., Kreiss, P. et al. (1977).
Risk factors in chronic obstructive pulmonary disease (COPD). American Journalof Epidemiology, 105, 223–232.
[14] Connolly, M., Liang K.-Y. (1988). Conditional logistic regression models for corre- lated binary data. Biometrika, 75, 501-506.
[15] Cox, D.R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society Series B, 74, 187-200.
[16] Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, Chapman and Hall, London, England.
[17] Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate condi- tional inference, (with discussion). Journal of the Royal Statistical Society, Series B,49, 1-39.
[18] Crowder, M.J. (1987). On linear and quadratic estimating functions. Biometrika, 74, 591-597.
[19] Crowder, M.J. (1989). Comments on Godambe, V.P. and Thompson, M.E. An ex- tension of quasi-likelihood estimation. Journal of Statistical Planning and Inference,22, 137-152.
[20] Diggle, P.J. (1988). An approach to the analysis of repeated measures. Biometrics, [21] Diggle, P.J. (1998). Dealing with missing values in longitudinal studies. In: Recent Advances in the Statistical Analysis of Medical Data, editors B.S. Everitt and G.
Dunn, 203–208. Arnold, London, England.
[22] Diggle, P.J., Liang, K.-Y. and Zeger, S.L. (1994). Analysis of Longitudinal Data.
Oxford University Press, Oxford, England.
[23] Durbin,J. (1960). Estimation of parameters in time-series regression models.
Biometrika, 47, 139-153.
[24] Ferguson, H. Reid, N. and Cox, D.R. (1991). Estimating equations from modified profile likelihood, in Estimating Functions, Godambe, V.P., editor, Oxford Univer-sity Press, Oxford, England.
[25] Finney, D.J. (1964). Statistical Methods in Biological Assay, 2nd edition. Griffin, London, England.
[26] Firth, D. (1987). On the efficiency of quasi-likelihood estimation. Biometrika, 74, [27] Fitzmaurice, G.M. and Laird, N.M. (1993). Regression models for discrete longitu- dinal responses. Statistical Science, 8, 284-304.
[28] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti- mation. Annals of Mathematical Statistics, 31, 1208–12.
[29] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimat- ing equations. Biometrika, 63, 277-284.
[30] Goodman, L.A. and Kruskal, W.H. (1979). Measures of Association for Cross- Classifications. Springer-Verlag: New York.
[31] Gourieroux, C., Monfort, A., and Trognon, A. (1984). Psuedo-maximum likelihood methods: theory. Econometrica, 52, 681–700.
[32] Grizzle, J.E., Starmer, C.F. and Koch, G.G. (1969). Analysis of categorical data by linear models. Biometrics, 25, 4898–504.
[33] Haseman, J.K. and Kupper, L.L. (1979). Analysis of dichtomous response data from certain toxicological experiments. Biometrics, 35, 281–293.
[34] Hauck, W.W. and Donner, A. (1977). Wald's test as applied to hypotheses in logist analysis. Journal of the American Statistical Association, 72, 851–853.
[35] Heagerty, P. and Zeger, S.L. (1998). Lorelogram: a regression approach to exploring dependence in longitudinal categorical responses. Journal of the American StatisticalAssociation, 93, 150–163.
[36] Holmes, M.C. and Williams, R.E.O. (1954). The distribution of carriers of Stepto- coccus pyogenes among 2413 healthy children. Journal of Hygiene Comb 52, 165-179.
[37] Jones, B. and Kenward, M.G. (1987). Modelling binary data from a three-point cross-over trial. Statistics in Medicine, 6, 555–64.
[38] Jφrgensen, B. (1987). Exponential dispersion models (with discussion). Journal of the Royal Statistical Society, Series B, 49, 127–162.
[39] Kalbfleisch, J.D. and Sprott, D.A. (1970). Application of likelihood methods to models involving a large number of nuisance parameters (with discussion). Journalof the Royal Statistical Society, Series B, 32, 175-208.
[40] Kupper, L.L. and Haseman, J.K. (1978). The use of a correlated binomial model for the analysis of data from certain toxicological experiments. Biometrics, 34, 69–76.
[41] Laird, N.M. and Ware, J.H. (1982). Random-effects models for longitudinal data.
Biometrics, 38, 963–74.
[42] Lee, E. (1992). Statistical Methods for Survival Data Analysis, 2nd edition. John Wiley and Sons, New York, New York.
[43] Lehmann, E.L. (1959). Testing Statitical Hypothesis. John Wiley and Sons, New York, New York.
[44] Liang, K.-Y. (1987). Estimating functions and approximate conditional likelihood.
Biometrika, 74, 695-702.
[45] Liang, K.-Y. and Beaty, T.H. (1991). Measuring familial aggregation by using odds- ratio regression models. Genetic Epidemiology, 8, 361–370.
[46] Liang, K.-Y. and Hanfelt, J. (1994). On he use of the quasi–likelihood method in the teratological experiments. Biometrics, 50, 872-880.
[47] Liang, K.-Y. and Tsou, D. (1992). Empirical Bayes and conditional inference with many nuisance parameters. Biometrika, 79, 261-270.
[48] Liang, K.-Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.
[49] Liang, K.-Y. and Zeger, S.L. (1989). A class of logistic regression models for mul- tivariate binary times series. Journal of the American Statistical Association, 84,447-451.
[50] Liang, K.-Y., Zeger, S.L. (1993). Regression analysis for correlated data. Annual Review of Public Health, 14, 43-68.
[51] Liang, K.-Y. and Zeger, S.L. (1995). Inference based on estimating functions in the presence of nuisance parameters (with discussion). Statistical Science, 10, 158-172.
[52] Liang, K.-Y., Zeger, S.L., and Qaqish, B. (1992). Multivariate regression analyses for categorical data (with Discussion). Journal of the Royal Statistical Society, SeriesB, 54, 3–40.
[53] Lin, X. and Breslow, N.E. (1996). Bias correction in generalized linear mixed models with multiple components of dispersion. Journal of the American Statistical Asso-ciation, 91, 1007–16.
[54] Lindsay, B. (1982). Conditional score functions: some optimality results. Biometrika, 69, 503-512.
[55] Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. John Wiley, New York.
[56] Mantel, N. and Haenszel, W. (1959). Statistican aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22,719-748.
[57] Mason, W.B. and Fienberg, S.E. (eds) (1985). Cohort Analysis in Social Research: Beyond the Identification Problem. Springer-Verlag, New York.
[58] Maxwell, A.E. (1961). Analysing Qualitative Data. Methuen, London.
[59] McCullagh, P. (1980). Regression models for ordinal data (with discussion). Journal of the Royal Statistical Society, Series B, 42, 109–42.
[60] McCullagh, P. (1983). Quasi-likelihood functions. Annals of Statistics, 11, 59-67.
[61] McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. Chapman and Hall, New York.
[62] Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalised linear models. Journal of the Royal Statistical Society, Series A 135, 370-384.
[63] Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consis- tent observations. Econometrica, 16, 1-32.
[64] Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–54.
[65] Prentice, R.L. and Zhao, L.P. (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics, 47,825–39.
[66] Qaqish, B. and Liang, K.-Y. (1992). Marginal models for correlated binary data with multiple classes and multiple levels of nesting. Biometrics, 48, 939–950.
[67] Rao, C.R. (1973). Linear Statistical Inference and Its Applications, 2nd edition.
John Wiley, New York.
[68] Reid, N. (1995). The roles of conditioning on inference (with discussion). Statistical Science, 10, 138-199.
[69] Robbins, H. (1956). An empirical Bayes approach to statistic. Proceedings of the Third Berkeley Symposium, 1, 157–163.
[70] Rosner, B. (1989). Multivariate methods for clustered binary data with more than one level of nesting. Journal of the American Statistical Association, 84, 373–380.
[71] Royall, R.M. (1986). Model robust inference using maximum likelihood estimators.
International Statistical Review, 54, 221–26.
[72] Royall, R.M. (1997). Statistical Evidence: a Likelihood Paradigm. Chapman and Hall, London, England.
[73] Scharfstein, D.O., Rotnitzky, A. and Robbins, J.M. (1999). Adjusting for nonignor- able drop-out using semiparametric nonresponse models (with discussion). Journalof the American Statistical Association, 94, 1096–1146.
[74] Skellam, J.G. (1948). A probability distribution derived from the binomial distribu- tion by regarding the probability of success as variable between the sets of trials.
Journal of the Royal Statistical Society, Series B, 10, 257–61.
[75] Stigler, S.M. (1986). The History of Statistics. Belknap Press, Cambridge, Mass.
[76] Stiratelli, R., Laird, N. and Ware, J.H. (1984). Random effects models for serial observations with binary responses. Biometrics, 40, 961–71.
[77] Tielsch, J.M., Sommer, A., Witt, K., Katz, J. and Royall, R.M. (1990). Blindness and visual impairment in an American urban population: Baltimore eye survey.
Archives of Ophthalmology, 108, 286-290.
[78] Tweedie, M.C.K. (1947). Functions of a statistical variate with given means, with special reference to Laplacian distributions. Proceedings of the Cambridge Phil.
, 49, 41–49.
[79] Waterman, R.P. and Lindsay, B.G. (1996). A simple and accurate method for ap- proximate conditional inference applied to exponential family models. Journal of theRoyal Statistical Society, Series B, 58, 177–188.
[80] Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models and the Gaussian method. Biometrika, 61, 439–47.
[81] Weil, C.S. (1970). Selection of the valid number of sampling units and a consideration of their combination in toxicological studies involving reproduction, teratogenesis orcarcinogenesis. Food and Cosmetics Toxicology, 8, 177–182.
[82] Weinberg, C.R., Wilcox, A.J. and Lie, R.T. (1998). A log-linear approach to case- parent-triad data: assessing effects of disease genes that act either directly or thoughmaternal effects and that may be subject to parental imprinting. American Journalof Human Genetics, 62,, 969–978.
[83] White, H. (1982). Maximum likelihood estimation of misspecified models. Econo- metrics, 50, 1–25.
[84] Williams, D.A. (1988). Reader reaction: estimation bias using the beta-binomial distribution in teratology. Biometrics, 44, 305–309.
[85] Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random effects: a Gibbs sampling approach. Journal of the American Statistical Association, 86,79–86.
[86] Zeger, S.L., Liang, K.-Y. and Albert, P. (1988). Models for longitudinal data: a generalized estimating equation approach. Biometrics, 44, 1049–1060.
[87] Zhao, L.P. and Prentice, R.L. (1990). Correlated binary regression using a general- ized quadratic model. Biometrika, 77, 642–48.
[88] Zhao, L.P, Prentice, R.L. and Self, S.G. (1992). Multivariate mean parameter esti- mation by using a partly exponential model. Journal of the Royal Statistical Society,Series B, 54, 805–812.
beta-binomial distribution, 22, 40, 41, likelihood function, 22 likelihood ratio, 22 binary response, 2 likelihood-based inference, 21 biologic plausibility, 43 litter effect, 22local optimality, 33 canonical link, 75 logistic regression model, 51 canonical parameter, 75 longitudinal study, 18 categorical response, 49cluster, 12 marginal model, 61, 62, 69 cluster-specific, 76 maximum likelihood estimator, 22 conditional score function, 32 method of moments, 31 contingency table, 56continuation ratio model, 55 nested design, 13 continuous response, 2 Neyman-Scott problem, 22 correlated data, 34 nuisance parameter, 21 correlation coefficient, 15 observation driven model, 61, 65, 77 odds ratio, 19ordinal response, 50 empirical Bayes, 76 orthogonality, 32, 33 empirical logit, 38, 39estimating function, 24 partly exponential family, 71 exponential dispersion family, 26 polytomous logistic regression model, 53polytomous response, 49 population averaged, 63 Fisher information matrix, 25 proportional odds model, 52 generalized estimating equation, 35 quadratic exponential family, 72 generalized linear model, 26 quasi-likelihood, 28, 30, 44, 69 log-linear model, 27 quasi-score function, 30, 31, 33, 34 logistic regression model, 27, 39multiple linear regression, 27 random effects model, 61, 63, 73 Poisson regression model, 27 random mechanism, 30 polytomous response, 51 regression analysis, 2 random component, 26 restricted maximum likelihood, 76 systematic component, 27 goodness-of-fit, 30 score function, 25 Hammersley-Clifford Theorem, 45 teratological experiment, 22, 40, 53, 54 unbiased estimating function, 31 variance specification, 30, 31, 44 weighted least squares, 28within-cluster dependence, 12


Annu. Rev. Genet. 2002. 36:153–73 ° 2002 by Annual Reviews. All rights reserved STUDYING GENE FUNCTION IN EUKARYOTES BYCONDITIONAL GENE INACTIVATION Manfred Gossen1 and Hermann Bujard2 1Max Delbr¨uck Centrum, Robert-R¨ossle-Strasse 10, D-13125 Berlin, Germany;e-mail:; 2ZMBH, Universit¨at Heidelberg, Im NeuenheimerFeld 282, D-69120 Heidelberg, Germany; e-mail:


Clinical Review & Education 2014 Evidence-Based Guideline for the Managementof High Blood Pressure in AdultsReport From the Panel Members Appointedto the Eighth Joint National Committee (JNC 8) Paul A. James, MD; Suzanne Oparil, MD; Barry L. Carter, PharmD; William C. Cushman, MD;Cheryl Dennison-Himmelfarb, RN, ANP, PhD; Joel Handler, MD; Daniel T. Lackland, DrPH;Michael L. LeFevre, MD, MSPH; Thomas D. MacKenzie, MD, MSPH; Olugbenga Ogedegbe, MD, MPH, MS;Sidney C. Smith Jr, MD; Laura P. Svetkey, MD, MHS; Sandra J. Taler, MD; Raymond R. Townsend, MD;Jackson T. Wright Jr, MD, PhD; Andrew S. Narva, MD; Eduardo Ortiz, MD, MPH

Copyright © 2008-2016 No Medical Care