## Runit.tex

Generalized Linear Models, Estimating Functions
Multivariate Extensions
Department of Biostatistics
School of Hygiene & Public Health
Johns Hopkins University
July 12–14, 1999
Regression analysis has been developed for many years and remains one of the most
commonly used statistical tools to help scientists address their scientiﬁc inquiries. Inthe year of 1972, thanks to the seminal work by Nelder and Wedderburn, many usefulregression models are uniﬁed under the framework of generalized linear models (GLMs).

These include as special cases ANCOVA, multiple regression for continuous responses, lo-gistic regression for binary responses and log-linear models for contingency tables. Theyasserted that each model under consideration can be cast through two components. Oneis the systematic component which relates the mean response to the covariates by spec-ifying a link function. The other is the random component which is needed to accountfor uncertainty in the response variable. Through this common framework, issues thatare commonly faced in regression such as estimating procedure, regression diagnostics,measurement errors in covariates, goodness-of-ﬁt, etc. can be addressed in a uniﬁed fash-ion. While there is no doubt that this approach will be in its existence for many yearsto come, it faces two, among others, challenges that has drawn a good deal of attentionamong statistical researchers in the past few decades. One is to deal with the situationwhere the scientists may not have suﬃcient knowledge about the subject matters to beincorporated by their statistical colleagues to characterize the random mechanism forwhich the data are generated. The work by Wedderburn in 1974, known as the quasilikelihood method, provides a means to cope with this complication. It relaxes the ran-dom component assumption required by GLMs, rather a much weaker assumption on thevariance expression for the response variables is installed. Several questions remain tobe answered regarding the use of this alternative method. First, can the work by Wed-derburn be extended to accommodate more general and realistic variance speciﬁcations?Second, does the quasi score method by Wedderburn possess any desirable statisticaloptimality properties? Third, is there any evidence that this method perform well inpractice?
The other challenge GLMs face is how to deal with the situation where some of the ob-
servations are not necessarily statistically independent of each others. Implicitly, GLMsoperates on the assumption that all the observations are independent statistically of eachothers. More frequently, one witnesses in biomedical studies that such an assumptionmay be invalid. In the past decade or two, the statistical community also witnessed thetremendous development in statistical methods for analyzing data that are correlatedwith each others. A natural questions to raise is: what are the pros and cons of thesenew methods in terms of their suitability in addressing scientiﬁc objectives from studiesinvolving correlated data?
In this lecture note, we intend to address the above two posed challenges and questions
through the vehicle of estimating functions, a concept formally developed in 1960 bythe independent work of Durbin and Godambe. This lecture note has the followingorganization. In Chapter one, we ﬁrst characterize ﬁve motivating examples which involvecorrelated data. We argued that for all examples considered, their scientiﬁc objectives canbe cast through regression modeling. Also discussed are (1) the statistical consequencewhen the correlation possessed by the data is ignored in the analysis and (2) quantitiesthat are useful for measuring the degree of associations for correlated data.

Chapter two provides a chronicle overview on some recent developments of regres-
sion techniques with focus on generalized linear models, quasi likelihood methods andestimating function approaches. This overview is preceded by some discussions on issues
concerning likelihood-based inference. These include the impact of nuisance parametersand cautions in using the popular Wald's procedure.

Chapters three and four may be viewed as a preview of statistical modeling for corre-
lated data. In Chapter three, a brief review on historic developments for analyzing pro-portional data leads to the following question: is the binomial assumption made thereina reasonable one? The well known "litter eﬀect" in teratological experiments suggeststhat one of the key assumptions in binomial, namely, all the binary responses from withinare statistically independent of each others, is invalid. We discussed in Chapter threedrawbacks of some attempts in statistical literature in the 1970s and 1980s to relax thetwo key assumptions in binomial for proportional data. The teratological experimentdata also serves to illustrate the usefulness of the quasi likelihood method in dealingwith (1) uncertainty of random mechanisms and (2) impact of nuisance parameters.

Chapter four consists of two parts. In part one, we discussed three diﬀerent statistical
models that are useful for polytomous responses in which the discrete response variablecontains three or more categories. Circumstances under which these models may be, ormay not be, appropriate are highlighted and contrasted through two examples. Thispart also serves to extend GLMs to incorporate either ordinal or nested polytomousresponse variables. In part two, we focused on interpretations of parameters induced bylog-linear models which are popular for analyzing data from contingency tables. Therehas been some renewed interest lately in transferring this approach for contingency tablesto correlated data analysis. Through detailed discussions on interpretations of log-linearparameters, we argued that such an application to correlated data may proceed withcaution.

In Chapter ﬁve, we contrasted three diﬀerent statistical modelings for correlated data:
marginal models, random eﬀects models and observation driven models with focus oninterpretations of induced parameters. Such contrasts and circumstances under whichthe proposed models may be most appropriate are illustrated through two examplesintroduced in Chapter one: The Baltimore Eye Survey Study and the SchizophreniaTrial Study.

Chapter six deals with statistical inference for correlated data when either one of
three statistical models mentioned above is employed. These methods may be viewed asmultivariate extensions of GLMs and quasi likelihood which were primarily designed foranalyzing univariate responses. Again, these newly developed methods were illustratedthrough the two above mentioned examples.

This lecture note is made possible through the kind invitation of Dr. C. Z. Wei,
the former Director of the Institute of Statistical Science, Academia Sinica, Taiwan,R.O.C., to be a part of the lecture series in statistics sponsored by the institute in 1999.

The coordinating eﬀorts by Dr. C.H. Chen, the executive secretary of the lecture seriescommittee, and by the committee members are gratefully acknowledged. The typingof this lecture note and the lectures delivered in July, 1999 in Academic Sinica werediligently prepared by Ms. Patty Hubbard, Department of Biostatistics, Johns HopkinsUniversity and by Ms. Umy Chen, Institute of Statistical Science, Academia Sinica.

Their hard work and patience throughout are beyond what can be described in words –I thank them from the bottom of my heart.

**1 Correlated Data**
Characteristics of Examples . . . . . . . . . . . . .

Impact of Ignoring Within-Cluster Dependence . . . . . . . .

Incorrect Variance . . . . . . . . . . . . . .

Measures of Within-Cluster Dependence . . . . . . . . . .

**2 Regression Analysis: An Overview**
Some Issues Concerning Likelihood-Based Inference . . . . . . .

Impact of Nuisance Parameters . . . . . . . . . .

Cautions on Wald's Procedure . . . . . . . . . .

Generalized Linear Models . . . . . . . . . . . . . .

First Two Moments of GLMs . . . . . . . . . . .

Score Function for

*β *. . . . . . . . . . . . .

MLE Versus Weighted Least Squares . . . . . . . . .

Quasi-Likelihood Method . . . . . . . . . . . . . .

Estimating Function Approach . . . . . . . . . . . .

Limitations: Impact of Nuisance Parameters
Orthogonality Properties of Conditional Score Functions . . .

(Local) Optimality of Quasi-Score Functions

**3 Analysis of Binary Data**
Historical Developments . . . . . . . . . . . . . .

Empirical Logit versus Logistic Regression . . . . . . .

Is the Binomial Assumption Reasonable? . . . . . . .

Teratological Experiments . . . . . . . . . . . . . .

Some Cautionary Notes on Beta-Binomial Distributions . . .

An Alternative Inferential Procedure . . . . . . . . .

Further Relaxation of Binomial Assumptions . . . . . . . .

Multiplicative Model . . . . . . . . . . . . .

**4 Analysis of Polytomous and Count Data**
Types of Polytomous Data . . . . . . . . . . . . . .

Regression Models for Polytomous Responses . . . . . . . .

Polytomous Logistic Regression Models . . . . . . . .

Proportional Odds Models . . . . . . . . . . . .

Continuation Ratio Models . . . . . . . . . . .

Log-Linear Models for Contingency Tables . . . . . . . . .

Interpretations of Log-Linear Model Parameters
An Alternative Representation of Log-Linear Models . . . .

**5 Statistical Modellings for Correlated Data**
Observation Driven Models . . . . . . . . . . . . .

Contrasts of Three Modelling Approaches . . . . . . . . .

**6 Statistical Inference for Correlated Data**
Quasi-Likelihood Approach . . . . . . . . . . .

Likelihood Approach . . . . . . . . . . . . .

Conditional Likelihood Approach . . . . . . . . . .

Likelihood Approach . . . . . . . . . . . . .

Observation Driven Models . . . . . . . . . . . . .

Likelihood Approach . . . . . . . . . . . . .

Quasi-Likelihood Approach . . . . . . . . . . .

Pre- Post Trial on Schizophrenia Revisited . . . . . . . . .

**List of Tables**
Some features of ﬁve examples in

*§*1.1. . . . . . . . . . .

Ranges of correlation coeﬃcients for a pair of binary responses with cor-responding means

*πj *and

*πk*, respectively. . . . . . . . . .

Proportions of children surviving 2 years free of disease following diagnosisand treatment for neuroblastoma by age and stage of disease at diagnosis(Breslow & McCann, 1971). . . . . . . . . . . . . .

Proportions of fetuses surviving the 21 day lactation period among thosewho were alive at 4 days by treatment group (Weil, 1970). . . . .

The averaged MLE (

*± *s.e.) for

*θ *for diﬀerent assumed values of

*φ *from1,000 simulations (Williams, 1988). . . . . . . . . . . .

Estimates and standard errors of

*β*1 = log

*{θT /*(1

*−θT *)

*}−*log

*{θC/*(1

*−θC*)

*}*of Weil's data presented in Table 3.2. Upper entry: common

*φ*; lower entry:heterogeneous

*φ *(Liang and Hanfelt, 1994). . . . . . . . .

Simulation results for

*β*1 = log

*{θT /*(1

*− θT *)

*} *from 1,000 replications mim-icking the data structure of Weil's. The true

*β*1 is

*−*1

*.*129 and

*φT *= 0

*.*317and

*φC *= 0

*.*021 (Liang and Hanfelt, 1994). . . . . . . . .

Regression estimates (

*± *s.e.) based on both proportional odds models andpolytomous logistic regression models for the disturbed dream example(Maxwell, 1961). . . . . . . . . . . . . . . . .

Regression estimates (

*± *s.e.) based on proportional odds models and con-tinuation ratio models for the tonsil size example (Holmes and Williams,1954). . . . . . . . . . . . . . . . . . . .

Regression estimates and standard errors (in parenthesis) for the visualimpairment response from the Baltimore Eye Survey Study. . . . .

Regression estimates and standard errors (in parenthesis) from three mod-elling approaches: Baltimore Eye Survey Study. . . . . . . .

Regression estimates and standard errors (in parenthesis) from three mod-elling approaches: a pre- post trial on schizophrenia. . . . . . .

*LIST OF TABLES*
**List of Figures**
The plot of the logarithm of

*V*1

*/V*2 versus

*ρ *for selected cluster sizes

*n*(2

*, *5

*, *10) and

*φ*(0

*, *0

*.*2

*, *0

*.*5

*, *0

*.*8

*, *1

*.*0). Here,

*V*1 is the variance of the leastsquared estimate ˆ

*β*1 when the within-cluster dependence is ignored and

*V*2 is the correct variance. We have assumed E(

*Yij*) =

*β*0 +

*β*1

*xij *and

*ρ *= corr(

*Yij, Yik*)

*, j < k *= 1

*, . . , n *:

*φ *is the ratio of the between clustervariance to the total variance among the

*xij*'s. . . . . . . .

The plot of

*V*3

*/V*2 versus

*ρ *for selected

*n*(2

*, *5

*, *10) and

*φ*(0

*, *0

*.*2

*, *0

*.*5

*, *0

*.*8

*, *1

*.*0).

Here

*V*2

*, ρ, φ*, and the assumed model are the same as described in Figure1.1,

*V*3 is the variance of the best unbiased estimate of

*β*1. . . . .

Beta-binomial log-likelihoods for the exposed group based on the datareported by Weil (1970). . . . . . . . . . . . . . .

Predicted risk of VI by race and age among those with 9 years of educationin Baltimore Eye Survey Study (Tielsch et al., 1990): solid line, whites;dashed line, blacks. . . . . . . . . . . . . . . .

Mean responses in PANSS by dropout cohort. . . . . . . . .

*LIST OF FIGURES*
**Chapter 1**
Regression analysis is commonly used in biomedical research. A typical assumptionbehind this tool is that all observations are statistically independent, or at least uncor-related with each other. The following research problems in which we have been directlyor indirectly involved suggest that there are many situations for which this assumptionmight not be met.

*Example 1.1 *(Baltimore Eye Survey Study). Over 5,000 individuals aged 40 and
above received a visual examination as part of a population-based prevalence study ofocular disorders (Tielsch et al. 1990). The main objective of the summary is to identifydemographic variables such as race and age that are associated with vision loss. Thecomplication of the study is that data are available on both eyes which are unlikely tobe independent because many causes of eye impairment are binocular.

*Example 1.2 *(Family Study on Chronic Obstructive Pulmonary Disease (COPD)). Six
hundred and thirteen members of 158 COPD cases seen at the Johns Hopkins Hospitalwere examined and given spirometry tests (Cohen, et al., 1977). The primary objectiveis to determine the percent of total variance attributable to unobserved genetic factorsshared among siblings and their parents. It is typical in family studies that the responsevariables, FEV1 in this case, are measured for related individuals, which in turn arecrucial to address scientiﬁc questions of interest such as the one stated above.

*Example 1.3 *(Sister Chromatid Exchange Study). A total of 14 hepatocelluar carci-
noma, 14 nasopharyngeal carcinoma and 16 cervical cancer patients and their age-sex-matched controls, were studied to compare the frequency of sister chromatid exchange(SCE) in their peripheral lymphocytes. The main hypothesis, which was tested in thestudy conducted by Professor C.J. Chen, College of Public Health, National Taiwan Uni-versity, is that cancer patients may have a higher frequency of SCE in lymphocytes thanmatched controls. The outcome variable is the number of SCE per cell where 20 cellsfrom each subject were cultured.

*Example 1.4 *(Pre-Post Trial of Schizophrenia). Five hundred and twenty three pa-
tients diagnosed with schizophrenia were randomly allocated amongst six treatments:placebo, haloperidol 20 mg and risperidone at dose levels 2mg, 6mg, 10mg and 16mg.

The primary response variable is the total score obtained on the Positive and Negative

*CHAPTER 1. CORRELATED DATA*
Symptom Rating Scale (PANSS), a measure of severity of schizophrenia. This score foreach patient was measured at baseline, weeks 1, 2, 4, 6 and 8 after randomization. Themain objective is to determine if risperidone improves the rate of decline in PANSS overthe two month period.

*Example 1.5 *(Diabetic Retinopathy Study). Over 1,700 patients with diabetic retinopa-
thy and visual acuity of 20/100 or better in both eyes participated in the study (Mur-phy and Patz, 1978). One eye of each patient was randomly selected for laster photo-coagulation and the other was observed without treatment. The main objective is tostudy the eﬀectiveness of laster photo-coagulation in delaying the onset of blindness andthe response variable for each eye is time from randomization to the occurrence of visualacuity less than 5/200.

**Characteristics of Examples**
Although diﬀerent in their scientiﬁc objectives, the examples introduced above sharesome important characteristics in common that are critical to the statistical developmentpresented in Chapters 5 and 6. First, data in each of these examples are organized in"clusters." For example, in the Baltimore Eye Survey Study, each cluster comprisesobservations of two eyes from an individual. In the family study (Example 1.2), a clusteris formed by a family with observations from related individuals. Secondly, responsevariables from the same cluster are likely to be correlated with each other. In longitudinalstudies where repeated observations from each subject are measured over time, this isknown as "tracking." In family studies, observations from related individuals such assiblings are correlated due to shared genes and/or environments. This explanation canbe applied to the fellow eyes in Example 1.1 as well. Thirdly, the scientiﬁc objectivesof these examples can be formulated statistically in terms of regression analyses as onewould have been for independent responses. For example, one could use the conventionallogistic regression to relate the risk of visual impairment for each eye in the BaltimoreEye Survey Study to demographic variables such as age and race. In addition, one canassess how the degree of within-cluster dependence between fellow eyes on variables suchas race, through regression as well. The issue of measuring within-cluster dependencewill be addressed in

*§*1.4, whereas more formal statistical modellings for correlated datawith the focus on regression will be the central theme of Chapter 5.

The examples introduced above, on the other hand, diﬀer from another in some impor-
tant ways. First, the type of response varies from binary (Example 1.1), count (Example1.3), continuous (Example 1.2 and 1.4) to survival subject to censoring (Example 1.5).

Secondly, they diﬀer in the structure of within-cluster dependence. For example, if thesampling unit for the Baltimore Eye Survey Study were the household in which eachmember of the sampled household was oﬀered for eye examinations, more complicateddependence structure would result since one would expect responses from family mem-bers to be correlated to each other. Furthermore, the degree of association between felloweyes is likely to be stronger than responses of eyes from two related subjects. Designs ofthis kind are special cases of the so-called "nested design." Examples include interventionstudies in which children (patients) are nested within household (clinicians) which in turnare nested within villages (clinics). Special care is required to model within-cluster de-pendence for data collected from nested designs or from family studies where the degreeof association for siblings is likely to be diﬀerent than that of parents (Qaqish and Liang,

*1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE*
Finally, the examples diﬀer from one another with respect to their focus. In longitu-
dinal studies (Example 1.4), the regression of response variables on explanatory variablessuch as time (in weeks) from randomization, treatment status and their interactions ismost important whereas the within-cluster dependence may be viewed as a nuisance. Onthe other hand, it is typical in family studies (Example 1.2) that the primary focus ison the role genetic factors play in disease as captured by the percent of total variationin response variables attributable to genetic components. In this situation, the within-cluster association represents the primary focus even though regression adjustment forthe regression variable is crucial to separate out the environmental impact from the ge-netic ones. Table 1.1 summarizes both the common factors and diﬀerences among theﬁve examples.

Table 1.1: Some features of ﬁve examples in

*§*1.1.

Scientiﬁc Focus

**Impact of Ignoring Within-Cluster Dependence**
A natural question to raise is "what happens if one ignores the within-cluster depen-dence and uses the conventional regression method assuming independence among ob-servations?" In the Baltimore Eye Survey example, this amounts to treating 10,398 ob-servations from 5,199 individuals as if they were collected from 10,398 individuals withone observation (could be from the right or left eye) per individual. From a statisti-cal viewpoint, there are at least two consequences as a result of ignoring within-clusterdependence: incorrect assessment of precision of regression coeﬃcient estimates and in-eﬃcient estimation of regression coeﬃcients. We now address these two issues in greaterdetail through a simple model for correlated data.

Let

*Yij *be the

*j*th observation from the

*i*th cluster,

*j *= 1

*, . . , n, i *= 1

*, . . , m*. Here

*n *is the common cluster size and

*m *the number of clusters. We assume for each

*i*,
E(

*Yij*) =

*β*0 +

*β*1

*xij, *Var(

*Yij*) =

*σ*2

*,*Cov(

*Yij, Yik*) =

*σ*2

*ρ, j < k *= 1

*, . . , n.*
This model assumes the expected value of

*Y *is a simple linear function of a covariate,

*x*, and that the correlation, as measure by correlation coeﬃcient, between each pairof responses from the same cluster has the same value,

*ρ*, say. The focus here is on theinference for

*β*1 when ignoring the fact that responses from the same cluster are correlatedwith each other. This amounts to the use of the ordinary least square estimator ˆ

*CHAPTER 1. CORRELATED DATA*
estimate

*β*1 where
(

*yij − yi*)(

*xij − xi*) /
(

*xij − xi*)2

*,*
*i*=1

*j*=1

*i*=1

*j*=1
with

*yi *and

*xi *be the sample means of the

*Yi*'s and

*xi*'s from the

*i*th cluster.

Ignoring the within-cluster association leads to the use of

*V*1 =

*σ*2

*/*
(

*xij − x*)2 =

*σ*2

*/VT ,*
*j ij */ (

*nm*), as the variance estimate of ˆ

*β*1. The correct variance,

*V*2, of

*β*1 has the form

*V*2 =

*V*1

*{*1 +

*ρ*(

*nφ − *1)

*},*
where

*φ *=

*n*
*i −x*)2 /

*VT *is the fraction of the total variation in the

*x*'s explained by
the between cluster variation in

*xi*'s. Two important cases of

*φ *deserve special attention.

One is when

*φ *= 0, i.e.

*x*1 =

*. . , xm*. This occurs commonly in longitudinal studies inwhich every subject's response is measured at the same set of times. In this case,

*β*1 isestimated mainly by using the within-cluster changes in

*Y *. The other extreme cases iswhen

*φ *= 1, i.e.

*xi*1 =

*xi*2 =

*. . *=

*xin *for all

*i ∈ m*. This is typcial in studies such asthe Baltimore Eye Survey Study where all the covariates are cluster-speciﬁc.

Figure 1.1: The plot of the logarithm of

*V*1

*/V*2 versus

*ρ *for selected clustersizes

*n*(2

*, *5

*, *10) and

*φ*(0

*, *0

*.*2

*, *0

*.*5

*, *0

*.*8

*, *1

*.*0). Here,

*V*1 is the variance of theleast squared estimate ˆ

*β*1 when the within-cluster dependence is ignored
and

*V*2 is the correct variance. We have assumed E(

*Yij*) =

*β*0 +

*β*1

*xij *and

*ρ *= corr(

*Yij, Yik*)

*, j < k *= 1

*, . . , n *:

*φ *is the ratio of the between clustervariance to the total variance among the

*xij*'s.

*1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE*
Figure 1.1 shows plots of log(

*V*1

*/V*2) = log

*{*1 +

*ρ*(

*nφ − *1)

*} *against

*ρ *for some selected

*n*'s and

*φ*'s. The message is rather clear. For within cluster comparisons, i.e.

*φ *= 0, theconﬁdence interval based on

*V*1 is wider than should be and discrepancy between

*V*1 and

*V*2 increases with

*ρ*. On the other hand, for between cluster comparisons, i.e.

*φ *= 1, theconﬁdence interval based on

*V*1 is too narrow and the discrepancy also increases with

*ρ*. In either case, invalid scientiﬁc conclusions, which could be either false positive ornegative, may be drawn if

*V*1 is used as the variance estimate of ˆ
While the ordinary least square estimate ˆ

*β*1 remains unbiased as an estimator of

*β*1, the
well known Gauss-Markov estimating theorem suggests that the uncertainty in estimating

*β*1 may be reduced by using the weighted least square estimator, ˜

*β*1 say, which properly
accounts for the within-cluster association. Under the speciﬁed correlation structure in(1.1), ˜

*β*1 has the variance of form

*V*3 =

*V*1(1

*− ρ*)

*{*1 + (

*n − *1)

*ρ} */

*{*1

*− ρ *+

*nρ*(1

*− φ*)

*}.*
Figure 1.2 shows plots of

*V*3

*/V*1 against

*ρ *for selected

*n*'s and

*φ*'s. Interestingly, no
The plot of

*V*3

*/V*2 versus

*ρ *for selected

*n*(2

*, *5

*, *10) and

*φ*(0

*, *0

*.*2

*, *0

*.*5

*, *0

*.*8

*, *1

*.*0). Here

*V*2

*, ρ, φ*, and the assumed model are the sameas described in Figure 1.1,

*V*3 is the variance of the best unbiased estimateof

*β*1.

eﬃciency loss occurs when

*φ *is equal to 0 or 1, irrespective of

*ρ *and

*n*. On the otherhand, a great deal of eﬃciency loss by using ˆ

*β*1 may result when

*φ *approaches 0

*.*5. Such
phenomenon is more apparent with increased

*ρ *but less so for

*n*. The main messagein Figure 1.2 is that ignoring correlation could lead to a substantial loss of statisticalpower especially when both within-cluster and between-cluster information is being usedto estimate

*β*1 and that it is important to examine and incorporate the within-clusterassociation structure so as to help improve upon the eﬃciency for the

*β *inference.

*CHAPTER 1. CORRELATED DATA*
**Measures of Within-Cluster Dependence**
Adequately describing the pattern of within-cluster dependence is important in severalregards: (i) it may represent the main scientiﬁc objective as is typically the case in familystudies (Example 1.2) and (ii) it may help to more precisely characterize the relationshipbetween the means and covariates as asserted in

*§*1.3.2. In this section, we brieﬂy discusshow one measures the within-cluster association and its patterns.

For continuous responses, the most commonly used measure of dependence between
a pair of responses from the same cluster is correlation coeﬃcient, deﬁned as

*j , Yk *) =

*{*
*, j < k *= 1

*, . . , n.*
This quantity takes values between

*−*1 and 1 inclusively. Very little dependence between

*Yj *and

*Yk *may be claimed if

*ρ *is close to zero and a strong association is evident if

*ρ*is close to either 1 or

*−*1. Furthermore, a positive association between

*Yj *and

*Yk*, i.e.

*ρ > *0, means

*Yj *tends to be larger than expected if

*Yk *is and vice versa. Finally, thismeasure of association has the desired property that it be dimensionless and symmetric.

For the former, it means that

*ρ *for

*Yj *and

*Yk *is the same as that for

*cYj *and

*cYk *where

*c *is a non-zero constiuent. Thus, the same magnitude of

*ρ *results whether one uses, forexample, meter or foot as the unit for height or length. For the latter, it matches with theintuition that the association between

*Yj *and

*Yk *has the same concept as that between

*Yk *and

*Yj*. For a cluster of size greater or equal to 3, it raises the question as to howone distinguishes between, for example,

*ρ*1

*,*2 and

*ρ*1

*,*3, the correlation coeﬃcient between

*Y *and

*Y*2 and between

*Y*1 and

*Y*3, respectively? The answer to this question depends onthe nature of the clustering and sometimes, on how the scientiﬁc objective is formulated.

For longitudinal studies, it is a common belief that the correlation coeﬃcient betweentwo observations adjacent in times is likely to be larger than that when two observationsare far apart in times. This is a part of the notion, known as "tracking" in longitudinalstudies. To capture this phenomenon, one may model

*ρ*(

*Yj, Yk*) =

*α tj−tk ,*
where

*tj *and

*tk *are times at which

*Yj *and

*Yk *are observed, respectively. This patternis known as the AR-1 (auto-regressive model of order one) model. For more detaileddiscussion on describing and empirically examining patterns of within-subject associationfor longitudinal data, we refer the readers to the work by Diggle (1988), Diggle, Liangand Zeger (1994) and the references therein.

For family studies of ﬁrst degree relatives, i.e. parents, siblings and oﬀspring, the
pattern of within-family association is likely to be diﬀerent from that from longitudinalstudies. For siblings, it would be sensible to assume that the degree of association,as measured by

*ρ*, is the same for each pair of siblings. However, it would be equallysensible to allow this correlation coeﬃcient (

*ρ*SS) to be diﬀerent than that of parents,

*ρ*PP. Indeed, any sensible statistical procedure should be ﬂexible enough to allow one totest the equality of

*ρ*SS and

*ρ*PP. Assuming that the same environment is shared by allrelatives in the same family, a larger

*ρ*SS than

*ρ*PP would suggest that for the continuousresponses considered, known as quantitative traits in the ﬁeld of genetic epidemiology,genetic factors may play a non-trivial role. For more detailed discussion on how one may

*1.4. MEASURES OF WITHIN-CLUSTER DEPENDENCE*
model and estimate patterns of within-family association, see, for example, Beaty et al.

(1997).

The use of correlation coeﬃcient as a measure of association is less useful for binary
responses. This is because the range of

*ρ *is narrowed considerably due to the constraintthat
max(0

*, πj *+

*πk − *1)

*< *Pr(

*Yj *=

*Yk *= 1)

*< *min(

*πj, πk*)

*.*
where

*πj *= Pr(

*Yj *= 1) and

*πk *= Pr(

*Yk *= 1). Furthermore, the degree of constraintdepends on values of the

*π*'s. Table 1.2 shows ranges of

*ρ *as a function of selected

*πj*and

*πk*. For example, if the true

*πj*(

*πk*) is 0

*.*1(0

*.*3), then the corresponding

*ρ *between

*Yj*and

*Yk *must be greater than

*−*0

*.*22, but less than 0

*.*51, much narrower than the idealsituation of (

*−*1

*, *1). This constraint in range of

*ρ *makes the use of

*ρ *much less desirablefor binary responses than it is for continuous responses, especially when characterizingwithin-cluster association represents the main objective. As an alternative, one may
Table 1.2: Ranges of correlation coeﬃcients for a pair of binary responseswith corresponding means

*πj *and

*πk*, respectively.

consider the use of the odds ratio (OR)

*j *=

*Yk *= 1) Pr(

*Yj *=

*Yk *= 0)

*j , Yk *) =
Pr(

*Yj *= 1

*, Yk *= 0) Pr(

*Yj *= 0

*, Yk *= 1)
as a measure of association between a pair of responses,

*Yj *and

*Yk*. For this quantity, noconstraint is induced other than the fact that it must be positive. A positive associationresults if OR is greater than one and a negative association may be claimed if OR isless than one. Furthermore, the odds ratio is symmetric as well and is familiar to publichealth researchers for its ease in interpretation. Just as in continuous response situations,one can model the odds ratio, or more preferably its log version, through regression. Fora more detailed discussion on the use of odds ratios for within-cluster associations, seeHeagerty and Zeger (1998) for longitudinal data and Liang and Beaty (1991) for familydata. Finally, diﬀerent measures of association between discrete variables have beensuggested in the literature; see Goodman and Kruskal (1979). We choose OR as theprimary measure of within-cluster dependence for discrete data, mainly because it iseasy to interpret and is familiar to biomedical researchers.

*CHAPTER 1. CORRELATED DATA*
**Chapter 2**
**Regression Analysis: An**

Overview
As mentioned earlier, regression analysis has been and remains one of the most commonlyused statistical tools in biomedical research. In clinical research, this technique is useful tohelp assess the degree of treatment eﬀects when randomized clinical trials are conducted.

It can also be used for clinicians to identify prognostic factors of disease for predictionpurposes. In public health research, regression analysis is instrumental to help identifyrisk factors of disease for prevention purposes. It is also useful in observational studies toadjust for confounding variables that are not considered in the design stage for matching.

In this chapter, we provide a chronicle overview on some recent developments of regressiontechniques with focus on methods including generalized linear models, quasi-likelihoodmethods and estimating function approaches. Since these methods are, by and large,likelihood-based, we ﬁrst review some key issues in likelihood-based inference that isparticularly relevant to subsequent developments.

**Some Issues Concerning Likelihood-Based Infer-**

ence
Let

*f *(

*·*;

*θ, φ*) be the probability (density) function for a random variable,

*Y *, of size

*m*indexed by two sets of parameters,

*θ *of p-dimensional and

*φ *of q-dimensional. Here

*θ*represents parameters of interest reﬂecting scientiﬁc interest and

*φ *the so-called "nui-sance parameters." Nuisance parameters, by deﬁnition, are of little intrinsic interest toinvestigators, yet necessary to fully specify the random mechanism,

*f *(

*·*;

*θ, φ*), for whichthe data,

*Y ≡ y*, are generated. A typical example of nuisance parameters is the exposedprobability of controls in case-control studies in which the interest is on the associationbetween the studied disease and the hypothesized exposure of interest, characterized bythe odds ratio,

*θ*. Upon the data were observed, the only quantities in

*f *that are un-known to investigators are

*θ *and

*φ*. The phrase "likelihood function" is formally deﬁnedas a function of

*θ *and

*φ *that is proportional to

*f *(

*y*;

*θ, φ*), i.e.

*L*(

*θ, φ*)

*≡ L*(

*θ, φ Y *=

*y*)

*∝ f *(

*y*;

*θ, φ*)

*.*
*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
**Impact of Nuisance Parameters**
When the value of nuisance parameters

*φ *is known, one can contrast two distinct valuesof

*θ, θ*1 and

*θ*2, by computing the following likelihood ratio
A likelihood ratio which is greater than one would suggest that there is evidence in fa-vor of

*θ*1 instead of

*θ*2 as conveyed by the data through the likelihood function (e.g.

Royall, 1997). Furthermore, the larger the likelihood ratio in magnitude, the strongerthe evidence. However, when

*φ *is unknown to investigators, one faces the additionalcomplication due to the need of specifying

*φ *when computing LR(

*θ*1

*, θ*2). The followingexample from a teratological experiment (Weil, 1970) illustrates that in the absence ofknowledge in

*φ*, the presence of nuisance parameters may have a profound impact onthe likelihood inference for parameters of interest. Figure 2.1 shows plots of log

*L*(

*θ, φ*)against

*θ *for selected

*φ *values. Here the likelihood function is proportional to a productof 16 observations following a beta-binomial distribution indexed by

*θ*, the probability ofsurviving a 21 day lactation period for a fetus and

*φ*, the nuisance parameter characteriz-ing the so-called "litter eﬀect", an eﬀect commonly observed in teratological experiments.

More detailed treatments of analysis for teratological experiments in general and of thisexample in particular will be given in Chapter 3. Choosing

*θ*1 = 0

*.*8 and

*θ*2 = 0

*.*6 as twoplausible values for

*θ*, we note that diﬀerent values of

*φ *lead to diﬀerent conclusions as towhich

*θ *is favored more by the data. For example,

*L*(0

*, *8

*, *0

*.*8)

*/L*(0

*.*6

*, *0

*.*8) = exp(

*−*2

*.*89)strongly suggests that

*θ *= 0

*.*6 is favored over

*θ *= 0

*.*8 when

*φ *= 0

*.*8 is used, whereas

*L*(0

*.*8

*, *0

*, *1)

*/L*(0

*.*6

*, *0

*, *1) = exp(5

*.*82) strong supports the opposite conclusion.

Nuisance parameters could have non-trivial impacts on estimating

*θ *as well. The
following well known Neyman-Scott problem (Neyman and Scott, 1948) demonstratesthat with many nuisance parameters at hand, the conventional maximum likelihoodapproach for estimating

*θ *may not even be consistent at all. Consider

*{Yij, j *= 1

*, . . , n}*,independent observations from the

*i*th of

*m *independent clusters following a normaldistribution with mean

*φi *and a common variance

*θ, i *= 1

*, . . , m*. The likelihood functionfor

*θ *and

*φ *= (

*φ*1

*, . . , φm*) is proportional to

*L*(

*θ, φ*
1

*, . . , φm*)

*∝*
*θ− n*2 e
It is easy to verify that the maximum likelihood estimator (MLE) of

*θ*, as a result ofjointly maximizing the likelihood function with respect to

*θ *and

*φ*, has the form
(

*yij − yi*)2

*/*(

*nm*)

*.*
*i*=1

*j*=1
Assuming that the cluster size,

*n*, is ﬁxed by design, it can be easily shown that ˆ
converges as

*m → ∞ *to (

*m − *1)

*θ/m *instead of

*θ*. Such an undesirable phenomenonfor ˆ

*θ *is not speciﬁc to the example given above. Indeed, it is well known that when the
dimension of nuisance parameters, increase with the sample size,

*m *in this case, the MLEof

*θ *is in general inconsistent (e.g. Andersen, 1970). An important special case of this

*2.1. SOME ISSUES CONCERNING LIKELIHOOD-BASED INFERENCE*
Figure 2.1: Beta-binomial log-likelihoods for the exposed group based onthe data reported by Weil (1970).

kind is on the estimation of the common odds ratio when the one-to-one matched case-control design is adopted (Breslow and Day, 1980). Indeed, it has been shown (Breslow,1981) that ˆ

*θ *converges, as the number of matched pairs increases, to

*θ*2 instead of

*θ*.

Many likelihood-based inferential procedures have been developed to eliminate, or at
least to reduce, the impact of nuisance parameters; see for example the work of Kalbﬂeischand Sprott (1970), Andersen (1970), Cox (1972), Lindsay (1982), Cox and Reed (1987),Reed (1995) and references therein. These methods rely upon the availability of thelikelihood function that is fully and correctly speciﬁed. In

*§*2.4 we present an alternativeapproach, the estimating function approach, which also serves to eliminate (or reduce) theimpact of nuisance parameters without the need to fully specify the likelihood function.

**Cautions on Wald's Procedure**
The basis behind the frequentist approach builds on the repeated sampling, or long run,interpretation for inferential making (e.g. Cox and Hinkley 1974). A central theme ofthis approach is on testing the null hypothesis that

*θ *=

*θ*0, a pre-speciﬁed value andon computing conﬁdence intervals for

*θ*. With the exception that

*f *(

*·*;

*θ, φ*) is generatedfrom an exponential family (Lehmann, 1959), one needs to appeal to the large sampleapproximation for the purpose, for example, of deriving conﬁdence intervals. To thisend, three large sample inferential procedures have been developed and employed on adaily basis: likelihood-ratio-based, score-statistic-based and Wald-based. We now brieﬂydescribe these three procedures from the viewpoint of testing the null hypothesis that

*H*0 :

*θ *=

*θ*0. Their application to conﬁdence intervals can be converted through hy-

*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
pothesis testing in a straightforward manner and will be mostly omitted. As the namesuggested, the likelihood-ratio-based method relies upon the use of the ratio of the like-lihood function evaluated at

*θ*0 and ˆ

*θ *for testing

*H*0, i.e.

1 =

*−*2 log

*φ*) are the MLE of (

*θ, φ*) and ˆ

*φ*(

*θ*0) the MLE of

*φ *under

*H*0. The score-statistic-
based procedure relies upon one of the key properties of the likelihood function, namely,
E(

*S*(

*θ, φ*);

*θ, φ*) = 0

*∀θ, φ,*
*S*(

*θ, φ*) =

*∂ *log

*L*(

*θ, φ*)

*/∂*(

*θ, φ*)

*,*
known as the score function for (

*θ, φ*). Based on (2.3), one would reject

*H*0, on theintuitive ground, if

*S*(

*θ*0

*, φ*) is "large." To assure that this assessment of

*S*(

*θ*0

*, φ*) beinglarge is not aﬀected by the unit of the

*Yi*'s, one needs to standardize

*S *by considering

*T*2 =

*St*(

*θ*0

*, *ˆ

*I*(

*θ, φ*)
is known as the Fisher information matrix for (

*θ, φ*). Finally, in contrast to the LRprocedure in which

*θ*0 and ˆ

*θ *are compared indirectly through the likelihood function, the
Wald's method uses

*θ − θ*0)

*t*(

*Iθθ − ItθφI−*1

*I*
*φφ φθ *)

* *(ˆ
to test

*H*0, here we have re-expressed

*I*(

*θ, φ*) as

*I*(

*θ, φ*) =
where, for example,

*Iθθ *= E(

*−∂*2 log

*L*(

*θ, φ*)

*/∂θ*2)

*Iθφ *= E(

*−∂*2 log

*L*(

*θ, φ*)

*/∂θ∂φ*)

*.*
Detailed treatments on these three large sample procedures can be found in Rao
(1973,

*§*6.e). In particular, it was shown that under some regularity conditions, allthree test statistics converge, as

*m → ∞*, to a

*χ*2

*p *distribution and furthermore, areequivalent asymptotically to each other, i.e. the diﬀerence between

*T*1 and

*T*2, and hence

*T*1 and

*T*3 and

*T*2 and

*T*3, is in the order of

*op*(

*m*). Despite the similarity among thesethree procedures when the sample is large, the Wald procedure possess some undesirableproperties that are not shared by the other two procedures. First, this procedure is notinvariant in the following sense. Let

*δ *=

*δ*(

*θ*) be a one-to-one function of

*θ *so that the null

*2.2. GENERALIZED LINEAR MODELS*
hypothesis of

*θ ≡ θ*0 is equivalent to

*δ ≡ δ*(

*θ*0) =

*δ*0. Unlike

*T*1 and

*T*2, the Wald's teststatistic is diﬀerent numerically depending on the choice of parameters used to formulatethe null hypothesis. The following hypothetical example on testing the hypothesis of anassociation from a 2

*× *2 table illustrates the shortcoming. With 8(2) out of 18 cases (17controls) exposed to a pre-speciﬁed risk factor, one has the MLE for

*θ*, the odds ratio,equal to 6.0 with an estimated standard error (s.e.) of 5.35. This leads to a Wald teststatistic value of (6

*− *1)2

*/*(5

*.*34)2 = 0

*.*70 with p-value

*< *0

*.*1. On the other hand, if oneused

*δ *= log

*θ *as the basis for inference, the corresponding Wald's test statistic is equalto (log 6

*− *0)2

*/*(0

*.*89)2 = 4

*.*0, highly statistically signiﬁcant at the 0.05 level.

The other drawback of the Wald procedure is that in certain circumstances

*T*3 may be
"powerless" when the true

*θ *value,

*θ *say, is far diﬀerent from

*θ*0, i.e.

*θ*0

*−θ *is large inmagnitude. This counter-intuitive phenomenon may be explained at least heuristically asfollows. For simplicity, we assume

*p *= 1 and there is no nuisance parameters, i.e.

*q *= 0.

It is true in general that the asymptotic variance of ˆ

*θ, I−*1(

*θ*), depends on

*θ*. While the
numerator of

*T*3

*, *(ˆ

*θ − θ*0)2, increases as

*θ − θ*0

* *increases since ˆ

*θ → θ*, so does the
denominator of

*T*3

*, I*(

*θ*). Under the circumstances that

*I*(

*θ*) increases at a faster ratethan (ˆ

*θ − θ*0)2 does,

*T*3 becomes arbitrarily small in spite of the fact that the true

*θ, θ *is
very diﬀerent from

*θ*0. This undesirable property of

*T*3 was ﬁrst observed and explainedfor logistic regression models by Hauck and Donner (1977). It remains an open questionas to how one may characterize such "circumstances" leading to the peculiar behavior ofthe Wald procedure.

For illustration, consider the following time to relapse data from a clinical trial re-
ported in Lee (1992). For ﬁve breast cancer patients receiving CMF (

*x *= 1) after a radicalmastectomy, the times to relapse were recorded as 23

*, *16+

*, *18+

*, *20+

*, *24+ in weeks, wherethe + sign represents censoring. The ﬁve control patients (

*x *= 0) all experienced re-lapse at 15

*, *18

*, *19

*, *19 and 20 weeks. A proportional hazard model with

*x*, the treatmentstatus, as the sole covariate was ﬁtted to the above data resulting in ˆ

*β *=

*−*16

*.*66 and
s

*.*e

*.*( ˆ

*β*) = 1513. This leads to a Wald's test statistic value of 0

*.*0001 with p-value = 0

*.*991.

The other two test statistics, the likelihood ratio and score test statistic, amount to 8

*.*76and 6

*.*90 with p-values 0

*.*003 and 0

*.*009, respectively, matching well with the intuition asconveyed by the data that the new treatment (CMF) prolongs time to relapse for breastcancer patients receiving radical mastectomy.

**Generalized Linear Models**
Multiple linear regression analysis and logistic regression analysis are daily employedin biomedical research for reasons given at the onset of this chapter. In 1972, thesemethods, among others, are uniﬁed under the framework of generalized linear models(GLMs) (Nelder and Wedderburn). We now brieﬂy review this seminal work along withseveral key properties of GLMs.

For a sample of size

*m*, let

*Yi *be the univariate response and

*xi*, a

*p × *1 vector, be
the covariates thought to be related to

*Yi, i *= 1

*, . . , m*. To specify a GLM, the followingtwo components are suﬃcient (Nelder and Wedderburn, 1972):
(A) (Random component).

*Yi *is generated from an exponential dispersion family

*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
(Tweedie, 1947, J

*φ*hanssen, 1987) of the form

*f *(

*Y*
*iyi − b*(

*θi*)

*i *=

*y*;

*θi, φ*) = exp
+

*c*(

*y*
*i*;

*φ*)
(B) (Systematic component). The expectation of

*Yi*, denoted as

*µi*, is related to

*xi*
through the link function

*h*, i.e.

*µi *= E(

*Yi xi*) =

*h−*1(

*xtiβ*)

*.*
Some special cases of GLMs are as follows.

*Example 2.1 (Multiple linear regression models)*. This corresponds to a special case
with

*h*(

*µ*)

*≡ µ*, the "identity" link function,

*a*(

*φ*)

*≡ φ *and

*Yi *being normally distributedwith mean

*µi *=

*θi *and variance

*φ*.

*Example 2.2 (Logistic regression models)*. For a binary response with

*Y *= 1 or 0, this
popular model is a special case of GLMs with

*h*(

*µ*) = log

*{µ/*(1

*− µ*)

*}*, the "logit" link,

*a*(

*φ*)

*≡ *1, and

*Yi *following a binomial distribution of size 1 and probability of "success"equal to

*µi*.

*Example 2.3 (Poisson regression models)*. For count responses which are common in
cohort analysis with the number of "events" such as the death or diagnosis of a disease,this corresponds to the use of the "log" link, i.e.

*h*(

*µ*) = log

*µ, a*(

*φ*)

*≡ *1 and a Poissondistribution with mean

*µi *as the basis for inference. Note that the index

*Example 2.4 (Log-linear models)*. For data that can be expressed in contingency
tables, log-linear models are commonly used (e.g. Bishop, Fienberg and Holland, 1975).

Here as in Example 2.3, the index

*i *represents the group of individuals sharing similarcharacteristics which are categorical in nature. This model can also be cast as a GLM with

*h*(

*µ*) = log

*µ *and a Poisson distribution for

*Yi*, which is the number of individuals with

*xi*as the covariate values. A major diﬀerence between this model and the Poisson regressionmodel lies on the objective of interest. For log-linear models, the primary interest is onexamining the pattern of associations among categorical variables represented by

*x*. Forthe latter, it is of interest to identify risk factors characterized by

*x *that may be associatedwith the risk of a particular event.

We now present several key properties of GLMs that are particularly relevant to the

**First Two Moments of GLMs**
The exponential dispersion family displayed in (2.4) does not explicitly reveal how themoments of

*Yi*, in particular the ﬁrst two moments, may be related to

*θi *and

*φ*. Simplealgebra suggests that
The expression for the variance of

*Yi *explains the rationale behind the phrase "dispersionparameters" for

*φ*. Finally, given the relationship between

*θi *and

*β *through (2.5) and(2.6) above, it is clear that

*f *(

*Y *;

*θ, φ*) can also be expressed as

*f *(

*Y *;

*β, φ*).

*2.2. GENERALIZED LINEAR MODELS*
**Score Function for ***β*
Assuming that the

*Yi*'s are independent and represent a random sample from the targetedpopulation, the likelihood function for

*β *and

*φ *is simply proportional to

*L*(

*β, φ*)

*f *(

*yi*;

*β, φ*)

*i*(

*β*)

*yi − b*(

*θi*(

*β*)) +

*c*(

*y*
*i*;

*φ*)
Consequently, the score function for

*β *can be derived as

*β *(

*β, φ*) =

*∂ *log

*L*(

*β, φ*)

*/∂β *=

*i*;

*β, π*)(

*yi − µi*(

*β*))

*.*
It is important to note that the score function for

*β *depends only on the ﬁrst twomoments of the

*Yi*'s despite the full speciﬁcation of

*f *through (A) in (2.4). This formsthe primary motivation behind the use of the quasi-likelihood advocated by Wedderburn(1974) which is the focus of the next section. Furthermore, given that

*a*(

*φ*) appears as aproportional factor in var(

*Yi*), no knowledge on

*φ *is needed to derive the MLE of

*β, *ˆ
solving

*Sβ*(

*β, φ*) = 0 even though the asymptotic variances of ˆ
(

*β, φ*), does depend
on

*φ*, where
Σ

*−*1(

*β, φ*) = lim

*i*;

*β, φ*)

**MLE Versus Weighted Least Squares**
An alternative estimating procedure for

*β *is the well-known weighted least squares ap-proach. It amounts to minimizing the following objective function for

*β*, namely,

*Q*(

*β, φ*) =

*i − µi*(

*β*))2

*,*
Var(

*Yi*;

*β, φ*)
the weighted diﬀerences between the "observed" (

*Yi*'s) and the "expected" (

*µi*'s). Again,the dispersion parameter

*φ *plays no role in the estimation of

*β *through minimizing

*Q*in (2.9). While intuitive, this approach, in contrast to the common belief, may lead toinconsistent estimation of

*β *if the variance of

*Yi *depends on the mean and hence on

*β*.

To see this, note that minimizing

*Q *in (2.9) is equivalent to solving

*∂Q/∂β ≡ *0, where

*∂Q*(

*β, φ*)

*/∂β *=

*i*;

*β, φ*)(

*yi − µi*(

*β*))

*· *(

*y*
*i, β, φ*)

*i − µi*(

*β*))2
Note that the ﬁrst term in (2.10) is identical to

*Sβ*(

*β, φ*), the score function of

*β*, whereasthe second term has in general non-zero expectations regardless of the sample size

*m*.

*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
The resulting solution, ˜

*β *say, would be inconsistent as an estimator of

*β *unless Var(

*Yi*)
is independent of

*β *in which case

*∂Q*(

*β, φ*)

*/∂β ≡ Sβ*(

*β, φ*) and hence ˜
This undesirable property of the weighted least squares approach can be alleviated
through the following simple modiﬁcation. Assuming the existence of a consistent esti-

*β *say, for

*β*, one remedy is to minimize instead

*Q*(

*β, φ*) =

*i − µi*(

*β*))2

*.*
*i*=1 Var(

*Yi*;

*β, φ*)
It can be shown that the minimizer of

*Q*(

*β, φ*)

*, *˜

*β *say, is now consistent for any consistent

*β *of

*β*. One such choice is the minimizer of

*i − µi*(

*β*))2. Furthermore,
it can be seen that by iterating the minimization process in (2.11) repeatedly, i.e. mini-mizing

*i − µi*(

*β*))2

*t*+1(

*β, φ*) =

*, t *= 0

*, *1

*, . . ,*
*i*=1 Var(

*Yi*;

*βt, φ*)
0 =

*β*, it leads to ˆ

*β*, the MLE. Indeed, one can show that

*β *and ˆ

*β *are asymptot-
ically equivalent to each other in that ˜

*β *=

*op*(

*m*).

An important aspect of model ﬁtting is model checking which examines the pattern ofdepartures from the ﬁtted model. There are at least two diﬀerent types of departure:the systematic departure which describes the departure of the model from the dataas a whole, and the isolated departure which identiﬁes individuals whose data are notin accordance with the ﬁtted model (McCullagh and Nelder, 1989, Chapter 12). Thissubsection focuses only on the former one.

For the

*i*th subject with

*xi *as the covariate value, there are two means to estimate

*µi*, the expected value of

*Yi*. One is to estimate by

*µi*( ˆ

*β*) based on the ﬁtted model. The
other is to simply estimate

*µi *by

*Yi *=

*yi*, an unbiased estimator. The ﬁrst approach,which uses the whole data, is legitimate only if the ﬁtted model adequately describesthe data. On the other hand, the second approach, while always legitimate, does notfully utilize the sample data except

*yi *itself. These contrasts between two approachesprovides a means to answer the question as to "how well the model ﬁts the data" bycomparing

*µi*( ˆ

*β*) with

*yi *through, for example, the likelihood function,

*L*(

*µi, φ*). Indeed
the quantity "deviance" is proportional to the likelihood ratio test statistic comparingthe null hypothesis that the ﬁtted model is adequate versus the saturated alternative,i.e. the

*µi*'s are diﬀerent and unrelated to each other. More speciﬁcally, the deviance fora ﬁtted model is formally deﬁned as

*µ*;

*Y *) =

*−*2

*{*log

*L*(

*µi*( ˆ

*β*);

*φ*)

*− *log

*L*(

*yi*;

*φ*)

*}a*(

*φ*)

*.*
Thus, a "small" value in

*D *would indicate that the ﬁtted model describes the data asa whole rather well whereas a "large" value would suggest otherwise. A few cautionarynotes on the use of the deviance, or the Pearson

*χ*2 statistic (

*Q*( ˆ

*β, φ*)), as a measure of

*2.3. QUASI-LIKELIHOOD METHOD*
goodness-of-ﬁt. First, except for the special case of normal responses, the distribution of

*D*, even asymptotically, is unknown. Second, as the hypothetical example in McCullaghand Nelder (1989, p.123) suggested, any global measure of goodness-of-ﬁt such as thedeviance may have a limited usage for choosing competitive models for the purpose ofpredictions when the

*x *considered falls outside the range formed by the observed

*xi*'s,known as extrapolation.

To adopt the generalized linear model approach one needs to specify the random compo-nent for the response variables. Very often one faces the problem where the investigatorsare uncertain about the complete random mechanism by which the data are generated.

This could be due to the fact that the underlying biologic theory is not fully understoodyet and/or no substantial (empirical) experience of similar data from previous studiesis available. On the other hand, the scientiﬁc objective can often be adequately charac-terized through regression, i.e. the systematic component may be well described. Withthis in mind, Wedderburn (1974) proposed as an alternative the quasi-likelihood methodwhich replaces the random component (A) in GLMs by
(A

) (Variance speciﬁcation). The variance of

*Yi *has the form
Var(

*Yi*) =

*a*(

*φ*)

*V *(

*µi*)

*,*
where

*V *is the known function relating the variance to the mean,

*µi*.

In the absence of the likelihood function that is available for statistical inference,
Wedderburn (1974) proposed the use of the quasi-score function to estimate

*β*, i.e. bysolving

*β *(

*β, φ*) =

*i*;

*β, φ*)(

*yi − µi*(

*β*)) = 0

*.*
This quasi-score function has the same expression as (2.7) and would be the true scorefunction for

*β *if the

*Yi*'s were indeed generated from an exponential dispersion fam-ily. McCullagh (1983) studied the asymptotic behavior of ˆ

*β*, the solution of

*Sβ ≡ *0, and
found that under some regularity conditions, ˆ

*β *is consistent and asymptotically normally
distributed with the covariance matrix equal to Σ in (2.8). Firth (1987), on the otherhand, examined the eﬃciency of ˆ

*β *relative to the MLE of

*β *when the true probability
mechanism for the

*Yi*'s is not from an exponential dispersion family. Several diﬀerenttypes of departure from the exponential dispersion family were established and, as ex-pected, high eﬃciency of ˆ

*β *is maintained when the departure of

*f *from the exponential
dispersion family is only "modest" (Firth, 1987). A remaining question to be addressedis whether ˆ

*β *or the quasi-score function,

*Sβ*, possesses any optimal property? This issue
will be the main focus of the next section. Indeed, we will consider the more generalsituations than (A

), namely
(A

) (Variance speciﬁcation). The variance of

*Yi *has the form
Var(

*Yi*) =

*Vi*(

*µi*;

*φ*)

*,*
where

*Vi *is a known function of the mean,

*µi*, and

*φ*, the dispersion parameter.

This speciﬁcation, of course, includes (A

) as a special case with

*Vi*(

*µi, φ*) =

*a*(

*φ*)

*V *(

*µi*).

*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
**Estimating Function Approach**
Returning to the general setup in

*§*2.1 where we assume the data,

*Y *=

*y*, is generatedfrom a probability (density) function

*f *(

*y*;

*θ, φ*) indexed by parameters of interest

*θ *andnuisance parameters

*φ*. In the regression setting,

*θ *would be the regression coeﬃcientwhereas

*φ *would be dispersion parameters. An estimating function for

*θ *is simply afunction of the data

*y *and parameters of interest

*θ*, i.e.

*g*(

*y*;

*θ*). An estimating functionis called unbiased if
E(

*g*(

*Y *;

*θ*);

*θ, φ*) = 0
for all

*θ *and

*φ *values. An important example of unbiased estimating functions fromthe perspective of health research is the quasi-score function,

*Sβ*(

*β, φ*) in (2.12) where

*a*(

*φ*) factors out in

*Sβ*(

*β, φ*). While the formal theoretical development of estimatingfunctions was not advanced until the 1960's, scientists grappled with how to designestimating functions which combined observations and unknown interest as early as themid 18th century; see the interesting book on the history of statistics before 1900 byStigler (1986). Meanwhile, the method of moments, advocated by Karl Pearson, maybe viewed as a precursor of the modern estimating function approach. Here, selectedempirical moments such as the sample mean are equated to their expectations which arepresumably dependent on parameters of scientiﬁc interest.

The concept of unbiased estimating functions and the corresponding optimality theory
was formally developed in the year of 1960 through two independent work by Durbin(1960) and Godambe (1960). We now brieﬂy recast the essence of a series of moregeneral work by Godambe, before addressing the question concerning the optimalityof the quasi-score function. Consider the class of unbiased estimating functions

*G *=

*{g*(

*y*;

*θ*) : E(

*g*(

*Y *;

*θ*);

*θ, φ*) = 0

*, ∀θ, φ}*. A member of

*G*,

*g *say, is optimal within this class(Godambe, 1960) if it minimizes

*W *(

*θ, φ*) = E

*−*1(

*∂g*(

*Y *;

*θ*)

*/∂θ*)Cov(

*g*(

*Y *;

*θ*))E

*−*1(

*∂g*(

*Y *;

*θ*)

*/∂θ*)

*.*
When

*φ *is known, i.e. there is no nuisance parameters, Godambe (1960) showed that thescore function for

*θ *is the optimal one among

*G*. In the more practical situation where

*φ *is present and unknown, Godambe (1976) showed that

*g*(

*y*;

*θ*) =

*∂ *log

*f *(

*y t*;

*θ*)

*/∂θ*
is optical among

*G *where

*t *is a complete, suﬃcient statistic for

*φ *for ﬁxed

*θ*.

**Limitations: Impact of Nuisance Parameters**
This modern line of optimal estimating function discussed thus far is subject to the fol-lowing two major limitations. First, optimality considered is ascribed to the estimatingfunction, yet scientists and other practitioners are concerned more about estimators. AsCrowder (1989) put it: "This is like admiring the pram rather than the baby." Thisconcern may be reconciled as follows.

It can be shown that under some regularity

*β*, the solution of

*g*(

*y*;

*θ*) = 0, has minimum variance among solutions of

*{g*(

*y*;

*θ*) = 0 :

*g ∈ G}*. Thus the limitation of estimating functions noted above may notbe too serious an issue as such if one is willing to appeal to a large sample argument.

*2.4. ESTIMATING FUNCTION APPROACH*
Second, nuisance parameters can compromise the optimality theory. Speciﬁcally, the
optimal estimating function given in (2.16) relies upon the existence of a complete suf-ﬁcient statistic for

*φ *that does not depend on

*t*. This is the case when

*Y *is generatedfrom an exponential family with

*θ *and

*φ *being the "natural (or canonical)" parameters,but more generally

*t *=

*t*(

*θ*). In this more general situation,

*∂ *log

*f *(

*y t*(

*θ*);

*θ, φ*)

*/∂θ *de-pends on

*φ *as well and hence is only locally optimal (Lindsay, 1982). More seriously,the sample space of

*Y *given

*t*(

*θ*) may depend on

*θ *and consequently,

*f *(

*y t*(

*θ*);

*θ, φ*) manynot be diﬀerentiable with respect to

*θ *(Kalbﬂeisch and Sprott, 1970). It is worth notingthat the quasi-score function in (2.12) also suﬀers from this limitation if the variance of

*Yi *has the more general form in (2.13), in which case the nuisance parameter

*φ *may notfactor out in (2.13).

**Orthogonality Properties of Conditional Score Functions**
The previous subsection made it clear that with the exception of exponential families, nui-sance parameters could compromise the optimal theory. While many likelihood methodsare potentially available leading to unbiased estimating functions that are independentof nuisance parameters, there is no general guidance as to how one should proceed. Inaddition, unlike the conditional score function in (2.16), there is no known theory to sup-port the optimality of derived estimating functions. For this reason, we redirect our focuson the ﬁnding of locally optimal estimating functions in which the impact of nuisanceparameters may be minimized, if not eliminated. Avoiding the complication associatedwith the computation of

*f *(

*y tθ*;

*θ, φ*), where

*tθ *is complete and suﬃcient for

*φ *for ﬁxed

*θ*, Lindsay (1982) extends the concept of conditional score functions in this more generalsituation, i.e.

*t ≡ tθ *by deﬁning

*gC*(

*θ, φ*) =

*∂ *log

*L*(

*θ, φ*)

*/∂θ − *E(

*∂ *log

*L*(

*θ, φ*)

*/∂θ tθ*)

*,*
which reduces to

*∂ *log

*f *(

*y t*;

*θ*)

*/∂θ *when

*tθ *=

*t*. Due to the dependence of

*gC *on

*φ*, theconditional score function deﬁned above is only locally optimal among

*G *at the true

*φ *value (Lindsay, 1982). However,

*gC*(

*θ, φ*), by deﬁnition, is orthogonal to the spacespanned by the suﬃcient statistic

*tθ *for

*φ*. Thus the impact of nuisance parameterson

*gC*(

*θ, φ*) is expected, at least intuitively, to be small. The following "orthogonality"properties enjoyed by

*gC*(

*θ, φ*) represent attempts to quantify the above speculation.

These properties ((a) to (d)) are listed so that

*gC*(

*θ, φ*) implies (a), which implies (b),which implies (c), which implies (d):
(a) E(

*gC*(

*θ, *ˆ

*φθ*);

*θ, φ*) = 0 for all

*θ, φ *and any ˆ

*φθ *which is a function of

*tθ*;
(b) E(

*gC*(

*θ, φ*);

*θ, φ*) = 0 for all

*θ, φ *and

*φ*;
(c) E(

*∂gC*(

*θ, φ*)

*/∂φ*;

*θ, φ*) = 0 for all

*θ, φ *and

*φ*;
(d) Cov(

*gC*(

*θ, φ*)

*, ∂ *log

*L*(

*θ, φ*)

*/∂φ*;

*θ, φ*) = 0 for all

*θ *and

*φ*.

Interpretation of these four orthogonality properties can be seen in Liang and Zeger(1995). Furthermore, a rich class of probability (density) functions in which

*tθ *is in-deed complete and suﬃcient for

*φ *is given by Liang and Tsou (1992). In summary,we have devoted this subsection to the discussion of desirable properties possessed bythe conditional score function proposed by Lindsay (1982). These properties, namely,

*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
local optimality and orthogonality to nuisance parameters, form the criteria for the nextsubsection.

Finally, several researchers, including Cox and Reid (1987), Liang (1987), Ferguson,
Reid and Cox (1991), Waterman and Lindsay (1996) have investigated strategies forapproximating the conditional score function in the more general situation where thecomplete suﬃcient statistic for

*φ *for ﬁxed

*θ *may not exist or are diﬃcult to ﬁnd.

**(Local) Optimality of Quasi-Score Functions**
We now return to the question posed in

*§*2.3, namely the potential optimality of the quasi-likelihood method, which will be addressed in more general settings. In the previous twosubsections, we have assumed implicitly that the likelihood function, indexed by

*θ *and

*φ*,is readily available for

*θ *inference. As mentioned in

*§*2.3, there are situations whereeither the investigators are uncertain about the random mechanism or the probabilitydistribution is too complicated to write down explicitly. Examples include data fromspatial processes encountered in plant ecology (Besag, 1974) and from DNA sequencing.

Nevertheless, it is common that the parameters of interest

*θ *are well deﬁned reﬂectingthe scientiﬁc interest, even though they may not completely specify the probability dis-tribution for the data

*y*. We assume that the data

*y *= (

*y*1

*, . . , ym*) is decomposed into

*m *"strata" and the

*yi*'s,

*i *= 1

*, . . , m*, are uncorrelated with each other. The sizes ofthe

*yi*'s are not required to be the same. Furthermore, we assume that

*θ *is common inmeaning to all

*m *strata and that there exists an unbiased estimating function,

*gi*(

*yi*;

*θ*),for the

*i*th stratum,

*i *= 1

*, . . , m*. A simple and trivial example of this setup is that the

*yi*'s are independent and identically distributed with mean

*θ *and variance

*φ *and a simplechoice of

*gi *would be

*yi −θ*. A natural question is how to combine

*{gi*(

*yi*;

*θ*)

*, i *= 1

*, . . , m}*together into a single unbiased estimating function for

*θ *inference that is optimal in somesense?
To address this question we consider a more restricted class (than

*G*) of unbiased
estimating functions, namely,

*G *=

*{g*(

*y*;

*θ*) =
which is a linear combination of the

*gi*'s. It can be shown (e.g. Liang and Zeger, 1995)that the optimal unbiased estimating function within

*G*,

*g *say, is
which minimizes

*W *(

*θ, φ*) deﬁned in (2.15). Furthermore, the solution of

*g *= 0

*, *ˆ
has the smallest asymptotic variance among

*{*˜

*θ *:

*g*(

*y*; ˜

*θ*) = 0, for all

*g ∈ G}*. This setup
includes several important examples:

*Example 2.5 (Quasi-score function)*. Under (A

) and (B) speciﬁed in

*§*2.3, one has

*h*(

*µi*) =

*xiθ, *Var(

*Yi*) =

*a*(

*φ*)

*V *(

*µi*)

*.*
By choosing

*gi*(

*yi*;

*θ*) =

*yi − µi*(

*θ*), one has as the optimal unbiased estimating function

*2.4. ESTIMATING FUNCTION APPROACH*
among

*G *=

*{ m a*
*i*)(

*yi − µi*(

*θ*))

*,*
the quasi-score function for

*θ*.

*Example 2.6 (Correlated data)*. For

*yi *= (

*yi, . . , yin *), the observations from the

*i*th cluster,

*i *= 1

*, . . , m*, where

*ni *is the cluster size, the objective of interest is fre-quently captured through the modeling of

*µij *= E(

*Yij*), i.e.

*h*(

*µij*) =

*xt θ *(Liang and
Zeger, 1986).

In this case, one such choice of

*gi *would be

*yi − µi*(

*θ*) where

*µi *=

*i*(

*θ*)

*, . . , µin *(

*θ*))

*t *and the optimal

*g *among

*G *=

*{g *=

*i*)(

*yi − µi*(

*θ*))

*,*
known as the generalized estimating equation (GEE) (Liang and Zeger, 1986).

*Example 2.7 (*2

*× *2

*tables)*. In case-control studies, subjects are stratiﬁed into

*m*
strata by confounding variables. The primary interest is on the degree of associationbetween the targeted disease and a dichotomous risk factor of interest as measured bythe odds ratio,

*θ*. In this case,

*yi *= (

*yi*1

*, ni*1

*− yi*1

*, yi*2

*, ni*2

*− yi*2) where

*yi*1(

*yi*2) is thenumber of exposed among

*ni*1 cases (

*ni*2 controls) in the

*i*th stratum,

*i *= 1

*, . . , m*. Itcan be veriﬁed easily that

*gi*(

*yi*;

*θ*) =

*yi*1(

*ni*2

*− yi*2)

*− θyi*2(

*ni − yi*)
is an unbiased estimating function for

*θ *from the

*i*th stratum. While
is complicated computationally,

*a*(

*θ*) reduces to 1

*/*(

*n*
*i*1 +

*ni*2) when

*θ *is evaluated at
one (McCullagh and Nelder, 1989) and this leads to the well known Mantel-Haenszelestimator of

*θ *(Mantel-Haenszel, 1959).

Implicitly, the optimality theory developed thus far assumes

*g *is functionally inde-
pendent of the nuisance parameters

*φ*. This is indeed the case for Example 2.7 if theexpectations derived in computing

*g *are conditional on

*ti *=

*yi*1 +

*yi*2, the total numberof exposed in the

*i*th stratum,

*i *= 1

*, . . , m*. In general, however, the distribution

*gi*,which is functionally independent of

*φ*, does depend on

*φ*. Thus,

*g *=

*g*(

*y*;

*θ, φ*) andis locally optimal among

*G *at the true

*φ *value as is the conditional score function

*θ*(2.17) when the probability mechanism is available for inference. One important excep-tion is the quasi-score function under (A

), i.e.

*a*(

*φ*) appears as a proportional factor inVar(

*Y *). In this case,

*a*(

*φ*) factored out and

*g *=

*g*(

*y*;

*θ*) is globally optimal among

*G*.

Even though

*g *may depend on

*φ*, we now argue that the impact of

*φ *on

*g *and on thecorresponding solution of

*g *= 0 is small. This is because

*g *shared the orthogonalityproperties (b) - (d) enjoyed by the conditional score function, i.e.

*CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW*
(b) E(

*g*(

*y*;

*θ, φ*);

*θ, φ*) = 0

*∀θ, φ *and

*φ*,
(c) E(

*∂g*(

*y*;

*θ, φ*)

*/∂φ*;

*θ, φ*) = 0

*∀θ, φ *and

*φ*,
(d) Cov(

*g*(

*y*;

*θ, φ*)

*, ∂ *log

*L*(

*θ, φ*)

*/∂φ*;

*θ, φ*) = 0

*∀θ *and

*φ*.

One important implication of these properties, at least when the sample size is large, isthat ˆ

*θ*, the solution of

*g*(

*y*;

*θ, *ˆ

*φθ*) = 0, where ˆ

*φθ *is an arbitrary consistent estimator of

*φ*,
has the same variance as that when

*φ *is known. That is, the uncertainty in estimating

*θ*,as measured by the variance, is unaltered whether

*φ *is known or not, another indicationthat

**Chapter 3**
**Analysis of Binary Data**
Binary (dichotomous) response is commonly observed in biomedical research. Examplesinclude death versus alive, ill versus healthy, injured or not in a car accident, depressed ornot in psychiatric research, etc. Sometimes there is a time element involved in deﬁning theresponse. For example, one may be interested in, as shown below in Table 3.1 whethera child diagnosed with neuroblastoma survives through a two year period. Anotherexample, shown in Table 3.2, has the response for each fetus as to whether it survives21 day lactation period or not. These two examples share similar data structure whichis the focus of this chapter, namely,
(

*yi/ni, xi*)

*, i *= 1

*, . . , m,*
where

*yi *is the number of "aﬀected" out of the

*ni *"subjects" in the

*i*th stratum and

*xi*,
Table 3.1: Proportions of children surviving 2 years free of disease followingdiagnosis and treatment for neuroblastoma by age and stage of disease atdiagnosis (Breslow & McCann, 1971).

Stage at Diagnosis
a

*p × *1 vector, covariates that are common to all

*ni *subjects. For the neuroblastomaexample,

*m *is equal to 15 with

*p *= 2 covariates, namely, age and stage at diagnosis.

For the teratological experiment example,

*m *would be 32 with one primary covariate,i.e. treatment status:

*x *= 1 if treated and 0 if control. Questions of interest for datathat can be summarized in this kind of structure include (i) can variation in

*yi/ni *beexplained by

*xi*? and (ii) can one predict the aﬀected status using

*xi*? The ﬁrst questionis important in that it provides a means to identify risk factors that may be useful forprevention purposes. The second question is useful for identifying prognostic factorswhich may have clinical and policy implementation implications.

*CHAPTER 3. ANALYSIS OF BINARY DATA*
Table 3.2: Proportions of fetuses surviving the 21 day lactation periodamong those who were alive at 4 days by treatment group (Weil, 1970).

12/12, 11/11, 10/10, 9/9, 10/11, 9/10, 9/10, 8/9,
8/9, 4/5, 7/9, 4/7, 5/10, 3/6, 3/10, 0/7
13/13, 12/12, 9/9, 9/9, 8/8, 8/8, 12/13, 11/12,
9/10, 9/10, 8/9, 11/13, 4/5, 5/7, 7/10, 7/10
In this chapter, we discuss some analytical issues that are relevant to this type of
data structure with special attention paid to the issue of statistical dependence within astratum which is common in data from teratological experiments.

Given the much longer history in multiple linear regression for continuous responses,it was natural that some earlier developments for analyzing binary data appealed tothe conventional linear regression method. This forms the basis behind the ﬁrst twoapproaches presented below.

*Approach 1 (Arcsine analysis)*. This approach amounts to creating a new response
variable through a arcsine transform on the original response variable, that is,

*y *=

*yi/ni*) and imposing the following model

*i *=

*xtiβ *+

*i,*
This transformation was motivated by the fact that the variance of

*y*, at least asymptot-
ically, is independent of the means,

*xtβ*, a desirable property to have in the conventional
linear regression approach; see

*§*2.2.3.

*Approach 2 (Empirical logit analysis)*. This approach is similar to Approach 1 except
that a diﬀerent transformation is applied to

*yi/ni*, namely,

*y *= log(

*y*
*i/*(

*ni − yi*)). This
is a special case of a comprehensive treatment for analyzing categorical data developedin the 1960's by Grizzle, Starmer and Koch (1969).

*Approach 3 (Modern analysis)*. This approach, which is a special case of GLMs, diﬀers
from the ﬁrst two approaches in one important aspect. In this modern (for the lack ofa better term) approach, the notion of transformation is applied to the expectation of

*yi/ni, πi*, rather than the data itself. In other words, for a given "link" function,

*h *say,this approach speciﬁes

*h*(

*πi*) =

*xtiβ*
and furthermore,

*yi *follows a binomial distribution. Some common choices of link func-tions include the probit link for bioassay analysis (e.g., Finney, 1964) and the logit linkfor case-control studies (Breslow and Day, 1980).

It is interesting to contrast the modern approach versus the ﬁrst two approaches. For
simplicity, we restrict our attention to the log odds transformation (Approach 2) so thatthe comparison is made between the empirical logit approach and the logistic regressionapproach.

*3.1. HISTORICAL DEVELOPMENTS*
**Empirical Logit versus Logistic Regression**
We now argue that the empirical logit approach inherits the following three drawbacksthat are not shared by the logistic regression method. First, the regression coeﬃcientsof this approach are diﬃcult to interpret: change in averaged

*Y *= log

*{Y /*(

*n − Y *)

*} *perunit change in

*x*. The transformation

*Y *is made for convenience whose average doesnot necessarily possess physical meaning of interest. This is to be contrasted with theinterpretations of the logistic regression coeﬃcients: change in log odds in risk per unitchange in

*x*. These quantities are easy to understand with sound physical meaning forbiomedical researchers. Second, the normal distribution assumption for the

*Y *'s relies
heavily on the

*ni*'s being large. As can be seen in either example, the

*ni*'s vary fromstrata to strata and are not (can't be in teratological experiments) ﬁxed or controlledby designs. This concern, on the other hand, is not shared by the logistic regressionmethod. Indeed, the latter approach can be applied even if

*ni ≡ *1 for all

*i *so longas

*m *is suﬃciently large. Third, with the exception of the intercept, all the regressioncoeﬃcients in logistic regression models are estimable should one adopt the retrospectivesampling, a commonly adopted design in epidemiologic studies. The same thing cannotbe said for the empirical logit approach. Finally, parameters in logistic regressin modelsare the "canonical" parameters in an exponential family which means the intercepts canbe eliminated through conditioning which is critical for the matched case-control designsto be implemented (Breslow and Day, 1980).

**Is the Binomial Assumption Reasonable?**
Either one of three approaches presented earlier assumes that

*Yi *follows a binomialdistribution with size

*ni *and probability

*πi, i *= 1

*, . . , m*. A natural question to ask iswhether the binomial assumption is adequate? To answer this question, we note that onenatural way to characterize a binomial variate,

*Y *say, is that

*Y *be expressed as a sum of

*n *binary observations, i.e.,

*Y *=

*Y*1 +

*. . *+

*Yn *with

*Yj *= 1 or 0 for all

*j *= 1

*, . . , n *suchthat (1) all the

*Yj*'s are statistically independent of each other and (2) Pr(

*Yj *= 1) =

*π*,the common probability for all

*j *= 1

*, . . , n*.

For the neuroblastoma example, it seems sensible to assume that the number of chil-
dren with the same age and stage of diagnosis who survived for a two-year period is wellapproximated by a binomial distribution. After all, these children are unrelated to eachother and meanwhile, comparable to each other with regard to major prognostic factorsfor neuroblastoma. On the other hand, it is well known in the teratological experimentliterature (e.g. Haseman and Kupper, 1979) of the tendency for fetuses from the samelitter (stratum in our notation) to respond more alike than fetuses from diﬀerent litterseven if they all received the same treatment. This is known as the "litter eﬀect" whichmay be explained by the fact that fetuses from the same litter share the common genesand environment which in turn leads to the "likeness" of responses from fetuses of thesame litter. This argument strongly suggest that the statistical independence assumptionof the binomial assumption may be invalid. Indeed, for the sixteen proportions from thetreated group in Table 3.2, only 25% (= 0

*.*328

*/*1

*.*295) of the total variation in the

*y/n*'sis explained by the variation induced by the binomial assumption (Liang and Hanfelt,1994).

In the next, we discuss pros and cons of approaches that have been developed to aid
investigators analyze data derived from teratological experiments.

*CHAPTER 3. ANALYSIS OF BINARY DATA*
Teratological (toxicological) experiments are designed to test whether a substance isindeed a carcinogen or toxin; and if so, it would be of interest to establish the dose-response relationship in order to provide guidances for the tolerance level regarding safetyof human populations. Typically, pregnant animals (e.g. rats) are randomized to receivethe substance of diﬀerent dose levels. The adverse eﬀect such as structural developmentaleﬀect of the substance in the fetuses is then assessed to address the question stated above.

As mentioned in

*§*3.1.2, the so-called "‘litter eﬀect" is commonly observed and con-
sequently, the binomial assumption on

*Y *, the number of aﬀected fetuses from a litter, isinvalid. One of the earliest approaches to address this issue of litter eﬀect, also knownas "extra binomial variation" or more generally "over-dispersion," is to assume that

*Y*follows a beta binomial distribution (Skellam, 1948) instead. This distribution may bemotivated and derived as follows. Consider

*yi/ni, i *= 1

*, . . , m,*
the responses (proportions of fetuses that are aﬀected) from litters receiving the samedosage. For the example presented in Table 3.2,

*m *is equal to 16 for both treated andcontrol groups. To account for variation in responses among litters that are "comparable"(regarding treatment levels) to each other, this approach assumes
(1) given

*πi, Yi πi ∼ *Binomial (

*ni, πi*)

*, i *= 1

*, . . , m*,
(2)

*πi ∼ *Beta (

*α, γ*)

*, α, γ > *0.

Consequently, the beta-binomial distribution for each

*Yi *arrived as
Pr(

*Yi *=

*yi*;

*α, γ*) =
Pr(

*Yi *=

*yi πi*)

*f *(

*πi*;

*α, γ*)

*dπi*
Γ(

*α *+

*γ*)

*πyi *(1

*− π*
*πα−*1(1

*− π*
Γ(

*α *+

*γ*)Γ(

*y*
*iα*)Γ(

*ni − yi *+

*γ*)

*.*
Γ(

*α*)Γ(

*γ*)Γ(

*α *+

*γ *+

*ni*)
The key step of this derivation is assumption (2) which acknowledges that the underlyingrisk,

*πi*, varies from litter to litter by imposing a distribution, beta in this case, on the

*πi*'s. Intuitively, the litter eﬀect is pronounced if the variance of

*πi *is large. One diﬃcultyassociated with the representation in (3.1) is lack of clear interpretability for both

*α *and

*γ*. However, noting that the ﬁrst two moments of a beta distribution have the form

*α/*(

*α *+

*γ*)

*≡ θ,*
*θ*(1

*− θ*) =

*φθ*(1

*− θ*)

*,*
*α *+

*γ *+ 1
we now argue that the new parameters,

*θ *and

*φ *are more easily interpreted. This isbecause the ﬁrst two moments of the

*Yi*'s can now be expressed as
E(

*Yi/ni*) = E(E(

*Yi/ni πi*)) = E(

*πi*) =

*θ,*
*θ*(1

*− θ*)
Var(

*Yi/ni*) =

*i − *1)

*φ*)

*.*
*3.2. TERATOLOGICAL EXPERIMENTS*
Thus

*θ *is simply the overall, or averaged, risk for those randomized to a particular dosegroup. On the other hand,

*φ*, the proportional factor in Var(

*π*), is introduced to accountfor variation that is not solely explained by the binomial variation. Furthermore, note in

*§*3.1.2 that

*Yi *can be expressed as

*Yi *=

*Yi*1 +

*. .*+

*Yin *and that these

*n*
*i *binary responses
are not statistically independent of each. If one assumes the correlations for each pairof these

*Yij*'s are the same, then it can be shown easily that this common correlationcoeﬃcient is exactly equal to

*φ*. Thus, the notion of variation between litters is equivalentto within litter "likeness" as captured by the same quantity

*φ*.

Finally, to test or to establish dose-response relationship between

*θ *and

*x*, the dosage
level, one may consider

*h*(

*θ*) =

*β*0 +

*xtβ*1

*,*
in line with the systematic component speciﬁed under GLMs.

**Some Cautionary Notes on Beta-Binomial Distributions**
While intuitive for its derivation, one may use this approach with caution from thefrequentist viewpoint. First, as shown below, the parameter of interest

*θ *and the nuisanceparameter

*φ *are highly intertwined with each other. Consequently, the presence of

*φ*is likely to have non-trivial impacts on

*θ *inference; see

*§*2.1 for general discussion onimpacts of nuisance parameters. To see this, Table 3.3 below, which is reported inWilliams (1988), shows through simulations that misspecifying

*φ *value could lead tosevere bias in

*θ *estimation. For example, if the true

*θ *and

*φ *values are 0

*.*125 and0

*.*2, respectively and a sample of 20 beta-binomial observations were generated, theaveraged MLE of

*θ *over 1,000 replications is equal to 0

*.*167 if a

*φ *value of 0

*.*4 wasassumed when deriving

*θ*. This phenomenon is also evident when the beta-binomialdistributions are ﬁtted to the Weil data in Table 3.2. Table 3.4 shows that the MLEof

*β*1 = log(

*θT /*(1

*− θT *))

*− *log(

*θC/*(1

*− θC*)) dropped to

*−*0

*.*665 when

*φT *and

*φC *wereforced to be the same. This is to be contrasted with the MLE of

*β*1 when

*φT *and

*φC *areallowed to be diﬀerent:

*−*1

*.*129. It is interesting to note that the standard error estimateof ˆ

*β*1 when assuming binomial distribution in the

*Y *'s, i.e. ignoring the litter eﬀect is too
small: 0

*.*330 compared to 0

*.*464 when a beta-binomial distribution is assumed instead.

Table 3.3: The averaged MLE (

*± *s.e.) for

*θ *for diﬀerent assumed values of

*φ *from 1,000 simulations (Williams, 1988).

True parameter values
Assumed

*φ *value

*θ *= 0

*.*125

*, φ *= 0

*.*2

*θ *= 0

*.*25

*, φ *= 0

*.*333
0

*.*1263

*± *0

*.*0012
0

*.*2530

*± *0

*.*0020
0

*.*1118

*± *0

*.*0011
0

*.*2310

*± *0

*.*0020
0

*.*1134

*± *0

*.*0011
0

*.*2266

*± *0

*.*0020
0

*.*1271

*± *0

*.*0011
0

*.*2336

*± *0

*.*0019
0

*.*1457

*± *0

*.*0012
0

*.*2491

*± *0

*.*0018
0

*.*1670

*± *0

*.*0013
0

*.*2685

*± *0

*.*0018
0

*.*1906

*± *0

*.*0014
0

*.*2903

*± *0

*.*0017
Second, the asymptotic normal approximation for the MLE of the

*β*'s may not be
ready to take place for practical sizes when ﬁtting the beta-binomial distribution to the

*CHAPTER 3. ANALYSIS OF BINARY DATA*
Table 3.4: Estimates and standard errors of

*β*1 = log

*{θT /*(1

*− θT *)

*} −*log

*{θC/*(1

*− θC*)

*} *of Weil's data presented in Table 3.2. Upper entry: com-mon

*φ*; lower entry: heterogeneous

*φ *(Liang and Hanfelt, 1994).

1

*± *s.e.

*−*0

*.*961

*± *0

*.*330

*−*0

*.*665

*± *0

*.*460 0.19

*−*1

*.*129

*± *0

*.*464 0.32 0.02
Quasi-likelihoodVar(

*Y *) =

*nθ*(1

*− θ*)

*−*1

*.*022

*± *0

*.*471 0.21

*×*(1 + (

*n − *1)

*φ*)

*−*1

*.*070

*± *0

*.*470 0.46 0.05

*−*0

*.*961

*± *0

*.*475 2.78

*φnθ*(1

*− θ*)

*−*0

*.*961

*± *0

*.*475 4.74
data at hand. Table 3.5 shows results from a simulation in which data mimicking thestructure from Table 3.2 and following beta-binomial distributions with the true

*θ*'s and

*φ*'s being the MLE's from the last row of Table 3.4 were generated 1,000 times. Eitherallowing

*φT *and

*φC *to be diﬀerent in the analysis or not, the MLE approach leads toundesirable 95% coverage probability that is far from being symmetric as would havebeen the case if the normal approximation to the distribution of MLE's were ready totake place.

Finally, for a litter of size

*n *in which

*Y *=

*Y*1 +

*. . *+

*Yn *and

*Y−j *deﬁned as
(

*Y*1

*, . . , Yj−*1

*, Yj*+1

*, . . , Yn*), one has

*j *= 1

* Y−j *)
Pr(

*Yj *= 0

* Y−j*)

*γ *+

*n −*
if

*Y *is assumed to follow a beta-binomial distribution indexed by

*α *and

*γ*. In otherwords, the odds for

*Yj *to be aﬀected, i.e.

*Yj *= 1, depends on

*Y−j*, the responses from the
rest of the litter, only through

, the total number of aﬀected. While appealing
mathematically, this expression in (3.2) raises a concern regarding the biologic plausibilityof this distribution as the litter size varies. It seems more intuitive that the eﬀect of

*Y−j*,
if any, on

*Yj *would be through its proportion, i.e.

*/*(

*n − *1), rather than through

**An Alternative Inferential Procedure**
The beta-binomial distribution is an excellent example in which (1) the probability dis-tribution is indexed by parameters of interest and nuisance parameters, (2) the nuisanceparameters have performed impacts on inference for parameter of interest and (3) yetthe conditional score function approach in

*§*2.4 is not applicable (

*tθ *=

*y *in this case).

Since the main interest of teratological experiments can be cast through regression mod-elling on the ﬁrst moment of the

*Y *'s, one alternative is to consider the quasi-likelihood

*3.2. TERATOLOGICAL EXPERIMENTS*
Table 3.5: Simulation results for

*β*1 = log

*{θT /*(1

*− θT *)

*} *from 1,000 repli-cations mimicking the data structure of Weil's. The true

*β*1 is

*−*1

*.*129 and

*φT *= 0

*.*317 and

*φC *= 0

*.*021 (Liang and Hanfelt, 1994).

95% converge prob.

*φT *=

*φC*
Quasi-likelihood1. Var(

*Y *) =

*nθ*(1

*− θ*)

*×*(1 + (

*n − *1)

*φ*)

*φT *=

*φC*
2. Var(

*Y *) =

*φnθ*(1

*− θ*)

*φT *=

*φC*
appproach detailed in

*§*2.3. Speciﬁcally, one may consider
(A

) (Variance speciﬁcation). The variance of

*Yi *has the form of
Var(

*Yi*) =

*niθ*(1

*− θ*)(1 + (

*ni − *1)

*φ*)

*, *or

*φniθ*(1

*− θ*)

*.*
(B) (Systematic component). The mean of

*Yi/ni *is modeled as
logit

*θi *=

*β*0 +

*β*1

*xi, i *= 1

*, . . , m.*
Note that the ﬁrst variance expression corresponds to that induced by the beta-binomialdistribution whereas the second one is induced from the exponential-dispersion family. Asdiscussed in

*§*2.4.3, while the quasi-score function does depend on

*φ*, yet the impact of

*φ*are minimal at least from the theoretical viewpoint. Results below provide some empiricalevidence of the above assertion. Table 3.4 shows for the Weil data that the estimators andthe corresponding s.e.'s from the quasi-likelihood approach are very similar to each otherregardless of the choice of the variance functions and whether

*φT *and

*φC *are assumed tobe the same or not. This observation is conﬁrmed by the simulated study presented inTable 3.5, which shows that this approach, unlike the beta-binomial likelihood approach,leads to excellent coverage probabilities, small bias and excellent agreement between thesample variance and the averaged estimated variance of ˆ

*β*1. More extensive simulations
which mimic the data structures that are common in practice can be found in Liang andHanfelt (1994).

In summary, we review in this section, two diﬀerent approaches for analyzing data
from teratological experiments which confront the non-trivial litter eﬀects. Both theo-

*CHAPTER 3. ANALYSIS OF BINARY DATA*
retical arguments and empirical results support the use of the quasi-likelihood methodas a sensible alternative to the conventional beta-binomial approach.

**Further Relaxation of Binomial Assumptions**
Recall in

*§*3.1.2 we discussed two components that constitute the binomial variate

*Y *=

*Y*1 +

*. . *+

*Yn*, namely, the statistical independence and identical distribution in the

*Yj*'s,

*j *= 1

*, . . , n*. In

*§*3.2 we addressed ways to relax the ﬁrst assumption of statistical inde-pendence. In this section, we review some recent developments for attempts to furtherrelax the second assumption of identical distributions. This work may be motivated infamily studies where responses from

*n *related individuals are not only correlated witheach other due to shared genes and environment, but have diﬀerent risks of disease dueto diﬀerent make up in risk factor proﬁles consisting of, for example, age, sex, etc. Draw-backs of these approaches are highlighted so as to provide further motivation for moremodern developments for modelling correlated data to be presented in Chapter 5.

Let

*xj *be the

*p × *1 vector of covariates for the

*j*th component of a cluster of size

*n*.

Building on the expression in (3.2), Rosner (1984) proposes to consider for

*j *= 1

*, . . , n*
*j *= 1

* Y−j , x*1

*, . . , xn*)
Pr(

*Yj *= 0

* Y−j, x*1

*, . . , xn*)

*y*)

*− *log(

*γ *+

*n −*
*y − *1) +

*βtxj.*
It models explicitly the dependence of the log odds for

*Yj *to be one on

*Y−j *and

*x*1

*, . . , xn*
=

*j *and

*xj *for each

*j *= 1

*, . . , n*. Using the Hammersley-Cliﬀord Theorem
(Besag, 1974), this speciﬁcation in (3.3) does lead to a legitimate joint distribution for

*Y *= (

*Y*1

*, . . , Yn*) given (

*x*1

*, . . , xn*), namely,
1 =

*y*1

*, . . , Yn *=

*yn x*1

*, . . , xn*)
=

*sbn−s *exp
1 =

*. . *=

*Yn *= 0

* x*1

*, . . , xn*)
(

*α *+

)

*, s ≥ *1

*,*
(

*γ *+

)

*, s ≥ *1

*,*
*b*0 = 1

*,*
As mentioned in

*§*3.2.1, the eﬀect of

*Y−j *on

*Yj*, as modeled in(3.3), depends on

=

*j *,
which is a function of the cluster size

*n*. More seriously, the eﬀect of

*xj *on

*Yj*, as

*3.3. FURTHER RELAXATION OF BINOMIAL ASSUMPTIONS*
measured by

*β*, is likely to be diminished as

*xj *is competing directly against

*Y−j *inexplaining the variation in

*Yj*. Furthermore, the random eﬀect motivation conveyed inthe original beta-binomial deviation (i

*.*e

*. π ∼ *Beta(

*α, γ*)) seems to disappear when onetries to incorporate covariate eﬀects for each component

*j*.

Approaching the same issue from a totally diﬀerent angle, Bahadur (1961) proposed thefollowing join distribution for (

*Y*1

*, . . , Yn*) given (

*x*1

*, . . , xn*)
Pr(

*Y*1 =

*y*1

*, . . , Yn *=

*yn x*1

*, . . , xn*)

*j *=

*yj xj *)

*ρjkzjzk *+
+

*. . *+

*ρ*
*, µj *= Pr(

*Yj *= 1

* xj*)

*, j *= 1

*, . . , n.*
*µj*(1

*− µj*)
This is termed "additive model" in that the ratio of
Pr(

*Yj *=

*yj, j *= 1

*, . . , n x*1

*, . . , xn*)

*n *Pr(

*Y*
*j *=

*yj xj *)
is additive in the products of the

*yj*'s. As a special case, Kupper and Haseman (1978)consider
Pr(

*Yj *=

*yj, j *= 1

*, . . , n x*1

*, . . , xn*)
Pr(

*Yj *=

*yj xj*)(1 +

*ρ*
where

*ρ *is the common correlation coeﬃcient for each pair of the

*yj*'s and

*µj ≡ µ *forall

*j *= 1

*, . . , n*. A major drawback of this model, as acknowledged by Kupper andHaseman (1978), is that there is no guarantee that the quantities on the right hand sideof (3.5) are non-negative, a necessary condition for probability functions. Furthermore,as pointed out in

*§*2.4, the range of

*ρ *is constrained by the

*µj*'s and as in the beta-binomial distribution, the

*µj*'s and

*ρ *are likely to be highly intertwined with each other.

These concerns have limited the usage of this approach.

Similar in spirit to the Rosner model, this third approach considers for

*j *= 1

*, . . , n*
*j *= 1

* Y−j , x*1

*, . . , xn*) =

*α *+

*γ*
*j *+

*βtxj .*
*j *= 0

* Y−j , x*1

*, . . , xn*)

*CHAPTER 3. ANALYSIS OF BINARY DATA*
Again, utilizing the Hammersley-Cliﬀord theorem, this leads to (Altham, 1978; Connollyand Liang, 1988)
Pr(

*Yj *=

*yj, j *= 1

*, . . , n x*1

*, . . , xn*)
Pr(

*Yj *= 0

*, j *= 1

*, . . , n x*1

*, . . , xn*)

*j *+

*γ*
*yj − *1) +

*βt*
Due to the similar representations in (3.6), this model is subject to the same drawbacksinherited by the Rosner model discussed earlier.

In summary, we have reviewed in this section three diﬀerent approaches to extend the
binomial distribution allowing statistical dependence and diﬀerent risks among the

*Yj*'sfrom the same cluster. Drawbacks of these approaches are highlighted. In this regard,the treatment of this chapter may be viewed as a preview of more modern developmentsfor modelling correlated data to be presented in Chapter 5.

**Chapter 4**
**Analysis of Polytomous and**

Count Data
In biomedical research, categorical response with three or more possible categories, knownas polytomous response, is also common in practice. Examples of this kind includedegree of injury (mild, modest and severe), diﬀerent cell types of lung cancer (e.g.,small, squamous, large, etc.) and income level (low, medium and high). Just as inbinary response, researchers are interested in identifying factors that are related to thepolytomous response for prevention, intervention and prediction purposes. The ﬁrstpart of this chapter reviews a variety of statistical models that have been proposed inthe literature to address the above issue. We discuss through several real examples thepros and cons of each approach with the emphasis on the interpretations of regressioncoeﬃcients induced by the models.

In social science research, data are frequently expressed in contingency tables since
variables observed are mostly categorical. The scientiﬁc interest is often on how thesevariables are related to each other. For that purpose, log-linear models (e.g. Bishop,Fienberg and Holland, 1974) have been developed since the 1950's to address the issueof patterns of association among variables collected from the same subjects. Attemptshave also been made in the past decade or two to utilize this approach to help analyzethe types of correlated data introduced in Chapter one. The second part of this chapteris devoted to critically reviewing the pros and especially the cons of log-linear modelswith emphasis on the interpretations of coeﬃcients implied by the models. One usefulreference for this topic is Liang, Zeger and Qaqish (1992) and the discussion therein.

**Types of Polytomous Data**
Examining examples of polytomous responses listed earlier, one notes that they can beclassiﬁed into at least three diﬀerent types in terms of scale which require diﬀerent at-tention for modelling. The ﬁrst type is in a nominal scale. Here categories are regardedas exchangeable such as cell types of lung cancer, blood type or hair color (red, black,brown, etc.). The second type is in a ordinal scale where categories are ordered like theordinal numbers (ﬁrst, second, third, etc.). Examples include degree of injury, perception

*CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA*
of food quality and for that matter hair color (light, medium, dark, etc.). It is importantto point out for ordinal responses that while ordered, the notion of "distance" or "spac-ing" between categories is arbitrary and in some instances, irrelevant. Furthermore, inmany instances, choice and deﬁnition of categories are either arbitrary or subjective. Animportant implication of this observation is that any sensible analytical procedure shouldhave the desired property that same conclusions be drawn if new categories are formed bycollapsing adjacent categories of the original scales. As an illustrative example, considerthe following data from a 3

*× *3 table relating the height (short, medium or tall) of thehusbands to that of the wives. The simple chi-squared test gives rise to a value of 2.9which provides little evidence of association between husbands and wives heights. Thesame conclusion is drawn if one either collapses the "short" and "medium" categoriestogether (

*χ*21 = 1

*.*45) or collapses instead the "medium" and "tall" categories together(

*χ*21 = 0

*.*90).

The third type of polytomous response is the interval scale in which scores or numericallabels are frequently attached. Examples of this kind include salary, income, age etc. inwhich the scores may be the midpoints of the intervals deﬁned according to the originalscales. In this situation, it is common that diﬀerences between scores from diﬀerentcategories be interpreted as a measure of separation of the categories.

In the next section, we discuss the pros and cons of three commonly used regression
models for polytomous responses of either nominal or ordinal scales. Throughout we willuse the following two examples for illustration.

*Example 4.1 (Disturbed dream)*. The data presented below is taken from Maxwell
(1961) on the association between degree of suﬀering from disturbed dreams and ageamong 223 boys age 5 to 15. It is clear that the response variable, severity of disturbeddreams, is ordinal while the explanatory variable aged is an interval scale.

Degree of suﬀering from

*Example 4.2 (Tonsil size)*. The data below is from Holmes and Williams (1954) who
classify 1398 children aged 1-15 years according to their relative tonsil size and whether

*4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES*
or not they were carriers of streptococcus pyogenes (SP). The response variable here,tonsil size, is also ordinal and the interest is on assessing the nature and direction of thepossible eﬀect of SP on the enlargement of tonsil size.

**Regression Models for Polytomous Responses**
Consider a response variable,

*Y *, which has

*C *+ 1 possible categories, i.e.

*Y *= 0

*, *1

*, *2

*, . .*

or

*C *with

*C *being a positive integer. The interest is on the relationship between

*Y *and

*x*, a

*p × *1 vector of covariates.

**Polytomous Logistic Regression Models**
This model represents a straightforward extension of the logistic regression model forbinary response by contrasting the odds for

*Y *to be

*j *= 1

*, . . , C *instead of 0, i.e.

Pr(

*Y *=

*j x*)
Pr(

*Y *= 0

* x*)

*j *+

*βt*
*j *= 1

*, . . , C.*
It can easily be seen that the distribution function for

*Y *given

*x *is fully speciﬁed by the

*C *logistic regression models above, namely,
Pr(

*Y *=

*j x*) =

*, j *= 0

*, *1

*, . . , C.*
with

*α*0 = 0 and

*β*0 = 0, a vector of

*p *zeros. Here

*βj *characterizes the eﬀect of

*x *onthe log odds for being the

*j*th category instead of the category zero,

*j *= 1

*, . . , C*. It isinteresting to point out that for a pair of categories, 0

*< j < *, the log odds for

*Y *to be

* *instead of

*j *is fully determined by (4.1) as
Pr(

*Y *=

* x*)
Pr(

*Y *=

*j x*)

* − αj *) + (

*β − βj *)

*tx,*
which is linear in

*x *as well. The simple equality of

*βj *=

*β −*(

*β −βj*), results of (4.1) and(4.2), explains why subjects whose

*Y *values are diﬀerent than 0 and

*j *are informativeabout

*βj*. This model is particularly useful for nominal response where categories areconsidered exchangeable. Finally, assuming that

*Yi *follows a multinomial distribution,likelihood inference for

*αj*'s and

*βj*'s can be carried out and is available in SAS (PROCCATMOD).

**Proportional Odds Models**
For ordinal responses, McCullagh (1980) proposes the use of proportional odds modelswhich can also be expressed in

*C *logistic regression models:

*j *(

*x*)

*j *+

*βtx,*
*j *= 1

*, . . , C,*
*j *(

*x*)

*CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA*
where

*γj*(

*x*) = Pr(

*Y ≥ j x*). This model focuses on the modelling of

*γj*(

*x*),the cumulativeprobability of

*Y *to be in the jth category or higher given

*x*, which is meaningful onlyif

*Y *is ordinal. It assumes that the eﬀect of

*x *on those

*C *log odds is the same for all

*j *= 1

*, . . , C*, an assumption which can be checked empirically. An important featureof this model is that there is no need to assign scores to the

*C *+ 1 categories of theresponse variable. Instead, the ordinal feature of

*Y *is incorporated (or acknowledged) by(1) modelling through

*γj *and (2) assuming that the eﬀect of

*x *on log

*{γj*(

*x*)

*/*(1

*− γj*(

*x*)

*}*is the same, or at least is in the same direction. Furthermore, the interpretation of

*β*is well preserved: the change in log odds for suﬀering more than less (Example 4.1) perunit change in

*x*. Finally, this model has been implemented in SAS as well (PRODLOGISTIC).

Returning to the disturbed dream example, Table 4.1 shows the MLEs of the

*αj*'s
and

*βj*'s when both the polytomous logistic regression model and the proportional oddsmodel are entertained with a single continuous

*x*: age in years (midpoint of each ageinterval). Results from the proportional odds model ﬁtting suggest a strong negativerelationship between suﬀering from disturbed dreams and age among 5-15 year old boysin that the odds of suﬀering from disturbed dreams decreases by a factor of 0

*.*8(= e

*−*0

*.*229)per year. Results from ﬁtting the polytomous logistic regression model, which ignores theordinal feature of the response variable, lead to the similar conclusion qualitatively. Wenote, however, that the standard error estimate of ˆ

*β*(0

*.*052) is, as expected, much smaller
than that of the three ˆ

*βj*'s (0

*.*078

*, *0

*.*078 and 0

*.*081, respectively). This example serves
to illustrate that proper modelling through acknowledging the ordinal feature of theresponse variable could lead to a substantial gain in eﬃciency for regression coeﬃcientsof interest.

Table 4.1: Regression estimates (

*± *s.e.) based on both proportional oddsmodels and polytomous logistic regression models for the disturbed dreamexample (Maxwell, 1961).

Proportional odds
Polytomous logistic
6

*.*402

*± *1

*.*403
6

*.*271

*± *2

*.*108
5

*.*586

*± *1

*.*387
4

*.*784

*± *2

*.*115
4

*.*571

*± *1

*.*377
7

*.*792

*± *2

*.*169

*−*0

*.*263

*± *0

*.*078

*−*0

*.*208

*± *0

*.*078

*−*0

*.*323

*± *0

*.*081

*−*0

*.*229

*± *0

*.*052

**Continuation Ratio Models**
There are situations where the response variable is "ordered" in a hierarchical (or nested)fashion. Examples include

*4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES*
*Example 4.3 (Teratological experiment)*. Chen et al. (1991) present data from a ter-
atological experiment in which each litter was randomized into one of three levels ofexposure (low, medium and high) to hydroxyures. Here fetuses from a litter could bedead or alive at birth, and if alive, they could be malformed or normal after a period oftime; see the diagram below. Here the relationship between the proportions of fetusesthat are dead and the dosage level allow investigators to examine the embryo-toxic eﬀectof the targeted substance; whereas the proportions of alive fetuses that are malformedcan be used to address the structural developmental eﬀect of the substance. Here theresponse variable has three categories with a natural order of death, alive but malformedand alive and normal, a decreasing order of health severity. It is hierarchical (or nested)in that the issue concerning the risk for being malformed is only meaningful to thosefetuses who are alive at birth.

*Example 4.4 (Insemination experiment)*. McCullagh and Nelder (1989, p.62) describe
an experiment in which milch cows receive treatments to induce pregnancy in such away that each cow will continue the experiments until pregnant as shown below. Herethe response variable represents a series of 1

*/*0 response indicating whether the cow ispregnant or not. However, this response variable is hierarchical (or nested) in that the

*j*th response would be recalled only if the cow was not pregnant in the previous

*j − *1experiments.

*Example 4.5 (Tonsil enlargement)*. Returning to the tonsil size example (Example
4.2), one might argue that the response variable can be expressed in a way that issimilar to that in Example 4.3. Here each subject may be ﬁrst classiﬁed as enlargedor not regarding the tonsil size; and for those experiencing enlargement, they can befurthered classiﬁed as either greatly enlarged or slightly enlarged as shown below. Herethe investigators are in the position to address two related but diﬀerent questions, the ﬁrstbeing whether carrying SP is associated with tonsil enlargement, and the second beingwhether carrying SP is associated with serious enlargement given tonsil enlargement hastaken place.

**Diagram 1 **(Teratological experiment)

*CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA*
**Diagram 2 **(Insemination experiment)

**Diagram 3 **(Tonsil enlargement)

slightly enlarged
For ordinal response of this kind, it is clear that proportional odds models may not be
appropriate. Instead we consider the continuation ratio model by Mason and Fienberg(1985)
logit (

*πj*(

*x*)

*/γj*(

*x*)) =

*αj *+

*βtjx, j *= 0

*, *1

*, . . , C − *1

*,*
where

*πj*(

*x*) = Pr(

*Y *=

*j x*) and

*γj*(

*x*) = Pr(

*Y ≥ j x*). Like the other two models,this model is represented by

*C *logistic regression models. Here

*πj*(

*x*)

*/γj*(

*x*) can bere-expressed as Pr(

*Y *=

*j Y ≥ j, x*), the conditional probability for

*Y *to be

*j *giventhat

*Y *is either in the

*j*th category or higher. It is rather clear that this conditionalprobability is more meaningful than

*γj*(

*x*) itself in reﬂecting the scientiﬁc question ofinterest. Note that in (4.4) we have allowed the eﬀect of

*x *on

*πj*(

*x*)

*/γj*(

*x*) to be diﬀerentfor diﬀerent

*j*. This would be a sensible approach for Examples 4.3 and 4.5 as, forinstance, there is no scientiﬁc reason to expect that the embryo-toxic eﬀect to be the sameas the structural developmental eﬀect. On the other hand, it may be equally sensible

*4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES*
to allow the

*βj*'s to be the same in Example 4.4, which characterizes the eﬀect of

*x *onthe probability for being pregnant allowing the probability itself (i.e. the

*αj*'s) to varyacross experiments. Nevertheless, this assumption of common

*β *can always be examinedstatistically. Returning to the tonsil size example, Table 4.2 gives the MLEs and theirs.e. when both the proportional odds model and the continuation ratio model were ﬁttedto the data shown earlier. Here the sole covariate is the SP carrier status (

*x *= 1 for beingcarrier and 0 if not). Both models lead to conclusions of strong evidence of associationbetween SP carrier and tonsil enlargement, even though the regression coeﬃcients fromthe two models have diﬀerent interpretations. Results from the proportional odds modelsuggest that the odds for having more serious tonsil size enlargement is 83% (= e0

*.*603

*−*1) higher for the carrier than the non-carriers. On the other hand, results from thecontinuation ratio model indicate that the odds for having tonsil enlargement is 67%(= e0

*.*514

*− *1) higher for the carriers than the non-carriers; whereas the odds for havingtonsil size greatly enlarged among those with tonsil enlargement is 72% (= e0

*.*543

*− *1)higher for the carriers than the non-carriers.

Table 4.2: Regression estimates (

*± *s.e.) based on proportional odds mod-els and continuation ratio models for the tonsil size example (Holmes andWilliams, 1954).

Proportional odds
Continuation ratio
0

*.*509

*± *0

*.*056
0

*.*512

*± *0

*.*051

*−*1

*.*363

*± *0

*.*067
1

*.*245

*± *0

*.*093
0

*.*514

*± *0

*.*273
0

*.*543

*± *0

*.*286
0

*.*603

*± *0

*.*227

**Log-Linear Models for Contingency Tables**
Log-linear models represent a popular approach to analyze data that can be summarizedin contingency tables. One important feature of this approach is that there is no needto specify response variables. This is especially useful in social science research withthe aim to examine the relationship (or the association) among variables considered.

More recently, this log-linear model approach has been advocated by some researchers todeal with correlation data. Examples include cross-over trials where patients receive, bydesign, diﬀerent treatments over several periods (Jones and Kenward, 1987) and geneticstudies on detection of maternal eﬀects of disease genes (Weinberg, Wilcox and Lie,1998). In this section, we brieﬂy review the work on log-linear models with the emphasison interpretations of speciﬁed parameters. It serves to point out the potential for drawingincorrect conclusions when using log-linear models to analyze contingency tables data ordiscrete correlated data.

*CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA*
**Interpretations of Log-Linear Model Parameters**
We start with a simple

*I × J *contingency table in which

*Yij *represents the number ofindividuals who are in the

*i*th(

*j*th) category of variable 1 (variable 2),

*i *= 1

*, . . , I, j *=1

*, . . , J*. Let

*Pij *be the corresponding probability for an individual to be in the (

*i, j*) cellof the

*I × J *table. With

*m *being the sample size, one has immediately

*µij *= E(

*Yij*) =

*mPij*. It is intuitive that two categorical variables of

*I *and

*J *categories, respectively, arenot associated with each other (or statistically independent) if

*Pij *=

*Pi*+

*P*+

*j,*
where

*Pi*+ and

*P*+

*j *are the marginal probabilities, e.g.

*Pi*+ is the probability that anindividual falls into the

*i*th category of the ﬁrst variable. The expression in (4.5) leads to
log

*n *+ log

*Pi *+ log

*P*+

*j*
*µ *+

*αi *+

*βj*
with

*α*1 = 0 =

*β*1 to avoid over-parameterizations. Thus, one can test the assumptionof statistical independence between the two variables by computing the deviance whichcontrasts this no association (or null) model with the saturated model where all the

*I × J Pij*'s are unspeciﬁed. When this null model is rejected, a natural question, froma modelling viewpoint, is how one may modify (4.6) to acknowledge (or to characterize)the dependence (or association) between two variables. One obvious approach is to addto (4.6) additional parameters (

*αβ*)

*ij*, i.e.

log

*µij *=

*µ *+

*αi *+

*βj *+ (

*αβ*)

*ij*
with

*α*1 =

*β*1 = 0 = (

*αβ*)1

*j *= (

*αβ*)

*i*1, for all

*i *= 1

*, . . , I, j *= 1

*, . . , J*. The followingrepresentation clearly qualiﬁes the (

*I − *1)

*× *(

*J − *1)(

*αβ*)

*ij*'s as "association parameters,"namely, for

*i *= 1

*, . . , I *and

*j *= 2

*, . . , J *,
is the log odds ratio contrasting the (

*i, j*) cell versus the (1

*, *1) cell; see the 2

*× *2 tablebelow
An important question is how to incorporate, under the framework of log-linear models,the additional information that (some of) the variables are ordinal? As an illustration,the table below gives the 12 empirical log odds ratios from the disturbed dream example.

It shows the pattern that the estimated log odds ratios decrease in age for each level ofsuﬀering and decrease in degree of suﬀering from disturbed dream for each age group.

This suggests that one approach is to assign scores,

*si, i *= 1

*, . . , I, tj, j *= 1

*, . . , J *toboth variables and model (

*αβ*)

*ij*, the log odds ratios, accordingly, e.g.

(

*αβ*)

*ij *=

*sitjβ.*
*4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES*
Degree of suﬀering from disturbed dreamNot severe
Returning to the disturbed dream example, we have ﬁtted the above model i.e. (4.8),to the 5

*× *4 contingency table with

*si *= mid-age of the

*i*th age group,

*i *= 1

*, . . , *5 and

*tj *=

*j, j *= 1

*, . . , *4. The MLE for

*β *gives

*−*0

*.*205 with s.e. 0

*.*050 which suggests thatthe odds for suﬀering from disturbed dream with higher unit decreases by a factor of0

*.*81(= e

*−*0

*.*205) per year for boys between 5 and 15 years of age. It is interesting tocontrast this approach with the proportional odds model approach in

*§*4.2.2; see Table4.1. Both approaches lead to the same conclusion that there is a convincing negativeassociation between age and degree of suﬀering from disturbed dreams. One importantdistinction between these two modelling approaches is on the assumptions being madefor the variables and the interpretability of the primary parameter of interest,

*β *in (4.3)and

*β *in (4.8). While both assigning scores to the age variable, which is the intervalscale, the log-linear model approach makes the further assumption by assigning scores tothe other variable, which is considered as a response variable for the proportional oddsmodel approach. One argument against this additional assumption is that it is ratherarbitrary. Speciﬁcally, by assuming

*tj *=

*j, j *= 1

*, . . , *4, one is making the assumptionthat, for example, the "diﬀerence" in degree suﬀering from disturbed dreams between thethird and ﬁrst categories is twice the "diﬀerence" between the second and ﬁrst categories.

This arbitrariness makes the

*β *coeﬃcient in (4.8) diﬃcult to interpret. The

*β *coeﬃcientinduced by the proportional odds model, on the other hand, does not rely on the assigningof scores for the response variable and is easy to interpret. We note that the deviancesfor both models with 11 degrees of freedom, are very comparable to each other (12

*.*42for the proportional odds model and 14

*.*08 for the log-linear model). This example alsoserves to illustrate that goodness-of-ﬁt measures such as the deviance are secondary (orless relevant) in choosing between models compared to the interpretability of primaryparameters induced by the models.

Moving on to more practical situations of

*I × J × K *contingency tables formed
by three categorical variables of

*I, J *and

*K *levels, respectively, let

*Yijk*(

*µijk*) be thenumber (expected number) of individuals who are in the

*i*th

*, j*th and

*k*th category ofvariables 1, 2 and 3, respectively,

*i *= 1

*, . . , I, j *= 1

*, . . , J *, and

*k *= 1

*, . . , K*. Again,the primary interest is on how these three variables are associated with each other. Forthree-dimensional or higher-dimensional contingency tables, more log-linear models areavailable in additional to the obvious null model
log

*µijk *=

*µ *+

*αi *+

*βj *+

*γk, α*1 =

*β*1 =

*γ*1 = 0
and the saturated model
log

*µijk *=

*µ *+

*αi *+

*βj *+

*γk *+ (

*αβ*)

*ij *+ (

*αγ*)

*ik *+ (

*βγ*)

*jk *+ (

*αβγ*)

*ijk.*
Some interesting intermediate models include
(A) log

*µijk *=

*µ *+

*αi *+

*βj *+

*γk *+ (

*αβ*)

*ij*,

*CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA*
(B) log

*µijk *=

*µ *+

*αi *+

*βj *+

*γk *+ (

*αβ*)

*ij *+ (

*αγ*)

*ik*,
(C) log

*µijk *=

*µ *+

*αi *+

*βj *+

*γk *+ (

*αβ*)

*ij *+ (

*αγ*)

*ik *+ (

*βγ*)

*jk*.

To better understand the interpretation of parameters, in particular (

*αβ*)

*ij*, speciﬁedabove, we consider the following log odds ratio
log

*ijkP*11

*k ≡ *log OR
which is the log odds ratio relating variables 1 and 2 among (conditional on) thoseindividuals who are in the

*k*th category of variable 3. Depending on models chosen,log OR

*ij k *has diﬀerent expressions and hence diﬀerent interpretations. For the nullmodel, which implies no association among three variables, log OR

*ij k *is equal to zero asexpected. For model (C), this log odds ratio reduces to (

*αβ*)

*ij*, which is the constant logodds ratio common to all

*K *categories of variable 3. On the other hand, if the saturatedmodel is entertained, log OR

*ij k *amounts to log odds ratio only for those who are in the

*k*th category of variable 3.

In summary, log-linear models have the following two important features that are
worthy of noting. First, interpretations of lower order terms (e.g. (

*αβ*)

*ij*) depend on thepresence or absence of higher order terms (e.g. (

*αβγ*)

*ijk*). More importantly, the two-wayassociation parameters (e.g. (

*αβ*)

*ij*) have the "conditional" log odds ratio interpretations,i.e. log OR relating two variables adjusting for other variables. One important implicaitonof the second feature stated above is the following: when more than one variable isconsidered as response variables, is it sensible to adjust for other response variables ifa primary question of interest is the association between a particular response and anindependent variable? We will return to this in the next chapter.

**An Alternative Representation of Log-Linear Models**
As mentioned at the onset of

*§*4.3, there has been some attempts recently to adoptlog-linear models as a means to model correlated data. At ﬁrst glance, this may notseem obvious as log-linear models are applied to aggregated data whereas the interestfor the latter is to model the joint distribution of correlated data from a cluster. Themain purpose of this subsection is to show brieﬂy how to bridge the gap betwen thesetwo seemingly unrelated tasks. The other purpose is through this new representationto reinforce some potential concerns regarding interpretations of regression coeﬃcientsinduced by log-linear models, which were mentioned in

*§*4.3.1.

Recall that what one observes for each subject in the context of contingency tables
can be expressed as

*Y *= (

*Y*1

*, . . , Yj, . . , Yn*)

*t,*
where

*n *is the number of categorical variables under investigation and

*Yj *is a categoricalvariable of

*Cj *levels,

*j *= 1

*, . . , n*. In the disturbed dream example,

*n *= 2 and

*C*1 = 5and

*C*2 = 4 if

*Y*1 and

*Y*2 represent age and degree suﬀering from disturbed dreams. Forsimplicity, we assume

*n *= 3 and

*Cj *= 2 for all

*j *= 1

*, *2

*, *3 so that

*Yj *= 1 or 0 dependingon whether the subject falls in the second or the ﬁrst category, respectively. Using thesame notations as in

*§*4.3.1, we now consider a joint distribution for

*Y *= (

*Y*1

*, Y*2

*, Y*3)

*t*
*4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES*
Pr(

*Yj *=

*yj, j *= 1

*, *2

*, *3)
2

*y*1 +

*β*2

*y*2 +

*γ*2

*y*3 + (

*αβ*)22

*y*1

*y*2 + (

*αγ*)22

*y*2

*y*3

*j *= 0

*, j *= 1

*, *2

*, *3)
It can be easily shown (e.g. Liang and Zeger, 1989) that the model in (4.9) is equivalentto the saturated log-linear model for a 2

*× *2

*× *2 contingency table where

*Yijk *= # of subjects in which (

*Y*1 =

*i − *1

*, Y*2 =

*j − *1

*, Y*3 =

*k − *1)

*i, j, k *= 1

*, *2

*.*
The representation in (4.9) provides an alternative way to interpret the parameters spec-iﬁed under log-linear models by considering for example
1 = 1

* y*2

*, y*3) =

*α*
2 + (

*αβ*)22

*y*2 + (

*αγ*)22

*y*3 + (

*αβγ*)222

*y*2

*y*3

*,*
1 = 0

* y*2

*, y*3)
the log odds for the ﬁrst categorical variable,

*Y*1, to be one as a function of

*Y*2 and

*Y*3, theother two variables. Thus, (

*αβ*)22 may be interpreted as the eﬀect, measured by log odds,of

*Y*2 on

*Y*1 adjusting for

*Y*3. In other words, (

*αβ*)22 characterizes the association between

*Y*1 and

*Y*2 conditional on the status of the third variable,

*Y*3 in this case. Similarly, onecan derive through (4.9)
2 = 1

* y*1

*, y*3)
2 + (

*αβ*)22

*y*1 + (

*βγ*)22

*y*3 + (

*αβγ*)222

*y*1

*y*3

*,*
2 = 0

* y*1

*, y*3)
3 = 1

* y*1

*, y*2)
2 + (

*αγ*)22

*y*1 + (

*βγ*)22

*y*2 + (

*αβγ*)222

*y*1

*y*2

*.*
3 = 0

* y*1

*, y*2)
Representations above serve to illuminate the concerns of using log-linear models tomodel longitudinal data, an important special case of correlated data. In this setting,

*Yj*represents for an individual the binary observation measured at the

*j*th visit,

*j *= 1

*, *2

*, *3.

An obvious question is whether it is sensible to model as shown in (4.10) the dependenceof the ﬁrst observation,

*Y*1, on observations measured subsequently,

*Y*2 and

*Y*3 in thiscase. Another drawback of this modelling is that it is not reproducible (e.g. Liang, Zegerand Qaqish, 1992) in that the joint distribution of any subset of

*Y *= (

*Y*1

*, Y*2

*, Y*3) do nothave the same representations as (4.9) for

*Y *itself. This is particularly troublesome forcorrelated data of varying cluster sizes.

*CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA*
**Chapter 5**
**Statistical Modellings for**

Correlated Data
This chapter is devoted to discussions on statistical modellings for correlated data thathave been developed recently. Recall in Chapter 1 we pointed out through examples thatscientiﬁc objectives for correlated data can often be characterized statistically throughregression to (1) assess the relationship between response variables and explanatory vari-ables and (2) examine and describe the degrees and patterns of within-cluster association.

We discuss in this chapter three diﬀerent approaches: marginal models, random eﬀectsmodels and observation driven models. These models are illustrated through two exam-ples presented in Chapter 1: The Baltimore Eye Survey (Example 1.1) and the pre-posttrial for schizophrenics (Example 1.4). These three models are contrasted with the focuson their distinct interpretations of regression coeﬃcients. To begin with, we recall thatcorrelated data can often be expressed as follows: Let

*Yi *= (

*Yi*1

*, Yij, . . , Yin *)

*t *be the

*ni *observations from the

*i*th cluster,

*i *= 1

*, . . , m, m *being the total number of clusters.

For the

*j*th component,

*j *= 1

*, . . , ni*, we assume the availability of

*xij*, a

*p × *1 vectorof covariates, that may be associated with

*Yij*. For the Baltimore Eye Survey Example,

*m *= 5

*, *199

*, ni *= 2 for all subjects and

*Yi*1(

*Yi*2) represents the visual impairment sta-tus (1: impaired; 0: normal) of the left (right) eye of the

*i*th subject. In this example

*xi*1 =

*xi*2 =

*xi*, i.e. all the covariates are the same for both eyes including age, race andpossibly interactions between the two demographic variables. The primary objective ofthis study is to identify demographic variables that are associated with the risk of visualimpairment. The secondary objective is to examine the degree of fellow eye associationand its possible dependence on demographic variables as well. As for the pre-post trial,

*m *= 170 and the

*ni*'s range from 2 to 6 as some of the patients dropped out duringthe course of follow-up. The primary response variable is PANSS measuring the severityof schizophrenia; whereas the main explanatory variables include the treatment status(

*xi *= 1 if receiving rispirodine of 6mg and 0 if receiving haloperidol of 20 mg), time (inweeks) from the baseline,

*tj*(0

*, *1

*, *2

*, *4

*, *6 and 8) and the interaction between

*xi *and

*tj*.

The main objective of this clinical trial is to assess the eﬃcacy of the new treatment forschizophrenia, rispirodine, as measured by the rate of decline in PANSS.

*CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA*
This approach may be viewed as a multivariate analogue of the quasi-likelihood methodin

*§*2.3:
(A) Systematic component

*h*(

*µij*) =

*xt β, µ*
*ij *= E(

*Yij xij *).

(B) Variance speciﬁcation Var(

*Yij*) =

*V *(

*µij*;

*φ*), where

*φ *is the over-dispersion param-
(C) Covariance speciﬁcation Cov(

*Yij, Yik*) = C(

*µij, µik*;

*α*)

*, j < k *= 1

*, . . , ni*, where

*C*
is a known function indexed by association parameters

*α*.

For the Baltimore eye survey example, one may consider
logit Pr(

*Yij *= 1) =

*β*0 +

*β*1 Race

*i *+

*β*2 (Age

*−*
60) +

*β*3 (Age

*i*
+

*β*4 Race

*i *(Age

*−*
60) +

*β*5 Race

*i *(Age

*− *60)2

*,*
(B) Var(

*Yij*) =

*µij*(1

*− µij*),
(C) log OR(

*Yi*1

*, Yi*2) =

*α*0 +

*α*1 Race

*i *+

*α*2 Age.

Several remarks are worth noting. First, implicitly one assumes in (A) of (5.1) that theeﬀects of a age and race on the risk of visual impairment are the same for both eyes.

While one can test this assumption formally, expert opinion on this is that commonetiology for both eyes is warranted. An interpretation of e

*β*1 , the coeﬃcient for the Racevariable, is that it represents the odds of visual impairment for any eye of blacks of 60years old relative to the odds for whites of the same age. Finally, recall in

*§*1.4 ourpreference of using odds ratio as a measure of association for categorical variables. Thespeciﬁcation (C) along with (A) and (B) lead to a full speciﬁcation of Cov(

*Yi*1

*, Yi*2). Forthe pre-post trial example, one could, for example, consider
(A)

*µij *=

*β*0 +

*β*1

*tj *+

*β*2

*xi *+

*β*3

*xi tj*,
(B) Var(

*Yij*) =

*φ*,
(C) Cov(

*Yij, Yik*) =

*φ e tj−tk *.

Here the AR-1 correlated structure for the repeated observations from the same patient isimposed as shown in (C). More importantly, the interaction term

*xi tj*, in (A) postulatesthat the diﬀerence in averaged PANSS between the treated (risperidone) and control(haloperidol) groups varies linearly in time as characterized by

*β*3. More speciﬁcally,

*β*3represents the change of averaged diﬀerence in PANSS between two groups per unit (inweeks) change as time progresses.

Finally, some general remarks on marginal models (Liang and Zeger, 1993). First,
the

*β *coeﬃcients speciﬁed in (A) describe the eﬀects of covariates on the "marginal"expectation of the response variable. For the eye survey example, this marginal expec-tation amounts to the averaged

*Y *for a given eye irrespective of the outcome statusof the fellow eye. For the pre-post trial example, this marginal expectation means the

*5.2. RANDOM EFFECTS MODEL*
averaged

*Y *at a given visit irrespective of the PANSS scores at other visits. Thus

*β*has the so-called "population averaged" interpretation by contrasting the averaged

*Y *'sfrom diﬀerent subpopulations formed by diﬀerent values in

*x *(Zeger, Liang and Albert,1988). This approach, from this viewpoint, is particularly relevant for studies of publichealth signiﬁcance. Furthermore, the interpretation of

*β *is the same regardless of thecluster size,

*ni*, and of magnitude of within-cluster association. Finally, just as in quasi-likelihood, the joint distribution of

*Yi *= (

*Yi*1

*, . . , Yin *)

*t *is not fully speciﬁed by (A), (B)
and (C) and this issue will be addressed in the next chapter.

**Random Eﬀects Model**
Unlike the marginal model approach which addresses issues of the regression of

*Y *on

*x*and within-cluster association through two separate sets of regression models, randomeﬀects models and observation driven models intend to address these two issues simulta-neously through a single set of regression models. The essence of random eﬀects models,however, is the introduction of cluster-speciﬁc regression coeﬃcients relating

*Y *to

*x *toreﬂect natural heterogeneity among clusters caused by unmeasured factors. This notionof heterogeneity among clusters, called "random eﬀects" translates into within cluster"likeness" or dependence. Speciﬁcally, given

*bi*, random eﬀects speciﬁc to the

*i*th cluster,

*i *= 1

*, . . , m*,
(A

)

*Yi*1

*, . . , Yij, . . , Yin *are statistically independent of each other.

(B

) Each

*Yij *follows a GLM with

*h*(E(

*Yij bi*)) =

*xtijβ *+

*ztijbi,*
where

*zij*, a

*q × *1 vector, is generally a subset of

*xij*.

(C

) The

*bi*'s are assumed to be independent and generated from a common distribution,

*F *(

*·*;

*δ*), with mean zero and indexed by parameters,

*δ*.

For the pre-post trial example, patients vary tremendously in baseline PANSS and inchange of PANSS over time. To acknowledge this variation statistically, one may consider
(B

)

*Yij *= (

*β*
1 +

*bi*1)

*tj *+

*β*
2

*xi *+

*β*
3

*xi tj *+

*ij *, where given

*bi *= (

*bi*0

*, bi*1)

*t, ij*
*N *(0

*, φ*)

*, j *= 1

*, . . , ni*,
(C

)

*bi *follows a bivariate normal distribution with mean (0

*, *0)

*t *and covariance matrix
In this model,

*β*0 and

*β*0+

*β*2 represent the averaged baseline PANSS for patients receivinghaloperidol and risperidone respectively, whereas

*β*1 and

*β*1 +

*β*3 represent averaged ratesof change in PANSS for the two respective groups. Thus negative

*β*1 and

*β*3 would suggestthat receiving risperidone would lead to a greater decline in PANSS over time thanhaloperidol does. Furthermore,

*bi*0 and

*bi*1

*, i *= 1

*, . . , m*, are introduced to account forvariation across patients discussed earlier. The larger the

*δ*00(

*δ*11), the greater variations

*CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA*
across subjects in PANSS scores (in rates of change in PANSS). Finally, Var(

*Yij*) =

*φ *+

*δ*00 + 2

*δ*01

*tj *+

*δ*11

*t*2 and
Cov(

*Yij, Yik*) =

*δ*00 +

*δ*01(

*tj *+

*tk*) +

*δ*11

*tjtk, j < k,*
a result of the introduction of random eﬀects, the

*bi*'s. For the eye survey example, onemay consider
(B

)

*Yij bi ∼ *Binomial(1

*, µ *), where

*µ *= E(

*Y*
*ij bi*) and
(

*β*0 +

*bi*) +

*β*1 Race

*i *+

*β*2 (Age

*i*
60) +

*β*3(Age

*i*
4 Race

*i *(Age

*i*
60) +

*β*5 Rac

*i *(Age

*i*
(C

)

*bi ∼ N *(0

*, δ*).

This logistic regression model in (B

) is very similar to that under marginal modelswith the exception of the additional individual-speciﬁc parameter,

*bi*, which is includedto account for variation among subjects in risks for visual impairment that are notexplained by age, race variables. Indeed, with random intercepts, Zeger, Liang andAlbert (1988) have shown that there is a simple mathematical relationship between

*β*and

*β *in marginal models, namely,
logit E(

*Yij*) = logit E(E(

*Yij bi*)) =

*xtijβ/*(

*c*2

*δ*200 + 1)1

*/*2

*,*
where 16 3

*/*(15

*π*). Thus, approximately
= (

*c*2

*δ*00 + 1) 2

*β.*
We will return to the comparison between these two models in

*§*5.4. Some generalcomments on random eﬀects models. This approach for modelling correlated data isparticularly useful to characterize the "change" in

*Y *within clusters. For example,

*β *+

*bi*describes the eﬀect of

*x *on

*Y *that is speciﬁc to the

*i*th cluster (or subject). Furthermore,the empirical Bayes method (Robbins, 1956) and subsequent developments (e.g. Lairdand Ware, 1982; Zeger and Karim, 1991; Breslow and Clayton, 1993) allow investigatorsto estimate not only

*β *but also

*bi *as well. Thus this modelling is particularly relevantfrom a clinical viewpoint. For example, longitudinal studies would allow clinicians tocommunicate with concerned parents the drop in risk of their own children for suﬀeringfrom infectious disease if they change their smoking behavior (e.g. from smoking to non-smoking). Finally, given (A

), (B

) and (C

) speciﬁed above, the probability distributionfunction for

*Yi *as indexed by

*β, φ *and

*δ *is fully speciﬁed.

**Observation Driven Models**
As in random eﬀects models, this approach addresses simultaneously both issues of re-gression of

*Y *on

*x *and within-cluster association through a set of regression models.

One important feature of the observation driven models that distinguish itself from therandom eﬀects model is that it operates on the premise that correlation within a cluster

*5.3. OBSERVATION DRIVEN MODELS*
arises because one response is explicitly caused by others. A typical example of this kindis the spread of the disease that is contagious within a family or a community. Deﬁne

*Yi,−j *= (

*Yi*1

*, . . , Yi,j−*1

*, Yi,j*+1

*, . . , Yin *)

*t*
a vector of (

*ni − *1)

*× *1

*Yij*'s excluding

*Yij*. Observation driven models may be charac-terized as follows
(A

) For

*Hij *=

*H*(

*Yi,−j*), a function of

*Yi,−j, Yij *given

*Hij *follows a GLM with

*h*(E(

*Yij Hij*)) =

*xtijβ *+
where

*fν, ν *= 1

*, . . , q*, are known, up to

*α*, functions of

*Hij*.

Choice of

*Hij *depends on situations that are dealt with. For longitudinal studies, it isnatural to consider

*Hij *= (

*Yi*1

*, . . , Yi,j−*1), observations prior to the

*j*th visit. On theother hand, for family studies, in particular, of siblings, it would be more sensible toconsider

*Hij *=

*Yi,j−*1, the rest of the family members. If adopting this approach, onemay consider for the eye survey example
(A

)

*Yij Yi,*3

*−j ∼ *Binomial(1

*, µ*),

*j *= 1

*, *2 with
logit Pr(

*Yij *= 1

* Yi,*3

*−j, xij*)
Race

*i *+

*β*
60) +

*β*
Race

*i *(Age

*i*
60) +

*β*
Race

*i *(Age

*i*
Note that with

*ni ≡ *2, this model is the same as those proposed by Rosner (1984) andConnolly and Liang (1988) discussed in

*§*3.3. It is interesting to note that e

*β*
the interpretation as the relative odds of visual impairment for one eye among blacks of60 years of age versus that of whites of the same age, adjusting for the visual impairmentstatus of the fellow eye. This is to be contrasted with e

*β*1 under marginal models inwhich no adjustment for the fellow eye status is made. If the objective of the studyis to identify risk factors, in particular demographic ones, for prevention purposes, onemay question the relevance of this observation driven approach as the eﬀects of race,age, etc. on risk of visual impairment for one eye may be diminished due to their directcompetition against the visual status of the fellow eye in explaining

*Yij*; see (5.4). Forthe pre-post trial example, one may consider
(A

)

*Yij Yi*1

*, Yi,j−*1 is speciﬁed as
1

*tj *+

*β*
2

*xi *+

*β*
+

*α*(

*Yi,j−*1

*− β −*
1

*tj−*1

*− β*
2

*xi − β*
3

*xi tj−*1) +

*ij ,*
where

*ij ∼ N *(0

*, φ*).

This is known as a AR-1 process in which

*Yij *depends on the past history in

*Yi *onlythrough

*Yi,j−*1, the immediate past observation. We will discuss the relationship among

*β, β *and

*β *for this example in

*§*5.4.

*CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA*
Generally speaking, this approach represents a particular way to model the within-
cluster dependence mechanism, e.g., past responses cause the present one in longitudinalstudies. This direct modelling is particularly useful for prediction purposes. For example,one may be interested in predicting PANSS two months after the termination of the studyif the patient receiving risperidone had a score of 140, which is very high, at that time.

On the other hand, if the main purpose of the study is on the regression of

*Y *on

*x*one should be cautious about the interpretation of

*β *which may be non-trivial andpotentially misleading especially when the cluster sizes vary. This is especially so fordiscrete response variables; see discussion in

*§*3.2 and

*§*4.3.

**Contrasts of Three Modelling Approaches**
It is clear that the three approaches presented in Sections 5.1-3 are motivated distinctlyconsidering how within-cluster association may be incorporated in conjunction with theconventional regression of

*Y *on

*x*. With the exception of identity link function, whichis commonly assumed for continuous responses, the three regression coeﬃcients,

*β*1

*, β*and

*β*, are likely to be diﬀerent numerically and interpretation-wise. In addition, thewithin-cluster association pattern as implied by the models are in general diﬀerent forthese approaches as well.

We now ﬁrst examine more closely the special case of continuous responses with
identity link. It can be shown mathematically that in this case

*β *=

*β *=

*β.*
To see this, note that random eﬀects models specify the eﬀects of

*xi *on the conditionalexpectations of

*Yi *given

*bi*, the random eﬀects. However, noting that E(

*bi*) = 0 asspeciﬁed in (C

) of

*§*5.2, we have immediately that
E(

*Yij*) = E(E(

*Yij bi*)) = E(

*xtijβ *+

*ztijbi*) =

*xtijβ,*
which when comparing with (A) in

*§*5.1, leads to

*β *=

*β*. Similarly, if one adopts theobservation driven model, one has
E(

*Yij*) = E(E(

*Yij H*(

*Yi,j−*1)))
E(

*xtijβ *+

*αν*(

*yi,j−ν − xti,j−νβ*))
Thus, for continuous responses of identity link, the three regression coeﬃcients relating

*Y *to

*x *are identical numerically. The only major diﬀerence is on how the within-clusterassociation may be modeled. For example, a random eﬀects model with intercepts as theonly random eﬀects leads to Cov(

*Yij, Yik*) =

*δ*00, a constant covariance for each pair ofobservations from the same cluster. On the other hand, if an AR-1 process is assumedunder observation driven models, one has Cov(

*Yij, Yik*) =

*α tj−tk *which decreases in

* tj − tk *, the "distance" (in time) between two observations from the same cluster.

The equality in

*β, β *and

*β *shown above is indeed an exception rather than the rule.

To elaborate on this, we assume, for simplicity, one is dealing with longitudinal studies of

*5.4. CONTRASTS OF THREE MODELLING APPROACHES*
binary responses and logit link. In this situation, e

*β *for a covariate (who marginal modelscompares prevalences in subgroups formed by observations with diﬀerent

*x *values (andhence measured at diﬀerent times). For random eﬀects models of the same covariate

*x*,e

*β *describes changes in personal risk due to change in

*x *over time. On the other hand,e

*β *in observation driven models characterizes changes in incidence for individuals withthe same history in

*Y *but diﬀer in

*x *measured at the same time. It has been shown inthis situation (Zeger, Liang and Albert, 1988) that

*β *is smaller than

*β *in magnitude.

However, one should in general reach similar conclusions for testing the hypothesis of noassociation between

*Y *and

*x *(i.e.

*β *= 0 or

*β *= 0) since

*β*)

*≈ *ˆ
On the other hand, by examining (A) in

*§*5.1 and (A

) in

*§*5.3 closely, we would speculatethat

*β *would be smaller than

*β *in magnitude due to the fact that (1) both

*xij *and

*H*(

*Yi,−j*) appear in the same regression model in explaining

*Yij *and (2) the

*Yij*'s are ingeneral highly correlated with each other.

*CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA*
**Chapter 6**
**Statistical Inference for**

Correlated Data
In Chapter 5 we introduced and contrasted three statistical modellings for correlateddata. In this chapter, we review approaches to make inference about parameters speci-ﬁed by the models. These approaches include full likelihood methods, conditional like-lihood methods and estimating function methods. Pros and cons of each approach arehighlighted.

Recall that for marginal models only the ﬁrst two joint moments of the

*Yi*'s are requiredto be speciﬁed, i.e.

(A) Systematic component

*h*(

*µij*) =

*xtβ, *for all

*j *= 1

*, . . , n*
(B) Covariance/variance speciﬁcation Cov(

*Yij, Yik*) =

*C*(

*µij, µik*;

*α, β*)

*, j ≤ k *= 1

*, . . , ni*.

The question is that given this speciﬁcation in (A) and (B) how to make inference on

*β*and

*α *based on the data at hand. Just as in the univariate response case, we considertwo diﬀerent approaches.

If the regression of

*Y *on

*x *is the primary interest which usually is the case in, for example,longitudinal studies, one may estimate

*β *by solving
1(

*β, α*) =

*i*;

*β, α*)(

*yi − µi*(

*β*)) = 0

*.*
The above equation is know as the generalized estimating equation (GEE) (Liang andZeger, 1986) which reduces to the quasi-score equation deﬁned in

*§*2.3 when

*ni *= 1 for all

*i *= 1

*, . . , m*. Since

*S*1 in (6.1) depends on

*α *as well, one may iterate until convergencebetween solving

*S*1(

*β, *ˆ

*α*(

*β*)) = 0 for

*β *and updating ˆ

*α*(

*β*) with the latest ˆ

*β *solved above

*CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA*
(Liang and Zeger, 1986). This approach, which we term GEE1, is robust in the sensethat, just as in quasi-likelihood, (1) no distributional assumption on

*Yi *= (

*Yi*1

*, . . , Yin *)

*t*
is needed and (2) consistent estimation of

*β *and correct precision assessments on ˆ
available even if the covariance structure for

*Yi, C*(

*µij, µik*;

*β, α*), is incorrectly speciﬁed.

For the latter part, see some key references including White (1982), Gourieoux et al.

(1984), Royall (1986) and Liang and Zeger (1986). Furthermore, high eﬃciency relativeto the MLE of

*β *is usually maintained even if Cov(

*Yi*) is incorrectly speciﬁed (e.g. Liangand Zeger, 1986; Zhao, Prentice and Self, 1992). As expected, the degree of eﬃciency lossdepends on the magnitude of within-cluster correlation (the higher the correlation, thegreater degree of eﬃciency loss) and on how "well" the speciﬁed "working" covariance for

*Yi *in (6.1) approximates the true covariance structure. It is important to point out thatincorrectly assuming no association within-cluster may yield high degree of eﬃciency lossespecially for covariates whose values vary within clusters; see

*§*2.4 and Zhao, Prenticeand Self (1992). More importantly, such an exercise may lead to severe biased estimationin

*β *if the data are missing in a non-trivial manner. This point will be further illuminatedthrough the pre-post trial data analysis to be presented later.

When the within-cluster association is the primary interest, which is typically the
case for family studies, one may argue that this GEE1 approach may be ineﬃcient asfar as inference for

*α *is concerned. This is because in GEE1, estimation of

*β *and

*α *istreated as if they were independent (or orthogonal) to each other. As an alternative, onemay solve for

*δ *= (

*β, α*) simultaneously through

*∂µ∗ t*
2(

*β, α*) =

*i*;

*δ, γ*)(

*zi − µ∗*
*i *(

*δ*)) = 0

*,*
*i *) = (

*Y t*
*i*1

*, . . , Y *2

*in , Yi*1

*Yi*2

*, . . , Yi,n*
*i −*1

*Yini*
*µ∗i *= E

*t*(

*Zi*;

*δ*) = (E

*t*(

*Yi*);

*β*)

*, *E

*t*(

*Wi*;

*β, α*))

*t,*
and

*γ *are parameters characterizing the third and fourth moments of the

*Yi*'s as requiredby Cov(

*Zi*) (Liang, Zeger and Qaqish, 1992). Several remarks on this GEE2 approach.

First, with a proper choice of ˆ

*α*(

*β*), the GEE1 method may be viewed as a special case
of GEE2 if Cov(

*Yi, Wi*) in the "working" covariance matrix for

*Zi*, Cov(

*Zi*), is set to bezero. It can be seen in this situation that the ﬁrst

*p *components of

*S*2(

*δ, γ*) reduces to

*S*1(

*β, α*). Second, in some instances considered, Liang, Zeger and Qaqish (1992) foundthat the GEE2 method is more eﬃciency for estimating

*β *and

*α *than the GEE1 method,although the degree of eﬃciency gain is much less substantial for

*β *than

*α*. We willreturn to this point through an illustration using the eye survey data as an example.

This new approach, however, creates a complication that is not associated with the useof GEE1. One complication is the need to estimate

*γ *as seen in (6.2). Given thatCov(

*Zi*) is serving as a working covariance matrix and its correct speciﬁcation is notneeded, one may compute Cov(

*Zi*) as if

*Yi *follows a multivariate normal distribution(Crowder, 1987; Prentice and Zhao, 1991). This resolves the issue of dependence of

*S*2on

*γ *since multivariate normal distributions are fully speciﬁed by the ﬁrst two moments.

Another approach is to replace

*γ *by a "guess-mates,"

*γ∗*, a ﬁxed value. Liang, Zegerand Qaqish (1992) found that for cases considered, this approach sacriﬁces very littlepower loss. However, no general performance of these two approaches for dealing with

*6.1. MARGINAL MODELS*
*γ *has been established and this issue is likely to be examined on the case by case basis.

Another complication is that the robust property for

*β *enjoyed by GEE1 is no longermaintained. To see this, except when Cov(

*Yi, Zi*) is set to be zero, it can be shown thatthe ﬁrst

*p *components of

*S*2(

*δ, γ*)

*, S*2

*,β*(

*δ, γ*) say, is a linear combination of

*yi − µi*(

*β*)and

*Wi − *E(

*Wi*;

*β, α*). Thus,

*S*2

*,β*(

*δ, γ*) is an unbiased estimating function for

*β *only ifE(

*Wi*;

*β, α*) computed is correct, i.e. if Cov(

*Yi*) is correctly speciﬁed. In other words,the consistency of ˆ

*β*, the solution of

*S*2(

*δ, γ*) = 0, depends on the correct speciﬁcation of
Cov(

*Yi*); see Liang and Zeger (1995) for a detailed examination of this issue for a specialcase.

Recall that speciﬁcation (A) and (B) do not fully specify the joint distribution of the

*Yi*'sas only the ﬁrst two moments of the

*Yi*'s are involved in (A) and (B). A natural questionis: does there exist a probability function for

*Yi *that is compatible with (A) and (B)?The work by Gourieroux et al. (1984) and Zhao et al. (1992) suggests that the answer isyes. Dropping the index

*i *temporarily, consider the following "partly exponential family"

*f *(

*Y *=

*y*;

*β, α*) = exp

*{θty *+

*c*(

*y*;

*α*)

*− b*(

*θ, α*)

*},*
where

*θ*, called "canonical parameters," is determined by

*µ *= E(

*Y *;

*β*) =

*y *exp

*{θty *+

*c*(

*y*;

*α*)

*− b*(

*θ, α*)

*}dy*
through the one-to-one correspondence between
Here

*c*(

*y*;

*α*) is pre-speciﬁed and

*α *is called "shape parameters." Some choices of

*c*(

*y*;

*α*)include

*c*(

*y*;

*α*) =

*yt*
which, in conjunction with (6.3), leads to the multivariate normal distribution and forbinary responses

*c*(

*y*;

*α*) =
known as the quadratic exponential family (e.g. Zhao and Prentice, 1990; Fitzmauriceand Laird, 1993). With this partly exponential family, it has been shown (Zhao et al.,1992; Fitzmaurice and Laird, 1993) that the score function for

*δ *= (

*β, α*) has the form

*S*(

*β, α*)

*i *(

*β*)

*i*;

*δ*)

*i − µi*(

*β*)
Cov(

*Yi*;

*Wi*;

*δ*)

*Iq×q*
*wi − *E(

*Wi*;

*δ*)
where

*Iq×q *is the

*q × q *identity matrix and

*wi *=

*∂c*(

*yi*;

*α*)

*/∂α*. Consequently, the ﬁrst

*p*components of

*S*(

*β, α*) is simply

*S*1(

*β, α*) in (6.1). Thus

*S*1(

*β, α*) from GEE1 is the scorefunction for

*β *if

*Yi *is from a partly exponential family as long as the Cov(

*Yi*) speciﬁed

*CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA*
in

*S*1(

*δ*) is the true covariance matrix from

*Yi*. Furthermore, it can be shown (e.g., Zhaoet al., 1992) that

*β *and

*α *are orthogonal to each other, i.e. the covariance between thescore function for

*β *and

*α *is zero. Consequently, the consistency of ˆ

*β*, the MLE of

*β*, is
maintained even if

*c*(

*y*;

*α*) is misspeciﬁed, a property enjoyed by GEE1.

On the other hand, this likelihood approach based on the partly exponential family
may be used with caution. First, choice of

*c*(

*y*;

*α*), which along with

*µ*(

*β*), would leadto the complete speciﬁcation of

*f *(

*y*;

*β, α*) requires more knowledge about the randommechanism than it does for (A) and (B) only. In the absence of suﬃcient knowledge,either substantive or empirical, one should proceed with caution. Second, this family isin general not reproducible, i.e. the distribution function of any subset of

*Yi *may not havethe same form as

*f *(

*Yi*) in (6.3). This lack of reproducibility is especially troublesomeif cluster sizes vary which is typically the case for family and longitudinal studies. We
Table 6.1: Regression estimates and standard errors (in parenthesis) for thevisual impairment response from the Baltimore Eye Survey Study.

now return to the Baltimore eye survey example to illustrate some of the points madeearlier. Table 6.1 presents results from four approaches:

*Approach I*: logistic regressionusing the data from the left eyes only;

*Approach II*: logistic regression using data fromboth eyes treating them as if they were statistically independent of each other;

*ApproachIII (IV)*: GEE1 (GEE2) in which both approaches model log OR(

*YL, YR*) as a functionof race. First, point estimates of logistic regression coeﬃcients from four approaches arevirtually identical suggesting that the etiology of visual impairment is similar for botheyes.

Furthermore, the association between age and risk of visual impairment is strong
and positive and varies between blacks and whites; see Figure 6.1. Second, the ordinarylogistic regression approach (Approach II), which ignores within-cluster association, has

*6.2. RANDOM EFFECTS MODELS*
Figure 6.1: Predicted risk of VI by race and age among those with 9 yearsof education in Baltimore Eye Survey Study (Tielsch et al., 1990): solidline, whites; dashed line, blacks.

the tendency, as expected, to underestimate the s.e. of ˆ

*β *compared to that of approaches
III and IV, which are known to be correct, at least asymptotically. Third, it is obviousand intuitive that it is less eﬃcient to use data from one eye only; see the last columnof Table 6.1. However, the ratios of the variance of ˆ

*β *between approaches I and IV, the
latter using twice the sample size, are approximately equal to 1.6 instead of 2. This isknown as "design eﬀect" in complex survey sampling and is due to the fact that the datafrom fellow eyes are highly correlated with each other. Indeed, results from Table 6.1indicate that the estimated odds ratio relating two fellow eyes in terms of risk for visualimpairment is approximately 9

*.*97(= e2

*.*30) for whites and 17

*.*1(= e2

*.*30+0

*.*54) for blacks.

The signiﬁcant diﬀerence in odds ratios between blacks and whites is consistent with thelong-standing clinical ﬁnding in the United States that blacks have higher incidence ofglaucoma and of diabetic mellitis, both of which tend to be bilateral diseases. Finally,there is a substantial eﬃciency gain in using GEE2 for inference on the pattern of within-cluster association as the ratio of Var( ˆ

*α*) for GEE1 versus GEE2 are 2

*.*22(= (2

*.*64

*/*1

*.*77)2)
for the intercept and 2

*.*65(= (45

*/.*255)2) for the race variable.

**Random Eﬀects Models**
As noted in

*§*5.2, given
(A

) Conditional on

*bi*, the random eﬀects,

*Yi*1

*, . . , Yin *are statistically independent
and for each

*j, Yij *follows a GLM with

*h*(E(

*Yij bi*)) =

*xt β *+

*zt b*
(B)

*bi*, a

*q × *1 vector,

*∼ F *(

*·*;

*D*) with mean zero,
the likelihood function for

*β, φ *and

*D *is completely speciﬁed, namely,

*L*(

*β, φ, D*)

*∝*
*f *(

*yij bi*;

*β, φ*)

*f *(

*bi*;

*D*)

*dbi.*
*CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA*
As an example, for binary responses of logit link and for a sole random intercept, i.e.

*zij ≡ *1, following

*N *(0

*, D*), the likelihood function has the form

*L*(

*β, D*)

*yijxtijβ *+
log 1 + e

*xtij*
One can see from (6.5) that the likelihood function involves

*q*-dimensional integrals, onefrom each cluster and in general, numerical approximation is needed in order to computer(6.4). We now consider two diﬀerent approaches whose pros and cons will be contrastedthroughout.

**Conditional Likelihood Approach**
This approach only works when the link function is a canonical link in which the ﬁrstpart of the integral in (6.4) has the simple form, namely,

*f *(

*yij bi*;

*β*) = exp

*yijxtijβ *+

*yijztijbi −*
which includes (6.5) as a special case. Some important examples of canonical linksinclude, as mentioned in

*§*2.2, the identity link for normal distributions, log link forPoisson distribution and logit link for binomial distribution. One important feature of
(6.6) is that given

*bi,*
*yijxij *and

*yijzij *are jointly minimally suﬃcient for

*β *and

*bi*, the canonical parameters. Consequently, one can eliminate

*bi *in (6.6) by conditioning

*yijzij*, the minimal suﬃcient statistics for

*bi *for ﬁxed

*β*, i.e.

*w *)

*∝*
*yi*1

*, . . , yin*
*ij zij *;

*β*
where

*β *= (

*βw, βz*) and

*xij *= (

*wij, zij*) as

*zij *is assumed to be a subset of

*xij*. Thus,through conditioning (see

*§*2.2), it gives rise to a conditional likelihood which is indepen-dent of

*bi *and

*βz *and depends only on

*βw*. In other words, only a subset of

*β *is estimable.

Thus, the usefulness of this conditional likelihood approach depends on whether covari-ates of interest have been declared as random eﬀects or not. For example, if

*zij ≡ *1, i.e.

random intercepts is the only random eﬀect declared, then only those

*β *from covariates

*wij*'s which vary within-cluster are estimable since otherwise
Thus it should be clear that this conditional likelihood approach would not work for theeye survey example since all the covariates are cluster-speciﬁc rather than eye-speciﬁc.

*6.2. RANDOM EFFECTS MODELS*
As another example, consider longitudinal studies in which primary objectives are of-ten centered on the individual change of the response variables over time. If the timevariable, i.e. the times for which the response variables are measured, is considered asa random eﬀect, known as random slopes, then one cannot estimate the averaged rateof change coeﬃcient should the conditional likelihood method be employed. This alongwith the fact that canonical links are required limit the extent to which this method maybe applied. One desirable property of this method, i.e. using

*Lc*(

*βw*) as the basis forinference, is that no distributional assumption on the

*bi*'s is needed, which is comfortingas the validity of such an assumption is diﬃcult to verify in general.

An obvious alternative is to use the full likelihood,

*L*(

*β, φ, D*), in (6.4) as the basis forinference on both ﬁxed eﬀects,

*β*, and the variance components,

*D *and

*φ*. For continuousresponses with identity links and normal distribution on both

*Y b *and

*b *itself, i.e.

(A

)

*Yij bi ∼ N *(

*xt β *+

*zt b*
*ij i, φ*),
(B

)

*bi ∼ *MVN(0

*, D*),

*Yi *follows a multivariate normal distribution with mean

*xtβ *and covariance matrix

*i *+

*φIni×ni*
.

*, zi *= .

*.*
In this case, the restricted maximum likelihood (REML) method (e.g. Patterson andThompson, 1971; Laird and Ware, 1982) for

*β, φ *and

*D *can be used. Furthermore, theempirical Bayes method (Robbins, 1955) can be utilized to help provide more desirableestimates of the

*bi*'s. This aspect of inference is useful especially for the purpose ofidentifying clusters that are distinct from others with regard to the behavior of therandom eﬀects. A prime example is in growth studies in which investigators are interestedin identifying children whose "growth rates" as captured by the

*bi*'s are lower than the restof the sample. We note that this REML method along with empirical Bayes procedurefor estimating the

*bi*'s have been implemented in SAS as a part of PROC MIXED.

In general, the likelihood function involves complicated

*q*-dimensional integrations
which are demanding numerically especially when

*q *is modestly large. Two methodshave been developed recently to address this issue. One is to use the Gibbs samplermethod in which the posterior distribution of

*β, φ*, and

*D *given the data is simulatedthrough, for example, the Monte Carlo Markov Chain (MCMC) algorithm (e.g. Zegerand Karim, 1991). The other approach, known as the penalized likelihood method,approximates

*f *(

*bi Yi*) by a normal density function leading to an approximate likelihoodfunction which is less computationally intensive (e.g., Stiratteli et al. 1984; Breslow andClayton, 1993). This approach has been shown through simulations to perform well ingeneral except for binary responses with small cluster sizes (Breslow and Clayton, 1993).

This method has been implemented in SAS as well (PROC GLIMMIX) and has beenreﬁned more recently to correct the bias occurring in the situation described above (e.g.

Breslow and Lin, 1995; Lin and Breslow, 1996).

*CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA*
**Observation Driven Models**
Recall that observation driven models tackle the issue of within-cluster association bydirectly modelling the dependence of

*Yij *on some functions of

*Yi,−j *=

*{Yi*1

*, . . , Yi,j−*1

*, Yi,j*+1

*, . . , Yn },*
namely,

*Hij *=

*Hij*(

*Yi,−j*) through
(A

) For

*j *= 1

*, . . , ni, Yij Hij *follows a GLM with

*h*(E(

*Yij Hij*)) =

*xtijβ *+
Some common choices of

*Hij *were discussed in

*§*5.3.

Just as in

*§*6.1.2, a natural question is "does there exist a probability function for

*Yi *thatis compatible with speciﬁcation (A

)?" While the answer no doubt depends on choices of

*H *and

*fν*, it is comforting that the Hammerley-Cliﬀord Theorem in Besag (1974) couldserve as a guidance for the question posed above. As an example, for binary responses
with

*q *= 1, a choice of

*fν*(

*Hij*;

*α*) =

*α*
*i *has led to the "multiplicative" model
developed by Connolly and Liang (1988).

As an alternative, one can estimate

*δ *= (

*β, α*) by solving

*∂*E(

*Y*
*ij Hij *;

*δ*)

*ij Hij *) (

*yij − *E(

*Yij Hij *;

*δ*)) = 0

*,*
*i*=1

*j*=1
a strategy similar to GEE for marginal models. An advantage of this approach is thatthe statistical package developed for GEE of marginal models such as SAS PROC GEN-MOD can be utilized. On the theoretical end, this approach shares the same robustnessproperty enjoyed by GEE for marginal models.

Returning to the Baltimore Eye Survey Study example, Table 6.2 presents results from
three modelling approaches. The ﬁrst column is based on GEE2 for marginal models asshown in Table 6.1. The second column is derived from observation driven models usingthe GEE1 method. Speciﬁcally, this model assumes for each

*j *= 1

*, *2,
logit Pr(

*Yij *= 1

* Yi,*3

*−j, *Age

*i, *Race

*i*)
1 Race

*i *+

*β*
60) +

*β*
Race

*i *(Age

*i*
60) +

*β*
Race

*i *(Age

*i*
+

*α*1

*Yi,*3

*−j *+

*α*2

*Yi,*3

*−j *Race

*i.*
Finally, the third column corresponds to results from ﬁtting a random eﬀects model ofsame covariates as marginal models with the addition of random intercepts. Consistent

*6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED*
with discussion in

*§*5.4, the estimated

*β*'s from the observation driven model are ingeneral smaller in magnitude than that of the

*β*'s from the marginal model. In par-ticular, the estimated

*β *for the Race variable is reduced from 0.334 in column oneto 0.099 in column two and as a result, is not signiﬁcantly diﬀerent than zero statisti-
Table 6.2: Regression estimates and standard errors (in parenthesis) fromthree modelling approaches: Baltimore Eye Survey Study.

* *variance of random intercepts.

cally if the observation driven model is employed. A simple and obvious explanation ofthis discrepancy is that the eﬀects of those demographic variables are, to some extent,"explained away" by the competing variate,

*Yi,*3

*−j*, the visual impairment status of thefellow eye; see ˆ

*α*1 and ˆ

*α*2 in column two. On the other hand, the estimated

*β*'s from
the random eﬀects model are larger in magnitude than the ˆ

*β*'s. However, the Wald test
statistics for testing the signiﬁcance of either

*β *or

*β *are similar in magnitude and leadto similar conclusions. For an example, the Wald statistics for the Race variables are3

*.*18(=

*.*334

*/*1

*.*05) from column one and 2

*.*58(=

*.*487

*/.*189) from column three.

**Pre- Post Trial on Schizophrenia Revisited**
Recall that in this trial more than ﬁve hundred schizophrenic patients were randomizedinto six arms, placebo, risperidone of 2, 6,, 10 and 16mg and haloperidol of 20mg. Forillustrative purpose, we focus on two particular groups: risperidone of 6mg (

*x *= 1)and haloperidol of 20mg (

*x *= 0). The main objective is to examine whether the new

*CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA*
treatment in risperidone leads to further and faster decline in PANSS, a measure ofseverity in schizophrenia, compared to the then standard treatment in haloperidol. Table
Table 6.3: Regression estimates and standard errors (in parenthesis) fromthree modelling approaches: a pre- post trial on schizophrenia.

Random eﬀects model
6.3 presents results form three modelling approaches. For marginal models, we consideredthe use of independence and exchangeable (also known as compound symmetry) matricesas the working covariance for the

*Yi*'s. For random eﬀects models, we considered (1)random intercepts only which leads to the compound symmetry covariance matrix in

*Y *and (2) random intercepts and slopes. In the latter, this accounts for variations inPANSS changes over time among patients. For observation driven models, we consideredthe following AR-1 model

*Yij *=

*xtijβ *+

*α*(

*yi,j−*1

*− xti,j−*1

*β*) +

*ij, j *= 1

*, . . , ni,*
where the

*ij ∼ N *(0

*, φ*). With the exception of the GEE approach using the inde-pendence matrix as the working matrix, conclusions from all three approaches are verysimilar qualitatively; see comments in

*§*5.4. Speciﬁcally, the averaged rate of decline inPANSS for the haloperidol groups is estimated as 1.16 per week; whereas the averagedrate of decline is estimated as 2

*.*35(= 1

*.*16 + 1

*.*19) per week for the risperidone group.

Such a diﬀerence is statistically signiﬁcant at the 0.05 level (

*−*1

*.*192

*/.*482 =

*−*2

*.*47, p-value

*< *0

*.*01). More importantly, this diﬀerence appears to be meaningful clinically aswell. Considering two patients of the same PANSS scores before receiving treatments,this result suggests that taking the new treatment would lead to 9

*.*5(= 1

*.*19

*× *8) unitgreater decline in PANSS in the eighth week, a substantial improvement from the clinicalpoint-of-view.

*6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED*
Figure 6.2: Mean responses in PANSS by dropout cohort.

Finally, ignoring within-cluster association would lead to very diﬀerent and presum-
ably misleading conclusions. Speciﬁcally, results from GEE assuming the repeated mea-sures from the same patient were statistically independent of each other suggest thatthere is no diﬀerence in averaged rate of decline in PANSS between the two groups(

*−*0

*.*117

*± *0

*.*545). An explanation for such a striking discrepancy is that this latterapproach is vulnerable to severe and non-trivial missing data phenomenon which is ex-hibited in this example. For example, the risperidone group experienced 36% dropoutrate in the eight week follow-up period; whereas the haloperidol group fared no betterwith a 53% dropout rate. Furthermore, Figure 6.2 strongly suggests that the chance ofdropping out at a given visit depends on the prior history of PANSS scores. For an exam-ple, patients dropped out at weeks 2, 4 and 6 experienced, on average, a sudden increasein PANSS prior to the scheduled visits. This pattern of missing mechanism, known asmissing at random (Little and Rubin, 1987), suggests that ignoring within-cluster associ-ation would, in general, lead to incorrect inference for questions of interest, the diﬀerencein rates of change in PANSS between groups in this case. For more detailed treatmentson accounting for non-trivial missing data patterns, see Diggle (1998) and Scharfstein,Rotnitzky and Robbins (1999) and references therein.

*CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA*
[1] Altham, P.M.E. (1978). Two generalizations of the binomial distribution.

*Applied*
*Statistics*,

*27*, 161–167.

[2] Andersen, E.B. (1970). Asymptotic properties of conditional maximum likelihood
estimators.

*Journal of the Royal Statistical Society, Series B*,

*32*, 283–301.

[3] Bahadur, R.R. (1961). A representation of the joint distribution of responses in

*n*
dichotomous items. In

*Studies on Item Analysis and Prediction*, Ed. H. Solomon,pp.158–168. Stanford Mathematical Studies in the Social Sciences VI, Stanford Uni-versity Press, Stanford, California.

[4] Beaty, T.H., Skjaerven, R., Breazeale, D.R. and Liang, K.-Y. (1997). Analyzing
sibships correlations in birth weight using large sibships from Norway.

*Genetic Epi-demiology*,

*14*, 423–433.

[5] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.

*Journal of the Royal Statistical Society, Series B*,

*36*, 192–236.

[6] Bishop,Y.M.M., Fienberg, S.E. and Holland, P.W. (1975).

*Discrete Multivariate*
*Analysis: Theory and Practice*. MIT Press, Cambridge, Massachussetts.

[7] Breslow, N.E. (1981). Odds ratio estimators when the data are sparse.

*Biometrika*,

*68*, 73-84.

[8] Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in generalized linear
mixed models.

*Journal of the American Statistical Association*,

*88*, 125–34.

[9] Breslow, N.E. and Day, N.E. (1980).

*Statistical Methods in Cancer Research, Volume*
*I*. IARC Scientiﬁc Publications No. 32. Lyon.

[10] Breslow, N.E. and Lin, X. (1995). Bias correction in generalized linear mixed models
with a single component of dispersion.

*Biometrika*,

*82*, 81-91.

[11] Breslow, N.E. and McCann, B. (1971). Statistical estimation of prognosis of children
with neuroblastoma.

*Cancer Research*,

*31*, 2098–2103.

[12] Chen, J.J., Kodell, R.L., Howe, R.B. and Gaylor, D.W. (1991). Analysis of trinomial
responses for reproductive and developmental toxicity experiments.

*Biometrics*,

*47*,1049–1058.

[13] Cohen, B.H., Bell, W.C., Jr., Brashears, S., Diamond, E.L., Kreiss, P. et al. (1977).

Risk factors in chronic obstructive pulmonary disease (COPD).

*American Journalof Epidemiology*,

*105*, 223–232.

[14] Connolly, M., Liang K.-Y. (1988). Conditional logistic regression models for corre-
lated binary data.

*Biometrika*,

*75*, 501-506.

[15] Cox, D.R. (1972). Regression models and life tables (with discussion).

*Journal of*
*the Royal Statistical Society Series B*,

*74*, 187-200.

[16] Cox, D. R. and Hinkley, D. V. (1974).

*Theoretical Statistics*, Chapman and Hall,
London, England.

[17] Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate condi-
tional inference, (with discussion).

*Journal of the Royal Statistical Society, Series B*,

*49*, 1-39.

[18] Crowder, M.J. (1987). On linear and quadratic estimating functions.

*Biometrika*,

*74*, 591-597.

[19] Crowder, M.J. (1989). Comments on Godambe, V.P. and Thompson, M.E. An ex-
tension of quasi-likelihood estimation.

*Journal of Statistical Planning and Inference*,

*22*, 137-152.

[20] Diggle, P.J. (1988). An approach to the analysis of repeated measures.

*Biometrics*,
[21] Diggle, P.J. (1998). Dealing with missing values in longitudinal studies. In:

*Recent*
*Advances in the Statistical Analysis of Medical Data*, editors B.S. Everitt and G.

Dunn, 203–208. Arnold, London, England.

[22] Diggle, P.J., Liang, K.-Y. and Zeger, S.L. (1994).

*Analysis of Longitudinal Data*.

Oxford University Press, Oxford, England.

[23] Durbin,J. (1960). Estimation of parameters in time-series regression models.

*Biometrika*,

*47*, 139-153.

[24] Ferguson, H. Reid, N. and Cox, D.R. (1991). Estimating equations from modiﬁed
proﬁle likelihood, in

*Estimating Functions*, Godambe, V.P., editor, Oxford Univer-sity Press, Oxford, England.

[25] Finney, D.J. (1964).

*Statistical Methods in Biological Assay*, 2nd edition. Griﬃn,
London, England.

[26] Firth, D. (1987). On the eﬃciency of quasi-likelihood estimation.

*Biometrika*,

*74*,
[27] Fitzmaurice, G.M. and Laird, N.M. (1993). Regression models for discrete longitu-
dinal responses.

*Statistical Science*,

*8*, 284-304.

[28] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti-
mation.

*Annals of Mathematical Statistics*,

*31*, 1208–12.

[29] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimat-
ing equations.

*Biometrika*,

*63*, 277-284.

[30] Goodman, L.A. and Kruskal, W.H. (1979).

*Measures of Association for Cross-*
*Classiﬁcations*. Springer-Verlag: New York.

[31] Gourieroux, C., Monfort, A., and Trognon, A. (1984). Psuedo-maximum likelihood
methods: theory.

*Econometrica*,

*52*, 681–700.

[32] Grizzle, J.E., Starmer, C.F. and Koch, G.G. (1969). Analysis of categorical data by
linear models.

*Biometrics*,

*25*, 4898–504.

[33] Haseman, J.K. and Kupper, L.L. (1979). Analysis of dichtomous response data from
certain toxicological experiments.

*Biometrics*,

*35*, 281–293.

[34] Hauck, W.W. and Donner, A. (1977). Wald's test as applied to hypotheses in logist
analysis.

*Journal of the American Statistical Association*,

*72*, 851–853.

[35] Heagerty, P. and Zeger, S.L. (1998). Lorelogram: a regression approach to exploring
dependence in longitudinal categorical responses.

*Journal of the American StatisticalAssociation*,

*93*, 150–163.

[36] Holmes, M.C. and Williams, R.E.O. (1954). The distribution of carriers of

*Stepto-*
*coccus pyogenes *among 2413 healthy children.

*Journal of Hygiene Comb 52*, 165-179.

[37] Jones, B. and Kenward, M.G. (1987). Modelling binary data from a three-point
cross-over trial.

*Statistics in Medicine*,

*6*, 555–64.

[38] J

*φ*rgensen, B. (1987). Exponential dispersion models (with discussion).

*Journal of*
*the Royal Statistical Society, Series B*,

*49*, 127–162.

[39] Kalbﬂeisch, J.D. and Sprott, D.A. (1970). Application of likelihood methods to
models involving a large number of nuisance parameters (with discussion).

*Journalof the Royal Statistical Society, Series B*,

*32*, 175-208.

[40] Kupper, L.L. and Haseman, J.K. (1978). The use of a correlated binomial model for
the analysis of data from certain toxicological experiments.

*Biometrics*,

*34*, 69–76.

[41] Laird, N.M. and Ware, J.H. (1982). Random-eﬀects models for longitudinal data.

*Biometrics*,

*38*, 963–74.

[42] Lee, E. (1992).

*Statistical Methods for Survival Data Analysis*, 2nd edition. John
Wiley and Sons, New York, New York.

[43] Lehmann, E.L. (1959).

*Testing Statitical Hypothesis*. John Wiley and Sons, New
York, New York.

[44] Liang, K.-Y. (1987). Estimating functions and approximate conditional likelihood.

*Biometrika*,

*74*, 695-702.

[45] Liang, K.-Y. and Beaty, T.H. (1991). Measuring familial aggregation by using odds-
ratio regression models.

*Genetic Epidemiology*,

*8*, 361–370.

[46] Liang, K.-Y. and Hanfelt, J. (1994). On he use of the quasi–likelihood method in
the teratological experiments.

*Biometrics*,

*50*, 872-880.

[47] Liang, K.-Y. and Tsou, D. (1992). Empirical Bayes and conditional inference with
many nuisance parameters.

*Biometrika*,

*79*, 261-270.

[48] Liang, K.-Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized
linear models.

*Biometrika*,

*73*, 13–22.

[49] Liang, K.-Y. and Zeger, S.L. (1989). A class of logistic regression models for mul-
tivariate binary times series.

*Journal of the American Statistical Association*,

*84*,447-451.

[50] Liang, K.-Y., Zeger, S.L. (1993). Regression analysis for correlated data.

*Annual*
*Review of Public Health*,

*14*, 43-68.

[51] Liang, K.-Y. and Zeger, S.L. (1995). Inference based on estimating functions in the
presence of nuisance parameters (with discussion).

*Statistical Science*,

*10*, 158-172.

[52] Liang, K.-Y., Zeger, S.L., and Qaqish, B. (1992). Multivariate regression analyses
for categorical data (with Discussion).

*Journal of the Royal Statistical Society, SeriesB*,

*54*, 3–40.

[53] Lin, X. and Breslow, N.E. (1996). Bias correction in generalized linear mixed models
with multiple components of dispersion.

*Journal of the American Statistical Asso-ciation*,

*91*, 1007–16.

[54] Lindsay, B. (1982). Conditional score functions: some optimality results.

*Biometrika*,

*69*, 503-512.

[55] Little, R.J.A. and Rubin, D.B. (1987).

*Statistical Analysis with Missing Data*. John
Wiley, New York.

[56] Mantel, N. and Haenszel, W. (1959). Statistican aspects of the analysis of data
from retrospective studies of disease.

*Journal of the National Cancer Institute*,

*22*,719-748.

[57] Mason, W.B. and Fienberg, S.E. (eds) (1985).

*Cohort Analysis in Social Research:*
*Beyond the Identiﬁcation Problem*. Springer-Verlag, New York.

[58] Maxwell, A.E. (1961).

*Analysing Qualitative Data*. Methuen, London.

[59] McCullagh, P. (1980). Regression models for ordinal data (with discussion).

*Journal*
*of the Royal Statistical Society, Series B*,

*42*, 109–42.

[60] McCullagh, P. (1983). Quasi-likelihood functions.

*Annals of Statistics*,

*11*, 59-67.

[61] McCullagh, P. and Nelder, J.A. (1989).

*Generalized Linear Models*. Chapman and
Hall, New York.

[62] Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalised linear models.

*Journal*
*of the Royal Statistical Society, Series A 135*, 370-384.

[63] Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consis-
tent observations.

*Econometrica*,

*16*, 1-32.

[64] Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when
block sizes are unequal.

*Biometrika*,

*58*, 545–54.

[65] Prentice, R.L. and Zhao, L.P. (1991). Estimating equations for parameters in means
and covariances of multivariate discrete and continuous responses.

*Biometrics*,

*47*,825–39.

[66] Qaqish, B. and Liang, K.-Y. (1992). Marginal models for correlated binary data
with multiple classes and multiple levels of nesting.

*Biometrics*,

*48*, 939–950.

[67] Rao, C.R. (1973).

*Linear Statistical Inference and Its Applications*, 2nd edition.

John Wiley, New York.

[68] Reid, N. (1995). The roles of conditioning on inference (with discussion).

*Statistical*
*Science*,

*10*, 138-199.

[69] Robbins, H. (1956). An empirical Bayes approach to statistic.

*Proceedings of the*
*Third Berkeley Symposium*,

*1*, 157–163.

[70] Rosner, B. (1989). Multivariate methods for clustered binary data with more than
one level of nesting.

*Journal of the American Statistical Association*,

*84*, 373–380.

[71] Royall, R.M. (1986). Model robust inference using maximum likelihood estimators.

*International Statistical Review*,

*54*, 221–26.

[72] Royall, R.M. (1997).

*Statistical Evidence: a Likelihood Paradigm*. Chapman and
Hall, London, England.

[73] Scharfstein, D.O., Rotnitzky, A. and Robbins, J.M. (1999). Adjusting for nonignor-
able drop-out using semiparametric nonresponse models (with discussion).

*Journalof the American Statistical Association*,

*94*, 1096–1146.

[74] Skellam, J.G. (1948). A probability distribution derived from the binomial distribu-
tion by regarding the probability of success as variable between the sets of trials.

*Journal of the Royal Statistical Society, Series B*,

*10*, 257–61.

[75] Stigler, S.M. (1986).

*The History of Statistics*. Belknap Press, Cambridge, Mass.

[76] Stiratelli, R., Laird, N. and Ware, J.H. (1984). Random eﬀects models for serial
observations with binary responses.

*Biometrics*,

*40*, 961–71.

[77] Tielsch, J.M., Sommer, A., Witt, K., Katz, J. and Royall, R.M. (1990). Blindness
and visual impairment in an American urban population: Baltimore eye survey.

*Archives of Ophthalmology*,

*108*, 286-290.

[78] Tweedie, M.C.K. (1947). Functions of a statistical variate with given means, with
special reference to Laplacian distributions.

*Proceedings of the Cambridge Phil.*

Society,

*49*, 41–49.

[79] Waterman, R.P. and Lindsay, B.G. (1996). A simple and accurate method for ap-
proximate conditional inference applied to exponential family models.

*Journal of theRoyal Statistical Society, Series B*,

*58*, 177–188.

[80] Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models
and the Gaussian method.

*Biometrika*,

*61*, 439–47.

[81] Weil, C.S. (1970). Selection of the valid number of sampling units and a consideration
of their combination in toxicological studies involving reproduction, teratogenesis orcarcinogenesis.

*Food and Cosmetics Toxicology*,

*8*, 177–182.

[82] Weinberg, C.R., Wilcox, A.J. and Lie, R.T. (1998). A log-linear approach to case-
parent-triad data: assessing eﬀects of disease genes that act either directly or thoughmaternal eﬀects and that may be subject to parental imprinting.

*American Journalof Human Genetics*,

*62,*, 969–978.

[83] White, H. (1982). Maximum likelihood estimation of misspeciﬁed models.

*Econo-*
*metrics*,

*50*, 1–25.

[84] Williams, D.A. (1988). Reader reaction: estimation bias using the beta-binomial
distribution in teratology.

*Biometrics*,

*44*, 305–309.

[85] Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random eﬀects:
a Gibbs sampling approach.

*Journal of the American Statistical Association*,

*86*,79–86.

[86] Zeger, S.L., Liang, K.-Y. and Albert, P. (1988). Models for longitudinal data: a
generalized estimating equation approach.

*Biometrics*,

*44*, 1049–1060.

[87] Zhao, L.P. and Prentice, R.L. (1990). Correlated binary regression using a general-
ized quadratic model.

*Biometrika*,

*77*, 642–48.

[88] Zhao, L.P, Prentice, R.L. and Self, S.G. (1992). Multivariate mean parameter esti-
mation by using a partly exponential model.

*Journal of the Royal Statistical Society,Series B*, 54, 805–812.

beta-binomial distribution, 22, 40, 41,
likelihood function, 22
likelihood ratio, 22
binary response, 2
likelihood-based inference, 21
biologic plausibility, 43
litter eﬀect, 22local optimality, 33
canonical link, 75
logistic regression model, 51
canonical parameter, 75
longitudinal study, 18
categorical response, 49cluster, 12
marginal model, 61, 62, 69
cluster-speciﬁc, 76
maximum likelihood estimator, 22
conditional score function, 32
method of moments, 31
contingency table, 56continuation ratio model, 55
nested design, 13
continuous response, 2
Neyman-Scott problem, 22
correlated data, 34
nuisance parameter, 21
correlation coeﬃcient, 15
observation driven model, 61, 65, 77
odds ratio, 19ordinal response, 50
empirical Bayes, 76
orthogonality, 32, 33
empirical logit, 38, 39estimating function, 24
partly exponential family, 71
exponential dispersion family, 26
polytomous logistic regression model, 53polytomous response, 49
population averaged, 63
Fisher information matrix, 25
proportional odds model, 52
generalized estimating equation, 35
quadratic exponential family, 72
generalized linear model, 26
quasi-likelihood, 28, 30, 44, 69
log-linear model, 27
quasi-score function, 30, 31, 33, 34
logistic regression model, 27, 39multiple linear regression, 27
random eﬀects model, 61, 63, 73
Poisson regression model, 27
random mechanism, 30
polytomous response, 51
regression analysis, 2
random component, 26
restricted maximum likelihood, 76
systematic component, 27
goodness-of-ﬁt, 30
score function, 25
Hammersley-Cliﬀord Theorem, 45
teratological experiment, 22, 40, 53, 54
unbiased estimating function, 31
variance speciﬁcation, 30, 31, 44
weighted least squares, 28within-cluster dependence, 12

Source: http://www.stat.nthu.edu.tw/EngWeb/Teachers/kyliang930225.pdf

Annu. Rev. Genet. 2002. 36:153–73 ° 2002 by Annual Reviews. All rights reserved STUDYING GENE FUNCTION IN EUKARYOTES BYCONDITIONAL GENE INACTIVATION Manfred Gossen1 and Hermann Bujard2 1Max Delbr¨uck Centrum, Robert-R¨ossle-Strasse 10, D-13125 Berlin, Germany;e-mail: mgossen@mdc-berlin.de; 2ZMBH, Universit¨at Heidelberg, Im NeuenheimerFeld 282, D-69120 Heidelberg, Germany; e-mail: h.bujard@zmbh.uni-heidelberg.de

Clinical Review & Education 2014 Evidence-Based Guideline for the Managementof High Blood Pressure in AdultsReport From the Panel Members Appointedto the Eighth Joint National Committee (JNC 8) Paul A. James, MD; Suzanne Oparil, MD; Barry L. Carter, PharmD; William C. Cushman, MD;Cheryl Dennison-Himmelfarb, RN, ANP, PhD; Joel Handler, MD; Daniel T. Lackland, DrPH;Michael L. LeFevre, MD, MSPH; Thomas D. MacKenzie, MD, MSPH; Olugbenga Ogedegbe, MD, MPH, MS;Sidney C. Smith Jr, MD; Laura P. Svetkey, MD, MHS; Sandra J. Taler, MD; Raymond R. Townsend, MD;Jackson T. Wright Jr, MD, PhD; Andrew S. Narva, MD; Eduardo Ortiz, MD, MPH