Runit.tex
Generalized Linear Models, Estimating Functions
Multivariate Extensions
Department of Biostatistics
School of Hygiene & Public Health
Johns Hopkins University
July 12–14, 1999
Regression analysis has been developed for many years and remains one of the most
commonly used statistical tools to help scientists address their scientific inquiries. Inthe year of 1972, thanks to the seminal work by Nelder and Wedderburn, many usefulregression models are unified under the framework of generalized linear models (GLMs).
These include as special cases ANCOVA, multiple regression for continuous responses, lo-gistic regression for binary responses and log-linear models for contingency tables. Theyasserted that each model under consideration can be cast through two components. Oneis the systematic component which relates the mean response to the covariates by spec-ifying a link function. The other is the random component which is needed to accountfor uncertainty in the response variable. Through this common framework, issues thatare commonly faced in regression such as estimating procedure, regression diagnostics,measurement errors in covariates, goodness-of-fit, etc. can be addressed in a unified fash-ion. While there is no doubt that this approach will be in its existence for many yearsto come, it faces two, among others, challenges that has drawn a good deal of attentionamong statistical researchers in the past few decades. One is to deal with the situationwhere the scientists may not have sufficient knowledge about the subject matters to beincorporated by their statistical colleagues to characterize the random mechanism forwhich the data are generated. The work by Wedderburn in 1974, known as the quasilikelihood method, provides a means to cope with this complication. It relaxes the ran-dom component assumption required by GLMs, rather a much weaker assumption on thevariance expression for the response variables is installed. Several questions remain tobe answered regarding the use of this alternative method. First, can the work by Wed-derburn be extended to accommodate more general and realistic variance specifications?Second, does the quasi score method by Wedderburn possess any desirable statisticaloptimality properties? Third, is there any evidence that this method perform well inpractice?
The other challenge GLMs face is how to deal with the situation where some of the ob-
servations are not necessarily statistically independent of each others. Implicitly, GLMsoperates on the assumption that all the observations are independent statistically of eachothers. More frequently, one witnesses in biomedical studies that such an assumptionmay be invalid. In the past decade or two, the statistical community also witnessed thetremendous development in statistical methods for analyzing data that are correlatedwith each others. A natural questions to raise is: what are the pros and cons of thesenew methods in terms of their suitability in addressing scientific objectives from studiesinvolving correlated data?
In this lecture note, we intend to address the above two posed challenges and questions
through the vehicle of estimating functions, a concept formally developed in 1960 bythe independent work of Durbin and Godambe. This lecture note has the followingorganization. In Chapter one, we first characterize five motivating examples which involvecorrelated data. We argued that for all examples considered, their scientific objectives canbe cast through regression modeling. Also discussed are (1) the statistical consequencewhen the correlation possessed by the data is ignored in the analysis and (2) quantitiesthat are useful for measuring the degree of associations for correlated data.
Chapter two provides a chronicle overview on some recent developments of regres-
sion techniques with focus on generalized linear models, quasi likelihood methods andestimating function approaches. This overview is preceded by some discussions on issues
concerning likelihood-based inference. These include the impact of nuisance parametersand cautions in using the popular Wald's procedure.
Chapters three and four may be viewed as a preview of statistical modeling for corre-
lated data. In Chapter three, a brief review on historic developments for analyzing pro-portional data leads to the following question: is the binomial assumption made thereina reasonable one? The well known "litter effect" in teratological experiments suggeststhat one of the key assumptions in binomial, namely, all the binary responses from withinare statistically independent of each others, is invalid. We discussed in Chapter threedrawbacks of some attempts in statistical literature in the 1970s and 1980s to relax thetwo key assumptions in binomial for proportional data. The teratological experimentdata also serves to illustrate the usefulness of the quasi likelihood method in dealingwith (1) uncertainty of random mechanisms and (2) impact of nuisance parameters.
Chapter four consists of two parts. In part one, we discussed three different statistical
models that are useful for polytomous responses in which the discrete response variablecontains three or more categories. Circumstances under which these models may be, ormay not be, appropriate are highlighted and contrasted through two examples. Thispart also serves to extend GLMs to incorporate either ordinal or nested polytomousresponse variables. In part two, we focused on interpretations of parameters induced bylog-linear models which are popular for analyzing data from contingency tables. Therehas been some renewed interest lately in transferring this approach for contingency tablesto correlated data analysis. Through detailed discussions on interpretations of log-linearparameters, we argued that such an application to correlated data may proceed withcaution.
In Chapter five, we contrasted three different statistical modelings for correlated data:
marginal models, random effects models and observation driven models with focus oninterpretations of induced parameters. Such contrasts and circumstances under whichthe proposed models may be most appropriate are illustrated through two examplesintroduced in Chapter one: The Baltimore Eye Survey Study and the SchizophreniaTrial Study.
Chapter six deals with statistical inference for correlated data when either one of
three statistical models mentioned above is employed. These methods may be viewed asmultivariate extensions of GLMs and quasi likelihood which were primarily designed foranalyzing univariate responses. Again, these newly developed methods were illustratedthrough the two above mentioned examples.
This lecture note is made possible through the kind invitation of Dr. C. Z. Wei,
the former Director of the Institute of Statistical Science, Academia Sinica, Taiwan,R.O.C., to be a part of the lecture series in statistics sponsored by the institute in 1999.
The coordinating efforts by Dr. C.H. Chen, the executive secretary of the lecture seriescommittee, and by the committee members are gratefully acknowledged. The typingof this lecture note and the lectures delivered in July, 1999 in Academic Sinica werediligently prepared by Ms. Patty Hubbard, Department of Biostatistics, Johns HopkinsUniversity and by Ms. Umy Chen, Institute of Statistical Science, Academia Sinica.
Their hard work and patience throughout are beyond what can be described in words –I thank them from the bottom of my heart.
1 Correlated Data
Characteristics of Examples . . . . . . . . . . . . .
Impact of Ignoring Within-Cluster Dependence . . . . . . . .
Incorrect Variance . . . . . . . . . . . . . .
Measures of Within-Cluster Dependence . . . . . . . . . .
2 Regression Analysis: An Overview
Some Issues Concerning Likelihood-Based Inference . . . . . . .
Impact of Nuisance Parameters . . . . . . . . . .
Cautions on Wald's Procedure . . . . . . . . . .
Generalized Linear Models . . . . . . . . . . . . . .
First Two Moments of GLMs . . . . . . . . . . .
Score Function for
β . . . . . . . . . . . . .
MLE Versus Weighted Least Squares . . . . . . . . .
Quasi-Likelihood Method . . . . . . . . . . . . . .
Estimating Function Approach . . . . . . . . . . . .
Limitations: Impact of Nuisance Parameters
Orthogonality Properties of Conditional Score Functions . . .
(Local) Optimality of Quasi-Score Functions
3 Analysis of Binary Data
Historical Developments . . . . . . . . . . . . . .
Empirical Logit versus Logistic Regression . . . . . . .
Is the Binomial Assumption Reasonable? . . . . . . .
Teratological Experiments . . . . . . . . . . . . . .
Some Cautionary Notes on Beta-Binomial Distributions . . .
An Alternative Inferential Procedure . . . . . . . . .
Further Relaxation of Binomial Assumptions . . . . . . . .
Multiplicative Model . . . . . . . . . . . . .
4 Analysis of Polytomous and Count Data
Types of Polytomous Data . . . . . . . . . . . . . .
Regression Models for Polytomous Responses . . . . . . . .
Polytomous Logistic Regression Models . . . . . . . .
Proportional Odds Models . . . . . . . . . . . .
Continuation Ratio Models . . . . . . . . . . .
Log-Linear Models for Contingency Tables . . . . . . . . .
Interpretations of Log-Linear Model Parameters
An Alternative Representation of Log-Linear Models . . . .
5 Statistical Modellings for Correlated Data
Observation Driven Models . . . . . . . . . . . . .
Contrasts of Three Modelling Approaches . . . . . . . . .
6 Statistical Inference for Correlated Data
Quasi-Likelihood Approach . . . . . . . . . . .
Likelihood Approach . . . . . . . . . . . . .
Conditional Likelihood Approach . . . . . . . . . .
Likelihood Approach . . . . . . . . . . . . .
Observation Driven Models . . . . . . . . . . . . .
Likelihood Approach . . . . . . . . . . . . .
Quasi-Likelihood Approach . . . . . . . . . . .
Pre- Post Trial on Schizophrenia Revisited . . . . . . . . .
List of Tables
Some features of five examples in
§1.1. . . . . . . . . . .
Ranges of correlation coefficients for a pair of binary responses with cor-responding means
πj and
πk, respectively. . . . . . . . . .
Proportions of children surviving 2 years free of disease following diagnosisand treatment for neuroblastoma by age and stage of disease at diagnosis(Breslow & McCann, 1971). . . . . . . . . . . . . .
Proportions of fetuses surviving the 21 day lactation period among thosewho were alive at 4 days by treatment group (Weil, 1970). . . . .
The averaged MLE (
± s.e.) for
θ for different assumed values of
φ from1,000 simulations (Williams, 1988). . . . . . . . . . . .
Estimates and standard errors of
β1 = log
{θT /(1
−θT )
}−log
{θC/(1
−θC)
}of Weil's data presented in Table 3.2. Upper entry: common
φ; lower entry:heterogeneous
φ (Liang and Hanfelt, 1994). . . . . . . . .
Simulation results for
β1 = log
{θT /(1
− θT )
} from 1,000 replications mim-icking the data structure of Weil's. The true
β1 is
−1
.129 and
φT = 0
.317and
φC = 0
.021 (Liang and Hanfelt, 1994). . . . . . . . .
Regression estimates (
± s.e.) based on both proportional odds models andpolytomous logistic regression models for the disturbed dream example(Maxwell, 1961). . . . . . . . . . . . . . . . .
Regression estimates (
± s.e.) based on proportional odds models and con-tinuation ratio models for the tonsil size example (Holmes and Williams,1954). . . . . . . . . . . . . . . . . . . .
Regression estimates and standard errors (in parenthesis) for the visualimpairment response from the Baltimore Eye Survey Study. . . . .
Regression estimates and standard errors (in parenthesis) from three mod-elling approaches: Baltimore Eye Survey Study. . . . . . . .
Regression estimates and standard errors (in parenthesis) from three mod-elling approaches: a pre- post trial on schizophrenia. . . . . . .
LIST OF TABLES
List of Figures
The plot of the logarithm of
V1
/V2 versus
ρ for selected cluster sizes
n(2
, 5
, 10) and
φ(0
, 0
.2
, 0
.5
, 0
.8
, 1
.0). Here,
V1 is the variance of the leastsquared estimate ˆ
β1 when the within-cluster dependence is ignored and
V2 is the correct variance. We have assumed E(
Yij) =
β0 +
β1
xij and
ρ = corr(
Yij, Yik)
, j < k = 1
, . . , n :
φ is the ratio of the between clustervariance to the total variance among the
xij's. . . . . . . .
The plot of
V3
/V2 versus
ρ for selected
n(2
, 5
, 10) and
φ(0
, 0
.2
, 0
.5
, 0
.8
, 1
.0).
Here
V2
, ρ, φ, and the assumed model are the same as described in Figure1.1,
V3 is the variance of the best unbiased estimate of
β1. . . . .
Beta-binomial log-likelihoods for the exposed group based on the datareported by Weil (1970). . . . . . . . . . . . . . .
Predicted risk of VI by race and age among those with 9 years of educationin Baltimore Eye Survey Study (Tielsch et al., 1990): solid line, whites;dashed line, blacks. . . . . . . . . . . . . . . .
Mean responses in PANSS by dropout cohort. . . . . . . . .
LIST OF FIGURES
Chapter 1
Regression analysis is commonly used in biomedical research. A typical assumptionbehind this tool is that all observations are statistically independent, or at least uncor-related with each other. The following research problems in which we have been directlyor indirectly involved suggest that there are many situations for which this assumptionmight not be met.
Example 1.1 (Baltimore Eye Survey Study). Over 5,000 individuals aged 40 and
above received a visual examination as part of a population-based prevalence study ofocular disorders (Tielsch et al. 1990). The main objective of the summary is to identifydemographic variables such as race and age that are associated with vision loss. Thecomplication of the study is that data are available on both eyes which are unlikely tobe independent because many causes of eye impairment are binocular.
Example 1.2 (Family Study on Chronic Obstructive Pulmonary Disease (COPD)). Six
hundred and thirteen members of 158 COPD cases seen at the Johns Hopkins Hospitalwere examined and given spirometry tests (Cohen, et al., 1977). The primary objectiveis to determine the percent of total variance attributable to unobserved genetic factorsshared among siblings and their parents. It is typical in family studies that the responsevariables, FEV1 in this case, are measured for related individuals, which in turn arecrucial to address scientific questions of interest such as the one stated above.
Example 1.3 (Sister Chromatid Exchange Study). A total of 14 hepatocelluar carci-
noma, 14 nasopharyngeal carcinoma and 16 cervical cancer patients and their age-sex-matched controls, were studied to compare the frequency of sister chromatid exchange(SCE) in their peripheral lymphocytes. The main hypothesis, which was tested in thestudy conducted by Professor C.J. Chen, College of Public Health, National Taiwan Uni-versity, is that cancer patients may have a higher frequency of SCE in lymphocytes thanmatched controls. The outcome variable is the number of SCE per cell where 20 cellsfrom each subject were cultured.
Example 1.4 (Pre-Post Trial of Schizophrenia). Five hundred and twenty three pa-
tients diagnosed with schizophrenia were randomly allocated amongst six treatments:placebo, haloperidol 20 mg and risperidone at dose levels 2mg, 6mg, 10mg and 16mg.
The primary response variable is the total score obtained on the Positive and Negative
CHAPTER 1. CORRELATED DATA
Symptom Rating Scale (PANSS), a measure of severity of schizophrenia. This score foreach patient was measured at baseline, weeks 1, 2, 4, 6 and 8 after randomization. Themain objective is to determine if risperidone improves the rate of decline in PANSS overthe two month period.
Example 1.5 (Diabetic Retinopathy Study). Over 1,700 patients with diabetic retinopa-
thy and visual acuity of 20/100 or better in both eyes participated in the study (Mur-phy and Patz, 1978). One eye of each patient was randomly selected for laster photo-coagulation and the other was observed without treatment. The main objective is tostudy the effectiveness of laster photo-coagulation in delaying the onset of blindness andthe response variable for each eye is time from randomization to the occurrence of visualacuity less than 5/200.
Characteristics of Examples
Although different in their scientific objectives, the examples introduced above sharesome important characteristics in common that are critical to the statistical developmentpresented in Chapters 5 and 6. First, data in each of these examples are organized in"clusters." For example, in the Baltimore Eye Survey Study, each cluster comprisesobservations of two eyes from an individual. In the family study (Example 1.2), a clusteris formed by a family with observations from related individuals. Secondly, responsevariables from the same cluster are likely to be correlated with each other. In longitudinalstudies where repeated observations from each subject are measured over time, this isknown as "tracking." In family studies, observations from related individuals such assiblings are correlated due to shared genes and/or environments. This explanation canbe applied to the fellow eyes in Example 1.1 as well. Thirdly, the scientific objectivesof these examples can be formulated statistically in terms of regression analyses as onewould have been for independent responses. For example, one could use the conventionallogistic regression to relate the risk of visual impairment for each eye in the BaltimoreEye Survey Study to demographic variables such as age and race. In addition, one canassess how the degree of within-cluster dependence between fellow eyes on variables suchas race, through regression as well. The issue of measuring within-cluster dependencewill be addressed in
§1.4, whereas more formal statistical modellings for correlated datawith the focus on regression will be the central theme of Chapter 5.
The examples introduced above, on the other hand, differ from another in some impor-
tant ways. First, the type of response varies from binary (Example 1.1), count (Example1.3), continuous (Example 1.2 and 1.4) to survival subject to censoring (Example 1.5).
Secondly, they differ in the structure of within-cluster dependence. For example, if thesampling unit for the Baltimore Eye Survey Study were the household in which eachmember of the sampled household was offered for eye examinations, more complicateddependence structure would result since one would expect responses from family mem-bers to be correlated to each other. Furthermore, the degree of association between felloweyes is likely to be stronger than responses of eyes from two related subjects. Designs ofthis kind are special cases of the so-called "nested design." Examples include interventionstudies in which children (patients) are nested within household (clinicians) which in turnare nested within villages (clinics). Special care is required to model within-cluster de-pendence for data collected from nested designs or from family studies where the degreeof association for siblings is likely to be different than that of parents (Qaqish and Liang,
1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE
Finally, the examples differ from one another with respect to their focus. In longitu-
dinal studies (Example 1.4), the regression of response variables on explanatory variablessuch as time (in weeks) from randomization, treatment status and their interactions ismost important whereas the within-cluster dependence may be viewed as a nuisance. Onthe other hand, it is typical in family studies (Example 1.2) that the primary focus ison the role genetic factors play in disease as captured by the percent of total variationin response variables attributable to genetic components. In this situation, the within-cluster association represents the primary focus even though regression adjustment forthe regression variable is crucial to separate out the environmental impact from the ge-netic ones. Table 1.1 summarizes both the common factors and differences among thefive examples.
Table 1.1: Some features of five examples in
§1.1.
Scientific Focus
Impact of Ignoring Within-Cluster Dependence
A natural question to raise is "what happens if one ignores the within-cluster depen-dence and uses the conventional regression method assuming independence among ob-servations?" In the Baltimore Eye Survey example, this amounts to treating 10,398 ob-servations from 5,199 individuals as if they were collected from 10,398 individuals withone observation (could be from the right or left eye) per individual. From a statisti-cal viewpoint, there are at least two consequences as a result of ignoring within-clusterdependence: incorrect assessment of precision of regression coefficient estimates and in-efficient estimation of regression coefficients. We now address these two issues in greaterdetail through a simple model for correlated data.
Let
Yij be the
jth observation from the
ith cluster,
j = 1
, . . , n, i = 1
, . . , m. Here
n is the common cluster size and
m the number of clusters. We assume for each
i,
E(
Yij) =
β0 +
β1
xij, Var(
Yij) =
σ2
,Cov(
Yij, Yik) =
σ2
ρ, j < k = 1
, . . , n.
This model assumes the expected value of
Y is a simple linear function of a covariate,
x, and that the correlation, as measure by correlation coefficient, between each pairof responses from the same cluster has the same value,
ρ, say. The focus here is on theinference for
β1 when ignoring the fact that responses from the same cluster are correlatedwith each other. This amounts to the use of the ordinary least square estimator ˆ
CHAPTER 1. CORRELATED DATA
estimate
β1 where
(
yij − yi)(
xij − xi) /
(
xij − xi)2
,
i=1
j=1
i=1
j=1
with
yi and
xi be the sample means of the
Yi's and
xi's from the
ith cluster.
Ignoring the within-cluster association leads to the use of
V1 =
σ2
/
(
xij − x)2 =
σ2
/VT ,
j ij / (
nm), as the variance estimate of ˆ
β1. The correct variance,
V2, of
β1 has the form
V2 =
V1
{1 +
ρ(
nφ − 1)
},
where
φ =
n
i −x)2 /
VT is the fraction of the total variation in the
x's explained by
the between cluster variation in
xi's. Two important cases of
φ deserve special attention.
One is when
φ = 0, i.e.
x1 =
. . , xm. This occurs commonly in longitudinal studies inwhich every subject's response is measured at the same set of times. In this case,
β1 isestimated mainly by using the within-cluster changes in
Y . The other extreme cases iswhen
φ = 1, i.e.
xi1 =
xi2 =
. . =
xin for all
i ∈ m. This is typcial in studies such asthe Baltimore Eye Survey Study where all the covariates are cluster-specific.
Figure 1.1: The plot of the logarithm of
V1
/V2 versus
ρ for selected clustersizes
n(2
, 5
, 10) and
φ(0
, 0
.2
, 0
.5
, 0
.8
, 1
.0). Here,
V1 is the variance of theleast squared estimate ˆ
β1 when the within-cluster dependence is ignored
and
V2 is the correct variance. We have assumed E(
Yij) =
β0 +
β1
xij and
ρ = corr(
Yij, Yik)
, j < k = 1
, . . , n :
φ is the ratio of the between clustervariance to the total variance among the
xij's.
1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE
Figure 1.1 shows plots of log(
V1
/V2) = log
{1 +
ρ(
nφ − 1)
} against
ρ for some selected
n's and
φ's. The message is rather clear. For within cluster comparisons, i.e.
φ = 0, theconfidence interval based on
V1 is wider than should be and discrepancy between
V1 and
V2 increases with
ρ. On the other hand, for between cluster comparisons, i.e.
φ = 1, theconfidence interval based on
V1 is too narrow and the discrepancy also increases with
ρ. In either case, invalid scientific conclusions, which could be either false positive ornegative, may be drawn if
V1 is used as the variance estimate of ˆ
While the ordinary least square estimate ˆ
β1 remains unbiased as an estimator of
β1, the
well known Gauss-Markov estimating theorem suggests that the uncertainty in estimating
β1 may be reduced by using the weighted least square estimator, ˜
β1 say, which properly
accounts for the within-cluster association. Under the specified correlation structure in(1.1), ˜
β1 has the variance of form
V3 =
V1(1
− ρ)
{1 + (
n − 1)
ρ} /
{1
− ρ +
nρ(1
− φ)
}.
Figure 1.2 shows plots of
V3
/V1 against
ρ for selected
n's and
φ's. Interestingly, no
The plot of
V3
/V2 versus
ρ for selected
n(2
, 5
, 10) and
φ(0
, 0
.2
, 0
.5
, 0
.8
, 1
.0). Here
V2
, ρ, φ, and the assumed model are the sameas described in Figure 1.1,
V3 is the variance of the best unbiased estimateof
β1.
efficiency loss occurs when
φ is equal to 0 or 1, irrespective of
ρ and
n. On the otherhand, a great deal of efficiency loss by using ˆ
β1 may result when
φ approaches 0
.5. Such
phenomenon is more apparent with increased
ρ but less so for
n. The main messagein Figure 1.2 is that ignoring correlation could lead to a substantial loss of statisticalpower especially when both within-cluster and between-cluster information is being usedto estimate
β1 and that it is important to examine and incorporate the within-clusterassociation structure so as to help improve upon the efficiency for the
β inference.
CHAPTER 1. CORRELATED DATA
Measures of Within-Cluster Dependence
Adequately describing the pattern of within-cluster dependence is important in severalregards: (i) it may represent the main scientific objective as is typically the case in familystudies (Example 1.2) and (ii) it may help to more precisely characterize the relationshipbetween the means and covariates as asserted in
§1.3.2. In this section, we briefly discusshow one measures the within-cluster association and its patterns.
For continuous responses, the most commonly used measure of dependence between
a pair of responses from the same cluster is correlation coefficient, defined as
j , Yk ) =
{
, j < k = 1
, . . , n.
This quantity takes values between
−1 and 1 inclusively. Very little dependence between
Yj and
Yk may be claimed if
ρ is close to zero and a strong association is evident if
ρis close to either 1 or
−1. Furthermore, a positive association between
Yj and
Yk, i.e.
ρ > 0, means
Yj tends to be larger than expected if
Yk is and vice versa. Finally, thismeasure of association has the desired property that it be dimensionless and symmetric.
For the former, it means that
ρ for
Yj and
Yk is the same as that for
cYj and
cYk where
c is a non-zero constiuent. Thus, the same magnitude of
ρ results whether one uses, forexample, meter or foot as the unit for height or length. For the latter, it matches with theintuition that the association between
Yj and
Yk has the same concept as that between
Yk and
Yj. For a cluster of size greater or equal to 3, it raises the question as to howone distinguishes between, for example,
ρ1
,2 and
ρ1
,3, the correlation coefficient between
Y and
Y2 and between
Y1 and
Y3, respectively? The answer to this question depends onthe nature of the clustering and sometimes, on how the scientific objective is formulated.
For longitudinal studies, it is a common belief that the correlation coefficient betweentwo observations adjacent in times is likely to be larger than that when two observationsare far apart in times. This is a part of the notion, known as "tracking" in longitudinalstudies. To capture this phenomenon, one may model
ρ(
Yj, Yk) =
α tj−tk ,
where
tj and
tk are times at which
Yj and
Yk are observed, respectively. This patternis known as the AR-1 (auto-regressive model of order one) model. For more detaileddiscussion on describing and empirically examining patterns of within-subject associationfor longitudinal data, we refer the readers to the work by Diggle (1988), Diggle, Liangand Zeger (1994) and the references therein.
For family studies of first degree relatives, i.e. parents, siblings and offspring, the
pattern of within-family association is likely to be different from that from longitudinalstudies. For siblings, it would be sensible to assume that the degree of association,as measured by
ρ, is the same for each pair of siblings. However, it would be equallysensible to allow this correlation coefficient (
ρSS) to be different than that of parents,
ρPP. Indeed, any sensible statistical procedure should be flexible enough to allow one totest the equality of
ρSS and
ρPP. Assuming that the same environment is shared by allrelatives in the same family, a larger
ρSS than
ρPP would suggest that for the continuousresponses considered, known as quantitative traits in the field of genetic epidemiology,genetic factors may play a non-trivial role. For more detailed discussion on how one may
1.4. MEASURES OF WITHIN-CLUSTER DEPENDENCE
model and estimate patterns of within-family association, see, for example, Beaty et al.
(1997).
The use of correlation coefficient as a measure of association is less useful for binary
responses. This is because the range of
ρ is narrowed considerably due to the constraintthat
max(0
, πj +
πk − 1)
< Pr(
Yj =
Yk = 1)
< min(
πj, πk)
.
where
πj = Pr(
Yj = 1) and
πk = Pr(
Yk = 1). Furthermore, the degree of constraintdepends on values of the
π's. Table 1.2 shows ranges of
ρ as a function of selected
πjand
πk. For example, if the true
πj(
πk) is 0
.1(0
.3), then the corresponding
ρ between
Yjand
Yk must be greater than
−0
.22, but less than 0
.51, much narrower than the idealsituation of (
−1
, 1). This constraint in range of
ρ makes the use of
ρ much less desirablefor binary responses than it is for continuous responses, especially when characterizingwithin-cluster association represents the main objective. As an alternative, one may
Table 1.2: Ranges of correlation coefficients for a pair of binary responseswith corresponding means
πj and
πk, respectively.
consider the use of the odds ratio (OR)
j =
Yk = 1) Pr(
Yj =
Yk = 0)
j , Yk ) =
Pr(
Yj = 1
, Yk = 0) Pr(
Yj = 0
, Yk = 1)
as a measure of association between a pair of responses,
Yj and
Yk. For this quantity, noconstraint is induced other than the fact that it must be positive. A positive associationresults if OR is greater than one and a negative association may be claimed if OR isless than one. Furthermore, the odds ratio is symmetric as well and is familiar to publichealth researchers for its ease in interpretation. Just as in continuous response situations,one can model the odds ratio, or more preferably its log version, through regression. Fora more detailed discussion on the use of odds ratios for within-cluster associations, seeHeagerty and Zeger (1998) for longitudinal data and Liang and Beaty (1991) for familydata. Finally, different measures of association between discrete variables have beensuggested in the literature; see Goodman and Kruskal (1979). We choose OR as theprimary measure of within-cluster dependence for discrete data, mainly because it iseasy to interpret and is familiar to biomedical researchers.
CHAPTER 1. CORRELATED DATA
Chapter 2
Regression Analysis: An
Overview
As mentioned earlier, regression analysis has been and remains one of the most commonlyused statistical tools in biomedical research. In clinical research, this technique is useful tohelp assess the degree of treatment effects when randomized clinical trials are conducted.
It can also be used for clinicians to identify prognostic factors of disease for predictionpurposes. In public health research, regression analysis is instrumental to help identifyrisk factors of disease for prevention purposes. It is also useful in observational studies toadjust for confounding variables that are not considered in the design stage for matching.
In this chapter, we provide a chronicle overview on some recent developments of regressiontechniques with focus on methods including generalized linear models, quasi-likelihoodmethods and estimating function approaches. Since these methods are, by and large,likelihood-based, we first review some key issues in likelihood-based inference that isparticularly relevant to subsequent developments.
Some Issues Concerning Likelihood-Based Infer-
ence
Let
f (
·;
θ, φ) be the probability (density) function for a random variable,
Y , of size
mindexed by two sets of parameters,
θ of p-dimensional and
φ of q-dimensional. Here
θrepresents parameters of interest reflecting scientific interest and
φ the so-called "nui-sance parameters." Nuisance parameters, by definition, are of little intrinsic interest toinvestigators, yet necessary to fully specify the random mechanism,
f (
·;
θ, φ), for whichthe data,
Y ≡ y, are generated. A typical example of nuisance parameters is the exposedprobability of controls in case-control studies in which the interest is on the associationbetween the studied disease and the hypothesized exposure of interest, characterized bythe odds ratio,
θ. Upon the data were observed, the only quantities in
f that are un-known to investigators are
θ and
φ. The phrase "likelihood function" is formally definedas a function of
θ and
φ that is proportional to
f (
y;
θ, φ), i.e.
L(
θ, φ)
≡ L(
θ, φ Y =
y)
∝ f (
y;
θ, φ)
.
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
Impact of Nuisance Parameters
When the value of nuisance parameters
φ is known, one can contrast two distinct valuesof
θ, θ1 and
θ2, by computing the following likelihood ratio
A likelihood ratio which is greater than one would suggest that there is evidence in fa-vor of
θ1 instead of
θ2 as conveyed by the data through the likelihood function (e.g.
Royall, 1997). Furthermore, the larger the likelihood ratio in magnitude, the strongerthe evidence. However, when
φ is unknown to investigators, one faces the additionalcomplication due to the need of specifying
φ when computing LR(
θ1
, θ2). The followingexample from a teratological experiment (Weil, 1970) illustrates that in the absence ofknowledge in
φ, the presence of nuisance parameters may have a profound impact onthe likelihood inference for parameters of interest. Figure 2.1 shows plots of log
L(
θ, φ)against
θ for selected
φ values. Here the likelihood function is proportional to a productof 16 observations following a beta-binomial distribution indexed by
θ, the probability ofsurviving a 21 day lactation period for a fetus and
φ, the nuisance parameter characteriz-ing the so-called "litter effect", an effect commonly observed in teratological experiments.
More detailed treatments of analysis for teratological experiments in general and of thisexample in particular will be given in Chapter 3. Choosing
θ1 = 0
.8 and
θ2 = 0
.6 as twoplausible values for
θ, we note that different values of
φ lead to different conclusions as towhich
θ is favored more by the data. For example,
L(0
, 8
, 0
.8)
/L(0
.6
, 0
.8) = exp(
−2
.89)strongly suggests that
θ = 0
.6 is favored over
θ = 0
.8 when
φ = 0
.8 is used, whereas
L(0
.8
, 0
, 1)
/L(0
.6
, 0
, 1) = exp(5
.82) strong supports the opposite conclusion.
Nuisance parameters could have non-trivial impacts on estimating
θ as well. The
following well known Neyman-Scott problem (Neyman and Scott, 1948) demonstratesthat with many nuisance parameters at hand, the conventional maximum likelihoodapproach for estimating
θ may not even be consistent at all. Consider
{Yij, j = 1
, . . , n},independent observations from the
ith of
m independent clusters following a normaldistribution with mean
φi and a common variance
θ, i = 1
, . . , m. The likelihood functionfor
θ and
φ = (
φ1
, . . , φm) is proportional to
L(
θ, φ
1
, . . , φm)
∝
θ− n2 e
It is easy to verify that the maximum likelihood estimator (MLE) of
θ, as a result ofjointly maximizing the likelihood function with respect to
θ and
φ, has the form
(
yij − yi)2
/(
nm)
.
i=1
j=1
Assuming that the cluster size,
n, is fixed by design, it can be easily shown that ˆ
converges as
m → ∞ to (
m − 1)
θ/m instead of
θ. Such an undesirable phenomenonfor ˆ
θ is not specific to the example given above. Indeed, it is well known that when the
dimension of nuisance parameters, increase with the sample size,
m in this case, the MLEof
θ is in general inconsistent (e.g. Andersen, 1970). An important special case of this
2.1. SOME ISSUES CONCERNING LIKELIHOOD-BASED INFERENCE
Figure 2.1: Beta-binomial log-likelihoods for the exposed group based onthe data reported by Weil (1970).
kind is on the estimation of the common odds ratio when the one-to-one matched case-control design is adopted (Breslow and Day, 1980). Indeed, it has been shown (Breslow,1981) that ˆ
θ converges, as the number of matched pairs increases, to
θ2 instead of
θ.
Many likelihood-based inferential procedures have been developed to eliminate, or at
least to reduce, the impact of nuisance parameters; see for example the work of Kalbfleischand Sprott (1970), Andersen (1970), Cox (1972), Lindsay (1982), Cox and Reed (1987),Reed (1995) and references therein. These methods rely upon the availability of thelikelihood function that is fully and correctly specified. In
§2.4 we present an alternativeapproach, the estimating function approach, which also serves to eliminate (or reduce) theimpact of nuisance parameters without the need to fully specify the likelihood function.
Cautions on Wald's Procedure
The basis behind the frequentist approach builds on the repeated sampling, or long run,interpretation for inferential making (e.g. Cox and Hinkley 1974). A central theme ofthis approach is on testing the null hypothesis that
θ =
θ0, a pre-specified value andon computing confidence intervals for
θ. With the exception that
f (
·;
θ, φ) is generatedfrom an exponential family (Lehmann, 1959), one needs to appeal to the large sampleapproximation for the purpose, for example, of deriving confidence intervals. To thisend, three large sample inferential procedures have been developed and employed on adaily basis: likelihood-ratio-based, score-statistic-based and Wald-based. We now brieflydescribe these three procedures from the viewpoint of testing the null hypothesis that
H0 :
θ =
θ0. Their application to confidence intervals can be converted through hy-
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
pothesis testing in a straightforward manner and will be mostly omitted. As the namesuggested, the likelihood-ratio-based method relies upon the use of the ratio of the like-lihood function evaluated at
θ0 and ˆ
θ for testing
H0, i.e.
1 =
−2 log
φ) are the MLE of (
θ, φ) and ˆ
φ(
θ0) the MLE of
φ under
H0. The score-statistic-
based procedure relies upon one of the key properties of the likelihood function, namely,
E(
S(
θ, φ);
θ, φ) = 0
∀θ, φ,
S(
θ, φ) =
∂ log
L(
θ, φ)
/∂(
θ, φ)
,
known as the score function for (
θ, φ). Based on (2.3), one would reject
H0, on theintuitive ground, if
S(
θ0
, φ) is "large." To assure that this assessment of
S(
θ0
, φ) beinglarge is not affected by the unit of the
Yi's, one needs to standardize
S by considering
T2 =
St(
θ0
, ˆ
I(
θ, φ)
is known as the Fisher information matrix for (
θ, φ). Finally, in contrast to the LRprocedure in which
θ0 and ˆ
θ are compared indirectly through the likelihood function, the
Wald's method uses
θ − θ0)
t(
Iθθ − ItθφI−1
I
φφ φθ )
(ˆ
to test
H0, here we have re-expressed
I(
θ, φ) as
I(
θ, φ) =
where, for example,
Iθθ = E(
−∂2 log
L(
θ, φ)
/∂θ2)
Iθφ = E(
−∂2 log
L(
θ, φ)
/∂θ∂φ)
.
Detailed treatments on these three large sample procedures can be found in Rao
(1973,
§6.e). In particular, it was shown that under some regularity conditions, allthree test statistics converge, as
m → ∞, to a
χ2
p distribution and furthermore, areequivalent asymptotically to each other, i.e. the difference between
T1 and
T2, and hence
T1 and
T3 and
T2 and
T3, is in the order of
op(
m). Despite the similarity among thesethree procedures when the sample is large, the Wald procedure possess some undesirableproperties that are not shared by the other two procedures. First, this procedure is notinvariant in the following sense. Let
δ =
δ(
θ) be a one-to-one function of
θ so that the null
2.2. GENERALIZED LINEAR MODELS
hypothesis of
θ ≡ θ0 is equivalent to
δ ≡ δ(
θ0) =
δ0. Unlike
T1 and
T2, the Wald's teststatistic is different numerically depending on the choice of parameters used to formulatethe null hypothesis. The following hypothetical example on testing the hypothesis of anassociation from a 2
× 2 table illustrates the shortcoming. With 8(2) out of 18 cases (17controls) exposed to a pre-specified risk factor, one has the MLE for
θ, the odds ratio,equal to 6.0 with an estimated standard error (s.e.) of 5.35. This leads to a Wald teststatistic value of (6
− 1)2
/(5
.34)2 = 0
.70 with p-value
< 0
.1. On the other hand, if oneused
δ = log
θ as the basis for inference, the corresponding Wald's test statistic is equalto (log 6
− 0)2
/(0
.89)2 = 4
.0, highly statistically significant at the 0.05 level.
The other drawback of the Wald procedure is that in certain circumstances
T3 may be
"powerless" when the true
θ value,
θ say, is far different from
θ0, i.e.
θ0
−θ is large inmagnitude. This counter-intuitive phenomenon may be explained at least heuristically asfollows. For simplicity, we assume
p = 1 and there is no nuisance parameters, i.e.
q = 0.
It is true in general that the asymptotic variance of ˆ
θ, I−1(
θ), depends on
θ. While the
numerator of
T3
, (ˆ
θ − θ0)2, increases as
θ − θ0
increases since ˆ
θ → θ, so does the
denominator of
T3
, I(
θ). Under the circumstances that
I(
θ) increases at a faster ratethan (ˆ
θ − θ0)2 does,
T3 becomes arbitrarily small in spite of the fact that the true
θ, θ is
very different from
θ0. This undesirable property of
T3 was first observed and explainedfor logistic regression models by Hauck and Donner (1977). It remains an open questionas to how one may characterize such "circumstances" leading to the peculiar behavior ofthe Wald procedure.
For illustration, consider the following time to relapse data from a clinical trial re-
ported in Lee (1992). For five breast cancer patients receiving CMF (
x = 1) after a radicalmastectomy, the times to relapse were recorded as 23
, 16+
, 18+
, 20+
, 24+ in weeks, wherethe + sign represents censoring. The five control patients (
x = 0) all experienced re-lapse at 15
, 18
, 19
, 19 and 20 weeks. A proportional hazard model with
x, the treatmentstatus, as the sole covariate was fitted to the above data resulting in ˆ
β =
−16
.66 and
s
.e
.( ˆ
β) = 1513. This leads to a Wald's test statistic value of 0
.0001 with p-value = 0
.991.
The other two test statistics, the likelihood ratio and score test statistic, amount to 8
.76and 6
.90 with p-values 0
.003 and 0
.009, respectively, matching well with the intuition asconveyed by the data that the new treatment (CMF) prolongs time to relapse for breastcancer patients receiving radical mastectomy.
Generalized Linear Models
Multiple linear regression analysis and logistic regression analysis are daily employedin biomedical research for reasons given at the onset of this chapter. In 1972, thesemethods, among others, are unified under the framework of generalized linear models(GLMs) (Nelder and Wedderburn). We now briefly review this seminal work along withseveral key properties of GLMs.
For a sample of size
m, let
Yi be the univariate response and
xi, a
p × 1 vector, be
the covariates thought to be related to
Yi, i = 1
, . . , m. To specify a GLM, the followingtwo components are sufficient (Nelder and Wedderburn, 1972):
(A) (Random component).
Yi is generated from an exponential dispersion family
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
(Tweedie, 1947, J
φhanssen, 1987) of the form
f (
Y
iyi − b(
θi)
i =
y;
θi, φ) = exp
+
c(
y
i;
φ)
(B) (Systematic component). The expectation of
Yi, denoted as
µi, is related to
xi
through the link function
h, i.e.
µi = E(
Yi xi) =
h−1(
xtiβ)
.
Some special cases of GLMs are as follows.
Example 2.1 (Multiple linear regression models). This corresponds to a special case
with
h(
µ)
≡ µ, the "identity" link function,
a(
φ)
≡ φ and
Yi being normally distributedwith mean
µi =
θi and variance
φ.
Example 2.2 (Logistic regression models). For a binary response with
Y = 1 or 0, this
popular model is a special case of GLMs with
h(
µ) = log
{µ/(1
− µ)
}, the "logit" link,
a(
φ)
≡ 1, and
Yi following a binomial distribution of size 1 and probability of "success"equal to
µi.
Example 2.3 (Poisson regression models). For count responses which are common in
cohort analysis with the number of "events" such as the death or diagnosis of a disease,this corresponds to the use of the "log" link, i.e.
h(
µ) = log
µ, a(
φ)
≡ 1 and a Poissondistribution with mean
µi as the basis for inference. Note that the index
Example 2.4 (Log-linear models). For data that can be expressed in contingency
tables, log-linear models are commonly used (e.g. Bishop, Fienberg and Holland, 1975).
Here as in Example 2.3, the index
i represents the group of individuals sharing similarcharacteristics which are categorical in nature. This model can also be cast as a GLM with
h(
µ) = log
µ and a Poisson distribution for
Yi, which is the number of individuals with
xias the covariate values. A major difference between this model and the Poisson regressionmodel lies on the objective of interest. For log-linear models, the primary interest is onexamining the pattern of associations among categorical variables represented by
x. Forthe latter, it is of interest to identify risk factors characterized by
x that may be associatedwith the risk of a particular event.
We now present several key properties of GLMs that are particularly relevant to the
First Two Moments of GLMs
The exponential dispersion family displayed in (2.4) does not explicitly reveal how themoments of
Yi, in particular the first two moments, may be related to
θi and
φ. Simplealgebra suggests that
The expression for the variance of
Yi explains the rationale behind the phrase "dispersionparameters" for
φ. Finally, given the relationship between
θi and
β through (2.5) and(2.6) above, it is clear that
f (
Y ;
θ, φ) can also be expressed as
f (
Y ;
β, φ).
2.2. GENERALIZED LINEAR MODELS
Score Function for β
Assuming that the
Yi's are independent and represent a random sample from the targetedpopulation, the likelihood function for
β and
φ is simply proportional to
L(
β, φ)
f (
yi;
β, φ)
i(
β)
yi − b(
θi(
β)) +
c(
y
i;
φ)
Consequently, the score function for
β can be derived as
β (
β, φ) =
∂ log
L(
β, φ)
/∂β =
i;
β, π)(
yi − µi(
β))
.
It is important to note that the score function for
β depends only on the first twomoments of the
Yi's despite the full specification of
f through (A) in (2.4). This formsthe primary motivation behind the use of the quasi-likelihood advocated by Wedderburn(1974) which is the focus of the next section. Furthermore, given that
a(
φ) appears as aproportional factor in var(
Yi), no knowledge on
φ is needed to derive the MLE of
β, ˆ
solving
Sβ(
β, φ) = 0 even though the asymptotic variances of ˆ
(
β, φ), does depend
on
φ, where
Σ
−1(
β, φ) = lim
i;
β, φ)
MLE Versus Weighted Least Squares
An alternative estimating procedure for
β is the well-known weighted least squares ap-proach. It amounts to minimizing the following objective function for
β, namely,
Q(
β, φ) =
i − µi(
β))2
,
Var(
Yi;
β, φ)
the weighted differences between the "observed" (
Yi's) and the "expected" (
µi's). Again,the dispersion parameter
φ plays no role in the estimation of
β through minimizing
Qin (2.9). While intuitive, this approach, in contrast to the common belief, may lead toinconsistent estimation of
β if the variance of
Yi depends on the mean and hence on
β.
To see this, note that minimizing
Q in (2.9) is equivalent to solving
∂Q/∂β ≡ 0, where
∂Q(
β, φ)
/∂β =
i;
β, φ)(
yi − µi(
β))
· (
y
i, β, φ)
i − µi(
β))2
Note that the first term in (2.10) is identical to
Sβ(
β, φ), the score function of
β, whereasthe second term has in general non-zero expectations regardless of the sample size
m.
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
The resulting solution, ˜
β say, would be inconsistent as an estimator of
β unless Var(
Yi)
is independent of
β in which case
∂Q(
β, φ)
/∂β ≡ Sβ(
β, φ) and hence ˜
This undesirable property of the weighted least squares approach can be alleviated
through the following simple modification. Assuming the existence of a consistent esti-
β say, for
β, one remedy is to minimize instead
Q(
β, φ) =
i − µi(
β))2
.
i=1 Var(
Yi;
β, φ)
It can be shown that the minimizer of
Q(
β, φ)
, ˜
β say, is now consistent for any consistent
β of
β. One such choice is the minimizer of
i − µi(
β))2. Furthermore,
it can be seen that by iterating the minimization process in (2.11) repeatedly, i.e. mini-mizing
i − µi(
β))2
t+1(
β, φ) =
, t = 0
, 1
, . . ,
i=1 Var(
Yi;
βt, φ)
0 =
β, it leads to ˆ
β, the MLE. Indeed, one can show that
β and ˆ
β are asymptot-
ically equivalent to each other in that ˜
β =
op(
m).
An important aspect of model fitting is model checking which examines the pattern ofdepartures from the fitted model. There are at least two different types of departure:the systematic departure which describes the departure of the model from the dataas a whole, and the isolated departure which identifies individuals whose data are notin accordance with the fitted model (McCullagh and Nelder, 1989, Chapter 12). Thissubsection focuses only on the former one.
For the
ith subject with
xi as the covariate value, there are two means to estimate
µi, the expected value of
Yi. One is to estimate by
µi( ˆ
β) based on the fitted model. The
other is to simply estimate
µi by
Yi =
yi, an unbiased estimator. The first approach,which uses the whole data, is legitimate only if the fitted model adequately describesthe data. On the other hand, the second approach, while always legitimate, does notfully utilize the sample data except
yi itself. These contrasts between two approachesprovides a means to answer the question as to "how well the model fits the data" bycomparing
µi( ˆ
β) with
yi through, for example, the likelihood function,
L(
µi, φ). Indeed
the quantity "deviance" is proportional to the likelihood ratio test statistic comparingthe null hypothesis that the fitted model is adequate versus the saturated alternative,i.e. the
µi's are different and unrelated to each other. More specifically, the deviance fora fitted model is formally defined as
µ;
Y ) =
−2
{log
L(
µi( ˆ
β);
φ)
− log
L(
yi;
φ)
}a(
φ)
.
Thus, a "small" value in
D would indicate that the fitted model describes the data asa whole rather well whereas a "large" value would suggest otherwise. A few cautionarynotes on the use of the deviance, or the Pearson
χ2 statistic (
Q( ˆ
β, φ)), as a measure of
2.3. QUASI-LIKELIHOOD METHOD
goodness-of-fit. First, except for the special case of normal responses, the distribution of
D, even asymptotically, is unknown. Second, as the hypothetical example in McCullaghand Nelder (1989, p.123) suggested, any global measure of goodness-of-fit such as thedeviance may have a limited usage for choosing competitive models for the purpose ofpredictions when the
x considered falls outside the range formed by the observed
xi's,known as extrapolation.
To adopt the generalized linear model approach one needs to specify the random compo-nent for the response variables. Very often one faces the problem where the investigatorsare uncertain about the complete random mechanism by which the data are generated.
This could be due to the fact that the underlying biologic theory is not fully understoodyet and/or no substantial (empirical) experience of similar data from previous studiesis available. On the other hand, the scientific objective can often be adequately charac-terized through regression, i.e. the systematic component may be well described. Withthis in mind, Wedderburn (1974) proposed as an alternative the quasi-likelihood methodwhich replaces the random component (A) in GLMs by
(A
) (Variance specification). The variance of
Yi has the form
Var(
Yi) =
a(
φ)
V (
µi)
,
where
V is the known function relating the variance to the mean,
µi.
In the absence of the likelihood function that is available for statistical inference,
Wedderburn (1974) proposed the use of the quasi-score function to estimate
β, i.e. bysolving
β (
β, φ) =
i;
β, φ)(
yi − µi(
β)) = 0
.
This quasi-score function has the same expression as (2.7) and would be the true scorefunction for
β if the
Yi's were indeed generated from an exponential dispersion fam-ily. McCullagh (1983) studied the asymptotic behavior of ˆ
β, the solution of
Sβ ≡ 0, and
found that under some regularity conditions, ˆ
β is consistent and asymptotically normally
distributed with the covariance matrix equal to Σ in (2.8). Firth (1987), on the otherhand, examined the efficiency of ˆ
β relative to the MLE of
β when the true probability
mechanism for the
Yi's is not from an exponential dispersion family. Several differenttypes of departure from the exponential dispersion family were established and, as ex-pected, high efficiency of ˆ
β is maintained when the departure of
f from the exponential
dispersion family is only "modest" (Firth, 1987). A remaining question to be addressedis whether ˆ
β or the quasi-score function,
Sβ, possesses any optimal property? This issue
will be the main focus of the next section. Indeed, we will consider the more generalsituations than (A
), namely
(A
) (Variance specification). The variance of
Yi has the form
Var(
Yi) =
Vi(
µi;
φ)
,
where
Vi is a known function of the mean,
µi, and
φ, the dispersion parameter.
This specification, of course, includes (A
) as a special case with
Vi(
µi, φ) =
a(
φ)
V (
µi).
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
Estimating Function Approach
Returning to the general setup in
§2.1 where we assume the data,
Y =
y, is generatedfrom a probability (density) function
f (
y;
θ, φ) indexed by parameters of interest
θ andnuisance parameters
φ. In the regression setting,
θ would be the regression coefficientwhereas
φ would be dispersion parameters. An estimating function for
θ is simply afunction of the data
y and parameters of interest
θ, i.e.
g(
y;
θ). An estimating functionis called unbiased if
E(
g(
Y ;
θ);
θ, φ) = 0
for all
θ and
φ values. An important example of unbiased estimating functions fromthe perspective of health research is the quasi-score function,
Sβ(
β, φ) in (2.12) where
a(
φ) factors out in
Sβ(
β, φ). While the formal theoretical development of estimatingfunctions was not advanced until the 1960's, scientists grappled with how to designestimating functions which combined observations and unknown interest as early as themid 18th century; see the interesting book on the history of statistics before 1900 byStigler (1986). Meanwhile, the method of moments, advocated by Karl Pearson, maybe viewed as a precursor of the modern estimating function approach. Here, selectedempirical moments such as the sample mean are equated to their expectations which arepresumably dependent on parameters of scientific interest.
The concept of unbiased estimating functions and the corresponding optimality theory
was formally developed in the year of 1960 through two independent work by Durbin(1960) and Godambe (1960). We now briefly recast the essence of a series of moregeneral work by Godambe, before addressing the question concerning the optimalityof the quasi-score function. Consider the class of unbiased estimating functions
G =
{g(
y;
θ) : E(
g(
Y ;
θ);
θ, φ) = 0
, ∀θ, φ}. A member of
G,
g say, is optimal within this class(Godambe, 1960) if it minimizes
W (
θ, φ) = E
−1(
∂g(
Y ;
θ)
/∂θ)Cov(
g(
Y ;
θ))E
−1(
∂g(
Y ;
θ)
/∂θ)
.
When
φ is known, i.e. there is no nuisance parameters, Godambe (1960) showed that thescore function for
θ is the optimal one among
G. In the more practical situation where
φ is present and unknown, Godambe (1976) showed that
g(
y;
θ) =
∂ log
f (
y t;
θ)
/∂θ
is optical among
G where
t is a complete, sufficient statistic for
φ for fixed
θ.
Limitations: Impact of Nuisance Parameters
This modern line of optimal estimating function discussed thus far is subject to the fol-lowing two major limitations. First, optimality considered is ascribed to the estimatingfunction, yet scientists and other practitioners are concerned more about estimators. AsCrowder (1989) put it: "This is like admiring the pram rather than the baby." Thisconcern may be reconciled as follows.
It can be shown that under some regularity
β, the solution of
g(
y;
θ) = 0, has minimum variance among solutions of
{g(
y;
θ) = 0 :
g ∈ G}. Thus the limitation of estimating functions noted above may notbe too serious an issue as such if one is willing to appeal to a large sample argument.
2.4. ESTIMATING FUNCTION APPROACH
Second, nuisance parameters can compromise the optimality theory. Specifically, the
optimal estimating function given in (2.16) relies upon the existence of a complete suf-ficient statistic for
φ that does not depend on
t. This is the case when
Y is generatedfrom an exponential family with
θ and
φ being the "natural (or canonical)" parameters,but more generally
t =
t(
θ). In this more general situation,
∂ log
f (
y t(
θ);
θ, φ)
/∂θ de-pends on
φ as well and hence is only locally optimal (Lindsay, 1982). More seriously,the sample space of
Y given
t(
θ) may depend on
θ and consequently,
f (
y t(
θ);
θ, φ) manynot be differentiable with respect to
θ (Kalbfleisch and Sprott, 1970). It is worth notingthat the quasi-score function in (2.12) also suffers from this limitation if the variance of
Yi has the more general form in (2.13), in which case the nuisance parameter
φ may notfactor out in (2.13).
Orthogonality Properties of Conditional Score Functions
The previous subsection made it clear that with the exception of exponential families, nui-sance parameters could compromise the optimal theory. While many likelihood methodsare potentially available leading to unbiased estimating functions that are independentof nuisance parameters, there is no general guidance as to how one should proceed. Inaddition, unlike the conditional score function in (2.16), there is no known theory to sup-port the optimality of derived estimating functions. For this reason, we redirect our focuson the finding of locally optimal estimating functions in which the impact of nuisanceparameters may be minimized, if not eliminated. Avoiding the complication associatedwith the computation of
f (
y tθ;
θ, φ), where
tθ is complete and sufficient for
φ for fixed
θ, Lindsay (1982) extends the concept of conditional score functions in this more generalsituation, i.e.
t ≡ tθ by defining
gC(
θ, φ) =
∂ log
L(
θ, φ)
/∂θ − E(
∂ log
L(
θ, φ)
/∂θ tθ)
,
which reduces to
∂ log
f (
y t;
θ)
/∂θ when
tθ =
t. Due to the dependence of
gC on
φ, theconditional score function defined above is only locally optimal among
G at the true
φ value (Lindsay, 1982). However,
gC(
θ, φ), by definition, is orthogonal to the spacespanned by the sufficient statistic
tθ for
φ. Thus the impact of nuisance parameterson
gC(
θ, φ) is expected, at least intuitively, to be small. The following "orthogonality"properties enjoyed by
gC(
θ, φ) represent attempts to quantify the above speculation.
These properties ((a) to (d)) are listed so that
gC(
θ, φ) implies (a), which implies (b),which implies (c), which implies (d):
(a) E(
gC(
θ, ˆ
φθ);
θ, φ) = 0 for all
θ, φ and any ˆ
φθ which is a function of
tθ;
(b) E(
gC(
θ, φ);
θ, φ) = 0 for all
θ, φ and
φ;
(c) E(
∂gC(
θ, φ)
/∂φ;
θ, φ) = 0 for all
θ, φ and
φ;
(d) Cov(
gC(
θ, φ)
, ∂ log
L(
θ, φ)
/∂φ;
θ, φ) = 0 for all
θ and
φ.
Interpretation of these four orthogonality properties can be seen in Liang and Zeger(1995). Furthermore, a rich class of probability (density) functions in which
tθ is in-deed complete and sufficient for
φ is given by Liang and Tsou (1992). In summary,we have devoted this subsection to the discussion of desirable properties possessed bythe conditional score function proposed by Lindsay (1982). These properties, namely,
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
local optimality and orthogonality to nuisance parameters, form the criteria for the nextsubsection.
Finally, several researchers, including Cox and Reid (1987), Liang (1987), Ferguson,
Reid and Cox (1991), Waterman and Lindsay (1996) have investigated strategies forapproximating the conditional score function in the more general situation where thecomplete sufficient statistic for
φ for fixed
θ may not exist or are difficult to find.
(Local) Optimality of Quasi-Score Functions
We now return to the question posed in
§2.3, namely the potential optimality of the quasi-likelihood method, which will be addressed in more general settings. In the previous twosubsections, we have assumed implicitly that the likelihood function, indexed by
θ and
φ,is readily available for
θ inference. As mentioned in
§2.3, there are situations whereeither the investigators are uncertain about the random mechanism or the probabilitydistribution is too complicated to write down explicitly. Examples include data fromspatial processes encountered in plant ecology (Besag, 1974) and from DNA sequencing.
Nevertheless, it is common that the parameters of interest
θ are well defined reflectingthe scientific interest, even though they may not completely specify the probability dis-tribution for the data
y. We assume that the data
y = (
y1
, . . , ym) is decomposed into
m "strata" and the
yi's,
i = 1
, . . , m, are uncorrelated with each other. The sizes ofthe
yi's are not required to be the same. Furthermore, we assume that
θ is common inmeaning to all
m strata and that there exists an unbiased estimating function,
gi(
yi;
θ),for the
ith stratum,
i = 1
, . . , m. A simple and trivial example of this setup is that the
yi's are independent and identically distributed with mean
θ and variance
φ and a simplechoice of
gi would be
yi −θ. A natural question is how to combine
{gi(
yi;
θ)
, i = 1
, . . , m}together into a single unbiased estimating function for
θ inference that is optimal in somesense?
To address this question we consider a more restricted class (than
G) of unbiased
estimating functions, namely,
G =
{g(
y;
θ) =
which is a linear combination of the
gi's. It can be shown (e.g. Liang and Zeger, 1995)that the optimal unbiased estimating function within
G,
g say, is
which minimizes
W (
θ, φ) defined in (2.15). Furthermore, the solution of
g = 0
, ˆ
has the smallest asymptotic variance among
{˜
θ :
g(
y; ˜
θ) = 0, for all
g ∈ G}. This setup
includes several important examples:
Example 2.5 (Quasi-score function). Under (A
) and (B) specified in
§2.3, one has
h(
µi) =
xiθ, Var(
Yi) =
a(
φ)
V (
µi)
.
By choosing
gi(
yi;
θ) =
yi − µi(
θ), one has as the optimal unbiased estimating function
2.4. ESTIMATING FUNCTION APPROACH
among
G =
{ m a
i)(
yi − µi(
θ))
,
the quasi-score function for
θ.
Example 2.6 (Correlated data). For
yi = (
yi, . . , yin ), the observations from the
ith cluster,
i = 1
, . . , m, where
ni is the cluster size, the objective of interest is fre-quently captured through the modeling of
µij = E(
Yij), i.e.
h(
µij) =
xt θ (Liang and
Zeger, 1986).
In this case, one such choice of
gi would be
yi − µi(
θ) where
µi =
i(
θ)
, . . , µin (
θ))
t and the optimal
g among
G =
{g =
i)(
yi − µi(
θ))
,
known as the generalized estimating equation (GEE) (Liang and Zeger, 1986).
Example 2.7 (2
× 2
tables). In case-control studies, subjects are stratified into
m
strata by confounding variables. The primary interest is on the degree of associationbetween the targeted disease and a dichotomous risk factor of interest as measured bythe odds ratio,
θ. In this case,
yi = (
yi1
, ni1
− yi1
, yi2
, ni2
− yi2) where
yi1(
yi2) is thenumber of exposed among
ni1 cases (
ni2 controls) in the
ith stratum,
i = 1
, . . , m. Itcan be verified easily that
gi(
yi;
θ) =
yi1(
ni2
− yi2)
− θyi2(
ni − yi)
is an unbiased estimating function for
θ from the
ith stratum. While
is complicated computationally,
a(
θ) reduces to 1
/(
n
i1 +
ni2) when
θ is evaluated at
one (McCullagh and Nelder, 1989) and this leads to the well known Mantel-Haenszelestimator of
θ (Mantel-Haenszel, 1959).
Implicitly, the optimality theory developed thus far assumes
g is functionally inde-
pendent of the nuisance parameters
φ. This is indeed the case for Example 2.7 if theexpectations derived in computing
g are conditional on
ti =
yi1 +
yi2, the total numberof exposed in the
ith stratum,
i = 1
, . . , m. In general, however, the distribution
gi,which is functionally independent of
φ, does depend on
φ. Thus,
g =
g(
y;
θ, φ) andis locally optimal among
G at the true
φ value as is the conditional score function
θ(2.17) when the probability mechanism is available for inference. One important excep-tion is the quasi-score function under (A
), i.e.
a(
φ) appears as a proportional factor inVar(
Y ). In this case,
a(
φ) factored out and
g =
g(
y;
θ) is globally optimal among
G.
Even though
g may depend on
φ, we now argue that the impact of
φ on
g and on thecorresponding solution of
g = 0 is small. This is because
g shared the orthogonalityproperties (b) - (d) enjoyed by the conditional score function, i.e.
CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
(b) E(
g(
y;
θ, φ);
θ, φ) = 0
∀θ, φ and
φ,
(c) E(
∂g(
y;
θ, φ)
/∂φ;
θ, φ) = 0
∀θ, φ and
φ,
(d) Cov(
g(
y;
θ, φ)
, ∂ log
L(
θ, φ)
/∂φ;
θ, φ) = 0
∀θ and
φ.
One important implication of these properties, at least when the sample size is large, isthat ˆ
θ, the solution of
g(
y;
θ, ˆ
φθ) = 0, where ˆ
φθ is an arbitrary consistent estimator of
φ,
has the same variance as that when
φ is known. That is, the uncertainty in estimating
θ,as measured by the variance, is unaltered whether
φ is known or not, another indicationthat
Chapter 3
Analysis of Binary Data
Binary (dichotomous) response is commonly observed in biomedical research. Examplesinclude death versus alive, ill versus healthy, injured or not in a car accident, depressed ornot in psychiatric research, etc. Sometimes there is a time element involved in defining theresponse. For example, one may be interested in, as shown below in Table 3.1 whethera child diagnosed with neuroblastoma survives through a two year period. Anotherexample, shown in Table 3.2, has the response for each fetus as to whether it survives21 day lactation period or not. These two examples share similar data structure whichis the focus of this chapter, namely,
(
yi/ni, xi)
, i = 1
, . . , m,
where
yi is the number of "affected" out of the
ni "subjects" in the
ith stratum and
xi,
Table 3.1: Proportions of children surviving 2 years free of disease followingdiagnosis and treatment for neuroblastoma by age and stage of disease atdiagnosis (Breslow & McCann, 1971).
Stage at Diagnosis
a
p × 1 vector, covariates that are common to all
ni subjects. For the neuroblastomaexample,
m is equal to 15 with
p = 2 covariates, namely, age and stage at diagnosis.
For the teratological experiment example,
m would be 32 with one primary covariate,i.e. treatment status:
x = 1 if treated and 0 if control. Questions of interest for datathat can be summarized in this kind of structure include (i) can variation in
yi/ni beexplained by
xi? and (ii) can one predict the affected status using
xi? The first questionis important in that it provides a means to identify risk factors that may be useful forprevention purposes. The second question is useful for identifying prognostic factorswhich may have clinical and policy implementation implications.
CHAPTER 3. ANALYSIS OF BINARY DATA
Table 3.2: Proportions of fetuses surviving the 21 day lactation periodamong those who were alive at 4 days by treatment group (Weil, 1970).
12/12, 11/11, 10/10, 9/9, 10/11, 9/10, 9/10, 8/9,
8/9, 4/5, 7/9, 4/7, 5/10, 3/6, 3/10, 0/7
13/13, 12/12, 9/9, 9/9, 8/8, 8/8, 12/13, 11/12,
9/10, 9/10, 8/9, 11/13, 4/5, 5/7, 7/10, 7/10
In this chapter, we discuss some analytical issues that are relevant to this type of
data structure with special attention paid to the issue of statistical dependence within astratum which is common in data from teratological experiments.
Given the much longer history in multiple linear regression for continuous responses,it was natural that some earlier developments for analyzing binary data appealed tothe conventional linear regression method. This forms the basis behind the first twoapproaches presented below.
Approach 1 (Arcsine analysis). This approach amounts to creating a new response
variable through a arcsine transform on the original response variable, that is,
y =
yi/ni) and imposing the following model
i =
xtiβ +
i,
This transformation was motivated by the fact that the variance of
y, at least asymptot-
ically, is independent of the means,
xtβ, a desirable property to have in the conventional
linear regression approach; see
§2.2.3.
Approach 2 (Empirical logit analysis). This approach is similar to Approach 1 except
that a different transformation is applied to
yi/ni, namely,
y = log(
y
i/(
ni − yi)). This
is a special case of a comprehensive treatment for analyzing categorical data developedin the 1960's by Grizzle, Starmer and Koch (1969).
Approach 3 (Modern analysis). This approach, which is a special case of GLMs, differs
from the first two approaches in one important aspect. In this modern (for the lack ofa better term) approach, the notion of transformation is applied to the expectation of
yi/ni, πi, rather than the data itself. In other words, for a given "link" function,
h say,this approach specifies
h(
πi) =
xtiβ
and furthermore,
yi follows a binomial distribution. Some common choices of link func-tions include the probit link for bioassay analysis (e.g., Finney, 1964) and the logit linkfor case-control studies (Breslow and Day, 1980).
It is interesting to contrast the modern approach versus the first two approaches. For
simplicity, we restrict our attention to the log odds transformation (Approach 2) so thatthe comparison is made between the empirical logit approach and the logistic regressionapproach.
3.1. HISTORICAL DEVELOPMENTS
Empirical Logit versus Logistic Regression
We now argue that the empirical logit approach inherits the following three drawbacksthat are not shared by the logistic regression method. First, the regression coefficientsof this approach are difficult to interpret: change in averaged
Y = log
{Y /(
n − Y )
} perunit change in
x. The transformation
Y is made for convenience whose average doesnot necessarily possess physical meaning of interest. This is to be contrasted with theinterpretations of the logistic regression coefficients: change in log odds in risk per unitchange in
x. These quantities are easy to understand with sound physical meaning forbiomedical researchers. Second, the normal distribution assumption for the
Y 's relies
heavily on the
ni's being large. As can be seen in either example, the
ni's vary fromstrata to strata and are not (can't be in teratological experiments) fixed or controlledby designs. This concern, on the other hand, is not shared by the logistic regressionmethod. Indeed, the latter approach can be applied even if
ni ≡ 1 for all
i so longas
m is sufficiently large. Third, with the exception of the intercept, all the regressioncoefficients in logistic regression models are estimable should one adopt the retrospectivesampling, a commonly adopted design in epidemiologic studies. The same thing cannotbe said for the empirical logit approach. Finally, parameters in logistic regressin modelsare the "canonical" parameters in an exponential family which means the intercepts canbe eliminated through conditioning which is critical for the matched case-control designsto be implemented (Breslow and Day, 1980).
Is the Binomial Assumption Reasonable?
Either one of three approaches presented earlier assumes that
Yi follows a binomialdistribution with size
ni and probability
πi, i = 1
, . . , m. A natural question to ask iswhether the binomial assumption is adequate? To answer this question, we note that onenatural way to characterize a binomial variate,
Y say, is that
Y be expressed as a sum of
n binary observations, i.e.,
Y =
Y1 +
. . +
Yn with
Yj = 1 or 0 for all
j = 1
, . . , n suchthat (1) all the
Yj's are statistically independent of each other and (2) Pr(
Yj = 1) =
π,the common probability for all
j = 1
, . . , n.
For the neuroblastoma example, it seems sensible to assume that the number of chil-
dren with the same age and stage of diagnosis who survived for a two-year period is wellapproximated by a binomial distribution. After all, these children are unrelated to eachother and meanwhile, comparable to each other with regard to major prognostic factorsfor neuroblastoma. On the other hand, it is well known in the teratological experimentliterature (e.g. Haseman and Kupper, 1979) of the tendency for fetuses from the samelitter (stratum in our notation) to respond more alike than fetuses from different litterseven if they all received the same treatment. This is known as the "litter effect" whichmay be explained by the fact that fetuses from the same litter share the common genesand environment which in turn leads to the "likeness" of responses from fetuses of thesame litter. This argument strongly suggest that the statistical independence assumptionof the binomial assumption may be invalid. Indeed, for the sixteen proportions from thetreated group in Table 3.2, only 25% (= 0
.328
/1
.295) of the total variation in the
y/n'sis explained by the variation induced by the binomial assumption (Liang and Hanfelt,1994).
In the next, we discuss pros and cons of approaches that have been developed to aid
investigators analyze data derived from teratological experiments.
CHAPTER 3. ANALYSIS OF BINARY DATA
Teratological (toxicological) experiments are designed to test whether a substance isindeed a carcinogen or toxin; and if so, it would be of interest to establish the dose-response relationship in order to provide guidances for the tolerance level regarding safetyof human populations. Typically, pregnant animals (e.g. rats) are randomized to receivethe substance of different dose levels. The adverse effect such as structural developmentaleffect of the substance in the fetuses is then assessed to address the question stated above.
As mentioned in
§3.1.2, the so-called "‘litter effect" is commonly observed and con-
sequently, the binomial assumption on
Y , the number of affected fetuses from a litter, isinvalid. One of the earliest approaches to address this issue of litter effect, also knownas "extra binomial variation" or more generally "over-dispersion," is to assume that
Yfollows a beta binomial distribution (Skellam, 1948) instead. This distribution may bemotivated and derived as follows. Consider
yi/ni, i = 1
, . . , m,
the responses (proportions of fetuses that are affected) from litters receiving the samedosage. For the example presented in Table 3.2,
m is equal to 16 for both treated andcontrol groups. To account for variation in responses among litters that are "comparable"(regarding treatment levels) to each other, this approach assumes
(1) given
πi, Yi πi ∼ Binomial (
ni, πi)
, i = 1
, . . , m,
(2)
πi ∼ Beta (
α, γ)
, α, γ > 0.
Consequently, the beta-binomial distribution for each
Yi arrived as
Pr(
Yi =
yi;
α, γ) =
Pr(
Yi =
yi πi)
f (
πi;
α, γ)
dπi
Γ(
α +
γ)
πyi (1
− π
πα−1(1
− π
Γ(
α +
γ)Γ(
y
iα)Γ(
ni − yi +
γ)
.
Γ(
α)Γ(
γ)Γ(
α +
γ +
ni)
The key step of this derivation is assumption (2) which acknowledges that the underlyingrisk,
πi, varies from litter to litter by imposing a distribution, beta in this case, on the
πi's. Intuitively, the litter effect is pronounced if the variance of
πi is large. One difficultyassociated with the representation in (3.1) is lack of clear interpretability for both
α and
γ. However, noting that the first two moments of a beta distribution have the form
α/(
α +
γ)
≡ θ,
θ(1
− θ) =
φθ(1
− θ)
,
α +
γ + 1
we now argue that the new parameters,
θ and
φ are more easily interpreted. This isbecause the first two moments of the
Yi's can now be expressed as
E(
Yi/ni) = E(E(
Yi/ni πi)) = E(
πi) =
θ,
θ(1
− θ)
Var(
Yi/ni) =
i − 1)
φ)
.
3.2. TERATOLOGICAL EXPERIMENTS
Thus
θ is simply the overall, or averaged, risk for those randomized to a particular dosegroup. On the other hand,
φ, the proportional factor in Var(
π), is introduced to accountfor variation that is not solely explained by the binomial variation. Furthermore, note in
§3.1.2 that
Yi can be expressed as
Yi =
Yi1 +
. .+
Yin and that these
n
i binary responses
are not statistically independent of each. If one assumes the correlations for each pairof these
Yij's are the same, then it can be shown easily that this common correlationcoefficient is exactly equal to
φ. Thus, the notion of variation between litters is equivalentto within litter "likeness" as captured by the same quantity
φ.
Finally, to test or to establish dose-response relationship between
θ and
x, the dosage
level, one may consider
h(
θ) =
β0 +
xtβ1
,
in line with the systematic component specified under GLMs.
Some Cautionary Notes on Beta-Binomial Distributions
While intuitive for its derivation, one may use this approach with caution from thefrequentist viewpoint. First, as shown below, the parameter of interest
θ and the nuisanceparameter
φ are highly intertwined with each other. Consequently, the presence of
φis likely to have non-trivial impacts on
θ inference; see
§2.1 for general discussion onimpacts of nuisance parameters. To see this, Table 3.3 below, which is reported inWilliams (1988), shows through simulations that misspecifying
φ value could lead tosevere bias in
θ estimation. For example, if the true
θ and
φ values are 0
.125 and0
.2, respectively and a sample of 20 beta-binomial observations were generated, theaveraged MLE of
θ over 1,000 replications is equal to 0
.167 if a
φ value of 0
.4 wasassumed when deriving
θ. This phenomenon is also evident when the beta-binomialdistributions are fitted to the Weil data in Table 3.2. Table 3.4 shows that the MLEof
β1 = log(
θT /(1
− θT ))
− log(
θC/(1
− θC)) dropped to
−0
.665 when
φT and
φC wereforced to be the same. This is to be contrasted with the MLE of
β1 when
φT and
φC areallowed to be different:
−1
.129. It is interesting to note that the standard error estimateof ˆ
β1 when assuming binomial distribution in the
Y 's, i.e. ignoring the litter effect is too
small: 0
.330 compared to 0
.464 when a beta-binomial distribution is assumed instead.
Table 3.3: The averaged MLE (
± s.e.) for
θ for different assumed values of
φ from 1,000 simulations (Williams, 1988).
True parameter values
Assumed
φ value
θ = 0
.125
, φ = 0
.2
θ = 0
.25
, φ = 0
.333
0
.1263
± 0
.0012
0
.2530
± 0
.0020
0
.1118
± 0
.0011
0
.2310
± 0
.0020
0
.1134
± 0
.0011
0
.2266
± 0
.0020
0
.1271
± 0
.0011
0
.2336
± 0
.0019
0
.1457
± 0
.0012
0
.2491
± 0
.0018
0
.1670
± 0
.0013
0
.2685
± 0
.0018
0
.1906
± 0
.0014
0
.2903
± 0
.0017
Second, the asymptotic normal approximation for the MLE of the
β's may not be
ready to take place for practical sizes when fitting the beta-binomial distribution to the
CHAPTER 3. ANALYSIS OF BINARY DATA
Table 3.4: Estimates and standard errors of
β1 = log
{θT /(1
− θT )
} −log
{θC/(1
− θC)
} of Weil's data presented in Table 3.2. Upper entry: com-mon
φ; lower entry: heterogeneous
φ (Liang and Hanfelt, 1994).
1
± s.e.
−0
.961
± 0
.330
−0
.665
± 0
.460 0.19
−1
.129
± 0
.464 0.32 0.02
Quasi-likelihoodVar(
Y ) =
nθ(1
− θ)
−1
.022
± 0
.471 0.21
×(1 + (
n − 1)
φ)
−1
.070
± 0
.470 0.46 0.05
−0
.961
± 0
.475 2.78
φnθ(1
− θ)
−0
.961
± 0
.475 4.74
data at hand. Table 3.5 shows results from a simulation in which data mimicking thestructure from Table 3.2 and following beta-binomial distributions with the true
θ's and
φ's being the MLE's from the last row of Table 3.4 were generated 1,000 times. Eitherallowing
φT and
φC to be different in the analysis or not, the MLE approach leads toundesirable 95% coverage probability that is far from being symmetric as would havebeen the case if the normal approximation to the distribution of MLE's were ready totake place.
Finally, for a litter of size
n in which
Y =
Y1 +
. . +
Yn and
Y−j defined as
(
Y1
, . . , Yj−1
, Yj+1
, . . , Yn), one has
j = 1
Y−j )
Pr(
Yj = 0
Y−j)
γ +
n −
if
Y is assumed to follow a beta-binomial distribution indexed by
α and
γ. In otherwords, the odds for
Yj to be affected, i.e.
Yj = 1, depends on
Y−j, the responses from the
rest of the litter, only through
, the total number of affected. While appealing
mathematically, this expression in (3.2) raises a concern regarding the biologic plausibilityof this distribution as the litter size varies. It seems more intuitive that the effect of
Y−j,
if any, on
Yj would be through its proportion, i.e.
/(
n − 1), rather than through
An Alternative Inferential Procedure
The beta-binomial distribution is an excellent example in which (1) the probability dis-tribution is indexed by parameters of interest and nuisance parameters, (2) the nuisanceparameters have performed impacts on inference for parameter of interest and (3) yetthe conditional score function approach in
§2.4 is not applicable (
tθ =
y in this case).
Since the main interest of teratological experiments can be cast through regression mod-elling on the first moment of the
Y 's, one alternative is to consider the quasi-likelihood
3.2. TERATOLOGICAL EXPERIMENTS
Table 3.5: Simulation results for
β1 = log
{θT /(1
− θT )
} from 1,000 repli-cations mimicking the data structure of Weil's. The true
β1 is
−1
.129 and
φT = 0
.317 and
φC = 0
.021 (Liang and Hanfelt, 1994).
95% converge prob.
φT =
φC
Quasi-likelihood1. Var(
Y ) =
nθ(1
− θ)
×(1 + (
n − 1)
φ)
φT =
φC
2. Var(
Y ) =
φnθ(1
− θ)
φT =
φC
appproach detailed in
§2.3. Specifically, one may consider
(A
) (Variance specification). The variance of
Yi has the form of
Var(
Yi) =
niθ(1
− θ)(1 + (
ni − 1)
φ)
, or
φniθ(1
− θ)
.
(B) (Systematic component). The mean of
Yi/ni is modeled as
logit
θi =
β0 +
β1
xi, i = 1
, . . , m.
Note that the first variance expression corresponds to that induced by the beta-binomialdistribution whereas the second one is induced from the exponential-dispersion family. Asdiscussed in
§2.4.3, while the quasi-score function does depend on
φ, yet the impact of
φare minimal at least from the theoretical viewpoint. Results below provide some empiricalevidence of the above assertion. Table 3.4 shows for the Weil data that the estimators andthe corresponding s.e.'s from the quasi-likelihood approach are very similar to each otherregardless of the choice of the variance functions and whether
φT and
φC are assumed tobe the same or not. This observation is confirmed by the simulated study presented inTable 3.5, which shows that this approach, unlike the beta-binomial likelihood approach,leads to excellent coverage probabilities, small bias and excellent agreement between thesample variance and the averaged estimated variance of ˆ
β1. More extensive simulations
which mimic the data structures that are common in practice can be found in Liang andHanfelt (1994).
In summary, we review in this section, two different approaches for analyzing data
from teratological experiments which confront the non-trivial litter effects. Both theo-
CHAPTER 3. ANALYSIS OF BINARY DATA
retical arguments and empirical results support the use of the quasi-likelihood methodas a sensible alternative to the conventional beta-binomial approach.
Further Relaxation of Binomial Assumptions
Recall in
§3.1.2 we discussed two components that constitute the binomial variate
Y =
Y1 +
. . +
Yn, namely, the statistical independence and identical distribution in the
Yj's,
j = 1
, . . , n. In
§3.2 we addressed ways to relax the first assumption of statistical inde-pendence. In this section, we review some recent developments for attempts to furtherrelax the second assumption of identical distributions. This work may be motivated infamily studies where responses from
n related individuals are not only correlated witheach other due to shared genes and environment, but have different risks of disease dueto different make up in risk factor profiles consisting of, for example, age, sex, etc. Draw-backs of these approaches are highlighted so as to provide further motivation for moremodern developments for modelling correlated data to be presented in Chapter 5.
Let
xj be the
p × 1 vector of covariates for the
jth component of a cluster of size
n.
Building on the expression in (3.2), Rosner (1984) proposes to consider for
j = 1
, . . , n
j = 1
Y−j , x1
, . . , xn)
Pr(
Yj = 0
Y−j, x1
, . . , xn)
y)
− log(
γ +
n −
y − 1) +
βtxj.
It models explicitly the dependence of the log odds for
Yj to be one on
Y−j and
x1
, . . , xn
=
j and
xj for each
j = 1
, . . , n. Using the Hammersley-Clifford Theorem
(Besag, 1974), this specification in (3.3) does lead to a legitimate joint distribution for
Y = (
Y1
, . . , Yn) given (
x1
, . . , xn), namely,
1 =
y1
, . . , Yn =
yn x1
, . . , xn)
=
sbn−s exp
1 =
. . =
Yn = 0
x1
, . . , xn)
(
α +
)
, s ≥ 1
,
(
γ +
)
, s ≥ 1
,
b0 = 1
,
As mentioned in
§3.2.1, the effect of
Y−j on
Yj, as modeled in(3.3), depends on
=
j ,
which is a function of the cluster size
n. More seriously, the effect of
xj on
Yj, as
3.3. FURTHER RELAXATION OF BINOMIAL ASSUMPTIONS
measured by
β, is likely to be diminished as
xj is competing directly against
Y−j inexplaining the variation in
Yj. Furthermore, the random effect motivation conveyed inthe original beta-binomial deviation (i
.e
. π ∼ Beta(
α, γ)) seems to disappear when onetries to incorporate covariate effects for each component
j.
Approaching the same issue from a totally different angle, Bahadur (1961) proposed thefollowing join distribution for (
Y1
, . . , Yn) given (
x1
, . . , xn)
Pr(
Y1 =
y1
, . . , Yn =
yn x1
, . . , xn)
j =
yj xj )
ρjkzjzk +
+
. . +
ρ
, µj = Pr(
Yj = 1
xj)
, j = 1
, . . , n.
µj(1
− µj)
This is termed "additive model" in that the ratio of
Pr(
Yj =
yj, j = 1
, . . , n x1
, . . , xn)
n Pr(
Y
j =
yj xj )
is additive in the products of the
yj's. As a special case, Kupper and Haseman (1978)consider
Pr(
Yj =
yj, j = 1
, . . , n x1
, . . , xn)
Pr(
Yj =
yj xj)(1 +
ρ
where
ρ is the common correlation coefficient for each pair of the
yj's and
µj ≡ µ forall
j = 1
, . . , n. A major drawback of this model, as acknowledged by Kupper andHaseman (1978), is that there is no guarantee that the quantities on the right hand sideof (3.5) are non-negative, a necessary condition for probability functions. Furthermore,as pointed out in
§2.4, the range of
ρ is constrained by the
µj's and as in the beta-binomial distribution, the
µj's and
ρ are likely to be highly intertwined with each other.
These concerns have limited the usage of this approach.
Similar in spirit to the Rosner model, this third approach considers for
j = 1
, . . , n
j = 1
Y−j , x1
, . . , xn) =
α +
γ
j +
βtxj .
j = 0
Y−j , x1
, . . , xn)
CHAPTER 3. ANALYSIS OF BINARY DATA
Again, utilizing the Hammersley-Clifford theorem, this leads to (Altham, 1978; Connollyand Liang, 1988)
Pr(
Yj =
yj, j = 1
, . . , n x1
, . . , xn)
Pr(
Yj = 0
, j = 1
, . . , n x1
, . . , xn)
j +
γ
yj − 1) +
βt
Due to the similar representations in (3.6), this model is subject to the same drawbacksinherited by the Rosner model discussed earlier.
In summary, we have reviewed in this section three different approaches to extend the
binomial distribution allowing statistical dependence and different risks among the
Yj'sfrom the same cluster. Drawbacks of these approaches are highlighted. In this regard,the treatment of this chapter may be viewed as a preview of more modern developmentsfor modelling correlated data to be presented in Chapter 5.
Chapter 4
Analysis of Polytomous and
Count Data
In biomedical research, categorical response with three or more possible categories, knownas polytomous response, is also common in practice. Examples of this kind includedegree of injury (mild, modest and severe), different cell types of lung cancer (e.g.,small, squamous, large, etc.) and income level (low, medium and high). Just as inbinary response, researchers are interested in identifying factors that are related to thepolytomous response for prevention, intervention and prediction purposes. The firstpart of this chapter reviews a variety of statistical models that have been proposed inthe literature to address the above issue. We discuss through several real examples thepros and cons of each approach with the emphasis on the interpretations of regressioncoefficients induced by the models.
In social science research, data are frequently expressed in contingency tables since
variables observed are mostly categorical. The scientific interest is often on how thesevariables are related to each other. For that purpose, log-linear models (e.g. Bishop,Fienberg and Holland, 1974) have been developed since the 1950's to address the issueof patterns of association among variables collected from the same subjects. Attemptshave also been made in the past decade or two to utilize this approach to help analyzethe types of correlated data introduced in Chapter one. The second part of this chapteris devoted to critically reviewing the pros and especially the cons of log-linear modelswith emphasis on the interpretations of coefficients implied by the models. One usefulreference for this topic is Liang, Zeger and Qaqish (1992) and the discussion therein.
Types of Polytomous Data
Examining examples of polytomous responses listed earlier, one notes that they can beclassified into at least three different types in terms of scale which require different at-tention for modelling. The first type is in a nominal scale. Here categories are regardedas exchangeable such as cell types of lung cancer, blood type or hair color (red, black,brown, etc.). The second type is in a ordinal scale where categories are ordered like theordinal numbers (first, second, third, etc.). Examples include degree of injury, perception
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
of food quality and for that matter hair color (light, medium, dark, etc.). It is importantto point out for ordinal responses that while ordered, the notion of "distance" or "spac-ing" between categories is arbitrary and in some instances, irrelevant. Furthermore, inmany instances, choice and definition of categories are either arbitrary or subjective. Animportant implication of this observation is that any sensible analytical procedure shouldhave the desired property that same conclusions be drawn if new categories are formed bycollapsing adjacent categories of the original scales. As an illustrative example, considerthe following data from a 3
× 3 table relating the height (short, medium or tall) of thehusbands to that of the wives. The simple chi-squared test gives rise to a value of 2.9which provides little evidence of association between husbands and wives heights. Thesame conclusion is drawn if one either collapses the "short" and "medium" categoriestogether (
χ21 = 1
.45) or collapses instead the "medium" and "tall" categories together(
χ21 = 0
.90).
The third type of polytomous response is the interval scale in which scores or numericallabels are frequently attached. Examples of this kind include salary, income, age etc. inwhich the scores may be the midpoints of the intervals defined according to the originalscales. In this situation, it is common that differences between scores from differentcategories be interpreted as a measure of separation of the categories.
In the next section, we discuss the pros and cons of three commonly used regression
models for polytomous responses of either nominal or ordinal scales. Throughout we willuse the following two examples for illustration.
Example 4.1 (Disturbed dream). The data presented below is taken from Maxwell
(1961) on the association between degree of suffering from disturbed dreams and ageamong 223 boys age 5 to 15. It is clear that the response variable, severity of disturbeddreams, is ordinal while the explanatory variable aged is an interval scale.
Degree of suffering from
Example 4.2 (Tonsil size). The data below is from Holmes and Williams (1954) who
classify 1398 children aged 1-15 years according to their relative tonsil size and whether
4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES
or not they were carriers of streptococcus pyogenes (SP). The response variable here,tonsil size, is also ordinal and the interest is on assessing the nature and direction of thepossible effect of SP on the enlargement of tonsil size.
Regression Models for Polytomous Responses
Consider a response variable,
Y , which has
C + 1 possible categories, i.e.
Y = 0
, 1
, 2
, . .
or
C with
C being a positive integer. The interest is on the relationship between
Y and
x, a
p × 1 vector of covariates.
Polytomous Logistic Regression Models
This model represents a straightforward extension of the logistic regression model forbinary response by contrasting the odds for
Y to be
j = 1
, . . , C instead of 0, i.e.
Pr(
Y =
j x)
Pr(
Y = 0
x)
j +
βt
j = 1
, . . , C.
It can easily be seen that the distribution function for
Y given
x is fully specified by the
C logistic regression models above, namely,
Pr(
Y =
j x) =
, j = 0
, 1
, . . , C.
with
α0 = 0 and
β0 = 0, a vector of
p zeros. Here
βj characterizes the effect of
x onthe log odds for being the
jth category instead of the category zero,
j = 1
, . . , C. It isinteresting to point out that for a pair of categories, 0
< j < , the log odds for
Y to be
instead of
j is fully determined by (4.1) as
Pr(
Y =
x)
Pr(
Y =
j x)
− αj ) + (
β − βj )
tx,
which is linear in
x as well. The simple equality of
βj =
β −(
β −βj), results of (4.1) and(4.2), explains why subjects whose
Y values are different than 0 and
j are informativeabout
βj. This model is particularly useful for nominal response where categories areconsidered exchangeable. Finally, assuming that
Yi follows a multinomial distribution,likelihood inference for
αj's and
βj's can be carried out and is available in SAS (PROCCATMOD).
Proportional Odds Models
For ordinal responses, McCullagh (1980) proposes the use of proportional odds modelswhich can also be expressed in
C logistic regression models:
j (
x)
j +
βtx,
j = 1
, . . , C,
j (
x)
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
where
γj(
x) = Pr(
Y ≥ j x). This model focuses on the modelling of
γj(
x),the cumulativeprobability of
Y to be in the jth category or higher given
x, which is meaningful onlyif
Y is ordinal. It assumes that the effect of
x on those
C log odds is the same for all
j = 1
, . . , C, an assumption which can be checked empirically. An important featureof this model is that there is no need to assign scores to the
C + 1 categories of theresponse variable. Instead, the ordinal feature of
Y is incorporated (or acknowledged) by(1) modelling through
γj and (2) assuming that the effect of
x on log
{γj(
x)
/(1
− γj(
x)
}is the same, or at least is in the same direction. Furthermore, the interpretation of
βis well preserved: the change in log odds for suffering more than less (Example 4.1) perunit change in
x. Finally, this model has been implemented in SAS as well (PRODLOGISTIC).
Returning to the disturbed dream example, Table 4.1 shows the MLEs of the
αj's
and
βj's when both the polytomous logistic regression model and the proportional oddsmodel are entertained with a single continuous
x: age in years (midpoint of each ageinterval). Results from the proportional odds model fitting suggest a strong negativerelationship between suffering from disturbed dreams and age among 5-15 year old boysin that the odds of suffering from disturbed dreams decreases by a factor of 0
.8(= e
−0
.229)per year. Results from fitting the polytomous logistic regression model, which ignores theordinal feature of the response variable, lead to the similar conclusion qualitatively. Wenote, however, that the standard error estimate of ˆ
β(0
.052) is, as expected, much smaller
than that of the three ˆ
βj's (0
.078
, 0
.078 and 0
.081, respectively). This example serves
to illustrate that proper modelling through acknowledging the ordinal feature of theresponse variable could lead to a substantial gain in efficiency for regression coefficientsof interest.
Table 4.1: Regression estimates (
± s.e.) based on both proportional oddsmodels and polytomous logistic regression models for the disturbed dreamexample (Maxwell, 1961).
Proportional odds
Polytomous logistic
6
.402
± 1
.403
6
.271
± 2
.108
5
.586
± 1
.387
4
.784
± 2
.115
4
.571
± 1
.377
7
.792
± 2
.169
−0
.263
± 0
.078
−0
.208
± 0
.078
−0
.323
± 0
.081
−0
.229
± 0
.052
Continuation Ratio Models
There are situations where the response variable is "ordered" in a hierarchical (or nested)fashion. Examples include
4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES
Example 4.3 (Teratological experiment). Chen et al. (1991) present data from a ter-
atological experiment in which each litter was randomized into one of three levels ofexposure (low, medium and high) to hydroxyures. Here fetuses from a litter could bedead or alive at birth, and if alive, they could be malformed or normal after a period oftime; see the diagram below. Here the relationship between the proportions of fetusesthat are dead and the dosage level allow investigators to examine the embryo-toxic effectof the targeted substance; whereas the proportions of alive fetuses that are malformedcan be used to address the structural developmental effect of the substance. Here theresponse variable has three categories with a natural order of death, alive but malformedand alive and normal, a decreasing order of health severity. It is hierarchical (or nested)in that the issue concerning the risk for being malformed is only meaningful to thosefetuses who are alive at birth.
Example 4.4 (Insemination experiment). McCullagh and Nelder (1989, p.62) describe
an experiment in which milch cows receive treatments to induce pregnancy in such away that each cow will continue the experiments until pregnant as shown below. Herethe response variable represents a series of 1
/0 response indicating whether the cow ispregnant or not. However, this response variable is hierarchical (or nested) in that the
jth response would be recalled only if the cow was not pregnant in the previous
j − 1experiments.
Example 4.5 (Tonsil enlargement). Returning to the tonsil size example (Example
4.2), one might argue that the response variable can be expressed in a way that issimilar to that in Example 4.3. Here each subject may be first classified as enlargedor not regarding the tonsil size; and for those experiencing enlargement, they can befurthered classified as either greatly enlarged or slightly enlarged as shown below. Herethe investigators are in the position to address two related but different questions, the firstbeing whether carrying SP is associated with tonsil enlargement, and the second beingwhether carrying SP is associated with serious enlargement given tonsil enlargement hastaken place.
Diagram 1 (Teratological experiment)
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
Diagram 2 (Insemination experiment)
Diagram 3 (Tonsil enlargement)
slightly enlarged
For ordinal response of this kind, it is clear that proportional odds models may not be
appropriate. Instead we consider the continuation ratio model by Mason and Fienberg(1985)
logit (
πj(
x)
/γj(
x)) =
αj +
βtjx, j = 0
, 1
, . . , C − 1
,
where
πj(
x) = Pr(
Y =
j x) and
γj(
x) = Pr(
Y ≥ j x). Like the other two models,this model is represented by
C logistic regression models. Here
πj(
x)
/γj(
x) can bere-expressed as Pr(
Y =
j Y ≥ j, x), the conditional probability for
Y to be
j giventhat
Y is either in the
jth category or higher. It is rather clear that this conditionalprobability is more meaningful than
γj(
x) itself in reflecting the scientific question ofinterest. Note that in (4.4) we have allowed the effect of
x on
πj(
x)
/γj(
x) to be differentfor different
j. This would be a sensible approach for Examples 4.3 and 4.5 as, forinstance, there is no scientific reason to expect that the embryo-toxic effect to be the sameas the structural developmental effect. On the other hand, it may be equally sensible
4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES
to allow the
βj's to be the same in Example 4.4, which characterizes the effect of
x onthe probability for being pregnant allowing the probability itself (i.e. the
αj's) to varyacross experiments. Nevertheless, this assumption of common
β can always be examinedstatistically. Returning to the tonsil size example, Table 4.2 gives the MLEs and theirs.e. when both the proportional odds model and the continuation ratio model were fittedto the data shown earlier. Here the sole covariate is the SP carrier status (
x = 1 for beingcarrier and 0 if not). Both models lead to conclusions of strong evidence of associationbetween SP carrier and tonsil enlargement, even though the regression coefficients fromthe two models have different interpretations. Results from the proportional odds modelsuggest that the odds for having more serious tonsil size enlargement is 83% (= e0
.603
−1) higher for the carrier than the non-carriers. On the other hand, results from thecontinuation ratio model indicate that the odds for having tonsil enlargement is 67%(= e0
.514
− 1) higher for the carriers than the non-carriers; whereas the odds for havingtonsil size greatly enlarged among those with tonsil enlargement is 72% (= e0
.543
− 1)higher for the carriers than the non-carriers.
Table 4.2: Regression estimates (
± s.e.) based on proportional odds mod-els and continuation ratio models for the tonsil size example (Holmes andWilliams, 1954).
Proportional odds
Continuation ratio
0
.509
± 0
.056
0
.512
± 0
.051
−1
.363
± 0
.067
1
.245
± 0
.093
0
.514
± 0
.273
0
.543
± 0
.286
0
.603
± 0
.227
Log-Linear Models for Contingency Tables
Log-linear models represent a popular approach to analyze data that can be summarizedin contingency tables. One important feature of this approach is that there is no needto specify response variables. This is especially useful in social science research withthe aim to examine the relationship (or the association) among variables considered.
More recently, this log-linear model approach has been advocated by some researchers todeal with correlation data. Examples include cross-over trials where patients receive, bydesign, different treatments over several periods (Jones and Kenward, 1987) and geneticstudies on detection of maternal effects of disease genes (Weinberg, Wilcox and Lie,1998). In this section, we briefly review the work on log-linear models with the emphasison interpretations of specified parameters. It serves to point out the potential for drawingincorrect conclusions when using log-linear models to analyze contingency tables data ordiscrete correlated data.
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
Interpretations of Log-Linear Model Parameters
We start with a simple
I × J contingency table in which
Yij represents the number ofindividuals who are in the
ith(
jth) category of variable 1 (variable 2),
i = 1
, . . , I, j =1
, . . , J. Let
Pij be the corresponding probability for an individual to be in the (
i, j) cellof the
I × J table. With
m being the sample size, one has immediately
µij = E(
Yij) =
mPij. It is intuitive that two categorical variables of
I and
J categories, respectively, arenot associated with each other (or statistically independent) if
Pij =
Pi+
P+
j,
where
Pi+ and
P+
j are the marginal probabilities, e.g.
Pi+ is the probability that anindividual falls into the
ith category of the first variable. The expression in (4.5) leads to
log
n + log
Pi + log
P+
j
µ +
αi +
βj
with
α1 = 0 =
β1 to avoid over-parameterizations. Thus, one can test the assumptionof statistical independence between the two variables by computing the deviance whichcontrasts this no association (or null) model with the saturated model where all the
I × J Pij's are unspecified. When this null model is rejected, a natural question, froma modelling viewpoint, is how one may modify (4.6) to acknowledge (or to characterize)the dependence (or association) between two variables. One obvious approach is to addto (4.6) additional parameters (
αβ)
ij, i.e.
log
µij =
µ +
αi +
βj + (
αβ)
ij
with
α1 =
β1 = 0 = (
αβ)1
j = (
αβ)
i1, for all
i = 1
, . . , I, j = 1
, . . , J. The followingrepresentation clearly qualifies the (
I − 1)
× (
J − 1)(
αβ)
ij's as "association parameters,"namely, for
i = 1
, . . , I and
j = 2
, . . , J ,
is the log odds ratio contrasting the (
i, j) cell versus the (1
, 1) cell; see the 2
× 2 tablebelow
An important question is how to incorporate, under the framework of log-linear models,the additional information that (some of) the variables are ordinal? As an illustration,the table below gives the 12 empirical log odds ratios from the disturbed dream example.
It shows the pattern that the estimated log odds ratios decrease in age for each level ofsuffering and decrease in degree of suffering from disturbed dream for each age group.
This suggests that one approach is to assign scores,
si, i = 1
, . . , I, tj, j = 1
, . . , J toboth variables and model (
αβ)
ij, the log odds ratios, accordingly, e.g.
(
αβ)
ij =
sitjβ.
4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES
Degree of suffering from disturbed dreamNot severe
Returning to the disturbed dream example, we have fitted the above model i.e. (4.8),to the 5
× 4 contingency table with
si = mid-age of the
ith age group,
i = 1
, . . , 5 and
tj =
j, j = 1
, . . , 4. The MLE for
β gives
−0
.205 with s.e. 0
.050 which suggests thatthe odds for suffering from disturbed dream with higher unit decreases by a factor of0
.81(= e
−0
.205) per year for boys between 5 and 15 years of age. It is interesting tocontrast this approach with the proportional odds model approach in
§4.2.2; see Table4.1. Both approaches lead to the same conclusion that there is a convincing negativeassociation between age and degree of suffering from disturbed dreams. One importantdistinction between these two modelling approaches is on the assumptions being madefor the variables and the interpretability of the primary parameter of interest,
β in (4.3)and
β in (4.8). While both assigning scores to the age variable, which is the intervalscale, the log-linear model approach makes the further assumption by assigning scores tothe other variable, which is considered as a response variable for the proportional oddsmodel approach. One argument against this additional assumption is that it is ratherarbitrary. Specifically, by assuming
tj =
j, j = 1
, . . , 4, one is making the assumptionthat, for example, the "difference" in degree suffering from disturbed dreams between thethird and first categories is twice the "difference" between the second and first categories.
This arbitrariness makes the
β coefficient in (4.8) difficult to interpret. The
β coefficientinduced by the proportional odds model, on the other hand, does not rely on the assigningof scores for the response variable and is easy to interpret. We note that the deviancesfor both models with 11 degrees of freedom, are very comparable to each other (12
.42for the proportional odds model and 14
.08 for the log-linear model). This example alsoserves to illustrate that goodness-of-fit measures such as the deviance are secondary (orless relevant) in choosing between models compared to the interpretability of primaryparameters induced by the models.
Moving on to more practical situations of
I × J × K contingency tables formed
by three categorical variables of
I, J and
K levels, respectively, let
Yijk(
µijk) be thenumber (expected number) of individuals who are in the
ith
, jth and
kth category ofvariables 1, 2 and 3, respectively,
i = 1
, . . , I, j = 1
, . . , J , and
k = 1
, . . , K. Again,the primary interest is on how these three variables are associated with each other. Forthree-dimensional or higher-dimensional contingency tables, more log-linear models areavailable in additional to the obvious null model
log
µijk =
µ +
αi +
βj +
γk, α1 =
β1 =
γ1 = 0
and the saturated model
log
µijk =
µ +
αi +
βj +
γk + (
αβ)
ij + (
αγ)
ik + (
βγ)
jk + (
αβγ)
ijk.
Some interesting intermediate models include
(A) log
µijk =
µ +
αi +
βj +
γk + (
αβ)
ij,
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
(B) log
µijk =
µ +
αi +
βj +
γk + (
αβ)
ij + (
αγ)
ik,
(C) log
µijk =
µ +
αi +
βj +
γk + (
αβ)
ij + (
αγ)
ik + (
βγ)
jk.
To better understand the interpretation of parameters, in particular (
αβ)
ij, specifiedabove, we consider the following log odds ratio
log
ijkP11
k ≡ log OR
which is the log odds ratio relating variables 1 and 2 among (conditional on) thoseindividuals who are in the
kth category of variable 3. Depending on models chosen,log OR
ij k has different expressions and hence different interpretations. For the nullmodel, which implies no association among three variables, log OR
ij k is equal to zero asexpected. For model (C), this log odds ratio reduces to (
αβ)
ij, which is the constant logodds ratio common to all
K categories of variable 3. On the other hand, if the saturatedmodel is entertained, log OR
ij k amounts to log odds ratio only for those who are in the
kth category of variable 3.
In summary, log-linear models have the following two important features that are
worthy of noting. First, interpretations of lower order terms (e.g. (
αβ)
ij) depend on thepresence or absence of higher order terms (e.g. (
αβγ)
ijk). More importantly, the two-wayassociation parameters (e.g. (
αβ)
ij) have the "conditional" log odds ratio interpretations,i.e. log OR relating two variables adjusting for other variables. One important implicaitonof the second feature stated above is the following: when more than one variable isconsidered as response variables, is it sensible to adjust for other response variables ifa primary question of interest is the association between a particular response and anindependent variable? We will return to this in the next chapter.
An Alternative Representation of Log-Linear Models
As mentioned at the onset of
§4.3, there has been some attempts recently to adoptlog-linear models as a means to model correlated data. At first glance, this may notseem obvious as log-linear models are applied to aggregated data whereas the interestfor the latter is to model the joint distribution of correlated data from a cluster. Themain purpose of this subsection is to show briefly how to bridge the gap betwen thesetwo seemingly unrelated tasks. The other purpose is through this new representationto reinforce some potential concerns regarding interpretations of regression coefficientsinduced by log-linear models, which were mentioned in
§4.3.1.
Recall that what one observes for each subject in the context of contingency tables
can be expressed as
Y = (
Y1
, . . , Yj, . . , Yn)
t,
where
n is the number of categorical variables under investigation and
Yj is a categoricalvariable of
Cj levels,
j = 1
, . . , n. In the disturbed dream example,
n = 2 and
C1 = 5and
C2 = 4 if
Y1 and
Y2 represent age and degree suffering from disturbed dreams. Forsimplicity, we assume
n = 3 and
Cj = 2 for all
j = 1
, 2
, 3 so that
Yj = 1 or 0 dependingon whether the subject falls in the second or the first category, respectively. Using thesame notations as in
§4.3.1, we now consider a joint distribution for
Y = (
Y1
, Y2
, Y3)
t
4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES
Pr(
Yj =
yj, j = 1
, 2
, 3)
2
y1 +
β2
y2 +
γ2
y3 + (
αβ)22
y1
y2 + (
αγ)22
y2
y3
j = 0
, j = 1
, 2
, 3)
It can be easily shown (e.g. Liang and Zeger, 1989) that the model in (4.9) is equivalentto the saturated log-linear model for a 2
× 2
× 2 contingency table where
Yijk = # of subjects in which (
Y1 =
i − 1
, Y2 =
j − 1
, Y3 =
k − 1)
i, j, k = 1
, 2
.
The representation in (4.9) provides an alternative way to interpret the parameters spec-ified under log-linear models by considering for example
1 = 1
y2
, y3) =
α
2 + (
αβ)22
y2 + (
αγ)22
y3 + (
αβγ)222
y2
y3
,
1 = 0
y2
, y3)
the log odds for the first categorical variable,
Y1, to be one as a function of
Y2 and
Y3, theother two variables. Thus, (
αβ)22 may be interpreted as the effect, measured by log odds,of
Y2 on
Y1 adjusting for
Y3. In other words, (
αβ)22 characterizes the association between
Y1 and
Y2 conditional on the status of the third variable,
Y3 in this case. Similarly, onecan derive through (4.9)
2 = 1
y1
, y3)
2 + (
αβ)22
y1 + (
βγ)22
y3 + (
αβγ)222
y1
y3
,
2 = 0
y1
, y3)
3 = 1
y1
, y2)
2 + (
αγ)22
y1 + (
βγ)22
y2 + (
αβγ)222
y1
y2
.
3 = 0
y1
, y2)
Representations above serve to illuminate the concerns of using log-linear models tomodel longitudinal data, an important special case of correlated data. In this setting,
Yjrepresents for an individual the binary observation measured at the
jth visit,
j = 1
, 2
, 3.
An obvious question is whether it is sensible to model as shown in (4.10) the dependenceof the first observation,
Y1, on observations measured subsequently,
Y2 and
Y3 in thiscase. Another drawback of this modelling is that it is not reproducible (e.g. Liang, Zegerand Qaqish, 1992) in that the joint distribution of any subset of
Y = (
Y1
, Y2
, Y3) do nothave the same representations as (4.9) for
Y itself. This is particularly troublesome forcorrelated data of varying cluster sizes.
CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
Chapter 5
Statistical Modellings for
Correlated Data
This chapter is devoted to discussions on statistical modellings for correlated data thathave been developed recently. Recall in Chapter 1 we pointed out through examples thatscientific objectives for correlated data can often be characterized statistically throughregression to (1) assess the relationship between response variables and explanatory vari-ables and (2) examine and describe the degrees and patterns of within-cluster association.
We discuss in this chapter three different approaches: marginal models, random effectsmodels and observation driven models. These models are illustrated through two exam-ples presented in Chapter 1: The Baltimore Eye Survey (Example 1.1) and the pre-posttrial for schizophrenics (Example 1.4). These three models are contrasted with the focuson their distinct interpretations of regression coefficients. To begin with, we recall thatcorrelated data can often be expressed as follows: Let
Yi = (
Yi1
, Yij, . . , Yin )
t be the
ni observations from the
ith cluster,
i = 1
, . . , m, m being the total number of clusters.
For the
jth component,
j = 1
, . . , ni, we assume the availability of
xij, a
p × 1 vectorof covariates, that may be associated with
Yij. For the Baltimore Eye Survey Example,
m = 5
, 199
, ni = 2 for all subjects and
Yi1(
Yi2) represents the visual impairment sta-tus (1: impaired; 0: normal) of the left (right) eye of the
ith subject. In this example
xi1 =
xi2 =
xi, i.e. all the covariates are the same for both eyes including age, race andpossibly interactions between the two demographic variables. The primary objective ofthis study is to identify demographic variables that are associated with the risk of visualimpairment. The secondary objective is to examine the degree of fellow eye associationand its possible dependence on demographic variables as well. As for the pre-post trial,
m = 170 and the
ni's range from 2 to 6 as some of the patients dropped out duringthe course of follow-up. The primary response variable is PANSS measuring the severityof schizophrenia; whereas the main explanatory variables include the treatment status(
xi = 1 if receiving rispirodine of 6mg and 0 if receiving haloperidol of 20 mg), time (inweeks) from the baseline,
tj(0
, 1
, 2
, 4
, 6 and 8) and the interaction between
xi and
tj.
The main objective of this clinical trial is to assess the efficacy of the new treatment forschizophrenia, rispirodine, as measured by the rate of decline in PANSS.
CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA
This approach may be viewed as a multivariate analogue of the quasi-likelihood methodin
§2.3:
(A) Systematic component
h(
µij) =
xt β, µ
ij = E(
Yij xij ).
(B) Variance specification Var(
Yij) =
V (
µij;
φ), where
φ is the over-dispersion param-
(C) Covariance specification Cov(
Yij, Yik) = C(
µij, µik;
α)
, j < k = 1
, . . , ni, where
C
is a known function indexed by association parameters
α.
For the Baltimore eye survey example, one may consider
logit Pr(
Yij = 1) =
β0 +
β1 Race
i +
β2 (Age
−
60) +
β3 (Age
i
+
β4 Race
i (Age
−
60) +
β5 Race
i (Age
− 60)2
,
(B) Var(
Yij) =
µij(1
− µij),
(C) log OR(
Yi1
, Yi2) =
α0 +
α1 Race
i +
α2 Age.
Several remarks are worth noting. First, implicitly one assumes in (A) of (5.1) that theeffects of a age and race on the risk of visual impairment are the same for both eyes.
While one can test this assumption formally, expert opinion on this is that commonetiology for both eyes is warranted. An interpretation of e
β1 , the coefficient for the Racevariable, is that it represents the odds of visual impairment for any eye of blacks of 60years old relative to the odds for whites of the same age. Finally, recall in
§1.4 ourpreference of using odds ratio as a measure of association for categorical variables. Thespecification (C) along with (A) and (B) lead to a full specification of Cov(
Yi1
, Yi2). Forthe pre-post trial example, one could, for example, consider
(A)
µij =
β0 +
β1
tj +
β2
xi +
β3
xi tj,
(B) Var(
Yij) =
φ,
(C) Cov(
Yij, Yik) =
φ e tj−tk .
Here the AR-1 correlated structure for the repeated observations from the same patient isimposed as shown in (C). More importantly, the interaction term
xi tj, in (A) postulatesthat the difference in averaged PANSS between the treated (risperidone) and control(haloperidol) groups varies linearly in time as characterized by
β3. More specifically,
β3represents the change of averaged difference in PANSS between two groups per unit (inweeks) change as time progresses.
Finally, some general remarks on marginal models (Liang and Zeger, 1993). First,
the
β coefficients specified in (A) describe the effects of covariates on the "marginal"expectation of the response variable. For the eye survey example, this marginal expec-tation amounts to the averaged
Y for a given eye irrespective of the outcome statusof the fellow eye. For the pre-post trial example, this marginal expectation means the
5.2. RANDOM EFFECTS MODEL
averaged
Y at a given visit irrespective of the PANSS scores at other visits. Thus
βhas the so-called "population averaged" interpretation by contrasting the averaged
Y 'sfrom different subpopulations formed by different values in
x (Zeger, Liang and Albert,1988). This approach, from this viewpoint, is particularly relevant for studies of publichealth significance. Furthermore, the interpretation of
β is the same regardless of thecluster size,
ni, and of magnitude of within-cluster association. Finally, just as in quasi-likelihood, the joint distribution of
Yi = (
Yi1
, . . , Yin )
t is not fully specified by (A), (B)
and (C) and this issue will be addressed in the next chapter.
Random Effects Model
Unlike the marginal model approach which addresses issues of the regression of
Y on
xand within-cluster association through two separate sets of regression models, randomeffects models and observation driven models intend to address these two issues simulta-neously through a single set of regression models. The essence of random effects models,however, is the introduction of cluster-specific regression coefficients relating
Y to
x toreflect natural heterogeneity among clusters caused by unmeasured factors. This notionof heterogeneity among clusters, called "random effects" translates into within cluster"likeness" or dependence. Specifically, given
bi, random effects specific to the
ith cluster,
i = 1
, . . , m,
(A
)
Yi1
, . . , Yij, . . , Yin are statistically independent of each other.
(B
) Each
Yij follows a GLM with
h(E(
Yij bi)) =
xtijβ +
ztijbi,
where
zij, a
q × 1 vector, is generally a subset of
xij.
(C
) The
bi's are assumed to be independent and generated from a common distribution,
F (
·;
δ), with mean zero and indexed by parameters,
δ.
For the pre-post trial example, patients vary tremendously in baseline PANSS and inchange of PANSS over time. To acknowledge this variation statistically, one may consider
(B
)
Yij = (
β
1 +
bi1)
tj +
β
2
xi +
β
3
xi tj +
ij , where given
bi = (
bi0
, bi1)
t, ij
N (0
, φ)
, j = 1
, . . , ni,
(C
)
bi follows a bivariate normal distribution with mean (0
, 0)
t and covariance matrix
In this model,
β0 and
β0+
β2 represent the averaged baseline PANSS for patients receivinghaloperidol and risperidone respectively, whereas
β1 and
β1 +
β3 represent averaged ratesof change in PANSS for the two respective groups. Thus negative
β1 and
β3 would suggestthat receiving risperidone would lead to a greater decline in PANSS over time thanhaloperidol does. Furthermore,
bi0 and
bi1
, i = 1
, . . , m, are introduced to account forvariation across patients discussed earlier. The larger the
δ00(
δ11), the greater variations
CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA
across subjects in PANSS scores (in rates of change in PANSS). Finally, Var(
Yij) =
φ +
δ00 + 2
δ01
tj +
δ11
t2 and
Cov(
Yij, Yik) =
δ00 +
δ01(
tj +
tk) +
δ11
tjtk, j < k,
a result of the introduction of random effects, the
bi's. For the eye survey example, onemay consider
(B
)
Yij bi ∼ Binomial(1
, µ ), where
µ = E(
Y
ij bi) and
(
β0 +
bi) +
β1 Race
i +
β2 (Age
i
60) +
β3(Age
i
4 Race
i (Age
i
60) +
β5 Rac
i (Age
i
(C
)
bi ∼ N (0
, δ).
This logistic regression model in (B
) is very similar to that under marginal modelswith the exception of the additional individual-specific parameter,
bi, which is includedto account for variation among subjects in risks for visual impairment that are notexplained by age, race variables. Indeed, with random intercepts, Zeger, Liang andAlbert (1988) have shown that there is a simple mathematical relationship between
βand
β in marginal models, namely,
logit E(
Yij) = logit E(E(
Yij bi)) =
xtijβ/(
c2
δ200 + 1)1
/2
,
where 16 3
/(15
π). Thus, approximately
= (
c2
δ00 + 1) 2
β.
We will return to the comparison between these two models in
§5.4. Some generalcomments on random effects models. This approach for modelling correlated data isparticularly useful to characterize the "change" in
Y within clusters. For example,
β +
bidescribes the effect of
x on
Y that is specific to the
ith cluster (or subject). Furthermore,the empirical Bayes method (Robbins, 1956) and subsequent developments (e.g. Lairdand Ware, 1982; Zeger and Karim, 1991; Breslow and Clayton, 1993) allow investigatorsto estimate not only
β but also
bi as well. Thus this modelling is particularly relevantfrom a clinical viewpoint. For example, longitudinal studies would allow clinicians tocommunicate with concerned parents the drop in risk of their own children for sufferingfrom infectious disease if they change their smoking behavior (e.g. from smoking to non-smoking). Finally, given (A
), (B
) and (C
) specified above, the probability distributionfunction for
Yi as indexed by
β, φ and
δ is fully specified.
Observation Driven Models
As in random effects models, this approach addresses simultaneously both issues of re-gression of
Y on
x and within-cluster association through a set of regression models.
One important feature of the observation driven models that distinguish itself from therandom effects model is that it operates on the premise that correlation within a cluster
5.3. OBSERVATION DRIVEN MODELS
arises because one response is explicitly caused by others. A typical example of this kindis the spread of the disease that is contagious within a family or a community. Define
Yi,−j = (
Yi1
, . . , Yi,j−1
, Yi,j+1
, . . , Yin )
t
a vector of (
ni − 1)
× 1
Yij's excluding
Yij. Observation driven models may be charac-terized as follows
(A
) For
Hij =
H(
Yi,−j), a function of
Yi,−j, Yij given
Hij follows a GLM with
h(E(
Yij Hij)) =
xtijβ +
where
fν, ν = 1
, . . , q, are known, up to
α, functions of
Hij.
Choice of
Hij depends on situations that are dealt with. For longitudinal studies, it isnatural to consider
Hij = (
Yi1
, . . , Yi,j−1), observations prior to the
jth visit. On theother hand, for family studies, in particular, of siblings, it would be more sensible toconsider
Hij =
Yi,j−1, the rest of the family members. If adopting this approach, onemay consider for the eye survey example
(A
)
Yij Yi,3
−j ∼ Binomial(1
, µ),
j = 1
, 2 with
logit Pr(
Yij = 1
Yi,3
−j, xij)
Race
i +
β
60) +
β
Race
i (Age
i
60) +
β
Race
i (Age
i
Note that with
ni ≡ 2, this model is the same as those proposed by Rosner (1984) andConnolly and Liang (1988) discussed in
§3.3. It is interesting to note that e
β
the interpretation as the relative odds of visual impairment for one eye among blacks of60 years of age versus that of whites of the same age, adjusting for the visual impairmentstatus of the fellow eye. This is to be contrasted with e
β1 under marginal models inwhich no adjustment for the fellow eye status is made. If the objective of the studyis to identify risk factors, in particular demographic ones, for prevention purposes, onemay question the relevance of this observation driven approach as the effects of race,age, etc. on risk of visual impairment for one eye may be diminished due to their directcompetition against the visual status of the fellow eye in explaining
Yij; see (5.4). Forthe pre-post trial example, one may consider
(A
)
Yij Yi1
, Yi,j−1 is specified as
1
tj +
β
2
xi +
β
+
α(
Yi,j−1
− β −
1
tj−1
− β
2
xi − β
3
xi tj−1) +
ij ,
where
ij ∼ N (0
, φ).
This is known as a AR-1 process in which
Yij depends on the past history in
Yi onlythrough
Yi,j−1, the immediate past observation. We will discuss the relationship among
β, β and
β for this example in
§5.4.
CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA
Generally speaking, this approach represents a particular way to model the within-
cluster dependence mechanism, e.g., past responses cause the present one in longitudinalstudies. This direct modelling is particularly useful for prediction purposes. For example,one may be interested in predicting PANSS two months after the termination of the studyif the patient receiving risperidone had a score of 140, which is very high, at that time.
On the other hand, if the main purpose of the study is on the regression of
Y on
xone should be cautious about the interpretation of
β which may be non-trivial andpotentially misleading especially when the cluster sizes vary. This is especially so fordiscrete response variables; see discussion in
§3.2 and
§4.3.
Contrasts of Three Modelling Approaches
It is clear that the three approaches presented in Sections 5.1-3 are motivated distinctlyconsidering how within-cluster association may be incorporated in conjunction with theconventional regression of
Y on
x. With the exception of identity link function, whichis commonly assumed for continuous responses, the three regression coefficients,
β1
, βand
β, are likely to be different numerically and interpretation-wise. In addition, thewithin-cluster association pattern as implied by the models are in general different forthese approaches as well.
We now first examine more closely the special case of continuous responses with
identity link. It can be shown mathematically that in this case
β =
β =
β.
To see this, note that random effects models specify the effects of
xi on the conditionalexpectations of
Yi given
bi, the random effects. However, noting that E(
bi) = 0 asspecified in (C
) of
§5.2, we have immediately that
E(
Yij) = E(E(
Yij bi)) = E(
xtijβ +
ztijbi) =
xtijβ,
which when comparing with (A) in
§5.1, leads to
β =
β. Similarly, if one adopts theobservation driven model, one has
E(
Yij) = E(E(
Yij H(
Yi,j−1)))
E(
xtijβ +
αν(
yi,j−ν − xti,j−νβ))
Thus, for continuous responses of identity link, the three regression coefficients relating
Y to
x are identical numerically. The only major difference is on how the within-clusterassociation may be modeled. For example, a random effects model with intercepts as theonly random effects leads to Cov(
Yij, Yik) =
δ00, a constant covariance for each pair ofobservations from the same cluster. On the other hand, if an AR-1 process is assumedunder observation driven models, one has Cov(
Yij, Yik) =
α tj−tk which decreases in
tj − tk , the "distance" (in time) between two observations from the same cluster.
The equality in
β, β and
β shown above is indeed an exception rather than the rule.
To elaborate on this, we assume, for simplicity, one is dealing with longitudinal studies of
5.4. CONTRASTS OF THREE MODELLING APPROACHES
binary responses and logit link. In this situation, e
β for a covariate (who marginal modelscompares prevalences in subgroups formed by observations with different
x values (andhence measured at different times). For random effects models of the same covariate
x,e
β describes changes in personal risk due to change in
x over time. On the other hand,e
β in observation driven models characterizes changes in incidence for individuals withthe same history in
Y but differ in
x measured at the same time. It has been shown inthis situation (Zeger, Liang and Albert, 1988) that
β is smaller than
β in magnitude.
However, one should in general reach similar conclusions for testing the hypothesis of noassociation between
Y and
x (i.e.
β = 0 or
β = 0) since
β)
≈ ˆ
On the other hand, by examining (A) in
§5.1 and (A
) in
§5.3 closely, we would speculatethat
β would be smaller than
β in magnitude due to the fact that (1) both
xij and
H(
Yi,−j) appear in the same regression model in explaining
Yij and (2) the
Yij's are ingeneral highly correlated with each other.
CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA
Chapter 6
Statistical Inference for
Correlated Data
In Chapter 5 we introduced and contrasted three statistical modellings for correlateddata. In this chapter, we review approaches to make inference about parameters speci-fied by the models. These approaches include full likelihood methods, conditional like-lihood methods and estimating function methods. Pros and cons of each approach arehighlighted.
Recall that for marginal models only the first two joint moments of the
Yi's are requiredto be specified, i.e.
(A) Systematic component
h(
µij) =
xtβ, for all
j = 1
, . . , n
(B) Covariance/variance specification Cov(
Yij, Yik) =
C(
µij, µik;
α, β)
, j ≤ k = 1
, . . , ni.
The question is that given this specification in (A) and (B) how to make inference on
βand
α based on the data at hand. Just as in the univariate response case, we considertwo different approaches.
If the regression of
Y on
x is the primary interest which usually is the case in, for example,longitudinal studies, one may estimate
β by solving
1(
β, α) =
i;
β, α)(
yi − µi(
β)) = 0
.
The above equation is know as the generalized estimating equation (GEE) (Liang andZeger, 1986) which reduces to the quasi-score equation defined in
§2.3 when
ni = 1 for all
i = 1
, . . , m. Since
S1 in (6.1) depends on
α as well, one may iterate until convergencebetween solving
S1(
β, ˆ
α(
β)) = 0 for
β and updating ˆ
α(
β) with the latest ˆ
β solved above
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
(Liang and Zeger, 1986). This approach, which we term GEE1, is robust in the sensethat, just as in quasi-likelihood, (1) no distributional assumption on
Yi = (
Yi1
, . . , Yin )
t
is needed and (2) consistent estimation of
β and correct precision assessments on ˆ
available even if the covariance structure for
Yi, C(
µij, µik;
β, α), is incorrectly specified.
For the latter part, see some key references including White (1982), Gourieoux et al.
(1984), Royall (1986) and Liang and Zeger (1986). Furthermore, high efficiency relativeto the MLE of
β is usually maintained even if Cov(
Yi) is incorrectly specified (e.g. Liangand Zeger, 1986; Zhao, Prentice and Self, 1992). As expected, the degree of efficiency lossdepends on the magnitude of within-cluster correlation (the higher the correlation, thegreater degree of efficiency loss) and on how "well" the specified "working" covariance for
Yi in (6.1) approximates the true covariance structure. It is important to point out thatincorrectly assuming no association within-cluster may yield high degree of efficiency lossespecially for covariates whose values vary within clusters; see
§2.4 and Zhao, Prenticeand Self (1992). More importantly, such an exercise may lead to severe biased estimationin
β if the data are missing in a non-trivial manner. This point will be further illuminatedthrough the pre-post trial data analysis to be presented later.
When the within-cluster association is the primary interest, which is typically the
case for family studies, one may argue that this GEE1 approach may be inefficient asfar as inference for
α is concerned. This is because in GEE1, estimation of
β and
α istreated as if they were independent (or orthogonal) to each other. As an alternative, onemay solve for
δ = (
β, α) simultaneously through
∂µ∗ t
2(
β, α) =
i;
δ, γ)(
zi − µ∗
i (
δ)) = 0
,
i ) = (
Y t
i1
, . . , Y 2
in , Yi1
Yi2
, . . , Yi,n
i −1
Yini
µ∗i = E
t(
Zi;
δ) = (E
t(
Yi);
β)
, E
t(
Wi;
β, α))
t,
and
γ are parameters characterizing the third and fourth moments of the
Yi's as requiredby Cov(
Zi) (Liang, Zeger and Qaqish, 1992). Several remarks on this GEE2 approach.
First, with a proper choice of ˆ
α(
β), the GEE1 method may be viewed as a special case
of GEE2 if Cov(
Yi, Wi) in the "working" covariance matrix for
Zi, Cov(
Zi), is set to bezero. It can be seen in this situation that the first
p components of
S2(
δ, γ) reduces to
S1(
β, α). Second, in some instances considered, Liang, Zeger and Qaqish (1992) foundthat the GEE2 method is more efficiency for estimating
β and
α than the GEE1 method,although the degree of efficiency gain is much less substantial for
β than
α. We willreturn to this point through an illustration using the eye survey data as an example.
This new approach, however, creates a complication that is not associated with the useof GEE1. One complication is the need to estimate
γ as seen in (6.2). Given thatCov(
Zi) is serving as a working covariance matrix and its correct specification is notneeded, one may compute Cov(
Zi) as if
Yi follows a multivariate normal distribution(Crowder, 1987; Prentice and Zhao, 1991). This resolves the issue of dependence of
S2on
γ since multivariate normal distributions are fully specified by the first two moments.
Another approach is to replace
γ by a "guess-mates,"
γ∗, a fixed value. Liang, Zegerand Qaqish (1992) found that for cases considered, this approach sacrifices very littlepower loss. However, no general performance of these two approaches for dealing with
6.1. MARGINAL MODELS
γ has been established and this issue is likely to be examined on the case by case basis.
Another complication is that the robust property for
β enjoyed by GEE1 is no longermaintained. To see this, except when Cov(
Yi, Zi) is set to be zero, it can be shown thatthe first
p components of
S2(
δ, γ)
, S2
,β(
δ, γ) say, is a linear combination of
yi − µi(
β)and
Wi − E(
Wi;
β, α). Thus,
S2
,β(
δ, γ) is an unbiased estimating function for
β only ifE(
Wi;
β, α) computed is correct, i.e. if Cov(
Yi) is correctly specified. In other words,the consistency of ˆ
β, the solution of
S2(
δ, γ) = 0, depends on the correct specification of
Cov(
Yi); see Liang and Zeger (1995) for a detailed examination of this issue for a specialcase.
Recall that specification (A) and (B) do not fully specify the joint distribution of the
Yi'sas only the first two moments of the
Yi's are involved in (A) and (B). A natural questionis: does there exist a probability function for
Yi that is compatible with (A) and (B)?The work by Gourieroux et al. (1984) and Zhao et al. (1992) suggests that the answer isyes. Dropping the index
i temporarily, consider the following "partly exponential family"
f (
Y =
y;
β, α) = exp
{θty +
c(
y;
α)
− b(
θ, α)
},
where
θ, called "canonical parameters," is determined by
µ = E(
Y ;
β) =
y exp
{θty +
c(
y;
α)
− b(
θ, α)
}dy
through the one-to-one correspondence between
Here
c(
y;
α) is pre-specified and
α is called "shape parameters." Some choices of
c(
y;
α)include
c(
y;
α) =
yt
which, in conjunction with (6.3), leads to the multivariate normal distribution and forbinary responses
c(
y;
α) =
known as the quadratic exponential family (e.g. Zhao and Prentice, 1990; Fitzmauriceand Laird, 1993). With this partly exponential family, it has been shown (Zhao et al.,1992; Fitzmaurice and Laird, 1993) that the score function for
δ = (
β, α) has the form
S(
β, α)
i (
β)
i;
δ)
i − µi(
β)
Cov(
Yi;
Wi;
δ)
Iq×q
wi − E(
Wi;
δ)
where
Iq×q is the
q × q identity matrix and
wi =
∂c(
yi;
α)
/∂α. Consequently, the first
pcomponents of
S(
β, α) is simply
S1(
β, α) in (6.1). Thus
S1(
β, α) from GEE1 is the scorefunction for
β if
Yi is from a partly exponential family as long as the Cov(
Yi) specified
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
in
S1(
δ) is the true covariance matrix from
Yi. Furthermore, it can be shown (e.g., Zhaoet al., 1992) that
β and
α are orthogonal to each other, i.e. the covariance between thescore function for
β and
α is zero. Consequently, the consistency of ˆ
β, the MLE of
β, is
maintained even if
c(
y;
α) is misspecified, a property enjoyed by GEE1.
On the other hand, this likelihood approach based on the partly exponential family
may be used with caution. First, choice of
c(
y;
α), which along with
µ(
β), would leadto the complete specification of
f (
y;
β, α) requires more knowledge about the randommechanism than it does for (A) and (B) only. In the absence of sufficient knowledge,either substantive or empirical, one should proceed with caution. Second, this family isin general not reproducible, i.e. the distribution function of any subset of
Yi may not havethe same form as
f (
Yi) in (6.3). This lack of reproducibility is especially troublesomeif cluster sizes vary which is typically the case for family and longitudinal studies. We
Table 6.1: Regression estimates and standard errors (in parenthesis) for thevisual impairment response from the Baltimore Eye Survey Study.
now return to the Baltimore eye survey example to illustrate some of the points madeearlier. Table 6.1 presents results from four approaches:
Approach I: logistic regressionusing the data from the left eyes only;
Approach II: logistic regression using data fromboth eyes treating them as if they were statistically independent of each other;
ApproachIII (IV): GEE1 (GEE2) in which both approaches model log OR(
YL, YR) as a functionof race. First, point estimates of logistic regression coefficients from four approaches arevirtually identical suggesting that the etiology of visual impairment is similar for botheyes.
Furthermore, the association between age and risk of visual impairment is strong
and positive and varies between blacks and whites; see Figure 6.1. Second, the ordinarylogistic regression approach (Approach II), which ignores within-cluster association, has
6.2. RANDOM EFFECTS MODELS
Figure 6.1: Predicted risk of VI by race and age among those with 9 yearsof education in Baltimore Eye Survey Study (Tielsch et al., 1990): solidline, whites; dashed line, blacks.
the tendency, as expected, to underestimate the s.e. of ˆ
β compared to that of approaches
III and IV, which are known to be correct, at least asymptotically. Third, it is obviousand intuitive that it is less efficient to use data from one eye only; see the last columnof Table 6.1. However, the ratios of the variance of ˆ
β between approaches I and IV, the
latter using twice the sample size, are approximately equal to 1.6 instead of 2. This isknown as "design effect" in complex survey sampling and is due to the fact that the datafrom fellow eyes are highly correlated with each other. Indeed, results from Table 6.1indicate that the estimated odds ratio relating two fellow eyes in terms of risk for visualimpairment is approximately 9
.97(= e2
.30) for whites and 17
.1(= e2
.30+0
.54) for blacks.
The significant difference in odds ratios between blacks and whites is consistent with thelong-standing clinical finding in the United States that blacks have higher incidence ofglaucoma and of diabetic mellitis, both of which tend to be bilateral diseases. Finally,there is a substantial efficiency gain in using GEE2 for inference on the pattern of within-cluster association as the ratio of Var( ˆ
α) for GEE1 versus GEE2 are 2
.22(= (2
.64
/1
.77)2)
for the intercept and 2
.65(= (45
/.255)2) for the race variable.
Random Effects Models
As noted in
§5.2, given
(A
) Conditional on
bi, the random effects,
Yi1
, . . , Yin are statistically independent
and for each
j, Yij follows a GLM with
h(E(
Yij bi)) =
xt β +
zt b
(B)
bi, a
q × 1 vector,
∼ F (
·;
D) with mean zero,
the likelihood function for
β, φ and
D is completely specified, namely,
L(
β, φ, D)
∝
f (
yij bi;
β, φ)
f (
bi;
D)
dbi.
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
As an example, for binary responses of logit link and for a sole random intercept, i.e.
zij ≡ 1, following
N (0
, D), the likelihood function has the form
L(
β, D)
yijxtijβ +
log 1 + e
xtij
One can see from (6.5) that the likelihood function involves
q-dimensional integrals, onefrom each cluster and in general, numerical approximation is needed in order to computer(6.4). We now consider two different approaches whose pros and cons will be contrastedthroughout.
Conditional Likelihood Approach
This approach only works when the link function is a canonical link in which the firstpart of the integral in (6.4) has the simple form, namely,
f (
yij bi;
β) = exp
yijxtijβ +
yijztijbi −
which includes (6.5) as a special case. Some important examples of canonical linksinclude, as mentioned in
§2.2, the identity link for normal distributions, log link forPoisson distribution and logit link for binomial distribution. One important feature of
(6.6) is that given
bi,
yijxij and
yijzij are jointly minimally sufficient for
β and
bi, the canonical parameters. Consequently, one can eliminate
bi in (6.6) by conditioning
yijzij, the minimal sufficient statistics for
bi for fixed
β, i.e.
w )
∝
yi1
, . . , yin
ij zij ;
β
where
β = (
βw, βz) and
xij = (
wij, zij) as
zij is assumed to be a subset of
xij. Thus,through conditioning (see
§2.2), it gives rise to a conditional likelihood which is indepen-dent of
bi and
βz and depends only on
βw. In other words, only a subset of
β is estimable.
Thus, the usefulness of this conditional likelihood approach depends on whether covari-ates of interest have been declared as random effects or not. For example, if
zij ≡ 1, i.e.
random intercepts is the only random effect declared, then only those
β from covariates
wij's which vary within-cluster are estimable since otherwise
Thus it should be clear that this conditional likelihood approach would not work for theeye survey example since all the covariates are cluster-specific rather than eye-specific.
6.2. RANDOM EFFECTS MODELS
As another example, consider longitudinal studies in which primary objectives are of-ten centered on the individual change of the response variables over time. If the timevariable, i.e. the times for which the response variables are measured, is considered asa random effect, known as random slopes, then one cannot estimate the averaged rateof change coefficient should the conditional likelihood method be employed. This alongwith the fact that canonical links are required limit the extent to which this method maybe applied. One desirable property of this method, i.e. using
Lc(
βw) as the basis forinference, is that no distributional assumption on the
bi's is needed, which is comfortingas the validity of such an assumption is difficult to verify in general.
An obvious alternative is to use the full likelihood,
L(
β, φ, D), in (6.4) as the basis forinference on both fixed effects,
β, and the variance components,
D and
φ. For continuousresponses with identity links and normal distribution on both
Y b and
b itself, i.e.
(A
)
Yij bi ∼ N (
xt β +
zt b
ij i, φ),
(B
)
bi ∼ MVN(0
, D),
Yi follows a multivariate normal distribution with mean
xtβ and covariance matrix
i +
φIni×ni
.
, zi = .
.
In this case, the restricted maximum likelihood (REML) method (e.g. Patterson andThompson, 1971; Laird and Ware, 1982) for
β, φ and
D can be used. Furthermore, theempirical Bayes method (Robbins, 1955) can be utilized to help provide more desirableestimates of the
bi's. This aspect of inference is useful especially for the purpose ofidentifying clusters that are distinct from others with regard to the behavior of therandom effects. A prime example is in growth studies in which investigators are interestedin identifying children whose "growth rates" as captured by the
bi's are lower than the restof the sample. We note that this REML method along with empirical Bayes procedurefor estimating the
bi's have been implemented in SAS as a part of PROC MIXED.
In general, the likelihood function involves complicated
q-dimensional integrations
which are demanding numerically especially when
q is modestly large. Two methodshave been developed recently to address this issue. One is to use the Gibbs samplermethod in which the posterior distribution of
β, φ, and
D given the data is simulatedthrough, for example, the Monte Carlo Markov Chain (MCMC) algorithm (e.g. Zegerand Karim, 1991). The other approach, known as the penalized likelihood method,approximates
f (
bi Yi) by a normal density function leading to an approximate likelihoodfunction which is less computationally intensive (e.g., Stiratteli et al. 1984; Breslow andClayton, 1993). This approach has been shown through simulations to perform well ingeneral except for binary responses with small cluster sizes (Breslow and Clayton, 1993).
This method has been implemented in SAS as well (PROC GLIMMIX) and has beenrefined more recently to correct the bias occurring in the situation described above (e.g.
Breslow and Lin, 1995; Lin and Breslow, 1996).
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
Observation Driven Models
Recall that observation driven models tackle the issue of within-cluster association bydirectly modelling the dependence of
Yij on some functions of
Yi,−j =
{Yi1
, . . , Yi,j−1
, Yi,j+1
, . . , Yn },
namely,
Hij =
Hij(
Yi,−j) through
(A
) For
j = 1
, . . , ni, Yij Hij follows a GLM with
h(E(
Yij Hij)) =
xtijβ +
Some common choices of
Hij were discussed in
§5.3.
Just as in
§6.1.2, a natural question is "does there exist a probability function for
Yi thatis compatible with specification (A
)?" While the answer no doubt depends on choices of
H and
fν, it is comforting that the Hammerley-Clifford Theorem in Besag (1974) couldserve as a guidance for the question posed above. As an example, for binary responses
with
q = 1, a choice of
fν(
Hij;
α) =
α
i has led to the "multiplicative" model
developed by Connolly and Liang (1988).
As an alternative, one can estimate
δ = (
β, α) by solving
∂E(
Y
ij Hij ;
δ)
ij Hij ) (
yij − E(
Yij Hij ;
δ)) = 0
,
i=1
j=1
a strategy similar to GEE for marginal models. An advantage of this approach is thatthe statistical package developed for GEE of marginal models such as SAS PROC GEN-MOD can be utilized. On the theoretical end, this approach shares the same robustnessproperty enjoyed by GEE for marginal models.
Returning to the Baltimore Eye Survey Study example, Table 6.2 presents results from
three modelling approaches. The first column is based on GEE2 for marginal models asshown in Table 6.1. The second column is derived from observation driven models usingthe GEE1 method. Specifically, this model assumes for each
j = 1
, 2,
logit Pr(
Yij = 1
Yi,3
−j, Age
i, Race
i)
1 Race
i +
β
60) +
β
Race
i (Age
i
60) +
β
Race
i (Age
i
+
α1
Yi,3
−j +
α2
Yi,3
−j Race
i.
Finally, the third column corresponds to results from fitting a random effects model ofsame covariates as marginal models with the addition of random intercepts. Consistent
6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED
with discussion in
§5.4, the estimated
β's from the observation driven model are ingeneral smaller in magnitude than that of the
β's from the marginal model. In par-ticular, the estimated
β for the Race variable is reduced from 0.334 in column oneto 0.099 in column two and as a result, is not significantly different than zero statisti-
Table 6.2: Regression estimates and standard errors (in parenthesis) fromthree modelling approaches: Baltimore Eye Survey Study.
variance of random intercepts.
cally if the observation driven model is employed. A simple and obvious explanation ofthis discrepancy is that the effects of those demographic variables are, to some extent,"explained away" by the competing variate,
Yi,3
−j, the visual impairment status of thefellow eye; see ˆ
α1 and ˆ
α2 in column two. On the other hand, the estimated
β's from
the random effects model are larger in magnitude than the ˆ
β's. However, the Wald test
statistics for testing the significance of either
β or
β are similar in magnitude and leadto similar conclusions. For an example, the Wald statistics for the Race variables are3
.18(=
.334
/1
.05) from column one and 2
.58(=
.487
/.189) from column three.
Pre- Post Trial on Schizophrenia Revisited
Recall that in this trial more than five hundred schizophrenic patients were randomizedinto six arms, placebo, risperidone of 2, 6,, 10 and 16mg and haloperidol of 20mg. Forillustrative purpose, we focus on two particular groups: risperidone of 6mg (
x = 1)and haloperidol of 20mg (
x = 0). The main objective is to examine whether the new
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
treatment in risperidone leads to further and faster decline in PANSS, a measure ofseverity in schizophrenia, compared to the then standard treatment in haloperidol. Table
Table 6.3: Regression estimates and standard errors (in parenthesis) fromthree modelling approaches: a pre- post trial on schizophrenia.
Random effects model
6.3 presents results form three modelling approaches. For marginal models, we consideredthe use of independence and exchangeable (also known as compound symmetry) matricesas the working covariance for the
Yi's. For random effects models, we considered (1)random intercepts only which leads to the compound symmetry covariance matrix in
Y and (2) random intercepts and slopes. In the latter, this accounts for variations inPANSS changes over time among patients. For observation driven models, we consideredthe following AR-1 model
Yij =
xtijβ +
α(
yi,j−1
− xti,j−1
β) +
ij, j = 1
, . . , ni,
where the
ij ∼ N (0
, φ). With the exception of the GEE approach using the inde-pendence matrix as the working matrix, conclusions from all three approaches are verysimilar qualitatively; see comments in
§5.4. Specifically, the averaged rate of decline inPANSS for the haloperidol groups is estimated as 1.16 per week; whereas the averagedrate of decline is estimated as 2
.35(= 1
.16 + 1
.19) per week for the risperidone group.
Such a difference is statistically significant at the 0.05 level (
−1
.192
/.482 =
−2
.47, p-value
< 0
.01). More importantly, this difference appears to be meaningful clinically aswell. Considering two patients of the same PANSS scores before receiving treatments,this result suggests that taking the new treatment would lead to 9
.5(= 1
.19
× 8) unitgreater decline in PANSS in the eighth week, a substantial improvement from the clinicalpoint-of-view.
6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED
Figure 6.2: Mean responses in PANSS by dropout cohort.
Finally, ignoring within-cluster association would lead to very different and presum-
ably misleading conclusions. Specifically, results from GEE assuming the repeated mea-sures from the same patient were statistically independent of each other suggest thatthere is no difference in averaged rate of decline in PANSS between the two groups(
−0
.117
± 0
.545). An explanation for such a striking discrepancy is that this latterapproach is vulnerable to severe and non-trivial missing data phenomenon which is ex-hibited in this example. For example, the risperidone group experienced 36% dropoutrate in the eight week follow-up period; whereas the haloperidol group fared no betterwith a 53% dropout rate. Furthermore, Figure 6.2 strongly suggests that the chance ofdropping out at a given visit depends on the prior history of PANSS scores. For an exam-ple, patients dropped out at weeks 2, 4 and 6 experienced, on average, a sudden increasein PANSS prior to the scheduled visits. This pattern of missing mechanism, known asmissing at random (Little and Rubin, 1987), suggests that ignoring within-cluster associ-ation would, in general, lead to incorrect inference for questions of interest, the differencein rates of change in PANSS between groups in this case. For more detailed treatmentson accounting for non-trivial missing data patterns, see Diggle (1998) and Scharfstein,Rotnitzky and Robbins (1999) and references therein.
CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
[1] Altham, P.M.E. (1978). Two generalizations of the binomial distribution.
Applied
Statistics,
27, 161–167.
[2] Andersen, E.B. (1970). Asymptotic properties of conditional maximum likelihood
estimators.
Journal of the Royal Statistical Society, Series B,
32, 283–301.
[3] Bahadur, R.R. (1961). A representation of the joint distribution of responses in
n
dichotomous items. In
Studies on Item Analysis and Prediction, Ed. H. Solomon,pp.158–168. Stanford Mathematical Studies in the Social Sciences VI, Stanford Uni-versity Press, Stanford, California.
[4] Beaty, T.H., Skjaerven, R., Breazeale, D.R. and Liang, K.-Y. (1997). Analyzing
sibships correlations in birth weight using large sibships from Norway.
Genetic Epi-demiology,
14, 423–433.
[5] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, Series B,
36, 192–236.
[6] Bishop,Y.M.M., Fienberg, S.E. and Holland, P.W. (1975).
Discrete Multivariate
Analysis: Theory and Practice. MIT Press, Cambridge, Massachussetts.
[7] Breslow, N.E. (1981). Odds ratio estimators when the data are sparse.
Biometrika,
68, 73-84.
[8] Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in generalized linear
mixed models.
Journal of the American Statistical Association,
88, 125–34.
[9] Breslow, N.E. and Day, N.E. (1980).
Statistical Methods in Cancer Research, Volume
I. IARC Scientific Publications No. 32. Lyon.
[10] Breslow, N.E. and Lin, X. (1995). Bias correction in generalized linear mixed models
with a single component of dispersion.
Biometrika,
82, 81-91.
[11] Breslow, N.E. and McCann, B. (1971). Statistical estimation of prognosis of children
with neuroblastoma.
Cancer Research,
31, 2098–2103.
[12] Chen, J.J., Kodell, R.L., Howe, R.B. and Gaylor, D.W. (1991). Analysis of trinomial
responses for reproductive and developmental toxicity experiments.
Biometrics,
47,1049–1058.
[13] Cohen, B.H., Bell, W.C., Jr., Brashears, S., Diamond, E.L., Kreiss, P. et al. (1977).
Risk factors in chronic obstructive pulmonary disease (COPD).
American Journalof Epidemiology,
105, 223–232.
[14] Connolly, M., Liang K.-Y. (1988). Conditional logistic regression models for corre-
lated binary data.
Biometrika,
75, 501-506.
[15] Cox, D.R. (1972). Regression models and life tables (with discussion).
Journal of
the Royal Statistical Society Series B,
74, 187-200.
[16] Cox, D. R. and Hinkley, D. V. (1974).
Theoretical Statistics, Chapman and Hall,
London, England.
[17] Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate condi-
tional inference, (with discussion).
Journal of the Royal Statistical Society, Series B,
49, 1-39.
[18] Crowder, M.J. (1987). On linear and quadratic estimating functions.
Biometrika,
74, 591-597.
[19] Crowder, M.J. (1989). Comments on Godambe, V.P. and Thompson, M.E. An ex-
tension of quasi-likelihood estimation.
Journal of Statistical Planning and Inference,
22, 137-152.
[20] Diggle, P.J. (1988). An approach to the analysis of repeated measures.
Biometrics,
[21] Diggle, P.J. (1998). Dealing with missing values in longitudinal studies. In:
Recent
Advances in the Statistical Analysis of Medical Data, editors B.S. Everitt and G.
Dunn, 203–208. Arnold, London, England.
[22] Diggle, P.J., Liang, K.-Y. and Zeger, S.L. (1994).
Analysis of Longitudinal Data.
Oxford University Press, Oxford, England.
[23] Durbin,J. (1960). Estimation of parameters in time-series regression models.
Biometrika,
47, 139-153.
[24] Ferguson, H. Reid, N. and Cox, D.R. (1991). Estimating equations from modified
profile likelihood, in
Estimating Functions, Godambe, V.P., editor, Oxford Univer-sity Press, Oxford, England.
[25] Finney, D.J. (1964).
Statistical Methods in Biological Assay, 2nd edition. Griffin,
London, England.
[26] Firth, D. (1987). On the efficiency of quasi-likelihood estimation.
Biometrika,
74,
[27] Fitzmaurice, G.M. and Laird, N.M. (1993). Regression models for discrete longitu-
dinal responses.
Statistical Science,
8, 284-304.
[28] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti-
mation.
Annals of Mathematical Statistics,
31, 1208–12.
[29] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimat-
ing equations.
Biometrika,
63, 277-284.
[30] Goodman, L.A. and Kruskal, W.H. (1979).
Measures of Association for Cross-
Classifications. Springer-Verlag: New York.
[31] Gourieroux, C., Monfort, A., and Trognon, A. (1984). Psuedo-maximum likelihood
methods: theory.
Econometrica,
52, 681–700.
[32] Grizzle, J.E., Starmer, C.F. and Koch, G.G. (1969). Analysis of categorical data by
linear models.
Biometrics,
25, 4898–504.
[33] Haseman, J.K. and Kupper, L.L. (1979). Analysis of dichtomous response data from
certain toxicological experiments.
Biometrics,
35, 281–293.
[34] Hauck, W.W. and Donner, A. (1977). Wald's test as applied to hypotheses in logist
analysis.
Journal of the American Statistical Association,
72, 851–853.
[35] Heagerty, P. and Zeger, S.L. (1998). Lorelogram: a regression approach to exploring
dependence in longitudinal categorical responses.
Journal of the American StatisticalAssociation,
93, 150–163.
[36] Holmes, M.C. and Williams, R.E.O. (1954). The distribution of carriers of
Stepto-
coccus pyogenes among 2413 healthy children.
Journal of Hygiene Comb 52, 165-179.
[37] Jones, B. and Kenward, M.G. (1987). Modelling binary data from a three-point
cross-over trial.
Statistics in Medicine,
6, 555–64.
[38] J
φrgensen, B. (1987). Exponential dispersion models (with discussion).
Journal of
the Royal Statistical Society, Series B,
49, 127–162.
[39] Kalbfleisch, J.D. and Sprott, D.A. (1970). Application of likelihood methods to
models involving a large number of nuisance parameters (with discussion).
Journalof the Royal Statistical Society, Series B,
32, 175-208.
[40] Kupper, L.L. and Haseman, J.K. (1978). The use of a correlated binomial model for
the analysis of data from certain toxicological experiments.
Biometrics,
34, 69–76.
[41] Laird, N.M. and Ware, J.H. (1982). Random-effects models for longitudinal data.
Biometrics,
38, 963–74.
[42] Lee, E. (1992).
Statistical Methods for Survival Data Analysis, 2nd edition. John
Wiley and Sons, New York, New York.
[43] Lehmann, E.L. (1959).
Testing Statitical Hypothesis. John Wiley and Sons, New
York, New York.
[44] Liang, K.-Y. (1987). Estimating functions and approximate conditional likelihood.
Biometrika,
74, 695-702.
[45] Liang, K.-Y. and Beaty, T.H. (1991). Measuring familial aggregation by using odds-
ratio regression models.
Genetic Epidemiology,
8, 361–370.
[46] Liang, K.-Y. and Hanfelt, J. (1994). On he use of the quasi–likelihood method in
the teratological experiments.
Biometrics,
50, 872-880.
[47] Liang, K.-Y. and Tsou, D. (1992). Empirical Bayes and conditional inference with
many nuisance parameters.
Biometrika,
79, 261-270.
[48] Liang, K.-Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized
linear models.
Biometrika,
73, 13–22.
[49] Liang, K.-Y. and Zeger, S.L. (1989). A class of logistic regression models for mul-
tivariate binary times series.
Journal of the American Statistical Association,
84,447-451.
[50] Liang, K.-Y., Zeger, S.L. (1993). Regression analysis for correlated data.
Annual
Review of Public Health,
14, 43-68.
[51] Liang, K.-Y. and Zeger, S.L. (1995). Inference based on estimating functions in the
presence of nuisance parameters (with discussion).
Statistical Science,
10, 158-172.
[52] Liang, K.-Y., Zeger, S.L., and Qaqish, B. (1992). Multivariate regression analyses
for categorical data (with Discussion).
Journal of the Royal Statistical Society, SeriesB,
54, 3–40.
[53] Lin, X. and Breslow, N.E. (1996). Bias correction in generalized linear mixed models
with multiple components of dispersion.
Journal of the American Statistical Asso-ciation,
91, 1007–16.
[54] Lindsay, B. (1982). Conditional score functions: some optimality results.
Biometrika,
69, 503-512.
[55] Little, R.J.A. and Rubin, D.B. (1987).
Statistical Analysis with Missing Data. John
Wiley, New York.
[56] Mantel, N. and Haenszel, W. (1959). Statistican aspects of the analysis of data
from retrospective studies of disease.
Journal of the National Cancer Institute,
22,719-748.
[57] Mason, W.B. and Fienberg, S.E. (eds) (1985).
Cohort Analysis in Social Research:
Beyond the Identification Problem. Springer-Verlag, New York.
[58] Maxwell, A.E. (1961).
Analysing Qualitative Data. Methuen, London.
[59] McCullagh, P. (1980). Regression models for ordinal data (with discussion).
Journal
of the Royal Statistical Society, Series B,
42, 109–42.
[60] McCullagh, P. (1983). Quasi-likelihood functions.
Annals of Statistics,
11, 59-67.
[61] McCullagh, P. and Nelder, J.A. (1989).
Generalized Linear Models. Chapman and
Hall, New York.
[62] Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalised linear models.
Journal
of the Royal Statistical Society, Series A 135, 370-384.
[63] Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consis-
tent observations.
Econometrica,
16, 1-32.
[64] Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when
block sizes are unequal.
Biometrika,
58, 545–54.
[65] Prentice, R.L. and Zhao, L.P. (1991). Estimating equations for parameters in means
and covariances of multivariate discrete and continuous responses.
Biometrics,
47,825–39.
[66] Qaqish, B. and Liang, K.-Y. (1992). Marginal models for correlated binary data
with multiple classes and multiple levels of nesting.
Biometrics,
48, 939–950.
[67] Rao, C.R. (1973).
Linear Statistical Inference and Its Applications, 2nd edition.
John Wiley, New York.
[68] Reid, N. (1995). The roles of conditioning on inference (with discussion).
Statistical
Science,
10, 138-199.
[69] Robbins, H. (1956). An empirical Bayes approach to statistic.
Proceedings of the
Third Berkeley Symposium,
1, 157–163.
[70] Rosner, B. (1989). Multivariate methods for clustered binary data with more than
one level of nesting.
Journal of the American Statistical Association,
84, 373–380.
[71] Royall, R.M. (1986). Model robust inference using maximum likelihood estimators.
International Statistical Review,
54, 221–26.
[72] Royall, R.M. (1997).
Statistical Evidence: a Likelihood Paradigm. Chapman and
Hall, London, England.
[73] Scharfstein, D.O., Rotnitzky, A. and Robbins, J.M. (1999). Adjusting for nonignor-
able drop-out using semiparametric nonresponse models (with discussion).
Journalof the American Statistical Association,
94, 1096–1146.
[74] Skellam, J.G. (1948). A probability distribution derived from the binomial distribu-
tion by regarding the probability of success as variable between the sets of trials.
Journal of the Royal Statistical Society, Series B,
10, 257–61.
[75] Stigler, S.M. (1986).
The History of Statistics. Belknap Press, Cambridge, Mass.
[76] Stiratelli, R., Laird, N. and Ware, J.H. (1984). Random effects models for serial
observations with binary responses.
Biometrics,
40, 961–71.
[77] Tielsch, J.M., Sommer, A., Witt, K., Katz, J. and Royall, R.M. (1990). Blindness
and visual impairment in an American urban population: Baltimore eye survey.
Archives of Ophthalmology,
108, 286-290.
[78] Tweedie, M.C.K. (1947). Functions of a statistical variate with given means, with
special reference to Laplacian distributions.
Proceedings of the Cambridge Phil.
Society,
49, 41–49.
[79] Waterman, R.P. and Lindsay, B.G. (1996). A simple and accurate method for ap-
proximate conditional inference applied to exponential family models.
Journal of theRoyal Statistical Society, Series B,
58, 177–188.
[80] Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models
and the Gaussian method.
Biometrika,
61, 439–47.
[81] Weil, C.S. (1970). Selection of the valid number of sampling units and a consideration
of their combination in toxicological studies involving reproduction, teratogenesis orcarcinogenesis.
Food and Cosmetics Toxicology,
8, 177–182.
[82] Weinberg, C.R., Wilcox, A.J. and Lie, R.T. (1998). A log-linear approach to case-
parent-triad data: assessing effects of disease genes that act either directly or thoughmaternal effects and that may be subject to parental imprinting.
American Journalof Human Genetics,
62,, 969–978.
[83] White, H. (1982). Maximum likelihood estimation of misspecified models.
Econo-
metrics,
50, 1–25.
[84] Williams, D.A. (1988). Reader reaction: estimation bias using the beta-binomial
distribution in teratology.
Biometrics,
44, 305–309.
[85] Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random effects:
a Gibbs sampling approach.
Journal of the American Statistical Association,
86,79–86.
[86] Zeger, S.L., Liang, K.-Y. and Albert, P. (1988). Models for longitudinal data: a
generalized estimating equation approach.
Biometrics,
44, 1049–1060.
[87] Zhao, L.P. and Prentice, R.L. (1990). Correlated binary regression using a general-
ized quadratic model.
Biometrika,
77, 642–48.
[88] Zhao, L.P, Prentice, R.L. and Self, S.G. (1992). Multivariate mean parameter esti-
mation by using a partly exponential model.
Journal of the Royal Statistical Society,Series B, 54, 805–812.
beta-binomial distribution, 22, 40, 41,
likelihood function, 22
likelihood ratio, 22
binary response, 2
likelihood-based inference, 21
biologic plausibility, 43
litter effect, 22local optimality, 33
canonical link, 75
logistic regression model, 51
canonical parameter, 75
longitudinal study, 18
categorical response, 49cluster, 12
marginal model, 61, 62, 69
cluster-specific, 76
maximum likelihood estimator, 22
conditional score function, 32
method of moments, 31
contingency table, 56continuation ratio model, 55
nested design, 13
continuous response, 2
Neyman-Scott problem, 22
correlated data, 34
nuisance parameter, 21
correlation coefficient, 15
observation driven model, 61, 65, 77
odds ratio, 19ordinal response, 50
empirical Bayes, 76
orthogonality, 32, 33
empirical logit, 38, 39estimating function, 24
partly exponential family, 71
exponential dispersion family, 26
polytomous logistic regression model, 53polytomous response, 49
population averaged, 63
Fisher information matrix, 25
proportional odds model, 52
generalized estimating equation, 35
quadratic exponential family, 72
generalized linear model, 26
quasi-likelihood, 28, 30, 44, 69
log-linear model, 27
quasi-score function, 30, 31, 33, 34
logistic regression model, 27, 39multiple linear regression, 27
random effects model, 61, 63, 73
Poisson regression model, 27
random mechanism, 30
polytomous response, 51
regression analysis, 2
random component, 26
restricted maximum likelihood, 76
systematic component, 27
goodness-of-fit, 30
score function, 25
Hammersley-Clifford Theorem, 45
teratological experiment, 22, 40, 53, 54
unbiased estimating function, 31
variance specification, 30, 31, 44
weighted least squares, 28within-cluster dependence, 12
Source: http://www.stat.nthu.edu.tw/EngWeb/Teachers/kyliang930225.pdf
Annu. Rev. Genet. 2002. 36:153–73 ° 2002 by Annual Reviews. All rights reserved STUDYING GENE FUNCTION IN EUKARYOTES BYCONDITIONAL GENE INACTIVATION Manfred Gossen1 and Hermann Bujard2 1Max Delbr¨uck Centrum, Robert-R¨ossle-Strasse 10, D-13125 Berlin, Germany;e-mail: [email protected]; 2ZMBH, Universit¨at Heidelberg, Im NeuenheimerFeld 282, D-69120 Heidelberg, Germany; e-mail: [email protected]
Clinical Review & Education 2014 Evidence-Based Guideline for the Managementof High Blood Pressure in AdultsReport From the Panel Members Appointedto the Eighth Joint National Committee (JNC 8) Paul A. James, MD; Suzanne Oparil, MD; Barry L. Carter, PharmD; William C. Cushman, MD;Cheryl Dennison-Himmelfarb, RN, ANP, PhD; Joel Handler, MD; Daniel T. Lackland, DrPH;Michael L. LeFevre, MD, MSPH; Thomas D. MacKenzie, MD, MSPH; Olugbenga Ogedegbe, MD, MPH, MS;Sidney C. Smith Jr, MD; Laura P. Svetkey, MD, MHS; Sandra J. Taler, MD; Raymond R. Townsend, MD;Jackson T. Wright Jr, MD, PhD; Andrew S. Narva, MD; Eduardo Ortiz, MD, MPH