library(arm)
library(Sleuth3)
library(tidyverse)
library(vcdExtra)
library(magrittr)
5 Log-lin-Reg
After successful completion of this module, you will be able to:
- Write out the expression of a Poisson regression model, including the response distribution and the model equation.
- Interpret the results of a Poisson regression.
- Identify potential problems with the fit of a Poisson regression model.
- Describe models for over dispersion of Poisson counts in the regression setting.
- Evaluate and compare models in the Poisson regression setting.
- Describe how log linear models can be used for general contingency table data.
In R:
- Fit Poisson, quasi-Poisson and negative binomial regression models, and make comparisons among them.
- Fit log linear models to general contingency table data.
- Use simulations to understand problems that can arise when the sample size is small and/or the counts are small.
- Prepare and submit an R markdown script.
Task list
In order to achieve these learning outcomes, please make sure to complete the following:
Review Module 5 Readings and LecturesSubmit Module 5 LabSubmit Module 5 HomeworkParticipate in the Module 5 Discussion- R Quiz
- Content Quiz
w5 Lectures
Book is better than lectures!
Code
#
# Example using Agresti's crab data
#
library(ggplot2)
#install.packages("glm2")
library(glm2)
help(crabs, package = "glm2")
data(crabs, package = "glm2")
#
# Take a look at the number of satellites (males) vs Width
#
= crabs
work ggplot(work,aes(Width,Satellites)) + geom_point()
#
# There are some zeroes, let's adjust for log transformation
# and re-plot
#
$lSat = log(ifelse(work$Satellites==0,0.1,work$Satellites))
workggplot(work,aes(Width,lSat)) + geom_point() + ylab("log(Satellites)")
ggplot(work,aes(Dark,lSat)) + geom_point() + ylab("log(Satellites)")
ggplot(work,aes(GoodSpine,lSat)) + geom_point() + ylab("log(Satellites)")
ggplot(work,aes(Width,lSat)) + geom_point() + facet_wrap(~Dark+GoodSpine)+ ylab("log(Satellites)")
#
# Fit a rich model with all interactions
#
= glm(Satellites~Width*Dark*GoodSpine,data=work,family=poisson)
mod1 summary(mod1)
#
# Look at residuals and fitted values
#
$res = residuals(mod1)
work$fits = predict.glm(mod1,type="response")
work
ggplot(work,aes(Width,res)) + geom_point()
ggplot(work,aes(Satellites,fits)) + geom_point()
Code
#
# Fit and compare quasi-Poisson and negative binomial models to the Crab
# data
#
library(ggplot2)
library(MASS) # access the glm.nb function
library(glm2) # access the crab data
data(crabs, package = "glm2")
= crabs
work
# fit the Poisson, quasi-Poisson and negative binomial models
= glm(Satellites~Width*Dark*GoodSpine,data=work,family=poisson)
mod1 = glm(Satellites~Width*Dark*GoodSpine,data=work,family=quasipoisson)
mod2 = glm.nb(Satellites~Width*Dark*GoodSpine,data=work)
mod3
# compare coefficient estimates
cbind(coef(mod1),coef(mod2),coef(mod3))
# compare output from the two models that account for over dispersion
summary(mod2)
summary(mod3)
# look at some residual plots
$res2 = resid(mod2)
work$res3 = resid(mod3)
work
ggplot(work,aes(Width,res2)) + geom_point()
ggplot(work,aes(Width,res3)) + geom_point()
# conclusion: favor the negative binomial approach over the quasi-poisson
# approach (likelihood-based, we get AIC)
# more to come: almost 36% of the satellite counts are zero...another
# modeling approach is to use a zero-inflated model...module 6
Code
#
# Example using Sleuth exercise 22.26
#
library(Sleuth3)
library(MASS)
library(ggplot2)
help(ex2226)
= ex2226
work names(work)
# exploratory analysis
# are there zeroes?
summary(work$Moons)
$Moons1 = ifelse(work$Moons==0,0.1,work$Moons)
work
# log moons versus distance
ggplot(work,aes(Distance,log(Moons1))) + geom_point()
# log moons versus diameter
ggplot(work,aes(Diameter,log(Moons1))) + geom_point()
ggplot(work,aes(log(Diameter),log(Moons1))) + geom_point()
# log moons versus mass
ggplot(work,aes(Mass,log(Moons1))) + geom_point()
ggplot(work,aes(log(Mass),log(Moons1))) + geom_point()
# log diameter versus log mass
ggplot(work,aes(log(Diameter),log(Mass))) + geom_point()
cor(log(work$Diameter),log(work$Mass))
# fit a rich model--with n = 13, I won't include interaction
# terms
= glm(Moons~Distance+log(Diameter)+log(Mass),data=work,family=poisson)
mod1 summary(mod1)
# look at residuals, dispersion, goodness of fit
$res1 = residuals(mod1)
workggplot(work,aes(Distance,res1)) + geom_point()
ggplot(work,aes(Diameter,res1)) + geom_point()
ggplot(work,aes(Mass,res1)) + geom_point()
41.897/9
1-pchisq(41.897,9)
# there is evidence of over dispersion, fit a NB model:
= glm.nb(Moons~Distance+log(Diameter)+log(Mass),data=work)
mod2 summary(mod2)
# look at residuals again
$res2 = residuals(mod2)
workggplot(work,aes(Distance,res2)) + geom_point()
# It appears that the diameter of the planet and its mass are associated
# with the number of moons the planet has.
Log Linear Regression
Crabs by width and color.
All examples use poisson.
\[ Y-I = log(\lambda_i) = \beta_0 + ... + \beta_kX_{k-1i} \]
Link is log.
pdf = Lye-L / y!
non constant variance.
lambda may not be wnough to represent variation.
Multiplicative vs add error terms mean GLm is better than mlr. even if variance is constant. MLR assumes additivity.
Coef fit by ML.
Everything is similar to binomial counts.
Poisson counts look at y and exp vars.
If one is zero, DNE, add a num to create a plot.
Crab Example
Agresti ex.
173 crabs, data in glm2 library
width, dark and and good spawn are exp.
fitting a rich model.
Script above.
plotting sats by width shows lots of zeroes but no pattern. Lots of ladies with no men. Maybe a hint of inc relationship.
Adjusting for zero by adding .01 to zero only. then take log.
plotting sats by width again shows the zeroes by -2 and everything else pretty much the same.
Sats by dark and light shows maybe more by dark, and for better spine condition.
facet wrap by yes no for spine and dark doesn’t show much, but dark and spine bad is not populated much.
GLM for sats~width*dark*spine in poisson family with automatic log link.
dispersion is assumed one by default.
three way interacition is significant. but we can’t make inference without looking at disperison.
Resids and fits with predict by type response for fits. and residuals().
width by resids shows the line near zero and others for the fact that the data are discrete.
Resids are spread out and theres a lot above 2 and 3.
Plotting fitted by observed counts, we expect y = x. That is not the case. Lots near zero and spread near 1 - 5.
We want to fit for extra poisson variation. The plots make her think dispersion is at fault.
Model Fit
Talk about fit with poisson regression
deviance GOF and extra poisson variation via the crabs.
there are dev, pearson, and std-resids. dev are default in R.
If the model is correct and sample is large dev resids look normal.
dev GOF test is informal test of adequecy, not good.
compares reg model to sat model
a large p may mean good fit or inadequate data
small p, model is correst poisson is wrong, outliers.
As in log reg, dev gof stat is sum os squared resids. resid deviance.
In the poisson case chi 2 requires means to be large (most greater than 5). Counts give idea of means.
For crabs Resid dev is 550 on 165 df. 75% are less than or eq to 5. We shouldn’t use dev GOf.
extra poisson variation. like binom variation from more var in obseveed counts. Var in poisson via lambda.
If E-poisson var exists var too small and SE too small.
Checking, is it likely, is there clustering, missing exp variation.
We could compare to sample var and mean to groups to counts with identical exp vars . not often possible.
Dev GOF gives an idea., calc disp param, examine dev resids plots.
= resid/ df. here = 3.33. suggestind extra var.
Looking at dev resids by width shows lots above 2 and below -2 means there could be extra dispersion.
Over Dispersion in R
We can use quisi poisson lM like we did with binom by adding an extra param.
Neg binom regression maight work.
mass and glm2 libs.
mass for glm.nb for neg binom. mass before glm to get correct crab data.
fit poisson , quasi and neg.
glm family is poisson, and quisipoisson then nb is glm.nb w/o family.
coef for estimates. poisson and quasi are the same as before. SE does. nb does slightly change the ests.
Summary for quasi and nb. quasi shows 1 sig variable. disp is 3.25.
Remove the three way interaction for further analysis.
glm.nb shows shows different everything but the same significance. We do get and AIC for this for model comparison.
Plotting resid plots for these shows via resid(mod) to data$col.
quasi is the same as the first bc coefs are the same and resids are based on those, but the glm.nb shows a different scale and less above and below 2 or -2.
nb model does a good job for extra dispersion in this case. plus its likelihood based.
We see zero inflation and that is mod 6.
Data Analysis Strategy
dealing with poisson regression and looking at a large example.
be clear about objectives,
what element can answer
what is response and exp var. How do they relate(hyp)
if poisson count perform exp analysis. Look for zeroes in response. Plot log of responses agaisnst exp vars.
relative to the q of interest and samle size fit a rich model.
examing the resids, check dev GOF and look at psi.
refit with nb
perfrom reduction and anova or aic. but don’t use model selection bias.
look at resid plots again.
Ex vai slueth for planets,
what char of planets makes moons.
sleuth mass ggplot.
name dist dia mass moons are variables.
explore for zeroes, and add.01 for plotting log.
summary shows min and first q are zero, at least 25% are zero.
log moons by dist is zero inflated but maybe negatively related.
dist by dia is also zero inflated but maybe curved and increasing. dist by log dia is linear.
mad via log moons is curved, log mass is inc. both dia and mass are inc. mog mass and log dia is near y = x. Both may not be good in the model, but a rich model is what we are going for at first.
glm moons by .og dia log mass and dist.
psi is 21/9 is greater than 2 and maybe there’s dispersion.
plotting resids by exp variables:
- mass is zero and one large one, spread is good.
- dia is the same
- dist is more spread out but no pattern.
- all show zero inflation.
- dispersion is 2.4
X squared GOF is low.
Fitting a nb mod:
mass and dia are again sig.
resids:
- dist is normal, no pattern
it appears dia and mass are important to a planet having moons, but dia and mass are cov. check diff models with aic and anova.
Log Linear Models for cont tables.
example and notation.
Agresti table for weed and booze. alc yes, cigs, yes/no, weed yes/no.
question isn’t about variation in use from one to another but looking for association of booze, cigs, and weed.
Do smokers smoke it all and such.
sub-notation, I 1-2 for booze yes and no, j, 1-2 for cigs and k for weed. n_111 for weed booze and cig yes.
pi_ijk for prob by cat.
N = 2276 for total participants, mu_ijk = N x pi_ijk.
log(mu_ijk) = beta_0 + … + b_7X_iY_jZ_k in the parlance of the course.
expamine reg coef of interaction terms for associations.
if b_7 is zero and three ways have no chance then they are ind.
for the two ways its the same thought, we’ll consider three ways and two ways in lab.
Lab
HW
- (2 points) Recall the obesity problem from Homework 1. The data are as follows:
CVD Death
Yes No
Obese 16 2045
Not obese 7 1044
Using Poisson log linear regression, test for independence between obesity and CVD death outcome. (Hint: this is equivalent to testing that the interaction term in the model is zero.) How does your answer compare to a chi-square test on the same data?
<- expand.grid(
o ob = factor(c("Obese", "NotObese"), levels = c("Obese", "NotObese")),
death = factor(c("Yes", "No"), levels = c("Yes", "No"))
)$Freq <- c(16, 7, 2045, 1044)
o
<- xtabs(data = o, Freq~ob + death)
o_tab o_tab
death
ob Yes No
Obese 16 2045
NotObese 7 1044
<- glm(data = o,
m1 family = poisson,
~ob+death)
Freq<- summary(m1)
s1 s1
Call:
glm(formula = Freq ~ ob + death, family = poisson, data = o)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7234 0.2089 13.04 <2e-16 ***
obNotObese -0.6734 0.0379 -17.77 <2e-16 ***
deathNo 4.9001 0.2093 23.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4376.49698 on 3 degrees of freedom
Residual deviance: 0.11741 on 1 degrees of freedom
AIC: 32.796
Number of Fisher Scoring iterations: 3
<- glm(data = o,
m1 family = poisson,
~ob+death)
Freq<- summary(m1)
s1 s1
Call:
glm(formula = Freq ~ ob + death, family = poisson, data = o)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7234 0.2089 13.04 <2e-16 ***
obNotObese -0.6734 0.0379 -17.77 <2e-16 ***
deathNo 4.9001 0.2093 23.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4376.49698 on 3 degrees of freedom
Residual deviance: 0.11741 on 1 degrees of freedom
AIC: 32.796
Number of Fisher Scoring iterations: 3
The statsistical independence model for log lin reg:
\(\mu_{ij} = \mu \alpha \beta_j\)
Obese , not obese are one variable. Death yes or no are another.
\[ \mu_{ij} = n \pi_i + \pi_{+j}, \ i = 1 \ to \ I, \ j = 1 \ to \ J \]
Subscript +j means column j total.
Chi squared
<- sum(o$Freq)
n
<- 16
n11 <- 2045
n12 <- 7
n21 <- 1044
n22
<- n11 + n21
j1 <- n12 + n22
j2 <- n11 + n12
i1 <- n21 + n22
i2
<- i1*j1/n
mu_11 <- i1*j2/n
mu_12 <- i2*j1/n
mu_21 <- i2*j2/n
mu_22
<- (((n11 - mu_11)^2)/mu_11) + (((n12 - mu_12)^2)/mu_12) + (((n21 - mu_21)^2)/mu_21) + (((n22 - mu_22)^2)/mu_22)
Xsquared
pchisq(Xsquared, df = 1, lower.tail = F)
[1] 0.7340665
<- log(mu_11) + log(mu_22) - log(mu_12) - log(mu_21)
ans exp(ans)
[1] 1
*j1/n i1
[1] 15.23233
Quizzes
R
All questions rely on the data in the quine dataset in the MASS library. The dataset contains information about children from Walgett, New South Wales, Australia. They were classified by Ethnicity (Eth), Age, Sex and Learner (Lrn) status and the number of days (Days) absent from school in a particular school year was recorded.
library(MASS)
data("quine")
<- quine df
Question 1
All of the explanatory variables, Eth, Age, Sex, and Lrn are categorical. What does this mean for the analysis we should perform? (Select one) Group of answer choices
We can use log linear regression using combinations of these explanatory variables.
Question 2
Fit a model that contains all of the two-way interactions between the explanatory variables. You should notice that the effect corresponding to AgeF3:LrnSL cannot be estimated. Why is this? (Select one) Group of answer choices
glm(data = df,
family = poisson,
~(Eth + Age + Sex + Lrn)^2) |> summary() Days
Call:
glm(formula = Days ~ (Eth + Age + Sex + Lrn)^2, family = poisson,
data = df)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.93246 0.09826 29.843 < 2e-16 ***
EthN -0.17399 0.12134 -1.434 0.1516
AgeF1 -0.04270 0.12691 -0.336 0.7365
AgeF2 -0.08632 0.16164 -0.534 0.5933
AgeF3 -0.15290 0.11898 -1.285 0.1987
SexM -0.71452 0.12229 -5.843 5.14e-09 ***
LrnSL 0.21608 0.14558 1.484 0.1377
EthN:AgeF1 -0.92889 0.14657 -6.337 2.34e-10 ***
EthN:AgeF2 -1.33398 0.13504 -9.879 < 2e-16 ***
EthN:AgeF3 -0.11242 0.13478 -0.834 0.4042
EthN:SexM 0.43902 0.09208 4.768 1.86e-06 ***
EthN:LrnSL 0.26415 0.11378 2.322 0.0203 *
AgeF1:SexM -0.05565 0.16303 -0.341 0.7328
AgeF2:SexM 1.09942 0.15281 7.195 6.26e-13 ***
AgeF3:SexM 1.15949 0.13859 8.366 < 2e-16 ***
AgeF1:LrnSL -0.13019 0.15688 -0.830 0.4066
AgeF2:LrnSL 0.37340 0.14563 2.564 0.0103 *
AgeF3:LrnSL NA NA NA NA
SexM:LrnSL 0.04143 0.13718 0.302 0.7627
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2073.5 on 145 degrees of freedom
Residual deviance: 1368.7 on 128 degrees of freedom
AIC: 1993.1
Number of Fisher Scoring iterations: 5
|> filter(Age == "F3" & Lrn == "SL") df
Eth | Sex | Age | Lrn | Days |
---|
No students.
Question 3
From the summary information you obtained in question 2, there is evidence of over dispersion. What is this evidence? (Select one) Group of answer choices
1368.7/128
[1] 10.69297
The residual deviance is a lot bigger than the residual degrees of freedom.
Question 4
Fit the same model you fit in question 2, except use the glm.nb function. Using this model fit, what is the expected number of days absent for a male of Aboriginal ethnicity in the F2 age group who is an average learner? (Select one) Group of answer choices
glm.nb(data = df,
~(Eth + Age + Sex + Lrn)^2) |> summary() Days
Call:
glm.nb(formula = Days ~ (Eth + Age + Sex + Lrn)^2, data = df,
init.theta = 1.60364105, link = log)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.00155 0.33709 8.904 < 2e-16 ***
EthN -0.24591 0.39135 -0.628 0.52977
AgeF1 -0.02546 0.41615 -0.061 0.95121
AgeF2 -0.54884 0.54393 -1.009 0.31296
AgeF3 -0.25735 0.40558 -0.635 0.52574
SexM -0.77181 0.38021 -2.030 0.04236 *
LrnSL 0.38919 0.48421 0.804 0.42153
EthN:AgeF1 -0.70000 0.43646 -1.604 0.10876
EthN:AgeF2 -1.23283 0.42962 -2.870 0.00411 **
EthN:AgeF3 0.04721 0.44883 0.105 0.91622
EthN:SexM 0.36240 0.29430 1.231 0.21818
EthN:LrnSL 0.06847 0.34040 0.201 0.84059
AgeF1:SexM 0.02257 0.47360 0.048 0.96198
AgeF2:SexM 1.55330 0.51325 3.026 0.00247 **
AgeF3:SexM 1.25227 0.45539 2.750 0.00596 **
AgeF1:LrnSL -0.43101 0.47948 -0.899 0.36870
AgeF2:LrnSL 0.52074 0.48567 1.072 0.28363
AgeF3:LrnSL NA NA NA NA
SexM:LrnSL 0.07187 0.40805 0.176 0.86019
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.6036) family taken to be 1)
Null deviance: 235.23 on 145 degrees of freedom
Residual deviance: 167.53 on 128 degrees of freedom
AIC: 1100.5
Number of Fisher Scoring iterations: 1
Theta: 1.604
Std. Err.: 0.214
2 x log-likelihood: -1062.546
<- 3.00155
a # a is Aborig, female, F0, AL
# a + Aborig + male + F2 + AL + interactions.
# d is a + F2 + Male + F2:Male
<- 3.00155 - 0.5488 - 0.77181 + 1.55330
d *d a
[1] 9.707733
exp(d)
[1] 25.38707
C
Question 1
Consider the deviance goodness-of-fit test. Under what conditions is it valid for Poisson regression? (i.e., when is the Poisson distribution well-approximated by a Normal distribution?)
When all of the Poisson counts are large (at least 5)
Question 2
Consider the deviance goodness-of-fit test. When it is valid, what possibility is suggested by a small p-value? Group of answer choices
We are using the wrong model for λ
Question 3
Consider the deviance goodness-of-fit test. When it is valid, what possibility is suggested by a large p-value? Group of answer choices
We have chosen the correct model for the data.
Question 4
Which of the following is the best approach to assess the fit of a Poisson Regression Model? Group of answer choices
Look at the residuals and examine the deviance goodness of fit test.