5 Log-lin-Reg

After successful completion of this module, you will be able to:

  • Write out the expression of a Poisson regression model, including the response distribution and the model equation.
  • Interpret the results of a Poisson regression.
  • Identify potential problems with the fit of a Poisson regression model.
  • Describe models for over dispersion of Poisson counts in the regression setting.
  • Evaluate and compare models in the Poisson regression setting.
  • Describe how log linear models can be used for general contingency table data.

In R:

  • Fit Poisson, quasi-Poisson and negative binomial regression models, and make comparisons among them.
  • Fit log linear models to general contingency table data.
  • Use simulations to understand problems that can arise when the sample size is small and/or the counts are small.
  • Prepare and submit an R markdown script.

Task list

In order to achieve these learning outcomes, please make sure to complete the following:

  • Review Module 5 Readings and Lectures
  • Submit Module 5 Lab
  • Submit Module 5 Homework
  • Participate in the Module 5 Discussion
  • R Quiz
  • Content Quiz
library(arm)
library(Sleuth3)
library(tidyverse)
library(vcdExtra)
library(magrittr)

w5 Lectures

Book is better than lectures!

Code
#
# Example using Agresti's crab data
#

library(ggplot2)
#install.packages("glm2")
library(glm2)
help(crabs, package = "glm2")
data(crabs, package = "glm2")

#
# Take a look at the number of satellites (males) vs Width
#

work = crabs
ggplot(work,aes(Width,Satellites)) + geom_point()

#
# There are some zeroes, let's adjust for log transformation
# and re-plot
#

work$lSat = log(ifelse(work$Satellites==0,0.1,work$Satellites))
ggplot(work,aes(Width,lSat)) + geom_point() + ylab("log(Satellites)")

ggplot(work,aes(Dark,lSat)) + geom_point() + ylab("log(Satellites)")
ggplot(work,aes(GoodSpine,lSat)) + geom_point() + ylab("log(Satellites)")
ggplot(work,aes(Width,lSat)) + geom_point()  + facet_wrap(~Dark+GoodSpine)+ ylab("log(Satellites)")


#
# Fit a rich model with all interactions
#

mod1 = glm(Satellites~Width*Dark*GoodSpine,data=work,family=poisson)
summary(mod1)

#
# Look at residuals and fitted values
#

work$res = residuals(mod1)
work$fits = predict.glm(mod1,type="response")

ggplot(work,aes(Width,res)) + geom_point()
ggplot(work,aes(Satellites,fits)) + geom_point()
Code
#
# Fit and compare quasi-Poisson and negative binomial models to the Crab
# data
# 

library(ggplot2)
library(MASS)     # access the glm.nb function
library(glm2)     # access the crab data

data(crabs, package = "glm2")

work = crabs

# fit the Poisson, quasi-Poisson and negative binomial models

mod1 = glm(Satellites~Width*Dark*GoodSpine,data=work,family=poisson)
mod2 = glm(Satellites~Width*Dark*GoodSpine,data=work,family=quasipoisson)
mod3 = glm.nb(Satellites~Width*Dark*GoodSpine,data=work)

# compare coefficient estimates

cbind(coef(mod1),coef(mod2),coef(mod3))

# compare output from the two models that account for over dispersion

summary(mod2)
summary(mod3)

# look at some residual plots

work$res2 = resid(mod2)
work$res3 = resid(mod3)

ggplot(work,aes(Width,res2)) + geom_point()
ggplot(work,aes(Width,res3)) + geom_point()

# conclusion: favor the negative binomial approach over the quasi-poisson
# approach (likelihood-based, we get AIC)

# more to come: almost 36% of the satellite counts are zero...another
# modeling approach is to use a zero-inflated model...module 6
Code
#
# Example using Sleuth exercise 22.26
#

library(Sleuth3)
library(MASS)
library(ggplot2)

help(ex2226)
work = ex2226
names(work)

# exploratory analysis

# are there zeroes?

summary(work$Moons)
work$Moons1 = ifelse(work$Moons==0,0.1,work$Moons)

# log moons versus distance

ggplot(work,aes(Distance,log(Moons1))) + geom_point()

# log moons versus diameter

ggplot(work,aes(Diameter,log(Moons1))) + geom_point()
ggplot(work,aes(log(Diameter),log(Moons1))) + geom_point()

# log moons versus mass

ggplot(work,aes(Mass,log(Moons1))) + geom_point()
ggplot(work,aes(log(Mass),log(Moons1))) + geom_point()

# log diameter versus log mass

ggplot(work,aes(log(Diameter),log(Mass))) + geom_point()

cor(log(work$Diameter),log(work$Mass))      

# fit a rich model--with n = 13, I won't include interaction
# terms 

mod1 = glm(Moons~Distance+log(Diameter)+log(Mass),data=work,family=poisson)
summary(mod1)

# look at residuals, dispersion, goodness of fit

work$res1 = residuals(mod1)
ggplot(work,aes(Distance,res1)) + geom_point()
ggplot(work,aes(Diameter,res1)) + geom_point()
ggplot(work,aes(Mass,res1)) + geom_point()
41.897/9
1-pchisq(41.897,9)

# there is evidence of over dispersion, fit a NB model:
 
mod2 = glm.nb(Moons~Distance+log(Diameter)+log(Mass),data=work)
summary(mod2)


# look at residuals again

work$res2 = residuals(mod2)
ggplot(work,aes(Distance,res2)) + geom_point()

# It appears that the diameter of the planet and its mass are associated
# with the number of moons the planet has.

Log Linear Regression

Crabs by width and color.

All examples use poisson.

\[ Y-I = log(\lambda_i) = \beta_0 + ... + \beta_kX_{k-1i} \]

Link is log.

pdf = Lye-L / y!

non constant variance.

lambda may not be wnough to represent variation.

Multiplicative vs add error terms mean GLm is better than mlr. even if variance is constant. MLR assumes additivity.

Coef fit by ML.

Everything is similar to binomial counts.

Poisson counts look at y and exp vars.

If one is zero, DNE, add a num to create a plot.

Crab Example

Agresti ex.

173 crabs, data in glm2 library

width, dark and and good spawn are exp.

fitting a rich model.

Script above.

plotting sats by width shows lots of zeroes but no pattern. Lots of ladies with no men. Maybe a hint of inc relationship.

Adjusting for zero by adding .01 to zero only. then take log.

plotting sats by width again shows the zeroes by -2 and everything else pretty much the same.

Sats by dark and light shows maybe more by dark, and for better spine condition.

facet wrap by yes no for spine and dark doesn’t show much, but dark and spine bad is not populated much.

GLM for sats~width*dark*spine in poisson family with automatic log link.

dispersion is assumed one by default.

three way interacition is significant. but we can’t make inference without looking at disperison.

Resids and fits with predict by type response for fits. and residuals().

width by resids shows the line near zero and others for the fact that the data are discrete.

Resids are spread out and theres a lot above 2 and 3.

Plotting fitted by observed counts, we expect y = x. That is not the case. Lots near zero and spread near 1 - 5.

We want to fit for extra poisson variation. The plots make her think dispersion is at fault.

Model Fit

Talk about fit with poisson regression

deviance GOF and extra poisson variation via the crabs.

there are dev, pearson, and std-resids. dev are default in R.

If the model is correct and sample is large dev resids look normal.

dev GOF test is informal test of adequecy, not good.

compares reg model to sat model

a large p may mean good fit or inadequate data

small p, model is correst poisson is wrong, outliers.

As in log reg, dev gof stat is sum os squared resids. resid deviance.

In the poisson case chi 2 requires means to be large (most greater than 5). Counts give idea of means.

For crabs Resid dev is 550 on 165 df. 75% are less than or eq to 5. We shouldn’t use dev GOf.

extra poisson variation. like binom variation from more var in obseveed counts. Var in poisson via lambda.

If E-poisson var exists var too small and SE too small.

Checking, is it likely, is there clustering, missing exp variation.

We could compare to sample var and mean to groups to counts with identical exp vars . not often possible.

Dev GOF gives an idea., calc disp param, examine dev resids plots.

= resid/ df. here = 3.33. suggestind extra var.

Looking at dev resids by width shows lots above 2 and below -2 means there could be extra dispersion.

Over Dispersion in R

We can use quisi poisson lM like we did with binom by adding an extra param.

Neg binom regression maight work.

mass and glm2 libs.

mass for glm.nb for neg binom. mass before glm to get correct crab data.

fit poisson , quasi and neg.

glm family is poisson, and quisipoisson then nb is glm.nb w/o family.

coef for estimates. poisson and quasi are the same as before. SE does. nb does slightly change the ests.

Summary for quasi and nb. quasi shows 1 sig variable. disp is 3.25.

Remove the three way interaction for further analysis.

glm.nb shows shows different everything but the same significance. We do get and AIC for this for model comparison.

Plotting resid plots for these shows via resid(mod) to data$col. 

quasi is the same as the first bc coefs are the same and resids are based on those, but the glm.nb shows a different scale and less above and below 2 or -2.

nb model does a good job for extra dispersion in this case. plus its likelihood based.

We see zero inflation and that is mod 6.

Data Analysis Strategy

dealing with poisson regression and looking at a large example.

be clear about objectives,

what element can answer

what is response and exp var. How do they relate(hyp)

if poisson count perform exp analysis. Look for zeroes in response. Plot log of responses agaisnst exp vars.

relative to the q of interest and samle size fit a rich model.

examing the resids, check dev GOF and look at psi.

refit with nb

perfrom reduction and anova or aic. but don’t use model selection bias.

look at resid plots again.

Ex vai slueth for planets,

what char of planets makes moons.

sleuth mass ggplot.

name dist dia mass moons are variables.

explore for zeroes, and add.01 for plotting log.

summary shows min and first q are zero, at least 25% are zero.

log moons by dist is zero inflated but maybe negatively related.

dist by dia is also zero inflated but maybe curved and increasing. dist by log dia is linear.

mad via log moons is curved, log mass is inc. both dia and mass are inc. mog mass and log dia is near y = x. Both may not be good in the model, but a rich model is what we are going for at first.

glm moons by .og dia log mass and dist.

psi is 21/9 is greater than 2 and maybe there’s dispersion.

plotting resids by exp variables:

  • mass is zero and one large one, spread is good.
  • dia is the same
  • dist is more spread out but no pattern.
  • all show zero inflation.
  • dispersion is 2.4

X squared GOF is low.

Fitting a nb mod:

mass and dia are again sig.

resids:

  • dist is normal, no pattern

it appears dia and mass are important to a planet having moons, but dia and mass are cov. check diff models with aic and anova.

Log Linear Models for cont tables.

example and notation.

Agresti table for weed and booze. alc yes, cigs, yes/no, weed yes/no.

question isn’t about variation in use from one to another but looking for association of booze, cigs, and weed.

Do smokers smoke it all and such.

sub-notation, I 1-2 for booze yes and no, j, 1-2 for cigs and k for weed. n_111 for weed booze and cig yes.

pi_ijk for prob by cat.

N = 2276 for total participants, mu_ijk = N x pi_ijk.

log(mu_ijk) = beta_0 + … + b_7X_iY_jZ_k in the parlance of the course.

expamine reg coef of interaction terms for associations.

if b_7 is zero and three ways have no chance then they are ind.

for the two ways its the same thought, we’ll consider three ways and two ways in lab.

Lab

HW

  1. (2 points) Recall the obesity problem from Homework 1. The data are as follows:
        CVD Death

            Yes         No
Obese       16        2045
Not obese   7         1044

Using Poisson log linear regression, test for independence between obesity and CVD death outcome. (Hint: this is equivalent to testing that the interaction term in the model is zero.) How does your answer compare to a chi-square test on the same data?

o <- expand.grid(
  ob = factor(c("Obese", "NotObese"), levels = c("Obese", "NotObese")),
  death = factor(c("Yes", "No"), levels = c("Yes", "No"))
)
o$Freq <- c(16, 7, 2045,  1044)

o_tab <- xtabs(data = o, Freq~ob + death)
o_tab
          death
ob          Yes   No
  Obese      16 2045
  NotObese    7 1044
m1 <- glm(data = o, 
    family = poisson, 
    Freq~ob+death)
s1 <- summary(m1)
s1

Call:
glm(formula = Freq ~ ob + death, family = poisson, data = o)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.7234     0.2089   13.04   <2e-16 ***
obNotObese   -0.6734     0.0379  -17.77   <2e-16 ***
deathNo       4.9001     0.2093   23.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4376.49698  on 3  degrees of freedom
Residual deviance:    0.11741  on 1  degrees of freedom
AIC: 32.796

Number of Fisher Scoring iterations: 3
m1 <- glm(data = o, 
    family = poisson, 
    Freq~ob+death)
s1 <- summary(m1)
s1

Call:
glm(formula = Freq ~ ob + death, family = poisson, data = o)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.7234     0.2089   13.04   <2e-16 ***
obNotObese   -0.6734     0.0379  -17.77   <2e-16 ***
deathNo       4.9001     0.2093   23.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4376.49698  on 3  degrees of freedom
Residual deviance:    0.11741  on 1  degrees of freedom
AIC: 32.796

Number of Fisher Scoring iterations: 3

The statsistical independence model for log lin reg:

\(\mu_{ij} = \mu \alpha \beta_j\)

Obese , not obese are one variable. Death yes or no are another.

\[ \mu_{ij} = n \pi_i + \pi_{+j}, \ i = 1 \ to \ I, \ j = 1 \ to \ J \]

Subscript +j means column j total.

Chi squared

n <- sum(o$Freq)

n11 <- 16
n12 <- 2045
n21 <- 7
n22 <- 1044

j1 <- n11 + n21
j2 <- n12 + n22
i1 <- n11 + n12
i2 <- n21 + n22

mu_11 <- i1*j1/n
mu_12 <- i1*j2/n
mu_21 <- i2*j1/n
mu_22 <- i2*j2/n

Xsquared <- (((n11 - mu_11)^2)/mu_11) + (((n12 - mu_12)^2)/mu_12) + (((n21 - mu_21)^2)/mu_21) + (((n22 - mu_22)^2)/mu_22)

pchisq(Xsquared, df = 1, lower.tail = F)
[1] 0.7340665
ans <- log(mu_11) + log(mu_22) - log(mu_12) - log(mu_21)
exp(ans)
[1] 1
i1*j1/n
[1] 15.23233

Quizzes

R

All questions rely on the data in the quine dataset in the MASS library. The dataset contains information about children from Walgett, New South Wales, Australia. They were classified by Ethnicity (Eth), Age, Sex and Learner (Lrn) status and the number of days (Days) absent from school in a particular school year was recorded.

library(MASS)
data("quine")
df <- quine

Question 1

All of the explanatory variables, Eth, Age, Sex, and Lrn are categorical. What does this mean for the analysis we should perform? (Select one) Group of answer choices

We can use log linear regression using combinations of these explanatory variables.

Question 2

Fit a model that contains all of the two-way interactions between the explanatory variables. You should notice that the effect corresponding to AgeF3:LrnSL cannot be estimated. Why is this? (Select one) Group of answer choices

glm(data = df, 
    family = poisson, 
    Days~(Eth + Age + Sex + Lrn)^2) |> summary()

Call:
glm(formula = Days ~ (Eth + Age + Sex + Lrn)^2, family = poisson, 
    data = df)

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.93246    0.09826  29.843  < 2e-16 ***
EthN        -0.17399    0.12134  -1.434   0.1516    
AgeF1       -0.04270    0.12691  -0.336   0.7365    
AgeF2       -0.08632    0.16164  -0.534   0.5933    
AgeF3       -0.15290    0.11898  -1.285   0.1987    
SexM        -0.71452    0.12229  -5.843 5.14e-09 ***
LrnSL        0.21608    0.14558   1.484   0.1377    
EthN:AgeF1  -0.92889    0.14657  -6.337 2.34e-10 ***
EthN:AgeF2  -1.33398    0.13504  -9.879  < 2e-16 ***
EthN:AgeF3  -0.11242    0.13478  -0.834   0.4042    
EthN:SexM    0.43902    0.09208   4.768 1.86e-06 ***
EthN:LrnSL   0.26415    0.11378   2.322   0.0203 *  
AgeF1:SexM  -0.05565    0.16303  -0.341   0.7328    
AgeF2:SexM   1.09942    0.15281   7.195 6.26e-13 ***
AgeF3:SexM   1.15949    0.13859   8.366  < 2e-16 ***
AgeF1:LrnSL -0.13019    0.15688  -0.830   0.4066    
AgeF2:LrnSL  0.37340    0.14563   2.564   0.0103 *  
AgeF3:LrnSL       NA         NA      NA       NA    
SexM:LrnSL   0.04143    0.13718   0.302   0.7627    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1368.7  on 128  degrees of freedom
AIC: 1993.1

Number of Fisher Scoring iterations: 5
df |> filter(Age == "F3" & Lrn == "SL")
Eth Sex Age Lrn Days

No students.

Question 3

From the summary information you obtained in question 2, there is evidence of over dispersion. What is this evidence? (Select one) Group of answer choices

1368.7/128  
[1] 10.69297

The residual deviance is a lot bigger than the residual degrees of freedom.

Question 4

Fit the same model you fit in question 2, except use the glm.nb function. Using this model fit, what is the expected number of days absent for a male of Aboriginal ethnicity in the F2 age group who is an average learner? (Select one) Group of answer choices

glm.nb(data = df, 
       Days~(Eth + Age + Sex + Lrn)^2) |> summary()

Call:
glm.nb(formula = Days ~ (Eth + Age + Sex + Lrn)^2, data = df, 
    init.theta = 1.60364105, link = log)

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.00155    0.33709   8.904  < 2e-16 ***
EthN        -0.24591    0.39135  -0.628  0.52977    
AgeF1       -0.02546    0.41615  -0.061  0.95121    
AgeF2       -0.54884    0.54393  -1.009  0.31296    
AgeF3       -0.25735    0.40558  -0.635  0.52574    
SexM        -0.77181    0.38021  -2.030  0.04236 *  
LrnSL        0.38919    0.48421   0.804  0.42153    
EthN:AgeF1  -0.70000    0.43646  -1.604  0.10876    
EthN:AgeF2  -1.23283    0.42962  -2.870  0.00411 ** 
EthN:AgeF3   0.04721    0.44883   0.105  0.91622    
EthN:SexM    0.36240    0.29430   1.231  0.21818    
EthN:LrnSL   0.06847    0.34040   0.201  0.84059    
AgeF1:SexM   0.02257    0.47360   0.048  0.96198    
AgeF2:SexM   1.55330    0.51325   3.026  0.00247 ** 
AgeF3:SexM   1.25227    0.45539   2.750  0.00596 ** 
AgeF1:LrnSL -0.43101    0.47948  -0.899  0.36870    
AgeF2:LrnSL  0.52074    0.48567   1.072  0.28363    
AgeF3:LrnSL       NA         NA      NA       NA    
SexM:LrnSL   0.07187    0.40805   0.176  0.86019    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.6036) family taken to be 1)

    Null deviance: 235.23  on 145  degrees of freedom
Residual deviance: 167.53  on 128  degrees of freedom
AIC: 1100.5

Number of Fisher Scoring iterations: 1

              Theta:  1.604 
          Std. Err.:  0.214 

 2 x log-likelihood:  -1062.546 
a <- 3.00155
# a is Aborig, female, F0, AL
# a + Aborig + male + F2 + AL + interactions. 
# d is a     + F2   +  Male    + F2:Male  
d <- 3.00155 - 0.5488 - 0.77181 + 1.55330 
a*d
[1] 9.707733
exp(d)
[1] 25.38707

C

Question 1

Consider the deviance goodness-of-fit test. Under what conditions is it valid for Poisson regression? (i.e., when is the Poisson distribution well-approximated by a Normal distribution?)

When all of the Poisson counts are large (at least 5)

Question 2

Consider the deviance goodness-of-fit test. When it is valid, what possibility is suggested by a small p-value? Group of answer choices

We are using the wrong model for λ

Question 3

Consider the deviance goodness-of-fit test. When it is valid, what possibility is suggested by a large p-value? Group of answer choices

We have chosen the correct model for the data.

Question 4

Which of the following is the best approach to assess the fit of a Poisson Regression Model? Group of answer choices

Look at the residuals and examine the deviance goodness of fit test.

____END____