ST-518 HW 5

Author

Chris Eckerson

Published

May 17, 2024

output: pdf_document

5 Log-Lin Reg

R Questions

1. Log-lin Reg on obeisty

(2 points) Recall the obesity problem from Homework 1. The data are as follows:

        CVD Death

            Yes         No
Obese       16        2045
Not obese   7         1044

Using Poisson log linear regression, test for independence between obesity and CVD death outcome. (Hint: this is equivalent to testing that the interaction term in the model is zero.) How does your answer compare to a chi-square test on the same data?

o <- expand.grid(
  ob = factor(c("Obese", "NotObese"), levels = c("Obese", "NotObese")),
  death = factor(c("Yes", "No"), levels = c("Yes", "No"))
)
o$Freq <- c(16, 7, 2045,  1044)

o_tab <- xtabs(data = o, Freq~ob + death)
o_tab

          death
ob          Yes   No
  Obese      16 2045
  NotObese    7 1044

ob	death	Freq
Obese	Yes	16
NotObese	Yes	7
Obese	No	2045
NotObese	No	1044

o2 <- expand.dft(o)

m1 <- glm(data = o, 
    family = poisson, 
    Freq~ob+death)
s1 <- summary(m1)
s1


Call:
glm(formula = Freq ~ ob + death, family = poisson, data = o)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.7234     0.2089   13.04   <2e-16 ***
obNotObese   -0.6734     0.0379  -17.77   <2e-16 ***
deathNo       4.9001     0.2093   23.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4376.49698  on 3  degrees of freedom
Residual deviance:    0.11741  on 1  degrees of freedom
AIC: 32.796

Number of Fisher Scoring iterations: 3

pchisq(m1$deviance, df = m1$df.residual, lower.tail = FALSE)

[1] 0.7318639

m1 <- glm(data = o_tab, 
    family = poisson, 
    Freq~ob+death)
s1 <- summary(m1)
s1


Call:
glm(formula = Freq ~ ob + death, family = poisson, data = o_tab)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.7234     0.2089   13.04   <2e-16 ***
obNotObese   -0.6734     0.0379  -17.77   <2e-16 ***
deathNo       4.9001     0.2093   23.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4376.49698  on 3  degrees of freedom
Residual deviance:    0.11741  on 1  degrees of freedom
AIC: 32.796

Number of Fisher Scoring iterations: 3

pchisq(0.11741, df = 1, lower.tail = F)

[1] 0.7318607

chisq.test(o_tab, correct = F)


    Pearson's Chi-squared test

data:  o_tab
X-squared = 0.11541, df = 1, p-value = 0.7341

chisq.test(o_tab)


    Pearson's Chi-squared test with Yates' continuity correction

data:  o_tab
X-squared = 0.014031, df = 1, p-value = 0.9057

Chi squared test for independence suggests evidense of an association.

The chi squared test of independence and the log-linear regression chi squared stat are the same. At least, they are if I turn off the continuity correction

2. Elephants

(3 points) Although male elephants are capable of reproducing by 14 to 17 years of age, younger adult males are usually unsuccessful in competing with their larger elders for the attention of receptive females. Since male elephants continue to grow throughout their lifetimes, and since larger males tend to be more successful at mating, the males most likely to pass their genes to future generations are those whose characteristics enable them to live long lives. Joyce Poole studied a population of African elephants in Amboseli National Park, Kenya, for 8 years. You can explore this data set in case2201 in the Sleuth3 library. This data frame contains the number of successful matings and ages (at the study’s beginning) of 41 male elephants.

rm(list = ls())
data(case2201)
df <- case2201
head(df)

Age	Matings
27	0
28	1
28	1
28	1
28	3
29	0

Give an estimated model for describing the number of successful matings as a function age, using

A. Model w/ lm and sqrt

simple linear regression after taking a square root transformation of the number of successful matings;

lmsqrt <- lm(sqrt(Matings)~Age, data = df)
summary(lmsqrt)


Call:
lm(formula = sqrt(Matings) ~ Age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.90532 -0.33654  0.07767  0.45871  1.09468 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.81220    0.56867  -1.428 0.161187    
Age          0.06320    0.01561   4.049 0.000236 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6493 on 39 degrees of freedom
Multiple R-squared:  0.296, Adjusted R-squared:  0.2779 
F-statistic:  16.4 on 1 and 39 DF,  p-value: 0.0002362

\[ sqrt(Matings) = -.81 + .063Age \] First mating around 12.9 years.

B. lm w/ log + 1

simple linear regression after taking a logarithmic transformation (after adding 1);

df$mlp1 <- log(df$Matings + 1)

lmlp1 <- lm(mlp1~Age, data = df)
summary(lmlp1)


Call:
lm(formula = mlp1 ~ Age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49087 -0.33939  0.06607  0.35376  0.81171 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.69893    0.45861  -1.524 0.135567    
Age          0.05093    0.01259   4.046 0.000238 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5237 on 39 degrees of freedom
Multiple R-squared:  0.2957,    Adjusted R-squared:  0.2776 
F-statistic: 16.37 on 1 and 39 DF,  p-value: 0.0002385

\[ log(Matings + 1) = -0.69893 + 0.05093Age \]

0.69893/0.05093

[1] 13.72335

First mating around 13.7 years.

C. log-linear

log-linear regression.

Be sure to examine residuals from each of these models. How do the models compare? Please be specific. Is there evidence of over dispersion? If so, fit another model and report results from that model. If not, why not?

df_tab <- table(df[,1:2])
df_tab

    Matings
Age  0 1 2 3 4 5 6 7 9
  27 1 0 0 0 0 0 0 0 0
  28 0 3 0 1 0 0 0 0 0
  29 3 0 3 0 0 0 0 0 0
  30 0 1 0 0 0 0 0 0 0
  32 0 0 1 0 0 0 0 0 0
  33 0 0 1 3 1 0 0 0 0
  34 0 2 1 1 0 0 0 0 0
  36 0 0 0 0 0 1 1 0 0
  37 0 2 0 0 0 0 1 0 0
  38 0 0 1 0 0 0 0 0 0
  39 0 1 0 0 0 0 0 0 0
  41 0 0 0 1 0 0 0 0 0
  42 0 0 0 0 1 0 0 0 0
  43 1 0 1 1 1 0 0 0 1
  44 0 0 0 1 0 0 0 0 0
  45 0 0 0 0 0 1 0 0 0
  47 0 0 0 0 0 0 0 1 0
  48 0 0 1 0 0 0 0 0 0
  52 0 0 0 0 0 0 0 0 1

glm1 <- glm(data = df, 
            family = poisson, 
            Matings~Age)
summary(glm1)


Call:
glm(formula = Matings ~ Age, family = poisson, data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
Age          0.06869    0.01375   4.997 5.81e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 75.372  on 40  degrees of freedom
Residual deviance: 51.012  on 39  degrees of freedom
AIC: 156.46

Number of Fisher Scoring iterations: 5

glm1psi <- 51.012/39

Psi of 1.3 isn’t wild.

\[ log(\mu_{ij}) = -1.58201 + 0.06869 \lambda_{ij}^{Age} \]

Probability of mating given age of 50 is 87%.

plogis((-1.58201 + 0.06869*27), lower.tail = T)

[1] 0.567736

df$sqres <- residuals(lmsqrt)
df$p1res <- residuals(lmlp1)
df$glmres <- residuals(glm1)

df |> ggplot() + 
  aes(x = Age, y = sqres) + 
  geom_point() + 
  labs(title = "sqrt")

df |> ggplot() + 
  aes(x = Age, y = p1res) + 
  geom_point()+ 
  labs(title = "log + 1")

df |> ggplot() + 
  aes(x = Age, y = glmres) + 
  geom_point()+ 
  labs(title = "glm")

There are a couple of values in the glm that are on the edge of the plus or minus 2 cutoff for outliers. That same low point at age equal to 43 is a little off in all the residual plots.

nb <- glm.nb(data = df, 
       Matings~Age)
summary(nb)


Call:
glm.nb(formula = Matings ~ Age, data = df, init.theta = 16.48629005, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.58837    0.59393  -2.674  0.00749 ** 
Age          0.06887    0.01519   4.534  5.8e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(16.4863) family taken to be 1)

    Null deviance: 65.199  on 40  degrees of freedom
Residual deviance: 44.498  on 39  degrees of freedom
AIC: 157.92

Number of Fisher Scoring iterations: 1

              Theta:  16.5 
          Std. Err.:  25.7 

 2 x log-likelihood:  -151.923

44.498/ 39

[1] 1.140974

Psi is a little lower, but it wasn’t bad before.

df$nbres <- residuals(nb)

df |> ggplot() + 
  aes(x = Age, y = nbres) + 
  geom_point()+ 
  labs(title = "nb")

The negative binomial residuals look the same.

summary(lmsqrt)


Call:
lm(formula = sqrt(Matings) ~ Age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.90532 -0.33654  0.07767  0.45871  1.09468 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.81220    0.56867  -1.428 0.161187    
Age          0.06320    0.01561   4.049 0.000236 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6493 on 39 degrees of freedom
Multiple R-squared:  0.296, Adjusted R-squared:  0.2779 
F-statistic:  16.4 on 1 and 39 DF,  p-value: 0.0002362

An increase of 1 year in Age results in a .06 more matings. This one has a lower intercept than the log plus one lm.

summary(lmlp1)


Call:
lm(formula = mlp1 ~ Age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49087 -0.33939  0.06607  0.35376  0.81171 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.69893    0.45861  -1.524 0.135567    
Age          0.05093    0.01259   4.046 0.000238 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5237 on 39 degrees of freedom
Multiple R-squared:  0.2957,    Adjusted R-squared:  0.2776 
F-statistic: 16.37 on 1 and 39 DF,  p-value: 0.0002385

An increase of 1 year in Age results in a .05 more matings.

sgl <- summary(glm1)
sgl


Call:
glm(formula = Matings ~ Age, family = poisson, data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
Age          0.06869    0.01375   4.997 5.81e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 75.372  on 40  degrees of freedom
Residual deviance: 51.012  on 39  degrees of freedom
AIC: 156.46

Number of Fisher Scoring iterations: 5

sgl$coefficients |> exp()

             Estimate Std. Error      z value Pr(>|z|)
(Intercept) 0.2055619   1.723955   0.05476055 1.003682
Age         1.0711071   1.013841 148.02412630 1.000001

An increase of 1 year in Age results in a 7% increase in matings.

pchisq(51.012, df = 39, lower.tail = F)

[1] 0.09425638

p-values for all three models are low, but glm is the highest.

Conceptual Questions

3. Log-lin v Linear transfromed.

(2 points) What is the difference between a log-linear model and a linear model after the log transformation of the response?

If the log transformation results in constant variance then, MLR could be used. However, glms don’t assume constant variance or normality.

Multiplicative vs additivity in error terms is the major difference. In MLR additivity is assumed.

4. Elephants

(3 points) This question refers to the elephant mating data from question 2 above.

A.

Both the binomial and the Poisson distributions provide probability models for random counts. Which distribution is more appropriate to model the number of successful matings for male African elephants, and why?

Elephant mating doesn’t have fixed sample size or constant probability, so Poisson is better in this instance.

B

In the following plot, we see that the spread of responses is larger for larger values of the mean response. Is this something to be concerned about if we perform a Poisson log-linear regression?

ggplot(df) + aes(Age, Matings) + geom_point() + 
  geom_smooth(se = T, method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Poisson assumes that variance will go up with mean. The lambda parameter is the only parameter, if one goes up so does the other.

C

Performing a Poisson log-linear regression results in the following output:

                Estimate    Std. Error z value    Pr(> |z|)
(Intercept)     -1.58201    0.54462     -2.905    0.00368 **
Age             0.06869     0.01375     4.997     5.81e-07 ***

Residual Deviance: 51.01 on 39 degrees of freedom

What are of the mean and variance of the distribution of counts of successful matings (in 8 years) for elephants who are aged 25 years at the beginning of the observation period? What are the mean and variance for elephants who are aged 45 years?

I do not know what that means, but lets try and figure it out. 25 to 33 year old elephants. Distribution of counts suggests we are not talking about the glm at all. I guess it wants me to filter by age and get the mean and variance.

d <- df |> filter(Age > 24 & Age < 34)
mean(d$Matings)

[1] 1.666667

sd(d$Matings)^2

[1] 1.529412

The mean and variance of the distribution of Matings for elephants 25 to 33 are 1.66 and 1.53 respectively.

MV <- function(A) {-1.58201 + 0.06869*A}

mean(MV(seq(25, 33, 1))) |> exp()

[1] 1.506818

(sd(MV(seq(25, 33, 1)))^2) |> exp()

[1] 1.036021

For the estimated distribution between 25 and 33, mean is 1.5 and var is 1.04.

A <- 45
MV(A) |> exp()

[1] 4.522387

The mean and variance for elephants at 45 is 4.5 Matings.