data(crashi)
<- rownames(crashi) ## grab the hours
hour <- stack(crashi) ## combine 7 columns of crashes into 1
crashi2 names(crashi2) <- c("Count","Day")
$Day <- factor(crashi2$Day,levels(crashi2$Day)[c(2,6,7,5,1,3,4)]) # make sure the days are ordered correctly
crashi2$Hour <- as.numeric(rep(hour, ncol(crashi)))
crashi2
<- crashi2 |> mutate(
crashi2 Day.Cat = ifelse(
!= "Fri" & Day != "Sat" & Day != "Sun"), "Weekday", as.character(Day)
(Day
),Time.Cat = cut(Hour,
breaks = c(-1, 5.5, 11.5, 18.5, 25),
labels = c("Early.Morn", "Morn", "Afternoon", "Evening"))
)
ST-518 HW 8
output: pdf_document
R Question
1. Consider the crashi data
(5 points) Consider the crashi data in the VGAM library that we examined in Lab. Sometimes people will fit multiple linear regression models to data like this, not recognizing that the responses are counts.
(a) Fit a multiple linear regression
Using the Time.Cat and Day.Cat variables that we created in Lab, fit a multiple linear regression model to the counts. Be sure to include the interaction between Time.Cat and Day.Cat just as we did in the lab.
How does the model fit compare to the negative binomial regression model that you fit in lab? Please explain.
<- glm(
mod.mlr ~ Day.Cat * Time.Cat,
Count data = crashi2,
)
<- glm.nb(
mod.nb ~ Day.Cat * Time.Cat,
Count data = crashi2
)
summary(mod.mlr)
Call:
glm(formula = Count ~ Day.Cat * Time.Cat, data = crashi2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.833 8.528 2.443 0.015719 *
Day.CatSat 17.000 12.061 1.410 0.160729
Day.CatSun 28.833 12.061 2.391 0.018044 *
Day.CatWeekday -8.667 9.535 -0.909 0.364827
Time.CatMorn 53.333 12.061 4.422 1.85e-05 ***
Time.CatAfternoon 106.167 11.622 9.135 3.87e-16 ***
Time.CatEvening 48.767 12.650 3.855 0.000170 ***
Day.CatSat:Time.CatMorn -28.333 17.057 -1.661 0.098751 .
Day.CatSun:Time.CatMorn -54.167 17.057 -3.176 0.001810 **
Day.CatWeekday:Time.CatMorn 11.333 13.485 0.840 0.401967
Day.CatSat:Time.CatAfternoon -40.857 16.436 -2.486 0.014009 *
Day.CatSun:Time.CatAfternoon -66.833 16.436 -4.066 7.65e-05 ***
Day.CatWeekday:Time.CatAfternoon -14.119 12.994 -1.087 0.278944
Day.CatSat:Time.CatEvening -21.800 17.889 -1.219 0.224883
Day.CatSun:Time.CatEvening -63.233 17.889 -3.535 0.000541 ***
Day.CatWeekday:Time.CatEvening -18.783 14.143 -1.328 0.186128
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 436.4019)
Null deviance: 255316 on 167 degrees of freedom
Residual deviance: 66333 on 152 degrees of freedom
AIC: 1515.1
Number of Fisher Scoring iterations: 2
# summary(mod.mlr)
summary(mod.nb)
Call:
glm.nb(formula = Count ~ Day.Cat * Time.Cat, data = crashi2,
init.theta = 13.71581333, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.03655 0.14196 21.391 < 2e-16 ***
Day.CatSat 0.59664 0.19159 3.114 0.001845 **
Day.CatSun 0.86878 0.18883 4.601 4.21e-06 ***
Day.CatWeekday -0.53785 0.16314 -3.297 0.000977 ***
Time.CatMorn 1.26976 0.18588 6.831 8.42e-12 ***
Time.CatAfternoon 1.80763 0.17802 10.154 < 2e-16 ***
Time.CatEvening 1.20621 0.19392 6.220 4.97e-10 ***
Day.CatSat:Time.CatMorn -0.76247 0.25673 -2.970 0.002979 **
Day.CatSun:Time.CatMorn -1.28668 0.25617 -5.023 5.09e-07 ***
Day.CatWeekday:Time.CatMorn 0.57318 0.21117 2.714 0.006642 **
Day.CatSat:Time.CatAfternoon -0.80471 0.24505 -3.284 0.001024 **
Day.CatSun:Time.CatAfternoon -1.22433 0.24335 -5.031 4.88e-07 ***
Day.CatWeekday:Time.CatAfternoon 0.34012 0.20273 1.678 0.093415 .
Day.CatSat:Time.CatEvening -0.66810 0.26801 -2.493 0.012675 *
Day.CatSun:Time.CatEvening -1.55050 0.27088 -5.724 1.04e-08 ***
Day.CatWeekday:Time.CatEvening 0.03632 0.22114 0.164 0.869524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(13.7158) family taken to be 1)
Null deviance: 880.82 on 167 degrees of freedom
Residual deviance: 172.14 on 152 degrees of freedom
AIC: 1439.5
Number of Fisher Scoring iterations: 1
Theta: 13.72
Std. Err.: 1.93
2 x log-likelihood: -1405.533
MLR model:
For a linear regression where there are only factors, the betas are all dummy variables.
\(\mu(Y|Day, Time) = \beta_0 + \beta_1Weekday + \beta_2Fri + \beta_3Sat +\) \(\beta_4Sun + \beta_5Early.Morn + \beta_6Morn + \beta_7Afternoon + \beta_8Evening + interactions\)
Negative Binomial:
For the negative binomial, it approaches a poisson model as dispersion approaches 1.
\(log(\lambda_i) = \beta_0 + \beta_1Weekday + \beta_2Fri + \beta_3Sat +\) \(\beta_4Sun + \beta_5Early.Morn + \beta_6Morn + \beta_7Afternoon + \beta_8Evening + interactions\)
Mean and variance are equal to lambda. So,
summary(mod.mlr)$coefficients[1:16]
[1] 20.833333 17.000000 28.833333 -8.666667 53.333333 106.166667
[7] 48.766667 -28.333333 -54.166667 11.333333 -40.857143 -66.833333
[13] -14.119048 -21.800000 -63.233333 -18.783333
summary(mod.nb)$coefficients[1:16]
[1] 3.03655427 0.59663628 0.86877975 -0.53785430 1.26976054 1.80763282
[7] 1.20621030 -0.76246537 -1.28668142 0.57317806 -0.80470838 -1.22433047
[13] 0.34011643 -0.66809524 -1.55049823 0.03632441
20.833333 + 17.000000
[1] 37.83333
exp(3.03655427+0.59663628)
[1] 37.83333
Above showed that due to the structure of the data, these two methods are getting the same coefficients. However, the negative binomial(nb) model is a better fit for the data. The dispersion parameter for the nb is much lower. The data clearly do not come from a normal distribution as is assumed by the multiple linear regression.
<- crashi2
c2
$resmlr <- residuals(mod.mlr)
c2$resnb <- residuals(mod.nb, type = "pearson")
c2
$fitmlr <- mod.mlr$fitted.values
c2$fit.nb <- mod.nb$fitted.values c2
ggplot(c2) +
aes(x = Day.Cat, y = resmlr, colour = Time.Cat) +
geom_point() +
labs(title = "mlr resids")
ggplot(c2) +
aes(x = Day.Cat, y = resnb, colour = Time.Cat) +
geom_point() +
labs(title = "nb resids")
Looking at the residuals, it is clear that the nb model is much better. The mlr model has residuals as high as 50, compared to the 2 or 3 from the nb.
(b) Compare the predicted number of crashes
Now compare the predicted number of crashes in the early morning on Saturdays using your model from part (a) and the negative binomial model from the Lab. Are they the same? Different?
<- data.frame(Day.Cat = "Sat", Time.Cat = "Early.Morn")
newdat
## predict using Poisson regression and NB regression
predict(mod.mlr, newdata = newdat, type = "response")
1
37.83333
predict(mod.nb, newdata = newdat, type = "response")
1
37.83333
$predmlr <- predict(mod.mlr, c2)
c2$prednb <- predict(mod.nb, c2, type = "response")
c2
sum(near(c2$predmlr, c2$prednb, tol = .0000000000001))/length(c2$prednb))*100 (
[1] 14.88095
Not only are the estimates for Saturday morning close, but all of them are close. They are 100% similar within 1 x 10^-12, and for the reasons I explained in part a.
(c) Examine the standard errors
Now examine the standard errors associated with the predictions from part (b). Are they the same? Different? Which model do you prefer? Please explain.
## manually doing the back transformation
predict(mod.mlr, newdata = newdat, type = "response", se.fit = TRUE)
$fit
1
37.83333
$se.fit
[1] 8.528403
$residual.scale
[1] 20.89023
predict(mod.nb, newdata = newdat, type = "response", se.fit = TRUE)
$fit
1
37.83333
$se.fit
1
4.868124
$residual.scale
[1] 1
I am not sure how these homework assignments work in gradescope, but I answered all of these questions in part a. The question about comparing model fit was general enough that I thought I was supposed to answer all these questions.
The SE for the mlr is higher than for the nb, but the nb fits the data better. Referring to the residual plot in part a, you can see that the mlr was scattered about wildly. The residual scale above shows that the standard deviation of the residuals is much higher for the mlr.
Conceptual Question
2. Issues with large datasets
(5 points) Please find an article or web posting in which the authors discuss the statistical issues with large datasets. Write a short paragraph in which you identify at least one of the issues that the authors raise. Also please comment on whether you agree with or disagree with their assessment. Please submit either a PDF copy of the article you read, or a web link to the article or post.
The article is Fan, Jianqing, Fang Han, and Han Liu. “Challenges of big data analysis.” National science review 1.2 (2014): 293-314.
Challenges of big data analysis pdf
https://academic.oup.com/nsr/article/1/2/293/1397586
The authors speak of Noise accumulation on page 6. To summarize in part, classification can be obfuscated by many weak predictors that do not reduce classification error. If a dataset has 200 predictors but only 2 are really useful, the noise added by the other 198 is enough to ruin the classification. Noise accumulation makes variable selection important, but due to “spurious correlations” it can be more difficult with big data. Spurious correlations are when variables are correlated for no apparent reason. It leads to associating selected variables that are not relevant. The presence of spurious variables also impacts inference, variance tends to be seriously underestimated. Finally, Incidental endogeneity is when some predictors correlate with noise.
This is a pretty long article and I do not have time to read the whole thing now. However, I have been stuck at work while trying to predict the types of trees that will regenerate in forests with and without different types of disturbance. I have known that I have all of these issues, even though I didn’t know they had names. There are many things that could be predictive of forest type, but they are all weak and there is a lot of noise. I very much agree with the authors that these are problems when dealing with big data. I have been working with a random forests model, and hoping that I could fix some other issues that would help my model. I am excited to read the rest of this paper and do some more digging. I have zero-inflation, over dispersion, noise accumulation, and indcidental endogeneity in my dataset.