library(arm)
library(Sleuth3)
library(tidyverse)
library(vcdExtra)
library(magrittr)
library(MASS)
library(pscl)
library(VGAM)
6 Zero Inflation
Zero-Inflation
Please load these libraries that you’ll need for this lab:
In this lab, we’ll cover some more details about zero-inflated and hurdle models. We will discuss why zero-inflated models are sometimes needed, the difference between zero-inflated models and hurdle models, how to compare various types of fitted models, and how to check residuals to assess model assumptions.
Quine Data
We will use data from schools in Australia about the number of days that children are absent from a particular school in the MASS
library.
Some data exploration
# ?quine
<- quine
schools head(schools)
Eth | Sex | Age | Lrn | Days |
---|---|---|---|---|
A | M | F0 | SL | 2 |
A | M | F0 | SL | 11 |
A | M | F0 | SL | 14 |
A | M | F0 | AL | 5 |
A | M | F0 | AL | 5 |
A | M | F0 | AL | 13 |
For each child in the data set, we have information on the child’s Ethnicity, Sex, Age, Learning Disability Status, and Days absent from school (which we will treat as the response variable). All of the explanatory variables are categorical with two categories each, except for Age which has four categories.
We’ll start with some exploration of the data.
ggplot(data = schools, aes(x = Days, y = after_stat(density))) +
geom_histogram(bins = 12, colour = "black", fill = "white") +
ggtitle("Days Absent from School")
You can see from the histogram above that most of the children only missed between 5 and 10 days of school. However, there are perhaps more 0’s than we would expect from a standard Poisson or Negative Binomial Model. It’s rather difficult to tell what will be the best model for these data just based on this simple histogram, but we will investigate the various models proposed in lecture throughout the rest of the lab.
Next, let’s next create some side-by-side boxplots to explore how each of the categorical explanatory variables is associated with the response.
ggplot(data = schools, aes(x = Eth, y = Days)) +
geom_boxplot() +
ggtitle("Days Absent vs. Ethnicity") +
scale_x_discrete(labels = c("Aboriginal", "Not Aboriginal"))
Here is looks like there are generally higher numbers of days absent among the Aboriginal students, and there’s also more variability in that group.
ggplot(data = schools, aes(x = Sex, y = Days)) +
geom_boxplot() +
ggtitle("Days Absent vs. Gender") +
scale_x_discrete(labels = c("Female", "Male"))
It’s difficult to detect large differences here, though there is some indication of more variation in the distribution of male students.
ggplot(data = schools, aes(x = Age, y = Days)) +
geom_boxplot() +
ggtitle("Days Absent vs. Age") +
scale_x_discrete(labels = c("Primary", "F1", "F2", "F3"))
There seems to be some skewness in the F1 distribution, and you could investigate that further by cross-tabulating Age
with some of the other explanatory variables.
ggplot(data = schools, aes(x = Lrn, y = Days)) +
geom_boxplot() +
ggtitle("Days Absent vs. Learning") +
scale_x_discrete(labels = c("Average Learner", "Slow Learner"))
Finally, there are not clear differences in these two distributions, but we should investigate some more by fitting some models.
One thing to notice about many of the boxplots is that there are (at least what R is declaring to be) outliers on the upper end of all of the boxplots. But remember that count distributions tend to be right skewed (just revisit the histogram above!), and so these may not really be outliers, but rather just large counts that we can expect from some count distributions.
An as aside, take a look at the help file for the geom_boxplot()
function.
# ?geom_boxplot
If you scroll down in the help file to where it describes the “Computed Variables,” you’ll see that after_stat(ymax)
is defined as “upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR” (“upper hinge” is the 75th percentile). This is a rather standard, though arbitrary rule for declaring something an outlier. Also remember that with Poisson count data, as the counts increase, so does the variance. Therefore, when looking at boxplots of count data, an “outlier” may simply be an indication of large variance, or it may be an indication of over dispersion.
Zero Inflation
As discussed in the narrated lectures this week, the 0’s in the quine
dataset might be divided into two categories: true 0’s and excess 0’s (though we will never be able to tell definitively which are true and which are excess based on the data alone). In this particular example, some of the excess 0’s might be present if parents of the children force them to go to school every day while the true 0’s might be present for those children that truly never got sick or had to miss school for another reason. This is purely speculative, but, for count data, it can be helpful to think about why there might be zero-inflated data.
Zero-Inflated Models
Recall from the Module 6 lectures that the zero-inflated Poisson regression model is a two part model with:
\(logit(\pi_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{4i} + \beta_5 X_{5i} + \beta_6 X_{6i}\)
and
\(log(\lambda_i) = \gamma_0 + \gamma_1 X_{1i} + \gamma_2 X_{2i} + \gamma_3 X_{3i} + \gamma_4 X_{4i} + \gamma_5 X_{5i} + \gamma_6 X_{6i}\)
with \(X_1\) as the indicator for Ethnicity, \(X_2\) as the indicator for Sex, \(X_3\), \(X_4\), \(X_5\) as the three indicator variables for Age, and \(X_6\) as the indicator variable for Learning Status.
Note 1: As usual, for categorical variables, we need \(k - 1\) indicator variables in the model for each explanatory variable, where \(k\) is the number of categories for a particular explanatory variable.
If we expect there to be over dispersion in the counts, we might also consider fitting a zero-inflated negative binomial model. This is really quite similar to the idea of the zero-inflated Poisson model (we even have the same link function, the log-link). The only difference is that the negative binomial model has an extra parameter to estimate and allows for the possibility of over dispersion in the counts.
Hurdle Models
The purpose of hurdle models is the same as the purpose of zero-inflated models: to account for excess 0’s. As discussed in the Sarul and Sahin reading, the results of the models can actually give quite different results sometimes.
Note 2: As with the previous lab, fitting all of the different models we are about to fit does actually qualify as data snooping. In reality, we would want to think about whether a hurdle model or a zero-inflated Poisson model or a zero-inflated negative binomial model, etc., is most reasonable for the particular data we have before doing any model fitting.
Model Evaluation, Comparison, and Information Criteria
Let’s first fit a zero-inflated Poisson model using all four covariates, and compare it to the usual poisson regression model.
<- zeroinfl(Days ~ Eth + Sex + Age + Lrn,
mod.pois0 dist = "poisson", data = schools)
summary(mod.pois0)
Call:
zeroinfl(formula = Days ~ Eth + Sex + Age + Lrn, data = schools, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-4.3969 -1.8974 -0.7468 1.4181 9.4340
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.71883 0.06480 41.956 < 2e-16 ***
EthN -0.44061 0.04190 -10.517 < 2e-16 ***
SexM 0.18904 0.04253 4.445 8.77e-06 ***
AgeF1 -0.32048 0.06968 -4.599 4.24e-06 ***
AgeF2 0.24602 0.06212 3.960 7.49e-05 ***
AgeF3 0.43720 0.06781 6.447 1.14e-10 ***
LrnSL 0.34400 0.05155 6.674 2.50e-11 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.799218 1.398261 -3.432 0.000599 ***
EthN 2.061952 1.079103 1.911 0.056030 .
SexM 1.010530 0.759919 1.330 0.183588
AgeF1 -0.005473 1.050231 -0.005 0.995842
AgeF2 -0.326430 1.074230 -0.304 0.761224
AgeF3 0.061397 1.119582 0.055 0.956266
LrnSL 0.213368 0.862528 0.247 0.804618
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 14
Log-likelihood: -1047 on 14 Df
Recall that when we run the zero-inflated model, we get estimated regression coefficients for the zero-inflated part of the model and separate estimated regression coefficients for the Poisson part of the model.
<- glm(Days ~ Eth + Sex + Age + Lrn,
mod.pois family = "poisson", data = schools)
summary(mod.pois)
Call:
glm(formula = Days ~ Eth + Sex + Age + Lrn, family = "poisson",
data = schools)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.71538 0.06468 41.980 < 2e-16 ***
EthN -0.53360 0.04188 -12.740 < 2e-16 ***
SexM 0.16160 0.04253 3.799 0.000145 ***
AgeF1 -0.33390 0.07009 -4.764 1.90e-06 ***
AgeF2 0.25783 0.06242 4.131 3.62e-05 ***
AgeF3 0.42769 0.06769 6.319 2.64e-10 ***
LrnSL 0.34894 0.05204 6.705 2.02e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2073.5 on 145 degrees of freedom
Residual deviance: 1696.7 on 139 degrees of freedom
AIC: 2299.2
Number of Fisher Scoring iterations: 5
We see that the zero-inflated model has a much lower AIC than the usual Poisson model; also, as expected, the zero-inflated model uses twice as many degrees of freedom as the usual Poisson model since we have twice as many parameters to estimate in the zero-inflated model.
We can also repeat what we did above using a negative binomial model to account for the (possible) overdispersion.
<- zeroinfl(Days ~ Eth + Sex + Age + Lrn,
mod.nb0 dist = "negbin", data = schools)
summary(mod.nb0)
Call:
zeroinfl(formula = Days ~ Eth + Sex + Age + Lrn, data = schools, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.1836 -0.7060 -0.2712 0.5118 3.5941
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.88387 0.21756 13.256 < 2e-16 ***
EthN -0.50956 0.14979 -3.402 0.00067 ***
SexM 0.16743 0.15760 1.062 0.28808
AgeF1 -0.44161 0.22856 -1.932 0.05335 .
AgeF2 0.05128 0.23216 0.221 0.82519
AgeF3 0.30937 0.23604 1.311 0.18996
LrnSL 0.28252 0.17544 1.610 0.10732
Log(theta) 0.39640 0.13171 3.010 0.00262 **
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -26.6848 484.4629 -0.055 0.956
EthN 13.7292 424.5594 0.032 0.974
SexM 11.7614 233.3545 0.050 0.960
AgeF1 0.1852 1.4154 0.131 0.896
AgeF2 -1.1579 1.6862 -0.687 0.492
AgeF3 -0.7733 1.5920 -0.486 0.627
LrnSL -0.6861 1.3114 -0.523 0.601
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.4865
Number of iterations in BFGS optimization: 33
Log-likelihood: -542.1 on 15 Df
# compare to the negative binomial model without zero-inflation
<- glm.nb(Days ~ Eth + Sex + Age + Lrn, data = schools)
mod.nb summary(mod.nb)
Call:
glm.nb(formula = Days ~ Eth + Sex + Age + Lrn, data = schools,
init.theta = 1.274892646, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.89458 0.22842 12.672 < 2e-16 ***
EthN -0.56937 0.15333 -3.713 0.000205 ***
SexM 0.08232 0.15992 0.515 0.606710
AgeF1 -0.44843 0.23975 -1.870 0.061425 .
AgeF2 0.08808 0.23619 0.373 0.709211
AgeF3 0.35690 0.24832 1.437 0.150651
LrnSL 0.29211 0.18647 1.566 0.117236
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.2749) family taken to be 1)
Null deviance: 195.29 on 145 degrees of freedom
Residual deviance: 167.95 on 139 degrees of freedom
AIC: 1109.2
Number of Fisher Scoring iterations: 1
Theta: 1.275
Std. Err.: 0.161
2 x log-likelihood: -1093.151
The structure of the output for these two models looks very similar to the structure of output for the Poisson regression models, except one additional parameter is estimate in each model (as compared to the analogous Poisson model): \(\theta\), the dispersion parameter. In both models, \(\hat\theta\) is larger than 1, indicating that there is some over dispersion of the counts. We will test this more formally using a drop-in-deviance test in a moment, but for completeness, let’s also fit the Poisson hurdle model and the negative binomial hurdle model.
<- hurdle(Days ~ Eth + Sex + Age + Lrn,
mod.pois.hurdle dist = "poisson", data = schools)
summary(mod.pois.hurdle)
Call:
hurdle(formula = Days ~ Eth + Sex + Age + Lrn, data = schools, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-4.397 -1.897 -0.747 1.418 9.434
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.71879 0.06481 41.953 < 2e-16 ***
EthN -0.44065 0.04190 -10.517 < 2e-16 ***
SexM 0.18907 0.04253 4.446 8.76e-06 ***
AgeF1 -0.32057 0.06969 -4.600 4.23e-06 ***
AgeF2 0.24600 0.06212 3.960 7.50e-05 ***
AgeF3 0.43725 0.06782 6.448 1.14e-10 ***
LrnSL 0.34408 0.05156 6.674 2.49e-11 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.799162 1.398219 3.432 0.000598 ***
EthN -2.062031 1.079091 -1.911 0.056018 .
SexM -1.010397 0.759840 -1.330 0.183601
AgeF1 0.005372 1.050158 0.005 0.995918
AgeF2 0.326481 1.074203 0.304 0.761182
AgeF3 -0.061336 1.119559 -0.055 0.956309
LrnSL -0.213461 0.862480 -0.247 0.804524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 12
Log-likelihood: -1047 on 14 Df
<- hurdle(Days ~ Eth + Sex + Age + Lrn,
mod.nb.hurdle dist = "negbin", data = schools)
summary(mod.nb.hurdle)
Call:
hurdle(formula = Days ~ Eth + Sex + Age + Lrn, data = schools, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.1934 -0.7283 -0.2897 0.4978 3.6377
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.85794 0.21858 13.075 < 2e-16 ***
EthN -0.49846 0.15231 -3.273 0.00107 **
SexM 0.14525 0.15953 0.910 0.36257
AgeF1 -0.45409 0.23094 -1.966 0.04927 *
AgeF2 0.07481 0.23408 0.320 0.74926
AgeF3 0.35777 0.23801 1.503 0.13280
LrnSL 0.31736 0.17734 1.790 0.07352 .
Log(theta) 0.39929 0.15466 2.582 0.00983 **
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.799162 1.398219 3.432 0.000598 ***
EthN -2.062031 1.079091 -1.911 0.056018 .
SexM -1.010397 0.759840 -1.330 0.183601
AgeF1 0.005372 1.050158 0.005 0.995918
AgeF2 0.326481 1.074203 0.304 0.761182
AgeF3 -0.061336 1.119559 -0.055 0.956309
LrnSL -0.213461 0.862480 -0.247 0.804524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.4908
Number of iterations in BFGS optimization: 14
Log-likelihood: -542.5 on 15 Df
We can now compare all six of these models using AIC or BIC to see if any of the models are preferable. Based on intuition, we would expect the negative binomial models to be better than the Poisson models since, from the exploratory analysis, we expected there to be over dispersion in the counts of days missed from school.
AIC(mod.pois0, mod.pois, mod.pois.hurdle, mod.nb0, mod.nb, mod.nb.hurdle)
df | AIC | |
---|---|---|
mod.pois0 | 14 | 2121.462 |
mod.pois | 7 | 2299.184 |
mod.pois.hurdle | 14 | 2121.451 |
mod.nb0 | 15 | 1114.118 |
mod.nb | 8 | 1109.151 |
mod.nb.hurdle | 15 | 1115.070 |
As expected, the negative binomial models all have much lower AIC than the Poisson models. However, after we account for this over dispersion, the three negative binomial models are relatively similar in terms of AIC. Therefore, we would probably prefer the simplest model (the model that does not incorporate zero-inflation) here.
We mentioned above that, sometimes the hurdle model gives very similar estimates and results as the zero-inflated model, but sometimes the results are quite different. We can compare the zero-inflated negative binomial model to the negative binomial hurdle model here to see if there are any major differences in the coefficient estimates and/or their standard errors. Again, this is just an academic exercise, because based on the AIC analysis above, we’d recommend using the model without zero-inflation.
<- cbind(round(summary(mod.nb0)$coefficients[[1]][, 1], 3),
tab1 round(summary(mod.nb.hurdle)$coefficients[[1]][, 1], 3))
colnames(tab1) <- c("Zero-Inf Coefs", "Hurdle Coefs")
tab1
Zero-Inf Coefs Hurdle Coefs
(Intercept) 2.884 2.858
EthN -0.510 -0.498
SexM 0.167 0.145
AgeF1 -0.442 -0.454
AgeF2 0.051 0.075
AgeF3 0.309 0.358
LrnSL 0.283 0.317
Log(theta) 0.396 0.399
<- cbind(round(summary(mod.nb0)$coefficients[[1]][, 2], 3),
tab2 round(summary(mod.nb.hurdle)$coefficients[[1]][, 2], 3))
colnames(tab2) <- c("Zero-Inf SEs", "Hurdle SEs")
tab2
Zero-Inf SEs Hurdle SEs
(Intercept) 0.218 0.219
EthN 0.150 0.152
SexM 0.158 0.160
AgeF1 0.229 0.231
AgeF2 0.232 0.234
AgeF3 0.236 0.238
LrnSL 0.175 0.177
Log(theta) 0.132 0.155
In this particular instance, the coefficients for the negative binomial part of the model and their standard errors are quite similar. You are asked to compare the coefficients and their standard errors for the “extra zero” part of these two models at the end of the lab.
Let’s next carry out a more formal Vuong test to compare the zero-inflated negative binomial model to the usual negative binomial model.
Vuong test
Because the zero-inflated poisson model and the usual Poisson model do not nest, we can’t use a drop in deviance test to compare the models (and same goes for the negative binomial model vs. the zero-inflated negative binomial model).
The Vuong test has a null hypothesis that the models are indistinguishable with a large positive test statistic indicating that the first model that is input into the function is better (below, we put the zero-inflated model first) and a large negative test statistic indicating that the second model is better.
vuong(mod.nb0, mod.nb)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw 1.315020 model1 > model2 0.094252
AIC-corrected -0.723184 model2 > model1 0.234783
BIC-corrected -3.763784 model2 > model1 8.3681e-05
Note 1: The Vuong test is asymptotic so, if the sample size of the data set is not very large, then the test is unreliable and should not be used.
Note 2: We can also use the test to compare the Poisson models just for fun.
vuong(mod.pois0, mod.pois)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw 2.667898 model1 > model2 0.0038164
AIC-corrected 2.473082 model1 > model2 0.0066977
BIC-corrected 2.182454 model1 > model2 0.0145380
Here, there is fairly strong evidence that the zero-inflated Poisson model is better than the non-zero inflated Poisson model. However, as discussed above, neither seem appropriate since there is over dispersion in the counts of days absent from school.
To conclude, it seems evident that there is over dispersion in the days absent variable, but that these counts are not zero-inflated. In this instance the appropriate model to use for interpretation and inference is the negative binomial regression model.
Coefficient Interpretation
Even though zero-inflation is not apparent in the days absent counts, we’re going to proceed with interpreting the estimated regression coefficients for the zero-inflated negative binomial model just so you can see one approach. Recall that the coefficient estimates for this model are:
summary(mod.nb0)$coefficients[[1]][, 1]
(Intercept) EthN SexM AgeF1 AgeF2 AgeF3
2.8838678 -0.5095556 0.1674285 -0.4416084 0.0512790 0.3093699
LrnSL Log(theta)
0.2825202 0.3964049
summary(mod.nb0)$coefficients[[2]][, 1]
(Intercept) EthN SexM AgeF1 AgeF2 AgeF3
-26.6847779 13.7291994 11.7614329 0.1852184 -1.1578503 -0.7732624
LrnSL
-0.6861468
The first set of coefficients is from the negative binomial part of the model and the second set of coefficients is from the zero-inflation part of the model. If we think in the framework of some of the zeros as true zeros and some of the zeros as excess zeros, then we might interpret the Ethnicity
coefficient estimate in the negative binomial model, -0.5096, in the following way:
“A person of non-Aboriginal descent is predicted to be absent from school \(1 - exp(-0.5096) = 39.93%\) less than a person of Aboriginal descent at this particular school among all of those with a risk of being absent (i.e., among all of those that are not the excess zeroes), provided these two people are of the same gender, age and learner status.
That’s rather a mouthful of an interpretation, but we have to be careful when interpreting regression coefficient estimates from a model with multiple explanatory variables – the interpretation of a single coefficient estimate has to be made while holding the values of the other explanatory variables fixed.
Notice that this interpretation is exactly the same as the interpretation in the usual negative binomial model except for the extra parenthetical we added about the excess zeroes.
Similarly, we interpret regression coefficient estimates from the zero-inflated part of the model in the same way that we would interpret estimates from a logistic regression model.
Looking at Residuals
Let’s new consider a few residual plots from the zero-inflated negative binomial model.
$residuals.pearson <- residuals(mod.nb0, type = "pearson")
schools$fitted.vals <- mod.nb0$fitted.values
schoolsggplot(data = schools, aes(x = Eth, y = residuals.pearson)) +
geom_point()
ggplot(data = schools, aes(x = Sex, y = residuals.pearson)) +
geom_point()
ggplot(data = schools, aes(x = Age, y = residuals.pearson)) +
geom_point()
ggplot(data = schools, aes(x = Lrn, y = residuals.pearson)) +
geom_point()
ggplot(data = schools, aes(x = fitted.vals, y = residuals.pearson)) +
geom_point()
We see from these residual plots that there are a few residuals that are larger than 3. However, given such a large sample, it is not too surprising that a few points have large residuals. We also see in the residuals vs. fitted values plot, that the spread of the residuals is relatively constant across all fitted values. Overall, there is no cause for concern in using the zero-inflated negative binomial model for this data. As discussed above, however, since the zero-inflated model has a similar AIC (actually a slightly larger AIC) than the non-zero-inflated model, we would probably prefer the simpler non-zero-inflated model.
Lab Questions
Compare the coefficients
Compare the coefficients and their standard errors for the “extra zero” parts of the mod.nb0
and mod.nb.hurdle
models. Are the coefficients and standard errors similar or different? Explain.
We saw above that, after accounting for over dispersion, there wasn’t really any evidence of zero-inflation in the quine
data set. From looking at exploratory plots of the data, however, it can be somewhat difficult to tell whether there is over dispersion, zero-inflation or both. Let’s run the following code to simulate what counts would look like under a Poisson model (no zero-inflation, no over dispersion), a zero-inflated Poisson model (zero-inflation, no over dispersion), and a negative binomial model (over dispersion, no zero-inflation). We will need to install and load the VGAM
package to easily simulate from a zero-inflated Poisson model.
# library(VGAM)
<- 150 ## set the sample size
n <- rpois(n, lambda = 5)
pois <- rnbinom(n, mu = 5, size = 1.4)
negbin <- rzipois(n, lambda = 6.25, pstr0 = 0.2)
pois0 <- as.data.frame(cbind(pois, negbin, pois0)) sim.df
1 Make three histograms
Run the above code and then make three histograms of the three sets of simulated data. Also, get summary statistics for each of the three sets of simulations.
hist(pois)
hist(pois0)
hist(negbin)
summary(pois)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 5.000 4.933 6.000 10.000
summary(pois0)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 3.00 5.00 5.08 7.00 12.00
summary(negbin)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 5.273 7.750 23.000
2
What are some of the differences between the three histograms? Also, what do you notice about the negative binomial simulated data? (which applies to our data analysis above)
Pois is somewhat normal. Pois0 is somewhat normal with more zeroes. Negbin is exponential. The negative binomial has a lower median and higher max. It is more dispersed.
3
Repeat the simulation of the negative binomial data, but try changing the size
parameter to a few different values. What does the size parameter seem to control?
hist(rnbinom(n*1000, mu = 5, size = 1))
hist(rnbinom(n*1000, mu = 5, size = 1.4))
hist(rnbinom(n*1000, mu = 5, size = 3))
hist(rnbinom(n*1000, mu = 5, size = 5))
hist(rnbinom(n*1000, mu = 5, size = 1400))
It seems to control the number of zeroes.
4
Finally, simulate some data from a zero-inflated negative binomial distribution using the rzinegbin
function in the VGAM
package.
# ?rzinegbin
$negbin0 <- rzinegbin(n = n, size = 1.4, munb = 5, pstr0 = 0.2) sim.df
Try plotting a histogram of the zero-inflated negative binomial data. Is it easy or difficult to tell based on these histograms whether data come from the poisson, negative binomial, zero-inflated poisson, or zero-inflated negative binomial models?
hist(sim.df$negbin)
hist(sim.df$negbin0)
hist(rzinegbin(n = 10000, size = 1.4, munb = 5, pstr0 = 0.2))
The sets with lots of zeroes seem easy enough to pick out. Yes, I think they are easy to distinquish baesd on the histograms.