ST-518 HW 1

Author

Chris Eckerson

Conceptual Questions

1. Numerical or categorical

For each of the following, determine whether the response variable is numerical or categorical. If the response variable is categorical, is it binary? If it is not binary, list possible categories for the response variable.

The first one is confusing because it isn’t a response variable. I am going to ignore the “response” part.

In a survey, college students were asked how many hours per week they spend on the internet.
- Numerical or categorical
  - I’d treat Hours as categorical, They will be estimating not measuring, nominally they spend x hours.
- Binary?
  - No
- List possible categories
- 0-4, 4-8, 8-12, … or 1, 2, 3, 4,…
In a survey, college students were asked, “What percentage of the time that you spend on the internet is not for course work?”
- Numerical or categorical
  - Again, they aren’t measuring the time, they are estimating. I’d treat it as Categorical.
- Binary?
  - No
- List possible categories
  - 0-10%, 10-20% etc.
In a survey, college students were asked, “What is your primary mode of transportation when traveling between campus and home?”
- Numerical or categorical
  - Categorical
- Binary?
  - No
- List possible categories
  - personal vehicle, foot, bike, metro, cab, other.
In a survey, college students were asked whether or not they live on campus.
- Numerical or categorical
  - Catergorical
- Binary?
  - Yes
- List possible categories
  - Yes or No
In a survey, college students were asked how many of their meals they prepare at home per week.
- Numerical or categorical
  - Either, they could actually count this. How to treat it in a model depends on what you are trying to get.
- Binary?
  - No
- List possible categories
  - 1, 2, 3,… or 1-5, 6-10, …
In a survey, college students were asked, “Which of the five main food groups constitutes the majority of your diet?”
- Numerical or categorical
  - Categorical
- Binary?
  - No
- List possible categories
  - Grain, Fruit, Veg, Dairy, …
In a survey, smart phone users were asked whether or not they have used a web-based taxi service like Uber or Lyft.
- Numerical or categorical
  - Categorical
- Binary?
  - Yes
- List possible categories
  - Yes or No
In a survey, smart phone users were asked how many times they used a web-based taxi service in the past three months.
- Numerical or categorical
  - Either, that data could be pulled from the phone. If you wanted to use it quantitatively you could. If it’s an estimate, it’s better as a categorical.
- Binary?
  - No
- List possible categories
  - 0, 1-10, …

2. Multiple-choice exam

On a multiple-choice exam, each of the 20 questions has 2 possible answers and only one correct response. Suppose, for each question, one student selects his responses completely at random.

Re-write: On a multiple choice test there are 2 options and 1 correct answer. Bob selects his responses at random (a) For Bob, X’s 1 through 20 represent if he answered correctly. Y is the total number answered correctly What is the distribution of \(X_i\)? What is the distribution of \(Y\)?

Let \(X_i,\ i = 1,...,20\) represent whether the student answered the \(i\)th question correctly, and let \(Y\) represent the total number of answers the student gets correct. What is the distribution of \(X_i\)? What is the distribution of \(Y\)?

The question boils down to what are the proper distributions for binary data and count data. Bernoulli distributions are for binary data, Binomial for the sum of binary variables.

\(Pr(Z = 1) = 0.5^1(1−0.5)^{1−1}\ = \ 0.5 \times 1 \ = \ .5\)

\(Pr(X = y) = \begin{pmatrix} n \\ 20 \end{pmatrix}0.5^{20}(1-0.5)^{n-20}\)

What is the probability that the student passes the test (i.e., scores at least 70%)? (You will learn more about this distribution in the next module, but you should be able to calculate this in R using the pbinom function.)

\(Pr(X = 14) = \begin{pmatrix} 14 \\ 20 \end{pmatrix}0.5^{20}(1-0.5)^{14-20}\)

print(paste("The probability of getting a 70 percent by guessing with 50 50 odds is",
            pbinom(q = .7*(20), size = 20, prob = .5, lower.tail = F) |> round(4)))

[1] "The probability of getting a 70 percent by guessing with 50 50 odds is 0.0207"

Correction

Answer to Part b) is incorrect. We want \(P(Y ≥ 14) = 1 - P(Y ≤ 13) = 1 - pbinom(x=13, n=20, p=0.5)=.058\)

1 - (pbinom(q = 13, size = 20, prob = .5) |> round(4))

[1] 0.0577

# same as
pbinom(q = 13, size = 20, prob = .5, lower.tail = F) |> round(4)

[1] 0.0577

df <- NULL

success <- (0:20)/20
nums <- dbinom(0:20, 20, .5)
df <- data.frame(success, nums)

df |> ggplot() + 
  aes(success, nums) + 
  geom_point() + 
  geom_vline(xintercept = .7)

Looking for the area on the right of the line That is

3. Online, on campus, or a hybrid

Do you prefer taking courses online, on campus, or a hybrid of the two? Suppose these preferences occur with probabilities (\(p_1, p_2, p_3\)). For \(N = 3\) independent subjects, let the observed frequencies be \((n_1, n_2, n_3)\). That is, we observe \(n_1\) subjects who prefer taking courses online, etc.

Explain how you can determine \(n_3\) from knowing \(n_1\) and \(n_2\).

If there are three people and each gets to pick one preference, then 3 minus the number who prefer online or hybrid is the number left who like campus.
List all the possible observations \((n_1, n_2, n_3)\), with \(n = 3\).

x <- expand.grid(0:3, 0:3, 0:3)
x |> mutate(sums = rowSums(x[,1:3])) |> 
  filter(sums == 3) |> 
  select(1:3)

Var1	Var2	Var3
3	0	0
2	1	0
1	2	0
0	3	0
2	0	1
1	1	1
0	2	1
1	0	2
0	1	2
0	0	3

R Question

4. Consider ex2117

Consider the data in ex2117 of the Sleuth3 package.

In what format are these data (case, tabular, frequency, other)? Please explain.

ex2117 is the Effect of Stress During Conception on Odds of a Male Birth.

ex2117 |> head()

Group	Time	Number	PctBoys
Control	none	20337	51.2
Exposed	13-16mo	71	52.1
Exposed	12-7mo	789	49.6
Exposed	0-6mo	1922	48.9
Exposed	1st trimester	290	46.0

It’s pretty much a frequency table. I am not sure if a the two numerical columns change that. If one were to convert the numeric columns to the number of boys alone, that would be an ideal frequency table.

Create a new R data frame for the data, called HW1data. In that data frame, create a NumBoys column that represents the number of births that were boys (be sure to round to the nearest integer).

HW1data <- ex2117 |> mutate(
  NumBoys = (PctBoys/100)*Number, 
  NumBoys = round(NumBoys)
)

HW1data

Group	Time	Number	PctBoys	NumBoys
Control	none	20337	51.2	10413
Exposed	13-16mo	71	52.1	37
Exposed	12-7mo	789	49.6	391
Exposed	0-6mo	1922	48.9	940
Exposed	1st trimester	290	46.0	133

Use an appropriate R command or commands to change the format to something else (you can pick what).

HW1data |> expand.dft(freq = "NumBoys") |> head()

Group	Time	Number	PctBoys
Control	none	20337	51.2
Control	none	20337	51.2
Control	none	20337	51.2
Control	none	20337	51.2
Control	none	20337	51.2
Control	none	20337	51.2

y <- xtabs(NumBoys ~ Group + Time + Number, HW1data)
y

, , Number = 71

         Time
Group     0-6mo 12-7mo 13-16mo 1st trimester  none
  Control     0      0       0             0     0
  Exposed     0      0      37             0     0

, , Number = 290

         Time
Group     0-6mo 12-7mo 13-16mo 1st trimester  none
  Control     0      0       0             0     0
  Exposed     0      0       0           133     0

, , Number = 789

         Time
Group     0-6mo 12-7mo 13-16mo 1st trimester  none
  Control     0      0       0             0     0
  Exposed     0    391       0             0     0

, , Number = 1922

         Time
Group     0-6mo 12-7mo 13-16mo 1st trimester  none
  Control     0      0       0             0     0
  Exposed   940      0       0             0     0

, , Number = 20337

         Time
Group     0-6mo 12-7mo 13-16mo 1st trimester  none
  Control     0      0       0             0 10413
  Exposed     0      0       0             0     0