10 Final Project
Ruberic
There is no ruberic up yet.
Instructions
Choose a dataset, even a personal one with instructor approval.
- Census income
- Forest Cover Type
- Crash Data, 3 from crashi in VGAM
- or one with instructor approval
Progress Report is two questions and a description of the data:
- 2 Substantive questions
- At least 1 explanatory, to understand variation in a response on the basis of several explanatory variables.
- Describe the data in your own words
- Clear indications of the response variables you will use and the methods you will use to address your questions.
- At least one and no more than three exploratory data plot relevant to your analysis.
Final Project is a cohesive report that explains:
- the data,
- your questions,
- your approach,
- any problems you encounter
- additional assumptions you have to make
- a summary of your findings.
More info on reports in the week 10 reading, Data Analysis Writing.
Overview
For this project, you must work on your own. You may interact by email with the course TA and/or the course instructor, and you may ask questions on the Project Canvas Discussion, but you may not interact with any of your fellow students outside of the Project Canvas Discussion. You may choose one of the following datasets for your project. You may also use a dataset of your own choosing, but if that’s what you want to do, you must get the instructor’s approval first. Datasets
Census income: https://archive.ics.uci.edu/ml/datasets/Adult Potential responses: whether or not making more than 50k; whether or not completed college; etc.
Forest cover type: https://archive.ics.uci.edu/ml/datasets/Covertype Potential responses: cover type (combine categories somehow); whether or not it is a wilderness area; etc.
Crash data: there are eight different data files in the VGAM package in R (after loading the VGAM package, type crashi to learn about them). We looked at the crashi data in Lab 8, but you take any combination of 3 or more of these data files to construct a new dataset and address some interesting questions about crashes. Potential responses: number of crashes involving motorcycles; number of crashes involving alcohol; etc. Deadlines and Deliverables There are two deadlines:
- Deadline 1: due end of week 9. This is a progress report. This progress report should be in the form of a pdf report that you generated using an R markdown file. The progress report should include the following:
A brief description of the data you are using for your project and the questions that you will address—you must address at least two substantive questions about the data you choose. We expect that at least one of the questions you address will be explanatory in nature—that is, you’ll use the data to understand variation in a response on the basis of several explanatory variables. Also, your description of the data should go beyond saying which of the three options above that you chose—you should describe the data in your own words.
Clear indications of (a) the response variable(s) you will use and (b) the methods (e.g., logistic regression) you will use to address your questions of interest.
At least one and no more than three exploratory data plot relevant to your analysis.
- Deadline 2: due end of week 10. This is the completed project report. The deliverable is a pdf report that you created from an R markdown file. It is your responsibility to verify that your code is well-documented and completely self-contained.
Please note that your report should not be a sequential list of what you did. Rather, it should be a cohesive report that explains the data, your questions, your approach, any problems you encounter or additional assumptions you have to make, and a summary of your findings.
We have uploaded a document titled DataAnalysisReports.pdf that provides some guidance about the anatomy of your report.
- Audience is my supervisor
- Outline
- Exec. Summary
- Summary of study
- Answer to questions
- Body
- Data
- Analysis
- Results
- Discussion
- Reprise the questions and results of the Executive Summary
- New Questions, Going forward
- Appendices
It’s not a report, it’s a convo between you and your supervisor.
Statistical Summary Example:
There is convincing evidence of a difference in the mean response between two treatment groups, A and B (p < 0.0001 from a Welch’s t-test on 32.4 df). The 95% confidence interval for this difference runs from 1.34 to 3.51 units, with the mean for treatment A being higher than the mean for treatment B.
Statistically significant as a term in writing:
Use the p-value to convey strength of evidence, in the sense that a p-value in the range of p < 0.0001 provides convincing evidence against the null hypothesis, where as a p-value around p = 0.05 only provides marginal evidence against the null hypothesis.
Don’t use “I am 95% confident that…”