In this Statistical Communication exercise, the learners take an already completed regression analysis and write a report of 250-400 words describing the analysis. This exercise consists of a 40-50 minute example that the teacher goes through to demonstrate and establish expectations, followed by a 50-70 minute period for the learners to emulate that writing process on a new analysis.
The example analysis is a logistic regression model selected through a stepwise process. To save time, the stepwise process can be skipped as well as an explanation of any variables not used in the final model. The data is from UCI's Breast Cancer dataset, found here (
The graphs and R output from the analysis is as follows:
The example analysis is a logistic regression model selected through a stepwise process. To save time, the stepwise process can be skipped as well as an explanation of any variables not used in the final model. The data is from UCI's Breast Cancer dataset, found here (
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data and described here
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.namesThe graphs and R output from the analysis is as follows:
Figure list:
### Figure 1: First six rows of the dataset
head(dat)
### Figure 2: Summary statistics of the dataset
summary(dat)
### Figure 3, summary of full model
mod_full = glm(Class ~ ., family="binomial", data=dat)
summary(mod_full)
### Figure 4, odds ratios full model
round(exp(mod_full$coef),3)
### Figure 5, summary of stepwise fitted model
mod_step <- stepAIC(fit, direction="both")
summary(mod_step)
### Figure 6, odds ratios of stepwise
round(exp(mod_step$coef),3)
### Figure 7: Normal Quantile-Quantile Plot
### Figure 8: Leverage Plot
plot(mod_step)
Use the following
checklist as a template.
1. What is the purpose of the analysis?
Specifically: What
is the response variable, and what are we trying to do with it?
In this example:
Trying to classify recurrence or non-recurrence of cancer using the other
variables available. Since this is a binary, we will use logistic regression.
2. What is the relevant data available
for this analysis?
Specifically, how
many cases do we have? What format are the variables in? What are the most
common responses and what are their distributions?
In this
example: Recurrence within five years
(categorical, binary, response), Age (categorized by decade), Pre or Post
Menopause, etc.
3. What sort of data preparation was
done?
Specifically, what
was done to the data in order to make the analysis possible?
In this example: We
removed cases for which some of the data was missing, and cases that were in
rare categories, such as age < 20 and age > 70.
4. What is the model that was used?
In this example: A
logistic model of three variables.
5. How did you come up with this model?
In this example:
With stepwise regression, working in both directions, optimizing on AIC.
6. What are some key features of the
model?
Specifically: Draw
by hand a mock-up of what a table would look like when describing the model.
What are the
significant variables? What does their effect size mean? Use the
general-example-exception principle to tell a story about a typical case or a
typical effect size.
7. How well does this model work?
Specifically: Use
the summary information and the diagnostic plots to explain how well the model
assumptions fit and how well the model performs.
Example written as demonstration
1. What is the purpose of the analysis?
To predict if
there’s a recurrence of cancer (e.g. in the next five years)
Predicting a binary
thing (logistic regression, classification)
2. What is the relevant data available
for this analysis?
We have a dataset
from UCI, of 10 variables of the breast cancer history of 256 women. These
variables include
‘Class’ a binary
response variable of recurrence or return of cancer within 5 years,
‘Age’ as a
categorical variable (10-19, 20-29, 30-39, … , 90-99),
menopausal status
(binary),
tumor size in mm
(categorical 0-4, 5-9, … , 50-54),
number of tumor
nodes (categorical 0-2, 3-5, 6-8, …),
whether the nodes
are capped (binary),
degree of
malignancy (categorical 1,2,3),
breast (binary),
quadrant (categorical,
5 categories),
radiation therapy
used (binary).
3. What sort of data preparation was
done?
Before analysis, we removed cases with missing
data, and those with rare categories (e.g. ages less than 20 or more than 70,
menopausal before 40)
4. What is the model that was used?
We attempted a full
model, using every explanatory variable in the dataset. This model, however,
was very difficult to interpret and of little use to clinics and hospitals.
Only one parameter (malignancy 3 vs. malignancy 1) was statistically
significant. Due to these difficulties we opted for a simpler model instead
Log-Odds (Class) as
a function of (number of nodes), (malignancy), and (radiation usage).
5. How did you come up with this model?
We started with a
full model of all 9 variables without any interactions, and we used stepwise
variable selection optimized on the Akaike Information Criterion to come with a
simpler model.
6. What are some key features of the model?
Residual deviance
is 263 compared to a null deviance of 310, which equates roughly to an
r-squared of 1 – (263/310) = 0.15. So this model does not predict very well
whether a recurrence happens or not. It should be noted that the full model
only has an r-squared equivalent to 0.20.
However, we do have
some useful indicators: The number of nodes matters substantially,
The odds of recurrence are 2.74 (CI: 2-4) and 2.90 (CI 2.1 – 4.2) times as high for women with 3-5 and 6-8 nodes respectively, when compared to those having 0-2 nodes, holding other variables constant. When there are 9 or more nodes, the odds of recurrence are 6.42 (CI 5-10) times as high.
The odds are recurrence for malignancy 1 and 2 are about the same, but the odds are about 4 times as high for those who stage 3 cancer (malignancy 3).
The odds of
recurrence in radiation was used are unclear due a large standard error.
7. How well does this model work?
A normal quantile-quantile plot reveals that there is a major break from normality in the residuals. We are not concerned about this because of the binary nature of the responses. Furthermore, we would expect leaps in the Q-Q plot because every variable we used is categorical, so a smooth progression is nearly impossible.
A leverage plot
does not reveal any outliers or overly leveraged points. There are two notable
cases with leverage that are potentially influential on the model, however
neither of these is deviant from the model as a whole.
Exercise Portion
Comment: The 'exercise' dataset is the 'trees' data from the datasets package in base R. This analysis is simpler than the example; it's a linear regression rather than a logistic one, and the model is pre-selected rather than selected through a stepwise process. There are still some twists: specifically, the 'species' category is meaningless (this is mentioned in the documentation), the model includes a polynomial term, and although the model fits reasonably well, there will be some diagnostic issues because the model is missapplied - volume should scale with height TIMES girth-squared, not height PLUS girth-squared.
Dataset information
This data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground.
A data frame with
31 observations on 3 variables.
[,1] Girth numeric Tree diameter in inches (1 inch = 2.5
cm)
[,2] Height numeric Height in ft (1 foot = 12 inches = 30
cm)
[,3] Volume numeric Volume of timber in cubic ft
[,4] Species categorical Made up entirely
Source
Ryan, T. A., Joiner,
B. L. and Ryan, B. F. (1976) The Minitab Student Handbook. Duxbury Press.
Figure list:
### Figure 1: Raw Data of trees
trees
### Figure 2: Summary information
summary(trees)
### Figure 3: Summary of polynomial model
mod_poly = lm(Volume ~ Species + Girth + Girth^2
+ Height, data=trees)
summary(mod_poly)
### Figure 4: Predicted vs Actual
plot(predict(mod_poly) ~ trees$Volume)
### Figure 5: Predicted vs Residual
plot(mod_poly$resid ~ predict(mod_poly))
### Figure 6: Normal Quantile-Quantile Plot
### Figure 7: Leverage Plot
plot(mod_step)
Checklist
1. What is
the purpose of the analysis?
Specifically:
What is the response variable, and what are we trying to do with it?
2. What is the relevant data available for this analysis?
Specifically,
how many cases do we have? What format are the variables in? What are the most
common responses and what are their distributions?
3. What is
the model that was used?
4. What are some key features of the model?
Specifically:
Draw by hand a mock-up of what a table would look like when describing the
model.
What are the significant variables? What does their effect size mean? Use the general-example-exception principle to tell a story about a typical case or a typical effect size.
5. How well does this model work?
No comments:
Post a Comment