% Final % Andrew G. Dunn^1^ % ^1^[email protected]
\vfill
Andrew G. Dunn, Northwestern University Predictive Analytics Program
Prepared for PREDICT-410: Regression & Multivariate Analysis.
Formatted using markdown, pandoc, and \LaTeX. References managed using Bibtex, and pandoc-citeproc.
\newpage
When validating the in-sample fit of a linear regression model, what are two assumptions that must be validated using the model residuals, and how does one validate each of these assumptions?
Response:
By 'in-sample' I assume this inquiry to be about building models for inference.
Statistical inference is focused on a set of formal hypotheses, denoted by
When we fit a statistical model, we have underlying assumptions about the probabilistic structures for that model. All of our statistical inference is derived from those probabilistic assumptions. Hence, if our estimated model, which is dependent upon the sample data, does not conform to these probabilistic assumptions, then our inference will be incorrect.
The two ways to validate a model in sample are examination of the R-Square and analysis of the residuals. For these investigations we assume:
- The relationship between the response and the regressors is linear, at least approximately.
- The errors are normally distributed (and uncorrelated)
- Write down the null and alternate hypotheses for the t-test for X1.
- Compute the t-statistic associated with the regression coefficient for X1.
- Write down the null and alternate hypotheses for the Overall F-test.
- Compute the F-statistic for the Overall F-test.
- Compute the R-Squared value.
- Compute the Adjusted R-Squared value.
- Compute the AIC value.
- Compute the BIC value.
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 5 | 677.74649 | 135.54930 | ||
Error | 18 | 153.76309 | 8.54239 | ||
Corrected Total | 23 | 831.50958 |
Table: Analysis of Variance
Source | |
---|---|
Root MSE | 2.92274 |
Dependent Mean | 34.62917 |
Coeff Var | 8.44010 |
R-Square | |
Adj R-Square |
Table: Estimator Performance
| Variable | DF | Parameter Estimate | Standard Error | t-value |
Table: Parameter Estimates
Response:
where
With the table output of this model we have a fairly easy job of computing the t-statistic:
which has a F-distribution with
We consider:
- The Total Sum of Squares is the total variation in the sample
- The Regression Sum of Squares is the variation in the sample that has been explained by the regression model
- The Error Sum of Squares is the variation in the sample that cannot be explained
SST | Total Sum of Squares | |
SSR | Regression Sum of Squares | |
SSE | Error Sum of Squares |
Where the Coefficient of Determination - R-Squared is:
With the table output of this model we have a fairly easy job of computing the R-square value:
We consider:
The standard regression notation uses
With the table output of this model we have a fairly easy job of computing the Adjusted R-Square value:
In the case of ordinary least squares regression, the Akaike Information Criterion is:
In the case of ordinary least squares regression, the Bayesian Analogues is:
Suppose that you were asked to build a linear regression model for a continuous response variable denoted by Y using only the categorical predictor variable gender. In your sample data set gender takes three values: M for male, F for Female, and U for Unknown (Missing).
We would choose to model this as an indicator variable. We have, due to M, F, and U, a three level variable. We would specify the model as:
In this situation, even though we have a three level indicator variable, we will model the unknown as our base category, against which the others are assessed in order to avoid the dummy variable trap.
As we're modeling an indicator variable, we will evaluate/test for gender effect by interpreting the model for unit change in Y due to unit change in each of the three variations of the model fit:
By interpreting these three, or graphing them, we can consider the effect of gender on Y.
Suppose that in addition to gender, your model had to include a continuous predictor variable (X1). Including this variable, how do you test for a 'gender effect' on Y?
We would add the continuous variable X1 to our model as follows:
To text for effect, we would evaluate in the same fashion as the fitted models as above:
Footnotes
-
Dr. Chad Bhatti, Statistical Inference Versus Predictive Modeling in OLS Regression ↩