DMA Midterm
Midterm notes for HBA1 DMA Course.
Decision Trees
Regression
Definition of a Regression
Running a Regression
Below are the steps to performing a regression analysis. The midterm itself will not ask you to perform these steps, but may ask you questions regarding this procedure.
-
Think about variable dependencies. If there are obvious ones such as “year built” vs “age” or possible ones such as “graduated university” and “qualified”, then take note of them before moving on. These help in guiding your analysis later or your qualitative analysis.
-
Look for dependencies among your variables.
You can do so either visually using a scatter plot or numerically via a correlation matrix.
- It is “good” to have relationship between and variables, but not necessarily bad if there aren’t strong relationships here.
- It is “bad” to have a strong correlation among variables. If this is the case, then you should pick one of the correlated variables and drop the rest. This will prevent the regression from “blowing up”
-
Check p-Values: if any are above 0.05, then mark it to be dropped. Do not drop more than one variable at a time. Instead, drop one of the insignificant variables and re-run the regression. Repeat until no insignificant variables remain.
-
If all significant, now check the residual plots for all variables and the variable.
We want an even cloud above and below the line at all points. Below are examples of “bad” residual plots:
- There is a portion of the residual plot where there are more dots above than below. This means that this variable is a bad fit here. Consider adding an Interaction Variable. After adding any new variable, repeat from step 2.
- The plot is “funnel in” or “funnel out”, while this may just be the best fit, it means you standard error number is very wrong and not consistent. Consider dropping this variable. and restarting from step 3.
Interpreting Results of a Regression
There are many results that standard regression analysis tools output.
Coefficients
This is the primary output of a regression. Using the outputted coefficients, we can construct a function to predict values in the form
It’s important to remember here that a value of 0.00001 does not mean it is insignificant. That’s what the p-Values are for! It’s very possibly the while , was actually the weight of a container ship measured in grams! That would make pretty large wouldn’t it…
Values
An R Squared value tells us how well our line fits our data. This, however, is not a good indicator of a good regression because even insignificant variables can improve our fit.
Imagine I asked you to predict the grades of everyone in the class based on study time. You get an value of 0.7. Great! But now, I add in the heights of everyone in the class. By sheer luck, it’s possible that there’s some tiny amount of correlation between height and grades that brings your value to 0.8. Does this make it a good regression? No! Heights is likely insignificant.
Adjusted Values
We noted above in [[#r-2-values| Values]] that is not a good measure of quality of regression between I can just add more and more variables to increase it.
Well, adjusted just takes our value and adjusts it for the number of variables we included. This new value is a good measure of the effectiveness of our regression.
For more information see: Adjusted R2
Standard Error of the Regression
Also called the residual standard error, it measures the average distance that the observed values fall from the regression line.
Formula:
where:
- are residuals,
- = number of observations,
- = number of predictors.
IN THIS COURSE we use as the standard deviation of a predicted value. Thus, if asked “what is the distribution of our predicted values” we can put it on a Normal Distribution with a mean of and a standard deviation of .
Other Information
Dummy Variables
It is vitally important that, when giving variables to a regression, none of the variables can be used to describe one another. Or more simply, the variables are truly independent.
This is because the regression machine will try tweaking each of the variables slightly to see what values give us the best results. If two of the levers it can adjust do effectively the same thing, it gets confused and “blows up.”
So, let’s consider us trying to represent a cardinal direction (North, South, East, West).
| North | South | East | West |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
But notice something important:
If South, East, and West are all 0, then we automatically know North must be 1.
That means the North column is redundant information and can be perfectly predicted by the other three.
The solution is simple:
- Drop one dummy variable from the set (often called the reference category).
So, we could drop “North” and let it be the baseline:
| South | East | West | Interpretation |
|---|---|---|---|
| 0 | 0 | 0 | North |
| 1 | 0 | 0 | South |
| 0 | 1 | 0 | East |
| 0 | 0 | 1 | West |
Now, the regression interprets each coefficient as the effect relative to North, and we’ve avoided redundancy.
Additional Note: We could’ve picked any of them to drop. I just picked North but you could’ve picked East or West or.. you get the idea.