DMA Midterm

Midterm notes for HBA1 DMA Course.

Decision Trees

Regression

Definition of a Regression

Running a Regression

Below are the steps to performing a regression analysis. The midterm itself will not ask you to perform these steps, but may ask you questions regarding this procedure.

  1. Think about variable dependencies. If there are obvious ones such as “year built” vs “age” or possible ones such as “graduated university” and “qualified”, then take note of them before moving on. These help in guiding your analysis later or your qualitative analysis.

  2. Look for dependencies among your variables.

You can do so either visually using a scatter plot or numerically via a correlation matrix.

  • It is “good” to have relationship between and variables, but not necessarily bad if there aren’t strong relationships here.
  • It is “bad” to have a strong correlation among variables. If this is the case, then you should pick one of the correlated variables and drop the rest. This will prevent the regression from “blowing up”
  1. Check p-Values: if any are above 0.05, then mark it to be dropped. Do not drop more than one variable at a time. Instead, drop one of the insignificant variables and re-run the regression. Repeat until no insignificant variables remain.

  2. If all significant, now check the residual plots for all variables and the variable.

We want an even cloud above and below the line at all points. Below are examples of “bad” residual plots:

  1. There is a portion of the residual plot where there are more dots above than below. This means that this variable is a bad fit here. Consider adding an Interaction Variable. After adding any new variable, repeat from step 2.
  2. The plot is “funnel in” or “funnel out”, while this may just be the best fit, it means you standard error number is very wrong and not consistent. Consider dropping this variable. and restarting from step 3.

Interpreting Results of a Regression

There are many results that standard regression analysis tools output.

Coefficients

This is the primary output of a regression. Using the outputted coefficients, we can construct a function to predict values in the form

It’s important to remember here that a value of 0.00001 does not mean it is insignificant. That’s what the p-Values are for! It’s very possibly the while , was actually the weight of a container ship measured in grams! That would make pretty large wouldn’t it…

Values

An R Squared value tells us how well our line fits our data. This, however, is not a good indicator of a good regression because even insignificant variables can improve our fit.

Imagine I asked you to predict the grades of everyone in the class based on study time. You get an value of 0.7. Great! But now, I add in the heights of everyone in the class. By sheer luck, it’s possible that there’s some tiny amount of correlation between height and grades that brings your value to 0.8. Does this make it a good regression? No! Heights is likely insignificant.

Adjusted Values

We noted above in [[#r-2-values| Values]] that is not a good measure of quality of regression between I can just add more and more variables to increase it.

Well, adjusted just takes our value and adjusts it for the number of variables we included. This new value is a good measure of the effectiveness of our regression.

For more information see: Adjusted R2

Standard Error of the Regression

Also called the residual standard error, it measures the average distance that the observed values fall from the regression line.

Formula:

where:

  • are residuals,
  • = number of observations,
  • = number of predictors.

IN THIS COURSE we use as the standard deviation of a predicted value. Thus, if asked “what is the distribution of our predicted values” we can put it on a Normal Distribution with a mean of and a standard deviation of .

Other Information

Dummy Variables

It is vitally important that, when giving variables to a regression, none of the variables can be used to describe one another. Or more simply, the variables are truly independent.

This is because the regression machine will try tweaking each of the variables slightly to see what values give us the best results. If two of the levers it can adjust do effectively the same thing, it gets confused and “blows up.”

So, let’s consider us trying to represent a cardinal direction (North, South, East, West).

NorthSouthEastWest
1000
0100
0010
0001

But notice something important:
If South, East, and West are all 0, then we automatically know North must be 1.

That means the North column is redundant information and can be perfectly predicted by the other three.

The solution is simple:

  • Drop one dummy variable from the set (often called the reference category).

So, we could drop “North” and let it be the baseline:

SouthEastWestInterpretation
000North
100South
010East
001West

Now, the regression interprets each coefficient as the effect relative to North, and we’ve avoided redundancy.

Additional Note: We could’ve picked any of them to drop. I just picked North but you could’ve picked East or West or.. you get the idea.