DMA - Regressions
Regressions are one of the three core concepts in HBA1 DMA Course.
Scope
In this course, we will only ever do linear regressions.
When to Use Regression
We use regression when we seek to find the relationship between variables. Or in other words, the impact of one factor onto some outcome. Some examples are below:
- How does a house’s location, size, age, etc. influence its market value?
- How do sales people and location in a store influence how many items we sell?
- How does temperature, humidity, and wind speed influence ice cream sales?
How to Perform Regression
There are nuanced steps to performing regression. Often times, performing every step is over complicated, nonetheless, this note will go through each step and its purpose.
Step 1: Think about Variable Dependencies
At this step, you want to think about if any of your independent variables are dependent on each other. That is, if some variables obviously influence each other, such as “year built” and “age of ship”, or “graduated university” and “qualified”, then take note of them before moving on. These will help in guiding your analysis later or your qualitative analysis.
Step 2: Look for Dependencies Among Variables
You can do this step two ways, either visually using a scatter plot, or numerically using a correlation matrix.
- It is “good” to have relationship between and variables, but not necessarily bad if there aren’t strong relationships here.
- It is “bad” to have a strong correlation among variables. If this is the case, then you should pick one of the correlated variables and drop the rest. This is important, as the regression engine assumes that all the variables are independent.
Step 3: Run the Regression
After running the regression, the key step is the check the p-Values. If any of them are above 0.05, then mark it to be dropped. Do not drop more than one variable at a time. Instead, drop one of the insignificant variables and re-run the regression, Repeat until no insignificant variables remain.
Dropping one at a time is important, as in one step, you may have 3 insignificant variables, but dropping one of them could cause the other two to become significant again!
Step 4: Check Residual Plots
Now, once all your are significant, you should check your residual plots!
Residual plots plot the residuals. Residuals are the difference between the predicted value and the actual value. That is, if you drew your line of best fit, the difference between an actual datapoint and your line of best fit is the “residual difference.”
What you want to see, are even clouds. If you see any patterns, such as points where the residuals are all positive or all negative, or it has a funneling down or funneling up shape, then this indicates that your model is not capturing some aspect of the data, and you may need to consider adding polynomial terms or interaction terms.
Solutions or responses to bad residual plots:
- If your bad residual plot is an independent variable plot, then this variable could be the problem. Consider dropping the variable or adding an interaction variable.
- If your bad residual plot is the “fitted values” plot, then consider interaction terms.
Interpreting Results of a Regression
There are many results that standard regression analysis tools output.
Coefficients
This is the primary output of a regression. Using the outputted coefficients, we can construct a function to predict values in the form
It’s important to remember here that a value of 0.00001 does not mean it is insignificant. That’s what the p-Values are for! It’s very possibly the while , was actually the weight of a container ship measured in grams! That would make pretty large wouldn’t it…
Values
An R Squared value tells us how well our line fits our data. This, however, is not a good indicator of a good regression because even insignificant variables can improve our fit.
Imagine I asked you to predict the grades of everyone in the class based on study time. You get an value of 0.7. Great! But now, I add in the heights of everyone in the class. By sheer luck, it’s possible that there’s some tiny amount of correlation between height and grades that brings your value to 0.8. Does this make it a good regression? No! Heights is likely insignificant.
Adjusted Values
We noted above in [[#r-2-values| Values]] that is not a good measure of quality of regression between I can just add more and more variables to increase it.
Well, adjusted just takes our value and adjusts it for the number of variables we included. This new value is a good measure of the effectiveness of our regression.
For more information see: Adjusted R2
Standard Error of the Regression
Also called the residual standard error, it measures the average distance that the observed values fall from the regression line.
Formula:
where:
- are residuals,
- = number of observations,
- = number of predictors.
IN THIS COURSE we use as the standard deviation of a predicted value. Thus, if asked “what is the distribution of our predicted values” we can put it on a Normal Distribution with a mean of and a standard deviation of .
Other Information for Reference
Dummy Variables
It is vitally important that, when giving variables to a regression, none of the variables can be used to describe one another. Or more simply, the variables are truly independent.
This is because the regression machine will try tweaking each of the variables slightly to see what values give us the best results. If two of the levers it can adjust do effectively the same thing, it gets confused and “blows up.”
So, let’s consider us trying to represent a cardinal direction (North, South, East, West).
| North | South | East | West |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
But notice something important:
If South, East, and West are all 0, then we automatically know North must be 1.
That means the North column is redundant information and can be perfectly predicted by the other three.
The solution is simple:
- Drop one dummy variable from the set (often called the reference category).
So, we could drop “North” and let it be the baseline:
| South | East | West | Interpretation |
|---|---|---|---|
| 0 | 0 | 0 | North |
| 1 | 0 | 0 | South |
| 0 | 1 | 0 | East |
| 0 | 0 | 1 | West |
Now, the regression interprets each coefficient as the effect relative to North, and we’ve avoided redundancy.
Additional Note: We could’ve picked any of them to drop. I just picked North but you could’ve picked East or West or.. you get the idea.
Reference States
When dealing with categorical variables in regression analysis, it’s essential to establish a reference state (or baseline category) for each categorical variable. This reference state serves as a comparison point for interpreting the effects of other categories within the same variable.
For example, consider a categorical variable “Education Level” with the following categories:
| Education Level | Description |
|---|---|
| High School | Completed high school education |
| Bachelor’s | Completed bachelor’s degree |
| Master’s | Completed master’s degree |
| PhD | Completed doctoral degree |
If we choose “High School” as the reference state, the regression coefficients for the other categories will indicate the effect of having that level of education compared to having only a high school education.
| Education Level | Coefficient Interpretation |
|---|---|
| Bachelor’s | Effect of having a bachelor’s degree vs. high school |
| Master’s | Effect of having a master’s degree vs. high school |
| PhD | Effect of having a PhD vs. high school |
By establishing a reference state, we can clearly understand how each category within a categorical variable influences the dependent variable in relation to the baseline category.
Interaction Terms
Interaction terms in regression analysis are used to capture the combined effect of two or more independent variables on the dependent variable. They help us understand how the relationship between one independent variable and the dependent variable changes depending on the level of another independent variable.
For example, consider a regression model that predicts sales based on advertising spend and seasonality. An interaction term between advertising spend and seasonality would allow us to see if the effect of advertising spend on sales differs during different seasons.
The interaction term is typically created by multiplying the two independent variables together. In our example, the interaction term would be:
Including this interaction term in the regression model allows us to assess whether the impact of advertising spend on sales is influenced by the season. If the coefficient of the interaction term is significant, it indicates that the effect of advertising spend on sales varies depending on the season.