Linear Models

Linear Regression: A whole Lot of ways something can go wrong

Chances are you have encountered the need to perform a linear regression, either for interpolation purposes or for predictive purposes. The latter in Machine Learning jargon is referred to as model training, or just training. Here is a list that highlights points to watch for and some suggestions on how to fix potential issues. These information, in greater details, could be found in a slew of books on regression analysis and statistics, so I am not going to belabour citing books here.  

  1. The relationship between the dependent variable and the independent variables is linear. After all, if our model is generically defined as such
    $$Y = \sum_ x_kf_k $$
    and yet if the actual relationship is not linear then our model is misspecified for the problem in hand.

    1. Errors it would cause: A misspecified model will make your estimated coefficients biased and inconsistent, which means it would be useless in making meaningful predictions.   

    2. How to fix it: Sometimes if one has a guess as to what the relationship between the independent parameters and dependent parameter look like one can transform the variables through known functions (for example log) so that the relationship becomes linear. 

  2. The independent variables are not random. Also, no exact linear relation exists between two or more of the independent variables. If there is a linear relationship between the independent variables, then this can lend itself to multicollinearity. 

    1. Errors it would cause: Multicollinearity will make your estimated coefficients biased (but still consistent), and the t-statistics associated with the regression coefficients inaccurate, lending to false negatives or type II errors, or the incorrect failure to reject the null hypothesis. 

    2. How to fix it: Generally, multicollinearity is a matter of degree. A simple and for the most part adequate way to check for it is to look at pairwise correlation between the independent variables to make sure that they are not heavily correlated. 

  3. The expected value of the error term is zero. This is just a consequence of solving for the regression coefficients, by minimizing the the square of the residuals. 

  4. The variance of the error term remains constant for all observations (Homoscedasticity) .

    1. Errors it would cause: If the error terms across observations do not have a constant variance then the t-statistics calculated to determine the significance of regression coefficients  are going to be inaccurate, too many type I errors are going to be happening here (false positives), which is incorrect rejection of the null hypothesis. 

    2. How to test for it: Breusch-Pagan test is the simplest and most general method. The basic ideas is that if no conditional heteroskedasticity exists then the independent variables will not be able to explain the variations in the square of the residuals.   

    3. How to fix it: Use GLS instead of OLS, or use Robust standard errors (white-corrected). 

  5. The error term is uncorrelated across observations, (no serial correlation). 

    1. Errors it would cause: The t-statistics calculated to determine the significance of regression coefficients are going to be inaccurate. Once again too many type I errors are going to be happening here (false positives), which is incorrect rejection of the null hypothesis. 

    2. How to test for it: Durbin-Watson Statistics (DWS). The general idea is that if there is no serial correlation, then DWS should be hovering around 2.
      $$ DWS = \frac - \epsilon_ $$
      when expanded and summed this equation can be roughly written as
      $$DWS \approx \frac ) - covariance()  + variance(\epsilon_ $$
      but look the variance terms should be constant-ish across observations. Also, if there is no serial correlation then the covariance term must be zero. Which means
      $$DWS \approx 2 $$  
      The DW test involves creating bounds so that any results below the lower bound implies the presence of serial correlation, above the upper bound implies the absence of serial correlation, and ambivalent otherwise. 

      1. As a side note, DWS isn't going to work for autoregressive models (a topic for some other time).

    3. How to fix it: There are two alternatives but the Hansen method is generally used as part of standard statistical packages. The method aims at correcting the standard error to account for the presence of serial correlation.  

  6. The error term is normally distributed.

    1. Errors it would cause: If the errors are not normally distributed then the standard error of estimates and thus the t-statistics are going to be biased and inconsistent.  

    2. How to fix it: Increase the number of data in your data set, or use a different equation to describe the relationship.