Project 3: Regression

What Is Regression?

What is regression and how does it work? Linear regressin is the simplest and most widely used regression technique. It is a statistical method used to model the relationship between one or more independent variables and a dependent variable. It aims to predict the value of the dependent variable based on the values of the independent variables. Linear regression assumes there is a linear relationship between the independent variables and the dependent variable. It attempts to find the best-fitting straight line through the data points, such that the difference between the observed values and the values predicted by the line (residuals) is minimized.
The math:
Given n data points (xi, yi), where xi represents the independent variable(s) and yi represents the dependent variable, linear regression aims to find coefficients β0 (intercept) and β1, β2, ..., βp (slopes) such that the linear equation:

y = β0 + β1x1 + β2x2 + ... + βpxp + ε

minimizes the sum of squared residuals ∑i=1n (yi - ŷi)2, where ŷi is the predicted value of yi.

Introduction

Imagine asking someone about their ideal home, and they respond with details about the color of the roof tiles or the age of the HVAC system. Not a typical conversation. But, delve into the wealth of data provided by this competition's dataset, and you'll find that the nuances of a property extend far beyond the number of bathrooms or the size of the backyard. In this competition, we're challenged to unravel the intricate web of factors that influence the price of a house. Can we accurately predict the price of a house based on its features and amenities?
For this analysis, I will be using the "House Prices" dataset from Kaggle, which has 79 explanatory variables describing almost every aspect of residential homes in Ames, Iowa. The dataset features a mix of nominal and ordinal variables for 1460 homes. We will attempt to predict the final price of each home.

Experiment 1: Pre-processing

For the first experiment, we will attempt to predict house prices using the dataset with all of its outliers. We will use both a Linear Regression model and a Random Forest model.
I'll start off by making sure the dataset has no null or duplicated values. There are a few variables that have too many null values and will be removed from our dataset: Id, Alley, FireplaceQu, PoolQC, Fence, MiscFeature. For the rest of the features with null values, I will replace them with "NA" for nominal variables and 0 for ordinal variables. Later on, we will use dummy variables for the categorical features and remove outliers.

Experiment 1: Data Understanding and Visualization

SalePrice Histogram

Figure 1: Distribution of SalePrice

This histogram plot allows us to visualize the distribution of SalePrice (house prices), our target variable. We can see that the data is slightly right skewed due to some high outliers.
Let's take a look at some other features and their distributions.
Feature Distribution Plots

Figure 2: Distribution of Features

We can see that most of the categorical variables have a lot of one class and are sparse for the other classes. Similarly, most continuous variables are also heavily skewed. Many continuous variables show a high frequency for 0, indicating that the house does not have that feature at all.

Experiment 1: Modeling

Firstly, we create dummy variables for our categorical features. After splitting our dataframe into training and testing datasets, we used a linear regression model and a random forest regression model to fit the training set and then predicted on the testing set. As a result, we were able to generate promising residuals plots.
Linear Regression Residuals Plot

Figure 3: Linear Regression Residuals Plot

Our model appears to have performed very well in predicting the sale prices of Iowa homes. The pink dots in the visualization represent the predicted versus actual sale prices of each house. The black line, which signifies a perfect prediction, facilitates the assessment of the model's accuracy—the closer the pink dots align with the black line, the more accurate the model's predictions. Notably, only a few outliers deviate significantly from the expected values, suggesting generally reliable performance.
Random Forest Regression Residuals Plot

Figure 4: Random Forest Regression Residuals Plot

There is a slight difference between the residuals plot of the random forest model and that of the linear regression model. It appears that the random forest model performed notably better in predicting houses with a sale price under $250,000. Beyond this threshold, the pattern in the residuals plot resembles that of the linear regression model, with noticeable outliers.

Experiment 1: Evaluation

We evaluated our models using root mean squared error (RMSE) and the coefficient of determination. For the linear regression model, the coefficient of determination is 0.88619, meaning that 88.62% of the points are explained by the regression line. This score tells us how well the regression model is at predicting SalePrice. The RMSE for this model is 29,545.12. This tells us that the average difference between values predicted by the model and the actual values is $29,545.12. For predicting house prices, this is considered a low mean error and demonstrates the model's strong predictive performance.
Comparatively, the coefficient of determination for the random forest regression model is 0.8928, indicating that 89.28% of the points are explained by the regression line. This is higher than the linear regression model, as observed from the residuals plot. The RMSE for the random forest model is 28,674.09, nearly $1,000 lower than the linear regression model. A smaller RMSE signifies better accuracy in predicting SalePrice. So far, the random forest regressor is proving to be more accurate.

Experiment 2: Pre-processing

For our second experiment, we will attempt to predict house prices using the dataset after removing SalePrice outliers. Once again, we will use both a linear regression model and a random forest model.
The only difference in pre-processing with this experiment is that we will be removing the outliers from SalePrice.

Experiment 2: Data Understanding and Visualization

SalePrice boxplot

Figure 5: Distribution of SalePrice with outliers

This boxplot allows us to visualize the distribution of SalePrice, including the median, minimum and maximum values, and various quartile ranges. We can see that there are many data points above the maximum line. These are the points we want to remove.
SalePrice boxplot without outliers

Figure 6: Distribution of SalePrice without outliers

This boxplot shows our SalePrice distribution after removing the big outliers.

Experiment 2: Modeling

We'll use the same models for regression on the newly processed dataset.
Linear Regression Residuals Plot

Figure 7: Linear Regression Residuals Plot

The residuals plot for the linear regression model appears to be more scattered compared to the one in experiment 1. However, it's evident that the scale has changed since we removed the outliers—now, 350,000 is the upper limit instead of 800,000. It's as though we've zoomed in on the previous residuals plot.
Random Forest Regression Residuals Plot

Figure 8: Random Forest Regression Residuals Plot

The residuals plot for the random forest regression model also appears to be more scattered compared to the one in experiment 1. We'll need to examine the evaluation metrics to accurately assess the models.

Experiment 2: Evaluation

With the linear regression model, 86.61% of the points are explained by the regression line. This is lower than the coefficient obtained from the linear regression model with the outliers dataset. The RMSE for this model is 20637.97, representing the average difference between the predicted and actual SalePrice values. This is lower than the previous linear regression model by nearly $9,000. Given this, we'll use RMSE as the main evaluation metric for all our models. That being so, the second experiment demonstrates improved performance over the first for the linear regression model.
As for the random forest regression model, 86.98% of the variance is explained by the regression line. Once again, this is higher than the linear regression model. However, it's lower than the coefficient obtained in the first experiment with the random forest model. The RMSE for this model is 20,352.30, indicating a smaller average difference between predicted and actual values compared to the previous model by only $300. Nonetheless, it's $8,000 lower than the RMSE of the random forest model in the first experiment. Therefore, based on RMSE, this model, combined with the removal of outliers, yields the most accurate predictions for SalePrice.

Experiment 3: Pre-processing

For our third experiment, we will attempt to predict house prices using the dataset before removing SalePrice outliers. This time we will use Google's open-source library, TensorFlow Decision Forests (TFDF). TFDF uses a variety of tree-based models, such as Random Forests, offering a more diverse set of algorithms that can handle more complex scenarios.
The preprocessing for experiment 3 is the same as that done in experiment 1.

Experiment 3: Data Understanding and Visualization

We're going to take another look at the distribution of the different features. This will be a good reference once we see the number of times each feature is used as the root node in our model.
Feature Distribution Plots

Figure 9: Distribution of Features

Experiment 3: Modeling

The following plot shows us the number of trees used to reduce the variance in the model.
RMSE per number of trees

Figure 10: RMSE Per Number of Trees

The first 50 trees appear to have made the biggest impact in reducing the RMSE.
Out of the importance values, we're going to take a closer look at "NUM_AS_ROOT." This value will show us the number of times each feature was used as the root node. Features that are frequently selected as the root node are considered more important as they have a strong influence on the overall decision-making process of the trees.
Number of root nodes bar graph

Figure 11: Number of Root Nodes Bar Graph

The bar graph helps us visualize the features that had the biggest impact in reducing the variance in the model. OverallQual, which rates the overall material and finish of a house, was used as a root node 110 times, having the most significant influence in the model. ExterQual, which evaluates the quality of the material on the exterior of a house, is second lead with 53 root nodes. GarageCars and Neighborhood follow behind in third and fourth place.

Experiment 3: Evaluation

The RMSE for this model is 29,651.58, indicating an average difference of $29,651.58 between the predicted and actual values. Surprisingly, this is the highest RMSE among all the models in our experiments. The performance of this model was below expectations, making it the least accurate in predicting SalePrice.

Impact

The models developed in this analysis carry the risk of being misleading if deployed and used on other datasets. Given the complexity of the dataset, featuring numerous variables and entries, the preprocessing and cleaning process was challenging and prone to errors. Despite my efforts, there may still be inaccuracies in the dataset that could result in misleading predictions.
On the other hand, the insights gained from the third experiment could prove beneficial to individuals seeking to sell a house. By identifying the features that exert the greatest influence on the sale price of a house, sellers can prioritize improving or highlighting those features. For instance, our TFDF model highlighted that the overall material and finish of a house had the most significant impact on sale price.

Conclusion

Through our experiments, we saw that there are factors beyond the obvious ones that influence house prices, making the prediction task more complex. Each experiment showcased the versatility of dataset analysis and its potential for predicting variables.
The removal of outliers from our target variable notably enhanced the accuracy of both regression models. However, the unexpected weaker performance of the TFDF model prompts further inquiry. It raises questions about the potential impact of removing outliers before implementing the TFDF model and whether the presence of unnecessary variables, possibly due to the dummy variable trap, affected model prioritization.

References

Dataset: Anna Montoya, DataCanary. (2016). House Prices - Advanced Regression Techniques. Kaggle. https://kaggle.com/competitions/house-prices-advanced-regression-techniques

Code

You can access my codes here: experiments 1 & 2, experiment 3.