Screen Shot 2022-03-28 at 2.51_edited.jpg

Project 3

Using regression models to conduct prediction on sale price of houses with. The goal of this project is to experiment with different features, pre-processing methods, and possibly regression techniques to improve the model's performance.

Project 3: Welcome

Introduction

For this project, I will experiment with Linear Regression to conduct prediction on house prices using this Kaggle dataset: https://www.kaggle.com/code/nohamahmoud22/housing-prices-eda-ml-2-competition/data.

The main goal of this project is to gain some hand-on experiments working with regression problem. Concretely, in this project I will experience improving the performance of my regression model by trying selecting different features, applying different pre-processing methods including dropping null values, fixing the skewness, transforming the type of data. Moreover, I will also experience applying possibly regression techniques to improve the model's performance.

This Housing dataset was compiled by Dean De Cock for use in data science education. This dataset contains 79 explanatory describing almost every aspect of residential homes in Ames, Iowa and these 79 features will be used to predict the final of each home in Ames area. This Housing dataset is a perfect choice for this project since the target, which is the SalePrice column, contains continuous value so we need to build and train regression model to predict our target. Moreover, this dataset contains both numerical data and categorical data, which gives me a chance to explore the different pre-processing methods to improve my predictive model.

Project 3: Text

Brief Overview of Regression

On a high level of understanding, regression is a statistical method used to estimate the relationship between one dependent variable (usually denoted by Y) and one or more independent variables. It can be utilized to assess the strength of the relationship between dependent and independent variables and also for modeling the future relationship between them. Regression models are used to predict a continuous value. Predicting prices of a house given the features of house like size of the house, the area of the house, bedrooms etc is one of the common examples of Regression. Regression Model is a supervised technique.

Regression analysis includes linear regression, multiple linear regression and nonlinear regression. Linear regression is the simplest variation of regression analysis and is one of the most common and interesting type of Regression technique. In order to achieve the main goal of this project, we will dive into and learn more about linear regression and how linear regression works.

Linear Regression is a linear model that estimates a linear relationship between the input variables and the single output variable. In other words, the output variable (target) can be calculated from the a linear combination of one or more input variables (features). Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable.

Mathematically, we can represent a linear regression as: y= a0 + a1x + ε

y = Dependent variable (target)

x = Independent variable (feature)

a0 = Intercept of the line (gives an additional degree of freedom)

a1 = Linear regression coefficient

ε = random error

The linear regression model provides a sloped straight line representing the relationship between the dependent variable and independent variables. Here is the image showing a simple linear regression model:

Project 3: Text

Data Understanding

In order to get more understanding about this housing dataset, I will observe the correlation between the features, specially correlation between the most correlated features to the target which is SalePrice. I will create heat maps using seaborn package to visualize the correlation between features and make it easy for the readers to have a better understanding about the data of this dataset.

Moreover, I will also observe the distribution and the skewness of the target. I will also use seaborn package to create distribution graph to visualize the distribution and the skewness of the data of our target.

Click the button below to read more about my data understanding step.

Project 3: Text

Experiment 1: Preprocessing

For the processing step for experiment 1, I will remove the some columns contain NaN values, impute some missing values with the median of the corresponding columns. Moreover, I also transform data type of some variables from categorical to numerical in order to build and train regression model.

Click the button below to read more about my preprocessing step for experiment 1.

Project 3: Text

Experiment 1: Modeling

I will build and train linear regression model for the first experiment. There are some notably reasons that made me choose linear regression model to build on this dataset. First of all, I already transform all the categorical variables in this dataset to numerical variables so linear regression model is a perfect choice for a dataset that only contains numerical variables. Moreover, the target (SalePrice) contains continuous values so linear model like linear regression is the best choice to predict this target. Furthermore, this model is simple and easy to understand and interpret.

For this experiment 1, I will predict the sale price based on all the features in the dataset to see how linear regression model predict our target (SalePrice) with all the features of the dataset and some of them are not correlated to our target.

Click the button below to read more about my modeling step for experiment 1

Project 3: Text

Experiment 1: Evaluation

I will evaluate the performance of linear regression on this dataset for predicting sale price based on the most correlated features through Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) and coefficient of determination correlation.

Mean square error (MSE) is the average of the square of the errors. The larger the number the larger the error.

The coefficient of determination is a measurement used to explain how much variability of one factor can be caused by its relationship to another related factor. This correlation, known as the "goodness of fit," is represented as a value between 0.0 and 1.0. Concretely, a value of 1.0 indicates a perfect fit, and is thus a highly reliable model for future forecasts, while a value of 0.0 would indicate that the calculation fails to accurately model the data at all.

Click the button below to read more about my evaluation step for experiment 1

Project 3: Text

Experiment 2

Using the most correlated features to the target to predict the target.

For this experiment 2, I will only use the most correlated to target (SalePrice) for the prediction. These most correlated features are selected after I observed the correlation between features during the data understanding step above.. Moreover, I also skew the target to make the target skew correctly.

For the experiment 2, I also use the linear regression model for the experiment 2 to be able to draw comparison about how the same model works on this dataset between using all the features and using the most correlated features to the skewed target.

I will also use the MSE, RMSE values and coefficient of determination value to evaluate the performance of my model.

Click the button below to read more about my experiment 2

Project 3: Text

Experiment 3

Using Sequential Feature Selection to select three features to predict the target.

For the experiment 3, I will try to select other features for the prediction by using Sequential Feature Selector. Sequential Feature Selector is basically a part of the wrapper methods in feature selection. This algorithm selects multiple features from the set of features and evaluates them for model iterate number between the different sets with reducing and improving the number of features so that the model can meet the optimal performance and result.

I will also keep building and training linear regression for this experiment 3 to be able to make a comparison about how the same regression model performs with three different ways of selecting features.

I will also use the MSE, RMSE value and coefficient of determination value to evaluate the performance of my model.

Click the button below to read more about experiment 3.

Project 3: Text

Conclusion

Throughout this project about getting some hand-on experiments working with regression problem using this Housing dataset, I did three different experiments on this dataset with different ways to select features, different method of preprocessing data to build an train linear regression models.

The fact is that I decided to build and train the same linear regression model for all three experiments since I wanted to draw a comparison about building and training the same regression model between three different ways of selecting features and see which way of selecting feature would help the regression model perform the best on this dataset. Moreover, building the same linear regression for all three experiments also helps me understand how effectively the same model would perform in different scenarios and based on this we could decide which model would be the best fit for the problem we are trying to solve.

Although I built and trained the same linear regression throughout the three experiments but in each experiment I did a different way to select the feature to build the model and different way to preprocess the data. This is why the performance of the linear regression models are indeed quite different in each experiment, which we can see during the evaluating step using MSE value, RMSE value and coefficient of determination value to evaluate in each experiment. Concretely, for the first experiment, I used all the features in the dataset for building and training the linear regression model without filtering out any irrelevant features. Not really surprisingly, the linear regression model in experiment 1 did not perform well at all with very high MSE (703797924) and RMSE value (26529). There are 2 main reasons I could think of that made my model perform poorly on this dataset. The first reason is that i used all the features to predict, I did not filter the features that correlated to the target (SalePrice) so there was a high chance that I used the irrelevant features with low correlation to the target to predict the target. The other notable reason I could think of is that the target (SalePrice) is positively skewed, not correctly skewed as I stated in the data understanding step. I will have to do some further observation, researches to come up with effective reasons to solve these two problems in order to improve the performance of my model on this dataset. However, in this experiment 1, the coefficient of determination is 0.87, which is pretty high. This correlation value means about 87% of the data fit the linear regression model in the experiment 1. In the experiment 2, I only used the most correlated features to SalePrice to predict our target that is SalePrice and I also fixed the skewness of the target using log algorithm. Notably, the performance of the linear regression model in experiment 2 is improved significantly with very low MSE value (0.0222) and RMSE value (0.149). Moreover, the coefficient of determination value is also high, which is 0.87. Overall, the linear regression model was built and trained using the most correlated features to skewed target did a very good job about prediction on this Housing dataset. For the experiment 3, I selected 3 features out of all the features in the dataset using Sequential Feature Selector. The 3 features got selected using Sequential Feature Selector are 'OverallQual', 'BsmtFinSF1', 'GrLivArea'. The linear regression model was built and trained using those 3 selected features also did a pretty good job on prediction with relatively low MSE value (0.0299) and RMSE value (0.173). The coefficient of determination value is also relatively high in the experiment 3, which is 0.87.

In short, out of the 3 different ways of selecting features, the linear regression model was built and trained in the experiment 2 and 3 performed much better than the linear regression model built and trained using all the features without filtering out any unnecessary features in experiment 1. I am confident to conclude that selecting the suitable, necessary, important features play the main role not only in building and training regression model but also in improving the performance of our models.

Project 3: Text

Click the button below to download my Jupyter Notebook to view my code for this project.

Download

Project 3: Files