
Project 2: Income Classification.
This project focuses on using classifier models to predict if a new sample will have an income of more than $50K a year or less than $50K a year.
Introduce the problem
We cannot deny the fact that income is affected by so many different factors such as your work experience, level of education, where you live, so on and so forth. This project will focus on predicting if a person would make an income of more than $50K a year or make less than $50K a year based on some category including age, education level, occupation, marriage status, gender, ... Through this project, we will be able to answer these important question:
1. Is it true that people who earn higher level degree will more likely to make more money?
2. I will also draw a comparison between using decision tree classifier algorithm and SVM classifier algorithm with pipeline for predicting in this project to see which classifier algorithm works better on this income dataset.
Introduce the data
This dataset is a record of a group of people with different backgrounds such as different age groups, different level educations, different marriage status and their income.
This dataset contains data about age, education level, work-class, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income.
There are some notable features about this dataset that made me decide to choose it for this project. First of all, this dataset contains data that are suitable for the classification purpose of this project and my intended use. Moreover, this dataset contains the data about the different levels of education and the income that people who have those level of education would make. This will help me answer the question that I stated in the problem about how the difference in education level effect your income.
Link to the my dataset: https://www.kaggle.com/lodetomasi1995/income-classification
Pre-processing the data
For preprocessing step, I will remove a irrelevant column that is not necessary for the purpose of this project. Moreover, I will also remove some weird '?' values that signifies missing values. Please click to the button below to read more about how I preprocessed the data and the explanation for each preprocessing step I did for this project.
Data Understanding
For data understanding, I will use 'seaborn' packages in Python to create different types of visualization such as violin graph, pie chart, bar graph to show the distribution of our target, how some specific factors such as the number of work hours per week, different level of education, ... affects the income.
Modeling
I will use both decision tree and SVM classifier algorithm using sklearn for predicting for this project.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. There are some reasons that made me choose decision tree for this project. First of all, this classifier algorithm is easy to understand and to interpret. Moreover, decision tree is able to handle both numerical and categorical data. Since this income dataset I chose contain both numerical and categorical data so decision tree is definitely the best choice to do prediction on this dataset.
Support Vector Machines (SVM) is widely used in classification objectives, which is totally suitable for the purpose of this project. Support vector machine is highly preferred by many as it produces significant accuracy with less computation power. The reason I chose SVM with pipeline is to make a comparison with decision tree to see which classification algorithm works better on this income dataset.
Evaluation
After building and training my models, I can conclude that all of my two models work well on this income dataset with high accuracy score. Concretely, the decision tree model has the accuracy score of 0.84 and the SVM with pipeline model also gives the same accuracy score which is 0.84.
I used confusion matrix to do evaluation because it gives direct comparisons of values including True Positive, False Positive, True Negative and False Negative. By looking at the visualization created by confusion matrix function, we are able to have a quick prediction about the "unseen" data.
Story telling
Through this project, we are now able to answer the question that I stated in the problem about how different levels of education affect someone's income. After building and training our classification algorithms, which are decision tree and SVM with pipeline, and both of them give high accuracy score, we can confidently predict that people with higher level degree are more likely to make higher income. Notably, income is not effected only by level of education. Through this project and the classification models we built, we can also predict that men tend to make more money than women. Moreover, after drawing the comparison about how decision tree and SVM with pipeline works on this income dataset on this project, we can conclude that these two classification algorithms both work well and the outputs are almost the same. However, it does not mean that decision tree and SVM will always perform the same and give the same output. Each classification algorithm performs differently depending on the dataset. So depending on the dataset you chose and your purpose, it is up to you to choose the classification model that will work the best on prediction on your dataset.
References
Income dataset. https://www.kaggle.com/lodetomasi1995/income-classification
Understanding Decision Tree. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
Support Vector Machines. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html