Machine Learning Dataset Tour (2): Boston Housing

TL;DR: You can view my work on my GitHub.

In this post, I will make a brief introduction of boston housing dataset, and I will share my solution with some explanations.

What’s Boston Housing?

The dataset consists of information collected by U.S. Census Service concerning housing in the area of Boston Mass.

The following show the meaning of each variable (column) in the dataset:

Variable	Description
CRIM	per capita crime rate by town
ZN	proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS	proportion of non-retail business acres per town.
CHAS	Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX	nitric oxides concentration (parts per 10 million)
RM	average number of rooms per dwelling
AGE	proportion of owner-occupied units built prior to 1940
DIS	weighted distances to five Boston employment centres
RAD	index of accessibility to radial highways
TAX	full-value property-tax rate per $10,000
PTRATIO	pupil-teacher ratio by town
B	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT	% lower status of the population
MEDV	Median value of owner-occupied homes in $1000’s

For more details, you can visit this website.

Goal and Evaluation Metric

Goal

predict the value of MEDV
Evaluation Metric
RMSE, i.e. $RMSE(y) = \sqrt{\displaystyle\sum_{i=1}^{n} (y_i - \hat{y_i})^2}$

My Solution

Though decision tree based model performs awesome on tabular data, I will choose linear regression model for practice.
- Specifically, I choose Lasso regression according to sklearn’s cheat sheet.
- I will try linear regression in later days, stay tuned!
- I will also demonstrate using less features to produce akin result in later days.
Because there is no any null value, we need not to fill or drop the missing value.
Since the testing data contains no ground truth, we will train the model with cross validation.

Step-by-stop Explaination

1. Prepare Data and Make Some Explainations

Using pandas’ info() method, we find all features are numbers, so there is no need for one-hot encoding.

Using pandas’ isnull() method, we realize there is no any empty value, so no need to fill the empty value.

2. Visualization

I would like to find the relationship between each feature and prices, so I draw some plots to visualize the relationship.

As there are a lot of plots, you can see here to check out the result for clarity.

3. Check Feature Importance

Maybe we don’t need to put every feature into the model, so I use XGBoost to find feature importance.
Many of decision tree based models have the ability to find feature importance as they splitting the data by calculating information gain or Gini coefficient.
After running XGBoost Regressor, we get the following results:

xgbr.get_booster().get_score(importance_type='gain')

# Out:
{'LSTAT': 1170.5912806284505,
 'RM': 407.86173977156585,
 'NOX': 114.57274366695417,
 'CRIM': 48.06237422548619,
 'DIS': 65.78717631524752,
 'PTRATIO': 109.93901440305558,
 'AGE': 26.895764984946428,
 'TAX': 56.70414882315789,
 'B': 30.702484215217385,
 'INDUS': 15.392935187827584,
 'CHAS': 45.74311416166666,
 'RAD': 18.774944386363632,
 'ZN': 9.174343283333334}

We can realize the most three important features: LSTAT, RM, and PTRATIO.

4. Build Model

Let’s build our model with Lasso Regression!
I create the model object with params alpha=0.1:

lasso = Lasso(alpha=0.1)

I also do a train-test split to train the model, and use splitted test data to produce the predictions.
We then gain 4.48 of RMSE value. Let’s try whether cross validation can boost up the performance or not.

Scikit-learn provides Lasso Regression with cross validation called LassoCV(). After training, we get 4.37 of RMSE value. Hmm, the value lowers.

5. Generate the Result of Competition

Let’s update the result to Kaggle and we gain score of 5.16976 and rank #4 in leaderboard. You can see the leaderboard here.

Conclusion

In this post, I make a brief introduction about famous Boston Housing Dataset, and I demonstrate step-by-step procedures of my solution. Furthermore, I show we can use Lasso Regression with cross validation, but the performance does not boost up.
See you in the next tour!

You can view my work on my GitHub.