# Machine Learning Dataset Tour (2): Boston Housing

TL;DR: You can view my work on my GitHub.

In this post, I will make a brief introduction of boston housing dataset, and I will share my solution with some explanations.

## What’s Boston Housing?

The dataset consists of information collected by U.S. Census Service concerning housing in the area of Boston Mass.

The following show the meaning of each variable (column) in the dataset:

Variable | Description |
---|---|

CRIM | per capita crime rate by town |

ZN | proportion of residential land zoned for lots over 25,000 sq.ft. |

INDUS | proportion of non-retail business acres per town. |

CHAS | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |

NOX | nitric oxides concentration (parts per 10 million) |

RM | average number of rooms per dwelling |

AGE | proportion of owner-occupied units built prior to 1940 |

DIS | weighted distances to five Boston employment centres |

RAD | index of accessibility to radial highways |

TAX | full-value property-tax rate per $10,000 |

PTRATIO | pupil-teacher ratio by town |

B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |

LSTAT | % lower status of the population |

MEDV | Median value of owner-occupied homes in $1000’s |

For more details, you can visit this website.

## Goal and Evaluation Metric

### Goal

- predict the value of
`MEDV`

### Evaluation Metric

- RMSE, i.e.

## My Solution

- Though decision tree based model performs awesome on tabular data, I will choose linear regression model for
practice.
- Specifically, I choose Lasso regression according to sklearn’s cheat sheet.
- I will try linear regression in later days, stay tuned!
- I will also demonstrate using less features to produce akin result in later days.

- Because there is no any null value, we need not to fill or drop the missing value.
- Since the testing data contains no ground truth, we will train the model with cross validation.

## Step-by-stop Explaination

### 1. Prepare Data and Make Some Explainations

Using pandas’ `info()`

method, we find all features are numbers, so there is no need for one-hot encoding.

Using pandas’ `isnull()`

method, we realize there is no any empty value, so no need to fill the empty value.

### 2. Visualization

I would like to find the relationship between each feature and prices, so I draw some plots to visualize the relationship.

As there are a lot of plots, you can see here to check out the result for clarity.

### 3. Check Feature Importance

Maybe we don’t need to put every feature into the model, so I use XGBoost to find feature importance.

Many of decision tree based models have the ability to find feature importance as they splitting the data by calculating
information gain or Gini coefficient.

After running XGBoost Regressor, we get the following results:

```
xgbr.get_booster().get_score(importance_type='gain')
# Out:
{'LSTAT': 1170.5912806284505,
'RM': 407.86173977156585,
'NOX': 114.57274366695417,
'CRIM': 48.06237422548619,
'DIS': 65.78717631524752,
'PTRATIO': 109.93901440305558,
'AGE': 26.895764984946428,
'TAX': 56.70414882315789,
'B': 30.702484215217385,
'INDUS': 15.392935187827584,
'CHAS': 45.74311416166666,
'RAD': 18.774944386363632,
'ZN': 9.174343283333334}
```

We can realize the most three important features: `LSTAT`

, `RM`

, and `PTRATIO`

.

### 4. Build Model

Let’s build our model with Lasso Regression!

I create the model object with params `alpha=0.1`

:

```
lasso = Lasso(alpha=0.1)
```

I also do a train-test split to train the model, and use splitted test data to produce the predictions.

We then gain `4.48`

of `RMSE`

value. Let’s try whether cross validation can boost up the performance or not.

Scikit-learn provides Lasso Regression with cross validation called `LassoCV()`

. After training, we get `4.37`

of `RMSE`

value. Hmm, the value lowers.

### 5. Generate the Result of Competition

Let’s update the result to Kaggle and we gain score of `5.16976`

and rank #4 in leaderboard. You can see the
leaderboard here.

## Conclusion

In this post, I make a brief introduction about famous Boston Housing Dataset, and I demonstrate step-by-step
procedures of my solution. Furthermore, I show we can use Lasso Regression with cross validation, but the
performance does not boost up.

See you in the next tour!

You can view my work on my GitHub.