Predicting Housing Prices With Python
Millions of people visit online sites to buy and sell houses. For those who want to sell their house finding the right price to sell it is really difficult. Home sellers must pick a price that is not too high where no one buys it and not too low where they could get much more value from it. But choosing this price requires understanding numerous complex variables and how they fit into the price.
These variables can include size, location, property tax, the quality of the education in the area, or the crime rate. Understanding how these factors play into the price of a house is a problem best solved by a computer algorithm.
Machine learning algorithms are great at taking a very large amount of data in finding the unique correlations and connections between what are called features. Features are the properties that are used to predict something. If you’re trying to predict whether to recommend someone a product, in this case, the features could be products they’ve bought in the past, their interests, or even what season it is.
In this project, I used a machine-learning algorithm to figure out how to take a set of features about a house and output the best prediction of its price.
The dataset I used is from the UCI Machine Learning Repository. It is from 1978 and contains 506 entries about a total of 14 features. The data covers a variety of suburbs in Boston. Although 506 entries is not an astounding number, it suits the purpose of showing how ML can be used to predict the prices of homes. Provided below is a list in order of what each value in the dataset represents taken directly from the dataset web link:
CRIM — per capita crime rate by town
ZN — proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS — proportion of non-retail business acres per town
CHAS — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX — nitric oxides concentration (parts per 10 million)
RM — average number of rooms per dwelling
AGE — proportion of owner-occupied units built prior to 1940
DIS — weighted distances to five Boston employment centres
RAD — index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
LSTAT — % lower status of the population
MEDV — Median value of owner-occupied homes in $1000’s
For the purposes of this project, the dataset was limited to 4 features “RM”, “LSAT”, “PTRATIO”, and “MEDV. Other small changes were made such as removing unnecessary values and adjusting the prices for inflation. In this case, we don’t need all of the features to train our machine learning model.
Here is a basic summary of what the code does
- Imports the necessary libraries
- Prints out the basic statistics to understand the data
- Creates plots that show the data to gain a better understanding of it
- Creates a matrix showing different correlations
- Imports functionality that calculates a score between 0 and 1 representing the accuracy of the model with 0 being less than or as good as continually picking the mean of the target variable and 1 being fully accurate
- Splits data up into training and testing data
- Training the model to make accurate predictions
- Evaluating the model’s accuracy
- Testing out the predictions with 3 somewhat arbitrary sets of features
The model was able to reach 84.4% accuracy which is quite good considering the relatively limited number of training examples included in the dataset.
Here is the code for the project
This project demonstrated that there is a real potential to help homeowners sell their homes at a price as close as possible to its value. This kind of technology is being employed today at major online real estate companies and shows great promise in the future to be the go-to method for determining home sale prices.
Thank you for reading. If you enjoyed it, be sure to “clap” for this article. If you want to connect with me on LinkedIn, here is the link.