1 Introduction

Home matters to everyone. It’s not only a geographical space that we live in, but also part of our physical and financial security. Smarter decisions in manipulating relative data, inspiration and knowledge can make a huge impact. Home market value prediction is one of the most classic models in this area, which takes internally and externally house characteristics into account such as home facts, location and market conditions. By providing professional and effective calculation for all buyers, sellers, real estate agencies and mortgage professionals, information transparency would be better introduced to establish a fairer trading environment.

Ordinary least squares (OLS) is a basic quantitative method for estimating the unknown parameters in a linear regression model, which could be worked as a starting point. In this exercise, we build an OLS model to predict 700 home sale prices in San Francisco from 2012-2015 based on 9433 known data entries. Accuracy and generalization would not be reached unless we figure out critical predictors and finish the model building, which requires a great amount of work in feature engineering, feature selection and model validation.

Under such ideas, many different process has been taken to detect significant predictors, including data gathering, pre-processing, exploratory analysis and feature engineering. Then a baseline model is built. However, this model fail to pass spatial autocorrelation test, and an advanced model has been developed which takes spatial autocorrelation into accounts. Accuracy and generalizability of our advanced model are measured and validated.

It can be summarized that there are 22 significant predictors in our final model, which includes house characteristics ranging from demographics, environment, housing, public facilities, crime to education. This model explained around 75% of the sale price, presented around 230000 mean absolute error (MAE) and about 23% of mean absolute percentage error (MAPE). In general, it is a good model that describes both local demographics and physical characteristics of the housing.

2 Methods

2.1 Hedonic Pricing Model

The hedonic model is both a theoretical and practical framework. In this instance, it help us deconstruct house sale price into a bunch of house characteristics. This requires that the composite good being valued can be reduced to its constituent parts. In real estate market, evaluation of a house can be deconstructed into three detailed characteristics.

\(Price = Internal Conditions + Amenities + Spatial Structure + Error\)

For internal conditions, there are parameters such as property area, construction age, number of rooms which can incorporate a house’s physical facts and features. Exposure to amenities mostly contains a variety of placed-base characteristics like road density, accessibility to parks and recreation, school scores and even crimes nearby. Spatial structure refers to the relationship of values within a single variable at nearby locations. It turns out to play a key role in the prediction since houses inside a specific geographical space always share similar market values.

2.2 Ordinary Least Squares

Ordinary Least Squares (OLS) is a statistical method used to examine the relationship between a variable of interest and one or more explanatory variables. It is one of the most basic quantitative method in linear regression. There are several techniques to measure an OLS model’s performance, i.e., the accuracy and generalizability.

Accuracy means that the model predicted value should not deviate significantly from the observation value. Here are some common metrics providing a measure of goodness of fit:

  • Correlation Analysis: Correlation is a standardized measure of the strength of the relationship between variables, ranging between -1 to 1. One of the assumptions in a regression analysis is no multicollinearity, which means that any two independent variables should not be highly correlated with each other. Therefore, part of the variables we select at first are added in our final model based on the correlation matrix.

  • R-Square Calculation: R-Square is also called the coefficient of determination. Higher values are indicative of a better model, as the value of R-Square is the proportion of variance in the dependent variable that has been explained by the model. In this instance, it could be interpreted as the proportion of sale price predicted by the features included in our model.

  • Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE): MAE is the average of the absolute errors between the predicted value and the true value, which is easy-understanding and efficient. In the similar way, Mean Absolute Percent Error (MAPE) can be more intuitive because of its interpretation in relative error. In our case, these two metrics, which could be interpreted in the context of sale price, are selected as key measurements instead of R Square.

Generalizability has two meanings here. First, the model should also perform well when applied to unseen data. Second, it should be equally accurate across different urban context in this instance.

  • Cross Validation: Cross validation is used to test the first type out-of-data generalizability. This make it possible that applying generalizability test to many data set instead of one.

3 Data

3.1 Data Gathering

We collect various categories of relevant variables including house internal characteristics, amenities and spatial structure.

  • Data in the first category about house internal characteristics (e.g. property area, number of baths, stories, construction type, zoning code etc.) comes from raw data in class.

  • The second category cares about amenities, including facility information, road density, 311 cases, crimes and neighborhood demographic data (race, median income, percentage of population with bachelor’s degree, percentage of unemployment etc.), whose sources are U.S. 2012-2015 Census Data and San Francisco Open Data.

  • Last but not the least, we use Zillow.com to get mostly all of K-12 school score in San Francisco, as well as some on-market data like median house value for each neighborhood.

3.2 Data Pre-processing

After gathering all dataset needed, we need to deal with missing and wrong data.

3.3 Exploratory Analysis

Exploratory data analysis is an important approach to summarize dataset’s main characteristic and identify effective variables. In order to make full use of these datasets, we mutate some of them into measurable parameters using k-nearest neighbors algorithm, buffering and zonal statistics tools (processing details are shown in 3.4 Feature Engineering). The outcome can be divided into two situations:

  • For numeric variables, we calculate their mean values, standard deviations and create a correlation matrix.

As the correlation matrix shown below, most of our variables are independent from each other.

  • For categorical variables, a bar plot can be generated.

3.4 Feature Engineering

Based on the outcome of 3.3 Exploratory Analysis, we take a further step in selected variables to build more significant predictors. Various statistical techniques are used in this process.

  • Zoning Code: Categorize its raw data into two dummy variables, where 1 represents zoning codes with higher sale prices and 0 presents those with lower value;

  • Age, Year and Month of Sale: Mutate 6-digit SaleDate into SaleYear and SaleMonth, and calculate the age of each house by subtracting SaleYear by its BuiltYear;

  • Race: Calculate racial diversity index by https://www.csun.edu/~hcmth031/MDITUS.pdf;

  • Median Income, % Bachelor’s Degree, % Vacant Housing, % Detached House Unit, % Median House Value: Add each house’s demographic data by the neighborhood it belongs to;

  • Building Violations: Calculate the mean distance from 5 nearest violations’ value;

  • School Score: Calculate the mean score of 3 nearest K-12 schools’ value;

  • 311 Cases, Facilities: Calculate the number within a 500 meter buffer;

  • Road Density: Calculate line density and extract values to points in ArcGIS.

In this section, four independent variables that we think are of interest are selected and we plot their correlation scatterplots with sale prices.

It can be concluded that variables such as median house value and property area have a positive association with the sale prices, while variables like percentage of single house and racial diversity have a negative relationship with sale prices. Then we explore how they distribute in spatial context.

4 Result

4.1 Baseline Model

We divide 9433 entries into 60% of training and 40% of test dataset. Then the sale price is regressed on the selected predictors.

## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(sf_train.training) %>% 
##     dplyr::select(SalePrice, LotArea_Ne, PropArea_N, SaleYr, 
##         vio_nn5, X311cnt, Unemployment, facility.Buffer, SchoolScore, 
##         RoadDensity, Month, mdhv_mean, PCT_singleHouse, PCT_bachelor, 
##         median.income, racial, poverty.rate, PropClassC, ConstTypeC, 
##         ZoneCode01, Baths, stoFill))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7032608  -203945   -19094   162820  2485759 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.439e+06  2.915e+05 -11.797  < 2e-16 ***
## LotArea_Ne       6.647e-01  5.718e-02  11.626  < 2e-16 ***
## PropArea_N       3.082e+02  9.614e+00  32.061  < 2e-16 ***
## SaleYr           1.619e+05  4.719e+03  34.302  < 2e-16 ***
## vio_nn5         -4.804e+01  2.017e+01  -2.381 0.017277 *  
## X311cnt          1.613e+03  7.834e+02   2.059 0.039572 *  
## Unemployment    -2.064e+03  2.351e+03  -0.878 0.380058    
## facility.Buffer -6.690e+02  2.414e+02  -2.771 0.005606 ** 
## SchoolScore     -7.372e+02  2.910e+03  -0.253 0.800057    
## RoadDensity     -7.193e+03  9.184e+02  -7.833 5.67e-15 ***
## Month            8.865e+03  1.584e+03   5.595 2.31e-08 ***
## mdhv_mean        1.171e+03  5.318e+01  22.015  < 2e-16 ***
## PCT_singleHouse -3.361e+03  3.298e+02 -10.191  < 2e-16 ***
## PCT_bachelor     7.147e+03  8.114e+02   8.809  < 2e-16 ***
## median.income    1.196e+00  2.179e-01   5.489 4.22e-08 ***
## racial          -7.575e+05  7.264e+04 -10.428  < 2e-16 ***
## poverty.rate     2.737e+03  1.261e+03   2.170 0.030052 *  
## PropClassCD      1.033e+06  2.773e+05   3.725 0.000197 ***
## PropClassCDA     8.691e+05  3.015e+05   2.883 0.003952 ** 
## PropClassCF      8.716e+05  2.858e+05   3.050 0.002302 ** 
## PropClassCLZ     4.729e+05  2.959e+05   1.598 0.110055    
## PropClassCOZ    -4.763e+05  4.811e+05  -0.990 0.322201    
## PropClassCTH     5.325e+05  3.018e+05   1.764 0.077748 .  
## PropClassCTIC   -4.296e+05  2.853e+05  -1.506 0.132179    
## PropClassCZ      5.800e+05  2.779e+05   2.087 0.036941 *  
## PropClassCZBM    2.486e+05  3.142e+05   0.791 0.428936    
## PropClassCZEU    3.289e+06  3.932e+05   8.365  < 2e-16 ***
## ConstTypeC       5.454e+04  1.389e+05   0.393 0.694499    
## ZoneCode01      -1.005e+04  2.619e+04  -0.384 0.701087    
## Baths            1.471e+04  6.471e+03   2.274 0.023012 *  
## stoFill          1.475e+05  1.048e+04  14.082  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 390600 on 5646 degrees of freedom
## Multiple R-squared:  0.6867, Adjusted R-squared:  0.685 
## F-statistic: 412.5 on 30 and 5646 DF,  p-value: < 2.2e-16

After all work in data wrangling and feature engineering, we establish our first model based on selected 21 variables. Table above is the summary result of our baseline regression model. As shown in the table, most of our predictors are significant.

R_square MAE MAPE
0.6799069 261631.5 0.2591792

From the regression results above, it can be seen that the R-Square of the baseline model is smaller than 0.7, suggesting that less than 70.0% of variance in sale prices prediction could be explained by our model. However, Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE) are more important indicators, which turn out to be around 260000 and 0.27 respctively.

We also create a plot of predicted sale price as a function of observed prices. The orange line represents a perfect fit, while the yellow line represents baseline model’s predicted fit. It can be clearly seen that it fits the data pretty well.

4.2 Spatial Lag Problem

The baseline regression is a fine model, but not satisfactory enough. A closer look should be taken to check if there is a spatial autocorrelation in the residuals, which means that high values always cluster near other high values or vice versa. The simple way is to produce a location-based residual map and observe its spatial pattern.

However, it is a little abrupt to draw the conclusion just from subjective judgments. Therefore, we introduce Moran’ I index into our study. Moran’ I has become one of the most widely used methods of testing for spatial autocorrelation, where values close to 1 indicates strong positive, close to -1 indicate strong negative, and 0 indicates no spatial autocorrelation.

As the plot shown above, Moran’ I index (value = 0.2) is strong positive, suggesting that spatial autocorrelation affects the baseline model’s prediction ability to a great extent. In order to reduce this impact, a new variable called spatial lag has been created, which equals to the mean sale price of 5 nearest houses.

4.3 Advanced model regression

To improve our model, spatial lag is included in the advanced model. The correlation matrix is shown below. It is safe to conclude that for most of the variables, the correlation between any two predictors is not significant. For the reason why we keep those who have high value in the matrix, the exclusion of these predictors affect the accuracy of the model, which we are more concerned about in the prediction.

Advanced regression model is estimated and the result is shown below. Based on the p value shown in the summary result, most of our predictors are significant.

## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(sf_train.training) %>% 
##     dplyr::select(SalePrice, LotArea_Ne, PropArea_N, SaleYr, 
##         vio_nn5, lag_price, X311cnt, Unemployment, facility.Buffer, 
##         SchoolScore, RoadDensity, Month, mdhv_mean, PCT_singleHouse, 
##         PCT_bachelor, median.income, racial, poverty.rate, PropClassC, 
##         ConstTypeC, ZoneCode01, Baths, stoFill))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4958162  -164685    -6561   134961  2182734 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.732e+06  2.479e+05 -11.021  < 2e-16 ***
## LotArea_Ne       3.284e-01  4.906e-02   6.694 2.39e-11 ***
## PropArea_N       2.177e+02  8.386e+00  25.960  < 2e-16 ***
## SaleYr           1.426e+05  4.027e+03  35.421  < 2e-16 ***
## vio_nn5         -2.169e+01  1.713e+01  -1.266 0.205465    
## lag_price        6.268e-01  1.339e-02  46.815  < 2e-16 ***
## X311cnt          3.194e+03  6.658e+02   4.797 1.65e-06 ***
## Unemployment    -1.217e+03  1.996e+03  -0.610 0.542139    
## facility.Buffer -3.660e+02  2.050e+02  -1.785 0.074297 .  
## SchoolScore     -6.067e+03  2.473e+03  -2.454 0.014175 *  
## RoadDensity     -3.842e+03  7.828e+02  -4.909 9.43e-07 ***
## Month            8.137e+03  1.345e+03   6.050 1.54e-09 ***
## mdhv_mean        3.980e+02  4.806e+01   8.282  < 2e-16 ***
## PCT_singleHouse -1.587e+03  2.825e+02  -5.619 2.02e-08 ***
## PCT_bachelor     3.366e+03  6.934e+02   4.854 1.24e-06 ***
## median.income    2.333e-01  1.861e-01   1.254 0.209979    
## racial          -4.149e+05  6.209e+04  -6.682 2.58e-11 ***
## poverty.rate     2.643e+03  1.071e+03   2.469 0.013578 *  
## PropClassCD      6.219e+05  2.355e+05   2.640 0.008307 ** 
## PropClassCDA     5.250e+05  2.560e+05   2.051 0.040334 *  
## PropClassCF      5.579e+05  2.427e+05   2.299 0.021541 *  
## PropClassCLZ     3.125e+05  2.512e+05   1.244 0.213448    
## PropClassCOZ    -2.029e+05  4.084e+05  -0.497 0.619240    
## PropClassCTH     3.918e+05  2.562e+05   1.529 0.126249    
## PropClassCTIC   -5.547e+05  2.422e+05  -2.290 0.022044 *  
## PropClassCZ      2.403e+05  2.360e+05   1.018 0.308541    
## PropClassCZBM   -2.454e+04  2.668e+05  -0.092 0.926697    
## PropClassCZEU    2.623e+06  3.341e+05   7.852 4.86e-15 ***
## ConstTypeC       2.392e+05  1.179e+05   2.028 0.042615 *  
## ZoneCode01       3.979e+04  2.226e+04   1.788 0.073853 .  
## Baths            2.006e+04  5.494e+03   3.652 0.000263 ***
## stoFill          9.659e+04  8.958e+03  10.783  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 331500 on 5645 degrees of freedom
## Multiple R-squared:  0.7743, Adjusted R-squared:  0.7731 
## F-statistic: 624.7 on 31 and 5645 DF,  p-value: < 2.2e-16

To test the accuracy, R square, MAE and MAPE of test set are used to measure the goodness of the fit. It is concluded that our advanced model is better than the baseline model from the lower MAE and higher R square in the new model.

R_Squared MAE MAPE
0.7645983 227291 0.2231394

K-fold cross validation is used to test the first type generalizability. In other words, model builder can decide whether their model is overfit based on the results of cross validation. But how does the algorithm work?

In the algorithm, the data is splitted into k folds. Then, the model is trained with k-1 folds for k times, 100 times in this case. After each training, the predict for the excluded fold and the goodness of fit matrix, MAE in this instance, is recorded. The distribution of the MAEs represents how the model performs in different data sets. The more concentrated the histogram, the more generalizable and less likely to be overfit the model. The distribution of MAE of our model is shown below.

meanMAE standardDeviationMAE
224032.9 35335.02

Based on the result, most of the MAEs of our model ranges from about 200,000 to 250,000does, from which, we conclude that our advanced model is not overfit.

Predicted prices as a function of observed prices is then plotted. Compared to the baseline model, our new model fits data better and is closer to the orange line, which stands for a perfect fit.

Based on the residual map, the distribution of residuals is less clustered as that in the residual map of baseline model. For example, the blue cluster in the center of San Francisco and the brown cluster in the southwest part of the city now disappear.

Moran’s I statistic is used to measure the spatial autocorrelation in our new model. Apparently, the spatial phenomenon is eliminated after we take spatial lag into account. The moran’s I value is decreased to 0.0121 from 0.206.

A map of predicted values for all San Francisco houses is shown below. Compared to the map of observed house sale price, both two maps show a strong space clustering, and the distribution are almost the same.

To test the second type of generalizability, MAPE by neighborhoods is mapped. Obviously, though there are some neighborhoods have a low sale price while other has higher sale price, the difference is slight, about 1 percent.

The relationship between mean sale price by neighborhood and the MAPE by neighborhood is depicted in figure below. The MAPE value change slightly as the increase in sale price. This means that our model is homoscedastic, which is one of the OLS regression assumption.

Further generalizability test is conducted. We download race and income data by tidycensus, to define the differences in San Francisco’s urban context. Tracts with more than 50% of white population is defined as ‘Majority White’ in racial context, and tracts with an median income higher than mean value, 43245, is defined as ‘High Income’.

Though there are some missing value in the census data, the results do bring us an insight into whether the model generalize to different urban contexts.

With a small difference in “Majority White” and “Majority Non-White” census tracts, we can conclude that our model generalize well with respect to ratial context. The small difference is reflected in that the city would slightly over-assesse majority-White tracts.

Mean Absolute Error of test set sales by neighborhood racial context
Majority Non-White Majority White
0.1958659 0.2169658

With a small difference in “High Income” and “Low Income” census tracts, we can conclude that our model is generalizable in income context. The small difference is reflected in that the city would over-assesse low income households.

Mean Absolute Error of test set sales by neighborhood income context
High Income Low Income
0.2051991 0.2020016

5 Discussion and Conclusion

  • The final model is pretty impressive in predicting house market value, with a mean absolute error (MAE) of 230000 and an absolute percent error (MAPE) of 23%. R-Square value reaches to nearly 75% which means the model accounts for about 75% of the variance of sale prices.

  • We detect a variety of both significant and interesting variables in modeling. For example, we adopt race diversity index into our model, which shows a strong negative relationship with sale prices. Moreover, we manage to take road density into our consideration since higher density will imporve convenience of life. Last but not the least, a house’s sale price could be reflected by its surrounding market conditions, which become the most important indicator in our advanced model. Houses with high sale price cluster in northern and center part of the city, while most of low values exist in west and south.

  • Although our proposed model perform well, there are still several limitations when it comes to generalizability. Our model slightly over-assesses majority-White tracts and low income households. Spatial autocorrealtion still exists even we take spatial lags into consideration. A more prudent feature engineering process and a more sophisticated machine learning algorithm would help improve the model.

  • In general, we would like to recommend our model to Zillow. Though it may not as powerful as Zestimate® (Zillow’s home valuation model), we still believe that our domain knowledge can inspire engineers and data scientists to discover more opportunities. Housing market with scientific calculation and freedom to share these information will be benefitial to residences in San Francisco.