Machine Learning and Recoup Value on Residential Renovation Projects

Our homes are often our biggest investment. We invest our hard earned money, our future and our emotions in them. It is unlike any investment: we are in love with this one and we want it to be both financially and emotionally rewarding. The return on investment (ROI) for a renovation project is a simple metric that measures the return obtained relative to the investment made for the project. ROI tells whether the investment made will change the house’s value positively or negatively.

Accurately predicting ROI for different renovation projects has a lot of advantages. For instance, homeowners can decide which part of the house they want to renovate based on their budget, needs and future goals. If they want to sell in the near future, they would consider renovation projects that fetch an impressive ROI. Similarly, accurately predicting projects’ ROI can help realtors and lenders make better risk evaluations when making decisions on house transaction or lending loans.

Predicting ROI, however, is no simple task. Even people who work in the renovation industry and are experienced in home renovations, or those in the business of buying and selling homes, can only make crude estimations. They mostly rely on their years of experience about the market to estimate the ROI, using a few data points obtained from their prior work and by observing the market over previous years.

The main reason for the difficulty in predicting ROI accurately is due to the influence of a large number of factors. Some of these factors can influence the value directly, such as house location, configuration, and renovations, while a range of economic and geographical factors about the neighborhood can influence the valuation indirectly. A bathroom addition in San Francisco may produce a greater return compared to one in a small city such as Minneapolis. The neighborhood also greatly influences ROI. For instance, any modifications to homes that have schools or a hospital in close vicinity will offer better returns.

All this means that one needs to look at all the data surrounding these factors and figure out how they influence the ROI.


As mentioned earlier, ROI depends on several factors such as house location, configuration and renovation details, along with geographical and economical details of the neighborhood. In order to successfully model ROI as a function of these data points, it is necessary to collect and aggregate such data. One of the main issues we encountered was to identify the large number of factors that we believe could influence ROI, and then to collect and aggregate the data from different sources.

We considered following data for modeling the ROI:

  • House details: Contains basic configuration of the house and its location
  • Renovation details: Contains the details about the renovation project undertaken for the house. It may include the type of project, whether it’s an addition or renovation, the size of room added, cost, etc.
  • Neighborhood details: Contains details about the number and quality of schools, hospitals, parks present in the region, crime rate, cost of living, population, population across different ethnic groups, etc.
  • City details: Contains details about employment rate, economy, average family income, etc.

We show some of these data types and their attributes in detail below. We show only a few sample attributes related to different geographical, economic and neighborhood data, which we collect and aggregate to study their influence on renovation and housing values.

House data

APN address city county zip code state type num_bed
#5127005013 2116 Adair St Los Angeles Los Angeles 90011 CA Apartment House 6.0


num_bath total_rooms home_size (sq ft) lot_size (sq ft) built_date longitude latitude
9 6 3752 9797 1907 34.025886 -118.259782

Table 1

Renovation details from permits:

Renovation records for houses are collected from permits available publicly from various city boards. As the data from different city boards are in different formats, we have to normalize, clean and format them to make them consistent. Also, in many instances, one or more attributes were missing, and we fill in such fields using our internal or third party tools. For instance, the project costs and project type are estimated using our proprietary “Kukun Estimator’’ and NLP tool, respectively.

parcel no permit no permit date permit description cost project
#5127005013 #130427000018633 2013 plumbing, 1 or 2 family dwelling, no plan check replace and relocate (2) hot water heaters. plumbing
#5127005013 #030447000003405 2003 hvac, 1 or 2 family dwelling, no plan check replace 2 directvent heaters in unit #2116 hvac
#5127005013 #030447000003407 2003 hvac, 1 or 2 family dwelling, no plan check install/replace1 direct vent heater in unit #2116 hvac
#5127005013 #020167000013873 2002 bldg-alter/repair, apartment, no plan check miscellaneous drywall repair general

Table 2

Home sold history

year 1 unit 2 units 3-4 units 5 or more
2010 4,008 346 325 5,715 10,394

Table 3

Home construction

parcel no sold date sold price
#5127005013 01/13/2004 329000
#5127005013 09/27/2002 210000

Table 4

City and neighborhood details:

year unemploy-ment employment gdp avg per capita income construction price index consumer price index cost of living crime rate violent crimes inflation rate
2010 764,850 5,689,711 58,211 42,540 85.4 225.894 26780 301.4 311.6 1.625

Table 5

Populatión Details

attribute population
male 1,889,064
female 1,903,557
white 1,888,158
asian 426,959
american indian 28,215
hispanic/latino 1,838,822

Economical factors

attribute population
under age 18 874,525
age 20-24 314,543
age 25-34 638,900
age 35-49 831,705
age 50-64 616,317
age 65 and over 396,696


name address zip code preschool elementary middle school high school rating type lat long
Goethe International Charter School 12500 Braddock Drive 90066 yes yes no no 9 Public charter 33.9872 -118.4223


name address zip code beds type latitude longitude
LAC+USC Medical Center 1200 N State St 90033 600 General Acute Care Hospital 34.0577 -118.2103

Generating ROI values

Many classes of machine-learning algorithms for prediction work by learning the relation between input and output. For example, if one wants to build a model that predicts whether an email is a spam, one needs to feed the algorithm a large amount of so-called “training data” that contains a lot of emails, each with a corresponding label that specifies if it is spam or not spam. The algorithm then figures out the relation or pattern between the input (independent variable) and output (dependent variables), which are subsequently used to make predictions for the future data points.
One of the main bottlenecks in ROI prediction is that no data source exists for ROI ground truth values, even though we can gather many input data points related to home renovation, neighborhood and economical and geographical data. We, therefore, have to indirectly infer ROI values by looking at house sold transaction history and renovation details from permits.
We achieve this with the house sold and renovation (permit) data by using two main criteria:
(i) Consider only those houses that have sold at least twice in the last five years.
(ii) Consider only houses that have been renovated at least once and had a valid permit between two sold transactions.

parcel no sold date sold price
#5127005013 01/13/2004 329,000
#5127005013 09/27/2002 210,000
#5127005013 2003 210,000*cpi

In Table 3, a house is sold twice: once in 2002 for $210,000, and then again in 2004 for $329,000. And the house was renovated three times between 2002 and 2004: two HVAC jobs and one for general repair work. We can easily say that the value of house changed from $210K to $329K due to renovation activity and market changes. If we know the percentage increase for house values in the region, we can then figure out the price increase from 2002 to 2003 and 2003 to 2004. We also use percentage change in GDP and the construction price index to find the change in house prices over two different years. We noticed that the construction price index is empirically a good if not perfect measure for revealing the average price change of houses over different years.
Once we have the sold prices in two consecutive years and the renovation activities between them, we can easily figure out that the change in house value is mainly due to renovation activity and market conditions. By applying the above procedure to our large house transaction and permit records for different cities, we can then generate the expected ROI (percentage change in overall house value) outputs for renovation projects.


4a. Features

The next step is to extract information and features from these data points that are relevant to our problem of ROI prediction. Because each data point comes from a different source and is in different format, we need to figure out a way to extract and process the information. This requires carefully designing multiple features from the different tables, such as the school and permit tables. For instance, we can consider distance, rating and whether it’s public or private as features for schools. Similarly, when looking at a permit description, the bag-of-words technique is applied.

Property feature: <<house type feature>, <room feature>, <size>, <lot size>, <built year>>, where

<house type feature> is obtained using one hot encoding (as follows) to encode different house types.

                                                apartment:[0,0,0,0,0,1], single_family:[0,0,0,0,1,0], duplex:[0,0,0,1,0,0], multi_family:[0,0,1,0,0,0],
                                                condominium: [0,1,0,0,0,0], multi_family_dwellings: [1,0,0,0,0,0]

and <room feature> is obtained as follows:
<number of bedrooms, number of bathrooms, total number of rooms>

Renovation feature: <renovation cost>, <project type>, <description feature>, where description feature is obtained through bag-of-words technique, which basically counts the frequency of each of the pre-defined vocabulary words appearing in the text. The vocabulary is prepared by finding all the unique words appearing in a large corpus of permit descriptions,
and then filtering some of the most frequent, stop and less meaningful words. The vocabulary created on such corpus is usually large in size, and as a result description features are larger in size.

City features: Encode various information about the city in terms of its GDP, size, population, average cost of living and average income, etc. All the house renovation data in a particular city will have the same set of city features. We use data from last ten years, and create the city feature for each year as shown below:

<<gdp>,<cost_of_living>,<consumer_price_index>,<construction price index>,<population>,<employment>,<unemployment>,<population density>,<number of houses constructed>,<average family income>,<dollar value>,<house sold>,<total permits>,<crime rate>>

Schools: We calculate the distance from the five nearest schools in the city to the house being renovated to obtain this feature using distance, rating and school type as follows:

<<distances_of_5_nearest_preschools>, <distances_of_5_nearest_elementary_schools>, <distances_of_5_nearest_middleschools>, <distances_of_5_nearest_highschools>, <ratings_of_5_nearest_preschools>, <ratings_of_5_nearest_elementary_schools>, <ratings_of_5_nearest_middleschools>, <ratings_of_5_nearest_highschool>, type_hist_school>>, where the <type_hist_school> gives the count of each school type for the 5 closest schools.

We use a similar concept for designing the features for hospitals, malls, and parks.

Once we generate these features based on different data points, we concatenate them (link them together) to obtain a single feature for each renovation transaction of the house.

Feature normalization: All the features from different data created above are in a different numerical range. For instance, the date is in the range of 2000-2050, the cost will be in the 100 thousands, and the number of rooms or ratings will be in a much smaller range. In order to treat all the features equally, it is required to normalize them to be in the same range (0-1). This is achieved by obtaining the minimum and maximum values for each feature and normalizing the feature values by subtracting and dividing by the minimum and maximum respectively. All features will be then in the 0-1 range.

4b. Regression

Our goal is to build a model that takes different feature points related to house, renovation and neighborhood, and predicts an ROI. This is referred to as regression in statistics and machine-learning lingo. Before moving on to our problem, here’s an explanation of regression in simple terms.

Let’s consider a simple example first, with a single input feature (independent variable) x1. Let us imagine this feature is one of our data points (say GDP). We want to find out how this single feature x1 affects the ROI (dependent variable) y1. If we assume that the relation between these two variables is linear, we can then express it mathematically as:

y1 = w1 * x1

This basically indicates that the ROI y1 is proportional to x1 and varies linearly with respect to x1. The parameter w1 decides by what factor y1 is changed when x1 is changed, and whether in positive or negative direction.

So, the entire problem boils down to figuring out the parameter w1. If we can discover w1, we can estimate y1 for any data point x1 using the above relation.

The regression technique in machine learning basically figures out the parameter w1 by observing many training data points x1 and y1.


Fig 1. This illustrative figure shows a few data points, where y1 is increased linearly with x1. And how much y1 increases/decreases with respect to x1 is decided based on the dotted line that best fits these points. Or as we call it, the slope of the line.


y = w1 *x1 + w2 * x2 + w3 *x3… + wn *xnIf we have many examples of feature x1 (GDP) and the corresponding output y1 (ROI), we can find a parameter that best explains all these data points in some average sense. In Fig 1, we show four data pairs (x1, y1), where each data point has a relation y1 = w1 * x1. So one way we can find out w1 from all these points is to look for an imaginary line (shown as dotted line) that best fits these data points. And once we determine the slope of this line, we can then calculate w1.

The problem considered above is the simplest case with only a single feature (x1). In reality, many features can influence the output. In our ROI prediction, we have large number of features (GDP, schools, hospitals, house configuration, renovation details) that could influence ROI. Yet we can still apply the same logic, and formulate the regression problem as:

This can be written in vector notation as:

y = WX

where y is a scalar value and W and X are the vectors. So basically a dot product or element-wise multiplication between the features X and the parameter W gives us the ROI y. The problem is exactly same as before but operates in a higher dimension. With a single feature, we find the line that best fits the single dimensional data points. With multiple features, we now find a higher dimensional hyperplane that best explains our high dimensional data points. For instance, if we are to consider 1,000 features, we need to imagine Fig. 1 in 1,001 dimensional space, and figure out how ‘y’ changes with respect to 1,000 ‘xi’ features. One can use any available optimization techniques, such as ordinary least square, ridge regression and so on, to find out W. Ordinary least square, for example, relies on the inverse of the input matrix X to get W.

The ROI algorithm developed with a principled approach like ours on a real dataset can predict returns by looking at the pattern in the historical data rather than rule-based algorithms. In our approach, we have made an assumption that the relationship between ROI and other independent variables is linear. However, it may be possible that the underlying dependence may be non-linear. As we collect and aggregate more data, it is possible to even figure out such non-linear relations between variables.

Leave a Reply