𝐓Table of Contents

Salt Lake County Real Estate Analysis


We will be analyzing real estate data from different cities in the Salt Lake County listed as of November 2022. In this project, we will be scraping Zillow, an American tech real-estate marketplace company. Zillow is one of the most popular real estate sites because offers the most robust suite of tools for home professionals and it sources postings from both the MLS (Multiple Listing Services) and non-MLS sources. Non-MLS sources include for sale by owner, non-MLS foreclosures, and auctions. Additionally, Zillow has the largest database of over 135 million properties.

After gathering the data, we will create different visualizations with geospatial analysis and charts to uncover preliminary trends between different variables. We will also be using linear regressions and different unsupervised machine learning algorithms such as Principal Component Analysis and K-Means Clustering to gain meaningful insights in the data we collected.


Resources

  • Analysis Software: Jupyter Lab 3.4.4
  • Programming Languages: Python 3.10
  • Data Source: ZillowData.csv - Zillow.com

Webscraping Process


Scrape Zillow using BeautifulSoup

BeautifulSoup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. We will be scraping data from different cities in the Salt Lake County from Zillow. We avoided any problems with Zillow blocking us from downloading the data by saving all the HTML files in the data folder. The path to the data folder is stored in the DATA_PATH variable. Furthermore, the ZillowData folder contains the HTML source code for the different cities. The file name contains the city name with the corresponding page number from Zillow:

wvc1.html draper1.html ... slc1.html
wvc2.html draper2.html ... slc2.html
wvc3.html draper3.html ... slc3.html

The Data

We were able to scrape 14 different variables associated with each of the 2,107 properties in this dataset. The data selection, processing, and transformation Python code can be viewed in the ZillowUT_Data.ipynb file and the cleaned data can be viewed in the ZillowData.csv file.

We also checked the data types and converted any numbers that were read as strings to numerical values. In particular, we converted type into a piecewise function where proptype takes a binary form of 0 or 1. That is:

$$ \text{proptype} = \begin{cases} 0, & \text{condo/townhouse/lot} \\ 1, & \text{single family house} \end{cases} $$

Finally, we replaced all null values in beds and baths with 0 and removed all listings with erroneous longitude and latitude missing values.


Removing Outliers

From the scatterplot below, we can see that the cities: Holladay, Salt Lake City, Sandy, South Jordan, West Jordan, and Draper have problematic outliers.


BC Home Price Scatter

To remove these outliers, we calculated the interquartile range to decipher the outlier cutoff.

$$ IQR = Q_3 - Q_1 $$

The outlier cutoff is the interquartile range times by 1.5 and subtracted from the lower and upper bounds. The data points that are above the upper bound and below the lower bound this cutoff will be removed to get a more precise analysis without outliers. Here is an example of how we removed the outliers of each city.

Python Code:


    # # Salt Lake City Box Plot
    outlier = df.loc[df["city"] == "Salt Lake City"]
    outlier = outlier['price']

    # Remove Salt Lake City Outliers
    # Calculate interquartile range
    q25, q75 = np.percentile(outlier, 25), np.percentile(outlier, 75)
    iqr = q75 - q25

    # Calculate the outlier cutoff
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off

    # Identify outliers
    outliers = [x for x in outlier if x < lower or x > upper]

    # Remove outliers
    df = df.drop(df[(df['city'] == 'Salt Lake City') & (df['price'] < lower)].index)
    df = df.drop(df[(df['city'] == 'Salt Lake City') & (df['price'] > upper)].index)
                

The second scatterplot illustrates the data without the outliers:

AC Home Price Scatter

Preliminary Analysis


Average Home Price

From the chart above, the cities with the highest average home prices are Draper and Holladay. The cities with the lowest average home prices are Kearns and Millcreek. Additionally, the chart remains relatively the same with the average home sqft. The cities with the highest average home sqft are Draper and Herriman and the cities with the lowest average home sqft are still Kearns and Millcreek. This could potentially be explained by the locations of the cities which is explained later in the analysis.

Average Home Sqft

However, the order of the cities change dramatically based on the average cost per sqft. From the chart below, we can see that the cities with the highest average cost per sqft are Millcreek and Holladay and the cities with the lowest average cost per sqft are West Valley City and Bluffdale.

Average Cost per Sqft

We also looked into the correlation between the list price of the homes and the sqft. Based on the scatterplot generated, we can visualize the positive correlation between price and sqft. This means that as the sqft of the homes increase, the prices also increase.

Price vs Sqft Scatterplot

GeoSpatial Analysis


saltlake

Geospatial data defines specific geographical locations, either in the form of latitude and longitude coordinates or text fields with names of geographical areas, such as countries or states. Geospatial charts combine geospatial data with other forms of data to create map-based charts. Our geodata contains (x, y) coordinates of geographical locations. The geometric shapes in a GeoSeries or GeoDataFrame object are simply a collection of coordinates in an arbitrary space. The Coordinate Reference System (CRS) tells Python how those coordinates relate to places on the Earth and is used to project the location coordinates onto a map for visualization. We will use the WGS84 latitude-longitude projection.

We will use two of the variables, latitude and longitude, of each listing to visualize the data of Salt Lake County with GeoPandas, a high-level interface Python library for making maps. This can be referred by using the authority code EPSG:4326.

Python Code:


    crs = 'EPSG:4326'
    geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
    geo_df = gpd.GeoDataFrame(df, crs = crs, geometry = geometry)
                

We then created heat maps showcasing the distribution of property prices and sqft:

Python Code:


    geo_df['price_log'] = np.log(geo_df['price'])
    fig, ax = plt.subplots(figsize = (10,10))
    utah_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
    geo_df.plot(column = 'price_log', ax=ax, cmap = 'inferno',
                legend = True, legend_kwds={'shrink': 0.5}, markersize = 15)
                
Price Heat Map

From the price heat map of Salt Lake County, we can see that a majority of the higher priced real estate properties are located on the east side of the Salt Lake valley. This corresponds with our previous analysis since Draper and Holladay are located in this area. Additionally, lower priced homes are located on the west side of Salt Lake City.

Python Code:


    geo_df['sqft_log'] = np.log(geo_df['sqft'])
    fig, ax = plt.subplots(figsize = (10,10))
    utah_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
    geo_df.plot(column = 'sqft_log', ax=ax, cmap = 'inferno',
                legend = True, legend_kwds={'shrink': 0.5},
                alpha = .5)
                
Sqft Heat Map

In the sqft heat map of Salt Lake County, we can see that the real estate properties with more square feet are located on the east and south-west sides of the Salt Lake Valley which is understandable since the locations are further away from the city center.


Linear Regression


Ordinary Least Squares Assumptions:

  1. Standard Errors assume that the covariance matrix of the errors is correctly specified.
  2. The condition number is large, 1.22e+03. This might indicate that there are strong multicollinearity or other numerical problems.
  3. The linear regression model is “linear in parameters.”
  4. There is a random sampling of observations.
  5. There is homoscedasticity and no autocorrelation.

We now develop a multilinear regression model for real estate property prices in the Salt Lake County. We could use this to come up with a price for properties coming on the market. Our model is now in the form:

$$ Y = β_0 + β_1x_1 + β_2x_2 + \cdots + β_nx_n + ε $$

where $x_n$ are predictive variables.

Python Code:


    mod = sm.ols(formula="price ~ sqft + beds + baths + proptype", data = df)
    res = mod.fit()
                

Output:

                            
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                  price   R-squared:                       0.640
    Model:                            OLS   Adj. R-squared:                  0.639
    Method:                 Least Squares   F-statistic:                     872.9
    Date:                Mon, 21 Nov 2022   Prob (F-statistic):               0.00
    Time:                        02:36:54   Log-Likelihood:                -26554.
    No. Observations:                1972   AIC:                         5.312e+04
    Df Residuals:                    1967   BIC:                         5.315e+04
    Df Model:                           4                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                    coef    std err          t      P>t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept   1.024e+05   1.28e+04     7.974      0.000    7.72e+04    1.28e+05
    sqft         180.8805      5.669     31.907      0.000     169.763     191.998
    beds       -8116.8371   3965.622     -2.047      0.041   -1.59e+04    -339.575
    baths       2.722e+04   5920.118      4.598      0.000    1.56e+04   3.88e+04
    proptype    4.042e+04   9865.190      4.097      0.000    2.11e+04    5.98e+04
    ==============================================================================
    Omnibus:                      882.045   Durbin-Watson:                   1.447
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7474.571
    Skew:                           1.898   Prob(JB):                         0.00
    Kurtosis:                      11.750   Cond. No.                     9.78e+03
    ==============================================================================
                

In the multiple linear regression model, the intercept of the regression is 102,400 and the R-squared is 0.640. The R-squared is the proportion of the variation in the dependent variable that is predictable from the independent variable. This means that around 64.0% of the variability observed in the target variable is explained by this regression model. Additionally, all the variables are statistically significant at an alpha of 5% and can be used to predict the listed price of the real estate property. With further clarification, as the sqft of the real estate property and the number of baths increase by one, the price increases by 181 and 27220, respectively. Moreover, as the number of bedrooms increase by one, the price actually decreases by 8116. We can also see that if the real estate property is a Single Family Home, it increases the price of the property.


Machine Learning Algorithms (Deeper Analysis)


For this analysis, we will be using Principal Component Analysis, K-Means Clustering, and Hierarchical Clustering to understand how the prices, sqft, and cost per sqft differ between zipcodes.


Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is one of the most used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction, information compression, and data de-noising. PCA is a dimensionality reduction technique that transforms a set of features in a dataset into a smaller number of features called principal components while at the same time trying to retain as much information in the original dataset as possible. Since we have 3 different variables, we have a three-dimensional data set. PCA can take 4 or more variables and make a two-dimensional PCA plot. PCA can also tell us which variable is the most valuable for clustering the data. It also can tell us how accurate the two-dimensional graph is.

Python Code:

  
    from sklearn.decomposition import PCA

    pca_model = PCA(n_components=3)
    X_PCA = pca_model.fit_transform(X)
    df_plot = pd.DataFrame(X_PCA, columns=['PC1', 'PC2', 'PC3'])
    df_plot.head()
    fig,ax1 = plt.subplots(figsize=(10, 6))

    ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
    ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)

    for i,name in enumerate(summary['zipcode'].values):
        ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
                

Principal Component Analysis calculates the average of each variable and using this average, finds the center of the data. It then shifts the data so that the center of the data is at the origin. From here, we input principal components. The principal components are vectors, but they are not chosen at random. The first principal component (PC1) is computed so that it explains the greatest amount of variance in the original features. Thus, it minimizes the distance between each data point on the graph (Sum of Squared) so PC1 is a linear combination of variables.

In order to maximize variance, the first weight vector $w_{(1)}$ thus has to satisfy:

$$ \begin{aligned} w_{(1)} &= \text{arg } \displaystyle{\max_{||w|| = 1}} \left( \sum_{i} {(t_1)}^2_{(i)} \right) \\ &= \text{arg } \displaystyle{\max_{||w|| = 1}} \left( \sum_{i} {(x_{(i)} * w)}^2 \right) \end{aligned} $$

Since $w_{(1)}$ has been defined to be a unit vector, it equivalently also satisfies:

$$ w_{(1)} = \text{arg max} \left( \frac{w^TX^TXw}{w^Tw} \right) $$

The quantity to be maximized can be recognized as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as $X^TX$ is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when $w$ is the corresponding eigenvector.

In our analysis, we require more than one component. The $k^{th}$ component can be found by subtracting the first $k − 1$ principal components from $X$ and then finding the weight vector which extracts the maximum variance from this new data matrix:

$$ \hat X_k = X - \sum_{s=1}^{k-1}{Xw_{(s)}w_{(s)}^T} $$ $$ \begin{aligned} w_{(k)} &= \text{arg } \displaystyle{\max_{||w|| = 1}} \left( ||\hat X_{k}w||^2 \right) \\ &= \text{arg max} \left( \frac{w^T\hat X_k^T\hat X_kw}{w^Tw} \right) \end{aligned} $$

The sum of squared distances for the best fit line is the eigenvalue for PC1. The second component (PC2) is orthogonal to the first, and it explains the greatest amount of variance left after the first principal component. Then we find PC3 which is perpendicular to PC1 and PC2. The number of PCs is either the number of variables or the number of samples, whichever is smaller.

PCA

Once all the principal components are figured out, you can use the eigenvalues to determine the proportion of variation that each PC accounts for. Then you can create a scree plot which is a graphical representation of the percentages of variation that each PC accounts for.

Python Code:


    var_ratio = pca_model.explained_variance_ratio_

    fig,ax1 = plt.subplots(figsize=(10, 6))
    plt.plot([1,2,3], var_ratio, '-o')
                
scree-plot

In this scree plot, we can see that PC1 and PC2 account for the vast majority of the variation. This means that a two-dimensional graph, using just PC1 and PC2 would be a good approximation of this three-dimensional graph since it would account for 99.22% of the variation in the data. Also, a one-dimensional graph would account for 59.98% of the variation in the data.

Output:


    Explained Variance Ratios of the PCA: [0.59977592 0.39255202 0.00767206]
                

K-Means Cluster Analysis

K-means cluster identifies initial clusters and calculates the variances between each cluster or the Euclidean distance. It clusters all the remaining points, calculates the mean of each cluster, and then reclusters based on the new means. It repeats until the clusters no longer change. It restarts the cluster until it finds the best overall cluster. It does as much reclustering as we tell it to do. It then comes back and returns to the optimal one.

Different K-Means

First, we need to determine the best K value. An easy method for determining the best number for K is the elbow curve. To create an elbow curve, we'll plot the clusters on the x-axis and the values of a selected objective function on the y-axis. The intra-cluster distance is one of the most common objective functions to use when creating an elbow curve. The intra-cluster distance objective function is measuring the amount of variation in the dataset. For our elbow curve, we will plot the number of clusters (also known as the values of K) on the x-axis and the total intra-cluster distance values on the y-axis.

Intra-Cluster-Distance

Using the "elbow" or "knee of a curve" as a cutoff point is a common heuristic in mathematical optimization to choose a point where diminishing returns are no longer worth the additional cost. In clustering, this means one should choose several clusters so that adding another cluster doesn't result in a better model. The intuition is that increasing the number of clusters will naturally improve the fit (explain more of the variation) since there are more parameters (more clusters) to use, but at some point this is over-fitting, and the elbow reflects this. We can see that the total intra-cluster distance is large for k = 1 and decreases as we increase k, until k = 6, after which it tapers off and gets only marginally smaller. The slope becomes constant after k = 6. This indicates that k = 6 is a good choice. Therefore, will now cluster the states into six clusters using K-means.

Python Code:


    kmeans = KMeans(n_clusters=6)
    kmeans.fit(X)
    y_kmeans = kmeans.predict(X)
    fig,ax1 = plt.subplots(figsize=(10, 6))

    plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50)

    for i,name in enumerate(summary['zipcode'].values):
        ax1.annotate(name, (X[i,0], X[i,1]), ha='center',fontsize=10)
                
K-Mean

Python Code:


    df2 = summary['zipcode'].values
    cluster1 = df2[kmeans.labels_==i].tolist()
                

Where i is the index of the number of clusters.

Clusters Zipcodes
1 84107, 84116, 84118, 84104, 84119, 84123
2 84020, 84092, 84121, 84093, 84095, 84065, 84096
3 84009, 84094, 84088, 84081, 84084, 84047, 84128, 84120, 84070, 84044, 84129
4 84101, 84111
5 84105, 84108, 84117, 84109, 84124, 84103
6 84106, 84102, 84115

Python Code:


    from pandas.plotting import parallel_coordinates

    palette = sns.color_palette("Set1", 10)
    X_df = summary.drop(['zipcode'],axis=1)
        
    def display_parallel_coordinates_centroids(df, num_clusters):
        '''Display a parallel coordinates plot for the centroids in df'''
                
Parallel-Centroids

From the Parallel Coordinates Centroids plot, we can see the variation across the variables for each of the clusters found by the K-means algorithm. We can identify that the real estate properties located in the Cluster 4 zip codes have relatively low average sqft but high cost per sqft and the Cluster 2 zip codes have high average prices and high average sqft.


K-Means Clustering with Principal Component Analysis

Applying the K-means to the Principal Component Analysis projection data produces an additional categorical constraint to validate the clustering algorithm. In other words, we can use dimensionality reduction as a feature extractor and reveal the different clusters. Based on the updated PCA plot with the clustering, it is consistent with the clustering with the points split into six sections:

Python Code:


    fig,ax1 = plt.subplots(figsize=(10, 6))

    ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
    ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)

    plt.scatter(X_PCA[:, 0], X_PCA[:, 1], c=y_kmeans, s=50)

    for i,name in enumerate(summary['zipcode'].values):
        ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
                
K-Mean-PCA

Hierarchical Cluster Analysis

Similar to K-means clustering, hierarchical clustering, also known as agglomerative clustering, works with groups (clusters) of data points. The algorithm starts by declaring each point with its own cluster, then merges the two most similar clusters until a declared stopping point has been reached. Hierarchical clustering is often associated with heatmaps and organizes the rows and columns based on similarity. This makes it easy to see correlations in the data.

Python Code:


    ff.create_dendrogram(X_PCA, color_threshold=0, labels=summary['zipcode'].values)
                
dendogram

Additionally, we created a dendogram to know how many clusters to make. A dendrogram is a graph that keeps the values of the points on the x-axis, then connects all the points as they are clustered. This is similar to the elbow curve, as it gives us a better idea of the ideal amount of clusters we want to use. Based on the dendogram above, we will now use hierarchical clustering with complete linkage and Euclidean distance to sort the zipcodes into six clusters.

Python Code:


    from sklearn.cluster import AgglomerativeClustering

    fig,ax1 = plt.subplots(figsize=(10, 6))
    agg_cluster_model = AgglomerativeClustering(linkage="complete", affinity='euclidean', n_clusters=6)
    y_pred = agg_cluster_model.fit_predict(X)

    plt.scatter(X[:, 0], X[:, 1], c=y_pred,  marker="o")
    for i,name in enumerate(summary['zipcode'].values):
        ax1.annotate(name, (X[i,0], X[i,1]), ha='center',fontsize=10)
                
Hierarchical

Python Code:


    df2 = summary['zipcode'].values
    cluster1 = df2[agg_cluster_model.labels_==i].tolist()
                

Where i is the index of the number of clusters.

Clusters Zipcodes
1 84020, 84092, 84121, 84093, 84095, 84065, 84109, 84096
2 84009, 84094, 84088, 84081, 84084, 84047, 84128, 84120, 84070, 84044, 84129
3 84106, 84102, 84115
4 84101, 84111
5 84105, 84108, 84117, 84124, 84103
6 84107, 84116, 84118, 84104, 84119, 84123

Hierarchical Clustering with Principal Component Analysis

Applying the Hierarchical Clustering to the Principal Component Analysis projection data produces an additional categorical constraint to validate the clustering algorithm. In other words, we can use dimensionality reduction as a feature extractor and reveal the different clusters. Based on the updated PCA plot with the clustering, it is consistent with the clustering with the points split into six sections:

Python Code:


    fig,ax1 = plt.subplots(figsize=(10, 6))

    ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
    ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)

    plt.scatter(X_PCA[:, 0], X_PCA[:, 1], c=y_pred, s=50)

    for i,name in enumerate(summary['zipcode'].values):
        ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
                
Hierarchical-PCA