𝐓Table of Contents

Salt Lake City Real Estate


We will be analyzing SLC real estate data. The dataset contains multiple listing service (MLS) real estate transactions for houses sold in 2015-16 in zip code 84103 (SLC avenues neighborhood). We are primarily interested in regressing the SoldPrice on the house attributes (property size, house size, number of bedrooms, etc...).


Data Overview


  • There are 206 different variables associated with each of the 348 houses in this dataset. There is some information about real estate data on this website to get a better sense of the meanings of the column headers in the dataset.

  • We are only keeping houses with a list price between 200,000 and 1,000,000 dollars. This is an arbitrary choice, and we realize that some people are high rollers, but for our purposes, we will consider the others as outliers.

  • We will be keeping these variables:

    ['Acres', 'Deck', 'GaragCap', 'Latitude', 'Longitude', 'LstPrice', 'Patio', 'PkgSpacs']

    ['PropType', 'SoldPrice', 'Taxes', 'TotBed', 'TotBth', 'TotSqf', 'YearBlt']

  • We will check the datatypes and convert any numbers that were read as strings to numerical values. In particular, we converted TotSqf to an integer and add a column titled Prop_Type_num (Prop) that is:

$$ Prop_i = \begin{cases} 0, & \text{condo or townhouse} \\ 1, & \text{single family house} \end{cases} $$

  • We removed the listings with erroneous Longitude (one has Longitude = 0) and Taxes values (two have unreasonably large values).

Exploratory Data Analysis


Python Code:


    import seaborn as sns
    rel = sns.pairplot(data = train_subset, hue="PropType", diag_kws={'bw': 0.2})
                
real-estate-scatter

In our exploratory analysis, there are 260 observations and 18 variables in the training dataset. Based on the correlation scatterplot, the columns with the strongest correlation are the LstPrice and the SoldPrice. The plot is almost a linear line. Additionally, there are more than own a Single Family than any other Property Type.


Geospatial Analysis


Geospatial data defines specific geographical locations, either in the form of latitude and longitude coordinates or text fields with names of geographical areas, such as countries or states. Geospatial charts combine geospatial data with other forms of data to create map-based charts.

Two of the variables are the latitude and longitude of each listing. Salt Lake City is on a east-west, north-south grid, so even a simple plot of latitude and longitude makes sense. Thus, we will be creating a scatterplot of these two variables.

Python Code:


    rel2 = sns.scatterplot(train['Latitude'], train['Longitude'], hue = train['LstPrice'], palette = 'hot')
                
real-estate-geospatial

As we can see from the chart, the bigger the longitude and latitude, the higher the price is for the house. Houses closer to (-111.91, 40.770) are cheaper in price.


Linear Regression


Ordinary Least Squares Assumptions:

  1. Standard Errors assume that the covariance matrix of the errors is correctly specified.
  2. The condition number is large, 1.22e+03. This might indicate that there are strong multicollinearity or other numerical problems.
  3. The linear regression model is “linear in parameters.”
  4. There is a random sampling of observations.
  5. There is homoscedasticity and no autocorrelation.

We will now create a linear regression model to regress the Sold price on some of the other variables.

$$ SoldPrice = β_0 + β_1x $$

where x is one of the other variables.

After we run the regression, we will make a scatterplot of list price vs sold price and overlay the prediction with the regression model.

Python Code:


    # Regression Model
    uni_mod = sm.ols(formula="SoldPrice ~ LstPrice", data = train)
    uni_result = uni_mod.fit()
    print(uni_result.summary())
                

Output:


                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:              SoldPrice   R-squared:                       0.990
    Model:                            OLS   Adj. R-squared:                  0.990
    Method:                 Least Squares   F-statistic:                 2.611e+04
    Date:                Thu, 21 Jul 2022   Prob (F-statistic):          2.97e-261
    Time:                        13:50:46   Log-Likelihood:                -2927.2
    No. Observations:                 260   AIC:                             5858.
    Df Residuals:                     258   BIC:                             5866.
    Df Model:                           1                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept   1.317e+04   2948.953      4.465      0.000    7361.378     1.9e+04
    LstPrice       0.9441      0.006    161.593      0.000       0.933       0.956
    ==============================================================================
    Omnibus:                      123.259   Durbin-Watson:                   2.015
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):              881.446
    Skew:                          -1.750   Prob(JB):                    3.95e-192
    Kurtosis:                      11.313   Cond. No.                     1.27e+06
    ==============================================================================
                

Based on the OLS Regression, the R-squared is 0.988. This is saying that 98.8% of the variation is explained in the model. The Beta one of LstPrice is 0.9019. In other words, as LstPrice goes up by 1, SoldPrice increases by 0.9019 points.

Python Code:


    plt.scatter(train['LstPrice'], train['SoldPrice'],  color='red')
                
lstprice-soldprice-scatter

Multilinear Regression


We are now developing a multilinear regression model for house prices in this neighborhood. We could use this to come up with a list price for houses coming on the market, so will not be including the list price in our model and also ignore the categorical variable Prop_Type. Our model is now in the form:

$$ SoldPrice = β_0 + β_1x_1 + \cdots + β_nx_n $$

where the x's are predictive variables.

Next we look at the difference between list price and sold price explicitly and calculate two new columns for our dataset. DiffPriceAbsolute and DiffPriceRelative.

  • DiffPriceAbsolute - This is difference between sold price and list price. If it is positive, that means the house sold for more than it was listed at.
  • DiffPriceRelative - This is the relative difference between sold price and list price. A value of 1.1 here means that the house sold for 110% of the asking price, and 0.9 means the house sold for 90% of the asking price.

After that, we created two new models. One to predict DiffPriceAbsolute, and one to predict DiffPriceRelative and use the same predictive variables as in the last model. From there, we compare which of the two new models makes better predictions.

From the two models, we will train two models to predict DiffPriceAbsolute and DiffPriceRelative based on just SoldPrice and create a scatterplot for each model.


SoldPrice Regression

Python Code:


    uni_mod0 = sm.ols(formula="SoldPrice ~ Acres + Taxes + TotBed + TotBth + TotSqf + YearBlt", data = train)
    uni_result0 = uni_mod0.fit()
    print(uni_result0.summary())
                

Output:


                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:              SoldPrice   R-squared:                       0.838
    Model:                            OLS   Adj. R-squared:                  0.835
    Method:                 Least Squares   F-statistic:                     218.7
    Date:                Thu, 21 Jul 2022   Prob (F-statistic):           4.45e-97
    Time:                        13:51:02   Log-Likelihood:                -3291.9
    No. Observations:                 260   AIC:                             6598.
    Df Residuals:                     253   BIC:                             6623.
    Df Model:                           6                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept   2.765e+05   2.94e+05      0.942      0.347   -3.02e+05    8.55e+05
    Acres       1.169e+05   6.15e+04      1.900      0.059   -4276.513    2.38e+05
    Taxes         58.6029      5.466     10.721      0.000      47.838      69.368
    TotBed      1.286e+04   6045.229      2.127      0.034     951.492    2.48e+04
    TotBth      6050.4777   8140.361      0.743      0.458   -9981.025    2.21e+04
    TotSqf        45.5661      9.878      4.613      0.000      26.113      65.019
    YearBlt      -96.9146    151.794     -0.638      0.524    -395.855     202.025
    ==============================================================================
    Omnibus:                       30.768   Durbin-Watson:                   2.174
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.356
    Skew:                           0.759   Prob(JB):                     3.85e-10
    Kurtosis:                       4.302   Cond. No.                     2.90e+05
    ==============================================================================
                

DiffPriceAbsolute Regression

Python Code:


    uni_mod1 = sm.ols(formula="DiffPriceAbsolute ~ Acres + Taxes + TotBed + TotBth + TotSqf + YearBlt", data = train)
    uni_result1 = uni_mod1.fit()
    print(uni_result1.summary())
                

Output:


                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:      DiffPriceAbsolute   R-squared:                       0.261
    Model:                            OLS   Adj. R-squared:                  0.244
    Method:                 Least Squares   F-statistic:                     14.92
    Date:                Thu, 21 Jul 2022   Prob (F-statistic):           1.32e-14
    Time:                        13:51:02   Log-Likelihood:                -2927.3
    No. Observations:                 260   AIC:                             5869.
    Df Residuals:                     253   BIC:                             5894.
    Df Model:                           6                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept   2.867e+04   7.22e+04      0.397      0.692   -1.14e+05    1.71e+05
    Acres      -1.235e+04   1.51e+04     -0.815      0.416   -4.22e+04    1.75e+04
    Taxes         -1.1223      1.345     -0.834      0.405      -3.771       1.527
    TotBed       528.3073   1487.544      0.355      0.723   -2401.240    3457.854
    TotBth     -2969.8850   2003.091     -1.483      0.139   -6914.743     974.973
    TotSqf        -4.9697      2.431     -2.045      0.042      -9.756      -0.183
    YearBlt       -9.4079     37.352     -0.252      0.801     -82.968      64.152
    ==============================================================================
    Omnibus:                       98.703   Durbin-Watson:                   2.026
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):              618.218
    Skew:                          -1.371   Prob(JB):                    5.70e-135
    Kurtosis:                      10.039   Cond. No.                     2.90e+05
    ==============================================================================
                

DiffPriceRelative Regression

Python Code:


    uni_mod2 = sm.ols(formula="DiffPriceRelative ~ Acres + Taxes + TotBed + TotBth + TotSqf + YearBlt", data = train)
    uni_result2 = uni_mod2.fit()
    print(uni_result2.summary())
                

Output:


                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:      DiffPriceRelative   R-squared:                       0.079
    Model:                            OLS   Adj. R-squared:                  0.057
    Method:                 Least Squares   F-statistic:                     3.618
    Date:                Thu, 21 Jul 2022   Prob (F-statistic):            0.00184
    Time:                        13:51:02   Log-Likelihood:                 486.61
    No. Observations:                 260   AIC:                            -959.2
    Df Residuals:                     253   BIC:                            -934.3
    Df Model:                           6                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept      1.0221      0.143      7.131      0.000       0.740       1.304
    Acres         -0.0145      0.030     -0.483      0.629      -0.074       0.045
    Taxes       9.371e-07   2.67e-06      0.351      0.726   -4.32e-06    6.19e-06
    TotBed         0.0025      0.003      0.838      0.403      -0.003       0.008
    TotBth        -0.0080      0.004     -2.022      0.044      -0.016      -0.000
    TotSqf     -3.941e-06   4.82e-06     -0.817      0.415   -1.34e-05    5.56e-06
    YearBlt    -1.261e-05   7.41e-05     -0.170      0.865      -0.000       0.000
    ==============================================================================
    Omnibus:                      119.899   Durbin-Watson:                   2.053
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):              840.861
    Skew:                          -1.697   Prob(JB):                    2.57e-183
    Kurtosis:                      11.130   Cond. No.                     2.90e+05
    ==============================================================================
                

The multilinear regression model has a lower R squared than the simple linear regression model including the List Price. Therefore, there is more variation explained by the simple regression model. Additionally, based on the two models, the model that predicts the DiffPriceAbsolute is better than the model that predicts the DiffPriceRelative because the R squared is greater. Furthermore, from the different trained plots, the plot for the DiffPriceAbsolute has more plots closer to the linear line, thus showing why more of the variation is explained. Although DiffPriceAbsolute is better, it still doesn't have a high R squared.

Python Code:


    from sklearn.model_selection import train_test_split
    from sklearn import linear_model
    lr = linear_model.LinearRegression()

    # Generate Values
    X = train[['SoldPrice']].values
    y = train['DiffPriceAbsolute'].values

    # Split data into training data and test data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9, random_state=1)
    print(X_train.shape, y_train.shape)
    print(X_test.shape, y_test.shape)

    for m in np.arange(1,2):     
        lr.fit(X=X_train[:,:m], y=y_train)
        print('m=', m, ', train: ', lr.score(X_train[:,:m], y_train), ' test: ', lr.score(X_test[:,:m], y_test))
                
DiffPriceAbsolute

Python Code:


    from sklearn.model_selection import train_test_split
    from sklearn import linear_model
    lr = linear_model.LinearRegression()

    # Generate Values
    X = train[['SoldPrice']].values
    y = train['DiffPriceRelative'].values

    # Split data into training data and test data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9, random_state=1)
    print(X_train.shape, y_train.shape)
    print(X_test.shape, y_test.shape)

    for m in np.arange(1,2):     
        lr.fit(X=X_train[:,:m], y=y_train)
        print('m=', m, ', train: ', lr.score(X_train[:,:m], y_train), ' test: ', lr.score(X_test[:,:m], y_test))
                
DiffPriceRelative

Incorporating a Categorical Variable

Above, we considered houses, townhouses, and condos together. Now we will now distinguish the difference between houses, townhouses, and condos. Consider the two regression models:

$$ SoldPrice = β_0 + β_1 (Prop) $$ $$ SoldPrice = β_0 + β_1 (Prop) + β_2 (TotSqf) $$

From the first model, property type is significant in predicting the sold price. On the other hand, the second model indicates that when we consider total square footage, property type is no longer predictive.

We will also make a scatterplot of TotSqf vs. SoldPrice where the house types are colored differently to illustrate your explanation.

Python Code:


    rel3 = sns.scatterplot(train['TotSqf'], train['SoldPrice'], hue = train['PropType'])
                
TotSqf-SoldPrice-Scatter