1 Introduction

Predicting home prices can be challenging, especially when relying on automated valuation models (AVMs) like Zillow’s Zestimate. While these tools offer a rough estimate, they often fail to capture the specifics of neighborhoods that don’t follow typical trends. Inaccurate home price predictions can have serious impacts, affecting everything from mortgage decisions to housing policies.

Philadelphia is a city of diverse neighborhoods, each with its own culture, economy, and housing characteristics. To make accurate predictions, it’s important to capture these local factors. That’s why our team worked on building a better model for predicting home prices in the city, using data specific to Philadelphia.

We collected open data from OpenDataPhilly, focusing on key variables that would help improve our model’s accuracy. Using ordinary least squares (OLS) regression, we refined the model by adjusting the variables until we found the best combination to predict prices effectively across different neighborhoods. Our goal was to not only lower the error rate but also create a model that works well in all parts of the city.

Throughout the project, we faced challenges with missing or incomplete data, particularly in some neighborhoods. Despite these issues, we identified 19 important predictors, like median household income and proximity to high-performing schools, which helped improve our model. Although it’s not perfect, our model represents a meaningful step forward in predicting housing prices in Philadelphia.

By using local data and a careful approach, we have built a model that offers more reliable price estimates, providing better insights for homeowners, buyers, and policymakers.

library(tidyverse)
library(dplyr)
library(olsrr)
library(sf)
library(caret)
library(tmap)
library(fastDummies)
library(tidycensus)
library(spdep)
library(sfdep)
library(curl)
library(jtools) 
library(zip)
library(janitor)
library(spatstat)
library(terra)
library(ggcorrplot)
library(kableExtra)
library(gridExtra)
library(ggpubr)
library(ggplot2)
library(osmdata)
library(mapview)
library(ckanr)
library(FNN)
library(grid)
library(patchwork)
library(psych)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c('#FFEDA0', '#FEB24C', '#FC4E2A', '#E31A1C', '#BD0026')
quantile_colors <- c("darkblue", "lightblue", "yellow", "lightcoral", "darkred")
quantile_colors_2 <- c("darkblue", "dodgerblue", "yellow", "orange", "darkred")
pal_full <- c("#34495e", "#3498db", "#2ecc71", "#f1c40f", "#e74c3c", "#9b59b6", "#1abc9c", "#f39c12", "#d35400")
corr_mat_pal <- c("#2ecc71", "white", "#e74c3c")

2 Data

2.1 Data Gathering

We began by gathering data to provide a comprehensive view of the city of Philadelphia. The city’s shape and neighborhood boundaries were obtained from OpenDataPhilly, giving us a detailed map of the city’s layout and divisions by tract.

The core of our analysis focused on housing market data. This included essential and comprehensive details about individual homes, such as sale prices, square footage, number of bedrooms, year built, and ownership status, all of which were provided by Zillow. All the categories documented by the dataset are realistic factors that people consider when looking for houses. Yet to complement this, we added socioeconomic information from the 2020 5-year American Community Survey (ACS) Census data, which gave us insights into neighborhood demographics, total population, median household income, etc. These variables were important for understanding the broader economic conditions affecting home prices in each area.

In Philadelphia, the value of a home is strongly influenced by its surrounding neighborhood, so our team selected variables that could reflect the characteristics typically associated with house values. Therefore, we categorized our variables into three main areas: 1. internal characteristics of the homes 2. amenities and public services 3. spatial structure

For the internal characteristics, we included variables such as the year the house was built, considering that older homes may have higher value due to historical significance or lower value due to potential maintenance issues. Additionally, square footage and total area were key factors, as larger homes typically have higher prices. The number of rooms, bathrooms, and bedrooms also contributed to the value, with larger homes commanding higher prices. Garage space was another feature we included, as homes with garages tend to be more desirable and often sell for more.

For the spatial structure, we incorporated median household income data from the ACS Census, which provides insight into the economic conditions of each neighborhood. Finally, we used a spatial lag price feature to account for the influence of nearby home prices, helping to adjust the model based on local market trends.

phl_path <- ("data/philadelphia-neighborhoods.geojson")

phl <- st_read(phl_path, quiet = TRUE) %>%
        st_as_sf(crs = 4326)%>%
        st_transform('EPSG:2272')

data_path <- ("data/studentData.geojson")
house <- st_read(data_path, quiet = TRUE)

drop <- c("objectid", "assessment_date", "beginning_point", "book_and_page", "category_code", 
          "cross_reference", "date_exterior_condition", "house_number", "location", "owner_1", 
          "owner_2", "parcel_number", "recording_date", "registry_number", "sale_date",
          "mailing_address_1", "mailing_address_2", "mailing_care_of", "mailing_zip", "mailing_street", 
          "mailing_city_state", "building_code", "geographic_ward", "state_code", "street_code", 
          "street_name", "street_designation", "street_direction", "suffix", "building_code_new",
          "year_built_estimate", "pin", "unit", "exempt_land", "building_code_description")

house <- house %>%
  mutate(non_resident_owner = mailing_address_1 == mailing_street) %>%
  select(-drop) %>%
  st_transform(crs = 2272) %>%
  filter(sale_price <= 2000000) %>%
  mutate(
    nb = st_knn(geometry, 5),
    wt = st_weights(nb),
    nb15 = st_knn(geometry, 15),
    wt15 = st_weights(nb15),
    
    # Handle total area and livable area
    total_area = ifelse(is.na(total_area) | total_area == 0, purrr::map_dbl(find_xj(total_area, nb), mean, na.rm = TRUE), total_area),
    total_livable_area = ifelse(is.na(total_livable_area) | total_livable_area == 0, purrr::map_dbl(find_xj(total_livable_area, nb), mean, na.rm = TRUE), total_livable_area),
    
    # Handle house age
    house_age = round(ifelse(2023 - year_built == 2023, purrr::map_dbl(find_xj(2023 - year_built, nb), mean, na.rm = TRUE), 2023 - year_built)),
    
    # Handle number of bathrooms, bedrooms, and rooms
    number_of_bathrooms = ifelse(number_of_bathrooms == 0, 1, number_of_bathrooms),
    number_of_bathrooms = round(ifelse(is.na(number_of_bathrooms), purrr::map_dbl(find_xj(number_of_bathrooms, nb), mean, na.rm = TRUE), number_of_bathrooms)),
    number_of_bedrooms = round(ifelse(is.na(number_of_bedrooms) | number_of_bedrooms == 31, purrr::map_dbl(find_xj(number_of_bedrooms, nb), mean, na.rm = TRUE), number_of_bedrooms)),
    number_of_rooms = round(ifelse(is.na(number_of_rooms) | number_of_rooms == 32, purrr::map_dbl(find_xj(number_of_rooms, nb), mean, na.rm = TRUE), number_of_rooms)),
    number_of_rooms = if_else(is.na(number_of_rooms),
                                number_of_bedrooms + number_of_bathrooms,
                                number_of_rooms),
    
    # Handle interior and exterior condition
    interior_condition = round(ifelse(is.na(interior_condition), purrr::map_dbl(find_xj(interior_condition, nb), mean, na.rm = TRUE), interior_condition)),
    exterior_condition = round(ifelse(is.na(exterior_condition), purrr::map_dbl(find_xj(exterior_condition, nb), mean, na.rm = TRUE), exterior_condition)),
    
    # Handle number of stories
    number_stories = round(ifelse(is.na(number_stories), purrr::map_dbl(find_xj(number_stories, nb), mean, na.rm = TRUE), number_stories)),
    
    # Handle garage spaces
    garage_spaces = ifelse(garage_spaces == 72 | garage_spaces == 12, 
                           round(purrr::map_dbl(find_xj(garage_spaces, nb), mean, na.rm = TRUE), 0),
                           garage_spaces),
    garage_spaces = ifelse(is.na(garage_spaces), 0, garage_spaces), 
    
    # Handle sale price and spatial lag
    sale_price = ifelse(is.na(sale_price) | sale_price == 0, 
                        purrr::map_dbl(find_xj(sale_price, nb), mean, na.rm = TRUE), 
                        sale_price),
    price_lag = st_lag(sale_price, nb15, wt15, na_ok = T)
  ) %>%
  
  filter(house_age != -1) %>%
  
  
  select(-c(nb, wt))

house <- house[phl, ]
keep_columns <- sapply(house, function(col) length(unique(col)) > 1) #drop columns with only one unique value (i.e., no variance)
house <- house[, keep_columns]

For the amenities and public services, we considered factors that influence neighborhood attractiveness. We looked at how crime levels in different neighborhoods impact housing prices using data from OpenDataPhilly. The data records both cases of violent offenses such as aggravated assault, rape, and arson, and as well as non-violent offenses like simple assault, prostitution, gambling, and fraud.

We also looked at proximity to parks, which included various recreational facilities such as playgrounds, recreation centers, older adult centers, swimming pools, and environmental education centers. Homes located near these amenities generally have higher values due to the appeal of access to green spaces and recreational options.

For healthcare access, we used data on Pennsylvania hospitals, which are institutions that provide medical or surgical treatment to sick or injured individuals. Proximity to hospitals is often viewed as a positive feature, especially in urban areas, where access to healthcare services can significantly impact a home’s desirability.

Public transportation data from OpenDataPhilly was another key factor, with a focus on SEPTA routes, stops, and locations. Homes near reliable public transportation options tend to have higher values, as easy access to transit is an important consideration for many buyers.

We also factored in proximity to highly rated schools. School performance was evaluated based on Pennsylvania System of School Assessment (PSSA) scores and the number of serious incidents reported to School District Police. Homes near schools with strong performance metrics and a safe environment are typically more attractive, particularly for families.

Data Sources:

  • Park and recreation program sites - This dataset includes recreation centers, playgrounds, older adult centers, swimming pools, and environmental education centers.

  • Hospitals - this dataset provides information on hospitals in Pennsylvania, which are defined as institutions in which sick or injured persons are given medical or surgical treatment.

  • Highly rated schools - PSSA score & incident number

  • Crime - Crime incidents from the Philadelphia Police Department. Part I crimes include violent offenses such as aggravated assault, rape, arson, among others. Part II crimes include simple assault, prostitution, gambling, fraud, and other non-violent offenses.

  • [Bus and Trolly Stops] (https://opendataphilly.org/datasets/septa-routes-stops-locations/) Bus and Trolly Stops in Philadelphia


2.1.1 Handling Missing Data and Outliers

Dealing with missing data is always a challenge, and we encountered gaps in our dataset, particularly where variables were inconsistently recorded due to differences in how neighboring areas defined certain attributes. To address this, we used a method called missing data imputation, where we filled in the blanks with reasonable estimates based on the existing data. For example, if a home was missing its year of construction, we estimated it using data from similar homes in the same neighborhood.

Outliers, or extreme values that didn’t align with the overall trends, were carefully reviewed. We analyzed these outliers to determine whether they represented genuine rare cases or potential errors. For instance, if a home’s price was significantly higher or lower than the average in its neighborhood, we flagged it for further inspection. If these outliers were legitimate data points, we kept them in the analysis. However, if they seemed to be mistakes, we either corrected the values or removed them from the dataset to prevent them from distorting the results.

## School Rate -----------------------------------------------------
schools <- read.csv('data/2023-2024 Master School List.csv')
PSSA_data <- read.csv('data/2023PSSAKeystoneActual.csv')
incident_data <- read.csv("data/Serious_Incident_Counts_School.csv")

schools <- schools %>%
  rename(ulcs_code = ULCS.Code) %>%
  separate(col = GPS.Location, into = c('Lat', 'Long'), sep = ', ') %>%
  mutate(
    Lat = as.numeric(Lat),
    Long = as.numeric(Long)
  ) %>%
  filter(!is.na(Lat) & !is.na(Long)) %>%
  st_as_sf(coords = c("Long", "Lat"), crs = "EPSG:4326") %>%
  dplyr::select(ulcs_code, School.Name..ULCS., School.Level, Admission.Type)

PSSA_data <- PSSA_data %>%
  dplyr::filter(category == 'All' & grade == 'Grades 3-8' & profadv_score != 's') %>%
  mutate(profadv_score = as.double(profadv_score)) %>%
  group_by(ulcs_code, publicationname) %>%
  summarize(mean_profadv = mean(profadv_score))

incident_data_summary <- incident_data %>%
  group_by(ULCS.Code) %>%
  summarize(total_incidents = sum(`Num.of.Incidents`, na.rm = TRUE))

schools <- schools %>%
  left_join(PSSA_data, by = 'ulcs_code') %>%
  left_join(incident_data_summary, by = c("ulcs_code" = "ULCS.Code")) %>%
  mutate(total_incidents = ifelse(is.na(total_incidents), 0, total_incidents)) %>%
  drop_na() %>%
  st_transform('EPSG:2272') %>%
  mutate(mean_profadv_norm = (mean_profadv - min(mean_profadv, na.rm = FALSE)) / (max(mean_profadv, na.rm = FALSE) - min(mean_profadv, na.rm = FALSE))) %>%
  mutate(total_incidents_norm = (max(total_incidents, na.rm = FALSE)) - total_incidents / (max(total_incidents, na.rm = FALSE) - min(total_incidents, na.rm = FALSE))) %>%
  mutate(overall_score = 0.7 * mean_profadv_norm + 0.3 * total_incidents_norm)
  


# crime -----------------------------------------------------
crime <- read.csv("data/incidents_part1_part2.csv")

crime <- crime %>%
  dplyr::select(objectid, location_block, text_general_code, lat, lng) %>% 
  filter(!is.na(lat) & !is.na(lng)) %>% 
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>% 
  st_transform('EPSG:2272') %>% 
  st_intersection(phl)

crime_summ <- crime %>%
  group_by(text_general_code) %>%
  summarize(crime_count = n()) %>%
  arrange(desc(crime_count)) 


# Trolly and Buses -----------------------------------------------------
trol <- 
  rbind(
    st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson") %>% 
    mutate(type = "trolley") %>%
    select(type),
      st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson") %>%
      mutate(type = "trolley") %>%
      select(type)) %>%
   st_transform("EPSG:2272")

bus <- st_read("data/Busstop2022.geojson") %>%
        st_transform("EPSG:2272") %>%
        st_intersection(phl) %>%
  mutate(type = "bus") %>%
  select(type)

tran <- 
  rbind(trol,bus)


# park -----------------------------------------------------
park <- st_read("data/PPR_Program_Sites.geojson") %>%
  st_transform('EPSG:2272') %>%
  st_intersection(phl)



# Hospital -----------------------------------------------------
hosp <- st_read("data/DOH_Hospitals202311.geojson") %>%
  st_as_sf(crs = 4326)%>%
  st_transform('EPSG:2272') %>%
  st_intersection(phl)
acs_vars <- c("B01001_001E",
              "B25001_001E",
              "B25002_003E",
              "B19013_001E",
              "B19058_002E",
              "B25003_002E",
              "B25006_002E",
              "B25006_001E"
              )
acsTractsPHL.2023 <- get_acs(geography = "tract",
                             variables = acs_vars, 
                             year=2022, state=42,
                             county=101, geometry=TRUE, 
                             output = "wide") %>%
  st_transform('EPSG:2272')

acsTractsPHL.2023 <- acsTractsPHL.2023 %>%
  dplyr::select (GEOID, NAME, all_of(acs_vars)) %>%
  rename (totalPop = B01001_001E,
          totalHU = B25001_001E,
          totalVacant = B25002_003E,
          medHHInc = B19013_001E,
          HHAssistedInc = B19058_002E,
          OwnerOccH = B25003_002E,
          WhiteHomeowner = B25006_002E,
          TotalOccH = B25006_001E) %>%
  dplyr::filter(totalPop != 0) %>%
  mutate(HHOccupiedRate = ifelse(OwnerOccH==0,0,OwnerOccH/totalHU),
         WhiteHOrate = ifelse(WhiteHomeowner==0,0,WhiteHomeowner/TotalOccH),
         AssistedIncRate = ifelse(HHAssistedInc==0,0,HHAssistedInc/totalHU),
         medHHInc = ifelse(is.na(medHHInc),mean(medHHInc,na.rm=TRUE),medHHInc))
## join all the variable data to house data
st_c <- st_coordinates
house <- house %>% 
  st_join(acsTractsPHL.2023 %>% select(medHHInc,NAME), join = st_intersects)%>%
  mutate(schools_nn4 = nn_function(st_c(house), st_c(schools), 4),
         crime_nn4 = nn_function(st_c(house), st_c(crime), 4),
         park_nn4 = nn_function(st_c(house), st_c(park), 4),
         tran_nn4 = nn_function(st_c(house), st_c(tran), 4),
         trol_nn4 = nn_function(st_c(house), st_c(trol), 4),
         bus_nn4 = nn_function(st_c(house), st_c(bus), 4),
         hosp_nn4 = nn_function(st_c(house), st_c(hosp), 4))
reg_data_or <- house %>%
  filter(toPredict == "MODELLING") %>%
  select(sale_price,
         house_age,
         total_area,
         total_livable_area,
         number_of_bathrooms,
         number_of_bedrooms,
         number_of_rooms,
         fireplaces,
         interior_condition,
         exterior_condition,
         number_stories,
         garage_spaces,
         schools_nn4,
         crime_nn4,
         park_nn4,
         tran_nn4,
         hosp_nn4,
         medHHInc,
         price_lag) %>%
  na.omit()

2.2 Feature Engineering

To improve the model’s performance, we engineered some additional features from our data. Feature engineering is the process of creating new variables or transforming existing ones to better capture important patterns. For instance, we created proximity features that measured how close each house was to parks, schools, public transportation, and hospitals, since these amenities often affect home values.

We also transformed categorical variables into numeric variables. Categorical variables are factors like the type of neighborhood or the house’s ownership status, which aren’t numbers. By converting them into numerical values, we could include them in our regression model, making sure our analysis captured their impact on housing prices.

2.2.1 Table of Summary Statistics

Summary statistics provides an overview of key variables like house age, house size, and proximity to services but also helps identify any inconsistencies or gaps in our data, such as missing values or zeros that could distort our model. Based on the insights from this table, we went back to clean the data by replacing missing or unrealistic values (e.g., zeros or NAs) with more realistic estimates, using the average values from the nearest five neighboring properties.

summ_stats <- reg_data_or %>% st_drop_geometry()
summ_stats <- summ_stats %>%
  dplyr::select_if(is.numeric) %>%
  dplyr::select(-total_area, -number_of_bedrooms, -number_stories, -garage_spaces) %>% 
  psych::describe() %>% dplyr::select(mean:median,min,max) %>%
  mutate(Mean = as.character(round(mean)),
         SD = as.character(round(sd)),
         Median = as.character(round(median)),
         Min = as.character(min),
         Max = as.character(round(max))) %>%
  dplyr::select(-(mean:max)) %>%
  # filter(Max > 1) %>%
  as.data.frame()

summ_stats$Description <- 
  c("Sale price of the house (USD)", 
    "Age of the house (in years)", 
    "Total livable area (in square feet)", 
    "Number of bathrooms", 
    "Number of rooms", 
    "Number of fireplaces", 
    "Overall condition of the interior", 
    "Overall condition of the exterior", 
    "Distance to nearest schools (nearest 4 neighbors)", 
    "Crime rate within the nearest 4 neighbors", 
    "Distance to nearest parks (nearest 4 neighbors)", 
    "Distance to nearest public transportation (nearest 4 neighbors)", 
    "Distance to nearest hospitals (nearest 4 neighbors)", 
    "Median household income (nearest 4 neighbors)", 
    "Spatial lag of sale price (15 nearest neighbors)")

summ_stats %>%
  dplyr::select(Description, Mean, SD, Median, Min, Max) %>%  # Reordering columns
  kbl(caption = "Table 1: Summary Statistics of Numeric Variables", align = rep("c", 8)) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 1: Summary Statistics of Numeric Variables
Description Mean SD Median Min Max
sale_price Sale price of the house (USD) 265116 201379 226000 10100 2000000
house_age Age of the house (in years) 86 29 98 0 273
total_livable_area Total livable area (in square feet) 1324 509 1204 64 15246
number_of_bathrooms Number of bathrooms 1 0 1 1 6
number_of_rooms Number of rooms 4 2 4 1 15
fireplaces Number of fireplaces 0 0 0 0 5
interior_condition Overall condition of the interior 4 1 4 0 7
exterior_condition Overall condition of the exterior 4 1 4 1 7
schools_nn4 Distance to nearest schools (nearest 4 neighbors) 3404 1668 2929 1093.17300119281 13260
crime_nn4 Crime rate within the nearest 4 neighbors 150 108 127 0.000121519236724476 2897
park_nn4 Distance to nearest parks (nearest 4 neighbors) 3349 1423 2992 834.545707484382 12133
tran_nn4 Distance to nearest public transportation (nearest 4 neighbors) 581 364 474 53.8736902540069 6522
hosp_nn4 Distance to nearest hospitals (nearest 4 neighbors) 10518 4994 9306 1638.3584230797 32006
medHHInc Median household income (nearest 4 neighbors) 64837 29489 59534 14983 172610
price_lag Spatial lag of sale price (15 nearest neighbors) 263595 158948 224920 35440 1274340
summ_stats
##                       Mean     SD Median                  Min     Max
## sale_price          265116 201379 226000                10100 2000000
## house_age               86     29     98                    0     273
## total_livable_area    1324    509   1204                   64   15246
## number_of_bathrooms      1      0      1                    1       6
## number_of_rooms          4      2      4                    1      15
## fireplaces               0      0      0                    0       5
## interior_condition       4      1      4                    0       7
## exterior_condition       4      1      4                    1       7
## schools_nn4           3404   1668   2929     1093.17300119281   13260
## crime_nn4              150    108    127 0.000121519236724476    2897
## park_nn4              3349   1423   2992     834.545707484382   12133
## tran_nn4               581    364    474     53.8736902540069    6522
## hosp_nn4             10518   4994   9306      1638.3584230797   32006
## medHHInc             64837  29489  59534                14983  172610
## price_lag           263595 158948 224920                35440 1274340
##                                                                         Description
## sale_price                                            Sale price of the house (USD)
## house_age                                               Age of the house (in years)
## total_livable_area                              Total livable area (in square feet)
## number_of_bathrooms                                             Number of bathrooms
## number_of_rooms                                                     Number of rooms
## fireplaces                                                     Number of fireplaces
## interior_condition                                Overall condition of the interior
## exterior_condition                                Overall condition of the exterior
## schools_nn4                       Distance to nearest schools (nearest 4 neighbors)
## crime_nn4                                 Crime rate within the nearest 4 neighbors
## park_nn4                            Distance to nearest parks (nearest 4 neighbors)
## tran_nn4            Distance to nearest public transportation (nearest 4 neighbors)
## hosp_nn4                        Distance to nearest hospitals (nearest 4 neighbors)
## medHHInc                              Median household income (nearest 4 neighbors)
## price_lag                          Spatial lag of sale price (15 nearest neighbors)

2.2.2 Correlation Across Numeric Variables

For this project, this correlation matrix helps us understand how these are related to one another. It highlights the strength and direction of these relationships, allowing us to identify which variables are highly correlated. This is useful in our later model development, as it helps us decide which variables to keep or exclude. Based on this analysis, we decided to remove three variables: “total area”, “number of bedrooms”, and “exterior condition from the model.” These variables showed high correlation with other similar features such as “total livable area” and “interior condition,” which could make them redundant and potentially introduce multicollinearity into the model.

numeric_vars <- reg_data_or %>% 
  st_drop_geometry()%>%
  select(where(is.numeric))

corr <- round(cor(numeric_vars), 1)
p.mat <- cor_pmat(numeric_vars)

ggcorrplot(corr,
           colors = corr_mat_pal,
           #p.mat = cor_pmat(numeric_vars),
  type="lower",
  insig = "blank",
  tl.cex = 6) +  
  labs(title = "Correlation across Numeric Variables",
       caption = "Figure 2") 

2.2.3 Scatterplots: Relationships Between Sale Price and Key Variables

This part of the analysis uses scatterplots to visualize how four key variables—distance to crime spots, median household income, spatial lag of sale price, and total livable area—relate to the sale price of homes. By plotting these relationships, we can see the general trends, such as greater livable areas tend to increase home prices. However, slightly off from what we hypothesized before, while there is a positive relationship between household income and home prices, it may not be very strong or consistent across the data. We also included a line of best fit to highlight these trends. This helps us understand which factors have the strongest influence on home prices and ensures we capture these effects in our regression model.

reg_scatterpl <- reg_data_or %>% st_drop_geometry()


var_labels <- c(
  "crime_nn4" = "Distance to Crime Spots",
  "medHHInc" = "Median Household Income",
  "price_lag" = "Sale Price of Nearby Homes",
  "total_livable_area" = "Total Livable Area"
)

reg_scatterpl %>%
  dplyr::select(sale_price, 
                total_livable_area, 
                price_lag, 
                crime_nn4,
                medHHInc) %>%
  filter(sale_price <= 1000000) %>%
  filter(total_livable_area <= 10000) %>%
  gather(Variable, Value, -sale_price) %>% 
  ggplot(aes(Value, sale_price)) +
    geom_point(color = "black", shape = 16, size = 0.5, alpha = 0.2) + 
    geom_smooth(method = "lm", se=F, colour = "#FA7800") + 
    facet_wrap(~Variable, nrow = 2, scales = "free_x", labeller = as_labeller(var_labels)) +  
    labs(
      title = "Scatterplots: Price as a function of 4 Neighborhood/House Variables",
      x = "",
      y = "Sale Price (USD)",
      caption = "Figure 3"
    ) +
    theme_minimal() + 
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      strip.text = element_text(size = 10, face = "italic"),
      panel.grid = element_line(color = "gray90")  
    )

2.2.4 Map of Property Sale Price

This map helps visualize the sale prices of properties across Philadelphia, highlighting price disparities across neighborhoods. Each point represents a property, with different colors indicating its price range, from homes priced under $250k (yellow)” to those exceeding $1 million (red)“. This distribution allows us to easily observe the geographic spread of property values, with higher-priced homes concentrated in certain areas.

reg_data_or <- reg_data_or %>% 
  mutate(sale_class = cut(sale_price, breaks = c(0, 250000, 500000, 750000, 1000000, max(reg_data_or$sale_price, na.rm=TRUE))))

ggplot()+
    geom_sf(data=phl,fill='grey80',color='transparent')+
    geom_sf(data=reg_data_or, size=0.3,aes(colour = q5(sale_class)))+
    geom_sf(data=phl,fill='transparent',color='black')+
    labs(
      title = "Sale Price of Houses in Philadelphia",
      caption = "Figure 4") +
  scale_color_manual(values = palette5,
                    name = "Sales Price (USD)",
                    na.value = 'grey80',
                    labels = c('$0-$250k', '$250k-$500k', '$500k-$750k', '$750k-$1m', '$1m+'))+
    theme(
      plot.title = element_text(hjust = 0.5, size = 13),
      axis.title = element_blank(),  
      axis.text = element_blank(),   
      axis.ticks = element_blank(),  
      panel.grid = element_blank()
    )

  mapTheme()

2.3 Three Interesting Maps

2.3.1 Crime Density Map by Type

This map shows the concentration of different types of crimes across the city, allowing us to see which areas have higher crime rates and what types of crimes are most prevalent.

top_crimes <- crime_summ %>%
  group_by(text_general_code) %>%
  summarise(total_count = sum(crime_count)) %>%
  arrange(desc(total_count)) %>%
  slice(1:10) %>%
  pull(text_general_code)

### Top crimes are "Thefts", "Motor Vehicle Theft", "Vandalism/Criminal Mischief"  "Theft from Vehicle"   "Fraud"  "Aggravated Assault No Firearm"  "Burglary Residential" "Narcotic / Drug Law Violations"
top_crimes_summ <- crime %>%
  filter(text_general_code %in% c("Thefts", "Motor Vehicle Theft", "Vandalism/Criminal Mischief", "Theft from Vehicle", "Fraud", "Burglary Residential"))

crime_types <- c("Thefts", "Motor Vehicle Theft", "Vandalism/Criminal Mischief", "Theft from Vehicle", "Fraud", "Burglary Residential")

plot_list <- list()

for (crime_type in crime_types) {
  crime_filtered <- top_crimes_summ %>%
    filter(text_general_code == crime_type)
  
  
  crime_coords <- st_coordinates(crime_filtered)  
  
  
  crime_coords_df <- data.frame(X = crime_coords[, 1], Y = crime_coords[, 2])
  
 
  p <- ggplot() +
    geom_sf(data = phl, fill = "grey40") +
    stat_density2d(data = crime_coords_df,
                   aes(x = X, y = Y, fill = ..level.., alpha = ..level..),
                   size = 0.01, bins = 40, geom = 'polygon') +
    scale_fill_gradient(low = "#25CB10", high = "#FA7800",
                        breaks = c(0.000000003, 0.00000003),
                        labels = c("Minimum", "Maximum"), 
                        name = "Density") +
    scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
    labs(title = paste(crime_type)) +  
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, size = 9),
      axis.title = element_blank(),  
      axis.text = element_blank(),   
      axis.ticks = element_blank(),  
      panel.grid = element_blank()
    )
  
  plot_list[[crime_type]] <- p
}

# Combine the individual plots into a single page
combined_plot <- wrap_plots(plot_list, ncol = 3) +
  plot_annotation(title = "Crime Density Maps for Philadelphia in 2023",
                  theme = theme(
                    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
                  ))


combined_plot

2.3.2 Highly Rated Schools Point Map

This point map displays the locations of highly rated schools across the city, helping to identify which neighborhoods have access to top-performing educational institutions.

schools <- schools %>%
  mutate(overall_score_class = cut(overall_score, 
                                   breaks = quantile(overall_score, probs = seq(0, 1, 0.2), na.rm = TRUE), 
                                   include.lowest = TRUE))

ggplot() +
  geom_sf(data = phl, fill = 'grey80', color = 'transparent') +
  geom_sf(data = schools, aes(colour = overall_score_class), size = 2.5) +
  scale_color_manual(values = c("#B2F7EF", "#A0DAE2", "#88BCC4", "#577A8A", "#1A2C5B"), 
                     name = "School Overall Score", 
                     labels = c("Lowest", "Low", "Medium", "High", "Highest")) +
  geom_sf(data = phl, fill = 'transparent', color = 'black') +
  labs(title = "School Overall Score based on Student Performance and Incidents") +
  mapTheme() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12, face = "bold", margin = margin(b = 10)),  
    legend.position = "right",  
    legend.title = element_text(size = 10),  
    legend.text = element_text(size = 8),  
    plot.margin = margin(t = 30, r = 5, b = 10, l = 10),
    panel.border = element_blank(), 
    axis.title = element_blank(),  
    axis.text = element_blank(),  
    axis.ticks = element_blank() 
  )

2.3.3 Hospital Density Map by Type

This map highlights the density of hospitals in Philadelphia, categorized by type (e.g., general, specialized), showing areas with higher or lower access to healthcare facilities.

hosp_coords <- st_coordinates(hosp)  

hosp_coords_df <- data.frame(X = hosp_coords[, 1], Y = hosp_coords[, 2])

hosp_density <- stat_density2d(data = hosp_coords_df, aes(x = X, y = Y), geom = "density")
density_range <- range(hosp_density$..level.., na.rm = TRUE)

# Adjusted dynamic breaks
breaks_dynamic <- c(density_range[1], mean(density_range), density_range[2])

ggplot() +
  geom_sf(data = phl, fill = "grey40") +  
  stat_density2d(data = hosp_coords_df,   
                 aes(x = X, y = Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 80, geom = 'polygon') +  
  scale_fill_gradient(low = "#4F81BD", high = "#C0504D",  
                      breaks = breaks_dynamic,  
                      labels = c("Low", "Medium", "High"),  
                      name = "Density", guide = "colorbar") +  
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +  
  labs(title = "Density of Hospitals in Philadelphia") +  
  theme_minimal() +  
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),  
    axis.title = element_blank(),  
    axis.text = element_blank(),   
    axis.ticks = element_blank(),  
    panel.grid = element_blank(),   
    legend.position = "right"  
  )

3 Methods

3.1 Data Mining & Cleaning Process

We began by preparing the Zillow housing data to ensure accuracy for modeling. This involved a complicated process through removing extreme outliers, such as homes priced above $2 million, and addressing missing or incorrect values. For example, houses with unrealistic ages (like -1 or 2023) were excluded, and missing values in areas like total living area were replaced with the average values of nearby homes. Variables without clear definitions, such as number of stories, were also removed to avoid confusion. Outliers, like unusually large garage sizes (12 or 72 spaces), were adjusted to align with more realistic neighborhood averages.

After cleaning the housing data, we integrated it with additional datasets on schools, crime rates, hospitals, and transportation. These datasets were aligned with the geographical boundaries of Philadelphia, ensuring they could be used together for analysis and model development.

3.2 Model Development and Testing

For our analysis, we split the data into 60% for training and 40% for testing, ensuring the model was trained on one part of the data and tested on another to evaluate its performance. We developed a linear regression model using 23,371 samples with 15 predictors to estimate housing prices. The model was trained and evaluated through 100-fold cross-validation to ensure its robustness and generalizability. Model accuracy was evaluated using: - Mean Absolute Error (MAE): A straightforward measure of the average difference between the model’s predicted prices and actual prices. - Mean Absolute Percentage Error (MAPE): A measure of the percentage difference between predicted and actual prices, providing insight into the model’s relative accuracy.

reg_data <- reg_data_or %>%
  select(
    -total_area,
    -number_of_bedrooms,
    -exterior_condition,
    -sale_class
    )

3.3 Train linear regression model

# set the seed to 1
set.seed(1)


inTrain <- createDataPartition(
              y = paste(reg_data$fireplaces,reg_data$interior_condition,reg_data$garage_spaces), 
              p = .60, list = FALSE)  # p = .60 = split to 60/40 ratio for training & test set

phl.training <- reg_data[inTrain,] # defining training set
phl.test <- reg_data[-inTrain,]    # defining testing set

reg.training <- lm(sale_price ~ ., data = phl.training %>% st_drop_geometry)


# plot(reg.training) # useful plots to look at model fit

phl.test <- phl.test %>%
              mutate(sale_price.Predict = predict(reg.training, phl.test), 
               sale_price.Error = sale_price.Predict - sale_price,
               sale_price.AbsError = abs(sale_price.Predict - sale_price),
               sale_price.APE = (abs(sale_price - sale_price.Predict)) / sale_price)

mape1 <- mean(phl.test$sale_price.APE, na.rm = TRUE) 
mae1 <- mean(phl.test$sale_price.AbsError, na.rm = TRUE) 
summ_reg.training <- summary(reg.training)
summ_coeff1 <- as.data.frame(summary(reg.training)$coefficients) # same as summ_coeff

footnote1 <- paste0("R-squared: ",round(summ_reg.training$r.squared,3),
                    "; adjusted R-squared: ",round(summ_reg.training$adj.r.squared,3),
                    "; F-statistic: ", round(summ_reg.training$fstatistic[1],3), 
                    " (df=", round(summ_reg.training$fstatistic[2],3), 
                    "; ", round(summ_reg.training$fstatistic[3],3), 
                    "); N = ", length(summ_reg.training$residuals),
                    "; Residual Std. Err.:", round(summ_reg.training$sigma,3), 
                    " (df=",summ_reg.training$df[2],")")

footnote2 <- "* p < 0.1; ** p < 0.05; *** p < 0.01"

summ_coeff1 %>% 
  mutate(`Sale Price` = round(Estimate,3),
         SE = round(`Std. Error`,3),
         `Sale Price` = case_when(
           `Pr(>|t|)` >= 0.1 ~ as.character(`Sale Price`),
           `Pr(>|t|)` < 0.1 & `Pr(>|t|)` >= 0.05 ~ paste0(`Sale Price`,"*"),
           `Pr(>|t|)` < 0.05 & `Pr(>|t|)` >= 0.01 ~ paste0(`Sale Price`,"**"),
           `Pr(>|t|)` < 0.01 ~ paste0(`Sale Price`,"***")
         )) %>% 
  dplyr::select(`Sale Price`,SE) %>% 
  kbl(caption = "Table | Summary of Training set", align = rep("c", 8)) %>%
  footnote(c(footnote1,footnote2)) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Table | Summary of Training set
Sale Price SE
(Intercept) -8555.297 7903.303
house_age -174.954*** 35.616
total_livable_area 113.777*** 2.082
number_of_bathrooms 42072.322*** 2337.001
number_of_rooms -3795.512*** 587.898
fireplaces 36560.084*** 3223.245
interior_condition -28184.343*** 1184.918
number_stories 4227.509** 1791.033
garage_spaces 12798.941*** 1877.633
schools_nn4 -0.011 0.925
crime_nn4 15.411 10.670
park_nn4 -4.57*** 0.927
tran_nn4 -1.207 2.522
hosp_nn4 -0.555** 0.249
medHHInc 0.445*** 0.046
price_lag 0.689*** 0.009
Note:
R-squared: 0.754; adjusted R-squared: 0.754; F-statistic: 2892.17 (df=15; 14121); N = 14137; Residual Std. Err.:100726.883 (df=14121)
* p < 0.1; ** p < 0.05; *** p < 0.01
  data.frame(MAE = sprintf('%.2f',round(mae1,3)),
             MAPE = round(mape1,3)) %>% 
    kbl(caption = "Table | MAE and MAPE of one test set",
        align = c("c","c")) %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>% 
    column_spec(1, width = "10em") %>% 
    column_spec(2, width = "10em")
Table | MAE and MAPE of one test set
MAE MAPE
62634.56 0.351

Table of Summary of Training set shows the results of our linear regression model that predicts house sale prices. The “Sale Price” column lists the coefficients, which tell us how much each variable impacts the house price. And the column “SE” stands for standard error, indicating the precision of these estimates. From the coefficient values, we can see that house age, number of rooms, interior condition, distance from parks, transit stops and hospitals has a negative correlation with sale price, which means the increase of these variables might reduce house sale price.

The R squared value of 0.754 suggests that our model explains about 75% of the variation in house prices, which is a good fit. However, we note that several of our predictors have p-values greater than 0.1, suggesting that they are not statistically significant and may not contribute meaningfully to the model.

We then evaluated our model using the test dataset to assess its performance and accuracy. Table 3 shows the MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error). The MAE measures the absolute difference between predicted and actual values, while MAPE expresses the error as a percentage, offering a sense of relative accuracy.

In our test, the value of MAE is 62,656.85, indicating that the predicted house sale prices deviate from the actual values by 62634.56 dollars on average. And MAPE is 0.351, suggesting that the predictions differ from the actual values by 35.1%.

3.4 K-fold Cross Validation

customSummary <- function(data, lev = NULL, model = NULL) {
  mape <- mean((abs(data$obs - data$pred)) / data$obs) * 100
  mae <- mean(abs(data$obs - data$pred))
  rmse <- sqrt(mean((data$obs - data$pred)^2))
  rsq <- cor(data$obs, data$pred)^2
  out <- c(MAE = mae, RMSE = rmse, Rsquared = rsq, MAPE = mape)
  out
}

train_control <- trainControl(method = "cv",
                              number = 100,
                              summaryFunction = customSummary)

model <- train(sale_price ~ .,
               data = reg_data %>% st_drop_geometry,
               trControl = train_control,
               method = "lm",
               na.action = na.exclude)

print_model <- as.data.frame(print(model))
## Linear Regression 
## 
## 23505 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23269, 23269, 23271, 23270, 23270, 23271, ... 
## Resampling results:
## 
##   MAE       RMSE      Rsquared   MAPE    
##   63110.41  99928.55  0.7526762  35.91969
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# final model
model$finalModel
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##         (Intercept)            house_age   total_livable_area  
##          -4290.3175            -180.4330             114.7380  
## number_of_bathrooms      number_of_rooms           fireplaces  
##          42556.8271           -3973.0084           32276.0356  
##  interior_condition       number_stories        garage_spaces  
##         -28481.6594            3530.0517           14600.4378  
##         schools_nn4            crime_nn4             park_nn4  
##             -0.1119              21.4555              -4.7618  
##            tran_nn4             hosp_nn4             medHHInc  
##             -2.4100              -0.6793               0.4518  
##           price_lag  
##              0.6866

We used k-fold cross validation to ensure that the performance of metrics are robust and generalizable. Unlike the previous linear regression model, the coefficients of cross validation shows that the distance to schools is negatively correlated with housing prices, which means the closer a property is to a school, the higher its price, and this is more reasonable. The MAE and MAPE is 63109.63 and 35.91% respectively, indicating a 36% discrepancy between the predicted and actual housing prices, which clearly suggests that our model is not satisfactory on accuracy.

3.5 Distribution of MAE

ggplot(data = model$resample) +
  ggplot2::geom_histogram(aes(x = MAE), bins = 12) +
  labs(title = "Distribution of MAE",
       subtitle = "K-Fold Cross Validation; k = 100",
       x = "Mean Absolute Error",
       y = "Count")

The majority of MAE values are clustered between 55,000 and 70,000,following a bell-shaped curve. This pattern suggests a normal distribution, indicating that the model generalized well.

3.6 ScatterPlot of Mean MAPE vs. Mean Sale Price

This graph used MAPE to analyze prediction accuracy by neighborhood and create a scatterplot showing MAPE as a function of neighborhood mean price. From the graph, we can observe a higher average MAPE where average house sale prices are at a low level.

house1 <- house %>%
  filter(toPredict == "MODELLING")
house1 <- house1 %>%
  mutate(pred = predict(model$finalModel, house1),  
         MAPE = abs((sale_price - pred) / sale_price) * 100) 

summary_data <- house1 %>%
  group_by(NAME) %>%
  summarize(mean_sale_price = mean(sale_price, na.rm = TRUE),  
            mean_MAPE = mean(MAPE, na.rm = TRUE),  
            .groups = 'drop')  


ggplot(summary_data, aes(x = mean_sale_price, y = mean_MAPE)) +
  geom_point(color = "blue", size = 1) +
  labs(title = "Scatter Plot of Mean MAPE vs. Mean Sale Price",
       x = "Mean Sale Price",
       y = "Mean MAPE (%)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

3.7 Predicted Prices as a function of observed price

The picture below depicts the difference between our prediction and the actual prices. To read the graph, the dashed line symbolizes a “perfect prediction”, while the solid line indicates a hypothetical perfect prediction. It can be observed that our model accurately predicts the majority of housing prices, but there are some obvious errors, particularly in underestimating the higher actual house sale prices.

phl.test %>%
  dplyr::select(sale_price.Predict, sale_price) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  ggplot2::geom_point(alpha = 0.4) +
  geom_abline(color = pal_full[3]) +
  geom_smooth(method = "lm", se = FALSE, color = pal_full[3], lwd = .5, lty = "dashed") +
  labs(title="Predicted sale price as a function of observed price",
       x = "Observed Sale Price", y = "Predicted Sale Price") +
  theme_minimal()

phl.test <- phl.test %>%
  mutate(quantile_group = ntile(sale_price.Predict, 5))  

quantile_ranges <- quantile(phl.test$sale_price.Predict, probs = seq(0, 1, by = 0.2))

quantile_labels <- c(
  paste0("<= ", round(quantile_ranges[2], 2)),
  paste0(round(quantile_ranges[2], 2), " - ", round(quantile_ranges[3], 2)),
  paste0(round(quantile_ranges[3], 2), " - ", round(quantile_ranges[4], 2)),
  paste0(round(quantile_ranges[4], 2), " - ", round(quantile_ranges[5], 2)),
  paste0("> ", round(quantile_ranges[5], 2))
)
ggplot() +
  geom_sf(data = phl, color = "black", size = 0.5) +  
 geom_sf(data = phl.test, aes(color = factor(quantile_group)), size = 0.8, alpha = 0.5) +
  scale_color_manual(values = quantile_colors_2, labels = quantile_labels, name = "Predicted Price Quintile") +   
  theme_minimal() +
  labs(title = "Predicted Sale Price Map")+ 
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0),
        panel.grid = element_blank(),
        axis.line = element_blank(),
        axis.title = element_blank(),  
        axis.text = element_blank(),  
        axis.ticks = element_blank())

The map above shows the predicted sale price distribution across Philadelphia, with values divided into quintiles. The darker colors (reds and browns) indicate areas with higher predicted sale prices, while the lighter colors (blues) represent lower predicted sale prices. It shows that the lower central and some southern parts of Philadelphia have higher predicted sale prices, while predicted sale prices are generally lower in the upper central and southwestern parts, as shown by the blue areas.

3.8 Residual Maps

3.8.1 Residual Maps by House

spatial_resids <- reg_data %>%
                    na.omit() %>%
                    st_as_sf()

spatial_resids$residuals <- model$finalModel$residuals
spatial_resids <- spatial_resids %>%
                        st_as_sf() %>%
                        st_join(phl)


spatial_resids <- spatial_resids %>%
  mutate(quantile_group = ntile(residuals, 5))  

quantile_ranges <- quantile(spatial_resids$residuals, probs = seq(0, 1, by = 0.2))

quantile_labels <- c(
  paste0("<= ", round(quantile_ranges[2], 2)),
  paste0(round(quantile_ranges[2], 2), " - ", round(quantile_ranges[3], 2)),
  paste0(round(quantile_ranges[3], 2), " - ", round(quantile_ranges[4], 2)),
  paste0(round(quantile_ranges[4], 2), " - ", round(quantile_ranges[5], 2)),
  paste0("> ", round(quantile_ranges[5], 2))
)


ggplot() +
  geom_sf(data = phl, color = "black", size = 0.5) +  
  geom_sf(data = spatial_resids, aes(color = factor(quantile_group)), size = 1, alpha = 0.5) +
  scale_color_manual(values = quantile_colors_2, labels = quantile_labels, name = "Residual Quintile") +   
  theme_minimal() +
  labs(title = "Residuals Overlay on Philadelphia Map", 
       subtitle = "Point Residuals Colored by Quintile") +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0),
        panel.grid = element_blank(),
        axis.line = element_blank(),
        axis.title = element_blank(),  
        axis.text = element_blank(),  
        axis.ticks = element_blank())

3.8.2 Residual Maps by tract

avg_resid_x_hood <-  spatial_resids %>%
    group_by(NAME) %>%
    summarize(mean_resid = mean(residuals)) %>%
  st_drop_geometry()

avg_resid_x_hood <- left_join(phl, avg_resid_x_hood) %>%
                          st_as_sf()


avg_resid_x_hood <- avg_resid_x_hood %>%
  mutate(quantile_group = ntile(mean_resid , 5))  



ggplot() +
  geom_sf(data = phl, fill = NA, color = "black", size = 0.5) +
  geom_sf(data = avg_resid_x_hood, aes(fill = factor(quantile_group))) +  
  scale_fill_manual(values = quantile_colors, 
                    name = "Residual Quantile",
                    labels = c("Low", "Moderately Low", "Moderate", "Moderately High", "High")) +  
  # Set theme and labels
  theme_minimal() +
  labs(title = "Residuals Overlay on Philadelphia Map", 
       subtitle = "Mean Residuals by Neighborhood") +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0),
        panel.grid = element_blank(),
        axis.title = element_blank(),  
        axis.text = element_blank(),  
        axis.ticks = element_blank())

These two maps display the residuals for each house and the average residual by tract, divided into five levels based on quintiles. The maps reveal that home sale prices tend to be underestimated in the western and central parts of Philadelphia, while they are overestimated in the northeastern areas. This suggests that the model struggles to fully capture some of the local housing market dynamics in these regions, potentially due to unaccounted neighborhood-specific factors.

3.9 Observed and Permuted Moran’s I

# adapted from code chunk: autocorr
house_nb <- house %>% 
            mutate(nb = st_knn(geometry, k = 5),
                   wt = st_weights(nb),
                   .before = 1)


# calculate global moran's i for sale_price
glbl_moran <- global_moran(house_nb$sale_price, house_nb$nb, house_nb$wt)$`I` # 0.5687531

# moran monte carlo
moranMC = global_moran_perm(house_nb$sale_price, house_nb$nb, house_nb$wt, alternative = "two.sided", 999)
moranMC
moranMCres = moranMC$res |>
  as.data.frame() %>% 
  rename(moranMCres = 1)

ggplot(moranMCres) +
  geom_histogram(aes(moranMCres), bins = 100) +
  geom_vline(xintercept = glbl_moran, color = pal_full[5]) +
  labs(title = "Observed and Permuted Moran's I",
       subtitle = "Red line indicates the observed Moran's I",
       x = "Moran's I",
       y = "Count",
       caption = "Figure 15")

Moran’s I is a measure that tells how similar or different things are across a map based on their location. In the context of our housing price prediction, we used Moran’s I to check if there is a pattern in how our prediction errors are spread out across different neighborhoods. If Moran’s I shows a strong value, it means that the errors are not random but are clustered in certain areas, suggesting that the model might not be capturing some location-based factors well. This helps us understand whether we need to include more spatial features in the model to improve accuracy.

Running a global Moran’s I test, the red line represents the value of the observed Moran’s I statistic, and the black bars depict the distribution of Permuted Moran’s I. We can see that the red line is significantly greater than the values from the permuted distribution, indicating that there is a significant positive spatial autocorrelation, confirming that this spatial pattern affects the quality of our model.

# this can be made prettier
spdep::moran.plot(house_nb$sale_price, nb2listw(house_nb$nb),
           xlab = "Sale Price", 
           ylab = "Spatial Lag")

3.10 Split by Census Group

The income distribution map and error analysis also show key insights into our model’s generalizability across Philadelphia’s socioeconomic landscape. High-income tracts, primarily in the northwest and lower central regions, show a lower MAPE of 0.25, indicating better relative accuracy, while low-income areas exhibit a higher MAPE of 0.43, suggesting greater prediction deviations. However, the MAE is significantly higher in high-income areas ($79,266.85) compared to low-income areas ($50,150.79), likely due to the higher property values in affluent neighborhoods.

These discrepancies highlight the model’s limitations in adapting to diverse income levels. The elevated MAPE in low-income areas suggests an underrepresentation of unique factors influencing these markets, while the high MAE in wealthier areas reflects larger absolute discrepancies due to higher base prices. Addressing these variations through additional neighborhood-specific factors could enhance model robustness, supporting fairer predictions across Philadelphia’s socioeconomic spectrum.

census_api_key("e6ed37e145b901c8c418a193d5bc174948045488", overwrite = TRUE)


phl_acs <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001E",
                        "B02001_002E",
                        "B19013_001E"), 
          year=2022, 
          state = "PA", 
          county = "Philadelphia", 
          geometry=TRUE, 
          output="wide") %>%
  st_transform('EPSG:2272') %>%
  rename(TotalPop = B01003_001E, 
         Whites = B02001_002E,
         MedHHInc = B19013_001E) %>%
  dplyr::select(-NAME, -starts_with("B"))%>%
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         year = "2022")

city_mean_inc <- round(mean(phl_acs$MedHHInc,na.rm=T),2)
phl_acs <- phl_acs %>% 
  mutate(incomeContext = ifelse(MedHHInc > city_mean_inc, "High Income", "Low Income"),  # really above/below avg inc
         raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"))
phl.test <- st_join(phl.test, phl_acs, join = st_within)
# Plot with different colors for "white" and "non-white"
ggplot()+
  geom_sf(data = phl_acs%>%na.omit,aes(fill = as.factor(incomeContext)), color = "transparent")+
  scale_fill_manual(values = c("High Income" = "lightblue", "Low Income" = "lightcoral")) +
  geom_sf(data = phl_acs, fill = "transparent")+
  theme_void()+
  labs(fill = "Income Level", 
       title = "Income Distribution by Tract",
       subtitle = "Split by median household income")

#  Test set MAPE by neighborhood income context
summary_table <- phl.test %>%
  na.omit(incomeContext)%>%
  st_drop_geometry()%>%
  group_by(incomeContext) %>%
  summarise(
    mape = round(mean(sale_price.APE, na.rm = TRUE),2), 
    mae = round(mean(sale_price.AbsError, na.rm = TRUE),2)   
  )

summary_table%>%
 kbl(caption = "Table | MAPE and MAE by Neighborhood Income Context",
        align = c("c","c")) %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>% 
    column_spec(1, width = "10em") %>% 
    column_spec(2, width = "10em")
Table | MAPE and MAE by Neighborhood Income Context
incomeContext mape mae
High Income 0.25 79266.85
Low Income 0.43 50150.79

4 Discussion

Discussion The linear regression model captures 75.3% of the variation in housing prices, as indicated by the R-squared value. However, with an MAE of $63,110 and a MAPE of 35.9%, a substantial portion of price variation remains unexplained, highlighting areas for further refinement. Key predictors include total livable area, number of bathrooms, and garage spaces, all of which positively influence home prices. For instance, every additional square foot of livable area increases the price by $114.74, demonstrating the market’s preference for larger homes. Similarly, each additional bathroom and garage space add significant value, indicating that buyers prioritize these amenities. Conversely, external factors like crime rates, proximity to parks, hospitals, and transportation had minimal impact, suggesting that home features outweigh neighborhood services in buyer decisions.

When evaluating the model’s generalizability, several key findings emerged. Firstly, the model demonstrates fair accuracy on the test set, confirming its ability to predict new data with reasonable reliability. However, generalizability across different geographies and socioeconomic groups reveals limitations. Analyzing MAPE by tract shows that prediction accuracy varies geographically, with lower errors in middle-income neighborhoods where price trends are more stable. In contrast, the model struggles in high- and low-income areas, where greater volatility and unobserved neighborhood-specific factors, such as renovation trends or local market conditions, impact prices.

A demographic split analysis further emphasizes these limitations. The model’s higher MAPE in low-income areas (0.43) versus high-income areas (0.25) indicates that it struggles to generalize across socioeconomically diverse neighborhoods. This discrepancy may be due to unique factors influencing low-income tracts that the model does not capture, resulting in higher relative error rates. Additionally, the residual mapping shows clustering of spatial errors, particularly in Southwest Philadelphia, suggesting unaccounted-for local dynamics that affect model accuracy.

In conclusion, while the model performs adequately on new data, its generalizability across geographies and socioeconomics is limited. To enhance the model’s robustness, integrating additional neighborhood-specific variables, such as walkability scores, proximity to employment hubs, or access to local amenities, could capture the complexity of Philadelphia’s housing market more effectively. This would reduce spatial and socioeconomic biases, leading to more equitable and consistent predictions across the city.

5 Conclusion

While our linear regression model effectively captures a significant portion of the variation in housing prices, explaining 75.3% of price fluctuations, it still leaves room for improvement, particularly in areas where housing prices are more volatile.

Given the model’s overall performance, including the MAE of $63,110 and MAPE of 35.9%, I would cautiously recommend this model to Zillow as a baseline tool for predicting housing prices in Philadelphia. However, there are limitations, especially in high-end and low-income neighborhoods where prediction errors are more pronounced, and spatial variations are not fully accounted for.