1.Introduction

This analysis aims to build a prediction model of robbery incidents in Chicago. Robbery is a crime that involves direct confrontation between the perpetrator and the victim, often using force or threats. This element of direct interaction causes complexities in both the occurrence and reporting of such incidents, which can lead to a high level of selection bias. Predicting robberies is prone to selection bias due to underreporting of incidents, where victims may not report crimes due to various factors like fear. Additionally, areas with lower levels of trust in law enforcement may see fewer reported robberies, even if they experience similar or higher rates of such crimes.

Robbery is influenced by multiple spatial factors, such as neighborhood demographics, economic conditions, and locations. Thus, building a spatial risk model for robbery requires accounting for these nuances and biases to create an accurate and fair model. This model aims to enhance the understanding of spatial patterns of robbery and improve predictive accuracy and generalizability across different neighborhoods.

2.Data Wrangling

2.1 load packages and functions

knitr::opts_chunk$set(echo = TRUE, warning=FALSE)

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat.explore)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt)   # for KDE and ML risk class intervals
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

2.2 Read in Data from Chicago

2.2.1 Police Data

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

2.2.2 Robbery Data

When I used “Socrata” to download crime data, the results are all “NA”. So I download the .csv file from the website, and use “read.csv” to get the data instead.

robbery <- 
  read.csv("Crimes_-_2017_20241016.csv") %>% 
    filter(Primary.Type == "ROBBERY") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

2.2.3 Chicago Boundary

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271')

2.2.4 Visualize Robbery Data

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = robbery, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Robbery, Chicago - 2017") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(robbery)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of robbery") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

2.3 Creating a fishnet grid

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

2.3.1 Aggregate points to the fishnet

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(robbery) %>% 
  mutate(countrobbery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countrobbery = replace_na(countrobbery, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

2.3.2 Visualize Robbery Fishnet

ggplot() +
  geom_sf(data = crime_net, aes(fill = countrobbery), color = NA) +
  scale_fill_viridis() +
  labs(title = "Figure 2: Count of Robberies for the fishnet") +
  mapTheme()

2.4 Preparing Risk Factors

In this analysis, I chose the following risk factors:

population, median household income, unemployed population (Source: ACS Data)
points of grocery stores, schools (Source: the City of Chicago’s open data https://data.cityofchicago.org/)

Population: Higher population density may lead to increased robbery rates due to more potential targets.

Median Household Income: Areas with lower median incomes often experience higher crime rates, including robberies, as economic stress can drive criminal behavior.

Unemployed number: high unemployment rates may link to high robbery rates.

Grocery Point Data: The presence of grocery stores can attract foot traffic, potentially increasing robbery opportunities. And grocery stores are a common site for robberies.

Schools Point Data: Schools influence robbery rates by serving as community hubs that foster engagement and social cohesion, which can deter criminal activities.

2.4.1 grocery stores

grocery <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Grocery-Stores-2013/53t8-wyrc") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "grocery.2013")

2.4.2 schools

school <- 
  read.socrata("https://data.cityofchicago.org/Education/CPS_School_Locations_1617/75e5-35kf") %>%
    dplyr::select(Y = lat, X = long) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "school.2017")

2.4.3 census data

census_api_key("e6ed37e145b901c8c418a193d5bc174948045488", overwrite = TRUE,install=TRUE)

acsTractsChi.2017 <- 
  get_acs(geography = "tract", 
          variables = c("B01001_001E", # total population
                        "B19013_001E", #median household income
                        "B23025_005E" # unemployed number
                        ), 
          year=2017, 
          state=17, 
          county=031, 
          geometry=TRUE, 
          output="wide") %>%
  st_transform('ESRI: 102271')%>%
  rename(total_pop.2017 = B01001_001E,
          med_HH_Income.2017 = B19013_001E,
         unemployed.2017 = B23025_005E)

pop <- 
  acsTractsChi.2017 %>%
    dplyr::select(GEOID,value = total_pop.2017) %>%
    na.omit() %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "pop.2017")

medhhinc <- 
  acsTractsChi.2017 %>%
    dplyr::select(GEOID,value = med_HH_Income.2017) %>%
    na.omit() %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "medhhinc.2017")

unemployed <- 
  acsTractsChi.2017 %>%
    dplyr::select(GEOID,value = unemployed.2017) %>%
    na.omit() %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "unemployed.2017")

2.4.4 Chicago neighborhoods data

## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

2.4.5 combine all the risk fators to a dataframe

For the grocery store and school point data, I checked which grid each point is in and assign a value to that grid. For the census data, I turn each grid of fishnet into a point based on centroid, and check which tract each point belongs to and assign the corresponding value of census data.

vars_net1 <- rbind(school,grocery) %>%
  st_join(fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  left_join(fishnet, ., by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  dplyr::select(-`<NA>`) %>%
  ungroup()


fishnet_centroid <- st_centroid(fishnet)
vars_net2 <- fishnet_centroid%>%
  st_join(rbind(pop,medhhinc,unemployed), join = st_within) %>%
  pivot_wider(names_from = Legend, values_from = value)%>%
  dplyr::select(uniqueID,pop.2017,medhhinc.2017,unemployed.2017,geometry)

Because the spatial range of Census Data and fishnet is slightly different, the values of census data in some grids are NA. I wrote a function to set the NA grids as their nearest non-NA value.

# Function to fill NA values with nearest non-NA
fill_na_with_nearest <- function(data, var_name) {
  # Create a copy of the data with only non-NA rows for the specified variable
  non_na_data <- data %>% filter(!is.na(!!sym(var_name)))
  
  # Find the index of the nearest non-NA feature for each row with NA
  nearest_indices <- st_nearest_feature(data, non_na_data)
  
  # Replace NA values with the corresponding nearest non-NA value
  data[[var_name]] <- ifelse(is.na(data[[var_name]]),
                             non_na_data[[var_name]][nearest_indices],
                             data[[var_name]])
  
  return(data)
}

# Apply the function to each variable
vars_net2 <- vars_net2 %>%
  fill_na_with_nearest("pop.2017") %>%
  fill_na_with_nearest("medhhinc.2017") %>%
  fill_na_with_nearest("unemployed.2017") %>%
  st_drop_geometry()

vars_net <- merge(vars_net1,vars_net2,by = "uniqueID")

2.4.6 Nearest Neighbor Feature of Groceries and Schools

Here we calculate the mean distance of the nearest 3 grocery stores / schools of each grid.

## create NN from abandoned cars
vars_net <- vars_net %>%
    mutate(grocery.nn = nn_function(st_coordinates(st_centroid(vars_net)), 
                                           st_coordinates(grocery),
                                           k = 3))
vars_net <- vars_net %>%
    mutate(school.nn = nn_function(st_coordinates(st_centroid(vars_net)), 
                                           st_coordinates(school),
                                           k = 3))

## Visualize the NN feature

vars_net.long <- 
  dplyr::select(vars_net, -"uniqueID",-"grocery.2013",-"school.2017") %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size = 14)}

do.call(grid.arrange,c(mapList, ncol = 3, top = "risk Factors Fishnet"))

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

3. Spatial Processes

3.1 Local Moran’s I for fishnet grid cells

Local Moran’s I is a statistical measure used to assess local spatial pattern. The null hypothesis is that robbery counts at specific locations are randomly distributed in relation to their neighboring areas. The analysis produces several key statistics, including the p-value and a designation for Significant Hotspots, which are grid cells exhibiting higher than expected robbery counts under random conditions (p-values ≤ 0.05). In this analysis, in cases where a low p-value is evident, indicating statistical significance, it signifies a hotspot for robbery-related incidents.

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# print(final_net.weights, zero.policy=TRUE)

## see ?localmoran
morans_robbery <- localmoran(final_net$countrobbery, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()


# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(morans_robbery, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(robbery_Count = countrobbery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, ROBBERY"))

3.2 Distance to Hot spot

# generates warning from NN
final_net <- final_net %>% 
  mutate(robbery.isSig = 
           ifelse(morans_robbery[,5] <= 0.001, 1, 0)) %>%
  mutate(robbery.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(filter(final_net, 
                                           robbery.isSig == 1))), 
                       k = 1))

ggplot() +
      geom_sf(data = final_net, aes(fill=robbery.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Robbery NN Distance") +
      mapTheme()

3.3 Correlation between Robbery and Risk Factors

Here I drew the scatter plots of robbery and all the risk factors. We can see that the correlation coeffecient (r) of independent variables school.nn and grocery.nn are at about -0.37, meaning that with the distance of groceries/schools increase, the robbery number decreases. This is contrary to our previous assumption about the school.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District, -grocery.2013, -school.2017) %>%
    gather(Variable, Value, -countrobbery) %>%
  mutate(Value = as.numeric(Value))

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countrobbery, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countrobbery)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "orange") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Robbery count as a function of risk factors") +
  plotTheme()

## `geom_smooth()` using formula = 'y ~ x'

4. Modeling and Cross Validation

4.1 Modeling

After several attempts, I found that removing the population variable would bring the best balance between accuracy and generalizability.

# View(crossValidate)

## define the variables we want
reg.ss.vars <- c("grocery.nn",
                 "school.nn",
                 "robbery.isSig.dist",
                 #"pop.2017",
                 "medhhinc.2017",
                 "unemployed.2017"
                 )

## RUN REGRESSIONS
reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",                           
  dependentVariable = "countrobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countrobbery, Prediction, geometry)

reg.summary <-
  mutate(reg.ss.spatialCV, 
         Error = Prediction - countrobbery,
         Regression = "Spatial LOGO-CV: Spatial Process") %>%
    st_sf()

4.2 Accuracy and generalizability

Here we calculate the Mean Absolute Error (MAE) of the prediction and actual data of robberies in 2018.Fro the histogram we can see that the bias are mainly within 3. I also calculated the mean MAE and the standard deviation of MAE, which are 2.69 and 2.44 respectively, reflecting an unsatisfactory accuracy. As for the generalizability, I calculated the mean error in both non-White majority and white majority area, and get -0.17 and 0.17 respectively, indicating that the model on average under-predicts in Majority_Non_White neighborhoods and over-predicts in Majority_White neighborhoods.It looks like this algorithm generalizes good with respect to race.

4.2.1 Histogram of Model Error

# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <- 
  reg.ss.spatialCV %>%
    group_by(cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countrobbery, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>% 
  arrange(desc(MAE))

## Simple feature collection with 96 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 96 × 5
##    cvID            Mean_Error   MAE SD_MAE                              geometry
##    <chr>                <dbl> <dbl>  <dbl>                        <GEOMETRY [m]>
##  1 Loop                -21.6  21.6   21.6  POLYGON ((357664.6 579374.5, 357664.…
##  2 Rush & Division     -19.4  19.4   19.4  POLYGON ((358664.6 580874.5, 358164.…
##  3 Printers Row        -14.1  14.1   14.1  POLYGON ((358664.6 577374.5, 358164.…
##  4 Boystown            -10.6  10.6   10.6  POLYGON ((356664.6 585874.5, 357164.…
##  5 Streeterville        -7.01  7.01   7.01 MULTIPOLYGON (((358664.6 580374.5, 3…
##  6 Fuller Park          -6.10  6.10   6.10 POLYGON ((358164.6 569374.5, 358164.…
##  7 Rogers Park          -6.07  6.07   6.07 POLYGON ((354164.6 593374.5, 353664.…
##  8 River North          -5.61  5.61   5.61 POLYGON ((357164.6 580874.5, 356664.…
##  9 Douglas              -5.60  5.60   5.60 POLYGON ((358664.6 574374.5, 358664.…
## 10 Little Village        5.33  5.33   5.33 POLYGON ((349164.6 572874.5, 349164.…
## # ℹ 86 more rows

error_by_reg_and_fold %>% 
  arrange(MAE)

## Simple feature collection with 96 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 96 × 5
##    cvID                  Mean_Error    MAE SD_MAE                       geometry
##    <chr>                      <dbl>  <dbl>  <dbl>                 <GEOMETRY [m]>
##  1 Jefferson Park            0.0226 0.0226 0.0226 POLYGON ((345164.6 590374.5, …
##  2 Edison Park              -0.0333 0.0333 0.0333 POLYGON ((342664.6 592874.5, …
##  3 Hegewisch                -0.0358 0.0358 0.0358 POLYGON ((363664.6 552874.5, …
##  4 North Park                0.106  0.106  0.106  POLYGON ((349164.6 590374.5, …
##  5 Garfield Ridge           -0.121  0.121  0.121  POLYGON ((345164.6 568374.5, …
##  6 Bridgeport                0.123  0.123  0.123  POLYGON ((355664.6 573874.5, …
##  7 Irving Park               0.128  0.128  0.128  POLYGON ((349164.6 586874.5, …
##  8 Norwood Park              0.143  0.143  0.143  MULTIPOLYGON (((341164.6 5898…
##  9 Hyde Park                 0.182  0.182  0.182  POLYGON ((360664.6 569374.5, …
## 10 Sauganash,Forest Glen     0.196  0.196  0.196  POLYGON ((347664.6 589874.5, …
## # ℹ 86 more rows

## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  scale_x_continuous(breaks = seq(0, 11, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "LOGO-CV",
         x="Mean Absolute Error", y="Count")

4.2.2 Map of Model Error

error_with_neighborhood <- st_join(error_by_reg_and_fold, neighborhoods)
mae_by_neighborhood <- error_with_neighborhood %>%
  group_by(name) %>%  # Replace with the actual neighborhood column name
  summarize(MAE = mean(MAE, na.rm = TRUE))

grid.arrange(ncol=2,
ggplot()+
  geom_sf(data=error_by_reg_and_fold,aes(fill=MAE),color = NA)+
  scale_fill_viridis()+
  mapTheme()+
  labs(title = "MAE by Fishnet",
       subtitle = "Spatial distribution of MAE values"),



ggplot() +
  geom_sf(data = mae_by_neighborhood, aes(fill = MAE), color = NA) +  # `color = NA` removes borders between neighborhoods
  scale_fill_viridis(option = "plasma", name = "MAE") +  # Use a color scale (e.g., "plasma" or "viridis")
  mapTheme() +
  labs(title = "MAE by Neighborhood",
       subtitle = "Spatial distribution of MAE values"))

4.2.3 Table of MAE and SD MAE

mean_errors <- error_by_reg_and_fold %>%
  st_drop_geometry()%>%
  summarise(regression = "Spatial LOGO-CV: Spatial Process",
    Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2))

mean_errors %>%
  kable(col.names = c("Regression", "Mean MAE", "SD_MAE"),
        caption = "Mean MAE and SD MAE",
        align = "c") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed"),
                position = "center") %>%
  row_spec(0, bold = TRUE, background = "#f7f7f7") %>%  # Header row styling
  column_spec(1, width = "16em") %>%  # Adjust column widths
  column_spec(2, width = "8em")%>%
  column_spec(3, width = "8em")

Mean MAE and SD MAE
Regression	Mean MAE	SD_MAE
Spatial LOGO-CV: Spatial Process	2.69	3.44

4.2.4 Table of Raw Error by Neighborhood Demographic Context

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

# Mean errors table
racialerror <- reg.summary %>% 
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Figure 14: Mean error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F)  
racialerror

4.2.5 Kernel Density vs Predictions

To compare the traditional Kernel Density method and Risk Predictions that I built, I summarized the KDE values and my prediction into a fishnet, and then break them into five classes. The map below is generated of the risk categories for both model types with a sample of robbery in 2018 points overlaid. The bar chart below shows that the two models have very similar predictions at the highest two risk levels, while prediction model estimates a lower share of the lowest risk level than KDE. This result indicates that no obvious advantage is shown than the traditional KDE method.

# demo of kernel width
robbery_ppp <- as.ppp(st_coordinates(robbery), W = st_bbox(final_net))
robbery_KD.1000 <- spatstat.explore::density.ppp(robbery_ppp, 1000)
robbery_KD.1500 <- spatstat.explore::density.ppp(robbery_ppp, 1500)
robbery_KD.2000 <- spatstat.explore::density.ppp(robbery_ppp, 2000)
robbery_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(robbery_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(robbery_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(robbery_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

robbery_KD.df$Legend <- factor(robbery_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=robbery_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(name="Density") +
  labs(title = "Kernel density with 3 different search radii") +
  mapTheme(title_size = 14)

as.data.frame(robbery_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(robbery, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 robbery") +
     mapTheme(title_size = 14)

robbery18 <- 
  read.csv("Crimes_-_2018_20241017.csv") %>% 
  filter(Primary.Type == "ROBBERY") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct() %>%
  .[fishnet,]

robbery_KDE_sum <- as.data.frame(robbery_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 


kde_breaks <- classIntervals(robbery_KDE_sum$value, 
                             n = 5, "fisher")


robbery_KDE_sf <- robbery_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery18) %>% mutate(robberyCount = 1), ., sum) %>%
    mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label, Risk_Category, robberyCount)

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
robbery_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery18) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

rbind(robbery_KDE_sf, robbery_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(robbery18, 3000), size = .2, colour = "black",alpha = 0.4) +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 robbery risk predictions; 2018 robbery") +
    mapTheme(title_size = 14)

rbind(robbery_KDE_sf, robbery_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countrobbery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countrobbery / sum(countrobbery)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, name = "Model") +
      labs(title = "Risk prediction vs. Kernel density, 2018 robbery",
           y = "% of Test Set robbery (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

5. Conclusion

This model has been effective in identifying robbery hot spots within Chicago, and has shown that its utility spans across different racial groups. However, it performs comparably to the original algorithm used by the police in high-risk categories, which doesn’t show a obvious advantage. Additionally, the average MAE of 2.69 also reflects an unsatisfactory accuracy.

As a result, I would not recommend using this model for predicting robberies in Chicago’s policing strategies. With technological advancements, future opportunities may arise for gathering more precise and detailed features, potentially improving such models’ effectiveness. While there is room for refining this geospatial prediction model, it is equally important for police departments in Chicago and elsewhere to address selection bias issues in their approaches.

Robbery Risk Prediction

Jingmiao Fei

2024-10-18