This analysis aims to build a prediction model of robbery incidents in Chicago. Robbery is a crime that involves direct confrontation between the perpetrator and the victim, often using force or threats. This element of direct interaction causes complexities in both the occurrence and reporting of such incidents, which can lead to a high level of selection bias. Predicting robberies is prone to selection bias due to underreporting of incidents, where victims may not report crimes due to various factors like fear. Additionally, areas with lower levels of trust in law enforcement may see fewer reported robberies, even if they experience similar or higher rates of such crimes.
Robbery is influenced by multiple spatial factors, such as neighborhood demographics, economic conditions, and locations. Thus, building a spatial risk model for robbery requires accounting for these nuances and biases to create an accurate and fair model. This model aims to enhance the understanding of spatial patterns of robbery and improve predictive accuracy and generalizability across different neighborhoods.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat.explore)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt) # for KDE and ML risk class intervals
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = beat_num)
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
When I used “Socrata” to download crime data, the results are all “NA”. So I download the .csv file from the website, and use “read.csv” to get the data instead.
robbery <-
read.csv("Crimes_-_2017_20241016.csv") %>%
filter(Primary.Type == "ROBBERY") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = robbery, colour="red", size=0.1, show.legend = "point") +
labs(title= "Robbery, Chicago - 2017") +
mapTheme(title_size = 14),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(robbery)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of robbery") +
mapTheme(title_size = 14) + theme(legend.position = "none"))
## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # fast way to select intersecting polygons
st_sf() %>%
mutate(uniqueID = 1:n())
## add a value of 1 to each crime, sum them with aggregate
crime_net <-
dplyr::select(robbery) %>%
mutate(countrobbery = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countrobbery = replace_na(countrobbery, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = crime_net, aes(fill = countrobbery), color = NA) +
scale_fill_viridis() +
labs(title = "Figure 2: Count of Robberies for the fishnet") +
mapTheme()
In this analysis, I chose the following risk factors:
population, median household income, unemployed population (Source: ACS Data)
points of grocery stores, schools (Source: the City of Chicago’s open data https://data.cityofchicago.org/)
Population: Higher population density may lead to increased robbery rates due to more potential targets.
Median Household Income: Areas with lower median incomes often experience higher crime rates, including robberies, as economic stress can drive criminal behavior.
Unemployed number: high unemployment rates may link to high robbery rates.
Grocery Point Data: The presence of grocery stores can attract foot traffic, potentially increasing robbery opportunities. And grocery stores are a common site for robberies.
Schools Point Data: Schools influence robbery rates by serving as community hubs that foster engagement and social cohesion, which can deter criminal activities.
grocery <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Grocery-Stores-2013/53t8-wyrc") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "grocery.2013")
school <-
read.socrata("https://data.cityofchicago.org/Education/CPS_School_Locations_1617/75e5-35kf") %>%
dplyr::select(Y = lat, X = long) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "school.2017")
census_api_key("e6ed37e145b901c8c418a193d5bc174948045488", overwrite = TRUE,install=TRUE)
acsTractsChi.2017 <-
get_acs(geography = "tract",
variables = c("B01001_001E", # total population
"B19013_001E", #median household income
"B23025_005E" # unemployed number
),
year=2017,
state=17,
county=031,
geometry=TRUE,
output="wide") %>%
st_transform('ESRI: 102271')%>%
rename(total_pop.2017 = B01001_001E,
med_HH_Income.2017 = B19013_001E,
unemployed.2017 = B23025_005E)
pop <-
acsTractsChi.2017 %>%
dplyr::select(GEOID,value = total_pop.2017) %>%
na.omit() %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "pop.2017")
medhhinc <-
acsTractsChi.2017 %>%
dplyr::select(GEOID,value = med_HH_Income.2017) %>%
na.omit() %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "medhhinc.2017")
unemployed <-
acsTractsChi.2017 %>%
dplyr::select(GEOID,value = unemployed.2017) %>%
na.omit() %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "unemployed.2017")
## Neighborhoods to use in LOOCV in a bit
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
For the grocery store and school point data, I checked which grid each point is in and assign a value to that grid. For the census data, I turn each grid of fishnet into a point based on centroid, and check which tract each point belongs to and assign the corresponding value of census data.
vars_net1 <- rbind(school,grocery) %>%
st_join(fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
left_join(fishnet, ., by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
dplyr::select(-`<NA>`) %>%
ungroup()
fishnet_centroid <- st_centroid(fishnet)
vars_net2 <- fishnet_centroid%>%
st_join(rbind(pop,medhhinc,unemployed), join = st_within) %>%
pivot_wider(names_from = Legend, values_from = value)%>%
dplyr::select(uniqueID,pop.2017,medhhinc.2017,unemployed.2017,geometry)
Because the spatial range of Census Data and fishnet is slightly different, the values of census data in some grids are NA. I wrote a function to set the NA grids as their nearest non-NA value.
# Function to fill NA values with nearest non-NA
fill_na_with_nearest <- function(data, var_name) {
# Create a copy of the data with only non-NA rows for the specified variable
non_na_data <- data %>% filter(!is.na(!!sym(var_name)))
# Find the index of the nearest non-NA feature for each row with NA
nearest_indices <- st_nearest_feature(data, non_na_data)
# Replace NA values with the corresponding nearest non-NA value
data[[var_name]] <- ifelse(is.na(data[[var_name]]),
non_na_data[[var_name]][nearest_indices],
data[[var_name]])
return(data)
}
# Apply the function to each variable
vars_net2 <- vars_net2 %>%
fill_na_with_nearest("pop.2017") %>%
fill_na_with_nearest("medhhinc.2017") %>%
fill_na_with_nearest("unemployed.2017") %>%
st_drop_geometry()
vars_net <- merge(vars_net1,vars_net2,by = "uniqueID")
Here we calculate the mean distance of the nearest 3 grocery stores / schools of each grid.
## create NN from abandoned cars
vars_net <- vars_net %>%
mutate(grocery.nn = nn_function(st_coordinates(st_centroid(vars_net)),
st_coordinates(grocery),
k = 3))
vars_net <- vars_net %>%
mutate(school.nn = nn_function(st_coordinates(st_centroid(vars_net)),
st_coordinates(school),
k = 3))
## Visualize the NN feature
vars_net.long <-
dplyr::select(vars_net, -"uniqueID",-"grocery.2013",-"school.2017") %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 14)}
do.call(grid.arrange,c(mapList, ncol = 3, top = "risk Factors Fishnet"))
## important to drop the geometry from joining features
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
Local Moran’s I is a statistical measure used to assess local spatial pattern. The null hypothesis is that robbery counts at specific locations are randomly distributed in relation to their neighboring areas. The analysis produces several key statistics, including the p-value and a designation for Significant Hotspots, which are grid cells exhibiting higher than expected robbery counts under random conditions (p-values ≤ 0.05). In this analysis, in cases where a low p-value is evident, indicating statistical significance, it signifies a hotspot for robbery-related incidents.
## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods...
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
# print(final_net.weights, zero.policy=TRUE)
## see ?localmoran
morans_robbery <- localmoran(final_net$countrobbery, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(morans_robbery, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(robbery_Count = countrobbery,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 14) + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, ROBBERY"))
# generates warning from NN
final_net <- final_net %>%
mutate(robbery.isSig =
ifelse(morans_robbery[,5] <= 0.001, 1, 0)) %>%
mutate(robbery.isSig.dist =
nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(filter(final_net,
robbery.isSig == 1))),
k = 1))
ggplot() +
geom_sf(data = final_net, aes(fill=robbery.isSig.dist), colour=NA) +
scale_fill_viridis(name="NN Distance") +
labs(title="Robbery NN Distance") +
mapTheme()
Here I drew the scatter plots of robbery and all the risk factors. We can see that the correlation coeffecient (r) of independent variables school.nn and grocery.nn are at about -0.37, meaning that with the distance of groceries/schools increase, the robbery number decreases. This is contrary to our previous assumption about the school.
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -name, -District, -grocery.2013, -school.2017) %>%
gather(Variable, Value, -countrobbery) %>%
mutate(Value = as.numeric(Value))
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countrobbery, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countrobbery)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "orange") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Robbery count as a function of risk factors") +
plotTheme()
## `geom_smooth()` using formula = 'y ~ x'
After several attempts, I found that removing the population variable would bring the best balance between accuracy and generalizability.
# View(crossValidate)
## define the variables we want
reg.ss.vars <- c("grocery.nn",
"school.nn",
"robbery.isSig.dist",
#"pop.2017",
"medhhinc.2017",
"unemployed.2017"
)
## RUN REGRESSIONS
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countrobbery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countrobbery, Prediction, geometry)
reg.summary <-
mutate(reg.ss.spatialCV,
Error = Prediction - countrobbery,
Regression = "Spatial LOGO-CV: Spatial Process") %>%
st_sf()
Here we calculate the Mean Absolute Error (MAE) of the prediction and actual data of robberies in 2018.Fro the histogram we can see that the bias are mainly within 3. I also calculated the mean MAE and the standard deviation of MAE, which are 2.69 and 2.44 respectively, reflecting an unsatisfactory accuracy. As for the generalizability, I calculated the mean error in both non-White majority and white majority area, and get -0.17 and 0.17 respectively, indicating that the model on average under-predicts in Majority_Non_White neighborhoods and over-predicts in Majority_White neighborhoods.It looks like this algorithm generalizes good with respect to race.
# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <-
reg.ss.spatialCV %>%
group_by(cvID) %>%
summarize(Mean_Error = mean(Prediction - countrobbery, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
error_by_reg_and_fold %>%
arrange(desc(MAE))
## Simple feature collection with 96 features and 4 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 96 × 5
## cvID Mean_Error MAE SD_MAE geometry
## <chr> <dbl> <dbl> <dbl> <GEOMETRY [m]>
## 1 Loop -21.6 21.6 21.6 POLYGON ((357664.6 579374.5, 357664.…
## 2 Rush & Division -19.4 19.4 19.4 POLYGON ((358664.6 580874.5, 358164.…
## 3 Printers Row -14.1 14.1 14.1 POLYGON ((358664.6 577374.5, 358164.…
## 4 Boystown -10.6 10.6 10.6 POLYGON ((356664.6 585874.5, 357164.…
## 5 Streeterville -7.01 7.01 7.01 MULTIPOLYGON (((358664.6 580374.5, 3…
## 6 Fuller Park -6.10 6.10 6.10 POLYGON ((358164.6 569374.5, 358164.…
## 7 Rogers Park -6.07 6.07 6.07 POLYGON ((354164.6 593374.5, 353664.…
## 8 River North -5.61 5.61 5.61 POLYGON ((357164.6 580874.5, 356664.…
## 9 Douglas -5.60 5.60 5.60 POLYGON ((358664.6 574374.5, 358664.…
## 10 Little Village 5.33 5.33 5.33 POLYGON ((349164.6 572874.5, 349164.…
## # ℹ 86 more rows
error_by_reg_and_fold %>%
arrange(MAE)
## Simple feature collection with 96 features and 4 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 96 × 5
## cvID Mean_Error MAE SD_MAE geometry
## <chr> <dbl> <dbl> <dbl> <GEOMETRY [m]>
## 1 Jefferson Park 0.0226 0.0226 0.0226 POLYGON ((345164.6 590374.5, …
## 2 Edison Park -0.0333 0.0333 0.0333 POLYGON ((342664.6 592874.5, …
## 3 Hegewisch -0.0358 0.0358 0.0358 POLYGON ((363664.6 552874.5, …
## 4 North Park 0.106 0.106 0.106 POLYGON ((349164.6 590374.5, …
## 5 Garfield Ridge -0.121 0.121 0.121 POLYGON ((345164.6 568374.5, …
## 6 Bridgeport 0.123 0.123 0.123 POLYGON ((355664.6 573874.5, …
## 7 Irving Park 0.128 0.128 0.128 POLYGON ((349164.6 586874.5, …
## 8 Norwood Park 0.143 0.143 0.143 MULTIPOLYGON (((341164.6 5898…
## 9 Hyde Park 0.182 0.182 0.182 POLYGON ((360664.6 569374.5, …
## 10 Sauganash,Forest Glen 0.196 0.196 0.196 POLYGON ((347664.6 589874.5, …
## # ℹ 86 more rows
## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
scale_x_continuous(breaks = seq(0, 11, by = 1)) +
labs(title="Distribution of MAE", subtitle = "LOGO-CV",
x="Mean Absolute Error", y="Count")
error_with_neighborhood <- st_join(error_by_reg_and_fold, neighborhoods)
mae_by_neighborhood <- error_with_neighborhood %>%
group_by(name) %>% # Replace with the actual neighborhood column name
summarize(MAE = mean(MAE, na.rm = TRUE))
grid.arrange(ncol=2,
ggplot()+
geom_sf(data=error_by_reg_and_fold,aes(fill=MAE),color = NA)+
scale_fill_viridis()+
mapTheme()+
labs(title = "MAE by Fishnet",
subtitle = "Spatial distribution of MAE values"),
ggplot() +
geom_sf(data = mae_by_neighborhood, aes(fill = MAE), color = NA) + # `color = NA` removes borders between neighborhoods
scale_fill_viridis(option = "plasma", name = "MAE") + # Use a color scale (e.g., "plasma" or "viridis")
mapTheme() +
labs(title = "MAE by Neighborhood",
subtitle = "Spatial distribution of MAE values"))
mean_errors <- error_by_reg_and_fold %>%
st_drop_geometry()%>%
summarise(regression = "Spatial LOGO-CV: Spatial Process",
Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2))
mean_errors %>%
kable(col.names = c("Regression", "Mean MAE", "SD_MAE"),
caption = "Mean MAE and SD MAE",
align = "c") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed"),
position = "center") %>%
row_spec(0, bold = TRUE, background = "#f7f7f7") %>% # Header row styling
column_spec(1, width = "16em") %>% # Adjust column widths
column_spec(2, width = "8em")%>%
column_spec(3, width = "8em")
Regression | Mean MAE | SD_MAE |
---|---|---|
Spatial LOGO-CV: Spatial Process | 2.69 | 3.44 |
tracts18 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2018, state=17, county=031, geometry=T) %>%
st_transform('ESRI:102271') %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
# Mean errors table
racialerror <- reg.summary %>%
st_centroid() %>%
st_join(tracts18) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Figure 14: Mean error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
racialerror
To compare the traditional Kernel Density method and Risk Predictions that I built, I summarized the KDE values and my prediction into a fishnet, and then break them into five classes. The map below is generated of the risk categories for both model types with a sample of robbery in 2018 points overlaid. The bar chart below shows that the two models have very similar predictions at the highest two risk levels, while prediction model estimates a lower share of the lowest risk level than KDE. This result indicates that no obvious advantage is shown than the traditional KDE method.
# demo of kernel width
robbery_ppp <- as.ppp(st_coordinates(robbery), W = st_bbox(final_net))
robbery_KD.1000 <- spatstat.explore::density.ppp(robbery_ppp, 1000)
robbery_KD.1500 <- spatstat.explore::density.ppp(robbery_ppp, 1500)
robbery_KD.2000 <- spatstat.explore::density.ppp(robbery_ppp, 2000)
robbery_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(robbery_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(robbery_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(robbery_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
robbery_KD.df$Legend <- factor(robbery_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
ggplot(data=robbery_KD.df, aes(x=x, y=y)) +
geom_raster(aes(fill=layer)) +
facet_wrap(~Legend) +
coord_sf(crs=st_crs(final_net)) +
scale_fill_viridis(name="Density") +
labs(title = "Kernel density with 3 different search radii") +
mapTheme(title_size = 14)
as.data.frame(robbery_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data = sample_n(robbery, 1500), size = .5) +
scale_fill_viridis(name = "Density") +
labs(title = "Kernel density of 2017 robbery") +
mapTheme(title_size = 14)
robbery18 <-
read.csv("Crimes_-_2018_20241017.csv") %>%
filter(Primary.Type == "ROBBERY") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[fishnet,]
robbery_KDE_sum <- as.data.frame(robbery_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean)
kde_breaks <- classIntervals(robbery_KDE_sum$value,
n = 5, "fisher")
robbery_KDE_sf <- robbery_KDE_sum %>%
mutate(label = "Kernel Density",
Risk_Category = classInt::findCols(kde_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(robbery18) %>% mutate(robberyCount = 1), ., sum) %>%
mutate(robberyCount = replace_na(robberyCount, 0))) %>%
dplyr::select(label, Risk_Category, robberyCount)
ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction,
n = 5, "fisher")
robbery_risk_sf <-
reg.ss.spatialCV %>%
mutate(label = "Risk Predictions",
Risk_Category =classInt::findCols(ml_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(robbery18) %>% mutate(robberyCount = 1), ., sum) %>%
mutate(robberyCount = replace_na(robberyCount, 0))) %>%
dplyr::select(label,Risk_Category, robberyCount)
rbind(robbery_KDE_sf, robbery_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(robbery18, 3000), size = .2, colour = "black",alpha = 0.4) +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2017 robbery risk predictions; 2018 robbery") +
mapTheme(title_size = 14)
rbind(robbery_KDE_sf, robbery_risk_sf) %>%
st_drop_geometry() %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countrobbery = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Pcnt_of_test_set_crimes = countrobbery / sum(countrobbery)) %>%
ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE, name = "Model") +
labs(title = "Risk prediction vs. Kernel density, 2018 robbery",
y = "% of Test Set robbery (per model)",
x = "Risk Category") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
This model has been effective in identifying robbery hot spots within Chicago, and has shown that its utility spans across different racial groups. However, it performs comparably to the original algorithm used by the police in high-risk categories, which doesn’t show a obvious advantage. Additionally, the average MAE of 2.69 also reflects an unsatisfactory accuracy.
As a result, I would not recommend using this model for predicting robberies in Chicago’s policing strategies. With technological advancements, future opportunities may arise for gathering more precise and detailed features, potentially improving such models’ effectiveness. While there is room for refining this geospatial prediction model, it is equally important for police departments in Chicago and elsewhere to address selection bias issues in their approaches.