The Emil City Department of Housing and Community Development (HCD) has long struggled with low participation in its home repair tax credit program, largely due to a random outreach approach that may not effectively reach homeowners likely to benefit. To address this, we analyze past client-level data and develop a predictive model for targeted outreach. This analysis will use logistic regression to classify homeowners by their likelihood of utilizing the tax credit, aiming to focus outreach resources more effectively. We also use a cost-benefit analysis to help HCD quantify the estimated revenue and optimize its engagement strategy.
rm(list=ls())
options(scipen=10000000)
library(tidyverse)
library(kableExtra)
library(caret)
library(knitr)
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(scales)
library(rstatix)
library(ggpubr)
library(crosstable)
library(ggcorrplot)
library(rsample)
library(gridExtra)
library(scales)
palette5 <- c("darkblue", "lightblue", "yellow", "lightcoral", "darkred")
palette4 <- c("lightblue","orange","lightcoral","yellow")
palette3 <- c("darkblue", "yellow", "darkred")
palette2 <- c("lightblue","lightcoral")
# Functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
# load dataset
housing <- read.csv("housingSubsidy.csv")
A correlation plot is used to evaluate relations and
multicollinearity between numeric variables. The following plot
indicates a high correlation
(the absolute r-value is 0.8 or greater)
between predicting
categories:
Therefore, Unemployment Rate and Inflation Rate are be removed to improve the final model’s performance.
# correlation matrix
corr <- round(cor(housing %>% select(where(is.numeric))%>%select(-X)), 1)
p.mat <- cor_pmat(housing %>% select(where(is.numeric))%>%select(-X))
ggcorrplot(corr, p.mat = p.mat, hc.order = TRUE,
type = "lower", insig = "blank",
ggtheme = ggplot2::theme_gray,
colors = palette5, title = "Feature correlations", lab = TRUE)
Here we use T-test to examine the relationship between numeric variables and the dependent variable. Indicators with a p-value less than 0.05 are considered as a meaningful predictor.
ttest <- housing %>%
select(-X)%>%
pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value") %>%
filter(variable != "y_numeric") %>%
group_by(variable) %>%
rstatix::t_test(value ~ y) %>%
adjust_pvalue(method = "BH") %>%
add_significance() %>%
select(variable,
p,
p.adj,
p.adj.signif)
ttest %>%
kbl() %>%
kable_minimal()
variable | p | p.adj | p.adj.signif |
---|---|---|---|
age | 0.0021600 | 0.0024300 | ** |
campaign | 0.0000000 | 0.0000000 | **** |
cons.conf.idx | 0.0057700 | 0.0057700 | ** |
cons.price.idx | 0.0000001 | 0.0000001 | **** |
inflation_rate | 0.0000000 | 0.0000000 | **** |
pdays | 0.0000000 | 0.0000000 | **** |
previous | 0.0000000 | 0.0000000 | **** |
spent_on_repairs | 0.0000000 | 0.0000000 | **** |
unemploy_rate | 0.0000000 | 0.0000000 | **** |
The chi-square test is typically used to examine the relationship between categorical independent and dependent variables. Based on these tests, we will drop 3 predictors with p-value more than 0.05: “mortgage”, “taxbill_in_phl” and day_of_week” in feature engineering.
cat_vars <- colnames(housing %>% select(where(is.character)))
crosstable(housing,
cat_vars,
by=y,
total="both",
percent_pattern="{n} ({p_row}/{p_col})",
percent_digits=0,
test = TRUE) %>%
as_flextable()
label | variable | y | Total | test | |
---|---|---|---|---|---|
no | yes | ||||
job | admin. | 879 (87%/24%) | 133 (13%/29%) | 1012 (25%) | p value: <0.0001 |
blue-collar | 823 (93%/22%) | 61 (7%/14%) | 884 (21%) | ||
entrepreneur | 140 (95%/4%) | 8 (5%/2%) | 148 (4%) | ||
housemaid | 99 (90%/3%) | 11 (10%/2%) | 110 (3%) | ||
management | 294 (91%/8%) | 30 (9%/7%) | 324 (8%) | ||
retired | 128 (77%/3%) | 38 (23%/8%) | 166 (4%) | ||
self-employed | 146 (92%/4%) | 13 (8%/3%) | 159 (4%) | ||
services | 358 (91%/10%) | 35 (9%/8%) | 393 (10%) | ||
student | 63 (77%/2%) | 19 (23%/4%) | 82 (2%) | ||
technician | 611 (88%/17%) | 80 (12%/18%) | 691 (17%) | ||
unemployed | 92 (83%/3%) | 19 (17%/4%) | 111 (3%) | ||
unknown | 35 (90%/1%) | 4 (10%/1%) | 39 (1%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
marital | divorced | 403 (90%/11%) | 43 (10%/10%) | 446 (11%) | p value: 0.0165 |
married | 2257 (90%/62%) | 252 (10%/56%) | 2509 (61%) | ||
single | 998 (87%/27%) | 155 (13%/34%) | 1153 (28%) | ||
unknown | 10 (91%/0.3%) | 1 (9%/0.2%) | 11 (0.3%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
education | basic.4y | 391 (91%/11%) | 38 (9%/8%) | 429 (10%) | p value: 0.0014 |
basic.6y | 211 (93%/6%) | 17 (7%/4%) | 228 (6%) | ||
basic.9y | 531 (93%/14%) | 43 (7%/10%) | 574 (14%) | ||
high.school | 824 (89%/22%) | 97 (11%/22%) | 921 (22%) | ||
illiterate | 1 (100%/0.03%) | 0 (0%/0%) | 1 (0.02%) | ||
professional.course | 470 (88%/13%) | 65 (12%/14%) | 535 (13%) | ||
university.degree | 1099 (87%/30%) | 165 (13%/37%) | 1264 (31%) | ||
unknown | 141 (84%/4%) | 26 (16%/6%) | 167 (4%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
taxLien | no | 2913 (88%/79%) | 402 (12%/89%) | 3315 (80%) | p value: <0.0001 |
unknown | 754 (94%/21%) | 49 (6%/11%) | 803 (19%) | ||
yes | 1 (100%/0.03%) | 0 (0%/0%) | 1 (0.02%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
mortgage | no | 1637 (89%/45%) | 202 (11%/45%) | 1839 (45%) | p value: 0.7307 |
unknown | 96 (91%/3%) | 9 (9%/2%) | 105 (3%) | ||
yes | 1935 (89%/53%) | 240 (11%/53%) | 2175 (53%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
taxbill_in_phl | no | 693 (90%/19%) | 77 (10%/17%) | 770 (19%) | p value: 0.3495 |
yes | 2975 (89%/81%) | 374 (11%/83%) | 3349 (81%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
contact | cellular | 2277 (86%/62%) | 375 (14%/83%) | 2652 (64%) | p value: <0.0001 |
telephone | 1391 (95%/38%) | 76 (5%/17%) | 1467 (36%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
month | apr | 179 (83%/5%) | 36 (17%/8%) | 215 (5%) | p value: <0.0001 |
aug | 572 (90%/16%) | 64 (10%/14%) | 636 (15%) | ||
dec | 10 (45%/0.3%) | 12 (55%/3%) | 22 (1%) | ||
jul | 652 (92%/18%) | 59 (8%/13%) | 711 (17%) | ||
jun | 462 (87%/13%) | 68 (13%/15%) | 530 (13%) | ||
mar | 20 (42%/1%) | 28 (58%/6%) | 48 (1%) | ||
may | 1288 (93%/35%) | 90 (7%/20%) | 1378 (33%) | ||
nov | 403 (90%/11%) | 43 (10%/10%) | 446 (11%) | ||
oct | 44 (64%/1%) | 25 (36%/6%) | 69 (2%) | ||
sep | 38 (59%/1%) | 26 (41%/6%) | 64 (2%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
day_of_week | fri | 685 (89%/19%) | 83 (11%/18%) | 768 (19%) | p value: 0.9723 |
mon | 757 (89%/21%) | 98 (11%/22%) | 855 (21%) | ||
thu | 764 (89%/21%) | 96 (11%/21%) | 860 (21%) | ||
tue | 750 (89%/20%) | 91 (11%/20%) | 841 (20%) | ||
wed | 712 (90%/19%) | 83 (10%/18%) | 795 (19%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) | ||
poutcome | failure | 387 (85%/11%) | 67 (15%/15%) | 454 (11%) | p value: <0.0001 |
nonexistent | 3231 (92%/88%) | 292 (8%/65%) | 3523 (86%) | ||
success | 50 (35%/1%) | 92 (65%/20%) | 142 (3%) | ||
Total | 3668 (89%) | 451 (11%) | 4119 (100%) |
Key Findings:
The “Campaign” feature indicates the frequency of contacts made during the tax credit program. Analysis reveals that individuals who received more contacts throughout the program are less likely to utilize the tax credit.
The “Previous” feature measures the number of contacts made with individuals before the tax credit program began. Unlike the first finding, those who were contacted more often prior to the program are more likely to take advantage of the tax credit than those with fewer prior interactions. This suggests that public awareness leading up to the program may significantly influence participation.
The inflation features shows that there seems to be a less likelihood of taking tax credit when the inflation rate is high.
Finally, the data shows that acceptance of the tax credit tends to be higher during periods of low unemployment (when the unemployment rate is negative). This implies that tax credits are more appealing in a strong economy, as individuals may be more willing to seek external resources for home purchases.
# Continuous Outcomes Visualization
housing %>%
dplyr::select(y, where(is.numeric), -X, -y_numeric) %>%
gather(Variable, value, -y) %>%
filter(!(Variable == "pdays" & value == 999)) %>%
ggplot(aes(y, value, fill=y)) +
geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
facet_wrap(~Variable, scales = "free", ncol=3, labeller = labeller(Variable = c(
`age` = "Age",
`previous` = "Contact before Campaign",
`unemploy_rate` = "Unemployment Rate",
`cons.price.idx` = "Consumer Price Index",
`cons.conf.idx` = "Consumer Confidence Index",
`campaign` = "Contacts for Campaign",
`inflation_rate` = "Inflation Rate",
`spent_on_repairs` = "Amount Spent on Repairs"))) +
scale_fill_manual(values = palette2) +
labs(x="Used Credit", y="Value",
title = "Feature associations with the likelihood of taking tax credit",
subtitle = "(Continous Outcomes, average value)")+
theme(legend.position = "none")
To gain a deeper insight into the distribution characteristics of these continuous variables and to facilitate comparisons among numeric features with varying scales, a density plot has been generated. This visualization uncovers surprising differences in the distributions of the Consumer Confidence Index, Consumer Price Index, and Amount Spent on Repairs between homeowners who accepted the tax credit and those who did not.
hmm2 <- housing %>%
pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value") %>%
filter(variable != "y_numeric", variable != "X")%>%
filter(!(variable == "pdays" & value == 999))
ggplot(hmm2) +
geom_density(aes(x = value, fill = y), alpha = 0.5) +
facet_wrap(~variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(x="Output Variables", y="Density",
title = "Feature associations with the likelihood of entering a program",
subtitle = "(Continous outcomes)") +
theme_minimal()+
theme(axis.text.x = element_text(hjust = 1, angle = 45))
It can be noticed that people who haven’t been contacted before are less likely to refuse the credit, while people with higher education or use cellular phones are more likely to take the credit.
# Categorical Outcomes Visualization
housing %>%
select(where(is.character))%>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free", ncol=4,
labeller= labeller(Variable = c(
`contact` = "Means of Contact",
`day_of_week` = "Day of Week",
`education` = "Educational Attainment",
`pdays` = "Days Elapsed After Contact",
`job` = "Job",
`marital` = "Marital Status",
`month` = "Month of Last Contact",
`mortgage` = "Mortgage Status",
`poutcome` = "Previous Campaign Outcome",
`taxbill_in_phl` = "Residing in Philadelphia",
`taxLien` = "Tax Liens"))) +
scale_fill_manual(values = palette2) +
labs(x="Took Credit", y="Count",
title = "Feature associations with the likelihood of taking tax credit",
subtitle = "Multiple category features") +
theme(axis.text.x = element_text(angle=45, hjust=1))
We found that the “tax lien” feature in the original housing data has only one record of “yes”, which will cause error in the regression analysis. So we combined “unknown” and “yes” as “yes or unknown”.
To improve the model’s predictive accuracy, we created new fields to capture homeowner characteristics in greater detail:
Quarter : Months are categorized into yearly quarters, namely Quarter 1, Quarter 2, Quarter 3, and Quarter 4.
Education Status: Educational levels are restructured to simplify the categories, as those with basic (“basic” and “illiterate”), secondary (high.school” and “professional.course) and higher (”university.degree”) level education.
Last Contact: The duration since a homeowner has been last approached is transformed from days to weeks.
Employment : This feature has been refined to reflect whether an individual is unemployed, blue collar, white collar or self-employed.
Age group: The age has been categorized into four levels: under 25, 26~45, 46 ~70 and above 70.
Marriage: The marital status has been refined as married and others.
These adjustments aim to provide more comprehensive and informative data for the predictive model.
housing <-
housing %>%
mutate(taxLien = case_when(
taxLien == "unknown" | taxLien == "yes" ~ "yes or unknown",
TRUE ~ "no"))
housing_new <- housing%>%
select(-unemploy_rate, -inflation_rate, - mortgage, -taxbill_in_phl, -day_of_week, -X)%>%
# Quarter
mutate(month_special = case_when(
month == "jan" | month == "feb"| month == "mar" ~ "Quarter 1",
month == "apr" |month == "may" | month == "jun" ~ "Quarter 2",
month == "jul" |month == "aug" | month == "sep" ~ "Quarter 3",
month == "oct" | month == "nov"| month == "dec" ~ "Quarter 4"))%>%
# Education
mutate(EducationStatus = case_when(
education %in% c("basic.9y", "basic.6y", "basic.4y", "illiterate") ~ "basic education",
education %in% c("high.school", "professional.course") ~ "secondary education",
education == "university.degree" ~ "higher education",
education == "unknown" ~ "unknown"))%>%
# Prior Contact
mutate(LastContact = case_when(pdays == 999 ~ "No Contact",
pdays <= 7 ~ "1 Week",
pdays > 7 & pdays <= 14 ~ "2 Weeks",
pdays > 14 & pdays <= 21 ~ "3 Weeks"))%>%
# Employment
mutate(Employment = case_when(job %in% c("student", "unemployed", "retired") ~ "unemployed",
job %in% c("blue-collar", "services","housemaid") ~ "bluecollar_services",
job %in% c("admin.", "management","technican","entrepreneur") ~ "whitecollar",
job == "self-employed" ~ "self-employed",
job == "unknown" ~ "unknown"))%>%
# Age Groups
mutate(AgeGroup = case_when(
age <= 25 ~ "Under 25",
age >= 26 & age < 45 ~ "26-45",
age >= 45 & age <= 70 ~ "45-70",
TRUE ~ "Above 70"))%>%
# marital
mutate(Marriage = case_when(marital =="married" ~ "married", TRUE ~ "other"))%>%
select(-month, -education, -pdays, -previous, -age, -job, -marital) %>%
na.omit()
To assess predictive capacity, both base categories as well as new categories are tested using chi-square tests. It is observed seen that all engineered variables are likely to be significant predictors (p<0.05).
cat_vars <- colnames(housing_new %>% select(where(is.character)))
crosstable(housing_new,
cat_vars,
by=y,
total="both",
percent_pattern="{n} ({p_row}/{p_col})",
percent_digits=0,
test = TRUE) %>%
as_flextable()
label | variable | y | Total | test | |
---|---|---|---|---|---|
no | yes | ||||
taxLien | no | 2380 (88%/78%) | 329 (12%/89%) | 2709 (79%) | p value: <0.0001 |
yes or unknown | 677 (94%/22%) | 42 (6%/11%) | 719 (21%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
contact | cellular | 1885 (86%/62%) | 306 (14%/82%) | 2191 (64%) | p value: <0.0001 |
telephone | 1172 (95%/38%) | 65 (5%/18%) | 1237 (36%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
poutcome | failure | 333 (86%/11%) | 52 (14%/14%) | 385 (11%) | p value: <0.0001 |
nonexistent | 2680 (92%/88%) | 241 (8%/65%) | 2921 (85%) | ||
success | 44 (36%/1%) | 78 (64%/21%) | 122 (4%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
month_special | Quarter 1 | 12 (33%/0.4%) | 24 (67%/6%) | 36 (1%) | p value: <0.0001 |
Quarter 2 | 1651 (91%/54%) | 164 (9%/44%) | 1815 (53%) | ||
Quarter 3 | 999 (89%/33%) | 120 (11%/32%) | 1119 (33%) | ||
Quarter 4 | 395 (86%/13%) | 63 (14%/17%) | 458 (13%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
EducationStatus | basic education | 1084 (92%/35%) | 97 (8%/26%) | 1181 (34%) | p value: 0.0005 |
higher education | 933 (87%/31%) | 135 (13%/36%) | 1068 (31%) | ||
secondary education | 917 (89%/30%) | 114 (11%/31%) | 1031 (30%) | ||
unknown | 123 (83%/4%) | 25 (17%/7%) | 148 (4%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
LastContact | 1 Week | 40 (35%/1%) | 74 (65%/20%) | 114 (3%) | p value: <0.0001 |
2 Weeks | 8 (50%/0.3%) | 8 (50%/2%) | 16 (0.5%) | ||
3 Weeks | 4 (67%/0.1%) | 2 (33%/1%) | 6 (0.2%) | ||
No Contact | 3005 (91%/98%) | 287 (9%/77%) | 3292 (96%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
Employment | bluecollar_services | 1280 (92%/42%) | 107 (8%/29%) | 1387 (40%) | p value: <0.0001 |
self-employed | 146 (92%/5%) | 13 (8%/4%) | 159 (5%) | ||
unemployed | 283 (79%/9%) | 76 (21%/20%) | 359 (10%) | ||
unknown | 35 (90%/1%) | 4 (10%/1%) | 39 (1%) | ||
whitecollar | 1313 (88%/43%) | 171 (12%/46%) | 1484 (43%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
AgeGroup | 26-45 | 1942 (90%/64%) | 206 (10%/56%) | 2148 (63%) | p value: <0.0001 |
45-70 | 975 (88%/32%) | 130 (12%/35%) | 1105 (32%) | ||
Above 70 | 18 (49%/1%) | 19 (51%/5%) | 37 (1%) | ||
Under 25 | 122 (88%/4%) | 16 (12%/4%) | 138 (4%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) | ||
Marriage | married | 1914 (90%/63%) | 211 (10%/57%) | 2125 (62%) | p value: 0.0316 |
other | 1143 (88%/37%) | 160 (12%/43%) | 1303 (38%) | ||
Total | 3057 (89%) | 371 (11%) | 3428 (100%) |
Two regression analyses are performed: one using all original
features without engineered variables, termed the
kitchen sink model
, and another using a combination of
selected original features and engineered variables, referred to as the
engineered model
. The kitchen sink model provides a
baseline for assessing the current performance and predictive strength
of the model, while the engineered model aims to enhance predictive
accuracy.
For both models, data is divided into training and testing sets with a 65:35 ratio. The training set enables the model to learn the relationships between input features and the target variable, while the testing set allows us to evaluate how well it might generalize to new data.
# Split Data - Kitchen Sink - Base Model
set.seed(3)
kitchenSink_base <- housing %>%
select(-X)%>%
initial_split(prop = 0.65, strata = y)
kitchenSink_Train <- training(kitchenSink_base)
kitchenSink_Test <- testing(kitchenSink_base)
# Regression - Kitchen Sink - Base Model
kitchenSink_reg_base <- glm(y_numeric ~ ., data = kitchenSink_Train%>%select(-y), family = binomial(link = "logit"))
print(summary(kitchenSink_reg_base))
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = kitchenSink_Train %>% select(-y))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -224.6270224 130.8520063 -1.717 0.08604 .
## age 0.0200011 0.0083695 2.390 0.01686 *
## jobblue-collar -0.2323833 0.2791547 -0.832 0.40515
## jobentrepreneur -0.9565802 0.5513127 -1.735 0.08272 .
## jobhousemaid -0.0950196 0.4803501 -0.198 0.84319
## jobmanagement -0.7209197 0.3155988 -2.284 0.02235 *
## jobretired -0.5449148 0.3858183 -1.412 0.15784
## jobself-employed -0.7209029 0.4168511 -1.729 0.08374 .
## jobservices -0.6363281 0.3220334 -1.976 0.04816 *
## jobstudent 0.2712142 0.4517188 0.600 0.54824
## jobtechnician -0.0271477 0.2360705 -0.115 0.90845
## jobunemployed 0.1942047 0.3869979 0.502 0.61579
## jobunknown -0.0635066 0.8061861 -0.079 0.93721
## maritalmarried 0.0614137 0.2429193 0.253 0.80041
## maritalsingle 0.0449526 0.2826534 0.159 0.87364
## maritalunknown 0.4736009 1.2541293 0.378 0.70570
## educationbasic.6y 0.4742905 0.4096971 1.158 0.24700
## educationbasic.9y 0.3121802 0.3509960 0.889 0.37378
## educationhigh.school 0.3750402 0.3354481 1.118 0.26356
## educationilliterate -10.6803086 324.7442168 -0.033 0.97376
## educationprofessional.course 0.5818050 0.3563273 1.633 0.10251
## educationuniversity.degree 0.4814763 0.3351456 1.437 0.15083
## educationunknown 0.1981585 0.4483817 0.442 0.65853
## taxLienyes or unknown 0.0356006 0.2139590 0.166 0.86785
## mortgageunknown -0.7374280 0.6023976 -1.224 0.22089
## mortgageyes -0.1502254 0.1448624 -1.037 0.29973
## taxbill_in_phlyes 0.0233118 0.1905802 0.122 0.90265
## contacttelephone -0.9399485 0.2885220 -3.258 0.00112 **
## monthaug -0.2730262 0.4514694 -0.605 0.54534
## monthdec 0.9001675 0.6655152 1.353 0.17619
## monthjul -0.0956177 0.3622389 -0.264 0.79181
## monthjun -0.0497620 0.4630236 -0.107 0.91441
## monthmar 1.4510575 0.5920736 2.451 0.01425 *
## monthmay -0.3866090 0.3026045 -1.278 0.20139
## monthnov -0.3004088 0.4369576 -0.688 0.49177
## monthoct 0.1091957 0.5530223 0.197 0.84347
## monthsep -0.1528016 0.6354501 -0.240 0.80997
## day_of_weekmon -0.1484093 0.2235659 -0.664 0.50680
## day_of_weekthu 0.0323693 0.2204016 0.147 0.88324
## day_of_weektue -0.0082397 0.2300987 -0.036 0.97143
## day_of_weekwed 0.0551076 0.2310660 0.238 0.81150
## campaign -0.0704786 0.0422581 -1.668 0.09535 .
## pdays -0.0001002 0.0007036 -0.142 0.88676
## previous 0.3488757 0.2197628 1.588 0.11240
## poutcomenonexistent 0.8369304 0.3546299 2.360 0.01827 *
## poutcomesuccess 1.3042224 0.7053637 1.849 0.06446 .
## unemploy_rate -0.3847595 0.5099880 -0.754 0.45058
## cons.price.idx 1.7026570 0.8699614 1.957 0.05033 .
## cons.conf.idx 0.0883545 0.0291188 3.034 0.00241 **
## inflation_rate -0.9451996 0.4602678 -2.054 0.04002 *
## spent_on_repairs 0.0133261 0.0105139 1.267 0.20498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1849.1 on 2676 degrees of freedom
## Residual deviance: 1451.1 on 2626 degrees of freedom
## AIC: 1553.1
##
## Number of Fisher Scoring iterations: 11
The McFadden’s R-squared value between 0.2 to 0.4 are considered good
model. Here, the kitchen sink model has a McFadden value of
0.215
indicating that it is a good model.
pR2(kitchenSink_reg_base)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -725.5675846 -924.5440389 397.9529085 0.2152158 0.1381347 0.2769392
# Confusion Matrix - Kitchen Sink - Base Model
kitchenSink_Test_Prob <- data.frame(outcome = as.factor(kitchenSink_Test$y_numeric),
probs = predict(kitchenSink_reg_base, kitchenSink_Test, type = "response"))%>%
mutate(pred_outcome = as.factor(ifelse(probs > 0.5 , 1, 0)))
print(caret::confusionMatrix(kitchenSink_Test_Prob$pred_outcome, kitchenSink_Test_Prob$outcome,
positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1267 126
## 1 17 32
##
## Accuracy : 0.9008
## 95% CI : (0.8842, 0.9158)
## No Information Rate : 0.8904
## P-Value [Acc > NIR] : 0.1097
##
## Kappa : 0.2714
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.20253
## Specificity : 0.98676
## Pos Pred Value : 0.65306
## Neg Pred Value : 0.90955
## Prevalence : 0.10957
## Detection Rate : 0.02219
## Detection Prevalence : 0.03398
## Balanced Accuracy : 0.59465
##
## 'Positive' Class : 1
##
# ROC - Kitchen Sink - Base Model
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit_base <- train(y ~ .,
data= kitchenSink_Train %>% dplyr::select(-y_numeric),
method="glm",
family="binomial",
metric="ROC",
trControl = ctrl)
cvFit_base
## Generalized Linear Model
##
## 2677 samples
## 19 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 2650, 2651, 2650, 2650, 2651, 2650, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7563496 0.9823732 0.195
The AUC of kitchen sink model is 0.78, indicating that this is an acceptable model.
# AUC - kitchen sink model
auc(kitchenSink_Test_Prob$outcome, kitchenSink_Test_Prob$probs)
## Area under the curve: 0.7827
#Split Data - Engineered Model
set.seed(3)
housing_new_base <- initial_split(housing_new, prop = 0.65, strata = y)
housing_new_Train <- training(housing_new_base)
housing_new_Test <- testing(housing_new_base)
# Regression - Engineered Model
housing_new_reg_base <- glm(y_numeric ~ ., data = housing_new_Train%>%select(-y), family = binomial(link = "logit"))
print(summary(housing_new_reg_base))
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = housing_new_Train %>% select(-y))
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 37.251364 15.498960 2.403
## taxLienyes or unknown -0.148304 0.247401 -0.599
## contacttelephone -0.726832 0.285273 -2.548
## campaign -0.074630 0.045556 -1.638
## poutcomenonexistent 0.373449 0.239062 1.562
## poutcomesuccess 0.859562 0.989886 0.868
## cons.price.idx 0.119465 0.162044 0.737
## cons.conf.idx 0.036218 0.019400 1.867
## spent_on_repairs -0.009059 0.001144 -7.918
## month_specialQuarter 2 -1.629941 0.541217 -3.012
## month_specialQuarter 3 -1.496015 0.564341 -2.651
## month_specialQuarter 4 -1.805193 0.568178 -3.177
## EducationStatushigher education 0.284656 0.252719 1.126
## EducationStatussecondary education 0.098664 0.222595 0.443
## EducationStatusunknown 0.199158 0.394143 0.505
## LastContact2 Weeks -0.400895 0.737170 -0.544
## LastContact3 Weeks -1.224408 1.301543 -0.941
## LastContactNo Contact -1.240500 1.030104 -1.204
## Employmentself-employed -0.265903 0.448200 -0.593
## Employmentunemployed 0.189811 0.273004 0.695
## Employmentunknown 0.112469 0.680309 0.165
## Employmentwhitecollar 0.093004 0.225617 0.412
## AgeGroup45-70 0.129911 0.177598 0.731
## AgeGroupAbove 70 0.416523 0.477452 0.872
## AgeGroupUnder 25 -0.021054 0.378700 -0.056
## Marriageother 0.002423 0.166855 0.015
## Pr(>|z|)
## (Intercept) 0.01624 *
## taxLienyes or unknown 0.54887
## contacttelephone 0.01084 *
## campaign 0.10138
## poutcomenonexistent 0.11825
## poutcomesuccess 0.38521
## cons.price.idx 0.46098
## cons.conf.idx 0.06192 .
## spent_on_repairs 0.00000000000000241 ***
## month_specialQuarter 2 0.00260 **
## month_specialQuarter 3 0.00803 **
## month_specialQuarter 4 0.00149 **
## EducationStatushigher education 0.26001
## EducationStatussecondary education 0.65759
## EducationStatusunknown 0.61335
## LastContact2 Weeks 0.58656
## LastContact3 Weeks 0.34684
## LastContactNo Contact 0.22849
## Employmentself-employed 0.55300
## Employmentunemployed 0.48689
## Employmentunknown 0.86869
## Employmentwhitecollar 0.68018
## AgeGroup45-70 0.46448
## AgeGroupAbove 70 0.38300
## AgeGroupUnder 25 0.95566
## Marriageother 0.98841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1526.9 on 2227 degrees of freedom
## Residual deviance: 1208.4 on 2202 degrees of freedom
## AIC: 1260.4
##
## Number of Fisher Scoring iterations: 6
The engineered model has more number of significant variables than
the kitchen sink model. It has a McFadden value of 0.208
indicating that it is also a satisfactory model, but not better than the
kitchen sink model.
# McFadden value - Engineered Model
pR2(housing_new_reg_base)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -604.2119708 -763.4676012 318.5112609 0.2085951 0.1332099 0.2685246
# Confusion Matrix - Engineered Model
housing_new_Test_Prob <- data.frame(outcome = as.factor(housing_new_Test$y_numeric),
probs = predict(housing_new_reg_base, housing_new_Test, type = "response")) %>%
mutate(pred_outcome = as.factor(ifelse(probs > 0.5 , 1, 0)))
print(caret::confusionMatrix(housing_new_Test_Prob$pred_outcome, housing_new_Test_Prob$outcome,
positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1053 102
## 1 17 28
##
## Accuracy : 0.9008
## 95% CI : (0.8825, 0.9172)
## No Information Rate : 0.8917
## P-Value [Acc > NIR] : 0.1648
##
## Kappa : 0.2799
##
## Mcnemar's Test P-Value : 0.00000000000001358
##
## Sensitivity : 0.21538
## Specificity : 0.98411
## Pos Pred Value : 0.62222
## Neg Pred Value : 0.91169
## Prevalence : 0.10833
## Detection Rate : 0.02333
## Detection Prevalence : 0.03750
## Balanced Accuracy : 0.59975
##
## 'Positive' Class : 1
##
# ROC - Engineered Model
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit_new <- train(y ~ .,
data= housing_new_Train %>% dplyr::select(-y_numeric),
method="glm",
family="binomial",
metric="ROC",
trControl = ctrl)
cvFit_new
## Generalized Linear Model
##
## 2228 samples
## 13 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 2206, 2206, 2206, 2206, 2206, 2206, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7657807 0.9848947 0.2316667
The AUC of the engineered model is 0.778, which is similar to the AUC of kitchen sink model.
# AUC - Engineered Model
# 0.778
auc(housing_new_Test_Prob$outcome, housing_new_Test_Prob$probs)
## Area under the curve: 0.778
The validity of the engineered model is substantiated by the
predicted probabilities in that
more positive values tend toward 1
and
more negative values tend toward 0
.
# Creating probabilities plot for Base Model
a <- ggplot(kitchenSink_Test_Prob, aes(x = probs, fill = as.factor(outcome))) +
geom_density() +
facet_grid(outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Credit", y = "Density of probabilities",
title = "Dist. of predicted probabilities by observed outcome - Kitchen Sink Model") +
theme(strip.text.x = element_text(size = 7), plot.title = element_text(size = 10))
# Creating probabilities plot for Engineered Model
b <- ggplot(housing_new_Test_Prob, aes(x = probs, fill = as.factor(outcome))) +
geom_density() +
facet_grid(outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Credit", y = "Density of probabilities",
title = "Dist. of predicted probabilities by observed outcome - Engineered Model") +
theme(strip.text.x = element_text(size = 7), plot.title = element_text(size = 10))
ggarrange(a, b, nrow = 2)
The results of the kitchen sink model indicate an area under the ROC curve (AUC) of 0.756, with a sensitivity (sens) of 0.983 and a specificity (spec) of 0.195, suggesting that while the model is highly effective at correctly identifying positive cases (true positives), it struggles significantly with false positives, as evidenced by the low specificity. In contrast, the engineered model shows an improved AUC of 0.766, along with a slightly higher specificity of 0.232, while maintaining a high sensitivity of 0.985. This improvement indicates that the engineered model not only preserves the ability to detect true positives but also enhances the ability to distinguish between positive and negative cases, albeit still with relatively low specificity. Overall, both models demonstrate strong sensitivity, but the engineered model shows a modest advancement in specificity, highlighting the potential benefits of feature engineering in enhancing model performance.
# CV Goodness of Fit - Base Model
grid.arrange(ncol = 1,
dplyr::select(cvFit_base$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit_base$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "darkblue") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "darkred", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics \nKitchen Sink Model",
subtitle = "Across-fold mean represented as dotted lines") +
theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7)),
# CV Goodness of Fit - Engineered Model
dplyr::select(cvFit_new$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit_new$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "darkblue") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "darkred", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics \nFeature Engineered Model",
subtitle = "Across-fold mean represented as dotted lines") +
theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7)))
### 4.3.2 ROC Curve
In this comparison, the Base Model (AUC = 0.783) slightly outperforms the Engineered Model (AUC = 0.778), though the difference is minimal. Both models demonstrate a similar ability to distinguish between classes, with ROC curves that closely overlap. The nearly identical AUC values suggest that the feature engineering applied had little impact on overall model performance in this case.
# ROC Curve Plot - Base Model
a <- ggplot(kitchenSink_Test_Prob, aes(d = as.numeric(kitchenSink_Test_Prob$outcome), m = probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "darkblue", size = .7) +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = .5, color = "darkred") +
labs(title = "ROC Curve - Base Model", subtitle = "AUC = 0.783")+
theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7))
# ROC Curve Plot - Engineered Model
b <- ggplot(housing_new_Test_Prob, aes(d = as.numeric(housing_new_Test_Prob$outcome), m = probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "darkblue", size = .7) +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = .5, color = "darkred") +
labs(title = "ROC Curve - Engineered Model", subtitle = "AUC = 0.778")+
theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7))
ggarrange(a, b, nrow = 1)
Assumptions:
$5,000
per homeowner which can be
used toward home improvement.$10,000
premium on average.$56,000
, on average.25%
of people who enter the program ultimately use the
creditBased on these assumptions, the value of each home that enters the
credit is $66,000
without evaluating costs to the city
Cases:
True Positive: Assuming that 25% of people take a loan of $5000, we calculate net revenue as follows: Revenue = (Count * -2850) + 0.25 * Count * (-5000 + 10000 + 56000), where the formula includes both the credit and marketing costs incurred by the city in each case.
False Positive: For each false positive, there is a loss equivalent to the marketing costs allocated, leading to a revenue loss of $2850 per case.
True Negative: With no credit or marketing costs applied, there’s no revenue gain, totaling $0.
False Negative: A revenue gain of $0 is also assumed for false negatives, as the reasons participants may have taken credit are beyond the scope of the marketing campaign.
Based on these assumptions, an iterative function is used to identify the threshold that maximizes revenue. For the engineered model, the curve levels off at a threshold of 0.17, suggesting this as the optimal value. This threshold is notably lower than the commonly recommended threshold of 0.5. We can see that the total revenue of feature engineered model is less than the kitchen sink one.
# Cost Benefits Table - Base Model
cost_benefit_table_base <-
kitchenSink_Test_Prob %>%
count(pred_outcome, outcome) %>%
summarize(True_Negative = sum(n[pred_outcome==0 & outcome==0]),
True_Positive = sum(n[pred_outcome==1 & outcome==1]),
False_Negative = sum(n[pred_outcome==0 & outcome==1]),
False_Positive = sum(n[pred_outcome==1 & outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
case_when(Variable == "True_Negative" ~ Count * 0,
Variable == "True_Positive" ~ ((Count * -2850) + (0.25 * Count * (-5000 + 10000 + 56000))),
Variable == "False_Negative" ~ Count * 0,
Variable == "False_Positive" ~ (Count * -2850))) %>%
bind_cols(data.frame(Description = c(
"Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no
credit was allocated.",
"Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took
the credit.",
"We predicted that a homeowner would not take the credit but they did.",
"Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit
allocated.")))
kable(cost_benefit_table_base,
caption = "Cost/Benefit Table - Base Data") %>% kable_styling()
Variable | Count | Revenue | Description |
---|---|---|---|
True_Negative | 1267 | 0 | Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated. |
True_Positive | 32 | 396800 | Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit. |
False_Negative | 126 | 0 | We predicted that a homeowner would not take the credit but they did. |
False_Positive | 17 | -48450 | Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated. |
# Cost Benefits Table - Engineered Model
cost_benefit_table_new <-
housing_new_Test_Prob %>%
count(pred_outcome, outcome) %>%
summarize(True_Negative = sum(n[pred_outcome==0 & outcome==0]),
True_Positive = sum(n[pred_outcome==1 & outcome==1]),
False_Negative = sum(n[pred_outcome==0 & outcome==1]),
False_Positive = sum(n[pred_outcome==1 & outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
case_when(Variable == "True_Negative" ~ Count * 0,
Variable == "True_Positive" ~ ((Count * -2850) + (0.25 * Count * (-5000 + 10000 + 56000))),
Variable == "False_Negative" ~ Count * 0,
Variable == "False_Positive" ~ (Count * -2850))) %>%
bind_cols(data.frame(Description = c(
"Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no
credit was allocated.",
"Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took
the credit.",
"We predicted that a homeowner would not take the credit but they did.",
"Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit
allocated.")))
kable(cost_benefit_table_new,
caption = "Cost/Benefit Table - Engineered Data") %>% kable_styling()
Variable | Count | Revenue | Description |
---|---|---|---|
True_Negative | 1053 | 0 | Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated. |
True_Positive | 28 | 347200 | Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit. |
False_Negative | 102 | 0 | We predicted that a homeowner would not take the credit but they did. |
False_Positive | 17 | -48450 | Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated. |
# Thresholds function - Base Model
iterateThresholds_base <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
kitchenSink_Test_Prob %>%
mutate(pred_outcome = ifelse(probs > x, 1, 0)) %>%
count(pred_outcome, outcome) %>%
summarize(True_Negative = sum(n[pred_outcome==0 & outcome==0]),
True_Positive = sum(n[pred_outcome==1 & outcome==1]),
False_Negative = sum(n[pred_outcome==0 & outcome==1]),
False_Positive = sum(n[pred_outcome==1 & outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", (Count * 0),
ifelse(Variable == "True_Positive", ((Count * -2850) + (0.25 * Count * (-5000 + 10000 + 56000))),
ifelse(Variable == "False_Negative", (Count * 0),
ifelse(Variable == "False_Positive", (Count * -2850), 0
)
)
)
),
Threshold = x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
# Thresholds function - Engineered Model
iterateThresholds_new <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
housing_new_Test_Prob %>%
mutate(pred_outcome = ifelse(probs > x, 1, 0)) %>%
count(pred_outcome, outcome) %>%
summarize(True_Negative = sum(n[pred_outcome==0 & outcome==0]),
True_Positive = sum(n[pred_outcome==1 & outcome==1]),
False_Negative = sum(n[pred_outcome==0 & outcome==1]),
False_Positive = sum(n[pred_outcome==1 & outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", (Count * 0),
ifelse(Variable == "True_Positive", ((Count * -2850) + (0.25 * Count * (-5000 + 10000 + 56000))),
ifelse(Variable == "False_Negative", (Count * 0),
ifelse(Variable == "False_Positive", (Count * -2850), 0
)
)
)
),
Threshold = x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
Our analysis of the test datasets reveals that the engineered dataset generates a marginally higher maximum revenue compared to the base dataset, particularly at a threshold of approximately 0.17. It is important to note that this threshold is significantly lower than the conventional 0.5. This discrepancy arises because our primary objective in this scenario is not balanced class accuracy but rather the optimization of net revenue, which necessitates a greater emphasis on sensitivity (true positives).
# Confusion Matrix - Base Model Threshold
baseThreshold <- iterateThresholds_base(kitchenSink_Test_Prob)
baseThreshold_revenue <-
baseThreshold %>%
group_by(Threshold) %>%
summarize(Revenue = sum(Revenue))
a <- baseThreshold %>%
ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette4) +
labs(title = "Revenue by confusion matrix type and threshold", subtitle = "Base Model",
y = "Revenue") + theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7)) +
guides(colour=guide_legend(title = "Confusion Matrix"))
# Confusion Matrix - Engineered Model Threshold
engineeredThreshold <- iterateThresholds_new(housing_new_Test_Prob)
engineeredThreshold_revenue <-
engineeredThreshold %>%
group_by(Threshold) %>%
summarize(Revenue = sum(Revenue))
b <- engineeredThreshold %>%
ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette4) +
labs(title = "Revenue by confusion matrix type and threshold", subtitle = "Engineered Model",
y = "Revenue") + theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7)) +
guides(colour=guide_legend(title = "Confusion Matrix"))
ggarrange(a,b, nrow=2)
# Revenue and Credits by Threshold for Base Model
baseThreshold_revenue <-
baseThreshold %>%
mutate(TookCredit = ifelse(Variable == "True_Positive", (Count * .25),
ifelse(Variable == "False_Negative", Count, 0))) %>%
group_by(Threshold) %>%
summarize(Total_Revenue = sum(Revenue),
Total_Count_Of_Credits = sum(TookCredit))
# Revenue and Credits by Threshold for Engineered Model
engineeredThreshold_revenue <-
engineeredThreshold %>%
mutate(TookCredit = ifelse(Variable == "True_Positive", (Count * .25),
ifelse(Variable == "False_Negative", Count, 0))) %>%
group_by(Threshold) %>%
summarize(Total_Revenue = sum(Revenue),
Total_Count_Of_Credits = sum(TookCredit))
# Revenue Plot for Engineered Model
grid.arrange(ncol = 1,
ggplot(engineeredThreshold_revenue)+
geom_line(aes(x = Threshold, y = Total_Revenue),color = "darkblue")+
geom_vline(xintercept = pull(arrange(engineeredThreshold_revenue, -Total_Revenue)[1,1]),color = "red")+
labs(title = "Total Revenue By Threshold - Engineered Model",
subtitle = "Vertical Line Denotes Optimal Threshold")+
theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7)),
# Credits Plot for Engineered Model
ggplot(engineeredThreshold_revenue)+
geom_line(aes(x = Threshold, y = Total_Count_Of_Credits),color = "darkblue")+
geom_vline(xintercept = pull(arrange(engineeredThreshold_revenue, -Total_Count_Of_Credits)[1,1]),color = "red")+
labs(title = "Total Count of Credits By Threshold - Engineered Model",
subtitle = "Vertical Line Denotes Optimal Threshold") +
theme(plot.title = element_text(size = 10), plot.subtitle = element_text(size = 7)))
# Optimal Threshold - Base Model
optimalthreshold_base <-
baseThreshold_revenue %>%
dplyr::select(Threshold, Total_Revenue, Total_Count_Of_Credits)%>%
mutate(Model = "Base Model")
optimalthreshold_base_table <-
baseThreshold_revenue %>%
dplyr::select(Threshold, Total_Revenue, Total_Count_Of_Credits)
optimalthreshold_base_table <- optimalthreshold_base %>%
filter(row_number() %in% c(20, 50))
# Optimal Threshold - Engineered Model
optimalthreshold_new <-
engineeredThreshold_revenue %>%
dplyr::select(Threshold, Total_Revenue, Total_Count_Of_Credits)%>%
mutate(Model = "Engineered Model")
optimalthreshold_new_table <-
engineeredThreshold_revenue %>%
dplyr::select(Threshold, Total_Revenue, Total_Count_Of_Credits)
optimalthreshold_new_table <- optimalthreshold_new %>%
filter(row_number() %in% c(17, 50))
optimalthreshold_table <- rbind(optimalthreshold_base_table, optimalthreshold_new_table) %>%
select(Threshold, Model, Total_Revenue, Total_Count_Of_Credits)
kable(optimalthreshold_table, caption = "
Cost/Benefit Table") %>% kable_styling()
Threshold | Model | Total_Revenue | Total_Count_Of_Credits |
---|---|---|---|
0.20 | Base Model | 854800 | 89.75 |
0.50 | Base Model | 348350 | 134.00 |
0.17 | Engineered Model | 650550 | 76.00 |
0.50 | Engineered Model | 298750 | 109.00 |
In conclusion, it is not advisable to put this model into production, as the performance of the feature-engineered model in regression did not show significant improvement and resulted in relatively lower income levels. A more thorough analysis of the feature engineering process is necessary, along with the introduction of additional relevant variables to enable more scientific and accurate predictions. By integrating more data and features, we can enhance the model’s predictive capability, ultimately leading to more effective decision-making support.
The following qualitative and quantitative factors which are missing in the current dataset may merit consideration.
By adopting a continuously iterative approach that merges financial and spatial aspects, Emil City can create a more targeted, proactive, and effective model to achieve its goal of safe homes and vibrant communities.