Getting accurate and reliable traffic crash data is important as it helps inform road safety policies. However, traffic crashes are often under reported, which could have implications on the true costs of road crashes, identifying which groups are more vulnerable, and developing the appropriate road safety strategies. This model looks at the spatial risk factors of traffic crashes in Chicago that would make the risk of traffic crashes leading to injuries even higher.
We used the traffic crash data set downloaded from Chicago’s open data portal. This data set includes information recorded by Chicago’s Police Department. For our model, we only considered traffic crashes that resulted in injuries in 2018.
#Map vehicle crash data
options(scipen=10000)
grid.arrange(ncol = 2,
ggplot()+
geom_sf(data = chicagoBoundary) +
geom_sf(data = traffic_crash, size = .5, color = "#440154FF") +
labs(title = 'Traffic Crashes in Chicago, 2018')+
mapTheme(),
ggplot()+
geom_sf(data = chicagoBoundary, fill = "#D4D4D4") +
stat_density2d(data = data.frame(st_coordinates(traffic_crash)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = .01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
labs(title = 'Density of traffic crashes in Chicago, 2018') +
mapTheme()+
theme(legend.position = 'none'))
We created a fishnet with a cell size of 500 ft, which serves as the basic modeling unit. The plot below shows the traffic crashes by grid cell.
## R Markdown
#-----Fishnet-----
fishnet <- st_make_grid(chicagoBoundary, cellsize = 500, square = TRUE) %>%
.[chicagoBoundary] %>%
st_sf() %>%
mutate(uniqueID = rownames(.))
#-----Count traffic crashes in fishnet-----
crash_net =
dplyr::select(traffic_crash) %>%
mutate(traffic_crash_count = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(traffic_crash_count = replace_na(traffic_crash_count, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
#---Plot traffic crashes by grid cell-----
ggplot(data = crash_net) +
geom_sf(aes(fill = traffic_crash_count), color = NA) +
scale_fill_viridis()+
labs(title = 'Traffic Crash Counts by fishnet, 2018') +
mapTheme()
Risk factors that could affect traffic crashes include the following: liquor stores, street light outages, pot holes, red light crossings, speed camera violations and bus stops. The plots below show how the count of risk factors are spatially distributed. Liquor Retail
and Potholes
are heavily clustered in Chicago’s downtown.
#-----Risk factors-----
#311streetlights
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street Lights Out")
#liquor stores
liquorRetail <-
read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%
filter(business_activity == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor Retail")
#311 potholes
potholes <-
read.socrata("https://data.cityofchicago.org/resource/_311-potholes.json") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Potholes")
#red light crossings
redlight <-
read.socrata("https://data.cityofchicago.org/resource/spqx-js37.json") %>%
mutate(year = substr(violation_date,1,4)) %>% filter(year == "2018")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Red Light Crossings")
#speed camera violations
speedcamera <-
read.socrata("https://data.cityofchicago.org/resource/hhkd-xvj4.json") %>%
mutate(year = substr(violation_date,1,4)) %>% filter(year == "2018")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Speed Cam Violations")
#bus stops
busstops <-
st_read("~/GitHub/MUSA-508/CTA_BusStops.shp") %>%
st_transform('ESRI:102271') %>%
#dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Bus Stops")
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
#-----Count of Risk Factors by Fishnet-----
vars_net <-
bind_rows(streetLightsOut, liquorRetail, potholes, redlight, speedcamera, busstops)%>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
vars_net.long <-
gather(vars_net, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol=3, top="Risk Factors by Fishnet"))
Next, we calculated the average nearest neighbor distance to check the exposure of a grid cell to risk factors.
#-----Feature Engineering - Nearest Neighbors-----
st_c <- st_coordinates
st_coid <- st_centroid
vars_net <-
vars_net %>%
mutate(
streetLightsOut.nn =
nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
liquorRetail.nn =
nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
potholes.nn =
nn_function(st_c(st_coid(vars_net)), st_c(potholes),3),
redlight.nn =
nn_function(st_c(st_coid(vars_net)), st_c(redlight),3),
speedcamera.nn =
nn_function(st_c(st_coid(vars_net)), st_c(speedcamera),3),
busstops.nn =
nn_function(st_c(st_coid(vars_net)), st_c(busstops),3))
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor risk Factors by Fishnet"))
We calculated the Local Moran’s I statistic to check for spatial autocorrelation or clustering. The plots below show the areas with greater statistical significance and concentration of crashes. Traffic crashes are concentrated in the downtown area.
## important to drop the geometry from joining features
final_net <-
left_join(crash_net, st_drop_geometry(vars_net), by="uniqueID")
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
#st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods...
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
# print(final_net.weights, zero.policy=TRUE)
## see ?localmoran
local_morans <- localmoran(final_net$traffic_crash_count, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(traffic_crash_count = traffic_crash_count,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 12) + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Traffic Crash"))
We checked whether a cell belongs to a significant cluster and calculated its distance to a significant traffic crash hotspot.
final_net <- final_net %>%
mutate(crash.isSig =
ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
mutate(crash.isSig.dist =
nn_function(st_c(st_coid(final_net)),
st_c(st_coid(filter(final_net,
crash.isSig == 1))),
k = 1))
ggplot() +
geom_sf(data = final_net, aes(fill=crash.isSig.dist), colour=NA) +
scale_fill_viridis(name="NN Distance") +
labs(title="Distance to Significant Crash Hotspot") +
mapTheme()
The scatterplots show the correlation of spatial risk factors to the count and nearest neighbor of traffic crashes.
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -name) %>%
gather(Variable, Value, -traffic_crash_count)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, traffic_crash_count, use = "complete.obs"))
ggplot(correlation.long, aes(Value, traffic_crash_count)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#238A8DFF") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Traffic crash as a function of risk factors") +
plotTheme()
The plot below shows that the distribution of the traffic crash counts is skewed, which indicates that Poisson Regression might be better suited than OLS regression to model this count outcome.
#Histogram of Dependent Variable
ggplot(final_net, aes(traffic_crash_count)) +
geom_histogram(binwidth = 1, fill = "#238A8DFF") +
labs(title = "Traffic Crashes Distribution")+plotTheme()
We used two different methods of cross validation for our model. We ran a random k-fold cross validation, dividing the samples into 100 folds, using Just Risk Factors
and Spatial Process
which includes the risk factors and the Local Moran’s I features. We also used a Leave one group out cross validation (LOGO-CV), with each neighborhood taking a turn as a hold out.Cross validating with the Spatial Process
features produces larger errors.
## define the variables we want
reg.vars <- c("potholes.nn", "redlight.nn", "speedcamera.nn","streetLightsOut.nn")
reg.ss.vars <- c("potholes.nn", "redlight.nn", "speedcamera.nn","streetLightsOut.nn","crash.isSig", "crash.isSig.dist")
## Random k-fold Reg: Just Risk Factors and Spatial Process
reg.cv <- crossValidate.1(
dataset = final_net,
id = "cvID",
dependentVariable = "traffic_crash_count",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, traffic_crash_count, Prediction, geometry)
reg.ss.cv <- crossValidate.1(
dataset = final_net,
id = "cvID",
dependentVariable = "traffic_crash_count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, traffic_crash_count, Prediction, geometry)
## Spatial LOGO-CV Reg
reg.spatialCV <- crossValidate.1(
dataset = final_net,
id = "name",
dependentVariable = "traffic_crash_count",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, traffic_crash_count, Prediction, geometry)
reg.ss.spatialCV <- crossValidate.1(
dataset = final_net,
id = "name",
dependentVariable = "traffic_crash_count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, traffic_crash_count, Prediction, geometry)
#Bind of observed and predicted counts and errors
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - traffic_crash_count,
Regression = "Random k-fold CV: \nJust Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - traffic_crash_count,
Regression = "Random k-fold CV: \nSpatial Process"),
mutate(reg.spatialCV, Error = Prediction - traffic_crash_count,
Regression = "Spatial LOGO-CV: \nJust Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - traffic_crash_count,
Regression = "Spatial LOGO-CV: \nSpatial Process")) %>%
st_sf()
ggplot() +
geom_sf(data = reg.summary, aes(fill = Prediction, colour = Prediction)) +
scale_fill_viridis() +
scale_colour_viridis() +
facet_wrap(~Regression) +
labs(title="Map of Model Errors", subtitle = "Random K-Fold and Spatial Cross Validation") +
mapTheme()
We visualized the distribution of the Mean Absolute Error (MAE) for both cross validation methods. The table below shows the mean and standard deviation of the MAE by regression. Adding Spatial Process
features minimizes the errors.
#Calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - traffic_crash_count, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
error_by_reg_and_fold %>% arrange(desc(MAE))
error_by_reg_and_fold %>% arrange(MAE)
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, fill = "#238A8DFF") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 11, by = 1)) +
labs(title="Distribution of MAE", subtitle = "k-fold Cross Validation vs. LOGO-CV",
x="Mean Absolute Error", y="Count") +
plotTheme()
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#DCE319FF") %>%
row_spec(4, color = "black", background = "#DCE319FF")
Regression | Mean_MAE | SD_MAE |
---|---|---|
Random k-fold CV: Just Risk Factors | 0.87 | 0.61 |
Random k-fold CV: Spatial Process | 0.73 | 0.51 |
Spatial LOGO-CV: Just Risk Factors | 2.49 | 3.80 |
Spatial LOGO-CV: Spatial Process | 1.69 | 2.48 |
The plots below map out the LOGO-CV errors and shows where errors are larger if the local spatial process is not included. The errors are largest in the significant hotspot locations.
error_by_reg_and_fold %>%
filter(str_detect(Regression, "LOGO")) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Crash errors by LOGO-CV Regression") +
mapTheme2() + theme(legend.position="bottom")
In the figures below, we can see that our model over predicts in areas with low crash rates and under predicts in areas with high crash rates.
st_drop_geometry(reg.summary) %>%
group_by(Regression) %>%
mutate(crash_Decile = ntile(traffic_crash_count, 10)) %>%
group_by(Regression, crash_Decile) %>%
summarize(meanObserved = mean(traffic_crash_count, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -crash_Decile) %>%
ggplot(aes(crash_Decile, Value, shape = Variable)) +
geom_point(size = 2) + geom_path(aes(group = crash_Decile), colour = "#565656") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and observed crashes by observed crash decile") +
plotTheme()
We then pulled Census data (focusing on race) to check how the model generalizes across different neighborhood contexts. The map and the table below show that the model may be under predicting for majority non-white areas and over predicting for majority white areas.
reg.summary %>%
filter(str_detect(Regression, "LOGO")) %>%
st_centroid() %>%
st_join(tracts18) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#DCE319FF")
Regression | Majority_Non_White | Majority_White |
---|---|---|
Spatial LOGO-CV: Just Risk Factors | -0.1535244 | 0.1857320 |
Spatial LOGO-CV: Spatial Process | -0.0026718 | 0.0171617 |
ggplot()+
geom_sf(data = tracts18, aes(fill = raceContext))+
scale_fill_manual(values = c("#238A8DFF", "#DCE319FF"),
labels = c('<50% white', '>50% white'),
name = '')+
labs(title = 'Majority white and majority\nnon-white neighborhoods, Chicago')+
mapTheme2()
To check how our model fares, we compared the kernel density and risk predictions for traffic crashes, overlaid with observed traffic crash incidents in 2019. If the model is well fit, the highest risk category should match with high crash density areas.
traffic_crash19 <- st_read("https://data.cityofchicago.org/api/geospatial/85ca-t3if?method=export&format=GeoJSON") %>% filter(crash_type == "INJURY AND / OR TOW DUE TO CRASH") %>% mutate(year = substr(crash_date,1,4)) %>% filter(year == "2019" & most_severe_injury %in% c('FATAL', 'INCAPACITATING INJURY', 'NONINCAPACITATING INJURY'))%>% st_transform('ESRI:102271') %>%
distinct()%>%
.[fishnet,]
crash_ppp <- as.ppp(st_coordinates(traffic_crash), W = st_bbox(final_net))
crash_KD.1000 <- spatstat.core::density.ppp(crash_ppp, 1000)
crash_KD.1500 <- spatstat.core::density.ppp(crash_ppp, 1500)
crash_KD.2000 <- spatstat.core::density.ppp(crash_ppp, 2000)
crash_KDE_sf <- as.data.frame(crash_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(traffic_crash) %>% mutate(traffic_crash_count= 1), ., sum) %>%
mutate(traffic_crash_count = replace_na(traffic_crash_count, 0))) %>%
dplyr::select(label, Risk_Category, traffic_crash_count)
crash_risk_sf <-
reg.ss.spatialCV %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(traffic_crash) %>% mutate(traffic_crash_count = 1), ., sum) %>%
mutate(traffic_crash_count = replace_na(traffic_crash_count, 0))) %>%
dplyr::select(label,Risk_Category, traffic_crash_count)
rbind(crash_KDE_sf, crash_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(traffic_crash19, 3000), size = .4, colour = "#404040") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2018 traffic crash risk predictions; 2019 traffic crashes") +
mapTheme()
The plot below shows that in the highest risk category, the risk prediction model is only marginally better than the Kernel Density, suggesting that the model isn’t very robust.
#Bar plot
rbind(crash_KDE_sf, crash_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(traffic_crash_count = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crash = traffic_crash_count / sum(traffic_crash_count)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crash)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_manual(values = c("#238A8DFF", "#DCE319FF"), name = '')+
#scale_fill_viridis(discrete = TRUE) +
labs(title = "Risk prediction vs. Kernel density, 2019 traffic crashes") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))+
plotTheme()
Our risk prediction model looked at how bus stops, 311 reports on potholes and street light outages, red light crossings, liquor stores, and speed camera violations influence the traffic crash incidents. In building our model, it is important to acknowledge that traffic crash counts are generally underreported and that our data set only includes information collected by Chicago’s police department. That said, we should use caution when interpreting the results of our model. For example, the density maps above show that the highest concentration of traffic incidents are in Chicago’s downtown. Is this because there are more police officers stationed in the downtown area? The number of 311 reports on potholes are also highest in the downtown. Does this mean there are more potholes in the area? Is it possible that downtown residents are more likely to submit a service request with 311?
Beyond spatial risk features, our model did not account for several factors that likely play a larger role in determining the likelihood of crashes. Teenage drivers, who are generally more inexperienced, are more likely to be involved in accidents. Those who are on their phones, are distracted while driving, are under the influence, or are overspeeding, also tend to figure in traffic crashes. Future work should include these factors alongside other spatial indicators.