1 Welcome and Schedule

  • 6:30pm Review Coding Warmup 1

  • 6:45pm Some Background

    • RMarkdown Basics
    • Census Data
  • 7:15pm Spatial Data Overview

  • 7:45pm Break + Debugging (15 minutes)

  • 8:00pm Coding

1.1 Readings

Racism In, Racism Out

Tidymodels Ch 1/2

Figure 1.2: The data science process

Figure 1.3: A schematic for the typical modeling process

Exploratory data analysis (EDA): Initially there is a back and forth between numerical analysis and data visualization (represented in Figure 1.2) where different discoveries lead to more questions and data analysis side-quests to gain more understanding.

Feature engineering: The understanding gained from EDA results in the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). Chapter 8 focuses entirely on this important step.

Model tuning and selection (large circles with alternating segments): A variety of models are generated and their performance is compared. Some models require parameter tuning in which some structural parameters must be specified or optimized. The alternating segments within the circles signify the repeated data splitting used during resampling (see Chapter 10).

Model evaluation: During this phase of model development, we assess the model’s performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter 11) help you understand whether any differences in models are within the experimental noise.

Ch 2. tidyverse

1.2 Spatial Examples

Real world problems require accurate projections of geographic relationships.

  • Find the ten neighborhoods in a city with the longest fire department response times.
  • Use real-time satellite imagery to detect early warnings of forest fires.
  • Detect changes in high-resolution satellite imagery that suggest building additions that do not match official building permits.
  • Estimate the impact of bar openings on nearby crime, business patterns, and property values.
  • Determine whether low-income households face a higher risk of sea-level rise.

1.2.1 Data Types

  • Objects: sets of points which create shapes like roads/buildings
  • Fields: spatially continuous processes like temperature

1.2.2 Vectors

  • Point, each group has one point
  • Lines, each group has multiple, ordered points
  • Polygons, each group has multiple, ordered points, with first/last point the same

CSV, Shapefile, GeoJSON

1.2.3 Coordinate Reference System (CRS)

Key to project points onto Earth’s surface

Very common are 4326 or 4269 (census)

1.3 Census Data Hierarchy

Geographies

2 Spatial Data

2.1 Setup

  • Get Census API Key
  • Run census_api_key("YOUR API KEY GOES HERE", install=TRUE)

2.2 Example from last week

Let’s show quickly how the data we used last week from the Assessor can be converted to an sf object.

ccao <- read_csv('../../files/Assessor__Archived_05-11-2022__-_Residential_Modeling_Characteristics__Chicago_.zip')

mini <- ccao %>% st_as_sf(coords=c("Latitude", "Longitude")) %>% slice_sample(n=1000)
mini
st_crs(mini) <- 4326
mapview(mini) #this is backwards!
mini2 <- ccao %>% st_as_sf(coords=c("Longitude", "Latitude")) %>% slice_sample(n=1000)
st_crs(mini2) <- 4326
mapview(mini2)
rm(mini)
rm(mini2)

3 Census

3.1 Exploring variables

Starting with the 2020 census…let’s look at institutional populations by county.

c20var <- tidycensus::load_variables(2020, "pl")

c20var %>% view()

We want to create an sf object which has the counts/values we are interested in showing joined in.

Note that making maps, similar to regular plots, requires thoughtful selection of the area you want to show based on how much information is consumable.

ggplot(county_census %>% filter(variable == "college_housing_pop"), aes(fill=value)) +
  geom_sf(color=NA) +
  coord_sf(crs=4269) +
  scale_fill_viridis_c(option = "magma")

Just the Midwest…

ggplot(county_census %>% filter(STATEFP %in% c(17, 18, 26, 39, 55),
                         variable == "college_housing_pop"), aes(fill=value)) +
  geom_sf(color=NA) +
  coord_sf(crs=4269) +
  scale_fill_viridis_c(option = "magma") +
  labs(title="College Housing Population", fill="")

ggplot(county_census %>% filter(STATEFP %in% c(17, 18, 26, 39, 55),
                         variable == "adult_inc"), aes(fill=value)) +
  geom_sf(color=NA) +
  coord_sf(crs=4269) +
  scale_fill_viridis_c(option = "magma") +
  labs(title="Adult Incarcerated", fill="")

ggplot(county_census %>% filter(STATEFP %in% c(17, 18, 26, 39, 55)), aes(fill=value/summary_value)) +
  geom_sf(color=NA) +
  coord_sf(crs=4269) +
  facet_wrap(~variable) +
  scale_fill_viridis_c(labels=percent) +
  labs(title="Institutional Populations", fill="")

4 Fancy Maps / American Community Survey

Let’s map Chicago on the tract level by major census racial categories.

4.1 Setting up Data

Finding variables

census reporter

Getting our data together. Let’s find the percent White/Black/Other alone for Illinois by tract.

 mapview(chicago) + mapview(chicago_tracts_acs, alpha=.05, col.regions='green')

4.2 ggplot2

chicago_tracts_acs %>% 
  pivot_longer(contains('Share'), names_to='target', values_to='percent') %>%
  ggplot(aes(fill = percent)) +
  facet_wrap(~target, nrow = 2) +
  geom_sf(color = NA) +
  theme_void() + 
  scale_fill_viridis_c() + 
  labs(title='Chicago Population by Tract', fill='Share of Pop.', caption='2021 ACS Table B02001')

4.3 leaflet

First leaflet map

leaflet() %>%
  addTiles() 
leaflet() %>%
  addTiles() %>% 
  addPolygons(data = chicago_tracts_acs)

Making palette

pal1 <-
  colorNumeric(
    palette = "Oranges",
    domain = chicago_tracts_acs$White.Share,
    na.color = "Grey"
  )
leaflet() %>%
  addTiles() %>% 
  addPolygons(
    data = chicago_tracts_acs,
    fillColor = ~ pal1(White.Share),
    weight = 0.5,
    opacity = 0.5,
    color = "white",
    dashArray = 3,
    fillOpacity = 0.7)

Make labels

label_str <- str_glue("<strong>Tract %s</strong><br>White Alone (Pct): %s<br/>")
labels <- sprintf(label_str,
                chicago_tracts_acs$NAME,
                percent(chicago_tracts_acs$White.Share, accuracy = .1)) %>% 
  lapply(htmltools::HTML)

Add labels/legend

m <- leaflet() %>%
  addTiles() %>% addPolygons(
    data = chicago_tracts_acs,
    fillColor = ~ pal1(White.Share),
    weight = 0.5,
    opacity = 0.5,
    color = "white",
    dashArray = 3,
    fillOpacity = 0.7,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  ) %>%
  addLegend(
    pal = pal1,
    values = chicago_tracts_acs$White.Share,
    opacity = 0.7,
    title = NULL,
    position = "bottomright"
  )
  

m

5 Spatial Relationships

5.1 Divvy

Let’s understand who took a divvy in 07/2021 and map it!

Open data

Divvy by community area

temp <- tempfile()
download.file("https://divvy-tripdata.s3.amazonaws.com/202107-divvy-tripdata.zip", 
              temp)
unzip(temp, list=TRUE)
divvy <- read_csv(unz(temp, "202107-divvy-tripdata.csv"))
unlink(temp)

community_areas <- st_read("https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON")
## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 77 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84
divvy <- 
  divvy %>% st_as_sf(coords=c("start_lng", "start_lat"), remove=FALSE)

st_crs(divvy) <- st_crs(community_areas)

divvy_joined <- divvy %>% select(ride_id, rideable_type) %>% st_join(
  community_areas %>% select(community)
)

divvy_counts <- divvy_joined %>% 
  as.data.frame() %>%
  group_by(community) %>%
  summarize(cnt=n(),
            pct_electric = length(ride_id[rideable_type == 'electric_bike']) / cnt)

ggplot(community_areas %>% left_join(divvy_counts),
       aes(fill=pct_electric)) +
  geom_sf(color='grey') +
  coord_sf() +
  scale_fill_viridis_c(labels = percent) +
  labs(title="Percent of Divvy Rides which were Electric", subtitle="July 2021", fill="Percent Electric")

ggplot(community_areas %>% left_join(divvy_counts),
       aes(fill=cnt*pct_electric)) +
  geom_sf(color='grey') +
  coord_sf() +
  scale_fill_viridis_c() +
  labs(title="Total Electric Rides", fill="")

5.2 Police and Crime

Example from the textbook.

station_df <- read_csv("https://github.com/DataScienceForPublicPolicy/diys/raw/main/data/chicago-police-stations.csv")
crime_df <- read_csv("https://github.com/DataScienceForPublicPolicy/diys/raw/main/data/chicago-crime-2018.csv") 

#Convert into station_df to sf
station_sf <- station_df  %>% 
                st_as_sf(coords = c("LONGITUDE", "LATITUDE"))

#Remove NAs from crime_df and convert to sf
crime_sf <- crime_df %>% 
                filter(!is.na(Longitude) & !is.na(Latitude)) %>% 
                st_as_sf(coords = c("Longitude", "Latitude"))

#Set CRS
st_crs(station_sf) <- 4326
st_crs(crime_sf) <- 4326

#Transform
station_sf <- st_transform(station_sf, 32616)
crime_sf <- st_transform(crime_sf, 32616)

district_sf <- st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON")
## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS:  WGS 84
district_sf <- st_transform(district_sf, st_crs(crime_sf))

crime_sf <- crime_sf %>% st_join(district_sf,
  join = st_intersects)

crime_count <- crime_sf %>% as.data.frame() %>%
  filter(!is.na(dist_label)) %>%
  group_by(dist_num) %>%
  summarize(n_incidents = n())

district_sf <- district_sf %>% left_join(crime_count, 
                                         by = "dist_num")
ggplot(data = district_sf, aes(fill = n_incidents / 1000)) +
  geom_sf(color = "black", size = 0.05) +
  scale_fill_viridis_c("Number of Incidents (’000)", option = "magma") +
  guides(fill = guide_colourbar(barwidth = 15, barheight = 1)) +
  theme_void() + theme(legend.position = "bottom")

district_dist <- st_distance(district_sf)

dist_mat <- st_distance(x = crime_sf,
                        y = station_sf)

# Minimum distance to station for each crime
crime_min_dist <- apply(X = dist_mat,
                        MARGIN = 1,
                        FUN = min)
# Add distance as variable in crime_sf
crime_sf <- crime_sf %>% mutate(dist_station = crime_min_dist)

Let’s plot the kernel density of these distances.

One might wonder how this distribution of crimes’ distances to the nearest station compares to the overall distribution of distances to the nearest station for all of Chicago. If police station locations are chosen to be in high-crime areas (or if they attract crime), then we would expect the distances between crimes and stations to be concentrated on smaller distances relative to the distances to stations for all of Chicago. If crimes avoid police stations, then we would expect their distribution to be pushed farther out relative to greater Chicago.

#Custom theme
custom_theme <- theme_bw() +
                theme(plot.title = element_text(size = 9),
                axis.title.x = element_text(size = 9),
                axis.title.y = element_text(size = 9))

ggplot(data = crime_sf,  aes(x = dist_station)) +
  geom_density(color = "white", fill = "orange", alpha = 0.6) +
  scale_x_continuous( "Distance to nearest station (meters)", labels = scales::comma) +
  scale_y_continuous("Density", labels = scales::percent) +
  custom_theme

Constructing a benchmark. If we believe police station location has some relationship with location of crime, we need to construct a benchmark – some comparison group that gives context. One strategy involves constructing a point vector containing a random or regular set of points placed throughout Chicago. This benchmark makes the assumption that if distance did not matter, then there is an equal chance of crime at every location in Chicago. When we compare the equal chance distribution to the actual crime-distance distribution, we can infer if distance to police station has any relationship with crime.

To construct this equal chance distribution, we first create a single polygon for Chicago by merging the police district polygons (district_sf) through the st_union function. In effect, the borders between all polygons are removed and merged into a single large city limits boundary. From this new city polygon, we draw an hexagonal grid of points (\(n=10000\) to be exact) using st_sample. Then, the same distance calculations are applied to find the minimum distance between each of the \(n=10000\) points to police stations.

#Re-project district shapefile to UTM 16N
district_sf <- st_transform(district_sf, crs = 32616)

#Take union
outline_chicago <- district_sf %>% st_geometry() %>% st_union()

#Draw sample of 'hexagonal' points from Chicago
points_chicago <-
  st_sample(x = outline_chicago, size = 10000, type = "hexagonal")

#Distances between points and police stations
dist_points_station <-
  st_distance(x = points_chicago,  y = station_sf)

#Find distance to nearest station for each point
points_min_dist <-
  apply(X = dist_points_station, MARGIN = 1, FUN = min)

#Convert points_chicago to sf data frame
points_chicago <- points_chicago %>%
  st_coordinates() %>% as.data.frame() %>%
  mutate(dist_station = points_min_dist) %>%
  st_as_sf(coords = 1:2) %>%
  st_set_crs(32616)

We compare the police districts, the grid points and the minimum distance to police station. To an extent, the minimum distance to police station loosely follows the boundaries of the police districts – there are both areas that are well-covered and others that are far less so. We can see that most of Chicago is within 5 km of a police station – with the exception of the airport in the northwestern corner.

# Sample 1,000 points
points_chicago_sub <-  st_sample(# Get Chicago's outline using st_union()
  x = outline_chicago,
  size = 1000,
  type = "hexagonal")

# Plot police districts
gg_district <- ggplot() +
  geom_sf(data = district_sf) +
  ggtitle("(A) Police districts") +
  theme_void() + theme(plot.title = element_text(size = 10, hjust = 0.5))

# Plot: 1000 points
gg_points_full <- ggplot() +
  geom_sf(data = outline_chicago, color = "blue", fill = NA) +
  geom_sf(
    data = points_chicago_sub,
    color = "black",
    fill = NA,
    shape = 19,
    size = 0.01
  ) +
  ggtitle("(B) Grid of points") +
  theme_void() + theme(plot.title = element_text(size = 10, hjust = 0.5))

# Plot distance to the nearest station
gg_dist <-  ggplot() +
  geom_sf(data = points_chicago,
          aes(color = dist_station / 1e3),
          size = 0.5) +
  ggtitle("(C) Distance to nearest station") +
  scale_color_viridis_c(
    "Dist. (km)",
    option = "magma",
    breaks = seq(0, 7.5, length = 4),
    labels = c("0", "2.5", "5", "7.5+")
  ) +
  theme_void() + theme(plot.title = element_text(size = 10, hjust = 0.5))

#Plot together
gridExtra::grid.arrange(gg_district,
                        gg_points_full,
                        gg_dist,
                        nrow = 1)

Comparing distance distributions. To answer the original question, we plot kernel densities of the distance distributions as seen. The vast majority of crimes (red line) in this data set occur within 2.5km of a police station, while our sampling of points from all of Chicago has a lower density in this distance range. This means that crimes tend to occur closer to police stations. Is this causal? It is hard to draw a firm conclusion.

custom_theme <- theme_bw() +
              theme(plot.title = element_text(size = 9),
              axis.title.x = element_text(size = 9),
              axis.title.y = element_text(size = 9))

#Graph
ggplot() +
    geom_density(
      data = points_chicago %>% filter(dist_station < 8.5e3),
      aes(x = dist_station, fill = "A", color = "A"),
      alpha = 0.5) +
    geom_density(
      data = crime_sf %>% filter(dist_station < 8.5e3),
      aes(x = dist_station, fill = "B", color = "B"),
      alpha = 0.5) +
    geom_hline(yintercept = 0) +
    scale_x_continuous(
      "Distance to nearest station (meters)",
      labels = scales::comma) +
    ylab("Density") +
    scale_color_manual(
      "",
      labels = c("Equal chance", "Crimes"),
      values = c(NA, "#cf556a")) +
    scale_fill_manual( "",
      labels = c("Equal chance", "Crimes"),
      values = c("#301867", NA)) +
    custom_theme

There are many factors that could contribute to this trend. Perhaps police stations are placed in high-crime neighborhoods or perhaps police place more effort on areas near the station, etc. Drawing a causal inference is quite challenging without an experiment design.

Nonetheless, it may be informative to focus on a few crime types and other attributes of the data. Perhaps the likelihood of generating an arrest differs based on distance from a police station offers a clue. We compare the distance distribution between narcotics incidents that led to an arrest versus those that do not. Interestingly, we see some evidence that the chance of an arrest could depend upon the distance between the incident and the police station. That said, there are many other relationships which we would want explore more deeply before making any decisions.

ggplot(data = crime_sf %>% 
         filter(`Primary Type` == "NARCOTICS") %>% filter(dist_station < 6e3),
          aes(x = dist_station, fill = Arrest, color = Arrest)) +
        geom_density(alpha = 0.5, color = NA) +
        geom_hline(yintercept = 0) +
        scale_x_continuous("Distance to station (meters)", labels = scales::comma) +
        ylab("Density") +
        scale_fill_manual("Arrest made?", labels = c("False", "True"), 
                          values = c("grey80", "violetred3")) +
    custom_theme +
        theme(axis.text.y = element_blank(),
          legend.position = "bottom")