1 Introduction

2 Sales Ratio Analysis

3 Modeling Overassessment

3.1 Specifications

3.1.1 recipe

3.2 Workflow Prep
3.3 Workflow
3.4 Classifier Accuracy Metrics
3.5 Predictions

4 Modeling Assessment

4.1 Model Evaluation

joined_mini <- dplyr::tbl(con, 'sales') %>%
  mutate(year = as.double(str_sub(sale_date, 1, 4))) %>%
  left_join(dplyr::tbl(con, 'assessments'), 
            by=c('parcel_num'='PARCELNO', 'year')) %>%
  filter(property_c == 401,
         sale_price >= 2000,
         ASSESSEDVALUE >= 1000,
         between(year, 2012, 2019)) %>%
  collect() %>%
  filter(str_detect(str_to_lower(sale_terms), 'valid arm'))

ratios <- cmfproperty::reformat_data(
  joined_mini,
  'sale_price',
  'ASSESSEDVALUE',
  'year'
)

## [1] "Filtered out non-arm's length transactions"
## [1] "Inflation adjusted to 2019"

stats <- cmfproperty::calc_iaao_stats(ratios)
output <- cmfproperty::diagnostic_plots(ratios = ratios, stats = stats, min_reporting_yr =  2012, max_reporting_yr = 2019)

detroit_targ <- st_intersection(wayne_tracts, detroit %>% select(geometry))

detroit_tracts <- wayne_tracts %>% filter(GEOID %in% detroit_targ$GEOID)


targ_geo <- tbl(con, 'attributes') %>% 
  filter(property_c == 401) %>%
  select(parcel_num, Longitude, Latitude) %>%
  collect() %>%
  filter(!is.na(Latitude)) %>%
  st_as_sf(coords=c("Longitude", "Latitude"))

st_crs(targ_geo) <- st_crs(detroit_tracts)

parcel_geo <- targ_geo %>% st_join(detroit_tracts, join=st_intersects) %>%
  as.data.frame() %>%
  select(-geometry)

1 Introduction

The Detroit housing market has experienced historic levels of foreclosures, disinvestment, and demolitions. Over 1 in 4 properties has been foreclosed due to property tax foreclosure since 2011 and many of these foreclosures were due to inaccurate, inflated assessments. These assessments remain problematic even after the City of Detroit undertook its first citywide reassessment in over 60 years which became effective in 2017.

Since the beginning of the coronavirus pandemic, tax foreclosures have been halted and assessments have become more accurate. Detroit’s housing market has begun to recover with some neighborhoods even gentrifying. Yet, the system remains inequitable especially for low-income homeowners of color.

output[[2]]

This analysis focuses on single family homes (class 401) which were taxable, sold for more than $2000, and marked as arm’s length by the assessor. Additionally, using the cmfproperty package, the IAAO arm’s length standard was applied to the data. This will present a conservative picture of the housing market in Detroit. Note that homes in Detroit as supposed to be assessed at most at 50% of their fair market value.

output[[1]]

homes_counts <- dplyr::tbl(con, 'assessments') %>% filter(propclass == 401) %>%
  count(year) %>% collect()

ggplot(homes_counts, aes(x=year, y=n)) +
  geom_line(color='light blue', size=2) +
  scale_y_continuous(labels=scales::comma, limit=c(0, NA)) +
  scale_x_continuous(breaks=scales::pretty_breaks()) +
  labs(x='Year', y='Count of 401 properties', title='Number of Homes in Detroit \nDecreased from 2011')

2 Sales Ratio Analysis

The sales ratio is a key unit for analyzing the accuracy of assessments. It is calculated by taking the ratio of a property’s sale price to its assessment at the time of sale. The sales ratios can be evaluated using metrics from the International Association of Assessing Officers.

iaao <- cmfproperty::iaao_graphs(stats=stats, ratios=ratios, min_reporting_yr = 2012, max_reporting_yr = 2019, jurisdiction_name = 'Detroit')

For 2019, the COD in Detroit was 43.44 which did not meet the IAAO standard for uniformity.

iaao[[2]]

In 2019, the PRD in Detroit, was 1.209 which does not meet the IAAO standard for vertical equity.

iaao[[4]]

In 2019, the PRB in Detroit was -0.354 which indicates that sales ratios decrease by 35.4% when home values double. This does not meet the IAAO standard.

iaao[[6]]

bs <- cmfproperty::binned_scatter(ratios = ratios, min_reporting_yr = 2012, max_reporting_yr = 2019, jurisdiction_name = 'Detroit')

In 2019, the most expensive homes (the top decile) were assessed at 22.9% of their value and the least expensive homes (the bottom decile) were assessed at 72.6%. In other words, the least expensive homes were assessed at 3.18 times the rate applied to the most expensive homes. Across our sample from 2012 to 2019, the most expensive homes were assessed at 27.1% of their value and the least expensive homes were assessed at 111.8%, which is 4.12 times the rate applied to the most expensive homes.

bs[[2]]

3 Modeling Overassessment

3.1 Specifications

targ_sales_16 <- 
  tbl(con, 'sales') %>% filter(year(sale_date) == 2016,
                                 sale_price > 2000,
                                 property_c == 401) %>%
      select(-c(grantor, grantee, ecf, property_c)) %>%
      arrange(desc(sale_price)) %>% 
      collect() %>%
      filter(str_detect(str_to_lower(sale_terms), 'valid arm')) %>%
      distinct(parcel_num, .keep_all=TRUE)


model_data <- tbl(con, 'assessments') %>% 
  filter(year == 2016 | year == 2019, 
         propclass == 401,
         ASSESSEDVALUE > 2000) %>%
  collect() %>%
  left_join(
    targ_sales_16,
    by=c('PARCELNO'='parcel_num')
  ) %>%
  left_join(
    tbl(con, 'attributes') %>% select(parcel_num, Neighborhood, total_squa, 
                                  heightcat, extcat, has_garage, has_basement,
                                  bathcat, total_floo, year_built, Longitude, Latitude) %>%
      collect(),
    by=c('PARCELNO'='parcel_num')
  ) %>%
  left_join(
    parcel_geo,
    by=c('PARCELNO'='parcel_num')
  )

foreclosures_10_to_15 <- tbl(con, 'foreclosures') %>%
  rename(parcel_num = prop_parcelnum) %>%
  pivot_longer(!c(prop_addr, parcel_num), names_to='Year') %>%
  filter(!is.na(value)) %>%
  left_join(tbl(con, 'attributes') %>% select(parcel_num, Neighborhood)) %>%
  filter(between(as.numeric(Year), 2010, 2015),
         !is.na(Neighborhood)) %>%
  count(Neighborhood) %>% left_join(
    tbl(con, 'attributes') %>% filter(property_c == 401) %>% count(Neighborhood) %>%
  rename(total_prop = n)
  ) %>%
  collect() %>%
  mutate(foreclosures_per_prop = `n` / total_prop) %>%
  select(-n, -total_prop)

## Joining, by = "parcel_num"
## Joining, by = "Neighborhood"

share_over_14_to_15 <- ratios %>% filter(between(SALE_YEAR, 2014, 2015)) %>%
  group_by(Neighborhood = ecf) %>%
  summarize(pct_over_50 = length(parcel_num[RATIO > .5]) / n())


avg_sp_per_sqft <- ratios %>% filter(between(SALE_YEAR, 2014, 2018)) %>%
  left_join(tbl(con, 'attributes') %>% select(parcel_num, total_squa) %>% collect()) %>%
  group_by(Neighborhood = ecf) %>%
  summarize(avg_sp_per_sq = mean(SALE_PRICE / total_squa, na.rm=T),
            med_sp_per_sq = median(SALE_PRICE / total_squa, na.rm=T),
            n())

## Joining, by = "parcel_num"

model_data <- model_data %>% mutate(RATIO = ASSESSEDVALUE / sale_price) %>%
  mutate(
  class2 = if_else(RATIO <= .5, 'Under', 'Over'),
  sp_log = log10(sale_price),
  across(contains('cat') | contains('has'), factor)
  ) %>%
  left_join(share_over_14_to_15) %>%
  left_join(foreclosures_10_to_15) %>%
  left_join(avg_sp_per_sqft)

## Joining, by = "Neighborhood"
## Joining, by = "Neighborhood"
## Joining, by = "Neighborhood"

Since we don’t have a lot of arm’s length sales in 2016 (about 4200), we may want to use a training method which resamples our data. Initially, I will use 10-fold cross validation to train and aggregate ten models. For our classification model, we will initially use a random forest of 500 trees. More on that later.

3.1.1 recipe

class_recipe <- recipe(
  class2 ~
    PARCELNO + Neighborhood + ASSESSEDVALUE +
    total_squa + heightcat + extcat + has_garage +
    has_basement + bathcat + total_floo + pct_over_50 +
    foreclosures_per_prop +
    year_built + Latitude + Longitude,
  data = sale_model_data
) %>%
  update_role(PARCELNO, new_role = 'PARCELID') %>%
  step_log(ASSESSEDVALUE) %>%
  step_impute_mean(total_squa, total_floo, year_built, Latitude, Longitude, pct_over_50, foreclosures_per_prop) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  prep()

Our recipe is above. Key steps highlighted:

Logging assessed value, ensures we capture variability in assessed value which ranges from 1000 to 120000.

3.2 Workflow Prep

Our tidymodels workflow requires a model, formula, and recipe. Some of these pieces will be the same across our three specifications.

sale_model_data <- model_data %>% filter(!is.na(RATIO), year == 2016)

folds <- rsample::vfold_cv(sale_model_data, v = 5, strata = class2)

roc_res <- metric_set(roc_auc)

class_model <-
   boost_tree(trees = 250,
              tree_depth = tune(), learn_rate = tune(), 
              min_n = tune()) %>% 
   set_engine("xgboost") %>% 
   set_mode("classification")

class_recipe <- recipe(
  class2 ~
    PARCELNO + Neighborhood + 
    ASSESSEDVALUE +
    total_squa + heightcat + extcat + has_garage +
    has_basement + bathcat + total_floo + pct_over_50 +
    foreclosures_per_prop +
    year_built + GEOID,
  data = sale_model_data
) %>%
  step_log(ASSESSEDVALUE) %>%
  step_impute_mean(total_squa, total_floo, year_built, pct_over_50, foreclosures_per_prop) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_other(threshold = 1e-4) %>%
  prep()

3.3 Workflow

first_workflow <-
  workflow() %>%
  add_model(class_model) %>%
  add_recipe(class_recipe)

initial_vals <- first_workflow %>%
  tune_grid(
    folds,
    grid = 6,
    metrics = roc_res,
    control = control_grid(verbose=TRUE)
  )

## i Fold1: preprocessor 1/1

## v Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/6

## v Fold1: preprocessor 1/1, model 1/6

## i Fold1: preprocessor 1/1, model 1/6 (predictions)

## i Fold1: preprocessor 1/1, model 2/6

## v Fold1: preprocessor 1/1, model 2/6

## i Fold1: preprocessor 1/1, model 2/6 (predictions)

## i Fold1: preprocessor 1/1, model 3/6

## v Fold1: preprocessor 1/1, model 3/6

## i Fold1: preprocessor 1/1, model 3/6 (predictions)

## i Fold1: preprocessor 1/1, model 4/6

## v Fold1: preprocessor 1/1, model 4/6

## i Fold1: preprocessor 1/1, model 4/6 (predictions)

## i Fold1: preprocessor 1/1, model 5/6

## v Fold1: preprocessor 1/1, model 5/6

## i Fold1: preprocessor 1/1, model 5/6 (predictions)

## i Fold1: preprocessor 1/1, model 6/6

## v Fold1: preprocessor 1/1, model 6/6

## i Fold1: preprocessor 1/1, model 6/6 (predictions)

## i Fold2: preprocessor 1/1

## v Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/6

## v Fold2: preprocessor 1/1, model 1/6

## i Fold2: preprocessor 1/1, model 1/6 (predictions)

## i Fold2: preprocessor 1/1, model 2/6

## v Fold2: preprocessor 1/1, model 2/6

## i Fold2: preprocessor 1/1, model 2/6 (predictions)

## i Fold2: preprocessor 1/1, model 3/6

## v Fold2: preprocessor 1/1, model 3/6

## i Fold2: preprocessor 1/1, model 3/6 (predictions)

## i Fold2: preprocessor 1/1, model 4/6

## v Fold2: preprocessor 1/1, model 4/6

## i Fold2: preprocessor 1/1, model 4/6 (predictions)

## i Fold2: preprocessor 1/1, model 5/6

## v Fold2: preprocessor 1/1, model 5/6

## i Fold2: preprocessor 1/1, model 5/6 (predictions)

## i Fold2: preprocessor 1/1, model 6/6

## v Fold2: preprocessor 1/1, model 6/6

## i Fold2: preprocessor 1/1, model 6/6 (predictions)

## i Fold3: preprocessor 1/1

## v Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/6

## v Fold3: preprocessor 1/1, model 1/6

## i Fold3: preprocessor 1/1, model 1/6 (predictions)

## i Fold3: preprocessor 1/1, model 2/6

## v Fold3: preprocessor 1/1, model 2/6

## i Fold3: preprocessor 1/1, model 2/6 (predictions)

## i Fold3: preprocessor 1/1, model 3/6

## v Fold3: preprocessor 1/1, model 3/6

## i Fold3: preprocessor 1/1, model 3/6 (predictions)

## i Fold3: preprocessor 1/1, model 4/6

## v Fold3: preprocessor 1/1, model 4/6

## i Fold3: preprocessor 1/1, model 4/6 (predictions)

## i Fold3: preprocessor 1/1, model 5/6

## v Fold3: preprocessor 1/1, model 5/6

## i Fold3: preprocessor 1/1, model 5/6 (predictions)

## i Fold3: preprocessor 1/1, model 6/6

## v Fold3: preprocessor 1/1, model 6/6

## i Fold3: preprocessor 1/1, model 6/6 (predictions)

## i Fold4: preprocessor 1/1

## v Fold4: preprocessor 1/1

## i Fold4: preprocessor 1/1, model 1/6

## v Fold4: preprocessor 1/1, model 1/6

## i Fold4: preprocessor 1/1, model 1/6 (predictions)

## i Fold4: preprocessor 1/1, model 2/6

## v Fold4: preprocessor 1/1, model 2/6

## i Fold4: preprocessor 1/1, model 2/6 (predictions)

## i Fold4: preprocessor 1/1, model 3/6

## v Fold4: preprocessor 1/1, model 3/6

## i Fold4: preprocessor 1/1, model 3/6 (predictions)

## i Fold4: preprocessor 1/1, model 4/6

## v Fold4: preprocessor 1/1, model 4/6

## i Fold4: preprocessor 1/1, model 4/6 (predictions)

## i Fold4: preprocessor 1/1, model 5/6

## v Fold4: preprocessor 1/1, model 5/6

## i Fold4: preprocessor 1/1, model 5/6 (predictions)

## i Fold4: preprocessor 1/1, model 6/6

## v Fold4: preprocessor 1/1, model 6/6

## i Fold4: preprocessor 1/1, model 6/6 (predictions)

## i Fold5: preprocessor 1/1

## v Fold5: preprocessor 1/1

## i Fold5: preprocessor 1/1, model 1/6

## v Fold5: preprocessor 1/1, model 1/6

## i Fold5: preprocessor 1/1, model 1/6 (predictions)

## i Fold5: preprocessor 1/1, model 2/6

## v Fold5: preprocessor 1/1, model 2/6

## i Fold5: preprocessor 1/1, model 2/6 (predictions)

## i Fold5: preprocessor 1/1, model 3/6

## v Fold5: preprocessor 1/1, model 3/6

## i Fold5: preprocessor 1/1, model 3/6 (predictions)

## i Fold5: preprocessor 1/1, model 4/6

## v Fold5: preprocessor 1/1, model 4/6

## i Fold5: preprocessor 1/1, model 4/6 (predictions)

## i Fold5: preprocessor 1/1, model 5/6

## v Fold5: preprocessor 1/1, model 5/6

## i Fold5: preprocessor 1/1, model 5/6 (predictions)

## i Fold5: preprocessor 1/1, model 6/6

## v Fold5: preprocessor 1/1, model 6/6

## i Fold5: preprocessor 1/1, model 6/6 (predictions)

initial_vals %>% collect_metrics()

ABCDEFGHIJ0123456789

min_n <int>	tree_depth <int>	learn_rate <dbl>	.metric <chr>	.estimator <chr>	mean <dbl>	n <int>	std_err <dbl>
24	4	0.041792945	roc_auc	binary	0.6543502	5	0.008393203
8	11	0.310133177	roc_auc	binary	0.6069698	5	0.007912766
10	1	0.007477383	roc_auc	binary	0.6224625	5	0.011267180
29	8	0.049946504	roc_auc	binary	0.6501577	5	0.008170375
36	7	0.001257071	roc_auc	binary	0.6422660	5	0.011852205
15	13	0.002661261	roc_auc	binary	0.6490685	5	0.010378464

ctrl <- control_bayes(verbose = TRUE)

your_search <- 
  first_workflow %>%
  tune_bayes(
    resamples = folds,
    metrics = roc_res,
    initial = initial_vals,
    iter = 25,
    control = ctrl
  )

## Optimizing roc_auc using the expected improvement

##

## -- Iteration 1 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=21, tree_depth=7, learn_rate=0.00102

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6446 (+/-0.0117)

##

## -- Iteration 2 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=38, tree_depth=1, learn_rate=0.0283

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6353 (+/-0.0105)

##

## -- Iteration 3 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=25, tree_depth=4, learn_rate=0.305

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6268 (+/-0.0074)

##

## -- Iteration 4 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=13, tree_depth=5, learn_rate=0.0552

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.644 (+/-0.00868)

##

## -- Iteration 5 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=4, tree_depth=8, learn_rate=0.0385

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6371 (+/-0.00726)

##

## -- Iteration 6 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=24, tree_depth=1, learn_rate=0.00211

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6077 (+/-0.00744)

##

## -- Iteration 7 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6544 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=31, tree_depth=7, learn_rate=0.0186

## i Estimating performance

## v Estimating performance

## <3 Newest results:   roc_auc=0.6569 (+/-0.00876)

##

## -- Iteration 8 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6569 (@iter 7)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=39, tree_depth=6, learn_rate=0.0388

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6569 (+/-0.00846)

##

## -- Iteration 9 -----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6569 (@iter 7)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=32, tree_depth=6, learn_rate=0.0313

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6567 (+/-0.00884)

##

## -- Iteration 10 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6569 (@iter 7)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=25, tree_depth=10, learn_rate=0.00548

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6543 (+/-0.0106)

##

## -- Iteration 11 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6569 (@iter 7)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=40, tree_depth=8, learn_rate=0.0245

## i Estimating performance

## v Estimating performance

## <3 Newest results:   roc_auc=0.6584 (+/-0.00871)

##

## -- Iteration 12 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6584 (@iter 11)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=38, tree_depth=15, learn_rate=0.015

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6581 (+/-0.00807)

##

## -- Iteration 13 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6584 (@iter 11)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=40, tree_depth=13, learn_rate=0.012

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6574 (+/-0.00828)

##

## -- Iteration 14 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6584 (@iter 11)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=39, tree_depth=15, learn_rate=0.0372

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6537 (+/-0.00748)

##

## -- Iteration 15 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6584 (@iter 11)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=39, tree_depth=15, learn_rate=0.00122

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.641 (+/-0.0111)

##

## -- Iteration 16 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6584 (@iter 11)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=23, tree_depth=15, learn_rate=0.0133

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.655 (+/-0.00887)

##

## -- Iteration 17 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6584 (@iter 11)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=40, tree_depth=9, learn_rate=0.0159

## i Estimating performance

## v Estimating performance

## <3 Newest results:   roc_auc=0.6588 (+/-0.00813)

##

## -- Iteration 18 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=40, tree_depth=10, learn_rate=0.0206

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6576 (+/-0.00794)

##

## -- Iteration 19 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=39, tree_depth=8, learn_rate=0.0167

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.657 (+/-0.00883)

##

## -- Iteration 20 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=38, tree_depth=9, learn_rate=0.0213

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6576 (+/-0.00862)

##

## -- Iteration 21 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=40, tree_depth=7, learn_rate=0.0256

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6567 (+/-0.00803)

##

## -- Iteration 22 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=4, tree_depth=9, learn_rate=0.00557

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6477 (+/-0.00609)

##

## -- Iteration 23 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=24, tree_depth=8, learn_rate=0.0142

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6559 (+/-0.00919)

##

## -- Iteration 24 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=2, tree_depth=15, learn_rate=0.00593

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6303 (+/-0.00337)

##

## -- Iteration 25 ----------------------------------------------------------------

##

## i Current best:      roc_auc=0.6588 (@iter 17)

## i Gaussian process model

## v Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i min_n=30, tree_depth=15, learn_rate=0.00995

## i Estimating performance

## v Estimating performance

## (x) Newest results:  roc_auc=0.6568 (+/-0.00894)

autoplot(your_search, type = "performance")

best_results <- 
   your_search %>% 
   select_best()

model_fit <- 
   first_workflow %>% 
   finalize_workflow(best_results) %>%
   fit_resamples(folds, control=control_resamples(save_pred=TRUE))

our_results <- collect_metrics(model_fit, summarize=TRUE)
our_results %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	mean	n	std_err	.config
accuracy	binary	0.6213246	5	0.0071011	Preprocessor1_Model1
roc_auc	binary	0.6588066	5	0.0081346	Preprocessor1_Model1

our_preds <- collect_predictions(model_fit, summarize=TRUE) 
our_preds %>%
  count(.pred_class) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.pred_class	n
Over	2841
Under	1416

conf_mat(our_preds, estimate=.pred_class, truth=class2)

##           Truth
## Prediction Over Under
##      Over  1809  1032
##      Under  580   836

Our model has pretty mediocre accuracy of 0.621.

3.4 Classifier Accuracy Metrics

our_preds <- collect_predictions(model_fit, summarize=TRUE) 
multi_metric <- metric_set(recall, specificity, precision, accuracy, f_meas)
multi_metric(our_preds, truth=class2, estimate=.pred_class) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	.estimate
recall	binary	0.7572206
specificity	binary	0.4475375
precision	binary	0.6367476
accuracy	binary	0.6213296
f_meas	binary	0.6917782

Some initial views on our model:

sale_model_data %>%
  mutate(pred = our_preds$.pred_class,
         bin = ntile(sale_price, 10)) %>%
  group_by(bin) %>%
  summarize(avg_sp = dollar(mean(sale_price)),
            share_correct = percent(sum(class2 == pred) / n()),
            share_over = percent(sum(class2 == 'Over') / n()))  %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

bin	avg_sp	share_correct	share_over
1	$4,774.79	72%	100%
2	$8,590.98	69%	96%
3	$12,594.83	77%	89%
4	$16,311.45	77%	81%
5	$20,472.82	73%	71%
6	$25,080.46	60%	53%
7	$29,715.93	56%	34%
8	$36,277.41	46%	20%
9	$46,286.91	38%	14%
10	$85,590.79	51%	4%

roc_curve(our_preds, class2, .pred_Over) %>%
  autoplot()

3.5 Predictions

We now fit/train our model on all the sale data and run predictions on all properties.

trained <- first_workflow %>% 
   finalize_workflow(best_results) %>% fit(sale_model_data)

## [14:27:44] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

all_predictions <- trained %>% augment(model_data %>% filter(year == 2016))

all_predictions %>% count(.pred_class)  %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.pred_class	n
Over	129594
Under	63939

Now lets model our predictions across race and make a map.

wayne <- get_acs(geography = "tract", 
              variables = c(white_alone = "B02001_002",
                            total_pop = "B02001_001",
                            medincome = "B19013_001"),
              state=26, county=163,
              output='wide') %>%
  mutate(pct_nonwhite = (total_popE - white_aloneE) / total_popE) %>%
  select(GEOID, pct_nonwhite, medincomeE)

## Getting data from the 2015-2019 5-year ACS

geoid_over <- all_predictions %>% group_by(GEOID) %>%
  summarize(
    pct_over = sum(.pred_class == 'Over') / n(),
    size = n()
  ) %>%
  filter(!is.na(GEOID)) %>%
  left_join(wayne)

## Joining, by = "GEOID"

ggplot(geoid_over, aes(x=pct_over, y=pct_nonwhite)) +
  geom_point(alpha=.3, size=3) +
  geom_smooth(se=FALSE) +
  labs(x='Pct Overassessed Prediction', y='Pct NonWhite Tract', title='Predicted Overassessment by NonWhite Pop')

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

library(leaflet)
geo_data <- detroit_tracts %>%
  left_join(geoid_over)

## Joining, by = "GEOID"

label_str <- str_glue("<strong>Tract %s</strong><br> Pct Over %s<br>")
labels <- sprintf(label_str,
                geo_data$GEOID,
                percent(geo_data$pct_over, .1)) %>% 
  lapply(htmltools::HTML)
### make palette
pal1 <-
  colorNumeric(
    palette = "Oranges",
    domain = geo_data$pct_over,
    na.color = "Grey"
  )
  
m <- leaflet() %>%
  addTiles() %>% addPolygons(
    data = geo_data,
    fillColor = ~ pal1(pct_over),
    weight = 0.5,
    opacity = 0.5,
    color = "white",
    dashArray = 3,
    fillOpacity = 0.7,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  ) %>%
  addLegend(
    pal = pal1,
    values = geo_data$pct_over,
    opacity = 0.7,
    title = NULL,
    position = "bottomright"
  )
  
m

4 Modeling Assessment

Generating our own assessments (for 2019) is very similar to modeling overassessment. Initially, I will use the same recipe and formula except replacing class with sale price. I will also demonstrate the xgboost (boosted decision tree) package. Boosted decision trees are similar to random forests except each tree is created iteratively in a process of continuous improvement.

Training and testing will occur across 2014 to 2018 with a 90/10 split based on time. Predictions will then be made for 2019 compared to the baseline of actual assessed values. Workflow is the same except that xgboost requires us to bake our data first.

time_split <- rsample::initial_time_split(
  ratios %>% select(parcel_num, SALE_PRICE_ADJ, sale_date, ASSESSED_VALUE_ADJ, SALE_YEAR) %>% 
    filter(between(SALE_YEAR, 2013, 2018)) %>% 
    arrange(sale_date) %>% left_join(
      model_data %>% filter(year == 2019) %>%
        select(-c(RATIO, class2, sp_log, sale_date)), by=c('parcel_num'='PARCELNO')
    ) %>% mutate(sp_log = log10(SALE_PRICE_ADJ),
                 sale_date = ymd(sale_date)),
  .9)

train <- training(time_split) 
test <- testing(time_split)

train_resample <- rsample::vfold_cv(train, 3)

linear_reg_spec <-
  linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

rf_spec <-
  rand_forest(mtry = tune(),
              min_n = tune(),
              trees = 250) %>%
  set_engine("ranger") %>%
  set_mode("regression")

xgb_spec <-
  boost_tree(
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune(),
    min_n = tune(),
    sample_size = tune(),
    trees = tune()
  ) %>%
  set_engine("xgboost") %>%
  set_mode("regression")
   
reg_recipe <- recipe(sp_log ~  parcel_num + Neighborhood + ASSESSEDVALUE +
    total_squa + heightcat + extcat + has_garage +
    has_basement + bathcat + total_floo + pct_over_50 +
    foreclosures_per_prop +
    year_built + GEOID + sale_date,
                       data=train) %>%
  step_date(sale_date, features = c("dow", "month", "year"), keep_original_cols = FALSE) %>%
  update_role(parcel_num, new_role = 'PARCELID') %>%
  step_log(ASSESSEDVALUE) %>%
  step_impute_mean(ASSESSEDVALUE, total_squa, total_floo, 
                   year_built, pct_over_50, foreclosures_per_prop) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  prep()

my_set <- workflow_set(
  preproc = list(reg_recp = reg_recipe),
  models = list(linear_reg = linear_reg_spec, random_forest = rf_spec, boosted = xgb_spec)
)

grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )

grid_results <-
   my_set %>%
   workflow_map(
      seed = 1503,
      resamples = train_resample,
      grid = 6,
      control = grid_ctrl,
      verbose = TRUE
   )

## i 1 of 3 tuning:     reg_recp_linear_reg

## ! Fold1: internal: A correlation computation is required, but `estimate` is const...

## ! Fold2: internal: A correlation computation is required, but `estimate` is const...

## ! Fold3: internal: A correlation computation is required, but `estimate` is const...

## v 1 of 3 tuning:     reg_recp_linear_reg (6.7s)

## i 2 of 3 tuning:     reg_recp_random_forest

## i Creating pre-processing data to finalize unknown parameter: mtry

## v 2 of 3 tuning:     reg_recp_random_forest (2m 16.8s)

## i 3 of 3 tuning:     reg_recp_boosted

## v 3 of 3 tuning:     reg_recp_boosted (5m 40.9s)

grid_results

ABCDEFGHIJ0123456789

wflow_id <chr>	info <list>	option <list>	result <list>
reg_recp_linear_reg	<tibble[,4]>	<wrkflw__>	<tune_results[,4]>
reg_recp_random_forest	<tibble[,4]>	<wrkflw__>	<tune_results[,4]>
reg_recp_boosted	<tibble[,4]>	<wrkflw__>	<tune_results[,4]>

autoplot(
   grid_results,
   rank_metric = "rmse",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
   select_best = TRUE     # <- one point per workflow
)

autoplot(
   grid_results,
   id = "reg_recp_random_forest",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
)

best_results <- 
   grid_results %>% 
   extract_workflow_set_result("reg_recp_random_forest") %>% 
   select_best(metric = "rmse")
best_results

ABCDEFGHIJ0123456789

mtry <int>	min_n <int>	.config <chr>
2	25	Preprocessor1_Model4

best_results_fit <- 
   grid_results %>% 
   extract_workflow("reg_recp_random_forest") %>% 
   finalize_workflow(best_results) %>% 
   last_fit(split = time_split)

best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = sp_log, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed", y = "predicted")

4.1 Model Evaluation

model_fit_reg <-
   grid_results %>% 
   extract_workflow("reg_recp_random_forest") %>% 
   finalize_workflow(best_results) %>%
  fit(train)

our_preds <- model_fit_reg %>% augment(train)

multi_metric <- metric_set(mape, rmse, rsq)
multi_metric(our_preds, truth=sp_log, estimate=.pred) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	.estimate
mape	standard	3.4658654
rmse	standard	0.1894850
rsq	standard	0.6748417

Our model:

our_preds <- model_fit_reg %>% augment(
  test
)

multi_metric(our_preds, truth=sp_log, estimate=.pred) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	.estimate
mape	standard	4.0012170
rmse	standard	0.2219580
rsq	standard	0.4845976

Actual assessments:

multi_metric(our_preds, truth=sp_log, estimate=log10(2*ASSESSED_VALUE_ADJ)) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	.estimate
mape	standard	5.0026929
rmse	standard	0.2868362
rsq	standard	0.4240782

ratios_mini <- cmfproperty::reformat_data(
  our_preds %>% mutate(av_pred = 0.5 * 10^.pred),
  'SALE_PRICE_ADJ',
  'av_pred',
  'SALE_YEAR'
)

## [1] "Filtered out non-arm's length transactions"
## [1] "Inflation adjusted to 2018"

bs <- cmfproperty::binned_scatter(ratios = ratios_mini, min_reporting_yr = 2018, max_reporting_yr = 2018,
                                  jurisdiction_name = 'Detroit')

In 2018, the most expensive homes (the top decile) were assessed at 29.4% of their value and the least expensive homes (the bottom decile) were assessed at 78.0%. In other words, the least expensive homes were assessed at 2.65 times the rate applied to the most expensive homes. Across our sample from 2018 to 2018, the most expensive homes were assessed at 29.4% of their value and the least expensive homes were assessed at 78.0%, which is 2.65 times the rate applied to the most expensive homes.

bs[[2]]

our_preds2 <- model_fit_reg %>% augment(
  model_data %>% 
    filter(year == 2019) %>% 
    mutate(sale_date = ymd('2019-01-01')) %>%
    rename(parcel_num = PARCELNO)
)

our_preds2 %>% 
         select(parcel_num, ASSESSEDVALUE, .pred) %>% 
         mutate(.pred = 0.5 * 10^.pred) %>% 
         pivot_longer(!parcel_num) %>%
ggplot(aes(x=value, color=name, fill=name)) +
  geom_density(alpha=.3) +
  scale_x_log10(labels=dollar) +
  labs(x = 'Assessed Value', y='Density', 
       fill = 'Type', color='Type', title='Density of Predictions and AV')

library(DALEXtra)

## Loading required package: DALEX

## Welcome to DALEX (version: 2.4.0).
## Find examples and detailed introduction at: http://ema.drwhy.ai/

## 
## Attaching package: 'DALEX'

## The following object is masked from 'package:dplyr':
## 
##     explain

explain_reg <- explain_tidymodels(
  model_fit_reg,
  data = train %>% select(-contains('ADJ'), -`n()`),
  y = train$sp_log,
  verbose=TRUE
)

## Preparation of a new explainer is initiated
##   -> model label       :  workflow  ( [33m default [39m )
##   -> data              :  26527  rows  26  cols 
##   -> target variable   :  26527  values 
##   -> predict function  :  yhat.workflow  will be used ( [33m default [39m )
##   -> predicted values  :  No value for predict function target column. ( [33m default [39m )
##   -> model_info        :  package tidymodels , ver. 0.1.4 , task regression ( [33m default [39m ) 
##   -> predicted values  :  numerical, min =  3.668785 , mean =  4.460449 , max =  5.803948  
##   -> residual function :  difference between y and yhat ( [33m default [39m )
##   -> residuals         :  numerical, min =  -0.9887813 , mean =  -0.0001163372 , max =  2.263115  
##  [32m A new explainer has been created! [39m

targ <- model_parts(explain_reg, loss_function = loss_root_mean_square)

plot(targ)

Part 4

Eric Langowski

1 Introduction

2 Sales Ratio Analysis

3 Modeling Overassessment

3.1 Specifications

3.1.1 recipe

3.2 Workflow Prep

3.3 Workflow

3.4 Classifier Accuracy Metrics

3.5 Predictions

4 Modeling Assessment

4.1 Model Evaluation