5 min read

Detroit Part 4

Assignment

Instead of traditional problem sets, this course has a single four part assignment where you will build upon your previous work each week with new material from the course. You will explore property assessment in Detroit, Michigan and create an assessment model. After the completion of the assignment, you will wrap your model into a report which analyzes the effectiveness of your model based on the ethical and other frameworks from class and make a brief presentation to the class.

Submissions

Each week you will submit two files on blackboard, your code/Rmd file and the knitted output of your code. Blackboard will not accept html files so you must zip the files together.

Final Submission

Create final_report.Rmd in the reports folder, copying the yaml/framework from part_3.Rmd.

Bring together your previous submissions into one cohesive report. This report should offer a brief overview of the problem (assessment), general trends on properties, your model, why your model is better than other models, and any technical or ethical critiques.

Your final submission will build upon your part 3 submission by ‘switching out’ the model you use and adding a conclusion.

Part A, New Assessment Models

Mirror Section 15.3 from the textbook for your assessment model only!

  • Create a workflow_set() of three different model types. You may choose any which are comparable with tidymodels. Suggested models and tuning parameters below. I encourage you to consider using one model not from this list, but that is optional.
linear_reg_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 250) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

my_set <- workflow_set(
  preproc = list(name_for_your_recipe = your_recipe), # REPLACE WITH YOUR RECIPE
  models = list(linear_reg = linear_reg_spec, random_forest = rf_spec, boosted = xgb_spec)
)

The textbook applies slightly different preprocessing steps to these models, but they should work reasonably well with your current recipe. If you have any compatability issues, I recommend starting with a simple recipe and adding things back one at a time.

  • Now, apply workflow_map() to your workflow_set(). You should resample your testing data and create a small grid. Please use at least 3 resamples and a grid of at least 5. This may take a long time (10+ minutes). I recommend starting very small and gradually increasing your resamples/grid. Note that I’ve included verbose = TRUE here to help you debug, please do not include this output in your final project.
grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )

grid_results <-
   my_set %>%
   workflow_map(
      seed = 1503,
      resamples = your_resamples,  # REPLACE WITH YOUR RESAMPLES
      grid = 5,
      control = grid_ctrl,
      verbose = TRUE
   )
   
  • Now, use rank_results on your selected performance metric and autoplot to replicate figure 15.1. Does one model perform significantly better than others? Select what you feel is the best by finalizing your model (see Section 15.5).
best_results <- 
   grid_results %>% 
   extract_workflow_set_result("best model type name") %>% # REPLACE
   select_best(metric = "your metric") # REPLACE
best_results

best_results_fit <- 
   grid_results %>% 
   extract_workflow("best model type name") %>% # REPLACE
   finalize_workflow(best_results) %>% 
   last_fit(split = your_rsample_data) #this is the output of rsample::initial_time_split() or rsample::initial_split()
  • Consider making a simple visualization of predicted / observed values from your best model similar to Figure 15.5
best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = target_variable, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed", y = "predicted")

Part B, Hyperparameter Exploration for Classification

Mirroring Section 14.2.3, take your current workflow and use tune_bayes() to create a small tuning grid for your classification model. You will need to:

  • Identify appropriate hyperparameters to be tuned for your chosen model type and set them equal to tune() in your workflow (note: do not include mtry in your tuning grid. note: if you use logistic_reg() you must use engine glmnet to have tuning parameters.)
  • Manually create a start_grid and evaluate your workflow to create initial values for tune_bayes()
initial_vals <- your_workflow %>%
  tune_grid(
    resampled_data,
    grid = 4,
    metrics = your_metric_set,
  )
  • Run a bayes search
ctrl <- control_bayes(verbose = TRUE)

your_search <- 
  your_workflow %>%
  tune_bayes(
    resamples = ..., # REPLACE
    metrics = ..., # REPLACE
    initial = initial_vals, #note you may simply pass a number here e.g. 6 for a random search
    iter = 25,
    control = ctrl
  )
  • Call show_best and finalize your model

Part C, Conclusion & Presentation

Write a four paragraph conclusion to your file. Include information on your model type, its performance on your chosen objective function, any ethical or implementation issues (e.g. should Detroit use your model?).

In class on the 30th, everyone will give a brief presentation on their work. You may present your knitted Rmd file or pull some of your graphs into a slide deck. Your presentation should be at most five minutes. Broadly look to answer if your model should be implemented by discussing the information in your conclusion and assignment.

Grading Overview

For each assignment, you will be graded on substantial completion of the assignment (demonstrated by an attempt of all parts). When submitting parts 2, 3, and 4, you will be additionally graded on your incorporation of feedback, new concepts from the course, or the correction of any flagged issues.

The assignment will culminate in a final submission of code/report and presentation. Code will be graded based on reproducibility, conceptual understanding, and accuracy. The report will be an Rmarkdown file which knits together graphs, tables, and ethical frameworks. It should be concise (include only relevant information from Parts 1-4). This report will be used to give a five minute presentation to the class on your model and ethical/technical issues with Detroit property assessment.

Asg. Points Category Notes
1 5 Substantial Completion (attempted all parts)
2 5 Substantial Completion (attempted all parts)
2 5 Incorporation of Feedback/New Concepts From Part 1
3 10 Substantial Completion (attempted all parts)
3 10 Incorporation of Feedback/New Concepts From Part 2
4 30 Final Code Reproducible (10), Concepts (10), Accurate (10)
4 20 Final Report Via Rmarkdown HTML, contextualized analysis and ethics
4 15 Final Presentation 3-5 minute presentation on model and insights