1 Tidymodels

1.1 Workflows
1.2 Recipies

1 Tidymodels

By now, we have looked at broom (tidy, glance, augment), rsample (splitting data), parsnip (specifying models), and yardstick (evaluating models).

1.1 Workflows

We are now going to look at starting to put together all the pieces into a coherent workflow, similar to how you might chain mutate/group by/summarize/left join/etc in tidyverse as one step.

We are continuing to build our modeling pipeline.

Figure 1.3:

Now let’s introduce the model workflow. We won’t really see the benefit of using this until we start using a lot of different models.

lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model)

data(ames)

ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_formula(Sale_Price ~ Longitude + Latitude)

lm_fit <- fit(lm_wflow, ames_train)

1.2 Recipies

A recipe is also an object that defines a series of steps for data processing. Unlike the formula method inside a modeling function, the recipe defines the steps without immediately executing them; it is only a specification of what should be done. ~ Sec 8.1

Here’s the example from the textbook:

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>% #data is used here for column types only
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal_predictors())
simple_ames

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          4
## 
## Operations:
## 
## Log transformation on Gr_Liv_Area
## Dummy variables from all_nominal_predictors()

Another key point from 8.3, all preprocessing/feature engineering steps only use the training data. Otherwise, your model might utilize information leakage and lead to overfitting.

There’s a ton of different recipe steps

Let’s go back and look at the divvy data and see how we could use a recipe for data processing. Recall the example from class…

Key steps:

Process recipe using training data
Apply recipe using training data
Apply recipe to testing

divvy_data <- read_csv('https://github.com/erhla/pa470spring2022/raw/main/static/lectures/week_3_data.csv')

grouped <- rsample::initial_split(divvy_data)
train <- training(grouped)
test  <- testing(grouped)


lm_model <-
  parsnip::linear_reg() %>%
  set_engine("lm") %>%
  set_mode('regression')

divvy_rec <- recipe(rides ~ ., data=train)

divvy_rec

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          7

divvy_rec <- recipe(rides ~ solar_rad + started_hour +
           temp + wind + interval_rain + avg_speed, data=train) %>%
  step_mutate(hour = factor(hour(started_hour))) %>%
  step_date(started_hour, features=c("dow", "month")) %>%
  step_holiday(started_hour, holidays = timeDate::listHolidays("US"),
               keep_original_cols = FALSE) %>%
  step_dummy(all_nominal_predictors())


#divvy_rec %>% prep() %>% bake(test) %>% view()

divvy_workflow <-
  workflow() %>%
  add_model(lm_model) %>%
  add_recipe(divvy_rec)

divvy_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_mutate()
## • step_date()
## • step_holiday()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

divvy_fit <- divvy_workflow %>%
  fit(data=train)

divvy_fit %>% tidy() %>% view()

divvy_preds <- 
  augment(divvy_fit, test)

Recall that our simple model had a MAPE of 136 and a RMSE of 352.

solar_rad + factor(hour(started_hour)) + 
           factor(wday(started_hour)) +
           factor(month(started_hour)) +
           temp + wind + interval_rain + avg_speed

This is essentially the same model

divvy_fit %>% extract_fit_parsnip() %>% tidy()

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>
(Intercept)	1678.0108554	166.56727693	10.07407269
solar_rad	0.9409949	0.04595032	20.47852554
temp	17.9202663	1.45230564	12.33918382
wind	-29.6243234	5.70018908	-5.19707732
interval_rain	-125.5217163	10.96280275	-11.44978334
avg_speed	-61.1437651	5.71911585	-10.69112197
started_hour_USChristmasDay	-469.5217633	82.48504326	-5.69220485
started_hour_USColumbusDay	-234.2909005	93.75079805	-2.49908167
started_hour_USCPulaskisBirthday	NA	NA	NA
started_hour_USDecorationMemorialDay	NA	NA	NA

yardstick::mape(divvy_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
mape	standard	135.0463

yardstick::rmse(divvy_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
rmse	standard	348.9933

ggplot(divvy_preds, aes(x=.pred)) +
  geom_density()

Let’s try to improve the model now

divvy_rec <- recipe(rides ~ ., data=train) %>%
  step_mutate(hour = factor(hour(started_hour)),
              bad_weather = if_else(solar_rad <= 5 & temp <= 5, 1, 0),
              nice_weather = if_else(solar_rad >= 25 & temp >= 15, 1, 0)) %>%
  step_date(started_hour, features=c("dow", "month")) %>%
  step_holiday(started_hour, holidays = timeDate::listHolidays("US"),
               keep_original_cols = FALSE) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_corr(all_predictors())

divvy_workflow <-
  workflow() %>%
  add_model(lm_model) %>%
  add_recipe(divvy_rec)

divvy_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_mutate()
## • step_date()
## • step_holiday()
## • step_unknown()
## • step_dummy()
## • step_corr()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

divvy_fit <- divvy_workflow %>%
  fit(data=train)

divvy_preds <- 
  augment(divvy_fit, test, new_data =  test)


yardstick::mape(divvy_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
mape	standard	125.1599

yardstick::rmse(divvy_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
rmse	standard	339.5152

ggplot(divvy_preds, aes(x=.pred)) +
  geom_density()

Using decision trees

tree_model <-
  parsnip::decision_tree(tree_depth=5) %>%
  set_engine("rpart") %>%
  set_mode('regression')

divvy_workflow <-
  workflow() %>%
  add_model(tree_model) %>%
  add_recipe(divvy_rec)

divvy_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_mutate()
## • step_date()
## • step_holiday()
## • step_unknown()
## • step_dummy()
## • step_corr()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   tree_depth = 5
## 
## Computational engine: rpart

divvy_fit <- divvy_workflow %>%
  fit(data=train)

divvy_preds <- 
  augment(divvy_fit, test)


yardstick::mape(divvy_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
mape	standard	421.5899

yardstick::rmse(divvy_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
rmse	standard	439.9574

ggplot(divvy_preds, aes(x=.pred)) +
  geom_density()

tree_fit <- divvy_fit %>% 
  extract_fit_engine()
rpart.plot::rpart.plot(tree_fit)

Week 5

1 Tidymodels

1.1 Workflows

1.2 Recipies