1 What is prediction?

Causal inference focuses on extracting treatment effects with a high degree of confidence, focusing on answering why. Prediction, in contrast, answers questions about who, what, when, and where—anticipating what will happen.

“Given what the model has seen before and if we assume the new data follows the same paradigm, what will the outcomes be in this new dataset”

2 Bias-Variance Trade-off

Accuracy: how close are you to prediction a value
Error: how far are you from the value

\(Error = Reducible + Irreducible\)

\(Error = (Bias + Variance) + Irreducible\)

Irreducible error: natural uncertainty/sampling error
Reducible error: bias + variance
Bias: difference between model and theoretical true model (e.g. error due to erroneous assumptions)
Variance: model learns random, irrelevant patters in data (e.g. model predictions change/vary widely when trained)

2.1 The Trade-off

Underfit: large bias and misses variability (e.g. straight line on quadratic process) Overfit: high variance (e.g. focus on the noise and therefore fails out of sample)

Figure 9.7

3 Objective/Loss functions

Table 9.3

4 Cross Validation

Resampling example from tidymodels: https://www.tidymodels.org/start/resampling/

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

cell_train %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))

rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_fit <- 
  rf_mod %>% 
  fit(class ~ ., data = cell_train)
rf_fit

## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      1514 
## Number of independent variables:  56 
## Mtry:                             7 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1263639

rf_training_pred <- 
  rf_fit %>% augment(cell_train)

rf_training_pred %>%                
  roc_auc(truth = class, .pred_PS)

rf_training_pred %>%                
  accuracy(truth = class, .pred_class)

rf_testing_pred <- 
  rf_fit %>% augment(cell_test)

rf_testing_pred %>%                   
  roc_auc(truth = class, .pred_PS)

rf_testing_pred %>%                  
  accuracy(truth = class, .pred_class)

folds <- rsample::vfold_cv(cell_train, v = 10)

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds)

collect_metrics(rf_fit_rs, summarize=FALSE)

collect_metrics(rf_fit_rs, summarize=TRUE)

rf_testing_pred %>%                   
  roc_auc(truth = class, .pred_PS)

rf_testing_pred %>%                   
  accuracy(truth = class, .pred_class)

Week 6

1 What is prediction?

2 Bias-Variance Trade-off

2.1 The Trade-off

3 Objective/Loss functions

4 Cross Validation