Today we are going to review the traditional base r methods of linear regression and then reapply that framework into a simplified version of the tidymodels pipeline.
Let’s look at our divvy data. It has been augmented and aggregated now so each row is the number of rides in an hour citywide and information on traffic and weather.
divvy_data <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_3_data.csv')
glimpse(divvy_data)
## Rows: 8,760
## Columns: 8
## $ started_hour <dttm> 2021-01-01 00:00:00, 2021-01-01 01:00:00, 2021-01-01 02…
## $ rides <dbl> 35, 48, 57, 17, 14, 29, 48, 51, 49, 46, 83, 75, 55, 65, …
## $ avg_speed <dbl> 26.93289, 25.38051, 20.30257, 19.98055, 24.68098, 26.436…
## $ temp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ wind <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ humidity <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ solar_rad <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ interval_rain <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
There’s a lot of missing values! Our weather data comes from the open data set on beaches. Traffic derives from data on CTA bus speeds.
Let’s look at how the variables are related to each other using
corrr
from tidymodels.
divvy_data %>%
dplyr::select(-started_hour) %>%
corrr::correlate() %>%
corrr::fashion()
divvy_data %>%
select(-started_hour) %>%
corrr::correlate() %>%
corrr::rplot(colors='Brown')
We can see that rides is most correlated with temperature/solar radiation.
How important might categorical variables be? Let’s run an ANOVA (Analysis of Variation) test on hour of day.
aov(rides ~ hour(started_hour), data=divvy_data) %>% broom::tidy()
What other time based trends should we consider?
ggplot(divvy_data, aes(x=started_hour, y=rides)) +
geom_smooth()
Starting with solar radiation…let’s look at what broom
has to offer.
m1 <- lm(rides ~ solar_rad, data=divvy_data)
m1 %>% broom::tidy()
m1 %>% broom::glance()
m1 %>% broom::augment()
From glance we can see that r2 is .274, modest.
One way to see what is going is to see if the residuals look normally distributed.
ggplot(m1 %>% augment(), aes(x=.resid)) +
geom_density(fill='navy', alpha=.6)
How does solar radiation look?
ggplot(m1 %>% augment(), aes(x=solar_rad)) +
geom_density(fill='navy', alpha=.6)
ggplot(m1 %>% augment(), aes(x=log(solar_rad))) +
geom_density(fill='navy', alpha=.6)
m1_log <- lm(rides ~ log(solar_rad), data=divvy_data %>% filter(solar_rad > 0))
m1_log %>% broom::tidy()
m1_log %>% broom::glance()
m1_log %>% broom::augment()
ggplot(m1_log %>% augment(), aes(x=.std.resid)) +
geom_density(fill='navy', alpha=.6)
This is better, but we can’t really justify dropping darkness.
Let’s add some time.
m2 <- lm(rides ~ solar_rad + factor(hour(started_hour)), data=divvy_data)
m2 %>% glance()
m3 <- lm(rides ~ solar_rad + factor(hour(started_hour)) +
factor(wday(started_hour)) +
factor(month(started_hour)) +
temp + wind + interval_rain + avg_speed, data=divvy_data)
m3 %>% glance()
ggplot(m3 %>% augment(), aes(x=.std.resid)) +
geom_density(fill='navy', alpha=.6)
Tidymodels is a collection of packages like tidyverse (which is ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats plus lubridate, dbplyr, dbi, rvest, readxl etc). Let’s briefly look at what we have with Tidymodels.
The core tidymodels packages work together to enable a wide variety of modeling approaches:
tidymodels is a meta-package that installs and load the core packages listed below that you need for modeling and machine learning.
rsample provides infrastructure for efficient data splitting and resampling.
parsnip is a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.
recipes is a tidy interface to data pre-processing tools for feature engineering.
workflows bundle your pre-processing, modeling, and post-processing together.
tune helps you optimize the hyperparameters of your model and pre-processing steps
yardstick measures the effectiveness of models using performance metrics
broom converts the information in common statistical R objects into user-friendly, predictable formats.
dials creates and manages tuning parameters and parameter grids
There’s a bunch of additional packages including corrr and more specialized models (like spatialsample).
There are three main metric types in yardstick: class, class probability, and numeric. Each type of metric has standardized argument syntax, and all metrics return the same kind of output (a tibble with 3 columns). This standardization allows metrics to easily be grouped together and used with grouped data frames for computing on multiple resamples at once. Below are the three types of metrics, along with the types of the inputs they take.
Let’s construct a basic tidymodels pipeline. This pipeline will build a model to use a ‘trained’ regression model to score the test set. Key components:
grouped <- rsample::initial_split(divvy_data)
train <- training(grouped)
test <- testing(grouped)
lm_model <-
parsnip::linear_reg() %>%
set_engine("lm") %>%
fit(rides ~ solar_rad + factor(hour(started_hour)) +
factor(wday(started_hour)) +
factor(month(started_hour)) +
temp + wind + interval_rain + avg_speed, data=train)
preds <-
predict(lm_model, test %>% filter(month(started_hour) >= 5))
test_preds <- lm_model %>%
augment(test %>% filter(month(started_hour) >=5))
mape (mean absolute percentage error) - avg pct difference b/t forecast and actual rmse (root mean square error) - std of residuals
yardstick::mape(test_preds,
truth = rides,
estimate = .pred)
yardstick::rmse(test_preds,
truth = rides,
estimate = .pred)
ggplot(test_preds, aes(x=.pred)) +
geom_density()
This model isn’t great. Can you improve it?
set.seed(27)
centers <- tibble(
cluster = factor(1:3),
num_points = c(100, 150, 50), # number points in each cluster
x1 = c(5, 0, -3), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
labelled_points <-
centers %>%
mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
) %>%
select(-num_points) %>%
unnest(cols = c(x1, x2))
ggplot(labelled_points, aes(x1, x2, color = cluster)) +
geom_point(alpha = 0.5)
points <-
labelled_points %>%
select(-cluster)
kclust <- kmeans(points, centers = 3)
kclust
## K-means clustering with 3 clusters of sizes 148, 51, 101
##
## Cluster means:
## x1 x2
## 1 0.08853475 1.045461
## 2 -3.14292460 -2.000043
## 3 5.00401249 -1.045811
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 298.9415 108.8112 243.2092
## (between_SS / total_SS = 82.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
summary(kclust)
## Length Class Mode
## cluster 300 -none- numeric
## centers 6 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
cluster
(300 values) contains information about each
pointcenters
, withinss
, and size
(3 values) contain information about each clustertotss
, tot.withinss
,
betweenss
, and iter
(1 value) contain
information about the full clusteringaugment(kclust, points)
tidy(kclust)
glance(kclust)
While these summaries are useful, they would not have been too difficult to extract out from the data set yourself. The real power comes from combining these analyses with other tools like dplyr.
Let’s say we want to explore the effect of different choices of k, from 1 to 9, on this clustering. First cluster the data 9 times, each using a different value of k, then create columns containing the tidied, glanced and augmented data:
kclusts <-
tibble(k = 1:9) %>%
mutate(
kclust = map(k, ~kmeans(points, .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, points)
)
kclusts
We can turn these into three separate data sets each representing a different type of data: using tidy(), using augment(), and using glance(). Each of these goes into a separate data set as they represent different types of data.
clusters <-
kclusts %>%
unnest(cols = c(tidied))
assignments <-
kclusts %>%
unnest(cols = c(augmented))
clusterings <-
kclusts %>%
unnest(cols = c(glanced))
Now we can plot the original points using the data from augment(), with each point colored according to the predicted cluster.
p1 <-
ggplot(assignments, aes(x = x1, y = x2)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
facet_wrap(~ k)
p1
Already we get a good sense of the proper number of clusters (3), and how the k-means algorithm functions when k is too high or too low. We can then add the centers of the cluster using the data from tidy():
p2 <- p1 + geom_point(data = clusters, size = 10, shape = "x")
p2
The data from glance() fills a different but equally important purpose; it lets us view trends of some summary statistics across values of k. Of particular interest is the total within sum of squares, saved in the tot.withinss column.
ggplot(clusterings, aes(k, tot.withinss)) +
geom_line() +
geom_point()
This represents the variance within the clusters. It decreases as k increases, but notice a bend (or “elbow”) around k = 3. This bend indicates that additional clusters beyond the third have little value.