1 Regressions Review

2 Regression Prediction

2.1 Tidymodels Primer
2.2 yardstick
2.3 Getting Predictions / First Pipeline

2.3.1 Split Data
2.3.2 Model Framework
2.3.3 Predictions
2.3.4 yardstick / evaluate model

3 Unsupervised

3.1 Exploratory clustering

Today we are going to review the traditional base r methods of linear regression and then reapply that framework into a simplified version of the tidymodels pipeline.

1 Regressions Review

Let’s look at our divvy data. It has been augmented and aggregated now so each row is the number of rides in an hour citywide and information on traffic and weather.

divvy_data <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_3_data.csv')


glimpse(divvy_data)

## Rows: 8,760
## Columns: 8
## $ started_hour  <dttm> 2021-01-01 00:00:00, 2021-01-01 01:00:00, 2021-01-01 02…
## $ rides         <dbl> 35, 48, 57, 17, 14, 29, 48, 51, 49, 46, 83, 75, 55, 65, …
## $ avg_speed     <dbl> 26.93289, 25.38051, 20.30257, 19.98055, 24.68098, 26.436…
## $ temp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ wind          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ humidity      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ solar_rad     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ interval_rain <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

There’s a lot of missing values! Our weather data comes from the open data set on beaches. Traffic derives from data on CTA bus speeds.

Let’s look at how the variables are related to each other using corrr from tidymodels.

divvy_data %>%
  dplyr::select(-started_hour) %>%
  corrr::correlate() %>%
  corrr::fashion()

ABCDEFGHIJ0123456789

term <noquote>	rides <noquote>	avg_speed <noquote>	temp <noquote>	wind <noquote>	humidity <noquote>	solar_rad <noquote>	interval_rain <noquote>
rides		-.17	.47	-.00	-.16	.52	-.08
avg_speed	-.17		-.06	-.10	-.00	-.21	-.09
temp	.47	-.06		-.15	.11	.32	.01
wind	-.00	-.10	-.15		-.01	.04	.05
humidity	-.16	-.00	.11	-.01		-.17	.17
solar_rad	.52	-.21	.32	.04	-.17		-.07
interval_rain	-.08	-.09	.01	.05	.17	-.07

divvy_data %>%
  select(-started_hour) %>%
  corrr::correlate() %>%
  corrr::rplot(colors='Brown')

We can see that rides is most correlated with temperature/solar radiation.

How important might categorical variables be? Let’s run an ANOVA (Analysis of Variation) test on hour of day.

aov(rides ~ hour(started_hour), data=divvy_data) %>% broom::tidy()

ABCDEFGHIJ0123456789

term <chr>	df <dbl>	sumsq <dbl>	meansq <dbl>	statistic <dbl>	p.value <dbl>
hour(started_hour)	1	556927583	556927583	1409.034	4.729126e-286
Residuals	8746	3456899855	395255	NA	NA

What other time based trends should we consider?

ggplot(divvy_data, aes(x=started_hour, y=rides)) +
  geom_smooth()

Starting with solar radiation…let’s look at what broom has to offer.

m1 <- lm(rides ~ solar_rad, data=divvy_data)

m1 %>% broom::tidy()

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	599.099273	10.06078541	59.54796	0
solar_rad	1.695705	0.03790607	44.73438	0

m1 %>% broom::glance()

ABCDEFGHIJ0123456789

r.squared <dbl>	adj.r.squared <dbl>	sigma <dbl>	statistic <dbl>	p.value <dbl>	df <dbl>	logLik <dbl>	AIC <dbl>	BIC <dbl>
0.2739009	0.273764	616.3888	2001.165	0	1	-41620.83	83247.65	83267.38

m1 %>% broom::augment()

ABCDEFGHIJ0123456789

.rownames <chr>	rides <dbl>	solar_rad <dbl>	.fitted <dbl>	.resid <dbl>	.hat <dbl>	.sigma <dbl>	.cooksd <dbl>
3280	986	340.0	1175.6389	-1.896389e+02	0.0003343144	616.4415	1.583291e-05
3281	1268	283.0	1078.9837	1.890163e+02	0.0002619253	616.4415	1.232151e-05
3282	1792	220.0	972.1543	8.198457e+02	0.0002105073	616.3441	1.862836e-04
3283	1541	145.0	844.9764	6.960236e+02	0.0001884378	616.3729	1.201822e-04
3284	1231	155.0	861.9335	3.690665e+02	0.0001889222	616.4261	3.387793e-05
3285	804	9.0	614.3606	1.896394e+02	0.0002569435	616.4415	1.216685e-05
3286	608	2.0	602.4907	5.509318e+00	0.0002642551	616.4469	1.056110e-08
3287	329	2.0	602.4907	-2.734907e+02	0.0002642551	616.4355	2.602547e-05
3288	201	2.0	602.4907	-4.014907e+02	0.0002642551	616.4223	5.608728e-05
3289	145	2.0	602.4907	-4.574907e+02	0.0002642551	616.4149	7.282458e-05

From glance we can see that r2 is .274, modest.

One way to see what is going is to see if the residuals look normally distributed.

ggplot(m1 %>% augment(), aes(x=.resid)) +
  geom_density(fill='navy', alpha=.6)

How does solar radiation look?

ggplot(m1 %>% augment(), aes(x=solar_rad)) +
  geom_density(fill='navy', alpha=.6)

ggplot(m1 %>% augment(), aes(x=log(solar_rad))) +
  geom_density(fill='navy', alpha=.6)

m1_log <- lm(rides ~ log(solar_rad), data=divvy_data %>% filter(solar_rad > 0))

m1_log %>% broom::tidy()

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	297.8392	16.24530	18.33388	2.846897e-72
log(solar_rad)	187.7363	3.85289	48.72609	0.000000e+00

m1_log %>% broom::glance()

ABCDEFGHIJ0123456789

r.squared <dbl>	adj.r.squared <dbl>	sigma <dbl>	statistic <dbl>	p.value <dbl>	df <dbl>	logLik <dbl>	AIC <dbl>	BIC <dbl>
0.362576	0.3624233	600.3339	2374.232	0	1	-32640.39	65286.78	65305.79

m1_log %>% broom::augment()

ABCDEFGHIJ0123456789

rides <dbl>	log(solar_rad) <dbl>	.fitted <dbl>	.resid <dbl>	.hat <dbl>	.sigma <dbl>	.cooksd <dbl>
986	5.8289456	1392.1437	-406.1436580	0.0004708223	600.3729	1.078474e-04
1268	5.6454469	1357.6943	-89.6942958	0.0004363831	600.4042	4.874847e-06
1792	5.3936275	1310.4187	481.5813258	0.0003936367	600.3595	1.267538e-04
1541	4.9767337	1232.1526	308.8474066	0.0003343519	600.3868	4.427579e-05
1231	5.0434251	1244.6730	-13.6729821	0.0003428738	600.4058	8.899034e-08
804	2.1972246	710.3380	93.6620438	0.0003050345	600.4041	3.714714e-06
608	0.6931472	427.9681	180.0318992	0.0005545481	600.3994	2.496343e-05
329	0.6931472	427.9681	-98.9681008	0.0005545481	600.4039	7.543897e-06
201	0.6931472	427.9681	-226.9681008	0.0005545481	600.3955	3.967664e-05
145	0.6931472	427.9681	-282.9681008	0.0005545481	600.3898	6.167089e-05

ggplot(m1_log %>% augment(), aes(x=.std.resid)) +
  geom_density(fill='navy', alpha=.6)

This is better, but we can’t really justify dropping darkness.

Let’s add some time.

m2 <- lm(rides ~ solar_rad + factor(hour(started_hour)), data=divvy_data)

m2 %>% glance()

ABCDEFGHIJ0123456789

r.squared <dbl>	adj.r.squared <dbl>	sigma <dbl>	statistic <dbl>	p.value <dbl>	df <dbl>	logLik <dbl>	AIC <dbl>	BIC <dbl>
0.6537618	0.6521885	426.5676	415.558	0	24	-39655.75	79363.51	79534.5

m3 <- lm(rides ~ solar_rad + factor(hour(started_hour)) + 
           factor(wday(started_hour)) +
           factor(month(started_hour)) +
           temp + wind + interval_rain + avg_speed, data=divvy_data)

m3 %>% glance()

ABCDEFGHIJ0123456789

r.squared <dbl>	adj.r.squared <dbl>	sigma <dbl>	statistic <dbl>	p.value <dbl>	df <dbl>	logLik <dbl>	AIC <dbl>	BIC <dbl>
0.7745487	0.7726027	342.5185	398.0207	0	41	-34746.13	69578.27	69856.68

ggplot(m3 %>% augment(), aes(x=.std.resid)) +
  geom_density(fill='navy', alpha=.6)

2 Regression Prediction

2.1 Tidymodels Primer

Tidymodels is a collection of packages like tidyverse (which is ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats plus lubridate, dbplyr, dbi, rvest, readxl etc). Let’s briefly look at what we have with Tidymodels.

Tidymodels packages

The core tidymodels packages work together to enable a wide variety of modeling approaches:

tidymodels is a meta-package that installs and load the core packages listed below that you need for modeling and machine learning.

rsample provides infrastructure for efficient data splitting and resampling.

parsnip is a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

recipes is a tidy interface to data pre-processing tools for feature engineering.

workflows bundle your pre-processing, modeling, and post-processing together.

tune helps you optimize the hyperparameters of your model and pre-processing steps

yardstick measures the effectiveness of models using performance metrics

broom converts the information in common statistical R objects into user-friendly, predictable formats.

dials creates and manages tuning parameters and parameter grids

There’s a bunch of additional packages including corrr and more specialized models (like spatialsample).

2.2 yardstick

Metric types

There are three main metric types in yardstick: class, class probability, and numeric. Each type of metric has standardized argument syntax, and all metrics return the same kind of output (a tibble with 3 columns). This standardization allows metrics to easily be grouped together and used with grouped data frames for computing on multiple resamples at once. Below are the three types of metrics, along with the types of the inputs they take.

Class metrics (hard predictions)
- truth - factor
- estimate - factor
Class probability metrics (soft predictions)
- truth - factor
- estimate - multiple numeric columns containing class probabilities
Numeric metrics
- truth - numeric
- estimate - numeric

2.3 Getting Predictions / First Pipeline

Let’s construct a basic tidymodels pipeline. This pipeline will build a model to use a ‘trained’ regression model to score the test set. Key components:

rsample (splitting data)
parsnip (linear_reg, set_engine, set_mode, fit)
yardstick (mape, rmse)
broom (augment).

2.3.1 Split Data

grouped <- rsample::initial_split(divvy_data)

train <- training(grouped)
test  <- testing(grouped)

2.3.2 Model Framework

lm_model <-
  parsnip::linear_reg() %>%
  set_engine("lm") %>%
  fit(rides ~ solar_rad + factor(hour(started_hour)) + 
           factor(wday(started_hour)) +
           factor(month(started_hour)) +
           temp + wind + interval_rain + avg_speed, data=train)

2.3.3 Predictions

preds <- 
  predict(lm_model, test %>% filter(month(started_hour) >= 5)) 

test_preds <- lm_model %>% 
  augment(test %>% filter(month(started_hour) >=5))

2.3.4 yardstick / evaluate model

mape (mean absolute percentage error) - avg pct difference b/t forecast and actual rmse (root mean square error) - std of residuals

yardstick::mape(test_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
mape	standard	121.2486

yardstick::rmse(test_preds, 
     truth = rides,
     estimate = .pred)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
rmse	standard	341.6638

ggplot(test_preds, aes(x=.pred)) +
  geom_density()

This model isn’t great. Can you improve it?

3 Unsupervised

Tidymodels K Means

set.seed(27)

centers <- tibble(
  cluster = factor(1:3), 
  num_points = c(100, 150, 50),  # number points in each cluster
  x1 = c(5, 0, -3),              # x1 coordinate of cluster center
  x2 = c(-1, 1, -2)              # x2 coordinate of cluster center
)

labelled_points <- 
  centers %>%
  mutate(
    x1 = map2(num_points, x1, rnorm),
    x2 = map2(num_points, x2, rnorm)
  ) %>% 
  select(-num_points) %>% 
  unnest(cols = c(x1, x2))

ggplot(labelled_points, aes(x1, x2, color = cluster)) +
  geom_point(alpha = 0.5)

points <- 
  labelled_points %>% 
  select(-cluster)

kclust <- kmeans(points, centers = 3)
kclust

## K-means clustering with 3 clusters of sizes 148, 51, 101
## 
## Cluster means:
##            x1        x2
## 1  0.08853475  1.045461
## 2 -3.14292460 -2.000043
## 3  5.00401249 -1.045811
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 298.9415 108.8112 243.2092
##  (between_SS / total_SS =  82.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

summary(kclust)

##              Length Class  Mode   
## cluster      300    -none- numeric
## centers        6    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

cluster (300 values) contains information about each point
centers, withinss, and size (3 values) contain information about each cluster
totss, tot.withinss, betweenss, and iter (1 value) contain information about the full clustering

augment(kclust, points)

ABCDEFGHIJ0123456789

x1 <dbl>	x2 <dbl>	.cluster <fct>
6.90716256	-2.740711244	3
6.14487689	-2.448710956	3
4.23546926	-0.945931996	3
3.54256750	0.286809836	3
3.90653112	0.407969640	3
5.29524122	-1.578524032	3
5.00688594	-1.772652539	3
6.15741089	-1.683247027	3
7.13463789	-2.165246682	3
5.23784461	-2.423052818	3

tidy(kclust)

ABCDEFGHIJ0123456789

x1 <dbl>	x2 <dbl>	size <int>	withinss <dbl>	cluster <fct>
0.08853475	1.045461	148	298.9415	1
-3.14292460	-2.000043	51	108.8112	2
5.00401249	-1.045811	101	243.2092	3

glance(kclust)

ABCDEFGHIJ0123456789

totss <dbl>	tot.withinss <dbl>	betweenss <dbl>	iter <int>
3724.125	650.9619	3073.163	2

3.1 Exploratory clustering

While these summaries are useful, they would not have been too difficult to extract out from the data set yourself. The real power comes from combining these analyses with other tools like dplyr.

Let’s say we want to explore the effect of different choices of k, from 1 to 9, on this clustering. First cluster the data 9 times, each using a different value of k, then create columns containing the tidied, glanced and augmented data:

kclusts <- 
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(points, .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, points)
  )

kclusts

ABCDEFGHIJ0123456789

k <int>	kclust <list>	tidied <list>	glanced <list>	augmented <list>
1	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
2	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
3	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
4	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
5	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
6	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
7	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
8	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>
9	<S3: kmeans>	<tibble[,5]>	<tibble[,4]>	<tibble[,3]>

We can turn these into three separate data sets each representing a different type of data: using tidy(), using augment(), and using glance(). Each of these goes into a separate data set as they represent different types of data.

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

Now we can plot the original points using the data from augment(), with each point colored according to the predicted cluster.

p1 <- 
  ggplot(assignments, aes(x = x1, y = x2)) +
  geom_point(aes(color = .cluster), alpha = 0.8) + 
  facet_wrap(~ k)
p1

Already we get a good sense of the proper number of clusters (3), and how the k-means algorithm functions when k is too high or too low. We can then add the centers of the cluster using the data from tidy():

p2 <- p1 + geom_point(data = clusters, size = 10, shape = "x")
p2

The data from glance() fills a different but equally important purpose; it lets us view trends of some summary statistics across values of k. Of particular interest is the total within sum of squares, saved in the tot.withinss column.

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point()

This represents the variance within the clusters. It decreases as k increases, but notice a bend (or “elbow”) around k = 3. This bend indicates that additional clusters beyond the third have little value.

Week 3 Regression and KMeans

PA 470 Spring 2023

Eric Langowski

1 Regressions Review

2 Regression Prediction

2.1 Tidymodels Primer

2.2 yardstick

2.3 Getting Predictions / First Pipeline

2.3.1 Split Data

2.3.2 Model Framework

2.3.3 Predictions

2.3.4 yardstick / evaluate model

3 Unsupervised

3.1 Exploratory clustering