## Tuning in unsupervised settings

In *supervised modeling* scenarios, we observe values of a
target (or “response”) variable, and we measure the success of our model
based on how well it predicts future response values. To select
hyperparameter values, we **tune** them, trying many
possible values and measuring how well each performs when predicting
target values of test data.

In the *unsupervised modeling* setting of
`tidyclust`

, there is no such objective measure of success.
Clustering analyses are typically exploratory rather than testable.
Nonetheless, the core tuning principle of varying inputs and quantifying
results is still applicable.

## Specify and fit a model

In this example, we will fit a \(k\)-means cluster model to the
`palmerpenguins`

dataset, using only the bill length and bill
depth of penguins as predictors.

(Please refer to the k-means vignette for an in-depth discussion of this model specification.)

Our goal will be to select an appropriate number of clusters for the model based on metrics.

First, we set up cross-validation samples for our data:

`penguins_cv <- vfold_cv(penguins, v = 5)`

Next, we specify our model with a tuning parameter, make a workflow,
and establish a range of possible values of `num_clusters`

to
try:

```
kmeans_spec <- k_means(num_clusters = tune())
penguins_rec <- recipe(~ bill_length_mm + bill_depth_mm,
data = penguins)
kmeans_wflow <- workflow(penguins_rec, kmeans_spec)
clust_num_grid <- grid_regular(num_clusters(),
levels = 10)
clust_num_grid
#> # A tibble: 10 × 1
#> num_clusters
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
```

Then, we can use `tune_cluster()`

to compute metrics on
each cross-validation split, for each possible choice of number of
clusters.

```
res <- tune_cluster(
kmeans_wflow,
resamples = penguins_cv,
grid = clust_num_grid,
control = control_grid(save_pred = TRUE, extract = identity),
metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio)
)
res
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 × 6
#> splits id .metrics .notes .extracts .predictions
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [266/67]> Fold1 <tibble [30 × 5]> <tibble> <tibble> <tibble>
#> 2 <split [266/67]> Fold2 <tibble [30 × 5]> <tibble> <tibble> <tibble>
#> 3 <split [266/67]> Fold3 <tibble [30 × 5]> <tibble> <tibble> <tibble>
#> 4 <split [267/66]> Fold4 <tibble [30 × 5]> <tibble> <tibble> <tibble>
#> 5 <split [267/66]> Fold5 <tibble [30 × 5]> <tibble> <tibble> <tibble>
```

```
res_metrics <- res %>% collect_metrics()
res_metrics
#> # A tibble: 30 × 7
#> num_clusters .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1 sse_ratio standard 1 5 0 Prepro…
#> 2 1 sse_total standard 8971. 5 1.19e+2 Prepro…
#> 3 1 sse_within_total standard 8971. 5 1.19e+2 Prepro…
#> 4 2 sse_ratio standard 0.321 5 1.41e-3 Prepro…
#> 5 2 sse_total standard 8971. 5 1.19e+2 Prepro…
#> 6 2 sse_within_total standard 2885. 5 4.73e+1 Prepro…
#> 7 3 sse_ratio standard 0.202 5 2.21e-3 Prepro…
#> 8 3 sse_total standard 8971. 5 1.19e+2 Prepro…
#> 9 3 sse_within_total standard 1809. 5 3.47e+1 Prepro…
#> 10 4 sse_ratio standard 0.160 5 5.79e-3 Prepro…
#> # … with 20 more rows
```

### Choosing hyperparameters

In supervised learning, we would choose the model with the best value of a target metric. However, clustering models in general have no such local maxima or minima. With more clusters in the model, we would always expect the within sum-of-squares to be smaller.

A common approach to choosing a number of clusters is to look for an “elbow”, or notable bend, in the plot of WSS/TSS ratio by cluster number:

```
res_metrics %>%
filter(.metric == "sse_ratio") %>%
ggplot(aes(x = num_clusters, y = mean)) +
geom_point() +
geom_line() +
theme_minimal() +
ylab("mean WSS/TSS ratio, over 5 folds") +
xlab("Number of clusters") +
scale_x_continuous(breaks = 1:10)
```

At each increase in the number of clusters, the WSS/TSS ratio decreases, with the amount of decrease getting smaller as the number of clusters grows. We might argue that the drop from two clusters to three, or from three to four, is a bit more extreme than the subsequent drops, so we should probably choose three or four clusters.