A potential solution, marking as the answer for now:
First, rescale the variables that you want to include in your distance matrix. In this case I assign a larger weight (10) to the coordinate variables (x_cent and y_cent).
dat$x_cent <- scales::rescale(dat$x_cent, to = c(0, 10))
dat$y_cent <- scales::rescale(dat$y_cent, to = c(0, 10))
dat$tot_pop <- scales::rescale(dat$tot_pop, to = c(0, 1))
Second, subset the data to include only the covariates with which you are calculating distance:
dat <- dat[, c("x_cent", "y_cent", "tot_pop")]
Next, calculate the distance matrix:
dist <- distances::distances(as.data.frame(dat))
Calculate clusters using the scclust
package and append values to the original dataset. This package allows you to incorporate constraints on your cluster size.
clust <- scclust::hierarchical_clustering(distances = dist, size_constraint = 10)
final <- dplyr::bind_cols(dat, clust) %>% dplyr::rename(block = `...4`)
You can see how many observations exist per cluster:
investigate_cluster <- dplyr::group_by(final, block) %>% dplyr::summarise(count = length(block))
head(investigate_cluster)
# A tibble: 6 x 2
block count
<scclust> <int>
1 0 10
2 1 10
3 2 10
4 3 10
5 4 10
6 5 10
And easily visualize your clusters:
ggplot(final, mapping = aes(x = x_cent, y = y_cent, color = factor(block))) +
geom_point() +
ggConvexHull::geom_convexhull(alpha = .5, aes(fill = factor(block))) +
theme_bw() +
theme(legend.position = "none")
CLICK HERE to find out more related problems solutions.