Density-based clustering — DBSCAN (1996), HDBSCAN (2013), and their descendants — encode cluster identity through point-to-point geometric distance. Structure-Centric clustering encodes cluster identity through topological organization within the data's native feature space.
In low dimensions, distance is meaningful. Two points being three units apart in a two-dimensional plane carries information about their similarity. Distance-based clustering methods exploit this fact — DBSCAN's ε-neighborhoods, HDBSCAN's mutual-reachability distance, hierarchical clustering's pairwise distances. They all measure geometry.
In high dimensions, this assumption breaks. As dimensionality grows past a few dozen, every point becomes roughly equidistant from every other point. The variance of pairwise distances collapses relative to the mean — the so-called concentration of measure phenomenon. Geometric similarity becomes meaningless.
The standard workaround is dimensionality reduction: compress the data to two or ten dimensions with UMAP, PCA, or t-SNE, then apply a distance-based clustering method to the reduced space. This is the BERTopic pipeline. This is how every social listening platform works. And every step of compression destroys the structural information that defines the clusters.
Structure-centric clustering replaces the question "how far apart are two points?" with a different question: "how are points organized within a graph defined by their nearest neighbors?"
Once the data has been embedded into a kNN graph, distances are no longer needed. What matters is the topology — which points are connected to which, which regions form dense subgraphs, where the boundaries between communities lie. The kNN graph captures local structure faithfully even in 5,000 dimensions, because it only needs to identify each point's nearest neighbors, not measure absolute distances.
From this reframe, three properties follow that no distance-based method can match. First, parameters become scale-invariant. Because cluster identity is structural, the same parameters that work on a 1,000-point sample work on a 1,000,000-point dataset — they describe a topological property, not a metric distance. Second, every point gets assigned to a cluster. There is no need for a noise category, because the topological organization is unambiguous even at boundary regions. Third, validity metrics survive into thousands of dimensions, because they too operate on the kNN graph, not on pairwise distances.
| Era | Method | Primitive | Native dim. | Discards data? |
|---|---|---|---|---|
| 1967 | K-Means (MacQueen) | Euclidean centroid distance | Low | No (forces all points) |
| 1996 | DBSCAN (Ester et al.) | ε-neighborhood density | Low | Yes (noise) |
| 2013 | HDBSCAN (Campello et al.) | Mutual reachability distance | Low | Yes (noise) |
| 2026 | Structure-Centric (Elmahdi) | kNN graph topology | 5,000+ | No |
Each era of clustering has been defined by a primitive — a fundamental quantity the algorithm operates on. K-Means measured centroid distance. DBSCAN measured neighborhood density. HDBSCAN measured mutual reachability. Structure-Centric methods measure graph topology. The shift in primitive enables the shift in capability.
The structure-centric paradigm has produced two complete clustering stacks — one for low-dimensional data, one for native high-dimensional data. They share the same philosophy but differ in implementation. Each is superior in its regime.
For data that has been (or will be) reduced to two dimensions. Includes the BERTopic pipeline, UMAP-then-cluster workflows, and any case where dimensionality reduction is appropriate.
See the low-D stack →For data that lives in its native high-dimensional space. Gene expression, materials properties, multi-modal scientific data — anywhere reduction destroys signal.
See the high-D stack →