Structure·Centric ML
Field Note No. 02  ·  Structure-Centric ML
arXiv: 2603.13339  ·  A. Elmahdi, PhD
Three decades of density-based clustering have rested on the wrong primitive.

Structure, not distance.

Density-based clustering — DBSCAN (1996), HDBSCAN (2013), and their descendants — encode cluster identity through point-to-point geometric distance. Structure-Centric clustering encodes cluster identity through topological organization within the data's native feature space.

Why thirty-year-old clustering fails on modern data.

In low dimensions, distance is meaningful. Two points being three units apart in a two-dimensional plane carries information about their similarity. Distance-based clustering methods exploit this fact — DBSCAN's ε-neighborhoods, HDBSCAN's mutual-reachability distance, hierarchical clustering's pairwise distances. They all measure geometry.

In high dimensions, this assumption breaks. As dimensionality grows past a few dozen, every point becomes roughly equidistant from every other point. The variance of pairwise distances collapses relative to the mean — the so-called concentration of measure phenomenon. Geometric similarity becomes meaningless.

The standard workaround is dimensionality reduction: compress the data to two or ten dimensions with UMAP, PCA, or t-SNE, then apply a distance-based clustering method to the reduced space. This is the BERTopic pipeline. This is how every social listening platform works. And every step of compression destroys the structural information that defines the clusters.

What if cluster identity were encoded topologically?

Structure-centric clustering replaces the question "how far apart are two points?" with a different question: "how are points organized within a graph defined by their nearest neighbors?"

Once the data has been embedded into a kNN graph, distances are no longer needed. What matters is the topology — which points are connected to which, which regions form dense subgraphs, where the boundaries between communities lie. The kNN graph captures local structure faithfully even in 5,000 dimensions, because it only needs to identify each point's nearest neighbors, not measure absolute distances.

Cluster identity should be a property of how points are connected, not how far apart they are.
— The structure-centric thesis

From this reframe, three properties follow that no distance-based method can match. First, parameters become scale-invariant. Because cluster identity is structural, the same parameters that work on a 1,000-point sample work on a 1,000,000-point dataset — they describe a topological property, not a metric distance. Second, every point gets assigned to a cluster. There is no need for a noise category, because the topological organization is unambiguous even at boundary regions. Third, validity metrics survive into thousands of dimensions, because they too operate on the kNN graph, not on pairwise distances.

Where this fits in clustering history.

Era Method Primitive Native dim. Discards data?
1967 K-Means (MacQueen) Euclidean centroid distance Low No (forces all points)
1996 DBSCAN (Ester et al.) ε-neighborhood density Low Yes (noise)
2013 HDBSCAN (Campello et al.) Mutual reachability distance Low Yes (noise)
2026 Structure-Centric (Elmahdi) kNN graph topology 5,000+ No

Each era of clustering has been defined by a primitive — a fundamental quantity the algorithm operates on. K-Means measured centroid distance. DBSCAN measured neighborhood density. HDBSCAN measured mutual reachability. Structure-Centric methods measure graph topology. The shift in primitive enables the shift in capability.

The paradigm comes in two forms.

The structure-centric paradigm has produced two complete clustering stacks — one for low-dimensional data, one for native high-dimensional data. They share the same philosophy but differ in implementation. Each is superior in its regime.

LOW-D STACK

SCOPE → AdaBox → SLCD

For data that has been (or will be) reduced to two dimensions. Includes the BERTopic pipeline, UMAP-then-cluster workflows, and any case where dimensionality reduction is appropriate.

See the low-D stack →
HIGH-D STACK

Graph-SCOPE → AdaGraph → DA-Sampler → SLCD

For data that lives in its native high-dimensional space. Gene expression, materials properties, multi-modal scientific data — anywhere reduction destroys signal.

See the high-D stack →