Research Roadmap — Structure-Centric ML

Domain 01

Cancer genomics & precision oncology.

The bottleneck: Standard tools (WGCNA, hierarchical clustering with elbow methods) discard 70–97% of gene expression data as background, hiding the disease-specific signals that matter for patient stratification and biomarker discovery.

The opening: AdaGraph + SLCD enable native clustering of patient-level gene expression across hundreds of dimensions. The lung adenocarcinoma study proved this concept on 246 patients; a 10,000-patient TCGA pan-cancer analysis is a direct extension.

Priority directions: Tumor subtype discovery from primary tumor expression. Treatment response prediction from pre-treatment biomarkers. Recurrence-risk module discovery in currently understudied populations (never-smokers, pediatric cancers, rare malignancies).

Domain 02

Drug discovery & target identification.

The bottleneck: The LINCS L1000 dataset characterizes drug effects through transcriptional signatures across 978 landmark genes × thousands of drugs × multiple cell lines. Existing methods cluster these signatures only after dimensionality reduction, losing the precise mechanistic information that distinguishes related compounds.

The opening: Native high-dimensional clustering preserves the full transcriptional signature, enabling identification of drug compounds that perturb the same molecular pathways even when their structures differ.

Priority directions: Drug repurposing through transcriptional similarity. Off-target effect identification through unexpected pathway clustering. Polypharmacology mapping.

Domain 03

Materials informatics & discovery science.

The bottleneck: Materials databases (SuperCon, Materials Project, JARVIS-DFT, AFLOW) contain millions of compositional and structural records. Existing pipelines either require expert specification of clusters or produce trivial 2- to 4-cluster decompositions.

The opening: The SuperCon benchmark demonstrated that AdaGraph rediscovers known physics-based families without supervision. The same approach applied to less-studied properties (thermoelectric performance, photovoltaic efficiency, catalytic activity) becomes a discovery tool for new materials families.

Priority directions: Battery cathode discovery. Photocatalyst family identification. Novel superconductor candidates from composition alone. Mineral exploration through multi-modal geological data clustering.

Domain 04

Neuroscience & connectomics.

The bottleneck: Single-cell RNA sequencing of brain tissue produces datasets with tens of thousands of cells and 20,000+ genes per cell. Standard analysis pipelines require aggressive PCA reduction before clustering, conflating cell types that differ in fewer than 50 genes.

The opening: Native high-dimensional clustering preserves rare-cell-type signals. Functional connectivity analysis of fMRI data — where each region's connectivity to every other region is a feature — benefits from the same property.

Priority directions: Rare neuron subtype discovery. Functional brain network identification. Disease-state vs healthy-state discrimination in neurodegenerative diseases.

Domain 05

Climate science & earth observation.

The bottleneck: Climate models output gridded data with hundreds of variables per grid cell. Identifying climate regimes, ENSO phases, or extreme-event clusters requires multi-decadal analysis at high resolution — producing datasets too large and too high-dimensional for standard clustering.

The opening: SLCD parameter transfer from sample regions to global grids enables consistent clustering at planetary scale.

Priority directions: Climate regime classification. Extreme weather event clustering. Drought-typology mapping. Marine heatwave structure discovery.

Domain 06

Astronomy & cosmological survey data.

The bottleneck: Sky surveys (Gaia, Rubin/LSST, JWST) catalog billions of objects with photometric, spectral, and astrometric features. Identifying stellar populations, galaxy types, or transient event classes requires clustering at population scale in feature spaces of hundreds of dimensions.

The opening: Graph-SCOPE provides an unsupervised quality signal for k-selection where Silhouette fails. AdaGraph clusters objects in their native multi-band photometric space.

Priority directions: Stellar stream identification in Gaia data. Galaxy morphological discovery. Transient event taxonomy from time-series features.

Domain 07

Epidemiology & public health.

The bottleneck: Disease outbreak analysis combines demographic, geographic, clinical, and genetic features per case. Identifying transmission clusters, outbreak strains, or population-level risk patterns requires methods that handle this multi-modal high-dimensional structure.

The opening: AdaGraph's native multi-modal clustering identifies disease subtypes and outbreak signatures that emerge only when all features are considered jointly.

Priority directions: Infectious disease subtype discovery from genomic-clinical data fusion. Outbreak source attribution. Long-term sequela clustering (e.g., long-COVID phenotypes).

The Through-Line

One method, seven discovery problems.

Every domain on this roadmap has the same structural form: high-dimensional, ambiguous, sparse, multi-modal data, with discovery as the goal. The structure-centric paradigm was designed precisely for this class of problem. Every domain is independently large; collectively, they define a research program that could occupy a research lab for a decade.

A research lab with three to five PhD students operating across two to three of these domains simultaneously, with the structure-centric stack as foundational infrastructure, would produce a body of papers and discoveries that establish the paradigm as a permanent contribution to computational science.

A research program.