OHSL Breast Cancer Data Alliance (BCDA)

Breast cancer (BC) is the most common incident site of cancer in women worldwide accounting for 24.5 % of all cancers (Globocan 2020).  Amongst the new cases of breast cancer in 2020, Asia has the largest incidence accounting for 45.4% amongst females of all ages. Interestingly, new breast cancer incidence cases amongst females under 45 years of age also has the largest percentage in Asia (52.9%). Furthermore, mortality due to breast cancer in this age group also peaks in Asia with 51.4 %. Five year prevalence in breast cancer in the same age group also has Asia peaking at 51.1%. In the United States too breast is the most common site for cancer amongst females of all ages (39.9%) with the percentage jumping to 45.6 % amongst females under 50 years of age. [1]

In the United States, during 2001-2015, incidence rates of early-onset metastatic breast cancer increased sharply among NH white, NH black, Hispanic, and Asian/Pacific Islander (API) women [2].

Asia has 44% of the world’s BC deaths with 39% of overall new BC cases diagnosed. Amongst the Asian countries, approximately 25% of the female cancer cases in India are BC. The rate of incidence was found to be 25.8 in 100,000 women and the mortality rate is 12.7 per 100,000 women (2017). Another troubling concern about the scenario of BC in India is the increased incidence of the disease in younger Indian women (between the ages of 30 and 40). Presently, almost 48% of patients with BC in India are below 50 years of age. There is an increasing trend of BC in women between the ages of 25 and 40 in the past 25 years. The latest surveillance trends from 2000 to 2014 based on registries from 71 countries estimated the 5-year survival ratio to be 66.1% in India which is the lowest levels among the countries included in the study [3].

Considering the fact that Breast cancer (BC) is a significant global health concern, with substantial variations in incidence and mortality rates across different regions, the OHSL Breast Cancer Data Alliance (BCDA) project aims to leverage the power of Big Data and precision medicine to better understand and address breast cancer, particularly focusing on risk factors and global trends.

Leveraging Big Data for Precision Medicine

Advances in a wide array of scientific technologies have brought data of unprecedented volume and complexity into the oncology research space. These novel big data resources are applied across a variety of contexts-from health services research using data from insurance claims, cancer registries, and electronic health records, to deeper and broader genomic characterizations of disease. Recently, the emergence of Big Data technologies has generated a lot of interest among the medical community concerned with breast health. Indeed, the available storage capacities increased exponentially during the last three decades, thus leading to bigger volumes and variety of stored medical data (e.g., mammography scans, 3D ultrasound, MRI, genomic data, pathological data, etcetera). However, Big data is not confined just to genomics. It includes DNA/RNA sequencing: Bulk, single cell, ChIPseq (CUT&RUN), ATACseq, spatial, Cell-free…; Proteomics, Screens (CRISPR/Cas9, RNAi, drug libraries), Digital pathology (H&E, highly multiplexed IF, image mass cytometry), Radiomics (Mammography, CT, MRI, PET (18FDG, 18FES, 18FTL, Na18F, etc),Clinical data (SEER, NCDB, claims data, EMR),-Real world data (CancerLinQ, ProjectGENIE, PCORI). [4]

The Role of Big Data in Precision Medicine is increasingly gaining importance. Big Data often used for this considers individual variability in  genes, environment, and lifestyle. Several  multigene assays are available to estimate the risk for relapse  after definitive surgery.

Examples include: [5]

  • MammaPrint (uses fresh/frozen tissues or formalin-fixed paraffin-embedded (FFPE) samples and examines 1,391 genes by microarray assay, and the results of 70 genes are used to classify patients into high- to low-risk for relapse)
  • Oncotype DX,(uses FFPE samples, is a multigene assay and examines 21 genes including five reference genes to estimate the risk of relapse at 10 years along with the benefit of adjuvant endocrine therapy and chemotherapy)
  • PAM50-based Prosigna (NanoString Technologies)  examines a 50-gene signature and estimates the risk of  recurrence (ROR)
  • EndoPredict (Myriad Genetics) is a clinically validated  multianalyte gene expression test which could predict the  ROR in patients with breast cancer at 10 years.

The big promise of Big Data is to allow the exploitation of all data sources, including unstructured ones such as textual patients reports or images, thus influencing medical research, and ultimately patient care.

A transition to novel therapies to treat these aggressive breast cancers amongst younger women is an expected output. Patient advocacy groups can play multiple roles to help maximize and leverage big data to better serve patients. 

The OHSL Breast Cancer Alliance (BCDA)

With an aim to provide global data resources on Breast cancer (OMICS and Clinical with an initial focus on risk factors as a starting focus), OHSL has established a global Breast Cancer Data Alliance (BCDA). This consortium brings together clinical and non-clinical professionals / researchers from various locations in the globe to participate in this alliance and help identify key parameters which could explain current trends in BC in women.

A Review titled Big data and Breast Cancer [7] from Northwestern Medical school set the foundation for the BCDA through a two pronged foray into the arena of Big Data and Breast Cancer with a breast cancer data catalog and an updated review to follow the previous one review.

Accurate analysis through bioinformatic approaches supplemented by experimental verification of certain markers identified via a cross platform Data Catalog compiled from different regions of the world would be a key milestone towards better understanding and better treatment of cancer. A primary requirement for this is the creation of a Data Catalog.

Now, OHSL has entered into a strategic partnership with Alation to use their technology to create a global data catalog on cancer to consolidate critical informational sources into a single metadata repository.

The data catalog provides the users with a simple, searchable and unified resource that enables data discovery, no matter the sources’ original location.

Motivation: The vast amount of breast cancer data and resources  prompted us to focus our search initially on OMICS and Clinical data resources with special focus on risk factors for Breast cancer data, global cohorts and repositories and related publications with associated data access modalities.

Current: We are working on creating common meta tags to classify open datasets for breast cancers from different geographical regions while conducting comparative analytics and research to identify unique features for region specific breast cancers. There is a special focus on LMICs (Lower and Middle Income Countries) and breast cancer. In December 2022, the work of the Breast Cancer Data Alliance was presented at the 11th General Assembly and International Conference of Asian Pacific organisation of Cancer Prevention (APOCP, December 8 -10 2022 Kolkata India  https://apocp.info/?page_id=766)

APOCP 11th General Conference on Cancer Prevention
APOC Group Photo
APOCP Group Photo


Furthermore, cataloging Breast cancer resources with data on Primary prevention has been initiated by partnership with International Breast Cancer and Nutrition Consortium.This collaboration has been expanded currently into preparation for a Horizon EU Grant application under the category of Staying Healthy program, where the objectives are:

  • (i) to implement personalized preventive interventions in preadolescents with and without increased risk for breast cancer,
  • (ii) develop a multidata knowledge hub (kHUB) including an expanded multidata catalog to identify individual breast cancer risk factors and make evidence-based recommendations and
  • (iii) validate biomarkers of Oxidative stress OS-mediated epigenetic impact and breast cancer risk and measure the impact of the intervention in a preadolescent population.

This conceptual approach is summarized in Figure 1 and 2 below; we have chosen breast cancer as case study to target a specific NCD that now occurs more in younger people than before and the risk of which is established during childhood .The ultimate aim is to develop a HeLP- Health literacy programme based on our KAP - Knowledge, Attitude and Practices; MuRCaPP-Multidata resource catalog for primary prevention; NED-Nutriepigenomics database; NuBI- Nutrition-based intervention design software; RAP- Risk Assessment and Prediction software and RisQ- Risk Information self-questionnaire.All of theses are interconnected as depicted in Fig 1 and 2. AI/ML/DL will be used as appropriate towards creation of the relevant softwares.

Figure 1
Figure 1: Conceptual approach to NCD prevention: Exposures trigger OS, itself the source of epigenetic drift leading to abnormal cellular aging, a main cause for many NCDs. Heightened risk is set early on for diseases occurring in adults, as the epigenome is very malleable in childhood. Epigenetic plasticity also permits reversion towards normalcy if the pathways involved are targeted. These epigenetic pathways can be identified by biomarkers of epigenetic drift that also provide risk assessment for specific NCDs, thanks to lab-on-chip technology. PREPPINGα (PRotecting the Epigenome to Prevent Prospective Illness through Nutrition in Generation ɑ) will focus on aspects shown in the blue horizon box to develop a process of NCD prevention based on a multidata-derived knowledge hub (kHUB) to revert epigenetic drift and its consequences for the breast, hence correcting the very source of NCD.
Figure 2
Figure 2: Project Structure: Summary of how the different steps of the project feed into each other, with the knowledge HUB (kHUB) as the source for validation of our intervention and for scaling up the NCD prevention strategy to large communities of users for long-term impact. HeLP- Health literacy programme; KAP - Knowledge, Attitude and Practices; MuRCaPP-Multidata resource catalog for primary prevention; NED-Nutriepigenomics database; NuBI- Nutrition-based intervention design software; RAP- Risk Assessment and Prediction software; RisQ- Risk Information self-questionnaire These foreseen objectives are realistic, verifiable and measurable on different levels, namely biological (epigenetics-mediated biomarkers), behavioral (dietary changes and health literacy knowledge), socioeconomic (Household income Level, Human Development Index as a measure of development and Gender Development Index) and environmental (exposure to Persistent Organic Pollutants and Endocrine Disrupting Chemicals).

As part of this process, OHSL participated in the Phase 1 Think Tank Program, Institut de Cancérologie de l’Ouest, (15, rue André Boquel) Angers, France On-site and via videoconference November 16, 2023. As part of this event, OHSL co-chaired two discussion groups.

  1. Multidata resources for cancer prevention - “Is data gathering sufficient to identify who needs health intervention?”
  2. Expert Discussion on Multidata resources
OHSL Co-chaired two discussions
Phase 1 Think Tank Program
EU Think Tank Working Group, November 2023
EU Think Tank Working Group, November 2023

Project Goals

The primary goal of the OHSL BCDA is to provide global data resources on breast cancer, with an initial focus on risk factors. Key objectives include:

  1. Data Catalog Creation: Develop a comprehensive global data catalog on cancer using Alation's technology to consolidate critical informational sources into a single metadata repository.

  2. Data Classification and Analytics: Implement common metatags to classify open datasets for breast cancer from different regions and conduct comparative analytics to identify region-specific characteristics.

  3. Focus on LMICs: Place special emphasis on Lower and Middle Income Countries (LMICs) to address breast cancer disparities.

  4. Primary Prevention: Collaborate with the International Breast Cancer and Nutrition Consortium to catalog breast cancer resources related to primary prevention.

Collaboration and Partnerships

The BCDA project involves collaboration with international researchers, healthcare professionals, patient advocacy groups, and organizations. Notably, our partnership with the International Breast Cancer and Nutrition Consortium extends into a Horizon EU Grant application, aiming to implement personalized preventive interventions and validate biomarkers.

  • Dr. Shruti Shukla (Co-ordinator, BCDA)
  • Prof. (Dr.) G. K. Rath (Former Chief of Cancer Institute, AIIMS, New Delhi & Head of NCI-INDIA)
  • Dr. Ravi Mehrotra (Honorary Chief Medical Officer, OHSL)
  • Dr. Susan E Clare (Northwestern University, Chicago)
  • Dr. Sophie A Lelièvre (Institut de Cancérologie de l’Ouest & Professor Emerita, Purdue)
  • Dr. Sulma I Mohammed (Professor, Cancer Biology, Purdue)
  • Dr. Mariana Bustamante Edurado (Northwestern University, Chicago)
  • Pamela L. Shaw (Northwestern University, Chicago)
  • Anil Srivastava (President, OHSL)
  • Cezary Mazurek, Dir. Poznan Supercomputing and Networking Center, PL
  • Mary Beth Terry (Professor, Department of Epidemiology - School of Public Health, Columbia University)
  • Sharon Ross (Program Director, Division of Cancer Prevention, National Cancer Institute, NIH, USA; member of the IBCN external advisory board)
  • Members of World Health Organization (WHO)
  • Alation


[1] https://gco.iarc.fr/today/online-analysis-pie

[2] DeSantis et al. Breast Cancer Res Treat.2019 Feb;173(3):743-747

[3] Madhav MR, Nayagam SG, Biyani K, Pandey V, Kamal DG, Sabarimurugan S, Ramesh N, Gothandam KM, Jayaraj R. Epidemiologic analysis of breast cancer incidence, prevalence, and mortality in India: Protocol for a systematic review and meta-analyses. Medicine (Baltimore). 2018 Dec;97(52):e13680. doi: 10.1097/MD.0000000000013680. PMID: 30593138; PMCID: PMC6314759

[4] Daniel Stover, MD (daniel.stover@osumc.edu) at SABCS 2021, https://www.sabcs.org/Program/Daily-Schedule/Day-1

[5] Naito Y, Urasaki T. Precision medicine in breast cancer. Chin Clin Oncol. 2018 Jun;7(3):29. doi: 10.21037/cco.2018.06.04. PMID: 30056731.

[6] Jourquin J, Reffey SB, Jernigan C, Levy M, Zinser G, Sabelko K, Pietenpol J, Sledge G Jr. Susan G. Komen Big Data for Breast Cancer Initiative: How Patient Advocacy Organizations Can Facilitate Using Big Data to Improve Patient Outcomes. JCO Precis Oncol. 2019 Sep 12;3:PO.19.00184. doi: 10.1200/PO.19.00184. PMID: 32923852

[7] Clare, S., Shaw, P. “Big Data” for breast cancer: where to look and what you will find. npj Breast Cancer 2, 16031 (2016)


BCDA Catalogue Preview

Catalogue Main View
Figure 1: Catalogue Main View
Figure 2: BCDA Catalogue View 1
Figure 2: BCDA Catalogue View 1
Figure 3: BCDA Catalogue View 2 (Data Tables)
Figure 3: BCDA Catalogue View 2 (Data Tables)
Figure 4: Non-US Breast Cancer Data View
Figure 4: Non-US Breast Cancer Data View
Figure 5: Non-US Breast Cancer Data Tables View
Figure 5: Non-US Breast Cancer Data Tables View 
Figure 6: Multi-Data Resource
Figure 6: Multi-Data Resource