Pan-Cancer Machine Learning Predictors of Primary Site of Origin and Molecular Subtype

Abstract: It is estimated by the American Cancer Society that approximately 5% of all metastatic tumors have no defined primary tissue of origin and are classified as cancers of unknown primary origin (CUPs). The current standard of care for CUP patients depends on immunohistochemistry (IHC) based approaches to identify the primary site. The addition of post-mortem evaluation to IHC based tests helps to reveal the identity of the primary site for only 25% of the CUPs, emphasizing the acute need for better methods of determination of the site of origin. CUP patients are therefore given generic chemotherapeutic agents resulting in poor prognosis. When the tissue of origin is known, patients can be given site specific therapy with significant improvement in clinical outcome. Similarly, identifying the primary origin of metastatic cancer is of great importance for designing treatment. Identification of the primary site of origin is an import first step but may not be sufficient information for optimal treatment of the patient. Recent studies, primarily from The Cancer Genome Atlas (TCGA) project, and others, have revealed molecular subtypes in several cancer types with distinct clinical outcome. The molecular subtype captures the fundamental mechanisms driving the cancer and provides information that is essential for the optimal treatment of a cancer. Thus, along with primary site of origin, molecular subtype of a tumor is emerging as a criterion for personalized medicine and patient entry into clinical trials. However, there is no comprehensive toolset available for precise identification of tissue of origin or molecular subtype for precision medicine and translational research. Methods and Findings: We posited that metastatic tumors will harbor the gene expression profiles of the primary tissue of origin of the cancer. Therefore, we decided to learn the characteristics of the primary tumors using the large number of cancer genome profiles available from the TCGA project. Our predictors were trained for 33 cancer types and for the 11 cancers where there are established molecular subtypes. We estimated the accuracy of several machine learning models using cross-validation methods and external validation sets. The extensive testing using independent test sets revealed that the predictors had a median sensitivity and specificity of 97.2% and 99.9% respectively without losing classification of any tumor. Subtype classifiers achieved median sensitivity of 87.7% and specificity of 94.5% via cross validation and presented median sensitivity of 79.6% and specificity of 94.6% in two external datasets of 1,999 total samples. Importantly, these external data shows that our classifiers can robustly predict the cancer primary origin from microarray data, metastatic cancer, and patient-derived xenograft (PDX) mouse models. Conclusion: We have demonstrated the utility of gene expression profiles to solve the important clinical challenge of identifying the primary site of origin and the molecular subtype of cancers based on machine learning algorithms. We show, for the first time to our knowledge, that our pan-cancer classifiers can predict multiple cancers' primary tissue of origin from metastatic samples. The predictors will be made available as open source software, freely available for academic non-commercial use.

Bio: Coming soon!