Alvarez-Romero C, Martínez-García A, Bernabeu-Wittel M, Parra-Calderón CL.
Health Res Policy Syst. 2023; 21 (1)
BackgroundDigital transformation in healthcare and the growth of health data generation and collection are important challenges for the secondary use of healthcare records in the health research field. Likewise, due to the ethical and legal constraints for using sensitive data, understanding how health data are managed by dedicated infrastructures called data hubs is essential to facilitating data sharing and reuse.
MethodsTo capture the different data governance behind health data hubs across Europe, a survey focused on analysing the feasibility of linking individual-level data between data collections and the generation of health data governance patterns was carried out. The target audience of this study was national, European, and global data hubs. In total, the designed survey was sent to a representative list of 99 health data hubs in January 2022.
ResultsIn total, 41 survey responses received until June 2022 were analysed. Stratification methods were performed to cover the different levels of granularity identified in some data hubs' characteristics. Firstly, a general pattern of data governance for data hubs was defined. Afterward, specific profiles were defined, generating specific data governance patterns through the stratifications in terms of the kind of organization (centralized versus decentralized) and role (data controller or data processor) of the health data hub respondents.
ConclusionsThe analysis of the responses from health data hub respondents across Europe provided a list of the most frequent aspects, which concluded with a set of specific best practices on data management and governance, taking into account the constraints of sensitive data. In summary, a data hub should work in a centralized way, providing a Data Processing Agreement and a formal procedure to identify data providers, as well as data quality control, data integrity and anonymization methods.
Jiménez-Santos MJ, Nogueira-Rodríguez A, Piñeiro-Yáñez E, López-Fernández H, García-Martín S, Gómez-Plana P, Reboiro-Jato M, Gómez-López G, Glez-Peña D, Al-Shahrour F.
Nucleic Acids Res. 2023; 51 (W1)
Genomics studies routinely confront researchers with long lists of tumor alterations detected in patients. Such lists are difficult to interpret since only a minority of the alterations are relevant biomarkers for diagnosis and for designing therapeutic strategies. PanDrugs is a methodology that facilitates the interpretation of tumor molecular alterations and guides the selection of personalized treatments. To do so, PanDrugs scores gene actionability and drug feasibility to provide a prioritized evidence-based list of drugs. Here, we introduce PanDrugs2, a major upgrade of PanDrugs that, in addition to somatic variant analysis, supports a new integrated multi-omics analysis which simultaneously combines somatic and germline variants, copy number variation and gene expression data. Moreover, PanDrugs2 now considers cancer genetic dependencies to extend tumor vulnerabilities providing therapeutic options for untargetable genes. Importantly, a novel intuitive report to support clinical decision-making is generated. PanDrugs database has been updated, integrating 23 primary sources that support >74K drug-gene associations obtained from 4642 genes and 14 659 unique compounds. The database has also been reimplemented to allow semi-automatic updates to facilitate maintenance and release of future versions. PanDrugs2 does not require login and is freely available at https://www.pandrugs.org/.
Pérez-Díez I, Andreu Z, Hidalgo MR, Perpiñá-Clérigues C, Fantín L, Fernandez-Serra A, de la Iglesia-Vaya M, Lopez-Guerrero JA, García-García F.
Cancers (Basel). 2023; 15 (11)
Pancreatic ductal adenocarcinoma (PDAC) prognoses and treatment responses remain devastatingly poor due partly to the highly heterogeneous, aggressive, and immunosuppressive nature of this tumor type. The intricate relationship between the stroma, inflammation, and immunity remains vaguely understood in the PDAC microenvironment. Here, we performed a meta-analysis of stroma-, and immune-related gene expression in the PDAC microenvironment to improve disease prognosis and therapeutic development. We selected 21 PDAC studies from the Gene Expression Omnibus and ArrayExpress databases, including 922 samples (320 controls and 602 cases). Differential gene enrichment analysis identified 1153 significant dysregulated genes in PDAC patients that contribute to a desmoplastic stroma and an immunosuppressive environment (the hallmarks of PDAC tumors). The results highlighted two gene signatures related to the immune and stromal environments that cluster PDAC patients into high- and low-risk groups, impacting patients' stratification and therapeutic decision making. Moreover, HCP5, SLFN13, IRF9, IFIT2, and IFI35 immune genes are related to the prognosis of PDAC patients for the first time.
Martínez-García A, Alvarez-Romero C, Román-Villarán E, Bernabeu-Wittel M, Luis Parra-Calderón C.
Heliyon. 2023; 9 (5)
BackgroundThe FAIR principles, under the open science paradigm, aim to improve the Findability, Accessibility, Interoperability and Reusability of digital data. In this sense, the FAIR4Health project aimed to apply the FAIR principles in the health research field. For this purpose, a workflow and a set of tools were developed to apply FAIR principles in health research datasets, and validated through the demonstration of the potential impact that this strategy has on health research management outcomes.
ObjectiveThis paper aims to describe the analysis of the impact on health research management outcomes of the FAIR4Health solution.
MethodsTo analyse the impact on health research management outcomes in terms of time and economic savings, a survey was designed and sent to experts on data management with expertise in the use of the FAIR4Health solution. Then, differences between the time and costs needed to perform the techniques with (i) standalone research, and (ii) using the proposed solution, were analyzed.
ResultsIn the context of the health research management outcomes, the survey analysis concluded that 56.57% of the time and 16800 EUR per month could be saved if the FAIR4Health solution is used.
ConclusionsAdopting principles in health research through the FAIR4Health solution saves time and, consequently, costs in the execution of research involving data management techniques.
Perpiñá-Clérigues C, Mellado S, Català-Senent JF, Ibáñez F, Costa P, Marcos M, Guerri C, García-García F, Pascual M.
Biol Sex Differ. 2023; 14 (1)
Lipids represent essential components of extracellular vesicles (EVs), playing structural and regulatory functions during EV biogenesis, release, targeting, and cell uptake. Importantly, lipidic dysregulation has been linked to several disorders, including metabolic syndrome, inflammation, and neurological dysfunction. Our recent results demonstrated the involvement of plasma EV microRNAs as possible amplifiers and biomarkers of neuroinflammation and brain damage induced by ethanol intoxication during adolescence. Considering the possible role of plasma EV lipids as regulatory molecules and biomarkers, we evaluated how acute ethanol intoxication differentially affected the lipid composition of plasma EVs in male and female adolescents and explored the participation of the immune response. Plasma EVs were extracted from humans and wild-type (WT) and Toll-like receptor 4 deficient (TLR4-KO) mice. Preprocessing and exploratory analyses were conducted after the extraction of EV lipids and data acquisition by mass spectrometry. Comparisons between ethanol-intoxicated and control human female and male individuals and ethanol-treated and untreated WT and TLR4-KO female and male mice were used to analyze the differential abundance of lipids. Annotation of lipids into their corresponding classes and a lipid set enrichment analysis were carried out to evaluate biological functions. We demonstrated, for the first time, that acute ethanol intoxication induced a higher enrichment of distinct plasma EV lipid species in human female adolescents than in males. We observed a higher content of the PA, LPC, unsaturated FA, and FAHFA lipid classes in females, whereas males showed enrichment in PI. These lipid classes participate in the formation, release, and uptake of EVs and the activation of the immune response. Moreover, we observed changes in EV lipid composition between ethanol-treated WT and TLR4-KO mice (e.g., enrichment of glycerophosphoinositols in ethanol-treated WT males), and the sex-based differences in lipid abundance are more notable in WT mice than in TLR4-KO mice. All data and results generated have been made openly available on a web-based platform ( http://bioinfo.cipf.es/sal ). Our results suggest that binge ethanol drinking in human female adolescents leads to a higher content of plasma EV lipid species associated with EV biogenesis and the propagation of neuroinflammatory responses than in males. In addition, we discovered greater differences in lipid abundance between sexes in WT mice compared to TLR4-KO mice. Our findings also support the potential use of EV-enriched lipids as biomarkers of ethanol-induced neuroinflammation during adolescence.
Çubuk C, Loucera C, Peña-Chilet M, Dopazo J.
Int J Mol Sci. 2023; 24 (8)
The reprogramming of metabolism is a recognized cancer hallmark. It is well known that different signaling pathways regulate and orchestrate this reprogramming that contributes to cancer initiation and development. However, recent evidence is accumulating, suggesting that several metabolites could play a relevant role in regulating signaling pathways. To assess the potential role of metabolites in the regulation of signaling pathways, both metabolic and signaling pathway activities of Breast invasive Carcinoma (BRCA) have been modeled using mechanistic models. Gaussian Processes, powerful machine learning methods, were used in combination with SHapley Additive exPlanations (SHAP), a recent methodology that conveys causality, to obtain potential causal relationships between the production of metabolites and the regulation of signaling pathways. A total of 317 metabolites were found to have a strong impact on signaling circuits. The results presented here point to the existence of a complex crosstalk between signaling and metabolic pathways more complex than previously was thought.
Guaita-Cespedes M, Grillo-Risco R, Hidalgo MR, Fernández-Veledo S, Burks DJ, de la Iglesia-Vayá M, Galán A, Garcia-Garcia F.
Biol Sex Differ. 2023; 14 (1)
As the housekeeping genes (HKG) generally involved in maintaining essential cell functions are typically assumed to exhibit constant expression levels across cell types, they are commonly employed as internal controls in gene expression studies. Nevertheless, HKG may vary gene expression profile according to different variables introducing systematic errors into experimental results. Sex bias can indeed affect expression display, however, up to date, sex has not been typically considered as a biological variable. In this study, we evaluate the expression profiles of six classical housekeeping genes (four metabolic: GAPDH, HPRT, PPIA, and UBC, and two ribosomal: 18S and RPL19) to determine expression stability in adipose tissues (AT) of Homo sapiens and Mus musculus and check sex bias and their overall suitability as internal controls. We also assess the expression stability of all genes included in distinct whole-transcriptome microarrays available from the Gene Expression Omnibus database to identify sex-unbiased housekeeping genes (suHKG) suitable for use as internal controls. We perform a novel computational strategy based on meta-analysis techniques to identify any sexual dimorphisms in mRNA expression stability in AT and to properly validate potential candidates. Just above half of the considered studies informed properly about the sex of the human samples, however, not enough female mouse samples were found to be included in this analysis. We found differences in the HKG expression stability in humans between female and male samples, with females presenting greater instability. We propose a suHKG signature including experimentally validated classical HKG like PPIA and RPL19 and novel potential markers for human AT and discarding others like the extensively used 18S gene due to a sex-based variability display in adipose tissue. Orthologs have also been assayed and proposed for mouse WAT suHKG signature. All results generated during this study are readily available by accessing an open web resource ( https://bioinfo.cipf.es/metafun-HKG ) for consultation and reuse in further studies. This sex-based research proves that certain classical housekeeping genes fail to function adequately as controls when analyzing human adipose tissue considering sex as a variable. We confirm RPL19 and PPIA suitability as sex-unbiased human and mouse housekeeping genes derived from sex-specific expression profiles, and propose new ones such as RPS8 and UBB.
Rodi M, Gross C, Sandri TL, Berner L, Marcet-Houben M, Kocak E, Pogoda M, Casadei N, Köhler C, Kreidenweiss A, Agnandji ST, Gabaldón T, Ossowski S, Held J.
Front Cell Infect Microbiol. 2023; 13
IntroductionMansonella species are filarial parasites that infect humans worldwide. Although these infections are common, knowledge of the pathology and diversity of the causative species is limited. Furthermore, the lack of sequencing data for Mansonella species, shows that their research is neglected. Apart from Mansonella perstans, a potential new species called Mansonella sp "DEUX" has been identified in Gabon, which is prevalent at high frequencies. We aimed to further determine if Mansonella sp "DEUX" is a genotype of M. perstans, or if these are two sympatric species.
MethodsWe screened individuals in the area of Fougamou, Gabon for Mansonella mono-infections and generated de novo assemblies from the respective samples. For evolutionary analysis, a phylogenetic tree was reconstructed, and the differences and divergence times are presented. In addition, mitogenomes were generated and phylogenies based on 12S rDNA and cox1 were created.
ResultsWe successfully generated whole genomes for M. perstans and Mansonella sp "DEUX". Phylogenetic analysis based on annotated protein sequences, support the hypothesis of two distinct species. The inferred evolutionary analysis suggested, that M. perstans and Mansonella sp "DEUX" separated around 778,000 years ago. Analysis based on mitochondrial marker genes support our hypothesis of two sympatric human Mansonella species.
DiscussionThe results presented indicate that Mansonella sp "DEUX" is a new Mansonella species. These findings reflect the neglect of this research topic. And the availability of whole genome data will allow further investigations of these species.
Gundogdu P, Alamo I, Nepomuceno-Chamorro IA, Dopazo J, Loucera C.
Biology (Basel). 2023; 12 (4)
Single-cell RNA sequencing is increasing our understanding of the behavior of complex tissues or organs, by providing unprecedented details on the complex cell type landscape at the level of individual cells. Cell type definition and functional annotation are key steps to understanding the molecular processes behind the underlying cellular communication machinery. However, the exponential growth of scRNA-seq data has made the task of manually annotating cells unfeasible, due not only to an unparalleled resolution of the technology but to an ever-increasing heterogeneity of the data. Many supervised and unsupervised methods have been proposed to automatically annotate cells. Supervised approaches for cell-type annotation outperform unsupervised methods except when new (unknown) cell types are present. Here, we introduce SigPrimedNet an artificial neural network approach that leverages (i) efficient training by means of a sparsity-inducing signaling circuits-informed layer, (ii) feature representation learning through supervised training, and (iii) unknown cell-type identification by fitting an anomaly detection method on the learned representation. We show that SigPrimedNet can efficiently annotate known cell types while keeping a low false-positive rate for unseen cells across a set of publicly available datasets. In addition, the learned representation acts as a proxy for signaling circuit activity measurements, which provide useful estimations of the cell functionalities.
Upchurch S, Palumbo E, Adams J, Bujold D, Bourque G, Nedzel J, Graham K, Kagda MS, Assis P, Hitz B, Righi E, Guigó R, Wold BJ, GA4GH RNA-Seq Task Team.
Bioinformatics. 2023; 39 (4)
Large-scale sharing of genomic quantification data requires standardized access interfaces. In this Global Alliance for Genomics and Health project, we developed RNAget, an API for secure access to genomic quantification data in matrix form. RNAget provides for slicing matrices to extract desired subsets of data and is applicable to all expression matrix-format data, including RNA sequencing and microarrays. Further, it generalizes to quantification matrices of other sequence-based genomics such as ATAC-seq and ChIP-seq. https://ga4gh-rnaseq.github.io/schema/docs/index.html.
Bueno-Fortes S, Berral-Gonzalez A, Sánchez-Santos JM, Martin-Merino M, De Las Rivas J.
Bioinform Adv. 2023; 3 (1)
MotivationModern genomic technologies allow us to perform genome-wide analysis to find gene markers associated with the risk and survival in cancer patients. Accurate risk prediction and patient stratification based on robust gene signatures is a key path forward in personalized treatment and precision medicine. Several authors have proposed the identification of gene signatures to assign risk in patients with breast cancer (BRCA), and some of these signatures have been implemented within commercial platforms in the clinic, such as Oncotype and Prosigna. However, these platforms are black boxes in which the influence of selected genes as survival markers is unclear and where the risk scores provided cannot be clearly related to the standard clinicopathological tumor markers obtained by immunohistochemistry (IHC), which guide clinical and therapeutic decisions in breast cancer.
ResultsHere, we present a framework to discover a robust list of gene expression markers associated with survival that can be biologically interpreted in terms of the three main biomolecular factors (IHC clinical markers: ER, PR and HER2) that define clinical outcome in BRCA. To test and ensure the reproducibility of the results, we compiled and analyzed two independent datasets with a large number of tumor samples (1024 and 879) that include full genome-wide expression profiles and survival data. Using these two cohorts, we obtained a robust subset of gene survival markers that correlate well with the major IHC clinical markers used in breast cancer. The geneset of survival markers that we identify (which includes 34 genes) significantly improves the risk prediction provided by the genesets included in the commercial platforms: Oncotype (16 genes) and Prosigna (50 genes, i.e. PAM50). Furthermore, some of the genes identified have recently been proposed in the literature as new prognostic markers and may deserve more attention in current clinical trials to improve breast cancer risk prediction.
Availability and implementationAll data integrated and analyzed in this research will be available on GitHub (https://github.com/jdelasrivas-lab/breastcancersurvsign), including the R scripts and protocols used for the analyses.
Supplementary informationSupplementary data are available at Bioinformatics Advances online.
Piñero J, Rodriguez Fraga PS, Valls-Margarit J, Ronzano F, Accuosto P, Lambea Jane R, Sanz F, Furlong LI.
Comput Struct Biotechnol J. 2023; 21
The use of molecular biomarkers to support disease diagnosis, monitor its progression, and guide drug treatment has gained traction in the last decades. While only a dozen biomarkers have been approved for their exploitation in the clinic by the FDA, many more are evaluated in the context of translational research and clinical trials. Furthermore, the information on which biomarkers are measured, for which purpose, and in relation to which conditions are not readily accessible: biomarkers used in clinical studies available through resources such as ClinicalTrials.gov are described as free text, posing significant challenges in finding, analyzing, and processing them by both humans and machines. We present a text mining strategy to identify proteomic and genomic biomarkers used in clinical trials and classify them according to the methodologies by which they are measured. We find more than 3000 biomarkers used in the context of 2600 diseases. By analyzing this dataset, we uncover patterns of use of biomarkers across therapeutic areas over time, including the biomarker type and their specificity. These data are made available at the Clinical Biomarker App at https://www.disgenet.org/biomarkers/, a new portal that enables the exploration of biomarkers extracted from the clinical studies available at ClinicalTrials.gov and enriched with information from the scientific literature. The App features several metrics that assess the specificity of the biomarkers, facilitating their selection and prioritization. Overall, the Clinical Biomarker App is a valuable and timely resource about clinical biomarkers, to accelerate biomarker discovery, development, and application.
López-López D, Roldán G, Fernández-Rueda JL, Bostelmann G, Carmona R, Aquino V, Perez-Florido J, Ortuño F, Pita G, Núñez-Torres R, González-Neira A, CSVS Crowdsourcing Group, Peña-Chilet M, Dopazo J.
Hum Genomics. 2023; 17 (1)
Despite being a very common type of genetic variation, the distribution of copy-number variations (CNVs) in the population is still poorly understood. The knowledge of the genetic variability, especially at the level of the local population, is a critical factor for distinguishing pathogenic from non-pathogenic variation in the discovery of new disease variants. Here, we present the SPAnish Copy Number Alterations Collaborative Server (SPACNACS), which currently contains copy number variation profiles obtained from more than 400 genomes and exomes of unrelated Spanish individuals. By means of a collaborative crowdsourcing effort whole genome and whole exome sequencing data, produced by local genomic projects and for other purposes, is continuously collected. Once checked both, the Spanish ancestry and the lack of kinship with other individuals in the SPACNACS, the CNVs are inferred for these sequences and they are used to populate the database. A web interface allows querying the database with different filters that include ICD10 upper categories. This allows discarding samples from the disease under study and obtaining pseudo-control CNV profiles from the local population. We also show here additional studies on the local impact of CNVs in some phenotypes and on pharmacogenomic variants. SPACNACS can be accessed at: http://csvs.clinbioinfosspa.es/spacnacs/ . SPACNACS facilitates disease gene discovery by providing detailed information of the local variability of the population and exemplifies how to reuse genomic data produced for other purposes to build a local reference database.
Núñez-Moreno G, Tamayo A, Ruiz-Sánchez C, Cortón M, Mínguez P.
Hum Genet. 2023; 142 (4)
DNA variants altering the pre-mRNA splicing process represent an underestimated cause of human genetic diseases. Their association with disease traits should be confirmed using functional assays from patient cell lines or alternative models to detect aberrant mRNAs. Long-read sequencing is a suitable technique to identify and quantify mRNA isoforms. Available isoform detection and/or quantification tools are generally designed for the whole transcriptome analysis. However experiments focusing on genes of interest need more precise data fine-tuning and visualization tools.Here we describe VIsoQLR, an interactive analyzer, viewer and editor for the semi-automated identification and quantification of known and novel isoforms using long-read sequencing data. VIsoQLR is tailored to thoroughly analyze mRNA expression in splicing assays of selected genes. Our tool takes sequences aligned to a reference, and for each gene, it defines consensus splice sites and quantifies isoforms. VIsoQLR introduces features to edit the splice sites through dynamic and interactive graphics and tables, allowing accurate manual curation. Known isoforms detected by other methods can also be imported as references for comparison. A benchmark against two other popular transcriptome-based tools shows VIsoQLR accurate performance on both detection and quantification of isoforms. Here, we present VIsoQLR principles and features and its applicability in a case study example using nanopore-based long-read sequencing. VIsoQLR is available at https://github.com/TBLabFJD/VIsoQLR .
Perez-Florido J, Casimiro-Soriguer CS, Ortuño F, Fernandez-Rueda JL, Aguado A, Lara M, Riazzo C, Rodriguez-Iglesias MA, Camacho-Martinez P, Merino-Diaz L, Pupo-Ledo I, de Salazar A, Viñuela L, Fuentes A, Chueca N, The Andalusian Covid-Sequencing Initiative, García F, Dopazo J, Lepe JA.
Int J Mol Sci. 2023; 24 (3)
Recombination is an evolutionary strategy to quickly acquire new viral properties inherited from the parental lineages. The systematic survey of the SARS-CoV-2 genome sequences of the Andalusian genomic surveillance strategy has allowed the detection of an unexpectedly high number of co-infections, which constitute the ideal scenario for the emergence of new recombinants. Whole genome sequence of SARS-CoV-2 has been carried out as part of the genomic surveillance programme. Sample sources included the main hospitals in the Andalusia region. In addition to the increase of co-infections and known recombinants, three novel SARS-CoV-2 delta-omicron and omicron-omicron recombinant variants with two break points have been detected. Our observations document an epidemiological scenario in which co-infection and recombination are detected more frequently. Finally, we describe a family case in which co-infection is followed by the detection of a recombinant made from the two co-infecting variants. This increased number of recombinants raises the risk of emergence of recombinant variants with increased transmissibility and pathogenicity.
de la Fuente L, Del Pozo-Valero M, Perea-Romero I, Blanco-Kelly F, Fernández-Caballero L, Cortón M, Ayuso C, Mínguez P.
Int J Mol Sci. 2023; 24 (2)
Screening for pathogenic variants in the diagnosis of rare genetic diseases can now be performed on all genes thanks to the application of whole exome and genome sequencing (WES, WGS). Yet the repertoire of gene-disease associations is not complete. Several computer-based algorithms and databases integrate distinct gene-gene functional networks to accelerate the discovery of gene-disease associations. We hypothesize that the ability of every type of information to extract relevant insights is disease-dependent. We compiled 33 functional networks classified into 13 knowledge categories (KCs) and observed large variability in their ability to recover genes associated with 91 genetic diseases, as measured using efficiency and exclusivity. We developed GLOWgenes, a network-based algorithm that applies random walk with restart to evaluate KCs' ability to recover genes from a given list associated with a phenotype and modulates the prediction of new candidates accordingly. Comparison with other integration strategies and tools shows that our disease-aware approach can boost the discovery of new gene-disease associations, especially for the less obvious ones. KC contribution also varies if obtained using recently discovered genes. Applied to 15 unsolved WES, GLOWgenes proposed three new genes to be involved in the phenotypes of patients with syndromic inherited retinal dystrophies.
Niarakis A, Ostaszewski M, Mazein A, Kuperstein I, Kutmon M, Gillespie ME, Funahashi A, Acencio ML, Hemedan A, Aichem M, Klein K, Czauderna T, Burtscher F, Yamada TG, Hiki Y, Hiroi NF, Hu F, Pham N, Ehrhart F, Willighagen EL, Valdeolivas A, Dugourd A, Messina F, Esteban-Medina M, Peña-Chilet M, Rian K, Soliman S, Aghamiri SS, Puniya BL, Naldi A, Helikar T, Singh V, Fernández MF, Bermudez V, Tsirvouli E, Montagud A, Noël V, de Leon MP, Maier D, Bauch A, Gyori BM, Bachman JA, Luna A, Pinero J, Furlong LI, Balaur I, Rougny A, Jarosz Y, Overall RW, Phair R, Perfetto L, Matthews L, Rex DAB, Orlic-Milacic M, Cristobal MGL, De Meulder B, Ravel JM, Jassal B, Satagopam V, Wu G, Golebiewski M, Gawron P, Calzone L, Beckmann JS, Evelo CT, D’Eustachio P, Schreiber F, Saez-Rodriguez J, Dopazo J, Kuiper M, Valencia A, Wolkenhauer O, Kitano H, Barillot E, Auffray C, Balling R, Schneider R, the COVID-19 Disease Map Community.
The COVID-19 Disease Map project is a large-scale community effort uniting 277 scientists from 130 Institutions around the globe. We use high-quality, mechanistic content describing SARS-CoV-2-host interactions and develop interoperable bioinformatic pipelines for novel target identification and drug repurposing. Community-driven and highly interdisciplinary, the project is collaborative and supports community standards, open access, and the FAIR data principles. The coordination of community work allowed for an impressive step forward in building interfaces between Systems Biology tools and platforms. Our framework links key molecules highlighted from broad omics data analysis and computational modeling to dysregulated pathways in a cell-, tissue- or patient-specific manner. We also employ text mining and AI-assisted analysis to identify potential drugs and drug targets and use topological analysis to reveal interesting structural features of the map. The proposed framework is versatile and expandable, offering a significant upgrade in the arsenal used to understand virus-host interactions and other complex pathologies.
López-Cerdán A, Andreu Z, Hidalgo MR, Grillo-Risco R, Català-Senent JF, Soler-Sáez I, Neva-Alejo A, Gordillo F, de la Iglesia-Vayá M, García-García F.
Biol Sex Differ. 2022; 13 (1)
BackgroundIn recent decades, increasing longevity (among other factors) has fostered a rise in Parkinson's disease incidence. Although not exhaustively studied in this devastating disease, the impact of sex represents a critical variable in Parkinson's disease as epidemiological and clinical features differ between males and females.
MethodsTo study sex bias in Parkinson's disease, we conducted a systematic review to select sex-labeled transcriptomic data from three relevant brain tissues: the frontal cortex, the striatum, and the substantia nigra. We performed differential expression analysis on each study chosen. Then we summarized the individual differential expression results with three tissue-specific meta-analyses and a global all-tissues meta-analysis. Finally, results from the meta-analysis were functionally characterized using different functional profiling approaches.
ResultsThe tissue-specific meta-analyses linked Parkinson's disease to the enhanced expression of MED31 in the female frontal cortex and the dysregulation of 237 genes in the substantia nigra. The global meta-analysis detected 15 genes with sex-differential patterns in Parkinson's disease, which participate in mitochondrial function, oxidative stress, neuronal degeneration, and cell death. Furthermore, functional analyses identified pathways, protein-protein interaction networks, and transcription factors that differed by sex. While male patients exhibited changes in oxidative stress based on metal ions, inflammation, and angiogenesis, female patients exhibited dysfunctions in mitochondrial and lysosomal activity, antigen processing and presentation functions, and glutamic and purine metabolism. All results generated during this study are readily available by accessing an open web resource ( http://bioinfo.cipf.es/metafun-pd/ ) for consultation and reuse in further studies.
ConclusionsOur in silico approach has highlighted sex-based differential mechanisms in typical Parkinson Disease hallmarks (inflammation, mitochondrial dysfunction, and oxidative stress). Additionally, we have identified specific genes and transcription factors for male and female Parkinson Disease patients that represent potential candidates as biomarkers to diagnosis.
Sorzano COS, Vilas JL, Ramírez-Aportela E, Krieger J, Del Hoyo D, Herreros D, Fernandez-Giménez E, Marchán D, Macías JR, Sánchez I, Del Caño L, Fonseca-Reyna Y, Conesa P, García-Mena A, Burguet J, García Condado J, Méndez García J, Martínez M, Muñoz-Barrutia A, Marabini R, Vargas J, Carazo JM.
Faraday Discuss. 2022; 240 (0)
The number of maps deposited in public databases (Electron Microscopy Data Bank, EMDB) determined by cryo-electron microscopy has quickly grown in recent years. With this rapid growth, it is critical to guarantee their quality. So far, map validation has primarily focused on the agreement between maps and models. From the image processing perspective, the validation has been mostly restricted to using two half-maps and the measurement of their internal consistency. In this article, we suggest that map validation can be taken much further from the point of view of image processing if 2D classes, particles, angles, coordinates, defoci, and micrographs are also provided. We present a progressive validation scheme that qualifies a result validation status from 0 to 5 and offers three optional qualifiers (A, W, and O) that can be added. The simplest validation state is 0, while the most complete would be 5AWO. This scheme has been implemented in a website https://biocomp.cnb.csic.es/EMValidationService/ to which reconstructed maps and their ESI can be uploaded.
Marcet-Houben M, Alvarado M, Ksiezopolska E, Saus E, de Groot PWJ, Gabaldón T.
BMC Biol. 2022; 20 (1)
BackgroundCandida glabrata is an opportunistic yeast pathogen thought to have a large genetic and phenotypic diversity and a highly plastic genome. However, the lack of chromosome-level genome assemblies representing this diversity limits our ability to accurately establish how chromosomal structure and gene content vary across strains.
ResultsHere, we expanded publicly available assemblies by using long-read sequencing technologies in twelve diverse strains, obtaining a final set of twenty-one chromosome-level genomes spanning the known C. glabrata diversity. Using comparative approaches, we inferred variation in chromosome structure and determined the pan-genome, including an analysis of the adhesin gene repertoire. Our analysis uncovered four new adhesin orthogroups and inferred a rich ancestral adhesion repertoire, which was subsequently shaped through a still ongoing process of gene loss, gene duplication, and gene conversion.
ConclusionsC. glabrata has a largely stable pan-genome except for a highly variable subset of genes encoding cell wall-associated functions. Adhesin repertoire was established for each strain and showed variability among clades.
Pérez-Granado J, Piñero J, Furlong LI.
Front Genet. 2022; 13
Our knowledge of complex disorders has increased in the last years thanks to the identification of genetic variants (GVs) significantly associated with disease phenotypes by genome-wide association studies (GWAS). However, we do not understand yet how these GVs functionally impact disease pathogenesis or their underlying biological mechanisms. Among the multiple post-GWAS methods available, fine-mapping and colocalization approaches are commonly used to identify causal GVs, meaning those with a biological effect on the trait, and their functional effects. Despite the variety of post-GWAS tools available, there is no guideline for method eligibility or validity, even though these methods work under different assumptions when accounting for linkage disequilibrium and integrating molecular annotation data. Moreover, there is no benchmarking of the available tools. In this context, we have applied two different fine-mapping and colocalization methods to the same GWAS on major depression (MD) and expression quantitative trait loci (eQTL) datasets. Our goal is to perform a systematic comparison of the results obtained by the different tools. To that end, we have evaluated their results at different levels: fine-mapped and colocalizing GVs, their target genes and tissue specificity according to gene expression information, as well as the biological processes in which they are involved. Our findings highlight the importance of fine-mapping as a key step for subsequent analysis. Notably, the colocalizing variants, altered genes and targeted tissues differed between methods, even regarding their biological implications. This contribution illustrates an important issue in post-GWAS analysis with relevant consequences on the use of GWAS results for elucidation of disease pathobiology, drug target prioritization and biomarker discovery.
Naranjo-Ortiz MA, Molina M, Fuentes D, Mixão V, Gabaldón T.
Gigascience. 2022; 11
BackgroundRecent technological developments have made genome sequencing and assembly highly accessible and widely used. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, heterokaryosis, or extreme compositional biases can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a nonnegligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors.
FindingsHere we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of nonchanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species.
ConclusionsOur results show that 10 (28.57%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least for some lineages within the Fungi.
Loucera C, Perez-Florido J, Casimiro-Soriguer CS, Ortuño FM, Carmona R, Bostelmann G, Martínez-González LJ, Muñoyerro-Muñiz D, Villegas R, Rodriguez-Baño J, Romero-Gomez M, Lorusso N, Garcia-León J, Navarro-Marí JM, Camacho-Martinez P, Merino-Diaz L, Salazar A, Viñuela L, The Andalusian Covid-Sequencing Initiative, Lepe JA, Garcia F, Dopazo J.
Viruses. 2022; 14 (9)
ObjectivesMore than two years into the COVID-19 pandemic, SARS-CoV-2 still remains a global public health problem. Successive waves of infection have produced new SARS-CoV-2 variants with new mutations for which the impact on COVID-19 severity and patient survival is uncertain.
MethodsA total of 764 SARS-CoV-2 genomes, sequenced from COVID-19 patients, hospitalized from 19th February 2020 to 30 April 2021, along with their clinical data, were used for survival analysis.
ResultsA significant association of B.1.1.7, the alpha lineage, with patient mortality (log hazard ratio (LHR) = 0.51, C.I. = [0.14,0.88]) was found upon adjustment by all the covariates known to affect COVID-19 prognosis. Moreover, survival analysis of mutations in the SARS-CoV-2 genome revealed 27 of them were significantly associated with higher mortality of patients. Most of these mutations were located in the genes coding for the S, ORF8, and N proteins.
ConclusionsThis study illustrates how a combination of genomic and clinical data can provide solid evidence for the impact of viral lineage on patient survival.
Jiménez-Santos MJ, García-Martín S, Fustero-Torre C, Di Domenico T, Gómez-López G, Al-Shahrour F.
Mol Oncol. 2022; 16 (21)
Tumour heterogeneity is one of the main characteristics of cancer and can be categorised into inter- or intratumour heterogeneity. This heterogeneity has been revealed as one of the key causes of treatment failure and relapse. Precision oncology is an emerging field that seeks to design tailored treatments for each cancer patient according to epidemiological, clinical and omics data. This discipline relies on bioinformatics tools designed to compute scores to prioritise available drugs, with the aim of helping clinicians in treatment selection. In this review, we describe the current approaches for therapy selection depending on which type of tumour heterogeneity is being targeted and the available next-generation sequencing data. We cover intertumour heterogeneity studies and individual treatment selection using genomics variants, expression data or multi-omics strategies. We also describe intratumour dissection through clonal inference and single-cell transcriptomics, in each case providing bioinformatics tools for tailored treatment selection. Finally, we discuss how these therapy selection workflows could be integrated into the clinical practice.
Loucera C, Carmona R, Esteban-Medina M, Bostelmann G, Muñoyerro-Muñiz D, Villegas R, Peña-Chilet M, Dopazo J.
Despite the extensive vaccination campaigns in many countries, COVID-19 is still a major worldwide health problem because of its associated morbidity and mortality. Therefore, finding efficient treatments as fast as possible is a pressing need. Drug repurposing constitutes a convenient alternative when the need for new drugs in an unexpected medical scenario is urgent, as is the case with COVID-19. Using data from a central registry of electronic health records (the Andalusian Population Health Database, BPS), the effect of prior consumption of drugs for other indications previous to the hospitalization with respect to patient survival was studied on a retrospective cohort of 15,968 individuals, comprising all COVID-19 patients hospitalized in Andalusia between January and November 2020. Covariate-adjusted hazard ratios and analysis of lymphocyte progression curves support a significant association between consumption of 21 different drugs and better patient survival. Contrarily, one drug, furosemide, displayed a significant increase in patient mortality.
Snyder M, Iraola-Guzmán S, Saus E, Gabaldón T.
Cancers (Basel). 2022; 14 (16)
Colorectal cancer (CRC) is the third most prevalent cancer worldwide, with nearly two million newly diagnosed cases each year. The survival of patients with CRC greatly depends on the cancer stage at the time of diagnosis, with worse prognosis for more advanced cases. Consequently, considerable effort has been directed towards improving population screening programs for early diagnosis and identifying prognostic markers that can better inform treatment strategies. In recent years, long non-coding RNAs (lncRNAs) have been recognized as promising molecules, with diagnostic and prognostic potential in many cancers, including CRC. Although large-scale genome and transcriptome sequencing surveys have identified many lncRNAs that are altered in CRC, most of their roles in disease onset and progression remain poorly understood. Here, we critically review the variety of detection methods and types of supporting evidence for the involvement of lncRNAs in CRC. In addition, we provide a reference catalog that features the most clinically relevant lncRNAs in CRC. These lncRNAs were selected based on recent studies sorted by stringent criteria for both supporting experimental evidence and reproducibility.
Iancu IF, Perea-Romero I, Núñez-Moreno G, de la Fuente L, Romero R, Ávila-Fernandez A, Trujillo-Tiebas MJ, Riveiro-Álvarez R, Almoguera B, Martín-Mérida I, Del Pozo-Valero M, Damián-Verde A, Cortón M, Ayuso C, Minguez P.
Int J Mol Sci. 2022; 23 (15)
The introduction of NGS in genetic diagnosis has increased the repertoire of variants and genes involved and the amount of genomic information produced. We built an allelic-frequency (AF) database for a heterogeneous cohort of genetic diseases to explore the aggregated genomic information and boost diagnosis in inherited retinal dystrophies (IRD). We retrospectively selected 5683 index-cases with clinical exome sequencing tests available, 1766 with IRD and the rest with diverse genetic diseases. We calculated a subcohort's IRD-specific AF and compared it with suitable pseudocontrols. For non-solved IRD cases, we prioritized variants with a significant increment of frequencies, with eight variants that may help to explain the phenotype, and 10/11 of uncertain significance that were reclassified as probably pathogenic according to ACMG. Moreover, we developed a method to highlight genes with more frequent pathogenic variants in IRD cases than in pseudocontrols weighted by the increment of benign variants in the same comparison. We identified 18 genes for further studies that provided new insights in five cases. This resource can also help one to calculate the carrier frequency in IRD genes. A cohort-specific AF database assists with variants and genes prioritization and operates as an engine that provides a new hypothesis in non-solved cases, augmenting the diagnosis rate.
Moya-García AA, González-Jiménez A, Moreno F, Stephens C, Lucena MI, Ranea JAG.
Genes (Basel). 2022; 13 (7)
Among adverse drug reactions, drug-induced liver injury presents particular challenges because of its complexity, and the underlying mechanisms are still not completely characterized. Our knowledge of the topic is limited and based on the assumption that a drug acts on one molecular target. We have leveraged drug polypharmacology, i.e., the ability of a drug to bind multiple targets and thus perturb several biological processes, to develop a systems pharmacology platform that integrates all drug-target interactions. Our analysis sheds light on the molecular mechanisms of drugs involved in drug-induced liver injury and provides new hypotheses to study this phenomenon.
Pérez-Granado J, Piñero J, Medina-Rivera A, Furlong LI.
Genes (Basel). 2022; 13 (7)
Understanding the molecular basis of major depression is critical for identifying new potential biomarkers and drug targets to alleviate its burden on society. Leveraging available GWAS data and functional genomic tools to assess regulatory variation could help explain the role of major depression-associated genetic variants in disease pathogenesis. We have conducted a fine-mapping analysis of genetic variants associated with major depression and applied a pipeline focused on gene expression regulation by using two complementary approaches: cis-eQTL colocalization analysis and alteration of transcription factor binding sites. The fine-mapping process uncovered putative causally associated variants whose proximal genes were linked with major depression pathophysiology. Four colocalizing genetic variants altered the expression of five genes, highlighting the role of SLC12A5 in neuronal chlorine homeostasis and MYRF in nervous system myelination and oligodendrocyte differentiation. The transcription factor binding analysis revealed the potential role of rs62259947 in modulating P4HTM expression by altering the YY1 binding site, altogether regulating hypoxia response. Overall, our pipeline could prioritize putative causal genetic variants in major depression. More importantly, it can be applied when only index genetic variants are available. Finally, the presented approach enabled the proposal of mechanistic hypotheses of these genetic variants and their role in disease pathogenesis.
Leis A, Casadevall D, Albanell J, Posso M, Macià F, Castells X, Ramírez-Anguita JM, Martínez Roldán J, Furlong LI, Sanz F, Ronzano F, Mayer MA.
JMIR Cancer. 2022; 8 (3)
BackgroundA cancer diagnosis is a source of psychological and emotional stress, which are often maintained for sustained periods of time that may lead to depressive disorders. Depression is one of the most common psychological conditions in patients with cancer. According to the Global Cancer Observatory, breast and colorectal cancers are the most prevalent cancers in both sexes and across all age groups in Spain.
ObjectiveThis study aimed to compare the prevalence of depression in patients before and after the diagnosis of breast or colorectal cancer, as well as to assess the usefulness of the analysis of free-text clinical notes in 2 languages (Spanish or Catalan) for detecting depression in combination with encoded diagnoses.
MethodsWe carried out an analysis of the electronic health records from a general hospital by considering the different sources of clinical information related to depression in patients with breast and colorectal cancer. This analysis included ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) diagnosis codes and unstructured information extracted by mining free-text clinical notes via natural language processing tools based on Systematized Nomenclature of Medicine Clinical Terms that mentions symptoms and drugs used for the treatment of depression.
ResultsWe observed that the percentage of patients diagnosed with depressive disorders significantly increased after cancer diagnosis in the 2 types of cancer considered-breast and colorectal cancers. We managed to identify a higher number of patients with depression by mining free-text clinical notes than the group selected exclusively on ICD-9-CM codes, increasing the number of patients diagnosed with depression by 34.8% (441/1269). In addition, the number of patients with depression who received chemotherapy was higher than those who did not receive this treatment, with significant differences (P<.001).
ConclusionsThis study provides new clinical evidence of the depression-cancer comorbidity and supports the use of natural language processing for extracting and analyzing free-text clinical notes from electronic health records, contributing to the identification of additional clinical data that complements those provided by coded data to improve the management of these patients.
Loucera C, Perez-Florido J, Casimiro-Soriguer CS, Ortuño FM, Carmona R, Bostelmann G, Martínez-González LJ, Muñoyerro-Muñiz D, Villegas R, Rodriguez-Baño J, Romero-Gomez M, Lorusso N, Garcia-León J, Navarro-Marí JM, Camacho-Martinez P, Merino-Diaz L, de Salazar A, Viñuela L, Lepe JA, Garcia F, Dopazo J, The Andalusian COVID-19 sequencing initiative.
After more than two years of COVID-19 pandemic, SARS-CoV-2 still remains a global public health problem. Successive waves of infection have produced new SARS-CoV-2 variants with new mutations whose impact on COVID-19 severity and patient survival is uncertain. A total of 764 SARS-CoV-2 genomes sequenced from COVID-19 patients, hospitalized from 19th February 2020 to 30st April 2021, along with their clinical data, were used for survival analysis. A significant association of B.1.1.7, the alpha lineage, with patient mortality (Log Hazard ratio LHR=0.51, C.I.=[0.14,0.88]) was found upon adjustment by all the covariates known to affect COVID-19 prognosis. Moreover, survival analysis of mutations in the SARS-CoV-2 genome rendered 27 of them significantly associated with higher mortality of patients. Most of these mutations were located in the S, ORF8 and N proteins. This study illustrates how a combination of genomic and clinical data provide solid evidence on the impact of viral lineage on patient survival.
Alvarez-Romero C, Martinez-Garcia A, Ternero Vega J, Díaz-Jimènez P, Jimènez-Juan C, Nieto-Martín MD, Román Villarán E, Kovacevic T, Bokan D, Hromis S, Djekic Malbasa J, Beslać S, Zaric B, Gencturk M, Sinaci AA, Ollero Baturone M, Parra Calderón CL.
JMIR Med Inform. 2022; 10 (6)
BackgroundOwing to the nature of health data, their sharing and reuse for research are limited by legal, technical, and ethical implications. In this sense, to address that challenge and facilitate and promote the discovery of scientific knowledge, the Findable, Accessible, Interoperable, and Reusable (FAIR) principles help organizations to share research data in a secure, appropriate, and useful way for other researchers.
ObjectiveThe objective of this study was the FAIRification of existing health research data sets and applying a federated machine learning architecture on top of the FAIRified data sets of different health research performing organizations. The entire FAIR4Health solution was validated through the assessment of a federated model for real-time prediction of 30-day readmission risk in patients with chronic obstructive pulmonary disease (COPD).
MethodsThe application of the FAIR principles on health research data sets in 3 different health care settings enabled a retrospective multicenter study for the development of specific federated machine learning models for the early prediction of 30-day readmission risk in patients with COPD. This predictive model was generated upon the FAIR4Health platform. Finally, an observational prospective study with 30 days follow-up was conducted in 2 health care centers from different countries. The same inclusion and exclusion criteria were used in both retrospective and prospective studies.
ResultsClinical validation was demonstrated through the implementation of federated machine learning models on top of the FAIRified data sets from different health research performing organizations. The federated model for predicting the 30-day hospital readmission risk was trained using retrospective data from 4.944 patients with COPD. The assessment of the predictive model was performed using the data of 100 recruited (22 from Spain and 78 from Serbia) out of 2070 observed (records viewed) patients during the observational prospective study, which was executed from April 2021 to September 2021. Significant accuracy (0.98) and precision (0.25) of the predictive model generated upon the FAIR4Health platform were observed. Therefore, the generated prediction of 30-day readmission risk was confirmed in 87% (87/100) of cases.
ConclusionsImplementing a FAIR data policy in health research performing organizations to facilitate data sharing and reuse is relevant and needed, following the discovery, access, integration, and analysis of health research data. The FAIR4Health project proposes a technological solution in the health domain to facilitate alignment with the FAIR principles.
López-Sánchez M, Loucera C, Peña-Chilet M, Dopazo J.
Hum Mol Genet. 2022; 31 (12)
Recent studies have demonstrated a relevant role of the host genetics in the coronavirus disease 2019 (COVID-19) prognosis. Most of the 7000 rare diseases described to date have a genetic component, typically highly penetrant. However, this vast spectrum of genetic variability remains yet unexplored with respect to possible interactions with COVID-19. Here, a mathematical mechanistic model of the COVID-19 molecular disease mechanism has been used to detect potential interactions between rare disease genes and the COVID-19 infection process and downstream consequences. Out of the 2518 disease genes analyzed, causative of 3854 rare diseases, a total of 254 genes have a direct effect on the COVID-19 molecular disease mechanism and 207 have an indirect effect revealed by a significant strong correlation. This remarkable potential of interaction occurs for >300 rare diseases. Mechanistic modeling of COVID-19 disease map has allowed a holistic systematic analysis of the potential interactions between the loss of function in known rare disease genes and the pathological consequences of COVID-19 infection. The results identify links between disease genes and COVID-19 hallmarks and demonstrate the usefulness of the proposed approach for future preventive measures in some rare diseases.
Carmona-Pírez J, Poblador-Plou B, Poncel-Falcó A, Rochat J, Alvarez-Romero C, Martínez-García A, Angioletti C, Almada M, Gencturk M, Sinaci AA, Ternero-Vega JE, Gaudet-Blavignac C, Lovis C, Liperoti R, Costa E, Parra-Calderón CL, Moreno-Juste A, Gimeno-Miguel A, Prados-Torres A.
Int J Environ Res Public Health. 2022; 19 (4)
The current availability of electronic health records represents an excellent research opportunity on multimorbidity, one of the most relevant public health problems nowadays. However, it also poses a methodological challenge due to the current lack of tools to access, harmonize and reuse research datasets. In FAIR4Health, a European Horizon 2020 project, a workflow to implement the FAIR (findability, accessibility, interoperability and reusability) principles on health datasets was developed, as well as two tools aimed at facilitating the transformation of raw datasets into FAIR ones and the preservation of data privacy. As part of this project, we conducted a multicentric retrospective observational study to apply the aforementioned FAIR implementation workflow and tools to five European health datasets for research on multimorbidity. We applied a federated frequent pattern growth association algorithm to identify the most frequent combinations of chronic diseases and their association with mortality risk. We identified several multimorbidity patterns clinically plausible and consistent with the bibliography, some of which were strongly associated with mortality. Our results show the usefulness of the solution developed in FAIR4Health to overcome the difficulties in data management and highlight the importance of implementing a FAIR data policy to accelerate responsible health research.
Casimiro-Soriguer CS, Loucera C, Peña-Chilet M, Dopazo J.
Sci Rep. 2022; 12 (1)
Gut microbiome is gaining interest because of its links with several diseases, including colorectal cancer (CRC), as well as the possibility of being used to obtain non-intrusive predictive disease biomarkers. Here we performed a meta-analysis of 1042 fecal metagenomic samples from seven publicly available studies. We used an interpretable machine learning approach based on functional profiles, instead of the conventional taxonomic profiles, to produce a highly accurate predictor of CRC with better precision than those of previous proposals. Moreover, this approach is also able to discriminate samples with adenoma, which makes this approach very promising for CRC prevention by detecting early stages in which intervention is easier and more effective. In addition, interpretable machine learning methods allow extracting features relevant for the classification, which reveals basic molecular mechanisms accounting for the changes undergone by the microbiome functional landscape in the transition from healthy gut to adenoma and CRC conditions. Functional profiles have demonstrated superior accuracy in predicting CRC and adenoma conditions than taxonomic profiles and additionally, in a context of explainable machine learning, provide useful hints on the molecular mechanisms operating in the microbiota behind these conditions.
Gundogdu P, Loucera C, Alamo-Alvarez I, Dopazo J, Nepomuceno I.
BioData Min. 2022; 15 (1)
BackgroundSingle-cell RNA sequencing (scRNA-seq) data provide valuable insights into cellular heterogeneity which is significantly improving the current knowledge on biology and human disease. One of the main applications of scRNA-seq data analysis is the identification of new cell types and cell states. Deep neural networks (DNNs) are among the best methods to address this problem. However, this performance comes with the trade-off for a lack of interpretability in the results. In this work we propose an intelligible pathway-driven neural network to correctly solve cell-type related problems at single-cell resolution while providing a biologically meaningful representation of the data.
ResultsIn this study, we explored the deep neural networks constrained by several types of prior biological information, e.g. signaling pathway information, as a way to reduce the dimensionality of the scRNA-seq data. We have tested the proposed biologically-based architectures on thousands of cells of human and mouse origin across a collection of public datasets in order to check the performance of the model. Specifically, we tested the architecture across different validation scenarios that try to mimic how unknown cell types are clustered by the DNN and how it correctly annotates cell types by querying a database in a retrieval problem. Moreover, our approach demonstrated to be comparable to other less interpretable DNN approaches constrained by using protein-protein interactions gene regulation data. Finally, we show how the latent structure learned by the network could be used to visualize and to interpret the composition of human single cell datasets.
ConclusionsHere we demonstrate how the integration of pathways, which convey fundamental information on functional relationships between genes, with DNNs, that provide an excellent classification framework, results in an excellent alternative to learn a biologically meaningful representation of scRNA-seq data. In addition, the introduction of prior biological knowledge in the DNN reduces the size of the network architecture. Comparative results demonstrate a superior performance of this approach with respect to other similar approaches. As an additional advantage, the use of pathways within the DNN structure enables easy interpretability of the results by connecting features to cell functionalities by means of the pathway nodes, as demonstrated with an example with human melanoma tumor cells.
Loucera C, Peña-Chilet M, Esteban-Medina M, Muñoyerro-Muñiz D, Villegas R, Lopez-Miranda J, Rodriguez-Baño J, Túnez I, Bouillon R, Dopazo J, Quesada Gomez JM.
Sci Rep. 2021; 11 (1)
COVID-19 is a major worldwide health problem because of acute respiratory distress syndrome, and mortality. Several lines of evidence have suggested a relationship between the vitamin D endocrine system and severity of COVID-19. We present a survival study on a retrospective cohort of 15,968 patients, comprising all COVID-19 patients hospitalized in Andalusia between January and November 2020. Based on a central registry of electronic health records (the Andalusian Population Health Database, BPS), prescription of vitamin D or its metabolites within 15-30 days before hospitalization were recorded. The effect of prescription of vitamin D (metabolites) for other indication previous to the hospitalization was studied with respect to patient survival. Kaplan-Meier survival curves and hazard ratios support an association between prescription of these metabolites and patient survival. Such association was stronger for calcifediol (Hazard Ratio, HR = 0.67, with 95% confidence interval, CI, of [0.50-0.91]) than for cholecalciferol (HR = 0.75, with 95% CI of [0.61-0.91]), when prescribed 15 days prior hospitalization. Although the relation is maintained, there is a general decrease of this effect when a longer period of 30 days prior hospitalization is considered (calcifediol HR = 0.73, with 95% CI [0.57-0.95] and cholecalciferol HR = 0.88, with 95% CI [0.75, 1.03]), suggesting that association was stronger when the prescription was closer to the hospitalization.