Automatic extraction of biomedical knowledge from scientific publications

4 February, 2020

The efficient retrieval and processing of biomedical information from textual sources represents a crucial step to better characterize the molecular entities at play in paediatric cancers.

Large databases of medical literature such as PubMed, which contains more than 25 million references to journal articles in biomedicine, store complex and heterogeneous knowledge whose extraction and integration through Natural Language Processing (NLP) techniques is key to advancing research and knowledge discovery.

Davide Cirillo, a postdoctoral researcher in Alfonso Valencia’s group at Barcelona Supercomputing Center (BSC), was recently awarded the “José Castillejo” mobility grant funded by the Spanish Ministry of Education. He worked on a project focused on NLP in paediatric oncology at IBM Research – Zurich, from October to December 2019.

During his stay at partner IBM, Davide developed a bioinformatics tool that uses NLP approaches for the content analytics of biomedical publications referring to distinct childhood cancers. The tool enables the automatic extraction of mentions of bio-entities (genes, proteins, drugs, diseases and others) and their semantic relationships, allowing to unify and interpret biomedical information from both the unstructured text sources and the multi-omics datasets available within the iPC consortium.