Text-Mining of miRNA Biomarkers from PubMed Abstracts

An initiative on the development of a text-mining tool for the extraction of miRNA biomarkers from PubMed Abstracts.

In recent years, the role of microRNAs (miRNAs) as biomarkers in human diseases has been studied due to the function of miRNAs to control and regulate gene expressions. However, research involving the impact of specific miRNAs and their specific influence on diseases is inefficient due to a significant obstacle in the medical industry: the constantly updating PubMed database containing thousands of unsorted and complex abstracts. With the above observation in mind, during the summer of 2020, we, Alexander Liang and Matt Laws, began a six-week internship under Thomas Jefferson University professor, Dr. Nestoras Karathanasis. Our internship focused on the implementation of a text-mining program that employed the R programming language. Our software automatically and efficiently extracts miRNA biomarkers – disease relationships within publicly available abstracts on PubMed.

Diagram of work flow to create the R text mining tool for pubmed. It reads, Understanding the field, Learning R, Reading Literature, Extracting miRNAs, Finding Relationship Sentences, Referencing HMDD, Cleaning Code, Evaluating our findingDiagram of the tasks we performed in our internship

Per abstract, our program extracts six pieces of important information per abstract:

  • diseases: all diseases mentioned in the abstract
  • miRNAs: all miRNAs mentioned in the abstract
  • relationships: all sentences containing relationships between a disease and miRNA
  • PMID: the identifier for each abstract
  • organisms: organisms that the miRNAs belong to
  • countries: geographic information of research

The end product is a data table with rows containing the succinct summaries of thousands of abstracts, generated at the approximate rate of one summary for an abstract per second. The program compiles the information in a concise, easy to access spreadsheet where each row contains all 6 pieces of extracted information.

Following extraction, we evaluated the accuracy of our tool using the manually curated Human microRNA Disease Database (HMDD) [1] database as our ground truth. We employed three statistics:

  • recall, which measures the program’s ability to extract relevant relationships according to HMDD
  • precision, which measures the program’s ability to only generate true relationships according to HMDD
  • f-score, which averages the previous two statistics

At the end of the internship, the program’s best performance on finding miRNA-disease relationships for a disease was a recall of 0.731, a precision of 0.864, and a f-score of 0.792.

miRNA biomarkers for hepatocellular carcinomamiRNA biomarkers for hepatocellular carcinoma

Read More

For more information you can download our presentation slides.

References

  • Huang, Z, Shi, J, Gao, Y, Cui, C, Zhang, S, Li, J, et al. (2019). HMDD v3.0: A database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 47, D1013–D1017. doi:10.1093/nar/gky1010. PubMed PMID:30364956.

About Authors

Alexander Liang : Summer Intern

Alexander Liang

Summer Intern

Matt Laws : Summer Intern

Matt Laws

Summer Intern

Supervisor

Nestoras Karathanasis : Teaching Assistant Professor

Nestoras Karathanasis

Teaching Assistant Professor

Comments are closed.