In recent years, the role of microRNAs (miRNAs) as biomarkers in human diseases has been studied due to the function of miRNAs to control and regulate gene expressions. However, research involving the impact of specific miRNAs and their specific influence on diseases is inefficient due to a significant obstacle in the medical industry: the constantly updating PubMed database containing thousands of unsorted and complex abstracts. With the above observation in mind, during the summer of 2020, we, Alexander Liang and Matt Laws, began a six-week internship under Thomas Jefferson University professor, Dr. Nestoras Karathanasis. Our internship focused on the implementation of a text-mining program that employed the R programming language. Our software automatically and efficiently extracts miRNA biomarkers – disease relationships within publicly available abstracts on PubMed.
Diagram of the tasks we performed in our internship
Per abstract, our program extracts six pieces of important information per abstract:
- diseases: all diseases mentioned in the abstract
- miRNAs: all miRNAs mentioned in the abstract
- relationships: all sentences containing relationships between a disease and miRNA
- PMID: the identifier for each abstract
- organisms: organisms that the miRNAs belong to
- countries: geographic information of research
The end product is a data table with rows containing the succinct summaries of thousands of abstracts, generated at the approximate rate of one summary for an abstract per second. The program compiles the information in a concise, easy to access spreadsheet where each row contains all 6 pieces of extracted information.
Following extraction, we evaluated the accuracy of our tool using the manually curated Human microRNA Disease Database (HMDD) [1] database as our ground truth. We employed three statistics:
- recall, which measures the program’s ability to extract relevant relationships according to HMDD
- precision, which measures the program’s ability to only generate true relationships according to HMDD
- f-score, which averages the previous two statistics
At the end of the internship, the program’s best performance on finding miRNA-disease relationships for a disease was a recall of 0.731, a precision of 0.864, and a f-score of 0.792.
miRNA biomarkers for hepatocellular carcinoma
Read More
For more information you can download our presentation slides.
References
- Huang, Z, Shi, J, Gao, Y, Cui, C, Zhang, S, Li, J, et al. (2019). HMDD v3.0: A database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 47, D1013–D1017. doi:10.1093/nar/gky1010. PubMed PMID:30364956.
About Authors
Alexander Liang
Summer Intern
Matt Laws
Summer Intern
Supervisor
Nestoras Karathanasis
Teaching Assistant Professor