The authors have declared that no competing interests exist.
Conceived and designed the experiments: SVL JB SP SA HYK ZL TS YVdP FG. Performed the experiments: SVL JB CHW KH FG. Analyzed the data: SVL JB CHW SP YVdP FG. Wrote the paper: SVL JB SP FG.
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (
The richness of information available in the vast biomedical literature has motivated many studies and resources to include textual data as an information source
During the last decade, the development of fully automated text mining techniques has attracted wide interest, resulting in several general-purpose stand-alone text mining tools
The gene mentions recognised in text are in red and the extracted event structures in blue. The normalization algorithm further maps the ambiguous gene mentions to unique database identifiers (in green).
In this work, we join together the two independent lines of research on event extraction and gene normalization by combining two state of the art systems from the BioNLP Shared Task and the BioCreative challenge. Integrating these approaches with a previously released gene family assignment algorithm, we further broaden the normalization scope to cover not only gene and protein identifiers, but also more general gene families that group evolutionarily related and functionally similar genes across species. Additionally, we present and evaluate a novel normalization algorithm using canonical symbols and taxonomic assignment. Our analyses illustrate that these different normalization algorithms exhibit different properties, and we demonstrate how they can be used to complement each other, providing a powerful means to query information in the literature at varying levels of detail. All methods are run on all PubMed (PM) abstracts and PubMed Central (PMC) open access full texts, resulting in a unique dataset for text mining researchers, bioinformaticians and biologists. A novel API allows customized querying of this data in a variety of applications.
In comparison to previous large-scale analyses
The text mining pipeline applied to all PM abstracts and PMC full-text articles consists of several consecutive steps. After downloading and pre-processing all data from the source literature databases and identifying the sentences in all articles, the first step towards information extraction entails the recognition of gene and protein mentions in text. Next, event extraction is performed to detect statements of biological processes and regulatory associations that involve the mentioned genes and proteins. Then, a gene normalization step is applied to resolve the ambiguous gene symbols to database identifiers. Finally, all data is integrated to extend our previously introduced resource, EVEX, for custom browsing and querying. A general overview of the text mining pipeline is depicted in
The black arrows represent previously published tools, which have all been integrated in this study to create a unified text mining pipeline. Furthermore, the various opportunities for combining the different methods for gene normalization are presented by the colored edges and discussed in detail in the text.
While PM abstracts can be processed relatively straightforwardly, we have implemented a few novel pre-processing steps to extract information from full-text PMC articles. First, we apply a Unicode-to-ASCII mapping, building on tools introduced for the ST'11
We perform the detection of gene and protein mentions in text (
For event extraction (
We previously applied the original version of TEES to 19 million PM abstracts to produce the first large-scale event dataset
Event types | Example text fragment | Count | |
Transcription | during meiosis, |
561K | |
Gene expression | DUN1 was not identified as differentially |
10453K | |
ST'09 | Localization | the subcellular |
1805K |
Protein catabolism | hyperglycemia leads to CREB protein |
279K | |
ST'11 | Binding | in vitro |
6154K |
(GE) | Regulation | the |
3194K |
Positive regulation | p27 (Kip1) |
9955K | |
Negative regulation | miR-198 functions to |
6010K | |
(De)phosphorylation | mutants were |
1005K | |
(De)hydroxylation | reduced Hif1alpha |
16K | |
(De)ubiquitination | 89K | ||
ST'11 | DNA (de)methylation | NGFI-A binding to its consensus sequence is inhibited by |
173K |
(EPI) | (De)glycosylation | TibA was the first |
172K |
(De)acetylation | in human myocytes where over-expressed Sir2 |
135K | |
(De)methylation | increased |
147K | |
Catalysis | an enzyme that |
43K | |
ST'11 | Protein-Component | truncation of the |
2178K |
(REL) | Subunit-Complex | the |
681K |
All event types included in this study, their counts, and example text fragments. Phosphorylation is only listed once, but was originally also included in the ST'09 and ST'11 GE data. The EPI types, with the exception of Catalysis, specifically include a positive and a reverse variant.
To rank the event predictions according to their reliability, TEES assigns a confidence score for each classification step. These scores are aggregated to normalized event scores using the
We previously released a large-scale event extraction dataset covering only abstracts and the event types of the ST'09 as part of the EVEX database
Lexical variation | Synonymy | Orthology | Species-specific | |
Canonicalization | yes | no | no | no |
Family assignment | yes | yes | yes | no |
Gene normalization | yes | yes | no | yes |
The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.
The canonicalization algorithm as previously implemented within the EVEX resource has two main goals. First, it resolves small lexical variations in spelling, mapping both “
The original EVEX release additionally included a family assignment algorithm, resolving the canonical symbols to the most plausible gene family as defined by HomoloGene (eukaryotes,
The original family assignment relies solely on the canonical forms of the gene symbols to determine the most plausible gene family to a specific gene mention. Disambiguation between different candidate gene families is performed by selecting the family that contains the most genes with this specific canonical form as synonym. In practice, this results in the interpretation of “
To provide even more detailed normalization, in this study we apply the GenNorm normalization algorithm
As GenNorm has been developed in the context of the BioCreative III task, it produces document-level normalization, i.e. a set of Entrez Gene identifiers relevant to each input document. This is achieved by first identifying one or many candidate normalizations for each gene mention in the text. Subsequently, these candidate normalizations are aggregated on the document level to produce a final set of normalizations, consistent across the article. However, in order to integrate GenNorm results with the event analyses, mention-level gene normalizations are needed. We thus extended GenNorm to revisit the original per-mention candidates and to choose for each one the candidate most consistent with the document-level set.
For gene mentions where full resolution into an Entrez Gene identifier is not possible, GenNorm still assigns the most likely organism of the mention, using its stand-alone open source module SR4GN
The three normalization algorithms described above exhibit different properties (
This novel approach can improve the accuracy of the family assignment by using the newly introduced mention-level gene normalizations and taxonomic assignments, taking into account the context of the full document. Conversely, the family assignments can assist in extending the coverage of the gene normalization algorithm itself (
The effects of combining these different methods for both family as well as gene ID assignment are detailed in the Results and Discussion Section.
All the data we have generated in this study are made publicly available through bulk downloads, an upgraded version of the EVEX web application, and through a novel programmatic interface, which allows custom querying of both the event structures and the normalization data. We have invested substantial engineering efforts into assuring that this large dataset can be efficiently queried, providing real-time response times even for queries involving complex structures occurring tens of thousands of times in the data. We solicit community feedback on both the website and the API, as these resources will be closely maintained and further improved upon in future efforts.
We have run all methods detailed in the previous section on all 21.9 million PM abstracts and 460 thousand PMC open access full-text articles (data downloaded on June 25, 2012). To make the processing times manageable in practice, the pipeline was parallelized over more than a hundred cluster machines, enabling the processing of all data in a matter of days. Analysing the total run time, the most time-consuming tasks in the pipeline are the syntactic analysis (41%), gene recognition (8%), gene normalization (35%) and event extraction (14%).
The automated processing of all available PM abstracts and PMC full-texts yielded more than 40 million detailed biomolecular events among 76 million gene/protein mentions (
This plot illustrates that this study covers normalized event data across all domains and kingdoms. It was created with iTOL
Abstracts | Full text | Total | Entrez Gene | Ensembl Genomes | |
Articles | 6.4M | 384K | 6.8M | – | – |
Sentences | 54.8M | 66.9M | 121.6M | – | – |
Gene/protein mentions | 43.3M | 33.3M | 76.5M | 28.8M (37.6%) | 47.9M (62.6%) |
Events | 23.5M | 16.7M | 40.2M | 16.3M (40.5%) | 26.0M (64.7%) |
Extraction statistics for PubMed abstracts and PubMed Central full texts with at least one identified gene/protein mention. The last two columns state the number of mentions/events that could be fully normalized to Entrez Gene identifiers or Ensembl Genomes families.
The various normalization strategies can be combined with the event extraction results by defining equality of different event occurrences across documents. For instance, two events can be regarded as equal when they have the same event structure and their arguments pertain to the exact same gene identifiers. Using this definition, the number of instances in the data can be reduced by grouping similar statements together. However, not all events can be fully normalized using Entrez Gene IDs, as some of the participating arguments may not have been assigned to gene identifiers. Out of the original set of 40.2 million events, we were able to map 16.3 million (40.5%) to unique identifiers, together resulting in a smaller set of 1.5 million unique normalized events (
The equality of events may alternatively be defined through the gene families, regarding two events as equal when they have the same structure and involve entities from the same gene families. This definition groups together interologs, conserved interactions between homologous pairs of proteins, and can thus support comparative genomics use-cases. Out of the original set of 40.2 million events, we linked 26.0 million events (64.7%) to gene families from Ensembl Genomes, a significantly higher fraction than that linked to Entrez Gene (
The addition of full-text event extraction significantly increases the coverage of information in biomolecular text, nearly doubling the text mining dataset in size compared to using PubMed abstracts only (
To further assess the added value of processing full-text articles rather than just abstracts, we analysed the retrieval of equivalent events within and across articles, defining events as equal when their event structure is equivalent and their arguments pertain to the same Entrez Gene identifiers. We found that only 7% of all events extracted from the body of a full-text PMC article could also be found in its abstract, results that are in line with previous reports
The event extraction algorithm of TEES was previously evaluated in the framework of the BioNLP Shared Tasks
To establish whether these official benchmark results, evaluated on small domain-specific corpora, can be extrapolated for event predictions over the entire literature, we have manually evaluated a sample of predicted events to determine the precision rate of the extraction system. To this end, a random set of 100 events was taken from PMC article bodies and PM/PMC abstracts each. Recall was not evaluated as this requires full annotation of the evaluation documents, an extremely time-consuming task.
Despite the relatively small scale of this evaluation, the observed general trend is well in line with the official results (
Both the evaluations of the BioNLP ST'11 GE task development set (3021 events, ST evaluation scripts) as well as a fully random sample (200 events, manually evaluated) are depicted. Events are ordered by their confidence scores, and plotted at different precision/recall trade-off points.
By ranking events using the automatically assigned scores, subsets of the event data can be selected with higher precision at the cost of lower recall. We observe the relation between precision and confidence scores both on the ST data and the manual evaluation (
The full set of events extracted within this study has an average confidence score of 3. When looking at the subsets of events that can be normalized into either Ensembl Genomes families or Entrez Gene identifiers, the average confidence value of the events rise to 3.06 and 3.09 respectively. Both these differences are statistically significant (
Finally, note that the event extraction step is limited by its aim to only extract events within single sentences, not crossing sentence boundaries. It was previously determined that the amount of intersentence events is between 6–9% of all data
The GenNorm algorithm was thoroughly evaluated through the BioCreative III challenge, which uses Threshold Average Precision (TAP) scores. The gold standard BioCreative test set consists of 50 PMC full-text articles manually annotated with Entrez Gene identifiers
To assess the performance of the GenNorm algorithm in combination with the canonicalization and family assignments (
However, by far the most important cause for the drop in performance is the fact that the event extraction pipeline does not process figures or tables. Indeed, such input data can not be processed by standard event extraction techniques, and has thus been
Precision | Recall | F-score | |
GenNorm | 38.1 | 26.9 | 31.5 |
Canonical | |||
HomoloGene | 27.6 | 19.2 | 22.7 |
Ensembl | 29.5 | 10.7 | 15.7 |
Ensembl Genomes | 31.1 | 20.7 | 24.9 |
Canonical + GenNorm | 35.3 | 29.8 | |
GenNorm + Ens. Genomes | 33.0 | 30.4 | 31.7 |
Canonical + GenNorm + Ens. Genomes | 32.4 |
Performance of the various algorithms for Entrez Gene identifier assignment, as measured on the BioCreative III dataset. The canonical and family assignment algorithms both refer to the combined procedure which use the taxonomic assignments by GenNorm to enable species-specific ID disambiguation (
Evaluating the ability of the gene family assignments for determining unique, species-specific gene identifiers (
These analyses illustrate the added value of using the canonicalization algorithm on top of the GenNorm predictions, producing the highest possible performance within these task settings. While the gene families can not be used to further improve on this combination in general, their biggest contribution lies in the fact that they are able to increase recall and cover more event occurrences. In future work, we plan on further evaluating these opportunities on the CRAFT corpus, a recently released mention-level gene normalization dataset
A multi-level approach to normalization is applicable not only to the identification of unique genes for textual mentions, but also to the assignment of families. Considering the fact that GenNorm achieves higher precision compared to the EVEX family assignment algorithm (
Precision | Recall | F-score | |
EVEX (original) | 18.9 | 41.9 | 26.0 |
EVEX (adapted) | 19.8 | 46.5 | 27.5 |
EVEX (adapted) + GenNorm | 21.5 | ||
GenNorm |
Performance of the gene family assignment using Ensembl Genomes definitions, as measured on a modified version of the BioCreative III dataset, translating gold gene IDs to their correct families. The last row depicts family assignments based on GenNorm only.
When interpreting these results, it is important to realise that this evaluation is performed on a dataset of family assignments, created as such by translating the manually annotated gene identifiers to families. In fact, mentions such as “the Esr-1 family” are out of scope for the BioCreative III normalization task and thus not annotated in this corpus. When the family assignment correctly determines the correct family ID for such a mention, this will show up as a false positive in our evaluation, artificially lowering precision rates. However, as there is, to the best of our knowledge, no other suitable mention-level gold standard dataset for evaluating gene family assignment in text, we accept these BioCreative III results as broadly indicative of recall and lower-bound estimates of precision.
To further investigate the added value of using the original family assignment as a fallback mechanism, we have evaluated this algorithm in the context of a study on NADP(H) metabolism in
In addition to illustrating the benefits of the specific combination strategy considered here, these results demonstrate how access to different layers of normalization granularity for the events extracted from text make it possible to create various combinations of the different normalization strategies according to the specific use-case, selecting either for high recall or high precision.
As demonstrated above, many of the algorithms applied in this study can be evaluated separately in terms of precision, recall and F-score. However, performance rates may vary drastically according to the domain or evaluation setup, as revealed by the normalization evaluation. Additionally, due to the complex interplay of the various components in the text mining pipeline and due to the influence of confidence-based filtering, the usability of this data as a whole can only be assessed within specific real-world use cases. In this section, a few promising example applications of this resource are illustrated, including data integration, database curation and pathway reconstruction.
To demonstrate the application of the text mining data in the context of database curation, the data was integrated with experimental information from STRING v.9, a rich resource of protein associations incorporating data from many major domain databases, including high-throughput experiments, computationally inferred annotations, and manually curated pathways
Database | Total # of pairs | Text mining match | Coverage |
PID | 998 | 820 | 82% |
HPRD | 1,057 | 694 | 66% |
DIP | 4,085 | 1,738 | 43% |
GRID | 28,735 | 8,346 | 29% |
KEGG | 72,620 | 19,739 | 27% |
MINT | 13,805 | 2,851 | 21% |
IntAct | 10,281 | 1,984 | 19% |
Reactome | 7,871 | 1,402 | 18% |
BIND | 6,453 | 1,135 | 18% |
BioCyc | 810 | 25 | 3% |
Number of unique high-confidence protein pairs in STRING, and the proportion of these pairs for which an event is found through text mining.
A very broad variation in coverage is observed, ranging from over 80% for the PID database, to just a few percent for BioCyc. This broad variation is expected: the PID database, for example, consists solely of manually curated associations and requires literature support, while BioCyc additionally contains many sequence-based, computationally predicted associations, which are not expected to substantially overlap with existing literature. The high recall against fully curated databases like PID serves as an indirect verification of the text mining system and illustrates the potential of text mining for applications in database curation support. We further note that the recall numbers found here are expected to rise as more full-text articles become open access and thus available for text mining.
Another promising application of event extraction involves assisting pathway curation and analysis. While previous work on this topic has involved small-scale evaluation with manual mapping between gene symbols and identifiers
A) Close interactions of p53 from KEGG pathway hsa04115. B) The highest confidence predicted event from EVEX for each directed KEGG association. All are correct and correspond to the KEGG interaction type. Event visualizations were made with stav
For each pair of proteins in the KEGG
For each KEGG interaction that connects a directed pair of proteins, we have further selected the event with the highest confidence score. These events are shown in
This example directly demonstrates the benefits of our data for pathway reconstruction and providing textual evidence for known interactions. However, this approach could also straightforwardly be applied to assist pathway curators in the search for candidate interactions established in the literature but not previously incorporated in pathway models.
We have presented a text mining analysis that combines structured event extraction with gene normalization – two major lines of research in the BioNLP community – to process all PubMed abstracts and all open access full texts in PubMed Central. The applied text mining pipeline represents the state of the art, confirmed by the results of community-wide shared task evaluations. The resulting dataset contains 40 million biomolecular events among 76 million gene and protein mentions from over 5000 species. Covering protein metabolism, fundamental molecular events, regulatory control, epigenetics, post-translational modifications and protein domains and complexes, this resource is unprecedented in semantic scope, literature coverage, and data retrieval facilities.
In this study, we have extended the gene normalization step of the text mining pipeline to produce mention-level results rather than only document-level ones. Additionally, we integrated a canonicalization algorithm and gene family definitions from HomoloGene, Ensembl and Ensembl Genomes, enabling a multi-level normalization strategy. We have demonstrated that such an integrative normalization method is useful to resolve cases where gene families are mentioned rather than individual genes, and those where the exact organism or substrain is difficult to distinguish. Further, specific normalization combinations allow selecting for either high recall or high precision of the results. By publicly releasing all our data, we hope to encourage the exploitation of this information also in other text mining studies and frameworks.
The detailed evaluations presented in this study illustrate that there is still room for improving the algorithms behind the various text mining components. However, by integrating these components into a unified pipeline and running them on the whole of PubMed and PubMed Central open access documents, instance-level recall issues are minimized and the chances of finding any specific piece of biologically relevant information are significantly increased. Indeed, it is sufficient to only extract a certain biological event once for it to be useful in a specific case-study. Additionally, we have shown that the confidence values, automatically assigned to the textual event predictions, can be used for selecting high-precision subsets of the data at various thresholds. Together, these opportunities were illustrated on example use-cases, obtaining high recall against manually curated databases such as PID. Further, we have shown on a specific pathway example that text mining data allows for accurate extraction of relevant literature and interaction partners of uniquely identified genes. These promising results demonstrate the potential of text mining applications in database curation and knowledge summarization.
All data produced in this study has been integrated into the EVEX resource (
In future work, we plan on targeting additional biological use-cases to further evaluate and improve on our integrative methods. In the framework of these future studies, we plan on improving the EVEX website and API to accommodate also researchers outside the field of BioNLP. Finally, we will build upon the normalization evaluations presented in this study to further enhance gene normalization in the context of event extraction, performing additional evaluations on a recently released mention-level corpus, as well as augmenting the textual events with information derived from figures and tables.
This file provides additional details on the pathway curation use-case, which describes a subsection of the
(XLS)
SVL and YVdP acknowledge the support of Ghent University (Multidisciplinary Research Partnership `Bioinformatics: from nucleotides to networks'). Computational resources were provided by CSC – IT Center for Science Ltd, Espoo, Finland.