paperKB
coga / coga-kb
Processing
Help
Sign in

Chunk #16 — Construction and content — Literature dataset

Source
SNPs3D: candidate gene and SNP selection for association studies.
Embedded
yes

Text

The abstracts of all the medline entries associated with each gene in the NCBI Gene database [56] are the source of words and terms. In the current version, there are, 80,249 Medline references linked to 19,228 human genes. Word types are identified using SVMtagger [10]. Keyterms are constructed from single nouns and adjectives, adjective/noun pairs, and continuous strings of words classified as adjectives or nouns. For example, the phrase 'blood pressure' occurring in an abstract would result in three keyterms: 'blood', 'pressure', and 'blood pressure'. Terms occurring only once are removed. There are currently a total of 266,337 keyterms.