Database Details

In this page, we provide detailed information about the SemMedDB schema. Database tables, their fields as well as the relationships between the tables are explained. Recently we changed the database schema as below and applied it when building the latest databases, semmedVER30 and semmedVER30_A. For the previous version of database schema, click here. Examples for each table are provided below.

Tables:

Name: CITATIONS table
This table contains relevant metadata for each PubMed citation and has the following data fields:

  • PMID:PubMed identifier of the citation
  • ISSN: ISSN identifier of the journal or the proceedings where the article was published
  • DP: Publication date for the citation
  • EDAT: The data when the citation was added to PubMed
  • PYEAR: Completion date for the citation
PMID ISSN DP EDAT PYEAR
19851774 1432-203X 2009 Dec 2010 01 21 2009

Name: GENERIC_CONCEPT table
This table contains the UMLS Metathesaurus concepts that are considered too generic based upon the 2006AA release. Concepts that are not stored in this table are considered novel. This table is used to populate the SUBJECT_NOVELTY and OBJECT_NOVELTY columns in the PREDICATION table defined below. Data fields in this table are as follows:

  • CONCEPT_ID: Auto generated primary key for each concept
  • CUI: Concept identifier (CUI) of the concept
  • PREFERRED_NAME: Preferred name of the concept
CONCEPT_ID CUI PREFERRED_NAME
1956 C0699748 Pathogenesis

Name: SENTENCE table
This table contains information about individual sentences from PubMed citations and includes the following data fields:

  • SENTENCE_ID: Auto-generated primary key for each sentence
  • PMID: The PubMed identifier of the citation that the sentence belongs to
  • TYPE: 'ti' for the title of citation and 'ab' for the abstract
  • NUMBER: The location of the sentence within the title or the abstract
  • SENTENCE: The actual string of this sentence
SENTENCE_ID PMID TYPE NUMBER SENTENCE
111751521 19855969 ti 1 Rheumatoid arthritis in patient with homozygous haemoglobin C disease.

Name: PREDICATION table
Each record in this table identifies a unique predication. The data fields are as follows:

  • PREDICATION_ID: Auto-generated primary key for each unique predication
  • SENTENCE_ID: Foreign key to the SENTENCE table
  • PMID: The PubMed identifier of the citation that the predication belongs to
  • PREDICATE: The string representation of each predicate (for example TREATS, PROCESS_OF)
  • SUBJECT_CUI: CUI of the subject of the predication
  • SUBJECT_NAME: Preferred name of the subject of the predication
  • SUBJECT_SEMTYPE: Semantic type of the subject of the predication
  • SUBJECT_NOVELTY: Novelty of the subject of the predication
  • OBJECT_CUI: CUI of the object of the predication
  • OBJECT_NAME: Preferred name of the object of the predication
  • OBJECT_SEMTYPE: Semantic type of the object of the predication
  • OBJECT_NOVELTY: Novelty of the object of the predication
PREDICATION_IDSENTENCE_IDPMIDPREDICATESUBJECT_CUI...OBJECT_CUI...OBJECT_NOVELTY
1252467336992416655556AFFECTSC1306232...C1326386...1

Name: PREDICATION_AUX table
This table has auxiliary information for the predications recorded in PREDICATION table. There is a 1-to-1 relation between the PREDICATION and PREDICATION_AUX table. For a full list of indicator types, see the Appendix in [2]. It includes the following data fields:

  • PREDICATION_AUX_ID: Auto-generated primary key for the auxiliary information of each unique predication
  • PREDICATION_ID: Foreign key to the PREDICATION table
The rest of the fields in PREDICATION_AUX table provide mention-level information for the elements of the predication.
  • SUBJECT_TEXT: Text that maps to the subject
  • SUBJECT_DIST: The distance of the subject mention (counted in noun phrases) from the predicate mention (0 for certain indicator types, such as NOM)
  • SUBJECT_MAXDIST: The number of potential arguments (in noun phrases) from the predicate mention in the direction of the subject mention (0 for certain indicator types, such as NOM)
  • SUBJECT_START_INDEX: First character position (in document) of text denoting subject entity
  • SUBJECT_END_INDEX: Last character position (in document) of text denoting subject entity
  • SUBJECT_SCORE: The confidence score of the mapping between the subject text and the subject concept
  • INDICATOR_TYPE:The type of the predicate, such as VERB for verbal predicates, and NOM for nominalizations and other argument-taking nouns. For a full list of indicator types, see the Appendix in [2]
  • PREDICATE_START_INDEX: First character position (in document) of text denoting relation
  • PREDICATE_END_INDEX: First character position (in document) of text denoting relation
  • OBJECT_*: The fields representing information about the object, in the same way the SUBJECT_* fields do for the subject
  • CURR_TIMESTAMP: The timestamp for the record
PREDICATION_AUX_IDPREDICATION_IDSUBJECT_TEXTSUBJECT_DISTSUBJECT_MAX_DIST...OBJECT_TEXT...OBJECT_SCORE
12524731252467severing12...transpiration...888

Name: COREFERENCE table
This table has coreference information generated by SemRep with Anaphora (option -A). It includes the following data fields:

  • COREFERENCE_ID: Auto-generated primary key for ach unique coreference
  • PMID: The PubMed identifier of the citation that the coreference belongs to
  • ANA_CUI: The CUI of the anaphor element of the coreference
  • ANA_NAME: The preferred name of the anaphor element of the coreference
  • ANA_SEMTYPE: The semantic type of the anaphor element of the coreference
  • ANA_TEXT: Text that maps to the antedecent
  • ANA_SENTENCE_ID: The foreign key to SENTENCE of the anaphor element of the coreference
  • ANA_START_INDEX: First character position (in document) of text denoting the anaphor
  • ANA_END_INDEX: Last character position (in document) of text denoting the anaphor
  • ANA_SCORE: The confidence score of the mapping between the anaphor text and the anaphor concept
  • ANT_CUI: The CUI of the antecedent element of the coreference
  • ANT_NAME: The preferred name of the antedecent element of the coreference
  • ANT_SEMTYPE: The semantic type of the antedecent element of the coreference
  • ANT_TEXT: Text that maps to the antedecent
  • ANT_SENTENCE_ID: The foreign key to SENTENCE of the antedecent element of the coreference
  • ANT_START_INDEX: First character position (in document) of text denoting the antedecent
  • ANT_END_INDEX: Last character position (in document) of text denoting the antedecent
  • ANT_SCORE: The confidence score of the mapping between the antedecent text and the anaphor concept
  • CURR_TIMESTAMP: The timestamp for the record
COREFERENCE_IDPMIDANA_CUIANA_NAMEANA_SEMTYPE...ANT_CUI...CURR_TIMESTAMP
3553911000385C0029235Organismorgm...C0317850...2017-01-26 17:21:42

Name: ENTITY table
This table has entity information whose data come from ENTITY output generated using full fielded output. It includes the following data fields:

  • ENTITY_ID: Auto-generated primary key for ach unique entity
  • SENTENCE_ID: The foreigh key to SENTENCE table
  • CUI: The CUI of the entity
  • NAME: The preferred name of the entity
  • TYPE: The semantic type of the entity
  • GENE_ID: The EntrezGene ID of the entity
  • GENE_NAME: The EntrezGene name of the entity
  • TEXT: Text in the utterance that maps to the entity
  • START_INDEX: First character position (in document) of text denoting entity
  • END_INDEX:First character position (in document) of text denoting entity
  • SCORE: The confidence score
ENTITY_IDSENTENCE_IDCUINAMETYPE...TEXTSTART_INDEXEND_INDEXSCORE
128454063369924C0806140Floworga...flow154158790




The entity-relationship diagram of SemMedDB version 3.0 or higher version is shown below graphically:

ER Diagram
  1. Fiszman M., et al. (2004) Abstraction summarization for managing the biomedical research literature. Proceedings of HLT-NAACL Workshop on Computational Lexical Semantics. pp. 76-83.
  2. Kilicoglu, H., et al. (2011). Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics, 12(486).