Semantically annotating eHealth data using the SNOMED-CT ontology

In this blog post, I will take you through the different steps needed to set-up a local MongoDB with a snapshot of the SNOMED-CT ontology and using this data to semantically annotate data. As such, the blogpost can be divided into the following parts:

  1. Introduction to the original dataset and the related machine learning task
  2. Setting up the SNOMED-CT MongoDB locally
  3. Annotating the original dataset using RDFLib
  4. Possibilities with our new semantically annotated dataset

Of course, the SNOMED-CT ontology can easily (will require some adaptations to the discussed methodology below) be replaced by any other ontology (allowing you to skip step 2). This is especially for people who cannot obtain a license to SNOMED-CT.

Introduction original dataset & related ML task

The dataset I will be using throughout this blog-post is the publicly available dataset, called MigBase (Çelik, Ufuk, et al. “Diagnostic accuracy comparison of artificial immune algorithms for primary headaches.” Computational and mathematical methods in medicine 2015 (2015)). It contains very clean data related to primary headache disorders. We are given different properties of a headache attack (such as the symptoms and the duration) and must predict to which kind of disorder the attack belongs (migraine, cluster and tension headache). I did some minimal pre-processing of this file (such as discarding columns with only 1 unique value, throw away the single sample with ‘no headache’ as class, etc.). It can be found here. Here’s a small sample of the dataset:

Small sample of the Migbase dataset (must features are binary)

All features are quite self-explanatory, except for the durationGroup one. Here, I got more context from the original author himself:

Snippet out of the mail from Ufuk Çelik

Setting up the SNOMED-CT MongoDB

Now let’s set up a local SNOMED-CT MongoDB. Turns out this is not as easy as it first seems… But once we have this up and running, we can perform local queries and map, for example, our symptom names to existing concepts from the SNOMED ontology. This allows us to exploit the vast amount of knowledge already available within this ontology.

Requirements: MongoDB, Java, Maven, a license to the SNOMED-CT ontology, git, npm and nodejs (version > 4.4.2)

Disclaimer: everything is done on an Ubuntu; i7–7820HQ processor and 16 GiB of RAM.

  1. Clone the required repositories with git

git clone

git clone

2. Getting the SNOMED-CT RF2 Files

This is quite an annoying step, which will require you to make an account with an official authority. This can either be done through UMLS or by your local authority.

3. Conversion of SNOMED-CT RF2 files to a JSON format

Go to the rf2-to-json-conversion directory that we just got from GitHub. Then, install the Maven project using the mvn install command. A new directory called target will be made, containing JARs. Now we modify one of the config-files (I chose for the enConfig, since I want my KB to be in English; the words that need to be modified are bold):

<?xml version="1.0" encoding="UTF-8"?>
<editionName>International Edition</editionName>


These are the parameters that need to be set:

  • $EFF_TIME: release date/effective time for the finl package, if it combines the International Edition and an extension use the later date, usually the extension one. The format is yyyymmdd. The latest SNOMED version (at the time of writing) is 31 July 2017.
  • $EXP_TIME: date when a warning needs to appear in the browser to announce that data may be deprecated. I’m not sure whether this parameter is of much importance. I just took my $EFF_TIME + 1 year.
  • $OUTPUT_FOLDER: where do the constructed json-files need to be stored?
  • $PATH_TO_SNOMED_RELEASE: the path to the snapshot folder in the downloaded SNOMED-CT ontology (from UMLS or your local authority)
  • Extension Releases: If you have extension files, you can specify their paths here

When we are done with the config-file, we can start running the compiled JAR:java -Xmx8g -jar target/rf2-to-json-conversion-1.2-jar-with-dependencies.jar conf/enConfig.xml If you do not have enough RAM memory on your computer (8 GB), make sure to put the <processInMemory> on False in the config XML or lower the memory allocation of your JVM (the -Xmx8g flag). If we are done (it takes some time…), we should have 3 json files in our output-folder: concepts.json, manifest.json and text-index.json

4. Upload the newly created JSON files to MongoDB

First, we edit the script. Set json_dir to the correct path (where you can find the 3 json files created in the previous step) and set desturl on localhost or a remote machine. Run it, and a new file: json.tgz will be created in your home directory. Now edit fill in the correct edition and importDate, the correct json_dir and the zip_fn location (which is the location of the .tgz created by the previous script). Start a mongod (on port 27017, but this is the default port) and run the script:

./ en-edition 20170731

5. Start the REST API

Go to the directory of that other repository we downloaded in the begging of this section, run npm install and then sudo node app.js

Finally, we are done! Examples to test if all the steps worked, in the language of your preference, can be found here.

Running the Python example gave me the following output:

Bilirubin test kit (physical object)
Methylphenyltetrahydropyridine (substance)

Semantically annotating our CSV-data (semantic encoding) and decoding with rdflib

Now that we have a local copy of SNOMED-CT set up, we can start writing some python code that annotates our data. It consists of two functions:

  • encode: input is the csv-data, the output is a semantic file (e.g. in Turtle syntax)
  • decode: input is a semantic file, the output is a semantic file

The main code will look as follows:

The main code (including a test: original =?= decode(encode(original)))

Of course, for this to work, we need to implement our encode and decode function. Both functions will make use of the same URIs, therefore, we keep the concepts and mappings from string values to URIs in separate files:

Python file with the different types and predicates for our headache data. We created all of these ourselves, but if predicates or concepts exist that express the same thing, it would be better to use these.
Python file with a bunch of dicts that convert a string (occurring in the CSV) to a URI and vice versa

Now let’s get started with our encode and decode functions. It will turn out that they are actually quite simple. For the encoding part, we just iterate over the rows in our CSV and map each of the features to a triple, where we use predicates and types created by ourselves. The code can be found below:

The encoding code

An important part of this code is the getDescriptionByString-function, here we will query our local SNOMED-CT database to find related concepts such that we can link them, one of the main principles of linked data. The decoding part is just as simple, just a SPARQL-query and then iterate over the results:

The decoding code

Another nice functionality of the rdflib library is that it allows to host a local endpoint of your semantic data. Here’s some python code to do this:

And here are some screenshots of the endpoint:

The landing page contains a SPARQL query editor
One of the pages about a headache
We can even generate graphs of our KB…

(Some) possibilities with our new semantically annotated data…

  • Traverse our knowledge graph to look for new features
  • Create artificial samples to combat the data imbalance problem, based on a knowledge base
  • Create vector embeddings of our knowledge graphs to serve as new features for our machine learning
  • Create a ‘prototype’ graph for each our classes (based on expert knowledge) and measure graph similarities to these prototype graphs (by using e.g. Weisfeller-Lehman kernel)
  • Enrich our decision tree visualizations (by getting, for example, the descriptions from SNOMED for each of the symptoms)

I implemented a few of the listed possibilities. In order to find the full code of these, or a more thorough explanation, I refer to our GitHub repository or the paper published in BMC Medical Informatics and Decision Making

Data Scientist || Kaggle Master