A datahack for medical research problems

HealthHack 2013 was a weekend hackfest solving problems that medical researchers face. HealthHack brought together medical researchers, bioinformaticians, software developers, data analysts, data visualisers and designers. Together, they created new software tools to analyse, visualise and communicate data.

I want to display a heap of genetic information at once…

We need a simple and highly visually effective software package to display genetic and epigenetic marks, where we can view many different layers of information at once, and easily switch between viewing a single region of one gene, to viewing many genes, to even viewing a whole chromosome.

There are many publicly available datasets that could be used to create such visualisation software, and it would be worth considering the currently available systems for viewing such data (e.g. Galaxy, UCSC Genome Browser, IGV, SeqMonk). While each of these has its benefits, none provide a holistic view of the data that allows optimal visualisation. *Every cell in the body contains all the genes of the genome – the entire DNA content of a cell – but not every cell in the body is the same. For example, a heart cell is very different to a neuron. How can these cells be so different if they all contain the same genes?

The answer is that each cell type turns on, or uses, a distinct set of genes. This means that each cell type makes its own complement of protein products that help determine the cell type’s function.

So what controls whether a gene is turned on or off? We know this is influenced by how easily accessible the gene is to the factors required to turn the gene on: the more tightly packaged the gene is, the less likely it is to be accessible to factors that bind and activate it, and the less likely it will be turned on.* *We now know that the different packaging of the DNA can be correlated with various marks that are made to the DNA, called ‘epigenetic marks’. These epigenetic marks can be considered as punctuation marks in the genome – they allow the cell to interpret how to read the information contained in the DNA sequence. Due to transformational changes in the way we examine these marks throughout the whole genome, biologists are creating a wealth of data that reports not only the DNA sequence, but also the amount of various epigenetic marks, and the amount of different factors bound to the DNA throughout the whole genome. This landslide of data is highly complex, and difficult for bench biologists to interrogate.

Here is a bunch of problems that geneticists would love help with!

Inside each cell in our body are more than 30,000 genes (the ‘genome’) that are expressed in different levels to control a wide range of processes. An example of this may be the differentiation of blood cells: a single blood stem cell in the bone marrow can give rise to many different mature blood cells, progressively developing from progenitors with the potential to become many different cells, to mature cells with very specialised functions. What key genes are turned on and off at various stages of differentiation? Can we associate functions of these cells with their genomic profiles?

Using whole genome techniques to measure the expression levels of all the genes in a cell, we can gather expression values of all the genes across a variety of cell types as a matrix of values. Then we can analyse this data in multiple ways to obtain biological meaning and generate hypothesis. As a simple example, suppose that a gene is expressed highly only in a single cell type out of many. Perhaps this implies that this gene is crucial in creating these cells, and blocking the action of this gene may lead to therapeutic outcomes if too many of these cells cause disease. Hence we need tools to be able to look at this data in different ways. We also want to empower the biologists without programming background to carry out some analysis in intuitive ways. So here are some questions which may have tractable answers over a weekend of coding, based on a matrix of gene expression values:

  • Use d3 (javascript) to plot expression profile of a gene in various ways – eg. quickly switch between bar plots and box plots, change presentation based on the “aggregation” metric (sum over strains or mutation status, for example). Input could simply be a dictionary of values corresponding to one row of the matrix mentioned above.
  • Many genes work together to create a phenotype, and there are many ways of grouping genes into “networks”. Some common ways of creating gene networks also create visualisation challenges, due to the sheer number of possible points and connections. In this context, evaluate some commonly used network visualisation tools such as a minimum spanning tree diagram on iOS and other tablet and handheld platforms. Input could be a graph which comprises of a set of nodes and edges. What can and can’t the tablet devices do compared to the mouse/keyboard combinations?
  • Correlation calculations are often performed on subsets of the matrix mentioned above. The high number of possible subset combinations make such calculations often quite slow. What strategies could be employed to speed up the performances these calculations?
  • Clustering is often performed on genes, but not very often on cells. However for large datasets, clustering on cells is very useful for the biologist. What ideas could we explore to visualise clusters of related cells based on quantitative measures such as correlations?
  • Heatmaps are often used to show clusters in a matrix of data, but they often have limited usefulness. Can we come up with some new ways of creating heatmaps?
  • Many of the bioinformatics methods are developed in R and deployed as R packages. What are some practically useful implementation strategies for deploying applications, which can provide the user with interface tools such as d3, while leveraging R as the analysis engine?

I need a good visualisation tool for literature research!

It would be of great benefit to the biomedical research community worldwide to develop a simple online tool that visually displays the results of a literature search in a manner that takes into account both the ranking of the references as well as their similarity to one another. Such a tool would enable researchers and clinicians to more efficiently browse the literature to not only find relevant articles but also gain an overview of a given research field.

A fundamental task in biomedical research and clinical practice is to browse the existing literature. This task is performed millions of times per day across the globe with the same basic purpose in mind: namely, to acquire information from the existing body of biomedical knowledge. For example, a clinician might want to learn about a new treatment option for one of their patients, or a PhD student might need to become familiar with an established field of research before undertaking their own studies.

In general, there are several major impediments to efficient literature browsing. First, the sheer volume of available literature can be overwhelming. For instance, there was >20 million references in standard biomedical databases by 2010, and this figure is likely to grow exponentially in coming years. Second, the search tools that exist to facilitate literature browsing are generally unwieldy and inefficient. Finally, it is difficult to identify references that are similar to one another – for instance, those that deal with the same question or topic – despite this being highly desirable when browsing the literature.

The standard biomedical literature search tool is PubMed, which provides an interface for accessing various reference databases in the biomedical sciences. Despite its popularity, PubMed presents its search results via a cumbersome interface, with references displayed in chronological order and with no attempt to rank results based on likely relevance or importance. Given that more than one third of PubMed searches return ³100 references, finding articles of greatest relevance can be highly time consuming. In recent years, several PubMed derivatives have been developed (e.g. Anne O’Tate, EBIMed), which attempt to rank and/or cluster results in a more user-friendly interface. However, these alternatives have failed to gain significant traction.

Many people also use Google to search the biomedical literature. The related Google Scholar aims to rank documents the way researchers do, weighing the full text of each document, where it was published, who it was written by, as well as how often and how recently it has been cited in other scholarly literature. While Google Scholar does an OK job of finding relevant articles, one problem that it shares with PubMed and its derivatives is that it fails to identify references that that are similar to one another. In this regard, several attempts have been made to develop bibliometric mapping tools that visually cluster similar references together, based either on shared semantics (i.e. the words contained within the article) or citation networks. However, even though such programs (e.g. VosViewer ) are freely available, they have not been merged with mainstream search tools such as Pubmed or Google Scholar.

Visualising regions of interest in large graphs

Large graphs arise commonly in bioinformatic analysis, for instance as the result of inferring gene regulatory networks, or building graphs of DNA sequences for genome assembly. These graphs are generally too large to lay out and visualise interactively, having tens of thousands to potentially billions of nodes. It would, however, be useful to be able to interactively visualise a restricted neighbourhood of such a graph.

Option 1:

One instance in which this would be useful is the analysis of sequence variants represented in a localised area of a genome assembly. Genome sequencing of a population of cells results in hundreds of millions sequence “reads”. Each of these reads (which can be thought of as strings of A, C, G and T with a typical length of 100 characters) represents a set of linearly adjacent nucleotides from a strand of DNA present in a single cell taken from this population.

To generate the complete sequence of a genome, these individual reads must be put together, much like jigsaw pieces, based upon the presence of overlapping substrings. One approach to solving this problem is to create a k-mer graph. For some value of k (smaller than the read length) a graph is created, where each node represents a string of length k (a k-mer) present in one or more reads. A directed edge exists where two k-mers share an overlap of k-1 characters (for example CATT -> ATTG). In this way, a read may be represented as a path through the graph, and the reconstructed genome sequence is (ideally) encoded as a longer path through the same graph.

In a cancer sample several different DNA variations may be present, particularly at clinically relevant positions of the genome, arising from the clonal structure of the population of cells. The presence of a sequence variation – such as the replacement, insertion, or deletion of nucleotides – will create alternative paths in a localised region of this graph, and visualisation of these alternatives may present a means for researchers to understand the spectrum of variants at a given locus.

The aim of this project is to interactively construct and display, given a seed genomic region of interest, the structure of the k-mer graph implied by a set of sequence reads in such a way as to show clearly the variants present, with a visual indication of their frequency of occurrence (determined by the number of reads that indicate their presence).

Option 2:

Gene regulatory networks, particularly those containing indirect interactions, are complex to visualise both because of their size, and because they generally do not have a planar embedding. However it is possible to avoid both of these problem by concentrating on a subset of graph edges in a restricted neighbourhood of such a graph.

A web-based component for interactive viewing of a node and its <nth degree neighbours be of considerable utility. For node positioning purposes, only a subset of edges that have a planar embedding would be considered, allowing simple interactive force-directed layout to be effective.

Each node and edge will have its own set of properties (for example: gene information, expression level in a variety of samples, pairwise correlation of expression between genes) and ideally this information would be integrated into the visualisation of the graph neighbourhood. Similarly, clever navigation of the graph (to change the focal point, interactively updating the neighbourhood and layout) would be a desirable feature.

Create a gait analysis tool

I’d like to build an open source library for open-ni to use depth sensors (ie Microsoft Kinect) to use as a research measurement tool for gait analysis and joint range of motion, which are common outcome measures in rehabilitation. This would make capturing this data much easier and cheaper. I have a depth camera that could be used for the weekend to build and test the code.

A tool for quickly designing primers

As a pathology researcher, I would love a tool that would enable me to design primers quickly, so I can add new genes/mutations to the panels of genes that I test for. It could be similar to Optimus Primer, though something that fits my needs better.