/ minus80

Storing big biological data: build once, reuse many.

Big data is a hot topic right now, especially in the biological sciences. Despite not having a precise definition for what being big takes, we are going to assume that many biological datasets fit within this category.

Also, data doesn't necessarily need to be big to be cumbersome. When talking about data analysis, I like to use the analogy of moving furniture. It is not a hard thing to describe: the couch needs to be moved up two flights of stairs and into the living room. I'm not sure about you, but when the couch finally does clear the door frame (by millimeters and only in a certain orientation) I immediately wonder who, if anyone, enforces standardized door sizes. And if couch engineers mathematically model couch dimensions before manufacturing them. Perhaps, also like me, you don't have the foresight of thinking whether or not your couch will fit into your apartment before signing a lease and loading it into the moving truck. Turns out this isn't an uncommon problem, especially when moving newer couches into older apartments[1].

Moving couches is a "do once, reuse many" problem. You go through the stress and pain of hauling that giant piece of furniture to your living space one time in order to lounge and relax on it, hopefully, many times. As scientists, we often sign the lease on our data before we have to move it in. When the time comes when you finally have to process half a terabyte of DNA sequence reads, it can feel like you are the only one trying to haul that king sized mattress up ten flights of stairs. However, with the right amount of person power and with the right tools[2], moving furniture and analyzing data can be made much more convenient.

Furniture and Freezers

How about another analogy? Bear with me. Let's talk data. Bases, not bits. Not furniture, but freezers. Most labs have a minus 80°C freezer. And despite what a fancy database is capable of, they pale in comparison to the amount of data that is stored in a minus 80°C.

Raw samples are prepped into working stocks and stored in the freezer. When an analysis needs to be done, samples are generated from the frozen stock and the assay is carried out. The same concept should apply to digital data analysis. Raw data is processed and "frozen". When an analysis needs to be done, data are generated from the frozen data and the analysis is run.

Minus80 is a python library that uses this analogy of pulling data objects from a frozen working stock which was processed from raw data to be used in an analysis.

Some key concepts minus80 implements:

• Processed, "Frozen" data and "Raw" data are kept separate. The process of prepping raw data (either digital or organic) influences the information contained within it. Duplication of data is acceptable if it was prepared using a different protocol. It is OK to have several different prepared versions of raw data "stocked in the freezer". It is also OK to toss stock if you are running out of room, if you can reliably reproduce the stock using a protocol.
• Samples or data generated from the frozen stock is one time use. Manipulating data that comes from the stock has no influence on the stock itself. Data is frozen in a certain state. If you transform data from the stock, it's up to you to re-freeze it.
• Sometimes the same sample's data can be frozen under different circumstances, which leads to duplicated data. While digital data can be linked symbolically (with a shortcut), copies can sometimes be convenient. The lab analog would be splitting stock into different tubes so more than one person can work with it.
• Finally, this freezer analogy only goes so far. There are going to be some discrepancies; this is only a data model.

Here is a diagram showing some of these relationships:

Multiple sources of raw data go into a frozen dataset. From this, two different data analysis datasets are built. Manipulating data from the data-analysis stage doesn't influence the frozen dataset it was built from (you don't normally add a solution back to a stock). However, the output from the data analysis can be frozen itself.

Minus80 comes with two main components. First, it implements several connections necessary to be able to freeze data. This will be covered in a different post. Second, minus80 comes with two Python objects that utilize this functionality. Let's take a look at those here.

Accessions and Cohorts

An accession is an experimental entry about a sample along with metadata about that collection. It distinguishes between "samples" in cases where the collection occurred with differences between space and time. For example, an experiment containing a single sample, sampled over the course of ten time points would equal ten experimental accessions. Another example: an experiment that contains a single sample, but ten different tissues some other or spatially differentiated components would again contain ten experimental accessions.

An accession object comes ready to use in minus80. The function signature for an accession is pretty straight forward. Start up iPython:

``````import minus80 as m80

?m80.Accession
Init signature: m80.Accession(name, files=None, **kwargs)
Docstring:
From google: Definition (noun): a new item added to an existing collection of books, paintings, or artifacts.

An Accession is an item that exists in an experimental collection.

Most of the time an accession is interoperable with a *sample*. However,
the term sample can become confusing when an experiment has multiple
samplings from the same sample, e.g. timecourse or different tissues.

Init docstring:
Create a new accession.

Parameters
----------
name : str
The name of the accession
files : iterable of str
Files associated with the accession
**kwargs : keyword arguments
Any number of key=value arguments that

Returns
-------
An accession object

``````

We just need a name for the accession. Optionally, we can provide a list of data files and any number of key-value pairs for meta data. Lets create three accessions from a hypothetical time course dataset:

``````t1 = m80.Accession('Sample1_t1',files=['raw_t1.txt'],time='t1')
t2 = m80.Accession('Sample1_t2',files=['raw_t2.txt'],time='t2')
t3 = m80.Accession('Sample1_t3',files=['raw_t3.txt'],time='t3')
``````

Now that we are happy with our small set of accessions, lets freeze them into a Cohort. This action preserves the state of the Accessions and gives them a specific context by assigning them a name. Lets call this group of accessions "Sample1_Timecourse". Instances of Accession objects do not persist across different python sessions. Instances of Cohorts can be re-used many times across different python sessions. Accessions are one time use objects. Cohorts are re-use many objects.

The Cohort object has a method to create a new Cohort based on a collection of Accessions:

``````m80.Cohort.from_accessions?

Signature: m80.Cohort.from_accessions(name, accessions)
Docstring:
Create a Cohort from an iterable of Accessions.

Parameters
----------
name : str
The name of the Cohort
accessions : iterable of Accessions
The accessions that will be frozen in the cohort
under the given name

Returns
-------
A Cohort object
``````

Let's create the Cohort object:

``````c = m80.Cohort.from_accessions('Sample1_Timecourse',[t1,t2,t3])
``````

Exiting our ipython session will cause us to lose the Accession objects, but since the Cohort is freezable we can recover them in the next session.

``````# A new iPython session
import minus80 as m80
c = m80.Cohort('Sample1_Timecourse')
``````

Accessions can be recreated by indexing the cohort by Accession name:

``````t1 = c['Sample1_t1']
t1
# Output:
Accession(Sample1_t1,files=['raw_t1.txt'],{'time': 't1', 'AID': 1})
``````

Note: Accessions created from the Cohort object don't influence the object they came from. But as the data model above shows, new Cohorts can be created from Accession objects that come from other Cohorts. For instance:

``````t1 = c['Sample1_t1']
t2 = c['Sample1_t2']
# Create a new Cohort from Accessions pulled from c1
c2 = m80.Cohort.from_accessions('Sample1_subset',[t1,t2])
``````