Lab State vs. Metadata

data models, metadata, state

TL/DR

Defining metadata is hard and often subjective. We need to leverage tools to help us define and manage metadata, thinking about it more as lab state.

Full Content

Metadata is "data that describes other data", or "data that provides context to other data". Often times, understanding and making decisions on which metadata attributes to keep track of during a project or analysis can be a tedious job. One area where metadata requires relatively more lengthy lists and descriptions is in bioinformatic projects, as the data may not at first seem physically tied to the lab. Especially when stored in cloud storage, without metadata describing where the data came from, there is no way to tie that back to any wet-lab events.

Many entities and organizations have come up with specific metadata attributes that everyone must adhere to, while also allowing a flexible set of metadata attributes. These can be thought of as the "stringent" metadata vs. the 'extra' metadata. When I write code, I create what I call Base Models for data, and then allow the user to extend those models with 1 or more Extended Models. For example, the Base Model for a PCR experiment could include attributes such as primers, master mix, cycle conditions, cycle number, and sample name. This model would look something like this:

Primer \|	Master Mix \|	Cycle Conditions \|	Cycle Number \|	Sample Name \|
V4+V5	KAPA	60/72/65	20	A

Above is a data model shown as 1 row of a table. When I talk about a “data model”, I"m talking about the minimum set of attributes required to describe an object, which can be represented (in 1 form) as the column names in a row. For example, a square requires at a minimum an attribute of “side length” to describe the said square. A rectangles' data model requires at least 2 attributes, the height, and the width. As you increase the complexity of the object you're trying to describe, you can image the data model needing to incorporate more attributes (imagine how many attributes you'd need to describe a particular species of a bird).

We could go as far as saying that every conceivable PCR experiment requires these attributes at the minimum to describe in a somewhat intelligible manner what occurred; I'm sure many people could create hours of discussion on what this minimum set of criteria could be - this is beside the point. But this in no way describes the sample or provides context. Thus, an Extended Model allows scientists to add more metadata attributes on a per assay way. In other words, **the Extended Model of a PCR assay done last year might differ from the Extended Model done today**, but both share the **Base Model**. This allows us the flexibility of combining all our PCR data and doing permutations, analyses, and graphing based on shared PCR traits that we know each assay contains.

The idea of Base and Extended models, or stringent and extra metadata, can become much more complicated as time progresses. This is usually due to the scientists" inability to foresee what attributes may be important or overlooking what seems like easily assumable data that is not important. And this is no knock on the scientist; it's extremely hard, even with extensive planning and teams, to understand what data you will need in the future.

The fact is determining which metadata attributes to add to an experiment is a lot harder than it seems and there almost never is a correct answer. The descriptors tied to the experiment are placed there because the scientist think of them, and they are completely disconnected from any other lab activity. What I mean by this is if the scientist realizes that a particular metadata attribute was incorrect, there"s no way to dynamically update the metadata appended to an experiment without manually determining the error and changing that themself. In this sense, the scientist is a subjective vessel by which the metadata travels from what is determined to be the truth and attached to the experiment.

The base reality of any experiment is that it occurs during a very specific period within the labs overall lifespan. At the time of the completion of an experiment, there is a certain state that is obtained in the lab that theoretically could be attached to the experiment to allow perfect replicability. This of course is not possible and would require an unfathomable number of metadata to describe the said state: time of day, humidity, temperature, altitude, etc. None of which are used as metadata attributes in experiments right now except in extreme cases.

The idea of a lab state is currently science fiction but could be a reality sooner than we might expect. The reason it seems so far away is because we use so many different tools to go from data collection to data publication. There are various buckets in which we work, all of which are very disconnected and require the scientist to act as the conduit, or glue, between all of them. Not only does this place an immense pressure on the scientist, but it disconnects our work and allows new errors to seep into our workplace (especially undetected errors). Lots of activity is going on in the lab throughout the day, whether it's experiments in progress, cleaning, stocking, new items being delivered, new samples being created, or any other such tasks.

The scientist deals with many challenges and uses a series of many disconnected solutions to address the challenges: Slack/Discord/Google Teams for communication, an excel sheet for a cleaning log, whiteboard for notifications, sensors describing lab conditions, an online sheet for inventory, stickie notes on a machine to indicate someone is using it, BOX/OneDrive/GoogleDrive for data storage, various compute sources for analytics, or whatever may be the case for your lab. The amalgamation of these events defines the total state of the lab, which are essentially extended metadata attributes that could be appended to each assay at the time of completion.

What the scientific community needs is a comprehensive ecosystem for which to work, one that is both flexible in the tools that it supports and simple to integrate with existing solutions. This system could not be niche in supporting just data storage, nor could it focus on the analytics and graphics. These aspects are the bread and butter of any scientific lab churning out papers and research, but it's not sufficient metadata in our attempts at creating true scientific replicability. We need a system that holds information about all the major components of a lab at various timepoints, or states, which can then be exported with experiments to create a deeper level of science. I don't know, food for thought.

-Dane

Email me at dane@liminalbios.com if you have any thoughts on this post.