The Insidious Costs of Poor Scientific Data Management

Data takes time and money to collect and is the core raw material of a lab but often it is managed poorly and not utilized efficiently. How can you get more mileage, papers and impact out of your data?

The post doc left the lab over a year ago. He shared some of his data with you in dropbox, but some pieces are missing. In the data that was available there are data headers that are uninterpretable. He has a new laptop now and can’t find those missing bits. He also doesn’t remember what those header abbreviations were meant to mean (TTES_1? it made sense at the time). He’s published his paper and moved on to a different job, so he’s not motivated to look for it anymore. That data would be useful for another project a student is proposing, but now it’s not going to be available. The student will have to start from ground zero. Sound familiar?

The cost of data

Data is the core of all empirical scientific research. It takes time and money to collect – usually public tax payer money. Take the example of the post doc described above. Let’s say he spent two years collecting data and writing up a paper before he moved on to a job outside academia. Let’s say the direct data collection cost was $30,000. The student who could have used his dataset and added to it is now set back a year on data collection and will have to collect the same data again – another $30,000. Altogether this has wasted one year of student stipend + an additional $30,000. This means the cost of the student’s project is now some $55,000 higher than it could have been. It has also cost the lab a more robust paper with a larger dataset and the opportunity to validate a previous result. Now imagine all the hundreds of thousands of such cases in academia every year and the enormous cost to the public and to our scientific progress.

Managing data for impact

Better data management is important for more cost efficient, better validated and impactful outcomes. As a PI it is important to have a data management strategy for the lab to ensure that you are getting the most out of all the effort. What does this mean?

Plan ahead

What defines the vision of the lab and what kind of metadata and data elements are important to collect consistently over time for larger scale, multipurpose views of the data? Even if a particular project may not seek to evaluate the impact of one control element it may be useful for everyone to collect it since it could have relevance down the line. Drawing up this vision is an important exercise.

Set standards

Maintain templates and standards that all lab members use consistently to collect data and take the trouble to check that data adheres to these standards. For example, how do you record and store the particular task associated with an EEG recording and the sampling and filter settings used? If everyone used the same format and nomenclature this would be easy to compare across lab members. How do you record basic data points relevant to all studies? For example, does your lab have a standard format for handling subjects studied and a way to contact them should the need arise for a future follow on by someone else? We provide some EEG metadata standards here for those interested.

Index and archive

Have a way to ensure that data is well archived and indexed by all lab members so that when people leave the lab it is available as a database for current lab members to use. This is not a simple task but well worth doing. There are various software products and tools available to help with this but regardless of which one you use, it will need an attention to cleansing, restructuring and formatting existing data according to some set of standards. For EEG and associated metadata, Brainbase will soon be available as a lab data management tool that will do this and a whole lot more. (While Brainbase is still in beta and available by invitation only, you can look inside now and see how it works).

Share and collaborate

When people are proprietary about data it comes from one of two motivations: the desire not to be scooped or lose credit or a fear that someone will find an error in their work. When data is managed well within the lab it can go a long way to minimize errors and help ensure credit is assigned where it is due. Going further, creating a positive culture of sharing within the lab can ensure that the overall quality of output is higher. Different people with different perspectives may bring a new lens to the same data. Thus, once the lab has published the papers it has envisioned, making that data available for collaborations and for the larger community can further amplify the impact of the data.

See related post ‘Crowdsourcing, Citizen Science and Data Sharing‘

Combing through the data in your lab to reformat and organize to a set of standards is no simple or pleasant task. However, managing data effectively is not just a responsibility of researchers to the taxpayers and foundations that support the research but also to the progress and goals of science.

Lab Talk

The cost of data

Managing data for impact

Plan ahead

Set standards

Index and archive

Share and collaborate

Leave a Reply Cancel reply

Swaeta Singha-Roy

Sridhar Venkatesh

David Blanchflower

Dr. Sr. John-Mary Vianney

Jennifer Newson, Ph.D.

Maya Thiagarajan

Robert Carter

Dr. Shailender Swaminathan

Randy Winn

Randy Winn

Rahul Varma

Uttara Bharath

Callyn Giese

Joseph Taylor

Narayan Subramaniyam, Ph.D.

Olesia Topalo

Jerzy Bala, Ph.D.

Dhanya Parameshwaran, Ph.D.

Tara Thiagarajan, Ph.D.