Houston expert: Navigating dark data within research and innovation
Is it necessary to share ALL your data? Is transparency a good thing or does it make researchers “vulnerable,” as author Nathan Schneider suggests in the Chronicle of Higher Education article, “Why Researchers Shouldn’t Share All Their Data.”
Dark Data Defined
Dark data is defined as the universe of information an organization collects, processes and stores – oftentimes for compliance reasons. Dark data never makes it to the official publication part of the project. According to the Gartner Glossary, “storing and securing data typically incurs more expense (and sometimes greater risk) than value.”
This topic is reminiscent of the file drawer effect, a phenomenon which reflects the influence of the results of a study on whether or not the study is published. Negative results can be just as important as hypotheses that are proven.
Publication bias and the need to only publish positive research that supports the PI’s hypothesis, it can be argued, is not good science. According to an article in the Indian Journal of Anaesthesia, authors Priscilla Joys Nagarajan, et al., wrote: “It is speculated that every significant result in the published world has 19 non-significant counterparts in file drawers.” That’s one definition of dark data.
But what to do with all your excess information that did not make it to publication, most likely because of various constraints? Should everything, meaning every little tidbit, be readily available to the research community?
Schneider doesn’t think it should be. In his article, he writes that he hides some findings in a paper notebook or behind a password, and he keeps interviews and transcripts offline altogether to protect his sources.
Open-source software communities tend to regard total transparency as inherently good. What are the advantages of total transparency? You may make connections between projects that you wouldn’t have otherwise. You can easily reproduce a peer’s experiment. You can even become more meticulous in your note-taking and experimental methods since you know it’s not private information. Similarly, journalists will recognize this thought pattern as the recent, popular call to engage in “open journalism.” Essentially, an author’s entire writing and editing process can be recorded, step by step.
This trend has led researchers to open-source programs like Jupyter and GitHub. Open-source programs detail every change that occurs along a project’s timeline. Is unorganized, excessive amounts of unpublishable data really what transparency means? Or does it confuse those looking for meaningful research that is meticulously curated?
The Big Idea
And what about the “vulnerability” claim? Sharing every edit and every new direction taken opens a scientist up to scoffers and harassment, even. Dark data in industry even involves publishing salaries, which can feel unfair to underrepresented, marginalized populations.
In Model View Culture, Ellen Marie Dash wrote: “Let’s give safety and consent the absolute highest priority, with openness and transparency prioritized explicitly below those. This means digging deep, properly articulating in detail what problems you are trying to solve with openness and transparency, and handling them individually or in smaller groups.”
This article originally appeared on the University of Houston's The Big Idea. Sarah Hill, the author of this piece, is the communications manager for the UH Division of Research.