cossmbewe

Selecting Research Data for Curation

By Cosmas Fletcher Mbewe

As a novice in research data management, one of my naive assumptions was that saving all data would require nothing more than purchasing additional hard drives. Lee and Stvilia (2017), however, soon shattered that illusion. In their survey of institutional repository staff, Lee and Stvilia (2017) make it clear that while well-endowed universities are willing to store just about any form of data, there is no established process for assessing which types of data should be stored in the long term. For instance, one interviewee reported that the university had a ten-year retention policy, but none of the datasets had ever reached that mark.

But what prevents us from keeping everything then? There is an important counterargument presented by Whyte and Wilson (2010), which refutes my earlier view of the "storage is cheap" concept. The fact is that while the cost of storage decreases, the expenses related to metadata generation, backups, and maintenance remain considerable. The DCC Curation Lifecycle Model created by Higgins (2008) has helped me see that appraisal should be the basis of every curatorial practice.

The question of what makes data valuable comes next. Whyte and Wilson (2010) suggest such aspects of consideration as relevance to the organization's mission, scientific importance, uniqueness, redistributability, non-replicability, economic value, and documentation. However, according to Lee and Stvilia (2017), many repositories work with the informal ReadMe metadata, spending up to 70%–80% of their budget on this. Similarly, Tenopir, Birch, and Allard (2012) discovered gaps between the needs of researchers and the services offered by libraries. Again, these are not purely technological issues; they also indicate the clash between our aspirations and possibilities.

Most alarming of all is the lack of expertise. Subject specialists were found in five of the thirteen organizations studied by Lee and Stvilia (2017). How could a central organization without such expertise possibly make judgments on research data in different disciplines? According to Borgman, Wallis, and Enyedy (2007), it is critical to understand the context within which the scientific communities produce data. I strongly believe that data appraisal needs to be a joint effort from researchers, librarians, and subject specialists. Data citations and usage rates will help in the decision process but should not determine the process.

To preserve everything is a utopia; to preserve nothing is a disaster. My point is that we need to engage in good data appraisal if we are to achieve proper digital preservation that may be pivotal in the long run. It is evident that without proper procedures, any attempt to preserve the records may be disastrous.

References

Borgman, C. L., Wallis, J. C., & Enyedy, N. (2007). Little science confronts the data deluge: Habitat ecology, embedded sensor networks, and digital libraries. International Journal on Digital Libraries, 7(1-2), 17–30.

Higgins, S. (2008). The DCC Curation Lifecycle Model. International Journal of Digital Curation, 3(1), 134–140.

Lee, D. J., & Stvilia, B. (2017). Practices of research data curation in institutional repositories: A qualitative view from repository staff. PLoS ONE, 12(3), e0173987.

Tenopir, C., Birch, B., & Allard, S. (2012). Academic libraries and research data services. Association of College and Research Libraries.

Whyte, A., & Wilson, A. (2010). How to appraise and select research data for curation. Digital Curation Centre.

Picture: depicting a thorough scrutiny of data to be appraised