Understand the data
“The issue when people decide to investigate data is often that data in a spreadsheet or an unstructured mass generally looks fine. It is once you start to try and utilise it in some way, or apply some kind of interrogation to it, that you realise what the problems with the data are.“
(Data Provider Liaison Lead, Data Pitch)
The data and the issue that will be addressed using the data need to be well aligned. This will require a combination of expertise from different areas, including:
- understanding whether an intended outcome can be achieved through the use of data;
- knowledge of what data will be needed;
- knowledge that this data exists, and where;
- knowledge of what the data includes – the metadata to the data, such as its level of granularity, to assess its usefulness;
- access to both metadata and actual data; and
- authority to share both metadata and actual data.
This may require the production of an information asset register or inventory, even if only a subset of the organisation’s data is used.
In order to explore the data, especially if there is limited awareness of the potential uses, it may be useful to provide sample data. Access to sample data allows potential data users to judge whether or how their proposed solution will be achievable. Alternatively, a synthesised subset of the data may be made available, to allow exploration of the data without risk of breaches.
What is synthetic data?
Synthetic data is a disguised version of a subset of the original data, which no longer holds any real data. In all other ways it appears to be the real data in terms of metadata, noise and other features of the original dataset, the most relevant being the statistical distribution. Synthetic datasets can be used to train algorithms that can then be tested on the real dataset, without revealing real data.
Data can be fully or partially synthetic. Fully synthetic data does not contain any original data; re-identification of any single unit is almost impossible, while all variables are still fully available. In partially synthetic data, only data that is sensitive is replaced. This leads to decreased model dependence, but does mean that some disclosure is possible owing to the true values that remain within the dataset.
Key resources:
Valuing information as an asset (Higson & Waltho, 2009): This white paper aims to support senior executives and policy makers with the transformation of information culture and practices within their organisations.
http://faculty.london.edu/chigson/research/InformationAsset.pdf
Data inventory guide (Johns Hopkins University, Center for Government Excellence): This is a practical guide to understanding what a data inventory is and how to build one, explaining the concepts and providing practical guidance and references.
https://labs.centerforgov.org/data-governance/data-inventory/
Designing data governance (Khatri & Brown, 2010): This paper offers a framework for data governance issues to help practitioners design data governance structures effectively.
https://doi.org/10.1145/1629175.1629210
Anonymisation and open data: An introduction to managing the risk of re-identification (Thereaux et al., 2019): This report describes the current state of the art of data anonymisation and provides guidance for risk management.
https://theodi.org/article/anonymisation-and-synthetic-data-towards-trustworthy-data/