Glossary
This toolkit is intended to be accessible to users from a wide range of backgrounds. We have avoided technical language, however, the following glossary may be of use.
- Access Controls: Security measures applied by a data holder or provider to any data user with which it proposes to share its data. These include placing terms and conditions on the use or reuse of the data, or allowing the data user access to the data only under some specified data environment.
- Anonymisation: Techniques for lowering the risk of identification of data subjects from data, typically by removing or aggregating data that would (help) identify data subjects, combined with other measures, such as adding noise.
- Confidential Data: A term from common law, to refer to data or personal information which is shared in confidence with another party, such as a lawyer, accountant or doctor, in order to allow the second party to act in their client’s interest. Such information should not be shared with any third party except with the clear consent of the client; this, if it happens, is called a breach of confidence. Confidentiality agreements are often implicit, when confidentiality is a reasonable expectation of the client.
- Data Controller: A legal term from the GDPR, to refer to a person, company, or other body that determines the purpose and means of personal data processing (this can be determined alone, or jointly with another person/company/body). The Data Controller is responsible for what happens with the data, and held accountable for any breaches of data protection.
- Data Environment: The context in which data is held. Data environments are characterised by agents with access to the data, other datasets with which the data may come into contact, governance arrangements for the data, and infrastructure used to store it. Typically, a dataset will be stored in a range of different data environments. Data sharing usually involves moving the shared data from one environment into another.
- Data Owner or Provider: An entity that owns a dataset; this could, for example, be a company with sales data, or a GP with a patient database. For data sharing to take place, a data owner or provider must facilitate access to the data for a data user. Note that if the data owner facilitates such access to personal data, then this will be regulated by GDPR; if it is not personal data, then it won’t be.
- Data Processing: A legal term from the GDPR, to refer to any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction of data.
- Data Processor: A legal term from the GDPR, meaning a natural or legal person, public authority, agency or other body which processes personal data on behalf of the Data Controller. A Data Processor is not necessarily also a Data Controller: It could be a third party that a Data Controller delegates the processing to, such as an external analyst.
- Data Protection Impact Assessment (DPIA): A risk assessment for the use of data that is mandatory under GDPR for processing that is likely to result in a high risk to individuals. It should describe the processing; assess its necessity and proportionality; assess risks to individuals; and identify measures to mitigate those risks.
- Data Sharing: The sharing of data between entities, typically for a specific purpose. This can happen between companies, or departments within an organisation. The data owner or provider provides a data user with access to some of its data. If the data is personal data, then the sharing will be regulated by GDPR.
- Data Subject: A legal term from GDPR, referring to a living person that is or can be identified through data. The term does not extend to institutions, organisations, or deceased individuals.
- Data User: A person or entity that uses data for their own purposes, for example for business, academic work or in government. Data users may transform data, for example by cleaning it up, merging it with other datasets, or feeding it into other systems.
- Functional Anonymisation: A risk management approach to anonymisation that accepts that whether data is anonymous or not is a function of the relationship between those data and their environment, and not a property of the data itself. Hence functional anonymisation goes beyond manipulation of the data, and encompasses manipulation of the data environment.
- GDPR: The General Data Protection Regulation; an EU regulation that came into force in May 2018. GDPR provides new definitions of terms such as data processing or anonymisation, and defines different bases on which data processing is allowed. It goes further than previous legislation in protecting data subjects, and as an EU regulation, unifies data protection across the EU, and thereby allows the flow of data across the single market.
- Metadata: Data that describes the properties of data. Metadata can be attached to a dataset, and can therefore be used to understand whether that dataset is of interest to potential users, without giving them access to the data itself. Of particular importance is metadata describing the provenance of data.
- Open Data: Data that is freely available on the internet, without access controls.
- Provenance: Metadata that gives a record of the inputs, entities, systems, and processes that were involved in the creation of data, providing a record of its origins.
- Pseudonymisation: Techniques involving the substitution of identifiers that are easily attributed to individuals with, eg, an ID number that is stored separately. Re-identification of the data is possible by reference to the original key; without the key, the data can be treated as anonymised.
- Synthetic Data: Data that has been created algorithmically rather than generated by real-world events. It is generally used to explore datasets before sharing, as a stand-in for test datasets of production or operational data, to validate models, and to train machine learning models.