DATA MANAGEMENT

The next generation of customer data management solutions

Challenge identifier: DPC5-2017

Proposed by

Uniserv GmbH is a customer data management company based in Germany. With almost half a century’s experience, they help companies maximise the value of their customer data. Their products support a range of data management scenarios, from data quality and warehousing to data migration and master data management, with applications in CRM, eBusiness, marketing, CDI/MDM and business intelligence.

Description

Customer data is a critical asset for every company. To create added value from it – for example, to improve engagement, targeted marketing, and customer loyalty – it is almost always necessary to combine it with other sources of data, internal or external, which are extremely heterogeneous in terms of formats, structure, and vocabulary. It is quite common to find a “name” field in one system, and a “surname” with a separate “surname prefix” in another. The same is true, to an even larger extent, for address and organisation records: different data models are used across systems, with different fields and varying number of fields, preventing an easy mapping from one system to the other. These fields are often used in unexpected ways. It is quite common to see information about people stored in organisation records, organisation names taken for address records or information about multiple people appearing in the record of a single individual. These errors are issues a human would instantly spot, but are hard to correct for a computer. A standard lookup approach, in which one would compare the records against reference data is risky and may lead to information loss. Correct, unambiguous reference data is rarely available (with the possible exception of address data in the Western World), making this traditional approach unsustainable.

New ideas are needed: with the use of deep learning techniques a system may be trained to recognize the correct structure of incorrect input data with the same or higher level of performance as a human operator. This challenge aims to explore the use of deep learning methods to master different types of data heterogeneities, in particular in the context of noisy data. Uniserv is looking for deep learning solutions that are trained to reconcile entities from different sources and identify errors in input data, at a level of performance comparable to that of humans.

Data

Uniserv will provide challenge winners access to datasets with synthetic name and address information in semi-structured and unstructured form, including gold standards. The data represents typical customer master data in enterprise applications such CRM or ERP systems and will cover entities from Germany and, to a lesser extent, France, Netherlands, and the UK.

The Uniserv data can be used for training, however, it will not be sufficient on its own. It is expected that publicly available sources will be used by the challenge winners to enrich the Uniserv data. For example, OpenStreetMap (addresses, buildings, organisations etc.) or names and addresses from public registries (such as electoral rolls), where the information within them can be used without infringing data protection rules.

More detailed information about the data can be found in our data catalogue.

Expected outcomes

Deep learning algorithms for entity reconciliation, tested on data from at least three European countries, and capable of processing global customer data, including different encodings, right-to-left writing systems etc. The algorithms should be self-learning and improve over time when more data is fed to the system.

Submissions should explain clearly how they plan to demonstrate the value of their algorithms. They are expected to meet the following performance targets at the end of their six-months acceleration in Data Pitch:

Production time: single requests should not take much longer than 10ms on a standard 4 core / 8Gb machine, for instance an Amazon EC2 c4.xlarge instance. The system should be easy to scale horizontally with additional hardware resources.
Training time: applicants may use specialised GPU hardware. Horizontal scalability when training takes longer than one week on a standard Amazon GPU instance, for instance an Amazon p2.8xlarge instance (8 GPU cores, 32 vCPU cores, 488 GB memory, 96 GB GPU memory).

The project also delivers a generalized CRM data model. This is the data model that is used as input, but also the data model that describes the correct, relational output of the algorithm.

Expected impacts

Superior entity reconciliation technology leading to a substantial increase in data processing speed and cost effectiveness.
A sizable (>50%) reduction in the time needed for handling the standard semi-structured identification fields (see above).
Levels of accuracy comparable to human operators, or exceeding it.
Advances in data cleansing and organisation through better handling of semi-structured identification fields.
Tools assisting data stewards in organisations with tedious, error-prone tasks.