Anonymization of sensitive health data and synthetic data generation – opportunities for hospitals

By Tuomo Pentikäinen

April 20, 2021


We all know that the future of healthcare is built on data. And not on ordinary data, but highly sensitive health data. And on big data- really big data, that any one stakeholder alone does not have enough of. But we might have enough together, if we share it wisely.

Why is this topical just now? There are basically two reasons. The first is that advances in artificial intelligence (and machine learning in particular) have made it possible to use large amounts of data. This has been seen, for instance, in the rapid development of personalized medicine as well as in different predictive data-intensive models. The second reason is that advances in electronic health records, registries and in data harmonization practices (e.g. OMOP/EHDEN) have made data technically more accessible and usable.


Big data for hospitals - why use it?

For hospitals, there are many possible use cases, a few of which are listed here:

  • Better diagnostics and prognostic modelling by AI-enhanced models. These models are built by using big data and can be compared with patient-specific data. Very often in these use cases it is desirable not only to use the existing model to make predictions or to support decision making, but also to continuously enrich the model by feeding new data into the system.
  •  Personalized medicine. More data, including genotype and phenotype information, can help to understand the patient’s characteristics and enable personalized medicine.
  • Research projects. Big data is needed for research projects, which are particularly important for university or research hospitals.
  • Better hospital management. Data can help to understand patient or personnel flows and bottlenecks, or can be used for efficacy or performance comparisons e.g. between clinics, hospitals, or to understand progress and developments over time.


Why isn’t big data being utilized to its full potential by hospitals?

The biggest obstacle for the full utilization of data to support the use cases above is that data cannot be easily shared- especially when it comes to data from multiple sources. The problem has three root causes:

  1. The data needed by hospitals is sensitive health data, and the use and sharing of sensitive health data is strictly regulated (e.g. GDPR, HIPAA, and other legislation)
  2. Data owners see their data as an asset, and are reluctant to share it without clear visibility and control over how it is used
  3. Data from different sources is not usually harmonized


How can we fix this, and allow hospitals to fully utilize big data?

 A good solution to the problem of data sharing for hospitals should fulfil the following requirements:

  1. The quality and usability of the data must be as high as possible
  2. Privacy must not be compromised
  3. There must be full transparency and traceability to what has been done to the data, and by whom
  4. Computational and human resources needed for data governance should be reasonable
  5. There must also be the possibility of utilizing complex, sensitive data types- e.g. genome data, imaging data, and signal data


Data lake architecture to solve the problem of using big data in hospitals

The VEIL.AI Anonymization Engine fulfils the criteria above; our solution applied to a data lake architecture is described in the picture below.

In this data lake solution, anonymization and synthetic data generation are built-in functionalities of the data lake. The anonymization and synthetic data generation processes can be highly automated, requiring minimal expert user input.

The architecture supports stable, regularly updated data inputs- and is even capable of supporting continuous data streams, as from e.g. medical devices. And it’s built to support not just structured data, but also complex multimodal data types including genome, imaging and signal data. 

Advanced and adjustable quality assurance and reporting functionalities can also be easily included in the pipelines. Data releases to either dedicated and controlled sub data lakes, separate sandboxes, or outside the data lake are supported.

In a key advance, the architecture allows for multi-party anonymization or synthetic data generation- two or more hospitals can simultaneously anonymize or generate synthetic data in order to form a larger database, without sharing sensitive personal data with one another.


How can your hospital unleash the full potential of big data?

Interested in finding out whether these developments in the utilization of big data can be applied to your hospital environment? Get in contact with our expert team and by clicking the Contact us button and unleash the full potential of your data.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}