Day Night

Ehr-safe: generating high-fidelity and privacy-preserving synthetic electronic health records

Nature

Ehr-safe: generating high-fidelity and privacy-preserving synthetic electronic health records"

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records

(EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data.

We propose a generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of

sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: heterogeneity, sparsity, coexistence of

numerical and categorical features with distinct characteristics, and time-varying features with highly-varying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity

of EHR-Safe is almost-identical with real data (<3% accuracy difference for the models trained on them) while yielding almost-ideal performance in practical privacy metrics. SIMILAR

CONTENT BEING VIEWED BY OTHERS A MULTIFACETED BENCHMARKING OF SYNTHETIC ELECTRONIC HEALTH RECORD GENERATION MODELS Article Open access 09 December 2022 PRESERVING INFORMATION WHILE

RESPECTING PRIVACY THROUGH AN INFORMATION THEORETIC FRAMEWORK FOR SYNTHETIC HEALTH DATA GENERATION Article Open access 23 January 2025 SYNTHESIZE HIGH-DIMENSIONAL LONGITUDINAL ELECTRONIC

HEALTH RECORDS VIA HIERARCHICAL AUTOREGRESSIVE LANGUAGE MODEL Article Open access 31 August 2023 INTRODUCTION Electronic Health Records (EHR) provide tremendous potential for enhancing

patient care, embedding performance measures in clinical practice, and facilitating clinical research. Statistical estimation and machine learning models trained on EHR data can be used to

diagnose diseases (such as diabetes1, track patient wellness2, and predict how patients respond to specific drugs3). To develop such models, researchers and practitioners need access to

data. However, data privacy concerns and patient confidentiality regulations continue to pose a major barrier to data access4,5,6. Conventional methods to anonymize data can be tedious and

costly7,8. They can distort important features from the original dataset, decreasing the utility of the data significantly, and they can be susceptible to privacy attacks even when the

de-identification process is in accordance with existing standards9. Synthetic data open new horizons for data sharing10. With two key properties, synthetic data can be extremely useful: (1)

high fidelity (i.e., the synthesized data are useful for the task of interest, such as giving similar downstream performance when a diagnostic model is trained on them), (2) meets certain

privacy measures (i.e., the synthesized data do not reveal any real patient’s identity). Generative models have shown notable success in generating synthetic data11,12,13,14,15. They are

trained to synthesize data from a given random noise vector or a feature that the model is conditioned on. This comes with the premise, for privacy preservation, that the data samples

synthesized from random vectors should be distinct from the real ones. Among generative models, Generative Adversarial Networks (GANs)16 have particularly gained traction as they can

synthesize highly realistic samples from the actual distribution of real data. The notable success of GANs in synthesizing high-dimensional complex data has been shown for images17,

speech18, text19 and time-series15. Recent works have also adapted GANs for privacy-preserving data generation, with methods such as adding noise to model weights20 or modified adversarial

training21. When it comes to synthetic EHR data generation, there are multiple fundamental challenges. EHR data contain heterogeneous features with different characteristics and

distributions. There can be numerical features (e.g., blood pressure) as well as categorical features, with many (e.g., medical codes) or two (e.g., mortality outcome) categories. We note

that EHR data with images and free-form text are beyond the scope of this paper. Some of these features might be static (i.e., not varying during the modeling window), while others are

time-varying, such as regular or sporadic lab measurements or diagnoses. Feature distributions might come from quite different families—categorical distributions might be highly nonuniform

(e.g., if there are minority groups), and numerical distributions might be highly skewed (e.g., a small proportion of values being very large while the vast majority are small). Ideally, a

generative model should have sufficient capacity to model all these types of features. Depending on a patient’s condition, the number of visits might vary drastically—some patients might

visit a clinic only once, whereas some might visit hundreds of times, leading to a variance in sequence lengths that is typically much higher compared to other time-series data. There might

also be a high ratio of missing features across different patients and time steps, as not all lab measurements or other input data might have been collected. An effective generative model

should be realistic in synthesizing missing patterns. GANs have been extended to healthcare data, particularly for EHR.22,23,24 apply various GAN variants on EHR data. However, these

variants have limitations regarding the aforementioned fundamental aspects of real-world EHR data, such as dealing with missing features, varying feature length (rather than fixed length),

categorical features (beyond numerical), and static features (beyond time series). These fundamental challenges require a holistic re-design in GAN-based synthetic data generation systems.

In this paper, our goal is to push the state-of-the-art by designing a framework that can jointly represent these diverse data modalities while preserving the privacy of source training

data. EHR-Safe, overviewed in Fig. 1, generates synthetic data that maintain the relevant statistical properties of the downstream tasks while preserving the privacy of the original data.

Our methodological innovations are key to this—we introduce approaches for encoding/decoding features, normalizing complex distributions, conditioning adversarial training, and representing

missing data. We demonstrate our results on two large-scale real-world EHR datasets: MIMIC-III25,26,27 and eICU28. We demonstrate superior synthetic data generation on a range of fidelity

and privacy metrics, often outperforming the previous works by a large margin. RESULTS DATASETS We utilize two real-world de-identified EHR datasets to showcase the EHR-Safe framework: (1)

MIMIC-III (https://physionet.org/content/mimiciii/1.4/), (2) eICU (https://eicu-crd.mit.edu/gettingstarted/access/). Both are inpatient datasets that consist of varying lengths of sequences

and include multiple static and temporal features with missing components. MIMIC-III The total number of patients is 19,946. Among more than 3000 features, we select 90 heterogeneous

features that have high correlations with the mortality outcome (Details can be found in Supplementary Information). Ninety features consist of (1) 3 static numerical features (e.g., age),

(2) 3 static categorical features (e.g., marital status), (3) 75 temporal numerical features (e.g., respiratory rate), (4) 8 temporal categorical features (e.g., heart rhythm), and (5) 1

measurement time. The sequence lengths vary between 1 and 30. EICU The total number of patients is 198,707. There are (1) 3 static numerical features (age, gender, mortality), (2) 1 static

categorical feature (condition code), (3) 162 temporal numerical features, and (4) 1 measurement time. Among 162 temporal numerical features, we only select 50 features whose average number

of observations is higher than 1 per patient. We set the maximum length of sequence as 50. For longer sequences, we only use the last 50 time steps. For both datasets, we divide the patients

into disjoint train and test datasets with 80 and 20% ratios. We only use the training split to train EHR-Safe. At inference, we generate synthetic train and test datasets from random

vectors (note that EHR-Safe can generate an arbitrary amount of synthetic samples). We apply standard outlier removal methods (by removing the sample whose values are outside of certain

value ranges between 0.1 percentile and 99.9 percentile) to exclude the outliers from the original datasets. More details on datasets, training and evaluation can be found in Supplementary

Information. FIDELITY The fidelity metrics assess the quality of synthetically generated data by measuring the realisticness of the synthetic data compared with real data (more details are

provided in Supplementary Information). Higher fidelity implies that it is more difficult to differentiate between synthetic and real data. For generative modeling, there is no standard way

of evaluating the fidelity of the generated synthetic data samples, and often different works base their evaluations on different methods. In this section, we evaluate the fidelity of

synthetic data with multiple quantitative and qualitative analyses, including training on synthetic/testing on real and KS-statistics. More results (including t-SNE analyses, comparison of

distributions, propensity scores, and feature importance) can be found in Supplementary Information. STATISTICAL SIMILARITY We provide quantitative comparisons of statistical similarity

between original and synthetic data that compare the distributions of the generated synthetic data and original data per each feature (including the missing patterns). For numeric variables,

we report the mean, standard deviation, missing rates, and KS-statistics. For categorical data, we report the ratio of each category. We only report the results with the 15 temporal

numerical features (with lowest missing rates) and all static numerical features. Table 1 summarizes the results for temporal and static numerical features, and most statistics are

well-aligned between original and synthetic data (KS-statistics are mostly lower than 0.03). Additional results of the top 50 temporal numerical features and categorical features can be

found in Supplementary Information. UTILITY—ML MODEL DEVELOPMENT ON SYNTHETIC VS. REAL DATA As one of the most important use cases of synthetic data is enabling machine learning innovations,

we focus on the fidelity metric that compares a predictive model performance when it is trained on synthetic vs. real data. Similar model performance would indicate that the synthetic data

captures the relevant informative content for the task. We focus on the mortality prediction task29,30, one of the most important machine learning tasks for EHR. We train four different

predictive models (Gradient Boosting Tree Ensemble (GBDT), Random Forest (RF), Logistic Regression (LR), Gated Recurrent Units (GRU)). Table 2 compares the performance of the predictive

models. In most scenarios, they are highly similar in terms of AUC. On MIMIC-III, the best model (GBDT) on synthetic data is only 0.026 worse than the best model on real data, whereas on

eICU, the best model (RF) on synthetic data is only 0.009 worse than the best model on real data. In Supplementary Information, we also provide the algorithmic fairness analysis across

multiple subgroups divided by static categorical features (such as gender and religion). Additionally, we evaluate the utility of the synthetic data with a random subset of features and

multiple target variables. The goal is to evaluate the predictive capability of each dataset regardless of which features and targets are being used. We choose random subsets with 30

features and two target variables (mortality and gender) and test the hypothesis that the performance difference between the trained models by original and synthetic data is greater than

_X_. In a practical setting, the choice of X would enable data owners to define a constraint on the acceptable fidelity of synthetic data. We report results with _X_ = 0.04 for illustrative

purposes. We obtain the _p_-value (computed by one sample T-test) that allows us to reject this hypothesis. As can be seen in Table 2, for MIMIC-III mortality prediction, we can reject the

hypothesis that AUC difference is greater than 0.04 with _p_-value smaller than 0.01 (average AUC difference is 0.009). For eICU gender prediction, we achieve 0.019 average AUC difference

with _p_-value smaller than 0.001. PRIVACY Unlike de-identified data, there is no straightforward one-to-one mapping between real and synthetic data (generated from random vectors). However,

there may be some indirect privacy leakage risks built on correlations between the synthetic data and partial information from real data. We consider three different privacy attacks that

represent known approaches that adversaries may apply to de-anonymize private data (details are provided in Fig. 2 and Supplementary Information): * MEMBERSHIP INFERENCE ATTACK: The

adversary explores the probability of data being a member of the training data used for training the synthetic data generation model31. * RE-IDENTIFICATION ATTACK: The adversary explores the

probability of some features being re-identified using synthetic data and matching to the training data32. * ATTRIBUTE INFERENCE ATTACK: The adversary predicts the value of sensitive

features using synthetic data33. These metrics are highly practical as they represent the expected risks that currently prevent sharing of conventionally anonymized data. Furthermore, they

are highly interpretable, as results for these metrics directly measure the risks associated with sharing synthetic data. Table 3 summarizes the results along with the ideal achievable value

for each metric. According to the results shown in Table 3, we observe that the privacy metrics are very close to the ideal in all cases. The risk of understanding whether a sample of the

original data is a member used for training the model is very close to random chance. For the attribute inference attack, we focus on the prediction task of inferring specific attributes

(gender, religion and marital status) using other attributes as features. We compare prediction accuracy when training a kNN classifier with real data against another kNN classifier trained

with synthetic data. The results demonstrate that access to synthetic data does not lead to higher prediction performance on specific attributes as compared to access to the original data.

More results for privacy with different distance metrics can be found in Supplementary Information. DISCUSSION We provide ablation studies on key components of EHR-Safe in Table 4 (top): (1)

stochastic normalization, (2) explicit mask modeling, and (3) categorical embedding. All three components are observed to substantially contribute to the quality of synthetic data

generation. Supplementary Information further illustrates the impact of stochastic normalization in terms of CDF curves. In Table 4 (bottom), we compare EHR-Safe to three alternative methods

(TimeGAN15, RC-GAN34, C-RNN-GAN35) proposed for time-series synthetic data generation. Note that the alternative methods are not designed to handle all the challenges of EHR data, such as

varying length sequences, missingness and joint representation of static and time-varying features (please see Supplementary Information on how we modify them for these functionalities).

Thus, they significantly underperform EHR-Safe, as shown in Table 4. Post-processing can further improve the statistical similarity of the synthetic data. Perfectly matching the

distributions of synthetic and real data might be particularly challenging for features with skewness or CDFs with discrete jumps. For some scenarios where EHR-Safe might have a shortcoming

in matching the distributions, a proposed post-processing method (details can be found in Supplementary Information) can further refine the generated data and improve the fidelity results

for statistical similarity. The post-processing method is based on matching the ratios of samples in different buckets for the real and synthetic data. Note that this procedure is not a

learning-based method (i.e., no trainable parameters). With this procedure, we can significantly improve the statistical similarity—KS-statistics are less or equal to 0.01 for all features.

However, the drawbacks are the additional complexity of generating synthetic data and a slight degradation of the utility metrics (e.g., AUC changed from 0.749 to 0.730 on MIMIC-III with

Random Forest). There is not much difference in the proposed privacy metrics (e.g., membership-inference attack metric changed from 0.493 to 0.489 on MIMIC-III). We demonstrate that EHR-Safe

achieves very strong empirical privacy results when considering multiple practical privacy metrics. However, EHR-Safe does not provide theoretical privacy guarantees (e.g., differential

privacy) unless its training is modified by randomly perturbing the models21,36. Note that EHR-Safe framework can be directly adopted with differential privacy. For instance, DP-SGD37 can be

used to train the encoder-decoder and WGAN-GP models to achieve a differentially private generator and decoder with respect to the original data. Since synthetic data are generated through

the differentially private generator and decoder using the random vector as the inputs, the generated synthetic data are also differentially private with respect to the original data. Even

though these approaches can be adopted to EHR-Safe, it may result in a decrease in fidelity as the added noise would hurt the generative model training. For the proposed metrics, the

specific assumptions and models might pose limitations. The proposed fidelity metrics that reflect the downstream machine learning use cases depend on the model type. For future work, it

would be interesting to study which fidelity metrics would correspond to the performance of the best achievable model. Similarly, the proposed privacy attacks employ certain assumptions

about the methodology and model of the attacker (e.g., nearest neighbor search for very high-dimensional data might be suboptimal). It would be interesting to understand the theoretically

achievable privacy. Most of our results are very close to the ideal achievable performance, indicating one could have high confidence in using our method in the real world. The result that

has the most room for improvement is statistical similarity, as it is not as high for all features. Reducing this consistently across all features can be done with further advances in

generative modeling. Various follow-up directions remain important for future work. The EHR data of this paper’s focus are heterogeneous structured data, and we show significant advancement

over the prior state-of-the-art that focused on more limited data types. A natural extension is to integrate the generative modeling capability for text and image data, as modern EHR

datasets often contain both. Realistic generation of text and image data would require high capacity and deep decoders. However, such decoders would come with extra training challenges, and

effective training of them could require a much higher number of data samples. In addition, extra training difficulties would arise due to the fact that training dynamics for different

modalities are different. Utilizing _foundation models_ that are pre-trained on publicly available data is shown to be one of the key drivers of the recent research progress for deep

learning on image and text data (including generative modeling). However, publicly available general purpose image and text datasets often come from very different domains, and their

relevance to real-world EHR data would be low. In this paper, we verify the performance of EHR-Safe on two healthcare provider datasets which consist of admitted patients. An important

follow-up work would be on applying EHR-Safe on out-patient medical datasets from primary care or insurance companies. Scaling synthetic data generation for a complete EHR dataset with many

features is another important future work. From a modeling perspective, there is no fundamental limitation for scaling—EHR-Safe can be trained to generate a very high number of features

without hitting computational issues. However, we expect degradation in the generation quality for rarely-observed features (e.g., almost 90% of the MIMIC-III features are measured less than

1 time per visit, on average). Weak data coverage would constitute the fundamental challenge. In conclusion, we propose a generative modeling framework for EHR data, EHR-Safe, that can

generate highly realistic synthetic EHR data that are robust to privacy attacks. EHR-Safe is based on generative adversarial networks modeling applied to the encoded representations of the

raw data. We introduce multiple innovations in the EHR-Safe architecture and training mechanisms that are motivated by the key challenges in EHR data. These innovations enable EHR-Safe to

demonstrate high fidelity (almost-identical properties with real data when desired downstream capabilities are considered) with almost-ideal privacy preservation. METHODS This research

follows Google AI principles (https://ai.google/principles/), reviewed by Google Health Ethics Committee and solely publicly available datasets are used. The overall EHR-Safe framework is

illustrated in Fig. 1d. To synthesize EHR data, we adopt generative adversarial networks (GANs). EHR data are heterogeneous (see Fig. 1b), including time-varying and static features that are

partially available. Direct modeling of raw EHR data is thus challenging for GANs. To circumvent this, we propose utilizing a sequential encoder-decoder architecture to learn the mapping

from the raw EHR data to low-dimensional representations and vice versa. While learning the mapping, esoteric distributions of various numerical and categorical features pose a great

challenge; for example, some values or numerical ranges might be much more common, dominating the distribution, while the capability of modeling rare cases is crucial. Our proposed methods

for feature mapping are key to handling such data by converting to distributions for which the training of encoder-decoder and GAN are more stable and accurate. The mapped low-dimensional

representations, generated by the encoder, are used for GAN training, and at test time, they are generated, which are then converted to raw EHR data with the decoder. Algorithm 1 overviews

the training procedure for EHR-Safe. In the following subsections, we explain the key components. FEATURE REPRESENTATIONS EHR data often consist of both static and time-varying features.

Each static and temporal feature can be further categorized into either numeric or categorical. Measurement time for time-varying features is another important feature. Overall, the five

categories of features for the patient index _i_ are: (1) measurement time as _u_, (2) static numeric feature (e.g., age) as S_n_, (3) static categorical feature (e.g., marital status) as

S_c_, (4) time-varying numerical feature (e.g., vital signs) as T_n_, (5) time-varying categorical feature (e.g., hearth rhythm) as T_c_. The sequence length of time-varying features is

denoted as _T_(_i_). Note that each patient record may have a different sequence length. With all these features, given training data can be represented as:

$${{{{D}}}}={\{{{{{\bf{s}}}}}^{n}(i),{{{{\bf{s}}}}}^{c}(i),{\{{u}_{\tau }(i),{{{{\bf{t}}}}}_{\tau }^{n}(i),{{{{\bf{t}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{T(i)}\}}_{i = 1}^{N},$$ (1) where _N_

is the total number of patient records. EHR datasets often contain missing features as patients might visit clinics sporadically, and not all measurements or information are collected

completely at all visits. In order to generate realistic synthetic data, missingness patterns should also be generated in a realistic way. Let’s denote the binary mask _m_ with 1/0 values

based on whether a feature is observed (_m_ = 1) or not (_m_ = 0). The missingness for the features is represented as

$${{{{{D}}}}}_{{{{{M}}}}}={\{{{{{\bf{m}}}}}^{n}(i),{{{{\bf{m}}}}}^{c}(i),{\{{{{{\bf{m}}}}}_{\tau }^{n}(i),{{{{\bf{m}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{T(i)}\}}_{i = 1}^{N}.$$ (2) Note that

there is no missingness for measurement time—we assume time is always given whenever at least one time-varying feature is observed. Figure 3 visualizes how the raw data are converted into

four categories of features: (1) measurement time, (2) time-varying features, (3) mask features, (4) static features. ENCODING AND DECODING CATEGORICAL FEATURES Handling categorical features

poses a unique challenge beyond numerical features, as meaningful discrete mappings need to be learned. One-hot encoding is one possible solution; however, if some features have a large

number of categories (such as the medical codes), the number of dimensions would significantly increase, hurting the GAN training and data efficiency38. We propose encoding and decoding

categorical features to obtain learnable mappings to be used for generative modeling. We first encode the categorical features (S_c_) into one-hot encoded features (S_c__o_)—here, we use the

notation with static categorical feature but it is the same with temporal categorical features. Then, we employ a categorical encoder (_C__E__s_) to transform one-hot encoded features into

the latent representations (S_c__e_): $${{{{\bf{s}}}}}^{ce}=C{E}^{s}[{{{{\bf{s}}}}}^{co}]=CE[{s}_{1}^{co},...,{s}_{K}^{co}],$$ (3) where _K_ is the number of categorical features. Lastly, we

use the multi-head decoders ($[C{F}_{1}^{s},...,C{F}_{K}^{s}]$) to recover the original one-hot encoded data from the latent representations.

$${\hat{{{{\bf{s}}}}}}_{k}^{co}=C{F}_{k}^{s}[{{{{\bf{s}}}}}^{ce}]$$ (4) Both encoder (_C__E__s_) and multi-head decoders ($[C{F}_{1}^{s},...,C{F}_{K}^{s}]$) are trained with softmax cross

entropy objective: (${{{{{L}}}}}_{c}$): $$\mathop{\min }\limits_{C{E}^{s},C{F}_{1}^{s},...,C{F}_{K}^{s}}\mathop{\sum

}\limits_{k=1}^{K}{{{{{L}}}}}_{c}(C{F}_{i}^{s}[CE[{s}_{1}^{co},...,{s}_{K}^{co}]],{s}_{i}^{co}).$$ (5) We use separate encoder-decoder models for static and temporal categorical features.

The transformed representations are denoted as S_c__e_ and T_c__e_, respectively. ALGORITHM 1 Pseudo-code of EHR-Safe training. INPUT: Original data

${{{{D}}}}={\{{{{{\bf{s}}}}}^{n}(i),{{{{\bf{s}}}}}^{c}(i),{\{{u}_{\tau }(i),{{{{\bf{t}}}}}_{\tau }^{n}(i),{{{{\bf{t}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{T(i)}\}}_{i = 1}^{N}$ 1: Generate

missing patterns of ${{{{D}}}}$: \({{{{{D}}}}}_{{{{{M}}}}}={\{{{{{\bf{m}}}}}^{n}(i),{{{{\bf{m}}}}}^{c}(i),{\{{{{{\bf{m}}}}}_{\tau }^{n}(i),{{{{\bf{m}}}}}_{\tau }^{c}(i)\}}_{\tau =

1}^{T(i)}\}}_{i = 1}^{N}\) 2: Transform categorical data (S_c_, T_c_) into one-hot encoded data (S_c__o_, T_c__o_) 3: Train static categorical encoder and decoder: $$\mathop{\min

}\limits_{C{E}^{s},C{F}_{1}^{s},...,C{F}_{K}^{s}}\mathop{\sum }\limits_{k=1}^{K}{{{{{L}}}}}_{c}(C{F}_{i}^{s}[CE[{s}_{1}^{co},...,{s}_{K}^{co}]],{s}_{i}^{co})$$ (6) 4: Train temporal

categorical encoder and decoder: $$\mathop{\min }\limits_{C{E}^{t},C{F}_{1}^{t},...,C{F}_{K}^{t}}\mathop{\sum

}\limits_{k=1}^{K}{{{{{L}}}}}_{c}(C{F}_{i}^{t}[CE[{t}_{1}^{co},...,{t}_{K}^{co}]],{t}_{i}^{co})$$ (7) 5: Transform one-hot encoded data (S_c__o_, T_c__o_) to categorical embeddings (S_c__e_,

T_c__e_) 6: Stochastic normalization for numerical features (S_n_, T_n_, _u_) (see Algorithm 2) 7: Train encoder-decoder model using Equation (11) 8: Generate original encoder states E

using trained encoder (_E_), original data ${{{{D}}}}$ and missing patterns ${{{{{D}}}}}_{{{{{M}}}}}$ 9: Train generator (_G_) and discriminator (_D_) using WGAN-GP $$\mathop{\max

}\limits_{G}\mathop{\min }\limits_{D}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}D({{{\bf{e}}}}[i])-\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}D(\hat{{{{\bf{e}}}}}[i])+\eta [{(| | \nabla

D(\tilde{{{{\bf{e}}}}}[i])| | -1)}^{2}]$$ (8) OUTPUT: Trained generator (_G_), trained decoder (_F_), trained categorical decoder (_C__F__s_, _C__F__t_) STOCHASTIC NORMALIZATION FOR

NUMERICAL FEATURES One prominent challenge for training GAN is mode collapse38, i.e., the generative model overemphasizes the generation of some commonly observed data values. Especially for

distributions where the mass probability is condensed within a small numerical range, this can be a severe issue. For EHR data, such distributions are indeed observed for many features.

Some numerical clinical features might have values from a discrete set of observations (e.g., high respiratory pressure values coming as multiples of 5—35, 40, 45, etc.) or from highly

nonuniform distributions, yielding cumulative distribution functions (CDFs) that are discontinuous or with significant jumps. Directly generating such numerical features coming from highly

discontinuous CDFs can be challenging for GANs, as they are known to suffer from mode collapse and would have the tendency to generate common values for all samples. To circumvent this issue

and obtain high fidelity, we propose a normalization/renormalization method, shown in Algorithms 2 and 3, that map the raw feature distributions to and from a more uniform distribution that

is easier to model with GANs. An example application would be like: (1) estimate the ratio of each unique value in the original feature; (2) transform each unique value into the normalized

feature space with the ratio as the width—if we have 3 original values: (1, 2, 3) and their corresponding ratios as (0.1, 0.7, 0.2); (3) map 1 into [0, 0.1] range in a uniformly random way;

for 2, we map into [0.1, 0.8]; for 3, we map into [0.8, 1.0]. ALGORITHM 2 Pseudo-code of stochastic normalization. INPUT: Original feature _X_ 1: UNIQ(X) = Unique values of _X_, N = Length

of (_X_) 2: LOWER-BOUND = 0.0, UPPER-BOUND = 0.0, $\hat{X}=X$ 3: FOR val in Uniq(X) DO 4: Find index of _X_ whose value = val as IDX(VAL) 5: Compute the frequency (ratio) of val as

RATIO(VAL) = Length of idx(val) / N 6: upper-bound = lower-bound + ratio(val) 7: $\hat{X}$[idx(val)] ~ UNIFORM(lower-bound, upper-bound) 8: params[val] = [lower-bound, upper-bound]

9: lower-bound = upper-bound 10: END FOR OUTPUT: Normalized feature ($\hat{X}$), normalization parameters (params) ALGORITHM 3 Pseudo-code of stochastic renormalization. INPUT:

Normalized feature ($\hat{X}$), normalization parameters (params) 1: $X=\hat{X}$ 2: FOR param in params.keys DO 3: Find index of $\hat{X}$ whose value is in [param.values] as

IDX(PARAM) 4: X[idx(param)] = param 5: END FOR OUTPUT: Original feature _X_ As shown in Supplementary Information, the proposed stochastic normalization can be highly effective in

transforming features with discontinuous CDFs into approximately uniform distributions while allowing for perfect renormalization into the original feature space. We demonstrate that the

impact of normalization is significant for EHR-Safe to improve results in Table 4. We also note that the stochastic normalization method is highly effective for handling skewed distributions

that might correspond to features with outliers. Stochastic normalization maps the original feature space (with outliers) into a normalized feature space (with uniform distribution), and

then the applied renormalization recreates the skewed distributions with outliers. ENCODER-DECODER ARCHITECTURE Given the described encoding scheme for numerical and categorical features,

next, we describe the employed architecture for jointly extracting the representations from multiple types of data, including static, temporal, measurement time, and mask features. We

propose to encode these heterogeneous features into joint representations from which the synthetic data samples are generated. High-dimensional sparse data are challenging to model with

GANs, as they might cause convergence stability and mode collapse issues, and they might be less data efficient38 To address this, using an encoder-decoder model is beneficial as it

condenses high-dimensional heterogeneous features into latent representations that are low dimensional and compact. The encoder model (_F_) inputs the static data (S_n_, S_c__e_), temporal

data (T_n_, T_c__e_), time data (_u_), and mask data (${{{{\bf{m}}}}}^{n},{{{{\bf{m}}}}}^{c},{{{{\bf{m}}}}}_{\tau }^{n},{{{{\bf{m}}}}}_{\tau }^{c}$) and generates the encoder states (E),

as shown in Fig. 4 and below equations.

$${{{\bf{e}}}}=E({{{{\bf{s}}}}}^{n},{{{{\bf{s}}}}}^{ce},{{{{\bf{t}}}}}^{n},{{{{\bf{t}}}}}^{ce},u,{{{{\bf{m}}}}}^{n},{{{{\bf{m}}}}}^{c},{{{{\bf{m}}}}}_{\tau }^{n},{{{{\bf{m}}}}}_{\tau

}^{c})$$ (9) The decoder model (_G_) inputs these encoded representations (E) and aims to recover the original static, temporal, measurement time, and mask data.

$${\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{s}}}}}}^{ce},{\hat{{{{\bf{t}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{ce},\hat{u},{\hat{{{{\bf{m}}}}}}^{n},{\hat{{{{\bf{m}}}}}}^{c},{\hat{{{{\bf{m}}}}}}_{\tau

}^{n},{\hat{{{{\bf{m}}}}}}_{\tau }^{c}=F({{{\bf{e}}}})$$ (10) If the decoder model can recover the original heterogeneous data correctly, it can be inferred that E contains most of the

information in the original heterogeneous data. For temporal, measurement time and static features, we use mean square error (${{{{{L}}}}}_{m}$) as the reconstruction loss. Note that we

compute the errors only when the features are observed. For the mask features, we use the binary cross entropy (${{{{{L}}}}}_{c}$) as the reconstruction loss because the mask features

consist of binary variables. Thus, our full reconstruction loss becomes: $$\begin{array}{ll}\min\,

{{{{{L}}}}}_{c}({\hat{{{{\bf{m}}}}}}^{n},{{{{\bf{m}}}}}^{n})+{{{{{L}}}}}_{c}({\hat{{{{\bf{m}}}}}}^{c},{{{{\bf{m}}}}}^{c})+{{{{{L}}}}}_{c}({\hat{{{{\bf{m}}}}}}_{\tau

}^{n},{{{{\bf{m}}}}}_{\tau }^{n})+{{{{{L}}}}}_{c}({\hat{{{{\bf{m}}}}}}_{\tau }^{c},{{{{\bf{m}}}}}_{\tau }^{c})+\\ \qquad\lambda

[{{{{{L}}}}}_{m}(\hat{u},u)+{{{{{L}}}}}_{m}({{{{\bf{m}}}}}^{n}{\hat{{{{\bf{s}}}}}}^{n},{{{{\bf{m}}}}}^{n}{{{{\bf{s}}}}}^{n})+{{{{{L}}}}}_{m}({{{{\bf{m}}}}}^{c}{\hat{{{{\bf{s}}}}}}^{ce},{{{{\bf{m}}}}}^{c}{{{{\bf{s}}}}}^{ce})+{{{{{L}}}}}_{m}({{{{\bf{m}}}}}_{\tau

}^{n}{\hat{{{{\bf{t}}}}}}^{n},{{{{\bf{m}}}}}_{\tau }^{n}{{{{\bf{t}}}}}^{n})+{{{{{L}}}}}_{m}({{{{\bf{m}}}}}_{\tau }^{c}{\hat{{{{\bf{t}}}}}}^{ce},{{{{\bf{m}}}}}_{\tau

}^{c}{{{{\bf{t}}}}}^{ce})],\end{array}$$ (11) where _λ_ is the hyper-parameter to balance the cross entropy loss and mean squared loss. ADVERSARIAL TRAINING The trained encoder model is used

to map raw data into encoded representations, that are then used for GAN training so that the trained generative model can generate realistic encoded representations that can be decoded

into realistic raw data. We first utilize the trained encoder to generate original encoder states (_e_) using the original raw data—the original dataset gets converted into

${{{{{D}}}}}_{e}={\{{{{\bf{e}}}}(i)\}}_{i = 1}^{N}$. Next, we use the generative adversarial network (GAN) training framework to generate synthetic encoder states $\hat{{{{\bf{e}}}}}$ to

make synthetic encoder states dataset ${\hat{{{{{D}}}}}}_{e}$. More specifically, the generator (_G_) uses the random vector (Z) to generate synthetic encoder states as follows.

$$\hat{{{{\bf{e}}}}}=G({{{\bf{z}}}})$$ (12) Then, the discriminator _D_ tries to distinguish the original encoder states E from the synthetic encoder states $\hat{{{{\bf{e}}}}}$. As the

GAN framework, we adopt Wasserstein GAN39 with Gradient Penalty40 due to its training stability for heterogeneous data types. The optimization problem can be stated as:

$$\begin{array}{ll}\mathop{\max }\limits_{G}\mathop{\min }\limits_{D}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}D({{{\bf{e}}}}[i])-\frac{1}{N}\mathop{\sum

}\limits_{i=1}^{N}D(\hat{{{{\bf{e}}}}}[i])+\eta [{(| | \nabla D(\tilde{{{{\bf{e}}}}}[i])| | -1)}^{2}]\\ \,{{\mbox{where}}}\,\,\tilde{{{{\bf{e}}}}}[i]=\epsilon {{{\bf{e}}}}[i]+(1-\epsilon

)\hat{{{{\bf{e}}}}}[i]\,{{\mbox{and}}}\,\epsilon \sim U[0,1],\end{array}$$ (13) where _η_ is WGAN-GP hyper-parameter, which is set to 10. Figure 4 describes the proposed GAN model with

generator and discriminator architectures based on multi-layer perceptron (MLP). INFERENCE The inference process of EHR-Safe is overviewed in Algorithm 4. After training both the

encoder-decoder and GAN models, we can generate synthetic heterogeneous data from any random vector. Note that only the trained generator and decoder are used for inference. As shown in Fig.

5, the trained generator uses the random vector to generate synthetic encoder states. $$\hat{{{{\bf{e}}}}}=G({{{\bf{z}}}})\,{{\mbox{where}}}\,{{{\bf{z}}}} \sim {{{{N}}}}(0,I)$$ (14) Then,

the trained decoder (_F_) uses the synthetic encoder states as the inputs to generate synthetic temporal (${\hat{{{{\bf{t}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{ce}$), static

(${\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{s}}}}}}^{ce}$), time ($\hat{u}$), and mask (\({\hat{{{{\bf{m}}}}}}^{n},{\hat{{{{\bf{m}}}}}}^{c},{\hat{{{{\bf{m}}}}}}_{\tau

}^{n},{\hat{{{{\bf{m}}}}}}_{\tau }^{c}\)) data.

$${\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{s}}}}}}^{ce},{\hat{{{{\bf{t}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{ce},\hat{u},{\hat{{{{\bf{m}}}}}}^{n},{\hat{{{{\bf{m}}}}}}^{c},{\hat{{{{\bf{m}}}}}}_{\tau

}^{n},{\hat{{{{\bf{m}}}}}}_{\tau }^{c}=F(\hat{{{{\bf{e}}}}})$$ (15) Representations for the static and temporal categorical features are decoded using the decoders in Fig. 6 to generate

synthetic static categorical (${\hat{s{{{\boldsymbol{}}}}}}^{c}$) data and temporal categorical (${\hat{{{{\bf{t}}}}}}^{c}$) data.

$${\hat{{{{\bf{s}}}}}}^{c}=C{F}^{s}({\hat{{{{\bf{s}}}}}}^{ce}),{\hat{{{{\bf{t}}}}}}^{c}=C{F}^{t}({\hat{{{{\bf{t}}}}}}^{ce})$$ (16) The generated synthetic data are represented as:

$$\hat{{{{{D}}}}}={\{{\hat{{{{\bf{s}}}}}}^{n}(i),{\hat{{{{\bf{s}}}}}}^{c}(i),{\{{\hat{{{{\bf{u}}}}}}_{\tau }(i),{\hat{{{{\bf{t}}}}}}_{\tau }^{n}(i),{\hat{{{{\bf{t}}}}}}_{\tau

}^{c}(i)\}}_{\tau = 1}^{\hat{T}(i)}\}}_{i = 1}^{M}$$ (17) $${\hat{{{{{D}}}}}}_{{{{{M}}}}}={\{{\hat{{{{\bf{m}}}}}}^{n}(i),{\hat{{{{\bf{m}}}}}}^{c}(i),{\{{\hat{{{{\bf{m}}}}}}_{\tau

}^{n}(i),{\hat{{{{\bf{m}}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{\hat{T}(i)}\}}_{i = 1}^{M}$$ (18) Note that with the trained models, we can generate an arbitrary number of synthetic data samples

(even more than the original data). ALGORITHM 4 Pseudo-code of EHR-Safe inference. INPUT: Trained generator (_G_), trained decoder (_F_), the number of synthetic data (_M_), trained

categorical decoder (_C__F__s_, _C__F__t_) 1: Sample _M_ random vectors ${{{\bf{z}}}} \sim {{{{N}}}}(0,I)$ 2: Generate synthetic embeddings: $\hat{{{{\bf{e}}}}}=G({{{\bf{z}}}})$ 3:

Decode synthetic embeddings to synthetic data:

\({\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{s}}}}}}^{ce},{\hat{{{{\bf{t}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{ce},\hat{u},{\hat{{{{\bf{m}}}}}}^{n},{\hat{{{{\bf{m}}}}}}^{c},{\hat{{{{\bf{m}}}}}}_{\tau

}^{n},{\hat{{{{\bf{m}}}}}}_{\tau }^{c}=F(\hat{{{{\bf{e}}}}})\) 4: Decode synthetic categorical embeddings:

${\hat{{{{\bf{s}}}}}}^{c}=C{F}^{s}({\hat{{{{\bf{s}}}}}}^{ce}),{\hat{{{{\bf{t}}}}}}^{c}=C{F}^{t}({\hat{{{{\bf{t}}}}}}^{ce})$ 5: Renormalize synthetic numerical data

(${\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{n},\hat{u}$) (see Algorithm 3) OUTPUT: Synthetic data

\(\hat{{{{{D}}}}}={\{{\hat{{{{\bf{s}}}}}}^{n}(i),{\hat{{{{\bf{s}}}}}}^{c}(i),{\{{\hat{{{{\bf{u}}}}}}_{\tau }(i),{\hat{{{{\bf{t}}}}}}_{\tau }^{n}(i),{\hat{{{{\bf{t}}}}}}_{\tau

}^{c}(i)\}}_{\tau = 1}^{\hat{T}(i)}\}}_{i = 1}^{M}\) and synthetic missing pattern

\({\hat{{{{{D}}}}}}_{{{{{M}}}}}={\{{\hat{{{{\bf{m}}}}}}^{n}(i),{\hat{{{{\bf{m}}}}}}^{c}(i),{\{{\hat{{{{\bf{m}}}}}}_{\tau }^{n}(i),{\hat{{{{\bf{m}}}}}}_{\tau }^{c}(i)\}}_{\tau =

1}^{\hat{T}(i)}\}}_{i = 1}^{M}\) DATA AVAILABILITY The data used for the training, validation, and test sets are publicly available. All data were collected entirely from openly available

sources. The following websites can be used to access the EHR datasets used in this study: MIMIC-III (https://physionet.org/content/mimiciii/1.4/), eICU

(https://eicu-crd.mit.edu/gettingstarted/access/). REFERENCES * Zhu, T., Li, K., Herrero, P. & Georgiou, P. Deep learning for diabetes: a systematic review. _IEEE J. Biomed. Health

Inform._ 25, 2744–2757 (2020). Article Google Scholar * Yu, L., Chan, W. M., Zhao, Y. & Tsui, K.-L. Personalized health monitoring system of elderly wellness at the community level in

Hong Kong. _IEEE Access_ 6, 35558–35567 (2018). Article Google Scholar * Liu, R. et al. Systematic pan-cancer analysis of mutation–treatment interactions using large real-world

clinicogenomics data. _Nat. Med._ 28, 1656–1661 (2022). Article CAS PubMed Google Scholar * Abouelmehdi, K., Beni-Hssane, A., Khaloufi, H. & Saadi, M. Big data security and privacy

in healthcare: a review. _Procedia Comput. Sci._ 113, 73–80 (2017). Article Google Scholar * Iyengar, A., Kundu, A. & Pallis, G. Healthcare informatics and privacy. _IEEE Internet

Comput._ 22, 29–31 (2018). Article Google Scholar * Ray, P. & Wimalasiri, J. The need for technical solutions for maintaining the privacy of EHR. In _Proc. 2006 International

Conference of the IEEE Engineering in Medicine and Biology Society_, 4686–4689 (IEEE, 2006). * Azarm-Daigle, M., Kuziemsky, C. & Peyton, L. A review of cross organizational healthcare

data sharing. _Procedia Comput. Sci._ 63, 425–432 (2015). Article Google Scholar * Uzuner, Ö., Luo, Y. & Szolovits, P. Evaluating the state-of-the-art in automatic de-identification.

_J. Am. Med. Inform. Assoc._ 14, 550–563 (2007). Article PubMed PubMed Central Google Scholar * Janmey, V. & Elkin, P. L. Re-identification risk in HIPAA de-identified datasets: the

MVA attack. _AMIA Annu. Symp. Proc._ 2018, 1329–1337 (2018). * Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and

healthcare. _Nat. Biomed. Eng._ 5, 493–497 (2021). Article PubMed PubMed Central Google Scholar * Goodfellow, I. et al. Generative adversarial nets. In _Proc. 27th International

Conference on Neural Information Processing Systems_, Vol. 27, 2672–2680 (2014). * Van den Oord, A. et al. Conditional image generation with PixelCNN decoders. In _Proc. 30th International

Conference on Neural Information Processing Systems_, 4797–4805 (2016). * Van den Oord, A. et al. Wavenet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499

(2016). * Nowozin, S., Cseke, B. & Tomioka, R. _f_-GAN: training generative neural samplers using variational divergence minimization. In _Proc. 30th International Conference on Neural

Information Processing Systems_, 271–279 (2016). * Yoon, J., Jarrett, D. & Van der Schaar, M. Time-series generative adversarial networks. In _Proc. 33rd Conference on Neural Information

Processing Systems_ (2019). * Creswell, A. et al. Generative adversarial networks: an overview. _IEEE Signal Process. Mag._ 35, 53–65 (2018). Article Google Scholar * Karras, T., Aila,

T., Laine, S. & Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In _Proc. International Conference on Learning Representations (ICLR)_ (2018). *

Kong, J., Kim, J. & Bae, J. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. _Adv. Neural Inf. Process. Syst._ 33, 17022–17033 (2020). Google

Scholar * de Masson d’Autume, C., Mohamed, S., Rosca, M. & Rae, J. Training language GANs from scratch. In _Proc. 33rd Conference on Neural Information Processing Systems_ (2019). *

Liu, Y., Peng, J., James, J. & Wu, Y. PPGAN: privacy-preserving generative adversarial network. In _Proc._ _2019 IEEE 25th International Conference on Parallel and Distributed Systems

(ICPADS)_, 985–989 (IEEE, 2019). * Jordon, J., Yoon, J. & Van Der Schaar, M. PATE-GAN: generating synthetic data with differential privacy guarantees. In _Proc. 2019 International

Conference On Learning Representations_ (2019). * Jarrett, D., Bica, I. & van der Schaar, M. Time-series generation by contrastive imitation. _Adv. Neural Inf. Process. Syst._ 34,

28968–28982 (2021). Google Scholar * Choi, E. et al. Generating multi-label discrete patient records using generative adversarial networks. _PMLR_ 68, 286–305 (2017). * Lu, C., Reddy, C.

K., Wang, P., Nie, D. & Ning, Y. Multi-label clinical time-series generation via conditional GAN. Preprint at https://arxiv.org/abs/2204.04797 (2022). * Johnson, A., Pollard, T. &

Mark, R. MIMIC-III clinical database (version 1.4). _PhysioNet_ 10 (2016). https://physionet.org/content/mimiciii/1.4/. * Johnson, A. E. et al. MIMIC-III, a freely accessible critical care

database. _Sci. Data_ 3, 160035 (2016). Article CAS PubMed PubMed Central Google Scholar * Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new

research resource for complex physiologic signals. _Circulation_ 101, e215–e220 (2000). Article CAS PubMed Google Scholar * Pollard, T. J. et al. The eICU Collaborative Research

Database, a freely available multi-center database for critical care research. _Sci. Data_ 5, 180178 (2018). Article PubMed PubMed Central Google Scholar * Sadeghi, R., Banerjee, T.

& Romine, W. Early hospital mortality prediction using vital signals. _Smart Health_ 9, 265–274 (2018). Article PubMed Google Scholar * Sheikhalishahi, S., Balaraman, V. & Osmani,

V. Benchmarking machine learning models on eICU critical care dataset. Preprint at https://arxiv.org/abs/1910.00964 (2019). * Liu, G. et al. SocInf: membership inference attacks on social

media health data with machine learning. _IEEE Trans. Comput. Soc. Syst._ 6, 907–921 (2019). Article Google Scholar * Su, D., Huynh, H. T., Chen, Z., Lu, Y. & Lu, W. Re-identification

attack to privacy-preserving data analysis with noisy sample-mean. In _Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining_, 1045–1053 (2020). * Mehnaz,

S. et al. Are your sensitive attributes private? Novel model inversion attribute inference attacks on classification models. In _Proc. 31st USENIX Security Symposium (USENIX Security 22)_,

4579–4596 (2022). * Esteban, C., Hyland, S. L. & Rätsch, G. Real-valued (medical) time series generation with recurrent conditional GANs. Preprint at https://arxiv.org/abs/1706.02633

(2017). * Mogren, O. C-RNN-GAN: continuous recurrent neural networks with adversarial training. Preprint at https://arxiv.org/abs/1611.09904 (2016). * Torkzadehmahani, R., Kairouz, P. &

Paten, B. DP-CGAN: differentially private synthetic data and label generation. In _Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops_ (2019). * Abadi, M. et al.

Deep learning with differential privacy. In _Proc. 2016 ACM SIGSAC Conference on Computer and Communications Security_, 308–318 (2016). * Saxena, D. & Cao, J. Generative adversarial

networks (gans) challenges, solutions, and future directions. _ACM Comput. Surv. (CSUR)_ 54, 1–42 (2021). Article Google Scholar * Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein

generative adversarial networks. _PMLR_ 70, 214–223 (2017). * Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of Wasserstein GANs. In _Proc.

31st International Conference on Neural Information Processing Systems_, 5769–5779 (2017). Download references ACKNOWLEDGEMENTS This work was approved by Google, and no extramural funding

was used for this project. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Google Cloud, 1155 Borregas Ave, Sunnyvale, CA, USA Jinsung Yoon, Michel Mizrahi, Nahid Farhady Ghalaty, Thomas

Jarvinen, Ashwin S. Ravi, Peter Brune, Fanyu Kong, Dave Anderson, George Lee, Farhana Bandukwala, Sercan Ö. Arık & Tomas Pfister * Google LLC, 1600 Amphitheatre Pkwy, Mountain View, CA,

USA Arie Meir & Elli Kanal Authors * Jinsung Yoon View author publications You can also search for this author inPubMed Google Scholar * Michel Mizrahi View author publications You can

also search for this author inPubMed Google Scholar * Nahid Farhady Ghalaty View author publications You can also search for this author inPubMed Google Scholar * Thomas Jarvinen View author

publications You can also search for this author inPubMed Google Scholar * Ashwin S. Ravi View author publications You can also search for this author inPubMed Google Scholar * Peter Brune

View author publications You can also search for this author inPubMed Google Scholar * Fanyu Kong View author publications You can also search for this author inPubMed Google Scholar * Dave

Anderson View author publications You can also search for this author inPubMed Google Scholar * George Lee View author publications You can also search for this author inPubMed Google

Scholar * Arie Meir View author publications You can also search for this author inPubMed Google Scholar * Farhana Bandukwala View author publications You can also search for this author

inPubMed Google Scholar * Elli Kanal View author publications You can also search for this author inPubMed Google Scholar * Sercan Ö. Arık View author publications You can also search for

this author inPubMed Google Scholar * Tomas Pfister View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS J.Y., F.B., S.A. and T.P. initiated the

project. J.Y. and S.A. designed the model architecture and training methodology. J.Y., M.M., N.F.G., T.J., S.A. contributed to metric developments. J.Y., M.M., N.F.G., A.S.R., P.B., F.K.

and D.A. contributed to developing scalable pipelines and software infrastructure. J.Y., M.M., N.F.G., T.J. and S.A. contributed to the overall experimental design and analyses. M.M., N.F.G.

and F.B. contributed to data preprocessing. J.Y., M.M., G.L., A.M., F.B., E.K., S.A. and T.P. managed the project. J.Y., M.M., N.F.G., T.J., A.M., F.B., E.K., S.A. and T.P. wrote the paper.

CORRESPONDING AUTHOR Correspondence to Jinsung Yoon. ETHICS DECLARATIONS COMPETING INTERESTS This work was approved by Google, and no extramural funding was used for this project. All

authors are affiliated with Google. The authors have no other competing interests to declare. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to

jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under

a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate

credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article

are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this

license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Yoon, J., Mizrahi, M., Ghalaty, N.F. _et al._ EHR-Safe: generating

high-fidelity and privacy-preserving synthetic electronic health records. _npj Digit. Med._ 6, 141 (2023). https://doi.org/10.1038/s41746-023-00888-7 Download citation * Received: 19 January

2023 * Accepted: 26 July 2023 * Published: 11 August 2023 * DOI: https://doi.org/10.1038/s41746-023-00888-7 SHARE THIS ARTICLE Anyone you share the following link with will be able to read

this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative

Trending News

History of the Vacuum Flask | Nature

ABSTRACT WHILE fully accepting all that Mr. Gabb writes in NATURE of June 1 about matters of fact within his knowledge, ...

A crystalline albumin component of skeletal muscle

ABSTRACT THE intracellular protein components of skeletal muscle are known generally as myosin, globulin _X_, myogen and...

Prof. C. L. Boulenger | Nature

ABSTRACT PROF. CHARLES L. BOULENGER, who died on May 21, aged fifty-five, will be remembered as a successful professor, ...

Relation of working period to output

ABSTRACT THE production drive in Great Britain which started a month ago has led to a very great extension of working ho...

Automatic service of long-distance telephone calls

ABSTRACT FULL automatic service for toll or long-distance telephone connexions, as they are variously termed, has been i...

Latests News

Ehr-safe: generating high-fidelity and privacy-preserving synthetic electronic health records

ABSTRACT Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, ...

BIRDS AND BUTTERFLIES | Nature

ABSTRACT THE extraordinary mimetic resemblances which exist between certain apparently edible species of insects and oth...

Prince philip funeral: public told to ‘stay away’ - how to pay tribute

Preparations are underway for the funeral of Prince Philip, Duke of Edinburgh, which will take place on Saturday, April ...

Brittany mahomes posts new video from las vegas before super bowl 2024

The Super Bowl 2024 stage is set and Brittany Mahomes is ready for kickoff in Las Vegas. Taking to her Instagram Stories...

Tumor cell-derived emp1 is essential for cancer-associated fibroblast infiltration in tumor microenvironment of triple-negative breast cancer

ABSTRACT The role of epithelial membrane protein 1 (EMP1) in tumor microenvironment (TME) remodeling has not yet been el...