Generative AI Data Augmentation and Missing Value Completion Method for Multi-Source Heterogeneous Big Data

Linfeng Yang

doi:10.22158/mmse.v8n3p66

Generative AI Data Augmentation and Missing Value Completion Method for Multi-Source Heterogeneous Big Data

Linfeng Yang

Abstract

The proliferation of multi-source heterogeneous big data across healthcare, industrial Internet of Things (IIoT), and smart city domains has introduced two pervasive bottlenecks in data analytics: widespread missing values and insufficient high-quality training samples. Traditional missing value imputation methods fail to capture complex nonlinear correlations across heterogeneous data modalities, while existing data augmentation techniques are mostly designed for single-modal data and neglect intrinsic cross-source associations. To address these gaps, this paper proposes Hetero-GenAI, a unified generative artificial intelligence framework for joint missing value completion and data augmentation in multi-source heterogeneous big data. First, a heterogeneous data embedding module with cross-modal attention maps numerical, categorical, temporal, and textual features into a shared latent space, explicitly modeling inter-source and intra-source feature dependencies. Second, a missing-aware conditional diffusion model performs adaptive imputation by integrating missing masks as soft constraints, eliminating the need for pre-assumptions about missing mechanisms (MCAR, MAR, MNAR). Third, a distribution-aligned augmentation strategy generates diverse, realistic samples via latent space interpolation and semantic perturbation while preserving cross-modal semantic coherence and distribution consistency. Extensive experiments on two real-world datasets—the MIMIC-IV clinical dataset and NASA C-MAPSS industrial sensor dataset—demonstrate that the proposed method reduces numerical imputation RMSE by 12.3%–21.7% and improves categorical imputation F1-score by 8.5%–15.2% under 30% missing rates compared to state-of-the-art baselines. Furthermore, the augmentation pipeline improves downstream classification and prediction task performance by 9.2%–14.6%, verifying its dual effectiveness in data completion and quality enhancement.

Full Text:

PDF

DOI: https://doi.org/10.22158/mmse.v8n3p66

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Username
Password
Remember me

Modern Management Science & Engineering

Generative AI Data Augmentation and Missing Value Completion Method for Multi-Source Heterogeneous Big Data

Abstract

Full Text:

Refbacks