GENESIS

GENESIS is a specialized synthetic data generation system that transforms real manufacturing data into high-fidelity synthetic datasets, protecting intellectual property while fueling industrial AI.

What is GENESIS?

GENESIS is a manufacturing-focused synthetic data generation system built on state-of-the-art generative AI architectures. It learns the statistical structure of real production data and generates novel, privacy-safe datasets that faithfully replicate the behavior of the original manufacturing processes.

Rather than simply copying data, GENESIS actively models complex distributions across mixed data types: from continuous sensor readings and categorical defect classifications to time-series production sequences. It adapts on-the-fly to new data, manages the full model lifecycle, and embeds manufacturing domain knowledge directly into the generation pipeline through a flexible custom function system.

Key Features

GENESIS provides a comprehensive suite of capabilities designed to transform raw manufacturing data into production-ready synthetic datasets, enabling industrial AI development without compromising operational data privacy.

๐Ÿง 

CTGAN & VAE Architectures

State-of-the-art generative models designed to learn complex distributions from manufacturing datasets, handling mixed continuous, categorical, and time-series data types.

โšก

Training-on-the-Fly

Dynamic model configuration and immediate training upon data submission, with hyperparameters tuned to each dataset's characteristics. The system adapts to each new production line or equipment type without requiring manual setup.

๐Ÿ”’

Schema-Based Generation

Generate realistic manufacturing data using a structural metadata descriptor (data skeleton) that communicates column types, constraints, and relationships to the trained model. After initial training on real data, inference can proceed from schema alone โ€” without resubmitting the original dataset each time.

๐Ÿ”

Model Lifecycle Management

Full support for training, saving, loading, versioning, and fine-tuning. Models persist across facilities and continuously improve as new production data is accumulated.

๐Ÿ› ๏ธ

Custom Function Application

Encode manufacturing domain knowledge directly into generation workflows. Define priority-ordered transformations for quality metrics, defect classifications, and business rules.

Integrated Evaluation

What truly sets GENESIS apart is the seamless integration of statistical quality evaluation directly into the generation pipeline โ€” not as an afterthought, but as a first-class component of every production run.

๐Ÿ“Š

Built-in Quality Metrics

Every generation job automatically runs a suite of statistical evaluation metrics purpose-built for manufacturing data. Distribution fidelity, feature-level divergence, and sample novelty are all measured without any additional configuration.

๐Ÿ“‹

Structured Quality Reports

Evaluation results are delivered as structured JSON reports alongside the generated dataset. Each report quantifies how faithfully the synthetic data reproduces the statistical properties of the original, giving teams an objective measure of generation quality.

๐ŸŽฏ

Manufacturing-Grade Standards

Unlike generic evaluation frameworks, GENESIS applies metrics calibrated for industrial data: measuring temporal coherence in time-series outputs, handling mixed-type columns, and accounting for the statistical properties typical of production-line measurements.

Tabular & Time-Series Data

GENESIS is designed from the ground up to handle the two primary data structures found across industrial manufacturing environments. Each type is supported by a dedicated processing strategy, ensuring that both the statistical and temporal properties of the original data are faithfully captured and reproduced.

๐Ÿ—‚๏ธ

Tabular Data

Tabular data is the most common format in manufacturing: structured records where each row represents a production sample, measurement, or quality inspection event, and each column carries a specific attribute โ€” from continuous sensor readings to categorical defect codes and binary pass/fail labels.

GENESIS trains a CTGAN (Conditional Tabular GAN) model on tabular inputs. CTGAN is specifically designed to handle the statistical challenges of real-world manufacturing tables: highly imbalanced class distributions, multi-modal continuous variables, and mixed numeric and categorical columns. After training, the model can generate new rows that faithfully replicate the joint distribution across all columns, preserving complex inter-feature correlations such as the relationship between temperature readings, cycle times, and defect occurrence rates.

Supported column types include: continuous numeric values, discrete numeric values, ordinal categories, and nominal categories. Missing values and outliers are handled during preprocessing before training begins.

๐Ÿ“ˆ

Time-Series Data

Time-series data captures the temporal evolution of manufacturing processes: sequences of measurements recorded at regular or irregular intervals across one or more sensors, production cycles, or experimental runs. This structure is fundamental to predictive maintenance, anomaly detection, and process optimization use cases.

For time-series inputs, GENESIS employs a VAE (Variational Autoencoder) architecture adapted to sequential data. The model learns a compact latent representation of the underlying temporal dynamics, encoding not just the distribution of individual values but also the autocorrelation structure and the characteristic patterns of variation across time. Experiment groupings and sequence boundaries are preserved during training, so that the model does not conflate measurements from different production runs or equipment instances.

Generated time-series sequences respect temporal coherence: trends, periodic patterns, and transient events are reproduced in a statistically consistent manner. The training process requires real time-series data to learn these dynamics; once trained, the model can generate novel sequences that maintain the same temporal properties without resubmitting the original records.

How GENESIS Processes Manufacturing Data

To understand how GENESIS operates in practice, consider a production line dataset lifecycle: from raw sensor ingestion, through adaptive model training, to synthetic data generation and quality validation.

1 - Ingest

Training Data Submission

Manufacturing data arrives as a JSON payload through the API. GENESIS automatically identifies the data type (tabular or time-series) and routes it through the appropriate preprocessing strategy. Sensor readings are normalized, categorical variables are encoded, and missing values are handled without manual intervention.

2 - Train

Dynamic Model Adaptation

The system analyzes the incoming data structure and immediately begins training. Hyperparameters are tuned dynamically based on data characteristics. For time-series inputs, temporal dependencies and experiment groupings are preserved throughout the learning process.

3 - Persist

Model Lifecycle Management

Once training completes, the model is saved alongside structural metadata and is ready for deployment. Trained models can be shared across multiple manufacturing facilities, versioned, and continuously improved through fine-tuning as new production data becomes available.

4 - Generate

Schema-Based or Data-Guided Inference

During inference, GENESIS operates in two modes. In schema-based mode, a structural metadata descriptor (data skeleton) communicates column types, constraints, and statistical properties to the trained model โ€” enabling generation without resubmitting the original dataset. In data-guided mode, reference data are used to evaluate synthetic data against real data.

5 - Transform

Function-Enhanced Generation

Custom functions encode manufacturing domain rules directly into the generation pipeline. Users define priority-ordered transformations, define boundaries, adding noise to data, or set a precise data behaviour that are applied either during generation from scratch or as post-processing over existing synthetic datasets.

6 - Evaluate

Integrated Quality Assessment

GENESIS calculates statistical distances between real and generated datasets, assesses distribution adherence, and measures sample novelty. Evaluation results are delivered as structured JSON reports for further inspection or downstream consumption.

System Architecture Layers

GENESIS is designed as a microservice-based layered architecture, where each tier has a clearly defined responsibility and communicates with adjacent layers through well-defined interfaces. This separation of concerns makes the system easy to extend, deploy, and integrate into existing industrial infrastructures.

Layer 1: User Interface

The outermost layer is designed with ease of use as its primary goal. It exposes all GENESIS capabilities through both a visual web interface and a programmatic API, so that data scientists and developers can interact with the system in whichever way fits their workflow. Users submit requests, monitor job progress, retrieve generated datasets, and inspect quality reports without needing to understand the internals of the generation process. The interface abstracts away all complexity and presents GENESIS as a single, coherent service.

Layer 2: Middleware - Orchestrator, Input Coherence Check, and Persistence

The middleware layer is the operational backbone of the system. An Orchestrator component receives every incoming request from the interface layer, validates it, and routes it to the appropriate downstream service. Before any processing begins, an Input Coherence Check inspects the submitted payload for structural consistency: verifying that data schemas are well-formed, that required fields are present, and that configuration parameters fall within acceptable ranges. This validation step prevents malformed or contradictory inputs from propagating into the generation pipeline. Alongside orchestration and validation, the middleware manages Persistence: trained models, structural metadata, and generation histories are stored and versioned here, ensuring that every artifact produced by GENESIS remains retrievable and reproducible across sessions and facilities.

Layer 3: Generator - Server and Core Library

The innermost layer is where synthetic data is actually produced. It is composed of two tightly integrated components: a Generator Server that handles the execution lifecycle of each generation or training job, and a Core Library that implements the full suite of generative AI capabilities. The core library encapsulates the CTGAN and VAE model architectures, the adaptive training logic, the preprocessing and post-processing pipelines, the custom function execution engine, and the statistical evaluation suite. When a job arrives from the middleware, the server instantiates the appropriate core library components, runs the requested operation, and returns the results upward through the stack. Because the core library is encapsulated as a standalone component, it can be updated, replaced, or extended independently of the rest of the system, making it straightforward to integrate new generative architectures as the field evolves.

Get in Touch

Contact & Resources

๐Ÿ“ง

Email

For inquiries and technical support

angelo.marguglio@eng.it mattiagiuseppe.marzano@eng.it
๐Ÿข

Organization

Engineering Ingegneria Informatica S.p.A.

Research & Innovation

๐Ÿ’ป

Source Code

Explore the codebase and documentation

โšก View on GitHub