Dataset Import Guide

Overview

This guide addresses GitHub Issue #116: “Add functionality to pick up existing datasets” by providing two distinct approaches for working with existing simulation data in EasyVVUQ:

  1. Campaign Creation: Import data to create full EasyVVUQ campaigns

  2. Analysis-Only: Direct analysis without campaign overhead

Quick Start: Choose Your Approach

Decision Matrix

Your Goal

Recommended Approach

Quick analysis of existing results

Analysis-Only

Building UQ workflow

Campaign Creation

Need both flexibility and organization

Hybrid

Maximum performance

Analysis-Only

Collaborative project

Campaign Creation

Custom analysis + future expansion

Hybrid

Approach 1: Campaign Creation + Analysis

When to Use

  • Building ongoing UQ workflows

  • Need to add more runs or resample

  • Want full parameter management

  • Using EasyVVUQ sampling methods

  • Collaborative projects (shared database)

Basic Example

from easyvvuq.utils import dataset_importer

# Method 1: From directory structure
campaign = dataset_importer.create_campaign_from_directory(
    root_dir="/path/to/simulation/data",
    campaign_name="my_campaign",
    work_dir="/path/to/work/dir"
)

# Method 2: From file lists
campaign = dataset_importer.create_campaign_from_files(
    input_files=["run1/input.json", "run2/input.json"],
    output_files=["run1/output.csv", "run2/output.csv"],
    campaign_name="my_campaign",
    work_dir="/path/to/work/dir"
)

# Method 3: Campaign class method
campaign = uq.Campaign.from_existing_data(
    name="my_campaign",
    input_files=input_files,
    output_files=output_files,
    work_dir="/path/to/work/dir"
)

Advanced Features

# Get collated results
df = campaign.get_collation_result()

# Apply EasyVVUQ analysis
analysis = uq.analysis.EnsembleBoot(
    qoi_cols=[('output_column', 0)],
    stat_func=np.mean
)
campaign.apply_analysis(analysis)
results = campaign.get_last_analysis()

# Extend campaign with more runs
campaign.add_external_runs(new_runs)
campaign.draw_samples(num_samples=50)
campaign.execute().collate()

Benefits

  • Full parameter space management

  • Can add more runs and resample

  • Database storage and retrieval

  • Run status tracking

  • Integration with EasyVVUQ sampling methods

  • Suitable for ongoing UQ workflows

Considerations

  • Higher memory usage

  • Campaign database overhead

  • Requires parameter definitions

Approach 2: Analysis-Only (Direct DataFrame Analysis)

When to Use

  • Quick analysis tasks

  • Custom analysis methods

  • Integration with other tools

  • Maximum performance needed

  • Exploratory data analysis

Basic Example

import pandas as pd
import numpy as np
import easyvvuq as uq

# Load your data directly into DataFrame
df = pd.DataFrame({
    ('run_id', 0): range(100),
    ('x1', 0): np.random.uniform(0, 1, 100),
    ('x2', 0): np.random.uniform(0, 1, 100),
    ('output', 0): np.random.normal(0, 1, 100)
})

# Method 1: Direct pandas/numpy analysis
mean_output = np.mean(df[('output', 0)])
std_output = np.std(df[('output', 0)])
correlation = np.corrcoef(df[('x1', 0)], df[('output', 0)])[0, 1]

# Method 2: Use EasyVVUQ analysis classes directly
ensemble_analysis = uq.analysis.EnsembleBoot(
    qoi_cols=[('output', 0)],
    stat_func=np.mean
)
results = ensemble_analysis.analyse(df)

Advanced Custom Analysis

# Custom sensitivity analysis
def elementary_effects(df, param_col, output_col):
    median_val = df[param_col].median()
    high_group = df[df[param_col] > median_val][output_col]
    low_group = df[df[param_col] <= median_val][output_col]
    return high_group.mean() - low_group.mean()

# Statistical tests
from scipy import stats
t_stat, p_value = stats.ttest_ind(group1, group2)

# Machine learning
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
coefficients = model.coef_

Benefits

  • No campaign overhead - maximum performance

  • Direct DataFrame manipulation

  • Custom analysis methods easy to implement

  • Works with any DataFrame structure

  • Easy integration with pandas, numpy, scipy, sklearn

  • Suitable for quick analysis and exploration

Considerations

  • No parameter management

  • Can’t easily add more runs

  • Manual data organization

Approach 3: Hybrid (Best of Both Worlds)

When to Use

  • Complex workflows

  • Need both organization and flexibility

  • Production UQ pipelines

  • Mixed analysis requirements

Example

# Create campaign for organization
campaign = dataset_importer.create_campaign_from_files(
    input_files=input_files,
    output_files=output_files,
    campaign_name="hybrid_campaign"
)

# Extract DataFrame for custom analysis
df = campaign.get_collation_result()

# Custom analysis
custom_results = my_custom_analysis(df)

# Campaign analysis
campaign.apply_analysis(uq.analysis.EnsembleBoot(qoi_cols=[('output', 0)]))
campaign_results = campaign.get_last_analysis()

# Still can extend campaign
campaign.add_external_runs(new_runs)

Benefits

  • Organized campaign management

  • Custom analysis flexibility

  • Can combine EasyVVUQ tools with custom methods

  • Extensible for future work

  • Best choice for complex workflows

Data Format Requirements

Campaign Creation Approach

Requires both input parameters AND output results:

Input Files (JSON, YAML, or CSV):

{
  "parameter1": 1.0,
  "parameter2": 2.0,
  "parameter3": "string_value"
}

Output Files (CSV, JSON, or YAML):

time,temperature,pressure
0,300.0,1.0
1,301.0,1.01
2,302.0,1.02

Analysis-Only Approach

Works with any pandas DataFrame structure:

# Flexible DataFrame format
df = pd.DataFrame({
    'param1': [1.0, 2.0, 3.0],
    'param2': [10.0, 20.0, 30.0],
    'output1': [100.0, 150.0, 200.0],
    'output2': [0.1, 0.2, 0.3]
})

# Or EasyVVUQ multi-index format
df = pd.DataFrame({
    ('param1', 0): [1.0, 2.0, 3.0],
    ('param2', 0): [10.0, 20.0, 30.0],
    ('output1', 0): [100.0, 150.0, 200.0]
})

Directory Structure Support

Supported Structures

Run Directories:

data/
├── run_001/
│   ├── input.json
│   └── output.csv
├── run_002/
│   ├── input.json
│   └── output.csv
└── ...

File Lists:

data/
├── inputs/
│   ├── params_001.json
│   ├── params_002.json
│   └── ...
└── outputs/
    ├── results_001.csv
    ├── results_002.csv
    └── ...

File Format Support

Input Files

  • JSON: Parameter definitions

  • YAML: Configuration files

  • CSV: Tabular parameter data

Output Files

  • CSV: Simulation results (most common)

  • JSON: Structured output data

  • YAML: Configuration outputs

Analysis Classes Compatibility

Works with Both Approaches

  • EnsembleBoot: Bootstrap analysis

  • BasicStats: Basic statistics

Requires Campaign (with proper samplers)

  • PCEAnalysis: Polynomial Chaos Expansion

  • SCAnalysis: Stochastic Collocation

  • QMCAnalysis: Quasi-Monte Carlo

Analysis-Only Compatible

  • Direct pandas/numpy operations

  • scipy.stats functions

  • sklearn models

  • Custom analysis functions

Performance Comparison

Aspect

Campaign Creation

Analysis-Only

Hybrid

Memory Usage

Higher

Lower

Medium

Setup Time

Longer

Shorter

Medium

Analysis Speed

Medium

Fastest

Medium

Extensibility

High

Low

High

Flexibility

Medium

High

High

Error Handling

Campaign Creation

try:
    campaign = dataset_importer.create_campaign_from_directory(
        root_dir="/path/to/data",
        campaign_name="my_campaign"
    )
except FileNotFoundError:
    print("Data directory not found")
except ValueError as e:
    print(f"Invalid data format: {e}")

Analysis-Only

try:
    df = pd.read_csv("results.csv")
    analysis = uq.analysis.EnsembleBoot(qoi_cols=[('output', 0)])
    results = analysis.analyse(df)
except Exception as e:
    print(f"Analysis failed: {e}")
    # Fallback to direct pandas analysis
    mean_val = df['output'].mean()

Code Examples

The documentation above provides comprehensive code examples for all approaches. Users can copy and adapt these examples for their specific use cases. The test suite in tests/test_dataset_importer.py also provides practical examples of how to use the functionality.

Best Practices

Campaign Creation

  • Use consistent parameter names across runs

  • Ensure all required parameters are present

  • Use appropriate file formats (JSON for parameters, CSV for results)

  • Test with small datasets first

Analysis-Only

  • Use pandas DataFrame best practices

  • Handle missing data appropriately

  • Use vectorized operations for performance

  • Consider memory usage for large datasets

Hybrid

  • Start with campaign creation for organization

  • Extract DataFrame for custom analysis

  • Use campaign features for extension

  • Document both approaches in your workflow

Migration from Existing Workflows

From Manual Analysis

# Before: Manual file loading
for file in files:
    data = pd.read_csv(file)
    # manual analysis...

# After: Analysis-Only approach
df = load_all_data_to_dataframe(files)
results = uq.analysis.EnsembleBoot(qoi_cols=['output']).analyse(df)

From Other UQ Tools

# Convert existing data to EasyVVUQ format
df_easyvvuq = convert_to_easyvvuq_format(external_data)
campaign = dataset_importer.create_campaign_from_dataframe(df_easyvvuq)

Troubleshooting

Common Issues

Campaign Creation:

  • File not found: Check file paths and permissions

  • Parameter mismatch: Ensure consistent parameter names

  • Invalid format: Verify JSON/CSV syntax

Analysis-Only:

  • Column not found: Check DataFrame column names

  • Analysis failed: Verify data types and ranges

  • Memory error: Consider chunking large datasets

Getting Help

  1. Check the error message carefully

  2. Review the demo scripts for examples

  3. Check the EasyVVUQ documentation

  4. Use the test suite for reference implementations