Dataset Import Guide
Overview
This guide addresses GitHub Issue #116: “Add functionality to pick up existing datasets” by providing two distinct approaches for working with existing simulation data in EasyVVUQ:
Campaign Creation: Import data to create full EasyVVUQ campaigns
Analysis-Only: Direct analysis without campaign overhead
Quick Start: Choose Your Approach
Decision Matrix
Your Goal |
Recommended Approach |
|---|---|
Quick analysis of existing results |
Analysis-Only |
Building UQ workflow |
Campaign Creation |
Need both flexibility and organization |
Hybrid |
Maximum performance |
Analysis-Only |
Collaborative project |
Campaign Creation |
Custom analysis + future expansion |
Hybrid |
Approach 1: Campaign Creation + Analysis
When to Use
Building ongoing UQ workflows
Need to add more runs or resample
Want full parameter management
Using EasyVVUQ sampling methods
Collaborative projects (shared database)
Basic Example
from easyvvuq.utils import dataset_importer
# Method 1: From directory structure
campaign = dataset_importer.create_campaign_from_directory(
root_dir="/path/to/simulation/data",
campaign_name="my_campaign",
work_dir="/path/to/work/dir"
)
# Method 2: From file lists
campaign = dataset_importer.create_campaign_from_files(
input_files=["run1/input.json", "run2/input.json"],
output_files=["run1/output.csv", "run2/output.csv"],
campaign_name="my_campaign",
work_dir="/path/to/work/dir"
)
# Method 3: Campaign class method
campaign = uq.Campaign.from_existing_data(
name="my_campaign",
input_files=input_files,
output_files=output_files,
work_dir="/path/to/work/dir"
)
Advanced Features
# Get collated results
df = campaign.get_collation_result()
# Apply EasyVVUQ analysis
analysis = uq.analysis.EnsembleBoot(
qoi_cols=[('output_column', 0)],
stat_func=np.mean
)
campaign.apply_analysis(analysis)
results = campaign.get_last_analysis()
# Extend campaign with more runs
campaign.add_external_runs(new_runs)
campaign.draw_samples(num_samples=50)
campaign.execute().collate()
Benefits
Full parameter space management
Can add more runs and resample
Database storage and retrieval
Run status tracking
Integration with EasyVVUQ sampling methods
Suitable for ongoing UQ workflows
Considerations
Higher memory usage
Campaign database overhead
Requires parameter definitions
Approach 2: Analysis-Only (Direct DataFrame Analysis)
When to Use
Quick analysis tasks
Custom analysis methods
Integration with other tools
Maximum performance needed
Exploratory data analysis
Basic Example
import pandas as pd
import numpy as np
import easyvvuq as uq
# Load your data directly into DataFrame
df = pd.DataFrame({
('run_id', 0): range(100),
('x1', 0): np.random.uniform(0, 1, 100),
('x2', 0): np.random.uniform(0, 1, 100),
('output', 0): np.random.normal(0, 1, 100)
})
# Method 1: Direct pandas/numpy analysis
mean_output = np.mean(df[('output', 0)])
std_output = np.std(df[('output', 0)])
correlation = np.corrcoef(df[('x1', 0)], df[('output', 0)])[0, 1]
# Method 2: Use EasyVVUQ analysis classes directly
ensemble_analysis = uq.analysis.EnsembleBoot(
qoi_cols=[('output', 0)],
stat_func=np.mean
)
results = ensemble_analysis.analyse(df)
Advanced Custom Analysis
# Custom sensitivity analysis
def elementary_effects(df, param_col, output_col):
median_val = df[param_col].median()
high_group = df[df[param_col] > median_val][output_col]
low_group = df[df[param_col] <= median_val][output_col]
return high_group.mean() - low_group.mean()
# Statistical tests
from scipy import stats
t_stat, p_value = stats.ttest_ind(group1, group2)
# Machine learning
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
coefficients = model.coef_
Benefits
No campaign overhead - maximum performance
Direct DataFrame manipulation
Custom analysis methods easy to implement
Works with any DataFrame structure
Easy integration with pandas, numpy, scipy, sklearn
Suitable for quick analysis and exploration
Considerations
No parameter management
Can’t easily add more runs
Manual data organization
Approach 3: Hybrid (Best of Both Worlds)
When to Use
Complex workflows
Need both organization and flexibility
Production UQ pipelines
Mixed analysis requirements
Example
# Create campaign for organization
campaign = dataset_importer.create_campaign_from_files(
input_files=input_files,
output_files=output_files,
campaign_name="hybrid_campaign"
)
# Extract DataFrame for custom analysis
df = campaign.get_collation_result()
# Custom analysis
custom_results = my_custom_analysis(df)
# Campaign analysis
campaign.apply_analysis(uq.analysis.EnsembleBoot(qoi_cols=[('output', 0)]))
campaign_results = campaign.get_last_analysis()
# Still can extend campaign
campaign.add_external_runs(new_runs)
Benefits
Organized campaign management
Custom analysis flexibility
Can combine EasyVVUQ tools with custom methods
Extensible for future work
Best choice for complex workflows
Data Format Requirements
Campaign Creation Approach
Requires both input parameters AND output results:
Input Files (JSON, YAML, or CSV):
{
"parameter1": 1.0,
"parameter2": 2.0,
"parameter3": "string_value"
}
Output Files (CSV, JSON, or YAML):
time,temperature,pressure
0,300.0,1.0
1,301.0,1.01
2,302.0,1.02
Analysis-Only Approach
Works with any pandas DataFrame structure:
# Flexible DataFrame format
df = pd.DataFrame({
'param1': [1.0, 2.0, 3.0],
'param2': [10.0, 20.0, 30.0],
'output1': [100.0, 150.0, 200.0],
'output2': [0.1, 0.2, 0.3]
})
# Or EasyVVUQ multi-index format
df = pd.DataFrame({
('param1', 0): [1.0, 2.0, 3.0],
('param2', 0): [10.0, 20.0, 30.0],
('output1', 0): [100.0, 150.0, 200.0]
})
Directory Structure Support
Supported Structures
Run Directories:
data/
├── run_001/
│ ├── input.json
│ └── output.csv
├── run_002/
│ ├── input.json
│ └── output.csv
└── ...
File Lists:
data/
├── inputs/
│ ├── params_001.json
│ ├── params_002.json
│ └── ...
└── outputs/
├── results_001.csv
├── results_002.csv
└── ...
File Format Support
Input Files
JSON: Parameter definitions
YAML: Configuration files
CSV: Tabular parameter data
Output Files
CSV: Simulation results (most common)
JSON: Structured output data
YAML: Configuration outputs
Analysis Classes Compatibility
Works with Both Approaches
EnsembleBoot: Bootstrap analysisBasicStats: Basic statistics
Requires Campaign (with proper samplers)
PCEAnalysis: Polynomial Chaos ExpansionSCAnalysis: Stochastic CollocationQMCAnalysis: Quasi-Monte Carlo
Analysis-Only Compatible
Direct pandas/numpy operations
scipy.stats functions
sklearn models
Custom analysis functions
Performance Comparison
Aspect |
Campaign Creation |
Analysis-Only |
Hybrid |
|---|---|---|---|
Memory Usage |
Higher |
Lower |
Medium |
Setup Time |
Longer |
Shorter |
Medium |
Analysis Speed |
Medium |
Fastest |
Medium |
Extensibility |
High |
Low |
High |
Flexibility |
Medium |
High |
High |
Error Handling
Campaign Creation
try:
campaign = dataset_importer.create_campaign_from_directory(
root_dir="/path/to/data",
campaign_name="my_campaign"
)
except FileNotFoundError:
print("Data directory not found")
except ValueError as e:
print(f"Invalid data format: {e}")
Analysis-Only
try:
df = pd.read_csv("results.csv")
analysis = uq.analysis.EnsembleBoot(qoi_cols=[('output', 0)])
results = analysis.analyse(df)
except Exception as e:
print(f"Analysis failed: {e}")
# Fallback to direct pandas analysis
mean_val = df['output'].mean()
Code Examples
The documentation above provides comprehensive code examples for all approaches. Users can copy and adapt these examples for their specific use cases. The test suite in tests/test_dataset_importer.py also provides practical examples of how to use the functionality.
Best Practices
Campaign Creation
Use consistent parameter names across runs
Ensure all required parameters are present
Use appropriate file formats (JSON for parameters, CSV for results)
Test with small datasets first
Analysis-Only
Use pandas DataFrame best practices
Handle missing data appropriately
Use vectorized operations for performance
Consider memory usage for large datasets
Hybrid
Start with campaign creation for organization
Extract DataFrame for custom analysis
Use campaign features for extension
Document both approaches in your workflow
Migration from Existing Workflows
From Manual Analysis
# Before: Manual file loading
for file in files:
data = pd.read_csv(file)
# manual analysis...
# After: Analysis-Only approach
df = load_all_data_to_dataframe(files)
results = uq.analysis.EnsembleBoot(qoi_cols=['output']).analyse(df)
From Other UQ Tools
# Convert existing data to EasyVVUQ format
df_easyvvuq = convert_to_easyvvuq_format(external_data)
campaign = dataset_importer.create_campaign_from_dataframe(df_easyvvuq)
Troubleshooting
Common Issues
Campaign Creation:
File not found: Check file paths and permissions
Parameter mismatch: Ensure consistent parameter names
Invalid format: Verify JSON/CSV syntax
Analysis-Only:
Column not found: Check DataFrame column names
Analysis failed: Verify data types and ranges
Memory error: Consider chunking large datasets
Getting Help
Check the error message carefully
Review the demo scripts for examples
Check the EasyVVUQ documentation
Use the test suite for reference implementations