.. _dataset_import:

Dataset Import Guide
====================

Overview
--------

This guide addresses **GitHub Issue #116: "Add functionality to pick up existing datasets"** by providing **two distinct approaches** for working with existing simulation data in EasyVVUQ:

1. **Campaign Creation**: Import data to create full EasyVVUQ campaigns
2. **Analysis-Only**: Direct analysis without campaign overhead

Quick Start: Choose Your Approach
---------------------------------

Decision Matrix
~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - **Your Goal**
     - **Recommended Approach**
   * - Quick analysis of existing results
     - **Analysis-Only**
   * - Building UQ workflow
     - **Campaign Creation**
   * - Need both flexibility and organization
     - **Hybrid**
   * - Maximum performance
     - **Analysis-Only**
   * - Collaborative project
     - **Campaign Creation**
   * - Custom analysis + future expansion
     - **Hybrid**

Approach 1: Campaign Creation + Analysis
----------------------------------------

When to Use
~~~~~~~~~~~

- Building ongoing UQ workflows
- Need to add more runs or resample
- Want full parameter management
- Using EasyVVUQ sampling methods
- Collaborative projects (shared database)

Basic Example
~~~~~~~~~~~~~

.. code-block:: python

    from easyvvuq.utils import dataset_importer

    # Method 1: From directory structure
    campaign = dataset_importer.create_campaign_from_directory(
        root_dir="/path/to/simulation/data",
        campaign_name="my_campaign",
        work_dir="/path/to/work/dir"
    )

    # Method 2: From file lists
    campaign = dataset_importer.create_campaign_from_files(
        input_files=["run1/input.json", "run2/input.json"],
        output_files=["run1/output.csv", "run2/output.csv"],
        campaign_name="my_campaign",
        work_dir="/path/to/work/dir"
    )

    # Method 3: Campaign class method
    campaign = uq.Campaign.from_existing_data(
        name="my_campaign",
        input_files=input_files,
        output_files=output_files,
        work_dir="/path/to/work/dir"
    )

Advanced Features
~~~~~~~~~~~~~~~~~

.. code-block:: python

    # Get collated results
    df = campaign.get_collation_result()

    # Apply EasyVVUQ analysis
    analysis = uq.analysis.EnsembleBoot(
        qoi_cols=[('output_column', 0)],
        stat_func=np.mean
    )
    campaign.apply_analysis(analysis)
    results = campaign.get_last_analysis()

    # Extend campaign with more runs
    campaign.add_external_runs(new_runs)
    campaign.draw_samples(num_samples=50)
    campaign.execute().collate()

Benefits
~~~~~~~~

- Full parameter space management
- Can add more runs and resample
- Database storage and retrieval
- Run status tracking
- Integration with EasyVVUQ sampling methods
- Suitable for ongoing UQ workflows

Considerations
~~~~~~~~~~~~~~

- Higher memory usage
- Campaign database overhead
- Requires parameter definitions

Approach 2: Analysis-Only (Direct DataFrame Analysis)
------------------------------------------------------

When to Use
~~~~~~~~~~~

- Quick analysis tasks
- Custom analysis methods
- Integration with other tools
- Maximum performance needed
- Exploratory data analysis

Basic Example
~~~~~~~~~~~~~

.. code-block:: python

    import pandas as pd
    import numpy as np
    import easyvvuq as uq

    # Load your data directly into DataFrame
    df = pd.DataFrame({
        ('run_id', 0): range(100),
        ('x1', 0): np.random.uniform(0, 1, 100),
        ('x2', 0): np.random.uniform(0, 1, 100),
        ('output', 0): np.random.normal(0, 1, 100)
    })

    # Method 1: Direct pandas/numpy analysis
    mean_output = np.mean(df[('output', 0)])
    std_output = np.std(df[('output', 0)])
    correlation = np.corrcoef(df[('x1', 0)], df[('output', 0)])[0, 1]

    # Method 2: Use EasyVVUQ analysis classes directly
    ensemble_analysis = uq.analysis.EnsembleBoot(
        qoi_cols=[('output', 0)],
        stat_func=np.mean
    )
    results = ensemble_analysis.analyse(df)

Advanced Custom Analysis
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # Custom sensitivity analysis
    def elementary_effects(df, param_col, output_col):
        median_val = df[param_col].median()
        high_group = df[df[param_col] > median_val][output_col]
        low_group = df[df[param_col] <= median_val][output_col]
        return high_group.mean() - low_group.mean()

    # Statistical tests
    from scipy import stats
    t_stat, p_value = stats.ttest_ind(group1, group2)

    # Machine learning
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X, y)
    coefficients = model.coef_

Benefits
~~~~~~~~

- No campaign overhead - maximum performance
- Direct DataFrame manipulation
- Custom analysis methods easy to implement
- Works with any DataFrame structure
- Easy integration with pandas, numpy, scipy, sklearn
- Suitable for quick analysis and exploration

Considerations
~~~~~~~~~~~~~~

- No parameter management
- Can't easily add more runs
- Manual data organization

Approach 3: Hybrid (Best of Both Worlds)
-----------------------------------------

When to Use
~~~~~~~~~~~

- Complex workflows
- Need both organization and flexibility
- Production UQ pipelines
- Mixed analysis requirements

Example
~~~~~~~

.. code-block:: python

    # Create campaign for organization
    campaign = dataset_importer.create_campaign_from_files(
        input_files=input_files,
        output_files=output_files,
        campaign_name="hybrid_campaign"
    )

    # Extract DataFrame for custom analysis
    df = campaign.get_collation_result()

    # Custom analysis
    custom_results = my_custom_analysis(df)

    # Campaign analysis
    campaign.apply_analysis(uq.analysis.EnsembleBoot(qoi_cols=[('output', 0)]))
    campaign_results = campaign.get_last_analysis()

    # Still can extend campaign
    campaign.add_external_runs(new_runs)

Benefits
~~~~~~~~

- Organized campaign management
- Custom analysis flexibility
- Can combine EasyVVUQ tools with custom methods
- Extensible for future work
- Best choice for complex workflows

Data Format Requirements
------------------------

Campaign Creation Approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Requires both input parameters AND output results:

**Input Files** (JSON, YAML, or CSV):

.. code-block:: json

    {
      "parameter1": 1.0,
      "parameter2": 2.0,
      "parameter3": "string_value"
    }

**Output Files** (CSV, JSON, or YAML):

.. code-block:: csv

    time,temperature,pressure
    0,300.0,1.0
    1,301.0,1.01
    2,302.0,1.02

Analysis-Only Approach
~~~~~~~~~~~~~~~~~~~~~~

Works with any pandas DataFrame structure:

.. code-block:: python

    # Flexible DataFrame format
    df = pd.DataFrame({
        'param1': [1.0, 2.0, 3.0],
        'param2': [10.0, 20.0, 30.0],
        'output1': [100.0, 150.0, 200.0],
        'output2': [0.1, 0.2, 0.3]
    })

    # Or EasyVVUQ multi-index format
    df = pd.DataFrame({
        ('param1', 0): [1.0, 2.0, 3.0],
        ('param2', 0): [10.0, 20.0, 30.0],
        ('output1', 0): [100.0, 150.0, 200.0]
    })

Directory Structure Support
---------------------------

Supported Structures
~~~~~~~~~~~~~~~~~~~~~

**Run Directories**:

.. code-block:: text

    data/
    ├── run_001/
    │   ├── input.json
    │   └── output.csv
    ├── run_002/
    │   ├── input.json
    │   └── output.csv
    └── ...

**File Lists**:

.. code-block:: text

    data/
    ├── inputs/
    │   ├── params_001.json
    │   ├── params_002.json
    │   └── ...
    └── outputs/
        ├── results_001.csv
        ├── results_002.csv
        └── ...

File Format Support
--------------------

Input Files
~~~~~~~~~~~

- **JSON**: Parameter definitions
- **YAML**: Configuration files
- **CSV**: Tabular parameter data

Output Files
~~~~~~~~~~~~

- **CSV**: Simulation results (most common)
- **JSON**: Structured output data
- **YAML**: Configuration outputs

Analysis Classes Compatibility
------------------------------

Works with Both Approaches
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- ``EnsembleBoot``: Bootstrap analysis
- ``BasicStats``: Basic statistics

Requires Campaign (with proper samplers)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- ``PCEAnalysis``: Polynomial Chaos Expansion
- ``SCAnalysis``: Stochastic Collocation
- ``QMCAnalysis``: Quasi-Monte Carlo

Analysis-Only Compatible
~~~~~~~~~~~~~~~~~~~~~~~~

- Direct pandas/numpy operations
- scipy.stats functions
- sklearn models
- Custom analysis functions

Performance Comparison
----------------------

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 20

   * - **Aspect**
     - **Campaign Creation**
     - **Analysis-Only**
     - **Hybrid**
   * - **Memory Usage**
     - Higher
     - Lower
     - Medium
   * - **Setup Time**
     - Longer
     - Shorter
     - Medium
   * - **Analysis Speed**
     - Medium
     - Fastest
     - Medium
   * - **Extensibility**
     - High
     - Low
     - High
   * - **Flexibility**
     - Medium
     - High
     - High

Error Handling
--------------

Campaign Creation
~~~~~~~~~~~~~~~~~

.. code-block:: python

    try:
        campaign = dataset_importer.create_campaign_from_directory(
            root_dir="/path/to/data",
            campaign_name="my_campaign"
        )
    except FileNotFoundError:
        print("Data directory not found")
    except ValueError as e:
        print(f"Invalid data format: {e}")

Analysis-Only
~~~~~~~~~~~~~

.. code-block:: python

    try:
        df = pd.read_csv("results.csv")
        analysis = uq.analysis.EnsembleBoot(qoi_cols=[('output', 0)])
        results = analysis.analyse(df)
    except Exception as e:
        print(f"Analysis failed: {e}")
        # Fallback to direct pandas analysis
        mean_val = df['output'].mean()

Code Examples
-------------

The documentation above provides comprehensive code examples for all approaches. Users can copy and adapt these examples for their specific use cases. The test suite in ``tests/test_dataset_importer.py`` also provides practical examples of how to use the functionality.

Best Practices
--------------

Campaign Creation
~~~~~~~~~~~~~~~~~

- Use consistent parameter names across runs
- Ensure all required parameters are present
- Use appropriate file formats (JSON for parameters, CSV for results)
- Test with small datasets first

Analysis-Only
~~~~~~~~~~~~~

- Use pandas DataFrame best practices
- Handle missing data appropriately
- Use vectorized operations for performance
- Consider memory usage for large datasets

Hybrid
~~~~~~

- Start with campaign creation for organization
- Extract DataFrame for custom analysis
- Use campaign features for extension
- Document both approaches in your workflow

Migration from Existing Workflows
---------------------------------

From Manual Analysis
~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # Before: Manual file loading
    for file in files:
        data = pd.read_csv(file)
        # manual analysis...

    # After: Analysis-Only approach
    df = load_all_data_to_dataframe(files)
    results = uq.analysis.EnsembleBoot(qoi_cols=['output']).analyse(df)

From Other UQ Tools
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # Convert existing data to EasyVVUQ format
    df_easyvvuq = convert_to_easyvvuq_format(external_data)
    campaign = dataset_importer.create_campaign_from_dataframe(df_easyvvuq)

Troubleshooting
---------------

Common Issues
~~~~~~~~~~~~~

**Campaign Creation**:

- *File not found*: Check file paths and permissions
- *Parameter mismatch*: Ensure consistent parameter names
- *Invalid format*: Verify JSON/CSV syntax

**Analysis-Only**:

- *Column not found*: Check DataFrame column names
- *Analysis failed*: Verify data types and ranges
- *Memory error*: Consider chunking large datasets

Getting Help
~~~~~~~~~~~~

1. Check the error message carefully
2. Review the demo scripts for examples
3. Check the EasyVVUQ documentation
4. Use the test suite for reference implementations