Skip to content

Utilities

Utility functions and data structures.

Configuration

load_builders_from_config(config_path: Optional[Union[str, Path]] = None, config_dict: Optional[Dict[str, Any]] = None, config_str: Optional[str] = None, path_key: Optional[str] = None, dataset_names: Optional[List[str]] = None) -> Dict[str, xaitimesynth.TimeSeriesBuilder]

Loads and creates TimeSeriesBuilder instances from various configuration sources.

This function can load configurations from a dictionary, a YAML file path, or a string containing YAML content. Exactly one of config_path, config_dict, or config_str must be provided.

Args: config_path (Optional[Union[str, Path]]): Path to a YAML configuration file. config_dict (Optional[Dict[str, Any]]): A dictionary containing the configuration. config_str (Optional[str]): A string containing YAML configuration. path_key (Optional[str]): A key (or path using '/' as separator) within the configuration dictionary where the dataset definitions are located. If None, assumes the top-level dictionary contains the dataset definitions. Example: "experiments/datasets". Default is None. dataset_names (Optional[List[str]]): A list of specific dataset names to load. If None, all datasets found at the specified location are loaded. Default is None.

Returns: Dict[str, TimeSeriesBuilder]: A dictionary where keys are the dataset names and values are the configured TimeSeriesBuilder instances.

Raises: ValueError: If not exactly one configuration source is provided, if the configuration source is invalid, the path_key does not lead to a dictionary, or required keys are missing. FileNotFoundError: If config_path is provided and the file does not exist. yaml.YAMLError: If config_str or the file at config_path contains invalid YAML. AttributeError: If a specified component function name does not exist in the xaitimesynth package.

Detailed Configuration Structure: The configuration (whether from file, string, or dict) must ultimately resolve to a Python dictionary. This dictionary contains dataset definitions, either at the top level or nested under the path_key.

  Each dataset definition (the value associated with a dataset name key) is a
  dictionary specifying the parameters for a `TimeSeriesBuilder` and its components.
  Key elements include:
  - Builder arguments: `n_timesteps`, `n_samples`, `n_dimensions`, `random_state`, etc.
  - `classes` (list, mandatory): A list of dictionaries, each defining a class.
      - `id` (mandatory): The class label.
      - `weight` (float, optional): Sampling weight for the class.
      - `signals` (list, optional): List of signal component dictionaries.
          - `function` (str, mandatory): Name of a signal generator function (e.g., "random_walk").
          - `params` (dict, optional): Parameters for the generator function.
          - `dimensions` (list, optional): Dimensions to apply to.
          - `shared_randomness` (bool, optional).
          - Location keys (optional): `start_pct`, `end_pct`, `length_pct` (float only),
            `random_location`, `shared_location`. Note: `length_pct` for signals only
            accepts a scalar float; stochastic forms (tuple/list/range) are not supported.
      - `features` (list, optional): List of feature component dictionaries.
          - `function` (str, mandatory): Name of a feature generator function (e.g., "peak").
          - `params` (dict, optional): Parameters for the generator function.
          - Location keys (optional): `start_pct`, `end_pct`, `length_pct`, `random_location`,
            `shared_location`. `length_pct` accepts a scalar float, a list of floats
            (discrete choices), or ``{range: [min, max]}`` for uniform per-sample sampling.
          - `dimensions` (list, optional): Dimensions to apply to.
          - `shared_randomness` (bool, optional).

Example YAML Structure (config.yaml):

# Option 1: Top-level dataset definition (path_key=None)
my_dataset_1:
  n_timesteps: 150
  n_samples: 200
  n_dimensions: 2
  random_state: 42
  classes:
    - id: 0 # Class 0 definition
      weight: 1.0
      signals:
        - function: random_walk
          params: { step_size: 0.1 }
          dimensions: [0, 1] # Apply to both dimensions
        - function: gaussian_noise
          params: { sigma: 0.05 }
          # dimensions omitted -> applies to all
      features: [] # No specific features for class 0

    - id: 1 # Class 1 definition
      weight: 1.5 # Sample class 1 more often
      signals:
        - { function: random_walk, params: { step_size: 0.1 }, dimensions: [0, 1] }
        - { function: gaussian_noise, params: { sigma: 0.05 } }
      features:
        - function: peak
          params: { amplitude: 1.5, width: 3 }
          length_pct: 0.1 # Feature length is 10% of total timesteps
          random_location: true # Place it randomly
          dimensions: [0] # Only in dimension 0
          shared_location: false # If dim had >1 element, location would differ
        - function: constant
          params: { value: -1.0 }
          start_pct: 0.7
          end_pct: 0.9
          dimensions: [1] # Only in dimension 1

# Option 2: Nested dataset definitions (path_key="experiments/datasets")
experiments:
  datasets:
    dataset_nested:
      n_timesteps: 80
      n_samples: 50
      classes:
        - id: 0
          signals: [ { function: seasonal, params: { period: 10 } } ]
        # ... potentially more classes ...

YAML Anchors and Aliases: YAML's anchor/alias feature can be used to reuse configuration across multiple datasets. This is particularly useful for defining common settings, signals, or features.

  Example:
  ```yaml
  # Define common settings with anchor (&)
  common: &common_settings
    n_timesteps: 100
    n_samples: 1000
    random_state: 42
    normalization: "zscore"

  # Define common signal configuration
  base_random_walk: &base_signal
    function: random_walk
    params:
      step_size: 0.1

  # Use aliases (*) to reference the anchors
  dataset_a:
    <<: *common_settings  # Merges all common settings
    n_dimensions: 1
    classes:
      - id: 0
        signals:
          - <<: *base_signal  # Use the common signal definition

  dataset_b:
    <<: *common_settings
    n_samples: 2000  # Override specific settings
    n_dimensions: 2
    classes:
      - id: 0
        signals:
          - <<: *base_signal
            dimensions: [0, 1]  # Add dimensions parameter
  ```

  The `<<:` syntax is a YAML merge key that merges all key-value pairs from the
  referenced anchor into the current mapping.

Example Usage: ```python from xaitimesynth.parser import load_builders_from_config

  # Load all datasets from top level of a file
  builders_file = load_builders_from_config(config_path="config.yaml")

  # Load only 'dataset_c' from a nested path in a file
  builders_c = load_builders_from_config(
      config_path="config.yaml",
      path_key="experiments/datasets",
      dataset_names=["dataset_c"]
  )

  # Load from a dictionary
  my_config = {
      "my_dataset": {"n_timesteps": 10, "classes": [{"id": 0}]}
  }
  builders_dict = load_builders_from_config(config_dict=my_config)

  # Load from a YAML string
  yaml_str = "my_data:

n_timesteps: 5" builders_str = load_builders_from_config(config_str=yaml_str) ```

Source code in xaitimesynth/parser.py
def load_builders_from_config(
    config_path: Optional[Union[str, Path]] = None,
    config_dict: Optional[Dict[str, Any]] = None,
    config_str: Optional[str] = None,
    path_key: Optional[str] = None,
    dataset_names: Optional[List[str]] = None,
) -> Dict[str, "xaitimesynth.TimeSeriesBuilder"]:
    """Loads and creates TimeSeriesBuilder instances from various configuration sources.

    This function can load configurations from a dictionary, a YAML file path,
    or a string containing YAML content. Exactly one of `config_path`,
    `config_dict`, or `config_str` must be provided.

    Args:
        config_path (Optional[Union[str, Path]]): Path to a YAML configuration file.
        config_dict (Optional[Dict[str, Any]]): A dictionary containing the configuration.
        config_str (Optional[str]): A string containing YAML configuration.
        path_key (Optional[str]): A key (or path using '/' as separator) within the
            configuration dictionary where the dataset definitions are located.
            If None, assumes the top-level dictionary contains the dataset definitions.
            Example: "experiments/datasets". Default is None.
        dataset_names (Optional[List[str]]): A list of specific dataset names to load.
            If None, all datasets found at the specified location are loaded.
            Default is None.

    Returns:
        Dict[str, TimeSeriesBuilder]: A dictionary where keys are the dataset names
        and values are the configured TimeSeriesBuilder instances.

    Raises:
        ValueError: If not exactly one configuration source is provided, if the
                    configuration source is invalid, the path_key does not lead
                    to a dictionary, or required keys are missing.
        FileNotFoundError: If config_path is provided and the file does not exist.
        yaml.YAMLError: If config_str or the file at config_path contains invalid YAML.
        AttributeError: If a specified component function name does not exist in
                        the xaitimesynth package.

    Detailed Configuration Structure:
        The configuration (whether from file, string, or dict) must ultimately resolve
        to a Python dictionary. This dictionary contains dataset definitions, either at
        the top level or nested under the `path_key`.

        Each dataset definition (the value associated with a dataset name key) is a
        dictionary specifying the parameters for a `TimeSeriesBuilder` and its components.
        Key elements include:
        - Builder arguments: `n_timesteps`, `n_samples`, `n_dimensions`, `random_state`, etc.
        - `classes` (list, mandatory): A list of dictionaries, each defining a class.
            - `id` (mandatory): The class label.
            - `weight` (float, optional): Sampling weight for the class.
            - `signals` (list, optional): List of signal component dictionaries.
                - `function` (str, mandatory): Name of a signal generator function (e.g., "random_walk").
                - `params` (dict, optional): Parameters for the generator function.
                - `dimensions` (list, optional): Dimensions to apply to.
                - `shared_randomness` (bool, optional).
                - Location keys (optional): `start_pct`, `end_pct`, `length_pct` (float only),
                  `random_location`, `shared_location`. Note: `length_pct` for signals only
                  accepts a scalar float; stochastic forms (tuple/list/range) are not supported.
            - `features` (list, optional): List of feature component dictionaries.
                - `function` (str, mandatory): Name of a feature generator function (e.g., "peak").
                - `params` (dict, optional): Parameters for the generator function.
                - Location keys (optional): `start_pct`, `end_pct`, `length_pct`, `random_location`,
                  `shared_location`. `length_pct` accepts a scalar float, a list of floats
                  (discrete choices), or ``{range: [min, max]}`` for uniform per-sample sampling.
                - `dimensions` (list, optional): Dimensions to apply to.
                - `shared_randomness` (bool, optional).


    Example YAML Structure (config.yaml):
        ```yaml
        # Option 1: Top-level dataset definition (path_key=None)
        my_dataset_1:
          n_timesteps: 150
          n_samples: 200
          n_dimensions: 2
          random_state: 42
          classes:
            - id: 0 # Class 0 definition
              weight: 1.0
              signals:
                - function: random_walk
                  params: { step_size: 0.1 }
                  dimensions: [0, 1] # Apply to both dimensions
                - function: gaussian_noise
                  params: { sigma: 0.05 }
                  # dimensions omitted -> applies to all
              features: [] # No specific features for class 0

            - id: 1 # Class 1 definition
              weight: 1.5 # Sample class 1 more often
              signals:
                - { function: random_walk, params: { step_size: 0.1 }, dimensions: [0, 1] }
                - { function: gaussian_noise, params: { sigma: 0.05 } }
              features:
                - function: peak
                  params: { amplitude: 1.5, width: 3 }
                  length_pct: 0.1 # Feature length is 10% of total timesteps
                  random_location: true # Place it randomly
                  dimensions: [0] # Only in dimension 0
                  shared_location: false # If dim had >1 element, location would differ
                - function: constant
                  params: { value: -1.0 }
                  start_pct: 0.7
                  end_pct: 0.9
                  dimensions: [1] # Only in dimension 1

        # Option 2: Nested dataset definitions (path_key="experiments/datasets")
        experiments:
          datasets:
            dataset_nested:
              n_timesteps: 80
              n_samples: 50
              classes:
                - id: 0
                  signals: [ { function: seasonal, params: { period: 10 } } ]
                # ... potentially more classes ...
        ```

    YAML Anchors and Aliases:
        YAML's anchor/alias feature can be used to reuse configuration across multiple datasets.
        This is particularly useful for defining common settings, signals, or features.

        Example:
        ```yaml
        # Define common settings with anchor (&)
        common: &common_settings
          n_timesteps: 100
          n_samples: 1000
          random_state: 42
          normalization: "zscore"

        # Define common signal configuration
        base_random_walk: &base_signal
          function: random_walk
          params:
            step_size: 0.1

        # Use aliases (*) to reference the anchors
        dataset_a:
          <<: *common_settings  # Merges all common settings
          n_dimensions: 1
          classes:
            - id: 0
              signals:
                - <<: *base_signal  # Use the common signal definition

        dataset_b:
          <<: *common_settings
          n_samples: 2000  # Override specific settings
          n_dimensions: 2
          classes:
            - id: 0
              signals:
                - <<: *base_signal
                  dimensions: [0, 1]  # Add dimensions parameter
        ```

        The `<<:` syntax is a YAML merge key that merges all key-value pairs from the
        referenced anchor into the current mapping.

    Example Usage:
        ```python
        from xaitimesynth.parser import load_builders_from_config

        # Load all datasets from top level of a file
        builders_file = load_builders_from_config(config_path="config.yaml")

        # Load only 'dataset_c' from a nested path in a file
        builders_c = load_builders_from_config(
            config_path="config.yaml",
            path_key="experiments/datasets",
            dataset_names=["dataset_c"]
        )

        # Load from a dictionary
        my_config = {
            "my_dataset": {"n_timesteps": 10, "classes": [{"id": 0}]}
        }
        builders_dict = load_builders_from_config(config_dict=my_config)

        # Load from a YAML string
        yaml_str = "my_data:\n  n_timesteps: 5"
        builders_str = load_builders_from_config(config_str=yaml_str)
        ```
    """
    # --- 1. Validate and Load configuration dictionary ---
    provided_configs = sum(
        arg is not None for arg in [config_path, config_dict, config_str]
    )
    if provided_configs != 1:
        raise ValueError(
            "Exactly one of config_path, config_dict, or config_str must be provided."
        )

    loaded_config_dict: Dict[str, Any]

    if config_dict is not None:
        if not isinstance(config_dict, dict):
            raise ValueError("config_dict must be a dictionary.")
        loaded_config_dict = config_dict
    elif config_str is not None:
        try:
            loaded_config_dict = yaml.safe_load(config_str)
            if not isinstance(loaded_config_dict, dict):
                raise ValueError(
                    "config_str is valid YAML but not a dictionary-based config."
                )
        except yaml.YAMLError as e:
            raise yaml.YAMLError(f"Could not parse config_str as YAML: {e}")
        except Exception as e:
            raise ValueError(f"Could not load config from config_str: {e}")
    elif config_path is not None:
        path = Path(config_path)
        if not path.is_file():
            raise FileNotFoundError(f"Configuration file not found: {path}")
        try:
            with open(path, "r") as f:
                loaded_config_dict = yaml.safe_load(f)
            if not isinstance(loaded_config_dict, dict):
                raise ValueError(
                    f"File at {path} is valid YAML but not a dictionary-based config."
                )
        except yaml.YAMLError as e:
            raise yaml.YAMLError(f"Could not parse file {path} as YAML: {e}")
        except Exception as e:
            raise ValueError(f"Could not load config from file {path}: {e}")
    # This else should be unreachable due to the initial check, but added for safety
    else:
        raise ValueError("Internal error: No configuration source identified.")

    # --- 2. Locate the dataset definitions within the dictionary ---
    datasets_dict = loaded_config_dict
    if path_key:
        keys = path_key.split("/")
        try:
            for key in keys:
                datasets_dict = datasets_dict[key]
        except KeyError:
            raise ValueError(f"Path key '{path_key}' not found in configuration.")
        except TypeError:
            raise ValueError(f"Element at path '{path_key}' is not a dictionary.")

    if not isinstance(datasets_dict, dict):
        raise ValueError(
            f"Configuration at path '{path_key or 'top-level'}' is not a dictionary of datasets."
        )

    # --- 3. Filter and Create Builders ---
    builders = {}
    datasets_to_load = (
        dataset_names if dataset_names is not None else datasets_dict.keys()
    )

    for name in datasets_to_load:
        if name not in datasets_dict:
            print(
                f"Warning: Dataset '{name}' requested but not found in configuration."
            )
            continue

        single_dataset_config = datasets_dict[name]
        if not isinstance(single_dataset_config, dict):
            print(
                f"Warning: Configuration for dataset '{name}' is not a dictionary. Skipping."
            )
            continue

        # Check if the dictionary looks like a dataset config (must have 'classes')
        if "classes" not in single_dataset_config:
            print(
                f"Warning: Configuration for '{name}' does not contain a 'classes' key. Skipping."
            )
            continue

        try:
            builders[name] = _create_single_builder_from_dict(single_dataset_config)
        except (ValueError, AttributeError) as e:
            print(f"Error creating builder for dataset '{name}': {e}")
            # Re-raise the exception after printing the context
            raise ValueError(f"Error processing dataset '{name}': {e}") from e

    return builders

Data Structures

TimeSeriesComponents dataclass

Stores the separate components of a generated time series.

This dataclass is designed to hold the individual components that constitute a synthetic time series. By storing these components separately, it facilitates ground truth evaluation of XAI (Explainable AI) methods, allowing for a deeper understanding of how each component contributes to the final time series.

Attributes:

Name Type Description
background ndarray

Background signal, the base structure component (e.g., constant, random walk).

features Optional[Dict[str, ndarray]]

Dictionary mapping feature names to their vector representations. Defaults to None.

feature_masks Optional[Dict[str, ndarray]]

Dictionary of boolean masks indicating feature locations. Defaults to None.

aggregated Optional[ndarray]

The final aggregated time series after combining components. Defaults to None.

Source code in xaitimesynth/data_structures.py
@dataclass
class TimeSeriesComponents:
    """Stores the separate components of a generated time series.

    This dataclass is designed to hold the individual components that constitute
    a synthetic time series. By storing these components separately, it facilitates
    ground truth evaluation of XAI (Explainable AI) methods, allowing for a deeper
    understanding of how each component contributes to the final time series.

    Attributes:
        background (np.ndarray): Background signal, the base structure component (e.g., constant, random walk).
        features (Optional[Dict[str, np.ndarray]]): Dictionary mapping feature names to their vector representations. Defaults to None.
        feature_masks (Optional[Dict[str, np.ndarray]]): Dictionary of boolean masks indicating feature locations. Defaults to None.
        aggregated (Optional[np.ndarray]): The final aggregated time series after combining components. Defaults to None.
    """

    background: np.ndarray
    features: Optional[Dict[str, np.ndarray]] = None
    feature_masks: Optional[Dict[str, np.ndarray]] = None
    aggregated: Optional[np.ndarray] = None

    def __post_init__(self):
        """Validate that components have compatible shapes with the background."""
        expected_length = self.background.shape[0]  # Time dimension length

        # Check features components
        if self.features is not None:
            for feature_name, feature_data in self.features.items():
                # For features, we only validate that the time dimension matches
                # This allows dimension-specific features to be 1D arrays
                if feature_data.shape[0] != expected_length:
                    raise ValueError(
                        f"The feature '{feature_name}' first dimension "
                        f"{feature_data.shape[0]} doesn't match "
                        f"background first dimension {expected_length}."
                    )

        # Check feature masks components
        if self.feature_masks is not None:
            for mask_name, mask_data in self.feature_masks.items():
                # Feature masks should also match at least in the time dimension
                if mask_data.shape[0] != expected_length:
                    raise ValueError(
                        f"The feature mask '{mask_name}' first dimension "
                        f"{mask_data.shape[0]} doesn't match "
                        f"background first dimension {expected_length}."
                    )

        # Check aggregated component
        if self.aggregated is not None:
            if self.aggregated.shape != self.background.shape:
                raise ValueError(
                    f"The 'aggregated' component shape {self.aggregated.shape} "
                    f"doesn't match background shape {self.background.shape}."
                )

__post_init__()

Validate that components have compatible shapes with the background.

Source code in xaitimesynth/data_structures.py
def __post_init__(self):
    """Validate that components have compatible shapes with the background."""
    expected_length = self.background.shape[0]  # Time dimension length

    # Check features components
    if self.features is not None:
        for feature_name, feature_data in self.features.items():
            # For features, we only validate that the time dimension matches
            # This allows dimension-specific features to be 1D arrays
            if feature_data.shape[0] != expected_length:
                raise ValueError(
                    f"The feature '{feature_name}' first dimension "
                    f"{feature_data.shape[0]} doesn't match "
                    f"background first dimension {expected_length}."
                )

    # Check feature masks components
    if self.feature_masks is not None:
        for mask_name, mask_data in self.feature_masks.items():
            # Feature masks should also match at least in the time dimension
            if mask_data.shape[0] != expected_length:
                raise ValueError(
                    f"The feature mask '{mask_name}' first dimension "
                    f"{mask_data.shape[0]} doesn't match "
                    f"background first dimension {expected_length}."
                )

    # Check aggregated component
    if self.aggregated is not None:
        if self.aggregated.shape != self.background.shape:
            raise ValueError(
                f"The 'aggregated' component shape {self.aggregated.shape} "
                f"doesn't match background shape {self.background.shape}."
            )

Normalization

normalize(data: np.ndarray, method: str = 'zscore', **kwargs) -> np.ndarray

Normalize data using specified method.

Applies a normalization method to the input data based on the specified method. Supports 'zscore' (standardization), 'minmax' (min-max scaling), and 'none' (no normalization).

Parameters:

Name Type Description Default
data ndarray

Input array to normalize.

required
method str

Normalization method ("zscore", "minmax", or "none"). Defaults to "zscore".

'zscore'
**kwargs

Additional parameters for specific normalization methods.

{}

Returns:

Type Description
ndarray

np.ndarray: Normalized data according to specified method.

Raises:

Type Description
ValueError

If an invalid normalization method is specified.

Source code in xaitimesynth/functions.py
def normalize(data: np.ndarray, method: str = "zscore", **kwargs) -> np.ndarray:
    """Normalize data using specified method.

    Applies a normalization method to the input data based on the specified method.
    Supports 'zscore' (standardization), 'minmax' (min-max scaling), and 'none' (no normalization).

    Args:
        data (np.ndarray): Input array to normalize.
        method (str): Normalization method ("zscore", "minmax", or "none"). Defaults to "zscore".
        **kwargs: Additional parameters for specific normalization methods.

    Returns:
        np.ndarray: Normalized data according to specified method.

    Raises:
        ValueError: If an invalid normalization method is specified.
    """
    if method == "minmax":
        feature_range = kwargs.get("feature_range", (0, 1))
        return minmax_normalize(data, feature_range)
    elif method == "zscore":
        epsilon = kwargs.get("epsilon", 1e-10)
        return zscore_normalize(data, epsilon)
    elif method == "none":
        return data
    else:
        raise ValueError(
            f"Invalid normalization method: {method}. "
            "Choose 'zscore', 'minmax', or 'none'."
        )