YAML Configuration Reference¶

This page is the format reference for YAML dataset definitions. This can be helpful when you want to:

Share dataset configurations more easily
Keep data generation settings separate from analysis code
Define multiple dataset variants in one file
Track dataset configuration changes in git

Quick start¶

You can load yaml data configs with:

from xaitimesynth.parser import load_builders_from_config

builders = load_builders_from_config(config_path="config.yaml")

for name, builder in builders.items():
    dataset = builder.build()

You can also store dataset configurations from code to yaml:

import yaml
from xaitimesynth import TimeSeriesBuilder, gaussian_noise, peak

# Define dataset in Python
builder = (
    TimeSeriesBuilder(n_timesteps=100, n_samples=200)
    # your dataset definition goes here...
)

# Export to dictionary
config = builder.to_config()

# Save to YAML file
with open("config.yaml", "w") as f:
    yaml.dump({"my_dataset": config}, f)

Basic Structure¶

Each dataset in your YAML file needs a name (the top-level key) and a configuration that mirrors the TimeSeriesBuilder API:

dataset_name:
  # Builder parameters
  n_timesteps: 100
  n_samples: 200
  n_dimensions: 1
  random_state: 42

  # Class definitions (required)
  classes:
    - id: 0
      signals:
        - function: random_walk
          params: { step_size: 0.2 }
        - function: gaussian_noise
          params: { sigma: 0.1 }
      features:
        - function: constant
          params: { value: -1.0 }
          start_pct: 0.4
          end_pct: 0.6
    - id: 1
      # ... class 1 definition

You can define multiple datasets in the same file - each top-level key becomes a separate builder.

Configuration Reference¶

Builder Parameters¶

These go at the top level of each dataset definition:

Key	Type	Default	Description
`n_timesteps`	int	100	Length of each time series
`n_samples`	int	1000	Total number of samples to generate
`n_dimensions`	int	1	Number of channels (for multivariate)
`random_state`	int	None	Random seed for reproducibility
`normalization`	str	"zscore"	Normalization method ("zscore", "minmax", "none")
`data_format`	str	"channels_first"	Output shape format

Class Definition¶

Each class in the classes list defines one class label and its components:

Key	Type	Required	Description
`id`	int	Yes	Class label (0, 1, 2, ...)
`weight`	float	No	Sampling weight for class balance (default: 1.0)
`signals`	list	No	Background signal components
`features`	list	No	Discriminative feature components

Signal Configuration¶

Signals define the background patterns in your time series. Each signal in the signals list:

Key	Type	Required	Description
`function`	str	Yes	Generator name (e.g., "random_walk", "gaussian_noise")
`params`	dict	No	Parameters passed to the generator
`dimensions`	list	No	Which dimensions to apply to (null = all)
`start_pct`, `end_pct`	float	No	Position (0-1) for partial coverage
`length_pct`	float	No	Length as fraction for random placement (scalar only; stochastic forms not supported for signals)
`random_location`	bool	No	Place at random position each sample

Feature Configuration¶

Features are the class-discriminating patterns. Each feature in the features list:

Key	Type	Required	Description
`function`	str	Yes	Generator name (e.g., "peak", "constant")
`params`	dict	No	Parameters passed to the generator
`start_pct`, `end_pct`	float	No*	Fixed position (0-1)
`length_pct`	see below	No*	Length for random placement
`random_location`	bool	No	Place randomly (requires `length_pct`)
`dimensions`	list	No	Which dimensions to apply to

*You must specify either start_pct/end_pct for fixed position, or length_pct with random_location: true.

Stochastic feature lengths with `length_pct`¶

length_pct controls the feature window size. It accepts three forms, both in Python and YAML:

Form	Python API	YAML syntax	Effect
Fixed	`length_pct=0.5`	`length_pct: 0.5`	Same length every sample
Discrete choices	`length_pct=[0.25, 0.5]`	`length_pct: [0.25, 0.5]`	Randomly pick one value per sample
Uniform range	`length_pct=(0.25, 0.75)`	`length_pct: {range: [0.25, 0.75]}`	Sample uniformly per sample

YAML note: YAML has no tuple type, so a plain list like [0.25, 0.75] is always treated as discrete choices, not a range. Use the {range: [...]} dict form to express a uniform range in YAML.

features:
  # Fixed length — always 30% of the series
  - function: peak
    params: { amplitude: 1.5 }
    random_location: true
    length_pct: 0.3

  # Discrete choices — randomly pick 25% or 50% per sample
  - function: constant
    params: { value: 1.0 }
    random_location: true
    length_pct: [0.25, 0.5]

  # Uniform range — sample any length between 25% and 75% per sample
  - function: trend
    params: { slope: 0.05 }
    random_location: true
    length_pct: {range: [0.25, 0.75]}

The Python API uses a tuple for ranges:

# Python equivalents of the three YAML forms above
.add_feature(peak(amplitude=1.5),   random_location=True, length_pct=0.3)
.add_feature(constant(value=1.0),   random_location=True, length_pct=[0.25, 0.5])
.add_feature(trend(slope=0.05),     random_location=True, length_pct=(0.25, 0.75))

to_config() serializes tuples as {range: [...]} so configurations round-trip faithfully through YAML.

Available Functions¶

The function field must match the name of a component function in the package (e.g., gaussian_noise, peak, random_walk). Use list_signal_components() and list_feature_components() to discover available functions programmatically. See the Usage Guide for details.

Loading Configurations¶

The load_builders_from_config() function provides several ways to load configurations:

from xaitimesynth.parser import load_builders_from_config

# Load all datasets from a file
builders = load_builders_from_config(config_path="config.yaml")

# Load from a nested path within the file
# Useful if you organize datasets under categories
builders = load_builders_from_config(
    config_path="config.yaml",
    path_key="experiments/ablation_study"
)

# Load only specific datasets by name
builders = load_builders_from_config(
    config_path="config.yaml",
    dataset_names=["dataset_a", "dataset_b"]
)

# Load from a Python dictionary (useful for testing)
builders = load_builders_from_config(config_dict=my_config_dict)

# Load from a YAML string
builders = load_builders_from_config(config_str=yaml_string)

The function returns a dictionary mapping dataset names to TimeSeriesBuilder instances.

Reusing Configuration with YAML Anchors¶

YAML has built-in support for reusing configuration blocks. This is helpful when multiple datasets share common settings.

Use &name to define an anchor and *name to reference it. Use <<: to merge an anchor's contents:

# Define common settings once
common: &common
  n_timesteps: 100
  n_samples: 500
  random_state: 42

# Define reusable signal configurations
gaussian_background: &gaussian_background
  function: gaussian_noise
  params: { sigma: 1.0 }

# Use anchors in dataset definitions
dataset_a:
  <<: *common                    # Merge common settings
  classes:
    - id: 0
      signals: [*gaussian_background]

dataset_b:
  <<: *common
  n_samples: 1000                # Override specific values
  classes:
    - id: 0
      signals: [*gaussian_background]

Exporting Python Configurations to YAML¶

If you've built a dataset programmatically and want to save its configuration for later use, you can export it with to_config():

import yaml
from xaitimesynth import TimeSeriesBuilder, gaussian_noise, peak

# Define dataset in Python
builder = (
    TimeSeriesBuilder(n_timesteps=100, n_samples=200)
    .for_class(0)
    .add_signal(gaussian_noise(sigma=0.1))
    .for_class(1)
    .add_signal(gaussian_noise(sigma=0.1))
    .add_feature(peak(amplitude=1.0), start_pct=0.3, end_pct=0.6)
)

# Export to dictionary
config = builder.to_config()

# Save to YAML file
with open("config.yaml", "w") as f:
    yaml.dump({"my_dataset": config}, f)

This enables round-trip conversion: define in Python, save to YAML, reload later.