YAML Configuration Reference¶
This page is the format reference for YAML dataset definitions. This can be helpful when you want to:
- Share dataset configurations more easily
- Keep data generation settings separate from analysis code
- Define multiple dataset variants in one file
- Track dataset configuration changes in git
Quick start¶
You can load yaml data configs with:
from xaitimesynth.parser import load_builders_from_config
builders = load_builders_from_config(config_path="config.yaml")
for name, builder in builders.items():
dataset = builder.build()
You can also store dataset configurations from code to yaml:
import yaml
from xaitimesynth import TimeSeriesBuilder, gaussian_noise, peak
# Define dataset in Python
builder = (
TimeSeriesBuilder(n_timesteps=100, n_samples=200)
# your dataset definition goes here...
)
# Export to dictionary
config = builder.to_config()
# Save to YAML file
with open("config.yaml", "w") as f:
yaml.dump({"my_dataset": config}, f)
Basic Structure¶
Each dataset in your YAML file needs a name (the top-level key) and a configuration that mirrors the TimeSeriesBuilder API:
dataset_name:
# Builder parameters
n_timesteps: 100
n_samples: 200
n_dimensions: 1
random_state: 42
# Class definitions (required)
classes:
- id: 0
signals:
- function: random_walk
params: { step_size: 0.2 }
- function: gaussian_noise
params: { sigma: 0.1 }
features:
- function: constant
params: { value: -1.0 }
start_pct: 0.4
end_pct: 0.6
- id: 1
# ... class 1 definition
You can define multiple datasets in the same file - each top-level key becomes a separate builder.
Configuration Reference¶
Builder Parameters¶
These go at the top level of each dataset definition:
| Key | Type | Default | Description |
|---|---|---|---|
n_timesteps |
int | 100 | Length of each time series |
n_samples |
int | 1000 | Total number of samples to generate |
n_dimensions |
int | 1 | Number of channels (for multivariate) |
random_state |
int | None | Random seed for reproducibility |
normalization |
str | "zscore" | Normalization method ("zscore", "minmax", "none") |
data_format |
str | "channels_first" | Output shape format |
Class Definition¶
Each class in the classes list defines one class label and its components:
| Key | Type | Required | Description |
|---|---|---|---|
id |
int | Yes | Class label (0, 1, 2, ...) |
weight |
float | No | Sampling weight for class balance (default: 1.0) |
signals |
list | No | Background signal components |
features |
list | No | Discriminative feature components |
Signal Configuration¶
Signals define the background patterns in your time series. Each signal in the signals list:
| Key | Type | Required | Description |
|---|---|---|---|
function |
str | Yes | Generator name (e.g., "random_walk", "gaussian_noise") |
params |
dict | No | Parameters passed to the generator |
dimensions |
list | No | Which dimensions to apply to (null = all) |
start_pct, end_pct |
float | No | Position (0-1) for partial coverage |
length_pct |
float | No | Length as fraction for random placement (scalar only; stochastic forms not supported for signals) |
random_location |
bool | No | Place at random position each sample |
Feature Configuration¶
Features are the class-discriminating patterns. Each feature in the features list:
| Key | Type | Required | Description |
|---|---|---|---|
function |
str | Yes | Generator name (e.g., "peak", "constant") |
params |
dict | No | Parameters passed to the generator |
start_pct, end_pct |
float | No* | Fixed position (0-1) |
length_pct |
see below | No* | Length for random placement |
random_location |
bool | No | Place randomly (requires length_pct) |
dimensions |
list | No | Which dimensions to apply to |
*You must specify either start_pct/end_pct for fixed position, or length_pct with random_location: true.
Stochastic feature lengths with length_pct¶
length_pct controls the feature window size. It accepts three forms, both in Python and YAML:
| Form | Python API | YAML syntax | Effect |
|---|---|---|---|
| Fixed | length_pct=0.5 |
length_pct: 0.5 |
Same length every sample |
| Discrete choices | length_pct=[0.25, 0.5] |
length_pct: [0.25, 0.5] |
Randomly pick one value per sample |
| Uniform range | length_pct=(0.25, 0.75) |
length_pct: {range: [0.25, 0.75]} |
Sample uniformly per sample |
YAML note: YAML has no tuple type, so a plain list like
[0.25, 0.75]is always treated as discrete choices, not a range. Use the{range: [...]}dict form to express a uniform range in YAML.
features:
# Fixed length — always 30% of the series
- function: peak
params: { amplitude: 1.5 }
random_location: true
length_pct: 0.3
# Discrete choices — randomly pick 25% or 50% per sample
- function: constant
params: { value: 1.0 }
random_location: true
length_pct: [0.25, 0.5]
# Uniform range — sample any length between 25% and 75% per sample
- function: trend
params: { slope: 0.05 }
random_location: true
length_pct: {range: [0.25, 0.75]}
The Python API uses a tuple for ranges:
# Python equivalents of the three YAML forms above
.add_feature(peak(amplitude=1.5), random_location=True, length_pct=0.3)
.add_feature(constant(value=1.0), random_location=True, length_pct=[0.25, 0.5])
.add_feature(trend(slope=0.05), random_location=True, length_pct=(0.25, 0.75))
to_config() serializes tuples as {range: [...]} so configurations round-trip faithfully through YAML.
Available Functions¶
The function field must match the name of a component function in the package (e.g., gaussian_noise, peak, random_walk). Use list_signal_components() and list_feature_components() to discover available functions programmatically. See the Usage Guide for details.
Loading Configurations¶
The load_builders_from_config() function provides several ways to load configurations:
from xaitimesynth.parser import load_builders_from_config
# Load all datasets from a file
builders = load_builders_from_config(config_path="config.yaml")
# Load from a nested path within the file
# Useful if you organize datasets under categories
builders = load_builders_from_config(
config_path="config.yaml",
path_key="experiments/ablation_study"
)
# Load only specific datasets by name
builders = load_builders_from_config(
config_path="config.yaml",
dataset_names=["dataset_a", "dataset_b"]
)
# Load from a Python dictionary (useful for testing)
builders = load_builders_from_config(config_dict=my_config_dict)
# Load from a YAML string
builders = load_builders_from_config(config_str=yaml_string)
The function returns a dictionary mapping dataset names to TimeSeriesBuilder instances.
Reusing Configuration with YAML Anchors¶
YAML has built-in support for reusing configuration blocks. This is helpful when multiple datasets share common settings.
Use &name to define an anchor and *name to reference it. Use <<: to merge an anchor's contents:
# Define common settings once
common: &common
n_timesteps: 100
n_samples: 500
random_state: 42
# Define reusable signal configurations
gaussian_background: &gaussian_background
function: gaussian_noise
params: { sigma: 1.0 }
# Use anchors in dataset definitions
dataset_a:
<<: *common # Merge common settings
classes:
- id: 0
signals: [*gaussian_background]
dataset_b:
<<: *common
n_samples: 1000 # Override specific values
classes:
- id: 0
signals: [*gaussian_background]
Exporting Python Configurations to YAML¶
If you've built a dataset programmatically and want to save its configuration for later use, you can export it with to_config():
import yaml
from xaitimesynth import TimeSeriesBuilder, gaussian_noise, peak
# Define dataset in Python
builder = (
TimeSeriesBuilder(n_timesteps=100, n_samples=200)
.for_class(0)
.add_signal(gaussian_noise(sigma=0.1))
.for_class(1)
.add_signal(gaussian_noise(sigma=0.1))
.add_feature(peak(amplitude=1.0), start_pct=0.3, end_pct=0.6)
)
# Export to dictionary
config = builder.to_config()
# Save to YAML file
with open("config.yaml", "w") as f:
yaml.dump({"my_dataset": config}, f)
This enables round-trip conversion: define in Python, save to YAML, reload later.