Skip to content

Datasets

Ready-made dataset generators. Each function returns the standard xaitimesynth dictionary with ground-truth feature masks included.

generate_cylinder_bell_funnel(n_samples: int = 300, n_timesteps: int = 128, weights: Optional[List[float]] = None, random_state: Optional[int] = None, normalization: str = 'none', data_format: str = 'channels_first') -> Dict[str, Any]

Generate a Cylinder-Bell-Funnel (CBF) dataset with ground-truth feature masks.

Recreates the classic CBF time series benchmark (Saito, 2000) using xaitimesynth's builder, so each sample comes with a boolean feature_mask that marks the exact window where the class-discriminating pattern lives.

The three classes differ only inside a randomly placed window [a, b]:

.. code-block:: text

Cylinder (0):  constant plateau of amplitude (6 + η)
Bell     (1):  linearly increasing ramp  0  → (6 + η)
Funnel   (2):  linearly decreasing ramp (6 + η) → 0

Outside [a, b] all classes share the same Gaussian noise background ε(t) ~ N(0,1). The amplitude noise η ~ N(0,1) is drawn fresh for every sample.

Approximation vs. original: The original formulation draws a ~ Uniform[16, 32] (window never starts before timestep 16) and b - a ~ Uniform[32, 96]. This implementation samples the window length uniformly from [32, 96] timesteps (length_pct=(0.25, 0.75)) and places it at a fully random start position, so the window can begin at timestep 0. The length distribution is faithful; the start distribution is wider. For XAI benchmarking the ground-truth mask is what matters, so this difference is intentional.

Parameters:

Name Type Description Default
n_samples int

Total number of time series to generate. Default 300.

300
n_timesteps int

Length of each time series. Default 128.

128
weights list of float

Sampling weight for each of the three classes [w_cylinder, w_bell, w_funnel]. Must be positive and sum to 1 (or to any positive value — they are normalised internally). If None, classes are balanced (weight 1/3 each). Default None.

None
random_state int

Seed for reproducibility. Default None.

None
normalization str

Normalisation applied to each generated series. "none" preserves the raw CBF signal values (recommended for comparison with the original). Other options: "zscore", "minmax". Default "none".

'none'
data_format str

Output tensor layout. "channels_first" gives X shape (n_samples, 1, n_timesteps) (PyTorch convention); "channels_last" gives (n_samples, n_timesteps, 1). Default "channels_first".

'channels_first'

Returns:

Name Type Description
dict Dict[str, Any]

Standard xaitimesynth dataset dictionary with keys:

  • "X": numpy array of shape (n_samples, 1, n_timesteps) (or channels-last equivalent).
  • "y": numpy array of shape (n_samples,) with class labels 0 (Cylinder), 1 (Bell), 2 (Funnel).
  • "feature_masks": dict mapping feature name → boolean array of shape (n_samples, n_timesteps). True where the class-discriminating window is located.
  • "metadata": generation metadata dict.
  • "components": per-sample component breakdown.

Raises:

Type Description
ValueError

If weights has the wrong length or contains non-positive values.

References

Saito, N. (2000). Local feature extraction and its applications using a library of bases. Topics in Analysis and Its Applications: Selected Theses, 269–451. World Scientific.

Example

dataset = generate_cylinder_bell_funnel(n_samples=90, random_state=42) X, y = dataset["X"], dataset["y"] X.shape (90, 1, 128) import numpy as np np.bincount(y) array([30, 30, 30]) masks = dataset["feature_masks"]

Source code in xaitimesynth/datasets.py
def generate_cylinder_bell_funnel(
    n_samples: int = 300,
    n_timesteps: int = 128,
    weights: Optional[List[float]] = None,
    random_state: Optional[int] = None,
    normalization: str = "none",
    data_format: str = "channels_first",
) -> Dict[str, Any]:
    """Generate a Cylinder-Bell-Funnel (CBF) dataset with ground-truth feature masks.

    Recreates the classic CBF time series benchmark (Saito, 2000) using
    xaitimesynth's builder, so each sample comes with a boolean ``feature_mask``
    that marks the exact window where the class-discriminating pattern lives.

    The three classes differ only inside a randomly placed window [a, b]:

    .. code-block:: text

        Cylinder (0):  constant plateau of amplitude (6 + η)
        Bell     (1):  linearly increasing ramp  0  → (6 + η)
        Funnel   (2):  linearly decreasing ramp (6 + η) → 0

    Outside [a, b] all classes share the same Gaussian noise background ε(t) ~ N(0,1).
    The amplitude noise η ~ N(0,1) is drawn fresh for every sample.

    **Approximation vs. original:**
    The original formulation draws ``a ~ Uniform[16, 32]`` (window never starts
    before timestep 16) and ``b - a ~ Uniform[32, 96]``.  This implementation
    samples the window *length* uniformly from [32, 96] timesteps
    (``length_pct=(0.25, 0.75)``) and places it at a *fully random* start
    position, so the window can begin at timestep 0.  The length distribution
    is faithful; the start distribution is wider.  For XAI benchmarking the
    ground-truth mask is what matters, so this difference is intentional.

    Args:
        n_samples (int): Total number of time series to generate. Default 300.
        n_timesteps (int): Length of each time series. Default 128.
        weights (list of float, optional): Sampling weight for each of the three
            classes ``[w_cylinder, w_bell, w_funnel]``. Must be positive and sum
            to 1 (or to any positive value — they are normalised internally).
            If ``None``, classes are balanced (weight 1/3 each). Default None.
        random_state (int, optional): Seed for reproducibility. Default None.
        normalization (str): Normalisation applied to each generated series.
            ``"none"`` preserves the raw CBF signal values (recommended for
            comparison with the original). Other options: ``"zscore"``,
            ``"minmax"``. Default ``"none"``.
        data_format (str): Output tensor layout.  ``"channels_first"`` gives
            ``X`` shape ``(n_samples, 1, n_timesteps)`` (PyTorch convention);
            ``"channels_last"`` gives ``(n_samples, n_timesteps, 1)``.
            Default ``"channels_first"``.

    Returns:
        dict: Standard xaitimesynth dataset dictionary with keys:

            - ``"X"``: numpy array of shape ``(n_samples, 1, n_timesteps)``
              (or channels-last equivalent).
            - ``"y"``: numpy array of shape ``(n_samples,)`` with class labels
              0 (Cylinder), 1 (Bell), 2 (Funnel).
            - ``"feature_masks"``: dict mapping feature name → boolean array of
              shape ``(n_samples, n_timesteps)``.  ``True`` where the
              class-discriminating window is located.
            - ``"metadata"``: generation metadata dict.
            - ``"components"``: per-sample component breakdown.

    Raises:
        ValueError: If ``weights`` has the wrong length or contains non-positive values.

    References:
        Saito, N. (2000). Local feature extraction and its applications using a
        library of bases. *Topics in Analysis and Its Applications: Selected
        Theses*, 269–451. World Scientific.

    Example:
        >>> dataset = generate_cylinder_bell_funnel(n_samples=90, random_state=42)
        >>> X, y = dataset["X"], dataset["y"]
        >>> X.shape
        (90, 1, 128)
        >>> import numpy as np
        >>> np.bincount(y)
        array([30, 30, 30])
        >>> masks = dataset["feature_masks"]
    """
    # --- Validate and normalise weights -------------------------------------
    if weights is None:
        weights = [1 / 3, 1 / 3, 1 / 3]
    else:
        weights = list(weights)
        if len(weights) != 3:
            raise ValueError(
                f"weights must have exactly 3 elements (one per class), got {len(weights)}"
            )
        if any(w <= 0 for w in weights):
            raise ValueError("All weights must be positive")
        total = sum(weights)
        weights = [w / total for w in weights]

    # --- Per-sample feature generators -------------------------------------
    # η ~ N(0,1) is drawn from rng so it varies across samples.

    def _cylinder(n_timesteps, rng, length, **kwargs):
        """Constant level shift of amplitude (6 + η)."""
        eta = rng.randn()
        return np.full(length, 6.0 + eta)

    def _bell(n_timesteps, rng, length, **kwargs):
        """Linearly increasing ramp from 0 to (6 + η)."""
        eta = rng.randn()
        return np.linspace(0, 6.0 + eta, length)

    def _funnel(n_timesteps, rng, length, **kwargs):
        """Linearly decreasing ramp from (6 + η) to 0."""
        eta = rng.randn()
        return np.linspace(6.0 + eta, 0, length)

    # --- Build dataset ------------------------------------------------------
    dataset = (
        TimeSeriesBuilder(
            n_timesteps=n_timesteps,
            n_samples=n_samples,
            normalization=normalization,
            random_state=random_state,
            data_format=data_format,
        )
        .for_class(0, weight=weights[0])  # Cylinder
        .add_signal(gaussian_noise(mu=0, sigma=1))
        .add_feature(
            manual(generator=_cylinder),
            random_location=True,
            length_pct=(0.25, 0.75),  # b-a ~ Uniform[32, 96] out of 128 timesteps
        )
        .for_class(1, weight=weights[1])  # Bell
        .add_signal(gaussian_noise(mu=0, sigma=1))
        .add_feature(
            manual(generator=_bell),
            random_location=True,
            length_pct=(0.25, 0.75),
        )
        .for_class(2, weight=weights[2])  # Funnel
        .add_signal(gaussian_noise(mu=0, sigma=1))
        .add_feature(
            manual(generator=_funnel),
            random_location=True,
            length_pct=(0.25, 0.75),
        )
        .build()
    )

    return dataset