Skip to content

Metrics

Functions for evaluating XAI attribution methods against ground truth feature masks.

Evaluation Functions

auc_roc_score(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro', normalize: bool = False) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Area Under the ROC Curve for attribution ranking.

Measures how well attributions discriminate between ground truth and non-ground-truth timesteps. Score of 0.5 = random, 1.0 = perfect.

Parameters:

Name Type Description Default
attributions ndarray

Attribution values, shape (n_samples, n_timesteps, n_dims).

required
dataset Dict

Dataset dictionary from TimeSeriesBuilder.build().

required
sample_indices Optional[List[int]]

Which samples to evaluate. Defaults to all.

None
dim_indices Optional[List[int]]

Which dimensions to evaluate. Defaults to all.

None
average Optional[str]

Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

'macro'
normalize bool

If True, normalize to [-1, 1] range: (AUC - 0.5) / 0.5

False

Returns:

Type Description
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Score(s) in range [0, 1] (or [-1, 1] if normalized). Higher is better.

References

Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.

Source code in xaitimesynth/metrics.py
def auc_roc_score(
    attributions: np.ndarray,
    dataset: Dict,
    sample_indices: Optional[List[int]] = None,
    dim_indices: Optional[List[int]] = None,
    average: Optional[str] = "macro",
    normalize: bool = False,
) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]:
    """Area Under the ROC Curve for attribution ranking.

    Measures how well attributions discriminate between ground truth
    and non-ground-truth timesteps. Score of 0.5 = random, 1.0 = perfect.

    Args:
        attributions: Attribution values, shape (n_samples, n_timesteps, n_dims).
        dataset: Dataset dictionary from TimeSeriesBuilder.build().
        sample_indices: Which samples to evaluate. Defaults to all.
        dim_indices: Which dimensions to evaluate. Defaults to all.
        average: Aggregation method:
            - 'macro': Mean across all samples and dimensions -> float
            - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score]
            - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score]
            - None: No aggregation -> Dict[(sample_idx, dim_idx), score]
        normalize: If True, normalize to [-1, 1] range: (AUC - 0.5) / 0.5

    Returns:
        Score(s) in range [0, 1] (or [-1, 1] if normalized). Higher is better.

    References:
        Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition
        letters, 27(8), 861-874.
    """
    attr, masks, sample_indices, dim_indices = _prepare_inputs(
        attributions, dataset, sample_indices, dim_indices
    )

    results = {}
    for i, s in enumerate(sample_indices):
        for j, d in enumerate(dim_indices):
            a, m = attr[i, :, j], masks[i, :, j]

            if np.all(m) or not np.any(m):
                auc = 0.5
            else:
                # Compute AUC-ROC via trapezoidal rule
                thresholds = np.unique(a)
                thresholds = np.append(thresholds, thresholds.max() + 1)

                n_pos, n_neg = np.sum(m), np.sum(~m)
                tpr_list, fpr_list = [], []

                for thresh in sorted(thresholds, reverse=True):
                    pred = a >= thresh
                    tpr_list.append(np.sum(pred & m) / n_pos)
                    fpr_list.append(np.sum(pred & ~m) / n_neg)

                tpr = np.array(tpr_list)
                fpr = np.array(fpr_list)

                # Sort by FPR for proper integration
                order = np.argsort(fpr)
                auc = float(np.trapezoid(tpr[order], fpr[order]))

            if normalize:
                auc = (auc - 0.5) / 0.5

            results[(s, d)] = auc

    return _aggregate_results(results, sample_indices, dim_indices, average)

auc_pr_score(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro', normalize: bool = False) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Area Under the Precision-Recall Curve for attribution ranking.

Measures precision-recall trade-off at different thresholds. Particularly useful for sparse ground truth (low prevalence). Baseline = prevalence.

Parameters:

Name Type Description Default
attributions ndarray

Attribution values, shape (n_samples, n_timesteps, n_dims).

required
dataset Dict

Dataset dictionary from TimeSeriesBuilder.build().

required
sample_indices Optional[List[int]]

Which samples to evaluate. Defaults to all.

None
dim_indices Optional[List[int]]

Which dimensions to evaluate. Defaults to all.

None
average Optional[str]

Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

'macro'
normalize bool

If True, normalize relative to prevalence: (AUC - prevalence) / (1 - prevalence)

False

Returns:

Type Description
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Score(s) in range [0, 1]. Higher is better. Baseline = prevalence.

Source code in xaitimesynth/metrics.py
def auc_pr_score(
    attributions: np.ndarray,
    dataset: Dict,
    sample_indices: Optional[List[int]] = None,
    dim_indices: Optional[List[int]] = None,
    average: Optional[str] = "macro",
    normalize: bool = False,
) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]:
    """Area Under the Precision-Recall Curve for attribution ranking.

    Measures precision-recall trade-off at different thresholds. Particularly
    useful for sparse ground truth (low prevalence). Baseline = prevalence.

    Args:
        attributions: Attribution values, shape (n_samples, n_timesteps, n_dims).
        dataset: Dataset dictionary from TimeSeriesBuilder.build().
        sample_indices: Which samples to evaluate. Defaults to all.
        dim_indices: Which dimensions to evaluate. Defaults to all.
        average: Aggregation method:
            - 'macro': Mean across all samples and dimensions -> float
            - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score]
            - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score]
            - None: No aggregation -> Dict[(sample_idx, dim_idx), score]
        normalize: If True, normalize relative to prevalence:
            (AUC - prevalence) / (1 - prevalence)

    Returns:
        Score(s) in range [0, 1]. Higher is better. Baseline = prevalence.
    """
    attr, masks, sample_indices, dim_indices = _prepare_inputs(
        attributions, dataset, sample_indices, dim_indices
    )

    results = {}
    for i, s in enumerate(sample_indices):
        for j, d in enumerate(dim_indices):
            a, m = attr[i, :, j], masks[i, :, j]

            n_pos = np.sum(m)
            prevalence = n_pos / m.size if m.size > 0 else 0.0

            if n_pos == m.size:
                auc = 1.0
            elif n_pos == 0:
                auc = 0.0
            else:
                thresholds = np.unique(a)
                thresholds = np.append(thresholds, thresholds.max() + 1)

                prec_list, rec_list = [], []
                for thresh in sorted(thresholds, reverse=True):
                    pred = a >= thresh
                    tp = np.sum(pred & m)
                    fp = np.sum(pred & ~m)
                    prec = tp / (tp + fp) if (tp + fp) > 0 else 1.0
                    rec = tp / n_pos
                    prec_list.append(prec)
                    rec_list.append(rec)

                prec = np.array(prec_list)
                rec = np.array(rec_list)

                # Sort by recall, keep max precision per recall
                order = np.argsort(rec)
                rec_sorted, prec_sorted = rec[order], prec[order]

                unique_rec, idx = np.unique(rec_sorted, return_index=True)
                unique_prec = np.array(
                    [np.max(prec_sorted[rec_sorted == r]) for r in unique_rec]
                )

                # Add (0, 1) anchor point
                if unique_rec[0] != 0:
                    unique_rec = np.concatenate([[0], unique_rec])
                    unique_prec = np.concatenate([[1.0], unique_prec])

                auc = float(np.trapezoid(unique_prec, unique_rec))

            if normalize:
                if prevalence >= 1.0:
                    auc = 0.0
                else:
                    auc = (auc - prevalence) / (1.0 - prevalence)

            results[(s, d)] = auc

    return _aggregate_results(results, sample_indices, dim_indices, average)

nac_score(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro', ground_truth_only: bool = True) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Normalized Attribution Correspondence (z-score at ground truth).

Standardizes attributions (z-score), then takes mean at ground truth locations. Positive = attributions elevated at features. Negative = inverse.

Also known as Normalised Scanpath Saliency (NSS) in eye-tracking literature.

Parameters:

Name Type Description Default
attributions ndarray

Attribution values, shape (n_samples, n_timesteps, n_dims).

required
dataset Dict

Dataset dictionary from TimeSeriesBuilder.build().

required
sample_indices Optional[List[int]]

Which samples to evaluate. Defaults to all.

None
dim_indices Optional[List[int]]

Which dimensions to evaluate. Defaults to all.

None
average Optional[str]

Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

'macro'
ground_truth_only bool

If True, evaluate at mask locations. If False, evaluate at non-mask locations (useful for checking background).

True

Returns:

Type Description
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Score(s) with no fixed range. Positive = good, negative = inverted.

References

Peters et al. (2005). Components of bottom-up gaze allocation in natural images. Vision Research, 45(18), 2397-2416.

Source code in xaitimesynth/metrics.py
def nac_score(
    attributions: np.ndarray,
    dataset: Dict,
    sample_indices: Optional[List[int]] = None,
    dim_indices: Optional[List[int]] = None,
    average: Optional[str] = "macro",
    ground_truth_only: bool = True,
) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]:
    """Normalized Attribution Correspondence (z-score at ground truth).

    Standardizes attributions (z-score), then takes mean at ground truth
    locations. Positive = attributions elevated at features. Negative = inverse.

    Also known as Normalised Scanpath Saliency (NSS) in eye-tracking literature.

    Args:
        attributions: Attribution values, shape (n_samples, n_timesteps, n_dims).
        dataset: Dataset dictionary from TimeSeriesBuilder.build().
        sample_indices: Which samples to evaluate. Defaults to all.
        dim_indices: Which dimensions to evaluate. Defaults to all.
        average: Aggregation method:
            - 'macro': Mean across all samples and dimensions -> float
            - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score]
            - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score]
            - None: No aggregation -> Dict[(sample_idx, dim_idx), score]
        ground_truth_only: If True, evaluate at mask locations. If False,
            evaluate at non-mask locations (useful for checking background).

    Returns:
        Score(s) with no fixed range. Positive = good, negative = inverted.

    References:
        Peters et al. (2005). Components of bottom-up gaze allocation in
        natural images. Vision Research, 45(18), 2397-2416.
    """
    attr, masks, sample_indices, dim_indices = _prepare_inputs(
        attributions, dataset, sample_indices, dim_indices
    )

    results = {}
    for i, s in enumerate(sample_indices):
        for j, d in enumerate(dim_indices):
            a, m = attr[i, :, j], masks[i, :, j]

            region = m if ground_truth_only else ~m

            if not np.any(region):
                results[(s, d)] = 0.0
            else:
                std = np.std(a)
                if std == 0:
                    results[(s, d)] = 0.0
                else:
                    z = (a - np.mean(a)) / std
                    results[(s, d)] = float(np.mean(z[region]))

    return _aggregate_results(results, sample_indices, dim_indices, average)

relevance_mass_accuracy(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro') -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Ratio of attribution mass inside ground truth regions.

Measures what fraction of total attribution "mass" falls within the ground truth mask. Higher is better (1.0 = all mass inside mask).

Formula: sum(attr[mask]) / sum(attr)

Parameters:

Name Type Description Default
attributions ndarray

Attribution values, shape (n_samples, n_timesteps, n_dims).

required
dataset Dict

Dataset dictionary from TimeSeriesBuilder.build().

required
sample_indices Optional[List[int]]

Which samples to evaluate. Defaults to all.

None
dim_indices Optional[List[int]]

Which dimensions to evaluate. Defaults to all.

None
average Optional[str]

Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

'macro'

Returns:

Type Description
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Score(s) in range [0, 1]. Higher is better.

References

Arras et al. (2022). CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Information Fusion, 81, 14-40.

Source code in xaitimesynth/metrics.py
def relevance_mass_accuracy(
    attributions: np.ndarray,
    dataset: Dict,
    sample_indices: Optional[List[int]] = None,
    dim_indices: Optional[List[int]] = None,
    average: Optional[str] = "macro",
) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]:
    """Ratio of attribution mass inside ground truth regions.

    Measures what fraction of total attribution "mass" falls within the
    ground truth mask. Higher is better (1.0 = all mass inside mask).

    Formula: sum(attr[mask]) / sum(attr)

    Args:
        attributions: Attribution values, shape (n_samples, n_timesteps, n_dims).
        dataset: Dataset dictionary from TimeSeriesBuilder.build().
        sample_indices: Which samples to evaluate. Defaults to all.
        dim_indices: Which dimensions to evaluate. Defaults to all.
        average: Aggregation method:
            - 'macro': Mean across all samples and dimensions -> float
            - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score]
            - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score]
            - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

    Returns:
        Score(s) in range [0, 1]. Higher is better.

    References:
        Arras et al. (2022). CLEVR-XAI: A benchmark dataset for the ground truth
        evaluation of neural network explanations. Information Fusion, 81, 14-40.
    """
    attr, masks, sample_indices, dim_indices = _prepare_inputs(
        attributions, dataset, sample_indices, dim_indices
    )

    results = {}
    for i, s in enumerate(sample_indices):
        for j, d in enumerate(dim_indices):
            a, m = attr[i, :, j], masks[i, :, j]
            total = np.sum(a)
            results[(s, d)] = float(np.sum(a[m]) / total) if total > 0 else 0.0

    return _aggregate_results(results, sample_indices, dim_indices, average)

relevance_rank_accuracy(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro') -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Fraction of top-K attributions that fall within ground truth.

Selects K timesteps with highest attribution (where K = mask size), then measures what fraction of these are actually in the mask. Higher is better (1.0 = top-K perfectly matches mask).

Parameters:

Name Type Description Default
attributions ndarray

Attribution values, shape (n_samples, n_timesteps, n_dims).

required
dataset Dict

Dataset dictionary from TimeSeriesBuilder.build().

required
sample_indices Optional[List[int]]

Which samples to evaluate. Defaults to all.

None
dim_indices Optional[List[int]]

Which dimensions to evaluate. Defaults to all.

None
average Optional[str]

Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

'macro'

Returns:

Type Description
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Score(s) in range [0, 1]. Higher is better.

References

Arras et al. (2022). CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Information Fusion, 81, 14-40.

Source code in xaitimesynth/metrics.py
def relevance_rank_accuracy(
    attributions: np.ndarray,
    dataset: Dict,
    sample_indices: Optional[List[int]] = None,
    dim_indices: Optional[List[int]] = None,
    average: Optional[str] = "macro",
) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]:
    """Fraction of top-K attributions that fall within ground truth.

    Selects K timesteps with highest attribution (where K = mask size),
    then measures what fraction of these are actually in the mask.
    Higher is better (1.0 = top-K perfectly matches mask).

    Args:
        attributions: Attribution values, shape (n_samples, n_timesteps, n_dims).
        dataset: Dataset dictionary from TimeSeriesBuilder.build().
        sample_indices: Which samples to evaluate. Defaults to all.
        dim_indices: Which dimensions to evaluate. Defaults to all.
        average: Aggregation method:
            - 'macro': Mean across all samples and dimensions -> float
            - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score]
            - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score]
            - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

    Returns:
        Score(s) in range [0, 1]. Higher is better.

    References:
        Arras et al. (2022). CLEVR-XAI: A benchmark dataset for the ground truth
        evaluation of neural network explanations. Information Fusion, 81, 14-40.
    """
    attr, masks, sample_indices, dim_indices = _prepare_inputs(
        attributions, dataset, sample_indices, dim_indices
    )

    results = {}
    for i, s in enumerate(sample_indices):
        for j, d in enumerate(dim_indices):
            a, m = attr[i, :, j], masks[i, :, j]
            k = int(np.sum(m))
            if k == 0:
                results[(s, d)] = 0.0
            else:
                top_k = np.argpartition(-a, k - 1)[:k]
                results[(s, d)] = float(np.sum(m[top_k]) / k)

    return _aggregate_results(results, sample_indices, dim_indices, average)

pointing_game(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro') -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Whether the maximum attribution falls within ground truth.

A simple binary check: is the single highest-attributed timestep inside the ground truth mask? Returns 1.0 if yes, 0.0 if no.

Parameters:

Name Type Description Default
attributions ndarray

Attribution values, shape (n_samples, n_timesteps, n_dims).

required
dataset Dict

Dataset dictionary from TimeSeriesBuilder.build().

required
sample_indices Optional[List[int]]

Which samples to evaluate. Defaults to all.

None
dim_indices Optional[List[int]]

Which dimensions to evaluate. Defaults to all.

None
average Optional[str]

Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

'macro'

Returns:

Type Description
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]

Score(s) of 0.0 or 1.0 per sample-dimension, aggregated per average.

References

Zhang et al. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084-1102.

Source code in xaitimesynth/metrics.py
def pointing_game(
    attributions: np.ndarray,
    dataset: Dict,
    sample_indices: Optional[List[int]] = None,
    dim_indices: Optional[List[int]] = None,
    average: Optional[str] = "macro",
) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]:
    """Whether the maximum attribution falls within ground truth.

    A simple binary check: is the single highest-attributed timestep
    inside the ground truth mask? Returns 1.0 if yes, 0.0 if no.

    Args:
        attributions: Attribution values, shape (n_samples, n_timesteps, n_dims).
        dataset: Dataset dictionary from TimeSeriesBuilder.build().
        sample_indices: Which samples to evaluate. Defaults to all.
        dim_indices: Which dimensions to evaluate. Defaults to all.
        average: Aggregation method:
            - 'macro': Mean across all samples and dimensions -> float
            - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score]
            - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score]
            - None: No aggregation -> Dict[(sample_idx, dim_idx), score]

    Returns:
        Score(s) of 0.0 or 1.0 per sample-dimension, aggregated per `average`.

    References:
        Zhang et al. (2018). Top-down neural attention by excitation backprop.
        International Journal of Computer Vision, 126(10), 1084-1102.
    """
    attr, masks, sample_indices, dim_indices = _prepare_inputs(
        attributions, dataset, sample_indices, dim_indices
    )

    results = {}
    for i, s in enumerate(sample_indices):
        for j, d in enumerate(dim_indices):
            a, m = attr[i, :, j], masks[i, :, j]
            results[(s, d)] = 1.0 if m[np.argmax(a)] else 0.0

    return _aggregate_results(results, sample_indices, dim_indices, average)