Metrics¶
Functions for evaluating XAI attribution methods against ground truth feature masks.
Evaluation Functions¶
auc_roc_score(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro', normalize: bool = False) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
¶
Area Under the ROC Curve for attribution ranking.
Measures how well attributions discriminate between ground truth and non-ground-truth timesteps. Score of 0.5 = random, 1.0 = perfect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributions
|
ndarray
|
Attribution values, shape (n_samples, n_timesteps, n_dims). |
required |
dataset
|
Dict
|
Dataset dictionary from TimeSeriesBuilder.build(). |
required |
sample_indices
|
Optional[List[int]]
|
Which samples to evaluate. Defaults to all. |
None
|
dim_indices
|
Optional[List[int]]
|
Which dimensions to evaluate. Defaults to all. |
None
|
average
|
Optional[str]
|
Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score] |
'macro'
|
normalize
|
bool
|
If True, normalize to [-1, 1] range: (AUC - 0.5) / 0.5 |
False
|
Returns:
| Type | Description |
|---|---|
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
|
Score(s) in range [0, 1] (or [-1, 1] if normalized). Higher is better. |
References
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
Source code in xaitimesynth/metrics.py
auc_pr_score(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro', normalize: bool = False) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
¶
Area Under the Precision-Recall Curve for attribution ranking.
Measures precision-recall trade-off at different thresholds. Particularly useful for sparse ground truth (low prevalence). Baseline = prevalence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributions
|
ndarray
|
Attribution values, shape (n_samples, n_timesteps, n_dims). |
required |
dataset
|
Dict
|
Dataset dictionary from TimeSeriesBuilder.build(). |
required |
sample_indices
|
Optional[List[int]]
|
Which samples to evaluate. Defaults to all. |
None
|
dim_indices
|
Optional[List[int]]
|
Which dimensions to evaluate. Defaults to all. |
None
|
average
|
Optional[str]
|
Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score] |
'macro'
|
normalize
|
bool
|
If True, normalize relative to prevalence: (AUC - prevalence) / (1 - prevalence) |
False
|
Returns:
| Type | Description |
|---|---|
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
|
Score(s) in range [0, 1]. Higher is better. Baseline = prevalence. |
Source code in xaitimesynth/metrics.py
462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 | |
nac_score(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro', ground_truth_only: bool = True) -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
¶
Normalized Attribution Correspondence (z-score at ground truth).
Standardizes attributions (z-score), then takes mean at ground truth locations. Positive = attributions elevated at features. Negative = inverse.
Also known as Normalised Scanpath Saliency (NSS) in eye-tracking literature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributions
|
ndarray
|
Attribution values, shape (n_samples, n_timesteps, n_dims). |
required |
dataset
|
Dict
|
Dataset dictionary from TimeSeriesBuilder.build(). |
required |
sample_indices
|
Optional[List[int]]
|
Which samples to evaluate. Defaults to all. |
None
|
dim_indices
|
Optional[List[int]]
|
Which dimensions to evaluate. Defaults to all. |
None
|
average
|
Optional[str]
|
Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score] |
'macro'
|
ground_truth_only
|
bool
|
If True, evaluate at mask locations. If False, evaluate at non-mask locations (useful for checking background). |
True
|
Returns:
| Type | Description |
|---|---|
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
|
Score(s) with no fixed range. Positive = good, negative = inverted. |
References
Peters et al. (2005). Components of bottom-up gaze allocation in natural images. Vision Research, 45(18), 2397-2416.
Source code in xaitimesynth/metrics.py
relevance_mass_accuracy(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro') -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
¶
Ratio of attribution mass inside ground truth regions.
Measures what fraction of total attribution "mass" falls within the ground truth mask. Higher is better (1.0 = all mass inside mask).
Formula: sum(attr[mask]) / sum(attr)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributions
|
ndarray
|
Attribution values, shape (n_samples, n_timesteps, n_dims). |
required |
dataset
|
Dict
|
Dataset dictionary from TimeSeriesBuilder.build(). |
required |
sample_indices
|
Optional[List[int]]
|
Which samples to evaluate. Defaults to all. |
None
|
dim_indices
|
Optional[List[int]]
|
Which dimensions to evaluate. Defaults to all. |
None
|
average
|
Optional[str]
|
Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score] |
'macro'
|
Returns:
| Type | Description |
|---|---|
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
|
Score(s) in range [0, 1]. Higher is better. |
References
Arras et al. (2022). CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Information Fusion, 81, 14-40.
Source code in xaitimesynth/metrics.py
relevance_rank_accuracy(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro') -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
¶
Fraction of top-K attributions that fall within ground truth.
Selects K timesteps with highest attribution (where K = mask size), then measures what fraction of these are actually in the mask. Higher is better (1.0 = top-K perfectly matches mask).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributions
|
ndarray
|
Attribution values, shape (n_samples, n_timesteps, n_dims). |
required |
dataset
|
Dict
|
Dataset dictionary from TimeSeriesBuilder.build(). |
required |
sample_indices
|
Optional[List[int]]
|
Which samples to evaluate. Defaults to all. |
None
|
dim_indices
|
Optional[List[int]]
|
Which dimensions to evaluate. Defaults to all. |
None
|
average
|
Optional[str]
|
Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score] |
'macro'
|
Returns:
| Type | Description |
|---|---|
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
|
Score(s) in range [0, 1]. Higher is better. |
References
Arras et al. (2022). CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Information Fusion, 81, 14-40.
Source code in xaitimesynth/metrics.py
pointing_game(attributions: np.ndarray, dataset: Dict, sample_indices: Optional[List[int]] = None, dim_indices: Optional[List[int]] = None, average: Optional[str] = 'macro') -> Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
¶
Whether the maximum attribution falls within ground truth.
A simple binary check: is the single highest-attributed timestep inside the ground truth mask? Returns 1.0 if yes, 0.0 if no.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributions
|
ndarray
|
Attribution values, shape (n_samples, n_timesteps, n_dims). |
required |
dataset
|
Dict
|
Dataset dictionary from TimeSeriesBuilder.build(). |
required |
sample_indices
|
Optional[List[int]]
|
Which samples to evaluate. Defaults to all. |
None
|
dim_indices
|
Optional[List[int]]
|
Which dimensions to evaluate. Defaults to all. |
None
|
average
|
Optional[str]
|
Aggregation method: - 'macro': Mean across all samples and dimensions -> float - 'per_sample': Mean per sample across dimensions -> Dict[sample_idx, score] - 'per_dimension': Mean per dimension across samples -> Dict[dim_idx, score] - None: No aggregation -> Dict[(sample_idx, dim_idx), score] |
'macro'
|
Returns:
| Type | Description |
|---|---|
Union[float, Dict[int, float], Dict[Tuple[int, int], float]]
|
Score(s) of 0.0 or 1.0 per sample-dimension, aggregated per |
References
Zhang et al. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084-1102.