Behavioral Benchmarks

Behavioral benchmarks compare model behavioral responses to human psychophysical data from experiments measuring choices, reaction times, and other behavioral outputs.

Prerequisites: Complete Data Packaging first β€” you need a registered BehavioralAssembly.

Time: ~30 minutes to implement (more complex than neural due to custom __call__).


Quick Start: Minimal Template

Copy this template and modify for your data:

# vision/brainscore_vision/benchmarks/yourbenchmark/__init__.py
from brainscore_vision import benchmark_registry
from .benchmark import YourBehavioralBenchmark

benchmark_registry['YourBenchmark-accuracy'] = YourBehavioralBenchmark
# vision/brainscore_vision/benchmarks/yourbenchmark/benchmark.py
from brainscore_vision import load_dataset, load_stimulus_set, load_metric
from brainscore_vision.benchmarks import BenchmarkBase
from brainscore_vision.benchmark_helpers.screen import place_on_screen
from brainscore_vision.model_interface import BrainModel
from brainscore_vision.benchmark_helpers import bound_score

BIBTEX = """@article{YourName2024, ...}"""

class YourBehavioralBenchmark(BenchmarkBase):
    def __init__(self):
        self._assembly = load_dataset('YourDataset')
        self._fitting_stimuli = load_stimulus_set('YourDataset_fitting')
        self._metric = load_metric('accuracy')  # or 'i2n', 'error_consistency'
        self._visual_degrees = 8

        super().__init__(
            identifier='YourBenchmark-accuracy',
            version=1,
            ceiling_func=lambda: self._metric.ceiling(self._assembly),
            parent='behavior',
            bibtex=BIBTEX
        )

    def __call__(self, candidate: BrainModel):
        # 1. Scale fitting stimuli and set up task
        fitting_stimuli = place_on_screen(self._fitting_stimuli,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees)
        candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)

        # 2. Scale test stimuli and get predictions
        stimulus_set = place_on_screen(self._assembly.stimulus_set,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees)
        predictions = candidate.look_at(stimulus_set, number_of_trials=1)

        # 3. Compare to human data
        raw_score = self._metric(predictions, self._assembly)
        return bound_score(raw_score / self.ceiling)

See Registration and Testing below.


Behavioral Benchmark Checklist

Before submitting your behavioral benchmark:

  • [ ] Inherits from BenchmarkBase directly (not a helper class)
  • [ ] Implements __call__ method
  • [ ] Uses fitting stimuli separate from test stimuli
  • [ ] Uses start_task() to configure behavioral task
  • [ ] Implements ceiling calculation (split-half or cross-subject)
  • [ ] Includes proper bibtex citation
  • [ ] Tests verify expected scores for known models
  • [ ] Considers chance correction if applicable (e.g., 1/N for N-way choice)
  • [ ] Uses bound_score() to clamp scores between [0, 1]

Which Metric Should I Use?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  What aspect of behavior are you comparing?                                 β”‚
β”‚                                                                             β”‚
β”‚  β”œβ”€β”€ Overall correctness                                                    β”‚
β”‚  β”‚   └── Use Accuracy                                                       β”‚
β”‚  β”‚       Example: Geirhos2021                                               β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Which specific images are hard/easy                                    β”‚
β”‚  β”‚   └── Use I2N (image-level consistency)                                  β”‚
β”‚  β”‚       Example: Rajalingham2018                                           β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Error patterns (accounting for chance)                                 β”‚
β”‚  β”‚   └── Use Error Consistency (Cohen's Kappa)                              β”‚
β”‚  β”‚       Example: Geirhos2021                                               β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Derived scalar values (integrals, thresholds)                          β”‚
β”‚  β”‚   └── Use Value Delta                                                    β”‚
β”‚  β”‚       Example: Ferguson2024                                              β”‚
β”‚  β”‚                                                                          β”‚
β”‚  └── Temporal dynamics of recognition                                       β”‚
β”‚      └── Use OST (Object Solution Times)                                    β”‚
β”‚          Example: Kar2019                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tutorial Overview

This tutorial covers everything you need to build a behavioral benchmark:

  1. Why no helper class β€” Understanding the diversity of behavioral data
  2. Full examples β€” Annotated code from Rajalingham2018 and Ferguson2024
  3. Metrics β€” Choosing between Accuracy, I2N, Error Consistency, Value Delta, and OST
  4. Fitting stimuli β€” When and how to provide training data
  5. Registration & testing β€” Getting your benchmark into Brain-Score

Example Benchmarks

Benchmark Description Task Type Metric
Rajalingham2018 Human vs model object recognition patterns Probabilities I2N
Ferguson2024 Visual search asymmetry tasks Probabilities Value Delta
Hebart2023 Odd-one-out similarity judgments Odd-one-out Accuracy
Geirhos2021 Shape vs texture bias measurements Label Bias comparison

Key Characteristics: - Use BehavioralAssembly data structures - Employ behavioral metrics (accuracy, similarity, consistency) - Focus on choice patterns and response distributions - No BehavioralBenchmark helper class β€” inherit BenchmarkBase directly


Inheritance Structure

Unlike neural benchmarks, behavioral benchmarks do not have a helper class. You inherit directly from BenchmarkBase:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Benchmark (ABC)                                                             β”‚
β”‚  └── BenchmarkBase(Benchmark)                                                β”‚
β”‚      └── Your behavioral benchmark (implements __call__ directly)            β”‚
β”‚                                                                              β”‚
β”‚  Note: No BehavioralBenchmark helper class exists!                           β”‚
β”‚  You must implement __call__, start_task, place_on_screen, look_at yourself  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why no helper class? As covered in Data Packaging, BehavioralAssembly has no enforced dimensionsβ€”behavioral data can be 1D (correct/incorrect), 2D (probability distributions), or other shapes. This flexibility means behavioral experiments are too diverse for a single abstraction like NeuralBenchmark.


Example 1: Rajalingham2018 (I2N Metric)

Click to expand example This benchmark compares model and human object recognition patterns at the image level:
# Located: brainscore_vision/benchmarks/rajalingham2018/benchmarks/benchmark.py

class _DicarloRajalingham2018(BenchmarkBase):
    def __init__(self, metric, metric_identifier):
        # The metric (I2N) compares image-by-image behavioral patterns
        self._metric = metric

        # Fitting stimuli: separate training data for the model's readout
        # This ensures fair comparison β€” model learns from different images
        self._fitting_stimuli = load_stimulus_set('objectome.public')

        # Human behavioral data (lazy-loaded to save memory)
        self._assembly = LazyLoad(lambda: load_assembly('private'))

        # Experimental parameters
        self._visual_degrees = 8    # Stimuli shown at 8Β° visual angle
        self._number_of_trials = 2  # 2 trials per stimulus

        super().__init__(
            identifier='Rajalingham2018-' + metric_identifier,
            version=2,
            # Ceiling: how well can humans predict other humans?
            ceiling_func=lambda: self._metric.ceiling(self._assembly),
            parent='behavior',
            bibtex=BIBTEX
        )

    def __call__(self, candidate: BrainModel) -> Score:
        # 1. Scale FITTING stimuli to model's visual field
        fitting_stimuli = place_on_screen(
            self._fitting_stimuli,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees
        )

        # 2. Set up task: model learns to output probabilities over object classes
        #    fitting_stimuli provides training data for the readout
        candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)

        # 3. Scale TEST stimuli to model's visual field
        stimulus_set = place_on_screen(
            self._assembly.stimulus_set,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees
        )

        # 4. Present test stimuli and get model's probability outputs
        probabilities = candidate.look_at(stimulus_set, number_of_trials=self._number_of_trials)

        # 5. Compare model probabilities to human choices using I2N metric
        #    I2N = image-level consistency, measuring per-image agreement
        score = self._metric(probabilities, self._assembly)

        # 6. Normalize by ceiling
        ceiling = self.ceiling
        score = self.ceil_score(score, ceiling)

        return score
**Key patterns in Rajalingham2018:** - **Fitting stimuli**: Separate training data (`objectome.public`) from test data - **Probabilities task**: Model outputs probability distributions over classes - **I2N metric**: Measures image-by-image behavioral consistency - **Custom ceiling**: Uses the metric's `ceiling()` method (split-half reliability)

Example 2: Ferguson2024 (Value Delta Metric)

Click to expand example This benchmark measures visual search asymmetry β€” how model performance patterns match humans across different conditions:
# Located: brainscore_vision/benchmarks/ferguson2024/benchmark.py

class _Ferguson2024ValueDelta(BenchmarkBase):
    def __init__(self, experiment, precompute_ceiling=True):
        self._experiment = experiment  # e.g., 'tilted_line', 'color', 'gray_hard'

        # Value Delta metric: compares scalar values (integrals) between model and human
        self._metric = load_metric('value_delta', scale=0.75)

        # Fitting stimuli: training data specific to this experiment type
        self._fitting_stimuli = gather_fitting_stimuli(combine_all=False, experiment=self._experiment)

        # Human behavioral data for this specific experiment
        self._assembly = load_dataset(f'Ferguson2024_{self._experiment}')

        # Experimental parameters
        self._visual_degrees = 8
        self._number_of_trials = 3  # 3 trials per stimulus

        # Pre-computed ceiling (saves time, same result as computing each time)
        self._ceiling = calculate_ceiling(precompute_ceiling, self._experiment, 
                                          self._assembly, self._metric, num_loops=500)

        super().__init__(
            identifier=f"Ferguson2024{self._experiment}-value_delta",
            version=1,
            ceiling_func=self._ceiling,
            parent='behavior',
            bibtex=BIBTEX
        )

    def __call__(self, candidate: BrainModel) -> Score:
        # 1. Add truth labels to stimuli (oddball vs same)
        self._assembly.stimulus_set["image_label"] = np.where(
            self._assembly.stimulus_set["image_number"] % 2 == 0, "oddball", "same"
        )
        self._fitting_stimuli["image_label"] = np.where(
            self._fitting_stimuli["image_number"] % 2 == 0, "oddball", "same"
        )

        # 2. Scale fitting stimuli and set up binary classification task
        fitting_stimuli = place_on_screen(
            self._fitting_stimuli,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees
        )
        candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)

        # 3. Scale test stimuli and get model predictions
        stimulus_set = place_on_screen(
            self._assembly.stimulus_set,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees
        )
        model_labels_raw = candidate.look_at(stimulus_set, number_of_trials=self._number_of_trials)

        # 4. Process model outputs (softmax + threshold β†’ binary choices)
        model_labels = process_model_choices(model_labels_raw)

        # 5. Compute "integral" metric (area under performance curve)
        human_integral = get_integral_data(self._assembly, self._experiment)['integral']
        model_integral = get_integral_data(model_labels, self._experiment)['integral']

        # 6. Compare integrals using Value Delta metric
        raw_score = self._metric(model_integral, human_integral)

        # 7. Normalize by ceiling (clamp to [0, 1])
        ceiling = self._ceiling
        score = Score(min(max(raw_score / ceiling, 0), 1))
        score.attrs['raw'] = raw_score
        score.attrs['ceiling'] = ceiling

        return score
**Key patterns in Ferguson2024:** - **Multiple experiment variants**: Same benchmark structure, different stimulus types - **Pre-computed ceilings**: Stored to avoid expensive recalculation - **Custom post-processing**: Converts probabilities to discrete choices - **Derived metric**: Computes "integral" from raw responses before comparison

Behavioral Benchmark Call Flow

Click to expand: Behavioral benchmark execution flow
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  INSIDE behavioral_benchmark.__call__(candidate)                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  STEP 1: Scale FITTING stimuli                                               β”‚
β”‚                                                                              β”‚
β”‚  fitting_stimuli = place_on_screen(self._fitting_stimuli, ...)               β”‚
β”‚      └──→ Training data for model's readout layer                            β”‚
β”‚                                                                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  STEP 2: Configure task with fitting stimuli                                 β”‚
β”‚                                                                              β”‚
β”‚  candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)        β”‚
β”‚      β”‚                                                                       β”‚
β”‚      β”‚  Model internally:                                                    β”‚
β”‚      β”‚  1. Extracts features from fitting_stimuli                            β”‚
β”‚      β”‚  2. Trains logistic regression: features β†’ image_label                β”‚
β”‚      β”‚  3. Stores classifier for later use                                   β”‚
β”‚                                                                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  STEP 3: Scale TEST stimuli                                                  β”‚
β”‚                                                                              β”‚
β”‚  stimulus_set = place_on_screen(self._assembly.stimulus_set, ...)            β”‚
β”‚                                                                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  STEP 4: Get model predictions                                               β”‚
β”‚                                                                              β”‚
β”‚  probabilities = candidate.look_at(stimulus_set, number_of_trials=N)         β”‚
β”‚      β”‚                                                                       β”‚
β”‚      β”‚  Model internally:                                                    β”‚
β”‚      β”‚  1. Extracts features from test stimuli                               β”‚
β”‚      β”‚  2. Applies trained classifier to get probabilities                   β”‚
β”‚      β”‚  3. Returns BehavioralAssembly with shape (stimuli, classes)          β”‚
β”‚      β”‚                                                                       β”‚
β”‚      └──→ Returns BehavioralAssembly: (presentations Γ— choices)              β”‚
β”‚                                                                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  STEP 5: Compare to human data using behavioral metric                       β”‚
β”‚                                                                              β”‚
β”‚  raw_score = self._metric(probabilities, self._assembly)                     β”‚
β”‚      └──→ I2N: per-image consistency                                         β”‚
β”‚           Accuracy: proportion correct                                       β”‚
β”‚           Value Delta: scalar comparison                                     β”‚
β”‚                                                                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  STEP 6: Normalize by ceiling                                                β”‚
β”‚                                                                              β”‚
β”‚  score = raw_score / ceiling  (or custom normalization)                      β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Understanding Ceilings

The ceiling represents the maximum achievable score given the noise in the human data. If humans disagree with each other 20% of the time, a model can't be expected to match humans better than humans match themselves.

Common ceiling approaches for behavioral benchmarks:

# 1. Metric's built-in ceiling (e.g., I2N)
ceiling_func=lambda: self._metric.ceiling(self._assembly)

# 2. Split-half reliability across subjects
def _compute_ceiling(self):
    subjects = list(set(self._assembly.coords['subject'].values))
    scores = []
    for _ in range(100):  # Bootstrap
        np.random.shuffle(subjects)
        half1, half2 = subjects[:len(subjects)//2], subjects[len(subjects)//2:]
        score = self._metric(
            self._assembly.sel(subject=half1).mean('subject'),
            self._assembly.sel(subject=half2).mean('subject')
        )
        scores.append(score.values)
    return Score(np.mean(scores))

# 3. Pre-computed ceiling (for expensive calculations)
ceiling_func=lambda: Score(0.85)  # Stored value

Normalized score: raw_score / ceiling ensures scores are comparable across benchmarks with different noise levels.


Common Behavioral Metrics

Behavioral benchmarks use diverse metrics to capture different aspects of human-model alignment. Here's an overview of the most common ones:

Metric Comparison Table

Metric Registry Key What It Measures When to Use
Accuracy accuracy Proportion of correct responses Simple classification tasks
I2N i2n Image-level error pattern consistency Fine-grained behavioral similarity
Error Consistency error_consistency Cohen's Kappa for error agreement When error patterns matter more than accuracy
Accuracy Distance accuracy_distance Relative distance between accuracies Condition-wise performance matching
Value Delta value_delta Sigmoid-weighted scalar comparison Comparing derived metrics (e.g., integrals)
OST ost Object Solution Time correlation Temporal dynamics of recognition

Note: All paths below are relative to brainscore_vision/metrics/ in the vision repository.


Accuracy

The simplest metric β€” proportion of matching responses:

# Located: metrics/accuracy/metric.py

class Accuracy(Metric):
    def __call__(self, source, target) -> Score:
        values = source == target
        center = np.mean(values)
        score = Score(center)
        score.attrs['error'] = np.std(values)
        return score

When to use: Basic classification accuracy when you just need "how often does the model get it right?"


I2N (Image-Level Consistency)

Measures whether model and humans make the same errors on the same individual images:

                          Model A        Model B        Humans
                          (90% acc)      (90% acc)      
Image 1 (dog vs cat)        βœ“              βœ—              βœ“
Image 2 (car vs bus)        βœ“              βœ“              βœ“
Image 3 (bird vs plane)     βœ—              βœ“              βœ—    ← Model A matches human error!
Image 4 (dog vs wolf)       βœ“              βœ—              βœ“
Image 5 (cat vs tiger)      βœ—              βœ—              βœ—    ← Both match human error
...

I2N(Model A, Humans) = 0.85   ← High: same error patterns
I2N(Model B, Humans) = 0.45   ← Low: different error patterns

Accuracy would miss this distinction β€” both models score 90%!
# Located: metrics/i1i2/metric.py

class _I(_Behavior_Metric):
    def _call_single(self, source_probabilities, target, random_state):
        # 1. Build "response matrix" from model probabilities
        source_response_matrix = self.target_distractor_scores(source_probabilities)

        # 2. Convert to d-prime (sensitivity) scores
        source_response_matrix = self.dprimes(source_response_matrix)

        # 3. Build response matrix from human data
        target_half = self.generate_halves(target, random_state)[0]
        target_response_matrix = self.build_response_matrix_from_responses(target_half)
        target_response_matrix = self.dprimes(target_response_matrix)

        # 4. Correlate the two response matrices (per-image patterns)
        correlation = self.correlate(source_response_matrix, target_response_matrix)
        return correlation

When to use: When you care about which specific images are hard, not just overall accuracy.


Error Consistency (Cohen's Kappa)

Measures agreement in error patterns, accounting for chance:

# Located: metrics/error_consistency/metric.py

class ErrorConsistency(Metric):
    """
    Implements Cohen's Kappa from Geirhos et al., 2020
    https://arxiv.org/abs/2006.16736
    """
    def compare_single_subject(self, source, target):
        correct_source = source.values == source['truth'].values
        correct_target = target.values == target['truth'].values

        # Expected consistency by chance
        accuracy_source = np.mean(correct_source)
        accuracy_target = np.mean(correct_target)
        expected = accuracy_source * accuracy_target + (1 - accuracy_source) * (1 - accuracy_target)

        # Observed consistency
        observed = (correct_source == correct_target).sum() / len(target)

        # Cohen's Kappa: adjust for chance
        kappa = (observed - expected) / (1.0 - expected)
        return Score(kappa)

When to use: When comparing error patterns across different conditions (e.g., shape vs. texture experiments).


Accuracy Distance

Compares accuracy patterns across experimental conditions:

# Located: metrics/accuracy_distance/metric.py

class AccuracyDistance(Metric):
    """
    Measures relative distance between source and target accuracies,
    adjusted for maximum possible difference.
    """
    def compare_single_subject(self, source, target):
        source_correct = source.values.flatten() == target['truth'].values
        target_correct = target.values == target['truth'].values

        source_mean = np.mean(source_correct)
        target_mean = np.mean(target_correct)

        # Normalize by maximum possible distance
        maximum_distance = max(1 - target_mean, target_mean)
        relative_distance = 1 - abs(source_mean - target_mean) / maximum_distance

        return Score(relative_distance)

When to use: When you want to compare performance patterns across conditions, not just overall accuracy.


Value Delta (Scalar Comparison)

Compares two scalar values using a sigmoid-weighted distance:

# Located: metrics/value_delta/metric.py

class ValueDelta(Metric):
    def __init__(self, scale: float = 1.00):
        self.scale = scale

    def __call__(self, source_value: float, target_value: float) -> Score:
        # Compute absolute difference
        abs_diff = float(np.abs(source_value - target_value))

        # Apply sigmoid: identical values β†’ 1.0, large differences β†’ 0.0
        center = 1 / (np.exp(self.scale * abs_diff))

        return Score(min(max(center, 0), 1))

When to use: When comparing derived metrics like integrals, thresholds, or other scalar summaries.


OST (Object Solution Times)

Measures whether models predict the temporal dynamics of human object recognition:

# Located: metrics/ost/metric.py

class OSTCorrelation(Metric):
    """
    Object Solution Times metric from Kar et al., Nature Neuroscience 2019.
    Tests whether models predict WHEN object identity emerges.
    """
    def compute_osts(self, train_source, test_source, test_osts):
        # For each time bin, train classifier and check if threshold is reached
        for time_bin_start in sorted(set(train_source['time_bin_start'].values)):
            # Train classifier at this time point
            classifier.fit(time_train_source, time_train_source['image_label'])
            prediction_probabilities = classifier.predict_proba(time_test_source)

            # Check if model reaches human-like performance threshold
            source_i1 = self.i1(prediction_probabilities)
            # ... interpolate to find exact crossing time

        return predicted_osts

    def correlate(self, predicted_osts, target_osts):
        # Spearman correlation: do models predict which images are "hard"?
        correlation, p = spearmanr(predicted_osts, target_osts)
        return Score(correlation)

When to use: When testing recurrent/temporal models on time-resolved behavioral data.


Choosing the Right Behavioral Metric

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Decision Guide: Which Behavioral Metric?                                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  Q: What aspect of behavior do you want to compare?                         β”‚
β”‚                                                                             β”‚
β”‚  β”œβ”€β”€ Overall correctness β†’ Accuracy                                         β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Which specific images are hard/easy β†’ I2N                              β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Error patterns (accounting for chance) β†’ Error Consistency             β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Performance across conditions β†’ Accuracy Distance                      β”‚
β”‚  β”‚                                                                          β”‚
β”‚  β”œβ”€β”€ Derived scalar values (integrals, thresholds) β†’ Value Delta            β”‚
β”‚  β”‚                                                                          β”‚
β”‚  └── When recognition happens (temporal) β†’ OST                              β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementing Your Own Behavioral Benchmark

from brainscore_vision.benchmarks import BenchmarkBase
from brainscore_vision import load_stimulus_set, load_dataset, load_metric
from brainscore_vision.benchmark_helpers.screen import place_on_screen
from brainscore_vision.model_interface import BrainModel
from brainscore_core.metrics import Score

BIBTEX = """
@article{author2024,
  title={Your Paper Title},
  author={Author, A.},
  journal={Journal},
  year={2024}
}
"""

class MyBehavioralBenchmark(BenchmarkBase):
    def __init__(self):
        # Load data
        self._assembly = load_dataset('MyExperiment2024')
        self._fitting_stimuli = load_stimulus_set('MyExperiment2024_fitting')
        self._stimulus_set = load_stimulus_set('MyExperiment2024')

        # Set up metric
        self._metric = load_metric('accuracy')  # or 'i2n', etc.

        # Experimental parameters
        self._visual_degrees = 8
        self._number_of_trials = 1

        super().__init__(
            identifier='MyExperiment2024-accuracy',
            version=1,
            ceiling_func=lambda: self._compute_ceiling(),
            parent='behavior',
            bibtex=BIBTEX
        )

    def __call__(self, candidate: BrainModel) -> Score:
        # 1. Scale fitting stimuli
        fitting_stimuli = place_on_screen(
            self._fitting_stimuli,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees
        )

        # 2. Set up task
        candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)

        # 3. Scale test stimuli
        stimulus_set = place_on_screen(
            self._stimulus_set,
            target_visual_degrees=candidate.visual_degrees(),
            source_visual_degrees=self._visual_degrees
        )

        # 4. Get model predictions
        predictions = candidate.look_at(stimulus_set, number_of_trials=self._number_of_trials)

        # 5. Compare to human data
        raw_score = self._metric(predictions, self._assembly)

        # 6. Normalize by ceiling
        ceiling = self.ceiling
        ceiled_score = raw_score / ceiling
        ceiled_score.attrs['raw'] = raw_score
        ceiled_score.attrs['ceiling'] = ceiling

        return ceiled_score

    def _compute_ceiling(self) -> Score:
        """Split-half reliability across subjects."""
        subjects = list(set(self._assembly.coords['subject'].values))
        scores = []

        for _ in range(100):  # Bootstrap
            np.random.shuffle(subjects)
            half1 = subjects[:len(subjects)//2]
            half2 = subjects[len(subjects)//2:]

            data1 = self._assembly.sel(presentation=self._assembly['subject'].isin(half1))
            data2 = self._assembly.sel(presentation=self._assembly['subject'].isin(half2))

            score = self._metric(data1.mean('subject'), data2.mean('subject'))
            scores.append(score.values)

        ceiling = Score(np.mean(scores))
        ceiling.attrs['error'] = np.std(scores)
        return ceiling

Key Differences from Neural Benchmarks

Aspect Neural Benchmark Behavioral Benchmark
Helper class NeuralBenchmark None (use BenchmarkBase)
Model setup start_recording(region, time_bins) start_task(task, fitting_stimuli)
Output NeuroidAssembly (activations) BehavioralAssembly (choices)
Fitting data Not needed Usually required
Implement __call__ No (inherited) Yes (required)

Benchmark Hierarchy

The parent parameter determines where your benchmark appears in the leaderboard:

super().__init__(
    identifier='Ferguson2024tilted_line-value_delta',
    parent='behavior',  # Groups under behavioral benchmarks
    ...
)
Parent Value Leaderboard Position
'behavior' Under behavioral benchmarks
'V1', 'IT', etc. Under brain region (for neural-behavioral hybrids)

Pre-defined categories (V1, V2, V4, IT, behavior) are registered in the Brain-Score database. Your benchmark's parent links to one of these existing categories.

The website automatically aggregates scores from benchmarks sharing the same parent.


Registration

# __init__.py
from brainscore_vision import benchmark_registry
from .benchmark import MyBehavioralBenchmark

benchmark_registry['MyExperiment2024-accuracy'] = MyBehavioralBenchmark

Plugin Directory Structure

vision/brainscore_vision/benchmarks/mybenchmark/
β”œβ”€β”€ __init__.py          # Registration (imports from benchmark.py)
β”œβ”€β”€ benchmark.py         # Benchmark implementation (class + factory)
β”œβ”€β”€ test.py              # Unit tests
└── requirements.txt     # Dependencies (optional)

Fitting Stimuli Requirements

Fitting stimuli are training data that the model uses to learn task-specific readouts. Whether you need them depends on the task type:

Task Type Fitting Stimuli Required? Why Example
BrainModel.Task.probabilities Yes Model needs labeled examples to train a classifier Object categorization
BrainModel.Task.label No Uses model's pre-trained label outputs ImageNet classification
BrainModel.Task.odd_one_out No Compares feature similarities directly Triplet similarity

Important: Fitting stimuli should be separate from test stimuli to ensure fair evaluation. The model learns from fitting stimuli, then is tested on unseen test stimuli.

# Good: Separate fitting and test sets
self._fitting_stimuli = load_stimulus_set('MyExperiment2024_training')
self._test_stimuli = load_stimulus_set('MyExperiment2024_test')

# Bad: Using the same stimuli for both
self._fitting_stimuli = self._test_stimuli  # Don't do this!

Testing Your Benchmark

Every benchmark should include tests to verify it loads and produces expected scores:

# vision/brainscore_vision/benchmarks/myexperiment2024/test.py

import pytest
from brainscore_vision import load_benchmark, load_model

@pytest.mark.private_access
class TestMyBehavioralBenchmark:
    def test_benchmark_loads(self):
        """Test that benchmark can be loaded"""
        benchmark = load_benchmark('MyExperiment2024-accuracy')
        assert benchmark is not None
        assert benchmark.identifier == 'MyExperiment2024-accuracy'

    def test_ceiling_valid(self):
        """Test ceiling is computed and in valid range"""
        benchmark = load_benchmark('MyExperiment2024-accuracy')
        ceiling = benchmark.ceiling
        assert 0 < ceiling <= 1, f"Ceiling {ceiling} out of expected range"

    def test_benchmark_score(self):
        """Test benchmark produces expected score for a known model"""
        benchmark = load_benchmark('MyExperiment2024-accuracy')
        model = load_model('alexnet')

        score = benchmark(model)

        # Document expected score for regression testing
        assert 0 <= score <= 1
        assert hasattr(score, 'attrs')
        # Optional: assert abs(score - 0.35) < 0.05  # Expected ~0.35 for alexnet

Common Issues and Solutions

Problem: "Model predictions have wrong shape"

The model output doesn't match expected dimensions for the task.

# Solution: Verify task type matches expected output
# For probabilities task, output should be (presentations Γ— choices)
candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)
predictions = candidate.look_at(stimulus_set)
print(f"Shape: {predictions.shape}, Dims: {predictions.dims}")

Problem: "Fitting stimuli labels don't match test stimuli"

The image_label column values are inconsistent between fitting and test sets.

# Solution: Ensure consistent labeling
fitting_labels = set(fitting_stimuli['image_label'].unique())
test_labels = set(test_stimuli['image_label'].unique())
assert fitting_labels == test_labels, f"Label mismatch: {fitting_labels} vs {test_labels}"

Problem: "Ceiling calculation takes too long"

Split-half bootstrapping with many iterations is slow.

# Solution: Pre-compute ceiling and store the value
# Calculate once:
ceiling_value = self._compute_ceiling()
print(f"Pre-computed ceiling: {ceiling_value}")

# Then use the value directly:
ceiling_func=lambda: Score(0.85)  # Use pre-computed value

Problem: "Score exceeds 1.0 or is negative"

Ceiling normalization isn't clamping properly.

# Solution: Clamp the final score
raw_score = self._metric(predictions, self._assembly)
ceiling = self.ceiling
ceiled_score = min(max(raw_score / ceiling, 0), 1)  # Clamp to [0, 1]

Problem: "start_task fails with unknown task"

Task type not recognized by the model.

# Solution: Use standard task types
from brainscore_vision.model_interface import BrainModel

# Valid tasks:
BrainModel.Task.passive        # Neural recording only
BrainModel.Task.label          # Discrete labels
BrainModel.Task.probabilities  # Probability distributions
BrainModel.Task.odd_one_out    # Similarity-based choices

Next Steps