Neural Benchmarks
Neural benchmarks evaluate whether artificial neural networks develop internal representations similar to those found in biological brains. Rather than just matching behavioral outputs, these benchmarks ask the question: does the model process information the way the brain does?
By comparing model activations to neural recordings (fMRI, electrophysiology, EEG, etc.), we can assess whether a model has learned representations that are fundamentally "brain-like." Models that accurately predict neural activity across brain regions provide evidence that they may have discovered similar computational solutions to the ones evolution found.
Prerequisites: Complete Data Packaging first β you need a registered NeuroidAssembly.
Time: ~20 minutes to implement a standard benchmark.
Quick Start: Minimal Template
Copy this template and modify for your data:
# vision/brainscore_vision/benchmarks/yourbenchmark/__init__.py
from brainscore_vision import benchmark_registry
from .benchmark import YourBenchmarkIT
benchmark_registry['YourBenchmark.IT-pls'] = YourBenchmarkIT
# vision/brainscore_vision/benchmarks/yourbenchmark/benchmark.py
from brainscore_vision import load_dataset, load_metric, load_ceiling
from brainscore_vision.benchmark_helpers.neural_common import NeuralBenchmark
BIBTEX = """@article{YourName2024, ...}"""
def YourBenchmarkIT():
# Load your registered data
assembly = load_dataset('YourDataset')
assembly_repetition = load_dataset('YourDataset') # Keep repetitions for ceiling
# Average repetitions for model comparison
assembly = assembly.mean(dim='repetition')
return NeuralBenchmark(
identifier='YourBenchmark.IT-pls',
version=1,
assembly=assembly,
similarity_metric=load_metric('pls'),
visual_degrees=8, # From your experiment
number_of_trials=50,
ceiling_func=lambda: load_ceiling('internal_consistency')(assembly_repetition),
parent='IT', # Or 'V1', 'V2', 'V4'
bibtex=BIBTEX
)
That's it for a standard neural benchmark. See Registration and Testing below.
Neural Benchmark Checklist
Before submitting your neural benchmark:
- [ ] Uses
NeuroidAssemblywith correct dimensions (presentation Γ neuroid Γ time_bin) - [ ] Loads data with
average_repetitions=Truefor model comparison - [ ] Loads data with
average_repetitions=Falsefor ceiling calculation - [ ] Uses internal consistency ceiling (split-half reliability)
- [ ] Specifies correct
visual_degreesfrom original experiment - [ ] Includes proper
bibtexcitation - [ ] Tests verify expected scores for known models
- [ ] Uses
bound_score()to clamp scores between [0, 1]
Which Pattern Should I Use?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Do you have standard neural recordings (electrophysiology/fMRI)? β
β βββ YES β Use NeuralBenchmark (Quick Start above) β
β β Examples: MajajHong2015, FreemanZiemba2013 β
β β β
β βββ NO β What's special about your data? β
β β β
β βββ Separate train/test splits? β
β β βββ Use TrainTestNeuralBenchmark β
β β Example: Papale2025 β
β β β
β βββ Temporal dynamics / multiple time bins? β
β β βββ Use BenchmarkBase with custom __call__ β
β β Example: Kar2019 β
β β β
β βββ RSA / representational similarity? β
β β βββ Use BenchmarkBase with RDM metric β
β β Example: Coggan2024_fMRI β
β β β
β βββ Neuronal properties (tuning curves, receptive fields)? β
β βββ Use PropertiesBenchmark β
β Example: Marques2020 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tutorial Overview
This tutorial covers everything you need to build a neural benchmark:
- Class hierarchy β Understanding
NeuralBenchmarkand when to use alternatives - Full examples β Annotated code from MajajHong2015, Kar2019, Papale2025, and Coggan2024
- Design patterns β Using coordinates, repetitions, time bins, and regions effectively
- Metrics β Choosing between PLS, Ridge, CKA, RDM, and others
- Registration & testing β Getting your benchmark into Brain-Score
Example Benchmarks
| Benchmark | Description | Brain Region | Key Features |
|---|---|---|---|
MajajHong2015 |
Monkey electrophysiology: object recognition | V4, IT | PLS metric, 8Β° visual degrees, 50 trials |
Kar2019 |
IT responses with object solution times | IT | Temporal dynamics, OST metric |
Papale2025 |
Monkey electrophysiology: extensive spiking data | V1, V2, V4, IT | Train/test split, reliability filtering |
Coggan2024_fMRI |
Human fMRI: amodal completion | V1, V2, V4, IT | RSA/RDM metric, 9Β° visual degrees |
Key Characteristics:
- Use NeuroidAssembly data structures
- Employ regression-based metrics (PLS, Ridge) or similarity metrics (RSA)
- Support temporal dynamics
- Compare across brain regions (V1, V2, V4, IT)
Inheritance Structure
Neural benchmarks use the NeuralBenchmark helper class, which is part of a three-level hierarchy:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Benchmark (ABC) β
β βββ Abstract interface defining required methods β
β βββ __call__(candidate) β Score β
β βββ identifier, ceiling, version, bibtex properties β
β βββ Located: brainscore_vision.benchmarks.Benchmark β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BenchmarkBase(Benchmark) β
β βββ Helper class implementing standard functions β
β βββ Automatic ceiling caching β
β βββ Version and metadata management β
β βββ Located: brainscore_vision.benchmarks.BenchmarkBase β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β NeuralBenchmark(BenchmarkBase) β
β βββ Specialized for neural recording comparisons β
β βββ Handles: start_recording(), place_on_screen(), time bins β
β βββ Built-in explained_variance ceiling normalization β
β βββ Located: brainscore_vision.benchmark_helpers.neural_common β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What NeuralBenchmark Provides
When you use the NeuralBenchmark helper class, you get these features automatically:
- Calls
candidate.start_recording(region, time_bins)to set up neural recording - Scales stimuli to model's visual field via
place_on_screen() - Squeezes single time bins for static benchmarks
- Normalizes scores using
explained_variance(raw_score, ceiling)
Key insight: You don't need to implement __call__, call look_at, or load a stimulus setβNeuralBenchmark handles all of that internally.
NeuralBenchmark Parameters:
| Parameter | Required | Description |
|---|---|---|
identifier |
Yes | Unique name following AuthorYear.region-metric convention (e.g., MajajHong2015.IT-pls) |
version |
Yes | Integer version number; increment when changes affect scores |
assembly |
Yes | NeuroidAssembly with biological recordings (typically averaged repetitions) |
similarity_metric |
Yes | Metric for comparing model to brain (e.g., pls, ridge) |
visual_degrees |
Yes | Stimulus size in degrees of visual angle from original experiment |
number_of_trials |
Yes | Number of stimulus presentations per image |
ceiling_func |
Yes | Function returning maximum achievable Score (uses non-averaged data) |
parent |
Yes | Position in leaderboard hierarchy (V1, V2, V4, IT, or custom) |
bibtex |
Yes | Citation for the original neuroscience paper |
timebins |
No | Time windows for temporal analysis; defaults to [(70, 170)] ms |
Example 1: MajajHong2015
Click to expand example
Here's the complete structure of a neural benchmark using `NeuralBenchmark`:from brainscore_vision.benchmark_helpers.neural_common import NeuralBenchmark
from brainscore_vision import load_metric, load_ceiling
def _DicarloMajajHong2015Region(region: str, access: str, identifier_metric_suffix: str,
similarity_metric: Metric, ceiler: Ceiling):
# Load data WITH individual repetitions (for ceiling calculation)
assembly_repetition = load_assembly(average_repetitions=False, region=region)
# Load data with repetitions AVERAGED (for model comparison)
assembly = load_assembly(average_repetitions=True, region=region)
return NeuralBenchmark(
# Unique identifier: <dataset>.<region>-<metric> (e.g., "MajajHong2015.IT-pls")
identifier=f'MajajHong2015.{region}-{identifier_metric_suffix}',
# Version number: increment when benchmark changes would affect scores
version=3,
# Neural data assembly: the biological recordings to compare against
# Uses averaged repetitions for cleaner model-to-brain comparison
assembly=assembly,
# Metric for comparing model activations to neural recordings
# Typically PLS regression for neural benchmarks
similarity_metric=similarity_metric,
# Size of stimuli in degrees of visual angle (as shown in original experiment)
# Models must scale their input to match this visual field size
visual_degrees=8,
# Number of stimulus presentations per image in the original experiment
# Supports stochastic models; deterministic models return same output each trial
number_of_trials=50,
# Function to compute the data ceiling (maximum achievable score)
# Uses NON-averaged data to estimate noise/reliability via split-half
ceiling_func=lambda: ceiler(assembly_repetition),
# Parent category in benchmark hierarchy (V1, V2, V4, IT, or behavior)
# Determines where this benchmark appears in the leaderboard tree
parent=region,
# BibTeX citation for the original neuroscience paper
bibtex=BIBTEX
)
# Factory function that creates the benchmark
def DicarloMajajHong2015ITPLS():
ceiler = load_ceiling('internal_consistency')
return _DicarloMajajHong2015Region(
region='IT',
access='public',
identifier_metric_suffix='pls',
similarity_metric=load_metric('pls'),
ceiler=ceiler
)
Example 2: Kar2019 (Custom Temporal Benchmark)
Click to expand example
When you need custom `__call__` logic (e.g., temporal dynamics), inherit from `BenchmarkBase` directly:from brainscore_core import Score
from brainscore_vision import load_dataset, load_metric
from brainscore_vision.benchmarks import BenchmarkBase, ceil_score
from brainscore_vision.benchmark_helpers.screen import place_on_screen
# Time bins: 10ms windows from 70-250ms (captures temporal dynamics)
TIME_BINS = [(t, t + 10) for t in range(70, 250, 10)]
class DicarloKar2019OST(BenchmarkBase):
def __init__(self):
# Ceiling computed offline (not split-half, custom calculation)
ceiling = Score(.79)
super().__init__(
identifier='Kar2019-ost',
version=2,
ceiling_func=lambda: ceiling,
parent='IT',
bibtex=BIBTEX
)
# Load data and metric in __init__
self._assembly = load_dataset('Kar2019')
self._similarity_metric = load_metric('ost') # Object Solution Time metric
self._visual_degrees = 8
self._time_bins = TIME_BINS
def __call__(self, candidate):
# Start temporal recording (multiple time bins)
candidate.start_recording('IT', time_bins=self._time_bins)
# Scale stimuli to model's visual field
stimulus_set = place_on_screen(
self._assembly.stimulus_set,
target_visual_degrees=candidate.visual_degrees(),
source_visual_degrees=self._visual_degrees
)
# Quick check: reject static models that can't predict temporal dynamics
check_recordings = candidate.look_at(stimulus_set[:1], number_of_trials=44)
if not temporally_varying(check_recordings):
return Score(np.nan) # Early exit for incompatible models
# Full evaluation
recordings = candidate.look_at(stimulus_set, number_of_trials=44)
score = self._similarity_metric(recordings, self._assembly)
return ceil_score(score, self.ceiling)
Example 3: Papale2025 (Train/Test Split)
Click to expand example
For benchmarks with separate training and test sets, use `TrainTestNeuralBenchmark`:from brainscore_vision import load_dataset, load_metric
from brainscore_vision.benchmark_helpers.neural_common import (
TrainTestNeuralBenchmark, average_repetition, filter_reliable_neuroids
)
from brainscore_vision.utils import LazyLoad
VISUAL_DEGREES = 18 # Large field of view (monkey viewing distance)
RELIABILITY_THRESHOLD = 0.3 # Filter out unreliable neurons
def _Papale2025(region, similarity_metric, identifier_metric_suffix):
# LazyLoad defers S3 fetching until data is actually needed
train_assembly = LazyLoad(lambda: load_assembly(region, split='train', average_repetitions=False))
test_assembly = LazyLoad(lambda: load_assembly(region, split='test', average_repetitions=True))
test_assembly_repetition = LazyLoad(lambda: load_assembly(region, split='test', average_repetitions=False))
return TrainTestNeuralBenchmark(
identifier=f'Papale2025.{region}-{identifier_metric_suffix}',
version=2,
# Separate train and test assemblies
train_assembly=train_assembly,
test_assembly=test_assembly,
similarity_metric=similarity_metric,
ceiling_func=lambda: load_metric('internal_consistency')(test_assembly_repetition),
visual_degrees=VISUAL_DEGREES,
number_of_trials=1,
parent=region,
bibtex=BIBTEX
)
def load_assembly(region, split, average_repetitions):
assembly = load_dataset(f'Papale2025_{split}')
# Filter neurons by reliability (remove noisy recordings)
assembly = filter_reliable_neuroids(assembly, RELIABILITY_THRESHOLD, 'reliability')
assembly = assembly.sel(region=region)
if average_repetitions:
assembly = average_repetition(assembly)
return assembly
Example 4: Coggan2024_fMRI (RSA/RDM Metric)
Click to expand example
For fMRI with representational similarity analysis, create a custom benchmark class:from brainscore_vision import load_dataset
from brainscore_vision.benchmarks import BenchmarkBase
from brainscore_vision.benchmark_helpers.screen import place_on_screen
class Coggan2024_fMRI_Benchmark(BenchmarkBase):
def __init__(self, identifier, assembly, ceiling_func, visual_degrees, **kwargs):
super().__init__(identifier=identifier, ceiling_func=ceiling_func, **kwargs)
self._assembly = assembly # Pre-computed human RSM (not raw fMRI)
self._visual_degrees = visual_degrees
self.region = np.unique(assembly['region'])[0]
def __call__(self, candidate):
# Scale stimuli
stimulus_set = place_on_screen(
self._assembly.stimulus_set,
target_visual_degrees=candidate.visual_degrees(),
source_visual_degrees=self._visual_degrees
)
# Get model activations
candidate.start_recording(self.region, time_bins=[(0, 250)])
source_assembly = candidate.look_at(stimulus_set, number_of_trials=1)
# Compute RSM (Representational Similarity Matrix) from model
source_rsm = RSA(source_assembly)
# Compare model RSM to human fMRI RSM
raw_score = get_score(source_rsm, self._assembly)
ceiling = self._ceiling_func(self._assembly)
return ceiler(raw_score, ceiling)
def _Coggan2024_Region(region: str):
assembly = load_dataset('Coggan2024_fMRI')
assembly = assembly.sel(region=region)
return Coggan2024_fMRI_Benchmark(
identifier=f'tong.Coggan2024_fMRI.{region}-rdm',
version=1,
assembly=assembly,
visual_degrees=9,
ceiling_func=get_ceiling,
parent=region,
bibtex=BIBTEX
)
Benchmark Pattern Summary
| Benchmark | Base Class | When to Use |
|---|---|---|
| MajajHong2015 | NeuralBenchmark |
Standard neural predictivity with PLS/Ridge |
| Kar2019 | BenchmarkBase |
Custom temporal dynamics, early rejection |
| Papale2025 | TrainTestNeuralBenchmark |
Explicit train/test splits |
| Coggan2024_fMRI | Custom subclass | RSA/RDM, pre-computed representations |
Neural Benchmark Call Flow
Click to expand: High-level call flow (Steps 1-4)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER CODE β
β β
β score = brainscore_vision.score('alexnet', 'MajajHong2015.IT-pls') β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: score() in brainscore_vision/__init__.py β
β β
β def score(model_identifier, benchmark_identifier): β
β model = load_model(model_identifier) β
β benchmark = load_benchmark(benchmark_identifier) β
β score = benchmark(model) β
β return score β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: load_benchmark() finds the plugin β
β β
β 1. Searches ALL __init__.py files in brainscore_vision/benchmarks/ β
β 2. Looks for: benchmark_registry['MajajHong2015.IT-pls'] β
β 3. Finds match in: benchmarks/majajhong2015/__init__.py β
β 4. Imports the module and calls the factory function β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: Plugin __init__.py executes β
β β
β # benchmarks/majajhong2015/__init__.py β
β from brainscore_vision import benchmark_registry β
β from .benchmark import DicarloMajajHong2015ITPLS β
β β
β benchmark_registry['MajajHong2015.IT-pls'] = DicarloMajajHong2015ITPLS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: Benchmark instance created, __call__ executed β
β β
β benchmark = NeuralBenchmark(...) # Instance created with all parameters β
β score = benchmark(model) # __call__ runs the evaluation β
β return score # Ceiling-normalized Score returned β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Click to expand: What happens inside benchmark(model) β Step 4 in detail
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INSIDE NeuralBenchmark.__call__(candidate) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STEP 4a: Configure the model for recording β
β β
β candidate.start_recording('IT', time_bins=[(70, 170)]) β
β β β
β ββββ "record from IT-mapped layers, return 70-170ms bin" β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STEP 4b: Scale stimuli to model's visual field β
β β
β stimulus_set = place_on_screen( β
β assembly.stimulus_set, # Original images β
β target_visual_degrees=model.visual_degrees(), # e.g., 8Β° β
β source_visual_degrees=8 # Original experiment's visual degrees β
β ) β
β β β
β ββββ Images resized/padded so they are at the same visual angle β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STEP 4c: Present stimuli and extract activations β
β β
β source_assembly = candidate.look_at(stimulus_set, number_of_trials=50) β
β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β INSIDE look_at() β for each image batch: β β
β β β β β
β β β 1. Load images from stimulus_paths β β
β β β 2. Preprocess: resize, normalize, convert to tensor β β
β β β 3. Forward pass through neural network β β
β β β 4. Hooks capture activations at target layer (e.g. layer4) β β
β β β 5. Flatten: (batch, C, H, W) β (batch, C*H*W neuroids) β β
β β β 6. Store in NeuroidAssembly with stimulus_id coordinates β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββ Returns NeuroidAssembly: (presentations Γ neuroids Γ time_bins) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STEP 4d: Compare model activations to biological recordings β
β β
β raw_score = similarity_metric(source_assembly, self._assembly) β
β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β INSIDE PLS metric: β β
β β β β β
β β β 1. Align by stimulus_id (same presentation order) β β
β β β 2. Cross-validation split (stratify by object_name) β β
β β β 3. For each fold: β β
β β β a. Fit PLS: model_activations β neural_data β β
β β β b. Predict on held-out stimuli β β
β β β c. Pearson correlation per neuroid β β
β β β 4. Average correlations across folds and neuroids β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββ Returns Score (e.g., 0.65) with metadata β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STEP 4e: Normalize by ceiling (explained variance) β
β β
β ceiled_score = explained_variance(raw_score, self.ceiling) β
β β β
β β Formula: ceiled_score = raw_scoreΒ² / ceiling β
β β β
β β Example: raw_score=0.65, ceiling=0.82 β
β β ceiled_score = 0.65Β² / 0.82 β 0.515 β
β β β
β ββββ Returns final Score between 0 and 1 β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ Critical: There is a pervasive misconception in neural encoding that because Split-Half Consistency is calculated using a Pearson correlation (r), the resulting Reliability coefficient must also be treated as a raw unit (r) that requires squaring to become variance. However, Classical Test Theory defines Reliability (after Spearman-Brown correction) as the ratio of True Score Variance to Total Variance: $$ \text{Reliability} = \frac{\text{Var(True)}}{\text{Var(Total)}} $$ Therefore, the Reliability coefficient is already a measure of explainable variance (rΒ²). As a result, the ceiled score calculation is r_modelΒ² / reliability
The Internal Consistency Ceiling
For neural benchmarks, the ceiling is typically computed using internal_consistency:
from brainscore_vision import load_ceiling
ceiler = load_ceiling('internal_consistency')
benchmark = _DicarloMajajHong2015Region(
region='IT',
access='public',
identifier_metric_suffix='pls',
similarity_metric=load_metric('pls'),
ceiler=ceiler
)
The ceiling answers: "How well can we predict one half of the biological data from the other half?" It represents the limit of the true signal inside the noisy date.
This sets the upper bound for any modelβif the biological data is only 80% reliable (80% signal, 20% noise), a model that explains 80% of the variance is actually "perfect."
Since every dataset has a different amount of noise, we cannot compare raw correlations directly. We normalize the raw score by the ceiling so we can compare a model's performance across different datasets:
$$ \text{Normalized Ceiled Score} = \frac{\text{What the Model Explained (Raw Score)}}{\text{What was Theoretically Possible to Explain (Ceiling)}} $$
Common Neural Metrics
Brain-Score provides several metrics for comparing model representations to neural data. Each has different strengths:
Metric Comparison Table
| Metric | Registry Key | What It Measures | When to Use |
|---|---|---|---|
| PLS Regression | pls |
Linear mapping from model β neural responses | Default choice. Handles high-dimensional data well |
| Ridge Regression | ridge |
Regularized linear mapping | Explicit regularization control; interpretable |
| RidgeCV | ridgecv_split |
Auto-regularized linear mapping | Auto-tunes regularization strength |
| Linear Regression | linear_predictivity |
Unregularized linear mapping | Small datasets; risk of overfitting |
| Neuron-to-Neuron | neuron_to_neuron |
Best single model unit per neuron | Interpretable 1:1 unit correspondences |
| CKA | cka |
Representational geometry alignment | Comparing representational structure |
| RDM | rdm |
Stimulus similarity structure correlation | Classic RSA; stimulus similarity structures |
β οΈ Note: All paths below are relative to
brainscore_vision/metrics/in the vision repository.
Regression-Based Metrics (Encoding Models)
The most common approach uses regression to learn a mapping from model activations to neural responses:
PLS Regression (Default)
Partial Least Squares is the standard metric for neural benchmarks:
# Located: metrics/regression_correlation/metric.py
class CrossRegressedCorrelation(Metric):
def __call__(self, source: DataAssembly, target: DataAssembly) -> Score:
# Cross-validation handles train/test splits
return self.cross_validation(source, target, apply=self.apply, aggregate=self.aggregate)
def apply(self, source_train, target_train, source_test, target_test):
# 1. Fit PLS: model_activations β neural_data (training set)
self.regression.fit(source_train, target_train)
# 2. Predict neural data from model activations (test set)
prediction = self.regression.predict(source_test)
# 3. Correlate predictions with actual neural data
score = self.correlation(prediction, target_test)
return score
def aggregate(self, scores):
# Median correlation across neuroids (robust to outliers)
return scores.median(dim='neuroid')
Why PLS? PLS handles high-dimensional model activations (often thousands of units) mapping to neural responses. It finds latent components that maximize covariance between model and brain, providing robust predictions even when model dimensions >> stimuli.
Ridge Regression
Ridge adds L2 regularization to prevent overfitting:
# Located: metrics/regression_correlation/metric.py
# Standard Ridge (fixed alpha=1)
metric = load_metric('ridge')
# RidgeCV (auto-tunes regularization)
metric = load_metric('ridgecv_split') # For pre-split train/test data
When to use Ridge over PLS: - When you want explicit control over regularization strength - When interpretability of weights matters - RidgeCV is ideal when you don't know the optimal regularization strength
Neuron-to-Neuron Matching
Finds the best single model unit for each biological neuron:
# Located: metrics/regression_correlation/metric.py
metric = load_metric('neuron_to_neuron')
When to use: When you want interpretable 1:1 correspondences, or testing whether individual model units behave like individual neurons.
Representational Similarity Metrics
These metrics compare the geometry of representations rather than predicting neural responses directly:
RDM (Representational Dissimilarity Matrix)
Classic Representational Similarity Analysis (RSA):
# Located: metrics/rdm/metric.py
metric = load_metric('rdm') # Single comparison
metric = load_metric('rdm_cv') # Cross-validated
How it works: 1. Compute pairwise distances between stimulus responses (for both model and brain) 2. Compare the resulting distance matrices with Spearman correlation
When to use: - Comparing representational structure independent of linear transforms - When you care about which stimuli are represented similarly vs. differently - Classic RSA paradigms
CKA (Centered Kernel Alignment)
Measures similarity of representational geometry:
# Located: metrics/cka/metric.py
metric = load_metric('cka') # Single comparison
metric = load_metric('cka_cv') # Cross-validated
When to use: - Comparing overall representational geometry - Invariant to orthogonal transformations and isotropic scaling - Good for comparing layers across different architectures
Choosing the Right Metric
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Decision Guide: Which Metric Should I Use? β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Q: Do you want to predict neural responses from model activations? β
β βββ YES β Use regression-based metrics β
β β βββ Default: PLS (handles high dimensions well) β
β β βββ Want regularization control? β Ridge β
β β βββ Want 1:1 unit matching? β Neuron-to-Neuron β
β β β
β βββ NO β Use representational similarity metrics β
β βββ Classic RSA paradigm? β RDM β
β βββ Comparing representational geometry? β CKA β
β β
β Most Brain-Score neural benchmarks use: pls (default) or ridge β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cross-Validation vs. Fixed Split
Each metric comes in two variants:
| Suffix | Example | Use Case |
|---|---|---|
_cv |
pls_cv, ridge_cv |
Default. Cross-validates on a single dataset |
_split |
pls_split, ridge_split |
Use with pre-defined train/test splits |
# Cross-validation (most common): metric handles splitting
metric = load_metric('pls') # Alias for 'pls_cv'
# Fixed split: you provide separate train and test data
metric = load_metric('pls_split')
score = metric(source_train, source_test, target_train, target_test)
Temporal Metrics
For time-resolved neural data, use the spantime_ prefix:
metric = load_metric('spantime_pls') # PLS across time bins
metric = load_metric('spantime_ridge') # Ridge across time bins
These treat time as a sample dimension, pooling across time bins when fitting the regression.
β οΈ Note: Feel free to implement your own metric plugin.
Design Decisions in Neural Benchmarks
Different benchmarks use coordinates strategically to ask specific scientific questions. Understanding these patterns helps you design benchmarks that capture what you want to measure.
Using Repetitions for Ceiling Computation
The pattern: Load data twiceβonce with repetitions averaged (for model comparison), once with repetitions kept (for ceiling).
# From MajajHong2015
assembly = load_assembly(average_repetitions=True, region='IT') # For metric
assembly_repetition = load_assembly(average_repetitions=False, region='IT') # For ceiling
ceiling_func = lambda: ceiler(assembly_repetition) # Split-half uses repetitions
Why? The ceiling measures how consistent the biological data is with itself. If a neuron responds differently across repetitions of the same stimulus, that variability sets an upper limit on predictability. You need the individual repetitions to compute this split-half reliability.
Using Coordinates for Stratified Cross-Validation
The pattern: Pass a stratification_coord to ensure balanced sampling across stimulus categories.
# From FreemanZiemba2013
similarity_metric = load_metric('pls', crossvalidation_kwargs=dict(
stratification_coord='texture_type' # Balance texture vs. noise images
))
Why? If your dataset has distinct stimulus categories (textures vs. noise, objects vs. scenes, etc.), random splits might accidentally put all of one category in training. Stratification ensures each fold has balanced representation, giving more reliable score estimates.
| Benchmark | Stratification Coord | Purpose |
|---|---|---|
| FreemanZiemba2013 | texture_type |
Balance texture and spectrally-matched noise images |
| MajajHong2015 | object_name |
Balance across object categories |
| Custom | image_category, difficulty, etc. |
Balance any relevant experimental condition |
Using Time Bins for Temporal Dynamics
The pattern: Define multiple time bins to capture how representations evolve over time.
# From Kar2019 - Object Solution Times
TIME_BINS = [(time_bin_start, time_bin_start + 10)
for time_bin_start in range(70, 250, 10)] # 70-250ms in 10ms steps
candidate.start_recording('IT', time_bins=TIME_BINS)
Why? Some scientific questions require temporal resolution:
- Kar2019: Tests whether models predict when object identity emerges (recurrent processing)
- Static benchmarks: Use single bin [(70, 170)] for overall response
# Static benchmark (default)
TIME_BINS = [(70, 170)] # Single 100ms window
# Temporal benchmark
TIME_BINS = [(t, t+10) for t in range(70, 250, 10)] # 18 time bins Γ 10ms each
Using Region Coordinates to Slice Data
The pattern: Filter assembly by brain region to create region-specific benchmarks from a single dataset.
# From MajajHong2015 - separate V4 and IT benchmarks
def load_assembly(region: str, average_repetitions: bool):
assembly = load_dataset('MajajHong2015')
assembly = assembly.sel(neuroid=assembly['region'] == region) # Filter by region
if average_repetitions:
assembly = assembly.mean(dim='repetition')
return assembly
# Creates separate benchmarks
benchmark_IT = NeuralBenchmark(identifier='MajajHong2015.IT-pls', ...)
benchmark_V4 = NeuralBenchmark(identifier='MajajHong2015.V4-pls', ...)
Why? Different brain regions have different computational roles. By slicing the same dataset, you can ask: "How well does the model predict V4 vs. IT?" without packaging separate datasets.
Using Stimulus Coordinates for Specialized Analyses
The pattern: Use stimulus metadata coordinates to compute neuronal properties or specialized metrics.
# From FreemanZiemba2013 - Texture Modulation properties
def freemanziemba2013_properties(responses, baseline):
# Uses 'type', 'family', 'sample' coordinates to organize responses
responses = responses.sortby(['type', 'family', 'sample'])
type = np.array(sorted(set(responses.type.values))) # texture vs. noise
family = np.array(sorted(set(responses.family.values))) # texture family
sample = np.array(sorted(set(responses.sample.values))) # specific sample
# Reshape using coordinate structure
responses = responses.values.reshape(n_neuroids, len(type), len(family), len(sample))
# Compute texture modulation index from structured data
texture_modulation_index = calc_texture_modulation(responses[:, 1], responses[:, 0])
Why? Rich coordinate metadata enables complex analyses beyond simple predictivity. The FreemanZiemba2013 benchmark computes texture modulation indices by leveraging the experimental structure encoded in coordinates.
Key Insight: The coordinates you include in your
NeuroidAssemblyduring data packaging can determine what scientific questions your benchmark can answer. Plan your coordinates based on what you want to measure.
Implementation Patterns
Pattern 1: NeuralBenchmark (Recommended)
When to use: Standard neural predictivity with PLS/Ridge regression metrics.
from brainscore_vision.benchmark_helpers.neural_common import NeuralBenchmark
def DicarloMajajHong2015ITPLS():
assembly = load_assembly(average_repetitions=True, region='IT')
assembly_repetition = load_assembly(average_repetitions=False, region='IT')
return NeuralBenchmark(
identifier='MajajHong2015.IT-pls',
assembly=assembly,
similarity_metric=load_metric('pls'),
visual_degrees=8,
number_of_trials=50,
ceiling_func=lambda: ceiler(assembly_repetition),
parent='IT',
bibtex=BIBTEX
)
Pattern 2: PropertiesBenchmark
When to use: Comparing neuronal properties like tuning curves, receptive field sizes, surround suppression.
from brainscore_vision.benchmark_helpers.properties_common import PropertiesBenchmark
def MarquesCavanaugh2002V1SurroundSuppressionIndex():
assembly = load_dataset('Cavanaugh2002a')
similarity_metric = load_metric('ks_similarity', property_name='surround_suppression_index')
return PropertiesBenchmark(
identifier='Marques2020_Cavanaugh2002-surround_suppression_index',
assembly=assembly,
neuronal_property=cavanaugh2002_properties,
similarity_metric=similarity_metric,
timebins=[(70, 170)],
ceiling_func=NeuronalPropertyCeiling(similarity_metric),
parent='V1-surround_modulation',
bibtex=BIBTEX
)
Pattern 3: BenchmarkBase (Custom Logic)
When to use: Custom preprocessing, RSA metrics, non-standard analysis.
from brainscore_vision.benchmarks import BenchmarkBase
class _Bracci2019RSA(BenchmarkBase):
def __init__(self, region):
self._stimulus_set = load_stimulus_set('Bracci2019')
self._human_assembly = load_dataset('Bracci2019')
self._metric = load_metric('rdm')
super().__init__(
identifier=f'Bracci2019.{region}-rdm',
version=1,
ceiling_func=lambda: 1,
parent='Bracci2019',
bibtex=BIBTEX
)
def __call__(self, candidate: BrainModel):
# 1. Start recording
candidate.start_recording(self._region, [(70, 170)])
# 2. Scale stimuli
stimulus_set = place_on_screen(
self._stimulus_set,
target_visual_degrees=candidate.visual_degrees(),
source_visual_degrees=8
)
# 3. Get model activations
dnn_assembly = candidate.look_at(stimulus_set, number_of_trials=1)
# 4. Custom preprocessing and comparison
ceiling = self._get_human_ceiling(self._human_assembly)
similarity = self._metric(dnn_assembly, self._human_assembly)
score = Score(similarity / ceiling)
score.attrs['raw'] = similarity
score.attrs['ceiling'] = ceiling
return score
Pattern Comparison
| Aspect | NeuralBenchmark | PropertiesBenchmark | BenchmarkBase |
|---|---|---|---|
| Abstraction | High | Medium | Low (full control) |
Implements __call__ |
No (inherited) | No (inherited) | Yes (required) |
Calls look_at |
No (automatic) | No (automatic) | Yes (explicit) |
| Custom preprocessing | No | Limited | Yes |
| Use case | Standard neural | Neuronal properties | RSA, custom |
| Examples | MajajHong2015 | Marques2020 | Bracci2019 |
Implementing Your Own Neural Benchmark
from brainscore_vision import load_dataset, load_metric, load_ceiling
from brainscore_vision.benchmark_helpers.neural_common import NeuralBenchmark, average_repetition
from brainscore_core.metrics import Score
BIBTEX = """
@article{author2024,
title={Your Paper Title},
author={Author, A.},
journal={Journal},
year={2024}
}
"""
# Constants from your experiment
VISUAL_DEGREES = 8 # Stimulus size in degrees of visual angle
NUMBER_OF_TRIALS = 50 # Number of presentations per image
def load_assembly(region: str, average_repetitions: bool = True):
"""Load neural data, optionally averaging across repetitions."""
assembly = load_dataset('MyExperiment2024')
# Filter by brain region
assembly = assembly.sel(neuroid=assembly['region'] == region)
# Average repetitions for model comparison (keep for ceiling)
if average_repetitions:
assembly = average_repetition(assembly)
return assembly
def MyExperiment2024ITPLS():
"""
IT cortex benchmark using PLS regression.
Returns a NeuralBenchmark that compares model activations
to IT neural responses using partial least squares.
"""
# Load averaged data for metric computation
assembly = load_assembly(region='IT', average_repetitions=True)
# Load non-averaged data for ceiling (needs repetitions for split-half)
assembly_repetition = load_assembly(region='IT', average_repetitions=False)
# Internal consistency ceiling
ceiler = load_ceiling('internal_consistency')
return NeuralBenchmark(
identifier='MyExperiment2024.IT-pls',
version=1,
assembly=assembly,
similarity_metric=load_metric('pls'),
visual_degrees=VISUAL_DEGREES,
number_of_trials=NUMBER_OF_TRIALS,
ceiling_func=lambda: ceiler(assembly_repetition),
parent='IT',
bibtex=BIBTEX
)
def MyExperiment2024V4PLS():
"""V4 cortex benchmark using PLS regression."""
assembly = load_assembly(region='V4', average_repetitions=True)
assembly_repetition = load_assembly(region='V4', average_repetitions=False)
ceiler = load_ceiling('internal_consistency')
return NeuralBenchmark(
identifier='MyExperiment2024.V4-pls',
version=1,
assembly=assembly,
similarity_metric=load_metric('pls'),
visual_degrees=VISUAL_DEGREES,
number_of_trials=NUMBER_OF_TRIALS,
ceiling_func=lambda: ceiler(assembly_repetition),
parent='V4',
bibtex=BIBTEX
)
Key Differences from Behavioral Benchmarks
| Aspect | Neural Benchmark | Behavioral Benchmark |
|---|---|---|
| Helper class | NeuralBenchmark |
None (use BenchmarkBase) |
| Model setup | start_recording(region, time_bins) |
start_task(task, fitting_stimuli) |
| Output | NeuroidAssembly (activations) |
BehavioralAssembly (choices) |
| Fitting data | Not needed | Usually required |
Implement __call__ |
No (inherited) | Yes (required) |
Benchmark Hierarchy
The parent parameter determines where your benchmark appears in the leaderboard:
| Parent Value | Leaderboard Position |
|---|---|
'V1', 'V2', 'V4', 'IT' |
Under existing brain region |
'behavior' |
Under behavioral benchmarks |
Pre-defined categories (V1, V2, V4, IT, behavior) are registered in the Brain-Score database. Your benchmark's parent links to one of these existing categories.
Examples from existing benchmarks:
# MajajHong2015 - region-based parent (V4 or IT)
NeuralBenchmark(
identifier=f'MajajHong2015.{region}-pls',
parent=region, # 'V4' or 'IT'
...
)
# Kar2019 - IT cortex benchmark
super().__init__(
identifier='Kar2019-ost',
parent='IT',
...
)
# Papale2025 - human fMRI regions
NeuralBenchmark(
identifier=f'Papale2025.{region}-{metric}',
parent=region, # 'V1', 'V2', 'V4', 'IT'
...
)
# Coggan2024_fMRI - also region-based
NeuralBenchmark(
identifier=f'Coggan2024.fMRI.{region}-{metric}',
parent=region,
...
)
The website automatically aggregates scores from benchmarks sharing the same parent.
Registration
Register your benchmark in __init__.py:
# vision/brainscore_vision/benchmarks/majajhong2015/__init__.py
from brainscore_vision import benchmark_registry
from .benchmark import (
DicarloMajajHong2015V4PLS,
DicarloMajajHong2015ITPLS,
MajajHongV4PublicBenchmark,
MajajHongITPublicBenchmark,
)
# Register each benchmark variant
benchmark_registry['MajajHong2015.V4-pls'] = DicarloMajajHong2015V4PLS
benchmark_registry['MajajHong2015.IT-pls'] = DicarloMajajHong2015ITPLS
benchmark_registry['MajajHong2015.V4.public-pls'] = MajajHongV4PublicBenchmark
benchmark_registry['MajajHong2015.IT.public-pls'] = MajajHongITPublicBenchmark
Note: Each factory function is registered directly. Brain-Score calls the function when the benchmark is loaded, creating a fresh instance each time.
Plugin Directory Structure
vision/brainscore_vision/benchmarks/majajhong2015/
βββ __init__.py # Registration (imports from benchmark.py)
βββ benchmark.py # Benchmark implementation (factory functions)
βββ test.py # Unit tests
βββ requirements.txt # Dependencies (optional)
Testing Your Benchmark
Every benchmark should include tests to verify it loads and produces expected scores:
# vision/brainscore_vision/benchmarks/majajhong2015/test.py
import pytest
from brainscore_vision import load_benchmark, load_model
@pytest.mark.parametrize("benchmark, expected", [
('MajajHong2015.V4-pls', approx(0.89, abs=0.01)),
('MajajHong2015.IT-pls', approx(0.82, abs=0.01)),
])
def test_MajajHong2015(benchmark, expected):
# Load precomputed features to speed up testing
filepath = Path(__file__).parent / 'alexnet-majaj2015.private-features.12.nc'
benchmark = load_benchmark(benchmark)
# Run with precomputed features instead of full model
score = run_test(benchmark=benchmark, precomputed_features_filepath=filepath)
assert score == expected
Testing approaches: - Precomputed features: Store model activations to speed up tests (as shown above) - Quick smoke test: Just verify benchmark loads without running full evaluation - Known score regression: Document expected scores to catch breaking changes
Common Issues and Solutions
Problem: "Ceiling is greater than 1"
The ceiling calculation may be returning raw correlation values instead of proper ceiling estimates.
# Solution: Ensure ceiling_func uses non-averaged data with split-half
ceiling_func=lambda: ceiler(assembly_repetition) # NOT assembly (averaged)
Problem: "Score is negative"
Negative scores usually indicate stimulus alignment issues between model and biological data.
# Solution: Check stimulus_id alignment
model_ids = set(model_assembly['stimulus_id'].values)
bio_ids = set(biological_assembly['stimulus_id'].values)
assert model_ids == bio_ids, f"Mismatched IDs: {model_ids ^ bio_ids}"
Problem: "Model activations shape mismatch"
The model's layer output doesn't match expected dimensions.
# Solution: Verify the region-to-layer mapping
# Check that model.start_recording() is using correct layer
candidate.start_recording('IT', time_bins=[(70, 170)])
# Ensure your model's layer map includes 'IT' β appropriate layer
Problem: "Time bins not found"
Assembly missing temporal dimension required by the benchmark.
# Solution: Ensure assembly has time_bin dimension
assert 'time_bin' in assembly.dims or len(assembly.dims) == 2
# For static benchmarks, NeuralBenchmark squeezes single time bins automatically
Problem: "PLS regression fails to converge"
Too few samples or too many features for regression.
# Solution: Check data dimensions
print(f"Samples: {len(assembly['presentation'])}, Features: {len(assembly['neuroid'])}")
# PLS needs samples > features; consider using Ridge for high-dimensional data
Next Steps
- Behavioral Benchmarks β Create benchmarks for behavioral data
- Vision vs Language β Differences between vision and language benchmarks