ECCV 2026

LinStereo

Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching

Linear-complexity global attention for iterative stereo on a frozen Depth Anything V3 backbone — strong on standard benchmarks and zero-shot to new domains, including underwater.

Australian Centre for Robotics, The University of Sydney  ·  Corresponding author  ·  Project lead

Drag to reveal

One frozen model, many domains, zero fine-tuning. Drag the handle: the stereo input on one side, LinStereo's dense disparity on the other — the same network across street scenes, indoor and outdoor benchmarks, and underwater footage it has never seen.

LinStereo predicted disparity (turbo colormap)
Stereo input (left view)
Input
LinStereo disparity

A single LinStereo model — frozen Depth Anything V3 backbone, trained on SceneFlow only — runs zero-shot across all six datasets, from KITTI street scenes to real underwater SQUID footage it was never trained on. Black = no ground truth.

TL;DR. LinStereo redesigns the iterative stereo update loop around Position-Aware Linear Attention (PALA) — a global, O(N) operator on a frozen Depth Anything V3 backbone — so reliable depth propagates across the whole image in a single step. It generalizes zero-shot to underwater scenes where local photometric cues collapse, while staying linear per iteration.

28%
AbsRel — TartanAir-UW ocean
26%
AbsRel — SQUID (real underwater)
31%
RMSE vs FoundationStereo (TartanAir-UW)
37%
Occluded EPE — Middlebury(H)

Competitive with ViT-L systems on standard benchmarks and best on underwater generalization — all from a ViT-B backbone trained on SceneFlow only, with no underwater, real-world, or domain-specific data at any stage.

Abstract

Existing Vision Foundation Model (VFM)-based iterative stereo pipelines under-exploit three information pathways: multi-scale backbone features are collapsed into single-level correlations, geometric priors remain untapped at initialization, and context propagates only locally. These gaps widen under degraded photometric cues, making underwater scenes a stringent generalization test. To address this, we propose LinStereo, built upon Depth Anything V3, whose core is a Position-Aware Linear Attention (PALA) module that replaces local recurrence with global aggregation at linear cost, propagating reliable estimates from well-matched regions into degraded areas while preserving disparity structure. PALA is made effective by two enabling components: Hierarchical Semantic Cost Volumes (HSCV), which supply scale-aligned correlations from the VFM feature hierarchy, and a Depth Prior Initialization (DPI) that converts monocular depth into a metrically calibrated warm start. LinStereo achieves state-of-the-art performance on standard benchmarks and strong cross-domain generalization, particularly on underwater scenes where severe photometric degradation makes stereo matching particularly challenging, surpassing all compared methods with consistent gains (28% AbsRel on TartanAir-UW, 26% on SQUID, a real-world underwater dataset).

Method

Modern VFM backbones encode rich multi-scale semantics and implicit geometry, but the iterative update loop still consumes them through a narrow interface inherited from lightweight encoders. LinStereo closes this gap with a single PALA-centred decoder, enabled by HSCV and DPI — not three independent modules.

LinStereo architecture: frozen Depth Anything V3 backbone feeding Hierarchical Semantic Cost Volumes, Depth Prior Initialization, and the Position-Aware Linear Attention updater.
Architecture. A frozen Depth Anything V3 backbone produces, in a single forward pass, multi-scale stereo features and a monocular depth map. The PALA updater replaces the ConvGRU with global linear attention over cost-volume features, using 2D rotary position encoding and an adaptive gating branch.
WHAT to match

HSCV

Hierarchical Semantic Cost Volumes build a per-scale correlation volume from each VFM scale (s ∈ {4,8,16}), kept simultaneously queryable, each with its own 4-level disparity pyramid.

WHERE to start

DPI

Depth Prior Initialization turns DA3's affine-invariant monocular depth into a metric disparity warm start via SIFT-based scale–shift alignment — refinement no longer starts from zero.

HOW to refine

PALA

Position-Aware Linear Attention aggregates global context at O(N) per iteration, propagating reliable disparity from textured regions into degraded, textureless areas.

01

Position-Aware Linear Attention (PALA)

Standard linear attention reaches O(N) by precomputing KᵀV and sharing it across queries — but that collapses all spatial relationships into one global summary, making attention position-agnostic. PALA restores spatial structure with 2D rotary position encoding applied asymmetrically: only the numerator is position-augmented, while the denominator keeps the plain kernels for stable normalization.

Oattn = Q̃ (K̃V) · ( Q̂ K̂ + ε )−1

A local spatial encoding on the value branch and an adaptive gate let each location absorb new evidence where matching is confident and preserve its estimate where it is not — so reliable disparity gradually fills uncertain regions. Cost stays O(N·C²); one PALA iteration is as cheap as a ConvGRU step (3.50 ms vs 3.63 ms).

02

Hierarchical Semantic Cost Volumes (HSCV)

Instead of one fixed-resolution correlation volume, HSCV builds one per VFM scale and keeps the whole set accessible, so PALA can re-query any scale at every iteration. Each volume carries its own disparity-axis pyramid for coarse-to-fine matching while preserving spatial resolution — a two-level hierarchy across scales and within each scale.

Replacing the hierarchy with a single pooled feature degrades EPE 1.01→1.07.

03

Depth Prior Initialization (DPI)

Disparity is inversely proportional to depth, so DPI converts the monocular prediction via d = α/Dmono + β, solving the scale–shift (α,β) by least squares from sparse SIFT correspondences along epipolar lines. Refinement starts from a geometrically plausible state rather than zero.

Honest fallback: when inliers < 20 it reverts to zero-disparity init — firing on 0% of TartanAir-UW and 3.7% of SQUID frames, costing only +0.08 px EPE.

Qualitative comparison

On well-conditioned scenes recent methods look near-identical; underwater, the differences explode. The win parses in under a second.

Underwater qualitative comparison grid: left/right images, ground truth, FoundationStereo, MGStereo, DeFOM-Stereo, Stereo Anywhere, and Ours, on TartanAir-UW and SQUID.
Underwater (TartanAir-UW & SQUID). Columns: input, ground truth, FoundationStereo, MGStereo, DeFOM-Stereo, Stereo Anywhere, and Ours. Baselines over-estimate depth in distant, backscatter-heavy regions and waver near-field; LinStereo keeps consistent scale at both ranges. Black = no GT.
Standard-benchmark qualitative comparison on Booster and ETH3D.
Standard benchmarks (Booster Q & ETH3D). LinStereo recovers sharper thin structures and handles non-Lambertian surfaces (transparent bottles, car windows).

More qualitative results

Per-benchmark comparisons, one scene per row: the input alongside recent baselines and LinStereo (Ours, highlighted) — across standard, underwater, real-world, and synthetic domains.

Reproduced from the paper's supplementary material. See the paper for full discussion.

SeaStereo Dataset

A physically-rendered underwater stereo corpus with dense disparity ground truth, released with the paper.

~40Kstereo pairs
7Jerlov water types
1000+configurations
denseGT disparity
SeaStereo rendering pipeline: ShapeNet objects composited over marine backgrounds in Blender under Jerlov water models.
Rendering pipeline. ShapeNetCore foreground objects composited over real marine backgrounds (coral, fish, shipwrecks) and rendered in Blender under varying Jerlov water types — spanning 1000+ configurations with varied trajectories, focal lengths, inter-ocular distances, and seafloor depths.

More dataset samples

Each card shows the left & right stereo views and the dense disparity ground truth, spanning Jerlov water types I–3C, seafloor depths, and camera configurations.

LeftRightDisp.
Jerlov IA · shallow
LeftRightDisp.
Jerlov IB · shallow
LeftRightDisp.
Jerlov IC · shallow
LeftRightDisp.
Jerlov 3C · shallow
LeftRightDisp.
Jerlov IB · shallow
LeftRightDisp.
Jerlov II · shallow
LeftRightDisp.
Jerlov III · shallow
LeftRightDisp.
Jerlov I · deep
LeftRightDisp.
Jerlov II · deep
LeftRightDisp.
Jerlov I · deep
LeftRightDisp.
Jerlov IA · deep
LeftRightDisp.
Jerlov IB · shallow

Water-condition explorer

Hold the method fixed, change the water. As turbidity and depth climb, LinStereo holds.

LinStereo result under varying water conditions

Standard benchmarks

Zero-shot cross-domain, official weights. We keep LinStereo a ViT-B, SceneFlow-only model to isolate the decoder's contribution — competitive with ViT-L systems and extra-data methods, and best among comparable models on occluded Middlebury.

MethodExtra
data
KITTI 2015KITTI 2012 Middlebury (H) ETH3DBooster (Q)
EPE↓bad3↓EPE↓bad3↓ All EPE↓All bad2↓Noc EPE↓Noc bad2↓Occ EPE↓Occ bad2↓ EPE↓bad1↓EPE↓bad2↓
PSMNet4.0528.513.7727.3411.4836.0610.5632.1117.6962.582.1515.1314.1749.79
RAFT-Stereo1.135.690.904.351.9212.61.098.653.3126.390.363.34.1817.64
IGEV-Stereo1.216.031.035.132.6311.932.279.495.0226.040.334.04.2617.58
UniMatch1.6311.011.7112.044.2134.723.9631.525.7056.510.6918.618.5747.30
Selective-RAFT1.276.681.085.192.3412.042.059.454.1727.40.344.364.1419.52
Selective-IGEV1.256.061.085.642.5911.792.319.224.3528.100.334.054.6219.28
MoCha-Stereo1.295.971.024.832.6610.182.497.963.8424.160.283.473.8816.82
NMRF1.175.310.924.632.9113.362.7310.904.2229.270.313.85.0526.22
IGEV++1.276.221.206.813.2110.852.417.866.8328.680.354.605.0018.62
MGStereo1.135.620.874.161.158.390.855.672.8926.500.251.882.2611.02
Stereo Anywhere ViT-L1.074.930.833.900.946.960.794.752.6720.340.242.662.219.91
DEFOM-Stereo ViT-L1.064.990.843.760.845.910.905.762.1119.340.352.353.5213.20
MonSter ViT-L0.893.580.753.581.9810.412.038.792.6221.560.231.413.0817.45
BridgeDepth ViT-L1.134.900.864.392.2110.272.258.662.6820.420.241.334.7118.79
FoundationStereo ViT-L0.953.650.713.180.713.470.421.462.6716.440.211.141.776.89
LinStereo ViT-B (Ours)1.014.540.763.490.836.010.895.671.338.690.242.412.149.93

Bold = best, underline = second-best among comparable methods (no extra data). FoundationStereo (ViT-L + extra data, grey) is the strongest baseline overall and shown for reference. LinStereo's 1.33 occluded-EPE on Middlebury is 37% below the previous best (DEFOM, 2.11) — direct evidence of PALA propagating reliable disparity into occluded pixels. EPE and badX (% pixels past the error threshold) reported per benchmark; scroll the table sideways for all columns.

Underwater generalization

The headline domain. Trained on SceneFlow only, LinStereo is best on every metric on both benchmarks — even against methods trained with real-world or domain-specific data.

Method · TartanAir-UW (simulated, long-range backscatter)Extra
data
AbsRel↓SqRel↓RMSE↓LogRMSE↓A1↑A2↑A3↑
UniMatch0.172.596.350.260.7590.8300.837
MoCha-Stereo0.152.306.020.250.7690.8350.843
NMRF0.111.084.630.210.8110.8560.870
PSMNet0.101.024.560.210.8130.8570.871
RAFT-Stereo0.080.654.360.200.9020.9630.984
MGStereo0.080.553.690.190.9110.9680.987
Stereo Anywhere0.060.413.240.180.9460.9800.989
DEFOM-Stereo0.050.413.130.170.9530.9820.990
Selective-Stereo (RAFT)0.090.674.310.210.9020.9620.981
Selective-Stereo (IGEV)0.111.025.010.230.8560.9420.976
IGEV-Stereo0.100.904.680.210.8910.9550.979
IGEV++0.090.814.370.200.9050.9620.983
FoundationStereo0.050.403.010.160.9590.9840.991
LinStereo ViT-B (Ours)0.040.192.080.090.9590.9920.998
Method · SQUID (real-world, color attenuation)Extra
data
AbsRel↓SqRel↓RMSE↓LogRMSE↓A1↑A2↑A3↑
UniMatch1.3820.916.420.660.5610.6600.734
MoCha-Stereo0.150.611.550.200.8750.9310.958
NMRF0.435.272.570.330.8440.9040.935
PSMNet0.464.313.430.510.7360.8130.853
RAFT-Stereo0.070.271.250.120.9370.9710.987
MGStereo0.090.711.990.160.9250.9580.976
Stereo Anywhere0.070.401.460.130.9370.9760.985
DEFOM-Stereo0.090.632.000.160.9150.9540.977
Selective-Stereo (RAFT)0.110.221.160.160.8770.9400.968
Selective-Stereo (IGEV)0.080.211.050.140.9320.9660.980
IGEV-Stereo0.201.352.680.460.7600.8210.863
IGEV++0.060.221.110.120.9500.9810.990
FoundationStereo0.070.301.360.130.9400.9760.987
LinStereo ViT-B (Ours)0.040.120.900.080.9700.9900.996

The largest gains track each benchmark's dominant degradation: 31% lower RMSE than FoundationStereo on TartanAir-UW (long-range backscatter) and 26% lower AbsRel than IGEV++ on SQUID (color attenuation). All baselines shown; AbsRel/SqRel/RMSE/LogRMSE lower is better, A1/A2/A3 higher is better.

Real-world deployment

Beyond simulation: a controlled laboratory water tank at close range (< 2 m), with AprilTag + CAD-model ground truth down to sub-millimetre — including ~3 mm taut ropes as a fine-structure stress test.

MethodExtra
data
AbsRel↓SqRel↓RMSE↓LogRMSE↓ A1↑A2↑A3↑
PSMNet0.180.160.340.280.870.900.93
UniMatch0.080.040.170.160.930.960.98
Stereo Anywhere0.090.060.200.190.920.930.97
RAFT-Stereo0.060.030.150.140.940.960.99
MGStereo0.080.050.180.170.930.940.98
Selective-Stereo (RAFT)0.070.030.140.140.940.970.99
Selective-Stereo (IGEV)0.080.040.160.160.910.950.99
IGEV-Stereo0.060.020.130.120.940.970.99
IGEV++0.050.020.120.120.960.970.99
FoundationStereo0.070.040.180.170.930.940.98
LinStereo (Ours, T=8)0.040.010.070.070.980.991.00

Bold = best. Extra data (✓) = trained beyond SceneFlow. AbsRel/SqRel/RMSE/LogRMSE lower is better; accuracy thresholds A1/A2/A3 higher is better. Scored only on valid-GT pixels projecting onto the CAD mesh. LinStereo — SceneFlow-only — leads every metric at this near-range, fine-structure target. Qualitative close-up in More qualitative results → Real-world tank.

Efficiency & runtime

LinStereo trades raw speed for accuracy and zero-shot generalisation: it is a large model built on a frozen Depth Anything V3 backbone. The cost is explicit below — yet its core PALA update operator costs about the same per iteration as a local ConvGRU, despite being global and linear in image size.

Computational efficiency

MethodParams (M)↓GFLOPs↓Time (ms)↓FPS↑
LightStereo-S3.4445.49.9101.0
CoEx2.7370.611.091.0
CGI-Stereo3.5081.612.878.1
Fast-ACVNet3.08104.613.276.0
ADStereo7.10201.615.464.7
RAFT-Stereo†9.87296.418.055.7
RT-IGEV4.17354.224.740.5
MobileStereoNet2.35175.637.726.5
AANet3.9365.815.2
LinStereo (Ours, T=2)127.0770.48012.5

Bold = best. Measured at 480×640 on a single NVIDIA RTX 4500 (24 GB); LinStereo at T = 2 (accuracy elsewhere uses T = 8). LinStereo is a large, accuracy-first model — not a lightweight real-time network.

Per-iteration update operator

Update operatorLatency (ms)↓
PALA (Ours)3.50 ± 0.05
RAFT-Stereo ConvGRU3.63 ± 0.06
IGEV ConvGRU3.43 ± 0.03

Per-update latency at 480×640 (± 95% CI). PALA's global, linear-complexity attention costs essentially the same per iteration as the local ConvGRU operators it replaces.

BibTeX

@inproceedings{wang2026linstereo,
  title         = {LinStereo: Linear-Complexity Global Attention for
                   Multi-Scale Iterative Stereo Matching},
  author        = {Wang, Yiran and Turner, Oliver and Ila, Viorela},
  booktitle     = {European Conference on Computer Vision (ECCV)},
  year          = {2026},
  eprint        = {2606.25437},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}