Alpha Research · Paper 3

Statistical Arbitrage in Crypto: An Honest Out-of-Sample Audit

Bryan Vine · June 16, 2026 · code & full research log ↗ · ← Alpha Research series

Abstract 1 Introduction 2 Data 3 Method 4 Results 5 Discussion 6 Limitations References

Abstract

Statistical arbitrage — PCA-residual mean-reversion and cointegration pairs — is the natural place to look for edge in a young, inefficient market, and crypto backtests routinely report >90% pair win-rates. We test whether any of it survives realistic costs out-of-sample, on a daily 30-coin panel (2023–2026) and an hourly ~50-coin panel (2024–2026), with every position estimated strictly walk-forward. It does not. Daily PCA-residual (Avellaneda–Lee s-score) reversion on liquid coins has no edge even gross (Sharpe −0.06) and is strongly negative net (−0.58 at 5 bp/side). The hourly version posts an apparent +0.46 net Sharpe — but it is a stale-price artifact: the edge rises monotonically as liquidity falls (high-liquidity 0.45 → illiquid 1.83), and under realistic per-coin costs it collapses everywhere (illiquid 1.83 → −1.39 at 50 bp, where illiquid alts actually trade). Tellingly, at the unrealistically-low 5 bp the signal is not overfit noise (deflated Sharpe 0.71, PBO 0.21) — it is real non-synchronous-trading reversion that is simply uncapturable. Finally, cointegration pairs are a textbook multiple-testing mirage: 210 pairs that pass the in-sample cointegration test lose out-of-sample even before costs (Sharpe −0.87). There is no net-of-cost crypto stat-arb edge for a small participant; the honest contribution is showing precisely why the headline numbers are illusory.

One-line takeaway. No net-of-cost crypto stat-arb edge survives out-of-sample: the apparent hourly signal is non-synchronous-trading mean-reversion in illiquid coins — real, but uncapturable once you pay their true spreads — and in-sample-selected cointegration pairs lose out-of-sample.

1. Introduction

If alpha lives anywhere in crypto, it should be in statistical arbitrage — exploiting transient mispricings between related coins via mean-reversion. The two canonical engines are PCA-residual reversion (Avellaneda & Lee's “s-score” [1]) and cointegration pairs (Engle–Granger [2]), and crypto backtests of both routinely look spectacular. This paper asks the only question that matters: does any of it survive realistic costs, out-of-sample, for a participant who is not a colocated market maker? Following the series' pattern (Paper 1: a real premium with no tradable edge; Paper 2: a thin premium that is tradable), we pre-register hypotheses and let honest costs and the rigor protocol decide:

H1. PCA-residual and cointegration stat-arb net Sharpe is ≤ 0.5 (often negative) out-of-sample after realistic costs. (Confirmed.)
H2. In-sample-selected cointegration pairs collapse out-of-sample. (Confirmed — 210 pairs lose OOS, even gross.)
H3. Any apparent edge concentrates in illiquid coins (a stale-price artifact) and dies at their real spreads. (Confirmed — monotonic in liquidity; negative at realistic costs.)

2. Data & leakage control

Two panels: a daily 30-coin spot panel (Binance, 2023–2026) and an hourly ~50-coin panel (2024–2026) carrying volume for a liquidity split. Leakage control is structural: the s-score at bar $t$ is estimated only from the trailing window ending at $t-1$, and the resulting weights earn the $t\!-\!1\!\to\!t$ return — so the strategy is walk-forward by construction, never fit on data it trades. Cointegration pairs are re-selected each formation window and traded only out-of-sample. Costs are charged on realized turnover; we report a full per-coin cost sweep because, as we show, the conclusion lives or dies on cost realism.

3. Method

PCA-residual s-score (Avellaneda–Lee). Over a rolling window we extract the top-$k$ eigenportfolios of standardized returns, regress each coin on them, and model the cumulative residual as an OU process. The standardized deviation $s_i=(X_i-m_i)/\sigma_{eq,i}$ is the signal; we go contrarian on it (buy oversold residuals, sell overbought), dollar-neutral, gross exposure 1.0. Cointegration pairs. Each formation window we test all pairs for cointegration (Engle–Granger, $p<0.05$), select the strongest, and trade the spread z-score (enter $|z|>2$, exit $|z|<0.5$) out-of-sample.

3.1 The rigor gauntlet

Degenerate-signal check first; deflated Sharpe [4] and PBO via CSCV [5] over the configuration grid; purged + embargoed walk-forward; a liquidity-tercile decomposition and a per-tercile cost sweep (the decisive test for stale-price artifacts); and the +0.3 mean-reversion floor as the bar.

4. Experiments & results

4.1 Daily stat-arb on liquid coins is dead (H1)

On the 30 most-liquid coins, daily PCA-residual reversion has no edge before costs (gross Sharpe −0.06) and is firmly negative net (−0.58 at 5 bp/side, −1.1 at 10 bp, −2.13 at 20 bp), with a −74% maximum drawdown and a 49% win rate (Figure 1). There is simply nothing to harvest where you could actually trade it cheaply.

Daily stat-arb cumulative PnL, gross flat and net negative — **Figure 1.** Daily PCA-residual stat-arb on the 30 liquid coins: gross PnL (dashed) is flat-to-down; net of a modest 5 bp/side it bleeds out. No edge even before costs.

4.2 The hourly “edge” is a stale-price artifact (H3)

At hourly frequency the strategy posts an apparent +0.46 net Sharpe. But decomposing by liquidity reveals it for what it is: the Sharpe rises monotonically as liquidity falls — 0.45 (high-liquidity) → 0.98 (mid) → 1.83 (illiquid alts) — the signature of non-synchronous-trading mean-reversion [3], where stale prices in thinly-traded coins “revert” mechanically. The decisive test is the per-coin cost sweep (Figure 2): the edge survives only at an unrealistically low 5 bp/side. At the spreads illiquid alts actually carry (50–100 bp), it is deeply negative — illiquid 1.83 → −1.39 (50 bp) → −3.46 (80 bp) — and even the liquid tercile turns negative by 20 bp. The edge lives exactly where it cannot be captured.

Net Sharpe by liquidity tercile across cost levels — **Figure 2.** Net Sharpe by liquidity tercile across one-way cost levels. At 5 bp the “edge” is largest in the illiquid coins (red); at the realistic illiquid-coin spread (shaded, 50–100 bp) every tercile is deeply negative. The apparent alpha is a stale-price artifact.

Net Sharpe	5 bp	20 bp	50 bp	80 bp
High liquidity	0.45	−0.86	−3.42	−5.79
Mid liquidity	0.98	−0.42	−3.14	−5.65
Low liquidity (illiquid)	1.83	0.75	−1.39	−3.46

Crucially, this is not an overfitting story. At the (fictional) 5 bp cost, the hourly strategy passes selection rigor — deflated Sharpe 0.71, PBO 0.21, walk-forward OOS 0.36. The reversion is statistically real; it is just economically uncapturable. That distinction is the point.

4.3 Cointegration pairs: a multiple-testing mirage (H2)

Cointegration pairs fail differently — through selection. Across formation windows, 210 pairs passed the in-sample cointegration test ($p<0.05$). Traded out-of-sample, they lose money even before costs (Sharpe −0.87 gross, −0.89 net; Figure 3). Testing hundreds of pairs guarantees spurious in-sample cointegration that does not persist — the headline “>90% win-rate” backtests are selecting noise.

Cointegration pairs out-of-sample Sharpe, negative — **Figure 3.** Out-of-sample performance of cointegration pairs: 210 in-sample-selected “cointegrated” pairs lose out-of-sample even gross of costs — a textbook multiple-testing collapse.

5. Discussion

Crypto looks like the ideal habitat for statistical arbitrage, and the backtests oblige. Every one of them, here, is an illusion — but the illusions have two distinct anatomies worth separating.

There is no net-of-cost statistical-arbitrage edge for a small participant in crypto: the apparent hourly signal is non-synchronous-trading mean-reversion in illiquid coins — real, but uncapturable once you pay their true spreads — and in-sample-selected cointegration pairs are a multiple-testing mirage that loses out-of-sample.

The PCA-residual edge is a measurement illusion: stale prices in illiquid coins manufacture mechanical reversion that a naive flat-cost backtest counts as alpha, but that the coins' real 50–100 bp spreads erase. The cointegration edge is a selection illusion: enough pairs, enough tests, and some will look cointegrated in-sample by chance. Neither survives the honest treatment. This completes a clean triptych across the series — Paper 1's real-but-untradable premium, Paper 2's thin-but-tradable factor, and Paper 3's tradable-looking-but-illusory edge — and underscores the program's thesis: in efficient-enough markets, rigorous cost modeling and out-of-sample discipline are the alpha, because they are what separate the one real edge from the many fake ones.

6. Limitations & future work

Cost model. We use per-side cost levels rather than coin-by-coin measured spreads; the conclusion is robust because we sweep the full plausible range and illiquid alts demonstrably trade at 50–100 bp, but a tick-level spread series would sharpen the per-coin picture. No order book. We test signal economics, not execution — a colocated market maker posting passively (negative-fee, queue-priority) faces a different cost structure, which is the subject of Paper 5 (liquidity provision), not this paper. Universe & survivorship. Today's liquid names; a point-in-time, delisted-inclusive universe would, if anything, strengthen the negative (dead coins add stale-price noise, not edge). Scope. We test PCA-residual and Engle–Granger pairs; richer structures (Johansen baskets, ML-selected spreads, lead–lag across venues) are unlikely to overturn the cost and selection problems, but a cross-venue lead–lag study on a fast feed is the one direction with a non-trivial prior — left to future work.

References

M. Avellaneda & J.-H. Lee (2010). “Statistical Arbitrage in the US Equities Market.” Quantitative Finance 10(7). (The PCA-residual “s-score” method.)
R. Engle & C. Granger (1987). “Co-integration and Error Correction: Representation, Estimation, and Testing.” Econometrica 55(2).
A. Lo & A. C. MacKinlay (1990). “When Are Contrarian Profits Due to Stock Market Overreaction?” Review of Financial Studies 3(2). (Non-synchronous trading manufactures spurious mean-reversion.)
D. Bailey & M. López de Prado (2014). “The Deflated Sharpe Ratio.” Journal of Portfolio Management.
D. Bailey, J. Borwein, M. López de Prado & Q. Zhu (2017). “The Probability of Backtest Overfitting.” Journal of Computational Finance.
B. Vine (2026). “Crypto Carry: The Funding-Rate Cross-Section” & “The Volatility Risk Premium, Cross-Asset.” Alpha Research, Papers 1–2.