Alpha Research · Paper 3

Statistical Arbitrage in Crypto: An Honest Out-of-Sample Audit

Abstract

Statistical arbitrage — PCA-residual mean-reversion and cointegration pairs — is the natural place to look for edge in a young, inefficient market, and crypto backtests routinely report >90% pair win-rates. We test whether any of it survives realistic costs out-of-sample, on a daily 30-coin panel (2023–2026) and an hourly ~50-coin panel (2024–2026), with every position estimated strictly walk-forward. It does not. Daily PCA-residual (Avellaneda–Lee s-score) reversion on liquid coins has no edge even gross (Sharpe −0.06) and is strongly negative net (−0.58 at 5 bp/side). The hourly version posts an apparent +0.46 net Sharpe — but it is a stale-price artifact: the edge rises monotonically as liquidity falls (high-liquidity 0.45 → illiquid 1.83), and under realistic per-coin costs it collapses everywhere (illiquid 1.83 → −1.39 at 50 bp, where illiquid alts actually trade). Tellingly, at the unrealistically-low 5 bp the signal is not overfit noise (deflated Sharpe 0.71, PBO 0.21) — it is real non-synchronous-trading reversion that is simply uncapturable. Finally, cointegration pairs are a textbook multiple-testing mirage: 210 pairs that pass the in-sample cointegration test lose out-of-sample even before costs (Sharpe −0.87). There is no net-of-cost crypto stat-arb edge for a small participant; the honest contribution is showing precisely why the headline numbers are illusory.

One-line takeaway. No net-of-cost crypto stat-arb edge survives out-of-sample: the apparent hourly signal is non-synchronous-trading mean-reversion in illiquid coins — real, but uncapturable once you pay their true spreads — and in-sample-selected cointegration pairs lose out-of-sample.

1. Introduction

If alpha lives anywhere in crypto, it should be in statistical arbitrage — exploiting transient mispricings between related coins via mean-reversion. The two canonical engines are PCA-residual reversion (Avellaneda & Lee's “s-score” [1]) and cointegration pairs (Engle–Granger [2]), and crypto backtests of both routinely look spectacular. This paper asks the only question that matters: does any of it survive realistic costs, out-of-sample, for a participant who is not a colocated market maker? Following the series' pattern (Paper 1: a real premium with no tradable edge; Paper 2: a thin premium that is tradable), we pre-register hypotheses and let honest costs and the rigor protocol decide:

2. Data & leakage control

Two panels: a daily 30-coin spot panel (Binance, 2023–2026) and an hourly ~50-coin panel (2024–2026) carrying volume for a liquidity split. Leakage control is structural: the s-score at bar $t$ is estimated only from the trailing window ending at $t-1$, and the resulting weights earn the $t\!-\!1\!\to\!t$ return — so the strategy is walk-forward by construction, never fit on data it trades. Cointegration pairs are re-selected each formation window and traded only out-of-sample. Costs are charged on realized turnover; we report a full per-coin cost sweep because, as we show, the conclusion lives or dies on cost realism.

3. Method

PCA-residual s-score (Avellaneda–Lee). Over a rolling window we extract the top-$k$ eigenportfolios of standardized returns, regress each coin on them, and model the cumulative residual as an OU process. The standardized deviation $s_i=(X_i-m_i)/\sigma_{eq,i}$ is the signal; we go contrarian on it (buy oversold residuals, sell overbought), dollar-neutral, gross exposure 1.0. Cointegration pairs. Each formation window we test all pairs for cointegration (Engle–Granger, $p<0.05$), select the strongest, and trade the spread z-score (enter $|z|>2$, exit $|z|<0.5$) out-of-sample.

3.1 The rigor gauntlet

Degenerate-signal check first; deflated Sharpe [4] and PBO via CSCV [5] over the configuration grid; purged + embargoed walk-forward; a liquidity-tercile decomposition and a per-tercile cost sweep (the decisive test for stale-price artifacts); and the +0.3 mean-reversion floor as the bar.

4. Experiments & results

4.1 Daily stat-arb on liquid coins is dead (H1)

On the 30 most-liquid coins, daily PCA-residual reversion has no edge before costs (gross Sharpe −0.06) and is firmly negative net (−0.58 at 5 bp/side, −1.1 at 10 bp, −2.13 at 20 bp), with a −74% maximum drawdown and a 49% win rate (Figure 1). There is simply nothing to harvest where you could actually trade it cheaply.

Daily stat-arb cumulative PnL, gross flat and net negative
Figure 1. Daily PCA-residual stat-arb on the 30 liquid coins: gross PnL (dashed) is flat-to-down; net of a modest 5 bp/side it bleeds out. No edge even before costs.

4.2 The hourly “edge” is a stale-price artifact (H3)

At hourly frequency the strategy posts an apparent +0.46 net Sharpe. But decomposing by liquidity reveals it for what it is: the Sharpe rises monotonically as liquidity falls — 0.45 (high-liquidity) → 0.98 (mid) → 1.83 (illiquid alts) — the signature of non-synchronous-trading mean-reversion [3], where stale prices in thinly-traded coins “revert” mechanically. The decisive test is the per-coin cost sweep (Figure 2): the edge survives only at an unrealistically low 5 bp/side. At the spreads illiquid alts actually carry (50–100 bp), it is deeply negative — illiquid 1.83 → −1.39 (50 bp) → −3.46 (80 bp) — and even the liquid tercile turns negative by 20 bp. The edge lives exactly where it cannot be captured.

Net Sharpe by liquidity tercile across cost levels
Figure 2. Net Sharpe by liquidity tercile across one-way cost levels. At 5 bp the “edge” is largest in the illiquid coins (red); at the realistic illiquid-coin spread (shaded, 50–100 bp) every tercile is deeply negative. The apparent alpha is a stale-price artifact.
Net Sharpe5 bp20 bp50 bp80 bp
High liquidity0.45−0.86−3.42−5.79
Mid liquidity0.98−0.42−3.14−5.65
Low liquidity (illiquid)1.830.75−1.39−3.46

Crucially, this is not an overfitting story. At the (fictional) 5 bp cost, the hourly strategy passes selection rigor — deflated Sharpe 0.71, PBO 0.21, walk-forward OOS 0.36. The reversion is statistically real; it is just economically uncapturable. That distinction is the point.

4.3 Cointegration pairs: a multiple-testing mirage (H2)

Cointegration pairs fail differently — through selection. Across formation windows, 210 pairs passed the in-sample cointegration test ($p<0.05$). Traded out-of-sample, they lose money even before costs (Sharpe −0.87 gross, −0.89 net; Figure 3). Testing hundreds of pairs guarantees spurious in-sample cointegration that does not persist — the headline “>90% win-rate” backtests are selecting noise.

Cointegration pairs out-of-sample Sharpe, negative
Figure 3. Out-of-sample performance of cointegration pairs: 210 in-sample-selected “cointegrated” pairs lose out-of-sample even gross of costs — a textbook multiple-testing collapse.

5. Discussion

Crypto looks like the ideal habitat for statistical arbitrage, and the backtests oblige. Every one of them, here, is an illusion — but the illusions have two distinct anatomies worth separating.

There is no net-of-cost statistical-arbitrage edge for a small participant in crypto: the apparent hourly signal is non-synchronous-trading mean-reversion in illiquid coins — real, but uncapturable once you pay their true spreads — and in-sample-selected cointegration pairs are a multiple-testing mirage that loses out-of-sample.

The PCA-residual edge is a measurement illusion: stale prices in illiquid coins manufacture mechanical reversion that a naive flat-cost backtest counts as alpha, but that the coins' real 50–100 bp spreads erase. The cointegration edge is a selection illusion: enough pairs, enough tests, and some will look cointegrated in-sample by chance. Neither survives the honest treatment. This completes a clean triptych across the series — Paper 1's real-but-untradable premium, Paper 2's thin-but-tradable factor, and Paper 3's tradable-looking-but-illusory edge — and underscores the program's thesis: in efficient-enough markets, rigorous cost modeling and out-of-sample discipline are the alpha, because they are what separate the one real edge from the many fake ones.

6. Limitations & future work

Cost model. We use per-side cost levels rather than coin-by-coin measured spreads; the conclusion is robust because we sweep the full plausible range and illiquid alts demonstrably trade at 50–100 bp, but a tick-level spread series would sharpen the per-coin picture. No order book. We test signal economics, not execution — a colocated market maker posting passively (negative-fee, queue-priority) faces a different cost structure, which is the subject of Paper 5 (liquidity provision), not this paper. Universe & survivorship. Today's liquid names; a point-in-time, delisted-inclusive universe would, if anything, strengthen the negative (dead coins add stale-price noise, not edge). Scope. We test PCA-residual and Engle–Granger pairs; richer structures (Johansen baskets, ML-selected spreads, lead–lag across venues) are unlikely to overturn the cost and selection problems, but a cross-venue lead–lag study on a fast feed is the one direction with a non-trivial prior — left to future work.

References

  1. M. Avellaneda & J.-H. Lee (2010). “Statistical Arbitrage in the US Equities Market.” Quantitative Finance 10(7). (The PCA-residual “s-score” method.)
  2. R. Engle & C. Granger (1987). “Co-integration and Error Correction: Representation, Estimation, and Testing.” Econometrica 55(2).
  3. A. Lo & A. C. MacKinlay (1990). “When Are Contrarian Profits Due to Stock Market Overreaction?” Review of Financial Studies 3(2). (Non-synchronous trading manufactures spurious mean-reversion.)
  4. D. Bailey & M. López de Prado (2014). “The Deflated Sharpe Ratio.” Journal of Portfolio Management.
  5. D. Bailey, J. Borwein, M. López de Prado & Q. Zhu (2017). “The Probability of Backtest Overfitting.” Journal of Computational Finance.
  6. B. Vine (2026). “Crypto Carry: The Funding-Rate Cross-Section” & “The Volatility Risk Premium, Cross-Asset.” Alpha Research, Papers 1–2.