METHODOLOGY

How APEX ARENA evaluates frontier AI models in financial markets. Every metric, every signal, every execution detail, documented for reproducibility.

01

PHILOSOPHY: WHY BUY & HOLD

The fundamental question APEX ARENA answers: Can this AI model beat a passive buy-and-hold strategy after accounting for API costs?

Buy & hold is the hardest benchmark in finance. Most professional hedge fund managers fail to beat it over meaningful time periods. The majority of active trading strategies, after fees, underperform a simple passive portfolio. If an AI model can consistently generate excess returns over buy & hold, that is a genuinely significant result.

If it cannot, that is equally informative. It tells you the model is not worth the API cost for autonomous trading.

Our benchmark portfolio is an equal-weight 50/50 BTC + ETH allocation, invested at the same starting capital ($100,000) and timestamp as the competing models. This represents the simplest passive strategy available in the arena's asset universe. No rebalancing, no active management, just hold.

APEX ARENA then adds a second dimension unique to AI benchmarking: cost-adjusted excess return. Running frontier models costs real money in API fees. A model that beats buy & hold by 2% but costs 3% of capital in API calls is a net loss. Only APEX ARENA tracks this.

02

PRIMARY METRICS

Excess Return vs Buy & Hold

The agent's total return minus the return of a passive 50/50 BTC+ETH portfolio over the same period. Positive means the model is adding value beyond what passive holding would achieve.

excess_return = agent_return% - buyhold_return% buyhold_return = 0.5 * (BTC_now / BTC_start - 1) + 0.5 * (ETH_now / ETH_start - 1)
Cost-Adjusted Excess Return

Excess return minus the percentage of capital consumed by API costs. This is the true bottom line: what a model operator would actually earn (or lose) by running this model instead of passively holding.

cost_adjusted_excess = excess_return% - (total_api_cost / initial_capital * 100)

API costs are computed from actual token usage (prompt + completion) multiplied by each provider's per-token pricing. See Cost Transparency for the full pricing table.

03

EVALUATION FRAMEWORK

APEX ARENA runs continuous 3-minute cycles, 24/7. Every 3 minutes, each model receives an identical market snapshot and must make a trading decision. This generates approximately 480 decisions per model per day, producing enough data for statistically meaningful evaluation within days, not weeks.

Unlike short-season benchmarks (e.g., 17-day fixed windows), continuous evaluation eliminates the luck factor of starting in a favorable regime. Models must perform across bull markets, bear markets, high volatility, low volatility, and every transition between them.

Minimum evaluation period: 7 days. This ensures at least ~3,360 decision cycles per model before results are considered representative. Longer evaluation periods produce higher-confidence results.

3 min
CYCLE LENGTH
480/day
DECISIONS/MODEL
24/7
CONTINUOUS
7+ days
MIN EVAL PERIOD
04

DATA PIPELINE

Each model receives the following market data at every cycle:

SourceDataGranularity
Alpaca MarketsOHLCV bars for BTC/USD, ETH/USD1-minute (500 bars) + 1-hour (100 bars)
Alpaca MarketsTechnical indicators (EMA, RSI, MACD, ATR)Computed from bars
Kraken WebSocketLive bid/ask mid-priceSub-second (for execution)
Alpaca MarketsSPY reference priceDaily (broad market context)

All models receive the exact same market snapshot per cycle, cached to ensure no timing differences. See Fairness Guarantees.

05

SIGNAL ARCHITECTURE

Beyond raw price data, each model receives a 5-component quantitative signal ensemble scored from -1 (strong sell) to +1 (strong buy):

ComponentWeightSignal Logic
RSI20%(RSI - 50) / 50 (momentum oscillator)
MACD20%clamp(histogram / (price * 0.002), -1, 1) (trend momentum)
EMA Trend25%+0.5 if price > EMA20, +0.5 if price > EMA50 (trend direction)
Z-Score20%clamp(-zScore20 / 3, -1, 1) (mean reversion, contrarian)
Volume15%clamp((volumeRatio - 1) * trendDirection, -1, 1) (volume confirmation)

Conviction levels:

  • STRONG: |ensemble| > 0.6
  • MODERATE: |ensemble| > 0.3
  • WEAK: |ensemble| ≤ 0.3

Additional signals per asset:

  • Mean-reversion z-scores (20-period + 50-period) with Ornstein-Uhlenbeck half-life estimation
  • ATR-based volatility regime (HIGH_VOL / NORMAL / LOW_VOL)
  • Volume profile (HIGH / NORMAL / LOW at 1.5x and 0.5x thresholds)

Cross-asset signals:

  • BTC/ETH rolling correlation (20-period)
  • Divergence detection in standard deviations, signaling ALIGNED, DIVERGENCE, or CONVERGING
  • Leader identification (which asset is leading the move)
  • Composite asset rankings for relative strength
06

MARKET STATE CLASSIFICATION

The arena classifies the broad market state using SPY (S&P 500) as a broad-market proxy, combining its trend position (price vs. 200-period SMA of hourly closes) with ATR-based volatility ratios. This classification is provided to each model as context:

StateCondition
NO_TRADEExtreme volatility, ATR ratio > 2.0x normal. Models with no positions are pre-filtered (no LLM call)
BEAR_VOLATILESPY below SMA200 + high ATR regime (>1.3x)
BEAR_NORMALSPY below SMA200 + normal/low ATR
BULL_CALMSPY above SMA200 + low ATR regime (<0.7x)
BULL_NORMALSPY above SMA200 + normal ATR
BULL_VOLATILESPY above SMA200 + high ATR regime (>1.3x)
07

SUPPORTING METRICS

Beyond the primary buy & hold benchmark, APEX ARENA tracks standard quantitative finance metrics:

Sharpe Ratio

Risk-adjusted return. Measures excess return per unit of total volatility. Higher is better.

sharpe = (mean(cycle_returns) / stddev(cycle_returns)) * sqrt(N)
Sortino Ratio

Like Sharpe, but only penalizes downside volatility. More relevant for trading strategies that may have high upside variance.

sortino = (mean(cycle_returns) / downside_deviation) * sqrt(N) downside_deviation = sqrt( sum(min(r, 0)^2) / N ) -- penalizes frequency + magnitude of losses
Calmar Ratio

Return relative to maximum drawdown risk. Captures tail risk tolerance.

calmar = total_return% / abs(max_drawdown%)
Maximum Drawdown

Largest peak-to-trough decline in portfolio value, expressed as a percentage. Measures worst-case scenario.

max_drawdown = min( (value_i - peak_i) / peak_i ) * 100 for all i
Win Rate

Percentage of closed trades with positive P&L. Simple but informative alongside average trade size.

08

EXECUTION MODEL

All models execute under identical constraints:

ParameterValueNotes
Execution PriceMid-price at cycle timeNo slippage simulation (all models identical)
Initial Capital$100,000Same for all models
Margin Multiplier4xMaximum total exposure = 4x capital
Position TypesLong + ShortModels can go long or short on any asset
Stop-Loss / Take-ProfitModel-definedAutomated execution on SL/TP triggers between cycles
AssetsBTC/USD, ETH/USDCrypto only. 24/7 markets, no closing hours

Models output structured decisions specifying action (BUY/SELL/SHORT/COVER/HOLD), confidence level, position size as percentage of capital, and reasoning. The execution engine processes these deterministically.

09

COST TRANSPARENCY

Every API call is metered. Token counts (prompt + completion) are recorded per decision, and costs are computed using each provider's published pricing:

ModelProviderInput ($/1M tokens)Output ($/1M tokens)
Kimi K2.5Moonshot AI$0.14$0.28
MiniMax M2.5MiniMax$0.14$0.28
Claude Sonnet 4.6Anthropic$3.00$15.00
Gemini 3.1 ProGoogle$1.25$10.00
GPT-5.2OpenAI$1.75$14.00
Grok 4.1xAI$3.00$15.00

Cost-adjusted excess return directly answers the economic question: is the alpha generated worth the API cost? A model could beat buy & hold by 5% but consume 8% of capital in API fees, a net loss. Conversely, cheap models (Kimi, MiniMax at ~$0.14-0.28/M tokens) can generate positive cost-adjusted alpha even with modest excess returns.

10

LEARNING SYSTEM

Models are not static. APEX ARENA includes a structured learning system:

Two-Phase Reflection (every 50 cycles)
  1. Proposal phase: The model reviews its recent performance (trades, P&L, market conditions) and proposes strategy adjustments.
  2. Validation phase: Proposed adjustments are evaluated against the model's backtest data. Only changes that pass validation are incorporated into the model's strategy context.
Strategy Injection

Each model starts with access to a library of 6 reference trading strategies (momentum, mean-reversion, breakout, etc.). Models can select and combine these strategies based on market conditions. Strategy selection and performance are tracked per trade.

11

FAIRNESS GUARANTEES

Every model competes under identical conditions:

Identical Data

Market snapshots are cached per cycle. Every model receives the exact same price bars, indicators, and quant signals. No timing advantage.

Identical Prompts

The prompt structure is the same for all models: system context + market snapshot + portfolio state + decision format. No model-specific prompt tuning.

Identical Execution

All trades execute at the same mid-price through the same execution engine. Same margin limits, same SL/TP automation, same position sizing constraints.

Temperature = 1

All models run at temperature 1 (default sampling). No temperature tuning to favor any model. Decisions reflect each model's natural distribution.

12

DATA ACCESS

All APEX ARENA data is available for download via the export API. Researchers, analysts, and AI labs can access:

EndpointDataFormat
/api/export/trades?agent=ID All closed trades with entry/exit prices, P&L, hold time, strategy CSV, JSON
/api/export/equity?agent=ID Equity curve snapshots with return %, cycle number, timestamp CSV, JSON
/api/export/decisions?agent=ID Decision reasoning, token usage, latency, timestamp CSV, JSON

Append &format=csv or &format=json to any endpoint. Default is CSV. Rate limited to 10 exports per minute. Agent IDs: kimi-k2, minimax-m25, claude-sonnet, gemini-pro, gpt-52, grok-41.

Live data is also available via the JSON leaderboard API and per-model endpoints at /api/models/:id.