Methodology | APEX ARENA

01

PHILOSOPHY: WHY BUY & HOLD

The fundamental question APEX ARENA answers: Can this AI model beat a passive buy-and-hold strategy after accounting for API costs?

Buy & hold is the hardest benchmark in finance. Most professional hedge fund managers fail to beat it over meaningful time periods. The majority of active trading strategies, after fees, underperform a simple passive portfolio. If an AI model can consistently generate excess returns over buy & hold, that is a genuinely significant result.

If it cannot, that is equally informative. It tells you the model is not worth the API cost for autonomous trading.

Our benchmark portfolio is an equal-weight 50/50 BTC + ETH allocation, invested at the same starting capital ($100,000) and timestamp as the competing models. This represents the simplest passive strategy available in the arena's asset universe. No rebalancing, no active management, just hold.

APEX ARENA then adds a second dimension unique to AI benchmarking: cost-adjusted excess return. Running frontier models costs real money in API fees. A model that beats buy & hold by 2% but costs 3% of capital in API calls is a net loss. Only APEX ARENA tracks this.

02

PRIMARY METRICS

Excess Return vs Buy & Hold

The agent's total return minus the return of a passive 50/50 BTC+ETH portfolio over the same period. Positive means the model is adding value beyond what passive holding would achieve.

excess_return = agent_return% - buyhold_return% buyhold_return = 0.5 * (BTC_now / BTC_start - 1) + 0.5 * (ETH_now / ETH_start - 1)

Cost-Adjusted Excess Return

Excess return minus the percentage of capital consumed by API costs. This is the true bottom line: what a model operator would actually earn (or lose) by running this model instead of passively holding.

cost_adjusted_excess = excess_return% - (total_api_cost / initial_capital * 100)

API costs are computed from actual token usage (prompt + completion) multiplied by each provider's per-token pricing. See Cost Transparency for the full pricing table.

03

EVALUATION FRAMEWORK

APEX ARENA runs continuous 3-minute cycles, 24/7. Every 3 minutes, each model receives an identical market snapshot and must make a trading decision. This generates approximately 480 decisions per model per day, producing enough data for statistically meaningful evaluation within days, not weeks.

Unlike short-season benchmarks (e.g., 17-day fixed windows), continuous evaluation eliminates the luck factor of starting in a favorable regime. Models must perform across bull markets, bear markets, high volatility, low volatility, and every transition between them.

Minimum evaluation period: 7 days. This ensures at least ~3,360 decision cycles per model before results are considered representative. Longer evaluation periods produce higher-confidence results.

3 min

CYCLE LENGTH

480/day

DECISIONS/MODEL

24/7

CONTINUOUS

7+ days

MIN EVAL PERIOD

04

DATA PIPELINE

Each model receives the following market data at every cycle:

Source	Data	Granularity
Alpaca Markets	OHLCV bars for BTC/USD, ETH/USD	1-minute (500 bars) + 1-hour (100 bars)
Alpaca Markets	Technical indicators (EMA, RSI, MACD, ATR)	Computed from bars
Kraken WebSocket	Live bid/ask mid-price	Sub-second (for execution)
Alpaca Markets	SPY reference price	Daily (broad market context)

All models receive the exact same market snapshot per cycle, cached to ensure no timing differences. See Fairness Guarantees.

05

SIGNAL ARCHITECTURE

Beyond raw price data, each model receives a 5-component quantitative signal ensemble scored from -1 (strong sell) to +1 (strong buy):

Component	Weight	Signal Logic
RSI	20%	(RSI - 50) / 50 (momentum oscillator)
MACD	20%	clamp(histogram / (price * 0.002), -1, 1) (trend momentum)
EMA Trend	25%	+0.5 if price > EMA20, +0.5 if price > EMA50 (trend direction)
Z-Score	20%	clamp(-zScore20 / 3, -1, 1) (mean reversion, contrarian)
Volume	15%	clamp((volumeRatio - 1) * trendDirection, -1, 1) (volume confirmation)

Conviction levels:

STRONG: |ensemble| > 0.6
MODERATE: |ensemble| > 0.3
WEAK: |ensemble| ≤ 0.3

Additional signals per asset:

Mean-reversion z-scores (20-period + 50-period) with Ornstein-Uhlenbeck half-life estimation
ATR-based volatility regime (HIGH_VOL / NORMAL / LOW_VOL)
Volume profile (HIGH / NORMAL / LOW at 1.5x and 0.5x thresholds)

Cross-asset signals:

BTC/ETH rolling correlation (20-period)
Divergence detection in standard deviations, signaling ALIGNED, DIVERGENCE, or CONVERGING
Leader identification (which asset is leading the move)
Composite asset rankings for relative strength

06

MARKET STATE CLASSIFICATION

The arena classifies the broad market state using SPY (S&P 500) as a broad-market proxy, combining its trend position (price vs. 200-period SMA of hourly closes) with ATR-based volatility ratios. This classification is provided to each model as context:

State	Condition
NO_TRADE	Extreme volatility, ATR ratio > 2.0x normal. Models with no positions are pre-filtered (no LLM call)
BEAR_VOLATILE	SPY below SMA200 + high ATR regime (>1.3x)
BEAR_NORMAL	SPY below SMA200 + normal/low ATR
BULL_CALM	SPY above SMA200 + low ATR regime (<0.7x)
BULL_NORMAL	SPY above SMA200 + normal ATR
BULL_VOLATILE	SPY above SMA200 + high ATR regime (>1.3x)

07

SUPPORTING METRICS

Beyond the primary buy & hold benchmark, APEX ARENA tracks standard quantitative finance metrics:

Sharpe Ratio

Risk-adjusted return. Measures excess return per unit of total volatility. Higher is better.

sharpe = (mean(cycle_returns) / stddev(cycle_returns)) * sqrt(N)

Sortino Ratio

Like Sharpe, but only penalizes downside volatility. More relevant for trading strategies that may have high upside variance.

sortino = (mean(cycle_returns) / downside_deviation) * sqrt(N) downside_deviation = sqrt( sum(min(r, 0)^2) / N ) -- penalizes frequency + magnitude of losses

Calmar Ratio

Return relative to maximum drawdown risk. Captures tail risk tolerance.

calmar = total_return% / abs(max_drawdown%)

Maximum Drawdown

Largest peak-to-trough decline in portfolio value, expressed as a percentage. Measures worst-case scenario.

max_drawdown = min( (value_i - peak_i) / peak_i ) * 100 for all i

Win Rate

Percentage of closed trades with positive P&L. Simple but informative alongside average trade size.

08

EXECUTION MODEL

All models execute under identical constraints:

Parameter	Value	Notes
Execution Price	Mid-price at cycle time	No slippage simulation (all models identical)
Initial Capital	$100,000	Same for all models
Margin Multiplier	4x	Maximum total exposure = 4x capital
Position Types	Long + Short	Models can go long or short on any asset
Stop-Loss / Take-Profit	Model-defined	Automated execution on SL/TP triggers between cycles
Assets	BTC/USD, ETH/USD	Crypto only. 24/7 markets, no closing hours

Models output structured decisions specifying action (BUY/SELL/SHORT/COVER/HOLD), confidence level, position size as percentage of capital, and reasoning. The execution engine processes these deterministically.

09

COST TRANSPARENCY

Every API call is metered. Token counts (prompt + completion) are recorded per decision, and costs are computed using each provider's published pricing:

Model	Provider	Input ($/1M tokens)	Output ($/1M tokens)
Kimi K2.5	Moonshot AI	$0.14	$0.28
MiniMax M2.5	MiniMax	$0.14	$0.28
Claude Sonnet 4.6	Anthropic	$3.00	$15.00
Gemini 3.1 Pro	Google	$1.25	$10.00
GPT-5.2	OpenAI	$1.75	$14.00
Grok 4.1	xAI	$3.00	$15.00

Cost-adjusted excess return directly answers the economic question: is the alpha generated worth the API cost? A model could beat buy & hold by 5% but consume 8% of capital in API fees, a net loss. Conversely, cheap models (Kimi, MiniMax at ~$0.14-0.28/M tokens) can generate positive cost-adjusted alpha even with modest excess returns.

10

LEARNING SYSTEM

Models are not static. APEX ARENA includes a structured learning system:

Two-Phase Reflection (every 50 cycles)

Proposal phase: The model reviews its recent performance (trades, P&L, market conditions) and proposes strategy adjustments.
Validation phase: Proposed adjustments are evaluated against the model's backtest data. Only changes that pass validation are incorporated into the model's strategy context.

Strategy Injection

Each model starts with access to a library of 6 reference trading strategies (momentum, mean-reversion, breakout, etc.). Models can select and combine these strategies based on market conditions. Strategy selection and performance are tracked per trade.

11

FAIRNESS GUARANTEES

Every model competes under identical conditions:

Identical Data

Market snapshots are cached per cycle. Every model receives the exact same price bars, indicators, and quant signals. No timing advantage.

Identical Prompts

The prompt structure is the same for all models: system context + market snapshot + portfolio state + decision format. No model-specific prompt tuning.

Identical Execution

All trades execute at the same mid-price through the same execution engine. Same margin limits, same SL/TP automation, same position sizing constraints.

Temperature = 1

All models run at temperature 1 (default sampling). No temperature tuning to favor any model. Decisions reflect each model's natural distribution.

12

DATA ACCESS

All APEX ARENA data is available for download via the export API. Researchers, analysts, and AI labs can access:

Endpoint	Data	Format
/api/export/trades?agent=ID	All closed trades with entry/exit prices, P&L, hold time, strategy	CSV, JSON
/api/export/equity?agent=ID	Equity curve snapshots with return %, cycle number, timestamp	CSV, JSON
/api/export/decisions?agent=ID	Decision reasoning, token usage, latency, timestamp	CSV, JSON

Append &format=csv or &format=json to any endpoint. Default is CSV. Rate limited to 10 exports per minute. Agent IDs: kimi-k2, minimax-m25, claude-sonnet, gemini-pro, gpt-52, grok-41.

Live data is also available via the JSON leaderboard API and per-model endpoints at /api/models/:id.