← 빌드 일지
일반2026-05-13·33분 읽기

Open-Source-First: How Close Can Gemma 4 Get to Frontier Closed Models on Real Trading Bot Failure Data?

This is a submission for the Gemma 4 Challenge: Write About Gemma 4


TL;DR. I fed one month of real trading-bot failure logs to four models. Gemma 4 31B. Gemini 3.1 Pro. DeepSeek V4 Pro. And Gemma 4 wrapped in a self-validation loop.

Raw Gemma 4 caught 6 of 8 items on a rubric I curated by reading the log myself. At roughly 1/65th the per-call cost of Gemini 3.1 Pro Preview on the same task.

Wrapping Gemma 4 in a Generator → Critic → Synthesizer harness didn't add new findings. It sharpened the ones the model already had. The break-even win-rate estimate moved from a naïve 50% to a defensible 64%.

The gap between open and closed models on analytical tasks isn't about raw capability anymore. It's about harness design.


Why I ran this comparison

I needed a real analytical task to stress-test Gemma 4 against frontier closed models. Not a synthetic benchmark, not a coding puzzle, but a noisy domain log a senior analyst would actually have to read end-to-end.

My one-month trading-bot log gave me exactly that: 432K lines of mixed Korean and English, statistical traps from N=1 symbol bans to fee-drag arithmetic, and a ground-truth set of 8 structural issues that a careful reader should surface. Real money was attached, small money but real.

The question I cared about was simple. Can a 31B open-weights model read a long, noisy, bilingual operational log at the same depth as a frontier closed model? If yes, the entire economics of solo analytical work changes.

As a one-person builder, can I rely on open models for serious analytical work, or do I still need to pay closed-model prices for the depth?

I gave the same one-month log to four models and asked the same self-validation task. Then I ran one of them (Gemma 4) through a three-call harness (Generator → Critic → Synthesizer) and watched what changed.

This article is the honest writeup. No sponsor, no vendor cheerleading. If something was disappointing, I say so.


The setup

Input. 432K log lines collapsed into a single 1,500-token Markdown summary:

  • Period: 35 days (2026-04-07 to 2026-05-12)
  • 601 GRID entries. 414 closed positions (298 safety-net SELL + 116 trailing TP fires)
  • Daily PnL trajectory (11 days where PnL was recorded)
  • Hour-of-day, day/night, RSI, and drop-pct stats
  • Top 20 scanned symbols (mostly rejected for low volume)
  • 5 "operator's working hypotheses." Explicitly framed as hypotheses to verify. Not facts.

System prompt. Written for the model as a "senior quantitative trader." Key constraints baked in:

  1. Verify the operator's hypotheses against data. Don't just confirm them.
  2. Apply Bonferroni / multiple-testing awareness. With 11 days × 33 symbols × 601 entries patterns are at high risk of being spurious.
  3. Self-critique every diagnosis. "What would I be wrong about if this is wrong?"
  4. Code changes must be concrete (variable name, value, and expected effect).
  5. Be wrong out loud. Label hypothesis_unverified rather than assert.

Output schema. Strict JSON. Abbreviated:

output_schema = {
    "diagnoses": [{
        "id": "D1",
        "claim": str,
        "confidence": "low|medium|high",
        "self_critique": str,            # "what would I be wrong about?"
        "evidence_in_log": str,
    }],
    "code_changes": [{
        "file": str, "line_or_function": str,
        "current": str, "proposed": str,
        "expected_effect": str,
    }],
    "rr_redesign": {
        "proposed_tp_pct": float,
        "proposed_sl_pct": float,
        "breakeven_winrate_pct": float,  # the number that matters
        "math_shown": str,
    },
    "additional_findings_beyond_operator": [...],
    "what_i_could_not_determine_from_data": [...],
    "overall_verdict": {"label": str, "reasoning": str},
}

The four models (all via OpenRouter. 2026-05 pricing):

ModelContext$/M input$/M outputReleased
Gemma 4 31B (Dense)262K$0.12$0.372026 Q1
Gemini 3.1 Pro Preview1M$2.00$12.002026-04
DeepSeek V4 Pro (MoE 1.6T)1M$0.435$0.872026-04-24
Gemma 4 × harness (3-call)262K$0.12$0.37(above ×3)

(I also ran Claude Opus 4.7 as a closed-model baseline for my own internal calibration. Given the "open-source-first" framing of this article I'm keeping its raw output as a control reference. The rest of this writeup focuses on whether the open and semi-open lineup can stand on its own.)

The same system_prompt + the same bot_one_month_summary.md went into every model. No retries. No cherry-picking.

Two runs total. The first failed silently because I had response_format=json_object set. Gemini and DeepSeek silently returned content=null while burning reasoning tokens. Lesson learned. Second run worked.

# The gotcha that ate my first run
response = client.chat.completions.create(
    model="google/gemini-3.1-pro-preview",
    messages=[...],
    response_format={"type": "json_object"},  # reasoning models hate this
)

content = response.choices[0].message.content
if content is None:
    # Gemini/DeepSeek burned reasoning tokens. Then refused to emit content.
    # Defensive: log usage so you can see *why* it was empty.
    usage = response.usage.completion_tokens_details
    raise RuntimeError(f"empty content. reasoning tokens burned: {usage}")

Section 1: Quantitative comparison

ModelDiagnosesCode changesSelf-critiquesAdditional findingsHonest gaps listedR/R breakeven WR %Wall timeCost
Gemma 4 31B (raw)5352350.076.4s$0.001
Gemma 4 × harness3332464.3130.5s (3 calls)$0.003
Gemini 3.1 Pro Preview3232450.047.1s$0.065
DeepSeek V4 Pro6464825.0198.1s$0.039

A few things jump out.

DeepSeek V4 Pro is the depth leader among the open and semi-open models. Six diagnoses. Four additional findings the operator hadn't mentioned. An explicit eight-item "things I could not determine from this data" list.

It also burned 6,689 reasoning tokens. Comfortably the most thoughtful of the four. Cost $0.04. Wall time ~200 seconds. Mostly reasoning.

Gemma 4 raw is dirt cheap and not far behind on findings. Five diagnoses. Three code changes. $0.001 per run. That's a hundredth of a US cent.

If a solo developer wants to run this analysis via a cron job every morning Gemma 4 raw is the only one that's economically sane to do that with.

Gemini 3.1 Pro Preview is the most expensive. And produces the thinnest output. Three diagnoses. Two code changes. At $0.065 per run it's 65× more expensive than Gemma 4 with fewer findings.

A clarification I owe the reader. On the rubric I curated (Section 2), Gemini 3.1 Pro Preview caught 7 of 8 items and Gemma 4 raw caught 6 of 8. Gemini found more. What Gemma won on was cost-per-emitted-finding, not quality-per-finding. The fairest reading is that Gemini was the better reader and Gemma was the cheaper one. For routine diagnostic loops where the same analyst runs every day, the cost ratio matters; for one-shot deep analysis, Gemini's extra item may be worth the spend.

The harness changed Gemma 4 in one specific way. The number of diagnoses went down (5 → 3). Not up. The Critic step flagged spurious findings and the Synthesizer dropped them.

But the proposed R/R redesign moved from a naïve 50% break-even win rate to a more defensible 64.3%. That second number is what a real trader would actually use.

The harness's value wasn't in quantity. It was in honesty.


Section 2: Qualitative. Who caught what?

Counting diagnoses is one thing. Which diagnoses each model catches is what actually matters to a real operator.

I picked eight specific structural issues a careful reader of this log should surface. And checked each model's output:

FindingGemma rawGemma × harnessGemini 3.1 ProDeepSeek V4 Pro
R/R asymmetry. 0.4% trail vs 1.5% SL ≈ 1:3.75 against
Phase-2 grid disabled (DATA_TARGET=0) is the strategy never running as designed
Top-volatile universe = top-slippage universe
SAGA/USDT specific late-period anomaly
Banning a symbol after n=1 ("币安人生") is statistically meaningless
MAX_HOLD_TIME = 900s is too short for a 0.6% drop to mean-revert
601 trades × $6.50 ≈ $3,900 notional. PnL is noise. Not signal.
Binance taker fee ~0.1% round-trip eats half the trail
Total (out of 8)6677

Two observations.

The depth gap between models is narrower than the price gap. Six versus seven findings. Gemma 4 at $0.001. DeepSeek at $0.04. Gemini at $0.065. If you're picking a model to find structural issues in a log raw capability is no longer the bottleneck.

The bottleneck is the domain knowledge you can pull in. DeepSeek caught the fee drag and the n=1 blacklist issue because of broader training on quantitative and statistical content. Not because of more parameters.

The harness didn't add findings to Gemma 4. This is humbling. The Generator → Critic → Synthesizer loop reduced Gemma 4's claims from 5 to 3.

The Critic correctly flagged some of the Generator's findings as over-stated. RSI-47 is not a "falling knife." The HYPER/NOM negative-PnL claim was based on entry count. Not actual PnL. The Synthesizer dropped both.

That's an improvement in honesty. Not in coverage. The harness didn't add the two things Gemma missed (n=1 blacklist. fee drag). Because the source model never knew about them in the first place.

A harness can't make a model know what it doesn't know.


Section 3: Where Gemma 4 31B shines

Cost. $0.001 per full diagnosis run. This number is so small that it changes what you can build.

Running the same analyst on every closed bot session. Every morning's git diff. Every overnight log. With Gemma 4 raw that's $0.03 a month. With Gemini 3.1 Pro it's $2. With Claude Opus it's $5.

Mixed Korean. English. And code logs. My input was Markdown with Korean operator notes. English structural commentary. Ticker symbols. Gemma 4 had no trouble. It produced clean English JSON in response.

Bilingual content is often where small open models drop quality. Gemma 4 didn't.

Operational detail. Gemma 4 caught the Phase-2 disabled bug. A concrete operational fact in the log that DeepSeek missed.

Whatever Gemma 4 lacks in reasoning-token budget. Its attention to operational structure holds up.

Speed in raw mode. 76 seconds for a 5-diagnosis analysis. Faster than Gemini 3.1 Pro Preview returned anything coherent. (Gemini spent 47s of which 3,388 tokens were silent reasoning. Returning a thinner answer.)


Section 4: Where Gemma 4 31B limps

Statistical literacy. Both the raw and harnessed versions of Gemma missed that banning a symbol after a single trade is statistically meaningless. DeepSeek caught it explicitly.

This is the kind of finding that matters. The operator (me) was about to make a real-money decision based on a single data point. And Gemma silently let it through.

Domain knowledge of execution economics. Neither version of Gemma mentioned that Binance's ~0.1% round-trip taker fees consume roughly half of a 0.4% trailing exit. Both Gemini and DeepSeek flagged it.

This is a domain-knowledge gap. Not a reasoning gap. Gemma 4 reasons fine. It just doesn't know this piece of trading-cost trivia by default.

Bonferroni / multiple-testing. I explicitly asked for Bonferroni-aware reasoning in the system prompt. None of the models. Including the harnessed Gemma. Gemini. And DeepSeek. Actually used the word Bonferroni or implemented a proper multiple-testing adjustment.

They all gave statistical-confidence labels ("high". "medium". "low"). But none did the math. The closed-model baseline at least cited Bonferroni and used it as a frame. A uniform open-model weakness on this task.

Depth-in-prose. Gemma 4's JSON outputs are tighter and shorter. DeepSeek's are denser and more discursive. If you want the model to do "thinking aloud" that you can quote in a postmortem DeepSeek is closer to a senior analyst writing notes. Gemma 4 reads more like a junior who has been told to keep it under a page.


Section 5: Harness engineering. Closing the gap

This is the part of the article I cared most about writing.

The idea was simple. Take the cheap open model (roughly 1/65 the per-run cost of Gemini 3.1 Pro Preview on this task) and surround it with the operational structure that a senior analyst applies automatically. Not a bigger model. A better workflow.

I ran the same Gemma 4 31B three times with three different roles:

Step 1. Generator. The same system prompt as the raw run. Identical task. Identical input. Identical model. Output: a first-pass JSON diagnosis. (5 diagnoses. 3 code changes. R/R breakeven 50%.)

Step 2. Critic. New system prompt: "You are an adversarial critic of a quantitative analysis. Look for spurious findings. Missing self-critique. Vague code changes. And unanswered questions." Input: the data PLUS the Generator's JSON. Output: a critique JSON.

Excerpt of the Critic's actual output on the Generator's first-pass:

"D2: Claiming the bot is a 'falling knife catcher' based on a median RSI of 47.4 is a contradiction. An RSI of 47 is neutral. Not 'catching a knife' (which implies entering at extreme lows). The analyst is conflating 'lack of oversold signal' with 'catching falling knives'."

"D3: Spurious pattern risk. With only 601 entries across 33 symbols attributing negative PnL specifically to HYPER and NOM without symbol-level PnL data is a leap. High entry count does not equal high loss contribution."

"Missing: Evaluation of the 'Top Volatile' universe's impact on slippage (Operator Pain Point #4). Which is a critical cost driver not addressed in the diagnosis."

"RR redesign concern: The analyst proposes widening the SL to 2.0% and Callback to 1.5%. But does not account for the fact that increasing the callback significantly lowers the win rate. Potentially offsetting the R/R gain."

That's the same Gemma 4 31B model with a different role prompt ripping its own first pass apart competently. It correctly identifies the "RSI 47 ≠ falling knife" logical inconsistency. The spurious-pattern risk on small-N per-symbol claims. The missing slippage analysis.

Step 3. Synthesizer. New system prompt: "Produce the FINAL JSON. Keep what survived critique. Drop or weaken what the critic flagged. Tighten code-change specificity. Re-check the R/R math against breakeven win rate."

The whole pipeline. Structurally:

def diagnose_with_harness(model, system_prompt, log_md):
    # Step 1. Generator. Identical to the raw run.
    first = call_model(model, system_prompt, user=log_md)

    # Step 2. Critic. Same model. Adversarial role.
    critic_system = (
        "You are an adversarial critic of a quantitative analysis. "
        "Look for spurious findings. Missing self-critique. Vague code "
        "changes. And unanswered questions. Be specific."
    )
    critique = call_model(model, critic_system,
                          user=f"DATA:\n{log_md}\n\nFIRST PASS:\n{first}")

    # Step 3. Synthesizer. Keep what survived. Drop or weaken the rest.
    synth_system = (
        "Produce the FINAL JSON. Keep what survived critique. "
        "Drop or weaken what the critic flagged. Tighten code-change "
        "specificity. Re-check the R/R math against breakeven win rate."
    )
    return call_model(model, synth_system,
                      user=f"DATA:\n{log_md}\n\nFIRST:\n{first}\n\nCRITIQUE:\n{critique}")

The Synthesizer dropped two of the Generator's five diagnoses (the ones the Critic flagged as spurious). It kept three with tighter wording. And most importantly it revised the R/R redesign's break-even win rate from a naïve 50% to a more honest 64.3%. Citing the Critic's point about callback widening lowering the win rate.

This is the part that matters.

Raw Gemma 4 told me: "Widen the trail to 1.5%. You'll need 50% win rate to break even." That number is too generous. It ignores the fact that widening the trail also reduces how often the trail fires at a profit.

Harness Gemma 4 told me: "Widen the trail to 1.5%. But be honest. You'll need closer to 64% win rate after accounting for fewer trail fires." That number is closer to the closed-model baseline's 55% estimate. It's the kind of number you'd actually use to decide whether the change is worth shipping.

The cost of upgrading Gemma 4 from "naïvely optimistic 50%" to "intellectually honest 64%" was two extra API calls. About a quarter of a US cent.

That to me is the headline of this experiment.


Section 6: Honest verdict

If I had to summarize the state of "open-source-first" AI in May 2026 for a solo developer:

Gemma 4 31B raw gets you to roughly 80% of the indie analytical work for 1% of the closed-model cost. It catches most structural issues. Processes mixed Korean/English/code without complaint. Returns clean JSON. Runs fast. For routine diagnostics (every morning's log review. Every PR's diff explanation) this is the model.

Gemma 4 with a 3-call self-validation harness pulls you closer to 90%. You won't add new findings the base model doesn't already know about. You will dramatically improve the honesty of the findings it does produce. Worth it for anything that turns into a code change you'll actually ship.

DeepSeek V4 Pro is the depth tool. Reasoning-token heavy. Slower. More thoughtful. Catches things Gemma misses (fee drag. n=1 statistical floor). Pay $0.04 per run when you genuinely want the second opinion of a more cautious analyst.

Gemini 3.1 Pro Preview. I wouldn't pay for it again. At least not for this task and this kind of input. Thinner output. Higher price. No qualitative win over DeepSeek or even Gemma + harness. Your mileage may vary on multimodal or long-context tasks where Gemini is genuinely strong.

The last 10%. Bonferroni rigor. Novel diagnoses outside the operator's framing. Citing specific prior incidents the way a senior trader actually would. That's still where frontier closed models edge out. But the 10% is a smaller gap than I expected. And much smaller than the price ratio suggests.

For the kind of one-person AI-Native operation I'm running (trading bot diagnostics today. Video pipeline orchestration tomorrow. Music release planning the day after) the open-source stack plus a well-designed harness is the right default.

I'll keep a closed-model line open for the high-stakes 10%. But the daily-driver is open now. That wasn't true twelve months ago. It is now.


Reproduce this

The harness code. The prompts. The bot log summary format. The model wrappers. All simple Python + the OpenRouter API.

  • analyze_sniper_log.py parses the 432K-line raw log into a 1,500-token Markdown summary
  • prompts.py holds the system + user prompt builders
  • run_all_models.py is an OpenRouter wrapper that calls all four models with robust JSON parsing
  • gemma4_with_harness.py is the three-call Generator/Critic/Synthesizer pipeline

The whole thing is small. The interesting part isn't the code. It's the prompt structure. And the willingness to give the model a critic role. And then use what the critic says.

If you run this on your own logs and get different rankings I'd love to see it. The 4-model gap on a different kind of input (longer reasoning chain. More multimodal. Different domain) may invert what I found here.


If you've run Gemma 4 on a harness loop and got different cost or honesty numbers, post your comparison. I'll add a row to the table.

Jack (wildeconforce.com)

Wildeconforce

매일 만들고, 매일 분석하고, 매일 기록합니다.
© 2026 wildeconforce · build-in-public

이 사이트는 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다.