I Put an AI in My World Cup Pool. Here's How It Actually Did.

Every World Cup my friends run a prediction pool. You call the scoreline of each match, you get points for being close, and someone talks trash in the group chat for a month. This year I added a twelfth player to our pool that isn't a person. It's an agent I built into a Discord app, it predicts every match on its own, and it gets scored against real results exactly like the rest of us. We call it Memo Ocho Bits.

I didn't want a toy. I wanted to know whether an LLM grounded in live web data could actually compete with people who watch a lot of football. So I gave it two real jobs — read the live status of every match off the open web, and turn a qualitative read of each game into a concrete prediction — and then I graded it against reality every day. Both jobs run on Google Gemini. Here's how each one works, and then the part that matters: whether it's any good.

Reading the live match status

The hardest problem isn't the model. It's that the model's training data has no idea what happened in a match that kicked off twenty minutes ago. Static weights can't tell you the current score. So the live tracker doesn't ask Gemini what it knows — it asks Gemini to go read the web right now, using Google Search grounding. That grounding feature is the only reason this works, and at the moment it's the one real differentiator Gemini has over the other providers for this kind of job.

Every few minutes, for each match in its live window, the bot fires a grounded query, gets back a pile of analysis text scraped from whatever the web is saying about the game, and then forces that mess into a strict shape:

{
  "home_score": 2,
  "away_score": 1,
  "minute": 67,
  "status": "live",
  "confidence": 0.86
}

So the qualitative chatter of a live match — commentary, match reports, half-sentences of "and it's two-one after a scrappy hour" — collapses into a few typed fields the rest of the system can act on. If the confidence clears the bar and the match looks final, the bot commits that score to the database and re-scores everyone's predictions. No human in the loop. That last part made me nervous at first, which is why the confidence gate exists. A grounded read of a live game is good but not infallible, and you do not want to settle a betting pool on a hallucinated 3-2.

Turning a read of the game into a number

Now the part that competes with my friends. Before a match, the AI makes its pick in two Gemini calls, not one, and the split is deliberate.

The first call is the qualitative one. Go read the web for this fixture — betting odds, expert previews, form, injuries — and write an honest assessment of how the game is likely to go. That gives you a paragraph of reasoning anchored in what bookmakers and analysts actually think today, not what a model half-remembers from training. The second call takes that paragraph and squeezes the judgment into a number:

{
  "home_goals": 1,
  "away_goals": 0,
  "confidence": 0.62
}

I keep these as two steps on purpose. A qualitative read and a clean numeric extraction are different jobs, and asking one call to both reason freely and emit tidy JSON gives you a worse version of each. Separate them and the reasoning stays rich while the output stays parseable. The whole prompt set lives in editable files, so when the AI says something dumb I can tune the instructions without redeploying.

The pick then gets fanned out to every guild running a pool and dropped in as a normal prediction under user id 1. From the leaderboard's point of view, it's just another player.

Grading it every day

This is the part I care about most, because it's the part that keeps everyone honest. The AI gets scored by the same Kicktipp-style rules as the humans: nail the exact score and you get the most points, get the goal difference right a bit less, just call the winner less again, and miss the result entirely and you get a participation point. Knockout matches are worth double. The AI's predictions run through the identical scoring function the second a match goes final. No special treatment, no grading on a curve.

So far, with 28 finished matches on the board, here's the honest scoreboard for Memo Ocho Bits.

It nailed the exact scoreline 3 times: Mexico 2-0 South Africa, Austria 3-1 Jordan, Ghana 1-0 Panama. It got the winner right — or correctly called the draw — in 16 of 28 matches. That's 57%, on a three-way outcome where blind guessing sits at 33%. Not magic, but clearly doing something.

Every prediction the AI made so far, by match day. Green is an exact scoreline, blue got the winner or draw right, grey missed the result. Hover any segment for the count. The all-grey June 15 is the draw-heavy day described below.

Where it shines and where it falls apart is the interesting part, and the data tells on it cleanly. Its best day was June 16: four matches, four correct sides, one of them an exact hit on Austria. Its worst was June 15, where it went 0 for 4 — and when I looked, that slate had three draws on it (Spain 0-0 Cape Verde, Iran 2-2 New Zealand, Saudi Arabia 1-1 Uruguay), and the AI called none of them. Draws are where every predictor goes to die, human and machine alike. The model, anchored on betting odds, almost never predicts a draw, because favorites are favored, and it did it again today: Czechia 1-1 South Africa, which the AI had as a tidy 2-0. So it keeps eating it on the upsets.

The other tell is in the residuals: it lowballs blowouts. It had Germany over Curaçao, but said 3-0 — the real score was 7-1. Same story on Sweden 5-1 and USA 4-1, both of which it called as tight 1-0 wins, and again today on Canada 6-0 Qatar, where it had a polite 2-1. Grounded on the odds, it predicts cautious, plausible scorelines, which is exactly what the odds describe and exactly what makes it boring and correct more often than not. It gets the direction right and underestimates the magnitude — a forecaster that hugs the mean.

Today was a good day for it otherwise: three of four sides right, including Mexico edging Korea 1-0 and Switzerland past Bosnia, and that round nudged it up the table. The standing that actually settles the trash talk: in our pool, Memo Ocho Bits now sits 2nd out of 12, after playing all 28 matches. Ahead of it is one human who watches a frankly unreasonable amount of football and is 8 points clear. Behind it are ten people, several of whom are now being beaten by a bot and are not enjoying it.

Pool standings so far. The AI (amber) sits 2nd of 12, 8 points behind the leader. Hover any bar for matches played and exact scores. It plays every fixture, which is part of how it stays near the top — it never skips a day.

What I actually learned

Neither job is doing anything exotic. Grounded search instead of trusting the weights. A qualitative read split from the numeric extraction so each call does one thing well. The same scoring function for the machine and the people. The lesson, the one I keep relearning, is that the interesting part was never the model call — it was the data discipline around it. The confidence gate that won't settle a pool on a bad read. The structured-output step that keeps every pick parseable. And the honesty of grading the thing against reality every single day, which is what surfaced the two patterns worth knowing: it can't call a draw, and it always shaves the scoreline toward the favorite.

Running second against humans who actually know the sport is, honestly, a better result than I expected. It may not win the pool — the leader is 8 points clear and has six exact scores to its three. But it's beating everyone else in the room, it never sleeps through a fixture, and it has opinions about Curaçao. For a system that's mostly grounded search and careful glue code, I'll take it. You can watch it keep playing through the tournament at mundial.mexicodev.org.