Can LLMs Reason by Predicting the Next Word?

The Detective Novel Test

The most interesting part is from 28:30-28:45.

Ilya Sutskever on why next-word prediction requires real understanding and reasoning:

"Say you read a detective novel complicated plot a storyline different characters lots of events Mysteries like Clues... at the last page of the book the detective has got all the clues gathered, all the people, and saying okay I'm going to reveal the identity of whoever committed the crime and that person's name is...predict that word... by predicting those words better and better and better the understanding of the text keeps on increasing... in order to predict that next word figure out from all of the agents that were there and all of their strengths or weaknesses or their intentions and the context and to be able to predict that word who who was the murderer that requires some amount of reasoning a fair amount of reasoning"

From a fireside chat with Jensen Huang at NVIDIA GTC 2023 (March 23, 2023) - click the image above to watch from timestamp 28:03.


What This Is

We generate logic puzzles where suspects have various attributes (they were at certain places, with certain objects, etc.). Through a series of propositions—logical clues—you must deduce who the killer is. Everyone except the killer has provable alibis. The catch? This is an NP-complete SAT (satisfiability) problem: given logical constraints, find the truth assignment that satisfies them all.

Here's what makes this interesting: We use a SAT solver to generate these puzzles, then ask an LLM to solve them through next-token prediction. The LLM faces the exact same NP-complete problem that created the puzzle in the first place.

This setup lets us answer some interesting questions:


How the Games Are Generated

The clever part: we use the same NP-complete problem to create the puzzles that the LLM will need to solve.

Each game is generated through an iterative process using SymPy's SAT solver:

"""Simple integration test for Clue logic game."""

from clue_models import DEFAULT_CONFIG
from generate_clue_game import (
    create_game,
    generate_game_until_unique_solution,
    setup_scenario,
)
from solver import check_solution_count
from ui import print_propositions, print_scenario


def test_generate_and_solve_game():
    """Test that a generated game has a unique solution."""
    print("Running single game test...")

    # Create and setup game
    game = create_game(DEFAULT_CONFIG, seed=42)
    setup_scenario(game)

    # Print scenario for visibility
    print_scenario(game)

    # Generate propositions until unique solution
    identified_killer = generate_game_until_unique_solution(game, verbose=True)

    # Print the full game record
    print("\n" + "=" * 60)
    print("GAME RECORD")
    print("=" * 60)
    print_propositions(game, verbose=True)
    print()

    # Verify the solution
    count, possible_killers = check_solution_count(game)

    # Assertions
    assert count == 1, f"Expected 1 possible killer, got {count}: {possible_killers}"
    assert identified_killer == game.killer, (
        f"Identified {identified_killer} but actual killer is {game.killer}"
    )
    assert possible_killers[0] == game.killer, (
        f"Solver found {possible_killers[0]} but actual killer is {game.killer}"
    )

    print("✅ Test passed: Unique solution found and verified!")
    return True


if __name__ == "__main__":
    test_generate_and_solve_game()
Loading Python...

The generation algorithm:

  1. Create suspects: Generate 8 people with random attributes across 6 categories (place, food, material, institution, company, technology)
  2. Pick the killer: Randomly select one suspect as the true culprit
  3. Generate propositions: Iteratively add logical clues like:
    • "Joe was with cement"
    • "(Will with cement) OR (Bob with pizza and at the library)"
    • "If Jane was at the church, Jane is not the killer"
  4. Validate with SAT: Before adding each proposition, use the SAT solver to ensure it doesn't accidentally eliminate the true killer
  5. Check convergence: After each proposition, check if exactly one suspect remains
  6. Repeat: Continue until the SAT solver confirms only one person could be the killer

The key insight: Every proposition is verified against the ground truth using SAT solving. This guarantees 100% solvability—the puzzle always has exactly one solution. The LLM must now perform the same logical deduction the SAT solver used to generate the game.


The Single-Token Trick

To measure the LLM's reasoning precisely, we need clean confidence scores. That's where single-token strings come in.

Why Single Tokens Matter

Names like "Joe", "Bob", "Will", and "Linda" are carefully chosen because they each tokenize to exactly one token in GPT-4o's tokenizer. This means:

Clean confidence measurement: When the LLM predicts "Joe" as the killer, we get a single log probability value representing its confidence. No multi-token aggregation needed—just one clean number.

Compare this to a name like "Elizabeth" (multiple tokens). To get the overall confidence, you'd need to aggregate probabilities across tokens, which introduces complexity and potential measurement artifacts. With single tokens, we get the LLM's raw, unambiguous confidence in its answer.

Implementation Details

All game elements are single-token strings:

We verify this in tests and extract log probabilities directly from the API response. The formula is simple: confidence = e^(logprob). This gives us a probability between 0 and 1 representing how confident the model is in its prediction.

This precision lets us measure exactly how confident the model is at each prediction, enabling calibration analysis and fine-tuning strategies.


Does the Model Know When It's Right?

A well-calibrated model should be more confident when correct and less confident when wrong. If it says "I'm 80% sure," it should be right about 80% of the time.

We analyzed predictions from gpt-4.1-nano to see how well it calibrates on this logical reasoning task:

Model Calibration Curve

Reading the Calibration Curve

What we typically see:

This analysis reveals when the model "knows what it knows"—and when it's confidently wrong. These confident mistakes are particularly interesting: they show us exactly where the model's reasoning breaks down.


Try It Yourself

Here's an actual game from our dataset. Read through the propositions (clues) and see if you can figure out who the killer is. Then reveal the answer to see if the LLM got it right—and how confident it was.


How Well Does It Actually Work?

Here's how gpt-4.1-nano performs on these NP-complete logic puzzles:

What These Numbers Mean

The Random Chance Baseline

With 8 suspects, random guessing gives you 12.5% accuracy. Any performance significantly above this suggests genuine logical reasoning rather than pattern matching or lucky guessing.

These are hard problems. The model is performing the same logical deduction that requires an NP-complete SAT solver to generate the puzzles. Every mistake the model makes is an opportunity to study where its reasoning fails—and potentially fix it through targeted fine-tuning.


Teaching It to Reason Better

Now that we know where the model fails, can we teach it to reason better through fine-tuning?

The Strategic Question

Not all training examples are equally valuable. Fine-tuning on cases the model already solves correctly is unlikely to help—it already knows how to handle those. The interesting question: which mistakes should we teach it to fix?

We generate multiple fine-tuning datasets to test different strategies:

1. Most Confident + Wrong (Target: Overconfidence)

Examples where the model was highly confident but incorrect. These are systematic blind spots—patterns where the model has learned the wrong thing. Training on these should help it recognize when it's about to make a confident mistake.

Hypothesis: High impact. These expose deeply learned errors.

2. Least Confident + Wrong (Target: Edge Cases)

Examples where the model struggled and failed. These are genuinely hard cases where the reasoning is complex or ambiguous.

Hypothesis: Medium impact. May help with similar edge cases but might not generalize.

3. Correct Predictions (Control Group)

Examples where the model got it right. Serves as a baseline to reinforce existing correct reasoning.

Hypothesis: Low impact. Reinforcement learning, but limited new information.

4. All Cases (Kitchen Sink)

The full dataset, both right and wrong predictions.

Hypothesis: Variable impact depending on the distribution.

The Experiment

By comparing these strategies, we can empirically test: Is it more efficient to train on a small number of confident mistakes, or a large number of all mistakes? This is a mechanistic interpretability question disguised as a training efficiency question.

The bet: confident mistakes reveal systematic reasoning failures that generalize across many similar cases. Fix those, and the model should improve dramatically with minimal training data.


Why This Matters

By having the model solve NP-complete SAT problems through next-token prediction, we can:

  1. Test the limits of reasoning: These are computationally hard problems. Can next-token prediction encode genuine logical deduction?

  2. Measure confidence precisely: Single-token predictions give us clean probability measurements, revealing when the model "knows what it knows."

  3. Identify failure modes: Where does logical reasoning break down? Confident mistakes show us systematic errors in how the model reasons.

  4. Improve strategically: By training on specific failure patterns, we can teach the model to reason better—and measure exactly how much better.

Ilya Sutskever said predicting the killer in a detective novel requires "a fair amount of reasoning." We've built a controlled environment to test exactly that claim. The results suggest that yes, LLMs can perform complex logical reasoning through next-token prediction—but they're far from perfect. Every mistake is a window into how they think.

That's the interesting part: we're not just testing if LLMs can solve logic puzzles. We're testing how statistical models learn to reason, and we can measure it with single-token precision.


Explore the full codebase on GitHub