AI Learns to Do Science Better by Playing Battleship

20

To revolutionize scientific discovery, artificial intelligence must first master the art of efficient decision-making. A recent study suggests that teaching AI to play Battleship offers a surprisingly effective training ground for this goal. By forcing models to make the most of limited resources, researchers have developed techniques that could transform how AI designs experiments and pursues hypotheses in complex scientific fields.

The Cost of Curiosity

Scientific research is fundamentally a game of resource management. Researchers must decide which hypotheses to test and which simulations to run, often facing strict constraints on time, money, or data availability. As Valerio Pepe, a research scientist who led the study before joining OpenAI, notes, “You can get only so much data because getting data is either expensive or time-consuming.”

The challenge for AI is not just to find answers, but to find them efficiently. This requires mastering what Pepe calls “cheap interventions” for information seeking—strategies that maximize the value of every single query or experiment. To test this, the team turned to a classic board game.

A Collaborative Twist on a Classic Game

The researchers designed a specialized, collaborative version of Battleship. In this variation, one player acts as the “questioner,” generating queries about the hidden ship locations, while another acts as the “answerer.” The goal is for the team to pinpoint and sink all vessels in the fewest number of rounds possible.

This setup allowed the team to rigorously compare the decision-making skills of large language models (LLMs) against human players. The study, presented at the International Conference on Learning Representations (ICLR), pitted AI models against a control group of 42 human participants.

Initially, the results highlighted a gap in efficiency:
* Humans consistently won in fewer moves than Llama-4-Scout, Meta’s efficiency-focused AI model.
* GPT-5, OpenAI’s premier reasoning model, outperformed both the humans and Llama-4-Scout in raw performance.

However, raw power was not the only metric. The researchers sought a way to optimize cost-effectiveness, aiming for a model that could compete with top-tier reasoning engines at a fraction of the computational expense.

Optimizing for Information Gain

To bridge the gap, the team applied principles from Bayesian experimental design. This statistical framework helps researchers estimate the likelihood of events based on prior assumptions, allowing them to choose experiments that yield the highest potential information gain.

The scientists optimized their models to:
1. Ask questions that maximized the probability of hitting targets.
2. Maximize the amount of new information gained per question.
3. Look ahead multiple turns to anticipate future outcomes.

A critical breakthrough came in the method of communication. The researchers found that accuracy and efficiency soared when AI players communicated using snippets of code rather than natural language. Code provided a precise, unambiguous structure for logic that natural language often lacks in complex reasoning tasks.

The Result: Efficiency Over Raw Power

These optimizations dramatically improved the performance of the smaller model. The refined Llama-4-Scout:
* Won in fewer moves than GPT-5 two-thirds of the time.
* Achieved these results at approximately one-hundredth of the cost.
* Beat human players by an average of seven fewer moves.

This outcome demonstrates that a smaller, well-tuned model can outperform a larger, more expensive one if it employs superior strategic reasoning and efficient communication protocols.

From Board Games to Lab Work

While Battleship is a simplified environment compared to the messy reality of chemistry or biology, the underlying logic remains relevant. Scientific samples do not always provide clear-cut “hits” or “misses,” but the need to navigate a vast “hypothesis space” is universal.

Yuanqi Du, a researcher focused on AI for chemistry who was not involved in the study, emphasizes the broader implications: “The framework will be very useful to measure whether language models are really making progress in deciding which hypotheses to pursue among all possibilities.”

Conclusion
By treating scientific inquiry as a strategic game of information maximization, this study provides a scalable method for evaluating and improving AI’s decision-making capabilities. As AI moves from simple puzzles to complex laboratory tasks, the ability to ask the right questions efficiently will be just as critical as the ability to answer them.