Thompson Sampling Via Fine-Tuning LLM for Bayesian Optimize
Scalable Bayesian Optimization in Complex Discrete Spaces with Thompson Sampling Via Fine-Tuning (ToSFiT) of LLMs
Thompson Sampling via Fine-Tuning (ToSFiT), a major optimization algorithm innovation, was discovered by ETH Zürich and IBM Research, Zurich. By using big language models, this novel approach solves the problem of exploring large and complex search spaces, where gradient-based approaches fail. ToSFiT's scalable Bayesian optimization (BO) method avoids the computationally expensive acquisition function maximization procedure.
By incrementally adjusting LLMs to reflect search area understanding, the innovative technique provides excellent theoretical performance guarantees and increases real-world efficiency. The team behind this work includes Abbas Rahimi from IBM Research, Zurich, Nicolas Menet, Aleksandar Terzić, and Andreas Krause from ETH Zürich.
Overcoming Discrete Domain Optimization Challenges
Bayesian optimization is essential for automated discovery and large-scale experimental design when reward function assessments are costly or time-consuming. This statistical model guides BO's search for promising configurations while maintaining a posterior distribution across unknown rewards. The traditional approach of selecting new candidates involves optimizing an acquisition function that balances exploitation (enhancing current solutions) and exploration (trying new possibilities).
Thompson sampling (TS) is unique in acquisition processes because to its robust empirical performance and cutting-edge convergence guarantees. TS commonly draws a reward function realization from the posterior and chooses the place that maximizes it to consider the realization as an acquisition function.
However, the lack of gradients in large unstructured discrete domains like amino acid sequences or quantum circuit code makes effective search difficult. A protein search space with 20 amino acids and a maximum sequence length of 100 exceeds the number of atoms in the universe, making an exhaustive search impossible. Traditional gradient-based methods are intractable in combinatorial spaces and can require iterating over every point.
LLMs as Generative Optimizers
The researchers built ToSFiT to scale BO to complex, high-dimensional spaces. Instead of optimizing an acquisition function, ToSFiT directly parameterizes the probability of minimality (PoM), or likelihood that a proposed solution is optimal, using a generative LLM. Consider the recommendations Thompson samples to avoid costly acquisition function maximization.
The VBOS paradigm underpins ToSFiT. Importantly, ToSFiT optimizes a prompt-conditioned language model. This solidifies existing knowledge, speeding learning. Online fine-tuning uses the VBOS objective to carefully tune model parameters to the posterior PoM.
The researchers used linear kernels over taught features to provide scalable Gaussian process (GP) inference to compute the reward posterior in closed form and enable conditioning on observations. This suggests that memory and computational complexity scale with Θ(dim(H) 2), not earlier observations.
For LLM fine-tuning, reinforcement learning approaches include the Reinforce Leave-One-Out (RLOO) baseline stabilized gradient estimation. Group Relative Policy Optimization (GRPO) uses the same advantage function as RLOO.
Theoretical Guarantees and Policy Start
This work provides strong theoretical evidence for ToSFiT. To demonstrate that cumulative regret rises with maximal information gain (γT) rather than search space size (∣X∣), researchers created a new regret restriction for Thompson Sampling. This greatly surpasses previous precise VBOS constraints, which scaled as O ~ ( T∣X∣), a vacuous constraint in large domains. For a linear kernel, this new bound scales well as O(dlogT) in d dimensions.
This theoretical approach stresses cautious adaptation. The approximation error between the VBOS maximizer (πt) and sampling policy (~t) may exceed total remorse. Initializing ToSFiT with pre-training and context ensures that the policy starts in the right probability simplex. According to empirical studies, a robust initial policy yields much greater performance, and cautious adaptation (using low learning rates) is needed to maintain existing knowledge and prevent performance stagnation.
Validating Multiple Tasks
Empirically validating ToSFiT across three search challenges showed its sample efficiency and low processing cost.
FAQ Response Refinement: This natural language challenge optimizes material semantically to an unknown ground-truth response using a Qwen3-1.7B model.
Thermally Stable Protein Search: Optimizing amino acid sequences for heat stability, a key drug development trait, is difficult. ProtGPT2 sampled sequences, and the search space is exponential.
Quantum Circuit Design: A Qwen2.5-Coder-1.5B model navigates a wide, discrete space of legitimate quantum programs to develop Qiskit circuits that prepare low-energy quantum states in unknown situations.
Because Unguided Generation does not employ feedback, it quickly reaches an inadequate reward level in all experiments. Post-Generation TS, a conventional BO approach over a preset subset of candidates, produces effective solutions quickly but saturates too soon. However, ToSFiT BOs across the solution space and identifies candidates with higher rewards. It outperformed Actor Critic and Soft Actor Critic in exploration efficiency by being optimistic in uncertain situations.
Thompson sampling naturally generates many candidates, making it perfect for batched optimization. ToSFiT shows that batching considerably boosts iteration efficiency and approaches target performance in fewer rounds, while slightly decreasing sample efficiency. This is critical for delayed or lengthy observations.
The results show that principled Bayesian optimization and strong foundation models can handle difficult discrete search issues. Future work may use jointly learned task-adaptive embeddings, Bayesian neural networks, or limit updates to a portion of the generative model to reduce processing cost.