— Shreyas Sharma
TLDR: A carefully chosen example in-context during RL can double reward, enable inference-time scaling, and change the distribution of cognitive behaviors in an LLM.
Code to reproduce results: https://github.com/shreyassharma1/prompt-priming-blog
While working on a project analyzing cognitive patterns in reasoning traces, I faced an issue where one of my configurations yielded no chain of thought reasoning or trace lengthening even after 256 steps of GRPO. This specific configuration was training Llama-3.1 8B Instruct on the reasoning task Countdown from ReasoningGym (Stojanovski et. al., 2025). Countdown is a task where given a list of N numbers and a target value, you need to find an arithmetic expression using all numbers exactly once that evaluates to the target. What makes Countdown a good testbed for reasoning is that it benefits from search: trying candidate expressions, verifying them, and backtracking when they don't work. In principle, RL should be able to teach a model to allocate more inference-time compute to this kind of exploration. That was the hope, at least.
Here’s an example task of Countdown from ReasoningGym’s implementation:
Question: Calculate 139 using all of these numbers: 36, 29, 95, 32, 4, 15. Each number may be used at most once.
Answer: 15 - 4 + 95 + 36 - 32 + 29
Before describing what went wrong and the fix, here's the training configuration:
| Setting | Value |
|---|---|
| Model | meta-llama/Llama-3.1-8B-Instruct with LoRA (rank 32) |
| Algorithm | GRPO |
| Group size | 8 rollouts per problem, 64 problems per batch |
| Training set | 16,384 problems (256 GRPO steps at 64 problems/step) |
| Learning rate | 1e-5 |
| Max generation length | 1,024 tokens |
| Temperature | 1.0 |
The system prompt I was using to encourage chain-of-thought and inference-time scaling was
"Reason through each problem step by step before giving the final answer. State the final answer clearly at the end."
I used Tinker (Thinking Machines Lab, 2025) for all my training and inference.
In the zero-shot configuration (system prompt plus task, no examples) the model quickly converged to short, formulaic outputs. Mean trace length stayed under two sentences throughout training, and reward plateaued around 0.4. The model was essentially guessing single expressions without any search process.
I tried several variations of "think step by step" style instructions in the system prompt. None of them changed the outcome in a meaningful way. The model remained stuck in the same low-effort regime.
Finally, I tried inserting an example question, reasoning trace, and answer into the prompt. The reasoning trace was fairly short and simple - it was only 5 sentences long and consisted of trying one solution, checking it, realizing it was wrong, trying an alternate solution, and realizing it was correct. Here’s the exact example I added to the prompt, alongside its corresponding question:
I need to make 150 from 25, 3, and 75. Let me try: 75 + 25 = 100. That leaves 3. 100 * 3 = 300, too big. What about 25 * 3 = 75, then 75 + 75 = 150. That works! 25 * 3 + 75.
Surprisingly, this worked quite well! Mean trace lengths went from ~13 sentences at step 0, to ~22 sentences at step 64, and finally settled at ~15.5 sentences at the final step, 256. Interestingly, all of these sentence counts are larger than the one in the example (6), and the length increases as more RL is applied. This was enough to break the reward plateau of 0.4, achieving a final reward of ~0.8 after training for the same number of steps.