Thinking models are LLMs optimized for reasoning and problem-solving. They have built-in Chain-of-Thought capabilities, making them more effective at complex tasks. Key models include:
DeepSeek R1
OpenAI o1/o3
Gemini 2.0 Flash
LLaMA 3.1
These models handle reasoning internally, requiring simpler prompts and less
explicit guidance to get optimal results.
Thinking models work best when given concise, direct, and structured prompts. Too much information can actually reduce accuracy. The best approach is to state the problem clearly and let the model figure out the steps.Good Example:
Copy
Ask AI
What are the main differences between classical and operant conditioning?
Poor Example:
Copy
Ask AI
In psychology, there are different learning theories. Classical conditioning was discovered by Pavlov, while operant conditioning was developed by Skinner. Could you please explain the difference between classical conditioning and operant conditioning? Make sure to include an example for each.
Fewer instructions allow the model to engage its reasoning process
naturally.
More complex problems benefit from additional reasoning time. Thinking models use reasoning tokens, which allow them to internally process a problem before outputting an answer.By prompting the model to take its time, you can improve the quality of the response. However, this also increases token usage, impacting cost.Good Example:
Copy
Ask AI
Analyze the economic impact of renewable energy adoption over the next 20 years. Consider factors such as job creation, energy prices, and carbon reduction. Take your time and think through each aspect carefully.
Poor Example:
Copy
Ask AI
How does renewable energy impact the economy? Answer quickly.
Encouraging longer reasoning helps for multi-step problems, improving
accuracy significantly.
Traditional few-shot (where you give examples) and Chain-of-Thought prompting strategies reduce performance for thinking models.According to research, thinking models performed worse when given few-shot examples. This contrasts with older models, where few-shot learning improved results. Thinking models are already designed to break down problems internally, so explicit step-by-step guidance can interfere with their reasoning.Good Example:
Copy
Ask AI
What is the capital of Canada?
Poor Example:
Copy
Ask AI
Example 1:Q: What is the capital of France?A: ParisExample 2:Q: What is the capital of Japan?A: TokyoNow answer this: What is the capital of Canada?
For thinking models, zero-shot prompts worked better than few-shot
prompts.
4. Use Thinking Models for Complex Multi-Step Tasks
Thinking models perform best on tasks that require five or more steps.When solving problems with 3-5 steps, thinking models offered a slight improvement over standard models. For simpler tasks (fewer than 3 steps), performance may actually degrade compared to traditional LLMs, because they “overthink.”If a task is highly structured or simple, a regular LLM like GPT-4 may be a better choice.Good Example:
Copy
Ask AI
Break down the process of solving a complex physics problem involving momentum conservation. Explain each step clearly and logically.
Poor Example:
Copy
Ask AI
What is 2+2?
To check how many steps a problem requires, you can prompt the web version of
a reasoning model to see how many reasoning steps it takes.
For regular LLMs, developers typically use delimiters like triple quotation marks, XML tags, or section titles to clearly define distinct sections of the input. This makes it easier for the model to interpret the information correctly.Thinking models, however, struggle with structured outputs but can be guided to maintain consistency. If you need a structured response (e.g., JSON, tables, fixed formats), structure your prompt carefully.Good Example:
Copy
Ask AI
[Task: Summarize the following text]Text: The mitochondrion is the powerhouse of the cell. It produces ATP, the energy currency of the cell, through cellular respiration.
Poor Example:
Copy
Ask AI
Summarize this: The mitochondrion is the powerhouse of the cell. It produces ATP, the energy currency of the cell, through cellular respiration.
If structured output is critical, you’re better off using a standard LLM
instead of a thinking model.
For high-stakes or complex problems, ensembling improves performance.Ensembling involves running multiple prompts (either the same prompt multiple times or variations of the prompt) and aggregating the results. This approach increases accuracy but raises costs because multiple queries are required.Example of Ensembling:
Copy
Ask AI
# Prompt 1:What are the primary causes of climate change? Provide a well-reasoned answer.# Prompt 2:Explain the major contributors to climate change, focusing on human activities and natural factors.# Prompt 3:Explain what causes climate change<Context># [Response 1 + Response 2]</Context>
While ensembling boosts performance, it’s expensive and should only be used
when high accuracy is critical.
Prompting thinking models requires a different mindset and approach compared to traditional LLMs. By following these guidelines, you can optimize your interactions with thinking models and get the best possible responses.