LLM Prompt Learning

Publish date: Jul 1, 2025
Tags: LLM Prompt

Understand the task
LLM and prompt
Performance
Continuous Improvement
Search results evaluation
Multi-round chatting
Vibe coding

To improve the performance of LLMs on specific tasks, prompt is the pivotal element that we need to carefully craft. However, prompt engineering is never only about playing around the prompt itself.

Framework

Understand the task

There are many ways based on the inputs and outputs to categorize the tasks that we want a LLM to solve. The most common methods are using knowledge domains and data types. Before everything, we should know what is the task we want to solve and the dataset to test the performance.

Type of Task

By knowledge domains, there are categories like math reasoning, science reasoning, code generation, intelligent Agent, precise instruction following, text understanding and generation.

By inputs and outputs, we can categorize into text to text, text to audio, image to text, etc. To further differentiate the type, we should think about supervised (classification and regression), unsupervised (clustering, retrieval, dimension reduction), and generation tasks.

The type of tasks will determine how to define our metrics to evaluate the performance of the LLM, and how to design the prompts to optimize the performance.

Metrics

Designing proper metrics is the key to launch a successful products for users or clients. For classification tasks, accuracy and F1 scores, precision, recall and F1 scores are usually the choices. For generation tasks, BLEU score and ROUGE score are commonly used. Customized rule-based manual annotation is also a widely used method to evaluate the performance for centain tasks.

When a single metric cannot capture all aspects of a task, we can use multiple metrics to evaluate the comprehensive performance.

Dataset of the task

There might be several stages requiring additional datasets of our own. In most circumstances, they can be optional because relevant knowledge can be retrieved by the LLM with the help of other knowledge base through RAG system, or directly from prompts if not that complicated.

First is the fine-tuning data collection. It is needed when the LLM needs to memorize or generalize some particular knowledge of relevant fields. This post-training process is able to achieve almost anything we are not satisfied with the outputs. For example, we could train the model to learn the preferred wording and phrasing of some specific area for some specific target audience.

Second is the data collection for evaluation. If we don’t have a test dataset, it usually means that there is no standard way to evaluate the performance. In this case, we can use human evaluators, or even better, use simulated environment to simulate the test scenario.

LLM and prompt

Screen available LLMs

Given the type of task we need to tackle, we should first screen and pick the most suitable LLM with the same domain. In most cases, the leaderboard is the best reference to compare the performance of different LLMs on some certain type of tasks.

Some resources to start with: Hugging Face: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/ SuperCLUE: https://www.superclueai.com/

Craft prompts for candidate LLMs

I firmly believe that those who invented a tool are the ones who truly understand how to wield it masterfully. Just like programming, official documentation is the ultimate guidelines for users to follow. And the guidances might be specific to model series or situations, so writing or generating prompts for different models will be necessary.

The following links are well documented and have easy-to-follow examples to demonstrate how to use the LLM effectively.

A bit of thoughts: After several version iterations, the documentation usually provides recommendations for migrating to newer version models. With more powerful models to be published in the future, the deployment of LLMs will be quite similar to that of SDKs. I suppose it means that, the skills prompt engineering will be similarly important as software engineering today.

Claude

Claude: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview

General principles:

Be explicit. Use specific features, phrases like “go beyond the basics” “fully-featured” “as many as possible”.
Add context. Explain why certain behavior (format or feature) is important.
Examples and details need to align with prompts.

Specific situations:

Format of responses:

Tell what to do instead of what not to do.
Use XML format indicators.
Match prompt style to the desired output. Reducing markdown from prompt can reduce volume of markdown in the output

Leverage thinking & interleaved thinking capabilities: Thinking here means reflection, reasoning and planning. Especially for prompts involving tool use, explicitly tell the model to follow the steps to determine following actions.

Optimise parallel tool calling: Just use the example prompt, to boost success rate in using parallel tool calling to ~100%.

For maximum efficiency, whenever you need to perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially.

Reduce file creation in agentic coding:

Claude 4 models sometimes create temporary files as scratchpad. Use following scripts to clean them up.

If you create any temporary new files, scripts, or helper files for iteration, clean up these files by removing them at the end of the task.

Enhance visual and frontend code generation: Encourage the model like Don’t hold back. Give it your all. Explicitly ask like thoughtful details,as many relevant features and interactions as possible,apply design principles: hierarchy, contrast, balance, and movement

Avoid focusing on passing tests and hard-coding:

Please write a high quality, general purpose solution. Implement a solution that works correctly for all valid inputs, not just the test cases. Do not hard-code values or create solutions that only work for specific test inputs. Instead, implement the actual logic that solves the problem generally.

Focus on understanding the problem requirements and implementing the correct algorithm. Tests are there to verify correctness, not to define the solution. Provide a principled implementation that follows best practices and software design principles.

If the task is unreasonable or infeasible, or if any of the tests are incorrect, please tell me. The solution should be robust, maintainable, and extendable.

Tips to extended thinking models

Extended thinking is a feature for some models. It is mostly achieved by applying Reinforcement Learning method to the LLM training process, so that the model learns to use a reasoning process to analyse more complex problems.

It’s not possible to force tool use in extended thinking models.

Thinking blocks and tool results are part of a continuous reasoning flow, so preserving those blocks is necessary for multi-turn conversation.

Models with interleaved thinking is able to gather information from tool uses, and then think further to generate final answers. While models without the feature only think about tool uses decisions, not after fetching useful information from them.

OpenAI

OpenAI: https://platform.openai.com/docs/guides/prompt-engineering

Deepseek

Deepseek: https://api-docs.deepseek.com/prompt-library

As stated in some of the documentation, using prompt generator from LLMs is also a great idea.

Performance

Evaluate on the test dataset

We have defined the core metrics earlier, so we can run the LLM on the test dataset and calculate the metrics.

Evaluate in the real world

Our test dataset is just a small sample of the whole population, sometimes even biased as the real world environment is never constant. So we need to continuously evaluate our product in the real world and keep improving it with our users’ feedbacks.

Continuous Improvement

As the business needs change over time, we always need to monitor the metrics and revisit the process we built. There are typically several ways to further improve the performance:

Prompt optimisation: other kinds of prompt structure might trigger LLMs to “think” deeper at minimal costs.
Post-training techniques: when computing resources and datasets are sufficient.
Redesign of the workflow: divide and conquer. The task might be broken down into several easier-to-solve subtasks so that the advantages of different LLMs can be leveraged to work together to achieve much better results.

Some learnings from relevant projects