Tags:: #[[Research Paper]] #[[prompting [[Large Language Models (LLM)]]]] #[[reflection ([[Large Language Models (LLM)]])]]
Summary
Overview
SELF-REFINE is a method for improving outputs from large language models (LLMs) through iterative self-feedback and refinement. This approach uses the same LLM to generate an initial output, provide feedback, and refine it iteratively without the need for supervised training or additional data.
Key Findings
Performance Improvement: Evaluations using GPT-3.5 and GPT-4 across seven tasks show that SELF-REFINE improves performance by about 20%. Outputs are preferred by humans and score better on metrics.
Complex Task Handling: LLMs often struggle with complex tasks requiring intricate solutions. Traditional refinement methods need domain-specific data and supervision. SELF-REFINE mimics human iterative refinement, where an initial draft is revised based on self-feedback.
Iterative Process: The process uses two steps: FEEDBACK and REFINE, iterating until no further improvements are needed.
Specific Task Performance
Strong Performance:
Constrained Generation: Generating a sentence containing up to 30 given concepts. Iterative refinement allows correction of initial mistakes and better exploration of possible outputs.
Preference-based Tasks: Dialogue Response Generation, Sentiment Reversal, Acronym Generation. Significant gains due to improved alignment with human preferences.
Weaker Performance:
Math Reasoning: Difficulty in accurately identifying nuanced errors in reasoning chains.
Additional Insights
Avoiding Repetition: SELF-REFINE avoids repeating past mistakes by appending the entire history of previous feedback in the REFINE step.
Role-based Feedback: Suggestion to improve results by having specific roles for feedback, like performance, reliability, readability, etc.
Related Method: Providing a scoring rubric to the LLM with dimensions over which they should evaluate the output.
Specific Feedback Importance: Results are significantly better with specific feedback compared to generic feedback.
Iteration Impact: Results improve significantly with the number of iterations (i.e., feedback-refine loops) but with decreasing marginal improvements for each loop. In some cases, like Acronym Generation, quality could improve in one aspect but decline in another. Their solution was to generate numeric scores for different quality aspects, leading to balanced evaluation.
Model Size Impact: SELF-REFINE performs well for different model sizes, but for a small enough model (Vicuna-13B), it fails to generate feedback consistently in the required format, often failing even with hard-coded feedback.
Relevant [[ChatGPT]] conversations: here, here, here
Evaluating Generative AI Applications: Challenges and Solutions
Challenges in Evaluation:
Evaluating custom AI applications generating free-form text is a barrier to progress.
Evaluations of general-purpose models like LLMs use standardized tests (MMLU, HumanEval) and platforms (LMSYS Chatbot Arena, HELM).
Current evaluation tools face limitations such as data leakage and subjective human preferences.
Types of Applications:
Unambiguous Right-or-Wrong Responses:
Examples: Extracting job titles from resumes, routing customer emails.
Evaluation involves creating labeled test sets, which is costly but manageable.
Free-Text Output:
Examples: Summarizing customer emails, writing research articles.
Evaluation is challenging due to the variability of good responses.
Often relies on using advanced LLMs for evaluation, but results can be noisy and expensive.
Cost and Time Considerations:
[[evals]] can significantly increase development costs.
Running [[evals]] is time-consuming, slowing down experimentation and iteration.
Future Outlook:
Optimistic about developing better evaluation techniques, possibly using agentic workflows such as [[reflection ([[Large Language Models (LLM)]])]].
Richer Context for RAG (Retrieval-Augmented Generation)
New Development:
Researchers at Stanford developed RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). Link to paper here.
RAPTOR provides graduated levels of detail in text summaries, optimizing context within LLM input limits.
How RAPTOR Works:
Processes documents through cycles of summarizing, embedding, and clustering.
Uses SBERT encoder for embedding, Gaussian mixture model (GMM) for clustering, and GPT-3.5-turbo for summarizing.
Retrieves and ranks excerpts based on cosine similarity to user prompts, optimizing input length.
Results:
RAPTOR outperformed other retrievers on the QASPER test set.
Importance:
Recent LLMs can process very long inputs, but it is costly and time-consuming.
RAPTOR enables models with tighter input limits to access more context efficiently.
Conclusion:
RAPTOR offers a promising solution for developers facing challenges with input context length.
This may be a relevant technique to reference if you get around to implement [[Project: Hierarchical File System Summarization using [[Large Language Models (LLM)]]]]