Evaluating Generative AI Applications: Challenges and Solutions
Challenges in Evaluation:
Evaluating custom AI applications generating free-form text is a barrier to progress.
Evaluations of general-purpose models like LLMs use standardized tests (MMLU, HumanEval) and platforms (LMSYS Chatbot Arena, HELM).
Current evaluation tools face limitations such as data leakage and subjective human preferences.
Types of Applications:
Unambiguous Right-or-Wrong Responses:
Examples: Extracting job titles from resumes, routing customer emails.
Evaluation involves creating labeled test sets, which is costly but manageable.
Free-Text Output:
Examples: Summarizing customer emails, writing research articles.
Evaluation is challenging due to the variability of good responses.
Often relies on using advanced LLMs for evaluation, but results can be noisy and expensive.
Cost and Time Considerations:
[[evals]] can significantly increase development costs.
Running [[evals]] is time-consuming, slowing down experimentation and iteration.
Future Outlook:
Optimistic about developing better evaluation techniques, possibly using agentic workflows such as [[reflection ([[Large Language Models (LLM)]])]].
Richer Context for RAG (Retrieval-Augmented Generation)
New Development:
Researchers at Stanford developed RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). Link to paper here.
RAPTOR provides graduated levels of detail in text summaries, optimizing context within LLM input limits.
How RAPTOR Works:
Processes documents through cycles of summarizing, embedding, and clustering.
Uses SBERT encoder for embedding, Gaussian mixture model (GMM) for clustering, and GPT-3.5-turbo for summarizing.
Retrieves and ranks excerpts based on cosine similarity to user prompts, optimizing input length.
Results:
RAPTOR outperformed other retrievers on the QASPER test set.
Importance:
Recent LLMs can process very long inputs, but it is costly and time-consuming.
RAPTOR enables models with tighter input limits to access more context efficiently.
Conclusion:
RAPTOR offers a promising solution for developers facing challenges with input context length.
This may be a relevant technique to reference if you get around to implement [[Project: Hierarchical File System Summarization using [[Large Language Models (LLM)]]]]
For access to my shared Anki deck and Roam Research notes knowledge base as well as regular updates on tips and ideas about spaced repetition and improving your learning productivity, join "Download Mark's Brain".