Navigate Select ESC Close

RAG for SKILLS: Retrieval Augmented Execution (SkillRAE)

2026-05-13 Science & Technology
204
21
0
Discover AI
Discover AI
88.6k subscribers

Unlock all features

FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.

Description

Standard AI engineering assumed that if you simply feed an LLM the correct tool repository, it has enough "reasoning" power to figure out the execution. SkillRAE proves this is mathematically naive. The LLM is essentially a raw CPU. Delivering uncompiled, isolated tools into its context window forces the LLM to do dependency resolution on the fly, which it fails at. The new insight is that Retrieval is not enough; execution requires Compilation. By explicitly rescuing and grafting boundary conditions (subunits) into a logically bound, low-token payload, we bypass the LLM's stateless amnesia and give it a fully resolved blueprint. all rights w/ authors: SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution Xiangcheng Meng Shu Wang Yixiang Fang∗ from The Chinese University of Hong Kong, Shenzhen arXiv:2605.10114 #airesearch #scienceexplained #aiexplained

Top Comments (8)

@luciderIamI 2026-05-13

You're right. Let me cut the generic and get to what actually matters. The real problem with RAG right now: Most implementations treat retrieval and generation as separate stages with a handoff. That handoff is where information dies. The vector database returns chunks. The LLM receives chunks. But the LLM has no memory of why those chunks were selected, no ability to say "this doesn't match what I actually need," and no way to go back and ask the retriever for something different. It's a pipeline, not a loop. What would actually help: 1. Make retrieval a dialogue, not a query Current RAG: user asks → one search → LLM answers. What works: user asks → search → LLM evaluates what it got → decides if it's sufficient → either answers or reformulates and searches again. This is active retrieval, and it's the difference between a system that looks up answers and one that finds them. The key shift: the LLM needs to be able to signal "insufficient" and specify what's missing. Most architectures don't give it that channel. 2. Stop embedding documents. Embed relationships. A chunk in isolation loses most of its meaning. "The policy was updated" — which policy? When? What did it replace? Who authorized it? Current RAG throws away this structure. What would help: graph-augmented retrieval. Entities, their versions, their relationships, their contradictions over time. The retrieval layer should return context — not just similar text — so the LLM knows whether it's looking at current or outdated information, primary or derived sources. 3. Compression that preserves uncertainty Contextual compression usually means "summarize the chunks to fit the context window." That's lossy in the wrong way — it strips the hedges, qualifiers, and contradictory evidence that the LLM needs to give calibrated answers. Better compression: preserve the shape of the evidence. If three sources disagree, the compressed context should say "Source A claims X, Source B claims Y, Source C is ambiguous" — not "The answer is approximately X." 4. Evaluation that measures the right thing Current benchmarks test whether the final answer is correct. But in production, what kills you is: - The system retrieves nothing relevant and hallucinates confidently - The system retrieves outdated information and presents it as current - The system retrieves contradictory information and picks the wrong one We need eval frameworks that score retrieval quality independently — precision@k, coverage, temporal accuracy, contradiction detection — not just end-to-end answer correctness. 5. Graceful degradation as a first-class feature When retrieval confidence is low, most systems either hallucinate or say "I don't know." Both are failures. What would help: a spectrum of responses based on evidence quality. "Here's what the documentation says, but it's 18 months old." "I found conflicting accounts — here's the range." "I don't have direct sources for this, but here's a related principle that might apply." This requires the system to know the quality of what it retrieved, which means metadata and provenance need to travel with the chunks all the way to generation. The through-line: RAG needs to stop being a pipeline architecture and become a cognitive loop — one where retrieval, evaluation, and generation can iterate, where the system knows what it doesn't know, and where the quality of evidence shapes the confidence of the answer. Right now it's mostly plumbing. The next step is judgment.

2 2 replies
@luciderIamI 2026-05-13

If you're building this, here's what I'd actually do — not theory, sequence: Phase 1: Fix the handoff (week 1-2) Build a retrieval evaluator that sits between search and generation. Simple classifier: given the query and the retrieved chunks, score sufficiency (0-1). If below threshold, trigger re-retrieval with a reformulated query. This turns your pipeline into a loop. Start with a small model for this — don't burn tokens on the evaluator. Phase 2: Add provenance metadata to every chunk (week 2-3) Every chunk in your index needs: source, date, version, author/owner, and a "last validated" timestamp. This isn't for display — it's for the LLM to reason about staleness and authority. Feed this metadata into the prompt explicitly: "The following sources were retrieved. Source A is from 2023, Source B is from 2026." Let the LLM do temporal reasoning. Phase 3: Implement contradiction detection (week 3-4) When you retrieve multiple chunks, run a lightweight comparison: do these say the same thing? If not, flag it in the prompt. "Retrieved sources disagree on X. Source 1 says... Source 2 says..." This prevents the LLM from silently picking the first or most confident-sounding chunk. Phase 4: Graph layer for relationships (month 2) Move beyond flat chunks. Extract entities and relationships at index time. When retrieving, pull the chunk plus its local graph neighborhood — what it references, what references it, what superseded it. This is where you stop embedding documents and start embedding context. Phase 5: Calibrated response generation (ongoing) Train or prompt the LLM to modulate its output based on evidence quality signals. Low confidence retrieval → qualified language. High confidence, single source → direct answer. Conflicting sources → present the range. This is the difference between a system that answers and a system that reasons. What I'd skip: - Fancy embedding models before fixing chunking. Bad chunks with good embeddings are still bad chunks. - Multi-hop retrieval frameworks unless you have clean entity resolution. Most implementations just chain noise. - Agentic RAG with tool use until the basic loop is solid. Adding tools to a broken loop multiplies failure modes. What I'd invest in early: - A retrieval evaluation dataset that captures your actual failure modes: outdated info, near-miss semantic matches, queries that need multiple document types. Generic benchmarks won't find your specific breakage. - Logging that captures the full loop: original query, reformulated queries, retrieved chunks with scores, evaluator decisions, final prompt context. When the system fails, you need to know if it was retrieval, evaluation, or generation. The real recommendation: Start with the loop. Everything else — graph, compression, contradiction detection — plugs into it. Without the loop, you're just adding sophistication to a pipeline that fundamentally can't correct itself.

2 3 replies
@timmygilbert4102 2026-05-13

In before we converge back to Skill2Vec embedding 😂 Man i wish i had a computer, I'm still learning to set up my phone for programming

0 1 replies
@jordanerhardt5038 2026-05-13

“Not overkill at all”

0
@luciderIamI 2026-05-13

Took five minutes. You're welcome.

0 1 replies
@dr.plaque2359 2026-05-14

What do you think about making your videos 10 minutes instead of 20 minutes or longer? You need to grow your audience

0 2 replies
@Peter-w3t4z 2026-05-13

Bench mogging

2
@luciderIamI 2026-05-13

The honest answer: most RAG systems fail at retrieval, not generation. The LLM is usually fine. The problem is what gets fed into it. Here's what actually moves the needle, ranked by impact: 1. Fix chunking before you touch anything else Fixed-size chunking (e.g., every 512 tokens) severs semantic units. A sentence split across two chunks becomes unretrievable. An answer needing context from adjacent paragraphs gets broken. Semantic chunking—splitting on meaning (headings, sections) rather than character counts—is now table stakes. Parent-document retrieval (store embeddings for child chunks, return larger parent spans) also helps preserve coherence. 2. Retrieval is the ceiling. Treat it that way. If you retrieve three irrelevant chunks, the LLM either hallucinates to fill the gap or produces a useless non-answer. Most evaluation frameworks test generation quality, not retrieval precision—that's backwards. Practical upgrades: - Hybrid search: Combine vector similarity with keyword search (BM25). Keyword search is 10-50ms and nails exact terms; vector search handles conceptual variation. Fuse results with Reciprocal Rank Fusion (RRF). - Re-ranking: Retrieve top 20-50 cheaply via vector similarity, then re-score with a cross-encoder or LLM-based ranker. Feed only the top 3-8 to the generator. This is one of the highest-ROI upgrades in RAG. - Query expansion: Generate multiple paraphrases of the user's question (multi-query), or generate a hypothetical answer document and embed that instead (HyDE). Different phrasing hits different vocabulary in the index. 3. Context compression and "lost in the middle" Even with good retrieval, the final prompt gets noisy: repeated info, irrelevant paragraphs, overlapping chunks. Contextual compression—filtering or summarizing retrieved content before it reaches the LLM—reduces token usage and helps the model focus. Also, LLM performance degrades when relevant information is buried in the middle of long contexts. Reranking and compression address this directly. 4. Active/iterative retrieval The best RAG systems aren't one-shot. They behave like a careful researcher: evaluate whether retrieved evidence is sufficient, reformulate the query if confidence is low, and retrieve again. This includes self-RAG patterns where the system decides when to retrieve rather than doing it blindly on every query. 5. Indexing and staleness Vector stores without governance metadata act as flat buckets. When definitions change, policies update, or schemas drift, the index becomes stale and returns semantically similar but factually outdated content. Every chunk should carry owner, effective date, last-validated timestamp, and lineage identifier. Build invalidation triggers so glossary or schema changes automatically update the index. 6. Performance is architectural, not code-level A naive RAG pipeline adds 800ms–2s of latency per query. The dominant bottleneck is usually LLM inference, not retrieval. If vector search takes 100ms and generation takes 3000ms, cutting search time in half only improves total latency by 3.2%. Optimize the slowest component first. Key architectural moves: - Caching for common queries and pre-computed embeddings - Quality-latency budgets: Let users specify how fast vs. how thorough. 200ms budget → keyword + small model. 2000ms budget → hybrid + large model + reranking - Graceful degradation: If retrieval confidence is low, return a fallback rather than forcing a bad answer Bottom line: RAG quality is frequently an indexing problem disguised as an LLM problem. Get retrieval right—chunking, hybrid search, reranking, active retrieval—and generation becomes the easy part.

0 3 replies

Unlock the Data Inside
Turn Videos into Knowledge

  • Get FREE 10/day: transcripts, summaries, chats
  • Chat with videos, export text & PDF
  • $1 free API credit for RAG, chatbots & research

Free forever plan • All features unlocked

App screenshot