Sources:

METIS

Background

  • RAG allows LLMs to generate better responses with external knowledge, but external knowledge causes higher response delay
  • No systematical way to balance the tradeoff between both quality and response delay
  • METIS tackles above by jointly schedules queries and adapts the key RAG configurations of each query
  • RAG query steps:
    • Retrieval - RAG sys retrieves one or more relevant context chunks
    • Synthesis - combines these chunks and the RAG query to form a single/multiple LLM calls to generate the response

Goals

  • Optimize queries based on resources while maintaining high quality

How

  • METIS reduces the RAG response delays by jointly deciding the per-query configuration and query scheduling based on available resources
  • Two-level design
    • Pruning configuration space:
      • pruning configuration space into smaller configurations that focus on keeping accuracy high
    • RAG Scheduler
      • jointly optimizing configuration and scheduling to optimize response delay by choosing configuration which best-fit into GPU memory
  • Basically estimate query’s performance, reduce configuration space. Then scheduler chooses best configuration to send to execution engine
  • Metric considerations:
    • resource cost (GPU requirement)
    • latency
    • memory consumption (KV Cache size)

RAG System and Configuration

  • Abstract model for adjusting RAG context
  • their configuration knobs:
    • Num_chunks - how many chunks to receive
    • synthesis_method - how to synthesize
    • intermediate_length - how long is each summary
  • Performance Metrics eval
    • Response quality - calculates the F1 score of the generated response against the ground truth
      • F1 score is the harmonic mean of percision
        • precision (# of correctly generated words)
      • recall (# of correct words successfully generated against ground truth)
      • Intuition:
        • F1 = 1 (perfect precision and recall)
        • F1 = 0 (completely wrong)
    • Response delay - measure time elapsed from when the RAG system receives a RAG request to when it completes generating the response

Pruning configuration space

  • Estimating query profile:
    • query complexity - intricacy of the query, like “why?” questions relative to “yes/no” questions
      • in paper, dimension output is binary (High/low)
    • joint reasoning requirement - whether multiple pieces of information are needed to answer the query
      • in paper, dimension output is binary (Yes/No)
    • Pieces of information required - distinct, standalone pieces of information required to fully answer the query
      • in paper, dimension output is a number (1-10)
    • Length of the summarization - if query is complex and needs a lot of different information, it is often necessary to first summarize the relevant information chunks first
      • in paper, dimension output is a number (30-200)

METIS SYS DESIGN

Rule-based mapping

Joint Configuration-Scheduling

A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

Abstract

  • Review of relevant algos:
    • review of storage and retrieval technologies
    • comparison of several advanced VDB solutions
    • outline emerging opportunities for coupling VDBs with LLMs

Background on Vector DB

  • Stores data in vectors
  • All searches are fuzzy, similarity based on algo form query to db embeddings
  • Vector Databases (VDBs) are tools designed to efficient store and manage high-dimensional vectors
    • Two core functions:
      • vector storage
        • quantization
        • compression
        • distributed storage mechanisms
      • vector retrieval
        • indexing techniques (Tree-based, hashing, graph-based, quantization-based techniques)
  • Compared to traditional database, VDBs have three significant advantages
    • VDBs possess efficient and accurate vector retrieval capabilities
    • Vector databases support the storage and query of complex and unstructured data
    • Vector databases have high scalability and real-time processing capabilities High level overview Algorithms for each component