Building a RAG Pipeline with llama.cpp in Python

Constructing a RAG Pipeline with llama.cpp in Python
Picture by Editor | Midjourney

Utilizing llama.cpp permits environment friendly and accessible inference of massive language fashions (LLMs) on native units, notably when working on CPUs. This text takes this functionality to a full retrieval augmented era (RAG) degree, offering a sensible, example-based information to constructing a RAG pipeline with this framework utilizing Python.

Step-by-Step Course of

First, we set up the required packages:

pip set up llama-cpp-python pip set up langchain langchain-community sentence-transformers chromadb pip set up pypdf requests pydantic tqdm

pip set up llama–cpp–python

pip set up langchain langchain–neighborhood sentence–transformers chromadb

pip set up pypdf requests pydantic tqdm

Keep in mind the preliminary elements setup will take a couple of minutes to finish if none have been put in earlier than in your working surroundings.

After putting in llama.cpp, Langchain, and different elements reminiscent of pypdf for dealing with PDF paperwork in doc corpus, it’s time to import all we want.

import os from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import Chroma from langchain.document_loaders import PyPDFLoader, TextLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate from langchain.llms import LlamaCpp import requests from tqdm import tqdm import time

import os

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import Chroma

from langchain.document_loaders import PyPDFLoader, TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

from langchain.llms import LlamaCpp

import requests

from tqdm import tqdm

import time

Time to get began with the true course of. The very first thing we want is regionally downloading an LLM. Though in an actual state of affairs it’s your decision a much bigger LLM, to make our instance comparatively light-weight, we are going to load a comparatively smaller LLM (I do know, this simply sounded contradictory!), specifically the Llama 2 7B quantized mannequin, accessible from Hugging Face:

model_path = “llama-2-7b-chat.Q4_K_M.gguf” if not os.path.exists(model_path): print(f”Downloading {model_path}…”) # It’s possible you’ll need to change the mannequin URL by one other of your alternative model_url = “https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/foremost/llama-2-7b-chat.Q4_K_M.gguf” response = requests.get(model_url, stream=True) total_size = int(response.headers.get(‘content-length’, 0)) with open(model_path, ‘wb’) as f: for information in tqdm(response.iter_content(chunk_size=1024), whole=total_size//1024): f.write(information) print(“Obtain full!”)

model_path = “llama-2-7b-chat.Q4_K_M.gguf”

if not os.path.exists(model_path):

print(f“Downloading {model_path}…”)

# It’s possible you’ll need to change the mannequin URL by one other of your alternative

model_url = “https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/foremost/llama-2-7b-chat.Q4_K_M.gguf”

response = requests.get(model_url, stream=True)

total_size = int(response.headers.get(‘content-length’, 0))

with open(model_path, ‘wb’) as f:

for information in tqdm(response.iter_content(chunk_size=1024), whole=total_size//1024):

f.write(information)

print(“Obtain full!”)

Intuitively, we now must arrange one other main part in any RAG system: the doc base. On this instance, we are going to create a mechanism to learn paperwork in a number of codecs, together with .doc and .txt, and for simplicity we are going to present a default pattern textual content doc constructed on the fly, including it to our newly created paperwork listing, docs. To strive it your self with an additional degree of enjoyable, ensure you load precise paperwork of your individual.

os.makedirs(“docs”, exist_ok=True) # Pattern textual content for demonstration functions with open(“docs/pattern.txt”, “w”) as f: f.write(“”” Retrieval-Augmented Era (RAG) is a way that mixes retrieval-based and generation-based approaches for pure language processing duties. It entails retrieving related data from a data base after which utilizing that data to generate extra correct and knowledgeable responses. RAG fashions first retrieve paperwork which are related to a given question, then use these paperwork as further context for language era. This method helps to floor the mannequin’s responses in factual data and reduces hallucinations. The llama.cpp library is a C/C++ implementation of Meta’s LLaMA mannequin, optimized for CPU utilization. It permits working LLaMA fashions on shopper {hardware} with out requiring high-end GPUs. LocalAI is a framework that allows working AI fashions regionally with out counting on cloud companies. It supplies APIs suitable with OpenAI’s interfaces, permitting builders to make use of their very own fashions with the identical code they’d use for OpenAI companies. “””) paperwork = [] for file in os.listdir(“docs”): if file.endswith(“.pdf”): loader = PyPDFLoader(os.path.be a part of(“docs”, file)) paperwork.prolong(loader.load()) elif file.endswith(“.txt”): loader = TextLoader(os.path.be a part of(“docs”, file)) paperwork.prolong(loader.load()) # Cut up paperwork into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_documents(paperwork)

os.makedirs(“docs”, exist_ok=True)

# Pattern textual content for demonstration functions

with open(“docs/pattern.txt”, “w”) as f:

f.write(“”“

Retrieval-Augmented Era (RAG) is a way that mixes retrieval-based and generation-based approaches

for pure language processing duties. It entails retrieving related data from a data base after which

utilizing that data to generate extra correct and knowledgeable responses.

RAG fashions first retrieve paperwork which are related to a given question, then use these paperwork as further context

for language era. This method helps to floor the mannequin’s responses in factual data and reduces hallucinations.

The llama.cpp library is a C/C++ implementation of Meta’s LLaMA mannequin, optimized for CPU utilization. It permits working LLaMA fashions

on shopper {hardware} with out requiring high-end GPUs.

LocalAI is a framework that allows working AI fashions regionally with out counting on cloud companies. It supplies APIs suitable

with OpenAI’s interfaces, permitting builders to make use of their very own fashions with the identical code they’d use for OpenAI companies.

““”)

paperwork = []

for file in os.listdir(“docs”):

if file.endswith(“.pdf”):

loader = PyPDFLoader(os.path.be a part of(“docs”, file))

paperwork.prolong(loader.load())

elif file.endswith(“.txt”):

loader = TextLoader(os.path.be a part of(“docs”, file))

paperwork.prolong(loader.load())

# Cut up paperwork into chunks

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

length_function=len

)

chunks = text_splitter.split_documents(paperwork)

Discover that after processing the paperwork, we cut up them into chunks, which is a standard apply in RAG methods for enhancing retrieval accuracy and making certain the LLM successfully processes manageable inputs inside its context window.

Each LLMs and RAG methods must deal with numerical representations of textual content somewhat than uncooked textual content, due to this fact, we subsequent construct a vector retailer that incorporates embeddings of our textual content paperwork. Chroma is a light-weight, open-source vector database for effectively storing and querying embeddings.

embeddings = HuggingFaceEmbeddings(model_name=”all-MiniLM-L6-v2″) vectorstore = Chroma.from_documents( paperwork=chunks, embedding=embeddings, persist_directory=”./chroma_db” )

embeddings = HuggingFaceEmbeddings(model_name=“all-MiniLM-L6-v2”)

vectorstore = Chroma.from_documents(

paperwork=chunks,

embedding=embeddings,

persist_directory=“./chroma_db”

)

Now llama.cpp enters the scene for initializing our beforehand downloaded LLM. To do that, a LlamaCpp object is instantiated with the mannequin path and different settings like mannequin temperature, most context size, and so forth.

llm = LlamaCpp( model_path=model_path, temperature=0.7, max_tokens=2000, n_ctx=4096, verbose=False )

llm = LlamaCpp(

model_path=model_path,

temperature=0.7,

max_tokens=2000,

n_ctx=4096,

verbose=False

)

We’re getting nearer to the inference present, and just some actors stay to look on stage. One is the RAG immediate template, which is a chic option to outline how the retrieved context and person question are mixed right into a single, well-structured enter for the LLM throughout inference.

template = “”” Reply the query based mostly on the next context: {context} Query: {query} Reply: “”” immediate = PromptTemplate( template=template, input_variables=[“context”, “question”] )

template = “”“

Reply the query based mostly on the next context:

{context}

Query: {query}

Reply:

““”

immediate = PromptTemplate(

template=template,

input_variables=[“context”, “question”]

)

Lastly, we put every part collectively to create our RAG pipeline based mostly on llama.cpp.

rag_pipeline = RetrievalQA.from_chain_type( llm=llm, chain_type=”stuff”, retriever=vectorstore.as_retriever(search_kwargs={“ok”: 3}), return_source_documents=True, chain_type_kwargs={“immediate”: immediate} )

rag_pipeline = RetrievalQA.from_chain_type(

llm=llm,

chain_type=“stuff”,

retriever=vectorstore.as_retriever(search_kwargs={“ok”: 3}),

return_source_documents=True,

chain_type_kwargs={“immediate”: immediate}

)

Let’s assessment the constructing blocks of the RAG pipeline we simply created for a greater understanding:

llm: the LLM downloaded after which initialized utilizing llama.cpp
chain_type: a technique to specify how the retrieved paperwork in an RAG system are put collectively and despatched to the LLM, with "stuff" which means that every one retrieved context is injected within the immediate.
retriever: initialized upon the vector retailer and configured to get the three most related doc chunks.
return_source_documents=True: used to acquire details about which doc chunks have been used to reply the person’s query.
chain_type_kwargs={"immediate": immediate}: permits using our just lately outlined customized template to format the retrieval-augmented enter right into a presentable format for the LLM.

To finalize and see every part in motion, we outline and make the most of a pipeline-driving operate, ask_question(), that runs the RAG pipeline to reply the person’s questions.

def ask_question(query): start_time = time.time() outcome = rag_pipeline({“question”: query}) end_time = time.time() print(f”Query: {query}”) print(f”Reply: {outcome[‘result’]}”) print(f”Time taken: {end_time – start_time:.2f} seconds”) print(“nSource paperwork:”) for i, doc in enumerate(outcome[“source_documents”]): print(f”Doc {i+1}:”) print(f”Supply: {doc.metadata.get(‘supply’, ‘Unknown’)}”) print(f”Content material: {doc.page_content[:150]}…n”)

def ask_question(query):

start_time = time.time()

outcome = rag_pipeline({“question”: query})

end_time = time.time()

print(f“Query: {query}”)

print(f“Reply: {outcome[‘result’]}”)

print(f“Time taken: {end_time – start_time:.2f} seconds”)

print(“nSource paperwork:”)

for i, doc in enumerate(outcome[“source_documents”]):

print(f“Doc {i+1}:”)

print(f“Supply: {doc.metadata.get(‘supply’, ‘Unknown’)}”)

print(f“Content material: {doc.page_content[:150]}…n”)

Now let’s check out our pipeline with some particular questions.

ask_question(“What’s RAG and the way does it work?”) ask_question(“What’s llama.cpp?”) ask_question(“How does LocalAI relate to cloud AI companies?”)

ask_question(“What’s RAG and the way does it work?”)

ask_question(“What’s llama.cpp?”)

ask_question(“How does LocalAI relate to cloud AI companies?”)

Outcome:

Query: What’s RAG and the way does it work? Reply: RAG is a mixture of retrieval-based and generation-based approaches for pure language processing duties. It entails retrieving related data from a data base and utilizing that data to generate extra correct and knowledgeable responses. RAG fashions first retrieve paperwork which are related to a given question, then use these paperwork as further context for language era. This method helps to floor the mannequin’s responses in factual data and reduces hallucinations. Time taken: 195.05 seconds Supply paperwork: Doc 1: Supply: docs/pattern.txt Content material: Retrieval-Augmented Era (RAG) is a way that mixes retrieval-based and generation-based approaches for pure language processing … Doc 2: Supply: docs/pattern.txt Content material: on shopper {hardware} with out requiring high-end GPUs. LocalAI is a framework that allows working AI fashions regionally with out counting on cloud … Query: What’s llama.cpp? Reply: llama.cpp is a C/C++ implementation of Meta’s LLaMA mannequin, optimized for CPU utilization. It permits working LLaMA fashions on shopper {hardware} with out requiring high-end GPUs. Time taken: 35.61 seconds Supply paperwork: Doc 1: Supply: docs/pattern.txt Content material: Retrieval-Augmented Era (RAG) is a way that mixes retrieval-based and generation-based approaches for pure language processing … Doc 2: Supply: docs/pattern.txt Content material: on shopper {hardware} with out requiring high-end GPUs. LocalAI is a framework that allows working AI fashions regionally with out counting on cloud … Query: How does LocalAI relate to cloud AI companies? Reply: LocalAI is a framework that allows working AI fashions regionally with out counting on cloud companies. It supplies APIs suitable with OpenAI’s interfaces, permitting builders to make use of their very own fashions with the identical code they’d use for OpenAI companies. Which means LocalAI permits builders to make use of their very own AI fashions, educated on their very own information, with out having to depend on cloud-based companies. Time taken: 182.07 seconds Supply paperwork: Doc 1: Supply: docs/pattern.txt Content material: on shopper {hardware} with out requiring high-end GPUs. LocalAI is a framework that allows working AI fashions regionally with out counting on cloud … Doc 2: Supply: docs/pattern.txt Content material: Retrieval-Augmented Era (RAG) is a way that mixes retrieval-based and generation-based approaches for pure language processing …

Query: What is RAG and how does it work?

Reply: RAG is a mixture of retrieval–based mostly and era–based mostly approaches for pure language processing duties. It entails retrieving related data from a data base and utilizing that data to generate extra correct and knowledgeable responses. RAG fashions first retrieve paperwork that are related to a given question, then use these paperwork as further context for language era. This method helps to floor the mannequin‘s responses in factual data and reduces hallucinations.

Time taken: 195.05 seconds

Supply paperwork:

Doc 1:

Supply: docs/pattern.txt

Content material: Retrieval-Augmented Era (RAG) is a way that mixes retrieval-based and generation-based approaches

for pure language processing …

Doc 2:

Supply: docs/pattern.txt

Content material: on shopper {hardware} with out requiring high-end GPUs.

LocalAI is a framework that allows working AI fashions regionally with out counting on cloud …

Query: What’s llama.cpp?

Reply: llama.cpp is a C/C++ implementation of Meta’s LLaMA mannequin, optimized for CPU utilization. It permits working LLaMA fashions on shopper {hardware} with out requiring excessive–finish GPUs.

Time taken: 35.61 seconds

Supply paperwork:

Doc 1:

Supply: docs/pattern.txt

Content material: Retrieval–Augmented Era (RAG) is a approach that combines retrieval–based mostly and era–based mostly approaches

for pure language processing ...

Doc 2:

Supply: docs/pattern.txt

Content material: on shopper {hardware} with out requiring excessive–finish GPUs.

LocalAI is a framework that permits working AI fashions regionally with out relying on cloud ...

Query: How does LocalAI relate to cloud AI companies?

Reply: LocalAI is a framework that permits working AI fashions regionally with out relying on cloud companies. It supplies APIs suitable with OpenAI‘s interfaces, permitting builders to use their personal fashions with the identical code they would use for OpenAI companies. This means that LocalAI permits builders to use their personal AI fashions, educated on their personal information, with out having to rely on cloud–based mostly companies.

Time taken: 182.07 seconds

Supply paperwork:

Doc 1:

Supply: docs/pattern.txt

Content material: on shopper {hardware} with out requiring excessive–finish GPUs.

LocalAI is a framework that permits working AI fashions regionally with out relying on cloud ...

Doc 2:

Supply: docs/pattern.txt

Content material: Retrieval–Augmented Era (RAG) is a approach that combines retrieval–based mostly and era–based mostly approaches

for pure language processing ...

Wrapping Up

This text demonstrated learn how to arrange and make the most of a neighborhood RAG pipeline effectively utilizing llama.cpp, a well-liked framework for working inference on current LLM regionally in a light-weight and moveable style. It’s best to now have the ability to apply these newly-learned expertise in your individual initiatives.

Advertise here

Source link