Implementing Multi-Modal RAG Systems - MachineLearningMastery.com

Implementing Multi-Modal RAG Methods
Picture by Creator | Ideogram

Massive language fashions (LLMs) have developed and permeated our lives a lot and so rapidly that many we’ve develop into depending on them in all kinds of eventualities. When individuals perceive that merchandise resembling ChatGPT for textual content technology are so useful, few are in a position to keep away from relying on them. Nevertheless, generally the reply is inaccurate, prompting an output enhancement method resembling retrieval-augmented technology, or RAG.

RAG is a framework that enhances the LLM output by incorporating real-time retrieval of exterior information. Multi-modal RAG techniques take this a step additional by enabling the retrieval and processing of knowledge throughout a number of knowledge codecs, resembling textual content and picture knowledge.

On this article, we are going to implement multi-modal RAG utilizing textual content, audio, and picture knowledge.

Multi-Modal RAG System

Multi-modal RAG techniques contain implementing a number of dataset varieties to realize higher output by accessing our information base. There are lots of methods to implement them, however what’s necessary is to create a system that works properly in manufacturing relatively than one that’s fancy.

On this tutorial, we are going to improve the RAG system by constructing a information base with each picture and audio knowledge. For all the code base, you possibly can go to the next GitHub repository.

The workflow will be summarized within the picture under.

It’s a bit small to learn as it’s, so click on to enlarge or save and zoom in as required. The workflow will be summarized into seven steps, that are:

Extract Photographs
Embed Photographs
Retailer Picture Embeddings
Course of Audio
Retailer Audio Embeddings
Retrieve Information
Generate and Output Response

As it will require excessive sources, we are going to use Google Colab with entry to a GPU. Extra particularly, we are going to use the A100 GPU, because the RAM requirement for this tutorial is comparatively excessive.

Let’s begin by putting in all of the libraries which might be necessary for our tutorial.

pip set up pdf2image Pillow chromadb torch torchvision torchaudio transformers librosa ipython open-clip-torch qwen_vl_utils

pip set up pdf2image Pillow chromadb torch torchvision torchaudio transformers librosa ipython open–clip–torch qwen_vl_utils

You’ll be able to go to the PyTorch web site to see which one works to your techniques and surroundings.

Moreover, there are occasions when picture extraction from PDF doesn’t work accurately. If this occurs, it is best to set up the next device.

apt-get replace apt-get set up -y poppler-utils

apt–get replace

apt–get set up –y poppler–utils

With the surroundings and the instruments prepared, we are going to import all the mandatory libraries.

import os from pdf2image import convert_from_path from PIL import Picture import chromadb from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction import torch from transformers import CLIPProcessor, CLIPModel, WhisperProcessor, WhisperForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2VLProcessor import librosa from sentence_transformers import SentenceTransformer from qwen_vl_utils import process_vision_info from IPython.show import show, Picture as IPImage

import os

from pdf2image import convert_from_path

from PIL import Picture

import chromadb

from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction

import torch

from transformers import CLIPProcessor, CLIPModel, WhisperProcessor, WhisperForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2VLProcessor

import librosa

from sentence_transformers import SentenceTransformer

from qwen_vl_utils import process_vision_info

from IPython.show import show, Picture as IPImage

On this tutorial, we are going to use each the picture knowledge from the PDFs and the audio recordsdata (.mp3) that we ready beforehand. We are going to use the Short Cooking Recipe from Unilever for the PDF file and the Gordon Ramsay Cooking Audio file from YouTube. You will discover each recordsdata within the dataset folder within the GitHub repository.

Put all of the recordsdata within the dataset folder, and we’re able to go.

We are going to begin by processing the picture knowledge from a PDF file. To try this, we are going to extract every of the PDF pages as a picture with the next code.

output_dir = “dataset” image_output_dir = “extracted_images” def convert_pdfs_to_images(folder, image_output_dir): if not os.path.exists(image_output_dir): os.makedirs(image_output_dir) pdf_files = [f for f in os.listdir(folder) if f.endswith(‘.pdf’)] all_images = {} for doc_id, pdf_file in enumerate(pdf_files): pdf_path = os.path.be part of(folder, pdf_file) photos = convert_from_path(pdf_path, dpi=100) image_paths = [] for i, picture in enumerate(photos): image_path = os.path.be part of(image_output_dir, f”{doc_id}_page_{i}.png”) picture.save(image_path, “PNG”) image_paths.append(image_path) all_images[doc_id] = image_paths return all_images all_images = convert_pdfs_to_images(output_dir, image_output_dir)

output_dir = “dataset”

image_output_dir = “extracted_images”

def convert_pdfs_to_images(folder, image_output_dir):

if not os.path.exists(image_output_dir):

os.makedirs(image_output_dir)

pdf_files = [f for f in os.listdir(folder) if f.endswith(‘.pdf’)]

all_images = {}

for doc_id, pdf_file in enumerate(pdf_files):

pdf_path = os.path.be part of(folder, pdf_file)

photos = convert_from_path(pdf_path, dpi=100)

image_paths = []

for i, picture in enumerate(photos):

image_path = os.path.be part of(image_output_dir, f“{doc_id}_page_{i}.png”)

picture.save(image_path, “PNG”)

image_paths.append(image_path)

all_images[doc_id] = image_paths

return all_images

all_images = convert_pdfs_to_images(output_dir, image_output_dir)

As soon as all the pictures are extracted from the PDF file, we are going to generate picture embedding with the CLIP model. CLIP is a multi-modal mannequin developed by OpenAI which is designed to know the connection between picture and textual content knowledge.

In our pipeline, we use CLIP to generate picture embeddings that we are going to retailer within the ChromaDB vector database later and use to retrieve related photos based mostly on textual content queries.

To generate the picture embedding, we are going to use the next code.

machine = “cuda” if torch.cuda.is_available() else “cpu” mannequin = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(machine) processor = CLIPProcessor.from_pretrained(“openai/clip-vit-base-patch32″) def embed_images(image_paths): embeddings = [] for path in image_paths: picture = Picture.open(path) inputs = processor(photos=picture, return_tensors=”pt”, padding=True).to(machine) with torch.no_grad(): image_embedding = mannequin.get_image_features(**inputs).cpu().numpy() embeddings.append(image_embedding) return embeddings image_embeddings = {} for doc_id, paths in all_images.gadgets(): image_embeddings[doc_id] = embed_images(paths)

machine = “cuda” if torch.cuda.is_available() else “cpu”

mannequin = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(machine)

processor = CLIPProcessor.from_pretrained(“openai/clip-vit-base-patch32”)

def embed_images(image_paths):

embeddings = []

for path in image_paths:

picture = Picture.open(path)

inputs = processor(photos=picture, return_tensors=“pt”, padding=True).to(machine)

with torch.no_grad():

image_embedding = mannequin.get_image_features(**inputs).cpu().numpy()

embeddings.append(image_embedding)

return embeddings

image_embeddings = {}

for doc_id, paths in all_images.gadgets():

image_embeddings[doc_id] = embed_images(paths)

Subsequent, we are going to course of the audio knowledge to generate the textual content transcription utilizing the Whisper model. Whisper is an OpenAI mannequin that makes use of transformer-based structure to generate textual content from audio enter.

We aren’t utilizing Whisper for embedding in our pipeline. As a substitute, it is going to solely be accountable for audio transcription. We are going to transcribe them in chunks earlier than we use sentence transformers to generate embeddings for the transcription chunks.

To course of the audio transcription, we are going to use the next code.

whisper_processor = WhisperProcessor.from_pretrained(“openai/whisper-small”) whisper_model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”).to(machine) # Outline how a lot textual content per chunk def transcribe_audio(audio_path, chunk_length=30): audio, sr = librosa.load(audio_path, sr=16000) chunk_size = chunk_length * sr chunks = for i in vary(0, len(audio), chunk_size)] transcription_chunks = [] for chunk in chunks: inputs = whisper_processor(chunk, sampling_rate=sr, return_tensors=”pt”).to(machine) inputs[“attention_mask”] = torch.ones_like(inputs.input_features) with torch.no_grad(): predicted_ids = whisper_model.generate(**inputs, max_length=448) chunk_transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] transcription_chunks.append(chunk_transcription) full_transcription = ” “.be part of(transcription_chunks) return full_transcription, transcription_chunks audio_files = [f for f in os.listdir(output_dir) if f.endswith(‘.mp3’)] audio_transcriptions = {} for audio_id, audio_file in enumerate(audio_files): audio_path = os.path.be part of(output_dir, audio_file) full_transcription, transcription_chunks = transcribe_audio(audio_path) audio_transcriptions[audio_id] = { “full_transcription”: full_transcription, “chunks”: transcription_chunks }

whisper_processor = WhisperProcessor.from_pretrained(“openai/whisper-small”)

whisper_model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”).to(machine)

# Outline how a lot textual content per chunk

def transcribe_audio(audio_path, chunk_length=30):

audio, sr = librosa.load(audio_path, sr=16000)

chunk_size = chunk_length * sr

chunks = [audio[i:i + chunk_size] for i in vary(0, len(audio), chunk_size)]

transcription_chunks = []

for chunk in chunks:

inputs = whisper_processor(chunk, sampling_rate=sr, return_tensors=“pt”).to(machine)

inputs[“attention_mask”] = torch.ones_like(inputs.input_features)

with torch.no_grad():

predicted_ids = whisper_model.generate(**inputs, max_length=448)

chunk_transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

transcription_chunks.append(chunk_transcription)

full_transcription = ” “.be part of(transcription_chunks)

return full_transcription, transcription_chunks

audio_files = [f for f in os.listdir(output_dir) if f.endswith(‘.mp3’)]

audio_transcriptions = {}

for audio_id, audio_file in enumerate(audio_files):

audio_path = os.path.be part of(output_dir, audio_file)

full_transcription, transcription_chunks = transcribe_audio(audio_path)

audio_transcriptions[audio_id] = {

“full_transcription”: full_transcription,

“chunks”: transcription_chunks

}

With all the things in place, we are going to retailer our embeddings within the ChromaDB vector database. We are going to separate the picture and audio transcription knowledge as they’ve totally different embedding traits. We may also provoke the embedding capabilities for each the picture and audio transcription knowledge.

consumer = chromadb.PersistentClient(path=”chroma_db”) embedding_function = OpenCLIPEmbeddingFunction() text_embedding_model = SentenceTransformer(‘sentence-transformers/all-MiniLM-L6-v2’) # Delete present collections (if wanted) strive: consumer.delete_collection(title=”image_collection”) consumer.delete_collection(title=”audio_collection”) print(“Deleted present collections.”) besides Exception as e: print(f”Collections don’t exist or couldn’t be deleted: {e}”) image_collection = consumer.create_collection(title=”image_collection”, embedding_function=embedding_function) audio_collection = consumer.create_collection(title=”audio_collection”) for doc_id, embeddings in image_embeddings.gadgets(): for i, embedding in enumerate(embeddings): image_collection.add( ids=[f”image_{doc_id}_{i}”], embeddings=[embedding.flatten().tolist()], metadatas=[{“doc_id”: str(doc_id), “image_path”: all_images[doc_id][i]}] ) for audio_id, transcription_data in audio_transcriptions.gadgets(): transcription_chunks = transcription_data[“chunks”] for chunk_id, chunk in enumerate(transcription_chunks): chunk_embedding = text_embedding_model.encode(chunk) audio_collection.add( ids=[f”audio_{audio_id}_chunk_{chunk_id}”], embeddings=[chunk_embedding.tolist()], metadatas=[{ “audio_id”: str(audio_id), “audio_path”: audio_files[audio_id], “chunk_id”: str(chunk_id) }], paperwork=[chunk] )

consumer = chromadb.PersistentClient(path=“chroma_db”)

embedding_function = OpenCLIPEmbeddingFunction()

text_embedding_model = SentenceTransformer(‘sentence-transformers/all-MiniLM-L6-v2’)

# Delete present collections (if wanted)

strive:

consumer.delete_collection(title=“image_collection”)

consumer.delete_collection(title=“audio_collection”)

print(“Deleted present collections.”)

besides Exception as e:

print(f“Collections don’t exist or couldn’t be deleted: {e}”)

image_collection = consumer.create_collection(title=“image_collection”, embedding_function=embedding_function)

audio_collection = consumer.create_collection(title=“audio_collection”)

for doc_id, embeddings in image_embeddings.gadgets():

for i, embedding in enumerate(embeddings):

image_collection.add(

ids=[f“image_{doc_id}_{i}”],

embeddings=[embedding.flatten().tolist()],

metadatas=[{“doc_id”: str(doc_id), “image_path”: all_images[doc_id][i]}]

)

for audio_id, transcription_data in audio_transcriptions.gadgets():

transcription_chunks = transcription_data[“chunks”]

for chunk_id, chunk in enumerate(transcription_chunks):

chunk_embedding = text_embedding_model.encode(chunk)

audio_collection.add(

ids=[f“audio_{audio_id}_chunk_{chunk_id}”],

embeddings=[chunk_embedding.tolist()],

metadatas=[{

“audio_id”: str(audio_id),

“audio_path”: audio_files[audio_id],

“chunk_id”: str(chunk_id)

}],

paperwork=[chunk]

)

Our RAG system is sort of prepared! The one factor left to do is about up the retrieval system from the ChromaDB vector database.

For instance, let’s strive retrieving the highest two outcomes from each a picture and an audio file utilizing a textual content question.

def retrieve_data(question, top_k=2): query_embedding_image = embedding_function([query])[0] # OpenCLIP for picture assortment query_embedding_audio = text_embedding_model.encode(question) # SentenceTransformer for audio assortment image_results = image_collection.question( query_embeddings=[query_embedding_image], n_results=top_k ) audio_results = audio_collection.question( query_embeddings=[query_embedding_audio.tolist()], n_results=top_k ) retrieved_images = [metadata[“image_path”] for metadata in image_results[“metadatas”][0] if “image_path” in metadata] retrieved_chunks = audio_results[“documents”][0] if “paperwork” in audio_results else [] return retrieved_images, retrieved_chunks question = “What are the healthiest components to make use of within the recipe you may have?” retrieved_images, retrieved_chunks = retrieve_data(question) print(“Retrieved Photographs:”, retrieved_images) print(“Retrieved Audio Chunks:”, retrieved_chunks)

def retrieve_data(question, top_k=2):

query_embedding_image = embedding_function([query])[0] # OpenCLIP for picture assortment

query_embedding_audio = text_embedding_model.encode(question) # SentenceTransformer for audio assortment

image_results = image_collection.question(

query_embeddings=[query_embedding_image],

n_results=high_ok

)

audio_results = audio_collection.question(

query_embeddings=[query_embedding_audio.tolist()],

n_results=high_ok

)

retrieved_images = [metadata[“image_path”] for metadata in image_results[“metadatas”][0] if “image_path” in metadata]

retrieved_chunks = audio_results[“documents”][0] if “paperwork” in audio_results else []

return retrieved_images, retrieved_chunks

question = “What are the healthiest components to make use of within the recipe you may have?”

retrieved_images, retrieved_chunks = retrieve_data(question)

print(“Retrieved Photographs:”, retrieved_images)

print(“Retrieved Audio Chunks:”, retrieved_chunks)

The outcome for each retrievals is proven within the output under.

Retrieved Photographs: [‘extracted_images/0_page_3.png’, ‘extracted_images/0_page_12.png’] Retrieved Audio Chunks: [” Lemon. Zest the lemon. Over. Smells incredible. And then finally seal the deal with a touch of grated parmesan cheese. Give your veg some attitude and you’ll get amazingly elegant dishes on a budget that are always guaranteed to impress. What more do you want from great cooking? Cheap to make, easy to cook and absolutely stunning. For me, food always has to be impressive. But when it comes to desserts,”, ” and one third of your protein, chicken. With a dish that takes literally minutes to put together, it’s really important to get everything organized. Everything needs to be at your fingertips. Touch of olive oil. Get that pan really nice and ready. Just starting to smoke. Drop the chicken in first. Just salt, pepper. Open up those little strands of chicken.”]

Retrieved Photographs: [‘extracted_images/0_page_3.png’, ‘extracted_images/0_page_12.png’]

Retrieved Audio Chunks: [” Lemon. Zest the lemon. Over. Smells incredible. And then finally seal the deal with a touch of grated parmesan cheese. Give your veg some attitude and you’ll get amazingly elegant dishes on a budget that are always guaranteed to impress. What more do you want from great cooking? Cheap to make, easy to cook and absolutely stunning. For me, food always has to be impressive. But when it comes to desserts,”, ” and one third of your protein, chicken. With a dish that takes literally minutes to put together, it’s really important to get everything organized. Everything needs to be at your fingertips. Touch of olive oil. Get that pan really nice and ready. Just starting to smoke. Drop the chicken in first. Just salt, pepper. Open up those little strands of chicken.”]

For picture retrieval, it is going to return the metadata picture path we saved within the vector database. For audio retrieval, it returns the transcription chunk most associated to the textual content question.

With the info retrieval, we are going to arrange the generative mannequin utilizing the Qwen-VL model. The mannequin is a multi-modal LLM that may deal with textual content and picture knowledge and generate textual content responses from the multi-modal knowledge we move into it.

We use the Qwen-VL mannequin, which generates multimodal textual content responses by taking each retrieved photos and audio transcription chunks.

Let’s arrange the mannequin with the next code.

vl_model = Qwen2VLForConditionalGeneration.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, torch_dtype=torch.bfloat16, ).cuda().eval() min_pixels = 256 * 256 max_pixels = 1024 * 1024 vl_model_processor = Qwen2VLProcessor.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, min_pixels=min_pixels, max_pixels=max_pixels )

vl_model = Qwen2VLForConditionalGeneration.from_pretrained(

“Qwen/Qwen2-VL-7B-Instruct”,

torch_dtype=torch.bfloat16,

).cuda().eval()

min_pixels = 256 * 256

max_pixels = 1024 * 1024

vl_model_processor = Qwen2VLProcessor.from_pretrained(

“Qwen/Qwen2-VL-7B-Instruct”,

min_pixels=min_pixels,

max_pixels=max_pixels

)

Then, we arrange the mannequin to simply accept the enter knowledge, course of them, and generate textual content output.

chat_template = [ { “role”: “user”, “content”: [ {“type”: “image”, “image”: retrieved_images[0]}, # First retrieved picture {“sort”: “picture”, “picture”: retrieved_images[1]}, # Second retrieved picture {“sort”: “textual content”, “textual content”: question}, # Person question {“sort”: “textual content”, “textual content”: “Audio Context: ” + ” “.be part of(retrieved_chunks)} # Embody audio knowledge ], } ] textual content = vl_model_processor.apply_chat_template( chat_template, tokenize=False, add_generation_prompt=True ) image_inputs, _ = process_vision_info(chat_template) inputs = vl_model_processor( textual content=[text], photos=image_inputs, padding=True, return_tensors=”pt”, ).to(“cuda”) generated_ids = vl_model.generate(**inputs, max_new_tokens=100) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = vl_model_processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text[0])

chat_template = [

{

“role”: “user”,

“content”: [

{“type”: “image”, “image”: retrieved_images[0]}, # First retrieved picture

{“sort”: “picture”, “picture”: retrieved_images[1]}, # Second retrieved picture

{“sort”: “textual content”, “textual content”: question}, # Person question

{“sort”: “textual content”, “textual content”: “Audio Context: “ + ” “.be part of(retrieved_chunks)} # Embody audio knowledge

}

]

textual content = vl_model_processor.apply_chat_template(

chat_template, tokenize=False, add_generation_prompt=True

)

image_inputs, _ = process_vision_info(chat_template)

inputs = vl_model_processor(

textual content=[text],

photos=image_inputs,

padding=True,

return_tensors=“pt”,

).to(“cuda”)

generated_ids = vl_model.generate(**inputs, max_new_tokens=100)

generated_ids_trimmed = [

out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)

]

output_text = vl_model_processor.batch_decode(

generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False

)

print(output_text[0])

The result’s proven within the output under.

The healthiest components to make use of within the recipe are: 1. **Lemon** – Gives a burst of citrus taste and is an effective supply of vitamin C. 2. **Parmesan Cheese** – A superb supply of calcium and protein. 3. **Hen** – A lean protein supply that’s wealthy in important amino acids. 4. **Olive Oil** – A wholesome fats that’s wealthy in monounsaturated fatty acids. 5. **Zest** – Provides a burst of taste

The healthiest components to use in the recipe are:

1. **Lemon** – Gives a burst of citrus taste and is a good supply of vitamin C.

2. **Parmesan Cheese** – A good supply of calcium and protein.

3. **Hen** – A lean protein supply that is wealthy in important amino acids.

4. **Olive Oil** – A wholesome fats that is wealthy in monounsaturated fatty acids.

5. **Zest** – Provides a burst of taste

As you possibly can see, the outcome takes into consideration each the picture and audio knowledge.

That’s all that you must construct a multi-modal RAG system. You’ll be able to change the file and code to accommodate your wants.

Conclusion

Retrieval-augmented technology, or RAG, is a framework that enhances LLM output utilizing exterior information. In multi-modal RAG techniques, we make the most of knowledge aside from easy textual content, resembling picture and audio knowledge.

On this article, we’ve carried out multi-modal RAG utilizing textual content, audio, and picture knowledge. We’re utilizing CLIP for picture embeddings, Whisper for audio transcription, SentenceTransformer for textual content embeddings, ChromaDB for vector storage, and Qwen-VL for multimodal textual content technology.

I hope this has helped!

Advertise here

Source link

Implementing Multi-Modal RAG Systems – MachineLearningMastery.com

People Who Got Divorced In Their First Year Of Marriage, We Want To Hear What Happened

Kashmiris Return to Rubble After India-Pakistan Truce

For Palestinians, a day at the beach is an all-too-brief escape from war

India considers counter duties on US products, notice to WTO shows

Travel Disruptions Linger as Flights Resume at London’s Heathrow

Flex How Pretentious You Are By Passing This "Intro To Film Studies" Quiz

At the epicentre of Ontario’s measles outbreak, residents reel with concern

Gigi Hadid Talks Bradley Cooper Relationship

Hamilton looks to source local steel for city projects in face of ‘troubling’ business slowdown, mayor says

Implementing Multi-Modal RAG Systems – MachineLearningMastery.com

Multi-Modal RAG System

Conclusion

Related Posts