

Implementing Multi-Modal RAG Methods
Picture by Creator | Ideogram
Massive language fashions (LLMs) have developed and permeated our lives a lot and so rapidly that many we’ve develop into depending on them in all kinds of eventualities. When individuals perceive that merchandise resembling ChatGPT for textual content technology are so useful, few are in a position to keep away from relying on them. Nevertheless, generally the reply is inaccurate, prompting an output enhancement method resembling retrieval-augmented technology, or RAG.
RAG is a framework that enhances the LLM output by incorporating real-time retrieval of exterior information. Multi-modal RAG techniques take this a step additional by enabling the retrieval and processing of knowledge throughout a number of knowledge codecs, resembling textual content and picture knowledge.
On this article, we are going to implement multi-modal RAG utilizing textual content, audio, and picture knowledge.
Multi-Modal RAG System
Multi-modal RAG techniques contain implementing a number of dataset varieties to realize higher output by accessing our information base. There are lots of methods to implement them, however what’s necessary is to create a system that works properly in manufacturing relatively than one that’s fancy.
On this tutorial, we are going to improve the RAG system by constructing a information base with each picture and audio knowledge. For all the code base, you possibly can go to the next GitHub repository.
The workflow will be summarized within the picture under.
It’s a bit small to learn as it’s, so click on to enlarge or save and zoom in as required. The workflow will be summarized into seven steps, that are:
- Extract Photographs
- Embed Photographs
- Retailer Picture Embeddings
- Course of Audio
- Retailer Audio Embeddings
- Retrieve Information
- Generate and Output Response
As it will require excessive sources, we are going to use Google Colab with entry to a GPU. Extra particularly, we are going to use the A100 GPU, because the RAM requirement for this tutorial is comparatively excessive.
Let’s begin by putting in all of the libraries which might be necessary for our tutorial.
pip set up pdf2image Pillow chromadb torch torchvision torchaudio transformers librosa ipython open–clip–torch qwen_vl_utils |
You’ll be able to go to the PyTorch web site to see which one works to your techniques and surroundings.
Moreover, there are occasions when picture extraction from PDF doesn’t work accurately. If this occurs, it is best to set up the next device.
apt–get replace apt–get set up –y poppler–utils |
With the surroundings and the instruments prepared, we are going to import all the mandatory libraries.
import os from pdf2image import convert_from_path from PIL import Picture import chromadb from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction import torch from transformers import CLIPProcessor, CLIPModel, WhisperProcessor, WhisperForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2VLProcessor import librosa from sentence_transformers import SentenceTransformer from qwen_vl_utils import process_vision_info from IPython.show import show, Picture as IPImage |
On this tutorial, we are going to use each the picture knowledge from the PDFs and the audio recordsdata (.mp3) that we ready beforehand. We are going to use the Short Cooking Recipe from Unilever for the PDF file and the Gordon Ramsay Cooking Audio file from YouTube. You will discover each recordsdata within the dataset folder within the GitHub repository.
Put all of the recordsdata within the dataset folder, and we’re able to go.
We are going to begin by processing the picture knowledge from a PDF file. To try this, we are going to extract every of the PDF pages as a picture with the next code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
output_dir = “dataset” image_output_dir = “extracted_images”
def convert_pdfs_to_images(folder, image_output_dir): if not os.path.exists(image_output_dir): os.makedirs(image_output_dir)
pdf_files = [f for f in os.listdir(folder) if f.endswith(‘.pdf’)] all_images = {}
for doc_id, pdf_file in enumerate(pdf_files): pdf_path = os.path.be part of(folder, pdf_file) photos = convert_from_path(pdf_path, dpi=100)
image_paths = [] for i, picture in enumerate(photos): image_path = os.path.be part of(image_output_dir, f“{doc_id}_page_{i}.png”) picture.save(image_path, “PNG”) image_paths.append(image_path)
all_images[doc_id] = image_paths return all_images
all_images = convert_pdfs_to_images(output_dir, image_output_dir) |
As soon as all the pictures are extracted from the PDF file, we are going to generate picture embedding with the CLIP model. CLIP is a multi-modal mannequin developed by OpenAI which is designed to know the connection between picture and textual content knowledge.
In our pipeline, we use CLIP to generate picture embeddings that we are going to retailer within the ChromaDB vector database later and use to retrieve related photos based mostly on textual content queries.
To generate the picture embedding, we are going to use the next code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
machine = “cuda” if torch.cuda.is_available() else “cpu” mannequin = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(machine) processor = CLIPProcessor.from_pretrained(“openai/clip-vit-base-patch32”)
def embed_images(image_paths): embeddings = [] for path in image_paths: picture = Picture.open(path) inputs = processor(photos=picture, return_tensors=“pt”, padding=True).to(machine) with torch.no_grad(): image_embedding = mannequin.get_image_features(**inputs).cpu().numpy() embeddings.append(image_embedding) return embeddings
image_embeddings = {} for doc_id, paths in all_images.gadgets(): image_embeddings[doc_id] = embed_images(paths) |
Subsequent, we are going to course of the audio knowledge to generate the textual content transcription utilizing the Whisper model. Whisper is an OpenAI mannequin that makes use of transformer-based structure to generate textual content from audio enter.
We aren’t utilizing Whisper for embedding in our pipeline. As a substitute, it is going to solely be accountable for audio transcription. We are going to transcribe them in chunks earlier than we use sentence transformers to generate embeddings for the transcription chunks.
To course of the audio transcription, we are going to use the next code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
whisper_processor = WhisperProcessor.from_pretrained(“openai/whisper-small”) whisper_model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”).to(machine)
# Outline how a lot textual content per chunk def transcribe_audio(audio_path, chunk_length=30): audio, sr = librosa.load(audio_path, sr=16000) chunk_size = chunk_length * sr chunks = [audio[i:i + chunk_size] for i in vary(0, len(audio), chunk_size)] transcription_chunks = [] for chunk in chunks: inputs = whisper_processor(chunk, sampling_rate=sr, return_tensors=“pt”).to(machine) inputs[“attention_mask”] = torch.ones_like(inputs.input_features) with torch.no_grad(): predicted_ids = whisper_model.generate(**inputs, max_length=448) chunk_transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] transcription_chunks.append(chunk_transcription)
full_transcription = ” “.be part of(transcription_chunks) return full_transcription, transcription_chunks
audio_files = [f for f in os.listdir(output_dir) if f.endswith(‘.mp3’)] audio_transcriptions = {} for audio_id, audio_file in enumerate(audio_files): audio_path = os.path.be part of(output_dir, audio_file) full_transcription, transcription_chunks = transcribe_audio(audio_path) audio_transcriptions[audio_id] = { “full_transcription”: full_transcription, “chunks”: transcription_chunks } |
With all the things in place, we are going to retailer our embeddings within the ChromaDB vector database. We are going to separate the picture and audio transcription knowledge as they’ve totally different embedding traits. We may also provoke the embedding capabilities for each the picture and audio transcription knowledge.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
consumer = chromadb.PersistentClient(path=“chroma_db”) embedding_function = OpenCLIPEmbeddingFunction() text_embedding_model = SentenceTransformer(‘sentence-transformers/all-MiniLM-L6-v2’)
# Delete present collections (if wanted) strive: consumer.delete_collection(title=“image_collection”) consumer.delete_collection(title=“audio_collection”) print(“Deleted present collections.”) besides Exception as e: print(f“Collections don’t exist or couldn’t be deleted: {e}”)
image_collection = consumer.create_collection(title=“image_collection”, embedding_function=embedding_function) audio_collection = consumer.create_collection(title=“audio_collection”)
for doc_id, embeddings in image_embeddings.gadgets(): for i, embedding in enumerate(embeddings): image_collection.add( ids=[f“image_{doc_id}_{i}”], embeddings=[embedding.flatten().tolist()], metadatas=[{“doc_id”: str(doc_id), “image_path”: all_images[doc_id][i]}] )
for audio_id, transcription_data in audio_transcriptions.gadgets(): transcription_chunks = transcription_data[“chunks”] for chunk_id, chunk in enumerate(transcription_chunks): chunk_embedding = text_embedding_model.encode(chunk) audio_collection.add( ids=[f“audio_{audio_id}_chunk_{chunk_id}”], embeddings=[chunk_embedding.tolist()], metadatas=[{ “audio_id”: str(audio_id), “audio_path”: audio_files[audio_id], “chunk_id”: str(chunk_id) }], paperwork=[chunk] ) |
Our RAG system is sort of prepared! The one factor left to do is about up the retrieval system from the ChromaDB vector database.
For instance, let’s strive retrieving the highest two outcomes from each a picture and an audio file utilizing a textual content question.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def retrieve_data(question, top_k=2):
query_embedding_image = embedding_function([query])[0] # OpenCLIP for picture assortment query_embedding_audio = text_embedding_model.encode(question) # SentenceTransformer for audio assortment
image_results = image_collection.question( query_embeddings=[query_embedding_image], n_results=high_ok )
audio_results = audio_collection.question( query_embeddings=[query_embedding_audio.tolist()], n_results=high_ok )
retrieved_images = [metadata[“image_path”] for metadata in image_results[“metadatas”][0] if “image_path” in metadata] retrieved_chunks = audio_results[“documents”][0] if “paperwork” in audio_results else []
return retrieved_images, retrieved_chunks
question = “What are the healthiest components to make use of within the recipe you may have?” retrieved_images, retrieved_chunks = retrieve_data(question) print(“Retrieved Photographs:”, retrieved_images) print(“Retrieved Audio Chunks:”, retrieved_chunks) |
The outcome for each retrievals is proven within the output under.
Retrieved Photographs: [‘extracted_images/0_page_3.png’, ‘extracted_images/0_page_12.png’]
Retrieved Audio Chunks: [” Lemon. Zest the lemon. Over. Smells incredible. And then finally seal the deal with a touch of grated parmesan cheese. Give your veg some attitude and you’ll get amazingly elegant dishes on a budget that are always guaranteed to impress. What more do you want from great cooking? Cheap to make, easy to cook and absolutely stunning. For me, food always has to be impressive. But when it comes to desserts,”, ” and one third of your protein, chicken. With a dish that takes literally minutes to put together, it’s really important to get everything organized. Everything needs to be at your fingertips. Touch of olive oil. Get that pan really nice and ready. Just starting to smoke. Drop the chicken in first. Just salt, pepper. Open up those little strands of chicken.”] |
For picture retrieval, it is going to return the metadata picture path we saved within the vector database. For audio retrieval, it returns the transcription chunk most associated to the textual content question.
With the info retrieval, we are going to arrange the generative mannequin utilizing the Qwen-VL model. The mannequin is a multi-modal LLM that may deal with textual content and picture knowledge and generate textual content responses from the multi-modal knowledge we move into it.
We use the Qwen-VL mannequin, which generates multimodal textual content responses by taking each retrieved photos and audio transcription chunks.
Let’s arrange the mannequin with the next code.
vl_model = Qwen2VLForConditionalGeneration.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, torch_dtype=torch.bfloat16, ).cuda().eval()
min_pixels = 256 * 256 max_pixels = 1024 * 1024 vl_model_processor = Qwen2VLProcessor.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, min_pixels=min_pixels, max_pixels=max_pixels ) |
Then, we arrange the mannequin to simply accept the enter knowledge, course of them, and generate textual content output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
chat_template = [ { “role”: “user”, “content”: [ {“type”: “image”, “image”: retrieved_images[0]}, # First retrieved picture {“sort”: “picture”, “picture”: retrieved_images[1]}, # Second retrieved picture {“sort”: “textual content”, “textual content”: question}, # Person question {“sort”: “textual content”, “textual content”: “Audio Context: “ + ” “.be part of(retrieved_chunks)} # Embody audio knowledge ], } ]
textual content = vl_model_processor.apply_chat_template( chat_template, tokenize=False, add_generation_prompt=True )
image_inputs, _ = process_vision_info(chat_template) inputs = vl_model_processor( textual content=[text], photos=image_inputs, padding=True, return_tensors=“pt”, ).to(“cuda”)
generated_ids = vl_model.generate(**inputs, max_new_tokens=100) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = vl_model_processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )
print(output_text[0]) |
The result’s proven within the output under.
The healthiest components to use in the recipe are:
1. **Lemon** – Gives a burst of citrus taste and is a good supply of vitamin C. 2. **Parmesan Cheese** – A good supply of calcium and protein. 3. **Hen** – A lean protein supply that is wealthy in important amino acids. 4. **Olive Oil** – A wholesome fats that is wealthy in monounsaturated fatty acids. 5. **Zest** – Provides a burst of taste |
As you possibly can see, the outcome takes into consideration each the picture and audio knowledge.
That’s all that you must construct a multi-modal RAG system. You’ll be able to change the file and code to accommodate your wants.
Conclusion
Retrieval-augmented technology, or RAG, is a framework that enhances LLM output utilizing exterior information. In multi-modal RAG techniques, we make the most of knowledge aside from easy textual content, resembling picture and audio knowledge.
On this article, we’ve carried out multi-modal RAG utilizing textual content, audio, and picture knowledge. We’re utilizing CLIP for picture embeddings, Whisper for audio transcription, SentenceTransformer for textual content embeddings, ChromaDB for vector storage, and Qwen-VL for multimodal textual content technology.
I hope this has helped!
Source link