Usage with Faiss

#32

by hakanbicer9608 - opened Mar 13, 2024

Discussion

hakanbicer9608

Mar 13, 2024

This comment has been hidden

hakanbicer9608

Mar 13, 2024

This comment has been hidden

Lue-C

Jan 10, 2025

Hi,
I want to use multilingual-e5-large with FAISS and got some code working. But the results are really bad compared to the results obtained with chromadb. I think this could be due to the indexing method/norm of the vector space. This is my code:

import numpy as np
import faiss
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
import pickle

text_chunks = ["Dies ist Chunk 1", "Dies ist Chunk 2", "Dies ist Chunk 3"]
metadatas = [
    {"source": "Dokument A", "page": 1},
    {"source": "Dokument B", "page": 5},
    {"source": "Dokument C", "page": 3}
]

# initializing embedding model
model = HuggingFaceEmbeddings(cache_folder="../models/embeddings/multilingual-e5-large_new", model_name='intfloat/multilingual-e5-large')

# converting text chunks to vectors
embeddings = model.embed_documents(text_chunks)

# convert to float32
embeddings = np.array(embeddings).astype('float32')

# get dimension of vector space
dimension = embeddings.shape[1]

# create faiss index
index = faiss.IndexFlatIP(dimension)

# add to index
index.add(embeddings)

# save index
faiss.write_index(index, "text_chunks_index.faiss")

# saving metadata
with open("metadatas.pkl", "wb") as f:
    pickle.dump(metadatas, f)

# defining and embedding query
query = "Just an example query"
query_vector = np.array(model.embed_documents(query))#.astype('float32')

# search for similar vectors
k = 2
distances, indices = loaded_index.search(query_vector, k)

print("\n search results:")
for i, idx in enumerate(indices[0]):
    print(f"text: {text_chunks[idx]}")
    print(f"metadata: {metadatas[idx]}")
    print(f"Inner product: {distances[0][i]}")

When I am using a chroma database like

import chromadb
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(cache_folder='../models/embeddings/multilingual-e5-large_new', model_name='intfloat/multilingual-e5-large')
db = Chroma.from_documents(docs, embeddings, persist_directory=db_directory)

query = "Just an example query"
matches = db.similarity_search_with_relevance_scores(query, k=2)

I get far better results. How can I adjust the faiss code to get results similar or equal to those obtained with chroma?

Regards

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment