Feature Extraction
sentence-transformers
PyTorch
ONNX
Safetensors
OpenVINO
xlm-roberta
mteb
Sentence Transformers
sentence-similarity
Eval Results (legacy)
Eval Results
text-embeddings-inference
Instructions to use intfloat/multilingual-e5-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use intfloat/multilingual-e5-large with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("intfloat/multilingual-e5-large") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Inference
- Notebooks
- Google Colab
- Kaggle
Usage with Faiss
#32
by hakanbicer9608 - opened
This comment has been hidden
This comment has been hidden
Hi,
I want to use multilingual-e5-large with FAISS and got some code working. But the results are really bad compared to the results obtained with chromadb. I think this could be due to the indexing method/norm of the vector space. This is my code:
import numpy as np
import faiss
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
import pickle
text_chunks = ["Dies ist Chunk 1", "Dies ist Chunk 2", "Dies ist Chunk 3"]
metadatas = [
{"source": "Dokument A", "page": 1},
{"source": "Dokument B", "page": 5},
{"source": "Dokument C", "page": 3}
]
# initializing embedding model
model = HuggingFaceEmbeddings(cache_folder="../models/embeddings/multilingual-e5-large_new", model_name='intfloat/multilingual-e5-large')
# converting text chunks to vectors
embeddings = model.embed_documents(text_chunks)
# convert to float32
embeddings = np.array(embeddings).astype('float32')
# get dimension of vector space
dimension = embeddings.shape[1]
# create faiss index
index = faiss.IndexFlatIP(dimension)
# add to index
index.add(embeddings)
# save index
faiss.write_index(index, "text_chunks_index.faiss")
# saving metadata
with open("metadatas.pkl", "wb") as f:
pickle.dump(metadatas, f)
# defining and embedding query
query = "Just an example query"
query_vector = np.array(model.embed_documents(query))#.astype('float32')
# search for similar vectors
k = 2
distances, indices = loaded_index.search(query_vector, k)
print("\n search results:")
for i, idx in enumerate(indices[0]):
print(f"text: {text_chunks[idx]}")
print(f"metadata: {metadatas[idx]}")
print(f"Inner product: {distances[0][i]}")
When I am using a chroma database like
import chromadb
from langchain.vectorstores import Chroma
embeddings = HuggingFaceEmbeddings(cache_folder='../models/embeddings/multilingual-e5-large_new', model_name='intfloat/multilingual-e5-large')
db = Chroma.from_documents(docs, embeddings, persist_directory=db_directory)
query = "Just an example query"
matches = db.similarity_search_with_relevance_scores(query, k=2)
I get far better results. How can I adjust the faiss code to get results similar or equal to those obtained with chroma?
Regards