Instructions to use intfloat/e5-mistral-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use intfloat/e5-mistral-7b-instruct with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("intfloat/e5-mistral-7b-instruct") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use intfloat/e5-mistral-7b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="intfloat/e5-mistral-7b-instruct")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct") model = AutoModel.from_pretrained("intfloat/e5-mistral-7b-instruct") - Inference
- Notebooks
- Google Colab
- Kaggle
Questions about the training data
I really appreciate your work!
I have some questions about the specifics of the training data. Regarding the sentence in the paper: "Given a pretrained LLM, we append an [EOS] token to the end of the query and document, and then feed them into the LLM to obtain the query and document embeddings by taking the last layer [EOS] vector."
What is the format of each training data during training? Is it like "Instruct: {task_definition} \n Query: {q+}[EOS]{doc}[EOS]"? Or are the query and document input separately into the model? Are the queries in a batch the same? Do you have corresponding code on your GitHub that I can refer to for this part? I couldn't find it.
It follows a bi-encoder architecture, the query follows the "Instruct: {task_definition} \n Query: {q+} [EOS] format, and the document follows the {doc} [EOS] format. The queries are randomly sampled from the training set, they are not the same in a batch.
We do not have public training code for this model, but you can refer to https://github.com/texttron/tevatron for training embedding models.
It follows a bi-encoder architecture, the query follows the
"Instruct: {task_definition} \n Query: {q+} [EOS]format, and the document follows the{doc} [EOS]format. The queries are randomly sampled from the training set, they are not the same in a batch.We do not have public training code for this model, but you can refer to https://github.com/texttron/tevatron for training embedding models.
Thanks very much for your reply!!!
Is the bi-encoder actually a single mistral-7b model (siamese network) or two seperate models (parameters are not the same) ? Besides, what is the specific format of a single training datapoint? In my understanding, it should contain a query, a positive doc, a hard negative doc, and negative sampling is performed during the training process?
Yes, it is a Siamese network with shared parameters.
As you said, a training datapoint has a query, a positive doc and a hard negative doc. Other documents from the same batch are used as in-batch negatives.
Yes, it is a Siamese network with shared parameters.
As you said, a training datapoint has a query, a positive doc and a hard negative doc. Other documents from the same batch are used as in-batch negatives.
Got it. Thanks!