Questions about the training data

#33

by xxxcliu - opened Mar 5, 2024

Mar 5, 2024

I really appreciate your work!
I have some questions about the specifics of the training data. Regarding the sentence in the paper: "Given a pretrained LLM, we append an [EOS] token to the end of the query and document, and then feed them into the LLM to obtain the query and document embeddings by taking the last layer [EOS] vector."
What is the format of each training data during training? Is it like "Instruct: {task_definition} \n Query: {q+}[EOS]{doc}[EOS]"? Or are the query and document input separately into the model? Are the queries in a batch the same? Do you have corresponding code on your GitHub that I can refer to for this part? I couldn't find it.

intfloat

Owner Mar 6, 2024

It follows a bi-encoder architecture, the query follows the "Instruct: {task_definition} \n Query: {q+} [EOS] format, and the document follows the {doc} [EOS] format. The queries are randomly sampled from the training set, they are not the same in a batch.

We do not have public training code for this model, but you can refer to https://github.com/texttron/tevatron for training embedding models.

xxxcliu

Mar 6, 2024

•

edited Mar 6, 2024

It follows a bi-encoder architecture, the query follows the "Instruct: {task_definition} \n Query: {q+} [EOS] format, and the document follows the {doc} [EOS] format. The queries are randomly sampled from the training set, they are not the same in a batch.

We do not have public training code for this model, but you can refer to https://github.com/texttron/tevatron for training embedding models.

Thanks very much for your reply!!!
Is the bi-encoder actually a single mistral-7b model (siamese network) or two seperate models (parameters are not the same) ? Besides, what is the specific format of a single training datapoint? In my understanding, it should contain a query, a positive doc, a hard negative doc, and negative sampling is performed during the training process?

intfloat

Owner Mar 7, 2024

Yes, it is a Siamese network with shared parameters.

As you said, a training datapoint has a query, a positive doc and a hard negative doc. Other documents from the same batch are used as in-batch negatives.

xxxcliu

Mar 10, 2024

Yes, it is a Siamese network with shared parameters.

As you said, a training datapoint has a query, a positive doc and a hard negative doc. Other documents from the same batch are used as in-batch negatives.

Got it. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment