Instructions to use robvanderg/bert-base-multilingual-cased-segment1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use robvanderg/bert-base-multilingual-cased-segment1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="robvanderg/bert-base-multilingual-cased-segment1")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("robvanderg/bert-base-multilingual-cased-segment1") model = AutoModel.from_pretrained("robvanderg/bert-base-multilingual-cased-segment1") - Notebooks
- Google Colab
- Kaggle
bert-base-multilingual-cased-segment1
This is a version of multilingual bert (bert-base-multilingual-cased), where the segment embedding of the 1's is copied into the 0's. Yes, that's all there is to it. We have found that this improves performance substantially in low-resource setups for word-level tasks (e.g. average 2.5 LAS on a variety of UD treebanks). More details are to be released in our LREC2022 paper titled: Frustratingly Easy Performance Improvements for Cross-lingual Transfer: A Tale on BERT and Segment Embeddings.
These embeddings are generated by the following code
import AutoModel
baseEmbeddings = AutoModel.from_pretrained("bert-base-multilingual-cased")
tte = baseEmbeddings.embeddings.token_type_embeddings.weight.clone().detach()
baseEmbeddings.embeddings.token_type_embeddings.weight[0,:] = tte[1,:]
More details and other varieties can be found in the repo: https://bitbucket.org/robvanderg/segmentembeds/
Note that when using this model on a single sentence task (or word-level task), the results would be similar as just using token_type_id=1 for all tokens.
- Downloads last month
- 10