Top 10 languages (Percentage of total tokens in the data):
English: 26%
Indonesian: 11%
Russian: 11%
Portuguese: 5%
Spanish: 4%
Portuguese: 4%
Polish: 3%
Dutch: 2%
Turkish: 1%
Chinese: 0.1%
CommonCrawl, Synthetic Data by leveraging GPT-35-Turbo and GPT-4
Initialization: 2.5TB of filtered CommonCrawl data containing 100 languages. Check xlm-roberta-large.
First stage: contrastive pre-training with 1 billion weakly supervised text pairs.
Second stage: fine-tuning on datasets from the E5-mistral paper, i.e. synthetic data by leveraging GPT-35-Turbo and GPT-4.
Top languages (by number of examples in the data):
1. English (en)
2. Russian (ru)
3. Japanese (ja)
4. Chinese (zh)
5. French (fr)
6. German (de)
7. Portuguese (pt)
8. Dutch (nl)
9. Polish (pl)
10. Spanish (es)
Monolingual data from CommonCrawl and Wikipedia. Bilingual translation pairs from web page bitext mining
17B monolingual sentences from CommonCrawl (version 2019-35) and Wikipedia (05-21-2020 dump). 6B translation pairs from bilingual data. For each language the maximum number of sentences is limited to 100m. The data include 109+ languages.
First stage: Pretraining the encoder using MLM and TLM on monolingual data and bilingual sentences respectively.
Second stage: Dual encoder model training. Two versions of the model where trained, one uses the public BERT multilingual cased vocab with vocab size 119,547 and a second incorporates a customized vocab extracted over our training data.
Top languages (by number of examples in the data):
1. English (en)
2. Russian (ru)
3. Japanese (ja)
4. Chinese (zh)
5. French (fr)
6. German (de)
7. Portuguese (pt)
8. Dutch (nl)
9. Polish (pl)
10. Spanish (es)
Mutiple datasets including Wikipedia, Reddit, Stackexchange, MS-Marco
First stage: Contrastive pre-training with 1B multilingual text pairs from 100 languages.
Data from Wikipedia, mC4, Multilingual CC News, NLLB, Reddit, S2ORC, Stackexchange, xP3, Misc. SBERT Data datasets
Second stage: Supervised fine-tuning 1.6M labeled data from MS-Marco Passage & Document, NQ, TriviaQA, SQuaAD, NLI,
ELI5, NLLB, DuReader Retrieval, Fever, HotpotQA, Quora Duplicate Questions, Mr. TyDi, MIRACL datasets
The model is also capable of embedding and generation in non-English languages as seen in its TyDi QA performance.
However, major performance gains on non-English tasks are likely possible through both data and architecture changes.
Based on Mistral 7B. Finetuned with adaptations of E5 and Tulu 2 data
Finetuned the final model from Mistral 7B and Mistral 8x7B. S2ORC was added to E5 to increase it's scientific data.
The model is also capable of embedding and generation in non-English languages as seen in its TyDi QA performance.
However, major performance gains on non-English tasks are likely possible through both data and architecture changes.
Adaptations of E5 and Tulu 2 data
Finetuned the final model from Mistral 7B and Mistral 8x7B. S2ORC was added to E5 to increase it's scientific data.
Three stage training:
Stage 1: Web Crawl for Topic Similarity (1.4B question and answers pairs)
Stage 2: Search Queries for Content Quality (over 3 million search queries from search engines)
Stage 3: Embeddings Optimized for Compression: The final stage ensures that the models work well with vector compression methods
Assumption: This model is finetuned version of one of Cohere's foundational models on 1B english training pairs.
Trained on an extensive corpus of text pairs from a broad spectrum of domains.
Includes data from domains like finance, science, medicine, law. No other information regarding the training data. Authors used training techniques
from RetroMAE and SetFit papers.
Based on Bert uncased. Finetuned on datasets like MNLI and SNLI
Pretrained using bert uncased which was trained on approx 3.3 billion words from BookCorpus datasets consisting of 11,000 unpublished books and English Wikipedia.
700m pairs from undisclosed proprietary dataset for pre-training and 30m triplets for finetuning.
This model is crafted on a custom-built dataset from extensive internet data,
ensuring no overlap with MTEB test sets (except MS Marco), and trained on over 700 million pairs with fine-tuning on 30 million high-quality triplets
700m pairs from undisclosed proprietary dataset for pre-training and 30m triplets for finetuning.
This model is crafted on a custom-built dataset from extensive internet data,
ensuring no overlap with MTEB test sets (except MS Marco), and trained on over 700 million pairs with fine-tuning on 30 million high-quality triplets
Up to 1024 (user defined)
512
Open Source
2024
* Allowed for commercial usage
** Cost to deploy an embedding model as a single instance of Sagemaker endpoint
*** Cost to an embedding model as a single instance of EC2 endpoint but you need to dockerize the model and perform all the infra related steps
Crafted by seasoned machine learning engineers with extensive backgrounds in top-tier tech companies.