Instructions to use ExponentialScience/LedgerBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ExponentialScience/LedgerBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="ExponentialScience/LedgerBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT") model = AutoModelForMaskedLM.from_pretrained("ExponentialScience/LedgerBERT") - Notebooks
- Google Colab
- Kaggle
| base_model: | |
| - allenai/scibert_scivocab_cased | |
| datasets: | |
| - ExponentialScience/DLT-Tweets | |
| - ExponentialScience/DLT-Patents | |
| - ExponentialScience/DLT-Scientific-Literature | |
| language: | |
| - en | |
| license: cc-by-nc-4.0 | |
| library_name: transformers | |
| pipeline_tag: feature-extraction | |
| # LedgerBERT | |
| [Paper: DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain](https://huggingface.co/papers/2602.22045) | [GitHub Repository](https://github.com/dlt-science/DLT-Corpus) | |
| ## Model Description | |
| ### Model Summary | |
| LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems. | |
| LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content. | |
| - **Developed by:** Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu | |
| - **Model type:** BERT-base encoder (bidirectional transformer) | |
| - **Language:** English | |
| - **License:** CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International) | |
| - **Base model:** SciBERT (allenai/scibert_scivocab_cased) | |
| - **Training corpus:** DLT-Corpus (2.98 billion tokens) | |
| ### Model Architecture | |
| - **Architecture:** BERT-base | |
| - **Parameters:** 110 million | |
| - **Hidden size:** 768 | |
| - **Number of layers:** 12 | |
| - **Attention heads:** 12 | |
| - **Vocabulary size:** 30,522 (SciBERT vocabulary) | |
| - **Max sequence length:** 512 tokens | |
| ## Intended Uses | |
| ### Primary Use Cases | |
| LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to: | |
| - **Named Entity Recognition (NER)**: Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing) | |
| - **Text Classification**: Categorizing DLT-related documents, patents, or social media posts | |
| - **Sentiment Analysis**: Analyzing sentiment in cryptocurrency news and social media | |
| - **Information Extraction**: Extracting technical concepts and relationships from DLT literature | |
| - **Document Retrieval**: Building search systems for DLT content | |
| - **Question Answering (QA)**: Creating QA systems for blockchain and cryptocurrency topics | |
| ### Out-of-Scope Uses | |
| - **Real-time trading systems**: LedgerBERT should not be used as the sole basis for automated trading decisions | |
| - **Investment advice**: Not suitable for providing financial or investment recommendations without proper disclaimers | |
| - **General-purpose NLP**: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks | |
| - **Legal or regulatory compliance**: Should not be used for legal interpretation without expert review | |
| ## Training Details | |
| ### Training Data | |
| LedgerBERT was continually pre-trained on the **DLT-Corpus**, consisting of: | |
| - **Scientific Literature**: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature | |
| - **Patents**: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents | |
| - **Social Media**: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets | |
| **Total:** 22.12 million documents, 2.98 billion tokens | |
| For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402 | |
| ### Training Procedure | |
| **Continual Pre-training:** | |
| Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts. | |
| **Training hyperparameters:** | |
| - **Epochs:** 3 | |
| - **Learning rate:** 5×10⁻⁵ with linear decay schedule | |
| - **MLM probability:** 0.15 (standard BERT masking) | |
| - **Warmup ratio:** 0.10 | |
| - **Batch size:** 12 per device | |
| - **Sequence length:** 512 tokens | |
| - **Weight decay:** 0.01 | |
| - **Optimizer:** Stable AdamW | |
| - **Precision:** bfloat16 | |
| ## Limitations and Biases | |
| ### Known Limitations | |
| - **Language coverage**: English only; does not support other languages | |
| - **Temporal coverage**: Training data extends to mid-2023 for social media; may not capture very recent terminology | |
| - **Domain specificity**: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa | |
| - **Context length**: Limited to 512 tokens; longer documents require truncation or chunking | |
| ### Potential Biases | |
| The model may reflect biases present in the training data: | |
| - **Geographic bias**: English-language sources may over-represent certain regions | |
| - **Platform bias**: Social media data only from Twitter/X; other platforms not represented | |
| - **Temporal bias**: More recent DLT developments are more heavily represented | |
| - **Market bias**: Training during periods of market volatility may influence sentiment understanding | |
| - **Source bias**: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others | |
| ### Ethical Considerations | |
| - **Market manipulation risk**: Could potentially be misused for analyzing or generating content for market manipulation | |
| - **Investment decisions**: Should not be used as sole basis for financial decisions without proper risk disclaimers | |
| - **Misinformation**: May reproduce or fail to identify false claims present in training data | |
| - **Privacy**: While usernames were removed from social media data, care should be taken not to re-identify individuals | |
| ## How to Use | |
| ### Basic Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| # Load model and tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT") | |
| model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT") | |
| # Example text | |
| text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation." | |
| # Tokenize and encode | |
| inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True) | |
| # Get embeddings | |
| outputs = model(**inputs) | |
| embeddings = outputs.last_hidden_state | |
| ``` | |
| ### Fine-tuning for NER | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer | |
| # Load for token classification | |
| tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT") | |
| model = AutoModelForTokenClassification.from_pretrained( | |
| "ExponentialScience/LedgerBERT", | |
| num_labels=num_labels # Set based on your NER task | |
| ) | |
| # Fine-tune on your dataset | |
| training_args = TrainingArguments( | |
| output_dir="./results", | |
| learning_rate=1e-5, | |
| per_device_train_batch_size=16, | |
| num_train_epochs=20, | |
| warmup_steps=500 | |
| ) | |
| trainer = Trainer( | |
| model=model, | |
| args=training_args, | |
| train_dataset=train_dataset, | |
| eval_dataset=eval_dataset | |
| ) | |
| trainer.train() | |
| ``` | |
| ### Fine-tuning for Sentiment Analysis | |
| A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment") | |
| model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment") | |
| text = "Bitcoin reaches new all-time high amid institutional adoption" | |
| inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True) | |
| outputs = model(**inputs) | |
| predictions = outputs.logits.argmax(dim=-1) | |
| ``` | |
| ## Citation | |
| If you use LedgerBERT in your research, please cite: | |
| ```bibtex | |
| @misc{hernandez2026dlt-corpus, | |
| title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain}, | |
| author={Walter Hernandez Cruz and Peter Devine and Nikhil Vadgama and Paolo Tasca and Jiahua Xu}, | |
| year={2026}, | |
| eprint={2602.22045}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2602.22045}, | |
| } | |
| ``` | |
| ## Related Resources | |
| - **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402 | |
| - **Scientific Literature Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature | |
| - **Patents Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Patents | |
| - **Social Media Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets | |
| - **Sentiment Analysis Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News | |
| - **Fine-tuned Market Sentiment Model**: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment | |
| ## Model Card Contact | |
| For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus |