Instructions to use PerceiveIO/tinyllama_92M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PerceiveIO/tinyllama_92M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PerceiveIO/tinyllama_92M")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("PerceiveIO/tinyllama_92M") model = AutoModelForMultimodalLM.from_pretrained("PerceiveIO/tinyllama_92M") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use PerceiveIO/tinyllama_92M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PerceiveIO/tinyllama_92M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PerceiveIO/tinyllama_92M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/PerceiveIO/tinyllama_92M
- SGLang
How to use PerceiveIO/tinyllama_92M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PerceiveIO/tinyllama_92M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PerceiveIO/tinyllama_92M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PerceiveIO/tinyllama_92M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PerceiveIO/tinyllama_92M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use PerceiveIO/tinyllama_92M with Docker Model Runner:
docker model run hf.co/PerceiveIO/tinyllama_92M
| library_name: transformers | |
| tags: [] | |
| # tinyllamas_92M | |
| <!-- Provide a quick summary of what the model is/does. --> | |
| ## Model Details | |
| ```python | |
| max_seq_len = 256 | |
| vocab_size = 8192 | |
| dim = 768 | |
| n_layers = 12 | |
| n_heads = 12 | |
| n_kv_heads = 12 | |
| ``` | |
| ### Training Data | |
| - https://huggingface.co/datasets/roneneldan/TinyStories | |
| - Tokenized using: https://github.com/karpathy/llama2.c?tab=readme-ov-file#custom-tokenizers | |
| #### Training Hyperparameters | |
| ```python | |
| batch_size = 64 # if gradient_accumulation_steps > 1, this is the micro-batch size | |
| dropout = 0.0 | |
| # adamw optimizer | |
| gradient_accumulation_steps = 8 # used to simulate larger batch sizes | |
| learning_rate = 1e-3 # max learning rate | |
| max_iters = 34000 # total number of training iterations | |
| weight_decay = 3e-4 | |
| beta1 = 0.9 | |
| beta2 = 0.95 | |
| grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0 | |
| # learning rate decay settings | |
| decay_lr = True # whether to decay the learning rate | |
| warmup_iters = 1000 # how many steps to warm up for | |
| ``` | |
| ### Results | |
| ```bash | |
| 4xV100 GPUs used. | |
| Run summary: | |
| iter 34000 | |
| loss/train 0.8704 | |
| loss/val 0.9966 | |
| tokens 983040000 | |
| ``` |