Instructions to use google/gemma-3-4b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-4b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-3-4b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-3-4b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-4b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-3-4b-it
- SGLang
How to use google/gemma-3-4b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-4b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-4b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-3-4b-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-4b-it
Batch processing on a GPU?
I'm pretty new to the transformers package. Can anyone provide example code for how a Gemma 3 VLM can be used to batch process images on a CUDA GPU? In my case, I have a list of local files that I want to process using a common prompt. Currently I'm only able to process each image sequentially on my CUDA GPU.
To process images in bulk using the Gemma 3 VLM model on a CUDA GPU, you can use PyTorch along with Tesseract OCR to extract text from the images and then send those texts to the model for inference. First, install the necessary libraries like torch, transformers, pytesseract, and Pillow. Then, load the model and tokenizer using transformers, and use Tesseract to process each image individually. To optimize batch processing, you can loop through all the images in a directory and generate text for each of them. The code below illustrates this process, using the GPU to perform the inferences:
python
Copiar
Editar
here !
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import pytesseract
import os
Define the path to the Tesseract OCR executable (if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # Adjust as needed
Function to process an image with OCR
def process_image(image_path):
# Load the image
img = Image.open(image_path)
# Use Tesseract to extract text
text = pytesseract.image_to_string(img)
return text
Function to process the batch of images
def process_batch(image_paths, model, tokenizer, device):
texts = []
for image_path in image_paths:
print(f"Processing {image_path}...")
# Step 1: Process the image with OCR (convert image to text)
ocr_text = process_image(image_path)
# Step 2: Use the model for inference (based on the extracted text)
inputs = tokenizer(ocr_text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_length=1024)
# Decode the model's response
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
texts.append(generated_text)
return texts
Load the model and tokenizer
model_name = "gemma-3-4b-it" # Or any other model you have
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')
Path to the folder with the images
image_folder = "/path/to/your/images"
List of image paths
image_paths = [os.path.join(image_folder, fname) for fname in os.listdir(image_folder) if fname.endswith('.jpg') or fname.endswith('.png')]
Process the images in bulk
generated_texts = process_batch(image_paths, model, tokenizer, 'cuda')
Display the results
for idx, generated_text in enumerate(generated_texts):
print(f"Generated text for image {image_paths[idx]}: {generated_text}\n")
@Ayorinha thanks for replying. I'm guessing you asked an LLM my question and pasted the response? What you provided doesn't make sense. The code is using pytesseract to extract the text from my image then feeding the text into Gemma 3 without any prompt from me. It's treating Gemma 3 like an LLM rather than a VLM, and it doesn't provide any prompt. This is not how Gemma 3 is meant to be used. I should be feeding the image and my prompt into Gemma 3. My aim is to do visual question answering (VQA) of the images I have.
I should mention I have already consulted with Sonnet 3.7 on this question and it wasn't able to figure it out. Maybe a more experienced transformers user could coax the right answer out of it, but I couldn't.
sorry man ,
Explain to me what you did
omg , This should work correctly for Visual Question Answering VQA ?
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import os
Hi,
Apologies for the late reply, thanks for reaching out to us. Could you please confirm whether the above mentioned issue is resolve or do you required any additional assistance.
Thanks.