Metom / README.md

update to transformers v5

1a0768d 4 months ago

8.5 kB

	---
	license: apache-2.0
	---
	# Metom (めとむ)

	The Metom is a Vision Transformer (ViT) based Kuzushiji classifier.
	The model takes an image with one character and returns what the character is.

	めとむは Vision Transformer (ViT) ベースのくずし字分類器です。
	モデルは1文字が写った画像を受け取り、その文字がどの文字であるかを返します。

	Japanese section follows English section (日本語セクションは英語セクションの後に続きます。)

	--------------------------------------------------------------------------------

	This model was trained by using [日本古典籍くずし字データセット](http://codh.rois.ac.jp/char-shape/book/).
	This dataset contains 1,086,326 characters in 4,328 types of Kuzushiji.
	However, we used only 2,703 types of characters that appeared at least 5 times in the dataset.

	The dataset was split into train, validation, and test subsets in a ratio of 3:1:1.
	As a result, the train subset contained 649,932 characters, the validation subset contained 216,644 characters, and the test subset contained 216,645 characters.

	The model was trained on the train subset, and hyperparameters were tuned based on the performance on the validation subset.
	The final evaluation on the test subset yielded a micro accuracy of 0.9722 and a macro accuracy of 0.8354.

	## Usage
	Please see also [Google Colab Notebook](https://colab.research.google.com/drive/1jFMZENoTjjum3qlBxV0Q5dTxmpCvqlpf?usp=sharing).
	1. Install dependencies (Not required on Google Colab)
	```sh
	python -m pip install einops torch torchvision "transformers>=5.1.0"

	# Optional (This is also required on Google Colab if you want to use FlashAttention-2)
	pip install flash-attn --no-build-isolation
	```

	2. Run the following code
	```python
	from io import BytesIO

	from PIL import Image
	import requests
	import torch
	from transformers import AutoImageProcessor, AutoModel

	repo_name = "SakanaAI/Metom"
	device = "cuda"
	torch_dtype = torch.float32 # This can also set `torch.float16` or `torch.bfloat16`

	def get_image(image_url: str) -> Image.Image:
	return Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")

	processor = AutoImageProcessor.from_pretrained(repo_name)
	model = AutoModel.from_pretrained(
	repo_name,
	dtype=torch_dtype,
	attn_implementation="sdpa", # This can also set `"eager"`, `"flash_attention_2"` or other methods supported in transformers v5 (https://huggingface.co/docs/transformers/main/en/attention_interface)
	trust_remote_code=True
	).to(device=device)
	# We still support transformers v4
	# model = AutoModel.from_pretrained(
	# repo_name,
	# torch_dtype=torch_dtype,
	# _attn_implementation="sdpa", # This can also set `"eager"` or `"flash_attention_2"`
	# trust_remote_code=True,
	# revision="transformers-v4", # Use this revision
	# ).to(device=device)

	image1 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example1_4E00.jpg") # An example image
	image_array1 = processor(images=image1, return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype)
	with torch.inference_mode():
	print(model.get_predictions(image_array1)) # Returns the prediction label
	# ['一']

	image2 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example2_5B9A.jpg") # An example image
	image3 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example3_5009.jpg") # An example image
	image_array2 = processor(images=[image2, image3], return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype)
	with torch.inference_mode():
	print(model.get_topk_labels(image_array2)) # Returns top-k prediction labels (label only)
	# [['定', '芝', '乏', '淀', '実'], ['倉', '衾', '斜', '会', '急']]
	print(model.get_topk_labels(image_array2, k=3, return_probs=True)) # Returns prediction top-k labels (label with probability)
	# [[('定', 0.9979110360145569), ('芝', 0.0002953446237370372), ('乏', 0.0001281465229112655)], [('倉', 0.9862518906593323), ('衾', 0.0005956498789601028), ('斜', 0.000399815384298563)]]
	```

	## Citation
	```bibtex
	@misc{Metom,
	url = {[https://huggingface.co/SakanaAI/Metom](https://huggingface.co/SakanaAI/Metom)},
	title = {Metom},
	author = {Imajuku, Yuki and Clanuwat, Tarin}
	}
	```

	--------------------------------------------------------------------------------

	本モデルは[日本古典籍くずし字データセット](http://codh.rois.ac.jp/char-shape/book/)を用いて訓練されました。
	このデータセットは4,328種1,086,326枚のくずし字画像が含まれています。
	ですが、データセット中に最低5回以上出現する2,703種類の文字のみを利用しました。

	データセットは訓練、検証、テストの3つのセットに、比率が3:1:1となるように分割されました。
	その結果、訓練セットは649,932枚、検証セットは216,644枚、テストセットは216,645枚、画像が含まれました。

	本モデルは訓練セットのみを用いて学習され、検証セットにおける性能を見ながらハイパーパラメータを調整しました。
	最終的にテストセットにおける評価の結果、216,645枚全体の正解率は0.9722となり、2,703種類のクラス別正解率の平均は0.8354となりました。

	## 使用方法
	[Google Colab Notebook](https://colab.research.google.com/drive/1jFMZENoTjjum3qlBxV0Q5dTxmpCvqlpf?usp=sharing)もご確認ください。
	1. 依存ライブラリをインストールする (Google Colabを使う場合は不要)
	```sh
	python -m pip install einops torch torchvision "transformers>=5.1.0"

	# 任意 (FlashAttention-2を使いたい場合はGoogle Colabを使う時でも必要)
	pip install flash-attn --no-build-isolation
	```

	2. 以下のコードを実行する
	```python
	from io import BytesIO

	from PIL import Image
	import requests
	import torch
	from transformers import AutoImageProcessor, AutoModel

	repo_name = "SakanaAI/Metom"
	device = "cuda"
	torch_dtype = torch.float32 # `torch.float16` や `torch.bfloat16` も指定可能

	def get_image(image_url: str) -> Image.Image:
	return Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")

	processor = AutoImageProcessor.from_pretrained(repo_name)
	model = AutoModel.from_pretrained(
	repo_name,
	dtype=torch_dtype,
	attn_implementation="sdpa", # `"eager"`, `"flash_attention_2"` および transformers v5 でサポートされている Attention backends を指定可能 (https://huggingface.co/docs/transformers/main/en/attention_interface)
	trust_remote_code=True
	).to(device=device)
	# transformers v4 もサポート
	# model = AutoModel.from_pretrained(
	# repo_name,
	# torch_dtype=torch_dtype,
	# _attn_implementation="sdpa", # `"eager"` や `"flash_attention_2"` も指定可能
	# trust_remote_code=True,
	# revision="transformers-v4", # この revision を使用
	# ).to(device=device)

	image1 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example1_4E00.jpg") # 画像例
	image_array1 = processor(images=image1, return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype)
	with torch.inference_mode():
	print(model.get_predictions(image_array1)) # 予測ラベルを返す
	# ['一']

	image2 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example2_5B9A.jpg") # 画像例
	image3 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example3_5009.jpg") # 画像例
	image_array2 = processor(images=[image2, image3], return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype)
	with torch.inference_mode():
	print(model.get_topk_labels(image_array2)) # 上位k件の予測ラベルを返す (ラベルのみ)
	# [['定', '芝', '乏', '淀', '実'], ['倉', '衾', '斜', '会', '急']]
	print(model.get_topk_labels(image_array2, k=3, return_probs=True)) # 上位k件の予測ラベルを返す (ラベルと確率)
	# [[('定', 0.9979110360145569), ('芝', 0.0002953446237370372), ('乏', 0.0001281465229112655)], [('倉', 0.9862518906593323), ('衾', 0.0005956498789601028), ('斜', 0.000399815384298563)]]
	```

	## 引用
	```bibtex
	@misc{Metom,
	url = {[https://huggingface.co/SakanaAI/Metom](https://huggingface.co/SakanaAI/Metom)},
	title = {Metom},
	author = {Imajuku, Yuki and Clanuwat, Tarin}
	}
	```