| --- |
| license: apache-2.0 |
| --- |
| # Metom (めとむ) |
|
|
| The **Metom** is a Vision Transformer (ViT) based **Kuzushiji** classifier. |
| The model takes an image with one character and returns what the character is. |
|
|
| **めとむ**は Vision Transformer (ViT) ベースの**くずし字**分類器です。 |
| モデルは1文字が写った画像を受け取り、その文字がどの文字であるかを返します。 |
|
|
| *Japanese section follows English section (日本語セクションは英語セクションの後に続きます。)* |
|
|
| -------------------------------------------------------------------------------- |
|
|
| This model was trained by using [日本古典籍くずし字データセット](http://codh.rois.ac.jp/char-shape/book/). |
| This dataset contains 1,086,326 characters in 4,328 types of Kuzushiji. |
| However, we used only 2,703 types of characters that appeared at least 5 times in the dataset. |
|
|
| The dataset was split into train, validation, and test subsets in a ratio of 3:1:1. |
| As a result, the train subset contained 649,932 characters, the validation subset contained 216,644 characters, and the test subset contained 216,645 characters. |
|
|
| The model was trained on the train subset, and hyperparameters were tuned based on the performance on the validation subset. |
| The final evaluation on the test subset yielded a micro accuracy of 0.9722 and a macro accuracy of 0.8354. |
|
|
| ## Usage |
| Please see also [Google Colab Notebook](https://colab.research.google.com/drive/1jFMZENoTjjum3qlBxV0Q5dTxmpCvqlpf?usp=sharing). |
| 1. Install dependencies (Not required on Google Colab) |
| ```sh |
| python -m pip install einops torch torchvision "transformers>=5.1.0" |
| |
| # Optional (This is also required on Google Colab if you want to use FlashAttention-2) |
| pip install flash-attn --no-build-isolation |
| ``` |
|
|
| 2. Run the following code |
| ```python |
| from io import BytesIO |
| |
| from PIL import Image |
| import requests |
| import torch |
| from transformers import AutoImageProcessor, AutoModel |
| |
| repo_name = "SakanaAI/Metom" |
| device = "cuda" |
| torch_dtype = torch.float32 # This can also set `torch.float16` or `torch.bfloat16` |
| |
| def get_image(image_url: str) -> Image.Image: |
| return Image.open(BytesIO(requests.get(image_url).content)).convert("RGB") |
| |
| processor = AutoImageProcessor.from_pretrained(repo_name) |
| model = AutoModel.from_pretrained( |
| repo_name, |
| dtype=torch_dtype, |
| attn_implementation="sdpa", # This can also set `"eager"`, `"flash_attention_2"` or other methods supported in transformers v5 (https://huggingface.co/docs/transformers/main/en/attention_interface) |
| trust_remote_code=True |
| ).to(device=device) |
| # We still support transformers v4 |
| # model = AutoModel.from_pretrained( |
| # repo_name, |
| # torch_dtype=torch_dtype, |
| # _attn_implementation="sdpa", # This can also set `"eager"` or `"flash_attention_2"` |
| # trust_remote_code=True, |
| # revision="transformers-v4", # Use this revision |
| # ).to(device=device) |
| |
| image1 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example1_4E00.jpg") # An example image |
| image_array1 = processor(images=image1, return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype) |
| with torch.inference_mode(): |
| print(model.get_predictions(image_array1)) # Returns the prediction label |
| # ['一'] |
| |
| image2 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example2_5B9A.jpg") # An example image |
| image3 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example3_5009.jpg") # An example image |
| image_array2 = processor(images=[image2, image3], return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype) |
| with torch.inference_mode(): |
| print(model.get_topk_labels(image_array2)) # Returns top-k prediction labels (label only) |
| # [['定', '芝', '乏', '淀', '実'], ['倉', '衾', '斜', '会', '急']] |
| print(model.get_topk_labels(image_array2, k=3, return_probs=True)) # Returns prediction top-k labels (label with probability) |
| # [[('定', 0.9979110360145569), ('芝', 0.0002953446237370372), ('乏', 0.0001281465229112655)], [('倉', 0.9862518906593323), ('衾', 0.0005956498789601028), ('斜', 0.000399815384298563)]] |
| ``` |
|
|
| ## Citation |
| ```bibtex |
| @misc{Metom, |
| url = {[https://huggingface.co/SakanaAI/Metom](https://huggingface.co/SakanaAI/Metom)}, |
| title = {Metom}, |
| author = {Imajuku, Yuki and Clanuwat, Tarin} |
| } |
| ``` |
|
|
| -------------------------------------------------------------------------------- |
|
|
| 本モデルは[日本古典籍くずし字データセット](http://codh.rois.ac.jp/char-shape/book/)を用いて訓練されました。 |
| このデータセットは4,328種1,086,326枚のくずし字画像が含まれています。 |
| ですが、データセット中に最低5回以上出現する2,703種類の文字のみを利用しました。 |
|
|
| データセットは訓練、検証、テストの3つのセットに、比率が3:1:1となるように分割されました。 |
| その結果、訓練セットは649,932枚、検証セットは216,644枚、テストセットは216,645枚、画像が含まれました。 |
|
|
| 本モデルは訓練セットのみを用いて学習され、検証セットにおける性能を見ながらハイパーパラメータを調整しました。 |
| 最終的にテストセットにおける評価の結果、216,645枚全体の正解率は0.9722となり、2,703種類のクラス別正解率の平均は0.8354となりました。 |
|
|
| ## 使用方法 |
| [Google Colab Notebook](https://colab.research.google.com/drive/1jFMZENoTjjum3qlBxV0Q5dTxmpCvqlpf?usp=sharing)もご確認ください。 |
| 1. 依存ライブラリをインストールする (Google Colabを使う場合は不要) |
| ```sh |
| python -m pip install einops torch torchvision "transformers>=5.1.0" |
| |
| # 任意 (FlashAttention-2を使いたい場合はGoogle Colabを使う時でも必要) |
| pip install flash-attn --no-build-isolation |
| ``` |
|
|
| 2. 以下のコードを実行する |
| ```python |
| from io import BytesIO |
| |
| from PIL import Image |
| import requests |
| import torch |
| from transformers import AutoImageProcessor, AutoModel |
| |
| repo_name = "SakanaAI/Metom" |
| device = "cuda" |
| torch_dtype = torch.float32 # `torch.float16` や `torch.bfloat16` も指定可能 |
| |
| def get_image(image_url: str) -> Image.Image: |
| return Image.open(BytesIO(requests.get(image_url).content)).convert("RGB") |
| |
| processor = AutoImageProcessor.from_pretrained(repo_name) |
| model = AutoModel.from_pretrained( |
| repo_name, |
| dtype=torch_dtype, |
| attn_implementation="sdpa", # `"eager"`, `"flash_attention_2"` および transformers v5 でサポートされている Attention backends を指定可能 (https://huggingface.co/docs/transformers/main/en/attention_interface) |
| trust_remote_code=True |
| ).to(device=device) |
| # transformers v4 もサポート |
| # model = AutoModel.from_pretrained( |
| # repo_name, |
| # torch_dtype=torch_dtype, |
| # _attn_implementation="sdpa", # `"eager"` や `"flash_attention_2"` も指定可能 |
| # trust_remote_code=True, |
| # revision="transformers-v4", # この revision を使用 |
| # ).to(device=device) |
| |
| image1 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example1_4E00.jpg") # 画像例 |
| image_array1 = processor(images=image1, return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype) |
| with torch.inference_mode(): |
| print(model.get_predictions(image_array1)) # 予測ラベルを返す |
| # ['一'] |
| |
| image2 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example2_5B9A.jpg") # 画像例 |
| image3 = get_image("https://huggingface.co/SakanaAI/Metom/resolve/main/examples/example3_5009.jpg") # 画像例 |
| image_array2 = processor(images=[image2, image3], return_tensors="pt")["pixel_values"].to(device=device, dtype=torch_dtype) |
| with torch.inference_mode(): |
| print(model.get_topk_labels(image_array2)) # 上位k件の予測ラベルを返す (ラベルのみ) |
| # [['定', '芝', '乏', '淀', '実'], ['倉', '衾', '斜', '会', '急']] |
| print(model.get_topk_labels(image_array2, k=3, return_probs=True)) # 上位k件の予測ラベルを返す (ラベルと確率) |
| # [[('定', 0.9979110360145569), ('芝', 0.0002953446237370372), ('乏', 0.0001281465229112655)], [('倉', 0.9862518906593323), ('衾', 0.0005956498789601028), ('斜', 0.000399815384298563)]] |
| ``` |
|
|
| ## 引用 |
| ```bibtex |
| @misc{Metom, |
| url = {[https://huggingface.co/SakanaAI/Metom](https://huggingface.co/SakanaAI/Metom)}, |
| title = {Metom}, |
| author = {Imajuku, Yuki and Clanuwat, Tarin} |
| } |
| ``` |
|
|