Update README.md

1b88531 verified 2 months ago

4.67 kB

	---
	license: apache-2.0
	language:
	- en
	- ru
	library_name: gigacheck
	tags:
	- text-classification
	- ai-detection
	- multilingual
	- gigacheck
	datasets:
	- iitolstykh/LLMTrace_classification
	base_model:
	- mistralai/Mistral-7B-v0.3
	---

	# GigaCheck-Classifier-Multi

	<p style="text-align: center;">
	<div align="center">
	<img src="https://raw.githubusercontent.com/sweetdream779/LLMTrace-info/refs/heads/main/images/logo/GigaCheck-classifier-multi.PNG" width="40%"/>
	</div>
	<p align="center">
	<a href="https://sweetdream779.github.io/LLMTrace-info"> 🌐 LLMTrace Website </a> \|
	<a href="http://arxiv.org/abs/2509.21269"> 📜 LLMTrace Paper on arXiv </a> \|
	<a href="https://huggingface.co/datasets/iitolstykh/LLMTrace_classification"> 🤗 LLMTrace - Classification Dataset </a> \|
	<a href="https://github.com/ai-forever/gigacheck"> Github </a> \|
	</p>

	## Model Card

	### Model Description

	This is the official `GigaCheck-Classifier-Multi` model from the `LLMTrace` project. It is a multilingual transformer-based model trained for the binary classification of text as either `human` or `ai`.

	The model was trained jointly on the English and Russian portions of the `LLMTrace Classification dataset`. It is designed to be a robust baseline for detecting AI-generated content across multiple domains, text lengths and prompt types.

	For complete details on the training data, methodology, and evaluation, please refer to our research paper: link(coming soon)

	### Intended Use & Limitations

	This model is intended for academic research, analysis of AI-generated content, and as a baseline for developing more advanced detection tools.

	Limitations:
	* The model's performance may degrade on text generated by LLMs released after its training date (September 2025).
	* It is not infallible and can produce false positives (flagging human text as AI) and false negatives.
	* Performance may vary on domains or styles of text not well-represented in the training data.


	## Evaluation

	The model was evaluated on the test split of the `LLMTrace Classification dataset`, which was not seen during training. Performance metrics are reported below:

	\| Metric \| Value \|
	\|-----------------------\|---------\|
	\| F1 Score (AI) \| 98.64 \|
	\| F1 Score (Human) \| 98.00 \|
	\| Mean Accuracy \| 98.46 \|
	\| TPR @ FPR=0.01 \| 97.93 \|


	## Quick start

	Requirements:
	- python3.11
	- [gigacheck](https://github.com/ai-forever/gigacheck)

	```bash
	pip install git+https://github.com/ai-forever/gigacheck
	```

	### Inference with transformers (with trust_remote_code=True)

	```python
	from transformers import AutoModel
	import torch

	gigacheck_model = AutoModel.from_pretrained(
	"iitolstykh/GigaCheck-Classifier-Multi",
	trust_remote_code=True,
	device_map="cuda:0",
	torch_dtype=torch.bfloat16
	)

	text = """To be, or not to be, that is the question:
	Whether ’tis nobler in the mind to suffer
	The slings and arrows of outrageous fortune,
	Or to take arms against a sea of troubles
	And by opposing end them.
	"""

	output = gigacheck_model([text.replace("\n", " ")])

	print([gigacheck_model.config.id2label[int(c_id)] for c_id in output.pred_label_ids])
	```

	### Inference with gigacheck

	```python
	import torch
	from transformers import AutoConfig
	from gigacheck.inference.src.mistral_detector import MistralDetector

	model_name = "iitolstykh/GigaCheck-Classifier-Multi"

	config = AutoConfig.from_pretrained(model_name)
	model = MistralDetector(
	max_seq_len=config.max_length,
	with_detr=config.with_detr,
	id2label=config.id2label,
	device="cpu" if not torch.cuda.is_available() else "cuda:0",
	).from_pretrained(model_name)

	text = """To be, or not to be, that is the question:
	Whether ’tis nobler in the mind to suffer
	The slings and arrows of outrageous fortune,
	Or to take arms against a sea of troubles
	And by opposing end them.
	"""

	output = model.predict(text.replace("\n", " "))
	print(output)
	```


	## Citation

	If you use this model in your research, please cite our papers:

	```bibtex
	@article{Layer2025LLMTrace,
	Title = {{LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text}},
	Author = {Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Maksim Kuprashevich},
	Year = {2025},
	Eprint = {arXiv:2509.21269}
	}
	@article{tolstykh2024gigacheck,
	title={{GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization}},
	author={Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Aleksandr Gordeev and Vladimir Dokholyan and Maksim Kuprashevich},
	journal={arXiv preprint arXiv:2410.23728},
	year={2024}
	}
	```