Transformers documentation

EETQ

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

EETQ

EETQ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” NVIDIA GPU์— ๋Œ€ํ•ด int8 ์ฑ„๋„๋ณ„(per-channel) ๊ฐ€์ค‘์น˜ ์ „์šฉ ์–‘์žํ™”(weight-only quantization)์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ๊ณ ์„ฑ๋Šฅ GEMM ๋ฐ GEMV ์ปค๋„์€ FasterTransformer ๋ฐ TensorRT-LLM์—์„œ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค. ๊ต์ •(calibration) ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š” ์—†์œผ๋ฉฐ, ๋ชจ๋ธ์„ ์‚ฌ์ „์— ์–‘์žํ™”ํ•  ํ•„์š”๋„ ์—†์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ฑ„๋„๋ณ„ ์–‘์žํ™”(per-channel quantization) ๋•๋ถ„์— ์ •ํ™•๋„ ์ €ํ•˜๊ฐ€ ๋ฏธ๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋ฆด๋ฆฌ์Šค ํŽ˜์ด์ง€์—์„œ eetq๋ฅผ ์„ค์น˜ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl

๋˜๋Š” ์†Œ์Šค ์ฝ”๋“œ https://github.com/NetEase-FuXi/EETQ ์—์„œ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. EETQ๋Š” CUDA ๊ธฐ๋Šฅ์ด 8.9 ์ดํ•˜์ด๊ณ  7.0 ์ด์ƒ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

git clone https://github.com/NetEase-FuXi/EETQ.git
cd EETQ/
git submodule update --init --recursive
pip install .

๋น„์–‘์žํ™” ๋ชจ๋ธ์€ โ€œfrom_pretrainedโ€๋ฅผ ํ†ตํ•ด ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import AutoModelForCausalLM, EetqConfig
path = "/path/to/model".
quantization_config = EetqConfig("int8")
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)

์–‘์žํ™”๋œ ๋ชจ๋ธ์€ โ€œsave_pretrainedโ€๋ฅผ ํ†ตํ•ด ์ €์žฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, โ€œfrom_pretrainedโ€๋ฅผ ํ†ตํ•ด ๋‹ค์‹œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
Update on GitHub