Qwen-72B-Chat-Int8 - Powered by MinDoc

介绍（Introduction）

通义千问-72B（Qwen-72B）是阿里云研发的通义千问大模型系列的720亿参数规模的模型。Qwen-72B是基于Transformer 的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-72B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-72B-Chat。本仓库为Qwen-72B-Chat的Int8量化模型的仓库。

通义千问-72B（Qwen-72B）主要有以下特点：

大规模高质量训练语料：使用超过3万亿tokens的数据进行预训练，包含高质量中、英、多语言、代码、数学等数据，涵盖通用及专业领域的训练语料。通过大量对比实验对预训练语料分布进行了优化。
强大的性能：Qwen-72B在多个中英文下游评测任务上（涵盖常识推理、代码、数学、翻译等），效果显著超越现有的开源模型。具体评测结果请详见下文。
覆盖更全面的词表：相比目前以中英词表为主的开源模型，Qwen-72B使用了约15万大小的词表。该词表对多语言更加友好，方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
更长的上下文支持：Qwen-72B支持32k的上下文长度。
系统指令跟随：Qwen-72B-Chat可以通过调整系统指令，实现角色扮演，语言风格迁移，任务设定，和行为设定等能力。

如果您想了解更多关于通义千问72B开源模型的细节，我们建议您参阅 GitHub代码库。

Hugging Face：https://huggingface.co/Qwen/Qwen-72B-Chat-Int8
GitHub：https://github.com/QwenLM/Qwen

要求（Requirements）

python 3.8及以上版本
pytorch 2.0及以上版本
建议使用CUDA 11.4及以上（GPU用户、flash-attention用户等需考虑此选项）
至少82GB显存（例如2xA100-80G或3xV100-32G）

依赖项（Dependency）

运行Qwen-72B-Chat-Int8，请确保满足上述要求，再执行以下pip命令安装依赖库。如安装 auto-gptq 遇到问题，我们建议您到官方repo搜索合适的预编译wheel。

pip install "transformers>=4.32.0" accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
pip install auto-gptq optimum

注意：预编译的auto-gptq版本对torch版本及其CUDA版本要求严格。同时，由于其近期更新，你可能会遇到
transformers、optimum或peft抛出的版本错误。 我们建议使用符合以下要求的最新版本：
torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0

另外，推荐安装flash-attention库（当前已支持flash attention 2），以实现更高的效率和更低的显存占用。

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 下方安装可选，安装可能比较缓慢。
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# 如果你的flash-attn版本高于2.1.1，下方不需要安装。
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary

快速使用（Quickstart）

下面我们展示了一个使用Qwen-72B-Chat-Int8模型的样例：

from transformers import AutoTokenizer, AutoModelForCausalLM

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat-Int8", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-72B-Chat-Int8",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好！很高兴为你提供帮助。

# Qwen-72B-Chat现在可以通过调整系统指令（System Prompt），实现角色扮演，语言风格迁移，任务设定，行为设定等能力。
# Qwen-72B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")
print(response)
# 哎呀，你好哇！是怎么找到人家的呢？是不是被人家的魅力吸引过来的呀~(≧▽≦)/~

response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
# Your colleague is a shining example of dedication and hard work. Their commitment to their job is truly commendable, and it shows in the quality of their work. 
# They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!

注意：使用vLLM运行量化模型需安装我们vLLM分支仓库。暂不支持int8模型，近期将更新。

作者：Jeebiz 创建时间：2023-12-15 22:42
最后编辑：Jeebiz 更新时间：2025-05-12 09:20