(原型) 使用BERT图模式动态量化¶

Created On: Jul 28, 2020 | Last Updated: Jan 16, 2024 | Last Verified: Nov 05, 2024

简介¶

本教程介绍了使用图模式量化进行训练后动态量化的步骤。动态量化将浮点模型转换为具有静态int8数据类型权重和动态量化激活的量化模型。激活根据批次动态量化为int8，而权重量化为静态int8。图模式量化流程基于模型图运行，用户干预量化操作的步骤最少。要使用图模式量化，浮点模型需要先进行追踪或脚本化。

图模式量化的优势包括：

在图模式下，我们可以检查前向函数中执行的代码（例如aten函数调用），通过模块和图的操作来实现量化。
简单的量化流程，手动步骤最少。
解锁进行高级优化的可能，例如自动精度选择。

有关图模式静态量化的更多详细信息，请参阅`图模式静态量化教程 <https://pytorch.org/tutorials/prototype/graph_mode_static_quantization_tutorial.html>`_。

简而言之，图模式动态`量化API <https://pytorch.org/docs/master/quantization.html#torch-quantization>`_：

import torch
from torch.quantization import per_channel_dynamic_qconfig
from torch.quantization import quantize_dynamic_jit

ts_model = torch.jit.script(float_model) # or torch.jit.trace(float_model, input)

quantized = quantize_dynamic_jit(ts_model, {'': per_channel_dynamic_qconfig})

1. 对BERT模型进行量化¶

安装步骤和模型相关的细节与Eager模式教程中的步骤一致。有关详细信息，请参阅`此教程 <https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html#install-pytorch-and-huggingface-transformers>`_。

1.1 设置¶

下载并安装所有必要的包后，我们开始配置代码。首先从必要的导入和模型配置开始。

import logging
import numpy as np
import os
import random
import sys
import time
import torch

from argparse import Namespace
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import tqdm
from transformers import (BertConfig, BertForSequenceClassification, BertTokenizer,)
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from torch.quantization import per_channel_dynamic_qconfig
from torch.quantization import quantize_dynamic_jit

def ids_tensor(shape, vocab_size):
    #  Creates a random int32 tensor of the shape within the vocab size
    return torch.randint(0, vocab_size, shape=shape, dtype=torch.int, device='cpu')

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.WARN)

logging.getLogger("transformers.modeling_utils").setLevel(
   logging.WARN)  # Reduce logging

print(torch.__version__)

torch.set_num_threads(1)
print(torch.__config__.parallel_info())

1.2 下载GLUE数据集¶

在运行MRPC任务之前，我们通过运行此脚本下载GLUE数据并将其解压到一个名为glue_data的目录中。

python download_glue_data.py --data_dir='glue_data' --tasks='MRPC'

1.3 设置全局BERT配置¶

要运行此实验，我们首先需要一个微调过的BERT模型。我们为MRPC任务提供了微调的BERT模型`此处 <https://download.pytorch.org/tutorial/MRPC.zip>`_。为了节省时间，您可以直接将模型文件（约400 MB）下载到本地文件夹$OUT_DIR。

configs = Namespace()

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "bert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.per_gpu_eval_batch_size = 8
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False

# Set random seed for reproducibility.
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)

tokenizer = BertTokenizer.from_pretrained(
    configs.output_dir, do_lower_case=configs.do_lower_case)

model = BertForSequenceClassification.from_pretrained(configs.output_dir, torchscript=True)
model.to(configs.device)

1.4 使用图模式量化BERT模型¶

1.4.1 将模型脚本化或追踪¶

图模式量化的输入是一个TorchScript模型，因此您需要先对模型进行脚本化或追踪。目前，BERT模型不支持脚本化，所以在这里我们选择追踪。

我们首先确定需要传递给模型的输入。在此，我们选择最大的可能输入尺寸进行追踪，这将是评估期间传递的最大尺寸。我们选择批量大小为8，序列长度为128，基于以下评估步骤中传递的输入。在模型追踪期间使用推理时的最大可能形状，是如`此处 <https://huggingface.co/transformers/v2.3.0/torchscript.html#dummy-inputs-and-standard-lengths>`_所述的huggingface BERT模型的限制。

我们使用``torch.jit.trace``对模型进行追踪。

input_ids = ids_tensor([8, 128], 2)
token_type_ids = ids_tensor([8, 128], 2)
attention_mask = ids_tensor([8, 128], vocab_size=2)
dummy_input = (input_ids, attention_mask, token_type_ids)
traced_model = torch.jit.trace(model, dummy_input)

1.4.2 指定qconfig_dict¶

qconfig_dict = {'': per_channel_dynamic_qconfig}

qconfig是一个命名元组，包含激活和权重的观察者。对于动态量化，我们使用一个虚拟激活观察器来模拟运行时操作中的动态量化过程。对于权重张量，我们推荐使用每通道量化，有助于提升最终准确性。``qconfig_dict``是一个字典，子模块的名称为键，模块的qconfig为值。空键表示qconfig将应用于整个模型，除非被更具体的配置覆盖。对于每个模块的qconfig，字典中优先寻找，若不存在则使用父模块的qconfig作为回退。

目前，qconfig_dict是配置模型量化方式的唯一方式，它以模块为细粒度进行配置，也就是我们仅支持每个模块一种类型的qconfig，子模块的qconfig会覆盖父模块的qconfig。例如，如果我们有

qconfig = {
    '' : qconfig_global,
    'sub' : qconfig_sub,
    'sub.fc1' : qconfig_fc,
    'sub.fc2': None
}

模块``sub.fc1``将使用``qconfig_fc``配置，而``sub``中的所有其他子模块将使用``qconfig_sub``配置，``sub.fc2``不会被量化。模型中所有其他模块将使用qconfig_global进行量化。

qconfig_dict = {'': per_channel_dynamic_qconfig}

1.4.3 量化模型（单行API）¶

我们调用类似于Eager模式的单行API，以如下方式执行量化。

quantized_model = quantize_dynamic_jit(traced_model, qconfig_dict)

2. 评估¶

我们重用Huggingface的分词和评估函数。

def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {'input_ids':      batch[0],
                          'attention_mask': batch[1]}
                labels = batch[3]
                if args.model_type != 'distilbert':
                    inputs['input'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                logits = outputs[0]
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = labels.detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)

        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results

def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
        'dev' if evaluate else 'train',
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                label_list=label_list,
                                                max_length=args.max_seq_length,
                                                output_mode=output_mode,)
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset

def time_model_evaluation(model, configs, tokenizer):
    eval_start_time = time.time()
    result = evaluate(configs, model, tokenizer, prefix="")
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

2.1 检查模型大小¶

我们打印模型大小，以衡量量化带来的优势。

def print_size_of_model(model):
    if isinstance(model, torch.jit.RecursiveScriptModule):
        torch.jit.save(model, "temp.p")
    else:
        torch.jit.save(torch.jit.script(model), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

print("Size of model before quantization")
print_size_of_model(traced_model)
print("Size of model after quantization")

print_size_of_model(quantized_model)

Size of model before quantization
Size (MB): 438.242141
Size of model after quantization
Size (MB): 184.354759

2.2 运行评估¶

我们评估FP32模型和量化模型并比较F1分数。请注意，以下性能数据是在开发机器上测得的，在生产服务器上可能会有所改善。

time_model_evaluation(traced_model, configs, tokenizer)
time_model_evaluation(quantized_model, configs, tokenizer)

FP32 model results -
'f1': 0.901
Time taken - 188.0s

INT8 model results -
'f1': 0.902
Time taken - 157.4s

3. 调试量化模型¶

我们通过传递调试选项来调试量化模型。

quantized_model = quantize_dynamic_jit(traced_model, qconfig_dict, debug=True)

如果调试选项设置为True：

我们可以像在torchscript模型中一样访问量化模型的属性，例如model.fc1.weight（如果使用模块列表或顺序可能会更难）。
所有算术操作都以浮点数进行，与最终量化模型的数值完全一致，支持调试。

quantized_model_debug = quantize_dynamic_jit(traced_model, qconfig_dict, debug=True)

调用``quantize_dynamic_jit``等价于依次调用``prepare_dynamic_jit``和``convert_dynamic_jit``。推荐使用单行API。但如果希望在每个步骤后调试或分析模型，则可使用多行API。

3.1 评估调试模型¶

# Evaluate the debug model
time_model_evaluation(quantized_model_debug, configs, tokenizer)

Size (MB): 438.406429

INT8 (debug=True) model results -
'f1': 0.897

请注意，调试版本的精度接近但并不完全等同于非调试版本，因为调试版本使用浮点操作模拟量化操作，数值匹配是近似的。这种情况仅适用于每通道量化（我们正在努力改进这一点）。每张量的量化（使用默认的动态量化配置）在调试和非调试版本之间具有精确的数值匹配。

print(str(quantized_model_debug.graph))

打印的图示片段 -

%111 : Tensor = prim::GetAttr[name="bias"](%110)
%112 : Tensor = prim::GetAttr[name="weight"](%110)
%113 : Float(768:1) = prim::GetAttr[name="4_scale_0"](%110)
%114 : Int(768:1) = prim::GetAttr[name="4_zero_point_0"](%110)
%115 : int = prim::GetAttr[name="4_axis_0"](%110)
%116 : int = prim::GetAttr[name="4_scalar_type_0"](%110)
%4.quant.6 : Tensor = aten::quantize_per_channel(%112, %113, %114, %115, %116)
%4.dequant.6 : Tensor = aten::dequantize(%4.quant.6)
%1640 : bool = prim::Constant[value=1]()
%input.5.scale.1 : float, %input.5.zero_point.1 : int = aten::_choose_qparams_per_tensor(%input.5, %1640)
%input.5.quant.1 : Tensor = aten::quantize_per_tensor(%input.5, %input.5.scale.1, %input.5.zero_point.1, %74)
%input.5.dequant.1 : Float(8:98304, 128:768, 768:1) = aten::dequantize(%input.5.quant.1)
%119 : Tensor = aten::linear(%input.5.dequant.1, %4.dequant.6, %111)

我们可以看到模型中没有``quantized::linear_dynamic``，而是与之数值等效的模式为``aten::_choose_qparams_per_tensor`` - aten::quantize_per_tensor - aten::dequantize - aten::linear。

# Get the size of the debug model
print_size_of_model(quantized_model_debug)

Size (MB): 438.406429

调试模型的大小接近浮点模型，因为所有权重都是浮点数，还没有进行量化和固定，这使得用户可以检查权重。您可以直接访问torchscript模型中的权重属性。在调试模型中访问权重与在TorchScript模型中访问权重的方式相同：

print(quantized_model.bert.encoder.layer._c.getattr('0').attention.self.query.weight)

tensor([[-0.0157,  0.0257, -0.0269,  ...,  0.0158,  0.0764,  0.0548],
        [-0.0325,  0.0345, -0.0423,  ..., -0.0528,  0.1382,  0.0069],
        [ 0.0106,  0.0335,  0.0113,  ..., -0.0275,  0.0253, -0.0457],
        ...,
        [-0.0090,  0.0512,  0.0555,  ...,  0.0277,  0.0543, -0.0539],
        [-0.0195,  0.0943,  0.0619,  ..., -0.1040,  0.0598,  0.0465],
        [ 0.0009, -0.0949,  0.0097,  ..., -0.0183, -0.0511, -0.0085]],
        grad_fn=<CloneBackward>)

可以按以下方式访问相应权重的scale和zero_point -

print(quantized_model.bert.encoder.layer._c.getattr('0').attention.self.query.getattr('4_scale_0'))
print(quantized_model.bert.encoder.layer._c.getattr('0').attention.self.query.getattr('4_zero_point_0'))

由于我们使用每通道量化，我们获得的是每通道的scale张量。

tensor([0.0009, 0.0011, 0.0010, 0.0011, 0.0034, 0.0013, 0.0010, 0.0010, 0.0013,
        0.0012, 0.0011, 0.0010, 0.0010, 0.0010, 0.0010, 0.0010, 0.0009, 0.0015,
        0.0016, 0.0036, 0.0012, 0.0009, 0.0010, 0.0014, 0.0008, 0.0008, 0.0008,
        ...,
        0.0019, 0.0023, 0.0013, 0.0018, 0.0012, 0.0031, 0.0015, 0.0013, 0.0014,
        0.0022, 0.0011, 0.0024])

Zero-point张量 -

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        ..,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       dtype=torch.int32)

4. 与Eager模式的结果比较¶

以下结果显示，通过遵循“教程 <https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html#evaluate-the-inference-accuracy-and-time>`_中提到的步骤，对同一模型进行Eager模式量化的F1得分和模型大小。结果表明，模型的Eager与图模式量化产生相同的结果。

FP32 model results -
Size (MB): 438.016605
'f1': 0.901

INT8 model results -
Size (MB): 182.878029
'f1': 0.902

5. 模型基准测试¶

我们使用虚拟输入进行基准测试，并在一个生产服务器机器上将浮点模型与Eager模式和图模式量化模型进行比较。

def benchmark(model):
    model = torch.jit.load(model)
    model.eval()
    torch.set_num_threads(1)
    input_ids = ids_tensor([8, 128], 2)
    token_type_ids = ids_tensor([8, 128], 2)
    attention_mask = ids_tensor([8, 128], vocab_size=2)
    elapsed = 0
    for _i in range(50):
        start = time.time()
        output = model(input_ids, token_type_ids, attention_mask)
        end = time.time()
        elapsed = elapsed + (end - start)
    print('Elapsed time: ', (elapsed / 50), ' s')
    return
print("Running benchmark for Float model")
benchmark(args.jit_model_path_float)
print("Running benchmark for Eager Mode Quantized model")
benchmark(args.jit_model_path_eager)
print("Running benchmark for Graph Mode Quantized model")
benchmark(args.jit_model_path_graph)

Running benchmark for Float model
Elapsed time: 4.49 s
Running benchmark for Eager Mode Quantized model
Elapsed time: 2.67 s
Running benchmark for Graph Mode Quantized model
Elapsed time: 2.69 s
As we can see both graph mode and eager mode quantized model have a similar speed up over the floating point model.

总结¶

在此教程中，我们演示了如何使用图模式将一个著名的先进NLP模型如BERT转换为动态量化模型，并且性能与Eager模式相同。动态量化可以在对精度只造成有限影响的情况下减小模型大小。

感谢阅读！一如既往，我们欢迎任何反馈，如果您有任何问题，请在`此处 <https://github.com/pytorch/pytorch/issues>`_创建一个问题。