{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(prototype) FX Graph Mode Post Training Dynamic Quantization\n============================================================\n\n**Author**: [Jerry Zhang](https://github.com/jerryzh168)\n\nThis tutorial introduces the steps to do post training dynamic\nquantization in graph mode based on `torch.fx`. We have a separate\ntutorial for [FX Graph Mode Post Training Static\nQuantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html),\ncomparison between FX Graph Mode Quantization and Eager Mode\nQuantization can be found in the [quantization\ndocs](https://pytorch.org/docs/master/quantization.html#quantization-api-summary)\n\ntldr; The FX Graph Mode API for dynamic quantization looks like the\nfollowing:\n\n``` {.python}\nimport torch\nfrom torch.ao.quantization import default_dynamic_qconfig, QConfigMapping\n# Note that this is temporary, we'll expose these functions to torch.ao.quantization after official releasee\nfrom torch.quantization.quantize_fx import prepare_fx, convert_fx\n\nfloat_model.eval()\n# The old 'fbgemm' is still available but 'x86' is the recommended default.\nqconfig = get_default_qconfig(\"x86\")\nqconfig_mapping = QConfigMapping().set_global(qconfig)\nprepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs) # fuse modules and insert observers\n# no calibration is required for dynamic quantization\nquantized_model = convert_fx(prepared_model) # convert the model to a dynamically quantized model\n```\n\nIn this tutorial, we'll apply dynamic quantization to an LSTM-based next\nword-prediction model, closely following the word language model from\nthe PyTorch examples. We will copy the code from [Dynamic Quantization\non an LSTM Word Language\nModel](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)\nand omit the descriptions.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Define the Model, Download Data and Model\n============================================\n\nDownload the\n[data](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip)\nand unzip to data folder\n\n``` {.}\nmkdir data\ncd data\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip\nunzip wikitext-2-v1.zip\n```\n\nDownload model to the data folder:\n\n``` {.}\nwget https://s3.amazonaws.com/pytorch-tutorial-assets/word_language_model_quantize.pth\n```\n\nDefine the model:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# imports\nimport os\nfrom io import open\nimport time\nimport copy\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n# Model Definition\nclass LSTMModel(nn.Module):\n \"\"\"Container module with an encoder, a recurrent module, and a decoder.\"\"\"\n\n def __init__(self, ntoken, ninp, nhid, nlayers, dropout=0.5):\n super(LSTMModel, self).__init__()\n self.drop = nn.Dropout(dropout)\n self.encoder = nn.Embedding(ntoken, ninp)\n self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)\n self.decoder = nn.Linear(nhid, ntoken)\n\n self.init_weights()\n\n self.nhid = nhid\n self.nlayers = nlayers\n\n def init_weights(self):\n initrange = 0.1\n self.encoder.weight.data.uniform_(-initrange, initrange)\n self.decoder.bias.data.zero_()\n self.decoder.weight.data.uniform_(-initrange, initrange)\n\n def forward(self, input, hidden):\n emb = self.drop(self.encoder(input))\n output, hidden = self.rnn(emb, hidden)\n output = self.drop(output)\n decoded = self.decoder(output)\n return decoded, hidden\n\n\ndef init_hidden(lstm_model, bsz):\n # get the weight tensor and create hidden layer in the same device\n weight = lstm_model.encoder.weight\n # get weight from quantized model\n if not isinstance(weight, torch.Tensor):\n weight = weight()\n device = weight.device\n nlayers = lstm_model.rnn.num_layers\n nhid = lstm_model.rnn.hidden_size\n return (torch.zeros(nlayers, bsz, nhid, device=device),\n torch.zeros(nlayers, bsz, nhid, device=device))\n\n\n# Load Text Data\nclass Dictionary(object):\n def __init__(self):\n self.word2idx = {}\n self.idx2word = []\n\n def add_word(self, word):\n if word not in self.word2idx:\n self.idx2word.append(word)\n self.word2idx[word] = len(self.idx2word) - 1\n return self.word2idx[word]\n\n def __len__(self):\n return len(self.idx2word)\n\n\nclass Corpus(object):\n def __init__(self, path):\n self.dictionary = Dictionary()\n self.train = self.tokenize(os.path.join(path, 'wiki.train.tokens'))\n self.valid = self.tokenize(os.path.join(path, 'wiki.valid.tokens'))\n self.test = self.tokenize(os.path.join(path, 'wiki.test.tokens'))\n\n def tokenize(self, path):\n \"\"\"Tokenizes a text file.\"\"\"\n assert os.path.exists(path)\n # Add words to the dictionary\n with open(path, 'r', encoding=\"utf8\") as f:\n for line in f:\n words = line.split() + ['']\n for word in words:\n self.dictionary.add_word(word)\n\n # Tokenize file content\n with open(path, 'r', encoding=\"utf8\") as f:\n idss = []\n for line in f:\n words = line.split() + ['']\n ids = []\n for word in words:\n ids.append(self.dictionary.word2idx[word])\n idss.append(torch.tensor(ids).type(torch.int64))\n ids = torch.cat(idss)\n\n return ids\n\nmodel_data_filepath = 'data/'\n\ncorpus = Corpus(model_data_filepath + 'wikitext-2')\n\nntokens = len(corpus.dictionary)\n\n# Load Pretrained Model\nmodel = LSTMModel(\n ntoken = ntokens,\n ninp = 512,\n nhid = 256,\n nlayers = 5,\n)\n\nmodel.load_state_dict(\n torch.load(\n model_data_filepath + 'word_language_model_quantize.pth',\n map_location=torch.device('cpu'),\n weights_only=True\n )\n )\n\nmodel.eval()\nprint(model)\n\nbptt = 25\ncriterion = nn.CrossEntropyLoss()\neval_batch_size = 1\n\n# create test data set\ndef batchify(data, bsz):\n # Work out how cleanly we can divide the dataset into bsz parts.\n nbatch = data.size(0) // bsz\n # Trim off any extra elements that wouldn't cleanly fit (remainders).\n data = data.narrow(0, 0, nbatch * bsz)\n # Evenly divide the data across the bsz batches.\n return data.view(bsz, -1).t().contiguous()\n\ntest_data = batchify(corpus.test, eval_batch_size)\nexample_inputs = (next(iter(test_data))[0])\n\n# Evaluation functions\ndef get_batch(source, i):\n seq_len = min(bptt, len(source) - 1 - i)\n data = source[i:i+seq_len]\n target = source[i+1:i+1+seq_len].reshape(-1)\n return data, target\n\ndef repackage_hidden(h):\n \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n\n if isinstance(h, torch.Tensor):\n return h.detach()\n else:\n return tuple(repackage_hidden(v) for v in h)\n\ndef evaluate(model_, data_source):\n # Turn on evaluation mode which disables dropout.\n model_.eval()\n total_loss = 0.\n hidden = init_hidden(model_, eval_batch_size)\n with torch.no_grad():\n for i in range(0, data_source.size(0) - 1, bptt):\n data, targets = get_batch(data_source, i)\n output, hidden = model_(data, hidden)\n hidden = repackage_hidden(hidden)\n output_flat = output.view(-1, ntokens)\n total_loss += len(data) * criterion(output_flat, targets).item()\n return total_loss / (len(data_source) - 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Post Training Dynamic Quantization\n=====================================\n\nNow we can dynamically quantize the model. We can use the same function\nas post training static quantization but with a dynamic qconfig.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torch.quantization.quantize_fx import prepare_fx, convert_fx\nfrom torch.ao.quantization import default_dynamic_qconfig, float_qparams_weight_only_qconfig, QConfigMapping\n\n# Full docs for supported qconfig for floating point modules/ops can be found in `quantization docs `_\n# Full docs for `QConfigMapping `_\nqconfig_mapping = (QConfigMapping()\n .set_object_type(nn.Embedding, float_qparams_weight_only_qconfig)\n .set_object_type(nn.LSTM, default_dynamic_qconfig)\n .set_object_type(nn.Linear, default_dynamic_qconfig)\n)\n# Load model to create the original model because quantization api changes the model inplace and we want\n# to keep the original model for future comparison\n\n\nmodel_to_quantize = LSTMModel(\n ntoken = ntokens,\n ninp = 512,\n nhid = 256,\n nlayers = 5,\n)\n\nmodel_to_quantize.load_state_dict(\n torch.load(\n model_data_filepath + 'word_language_model_quantize.pth',\n map_location=torch.device('cpu')\n )\n )\n\nmodel_to_quantize.eval()\n\n\nprepared_model = prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)\nprint(\"prepared model:\", prepared_model)\nquantized_model = convert_fx(prepared_model)\nprint(\"quantized model\", quantized_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For dynamically quantized objects, we didn\\'t do anything in\n`prepare_fx` for modules, but will insert observers for weight for\ndynamically quantizable forunctionals and torch ops. We also fuse the\nmodules like Conv + Bn, Linear + ReLU.\n\nIn convert we\\'ll convert the float modules to dynamically quantized\nmodules and convert float ops to dynamically quantized ops. We can see\nin the example model, `nn.Embedding`, `nn.Linear` and `nn.LSTM` are\ndynamically quantized.\n\nNow we can compare the size and runtime of the quantized model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def print_size_of_model(model):\n torch.save(model.state_dict(), \"temp.p\")\n print('Size (MB):', os.path.getsize(\"temp.p\")/1e6)\n os.remove('temp.p')\n\nprint_size_of_model(model)\nprint_size_of_model(quantized_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a 4x size reduction because we quantized all the weights in the\nmodel (nn.Embedding, nn.Linear and nn.LSTM) from float (4 bytes) to\nquantized int (1 byte).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "torch.set_num_threads(1)\n\ndef time_model_evaluation(model, test_data):\n s = time.time()\n loss = evaluate(model, test_data)\n elapsed = time.time() - s\n print('''loss: {0:.3f}\\nelapsed time (seconds): {1:.1f}'''.format(loss, elapsed))\n\ntime_model_evaluation(model, test_data)\ntime_model_evaluation(quantized_model, test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a roughly 2x speedup for this model. Also note that the speedup\nmay vary depending on model, device, build, input batch sizes, threading\netc.\n\n3. Conclusion\n=============\n\nThis tutorial introduces the api for post training dynamic quantization\nin FX Graph Mode, which dynamically quantizes the same modules as Eager\nMode Quantization.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }