{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(prototype) FX Graph Mode Post Training Dynamic Quantization\n============================================================\n\n**Author**: [Jerry Zhang](https://github.com/jerryzh168)\n\nThis tutorial introduces the steps to do post training dynamic\nquantization in graph mode based on `torch.fx`. We have a separate\ntutorial for [FX Graph Mode Post Training Static\nQuantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html),\ncomparison between FX Graph Mode Quantization and Eager Mode\nQuantization can be found in the [quantization\ndocs](https://pytorch.org/docs/master/quantization.html#quantization-api-summary)\n\ntldr; The FX Graph Mode API for dynamic quantization looks like the\nfollowing:\n\n``` {.python}\nimport torch\nfrom torch.ao.quantization import default_dynamic_qconfig, QConfigMapping\n# Note that this is temporary, we'll expose these functions to torch.ao.quantization after official releasee\nfrom torch.quantization.quantize_fx import prepare_fx, convert_fx\n\nfloat_model.eval()\n# The old 'fbgemm' is still available but 'x86' is the recommended default.\nqconfig = get_default_qconfig(\"x86\")\nqconfig_mapping = QConfigMapping().set_global(qconfig)\nprepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers\n# no calibration is required for dynamic quantization\nquantized_model = convert_fx(prepared_model)  # convert the model to a dynamically quantized model\n```\n\nIn this tutorial, we'll apply dynamic quantization to an LSTM-based next\nword-prediction model, closely following the word language model from\nthe PyTorch examples. We will copy the code from [Dynamic Quantization\non an LSTM Word Language\nModel](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)\nand omit the descriptions.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1. Define the Model, Download Data and Model\n============================================\n\nDownload the\n[data](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip)\nand unzip to data folder\n\n``` {.}\nmkdir data\ncd data\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip\nunzip wikitext-2-v1.zip\n```\n\nDownload model to the data folder:\n\n``` {.}\nwget https://s3.amazonaws.com/pytorch-tutorial-assets/word_language_model_quantize.pth\n```\n\nDefine the model:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# imports\nimport os\nfrom io import open\nimport time\nimport copy\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n# Model Definition\nclass LSTMModel(nn.Module):\n    \"\"\"Container module with an encoder, a recurrent module, and a decoder.\"\"\"\n\n    def __init__(self, ntoken, ninp, nhid, nlayers, dropout=0.5):\n        super(LSTMModel, self).__init__()\n        self.drop = nn.Dropout(dropout)\n        self.encoder = nn.Embedding(ntoken, ninp)\n        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)\n        self.decoder = nn.Linear(nhid, ntoken)\n\n        self.init_weights()\n\n        self.nhid = nhid\n        self.nlayers = nlayers\n\n    def init_weights(self):\n        initrange = 0.1\n        self.encoder.weight.data.uniform_(-initrange, initrange)\n        self.decoder.bias.data.zero_()\n        self.decoder.weight.data.uniform_(-initrange, initrange)\n\n    def forward(self, input, hidden):\n        emb = self.drop(self.encoder(input))\n        output, hidden = self.rnn(emb, hidden)\n        output = self.drop(output)\n        decoded = self.decoder(output)\n        return decoded, hidden\n\n\ndef init_hidden(lstm_model, bsz):\n    # get the weight tensor and create hidden layer in the same device\n    weight = lstm_model.encoder.weight\n    # get weight from quantized model\n    if not isinstance(weight, torch.Tensor):\n        weight = weight()\n    device = weight.device\n    nlayers = lstm_model.rnn.num_layers\n    nhid = lstm_model.rnn.hidden_size\n    return (torch.zeros(nlayers, bsz, nhid, device=device),\n            torch.zeros(nlayers, bsz, nhid, device=device))\n\n\n# Load Text Data\nclass Dictionary(object):\n    def __init__(self):\n        self.word2idx = {}\n        self.idx2word = []\n\n    def add_word(self, word):\n        if word not in self.word2idx:\n            self.idx2word.append(word)\n            self.word2idx[word] = len(self.idx2word) - 1\n        return self.word2idx[word]\n\n    def __len__(self):\n        return len(self.idx2word)\n\n\nclass Corpus(object):\n    def __init__(self, path):\n        self.dictionary = Dictionary()\n        self.train = self.tokenize(os.path.join(path, 'wiki.train.tokens'))\n        self.valid = self.tokenize(os.path.join(path, 'wiki.valid.tokens'))\n        self.test = self.tokenize(os.path.join(path, 'wiki.test.tokens'))\n\n    def tokenize(self, path):\n        \"\"\"Tokenizes a text file.\"\"\"\n        assert os.path.exists(path)\n        # Add words to the dictionary\n        with open(path, 'r', encoding=\"utf8\") as f:\n            for line in f:\n                words = line.split() + ['<eos>']\n                for word in words:\n                    self.dictionary.add_word(word)\n\n        # Tokenize file content\n        with open(path, 'r', encoding=\"utf8\") as f:\n            idss = []\n            for line in f:\n                words = line.split() + ['<eos>']\n                ids = []\n                for word in words:\n                    ids.append(self.dictionary.word2idx[word])\n                idss.append(torch.tensor(ids).type(torch.int64))\n            ids = torch.cat(idss)\n\n        return ids\n\nmodel_data_filepath = 'data/'\n\ncorpus = Corpus(model_data_filepath + 'wikitext-2')\n\nntokens = len(corpus.dictionary)\n\n# Load Pretrained Model\nmodel = LSTMModel(\n    ntoken = ntokens,\n    ninp = 512,\n    nhid = 256,\n    nlayers = 5,\n)\n\nmodel.load_state_dict(\n    torch.load(\n        model_data_filepath + 'word_language_model_quantize.pth',\n        map_location=torch.device('cpu'),\n        weights_only=True\n        )\n    )\n\nmodel.eval()\nprint(model)\n\nbptt = 25\ncriterion = nn.CrossEntropyLoss()\neval_batch_size = 1\n\n# create test data set\ndef batchify(data, bsz):\n    # Work out how cleanly we can divide the dataset into bsz parts.\n    nbatch = data.size(0) // bsz\n    # Trim off any extra elements that wouldn't cleanly fit (remainders).\n    data = data.narrow(0, 0, nbatch * bsz)\n    # Evenly divide the data across the bsz batches.\n    return data.view(bsz, -1).t().contiguous()\n\ntest_data = batchify(corpus.test, eval_batch_size)\nexample_inputs = (next(iter(test_data))[0])\n\n# Evaluation functions\ndef get_batch(source, i):\n    seq_len = min(bptt, len(source) - 1 - i)\n    data = source[i:i+seq_len]\n    target = source[i+1:i+1+seq_len].reshape(-1)\n    return data, target\n\ndef repackage_hidden(h):\n  \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n\n  if isinstance(h, torch.Tensor):\n      return h.detach()\n  else:\n      return tuple(repackage_hidden(v) for v in h)\n\ndef evaluate(model_, data_source):\n    # Turn on evaluation mode which disables dropout.\n    model_.eval()\n    total_loss = 0.\n    hidden = init_hidden(model_, eval_batch_size)\n    with torch.no_grad():\n        for i in range(0, data_source.size(0) - 1, bptt):\n            data, targets = get_batch(data_source, i)\n            output, hidden = model_(data, hidden)\n            hidden = repackage_hidden(hidden)\n            output_flat = output.view(-1, ntokens)\n            total_loss += len(data) * criterion(output_flat, targets).item()\n    return total_loss / (len(data_source) - 1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "2. Post Training Dynamic Quantization\n=====================================\n\nNow we can dynamically quantize the model. We can use the same function\nas post training static quantization but with a dynamic qconfig.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.quantization.quantize_fx import prepare_fx, convert_fx\nfrom torch.ao.quantization import default_dynamic_qconfig, float_qparams_weight_only_qconfig, QConfigMapping\n\n# Full docs for supported qconfig for floating point modules/ops can be found in `quantization docs <https://pytorch.org/docs/stable/quantization.html#module-torch.quantization>`_\n# Full docs for `QConfigMapping <https://pytorch.org/docs/stable/generated/torch.ao.quantization.qconfig_mapping.QConfigMapping.html#torch.ao.quantization.qconfig_mapping.QConfigMapping>`_\nqconfig_mapping = (QConfigMapping()\n    .set_object_type(nn.Embedding, float_qparams_weight_only_qconfig)\n    .set_object_type(nn.LSTM, default_dynamic_qconfig)\n    .set_object_type(nn.Linear, default_dynamic_qconfig)\n)\n# Load model to create the original model because quantization api changes the model inplace and we want\n# to keep the original model for future comparison\n\n\nmodel_to_quantize = LSTMModel(\n    ntoken = ntokens,\n    ninp = 512,\n    nhid = 256,\n    nlayers = 5,\n)\n\nmodel_to_quantize.load_state_dict(\n    torch.load(\n        model_data_filepath + 'word_language_model_quantize.pth',\n        map_location=torch.device('cpu')\n        )\n    )\n\nmodel_to_quantize.eval()\n\n\nprepared_model = prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)\nprint(\"prepared model:\", prepared_model)\nquantized_model = convert_fx(prepared_model)\nprint(\"quantized model\", quantized_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For dynamically quantized objects, we didn\\'t do anything in\n`prepare_fx` for modules, but will insert observers for weight for\ndynamically quantizable forunctionals and torch ops. We also fuse the\nmodules like Conv + Bn, Linear + ReLU.\n\nIn convert we\\'ll convert the float modules to dynamically quantized\nmodules and convert float ops to dynamically quantized ops. We can see\nin the example model, `nn.Embedding`, `nn.Linear` and `nn.LSTM` are\ndynamically quantized.\n\nNow we can compare the size and runtime of the quantized model.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def print_size_of_model(model):\n    torch.save(model.state_dict(), \"temp.p\")\n    print('Size (MB):', os.path.getsize(\"temp.p\")/1e6)\n    os.remove('temp.p')\n\nprint_size_of_model(model)\nprint_size_of_model(quantized_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There is a 4x size reduction because we quantized all the weights in the\nmodel (nn.Embedding, nn.Linear and nn.LSTM) from float (4 bytes) to\nquantized int (1 byte).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "torch.set_num_threads(1)\n\ndef time_model_evaluation(model, test_data):\n    s = time.time()\n    loss = evaluate(model, test_data)\n    elapsed = time.time() - s\n    print('''loss: {0:.3f}\\nelapsed time (seconds): {1:.1f}'''.format(loss, elapsed))\n\ntime_model_evaluation(model, test_data)\ntime_model_evaluation(quantized_model, test_data)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There is a roughly 2x speedup for this model. Also note that the speedup\nmay vary depending on model, device, build, input batch sizes, threading\netc.\n\n3. Conclusion\n=============\n\nThis tutorial introduces the api for post training dynamic quantization\nin FX Graph Mode, which dynamically quantizes the same modules as Eager\nMode Quantization.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}