{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(beta) Dynamic Quantization on an LSTM Word Language Model\n==========================================================\n\n**Author**: [James Reed](https://github.com/jamesr66a)\n\n**Edited by**: [Seth Weidman](https://github.com/SethHWeidman/)\n\nIntroduction\n------------\n\nQuantization involves converting the weights and activations of your\nmodel from float to int, which can result in smaller model size and\nfaster inference with only a small hit to accuracy.\n\nIn this tutorial, we will apply the easiest form of quantization\n-[dynamic\nquantization](https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic)\n-to an LSTM-based next word-prediction model, closely following the\n[word language\nmodel](https://github.com/pytorch/examples/tree/master/word_language_model)\nfrom the PyTorch examples.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# imports\nimport os\nfrom io import open\nimport time\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1. Define the model\n===================\n\nHere we define the LSTM model architecture, following the\n[model](https://github.com/pytorch/examples/blob/master/word_language_model/model.py)\nfrom the word language model example.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class LSTMModel(nn.Module):\n    \"\"\"Container module with an encoder, a recurrent module, and a decoder.\"\"\"\n\n    def __init__(self, ntoken, ninp, nhid, nlayers, dropout=0.5):\n        super(LSTMModel, self).__init__()\n        self.drop = nn.Dropout(dropout)\n        self.encoder = nn.Embedding(ntoken, ninp)\n        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)\n        self.decoder = nn.Linear(nhid, ntoken)\n\n        self.init_weights()\n\n        self.nhid = nhid\n        self.nlayers = nlayers\n\n    def init_weights(self):\n        initrange = 0.1\n        self.encoder.weight.data.uniform_(-initrange, initrange)\n        self.decoder.bias.data.zero_()\n        self.decoder.weight.data.uniform_(-initrange, initrange)\n\n    def forward(self, input, hidden):\n        emb = self.drop(self.encoder(input))\n        output, hidden = self.rnn(emb, hidden)\n        output = self.drop(output)\n        decoded = self.decoder(output)\n        return decoded, hidden\n\n    def init_hidden(self, bsz):\n        weight = next(self.parameters())\n        return (weight.new_zeros(self.nlayers, bsz, self.nhid),\n                weight.new_zeros(self.nlayers, bsz, self.nhid))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "2. Load in the text data\n========================\n\nNext, we load the [Wikitext-2\ndataset](https://www.google.com/search?q=wikitext+2+data) into a\n[Corpus]{.title-ref}, again following the\n[preprocessing](https://github.com/pytorch/examples/blob/master/word_language_model/data.py)\nfrom the word language model example.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Dictionary(object):\n    def __init__(self):\n        self.word2idx = {}\n        self.idx2word = []\n\n    def add_word(self, word):\n        if word not in self.word2idx:\n            self.idx2word.append(word)\n            self.word2idx[word] = len(self.idx2word) - 1\n        return self.word2idx[word]\n\n    def __len__(self):\n        return len(self.idx2word)\n\n\nclass Corpus(object):\n    def __init__(self, path):\n        self.dictionary = Dictionary()\n        self.train = self.tokenize(os.path.join(path, 'train.txt'))\n        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))\n        self.test = self.tokenize(os.path.join(path, 'test.txt'))\n\n    def tokenize(self, path):\n        \"\"\"Tokenizes a text file.\"\"\"\n        assert os.path.exists(path)\n        # Add words to the dictionary\n        with open(path, 'r', encoding=\"utf8\") as f:\n            for line in f:\n                words = line.split() + ['<eos>']\n                for word in words:\n                    self.dictionary.add_word(word)\n\n        # Tokenize file content\n        with open(path, 'r', encoding=\"utf8\") as f:\n            idss = []\n            for line in f:\n                words = line.split() + ['<eos>']\n                ids = []\n                for word in words:\n                    ids.append(self.dictionary.word2idx[word])\n                idss.append(torch.tensor(ids).type(torch.int64))\n            ids = torch.cat(idss)\n\n        return ids\n\nmodel_data_filepath = 'data/'\n\ncorpus = Corpus(model_data_filepath + 'wikitext-2')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "3. Load the pretrained model\n============================\n\nThis is a tutorial on dynamic quantization, a quantization technique\nthat is applied after a model has been trained. Therefore, we\\'ll simply\nload some pretrained weights into this model architecture; these weights\nwere obtained by training for five epochs using the default settings in\nthe word language model example.\n\nBefore running this tutorial, download the required pre-trained model:\n\n``` {.bash}\nwget https://s3.amazonaws.com/pytorch-tutorial-assets/word_language_model_quantize.pth\n```\n\nPlace the downloaded file in the data directory or update the\nmodel\\_data\\_filepath accordingly.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ntokens = len(corpus.dictionary)\n\nmodel = LSTMModel(\n    ntoken = ntokens,\n    ninp = 512,\n    nhid = 256,\n    nlayers = 5,\n)\n\nmodel.load_state_dict(\n    torch.load(\n        model_data_filepath + 'word_language_model_quantize.pth',\n        map_location=torch.device('cpu'),\n        weights_only=True\n        )\n    )\n\nmodel.eval()\nprint(model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let\\'s generate some text to ensure that the pretrained model is\nworking properly - similarly to before, we follow\n[here](https://github.com/pytorch/examples/blob/master/word_language_model/generate.py)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "input_ = torch.randint(ntokens, (1, 1), dtype=torch.long)\nhidden = model.init_hidden(1)\ntemperature = 1.0\nnum_words = 1000\n\nwith open(model_data_filepath + 'out.txt', 'w') as outf:\n    with torch.no_grad():  # no tracking history\n        for i in range(num_words):\n            output, hidden = model(input_, hidden)\n            word_weights = output.squeeze().div(temperature).exp().cpu()\n            word_idx = torch.multinomial(word_weights, 1)[0]\n            input_.fill_(word_idx)\n\n            word = corpus.dictionary.idx2word[word_idx]\n\n            outf.write(str(word.encode('utf-8')) + ('\\n' if i % 20 == 19 else ' '))\n\n            if i % 100 == 0:\n                print('| Generated {}/{} words'.format(i, 1000))\n\nwith open(model_data_filepath + 'out.txt', 'r') as outf:\n    all_output = outf.read()\n    print(all_output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "It\\'s no GPT-2, but it looks like the model has started to learn the\nstructure of language!\n\nWe\\'re almost ready to demonstrate dynamic quantization. We just need to\ndefine a few more helper functions:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bptt = 25\ncriterion = nn.CrossEntropyLoss()\neval_batch_size = 1\n\n# create test data set\ndef batchify(data, bsz):\n    # Work out how cleanly we can divide the dataset into ``bsz`` parts.\n    nbatch = data.size(0) // bsz\n    # Trim off any extra elements that wouldn't cleanly fit (remainders).\n    data = data.narrow(0, 0, nbatch * bsz)\n    # Evenly divide the data across the ``bsz`` batches.\n    return data.view(bsz, -1).t().contiguous()\n\ntest_data = batchify(corpus.test, eval_batch_size)\n\n# Evaluation functions\ndef get_batch(source, i):\n    seq_len = min(bptt, len(source) - 1 - i)\n    data = source[i:i+seq_len]\n    target = source[i+1:i+1+seq_len].reshape(-1)\n    return data, target\n\ndef repackage_hidden(h):\n  \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n\n  if isinstance(h, torch.Tensor):\n      return h.detach()\n  else:\n      return tuple(repackage_hidden(v) for v in h)\n\ndef evaluate(model_, data_source):\n    # Turn on evaluation mode which disables dropout.\n    model_.eval()\n    total_loss = 0.\n    hidden = model_.init_hidden(eval_batch_size)\n    with torch.no_grad():\n        for i in range(0, data_source.size(0) - 1, bptt):\n            data, targets = get_batch(data_source, i)\n            output, hidden = model_(data, hidden)\n            hidden = repackage_hidden(hidden)\n            output_flat = output.view(-1, ntokens)\n            total_loss += len(data) * criterion(output_flat, targets).item()\n    return total_loss / (len(data_source) - 1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "4. Test dynamic quantization\n============================\n\nFinally, we can call `torch.quantization.quantize_dynamic` on the model!\nSpecifically,\n\n-   We specify that we want the `nn.LSTM` and `nn.Linear` modules in our\n    model to be quantized\n-   We specify that we want weights to be converted to `int8` values\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch.quantization\n\nquantized_model = torch.quantization.quantize_dynamic(\n    model, {nn.LSTM, nn.Linear}, dtype=torch.qint8\n)\nprint(quantized_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The model looks the same; how has this benefited us? First, we see a\nsignificant reduction in model size:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def print_size_of_model(model):\n    torch.save(model.state_dict(), \"temp.p\")\n    print('Size (MB):', os.path.getsize(\"temp.p\")/1e6)\n    os.remove('temp.p')\n\nprint_size_of_model(model)\nprint_size_of_model(quantized_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Second, we see faster inference time, with no difference in evaluation\nloss:\n\nNote: we set the number of threads to one for single threaded\ncomparison, since quantized models run single threaded.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "torch.set_num_threads(1)\n\ndef time_model_evaluation(model, test_data):\n    s = time.time()\n    loss = evaluate(model, test_data)\n    elapsed = time.time() - s\n    print('''loss: {0:.3f}\\nelapsed time (seconds): {1:.1f}'''.format(loss, elapsed))\n\ntime_model_evaluation(model, test_data)\ntime_model_evaluation(quantized_model, test_data)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Running this locally on a MacBook Pro, without quantization, inference\ntakes about 200 seconds, and with quantization it takes just about 100\nseconds.\n\nConclusion\n==========\n\nDynamic quantization can be an easy way to reduce model size while only\nhaving a limited effect on accuracy.\n\nThanks for reading! As always, we welcome any feedback, so please create\nan issue [here](https://github.com/pytorch/pytorch/issues) if you have\nany.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}