{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(beta) Dynamic Quantization on an LSTM Word Language Model\n==========================================================\n\n**Author**: [James Reed](https://github.com/jamesr66a)\n\n**Edited by**: [Seth Weidman](https://github.com/SethHWeidman/)\n\nIntroduction\n------------\n\nQuantization involves converting the weights and activations of your\nmodel from float to int, which can result in smaller model size and\nfaster inference with only a small hit to accuracy.\n\nIn this tutorial, we will apply the easiest form of quantization\n-[dynamic\nquantization](https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic)\n-to an LSTM-based next word-prediction model, closely following the\n[word language\nmodel](https://github.com/pytorch/examples/tree/master/word_language_model)\nfrom the PyTorch examples.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# imports\nimport os\nfrom io import open\nimport time\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Define the model\n===================\n\nHere we define the LSTM model architecture, following the\n[model](https://github.com/pytorch/examples/blob/master/word_language_model/model.py)\nfrom the word language model example.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class LSTMModel(nn.Module):\n \"\"\"Container module with an encoder, a recurrent module, and a decoder.\"\"\"\n\n def __init__(self, ntoken, ninp, nhid, nlayers, dropout=0.5):\n super(LSTMModel, self).__init__()\n self.drop = nn.Dropout(dropout)\n self.encoder = nn.Embedding(ntoken, ninp)\n self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)\n self.decoder = nn.Linear(nhid, ntoken)\n\n self.init_weights()\n\n self.nhid = nhid\n self.nlayers = nlayers\n\n def init_weights(self):\n initrange = 0.1\n self.encoder.weight.data.uniform_(-initrange, initrange)\n self.decoder.bias.data.zero_()\n self.decoder.weight.data.uniform_(-initrange, initrange)\n\n def forward(self, input, hidden):\n emb = self.drop(self.encoder(input))\n output, hidden = self.rnn(emb, hidden)\n output = self.drop(output)\n decoded = self.decoder(output)\n return decoded, hidden\n\n def init_hidden(self, bsz):\n weight = next(self.parameters())\n return (weight.new_zeros(self.nlayers, bsz, self.nhid),\n weight.new_zeros(self.nlayers, bsz, self.nhid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Load in the text data\n========================\n\nNext, we load the [Wikitext-2\ndataset](https://www.google.com/search?q=wikitext+2+data) into a\n[Corpus]{.title-ref}, again following the\n[preprocessing](https://github.com/pytorch/examples/blob/master/word_language_model/data.py)\nfrom the word language model example.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class Dictionary(object):\n def __init__(self):\n self.word2idx = {}\n self.idx2word = []\n\n def add_word(self, word):\n if word not in self.word2idx:\n self.idx2word.append(word)\n self.word2idx[word] = len(self.idx2word) - 1\n return self.word2idx[word]\n\n def __len__(self):\n return len(self.idx2word)\n\n\nclass Corpus(object):\n def __init__(self, path):\n self.dictionary = Dictionary()\n self.train = self.tokenize(os.path.join(path, 'train.txt'))\n self.valid = self.tokenize(os.path.join(path, 'valid.txt'))\n self.test = self.tokenize(os.path.join(path, 'test.txt'))\n\n def tokenize(self, path):\n \"\"\"Tokenizes a text file.\"\"\"\n assert os.path.exists(path)\n # Add words to the dictionary\n with open(path, 'r', encoding=\"utf8\") as f:\n for line in f:\n words = line.split() + ['']\n for word in words:\n self.dictionary.add_word(word)\n\n # Tokenize file content\n with open(path, 'r', encoding=\"utf8\") as f:\n idss = []\n for line in f:\n words = line.split() + ['']\n ids = []\n for word in words:\n ids.append(self.dictionary.word2idx[word])\n idss.append(torch.tensor(ids).type(torch.int64))\n ids = torch.cat(idss)\n\n return ids\n\nmodel_data_filepath = 'data/'\n\ncorpus = Corpus(model_data_filepath + 'wikitext-2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Load the pretrained model\n============================\n\nThis is a tutorial on dynamic quantization, a quantization technique\nthat is applied after a model has been trained. Therefore, we\\'ll simply\nload some pretrained weights into this model architecture; these weights\nwere obtained by training for five epochs using the default settings in\nthe word language model example.\n\nBefore running this tutorial, download the required pre-trained model:\n\n``` {.bash}\nwget https://s3.amazonaws.com/pytorch-tutorial-assets/word_language_model_quantize.pth\n```\n\nPlace the downloaded file in the data directory or update the\nmodel\\_data\\_filepath accordingly.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ntokens = len(corpus.dictionary)\n\nmodel = LSTMModel(\n ntoken = ntokens,\n ninp = 512,\n nhid = 256,\n nlayers = 5,\n)\n\nmodel.load_state_dict(\n torch.load(\n model_data_filepath + 'word_language_model_quantize.pth',\n map_location=torch.device('cpu'),\n weights_only=True\n )\n )\n\nmodel.eval()\nprint(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let\\'s generate some text to ensure that the pretrained model is\nworking properly - similarly to before, we follow\n[here](https://github.com/pytorch/examples/blob/master/word_language_model/generate.py)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "input_ = torch.randint(ntokens, (1, 1), dtype=torch.long)\nhidden = model.init_hidden(1)\ntemperature = 1.0\nnum_words = 1000\n\nwith open(model_data_filepath + 'out.txt', 'w') as outf:\n with torch.no_grad(): # no tracking history\n for i in range(num_words):\n output, hidden = model(input_, hidden)\n word_weights = output.squeeze().div(temperature).exp().cpu()\n word_idx = torch.multinomial(word_weights, 1)[0]\n input_.fill_(word_idx)\n\n word = corpus.dictionary.idx2word[word_idx]\n\n outf.write(str(word.encode('utf-8')) + ('\\n' if i % 20 == 19 else ' '))\n\n if i % 100 == 0:\n print('| Generated {}/{} words'.format(i, 1000))\n\nwith open(model_data_filepath + 'out.txt', 'r') as outf:\n all_output = outf.read()\n print(all_output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It\\'s no GPT-2, but it looks like the model has started to learn the\nstructure of language!\n\nWe\\'re almost ready to demonstrate dynamic quantization. We just need to\ndefine a few more helper functions:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bptt = 25\ncriterion = nn.CrossEntropyLoss()\neval_batch_size = 1\n\n# create test data set\ndef batchify(data, bsz):\n # Work out how cleanly we can divide the dataset into ``bsz`` parts.\n nbatch = data.size(0) // bsz\n # Trim off any extra elements that wouldn't cleanly fit (remainders).\n data = data.narrow(0, 0, nbatch * bsz)\n # Evenly divide the data across the ``bsz`` batches.\n return data.view(bsz, -1).t().contiguous()\n\ntest_data = batchify(corpus.test, eval_batch_size)\n\n# Evaluation functions\ndef get_batch(source, i):\n seq_len = min(bptt, len(source) - 1 - i)\n data = source[i:i+seq_len]\n target = source[i+1:i+1+seq_len].reshape(-1)\n return data, target\n\ndef repackage_hidden(h):\n \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n\n if isinstance(h, torch.Tensor):\n return h.detach()\n else:\n return tuple(repackage_hidden(v) for v in h)\n\ndef evaluate(model_, data_source):\n # Turn on evaluation mode which disables dropout.\n model_.eval()\n total_loss = 0.\n hidden = model_.init_hidden(eval_batch_size)\n with torch.no_grad():\n for i in range(0, data_source.size(0) - 1, bptt):\n data, targets = get_batch(data_source, i)\n output, hidden = model_(data, hidden)\n hidden = repackage_hidden(hidden)\n output_flat = output.view(-1, ntokens)\n total_loss += len(data) * criterion(output_flat, targets).item()\n return total_loss / (len(data_source) - 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Test dynamic quantization\n============================\n\nFinally, we can call `torch.quantization.quantize_dynamic` on the model!\nSpecifically,\n\n- We specify that we want the `nn.LSTM` and `nn.Linear` modules in our\n model to be quantized\n- We specify that we want weights to be converted to `int8` values\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch.quantization\n\nquantized_model = torch.quantization.quantize_dynamic(\n model, {nn.LSTM, nn.Linear}, dtype=torch.qint8\n)\nprint(quantized_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model looks the same; how has this benefited us? First, we see a\nsignificant reduction in model size:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def print_size_of_model(model):\n torch.save(model.state_dict(), \"temp.p\")\n print('Size (MB):', os.path.getsize(\"temp.p\")/1e6)\n os.remove('temp.p')\n\nprint_size_of_model(model)\nprint_size_of_model(quantized_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, we see faster inference time, with no difference in evaluation\nloss:\n\nNote: we set the number of threads to one for single threaded\ncomparison, since quantized models run single threaded.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "torch.set_num_threads(1)\n\ndef time_model_evaluation(model, test_data):\n s = time.time()\n loss = evaluate(model, test_data)\n elapsed = time.time() - s\n print('''loss: {0:.3f}\\nelapsed time (seconds): {1:.1f}'''.format(loss, elapsed))\n\ntime_model_evaluation(model, test_data)\ntime_model_evaluation(quantized_model, test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running this locally on a MacBook Pro, without quantization, inference\ntakes about 200 seconds, and with quantization it takes just about 100\nseconds.\n\nConclusion\n==========\n\nDynamic quantization can be an easy way to reduce model size while only\nhaving a limited effect on accuracy.\n\nThanks for reading! As always, we welcome any feedback, so please create\nan issue [here](https://github.com/pytorch/pytorch/issues) if you have\nany.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }