{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chatbot Tutorial\n================\n\n**Author:** [Matthew Inkawhich](https://github.com/MatthewInkawhich)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we explore a fun and interesting use-case of recurrent\nsequence-to-sequence models. We will train a simple chatbot using movie\nscripts from the [Cornell Movie-Dialogs\nCorpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html).\n\nConversational models are a hot topic in artificial intelligence\nresearch. Chatbots can be found in a variety of settings, including\ncustomer service applications and online helpdesks. These bots are often\npowered by retrieval-based models, which output predefined responses to\nquestions of certain forms. In a highly restricted domain like a\ncompany's IT helpdesk, these models may be sufficient, however, they are\nnot robust enough for more general use-cases. Teaching a machine to\ncarry out a meaningful conversation with a human in multiple domains is\na research question that is far from solved. Recently, the deep learning\nboom has allowed for powerful generative models like Google's [Neural\nConversational Model](https://arxiv.org/abs/1506.05869), which marks a\nlarge step towards multi-domain generative conversational models. In\nthis tutorial, we will implement this kind of model in PyTorch.\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/bot.png){.align-center}\n\n``` {.python}\n> hello?\nBot: hello .\n> where am I?\nBot: you re in a hospital .\n> who are you?\nBot: i m a lawyer .\n> how are you doing?\nBot: i m fine .\n> are you my friend?\nBot: no .\n> you're under arrest\nBot: i m trying to help you !\n> i'm just kidding\nBot: i m sorry .\n> where are you from?\nBot: san francisco .\n> it's time for me to leave\nBot: i know .\n> goodbye\nBot: goodbye .\n```\n\n**Tutorial Highlights**\n\n- Handle loading and preprocessing of [Cornell Movie-Dialogs\n Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)\n dataset\n- Implement a sequence-to-sequence model with [Luong attention\n mechanism(s)](https://arxiv.org/abs/1508.04025)\n- Jointly train encoder and decoder models using mini-batches\n- Implement greedy-search decoding module\n- Interact with trained chatbot\n\n**Acknowledgments**\n\nThis tutorial borrows code from the following sources:\n\n1) Yuan-Kuei Wu's pytorch-chatbot implementation:\n \n2) Sean Robertson's practical-pytorch seq2seq-translation example:\n \n3) FloydHub Cornell Movie Corpus preprocessing code:\n \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Preparations\n============\n\nTo get started,\n[download](https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip)\nthe Movie-Dialogs Corpus zip file.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# and put in a ``data/`` directory under the current directory.\n#\n# After that, let\u2019s import some necessities.\n#\n\nimport torch\nfrom torch.jit import script, trace\nimport torch.nn as nn\nfrom torch import optim\nimport torch.nn.functional as F\nimport csv\nimport random\nimport re\nimport os\nimport unicodedata\nimport codecs\nfrom io import open\nimport itertools\nimport math\nimport json\n\n\n# If the current `accelerator `__ is available,\n# we will use it. Otherwise, we use the CPU.\ndevice = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else \"cpu\"\nprint(f\"Using {device} device\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load & Preprocess Data\n======================\n\nThe next step is to reformat our data file and load the data into\nstructures that we can work with.\n\nThe [Cornell Movie-Dialogs\nCorpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)\nis a rich dataset of movie character dialog:\n\n- 220,579 conversational exchanges between 10,292 pairs of movie\n characters\n- 9,035 characters from 617 movies\n- 304,713 total utterances\n\nThis dataset is large and diverse, and there is a great variation of\nlanguage formality, time periods, sentiment, etc. Our hope is that this\ndiversity makes our model robust to many forms of inputs and queries.\n\nFirst, we'll take a look at some lines of our datafile to see the\noriginal format.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "corpus_name = \"movie-corpus\"\ncorpus = os.path.join(\"data\", corpus_name)\n\ndef printLines(file, n=10):\n with open(file, 'rb') as datafile:\n lines = datafile.readlines()\n for line in lines[:n]:\n print(line)\n\nprintLines(os.path.join(corpus, \"utterances.jsonl\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create formatted data file\n==========================\n\nFor convenience, we\\'ll create a nicely formatted data file in which\neach line contains a tab-separated *query sentence* and a *response\nsentence* pair.\n\nThe following functions facilitate the parsing of the raw\n`utterances.jsonl` data file.\n\n- `loadLinesAndConversations` splits each line of the file into a\n dictionary of lines with fields: `lineID`, `characterID`, and text\n and then groups them into conversations with fields:\n `conversationID`, `movieID`, and lines.\n- `extractSentencePairs` extracts pairs of sentences from\n conversations\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Splits each line of the file to create lines and conversations\ndef loadLinesAndConversations(fileName):\n lines = {}\n conversations = {}\n with open(fileName, 'r', encoding='iso-8859-1') as f:\n for line in f:\n lineJson = json.loads(line)\n # Extract fields for line object\n lineObj = {}\n lineObj[\"lineID\"] = lineJson[\"id\"]\n lineObj[\"characterID\"] = lineJson[\"speaker\"]\n lineObj[\"text\"] = lineJson[\"text\"]\n lines[lineObj['lineID']] = lineObj\n\n # Extract fields for conversation object\n if lineJson[\"conversation_id\"] not in conversations:\n convObj = {}\n convObj[\"conversationID\"] = lineJson[\"conversation_id\"]\n convObj[\"movieID\"] = lineJson[\"meta\"][\"movie_id\"]\n convObj[\"lines\"] = [lineObj]\n else:\n convObj = conversations[lineJson[\"conversation_id\"]]\n convObj[\"lines\"].insert(0, lineObj)\n conversations[convObj[\"conversationID\"]] = convObj\n\n return lines, conversations\n\n\n# Extracts pairs of sentences from conversations\ndef extractSentencePairs(conversations):\n qa_pairs = []\n for conversation in conversations.values():\n # Iterate over all the lines of the conversation\n for i in range(len(conversation[\"lines\"]) - 1): # We ignore the last line (no answer for it)\n inputLine = conversation[\"lines\"][i][\"text\"].strip()\n targetLine = conversation[\"lines\"][i+1][\"text\"].strip()\n # Filter wrong samples (if one of the lists is empty)\n if inputLine and targetLine:\n qa_pairs.append([inputLine, targetLine])\n return qa_pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll call these functions and create the file. We'll call it\n`formatted_movie_lines.txt`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define path to new file\ndatafile = os.path.join(corpus, \"formatted_movie_lines.txt\")\n\ndelimiter = '\\t'\n# Unescape the delimiter\ndelimiter = str(codecs.decode(delimiter, \"unicode_escape\"))\n\n# Initialize lines dict and conversations dict\nlines = {}\nconversations = {}\n# Load lines and conversations\nprint(\"\\nProcessing corpus into lines and conversations...\")\nlines, conversations = loadLinesAndConversations(os.path.join(corpus, \"utterances.jsonl\"))\n\n# Write new csv file\nprint(\"\\nWriting newly formatted file...\")\nwith open(datafile, 'w', encoding='utf-8') as outputfile:\n writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\\n')\n for pair in extractSentencePairs(conversations):\n writer.writerow(pair)\n\n# Print a sample of lines\nprint(\"\\nSample lines from file:\")\nprintLines(datafile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load and trim data\n==================\n\nOur next order of business is to create a vocabulary and load\nquery/response sentence pairs into memory.\n\nNote that we are dealing with sequences of **words**, which do not have\nan implicit mapping to a discrete numerical space. Thus, we must create\none by mapping each unique word that we encounter in our dataset to an\nindex value.\n\nFor this we define a `Voc` class, which keeps a mapping from words to\nindexes, a reverse mapping of indexes to words, a count of each word and\na total word count. The class provides methods for adding a word to the\nvocabulary (`addWord`), adding all words in a sentence (`addSentence`)\nand trimming infrequently seen words (`trim`). More on trimming later.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Default word tokens\nPAD_token = 0 # Used for padding short sentences\nSOS_token = 1 # Start-of-sentence token\nEOS_token = 2 # End-of-sentence token\n\nclass Voc:\n def __init__(self, name):\n self.name = name\n self.trimmed = False\n self.word2index = {}\n self.word2count = {}\n self.index2word = {PAD_token: \"PAD\", SOS_token: \"SOS\", EOS_token: \"EOS\"}\n self.num_words = 3 # Count SOS, EOS, PAD\n\n def addSentence(self, sentence):\n for word in sentence.split(' '):\n self.addWord(word)\n\n def addWord(self, word):\n if word not in self.word2index:\n self.word2index[word] = self.num_words\n self.word2count[word] = 1\n self.index2word[self.num_words] = word\n self.num_words += 1\n else:\n self.word2count[word] += 1\n\n # Remove words below a certain count threshold\n def trim(self, min_count):\n if self.trimmed:\n return\n self.trimmed = True\n\n keep_words = []\n\n for k, v in self.word2count.items():\n if v >= min_count:\n keep_words.append(k)\n\n print('keep_words {} / {} = {:.4f}'.format(\n len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)\n ))\n\n # Reinitialize dictionaries\n self.word2index = {}\n self.word2count = {}\n self.index2word = {PAD_token: \"PAD\", SOS_token: \"SOS\", EOS_token: \"EOS\"}\n self.num_words = 3 # Count default tokens\n\n for word in keep_words:\n self.addWord(word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can assemble our vocabulary and query/response sentence pairs.\nBefore we are ready to use this data, we must perform some\npreprocessing.\n\nFirst, we must convert the Unicode strings to ASCII using\n`unicodeToAscii`. Next, we should convert all letters to lowercase and\ntrim all non-letter characters except for basic punctuation\n(`normalizeString`). Finally, to aid in training convergence, we will\nfilter out sentences with length greater than the `MAX_LENGTH` threshold\n(`filterPairs`).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "MAX_LENGTH = 10 # Maximum sentence length to consider\n\n# Turn a Unicode string to plain ASCII, thanks to\n# https://stackoverflow.com/a/518232/2809427\ndef unicodeToAscii(s):\n return ''.join(\n c for c in unicodedata.normalize('NFD', s)\n if unicodedata.category(c) != 'Mn'\n )\n\n# Lowercase, trim, and remove non-letter characters\ndef normalizeString(s):\n s = unicodeToAscii(s.lower().strip())\n s = re.sub(r\"([.!?])\", r\" \\1\", s)\n s = re.sub(r\"[^a-zA-Z.!?]+\", r\" \", s)\n s = re.sub(r\"\\s+\", r\" \", s).strip()\n return s\n\n# Read query/response pairs and return a voc object\ndef readVocs(datafile, corpus_name):\n print(\"Reading lines...\")\n # Read the file and split into lines\n lines = open(datafile, encoding='utf-8').\\\n read().strip().split('\\n')\n # Split every line into pairs and normalize\n pairs = [[normalizeString(s) for s in l.split('\\t')] for l in lines]\n voc = Voc(corpus_name)\n return voc, pairs\n\n# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold\ndef filterPair(p):\n # Input sequences need to preserve the last word for EOS token\n return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH\n\n# Filter pairs using the ``filterPair`` condition\ndef filterPairs(pairs):\n return [pair for pair in pairs if filterPair(pair)]\n\n# Using the functions defined above, return a populated voc object and pairs list\ndef loadPrepareData(corpus, corpus_name, datafile, save_dir):\n print(\"Start preparing training data ...\")\n voc, pairs = readVocs(datafile, corpus_name)\n print(\"Read {!s} sentence pairs\".format(len(pairs)))\n pairs = filterPairs(pairs)\n print(\"Trimmed to {!s} sentence pairs\".format(len(pairs)))\n print(\"Counting words...\")\n for pair in pairs:\n voc.addSentence(pair[0])\n voc.addSentence(pair[1])\n print(\"Counted words:\", voc.num_words)\n return voc, pairs\n\n\n# Load/Assemble voc and pairs\nsave_dir = os.path.join(\"data\", \"save\")\nvoc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)\n# Print some pairs to validate\nprint(\"\\npairs:\")\nfor pair in pairs[:10]:\n print(pair)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another tactic that is beneficial to achieving faster convergence during\ntraining is trimming rarely used words out of our vocabulary. Decreasing\nthe feature space will also soften the difficulty of the function that\nthe model must learn to approximate. We will do this as a two-step\nprocess:\n\n1) Trim words used under `MIN_COUNT` threshold using the `voc.trim`\n function.\n2) Filter out pairs with trimmed words.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "MIN_COUNT = 3 # Minimum word count threshold for trimming\n\ndef trimRareWords(voc, pairs, MIN_COUNT):\n # Trim words used under the MIN_COUNT from the voc\n voc.trim(MIN_COUNT)\n # Filter out pairs with trimmed words\n keep_pairs = []\n for pair in pairs:\n input_sentence = pair[0]\n output_sentence = pair[1]\n keep_input = True\n keep_output = True\n # Check input sentence\n for word in input_sentence.split(' '):\n if word not in voc.word2index:\n keep_input = False\n break\n # Check output sentence\n for word in output_sentence.split(' '):\n if word not in voc.word2index:\n keep_output = False\n break\n\n # Only keep pairs that do not contain trimmed word(s) in their input or output sentence\n if keep_input and keep_output:\n keep_pairs.append(pair)\n\n print(\"Trimmed from {} pairs to {}, {:.4f} of total\".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))\n return keep_pairs\n\n\n# Trim voc and pairs\npairs = trimRareWords(voc, pairs, MIN_COUNT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prepare Data for Models\n=======================\n\nAlthough we have put a great deal of effort into preparing and massaging\nour data into a nice vocabulary object and list of sentence pairs, our\nmodels will ultimately expect numerical torch tensors as inputs. One way\nto prepare the processed data for the models can be found in the\n[seq2seq translation\ntutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).\nIn that tutorial, we use a batch size of 1, meaning that all we have to\ndo is convert the words in our sentence pairs to their corresponding\nindexes from the vocabulary and feed this to the models.\n\nHowever, if you're interested in speeding up training and/or would like\nto leverage GPU parallelization capabilities, you will need to train\nwith mini-batches.\n\nUsing mini-batches also means that we must be mindful of the variation\nof sentence length in our batches. To accommodate sentences of different\nsizes in the same batch, we will make our batched input tensor of shape\n*(max\\_length, batch\\_size)*, where sentences shorter than the\n*max\\_length* are zero padded after an *EOS\\_token*.\n\nIf we simply convert our English sentences to tensors by converting\nwords to their indexes(`indexesFromSentence`) and zero-pad, our tensor\nwould have shape *(batch\\_size, max\\_length)* and indexing the first\ndimension would return a full sequence across all time-steps. However,\nwe need to be able to index our batch along time, and across all\nsequences in the batch. Therefore, we transpose our input batch shape to\n*(max\\_length, batch\\_size)*, so that indexing across the first\ndimension returns a time step across all sentences in the batch. We\nhandle this transpose implicitly in the `zeroPadding` function.\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/seq2seq_batches.png){.align-center}\n\nThe `inputVar` function handles the process of converting sentences to\ntensor, ultimately creating a correctly shaped zero-padded tensor. It\nalso returns a tensor of `lengths` for each of the sequences in the\nbatch which will be passed to our decoder later.\n\nThe `outputVar` function performs a similar function to `inputVar`, but\ninstead of returning a `lengths` tensor, it returns a binary mask tensor\nand a maximum target sentence length. The binary mask tensor has the\nsame shape as the output target tensor, but every element that is a\n*PAD\\_token* is 0 and all others are 1.\n\n`batch2TrainData` simply takes a bunch of pairs and returns the input\nand target tensors using the aforementioned functions.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def indexesFromSentence(voc, sentence):\n return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]\n\n\ndef zeroPadding(l, fillvalue=PAD_token):\n return list(itertools.zip_longest(*l, fillvalue=fillvalue))\n\ndef binaryMatrix(l, value=PAD_token):\n m = []\n for i, seq in enumerate(l):\n m.append([])\n for token in seq:\n if token == PAD_token:\n m[i].append(0)\n else:\n m[i].append(1)\n return m\n\n# Returns padded input sequence tensor and lengths\ndef inputVar(l, voc):\n indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]\n lengths = torch.tensor([len(indexes) for indexes in indexes_batch])\n padList = zeroPadding(indexes_batch)\n padVar = torch.LongTensor(padList)\n return padVar, lengths\n\n# Returns padded target sequence tensor, padding mask, and max target length\ndef outputVar(l, voc):\n indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]\n max_target_len = max([len(indexes) for indexes in indexes_batch])\n padList = zeroPadding(indexes_batch)\n mask = binaryMatrix(padList)\n mask = torch.BoolTensor(mask)\n padVar = torch.LongTensor(padList)\n return padVar, mask, max_target_len\n\n# Returns all items for a given batch of pairs\ndef batch2TrainData(voc, pair_batch):\n pair_batch.sort(key=lambda x: len(x[0].split(\" \")), reverse=True)\n input_batch, output_batch = [], []\n for pair in pair_batch:\n input_batch.append(pair[0])\n output_batch.append(pair[1])\n inp, lengths = inputVar(input_batch, voc)\n output, mask, max_target_len = outputVar(output_batch, voc)\n return inp, lengths, output, mask, max_target_len\n\n\n# Example for validation\nsmall_batch_size = 5\nbatches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])\ninput_variable, lengths, target_variable, mask, max_target_len = batches\n\nprint(\"input_variable:\", input_variable)\nprint(\"lengths:\", lengths)\nprint(\"target_variable:\", target_variable)\nprint(\"mask:\", mask)\nprint(\"max_target_len:\", max_target_len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define Models\n=============\n\nSeq2Seq Model\n-------------\n\nThe brains of our chatbot is a sequence-to-sequence (seq2seq) model. The\ngoal of a seq2seq model is to take a variable-length sequence as an\ninput, and return a variable-length sequence as an output using a\nfixed-sized model.\n\n[Sutskever et al.](https://arxiv.org/abs/1409.3215) discovered that by\nusing two separate recurrent neural nets together, we can accomplish\nthis task. One RNN acts as an **encoder**, which encodes a variable\nlength input sequence to a fixed-length context vector. In theory, this\ncontext vector (the final hidden layer of the RNN) will contain semantic\ninformation about the query sentence that is input to the bot. The\nsecond RNN is a **decoder**, which takes an input word and the context\nvector, and returns a guess for the next word in the sequence and a\nhidden state to use in the next iteration.\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/seq2seq_ts.png){.align-center}\n\nImage source:\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder\n=======\n\nThe encoder RNN iterates through the input sentence one token\n(e.g.\u00a0word) at a time, at each time step outputting an \"output\" vector\nand a \"hidden state\" vector. The hidden state vector is then passed to\nthe next time step, while the output vector is recorded. The encoder\ntransforms the context it saw at each point in the sequence into a set\nof points in a high-dimensional space, which the decoder will use to\ngenerate a meaningful output for the given task.\n\nAt the heart of our encoder is a multi-layered Gated Recurrent Unit,\ninvented by [Cho et al.](https://arxiv.org/pdf/1406.1078v3.pdf) in 2014.\nWe will use a bidirectional variant of the GRU, meaning that there are\nessentially two independent RNNs: one that is fed the input sequence in\nnormal sequential order, and one that is fed the input sequence in\nreverse order. The outputs of each network are summed at each time step.\nUsing a bidirectional GRU will give us the advantage of encoding both\npast and future contexts.\n\nBidirectional RNN:\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/RNN-bidirectional.png){.align-center\nwidth=\"70.0%\"}\n\nImage source: \n\nNote that an `embedding` layer is used to encode our word indices in an\narbitrarily sized feature space. For our models, this layer will map\neach word to a feature space of size *hidden\\_size*. When trained, these\nvalues should encode semantic similarity between similar meaning words.\n\nFinally, if passing a padded batch of sequences to an RNN module, we\nmust pack and unpack padding around the RNN pass using\n`nn.utils.rnn.pack_padded_sequence` and\n`nn.utils.rnn.pad_packed_sequence` respectively.\n\n**Computation Graph:**\n\n> 1) Convert word indexes to embeddings.\n> 2) Pack padded batch of sequences for RNN module.\n> 3) Forward pass through GRU.\n> 4) Unpack padding.\n> 5) Sum bidirectional GRU outputs.\n> 6) Return output and final hidden state.\n\n**Inputs:**\n\n- `input_seq`: batch of input sentences; shape=*(max\\_length,\n batch\\_size)*\n- `input_lengths`: list of sentence lengths corresponding to each\n sentence in the batch; shape=*(batch\\_size)*\n- `hidden`: hidden state; shape=*(n\\_layers x num\\_directions,\n batch\\_size, hidden\\_size)*\n\n**Outputs:**\n\n- `outputs`: output features from the last hidden layer of the GRU\n (sum of bidirectional outputs); shape=*(max\\_length, batch\\_size,\n hidden\\_size)*\n- `hidden`: updated hidden state from GRU; shape=*(n\\_layers x\n num\\_directions, batch\\_size, hidden\\_size)*\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class EncoderRNN(nn.Module):\n def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):\n super(EncoderRNN, self).__init__()\n self.n_layers = n_layers\n self.hidden_size = hidden_size\n self.embedding = embedding\n\n # Initialize GRU; the input_size and hidden_size parameters are both set to 'hidden_size'\n # because our input size is a word embedding with number of features == hidden_size\n self.gru = nn.GRU(hidden_size, hidden_size, n_layers,\n dropout=(0 if n_layers == 1 else dropout), bidirectional=True)\n\n def forward(self, input_seq, input_lengths, hidden=None):\n # Convert word indexes to embeddings\n embedded = self.embedding(input_seq)\n # Pack padded batch of sequences for RNN module\n packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)\n # Forward pass through GRU\n outputs, hidden = self.gru(packed, hidden)\n # Unpack padding\n outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)\n # Sum bidirectional GRU outputs\n outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]\n # Return output and final hidden state\n return outputs, hidden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decoder\n=======\n\nThe decoder RNN generates the response sentence in a token-by-token\nfashion. It uses the encoder's context vectors, and internal hidden\nstates to generate the next word in the sequence. It continues\ngenerating words until it outputs an *EOS\\_token*, representing the end\nof the sentence. A common problem with a vanilla seq2seq decoder is that\nif we rely solely on the context vector to encode the entire input\nsequence's meaning, it is likely that we will have information loss.\nThis is especially the case when dealing with long input sequences,\ngreatly limiting the capability of our decoder.\n\nTo combat this, [Bahdanau et al.](https://arxiv.org/abs/1409.0473)\ncreated an \"attention mechanism\" that allows the decoder to pay\nattention to certain parts of the input sequence, rather than using the\nentire fixed context at every step.\n\nAt a high level, attention is calculated using the decoder's current\nhidden state and the encoder's outputs. The output attention weights\nhave the same shape as the input sequence, allowing us to multiply them\nby the encoder outputs, giving us a weighted sum which indicates the\nparts of encoder output to pay attention to. [Sean\nRobertson's](https://github.com/spro) figure describes this very well:\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/attn2.png){.align-center}\n\n[Luong et al.](https://arxiv.org/abs/1508.04025) improved upon Bahdanau\net al.'s groundwork by creating \"Global attention\". The key difference\nis that with \"Global attention\", we consider all of the encoder's hidden\nstates, as opposed to Bahdanau et al.'s \"Local attention\", which only\nconsiders the encoder's hidden state from the current time step. Another\ndifference is that with \"Global attention\", we calculate attention\nweights, or energies, using the hidden state of the decoder from the\ncurrent time step only. Bahdanau et al.'s attention calculation requires\nknowledge of the decoder's state from the previous time step. Also,\nLuong et al.\u00a0provides various methods to calculate the attention\nenergies between the encoder output and decoder output which are called\n\"score functions\":\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/scores.png){.align-center\nwidth=\"60.0%\"}\n\nwhere $h_t$ = current target decoder state and $\\bar{h}_s$ = all encoder\nstates.\n\nOverall, the Global attention mechanism can be summarized by the\nfollowing figure. Note that we will implement the \"Attention Layer\" as a\nseparate `nn.Module` called `Attn`. The output of this module is a\nsoftmax normalized weights tensor of shape *(batch\\_size, 1,\nmax\\_length)*.\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/global_attn.png){.align-center\nwidth=\"60.0%\"}\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Luong attention layer\nclass Attn(nn.Module):\n def __init__(self, method, hidden_size):\n super(Attn, self).__init__()\n self.method = method\n if self.method not in ['dot', 'general', 'concat']:\n raise ValueError(self.method, \"is not an appropriate attention method.\")\n self.hidden_size = hidden_size\n if self.method == 'general':\n self.attn = nn.Linear(self.hidden_size, hidden_size)\n elif self.method == 'concat':\n self.attn = nn.Linear(self.hidden_size * 2, hidden_size)\n self.v = nn.Parameter(torch.FloatTensor(hidden_size))\n\n def dot_score(self, hidden, encoder_output):\n return torch.sum(hidden * encoder_output, dim=2)\n\n def general_score(self, hidden, encoder_output):\n energy = self.attn(encoder_output)\n return torch.sum(hidden * energy, dim=2)\n\n def concat_score(self, hidden, encoder_output):\n energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()\n return torch.sum(self.v * energy, dim=2)\n\n def forward(self, hidden, encoder_outputs):\n # Calculate the attention weights (energies) based on the given method\n if self.method == 'general':\n attn_energies = self.general_score(hidden, encoder_outputs)\n elif self.method == 'concat':\n attn_energies = self.concat_score(hidden, encoder_outputs)\n elif self.method == 'dot':\n attn_energies = self.dot_score(hidden, encoder_outputs)\n\n # Transpose max_length and batch_size dimensions\n attn_energies = attn_energies.t()\n\n # Return the softmax normalized probability scores (with added dimension)\n return F.softmax(attn_energies, dim=1).unsqueeze(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have defined our attention submodule, we can implement the\nactual decoder model. For the decoder, we will manually feed our batch\none time step at a time. This means that our embedded word tensor and\nGRU output will both have shape *(1, batch\\_size, hidden\\_size)*.\n\n**Computation Graph:**\n\n> 1) Get embedding of current input word.\n> 2) Forward through unidirectional GRU.\n> 3) Calculate attention weights from the current GRU output from (2).\n> 4) Multiply attention weights to encoder outputs to get new\n> \\\"weighted sum\\\" context vector.\n> 5) Concatenate weighted context vector and GRU output using Luong\n> eq. 5.\n> 6) Predict next word using Luong eq. 6 (without softmax).\n> 7) Return output and final hidden state.\n\n**Inputs:**\n\n- `input_step`: one time step (one word) of input sequence batch;\n shape=*(1, batch\\_size)*\n- `last_hidden`: final hidden layer of GRU; shape=*(n\\_layers x\n num\\_directions, batch\\_size, hidden\\_size)*\n- `encoder_outputs`: encoder model's output; shape=*(max\\_length,\n batch\\_size, hidden\\_size)*\n\n**Outputs:**\n\n- `output`: softmax normalized tensor giving probabilities of each\n word being the correct next word in the decoded sequence;\n shape=*(batch\\_size, voc.num\\_words)*\n- `hidden`: final hidden state of GRU; shape=*(n\\_layers x\n num\\_directions, batch\\_size, hidden\\_size)*\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class LuongAttnDecoderRNN(nn.Module):\n def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):\n super(LuongAttnDecoderRNN, self).__init__()\n\n # Keep for reference\n self.attn_model = attn_model\n self.hidden_size = hidden_size\n self.output_size = output_size\n self.n_layers = n_layers\n self.dropout = dropout\n\n # Define layers\n self.embedding = embedding\n self.embedding_dropout = nn.Dropout(dropout)\n self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))\n self.concat = nn.Linear(hidden_size * 2, hidden_size)\n self.out = nn.Linear(hidden_size, output_size)\n\n self.attn = Attn(attn_model, hidden_size)\n\n def forward(self, input_step, last_hidden, encoder_outputs):\n # Note: we run this one step (word) at a time\n # Get embedding of current input word\n embedded = self.embedding(input_step)\n embedded = self.embedding_dropout(embedded)\n # Forward through unidirectional GRU\n rnn_output, hidden = self.gru(embedded, last_hidden)\n # Calculate attention weights from the current GRU output\n attn_weights = self.attn(rnn_output, encoder_outputs)\n # Multiply attention weights to encoder outputs to get new \"weighted sum\" context vector\n context = attn_weights.bmm(encoder_outputs.transpose(0, 1))\n # Concatenate weighted context vector and GRU output using Luong eq. 5\n rnn_output = rnn_output.squeeze(0)\n context = context.squeeze(1)\n concat_input = torch.cat((rnn_output, context), 1)\n concat_output = torch.tanh(self.concat(concat_input))\n # Predict next word using Luong eq. 6\n output = self.out(concat_output)\n output = F.softmax(output, dim=1)\n # Return output and final hidden state\n return output, hidden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define Training Procedure\n=========================\n\nMasked loss\n-----------\n\nSince we are dealing with batches of padded sequences, we cannot simply\nconsider all elements of the tensor when calculating loss. We define\n`maskNLLLoss` to calculate our loss based on our decoder's output\ntensor, the target tensor, and a binary mask tensor describing the\npadding of the target tensor. This loss function calculates the average\nnegative log likelihood of the elements that correspond to a *1* in the\nmask tensor.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def maskNLLLoss(inp, target, mask):\n nTotal = mask.sum()\n crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))\n loss = crossEntropy.masked_select(mask).mean()\n loss = loss.to(device)\n return loss, nTotal.item()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Single training iteration\n=========================\n\nThe `train` function contains the algorithm for a single training\niteration (a single batch of inputs).\n\nWe will use a couple of clever tricks to aid in convergence:\n\n- The first trick is using **teacher forcing**. This means that at\n some probability, set by `teacher_forcing_ratio`, we use the current\n target word as the decoder's next input rather than using the\n decoder's current guess. This technique acts as training wheels for\n the decoder, aiding in more efficient training. However, teacher\n forcing can lead to model instability during inference, as the\n decoder may not have a sufficient chance to truly craft its own\n output sequences during training. Thus, we must be mindful of how we\n are setting the `teacher_forcing_ratio`, and not be fooled by fast\n convergence.\n- The second trick that we implement is **gradient clipping**. This is\n a commonly used technique for countering the \"exploding gradient\"\n problem. In essence, by clipping or thresholding gradients to a\n maximum value, we prevent the gradients from growing exponentially\n and either overflow (NaN), or overshoot steep cliffs in the cost\n function.\n\n![](https://pytorch.org/tutorials/_static/img/chatbot/grad_clip.png){.align-center\nwidth=\"60.0%\"}\n\nImage source: Goodfellow et al. *Deep Learning*. 2016.\n\n\n**Sequence of Operations:**\n\n> 1) Forward pass entire input batch through encoder.\n> 2) Initialize decoder inputs as SOS\\_token, and hidden state as the\n> encoder\\'s final hidden state.\n> 3) Forward input batch sequence through decoder one time step at a\n> time.\n> 4) If teacher forcing: set next decoder input as the current target;\n> else: set next decoder input as current decoder output.\n> 5) Calculate and accumulate loss.\n> 6) Perform backpropagation.\n> 7) Clip gradients.\n> 8) Update encoder and decoder model parameters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,\n encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):\n\n # Zero gradients\n encoder_optimizer.zero_grad()\n decoder_optimizer.zero_grad()\n\n # Set device options\n input_variable = input_variable.to(device)\n target_variable = target_variable.to(device)\n mask = mask.to(device)\n # Lengths for RNN packing should always be on the CPU\n lengths = lengths.to(\"cpu\")\n\n # Initialize variables\n loss = 0\n print_losses = []\n n_totals = 0\n\n # Forward pass through encoder\n encoder_outputs, encoder_hidden = encoder(input_variable, lengths)\n\n # Create initial decoder input (start with SOS tokens for each sentence)\n decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])\n decoder_input = decoder_input.to(device)\n\n # Set initial decoder hidden state to the encoder's final hidden state\n decoder_hidden = encoder_hidden[:decoder.n_layers]\n\n # Determine if we are using teacher forcing this iteration\n use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False\n\n # Forward batch of sequences through decoder one time step at a time\n if use_teacher_forcing:\n for t in range(max_target_len):\n decoder_output, decoder_hidden = decoder(\n decoder_input, decoder_hidden, encoder_outputs\n )\n # Teacher forcing: next input is current target\n decoder_input = target_variable[t].view(1, -1)\n # Calculate and accumulate loss\n mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])\n loss += mask_loss\n print_losses.append(mask_loss.item() * nTotal)\n n_totals += nTotal\n else:\n for t in range(max_target_len):\n decoder_output, decoder_hidden = decoder(\n decoder_input, decoder_hidden, encoder_outputs\n )\n # No teacher forcing: next input is decoder's own current output\n _, topi = decoder_output.topk(1)\n decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])\n decoder_input = decoder_input.to(device)\n # Calculate and accumulate loss\n mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])\n loss += mask_loss\n print_losses.append(mask_loss.item() * nTotal)\n n_totals += nTotal\n\n # Perform backpropagation\n loss.backward()\n\n # Clip gradients: gradients are modified in place\n _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)\n _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)\n\n # Adjust model weights\n encoder_optimizer.step()\n decoder_optimizer.step()\n\n return sum(print_losses) / n_totals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training iterations\n===================\n\nIt is finally time to tie the full training procedure together with the\ndata. The `trainIters` function is responsible for running\n`n_iterations` of training given the passed models, optimizers, data,\netc. This function is quite self explanatory, as we have done the heavy\nlifting with the `train` function.\n\nOne thing to note is that when we save our model, we save a tarball\ncontaining the encoder and decoder `state_dicts` (parameters), the\noptimizers' `state_dicts`, the loss, the iteration, etc. Saving the\nmodel in this way will give us the ultimate flexibility with the\ncheckpoint. After loading a checkpoint, we will be able to use the model\nparameters to run inference, or we can continue training right where we\nleft off.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, clip, corpus_name, loadFilename):\n\n # Load batches for each iteration\n training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])\n for _ in range(n_iteration)]\n\n # Initializations\n print('Initializing ...')\n start_iteration = 1\n print_loss = 0\n if loadFilename:\n start_iteration = checkpoint['iteration'] + 1\n\n # Training loop\n print(\"Training...\")\n for iteration in range(start_iteration, n_iteration + 1):\n training_batch = training_batches[iteration - 1]\n # Extract fields from batch\n input_variable, lengths, target_variable, mask, max_target_len = training_batch\n\n # Run a training iteration with batch\n loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,\n decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)\n print_loss += loss\n\n # Print progress\n if iteration % print_every == 0:\n print_loss_avg = print_loss / print_every\n print(\"Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}\".format(iteration, iteration / n_iteration * 100, print_loss_avg))\n print_loss = 0\n\n # Save checkpoint\n if (iteration % save_every == 0):\n directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))\n if not os.path.exists(directory):\n os.makedirs(directory)\n torch.save({\n 'iteration': iteration,\n 'en': encoder.state_dict(),\n 'de': decoder.state_dict(),\n 'en_opt': encoder_optimizer.state_dict(),\n 'de_opt': decoder_optimizer.state_dict(),\n 'loss': loss,\n 'voc_dict': voc.__dict__,\n 'embedding': embedding.state_dict()\n }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define Evaluation\n=================\n\nAfter training a model, we want to be able to talk to the bot ourselves.\nFirst, we must define how we want the model to decode the encoded input.\n\nGreedy decoding\n---------------\n\nGreedy decoding is the decoding method that we use during training when\nwe are **NOT** using teacher forcing. In other words, for each time\nstep, we simply choose the word from `decoder_output` with the highest\nsoftmax value. This decoding method is optimal on a single time-step\nlevel.\n\nTo facilitate the greedy decoding operation, we define a\n`GreedySearchDecoder` class. When run, an object of this class takes an\ninput sequence (`input_seq`) of shape *(input\\_seq length, 1)*, a scalar\ninput length (`input_length`) tensor, and a `max_length` to bound the\nresponse sentence length. The input sentence is evaluated using the\nfollowing computational graph:\n\n**Computation Graph:**\n\n> 1) Forward input through encoder model.\n>\n> 2) Prepare encoder\\'s final hidden layer to be first hidden input to\n> the decoder.\n>\n> 3) Initialize decoder\\'s first input as SOS\\_token.\n>\n> 4) Initialize tensors to append decoded words to.\n>\n> 5) \n>\n> Iteratively decode one word token at a time:\n>\n> : a) Forward pass through decoder.\n> b) Obtain most likely word token and its softmax score.\n> c) Record token and score.\n> d) Prepare current token to be next decoder input.\n>\n> 6) Return collections of word tokens and scores.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class GreedySearchDecoder(nn.Module):\n def __init__(self, encoder, decoder):\n super(GreedySearchDecoder, self).__init__()\n self.encoder = encoder\n self.decoder = decoder\n\n def forward(self, input_seq, input_length, max_length):\n # Forward input through encoder model\n encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)\n # Prepare encoder's final hidden layer to be first hidden input to the decoder\n decoder_hidden = encoder_hidden[:self.decoder.n_layers]\n # Initialize decoder input with SOS_token\n decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token\n # Initialize tensors to append decoded words to\n all_tokens = torch.zeros([0], device=device, dtype=torch.long)\n all_scores = torch.zeros([0], device=device)\n # Iteratively decode one word token at a time\n for _ in range(max_length):\n # Forward pass through decoder\n decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)\n # Obtain most likely word token and its softmax score\n decoder_scores, decoder_input = torch.max(decoder_output, dim=1)\n # Record token and score\n all_tokens = torch.cat((all_tokens, decoder_input), dim=0)\n all_scores = torch.cat((all_scores, decoder_scores), dim=0)\n # Prepare current token to be next decoder input (add a dimension)\n decoder_input = torch.unsqueeze(decoder_input, 0)\n # Return collections of word tokens and scores\n return all_tokens, all_scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate my text\n================\n\nNow that we have our decoding method defined, we can write functions for\nevaluating a string input sentence. The `evaluate` function manages the\nlow-level process of handling the input sentence. We first format the\nsentence as an input batch of word indexes with *batch\\_size==1*. We do\nthis by converting the words of the sentence to their corresponding\nindexes, and transposing the dimensions to prepare the tensor for our\nmodels. We also create a `lengths` tensor which contains the length of\nour input sentence. In this case, `lengths` is scalar because we are\nonly evaluating one sentence at a time (batch\\_size==1). Next, we obtain\nthe decoded response sentence tensor using our `GreedySearchDecoder`\nobject (`searcher`). Finally, we convert the response's indexes to words\nand return the list of decoded words.\n\n`evaluateInput` acts as the user interface for our chatbot. When called,\nan input text field will spawn in which we can enter our query sentence.\nAfter typing our input sentence and pressing *Enter*, our text is\nnormalized in the same way as our training data, and is ultimately fed\nto the `evaluate` function to obtain a decoded output sentence. We loop\nthis process, so we can keep chatting with our bot until we enter either\n\"q\" or \"quit\".\n\nFinally, if a sentence is entered that contains a word that is not in\nthe vocabulary, we handle this gracefully by printing an error message\nand prompting the user to enter another sentence.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):\n ### Format input sentence as a batch\n # words -> indexes\n indexes_batch = [indexesFromSentence(voc, sentence)]\n # Create lengths tensor\n lengths = torch.tensor([len(indexes) for indexes in indexes_batch])\n # Transpose dimensions of batch to match models' expectations\n input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)\n # Use appropriate device\n input_batch = input_batch.to(device)\n lengths = lengths.to(\"cpu\")\n # Decode sentence with searcher\n tokens, scores = searcher(input_batch, lengths, max_length)\n # indexes -> words\n decoded_words = [voc.index2word[token.item()] for token in tokens]\n return decoded_words\n\n\ndef evaluateInput(encoder, decoder, searcher, voc):\n input_sentence = ''\n while(1):\n try:\n # Get input sentence\n input_sentence = input('> ')\n # Check if it is quit case\n if input_sentence == 'q' or input_sentence == 'quit': break\n # Normalize sentence\n input_sentence = normalizeString(input_sentence)\n # Evaluate sentence\n output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)\n # Format and print response sentence\n output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]\n print('Bot:', ' '.join(output_words))\n\n except KeyError:\n print(\"Error: Encountered unknown word.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run Model\n=========\n\nFinally, it is time to run our model!\n\nRegardless of whether we want to train or test the chatbot model, we\nmust initialize the individual encoder and decoder models. In the\nfollowing block, we set our desired configurations, choose to start from\nscratch or set a checkpoint to load from, and build and initialize the\nmodels. Feel free to play with different model configurations to\noptimize performance.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Configure models\nmodel_name = 'cb_model'\nattn_model = 'dot'\n#``attn_model = 'general'``\n#``attn_model = 'concat'``\nhidden_size = 500\nencoder_n_layers = 2\ndecoder_n_layers = 2\ndropout = 0.1\nbatch_size = 64\n\n# Set checkpoint to load from; set to None if starting from scratch\nloadFilename = None\ncheckpoint_iter = 4000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample code to load from a checkpoint:\n\n``` {.python}\nloadFilename = os.path.join(save_dir, model_name, corpus_name,\n '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),\n '{}_checkpoint.tar'.format(checkpoint_iter))\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load model if a ``loadFilename`` is provided\nif loadFilename:\n # If loading on same machine the model was trained on\n checkpoint = torch.load(loadFilename)\n # If loading a model trained on GPU to CPU\n #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))\n encoder_sd = checkpoint['en']\n decoder_sd = checkpoint['de']\n encoder_optimizer_sd = checkpoint['en_opt']\n decoder_optimizer_sd = checkpoint['de_opt']\n embedding_sd = checkpoint['embedding']\n voc.__dict__ = checkpoint['voc_dict']\n\n\nprint('Building encoder and decoder ...')\n# Initialize word embeddings\nembedding = nn.Embedding(voc.num_words, hidden_size)\nif loadFilename:\n embedding.load_state_dict(embedding_sd)\n# Initialize encoder & decoder models\nencoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)\ndecoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)\nif loadFilename:\n encoder.load_state_dict(encoder_sd)\n decoder.load_state_dict(decoder_sd)\n# Use appropriate device\nencoder = encoder.to(device)\ndecoder = decoder.to(device)\nprint('Models built and ready to go!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run Training\n============\n\nRun the following block if you want to train the model.\n\nFirst we set training parameters, then we initialize our optimizers, and\nfinally we call the `trainIters` function to run our training\niterations.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Configure training/optimization\nclip = 50.0\nteacher_forcing_ratio = 1.0\nlearning_rate = 0.0001\ndecoder_learning_ratio = 5.0\nn_iteration = 4000\nprint_every = 1\nsave_every = 500\n\n# Ensure dropout layers are in train mode\nencoder.train()\ndecoder.train()\n\n# Initialize optimizers\nprint('Building optimizers ...')\nencoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)\ndecoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)\nif loadFilename:\n encoder_optimizer.load_state_dict(encoder_optimizer_sd)\n decoder_optimizer.load_state_dict(decoder_optimizer_sd)\n\n# If you have an accelerator, configure it to call\nfor state in encoder_optimizer.state.values():\n for k, v in state.items():\n if isinstance(v, torch.Tensor):\n state[k] = v.to(device)\n\nfor state in decoder_optimizer.state.values():\n for k, v in state.items():\n if isinstance(v, torch.Tensor):\n state[k] = v.to(device)\n\n# Run training iterations\nprint(\"Starting Training!\")\ntrainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,\n embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,\n print_every, save_every, clip, corpus_name, loadFilename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run Evaluation\n==============\n\nTo chat with your model, run the following block.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set dropout layers to ``eval`` mode\nencoder.eval()\ndecoder.eval()\n\n# Initialize search module\nsearcher = GreedySearchDecoder(encoder, decoder)\n\n# Begin chatting (uncomment and run the following line to begin)\n# evaluateInput(encoder, decoder, searcher, voc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion\n==========\n\nThat's all for this one, folks. Congratulations, you now know the\nfundamentals to building a generative chatbot model! If you're\ninterested, you can try tailoring the chatbot's behavior by tweaking the\nmodel and training parameters and customizing the data that you train\nthe model on.\n\nCheck out the other tutorials for more cool deep learning applications\nin PyTorch!\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }