{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLP From Scratch: Generating Names with a Character-Level RNN\n=============================================================\n\n**Author**: [Sean Robertson](https://github.com/spro)\n\nThis tutorials is part of a three-part series:\n\n- [NLP From Scratch: Classifying Names with a Character-Level\n RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)\n- [NLP From Scratch: Generating Names with a Character-Level\n RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)\n- [NLP From Scratch: Translation with a Sequence to Sequence Network\n and\n Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)\n\nThis is our second of three tutorials on \\\"NLP From Scratch\\\". In the\n[first\ntutorial](/tutorials/intermediate/char_rnn_classification_tutorial) we\nused a RNN to classify names into their language of origin. This time\nwe\\'ll turn around and generate names from languages.\n\n``` {.sh}\n> python sample.py Russian RUS\nRovakov\nUantov\nShavakov\n\n> python sample.py German GER\nGerren\nEreng\nRosher\n\n> python sample.py Spanish SPA\nSalla\nParer\nAllan\n\n> python sample.py Chinese CHI\nChan\nHang\nIun\n```\n\nWe are still hand-crafting a small RNN with a few linear layers. The big\ndifference is instead of predicting a category after reading in all the\nletters of a name, we input a category and output one letter at a time.\nRecurrently predicting characters to form language (this could also be\ndone with words or other higher order constructs) is often referred to\nas a \\\"language model\\\".\n\n**Recommended Reading:**\n\nI assume you have at least installed PyTorch, know Python, and\nunderstand Tensors:\n\n- For installation instructions\n- `/beginner/deep_learning_60min_blitz`{.interpreted-text role=\"doc\"}\n to get started with PyTorch in general\n- `/beginner/pytorch_with_examples`{.interpreted-text role=\"doc\"} for\n a wide and deep overview\n- `/beginner/former_torchies_tutorial`{.interpreted-text role=\"doc\"}\n if you are former Lua Torch user\n\nIt would also be useful to know about RNNs and how they work:\n\n- [The Unreasonable Effectiveness of Recurrent Neural\n Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)\n shows a bunch of real life examples\n- [Understanding LSTM\n Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)\n is about LSTMs specifically but also informative about RNNs in\n general\n\nI also suggest the previous tutorial,\n`/intermediate/char_rnn_classification_tutorial`{.interpreted-text\nrole=\"doc\"}\n\nPreparing the Data\n------------------\n\n```{=html}\n
NOTE:
\n```\n```{=html}\n
\n```\n```{=html}\n

Download the data fromhereand extract it to the current directory.

\n```\n```{=html}\n
\n```\nSee the last tutorial for more detail of this process. In short, there\nare a bunch of plain text files `data/names/[Language].txt` with a name\nper line. We split lines into an array, convert Unicode to ASCII, and\nend up with a dictionary `{language: [names ...]}`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from io import open\nimport glob\nimport os\nimport unicodedata\nimport string\n\nall_letters = string.ascii_letters + \" .,;'-\"\nn_letters = len(all_letters) + 1 # Plus EOS marker\n\ndef findFiles(path): return glob.glob(path)\n\n# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427\ndef unicodeToAscii(s):\n return ''.join(\n c for c in unicodedata.normalize('NFD', s)\n if unicodedata.category(c) != 'Mn'\n and c in all_letters\n )\n\n# Read a file and split into lines\ndef readLines(filename):\n with open(filename, encoding='utf-8') as some_file:\n return [unicodeToAscii(line.strip()) for line in some_file]\n\n# Build the category_lines dictionary, a list of lines per category\ncategory_lines = {}\nall_categories = []\nfor filename in findFiles('data/names/*.txt'):\n category = os.path.splitext(os.path.basename(filename))[0]\n all_categories.append(category)\n lines = readLines(filename)\n category_lines[category] = lines\n\nn_categories = len(all_categories)\n\nif n_categories == 0:\n raise RuntimeError('Data not found. Make sure that you downloaded data '\n 'from https://download.pytorch.org/tutorial/data.zip and extract it to '\n 'the current directory.')\n\nprint('# categories:', n_categories, all_categories)\nprint(unicodeToAscii(\"O'N\u00e9\u00e0l\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating the Network\n====================\n\nThis network extends [the last tutorial\\'s RNN](#Creating-the-Network)\nwith an extra argument for the category tensor, which is concatenated\nalong with the others. The category tensor is a one-hot vector just like\nthe letter input.\n\nWe will interpret the output as the probability of the next letter. When\nsampling, the most likely output letter is used as the next input\nletter.\n\nI added a second linear layer `o2o` (after combining hidden and output)\nto give it more muscle to work with. There\\'s also a dropout layer,\nwhich [randomly zeros parts of its\ninput](https://arxiv.org/abs/1207.0580) with a given probability (here\n0.1) and is usually used to fuzz inputs to prevent overfitting. Here\nwe\\'re using it towards the end of the network to purposely add some\nchaos and increase sampling variety.\n\n![](https://i.imgur.com/jzVrf7f.png)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torch.nn as nn\n\nclass RNN(nn.Module):\n def __init__(self, input_size, hidden_size, output_size):\n super(RNN, self).__init__()\n self.hidden_size = hidden_size\n\n self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)\n self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)\n self.o2o = nn.Linear(hidden_size + output_size, output_size)\n self.dropout = nn.Dropout(0.1)\n self.softmax = nn.LogSoftmax(dim=1)\n\n def forward(self, category, input, hidden):\n input_combined = torch.cat((category, input, hidden), 1)\n hidden = self.i2h(input_combined)\n output = self.i2o(input_combined)\n output_combined = torch.cat((hidden, output), 1)\n output = self.o2o(output_combined)\n output = self.dropout(output)\n output = self.softmax(output)\n return output, hidden\n\n def initHidden(self):\n return torch.zeros(1, self.hidden_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training\n========\n\nPreparing for Training\n----------------------\n\nFirst of all, helper functions to get random pairs of (category, line):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import random\n\n# Random item from a list\ndef randomChoice(l):\n return l[random.randint(0, len(l) - 1)]\n\n# Get a random category and random line from that category\ndef randomTrainingPair():\n category = randomChoice(all_categories)\n line = randomChoice(category_lines[category])\n return category, line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each timestep (that is, for each letter in a training word) the\ninputs of the network will be `(category, current letter, hidden state)`\nand the outputs will be `(next letter, next hidden state)`. So for each\ntraining set, we\\'ll need the category, a set of input letters, and a\nset of output/target letters.\n\nSince we are predicting the next letter from the current letter for each\ntimestep, the letter pairs are groups of consecutive letters from the\nline - e.g. for `\"ABCD\"` we would create (\\\"A\\\", \\\"B\\\"), (\\\"B\\\",\n\\\"C\\\"), (\\\"C\\\", \\\"D\\\"), (\\\"D\\\", \\\"EOS\\\").\n\n![](https://i.imgur.com/JH58tXY.png)\n\nThe category tensor is a [one-hot\ntensor](https://en.wikipedia.org/wiki/One-hot) of size\n`<1 x n_categories>`. When training we feed it to the network at every\ntimestep - this is a design choice, it could have been included as part\nof initial hidden state or some other strategy.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# One-hot vector for category\ndef categoryTensor(category):\n li = all_categories.index(category)\n tensor = torch.zeros(1, n_categories)\n tensor[0][li] = 1\n return tensor\n\n# One-hot matrix of first to last letters (not including EOS) for input\ndef inputTensor(line):\n tensor = torch.zeros(len(line), 1, n_letters)\n for li in range(len(line)):\n letter = line[li]\n tensor[li][0][all_letters.find(letter)] = 1\n return tensor\n\n# ``LongTensor`` of second letter to end (EOS) for target\ndef targetTensor(line):\n letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]\n letter_indexes.append(n_letters - 1) # EOS\n return torch.LongTensor(letter_indexes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience during training we\\'ll make a `randomTrainingExample`\nfunction that fetches a random (category, line) pair and turns them into\nthe required (category, input, target) tensors.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Make category, input, and target tensors from a random category, line pair\ndef randomTrainingExample():\n category, line = randomTrainingPair()\n category_tensor = categoryTensor(category)\n input_line_tensor = inputTensor(line)\n target_line_tensor = targetTensor(line)\n return category_tensor, input_line_tensor, target_line_tensor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training the Network\n====================\n\nIn contrast to classification, where only the last output is used, we\nare making a prediction at every step, so we are calculating loss at\nevery step.\n\nThe magic of autograd allows you to simply sum these losses at each step\nand call backward at the end.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "criterion = nn.NLLLoss()\n\nlearning_rate = 0.0005\n\ndef train(category_tensor, input_line_tensor, target_line_tensor):\n target_line_tensor.unsqueeze_(-1)\n hidden = rnn.initHidden()\n\n rnn.zero_grad()\n\n loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``\n\n for i in range(input_line_tensor.size(0)):\n output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)\n l = criterion(output, target_line_tensor[i])\n loss += l\n\n loss.backward()\n\n for p in rnn.parameters():\n p.data.add_(p.grad.data, alpha=-learning_rate)\n\n return output, loss.item() / input_line_tensor.size(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To keep track of how long training takes I am adding a\n`timeSince(timestamp)` function which returns a human readable string:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\nimport math\n\ndef timeSince(since):\n now = time.time()\n s = now - since\n m = math.floor(s / 60)\n s -= m * 60\n return '%dm %ds' % (m, s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training is business as usual - call train a bunch of times and wait a\nfew minutes, printing the current time and loss every `print_every`\nexamples, and keeping store of an average loss per `plot_every` examples\nin `all_losses` for plotting later.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rnn = RNN(n_letters, 128, n_letters)\n\nn_iters = 100000\nprint_every = 5000\nplot_every = 500\nall_losses = []\ntotal_loss = 0 # Reset every ``plot_every`` ``iters``\n\nstart = time.time()\n\nfor iter in range(1, n_iters + 1):\n output, loss = train(*randomTrainingExample())\n total_loss += loss\n\n if iter % print_every == 0:\n print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))\n\n if iter % plot_every == 0:\n all_losses.append(total_loss / plot_every)\n total_loss = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the Losses\n===================\n\nPlotting the historical loss from all\\_losses shows the network\nlearning:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nplt.figure()\nplt.plot(all_losses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sampling the Network\n====================\n\nTo sample we give the network a letter and ask what the next one is,\nfeed that in as the next letter, and repeat until the EOS token.\n\n- Create tensors for input category, starting letter, and empty hidden\n state\n- Create a string `output_name` with the starting letter\n- Up to a maximum output length,\n - Feed the current letter to the network\n - Get the next letter from highest output, and next hidden state\n - If the letter is EOS, stop here\n - If a regular letter, add to `output_name` and continue\n- Return the final name\n\n```{=html}\n
NOTE:
\n```\n```{=html}\n
\n```\n```{=html}\n

Rather than having to give it a starting letter, anotherstrategy would have been to include a \"start of string\" token intraining and have the network choose its own starting letter.

\n```\n```{=html}\n
\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "max_length = 20\n\n# Sample from a category and starting letter\ndef sample(category, start_letter='A'):\n with torch.no_grad(): # no need to track history in sampling\n category_tensor = categoryTensor(category)\n input = inputTensor(start_letter)\n hidden = rnn.initHidden()\n\n output_name = start_letter\n\n for i in range(max_length):\n output, hidden = rnn(category_tensor, input[0], hidden)\n topv, topi = output.topk(1)\n topi = topi[0][0]\n if topi == n_letters - 1:\n break\n else:\n letter = all_letters[topi]\n output_name += letter\n input = inputTensor(letter)\n\n return output_name\n\n# Get multiple samples from one category and multiple starting letters\ndef samples(category, start_letters='ABC'):\n for start_letter in start_letters:\n print(sample(category, start_letter))\n\nsamples('Russian', 'RUS')\n\nsamples('German', 'GER')\n\nsamples('Spanish', 'SPA')\n\nsamples('Chinese', 'CHI')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercises\n=========\n\n- Try with a different dataset of category -\\> line, for example:\n - Fictional series -\\> Character name\n - Part of speech -\\> Word\n - Country -\\> City\n- Use a \\\"start of sentence\\\" token so that sampling can be done\n without choosing a start letter\n- Get better results with a bigger and/or better shaped network\n - Try the `nn.LSTM` and `nn.GRU` layers\n - Combine multiple of these RNNs as a higher level network\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }