{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "NLP From Scratch: Generating Names with a Character-Level RNN\n=============================================================\n\n**Author**: [Sean Robertson](https://github.com/spro)\n\nThis tutorials is part of a three-part series:\n\n-   [NLP From Scratch: Classifying Names with a Character-Level\n    RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)\n-   [NLP From Scratch: Generating Names with a Character-Level\n    RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)\n-   [NLP From Scratch: Translation with a Sequence to Sequence Network\n    and\n    Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)\n\nThis is our second of three tutorials on \\\"NLP From Scratch\\\". In the\n[first\ntutorial](/tutorials/intermediate/char_rnn_classification_tutorial) we\nused a RNN to classify names into their language of origin. This time\nwe\\'ll turn around and generate names from languages.\n\n``` {.sh}\n> python sample.py Russian RUS\nRovakov\nUantov\nShavakov\n\n> python sample.py German GER\nGerren\nEreng\nRosher\n\n> python sample.py Spanish SPA\nSalla\nParer\nAllan\n\n> python sample.py Chinese CHI\nChan\nHang\nIun\n```\n\nWe are still hand-crafting a small RNN with a few linear layers. The big\ndifference is instead of predicting a category after reading in all the\nletters of a name, we input a category and output one letter at a time.\nRecurrently predicting characters to form language (this could also be\ndone with words or other higher order constructs) is often referred to\nas a \\\"language model\\\".\n\n**Recommended Reading:**\n\nI assume you have at least installed PyTorch, know Python, and\nunderstand Tensors:\n\n-   <https://pytorch.org/> For installation instructions\n-   `/beginner/deep_learning_60min_blitz`{.interpreted-text role=\"doc\"}\n    to get started with PyTorch in general\n-   `/beginner/pytorch_with_examples`{.interpreted-text role=\"doc\"} for\n    a wide and deep overview\n-   `/beginner/former_torchies_tutorial`{.interpreted-text role=\"doc\"}\n    if you are former Lua Torch user\n\nIt would also be useful to know about RNNs and how they work:\n\n-   [The Unreasonable Effectiveness of Recurrent Neural\n    Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)\n    shows a bunch of real life examples\n-   [Understanding LSTM\n    Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)\n    is about LSTMs specifically but also informative about RNNs in\n    general\n\nI also suggest the previous tutorial,\n`/intermediate/char_rnn_classification_tutorial`{.interpreted-text\nrole=\"doc\"}\n\nPreparing the Data\n------------------\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>Download the data from<a href=\"https://download.pytorch.org/tutorial/data.zip\">here</a>and extract it to the current directory.</p>\n```\n```{=html}\n</div>\n```\nSee the last tutorial for more detail of this process. In short, there\nare a bunch of plain text files `data/names/[Language].txt` with a name\nper line. We split lines into an array, convert Unicode to ASCII, and\nend up with a dictionary `{language: [names ...]}`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from io import open\nimport glob\nimport os\nimport unicodedata\nimport string\n\nall_letters = string.ascii_letters + \" .,;'-\"\nn_letters = len(all_letters) + 1 # Plus EOS marker\n\ndef findFiles(path): return glob.glob(path)\n\n# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427\ndef unicodeToAscii(s):\n    return ''.join(\n        c for c in unicodedata.normalize('NFD', s)\n        if unicodedata.category(c) != 'Mn'\n        and c in all_letters\n    )\n\n# Read a file and split into lines\ndef readLines(filename):\n    with open(filename, encoding='utf-8') as some_file:\n        return [unicodeToAscii(line.strip()) for line in some_file]\n\n# Build the category_lines dictionary, a list of lines per category\ncategory_lines = {}\nall_categories = []\nfor filename in findFiles('data/names/*.txt'):\n    category = os.path.splitext(os.path.basename(filename))[0]\n    all_categories.append(category)\n    lines = readLines(filename)\n    category_lines[category] = lines\n\nn_categories = len(all_categories)\n\nif n_categories == 0:\n    raise RuntimeError('Data not found. Make sure that you downloaded data '\n        'from https://download.pytorch.org/tutorial/data.zip and extract it to '\n        'the current directory.')\n\nprint('# categories:', n_categories, all_categories)\nprint(unicodeToAscii(\"O'N\u00e9\u00e0l\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Creating the Network\n====================\n\nThis network extends [the last tutorial\\'s RNN](#Creating-the-Network)\nwith an extra argument for the category tensor, which is concatenated\nalong with the others. The category tensor is a one-hot vector just like\nthe letter input.\n\nWe will interpret the output as the probability of the next letter. When\nsampling, the most likely output letter is used as the next input\nletter.\n\nI added a second linear layer `o2o` (after combining hidden and output)\nto give it more muscle to work with. There\\'s also a dropout layer,\nwhich [randomly zeros parts of its\ninput](https://arxiv.org/abs/1207.0580) with a given probability (here\n0.1) and is usually used to fuzz inputs to prevent overfitting. Here\nwe\\'re using it towards the end of the network to purposely add some\nchaos and increase sampling variety.\n\n![](https://i.imgur.com/jzVrf7f.png)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torch.nn as nn\n\nclass RNN(nn.Module):\n    def __init__(self, input_size, hidden_size, output_size):\n        super(RNN, self).__init__()\n        self.hidden_size = hidden_size\n\n        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)\n        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)\n        self.o2o = nn.Linear(hidden_size + output_size, output_size)\n        self.dropout = nn.Dropout(0.1)\n        self.softmax = nn.LogSoftmax(dim=1)\n\n    def forward(self, category, input, hidden):\n        input_combined = torch.cat((category, input, hidden), 1)\n        hidden = self.i2h(input_combined)\n        output = self.i2o(input_combined)\n        output_combined = torch.cat((hidden, output), 1)\n        output = self.o2o(output_combined)\n        output = self.dropout(output)\n        output = self.softmax(output)\n        return output, hidden\n\n    def initHidden(self):\n        return torch.zeros(1, self.hidden_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Training\n========\n\nPreparing for Training\n----------------------\n\nFirst of all, helper functions to get random pairs of (category, line):\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import random\n\n# Random item from a list\ndef randomChoice(l):\n    return l[random.randint(0, len(l) - 1)]\n\n# Get a random category and random line from that category\ndef randomTrainingPair():\n    category = randomChoice(all_categories)\n    line = randomChoice(category_lines[category])\n    return category, line"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For each timestep (that is, for each letter in a training word) the\ninputs of the network will be `(category, current letter, hidden state)`\nand the outputs will be `(next letter, next hidden state)`. So for each\ntraining set, we\\'ll need the category, a set of input letters, and a\nset of output/target letters.\n\nSince we are predicting the next letter from the current letter for each\ntimestep, the letter pairs are groups of consecutive letters from the\nline - e.g. for `\"ABCD<EOS>\"` we would create (\\\"A\\\", \\\"B\\\"), (\\\"B\\\",\n\\\"C\\\"), (\\\"C\\\", \\\"D\\\"), (\\\"D\\\", \\\"EOS\\\").\n\n![](https://i.imgur.com/JH58tXY.png)\n\nThe category tensor is a [one-hot\ntensor](https://en.wikipedia.org/wiki/One-hot) of size\n`<1 x n_categories>`. When training we feed it to the network at every\ntimestep - this is a design choice, it could have been included as part\nof initial hidden state or some other strategy.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# One-hot vector for category\ndef categoryTensor(category):\n    li = all_categories.index(category)\n    tensor = torch.zeros(1, n_categories)\n    tensor[0][li] = 1\n    return tensor\n\n# One-hot matrix of first to last letters (not including EOS) for input\ndef inputTensor(line):\n    tensor = torch.zeros(len(line), 1, n_letters)\n    for li in range(len(line)):\n        letter = line[li]\n        tensor[li][0][all_letters.find(letter)] = 1\n    return tensor\n\n# ``LongTensor`` of second letter to end (EOS) for target\ndef targetTensor(line):\n    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]\n    letter_indexes.append(n_letters - 1) # EOS\n    return torch.LongTensor(letter_indexes)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For convenience during training we\\'ll make a `randomTrainingExample`\nfunction that fetches a random (category, line) pair and turns them into\nthe required (category, input, target) tensors.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Make category, input, and target tensors from a random category, line pair\ndef randomTrainingExample():\n    category, line = randomTrainingPair()\n    category_tensor = categoryTensor(category)\n    input_line_tensor = inputTensor(line)\n    target_line_tensor = targetTensor(line)\n    return category_tensor, input_line_tensor, target_line_tensor"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Training the Network\n====================\n\nIn contrast to classification, where only the last output is used, we\nare making a prediction at every step, so we are calculating loss at\nevery step.\n\nThe magic of autograd allows you to simply sum these losses at each step\nand call backward at the end.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "criterion = nn.NLLLoss()\n\nlearning_rate = 0.0005\n\ndef train(category_tensor, input_line_tensor, target_line_tensor):\n    target_line_tensor.unsqueeze_(-1)\n    hidden = rnn.initHidden()\n\n    rnn.zero_grad()\n\n    loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``\n\n    for i in range(input_line_tensor.size(0)):\n        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)\n        l = criterion(output, target_line_tensor[i])\n        loss += l\n\n    loss.backward()\n\n    for p in rnn.parameters():\n        p.data.add_(p.grad.data, alpha=-learning_rate)\n\n    return output, loss.item() / input_line_tensor.size(0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To keep track of how long training takes I am adding a\n`timeSince(timestamp)` function which returns a human readable string:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import time\nimport math\n\ndef timeSince(since):\n    now = time.time()\n    s = now - since\n    m = math.floor(s / 60)\n    s -= m * 60\n    return '%dm %ds' % (m, s)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Training is business as usual - call train a bunch of times and wait a\nfew minutes, printing the current time and loss every `print_every`\nexamples, and keeping store of an average loss per `plot_every` examples\nin `all_losses` for plotting later.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rnn = RNN(n_letters, 128, n_letters)\n\nn_iters = 100000\nprint_every = 5000\nplot_every = 500\nall_losses = []\ntotal_loss = 0 # Reset every ``plot_every`` ``iters``\n\nstart = time.time()\n\nfor iter in range(1, n_iters + 1):\n    output, loss = train(*randomTrainingExample())\n    total_loss += loss\n\n    if iter % print_every == 0:\n        print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))\n\n    if iter % plot_every == 0:\n        all_losses.append(total_loss / plot_every)\n        total_loss = 0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Plotting the Losses\n===================\n\nPlotting the historical loss from all\\_losses shows the network\nlearning:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nplt.figure()\nplt.plot(all_losses)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Sampling the Network\n====================\n\nTo sample we give the network a letter and ask what the next one is,\nfeed that in as the next letter, and repeat until the EOS token.\n\n-   Create tensors for input category, starting letter, and empty hidden\n    state\n-   Create a string `output_name` with the starting letter\n-   Up to a maximum output length,\n    -   Feed the current letter to the network\n    -   Get the next letter from highest output, and next hidden state\n    -   If the letter is EOS, stop here\n    -   If a regular letter, add to `output_name` and continue\n-   Return the final name\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>Rather than having to give it a starting letter, anotherstrategy would have been to include a \"start of string\" token intraining and have the network choose its own starting letter.</p>\n```\n```{=html}\n</div>\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "max_length = 20\n\n# Sample from a category and starting letter\ndef sample(category, start_letter='A'):\n    with torch.no_grad():  # no need to track history in sampling\n        category_tensor = categoryTensor(category)\n        input = inputTensor(start_letter)\n        hidden = rnn.initHidden()\n\n        output_name = start_letter\n\n        for i in range(max_length):\n            output, hidden = rnn(category_tensor, input[0], hidden)\n            topv, topi = output.topk(1)\n            topi = topi[0][0]\n            if topi == n_letters - 1:\n                break\n            else:\n                letter = all_letters[topi]\n                output_name += letter\n            input = inputTensor(letter)\n\n        return output_name\n\n# Get multiple samples from one category and multiple starting letters\ndef samples(category, start_letters='ABC'):\n    for start_letter in start_letters:\n        print(sample(category, start_letter))\n\nsamples('Russian', 'RUS')\n\nsamples('German', 'GER')\n\nsamples('Spanish', 'SPA')\n\nsamples('Chinese', 'CHI')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Exercises\n=========\n\n-   Try with a different dataset of category -\\> line, for example:\n    -   Fictional series -\\> Character name\n    -   Part of speech -\\> Word\n    -   Country -\\> City\n-   Use a \\\"start of sentence\\\" token so that sampling can be done\n    without choosing a start letter\n-   Get better results with a bigger and/or better shaped network\n    -   Try the `nn.LSTM` and `nn.GRU` layers\n    -   Combine multiple of these RNNs as a higher level network\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}