{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLP From Scratch: Classifying Names with a Character-Level RNN\n==============================================================\n\n**Author**: [Sean Robertson](https://github.com/spro)\n\nThis tutorials is part of a three-part series:\n\n- [NLP From Scratch: Classifying Names with a Character-Level\n RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)\n- [NLP From Scratch: Generating Names with a Character-Level\n RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)\n- [NLP From Scratch: Translation with a Sequence to Sequence Network\n and\n Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)\n\nWe will be building and training a basic character-level Recurrent\nNeural Network (RNN) to classify words. This tutorial, along with two\nother Natural Language Processing (NLP) \\\"from scratch\\\" tutorials\n`/intermediate/char_rnn_generation_tutorial`{.interpreted-text\nrole=\"doc\"} and\n`/intermediate/seq2seq_translation_tutorial`{.interpreted-text\nrole=\"doc\"}, show how to preprocess data to model NLP. In particular,\nthese tutorials show how preprocessing to model NLP works at a low\nlevel.\n\nA character-level RNN reads words as a series of characters -outputting\na prediction and \\\"hidden state\\\" at each step, feeding its previous\nhidden state into each next step. We take the final prediction to be the\noutput, i.e. which class the word belongs to.\n\nSpecifically, we\\'ll train on a few thousand surnames from 18 languages\nof origin, and predict which language a name is from based on the\nspelling.\n\nRecommended Preparation\n-----------------------\n\nBefore starting this tutorial it is recommended that you have installed\nPyTorch, and have a basic understanding of Python programming language\nand Tensors:\n\n- For installation instructions\n- `/beginner/deep_learning_60min_blitz`{.interpreted-text role=\"doc\"}\n to get started with PyTorch in general and learn the basics of\n Tensors\n- `/beginner/pytorch_with_examples`{.interpreted-text role=\"doc\"} for\n a wide and deep overview\n- `/beginner/former_torchies_tutorial`{.interpreted-text role=\"doc\"}\n if you are former Lua Torch user\n\nIt would also be useful to know about RNNs and how they work:\n\n- [The Unreasonable Effectiveness of Recurrent Neural\n Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)\n shows a bunch of real life examples\n- [Understanding LSTM\n Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)\n is about LSTMs specifically but also informative about RNNs in\n general\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preparing Torch\n===============\n\nSet up torch to default to the right device use GPU acceleration\ndepending on your hardware (CPU or CUDA).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\n\n# Check if CUDA is available\ndevice = torch.device('cpu')\nif torch.cuda.is_available():\n device = torch.device('cuda')\n\ntorch.set_default_device(device)\nprint(f\"Using device = {torch.get_default_device()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preparing the Data\n==================\n\nDownload the data from\n[here](https://download.pytorch.org/tutorial/data.zip) and extract it to\nthe current directory.\n\nIncluded in the `data/names` directory are 18 text files named as\n`[Language].txt`. Each file contains a bunch of names, one name per\nline, mostly romanized (but we still need to convert from Unicode to\nASCII).\n\nThe first step is to define and clean our data. Initially, we need to\nconvert Unicode to plain ASCII to limit the RNN input layers. This is\naccomplished by converting Unicode strings to ASCII and allowing only a\nsmall set of allowed characters.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import string\nimport unicodedata\n\n# We can use \"_\" to represent an out-of-vocabulary character, that is, any character we are not handling in our model\nallowed_characters = string.ascii_letters + \" .,;'\" + \"_\"\nn_letters = len(allowed_characters)\n\n# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427\ndef unicodeToAscii(s):\n return ''.join(\n c for c in unicodedata.normalize('NFD', s)\n if unicodedata.category(c) != 'Mn'\n and c in allowed_characters\n )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here\\'s an example of converting a unicode alphabet name to plain ASCII.\nThis simplifies the input layer\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print (f\"converting '\u015alus\u00e0rski' to {unicodeToAscii('\u015alus\u00e0rski')}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Turning Names into Tensors\n==========================\n\nNow that we have all the names organized, we need to turn them into\nTensors to make any use of them.\n\nTo represent a single letter, we use a \\\"one-hot vector\\\" of size\n`<1 x n_letters>`. A one-hot vector is filled with 0s except for a 1 at\nindex of the current letter, e.g. `\"b\" = <0 1 0 0 0 ...>`.\n\nTo make a word we join a bunch of those into a 2D matrix\n``.\n\nThat extra 1 dimension is because PyTorch assumes everything is in\nbatches - we\\'re just using a batch size of 1 here.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Find letter index from all_letters, e.g. \"a\" = 0\ndef letterToIndex(letter):\n # return our out-of-vocabulary character if we encounter a letter unknown to our model\n if letter not in allowed_characters:\n return allowed_characters.find(\"_\")\n else:\n return allowed_characters.find(letter)\n\n# Turn a line into a ,\n# or an array of one-hot letter vectors\ndef lineToTensor(line):\n tensor = torch.zeros(len(line), 1, n_letters)\n for li, letter in enumerate(line):\n tensor[li][0][letterToIndex(letter)] = 1\n return tensor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are some examples of how to use `lineToTensor()` for a single and\nmultiple character string.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print (f\"The letter 'a' becomes {lineToTensor('a')}\") #notice that the first position in the tensor = 1\nprint (f\"The name 'Ahn' becomes {lineToTensor('Ahn')}\") #notice 'A' sets the 27th index to 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations, you have built the foundational tensor objects for this\nlearning task! You can use a similar approach for other RNN tasks with\ntext.\n\nNext, we need to combine all our examples into a dataset so we can\ntrain, test and validate our models. For this, we will use the [Dataset\nand\nDataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)\nclasses to hold our dataset. Each Dataset needs to implement three\nfunctions: `__init__`, `__len__`, and `__getitem__`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from io import open\nimport glob\nimport os\nimport time\n\nimport torch\nfrom torch.utils.data import Dataset\n\nclass NamesDataset(Dataset):\n\n def __init__(self, data_dir):\n self.data_dir = data_dir #for provenance of the dataset\n self.load_time = time.localtime #for provenance of the dataset\n labels_set = set() #set of all classes\n\n self.data = []\n self.data_tensors = []\n self.labels = []\n self.labels_tensors = []\n\n #read all the ``.txt`` files in the specified directory\n text_files = glob.glob(os.path.join(data_dir, '*.txt'))\n for filename in text_files:\n label = os.path.splitext(os.path.basename(filename))[0]\n labels_set.add(label)\n lines = open(filename, encoding='utf-8').read().strip().split('\\n')\n for name in lines:\n self.data.append(name)\n self.data_tensors.append(lineToTensor(name))\n self.labels.append(label)\n\n #Cache the tensor representation of the labels\n self.labels_uniq = list(labels_set)\n for idx in range(len(self.labels)):\n temp_tensor = torch.tensor([self.labels_uniq.index(self.labels[idx])], dtype=torch.long)\n self.labels_tensors.append(temp_tensor)\n\n def __len__(self):\n return len(self.data)\n\n def __getitem__(self, idx):\n data_item = self.data[idx]\n data_label = self.labels[idx]\n data_tensor = self.data_tensors[idx]\n label_tensor = self.labels_tensors[idx]\n\n return label_tensor, data_tensor, data_label, data_item"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we can load our example data into the `NamesDataset`\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"alldata = NamesDataset(\"data/names\")\nprint(f\"loaded {len(alldata)} items of data\")\nprint(f\"example = {alldata[0]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the dataset object allows us to easily split the data into train and test sets. Here we create a 80/20\n\n: split but the `torch.utils.data` has more useful utilities. Here we\n specify a generator since we need to use the\n\nsame device as PyTorch defaults to above.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"train_set, test_set = torch.utils.data.random_split(alldata, [.85, .15], generator=torch.Generator(device=device).manual_seed(2024))\n\nprint(f\"train examples = {len(train_set)}, validation examples = {len(test_set)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a basic dataset containing **20074** examples where each\nexample is a pairing of label and name. We have also split the dataset\ninto training and testing so we can validate the model that we build.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating the Network\n====================\n\nBefore autograd, creating a recurrent neural network in Torch involved\ncloning the parameters of a layer over several timesteps. The layers\nheld hidden state and gradients which are now entirely handled by the\ngraph itself. This means you can implement a RNN in a very \\\"pure\\\" way,\nas regular feed-forward layers.\n\nThis CharRNN class implements an RNN with three components. First, we\nuse the [nn.RNN\nimplementation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).\nNext, we define a layer that maps the RNN hidden layers to our output.\nAnd finally, we apply a `softmax` function. Using `nn.RNN` leads to a\nsignificant improvement in performance, such as cuDNN-accelerated\nkernels, versus implementing each layer as a `nn.Linear`. It also\nsimplifies the implementation in `forward()`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch.nn as nn\nimport torch.nn.functional as F\n\nclass CharRNN(nn.Module):\n def __init__(self, input_size, hidden_size, output_size):\n super(CharRNN, self).__init__()\n\n self.rnn = nn.RNN(input_size, hidden_size)\n self.h2o = nn.Linear(hidden_size, output_size)\n self.softmax = nn.LogSoftmax(dim=1)\n\n def forward(self, line_tensor):\n rnn_out, hidden = self.rnn(line_tensor)\n output = self.h2o(hidden[0])\n output = self.softmax(output)\n\n return output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then create an RNN with 58 input nodes, 128 hidden nodes, and 18\noutputs:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"n_hidden = 128\nrnn = CharRNN(n_letters, n_hidden, len(alldata.labels_uniq))\nprint(rnn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After that we can pass our Tensor to the RNN to obtain a predicted\noutput. Subsequently, we use a helper function, `label_from_output`, to\nderive a text label for the class.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def label_from_output(output, output_labels):\n top_n, top_i = output.topk(1)\n label_i = top_i[0].item()\n return output_labels[label_i], label_i\n\ninput = lineToTensor('Albert')\noutput = rnn(input) #this is equivalent to ``output = rnn.forward(input)``\nprint(output)\nprint(label_from_output(output, alldata.labels_uniq))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training\n========\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training the Network\n====================\n\nNow all it takes to train this network is show it a bunch of examples,\nhave it make guesses, and tell it if it\\'s wrong.\n\nWe do this by defining a `train()` function which trains the model on a\ngiven dataset using minibatches. RNNs RNNs are trained similarly to\nother networks; therefore, for completeness, we include a batched\ntraining method here. The loop (`for i in batch`) computes the losses\nfor each of the items in the batch before adjusting the weights. This\noperation is repeated until the number of epochs is reached.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import random\nimport numpy as np\n\ndef train(rnn, training_data, n_epoch = 10, n_batch_size = 64, report_every = 50, learning_rate = 0.2, criterion = nn.NLLLoss()):\n \"\"\"\n Learn on a batch of training_data for a specified number of iterations and reporting thresholds\n \"\"\"\n # Keep track of losses for plotting\n current_loss = 0\n all_losses = []\n rnn.train()\n optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)\n\n start = time.time()\n print(f\"training on data set with n = {len(training_data)}\")\n\n for iter in range(1, n_epoch + 1):\n rnn.zero_grad() # clear the gradients\n\n # create some minibatches\n # we cannot use dataloaders because each of our names is a different length\n batches = list(range(len(training_data)))\n random.shuffle(batches)\n batches = np.array_split(batches, len(batches) //n_batch_size )\n\n for idx, batch in enumerate(batches):\n batch_loss = 0\n for i in batch: #for each example in this batch\n (label_tensor, text_tensor, label, text) = training_data[i]\n output = rnn.forward(text_tensor)\n loss = criterion(output, label_tensor)\n batch_loss += loss\n\n # optimize parameters\n batch_loss.backward()\n nn.utils.clip_grad_norm_(rnn.parameters(), 3)\n optimizer.step()\n optimizer.zero_grad()\n\n current_loss += batch_loss.item() / len(batch)\n\n all_losses.append(current_loss / len(batches) )\n if iter % report_every == 0:\n print(f\"{iter} ({iter / n_epoch:.0%}): \\t average batch loss = {all_losses[-1]}\")\n current_loss = 0\n\n return all_losses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now train a dataset with minibatches for a specified number of\nepochs. The number of epochs for this example is reduced to speed up the\nbuild. You can get better results with different parameters.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"start = time.time()\nall_losses = train(rnn, train_set, n_epoch=27, learning_rate=0.15, report_every=5)\nend = time.time()\nprint(f\"training took {end-start}s\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plotting the Results\n====================\n\nPlotting the historical loss from `all_losses` shows the network\nlearning:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.ticker as ticker\n\nplt.figure()\nplt.plot(all_losses)\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluating the Results\n======================\n\nTo see how well the network performs on different categories, we will\ncreate a confusion matrix, indicating for every actual language (rows)\nwhich language the network guesses (columns). To calculate the confusion\nmatrix a bunch of samples are run through the network with `evaluate()`,\nwhich is the same as `train()` minus the backprop.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def evaluate(rnn, testing_data, classes):\n confusion = torch.zeros(len(classes), len(classes))\n\n rnn.eval() #set to eval mode\n with torch.no_grad(): # do not record the gradients during eval phase\n for i in range(len(testing_data)):\n (label_tensor, text_tensor, label, text) = testing_data[i]\n output = rnn(text_tensor)\n guess, guess_i = label_from_output(output, classes)\n label_i = classes.index(label)\n confusion[label_i][guess_i] += 1\n\n # Normalize by dividing every row by its sum\n for i in range(len(classes)):\n denom = confusion[i].sum()\n if denom > 0:\n confusion[i] = confusion[i] / denom\n\n # Set up plot\n fig = plt.figure()\n ax = fig.add_subplot(111)\n cax = ax.matshow(confusion.cpu().numpy()) #numpy uses cpu here so we need to use a cpu version\n fig.colorbar(cax)\n\n # Set up axes\n ax.set_xticks(np.arange(len(classes)), labels=classes, rotation=90)\n ax.set_yticks(np.arange(len(classes)), labels=classes)\n\n # Force label at every tick\n ax.xaxis.set_major_locator(ticker.MultipleLocator(1))\n ax.yaxis.set_major_locator(ticker.MultipleLocator(1))\n\n # sphinx_gallery_thumbnail_number = 2\n plt.show()\n\n\n\nevaluate(rnn, test_set, classes=alldata.labels_uniq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pick out bright spots off the main axis that show which\nlanguages it guesses incorrectly, e.g. Chinese for Korean, and Spanish\nfor Italian. It seems to do very well with Greek, and very poorly with\nEnglish (perhaps because of overlap with other languages).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercises\n=========\n\n- Get better results with a bigger and/or better shaped network\n - Adjust the hyperparameters to enhance performance, such as\n changing the number of epochs, batch size, and learning rate\n - Try the `nn.LSTM` and `nn.GRU` layers\n - Modify the size of the layers, such as increasing or decreasing\n the number of hidden nodes or adding additional linear layers\n - Combine multiple of these RNNs as a higher level network\n- Try with a different dataset of line -\\> label, for example:\n - Any word -\\> language\n - First name -\\> gender\n - Character name -\\> writer\n - Page title -\\> blog or subreddit\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}