{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Adversarial Example Generation\n==============================\n\n**Author:** [Nathan Inkawhich](https://github.com/inkawhich)\n\nIf you are reading this, hopefully you can appreciate how effective some\nmachine learning models are. Research is constantly pushing ML models to\nbe faster, more accurate, and more efficient. However, an often\noverlooked aspect of designing and training models is security and\nrobustness, especially in the face of an adversary who wishes to fool\nthe model.\n\nThis tutorial will raise your awareness to the security vulnerabilities\nof ML models, and will give insight into the hot topic of adversarial\nmachine learning. You may be surprised to find that adding imperceptible\nperturbations to an image *can* cause drastically different model\nperformance. Given that this is a tutorial, we will explore the topic\nvia example on an image classifier. Specifically, we will use one of the\nfirst and most popular attack methods, the Fast Gradient Sign Attack\n(FGSM), to fool an MNIST classifier.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Threat Model\n============\n\nFor context, there are many categories of adversarial attacks, each with\na different goal and assumption of the attacker's knowledge. However, in\ngeneral the overarching goal is to add the least amount of perturbation\nto the input data to cause the desired misclassification. There are\nseveral kinds of assumptions of the attacker's knowledge, two of which\nare: **white-box** and **black-box**. A *white-box* attack assumes the\nattacker has full knowledge and access to the model, including\narchitecture, inputs, outputs, and weights. A *black-box* attack assumes\nthe attacker only has access to the inputs and outputs of the model, and\nknows nothing about the underlying architecture or weights. There are\nalso several types of goals, including **misclassification** and\n**source/target misclassification**. A goal of *misclassification* means\nthe adversary only wants the output classification to be wrong but does\nnot care what the new classification is. A *source/target\nmisclassification* means the adversary wants to alter an image that is\noriginally of a specific source class so that it is classified as a\nspecific target class.\n\nIn this case, the FGSM attack is a *white-box* attack with the goal of\n*misclassification*. With this background information, we can now\ndiscuss the attack in detail.\n\nFast Gradient Sign Attack\n=========================\n\nOne of the first and most popular adversarial attacks to date is\nreferred to as the *Fast Gradient Sign Attack (FGSM)* and is described\nby Goodfellow et. al.\u00a0in [Explaining and Harnessing Adversarial\nExamples](https://arxiv.org/abs/1412.6572). The attack is remarkably\npowerful, and yet intuitive. It is designed to attack neural networks by\nleveraging the way they learn, *gradients*. The idea is simple, rather\nthan working to minimize the loss by adjusting the weights based on the\nbackpropagated gradients, the attack *adjusts the input data to maximize\nthe loss* based on the same backpropagated gradients. In other words,\nthe attack uses the gradient of the loss w.r.t the input data, then\nadjusts the input data to maximize the loss.\n\nBefore we jump into the code, let's look at the famous\n[FGSM](https://arxiv.org/abs/1412.6572) panda example and extract some\nnotation.\n\n![](https://pytorch.org/tutorials/_static/img/fgsm_panda_image.png)\n\nFrom the figure, $\\mathbf{x}$ is the original input image correctly\nclassified as a \"panda\", $y$ is the ground truth label for $\\mathbf{x}$,\n$\\mathbf{\\theta}$ represents the model parameters, and\n$J(\\mathbf{\\theta}, \\mathbf{x}, y)$ is the loss that is used to train\nthe network. The attack backpropagates the gradient back to the input\ndata to calculate $\\nabla_{x} J(\\mathbf{\\theta}, \\mathbf{x}, y)$. Then,\nit adjusts the input data by a small step ($\\epsilon$ or $0.007$ in the\npicture) in the direction (i.e.\n$sign(\\nabla_{x} J(\\mathbf{\\theta}, \\mathbf{x}, y))$) that will maximize\nthe loss. The resulting perturbed image, $x'$, is then *misclassified*\nby the target network as a \"gibbon\" when it is still clearly a \"panda\".\n\nHopefully now the motivation for this tutorial is clear, so lets jump\ninto the implementation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torchvision import datasets, transforms\nimport numpy as np\nimport matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Implementation\n==============\n\nIn this section, we will discuss the input parameters for the tutorial,\ndefine the model under attack, then code the attack and run some tests.\n\nInputs\n------\n\nThere are only three inputs for this tutorial, and are defined as\nfollows:\n\n-   `epsilons` - List of epsilon values to use for the run. It is\n    important to keep 0 in the list because it represents the model\n    performance on the original test set. Also, intuitively we would\n    expect the larger the epsilon, the more noticeable the perturbations\n    but the more effective the attack in terms of degrading model\n    accuracy. Since the data range here is $[0,1]$, no epsilon value\n    should exceed 1.\n-   `pretrained_model` - path to the pretrained MNIST model which was\n    trained with\n    [pytorch/examples/mnist](https://github.com/pytorch/examples/tree/master/mnist).\n    For simplicity, download the pretrained model\n    [here](https://drive.google.com/file/d/1HJV2nUHJqclXQ8flKvcWmjZ-OU5DGatl/view?usp=drive_link).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "epsilons = [0, .05, .1, .15, .2, .25, .3]\npretrained_model = \"data/lenet_mnist_model.pth\"\n# Set random seed for reproducibility\ntorch.manual_seed(42)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Model Under Attack\n==================\n\nAs mentioned, the model under attack is the same MNIST model from\n[pytorch/examples/mnist](https://github.com/pytorch/examples/tree/master/mnist).\nYou may train and save your own MNIST model or you can download and use\nthe provided model. The *Net* definition and test dataloader here have\nbeen copied from the MNIST example. The purpose of this section is to\ndefine the model and dataloader, then initialize the model and load the\npretrained weights.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# LeNet Model definition\nclass Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.dropout1 = nn.Dropout(0.25)\n        self.dropout2 = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(9216, 128)\n        self.fc2 = nn.Linear(128, 10)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = F.relu(x)\n        x = self.conv2(x)\n        x = F.relu(x)\n        x = F.max_pool2d(x, 2)\n        x = self.dropout1(x)\n        x = torch.flatten(x, 1)\n        x = self.fc1(x)\n        x = F.relu(x)\n        x = self.dropout2(x)\n        x = self.fc2(x)\n        output = F.log_softmax(x, dim=1)\n        return output\n\n# MNIST Test dataset and dataloader declaration\ntest_loader = torch.utils.data.DataLoader(\n    datasets.MNIST('../data', train=False, download=True, transform=transforms.Compose([\n            transforms.ToTensor(),\n            transforms.Normalize((0.1307,), (0.3081,)),\n            ])),\n        batch_size=1, shuffle=True)\n\n# We want to be able to train our model on an `accelerator <https://pytorch.org/docs/stable/torch.html#accelerators>`__\n# such as CUDA, MPS, MTIA, or XPU. If the current accelerator is available, we will use it. Otherwise, we use the CPU.\ndevice = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else \"cpu\"\nprint(f\"Using {device} device\")\n\n# Initialize the network\nmodel = Net().to(device)\n\n# Load the pretrained model\nmodel.load_state_dict(torch.load(pretrained_model, map_location=device, weights_only=True))\n\n# Set the model in evaluation mode. In this case this is for the Dropout layers\nmodel.eval()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "FGSM Attack\n===========\n\nNow, we can define the function that creates the adversarial examples by\nperturbing the original inputs. The `fgsm_attack` function takes three\ninputs, *image* is the original clean image ($x$), *epsilon* is the\npixel-wise perturbation amount ($\\epsilon$), and *data\\_grad* is\ngradient of the loss w.r.t the input image\n($\\nabla_{x} J(\\mathbf{\\theta}, \\mathbf{x}, y)$). The function then\ncreates perturbed image as\n\n$$perturbed\\_image = image + epsilon*sign(data\\_grad) = x + \\epsilon * sign(\\nabla_{x} J(\\mathbf{\\theta}, \\mathbf{x}, y))$$\n\nFinally, in order to maintain the original range of the data, the\nperturbed image is clipped to range $[0,1]$.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# FGSM attack code\ndef fgsm_attack(image, epsilon, data_grad):\n    # Collect the element-wise sign of the data gradient\n    sign_data_grad = data_grad.sign()\n    # Create the perturbed image by adjusting each pixel of the input image\n    perturbed_image = image + epsilon*sign_data_grad\n    # Adding clipping to maintain [0,1] range\n    perturbed_image = torch.clamp(perturbed_image, 0, 1)\n    # Return the perturbed image\n    return perturbed_image\n\n# restores the tensors to their original scale\ndef denorm(batch, mean=[0.1307], std=[0.3081]):\n    \"\"\"\n    Convert a batch of tensors to their original scale.\n\n    Args:\n        batch (torch.Tensor): Batch of normalized tensors.\n        mean (torch.Tensor or list): Mean used for normalization.\n        std (torch.Tensor or list): Standard deviation used for normalization.\n\n    Returns:\n        torch.Tensor: batch of tensors without normalization applied to them.\n    \"\"\"\n    if isinstance(mean, list):\n        mean = torch.tensor(mean).to(device)\n    if isinstance(std, list):\n        std = torch.tensor(std).to(device)\n\n    return batch * std.view(1, -1, 1, 1) + mean.view(1, -1, 1, 1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Testing Function\n================\n\nFinally, the central result of this tutorial comes from the `test`\nfunction. Each call to this test function performs a full test step on\nthe MNIST test set and reports a final accuracy. However, notice that\nthis function also takes an *epsilon* input. This is because the `test`\nfunction reports the accuracy of a model that is under attack from an\nadversary with strength $\\epsilon$. More specifically, for each sample\nin the test set, the function computes the gradient of the loss w.r.t\nthe input data ($data\\_grad$), creates a perturbed image with\n`fgsm_attack` ($perturbed\\_data$), then checks to see if the perturbed\nexample is adversarial. In addition to testing the accuracy of the\nmodel, the function also saves and returns some successful adversarial\nexamples to be visualized later.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def test( model, device, test_loader, epsilon ):\n\n    # Accuracy counter\n    correct = 0\n    adv_examples = []\n\n    # Loop over all examples in test set\n    for data, target in test_loader:\n\n        # Send the data and label to the device\n        data, target = data.to(device), target.to(device)\n\n        # Set requires_grad attribute of tensor. Important for Attack\n        data.requires_grad = True\n\n        # Forward pass the data through the model\n        output = model(data)\n        init_pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability\n\n        # If the initial prediction is wrong, don't bother attacking, just move on\n        if init_pred.item() != target.item():\n            continue\n\n        # Calculate the loss\n        loss = F.nll_loss(output, target)\n\n        # Zero all existing gradients\n        model.zero_grad()\n\n        # Calculate gradients of model in backward pass\n        loss.backward()\n\n        # Collect ``datagrad``\n        data_grad = data.grad.data\n\n        # Restore the data to its original scale\n        data_denorm = denorm(data)\n\n        # Call FGSM Attack\n        perturbed_data = fgsm_attack(data_denorm, epsilon, data_grad)\n\n        # Reapply normalization\n        perturbed_data_normalized = transforms.Normalize((0.1307,), (0.3081,))(perturbed_data)\n\n        # Re-classify the perturbed image\n        output = model(perturbed_data_normalized)\n\n        # Check for success\n        final_pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability\n        if final_pred.item() == target.item():\n            correct += 1\n            # Special case for saving 0 epsilon examples\n            if epsilon == 0 and len(adv_examples) < 5:\n                adv_ex = perturbed_data.squeeze().detach().cpu().numpy()\n                adv_examples.append( (init_pred.item(), final_pred.item(), adv_ex) )\n        else:\n            # Save some adv examples for visualization later\n            if len(adv_examples) < 5:\n                adv_ex = perturbed_data.squeeze().detach().cpu().numpy()\n                adv_examples.append( (init_pred.item(), final_pred.item(), adv_ex) )\n\n    # Calculate final accuracy for this epsilon\n    final_acc = correct/float(len(test_loader))\n    print(f\"Epsilon: {epsilon}\\tTest Accuracy = {correct} / {len(test_loader)} = {final_acc}\")\n\n    # Return the accuracy and an adversarial example\n    return final_acc, adv_examples"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run Attack\n==========\n\nThe last part of the implementation is to actually run the attack. Here,\nwe run a full test step for each epsilon value in the *epsilons* input.\nFor each epsilon we also save the final accuracy and some successful\nadversarial examples to be plotted in the coming sections. Notice how\nthe printed accuracies decrease as the epsilon value increases. Also,\nnote the $\\epsilon=0$ case represents the original test accuracy, with\nno attack.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "accuracies = []\nexamples = []\n\n# Run test for each epsilon\nfor eps in epsilons:\n    acc, ex = test(model, device, test_loader, eps)\n    accuracies.append(acc)\n    examples.append(ex)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Results\n=======\n\nAccuracy vs Epsilon\n-------------------\n\nThe first result is the accuracy versus epsilon plot. As alluded to\nearlier, as epsilon increases we expect the test accuracy to decrease.\nThis is because larger epsilons mean we take a larger step in the\ndirection that will maximize the loss. Notice the trend in the curve is\nnot linear even though the epsilon values are linearly spaced. For\nexample, the accuracy at $\\epsilon=0.05$ is only about 4% lower than\n$\\epsilon=0$, but the accuracy at $\\epsilon=0.2$ is 25% lower than\n$\\epsilon=0.15$. Also, notice the accuracy of the model hits random\naccuracy for a 10-class classifier between $\\epsilon=0.25$ and\n$\\epsilon=0.3$.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plt.figure(figsize=(5,5))\nplt.plot(epsilons, accuracies, \"*-\")\nplt.yticks(np.arange(0, 1.1, step=0.1))\nplt.xticks(np.arange(0, .35, step=0.05))\nplt.title(\"Accuracy vs Epsilon\")\nplt.xlabel(\"Epsilon\")\nplt.ylabel(\"Accuracy\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Sample Adversarial Examples\n===========================\n\nRemember the idea of no free lunch? In this case, as epsilon increases\nthe test accuracy decreases **BUT** the perturbations become more easily\nperceptible. In reality, there is a tradeoff between accuracy\ndegradation and perceptibility that an attacker must consider. Here, we\nshow some examples of successful adversarial examples at each epsilon\nvalue. Each row of the plot shows a different epsilon value. The first\nrow is the $\\epsilon=0$ examples which represent the original \"clean\"\nimages with no perturbation. The title of each image shows the \"original\nclassification -\\> adversarial classification.\" Notice, the\nperturbations start to become evident at $\\epsilon=0.15$ and are quite\nevident at $\\epsilon=0.3$. However, in all cases humans are still\ncapable of identifying the correct class despite the added noise.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Plot several examples of adversarial samples at each epsilon\ncnt = 0\nplt.figure(figsize=(8,10))\nfor i in range(len(epsilons)):\n    for j in range(len(examples[i])):\n        cnt += 1\n        plt.subplot(len(epsilons),len(examples[0]),cnt)\n        plt.xticks([], [])\n        plt.yticks([], [])\n        if j == 0:\n            plt.ylabel(f\"Eps: {epsilons[i]}\", fontsize=14)\n        orig,adv,ex = examples[i][j]\n        plt.title(f\"{orig} -> {adv}\")\n        plt.imshow(ex, cmap=\"gray\")\nplt.tight_layout()\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Where to go next?\n=================\n\nHopefully this tutorial gives some insight into the topic of adversarial\nmachine learning. There are many potential directions to go from here.\nThis attack represents the very beginning of adversarial attack research\nand since there have been many subsequent ideas for how to attack and\ndefend ML models from an adversary. In fact, at NIPS 2017 there was an\nadversarial attack and defense competition and many of the methods used\nin the competition are described in this paper: [Adversarial Attacks and\nDefences Competition](https://arxiv.org/pdf/1804.00097.pdf). The work on\ndefense also leads into the idea of making machine learning models more\n*robust* in general, to both naturally perturbed and adversarially\ncrafted inputs.\n\nAnother direction to go is adversarial attacks and defense in different\ndomains. Adversarial research is not limited to the image domain, check\nout [this](https://arxiv.org/pdf/1801.01944.pdf) attack on\nspeech-to-text models. But perhaps the best way to learn more about\nadversarial machine learning is to get your hands dirty. Try to\nimplement a different attack from the NIPS 2017 competition, and see how\nit differs from FGSM. Then, try to defend the model from your own\nattacks.\n\nA further direction to go, depending on available resources, is to\nmodify the code to support processing work in batch, in parallel, and or\ndistributed vs working on one attack at a time in the above for each\n`epsilon test()` loop.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}