{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TorchVision Object Detection Finetuning Tutorial\n================================================\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this tutorial, we will be finetuning a pre-trained [Mask\nR-CNN](https://arxiv.org/abs/1703.06870) model on the [Penn-Fudan\nDatabase for Pedestrian Detection and\nSegmentation](https://www.cis.upenn.edu/~jshi/ped_html/). It contains\n170 images with 345 instances of pedestrians, and we will use it to\nillustrate how to use the new features in torchvision in order to train\nan object detection and instance segmentation model on a custom dataset.\n\nDefining the Dataset\n====================\n\nThe reference scripts for training object detection, instance\nsegmentation and person keypoint detection allows for easily supporting\nadding new custom datasets. The dataset should inherit from the standard\n`torch.utils.data.Dataset`{.interpreted-text role=\"class\"} class, and\nimplement `__len__` and `__getitem__`.\n\nThe only specificity that we require is that the dataset `__getitem__`\nshould return a tuple:\n\n- image: `torchvision.tv_tensors.Image`{.interpreted-text\n role=\"class\"} of shape `[3, H, W]`, a pure tensor, or a PIL Image of\n size `(H, W)`\n- target: a dict containing the following fields\n - `boxes`,\n `torchvision.tv_tensors.BoundingBoxes`{.interpreted-text\n role=\"class\"} of shape `[N, 4]`: the coordinates of the `N`\n bounding boxes in `[x0, y0, x1, y1]` format, ranging from `0` to\n `W` and `0` to `H`\n - `labels`, integer `torch.Tensor`{.interpreted-text role=\"class\"}\n of shape `[N]`: the label for each bounding box. `0` represents\n always the background class.\n - `image_id`, int: an image identifier. It should be unique\n between all the images in the dataset, and is used during\n evaluation\n - `area`, float `torch.Tensor`{.interpreted-text role=\"class\"} of\n shape `[N]`: the area of the bounding box. This is used during\n evaluation with the COCO metric, to separate the metric scores\n between small, medium and large boxes.\n - `iscrowd`, uint8 `torch.Tensor`{.interpreted-text role=\"class\"}\n of shape `[N]`: instances with `iscrowd=True` will be ignored\n during evaluation.\n - (optionally) `masks`,\n `torchvision.tv_tensors.Mask`{.interpreted-text role=\"class\"} of\n shape `[N, H, W]`: the segmentation masks for each one of the\n objects\n\nIf your dataset is compliant with above requirements then it will work\nfor both training and evaluation codes from the reference script.\nEvaluation code will use scripts from `pycocotools` which can be\ninstalled with `pip install pycocotools`.\n\nOne note on the `labels`. The model considers class `0` as background.\nIf your dataset does not contain the background class, you should not\nhave `0` in your `labels`. For example, assuming you have just two\nclasses, *cat* and *dog*, you can define `1` (not `0`) to represent\n*cats* and `2` to represent *dogs*. So, for instance, if one of the\nimages has both classes, your `labels` tensor should look like `[1, 2]`.\n\nAdditionally, if you want to use aspect ratio grouping during training\n(so that each batch only contains images with similar aspect ratios),\nthen it is recommended to also implement a `get_height_and_width`\nmethod, which returns the height and the width of the image. If this\nmethod is not provided, we query all elements of the dataset via\n`__getitem__` , which loads the image in memory and is slower than if a\ncustom method is provided.\n\nWriting a custom dataset for PennFudan\n--------------------------------------\n\nLet's write a dataset for the PennFudan dataset. First, let\\'s download\nthe dataset and extract the [zip\nfile](https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip):\n\n``` {.python}\nwget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip -P data\ncd data && unzip PennFudanPed.zip\n```\n\nWe have the following folder structure:\n\n PennFudanPed/\n PedMasks/\n FudanPed00001_mask.png\n FudanPed00002_mask.png\n FudanPed00003_mask.png\n FudanPed00004_mask.png\n ...\n PNGImages/\n FudanPed00001.png\n FudanPed00002.png\n FudanPed00003.png\n FudanPed00004.png\n\nHere is one example of a pair of images and segmentation masks\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nfrom torchvision.io import read_image\n\n\nimage = read_image(\"data/PennFudanPed/PNGImages/FudanPed00046.png\")\nmask = read_image(\"data/PennFudanPed/PedMasks/FudanPed00046_mask.png\")\n\nplt.figure(figsize=(16, 8))\nplt.subplot(121)\nplt.title(\"Image\")\nplt.imshow(image.permute(1, 2, 0))\nplt.subplot(122)\nplt.title(\"Mask\")\nplt.imshow(mask.permute(1, 2, 0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So each image has a corresponding segmentation mask, where each color\ncorrespond to a different instance. Let's write a\n`torch.utils.data.Dataset`{.interpreted-text role=\"class\"} class for\nthis dataset. In the code below, we are wrapping images, bounding boxes\nand masks into `torchvision.tv_tensors.TVTensor`{.interpreted-text\nrole=\"class\"} classes so that we will be able to apply torchvision\nbuilt-in transformations ([new Transforms\nAPI](https://pytorch.org/vision/stable/transforms.html)) for the given\nobject detection and segmentation task. Namely, image tensors will be\nwrapped by `torchvision.tv_tensors.Image`{.interpreted-text\nrole=\"class\"}, bounding boxes into\n`torchvision.tv_tensors.BoundingBoxes`{.interpreted-text role=\"class\"}\nand masks into `torchvision.tv_tensors.Mask`{.interpreted-text\nrole=\"class\"}. As `torchvision.tv_tensors.TVTensor`{.interpreted-text\nrole=\"class\"} are `torch.Tensor`{.interpreted-text role=\"class\"}\nsubclasses, wrapped objects are also tensors and inherit the plain\n`torch.Tensor`{.interpreted-text role=\"class\"} API. For more information\nabout torchvision `tv_tensors` see [this\ndocumentation](https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-are-tvtensors).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\nimport torch\n\nfrom torchvision.io import read_image\nfrom torchvision.ops.boxes import masks_to_boxes\nfrom torchvision import tv_tensors\nfrom torchvision.transforms.v2 import functional as F\n\n\nclass PennFudanDataset(torch.utils.data.Dataset):\n def __init__(self, root, transforms):\n self.root = root\n self.transforms = transforms\n # load all image files, sorting them to\n # ensure that they are aligned\n self.imgs = list(sorted(os.listdir(os.path.join(root, \"PNGImages\"))))\n self.masks = list(sorted(os.listdir(os.path.join(root, \"PedMasks\"))))\n\n def __getitem__(self, idx):\n # load images and masks\n img_path = os.path.join(self.root, \"PNGImages\", self.imgs[idx])\n mask_path = os.path.join(self.root, \"PedMasks\", self.masks[idx])\n img = read_image(img_path)\n mask = read_image(mask_path)\n # instances are encoded as different colors\n obj_ids = torch.unique(mask)\n # first id is the background, so remove it\n obj_ids = obj_ids[1:]\n num_objs = len(obj_ids)\n\n # split the color-encoded mask into a set\n # of binary masks\n masks = (mask == obj_ids[:, None, None]).to(dtype=torch.uint8)\n\n # get bounding box coordinates for each mask\n boxes = masks_to_boxes(masks)\n\n # there is only one class\n labels = torch.ones((num_objs,), dtype=torch.int64)\n\n image_id = idx\n area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])\n # suppose all instances are not crowd\n iscrowd = torch.zeros((num_objs,), dtype=torch.int64)\n\n # Wrap sample and targets into torchvision tv_tensors:\n img = tv_tensors.Image(img)\n\n target = {}\n target[\"boxes\"] = tv_tensors.BoundingBoxes(boxes, format=\"XYXY\", canvas_size=F.get_size(img))\n target[\"masks\"] = tv_tensors.Mask(masks)\n target[\"labels\"] = labels\n target[\"image_id\"] = image_id\n target[\"area\"] = area\n target[\"iscrowd\"] = iscrowd\n\n if self.transforms is not None:\n img, target = self.transforms(img, target)\n\n return img, target\n\n def __len__(self):\n return len(self.imgs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all for the dataset. Now let's define a model that can perform\npredictions on this dataset.\n\nDefining your model\n===================\n\nIn this tutorial, we will be using [Mask\nR-CNN](https://arxiv.org/abs/1703.06870), which is based on top of\n[Faster R-CNN](https://arxiv.org/abs/1506.01497). Faster R-CNN is a\nmodel that predicts both bounding boxes and class scores for potential\nobjects in the image.\n\n![image](https://pytorch.org/tutorials/_static/img/tv_tutorial/tv_image03.png)\n\nMask R-CNN adds an extra branch into Faster R-CNN, which also predicts\nsegmentation masks for each instance.\n\n![image](https://pytorch.org/tutorials/_static/img/tv_tutorial/tv_image04.png)\n\nThere are two common situations where one might want to modify one of\nthe available models in TorchVision Model Zoo. The first is when we want\nto start from a pre-trained model, and just finetune the last layer. The\nother is when we want to replace the backbone of the model with a\ndifferent one (for faster predictions, for example).\n\nLet's go see how we would do one or another in the following sections.\n\n1 - Finetuning from a pretrained model\n--------------------------------------\n\nLet's suppose that you want to start from a model pre-trained on COCO\nand want to finetune it for your particular classes. Here is a possible\nway of doing it:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torchvision\nfrom torchvision.models.detection.faster_rcnn import FastRCNNPredictor\n\n# load a model pre-trained on COCO\nmodel = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=\"DEFAULT\")\n\n# replace the classifier with a new one, that has\n# num_classes which is user-defined\nnum_classes = 2 # 1 class (person) + background\n# get number of input features for the classifier\nin_features = model.roi_heads.box_predictor.cls_score.in_features\n# replace the pre-trained head with a new one\nmodel.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2 - Modifying the model to add a different backbone\n===================================================\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torchvision\nfrom torchvision.models.detection import FasterRCNN\nfrom torchvision.models.detection.rpn import AnchorGenerator\n\n# load a pre-trained model for classification and return\n# only the features\nbackbone = torchvision.models.mobilenet_v2(weights=\"DEFAULT\").features\n# ``FasterRCNN`` needs to know the number of\n# output channels in a backbone. For mobilenet_v2, it's 1280\n# so we need to add it here\nbackbone.out_channels = 1280\n\n# let's make the RPN generate 5 x 3 anchors per spatial\n# location, with 5 different sizes and 3 different aspect\n# ratios. We have a Tuple[Tuple[int]] because each feature\n# map could potentially have different sizes and\n# aspect ratios\nanchor_generator = AnchorGenerator(\n sizes=((32, 64, 128, 256, 512),),\n aspect_ratios=((0.5, 1.0, 2.0),)\n)\n\n# let's define what are the feature maps that we will\n# use to perform the region of interest cropping, as well as\n# the size of the crop after rescaling.\n# if your backbone returns a Tensor, featmap_names is expected to\n# be [0]. More generally, the backbone should return an\n# ``OrderedDict[Tensor]``, and in ``featmap_names`` you can choose which\n# feature maps to use.\nroi_pooler = torchvision.ops.MultiScaleRoIAlign(\n featmap_names=['0'],\n output_size=7,\n sampling_ratio=2\n)\n\n# put the pieces together inside a Faster-RCNN model\nmodel = FasterRCNN(\n backbone,\n num_classes=2,\n rpn_anchor_generator=anchor_generator,\n box_roi_pool=roi_pooler\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Object detection and instance segmentation model for PennFudan Dataset\n======================================================================\n\nIn our case, we want to finetune from a pre-trained model, given that\nour dataset is very small, so we will be following approach number 1.\n\nHere we want to also compute the instance segmentation masks, so we will\nbe using Mask R-CNN:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torchvision\nfrom torchvision.models.detection.faster_rcnn import FastRCNNPredictor\nfrom torchvision.models.detection.mask_rcnn import MaskRCNNPredictor\n\n\ndef get_model_instance_segmentation(num_classes):\n # load an instance segmentation model pre-trained on COCO\n model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights=\"DEFAULT\")\n\n # get number of input features for the classifier\n in_features = model.roi_heads.box_predictor.cls_score.in_features\n # replace the pre-trained head with a new one\n model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)\n\n # now get the number of input features for the mask classifier\n in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels\n hidden_layer = 256\n # and replace the mask predictor with a new one\n model.roi_heads.mask_predictor = MaskRCNNPredictor(\n in_features_mask,\n hidden_layer,\n num_classes\n )\n\n return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it, this will make `model` be ready to be trained and evaluated\non your custom dataset.\n\nPutting everything together\n===========================\n\nIn `references/detection/`, we have a number of helper functions to\nsimplify training and evaluating detection models. Here, we will use\n`references/detection/engine.py` and `references/detection/utils.py`.\nJust download everything under `references/detection` to your folder and\nuse them here. On Linux if you have `wget`, you can download them using\nbelow commands:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "os.system(\"wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/engine.py\")\nos.system(\"wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/utils.py\")\nos.system(\"wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_utils.py\")\nos.system(\"wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_eval.py\")\nos.system(\"wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/transforms.py\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since v0.15.0 torchvision provides [new Transforms\nAPI](https://pytorch.org/vision/stable/transforms.html) to easily write\ndata augmentation pipelines for Object Detection and Segmentation tasks.\n\nLet's write some helper functions for data augmentation /\ntransformation:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torchvision.transforms import v2 as T\n\n\ndef get_transform(train):\n transforms = []\n if train:\n transforms.append(T.RandomHorizontalFlip(0.5))\n transforms.append(T.ToDtype(torch.float, scale=True))\n transforms.append(T.ToPureTensor())\n return T.Compose(transforms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Testing `forward()` method (Optional)\n=====================================\n\nBefore iterating over the dataset, it\\'s good to see what the model\nexpects during training and inference time on sample data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import utils\n\nmodel = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=\"DEFAULT\")\ndataset = PennFudanDataset('data/PennFudanPed', get_transform(train=True))\ndata_loader = torch.utils.data.DataLoader(\n dataset,\n batch_size=2,\n shuffle=True,\n collate_fn=utils.collate_fn\n)\n\n# For Training\nimages, targets = next(iter(data_loader))\nimages = list(image for image in images)\ntargets = [{k: v for k, v in t.items()} for t in targets]\noutput = model(images, targets) # Returns losses and detections\nprint(output)\n\n# For inference\nmodel.eval()\nx = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]\npredictions = model(x) # Returns predictions\nprint(predictions[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now write the main function which performs the training and the\nvalidation:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from engine import train_one_epoch, evaluate\n\n# train on the GPU or on the CPU, if a GPU is not available\ndevice = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n\n# our dataset has two classes only - background and person\nnum_classes = 2\n# use our dataset and defined transformations\ndataset = PennFudanDataset('data/PennFudanPed', get_transform(train=True))\ndataset_test = PennFudanDataset('data/PennFudanPed', get_transform(train=False))\n\n# split the dataset in train and test set\nindices = torch.randperm(len(dataset)).tolist()\ndataset = torch.utils.data.Subset(dataset, indices[:-50])\ndataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])\n\n# define training and validation data loaders\ndata_loader = torch.utils.data.DataLoader(\n dataset,\n batch_size=2,\n shuffle=True,\n collate_fn=utils.collate_fn\n)\n\ndata_loader_test = torch.utils.data.DataLoader(\n dataset_test,\n batch_size=1,\n shuffle=False,\n collate_fn=utils.collate_fn\n)\n\n# get the model using our helper function\nmodel = get_model_instance_segmentation(num_classes)\n\n# move model to the right device\nmodel.to(device)\n\n# construct an optimizer\nparams = [p for p in model.parameters() if p.requires_grad]\noptimizer = torch.optim.SGD(\n params,\n lr=0.005,\n momentum=0.9,\n weight_decay=0.0005\n)\n\n# and a learning rate scheduler\nlr_scheduler = torch.optim.lr_scheduler.StepLR(\n optimizer,\n step_size=3,\n gamma=0.1\n)\n\n# let's train it just for 2 epochs\nnum_epochs = 2\n\nfor epoch in range(num_epochs):\n # train for one epoch, printing every 10 iterations\n train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)\n # update the learning rate\n lr_scheduler.step()\n # evaluate on the test dataset\n evaluate(model, data_loader_test, device=device)\n\nprint(\"That's it!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So after one epoch of training, we obtain a COCO-style mAP \\> 50, and a\nmask mAP of 65.\n\nBut what do the predictions look like? Let's take one image in the\ndataset and verify\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom torchvision.utils import draw_bounding_boxes, draw_segmentation_masks\n\n\nimage = read_image(\"data/PennFudanPed/PNGImages/FudanPed00046.png\")\neval_transform = get_transform(train=False)\n\nmodel.eval()\nwith torch.no_grad():\n x = eval_transform(image)\n # convert RGBA -> RGB and move to device\n x = x[:3, ...].to(device)\n predictions = model([x, ])\n pred = predictions[0]\n\n\nimage = (255.0 * (image - image.min()) / (image.max() - image.min())).to(torch.uint8)\nimage = image[:3, ...]\npred_labels = [f\"pedestrian: {score:.3f}\" for label, score in zip(pred[\"labels\"], pred[\"scores\"])]\npred_boxes = pred[\"boxes\"].long()\noutput_image = draw_bounding_boxes(image, pred_boxes, pred_labels, colors=\"red\")\n\nmasks = (pred[\"masks\"] > 0.7).squeeze(1)\noutput_image = draw_segmentation_masks(output_image, masks, alpha=0.5, colors=\"blue\")\n\n\nplt.figure(figsize=(12, 12))\nplt.imshow(output_image.permute(1, 2, 0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results look good!\n\nWrapping up\n===========\n\nIn this tutorial, you have learned how to create your own training\npipeline for object detection models on a custom dataset. For that, you\nwrote a `torch.utils.data.Dataset`{.interpreted-text role=\"class\"} class\nthat returns the images and the ground truth boxes and segmentation\nmasks. You also leveraged a Mask R-CNN model pre-trained on COCO\ntrain2017 in order to perform transfer learning on this new dataset.\n\nFor a more complete example, which includes multi-machine / multi-GPU\ntraining, check `references/detection/train.py`, which is present in the\ntorchvision repository.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }