{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Semi-Supervised Learning using USB built upon PyTorch\n=====================================================\n\n**Author**: [Hao Chen](https://github.com/Hhhhhhao)\n\nUnified Semi-supervised learning Benchmark (USB) is a semi-supervised\nlearning (SSL) framework built upon PyTorch. Based on Datasets and\nModules provided by PyTorch, USB becomes a flexible, modular, and\neasy-to-use framework for semi-supervised learning. It supports a\nvariety of semi-supervised learning algorithms, including `FixMatch`,\n`FreeMatch`, `DeFixMatch`, `SoftMatch`, and so on. It also supports a\nvariety of imbalanced semi-supervised learning algorithms. The benchmark\nresults across different datasets of computer vision, natural language\nprocessing, and speech processing are included in USB.\n\nThis tutorial will walk you through the basics of using the USB lighting\npackage. Let\\'s get started by training a `FreeMatch`/`SoftMatch` model\non CIFAR-10 using pretrained Vision Transformers (ViT)! And we will show\nit is easy to change the semi-supervised algorithm and train on\nimbalanced datasets.\n\n![](https://pytorch.org/tutorials/_static/img/usb_semisup_learn/code.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Introduction to `FreeMatch` and `SoftMatch` in Semi-Supervised Learning\n=======================================================================\n\nHere we provide a brief introduction to `FreeMatch` and `SoftMatch`.\nFirst, we introduce a famous baseline for semi-supervised learning\ncalled `FixMatch`. `FixMatch` is a very simple framework for\nsemi-supervised learning, where it utilizes a strong augmentation to\ngenerate pseudo labels for unlabeled data. It adopts a confidence\nthresholding strategy to filter out the low-confidence pseudo labels\nwith a fixed threshold set. `FreeMatch` and `SoftMatch` are two\nalgorithms that improve upon `FixMatch`. `FreeMatch` proposes adaptive\nthresholding strategy to replace the fixed thresholding strategy in\n`FixMatch`. The adaptive thresholding progressively increases the\nthreshold according to the learning status of the model on each class.\n`SoftMatch` absorbs the idea of confidence thresholding as an weighting\nmechanism. It proposes a Gaussian weighting mechanism to overcome the\nquantity-quality trade-off in pseudo-labels. In this tutorial, we will\nuse USB to train `FreeMatch` and `SoftMatch`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use USB to Train `FreeMatch`/`SoftMatch` on CIFAR-10 with only 40 labels\n========================================================================\n\nUSB is easy to use and extend, affordable to small groups, and\ncomprehensive for developing and evaluating SSL algorithms. USB provides\nthe implementation of 14 SSL algorithms based on Consistency\nRegularization, and 15 tasks for evaluation from CV, NLP, and Audio\ndomain. It has a modular design that allows users to easily extend the\npackage by adding new algorithms and tasks. It also supports a Python\nAPI for easier adaptation to different SSL algorithms on new data.\n\nNow, let\\'s use USB to train `FreeMatch` and `SoftMatch` on CIFAR-10.\nFirst, we need to install USB package `semilearn` and import necessary\nAPI functions from USB. If you are running this in Google Colab, install\n`semilearn` by running: `!pip install semilearn`.\n\nBelow is a list of functions we will use from `semilearn`:\n\n- `get_dataset` to load dataset, here we use CIFAR-10\n\n\\- `get_data_loader` to create train (labeled and unlabeled) and test\ndata loaders, the train unlabeled loaders will provide both strong and\nweak augmentation of unlabeled data - `get_net_builder` to create a\nmodel, here we use pretrained ViT - `get_algorithm` to create the\nsemi-supervised learning algorithm, here we use `FreeMatch` and\n`SoftMatch` - `get_config`: to get default configuration of the\nalgorithm - `Trainer`: a Trainer class for training and evaluating the\nalgorithm on dataset\n\nNote that a CUDA-enabled backend is required for training with the\n`semilearn` package. See [Enabling CUDA in Google\nColab](https://pytorch.org/tutorials/beginner/colab#enabling-cuda) for\ninstructions on enabling CUDA in Google Colab.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import semilearn\nfrom semilearn import get_dataset, get_data_loader, get_net_builder, get_algorithm, get_config, Trainer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After importing necessary functions, we first set the hyper-parameters\nof the algorithm.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "config = {\n 'algorithm': 'freematch',\n 'net': 'vit_tiny_patch2_32',\n 'use_pretrain': True, \n 'pretrain_path': 'https://github.com/microsoft/Semi-supervised-learning/releases/download/v.0.0.0/vit_tiny_patch2_32_mlp_im_1k_32.pth',\n\n # optimization configs\n 'epoch': 1, \n 'num_train_iter': 500,\n 'num_eval_iter': 500, \n 'num_log_iter': 50, \n 'optim': 'AdamW',\n 'lr': 5e-4,\n 'layer_decay': 0.5,\n 'batch_size': 16,\n 'eval_batch_size': 16,\n\n\n # dataset configs\n 'dataset': 'cifar10',\n 'num_labels': 40,\n 'num_classes': 10,\n 'img_size': 32,\n 'crop_ratio': 0.875,\n 'data_dir': './data',\n 'ulb_samples_per_class': None,\n\n # algorithm specific configs\n 'hard_label': True,\n 'T': 0.5,\n 'ema_p': 0.999,\n 'ent_loss_ratio': 0.001,\n 'uratio': 2,\n 'ulb_loss_ratio': 1.0,\n\n # device configs\n 'gpu': 0,\n 'world_size': 1,\n 'distributed': False,\n \"num_workers\": 4,\n}\nconfig = get_config(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we load the dataset and create data loaders for training and\ntesting. And we specify the model and algorithm to use.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dataset_dict = get_dataset(config, config.algorithm, config.dataset, config.num_labels, config.num_classes, data_dir=config.data_dir, include_lb_to_ulb=config.include_lb_to_ulb)\ntrain_lb_loader = get_data_loader(config, dataset_dict['train_lb'], config.batch_size)\ntrain_ulb_loader = get_data_loader(config, dataset_dict['train_ulb'], int(config.batch_size * config.uratio))\neval_loader = get_data_loader(config, dataset_dict['eval'], config.eval_batch_size)\nalgorithm = get_algorithm(config, get_net_builder(config.net, from_name=False), tb_log=None, logger=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can start training the algorithms on CIFAR-10 with 40 labels now. We\ntrain for 500 iterations and evaluate every 500 iterations.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = Trainer(config, algorithm)\ntrainer.fit(train_lb_loader, train_ulb_loader, eval_loader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let\\'s evaluate the trained model on the validation set. After\ntraining 500 iterations with `FreeMatch` on only 40 labels of CIFAR-10,\nwe obtain a classifier that achieves around 87% accuracy on the\nvalidation set.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer.evaluate(eval_loader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use USB to Train `SoftMatch` with specific imbalanced algorithm on imbalanced CIFAR-10\n======================================================================================\n\nNow let\\'s say we have imbalanced labeled set and unlabeled set of\nCIFAR-10, and we want to train a `SoftMatch` model on it. We create an\nimbalanced labeled set and imbalanced unlabeled set of CIFAR-10, by\nsetting the `lb_imb_ratio` and `ulb_imb_ratio` to 10. Also, we replace\nthe `algorithm` with `softmatch` and set the `imbalanced` to `True`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "config = {\n 'algorithm': 'softmatch',\n 'net': 'vit_tiny_patch2_32',\n 'use_pretrain': True, \n 'pretrain_path': 'https://github.com/microsoft/Semi-supervised-learning/releases/download/v.0.0.0/vit_tiny_patch2_32_mlp_im_1k_32.pth',\n\n # optimization configs\n 'epoch': 1, \n 'num_train_iter': 500,\n 'num_eval_iter': 500, \n 'num_log_iter': 50, \n 'optim': 'AdamW',\n 'lr': 5e-4,\n 'layer_decay': 0.5,\n 'batch_size': 16,\n 'eval_batch_size': 16,\n\n\n # dataset configs\n 'dataset': 'cifar10',\n 'num_labels': 1500,\n 'num_classes': 10,\n 'img_size': 32,\n 'crop_ratio': 0.875,\n 'data_dir': './data',\n 'ulb_samples_per_class': None,\n 'lb_imb_ratio': 10,\n 'ulb_imb_ratio': 10,\n 'ulb_num_labels': 3000,\n\n # algorithm specific configs\n 'hard_label': True,\n 'T': 0.5,\n 'ema_p': 0.999,\n 'ent_loss_ratio': 0.001,\n 'uratio': 2,\n 'ulb_loss_ratio': 1.0,\n\n # device configs\n 'gpu': 0,\n 'world_size': 1,\n 'distributed': False,\n \"num_workers\": 4,\n}\nconfig = get_config(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we re-load the dataset and create data loaders for training and\ntesting. And we specify the model and algorithm to use.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dataset_dict = get_dataset(config, config.algorithm, config.dataset, config.num_labels, config.num_classes, data_dir=config.data_dir, include_lb_to_ulb=config.include_lb_to_ulb)\ntrain_lb_loader = get_data_loader(config, dataset_dict['train_lb'], config.batch_size)\ntrain_ulb_loader = get_data_loader(config, dataset_dict['train_ulb'], int(config.batch_size * config.uratio))\neval_loader = get_data_loader(config, dataset_dict['eval'], config.eval_batch_size)\nalgorithm = get_algorithm(config, get_net_builder(config.net, from_name=False), tb_log=None, logger=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can start Train the algorithms on CIFAR-10 with 40 labels now. We\ntrain for 500 iterations and evaluate every 500 iterations.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = Trainer(config, algorithm)\ntrainer.fit(train_lb_loader, train_ulb_loader, eval_loader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let\\'s evaluate the trained model on the validation set.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer.evaluate(eval_loader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "References: - \\[1\\] USB:\n - \\[2\\] Kihyuk\nSohn et al. FixMatch: Simplifying Semi-Supervised Learning with\nConsistency and Confidence - \\[3\\] Yidong Wang et al. FreeMatch:\nSelf-adaptive Thresholding for Semi-supervised Learning - \\[4\\] Hao Chen\net al. SoftMatch: Addressing the Quantity-Quality Trade-off in\nSemi-supervised Learning\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }