Ease-of-use quantization for PyTorch with Intel® Neural Compressor¶

Overview¶

Most deep learning applications are using 32-bits of floating-point precision for inference. But low precision data types, especially int8, are getting more focus due to significant performance boost. One of the essential concerns on adopting low precision is how to easily mitigate the possible accuracy loss and reach predefined accuracy requirement.

Intel® Neural Compressor aims to address the aforementioned concern by extending PyTorch with accuracy-driven automatic tuning strategies to help user quickly find out the best quantized model on Intel hardware, including Intel Deep Learning Boost (Intel DL Boost) and Intel Advanced Matrix Extensions (Intel AMX).

Intel® Neural Compressor has been released as an open-source project at Github.

Features¶

Ease-of-use Python API: Intel® Neural Compressor provides simple frontend Python APIs and utilities for users to do neural network compression with few line code changes. Typically, only 5 to 6 clauses are required to be added to the original code.
Quantization: Intel® Neural Compressor supports accuracy-driven automatic tuning process on post-training static quantization, post-training dynamic quantization, and quantization-aware training on PyTorch fx graph mode and eager model.

This tutorial mainly focuses on the quantization part. As for how to use Intel® Neural Compressor to do pruning and distillation, please refer to corresponding documents in the Intel® Neural Compressor github repo.

Getting Started¶

Installation¶

# install stable version from pip
pip install neural-compressor

# install nightly version from pip
pip install -i https://test.pypi.org/simple/ neural-compressor

# install stable version from from conda
conda install neural-compressor -c conda-forge -c intel

Supported python versions are 3.6 or 3.7 or 3.8 or 3.9

Usages¶

Minor code changes are required for users to get started with Intel® Neural Compressor quantization API. Both PyTorch fx graph mode and eager mode are supported.

Intel® Neural Compressor takes a FP32 model and a yaml configuration file as inputs. To construct the quantization process, users can either specify the below settings via the yaml configuration file or python APIs:

Calibration Dataloader (Needed for static quantization)
Evaluation Dataloader
Evaluation Metric

Intel® Neural Compressor supports some popular dataloaders and evaluation metrics. For how to configure them in yaml configuration file, user could refer to Built-in Datasets.

If users want to use a self-developed dataloader or evaluation metric, Intel® Neural Compressor supports this by the registration of customized dataloader/metric using python code.

For the yaml configuration file format please refer to yaml template.

The code changes that are required for Intel® Neural Compressor are highlighted with comments in the line above.

Model¶

In this tutorial, the LeNet model is used to demonstrate how to deal with Intel® Neural Compressor.

# main.py
import torch
import torch.nn as nn
import torch.nn.functional as F

# LeNet Model definition
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc1_drop = nn.Dropout()
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.reshape(-1, 320)
        x = F.relu(self.fc1(x))
        x = self.fc1_drop(x)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

model = Net()
model.load_state_dict(torch.load('./lenet_mnist_model.pth'))

The pretrained model weight lenet_mnist_model.pth comes from here.

Accuracy driven quantization¶

Intel® Neural Compressor supports accuracy-driven automatic tuning to generate the optimal int8 model which meets a predefined accuracy goal.

Below is an example of how to quantize a simple network on PyTorch FX graph mode by auto-tuning.

# conf.yaml
model:
    name: LeNet
    framework: pytorch_fx

evaluation:
    accuracy:
        metric:
            topk: 1

tuning:
  accuracy_criterion:
    relative: 0.01

# main.py
model.eval()

from torchvision import datasets, transforms
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=False, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                   ])),
    batch_size=1)

# launch code for Intel® Neural Compressor
from neural_compressor.experimental import Quantization
quantizer = Quantization("./conf.yaml")
quantizer.model = model
quantizer.calib_dataloader = test_loader
quantizer.eval_dataloader = test_loader
q_model = quantizer()
q_model.save('./output')

In the conf.yaml file, the built-in metric top1 of Intel® Neural Compressor is specified as the evaluation method, and 1% relative accuracy loss is set as the accuracy target for auto-tuning. Intel® Neural Compressor will traverse all possible quantization config combinations on per-op level to find out the optimal int8 model that reaches the predefined accuracy target.

Besides those built-in metrics, Intel® Neural Compressor also supports customized metric through python code:

# conf.yaml
model:
    name: LeNet
    framework: pytorch_fx

tuning:
    accuracy_criterion:
        relative: 0.01

# main.py
model.eval()

from torchvision import datasets, transforms
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=False, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                   ])),
    batch_size=1)

# define a customized metric
class Top1Metric(object):
    def __init__(self):
        self.correct = 0
    def update(self, output, label):
        pred = output.argmax(dim=1, keepdim=True)
        self.correct += pred.eq(label.view_as(pred)).sum().item()
    def reset(self):
        self.correct = 0
    def result(self):
        return 100. * self.correct / len(test_loader.dataset)

# launch code for Intel® Neural Compressor
from neural_compressor.experimental import Quantization
quantizer = Quantization("./conf.yaml")
quantizer.model = model
quantizer.calib_dataloader = test_loader
quantizer.eval_dataloader = test_loader
quantizer.metric = Top1Metric()
q_model = quantizer()
q_model.save('./output')

In the above example, a class which contains update() and result() function is implemented to record per mini-batch result and calculate final accuracy at the end.

Quantization aware training¶

Besides post-training static quantization and post-training dynamic quantization, Intel® Neural Compressor supports quantization-aware training with an accuracy-driven automatic tuning mechanism.

Below is an example of how to do quantization aware training on a simple network on PyTorch FX graph mode.

# conf.yaml
model:
    name: LeNet
    framework: pytorch_fx

quantization:
    approach: quant_aware_training

evaluation:
    accuracy:
        metric:
            topk: 1

tuning:
    accuracy_criterion:
        relative: 0.01

# main.py
model.eval()

from torchvision import datasets, transforms
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=False, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=1)

import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.1)

def training_func(model):
    model.train()
    for epoch in range(1, 3):
        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                  epoch, batch_idx * len(data), len(train_loader.dataset),
                  100. * batch_idx / len(train_loader), loss.item()))

# launch code for Intel® Neural Compressor
from neural_compressor.experimental import Quantization
quantizer = Quantization("./conf.yaml")
quantizer.model = model
quantizer.q_func = training_func
quantizer.eval_dataloader = test_loader
q_model = quantizer()
q_model.save('./output')

Performance only quantization¶

Intel® Neural Compressor supports directly yielding int8 model with dummy dataset for the performance benchmarking purpose.

Below is an example of how to quantize a simple network on PyTorch FX graph mode with a dummy dataset.

# conf.yaml
model:
    name: lenet
    framework: pytorch_fx

# main.py
model.eval()

# launch code for Intel® Neural Compressor
from neural_compressor.experimental import Quantization, common
from neural_compressor.experimental.data.datasets.dummy_dataset import DummyDataset
quantizer = Quantization("./conf.yaml")
quantizer.model = model
quantizer.calib_dataloader = common.DataLoader(DummyDataset([(1, 1, 28, 28)]))
q_model = quantizer()
q_model.save('./output')

Quantization outputs¶

Users could know how many ops get quantized from log printed by Intel® Neural Compressor like below:

2021-12-08 14:58:35 [INFO] |********Mixed Precision Statistics*******|
2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+
2021-12-08 14:58:35 [INFO] |        Op Type         | Total  |  INT8 |
2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+
2021-12-08 14:58:35 [INFO] |  quantize_per_tensor   |   2    |   2   |
2021-12-08 14:58:35 [INFO] |         Conv2d         |   2    |   2   |
2021-12-08 14:58:35 [INFO] |       max_pool2d       |   1    |   1   |
2021-12-08 14:58:35 [INFO] |          relu          |   1    |   1   |
2021-12-08 14:58:35 [INFO] |       dequantize       |   2    |   2   |
2021-12-08 14:58:35 [INFO] |       LinearReLU       |   1    |   1   |
2021-12-08 14:58:35 [INFO] |         Linear         |   1    |   1   |
2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+

The quantized model will be generated under ./output directory, in which there are two files: 1. best_configure.yaml 2. best_model_weights.pt

The first file contains the quantization configurations of each op, the second file contains int8 weights and zero point and scale info of activations.

Deployment¶

Users could use the below code to load quantized model and then do inference or performance benchmark.

from neural_compressor.utils.pytorch import load
int8_model = load('./output', model)

Tutorials¶

Please visit Intel® Neural Compressor Github repo for more tutorials.