Shortcuts

Inductor C++ Wrapper Tutorial

Author: Chunyuan Wu, Bin Bao, Jiong Gong

Introduction

Python, as the primary interface of PyTorch, is easy to use and efficient for development and debugging. The Inductor’s default wrapper generates Python code to invoke generated kernels and external kernels. However, in deployments requiring high performance, Python, as an interpreted language, runs relatively slower compared to compiled languages.

We implemented an Inductor C++ wrapper by leveraging the PyTorch C++ APIs to generate pure C++ code that combines the generated and external kernels. This allows for the execution of each captured Dynamo graph in pure C++, thereby reducing the Python overhead within the graph.

Enabling the API

This feature is still in prototype stage. To activate this feature, add the following to your code:

import torch._inductor.config as config
config.cpp_wrapper = True

This will speed up your models by reducing the Python overhead of the Inductor wrapper.

Example code

We will use the below frontend code as an example:

import torch

def fn(x):
    return torch.tensor(list(range(2, 40, 2)), device=x.device) + x

x = torch.randn(1)
opt_fn = torch.compile()(fn)
y = opt_fn(x)

For CPU

The main part of Inductor-generated code with the default Python wrapper will look like this:

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (1, ), (1, ))
    buf0 = empty_strided((19, ), (1, ), device='cpu', dtype=torch.float32)
    cpp_fused_add_lift_fresh_0(c_void_p(constant0.data_ptr()), c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )

By turning on the C++ wrapper, the generated code for the call function becomes a C++ function inductor_entry_cpp of the C++ extension module:

std::vector<at::Tensor> inductor_entry_cpp(const std::vector<at::Tensor>& args) {
    at::Tensor arg0_1 = args[0];
    at::Tensor constant0 = args[1];
    auto buf0 = at::empty_strided({19L, }, {1L, }, at::device(at::kCPU).dtype(at::kFloat));
    cpp_fused_add_lift_fresh_0((long*)(constant0.data_ptr()), (float*)(arg0_1.data_ptr()), (float*)(buf0.data_ptr()));
    arg0_1.reset();
    return {buf0};
}

module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'c2buojsvlqbywxe3itb43hldieh4jqulk72iswa2awalwev7hjn2', False)

def _wrap_func(f):
    def g(args):
        args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        constants_tensor = [constant0]
        args_tensor.extend(constants_tensor)

        return f(args_tensor)
    return g
call = _wrap_func(module.inductor_entry_cpp)

For GPU

Based on the same example code, the generated code for GPU will look like this:

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (1, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        buf0 = empty_strided((19, ), (1, ), device='cuda', dtype=torch.float32)
        # Source Nodes: [add, tensor], Original ATen: [aten.add, aten.lift_fresh]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_lift_fresh_0.run(constant0, arg0_1, buf0, 19, grid=grid(19), stream=stream0)
        run_intermediate_hooks('add', buf0)
        del arg0_1
        return (buf0, )

With the C++ wrapper turned on, the below equivalent C++ code will be generated:

std::vector<at::Tensor> inductor_entry_cpp(const std::vector<at::Tensor>& args) {
    at::Tensor arg0_1 = args[0];
    at::Tensor constant0 = args[1];

    at::cuda::CUDAGuard device_guard(0);
    auto buf0 = at::empty_strided({19L, }, {1L, }, at::TensorOptions(c10::Device(at::kCUDA, 0)).dtype(at::kFloat));
    // Source Nodes: [add, tensor], Original ATen: [aten.add, aten.lift_fresh]
    if (triton_poi_fused_add_lift_fresh_0 == nullptr) {
        triton_poi_fused_add_lift_fresh_0 = loadKernel("/tmp/torchinductor_user/mm/cmm6xjgijjffxjku4akv55eyzibirvw6bti6uqmfnruujm5cvvmw.cubin", "triton_poi_fused_add_lift_fresh_0_0d1d2d3");
    }
    CUdeviceptr var_0 = reinterpret_cast<CUdeviceptr>(constant0.data_ptr());
    CUdeviceptr var_1 = reinterpret_cast<CUdeviceptr>(arg0_1.data_ptr());
    CUdeviceptr var_2 = reinterpret_cast<CUdeviceptr>(buf0.data_ptr());
    auto var_3 = 19;
    void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
    cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0);
    launchKernel(triton_poi_fused_add_lift_fresh_0, 1, 1, 1, 1, 0, kernel_args_var_0, stream0);
    arg0_1.reset();
    return {buf0};
}

module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'czbpeilh4qqmbyejdgsbpdfuk2ss5jigl2qjb7xs4gearrjvuwem', True)

def _wrap_func(f):
    def g(args):
        args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        constants_tensor = [constant0]
        args_tensor.extend(constants_tensor)

        return f(args_tensor)
    return g
call = _wrap_func(module.inductor_entry_cpp)

Conclusion

In this tutorial, we introduced a new C++ wrapper in TorchInductor to speed up your models with just two lines of code changes. We explained the motivation of this new feature and walked through the easy-to-use API to activate this experimental feature. Furthermore, we demonstrated the Inductor-generated code using the default Python wrapper and the new C++ wrapper on both CPU and GPU to visually showcase the difference between these two wrappers.

This feature is still in prototype stage. If you have any feature requests or run into any issues, please file a bug report at GitHub issues.


더 궁금하시거나 개선할 내용이 있으신가요? 커뮤니티에 참여해보세요!


이 튜토리얼이 어떠셨나요? 평가해주시면 이후 개선에 참고하겠습니다! :)

© Copyright 2018-2024, PyTorch & 파이토치 한국 사용자 모임(PyTorch Korea User Group).

Built with Sphinx using a theme provided by Read the Docs.

PyTorchKorea @ GitHub

파이토치 한국 사용자 모임을 GitHub에서 만나보세요.

GitHub로 이동

한국어 튜토리얼

한국어로 번역 중인 PyTorch 튜토리얼입니다.

튜토리얼로 이동

커뮤니티

다른 사용자들과 의견을 나누고, 도와주세요!

커뮤니티로 이동