nvidia-cusparselt-cu13

NVIDIA cuSPARSELt

NVIDIA Proprietary Software 5 个版本
NVIDIA Corporation <cuda_installer@nvidia.com>
安装
pip install nvidia-cusparselt-cu13
poetry add nvidia-cusparselt-cu13
pipenv install nvidia-cusparselt-cu13
conda install nvidia-cusparselt-cu13
描述

################################################################################### cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication ###################################################################################

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50% sparsity ratio:

.. math::

D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)

where :math:op(A)/op(B) refers to in-place operations such as transpose/non-transpose, and :math:alpha, beta are scalars or vectors.

The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Download: developer.nvidia.com/cusparselt/downloads <https://developer.nvidia.com/cusparselt/downloads>_

Provide Feedback: Math-Libs-Feedback@nvidia.com <mailto:Math-Libs-Feedback@nvidia.com?subject=cuSPARSELt-Feedback>_

Examples: cuSPARSELt Example 1 <https://github.com/NVIDIA/CUDALibrarySamples/tree/main/cuSPARSELt/matmul>, cuSPARSELt Example 2 <https://github.com/NVIDIA/CUDALibrarySamples/tree/main/cuSPARSELt/matmul_advanced>

Blog post:

  • Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt <https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/>_
  • Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines <https://developer.nvidia.com/blog/structured-sparsity-in-the-nvidia-ampere-architecture-and-applications-in-search-engines/>__
  • Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture <https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31552/>__

================================================================================ Key Features

  • NVIDIA Sparse MMA tensor core support

  • Mixed-precision computation support:

    +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | Input A/B | Input C | Output D | Compute | Block scaled | Support SM arch | +==============+================+=================+=============+=================================+====================+ | FP32 | FP32 | FP32 | FP32 | No | | +--------------+----------------+-----------------+-------------+ + | | BF16 | BF16 | BF16 | FP32 | | 8.0, 8.6, 8.7 | +--------------+----------------+-----------------+-------------+ + 9.0, 10.0, 10.3 | | FP16 | FP16 | FP16 | FP32 | | 11.0, 12.0, 12.1 | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | FP16 | FP16 | FP16 | FP16 | No | 9.0 | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | INT8 | INT8 | INT8 | INT32 | No | |

    •          +----------------+-----------------+             +                                 + `8.0, 8.6, 8.7`    +
      

    | | INT32 | INT32 | | | 9.0, 10.0, 11.0 |

    •          +----------------+-----------------+             +                                 + `12.0, 12.1`       +
      

    | | FP16 | FP16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | BF16 | BF16 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | INT8 | INT8 | INT8 | INT32 | No | |

    •          +----------------+-----------------+             +                                 + `8.0, 8.6, 8.7`    +
      

    | | INT32 | INT32 | | | 9.0, 10.0, 11.0 |

    •          +----------------+-----------------+             +                                 + `12.0, 12.1`       +
      

    | | FP16 | FP16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | BF16 | BF16 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E4M3 | FP16 | E4M3 | FP32 | No | 9.0, 10.0, 10.3 |

    •          +----------------+-----------------+             +                                 + `11.0, 12.0, 12.1` +
      

    | | BF16 | E4M3 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | FP16 | FP16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | BF16 | BF16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E5M2 | FP16 | E5M2 | FP32 | No | 9.0, 10.0, 10.3 |

    •          +----------------+-----------------+             +                                 + `11.0, 12.0, 12.1` +
      

    | | BF16 | E5M2 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | FP16 | FP16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | BF16 | BF16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E4M3 | FP16 | E4M3 | FP32 | A/B/D_OUT_SCALE = VEC64_UE8M0 | 10.0, 10.3, 11.0 |

    •          +----------------+-----------------+             +                                 + `12.0, 12.1`       +
      

    | | BF16 | E4M3 | | D_SCALE = 32F | |

    •          +----------------+-----------------+             +---------------------------------+                    +
      

    | | FP16 | FP16 | | A/B_SCALE = VEC64_UE8M0 | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | BF16 | BF16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E2M1 | FP16 | E2M1 | FP32 | A/B/D_SCALE = VEC32_UE4M3 | 10.0, 10.3, 11.0 |

    •          +----------------+-----------------+             +                                 + `12.0, 12.1`       +
      

    | | BF16 | E2M1 | | D_SCALE = 32F | |

    •          +----------------+-----------------+             +---------------------------------+                    +
      

    | | FP16 | FP16 | | A/B_SCALE = VEC32_UE4M3 | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | BF16 | BF16 | | | |

    •          +----------------+-----------------+             +                                 +                    +
      

    | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+

  • Matrix pruning and compression functionalities

  • Activation functions, bias vector, and output scaling

  • Batched computation (multiple matrices in a single run)

  • GEMM Split-K mode

  • Auto-tuning functionality (see cusparseLtMatmulSearch() <cusparseLtMatmulSearch-label>)

  • NVTX ranging and Logging functionalities

================================================================================ Support

  • Supported SM Architectures: SM 8.0, SM 8.6, SM 8.7, SM 8.9, SM 9.0, SM 10.0, SM 10.3, SM 11.0, SM 12.0, SM 12.1
  • Supported CPU architectures and operating systems:

+------------+--------------------+ | OS | CPU archs | +============+====================+ | Windows | x86_64 | +------------+--------------------+ | Linux | x86_64, Arm64 | +------------+--------------------+

================================================================================ Documentation

Please refer to https://docs.nvidia.com/cuda/cusparselt/index.html for the cuSPARSELt documentation.

================================================================================ Installation

The cuSPARSELt wheel can be installed as follows:

.. code-block:: bash

pip install nvidia-cusparselt-cuXX

where XX is the CUDA major version.