SYCL: Integrated Compiler Runtime for Accelerated Deep Learning

На английском языкеСложность -

LLMs and generative models have become the mainstream deep learning architectures for industries globally and with customized optimizations there is a lot of developments among deep learning compilers. However, the majority of the frameworks supporting exascale model training/finetuning, such as PyTorch or Jax, has extensive device specific compiler runtime codes which are performant on a single specific hardware type. To democratize deep learning models and benchmark them across different runtime devices, there is a need to support a device-agnostic compiler backend which can be run on Nvidia/AMD or Intel (other ISAs of x86 CPU or LLVM/clang supported GPU). This talk focuses on how to create such backends using SYCL (originally from Khronos) and induce platform specific optimizations.

The talk would mainly focus on 3 primary agendas:

Understanding the LLVM optimizations from Intel which utilises SYCL to build different ISAs specific to standard CPU and GPUs. This also introduces SYCL runtime, clang's dependency, math kernels, standard OpenCL optimizations for CPU architectures and device-agnostic GPU code.
Building custom kernels using SYCL to run sample miniature models (mini LLMs) on low-end /low power cards (across NV/AMD or Intel GPUs). Understanding native MLIR translation across specific device codes from the common codebase of SYCL. This also includes some generic SYCL principles and standard differences between CUDA or ROCm.
Work with the DPCT (DPC++ toolkit) from Intel which is built on top of SYCL to provide static code translation from CUDA. Automatic lexical translation, characterization and host-device specific coupling codes will be seen here (with a miniature example if time permits). This would also include members trying to replicate a GitHub project translation of a "small" LLM written natively in CUDA /C++ to SYCL.

Since the expanse of unified compiler runtime will increase owing to standardization followed by several communities (such as Triton backend), this would enable engineers to write custom code without having to worry about IR translations across devices. This would also imply standard practises of C++ being introduces as a brush-up, as the entire framework is built on top of it.