Block-Based Parallel Programming

Parallel programming can be intimidating, but doesn’t need to be! There's a new paradigm for parallel programming that's newcomer-friendly, highly productive, and performant: block-based programming.

C++

Tools

Python

Language

Block-based programming models divides inputs into local arrays (tiles) that are processed concurrently by groups of threads (blocks). Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. Block-based models have been around for a long time, but in recent years, they've grown in popularity for GPU programming in languages such as [Triton](https://openai.com/index/triton/), [JAX/Pallas](https://docs.jax.dev/en/latest/pallas/index.html), and [Warp](https://nvidia.github.io/warp/modules/tiles.html), aiming to make parallelism more accessible and increase portability.

In this example-driven talk, we'll cover the basics of block-based programming in both Python and C++. We'll present cuTile, NVIDIA's new block-based programming model for Python, C++, and other languages, and Tile IR, the new compiler stack that it is built with. We'll reveal new details about this new technology for the first time in this talk. We'll compare and contrast block-based models with traditional parallel programming models.

We'll look at a variety of examples, including a new tile-based [LLAMA3](https://github.com/meta-llama/llama3)-based large language model demo, a stencil code, and an FFT solver.

In this session, you'll:
- Learn the best practices for writing block-based parallel applications for CPUs and GPUs.
- Gain insight into the performance of block-based code and how it actually gets executed.
- Discover how to reason about and debug block-based applications.
- Understand the differences between block-based and traditional parallel programming and when each paradigm should be used.

By the end of the session, you'll understand how block-based programming enables more intuitive, portable, and efficient development of high-performance, data-parallel applications.

Bryce Adelstein Lelbach

Bryce Adelstein Lelbach has spent over a decade developing programming languages, compilers, and software libraries. He is passionate about parallel programming and strives to make it more accessible for everyone.

Bryce is a Principal Architect at NVIDIA, where he leads programming language efforts and drives the technical roadmap for NVIDIA's compute compilers and libraries.

He is one of the leaders of the systems programming language community, having served as chair of the Standard C++ Library Evolution group and the US standards committee for programming languages (INCITS/PL22). He has been an organizer and program chair for many conferences over the years.

On the C++ Committee, he has personally worked on concurrency primitives, parallel algorithms, executors, and multidimensional arrays. He is one of the founding developers of the HPX parallel runtime system.

Outside of work, Bryce is passionate about airplanes and watches. He lives in Midtown Manhattan with his girlfriend and dog.

NDC { TechTown }

Block-Based Parallel Programming

Bryce Adelstein Lelbach