As an ML engineer I want to develop and automate the process of building Python packages (e.g., Flash Attention, vLLM) optimized for AMD ROCm and PyTorch. At the moment there are no published wheels for the packages we want to use in our deployments when we utilize an AMD GPU. Examples of these packages include:
- quantization: AWQ, GPTQ, bitsandbytes (it is the only one that has an alpha release for a multiple backends)
- flash attention 2
- inference optimization frameworks: vllm
These packages should be production-ready and distributed as wheels through a public repository for streamlined deployment and usage.
While starting working on this the team has been using ml-labs to build these packages and test things. While ml-labs is a great environment to experiment and figure out what we need it is by no means a production environment.
We have the following requirements for this work:
- an environment where we can use the upstream docker images that are based on rocm/pytorch
- establish CI/CD processes that build and publish these packages
These could happen on Gitlab CI which also has a package registry where we can publish our wheels.
One thing to consider is the following: if we use the upstream docker images to do with work we need to make sure that the packages we built will work in our environment. The upstream uses ubuntu while we use debian. While we don't expect any issues this is something to keep in mind.