GPU-accelerated Llama3.java inference in pure Java using TornadoVM

a year ago

Llama3 models can be run in native Java with GPU acceleration using TornadoVM.
TornadoVM enables parallel computing features for enhanced performance on GPUs.
Supports various hardware backends including NVIDIA, Intel, and Apple Silicon (via OpenCL).
Performance metrics provided for different GPUs (e.g., RTX 4090 achieves 66.07 tokens/s for Llama-3.2-1B).
Requires Java 21, TornadoVM with OpenCL/PTX backends, and Maven for building.
Detailed setup instructions for cloning, building, and running the project.
Supports FP16 models with optional Q8_0 and Q4_0 quantization.
Includes command-line options for model execution, memory management, and debugging.
Roadmap aims for performance parity with fastest implementations like llama.cpp.
Funded by EU Horizon Europe and UKRI grants.

Hasty Briefsbeta