GPU-accelerated Llama3.java inference in pure Java using TornadoVM
a year ago
- #Java
- #GPU-Acceleration
- #AI
- Llama3 models can be run in native Java with GPU acceleration using TornadoVM.
- TornadoVM enables parallel computing features for enhanced performance on GPUs.
- Supports various hardware backends including NVIDIA, Intel, and Apple Silicon (via OpenCL).
- Performance metrics provided for different GPUs (e.g., RTX 4090 achieves 66.07 tokens/s for Llama-3.2-1B).
- Requires Java 21, TornadoVM with OpenCL/PTX backends, and Maven for building.
- Detailed setup instructions for cloning, building, and running the project.
- Supports FP16 models with optional Q8_0 and Q4_0 quantization.
- Includes command-line options for model execution, memory management, and debugging.
- Roadmap aims for performance parity with fastest implementations like llama.cpp.
- Funded by EU Horizon Europe and UKRI grants.