Hasty Briefsbeta

Bilingual

GPU-accelerated Llama3.java inference in pure Java using TornadoVM

a year ago
  • #Java
  • #GPU-Acceleration
  • #AI
  • Llama3 models can be run in native Java with GPU acceleration using TornadoVM.
  • TornadoVM enables parallel computing features for enhanced performance on GPUs.
  • Supports various hardware backends including NVIDIA, Intel, and Apple Silicon (via OpenCL).
  • Performance metrics provided for different GPUs (e.g., RTX 4090 achieves 66.07 tokens/s for Llama-3.2-1B).
  • Requires Java 21, TornadoVM with OpenCL/PTX backends, and Maven for building.
  • Detailed setup instructions for cloning, building, and running the project.
  • Supports FP16 models with optional Q8_0 and Q4_0 quantization.
  • Includes command-line options for model execution, memory management, and debugging.
  • Roadmap aims for performance parity with fastest implementations like llama.cpp.
  • Funded by EU Horizon Europe and UKRI grants.