A 40-Line Fix Eliminated a 400x Performance Gap
4 months ago
- #Performance
- #Linux-Kernel
- #OpenJDK
- A 40-line code change in OpenJDK replaced a slow `/proc` file parsing method with `clock_gettime()` to measure thread CPU time, resulting in a 30x-400x performance improvement.
- The old method involved reading and parsing `/proc/self/task/<tid>/stat`, which required multiple syscalls, file I/O, and complex string parsing, leading to high latency, especially under concurrency.
- The new method uses `clock_gettime(CLOCK_THREAD_CPUTIME_ID)`, which directly accesses kernel thread scheduling data with a single syscall, avoiding file operations and parsing overhead.
- Linux kernels encode clock type information in `clockid_t`, allowing the JVM to flip bits to request user-only CPU time, bypassing POSIX limitations that only provide total CPU time (user + system).
- Further optimization potential was identified by manually constructing a `clockid` with PID=0 to skip a radix tree lookup in the kernel, yielding an additional 13% performance gain.
- The fix highlights the importance of revisiting old assumptions, leveraging kernel internals, and understanding the trade-offs between POSIX compliance and platform-specific optimizations.