Friday, February 27, 2015

Interpret Page Fault Metrics

A page fault occurs when a program requests an address on a page that is not in the current set of memory resident pages.  What happens when a page fault occurs is that the thread that experienced the page fault is put into a Wait state while the operating system finds the specific page on disk and restores it to physical memory. It is important to distinguish between minor/soft and major/hard page fault;

  • Minor - occurs when the page is resident at an alternate location in the memory. It may happen becasue the page is no longer part of the working set but not yet moved to disk or it was resident in memory as result of prefetch operation
  • Major - occurs when the page is not located in physical memory or in the memory mapped files created by the process.

Why bother?

  • Poor Latency - You may ignore minor faults however, major faults can be detrimental to your application performance in the presence of insufficient physical memory and excessive hard fault and as such needs to be fixed immediately.
  • Poor CPU utilization - as a direct result of thrashing

What next?

  • Increase physical memory - this might be the easy one to start with, although, if you already own a large real estate chances you need to go back to design room as this might just delay the problem
  • Reduce overall memory usage - think right data type, de-duplication, effective (de)serialization
  • Improve memory locality - think about your choice of algorithm based on data access pattern to reduce page fault

Wednesday, February 25, 2015

JVM Pressure - Context Switching Overhead


Context Switching (CS) is a valuable service provided by the underlying Operating System. It prevents greedy processes from CPU hogging, time-shares CPU between multiple threads/tasks/processes to create an illusion of continuous progress. However, the suspension of the first process and scheduling of the second one requires the kernel to store the state of the first process and load the state of the second one. The overhead and time required for the same is referred to as context switching overhead.

Why bother?

  • With increasing context switching the probability of context data available in CPU's cache is small and therefore needs to be fetched from main memory which is way more costly operation than fetching the same data from CPU cache. Further, the currently resident data in CPU cache needs to be committed to RAM. Together, these two operation per context switch make the problem grave.
  • Increases the general user space processing. A typical user space that requires instructions and data from CPU L1/L2 cache will now need to fetch the same data from main memory.
  • Increases latency for the process as explained above.

How to interpret data?

CS can be classified as below;
  • Voluntary - context switch can be triggered by the process making itself unrunnable, such as by waiting for an I/O or synchronization operation to complete. Thus, this value can be used to infer frequency of blocking calls in your application process. 
    • Hints: think non blocking, asynchronous, reactive modes
  • Pre-emptive/Non-Voluntary - signify processes contending for the CPU and having to be switched out even though their task had not completed. Simply put, there are far too many threads than necessary.
    • Hints: reduce threads, use thread pools, reduce task granularity