Diagnosing JVM Memory Leaks in Production: Tools, Techniques, and Prevention Patterns
Memory leaks in JVM-based systems are one of those nasty problems: the symptoms often creep in slowly, performance degrades, occasional OutOfMemoryErrors pop up, and diagnosing in production is challenging. But with the right tools, techniques, and patterns, you can catch them before they bring down your service or cause serious user impact.
This article walks through how you can diagnose JVM memory leaks in production (heap dumps, GC log analysis, etc.), illustrated with examples and insights, and then suggests some common avoidance patterns to reduce the risk.
What is a JVM Memory Leak?
In Java (or any language with GC), a “memory leak” means objects that are no longer needed are still referenced, so the garbage collector cannot reclaim them. Over time, these accumulate, filling up the heap, causing GC to work harder, increasing pause times, potentially ending in an OutOfMemoryError. It’s different from simply requiring more memory; it’s about retention of useless data.
Sometimes what looks like a leak is just excessive object allocation or too small a heap, or poor GC tuning. Diagnosis helps distinguish among these.
Symptoms to Watch For in Production
Before diving into heaps or logs, there are signs that you may have a memory leak:
Memory leaks in a JVM application rarely announce themselves loudly at first. Instead, they leave subtle traces in the system’s behavior. One of the earliest signs is a steady increase in heap usage that doesn’t return to its original baseline after garbage collection. Normally, you expect to see a saw-tooth pattern — memory climbs as the application allocates objects, then drops sharply when the garbage collector runs. With a leak, however, each drop lands a little higher than the last, and over time the baseline creeps upward.
As the leak worsens, garbage collection begins to struggle. Full GCs run more often, but each one reclaims less memory than before. The service may still appear responsive, but latency starts to creep in as GC pauses grow longer. Operations teams often notice this as sluggish response times during peak load or sudden spikes in CPU usage tied to GC activity. Left unchecked, the leak eventually manifests in the most visible way possible: a java.lang.OutOfMemoryError. By this point, the JVM is unable to free enough space to continue allocating new objects, and the application either crashes or becomes unresponsive.
These symptoms can look deceptively similar to other issues, such as undersized heaps, misconfigured GC tuning, or simply increased workload. That’s why observing the long-term behavior — especially whether memory usage consistently fails to reset after full GC — is the key clue that a memory leak may be at play.
Tools & Techniques for Diagnosis
Here are the main ways to dig in, what to collect, what to look at, and how to analyze.
| Tool / Artifact | What It Gives You | How to Use It in Production | Examples & Insights |
|---|---|---|---|
| Heap Dump | A snapshot of all live objects (types, references, sizes) at a point in time. Can tell you what is being retained, what is growing, what references are preventing GC. | Capture when app is “healthy” (baseline), then again when suspect memory leak is manifesting. Use JVM flags or external tools. Use offline analysis tools (MAT, HeapHero, VisualVM, etc.). Be mindful: generating a large heap‐dump in production may incur pause, require disk space, security considerations. | E.g. using HeapHero to compare two dumps: you see that “Map<String, Payload[]>” instances balloon from 100 to 10,000, and MAT shows that most are referenced from a static cache which is never cleaned up. Or example in HeapHero blog: top objects by retained size identified leak suspects. |
| Garbage Collection Log Analysis | GC logs tell you how memory is used over time, frequencies of minor and major (full) GCs, how much memory is reclaimed by each GC, GC pause times, etc. Patterns can point to leak vs just high allocation. | Enable detailed GC logging (e.g. -Xlog:gc* or older -XX:+PrintGCDetails etc.). Collect and retain GC logs over time. Use tools like GCeasy, GCViewer, or homegrown dashboards to plot heap usage vs time, full‐GC reclaimed memory, etc. Correlate with traffic patterns. | A classic example: a “saw‐tooth” pattern where after each GC the heap is much smaller, steady cycles. Contrast that with “memory leak” pattern: after each full GC, the baseline of heap usage creeps up, not reverting. In a blog post, they showed full GCs freeing less and less each time, and memory usage rising despite GC. |
| Runtime Monitoring / Profiling | Gives you live metrics: heap usage (young, old gen), GC pause durations, allocation rates; possibly even which classes are allocating heavily, or which threads hold many references. | Use lightweight profilers, JMX exporters (e.g. via Prometheus / Micrometer), Java Flight Recorder / Mission Control (if safe in production), or tools with low overhead. Also, periodic histograms (e.g. jmap -histo) to see class instance count growth. | Example: Using jmap -histo regularly (say every hour), you detect that instances of MyLargeObject are increasing over time; then you dump and analyze; find that thread locals or a static collection is holding onto them. Or JFR traces show that certain allocation sites are hot. |
| Comparative Analysis | Comparing snapshots over time (heap dumps or histograms) helps find which objects/classes are growing (leaking). | Keep a baseline snapshot, then later ones. Use tools that can diff heap dumps (MAT has diffing tools). Also track histograms over time or in a DB to see trends. | In one case, dump every 5 minutes for several hours; collect histograms; subtract classes that don’t increase; find union of classes that consistently increase. Use linear regression on counts over time to detect slow but steady growth. |
| GC Patterns / Heap Behavior | Understanding how GC behavior should normally look (healthy pattern) vs what looks pathological helps diagnose early. | Plot heap usage over time (after minor GC, after full GC), look for healthy saw‐tooth vs creeping baseline; monitor frequency/impact of Full GC; measure how much is reclaimed after full GC; set thresholds/alerts. | E.g., in the “Acute Memory Leak” pattern: first full GC reduces from 60 GB → 22 GB, next from 60 → 25, then 60 → 30 etc: reclaimed memory shrinks as leak worsens. Or detect “Consecutive Full GCs” (multiple full GC back-to-back) where GC doesn’t have time/resources to free much. |
Example Flow: Diagnosing a Leak in Production
Imagine a production microservice that slowly starts consuming more memory day after day. At first, everything looks fine, but over the course of a week, latency begins to rise and GC pause times become noticeable. The operations team observes that even after a full GC, the heap usage never returns to its original baseline. Instead, it creeps up: 50 GB after the first full GC, then 52 GB the next day, then 54 GB, and so on. This is a classic sign of a memory leak.
To investigate, engineers begin by enabling detailed GC logging and capturing data over time. The logs confirm the suspicion: each full GC is reclaiming less memory than before. To dig deeper, they generate a baseline heap dump right after a restart when the service is stable. Later, when memory consumption spikes, they take another dump for comparison. Using a tool like Eclipse MAT or HeapHero, the team compares the two and quickly spots the issue — instances of DataCacheEntry have grown from one million to over five million, and the retained size of a static map holding these entries accounts for the bulk of the growth.
Following the trail, they inspect the code and realize that the cache was configured without an eviction policy. Old entries were never being removed, and the map kept growing indefinitely. On top of that, some event listeners were being registered but never deregistered, compounding the leak.
The fix was straightforward but critical: introduce TTL-based eviction for cache entries and ensure that listeners are properly cleaned up. After redeploying, the team continues monitoring. The GC logs now show a healthy saw-tooth pattern again — heap usage rises with load, but after each full GC it drops back to a stable baseline. The creeping memory line is gone, and the application runs smoothly without further issues.
Avoidance Patterns
Prevention is often easier than cure. These are patterns and coding practices that reduce risk of leaks:
| Pattern / Practice | What To Do | Why It Helps / Examples |
|---|---|---|
| Use Weak / Soft References Carefully | For caches or maps where you don’t need strong references forever, use WeakHashMap, WeakReference, SoftReference depending on memory pressure. | If cache entries are only useful when memory is available, they won’t prevent GC when needed. But misuse can lead to other problems (e.g. unpredictable removal, GC overhead). |
| Bounded Caches / Eviction Policies | If you maintain caches, make sure size limits and TTL (time‐to‐live) or LRU eviction are in place. | Without bounds, rarely used entries accumulate. A cache that never evicts is a leak. |
| Proper resource cleanup | Close streams, sockets, DB connections; deregister listeners; clear ThreadLocals; clean up scheduled tasks. | Many leaks are due to references maintained unintentionally through long‐lived threads or static holders. |
| Minimal static state | Keep static fields to a minimum; avoid large static collections unless carefully managed. | Statics live for the life of the classloader/JVM—if they hold onto large or many objects, it’s fatal for leaks. |
| Immutable or value‐type safe objects | Where possible, use value types or small immutable objects to reduce accidental retention. Also minimize object graph complexity. | When objects have many nested references, easier to mistakenly hold onto something because of a deep reference chain. |
| Avoid unnecessary retainment via logging, tracing, metrics | Sometimes instrumentation, logging (especially caching log messages or contexts), or metrics objects keep references. Be careful what you capture and store. | E.g. storing full request objects in logs, or large payloads in traces without sampling. |
| Test/Load test with production-like scenarios | Before rollout, run long‐running integration or load tests, monitor memory behavior, GC, heap dumps, etc. | Helps catch leaks before production. If the leak only appears after hours/days, long tests can expose it. |
Practical Considerations & Caveats in Production
- Performance impact: Taking heap dumps, enabling very detailed GC logging, or using heavy profiling in production comes with costs (pause time, CPU, I/O). You’ll need to balance risk vs impact. Perhaps schedule dumps during lower usage, sample profilers, ensure your system can absorb the overhead.
- Security & privacy: Heap dumps can contain sensitive data (user data, secrets, etc.). Treat them carefully: access controls, encryption, discard after use, possibly censor or anonymize if needed.
- Storage & retention: GC logs and heap dumps can be large. Plan for disk space, rotation policies, archival. Only keep what’s needed for diagnosis.
- False positives: Some growth in memory usage is normal (e.g. warming up caches, new features being used more, user growth). What matters is unexpected growth or growth where GC fails to reclaim what it should. Always compare with baseline, usage trend, business as usual.
- Native / off-heap memory: Not all leaks are in the Java heap. Native memory (e.g. DirectByteBuffers, JNI allocations), threads stacks, metaspace, etc. may leak or grow. GC/heap-based tools won’t always catch them. The Oracle troubleshooting docs highlight diagnosing native memory leaks separately.
- Garbage Collector selection & tuning: The choice of GC algorithm (Parallel, G1, ZGC, Shenandoah, etc.) affects how heap behaves: pause times, how quickly memory is reclaimed, how full generations grow, etc. Sometimes what looks like a leak is just suboptimal GC configuration.
Example: Putting It All Together
Let’s walk through a fictional but plausible real-world example.
Scenario: A microservice in production begins consuming more memory each day. After a week, the host starts swapping and latency spikes. Full GCs are happening more often; each Full GC frees less memory, and after full GC, heap usage drops only slightly from 50 GB to ~45 GB; over subsequent days the minimum after GC rises (50→52→54…).
Steps taken:
- Enable detailed GC logs; begin monitoring with a GC log analyzer dashboard.
- Capture baseline heap dump after a clean restart and moderate traffic.
- After memory usage is visibly high (before OOM), capture another heap dump.
- Compare in MAT: see that instances of
DataCacheEntryhave increased from ~1M → 5M. The retained size of a staticMap<String, DataCacheEntry>accounts for most of difference. - Investigate code: realize that cache eviction was mis-configured (never triggered), and TTL not applied properly. Also, some old listener registrations were never removed.
- Fix: apply TTL eviction to cache, ensure listeners deregister in shutdown / deregistration paths, add monitoring of “cache size” to see if it’s growing.
- Deploy, monitor again: GC logs show saw-tooth pattern restored, baseline after full GC remains stable, memory usage no longer creeping over days.
Summary
Diagnosing JVM memory leaks in production is about combining multiple sources of information:
- GC log behavior (patterns, full GC reclamation, heap usage over time)
- Heap dumps (what is being retained, what is growing)
- Runtime class/instance count histograms or profiles
- Comparing across time (baseline vs problem)
And then following up with code / architecture fixes, coding patterns to avoid accumulation of unused references, good resource cleanup, bounded caching, etc.
Useful Links
Here are some valuable resources for further reading and tools:
- Oracle’s Java™ Platform, Standard Edition Troubleshooting Guide: Memory Leaks section
- HeapHero: Analyzing Java Heap Dumps
- GC Easy: Universal GC Log Analyser, diagnosing memory leaks with GC logs
- yCrash / HeapHero dashboards & tooling for spotting GC / memory leak patterns

