Memory leak in TRX causes heap to get used up in the hbase regionserver.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Trafodion |
Fix Committed
|
Critical
|
Joanie Cooper |
Bug Description
Running longevity tests on a system revealed that we have a memory leak in the TRX code that causes the hbase regionserver heap to accumulate structures.
The java flight simulator heap section only shows areas that is taking more than .5% of the total heap, since our heap is so big (32gb), it takes a while to have our sections show up.
The first object count capture I did was 36 mins into the run.
Class Instances Size(bytes) Percentage of Heap(%)
byte[] 7,056,680 10,569,632,596 93.782
org.apache.
java.util.
java.util.
org.apache.
java.util.TreeMap 1,465,205 70,329,828 0.624
java.util.
-------
At 127 mins into the run:
Class Instances Size(bytes) Percentage of Heap(%)
byte[] 19,458,266 12,997,350,664 88.254
java.util.
org.apache.
org.apache.
java.util.TreeMap 4,293,451 206,085,636 1.399
java.util.
org.apache.
org.apache.
java.lang.Object[] 2,693,893 102,438,228 0.696
org.apache.
org.apache.
-------
At 203 mins into the run, notice that we now have explicit transaction stuff:
Class Instances Size(bytes) Percentage of Heap(%)
byte[] 28,810,276 13,622,665,676 80.878
java.util.
org.apache.
org.apache.
java.util.TreeMap 6,455,446 309,861,408 1.84
java.util.
org.apache.
org.apache.
org.apache.
org.apache.
java.lang.Object[] 4,043,772 151,503,184 0.899
java.util.
org.apache.
java.lang.Long 4,375,168 105,004,044 0.623
org.apache.
org.apache.
-------
At 335 mins into the run:
Class Instances Size(bytes) Percentage of Heap(%)
byte[] 46,075,938 14,785,540,816 73.006
java.util.
org.apache.
org.apache.
java.util.TreeMap 10,275,938 493,245,048 2.435
java.util.
org.apache.
org.apache.
org.apache.
org.apache.
java.lang.Object[] 6,453,036 243,666,964 1.203
org.apache.
java.lang.Long 6,884,497 165,227,928 0.816
org.apache.
org.apache.
java.util.
org.apache.
org.apache.
java.util.
java.util.
NOTE: Above lots of objects are in the 3,3 million instance range. Are those all related to each other? And some are in the 6 million instance range. That could be viewed as 2 for 1 of the 3.3 million objets. Again, are those related?
-------
And the last object count I took which is at 463 mins into the run:
Class Instances Size(bytes) Percentage of Heap(%)
byte[] 60,097,314 15,249,837,392 65.914
java.util.
org.apache.
java.util.TreeMap 13,563,104 651,028,968 2.814
org.apache.
java.util.
org.apache.
org.apache.
org.apache.
org.apache.
java.lang.Object[] 8,428,334 315,143,268 1.362
org.apache.
java.lang.Long 9,104,852 218,516,460 0.944
org.apache.
org.apache.
org.apache.
org.apache.
java.util.
java.util.
java.util.
java.util.TreeSet 9,001,684 144,026,952 0.623
org.apache.
org.apache.
java.util.
java.util.
org.apache.
Changed in trafodion: | |
milestone: | none → r1.0 |
assignee: | nobody → Joanie Cooper (joanie-cooper) |
status: | New → In Progress |
I’ve been researching two hbase-trx coprocessor memory leaks. One in the TrxTransactionState object and one in the TransactionalRe gionScannerHold er object.
For the TrxTransactionState memory leak, we were a little too conservative in our cleaning up of dependent objects. I have added additional code to clean up the TrxTransactionState objects when the transaction state is ABORT or READ_ONLY. This
fix was submitted with git job 955.
Git job 955 is having problems passing seabase gate tests due to an intermittent
failure in seabase/TEST010.
The second memory leak is with the TransactionalRe gionScannerHold er object.
This object is created in the TrxRegionEndpoint coprocessor
when an “openScanner” call is made. It is placed
on a scanners map for each region.
It lives and is used by “performScan”.
The object is released and cleared out of the scanners map when a “closeScanner”
call is made.
I added some extra line tracing and can see a possible leak.
It looks like for some transactions, we perform an “openScanner”, gionScannerHold er object from the scanners map.
“performScan” and “closeScanner” series of calls. This clears
out the TransactionalRe
For others, we only perform an “openScanner”. We never seem
to get a “performScan” or “closeScanner” call.
This leaves the TransactionalRe gionScannerHold er scanners map constantly growing and leaving
objects that are unable to be cleaned. This would definitely have performance implications.
We are looking at the hbase-trx endpoint coprocessor client/server interaction for a change in behavior.
However, scanners are normally managed and cleaned using leases. gionScannerHold er scanners that
We do not use leases with our scanners. We allow the underlying hbase scanner to be managed
by the client scanner lease timeout. If we do have TransactionalRe
are left stranded, we have no mechanism to clean them out.
I've added a cleaner method to the hbase-trx coprocessor endpoint chore thread. This tests for stale
hbase-trx scanners and cleans them at regular intervals.
This could have a direct correlation to the increased number of hbase scanners that are experiencing
client scanner timeouts seen recently.
These changes are being tested with the performance team on zircon4.