Analysis of the trafodion.dtm.log on bronto05 demonstrate that we have a window of opportunity for an UnknownTransactionException to be received for abort and commit TransactionManager requests. 449039 2015-04-01 16:54:30,308 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449040 2015-04-01 16:54:30,309 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449041 2015-04-01 16:54:30,310 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449042 2015-04-01 16:54:30,314 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449043 2015-04-01 17:00:18,959 INFO dtm.HBaseTxClient: useForgotten is true 449044 2015-04-01 17:00:18,959 INFO dtm.HBaseTxClient: forceForgotten is false 449045 2015-04-01 17:00:18,984 INFO dtm.TmAuditTlog: forceControlPoint is false 449046 2015-04-01 17:00:18,984 INFO dtm.TmAuditTlog: useAutoFlush is false 449047 2015-04-01 17:00:18,984 INFO dtm.TmAuditTlog: ageCommitted is false 449048 2015-04-01 17:00:18,984 INFO dtm.TmAuditTlog: disableBlockCache is false 449049 2015-04-01 17:00:19,046 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 449050 2015-04-01 17:00:21,002 INFO dtm.HBaseAuditControlPoint: disableBlockCache is false 449051 2015-04-01 17:00:21,004 INFO dtm.HBaseAuditControlPoint: useAutoFlush is false 449052 2015-04-01 18:25:58,248 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 449053 2015-04-01 18:25:58,250 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 449054 2015-04-01 18:25:59,064 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 449055 2015-04-01 18:25:59,065 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 449056 2015-04-01 18:25:59,880 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 449057 2015-04-01 18:25:59,880 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 449058 2015-04-01 19:22:21,209 INFO dtm.HBaseTxClient: useForgotten is true 449059 2015-04-01 19:22:21,210 INFO dtm.HBaseTxClient: forceForgotten is false 449060 2015-04-01 19:22:21,240 INFO dtm.TmAuditTlog: forceControlPoint is false 449061 2015-04-01 19:22:21,240 INFO dtm.TmAuditTlog: useAutoFlush is false 449062 2015-04-01 19:22:21,240 INFO dtm.TmAuditTlog: ageCommitted is false 449063 2015-04-01 19:22:21,240 INFO dtm.TmAuditTlog: disableBlockCache is false 449064 2015-04-01 19:22:21,327 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 449065 2015-04-01 19:22:23,543 INFO dtm.HBaseAuditControlPoint: disableBlockCache is false 449066 2015-04-01 19:22:23,546 INFO dtm.HBaseAuditControlPoint: useAutoFlush is false 449067 2015-04-02 17:18:14,651 ERROR transactional.TransactionManager: doCommitX, received incorrect result size: 0 449068 2015-04-02 17:18:14,691 ERROR dtm.TmAuditTlog: deleteAgedEntries Exception getting results for table index 2; org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=3, exceptions: 449069 Thu Apr 02 17:18:13 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, java.io.IOException: Call to bronto08.usa.hp.com/15.250.49.15:60020 failed on local exception: java.io.IOException: Connection reset by peer 449070 Thu Apr 02 17:18:14 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, java.net.ConnectException: Connection refused 449071 Thu Apr 02 17:18:14 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: bronto08.usa.hp.com/15.250.49.15:60020 449072 449073 2015-04-02 17:18:14,802 ERROR dtm.TmAuditTlog: deleteAgedEntries Exception java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=3, exceptions: 449074 Thu Apr 02 17:18:13 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, java.io.IOException: Call to bronto08.usa.hp.com/15.250.49.15:60020 failed on local exception: java.io.IOException: Connection reset by peer 449075 Thu Apr 02 17:18:14 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, java.net.ConnectException: Connection refused 449076 Thu Apr 02 17:18:14 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: bronto08.usa.hp.com/15.250.49.15:60020 449077 449078 2015-04-02 17:18:14,803 ERROR dtm.TmAuditTlog: addControlPoint Exception java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=3, exceptions: 449079 Thu Apr 02 17:18:13 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, java.io.IOException: Call to bronto08.usa.hp.com/15.250.49.15:60020 failed on local exception: java.io.IOException: Connection reset by peer 449080 Thu Apr 02 17:18:14 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, java.net.ConnectException: Connection refused 449081 Thu Apr 02 17:18:14 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2051fb6, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: bronto08.usa.hp.com/15.250.49.15:60020 449082 449083 2015-04-02 17:18:15,466 ERROR transactional.TransactionManager: doCommitX, received incorrect result size: 0 449084 2015-04-02 17:18:16,281 ERROR transactional.TransactionManager: doCommitX, received incorrect result size: 0 449085 2015-04-02 17:18:45,998 WARN dtm.HBaseTxClient: TRAF RCOV THREAD:Starting recovery with 1 regions to recover. First region hostname: bronto06.usa.hp.com,60020,765acd4b5ae00570ae1790a82067ac07 Recovery iterations: 0 449086 2015-04-02 17:18:47,675 ERROR dtm.HBaseAuditControlPoint: getNextAuditSeqNum IOException setting up scan for TRAFODION._DTM_.TLOG3_CONTROL_POINT 449087 2015-04-03 19:09:11,654 INFO dtm.HBaseTxClient: Exit RET_EXCEPTION prepareCommit, txid: 12884923020 449088 2015-04-03 19:09:11,658 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449089 2015-04-03 19:09:11,658 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449090 2015-04-03 19:09:11,659 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449091 2015-04-03 19:12:22,076 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449092 2015-04-03 19:12:22,076 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449093 2015-04-03 19:12:22,077 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449094 2015-04-03 19:12:22,077 ERROR transactional.TransactionManager: Abort HasException true: java.io.IOException: UnknownTransactionException 449095 2015-04-03 19:12:39,311 ERROR transactional.TransactionManager: doCommitX, coprocessor UnknownTransactionException: java.io.IOException: UnknownTransactionException 449096 2015-04-03 19:12:39,311 ERROR transactional.TransactionManager: exception in doCommitX for transaction: 12884923027 org.apache.hadoop.hbase.client.transactional.UnknownTransactionException 449097 2015-04-03 19:12:39,312 ERROR dtm.HBaseTxClient: Returning from HBaseTxClient:completeRequest, ts.completeRequest: EXCEPTION txid: 12884923027 449098 2015-04-03 19:12:44,356 ERROR transactional.TransactionManager: doCommitX, coprocessor UnknownTransactionException: java.io.IOException: UnknownTransactionException 449099 2015-04-03 19:12:44,356 ERROR transactional.TransactionManager: exception in doCommitX for transaction: 12884923028 org.apache.hadoop.hbase.client.transactional.UnknownTransactionException 449100 2015-04-03 19:12:44,381 ERROR dtm.HBaseTxClient: Returning from HBaseTxClient:completeRequest, ts.completeRequest: EXCEPTION txid: 12884923028 449101 2015-04-03 19:31:24,806 INFO dtm.HBaseTxClient: useForgotten is true 449102 2015-04-03 19:31:24,807 INFO dtm.HBaseTxClient: forceForgotten is false 449103 2015-04-03 19:31:24,826 INFO dtm.TmAuditTlog: forceControlPoint is false 449104 2015-04-03 19:31:24,826 INFO dtm.TmAuditTlog: useAutoFlush is false 449105 2015-04-03 19:31:24,826 INFO dtm.TmAuditTlog: ageCommitted is false 449106 2015-04-03 19:31:24,826 INFO dtm.TmAuditTlog: disableBlockCache is false 449107 2015-04-03 19:31:24,879 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 449108 2015-04-03 19:31:26,528 INFO dtm.HBaseAuditControlPoint: disableBlockCache is false 449109 2015-04-03 19:31:26,529 INFO dtm.HBaseAuditControlPoint: useAutoFlush is false 449110 2015-04-06 01:06:26,368 INFO dtm.HBaseTxClient: useForgotten is true 449111 2015-04-06 01:06:26,369 INFO dtm.HBaseTxClient: forceForgotten is false 449112 2015-04-06 01:06:26,399 INFO dtm.TmAuditTlog: forceControlPoint is false 449113 2015-04-06 01:06:26,399 INFO dtm.TmAuditTlog: useAutoFlush is false 449114 2015-04-06 01:06:26,399 INFO dtm.TmAuditTlog: ageCommitted is false 449115 2015-04-06 01:06:26,399 INFO dtm.TmAuditTlog: disableBlockCache is false 449116 2015-04-06 01:06:26,475 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 449117 2015-04-06 01:06:28,349 INFO dtm.HBaseAuditControlPoint: disableBlockCache is false 449118 2015-04-06 01:06:28,351 INFO dtm.HBaseAuditControlPoint: useAutoFlush is false [seapilot@bronto05 logs]$ grep "received incorrect" trafodion*dtm* |more trafodion.dtm.log:2015-04-01 18:25:58,248 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 trafodion.dtm.log:2015-04-01 18:25:58,250 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 trafodion.dtm.log:2015-04-01 18:25:59,064 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 trafodion.dtm.log:2015-04-01 18:25:59,065 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 trafodion.dtm.log:2015-04-01 18:25:59,880 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 trafodion.dtm.log:2015-04-01 18:25:59,880 ERROR transactional.TransactionManager: doAbortX, received incorrect result size: 0 trafodion.dtm.log:2015-04-02 17:18:14,651 ERROR transactional.TransactionManager: doCommitX, received incorrect result size: 0 trafodion.dtm.log:2015-04-02 17:18:15,466 ERROR transactional.TransactionManager: doCommitX, received incorrect result size: 0 trafodion.dtm.log:2015-04-02 17:18:16,281 ERROR transactional.TransactionManager: doCommitX, received incorrect result size: 0 A grep of the “UnknownTransaction” exception gives us several million entries: [seapilot@bronto05 logs]$ grep UnknownTransaction trafodion.dtm* |wc 10174993 91574941 1612100211 The TransactionManager logic does not always increment a do-while loop retry counter for conditions where a refresh is set to false. As we never increment this retry counter, we continually send a request to a region for a transaction identifier not in its current transactionId list. The loop can not be exited and millions of exceptions can be generated flooding the system and rendering the instance virtually unusable.