CDH Namenode Abnormal stop Error: flush failed for required journal (JournalAndStream(mgr=QJM to

The error information is as follows:

2020-12-09 14:07:56,509 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx:8485, xxx:8485, xxx:8485], stream=QuorumOutputStream starting at txid 74798133))
2020-12-09 14:07:56,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 74798133
        at java.lang.Thread.run(Thread.java:748)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:243)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:711)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:521)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:55)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:525)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:109)
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:286)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
xxx:8485: IPC's epoch 33 is less than the last promised epoch 34

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
xxx:8485: IPC's epoch 33 is less than the last promised epoch 34

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
xxx:8485: IPC's epoch 33 is less than the last promised epoch 34
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
2020-12-09 14:07:56,496 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx:8485, xxx:8485, xxx:8485], stream=QuorumOutputStream starting at txid 74798133))
2020-12-09 14:07:56,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 7611ms to send a batch of 2 edits (179 bytes) to remote journal xxx:8485
        at java.lang.Thread.run(Thread.java:748)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:389)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:396)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:187)
        at com.sun.proxy.$Proxy19.journal(Unknown Source)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.Client.call(Client.java:1355)
        at org.apache.hadoop.ipc.Client.call(Client.java:1445)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34
2020-12-09 14:07:56,492 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal xxx:8485 failed to write txns 74798134-74798135. Will try to write to this JN again after the next log roll.
]
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
, xxx:8485: IPC's epoch 33 is less than the last promised epoch 34
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
2020-12-09 14:07:55,886 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. Exceptions so far: [xxx:8485: IPC's epoch 33 is less than the last promised epoch 34
]
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
, xxx:8485: IPC's epoch 33 is less than the last promised epoch 34
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
2020-12-09 14:07:54,883 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. Exceptions so far: [xxx:8485: IPC's epoch 33 is less than the last promised epoch 34
        at java.lang.Thread.run(Thread.java:748)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:389)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:396)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:187)
        at com.sun.proxy.$Proxy19.journal(Unknown Source)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.Client.call(Client.java:1355)
        at org.apache.hadoop.ipc.Client.call(Client.java:1445)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27401)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34
2020-12-09 14:07:49,776 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal xxx:8485 failed to write txns 74798134-74798135. Will try to write to this JN again after the next log roll.

When HA is configured, one of the namenode stops, and the key message “IPC’s epoch is less than the last committed epoch” is probably due to network failure. After reading the log, every time another namenode is started, port 8485 of the three journalnode services will be detected, indicating that it is failed,
indicating that it is most likely a network problem, The troubleshooting is as follows:
ifconfig – a check whether the network card has packet loss
check whether/etc/sysconfig/SELinux = disabled is correct
/etc/init.d/iptables status check whether the firewall is running, because Hadoop is running in the Intranet environment, remember that the firewall was closed when it was deployed before
check the firewalls of three journalnode servers successively, It’s all closed

Online solutions:
1) adjust the write timeout of journal node
for example, dfs.qjournal.write-txns.timeout.ms = 90000

In fact, in the actual production environment, this kind of timeout is also easy to happen, so we need to change the default 20s timeout to a larger value, such as 60 or 90s.

We can add a set of configurations in hdfs-site.xml under Hadoop/etc/Hadoop

dfs.qjournal.write-txns.timeout.ms
60000

CDH cluster searches dfs.qjournal.write-txns.timeout.ms in HDFS configuration interface
2) adjusts the Java parameters of namenode and triggers full GC in advance, so that the time of full GC will be less
3) the default full GC mode of namenode is parallel GC, which is in STW mode and is changed to CMS format. Adjust the startup parameters of namenode:
– XX: + usecompansedoops
– XX: + useparnewgc – XX: + useconcmarksweepgc – XX: + cmsclassunloadingenabled
– XX: + usecmpackage at full collection – XX: cmsfullgcsbeforecompaction = 0
– XX: + cmsparallelremarkenabled – XX: + disableexplicitgc
– XX: + usecmsinitiatingoccupancyonly – XX: cmsinitiatingoccupancyfraction = 75
– XX: cmsfullgcsbeforecompaction SoftRefLRUPolicyMSPerMB=0

Read More: