hadoop2 - Namenodes fails starting on HA cluster - Fatals exists in Journalnode logs -


i having problem hadoop cluster. in ha mode , both of namenodes fail start.

centos 7.3 hortonworks ambari 2.4.2 hortonworks hdp 2.5.3

ambari stderr:

2017-04-06 10:49:49,039 - getting jmx metrics nn failed. url: http://master02.mydomain.local:50070/jmx?qry=hadoop:service=namenode,name=fsnamesystem traceback (most recent call last):   file "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx     _, data, _ = get_user_call_output(cmd, user=run_user, quiet=false)   file "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output     raise executionfailed(err_msg, code, files_output[0], files_output[1]) executionfailed: execution of 'curl -s 'http://master02.mydomain.local:50070/jmx?qry=hadoop:service=namenode,name=fsnamesystem' 1>/tmp/tmp0cnzmd 2>/tmp/tmprazgwz' returned 7.   2017-04-06 10:49:51,041 - getting jmx metrics nn failed. url: http://master03.mydomain.local:50070/jmx?qry=hadoop:service=namenode,name=fsnamesystem traceback (most recent call last):   file "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx     _, data, _ = get_user_call_output(cmd, user=run_user, quiet=false)   file "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output     raise executionfailed(err_msg, code, files_output[0], files_output[1]) executionfailed: execution of 'cur  l -s 'http://master03.mydomain.local:50070/jmx?qry=hadoop:service=namenode,name=fsnamesystem' 1>/tmp/tmp_hlny7 2>/tmp/tmpocott8' returned 7.  ... (tries several times , then) ... traceback (most recent call last):   file "/var/lib/ambari-agent/cache/common-services/hdfs/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module>     namenode().execute()   file "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute     method(env)   file "/var/lib/ambari-agent/cache/common-services/hdfs/2.1.0.2.0/package/scripts/namenode.py", line 101, in start     upgrade_suspended=params.upgrade_suspended, env=env)   file "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk     return fn(*args, **kwargs)   file "/var/lib/ambari-agent/cache/common-services/hdfs/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 184, in namenode     if is_this_namenode_active() false:   file "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper     return function(*args, **kwargs)   file "/var/lib/ambari-agent/cache/common-services/hdfs/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 562, in is_this_namenode_active     raise fail(format("the namenode {namenode_id} not listed active or standby, waiting...")) resource_management.core.exceptions.fail: namenode nn1 not listed active or standby, waiting... 

ambari stdout:

2017-04-06 10:53:20,521 - call returned (255, '17/04/06 10:53:20 info ipc.client: retrying connect server: master03.mydomain.local/10.0.109.21:8020. tried 0 time(s); retry policy retryuptomaximumcountwithfixedsleep(maxretries=1, sleeptime=1000 milliseconds)\n17/04/06 10:53:20 warn ipc.client: failed connect server: master03.mydomain.local/10.0.109.21:8020: retries failed due exceeded maximum allowed retries number: 1 2017-04-06 10:53:20,522 - no active namenode found after 5 retries. return current namenode ha states 

namenode log:

2017-04-06 10:11:43,561fatalerror: recoverunfinalizedsegments failed required journal (journalandstream(mgr=qjm [10.0.109.20:8485, 10.0.109.21:8485, 10.0.109.22:8485], stream=null)) java.lang.assertionerror: decided synchronize log starttxid: 1 endtxid: 1 isinprogress: true logger 10.0.109.20:8485 had seen txid 1865764 committed @ org.apache.hadoop.hdfs.qjournal.client.quorumjournalmanager.recoverunclosedsegment(quorumjournalmanager.java:336) @ (some class @ other class @ ...) 

some more logs namenode:

2017-04-06 10:11:42,380 info  ipc.server (server.java:logexception(2401)) - ipc server handler 72 on 8020, call org.apache.hadoop.hdfs.server.protocol.datanodeprotocol.sendheartbeat 9.1.10.14:37173 call#2322 retry#0 org.apache.hadoop.ipc.retriableexception: namenode still not started         @ org.apache.hadoop.hdfs.server.namenode.namenoderpcserver.checknnstartup(namenoderpcserver.java:2057)         @ org.apache.hadoop.hdfs.server.namenode.namenoderpcserver.sendheartbeat(namenoderpcserver.java:1414)         @ org.apache.hadoop.hdfs.protocolpb.datanodeprotocolserversidetranslatorpb.sendheartbeat(datanodeprotocolserversidetranslatorpb.java:118)         @ org.apache.hadoop.hdfs.protocol.proto.datanodeprotocolprotos$datanodeprotocolservice$2.callblockingmethod(datanodeprotocolprotos.java:29064)         @ org.apache.hadoop.ipc.protobufrpcengine$server$protobufrpcinvoker.call(protobufrpcengine.java:640)         @ org.apache.hadoop.ipc.rpc$server.call(rpc.java:982)         @ org.apache.hadoop.ipc.server$handler$1.run(server.java:2313)         @ org.apache.hadoop.ipc.server$handler$1.run(server.java:2309)         @ java.security.accesscontroller.doprivileged(native method)         @ javax.security.auth.subject.doas(subject.java:422)         @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1724)         @ org.apache.hadoop.ipc.server$handler.run(server.java:2307) 2017-04-06 10:11:42,390 info  namenode.namenode (namenode.java:startcommonservices(876)) - namenode rpc at: bigm02.etstur.local/9.1.10.21:8020 2017-04-06 10:11:42,391 info  namenode.fsnamesystem (fsnamesystem.java:startstandbyservices(1286)) - starting services required standby state 2017-04-06 10:11:42,393 info  ha.editlogtailer (editlogtailer.java:<init>(117)) - roll logs on active node @ bigm03.etstur.local/9.1.10.22:8020 every 120 seconds. 2017-04-06 10:11:42,397 info  ha.standbycheckpointer (standbycheckpointer.java:start(129)) - starting standby checkpoint thread... checkpointing active nn @ http://bigm03.etstur.local:50070 serving checkpoints @ http://bigm02.etstur.local:50070 2017-04-06 10:11:43,371 info  namenode.fsnamesystem (fsnamesystem.java:stopstandbyservices(1329)) - stopping services started standby state 2017-04-06 10:11:43,372 warn  ha.editlogtailer (editlogtailer.java:dowork(349)) - edit log tailer interrupted java.lang.interruptedexception: sleep interrupted         @ java.lang.thread.sleep(native method)         @ org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.dowork(editlogtailer.java:347)         @ org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.access$200(editlogtailer.java:284)         @ org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread$1.run(editlogtailer.java:301)         @ org.apache.hadoop.security.securityutil.doasloginuserorfatal(securityutil.java:476)         @ org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.run(editlogtailer.java:297) 2017-04-06 10:11:43,475 info  namenode.fsnamesystem (fsnamesystem.java:startactiveservices(1130)) - starting services required active state 2017-04-06 10:11:43,485 info  client.quorumjournalmanager (quorumjournalmanager.java:recoverunfinalizedsegments(435)) - starting recovery process unclosed journal segments... 2017-04-06 10:11:43,534 info  client.quorumjournalmanager (quorumjournalmanager.java:recoverunfinalizedsegments(437)) - started new epoch 17 2017-04-06 10:11:43,535 info  client.quorumjournalmanager (quorumjournalmanager.java:recoverunclosedsegment(263)) - beginning recovery of unclosed segment starting @ txid 1 2017-04-06 10:11:43,557 info  client.quorumjournalmanager (quorumjournalmanager.java:recoverunclosedsegment(272)) - recovery prepare phase complete. responses: 9.1.10.20:8485: segmentstate { starttxid: 1 endtxid: 1 isinprogress: true } lastwriterepoch: 14 lastcommittedtxid: 1865764 9.1.10.21:8485: segmentstate { starttxid: 1 endtxid: 1 isinprogress: true } lastwriterepoch: 14 lastcommittedtxid: 1865764 2017-04-06 10:11:43,560 info  client.quorumjournalmanager (quorumjournalmanager.java:recoverunclosedsegment(296)) - using longest log: 9.1.10.20:8485=segmentstate {   starttxid: 1   endtxid: 1   isinprogress: true } lastwriterepoch: 14 lastcommittedtxid: 1865764   2017-04-06 10:11:43,561 fatal namenode.fseditlog (journalset.java:mapjournalsandreporterrors(398)) - error: recoverunfinalizedsegments failed required journal (journalandstream(mgr=qjm [9.1.10.20:8485, 9.1.10.21:8485, 9.1.10.22:8485], stream=null)) java.lang.assertionerror: decided synchronize log starttxid: 1 endtxid: 1 isinprogress: true  logger 9.1.10.20:8485 had seen txid 1865764 committed         @ org.apache.hadoop.hdfs.qjournal.client.quorumjournalmanager.recoverunclosedsegment(quorumjournalmanager.java:336)         @ org.apache.hadoop.hdfs.qjournal.client.quorumjournalmanager.recoverunfinalizedsegments(quorumjournalmanager.java:455)         @ org.apache.hadoop.hdfs.server.namenode.journalset$8.apply(journalset.java:624)         @ org.apache.hadoop.hdfs.server.namenode.journalset.mapjournalsandreporterrors(journalset.java:393)         @ org.apache.hadoop.hdfs.server.namenode.journalset.recoverunfinalizedsegments(journalset.java:621)         @ org.apache.hadoop.hdfs.server.namenode.fseditlog.recoverunclosedstreams(fseditlog.java:1459)         @ org.apache.hadoop.hdfs.server.namenode.fsnamesystem.startactiveservices(fsnamesystem.java:1139)         @ org.apache.hadoop.hdfs.server.namenode.namenode$namenodehacontext.startactiveservices(namenode.java:1915)         @ org.apache.hadoop.hdfs.server.namenode.ha.activestate.enterstate(activestate.java:61)         @ org.apache.hadoop.hdfs.server.namenode.ha.hastate.setstateinternal(hastate.java:64)         @ org.apache.hadoop.hdfs.server.namenode.ha.standbystate.setstate(standbystate.java:49)         @ org.apache.hadoop.hdfs.server.namenode.namenode.transitiontoactive(namenode.java:1783)         @ org.apache.hadoop.hdfs.server.namenode.namenoderpcserver.transitiontoactive(namenoderpcserver.java:1631)         @ org.apache.hadoop.ha.protocolpb.haserviceprotocolserversidetranslatorpb.transitiontoactive(haserviceprotocolserversidetranslatorpb.java:107)         @ org.apache.hadoop.ha.proto.haserviceprotocolprotos$haserviceprotocolservice$2.callblockingmethod(haserviceprotocolprotos.java:4460)         @ org.apache.hadoop.ipc.protobufrpcengine$server$protobufrpcinvoker.call(protobufrpcengine.java:640)         @ org.apache.hadoop.ipc.rpc$server.call(rpc.java:982)         @ org.apache.hadoop.ipc.server$handler$1.run(server.java:2313)         @ org.apache.hadoop.ipc.server$handler$1.run(server.java:2309)         @ java.security.accesscontroller.doprivileged(native method)         @ javax.security.auth.subject.doas(subject.java:422)         @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1724)         @ org.apache.hadoop.ipc.server$handler.run(server.java:2307) 2017-04-06 10:11:43,562 info  util.exitutil (exitutil.java:terminate(124)) - exiting status 1 2017-04-06 10:11:43,563 info  namenode.namenode (logadapter.java:info(47)) - shutdown_msg: /************************************************************ shutdown_msg: shutting down namenode @ bigm02.etstur.local/9.1.10.21 ************************************************************/ 

and although journal nodes started succesfully, have following error can found suspicious:

2017-04-05 17:15:05,653 error received signal 15: sigterm 

and backgroud of error following...

yesterday noticed 1 of datanodes failed , stopped. there following errors in logs:

2017-04-05 15:50:11,168 error datanode.datanode (bpserviceactor.java:run(752)) - initialization failed block pool <registering> (datanode uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service master02.mydomain.local/10.0.109.23:8020. exiting. java.io.ioexception: specified directories failed load. @ org.apache.hadoop.hdfs.server.datanode.datastorage.recovertransitionread(datastorage.java:596) @ org.apache.hadoop.hdfs.server.datanode.datanode.initstorage(datanode.java:1483) @ org.apache.hadoop.hdfs.server.datanode.datanode.initblockpool(datanode.java:1448) @ org.apache.hadoop.hdfs.server.datanode.bpofferservice.verifyandsetnamespaceinfo(bpofferservice.java:319) @ org.apache.hadoop.hdfs.server.datanode.bpserviceactor.connecttonnandhandshake(bpserviceactor.java:267) @ org.apache.hadoop.hdfs.server.datanode.bpserviceactor.run(bpserviceactor.java:740) @ java.lang.thread.run(thread.java:745) 2017-04-05 15:50:11,168 error datanode.datanode (bpserviceactor.java:run(752)) - initialization failed block pool <registering> (datanode uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service master02.mydomain.local/10.0.109.23:8020. exiting. org.apache.hadoop.util.diskchecker$diskerrorexception: many failed volumes - current valid volumes: 13, volumes configured: 14, volumes failed: 1, volume failures tolerated: 0  2017-04-05 17:15:36,968 info  common.storage (storage.java:trylock(774)) - lock on /grid/13/hadoop/hdfs/data/in_use.lock  acquired nodename 31353@data02.mydomain.local 

although seeing volume errors, able browse /grid/13/

so wanted try following answers in stackoverflow question:

datanode not starts correctly

first deleted data folder under /grid/13/hadoop/hdfs (/grid/13/hadoop/hdfs/data) , tried start datanode.

it failed again same errors went namenode format. cluster new , empty fine solution including formats:

(in first try gave block pool id instead of clusterid, command failed.)

./hdfs namenode -format -clusterid <myclusterid> 

after format, 1 of namenodes failed. when tried restart hdfs components, both namenodes failed.

any comments appreciated.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -