初めに
皆さんAmbari使っていますか?
便利ですよね。
設定は管理してくれるし、サービスも管理してくれる。
でも…
まれにドツボにはまります。
この記事は、Ambariのサービス操作がにっちもさっちもいかなくなった時の最終手段を記載します。
事態の背景
Sandboxはらくちんなのですが、メモリをバカ食いして、20GB割り当ててもDockerが悲鳴を上げます。
そこで、当面使用しないサービスを、片っ端から削除していました。
削除後のAmbari画面(Flume、Supersetなどが減っていますね)
ところが、起動中にFlumeを消したところ、どうやら存在しないServiceを起動しようとしてスタックしてしまうようです。
状況調査
Ambariのサーバログを確認します。
Docker psコマンドでコンテナIDを確認し、接続します。
C:\Users\Monta>docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
~前略~
3e9ed3a59f43 hortonworks/sandbox-hdp:2.6.5 "/usr/sbin/init" 8 weeks ago Up About an hour 22/tcp, 4200/tcp, 8080/tcp
~後略~
C:\Users\Monta>docker exec -it 3e9ed3a59f43 /bin/bash
接続後、Ambari-serverのログを確認します。
/var/log/ambari-server/ambari-server.log
[root@sandbox-hdp /]# tail -f /var/log/ambari-server/ambari-server.log
27 Oct 2018 02:05:07,476 WARN [ambari-action-scheduler] ExecutionCommandWrapper:346 - The service FLUME is not installed in the cluster. No repository version will be sent for this command.
27 Oct 2018 02:05:07,477 WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
java.lang.RuntimeException: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:266)
at org.apache.ambari.server.actionmanager.ActionScheduler.abortOperationsForStage(ActionScheduler.java:880)
at org.apache.ambari.server.actionmanager.ActionScheduler.processCancelledRequestsList(ActionScheduler.java:1172)
at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:333)
at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:237)
... 5 more
27 Oct 2018 02:05:08,481 WARN [ambari-action-scheduler] ExecutionCommandWrapper:346 - The service FLUME is not installed in the cluster. No repository version will be sent for this command.
27 Oct 2018 02:05:08,482 WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
java.lang.RuntimeException: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:266)
at org.apache.ambari.server.actionmanager.ActionScheduler.abortOperationsForStage(ActionScheduler.java:880)
at org.apache.ambari.server.actionmanager.ActionScheduler.processCancelledRequestsList(ActionScheduler.java:1172)
at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:333)
at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:237)
完璧にやらかしていますね。
Ambari側ではFlumeを起動しようとしていますが、うまくいかずキャンセルしようとしたところ、Flumeサービスがないということで、例外になっています。
Ambariのコード(ActionScheduler.javaなど)をしばらく眺めましたが、手を入れるのが怖いので、とりあえずDBを手動で修正することにしました。
状況調査
しばらくやみくもにDBを確認していましたが、見つからず。ググりました。
どうやら、以下のページが参考になりそう。
オペレーションのタスクIDを画面のファイル名から確認します。
いか、私の実際の画面で、taskIDは994です。
Ambariの情報はPostgreSQLに格納されているので、確認しましょう。
デフォルトでは ユーザはambari,パスワードはbigdataです。
[root@sandbox-hdp ~]# su postgres
bash-4.2$ psql -U ambari ambari
ambari=> select task_id, role, role_command, status from host_role_command where task_id = 994;
task_id | role | role_command | status
---------+---------------+--------------+---------
994 | FLUME_HANDLER | START | PENDING
(1 row)
ambari=> update host_role_command set status = 'ABORTED' where task_id = 994 and status = 'PENDING';
UPDATE 1
その後、Ambari-Serverをリスタートします。
[root@sandbox-hdp /]# ambari-server restart
Using python /usr/bin/python
Restarting ambari-server
Waiting for server stop...
Ambari Server stopped
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start..........................
Server started listening on 8080
DB configs consistency check found warnings. See /var/log/ambari-server/ambari-server-check-database.log for more details.
再度、Operationを確認すると…
ちゃんとFlume StartがAborted扱いになっています!!
しかし、裏で処理が走っているらしく、エラーが出続けています。Ambari-serverをリスタートしても同様でした。
27 Oct 2018 02:56:04,529 WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
at org.apache.ambari.server.actionmanager.ActionScheduler.processInProgressStage(ActionScheduler.java:716)
at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:417)
at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
at java.lang.Thread.run(Thread.java:748)
27 Oct 2018 02:56:05,537 WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
at org.apache.ambari.server.actionmanager.ActionScheduler.processInProgressStage(ActionScheduler.java:716)
at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:417)
at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
at java.lang.Thread.run(Thread.java:748)
Ambari DBをチェックする方法はないか?
ふと思いついて、ambari-serverコマンドをたたいたところ、check-databaseというオプションがあるようなので、とりあえず実行してみる。
[root@sandbox-hdp /]# ambari-server check-database
Using python /usr/bin/python
Checking database
DB configs consistency check found warnings. See "/var/log/ambari-server/ambari-server-check-database.log" for more details.
Ambari Server 'check-database' completed
ログ内容を確認します。
2018-10-27 03:03:20,702 INFO - ******************************* Check database started *******************************
2018-10-27 03:03:20,702 INFO - Ensuring that the schema set for Postgres is correct
2018-10-27 03:03:20,721 INFO - Checking for configs that are not mapped to any service
2018-10-27 03:03:21,191 WARN - You have config(s): application-properties-INITIAL,storm-site-INITIAL that is(are) not mapped (in serviceconfigmapping table) to any
service!
2018-10-27 03:03:21,191 INFO - Checking for configs selected more than once
2018-10-27 03:03:21,194 INFO - Checking for hosts without state
2018-10-27 03:03:21,195 INFO - Checking host component states count equals host component desired states count
2018-10-27 03:03:21,197 INFO - Checking services and their configs
2018-10-27 03:03:21,197 INFO - Getting ambari metainfo instance
2018-10-27 03:03:24,069 INFO - Executing query 'GET_SERVICES_WITHOUT_CONFIGS'
2018-10-27 03:03:24,070 INFO - Executing query 'GET_SERVICE_CONFIG_WITHOUT_MAPPING'
2018-10-27 03:03:24,070 INFO - Getting stack info from database
2018-10-27 03:03:24,071 INFO - Executing query 'GET_SERVICES_WITH_CONFIGS'
2018-10-27 03:03:24,079 INFO - Comparing service configs from stack with configs that we got from db
2018-10-27 03:03:24,079 INFO - Getting services from metainfo
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / SQOOP
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / DRUID
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / MAPREDUCE2
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / HDFS
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / TEZ
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / OOZIE
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / ZOOKEEPER
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / HBASE
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / YARN
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / PIG
2018-10-27 03:03:24,079 INFO - Processing HDP-2.6 / RANGER
2018-10-27 03:03:24,080 INFO - Processing HDP-2.6 / HIVE
2018-10-27 03:03:24,080 INFO - Processing HDP-2.6 / SLIDER
2018-10-27 03:03:24,080 INFO - Processing HDP-2.6 / ZEPPELIN
2018-10-27 03:03:24,080 INFO - Processing HDP-2.6 / KAFKA
2018-10-27 03:03:24,080 INFO - Processing HDP-2.6 / SPARK2
2018-10-27 03:03:24,080 INFO - Comparing required service configs from stack with mapped service configs from db
2018-10-27 03:03:24,080 INFO - Getting services which has mapped configs which are not selected in clusterconfig
2018-10-27 03:03:24,087 INFO - Checking Topology tables
2018-10-27 03:03:24,089 INFO - Checking for tables with large physical size
2018-10-27 03:03:24,090 INFO - The database table host_role_command is currently 4.867 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,091 INFO - The database table execution_command is currently 3.555 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,091 INFO - The database table stage is currently 0.305 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,092 INFO - The database table request is currently 0.273 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,092 INFO - The database table alert_history is currently 1.031 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,092 INFO - Checking config group host mappings
2018-10-27 03:03:24,586 INFO - ******************************* Check database completed *******************************
致命的な点はなさそうです。
原因がわからんので、その後、コンテナを再起動し、サービスも起動したところ、正常に稼働するようになりました。