More than 5 years have passed since last update.

Ambariのオペレーションを強制終了する方法

Last updated at 2018-10-27Posted at 2018-10-27

初めに

皆さんAmbari使っていますか？
便利ですよね。

設定は管理してくれるし、サービスも管理してくれる。

でも…

まれにドツボにはまります。

この記事は、Ambariのサービス操作がにっちもさっちもいかなくなった時の最終手段を記載します。

事態の背景

Sandboxはらくちんなのですが、メモリをバカ食いして、20GB割り当ててもDockerが悲鳴を上げます。

そこで、当面使用しないサービスを、片っ端から削除していました。
削除後のAmbari画面（Flume、Supersetなどが減っていますね）

ところが、起動中にFlumeを消したところ、どうやら存在しないServiceを起動しようとしてスタックしてしまうようです。

状況調査

Ambariのサーバログを確認します。
Docker psコマンドでコンテナIDを確認し、接続します。

HDPコンテナへの接続

C:\Users\Monta>docker ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS   
　　　　　　　　　　　~前略~                                          
3e9ed3a59f43        hortonworks/sandbox-hdp:2.6.5   "/usr/sbin/init"         8 weeks ago         Up About an hour    22/tcp, 4200/tcp, 8080/tcp 
　　　　　　　　　　　~後略~   
C:\Users\Monta>docker exec -it 3e9ed3a59f43  /bin/bash

接続後、Ambari-serverのログを確認します。
/var/log/ambari-server/ambari-server.log

ambari-server.log

[root@sandbox-hdp /]# tail -f /var/log/ambari-server/ambari-server.log

27 Oct 2018 02:05:07,476  WARN [ambari-action-scheduler] ExecutionCommandWrapper:346 - The service FLUME is not installed in the cluster. No repository version will be sent for this command.
27 Oct 2018 02:05:07,477  WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
java.lang.RuntimeException: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
        at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:266)
        at org.apache.ambari.server.actionmanager.ActionScheduler.abortOperationsForStage(ActionScheduler.java:880)
        at org.apache.ambari.server.actionmanager.ActionScheduler.processCancelledRequestsList(ActionScheduler.java:1172)
        at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:333)
        at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
        at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
        at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:237)
        ... 5 more
27 Oct 2018 02:05:08,481  WARN [ambari-action-scheduler] ExecutionCommandWrapper:346 - The service FLUME is not installed in the cluster. No repository version will be sent for this command.
27 Oct 2018 02:05:08,482  WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
java.lang.RuntimeException: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
        at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:266)
        at org.apache.ambari.server.actionmanager.ActionScheduler.abortOperationsForStage(ActionScheduler.java:880)
        at org.apache.ambari.server.actionmanager.ActionScheduler.processCancelledRequestsList(ActionScheduler.java:1172)
        at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:333)
        at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
        at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
        at org.apache.ambari.server.actionmanager.ExecutionCommandWrapper.getExecutionCommand(ExecutionCommandWrapper.java:237)

完璧にやらかしていますね。
Ambari側ではFlumeを起動しようとしていますが、うまくいかずキャンセルしようとしたところ、Flumeサービスがないということで、例外になっています。

Ambariのコード（ActionScheduler.javaなど）をしばらく眺めましたが、手を入れるのが怖いので、とりあえずDBを手動で修正することにしました。

状況調査

しばらくやみくもにDBを確認していましたが、見つからず。ググりました。
どうやら、以下のページが参考になりそう。

オペレーションのタスクIDを画面のファイル名から確認します。
いか、私の実際の画面で、taskIDは994です。

Ambariの情報はPostgreSQLに格納されているので、確認しましょう。
デフォルトではユーザはambari,パスワードはbigdataです。

チェック

[root@sandbox-hdp ~]# su postgres
bash-4.2$ psql -U ambari ambari
ambari=> select task_id, role, role_command, status from host_role_command where task_id = 994;
 task_id |     role      | role_command | status
---------+---------------+--------------+---------
     994 | FLUME_HANDLER | START        | PENDING
(1 row)
ambari=> update host_role_command set status = 'ABORTED' where task_id = 994 and status = 'PENDING';
UPDATE 1

その後、Ambari-Serverをリスタートします。

Ambariサーバリスタート

[root@sandbox-hdp /]# ambari-server restart
Using python  /usr/bin/python
Restarting ambari-server
Waiting for server stop...
Ambari Server stopped
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start..........................
Server started listening on 8080

DB configs consistency check found warnings. See /var/log/ambari-server/ambari-server-check-database.log for more details.

再度、Operationを確認すると…

ちゃんとFlume StartがAborted扱いになっています！！

しかし、裏で処理が走っているらしく、エラーが出続けています。Ambari-serverをリスタートしても同様でした。

ambari-server.log

27 Oct 2018 02:56:04,529  WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
        at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
        at org.apache.ambari.server.actionmanager.ActionScheduler.processInProgressStage(ActionScheduler.java:716)
        at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:417)
        at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
        at java.lang.Thread.run(Thread.java:748)
27 Oct 2018 02:56:05,537  WARN [ambari-action-scheduler] ActionScheduler:316 - Exception received
org.apache.ambari.server.ServiceNotFoundException: Service not found, clusterName=Sandbox, serviceName=FLUME
        at org.apache.ambari.server.state.cluster.ClusterImpl.getService(ClusterImpl.java:864)
        at org.apache.ambari.server.actionmanager.ActionScheduler.processInProgressStage(ActionScheduler.java:716)
        at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:417)
        at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:310)
        at java.lang.Thread.run(Thread.java:748)

Ambari DBをチェックする方法はないか？

ふと思いついて、ambari-serverコマンドをたたいたところ、check-databaseというオプションがあるようなので、とりあえず実行してみる。

check-database

[root@sandbox-hdp /]# ambari-server check-database
Using python  /usr/bin/python
Checking database
DB configs consistency check found warnings. See "/var/log/ambari-server/ambari-server-check-database.log" for more details.
Ambari Server 'check-database' completed

ログ内容を確認します。

/var/log/ambari-server/ambari-server-check-database.log

2018-10-27 03:03:20,702  INFO - ******************************* Check database started *******************************
2018-10-27 03:03:20,702  INFO - Ensuring that the schema set for Postgres is correct
2018-10-27 03:03:20,721  INFO - Checking for configs that are not mapped to any service
2018-10-27 03:03:21,191  WARN - You have config(s): application-properties-INITIAL,storm-site-INITIAL that is(are) not mapped (in serviceconfigmapping table) to any
 service!
2018-10-27 03:03:21,191  INFO - Checking for configs selected more than once
2018-10-27 03:03:21,194  INFO - Checking for hosts without state
2018-10-27 03:03:21,195  INFO - Checking host component states count equals host component desired states count
2018-10-27 03:03:21,197  INFO - Checking services and their configs
2018-10-27 03:03:21,197  INFO - Getting ambari metainfo instance
2018-10-27 03:03:24,069  INFO - Executing query 'GET_SERVICES_WITHOUT_CONFIGS'
2018-10-27 03:03:24,070  INFO - Executing query 'GET_SERVICE_CONFIG_WITHOUT_MAPPING'
2018-10-27 03:03:24,070  INFO - Getting stack info from database
2018-10-27 03:03:24,071  INFO - Executing query 'GET_SERVICES_WITH_CONFIGS'
2018-10-27 03:03:24,079  INFO - Comparing service configs from stack with configs that we got from db
2018-10-27 03:03:24,079  INFO - Getting services from metainfo
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / SQOOP
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / DRUID
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / MAPREDUCE2
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / HDFS
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / TEZ
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / OOZIE
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / ZOOKEEPER
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / HBASE
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / YARN
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / PIG
2018-10-27 03:03:24,079  INFO - Processing HDP-2.6 / RANGER
2018-10-27 03:03:24,080  INFO - Processing HDP-2.6 / HIVE
2018-10-27 03:03:24,080  INFO - Processing HDP-2.6 / SLIDER
2018-10-27 03:03:24,080  INFO - Processing HDP-2.6 / ZEPPELIN
2018-10-27 03:03:24,080  INFO - Processing HDP-2.6 / KAFKA
2018-10-27 03:03:24,080  INFO - Processing HDP-2.6 / SPARK2
2018-10-27 03:03:24,080  INFO - Comparing required service configs from stack with mapped service configs from db
2018-10-27 03:03:24,080  INFO - Getting services which has mapped configs which are not selected in clusterconfig
2018-10-27 03:03:24,087  INFO - Checking Topology tables
2018-10-27 03:03:24,089  INFO - Checking for tables with large physical size
2018-10-27 03:03:24,090  INFO - The database table host_role_command is currently 4.867 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,091  INFO - The database table execution_command is currently 3.555 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,091  INFO - The database table stage is currently 0.305 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,092  INFO - The database table request is currently 0.273 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,092  INFO - The database table alert_history is currently 1.031 MB and is within normal limits (3000.000)
2018-10-27 03:03:24,092  INFO - Checking config group host mappings
2018-10-27 03:03:24,586  INFO - ******************************* Check database completed *******************************

致命的な点はなさそうです。
原因がわからんので、その後、コンテナを再起動し、サービスも起動したところ、正常に稼働するようになりました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up