概要
今回は blackbox_exporter を使用した死活監視の構成を試します。
Sysdig 自身には死活監視・外形監視がないので、 このような proxy server を設定し、インフラ監視を補完します。
また、Sysdig Agent 10.5.0
から Prometheus Native なスクレイピング設定が可能になっており、そちらを試します。
As of agent v10.5.0, Sysdig supports the native Prometheus service discovery and you can configure in prometheus.yaml in the same way you do for native Prometheus.
The new version of promscrape is named promscrape.v2 .
promscrape.v2 supports all types of scraping configuration, such as federation, blackbox-exporter, and so on.
ibmcloud
CLI ログイン
CLI で設定を進めます。
export REGION="jp-tok"
export RESOURCE_GROUP="khayama-rg"
ibmcloud login -a cloud.ibm.com -r $REGION -g $RESOURCE_GROUP
Sysdig サービスインスタンス作成
監視結果の集約先として IBM Cloud Monitoring with Sysdig をプロビジョニングします。
# ibmcloud catalog service sysdig-monitor
export SYSDIG_NAME=khayama-sysdig
export SYSDIG_PLAN=graduated-tier # or lite
ibmcloud resource service-instance-create $SYSDIG_NAME sysdig-monitor $SYSDIG_PLAN $REGION
Sysdig サービス認証情報の作成
プロビジョニングした Sysdig のインスタンス ID を取得します。
export SYSDIG_ID=$(ibmcloud resource service-instance --output JSON $SYSDIG_NAME | jq -r '.[].id')
echo $SYSDIG_ID
Agent インストールに必要な認証情報を作成します。
ibmcloud resource service-key-create "$SYSDIG_NAME"-service-key --instance-id $SYSDIG_ID
ibmcloud resource service-keys
Sysdig アクセスキーを確認します。
export SYSDIG_ACCESS_KEY=$(ibmcloud resource service-key --output JSON "$SYSDIG_NAME"-service-key | jq -r '.[].credentials."Sysdig Access Key"')
echo $SYSDIG_ACCESS_KEY
proxy server 環境準備
proxy server には Red Hat Enterprise Linux を使います。
[root@khayama-proxy ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)
Sysdig Agent インストール
Regions and endpoints からプライベートエンドポイントをセットし、
Configuring a Sysdig agent に従って Proxy node に導入します。
[root@khayama-proxy ~]# SYSDIG_ACCESS_KEY=xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbcccccc
[root@khayama-proxy ~]# COLLECTOR_ENDPOINT=ingest.private.jp-tok.monitoring.cloud.ibm.com
[root@khayama-proxy ~]# curl -sL https://ibm.biz/install-sysdig-agent | sudo bash -s -- --access_key $SYSDIG_ACCESS_KEY --collector $COLLECTOR_ENDPOINT --collector_port 6443 --secure true --tags role:proxy,location:tok04 --additional_conf 'sysdig_capture_enabled: false'
* Detecting operating system
* Installing Sysdig public key
* Installing Sysdig repository
* Installing kernel headers
* Installing Sysdig Agent
* Setting access key
* Setting tags
* Setting collector endpoint
* Setting collector port
* Setting connection security
* Adding additional configuration to dragent.yaml
Restarting dragent (via systemctl): [ OK ]
Sysdig Agent のバージョンが 10.5.0
以上であることを確認します。
[root@khayama-proxy ~]# /opt/draios/bin/dragent --version
10.5.0
Agent の設定は以下のコマンドで確認できます。
[root@khayama-proxy ~]# cat /opt/draios/etc/dragent.yaml
customerid: xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbcccccc
tags: role:proxy,location:tok04
collector: ingest.private.jp-tok.monitoring.cloud.ibm.com
collector_port: 6443
ssl: true
sysdig_capture_enabled: false
blackbox_exporter
インストール
prometheus/blackbox_exporter: Blackbox prober exporter に従って proxy server に導入します。
GitHub からバイナリをダウンロードして展開します。
wget -c https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz -O - | tar -xz
cd blackbox_exporter-0.17.0.linux-amd64
./blackbox_exporter -h
以下のようにバイナリと設定ファイルを配置します。
mv blackbox_exporter /usr/local/bin
mkdir -p /etc/blackbox
mv blackbox.yml /etc/blackbox
icmp
probe に追加の設定を入れます。
- blackbox_exporter/CONFIGURATION.md at master · prometheus/blackbox_exporter
- blackbox_exporter/example.yml at master · prometheus/blackbox_exporter
cat <<EOF >> /etc/blackbox/blackbox.yml
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
EOF
blackbox_exporter --config.check --config.file="/etc/blackbox/blackbox.yml"
定常的に起動するようにサービスを構成します。
cat <<EOF > /lib/systemd/system/blackbox.service
[Unit]
Description=Blackbox Exporter Service
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter \
--config.file=/etc/blackbox/blackbox.yml \
--web.listen-address=":9115"
Restart=always
[Install]
WantedBy=multi-user.target
EOF
最後に権限を追加して blackbox.service
を起動します。
# to allow any user the ability to use ping
sysctl -w net.ipv4.ping_group_range='0 2147483647'
systemctl enable blackbox.service
systemctl start blackbox.service
systemctl status blackbox.service
blackbox_exporter
稼働確認
probe_success 1
なら Up
の意味になります。
詳細な情報を表示したい場合は、以下のように debug=true
のパラメータを追加してください。
[root@khayama-proxy ~]# curl "http://localhost:9115/probe?module=icmp&target=10.193.37.176&debug=true"
Logs for the probe:
ts=2020-09-23T07:22:53.810006347Z caller=main.go:304 module=icmp target=10.193.37.176 level=info msg="Beginning probe" probe=icmp timeout_seconds=5
ts=2020-09-23T07:22:53.810227284Z caller=icmp.go:84 module=icmp target=10.193.37.176 level=info msg="Resolving target address" ip_protocol=ip4
ts=2020-09-23T07:22:53.810278023Z caller=icmp.go:84 module=icmp target=10.193.37.176 level=info msg="Resolved target address" ip=10.193.37.176
ts=2020-09-23T07:22:53.810324592Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Creating socket"
ts=2020-09-23T07:22:53.810535428Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Creating ICMP packet" seq=34885 id=25639
ts=2020-09-23T07:22:53.810577682Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Writing out packet"
ts=2020-09-23T07:22:53.810673503Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Waiting for reply packets"
ts=2020-09-23T07:22:53.813713568Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Found matching reply packet"
ts=2020-09-23T07:22:53.813791262Z caller=main.go:304 module=icmp target=10.193.37.176 level=info msg="Probe succeeded" duration_seconds=0.003687872
Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 4.7291e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.003687872
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 4.7291e-05
probe_icmp_duration_seconds{phase="rtt"} 0.003104699
probe_icmp_duration_seconds{phase="setup"} 0.00025286
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.984007035e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
Module configuration:
prober: icmp
timeout: 5s
http:
ip_protocol_fallback: true
tcp:
ip_protocol_fallback: true
icmp:
preferred_ip_protocol: ip4
ip_protocol_fallback: true
dns:
ip_protocol_fallback: true
blackbox_exporter
probe ターゲット設定
Prometheus Configuration をベースに ICMP の死活監視ターゲットを設定します。
今回は localhost
と 10.193.37.176
の複数を登録し、Sysdig でセグメント別にデータを見られるかを確認します。
cat <<EOF > /opt/draios/etc/prometheus.yaml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [icmp] # Look for a icmp response.
static_configs:
- targets:
- localhost # Target to probe with icmp
- 10.193.37.176 # Target to probe with icmp
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.
EOF
blackbox_exporter
エンドポイントに対するスクレイピング設定
Sysdig Agent 10.5.0
から Prometheus Native なスクレイピング設定が可能になったので、その設定を追加して Agent を再起動します。
cat <<EOF >> /opt/draios/etc/dragent.yaml
prometheus:
enabled: true
prom_service_discovery: true
EOF
service dragent restart
データ確認
IBM Cloud Monitoring with Sysdig の「Dashboard > Add dashboard」画面から以下の設定でパネル追加できることを確認します。
同じ Metrics でも instance 別にデータを確認できたことがわかります。
- Metrics:
probe_success
- Segmentation:
instance
通知設定
パネルが作成できたら、「Create Alert」からメール通知を設定します。
(事前にメール通知チャネルは追加しておきます。)
このように「Multiple Alerts」を作成すると、セグメント「instance」「job」ごとに一度にアラートが設定できます。
通知メール本文では「{{instance}}」「{{job}}」を変数として使用できます。
通知確認
「10.193.37.176」をパワーオフして通知を確認します。
データは以下のように確認できました。
通知メールの受信も確認できました。
さいごに
これで blackbox_exporter
と Sysdig を使った死活監視を実装できました。
proxy server の冗長性を確保するには、複数立てるとよいでしょう。
参考リンク
- Configuring Sysdig Agent - Sysdig Documentation
- Working with Prometheus Metrics
- How To Install and Configure Blackbox Exporter for Prometheus
参考:トラブルシューティング
dragent.yaml
に設定を追加し、Agent を再起動することで詳細なログを確認できます。
cat <<EOF >> /opt/draios/etc/dragent.yaml
log:
file_priority: debug
EOF
service dragent restart && tail -f /opt/draios/logs/draios.log
例えば、以下のようなログを確認できれば、今回の設定が正しく動いていることが確認できます。
2020-09-28 04:32:47.025, 21727.21762, Debug, promscrape:1010: have metrics for job 1
2020-09-28 04:32:47.025, 21727.21762, Debug, promscrape:1010: have metrics for job 2
参考:Python 3 で実行するには
Python3 をインストールせずに Sysdig Agent を実行すると、以下のエラーが確認できます。
Error, sdchecks[0] /opt/draios/lib/python-deps2.7/OpenSSL/crypto.py:12: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
Python3 をインストールします。
dnf install python38 -y
python3 -V
alternatives --set python /usr/bin/python3
これだけでは以下のエラーが確認できます。
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] Traceback (most recent call last):
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/bin/sdchecks", line 33, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from sdchecks import Application
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/sdchecks.py", line 28, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] import config
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/config.py", line 22, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from util import get_os, yLoader
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/util.py", line 44, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from utils.platform import Platform
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/utils/platform.py", line 6, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from utils.dockerutil import get_client
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/utils/dockerutil.py", line 15, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from docker import Client
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/docker/__init__.py", line 6, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from .client import Client, AutoVersionClient, from_env # flake8: noqa
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/docker/client.py", line 5, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] import requests
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/requests/__init__.py", line 112, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from . import utils
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/requests/utils.py", line 39, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] DEFAULT_CA_BUNDLE_PATH = certs.where()
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/certifi/core.py", line 37, in where
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] _CACERT_PATH = str(_CACERT_CTX.__enter__())
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/usr/lib64/python3.8/contextlib.py", line 113, in __enter__
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] return next(self.gen)
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/usr/lib64/python3.8/importlib/resources.py", line 201, in path
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] with open_binary(package, resource) as fp:
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/usr/lib64/python3.8/importlib/resources.py", line 91, in open_binary
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] return reader.open_resource(resource)
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "<frozen importlib._bootstrap_external>", line 988, in open_resource
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] FileNotFoundError: [Errno 2] No such file or directory: '/opt/draios/lib/python-deps/certifi/cacert.pem'
その場合は、続いて以下のコマンドを実施します。
python3 -m pip install --upgrade pip
python3 -m pip --version
python3 -m pip install certifi
cp $(python -m certifi) /opt/draios/lib/python-deps/certifi/cacert.pem
最後にこのコマンドでエラーが出力されないことを確認します。
service dragent restart && tail -f /opt/draios/logs/draios.log | grep Error
参考:メトリクス一覧
[root@khayama-proxy ~]# curl "http://localhost:9115/metrics"
# HELP blackbox_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which blackbox_exporter was built.
# TYPE blackbox_exporter_build_info gauge
blackbox_exporter_build_info{branch="HEAD",goversion="go1.14.4",revision="1bc768014cf6815f7e9d694e0292e77dd10f3235",version="0.17.0"} 1
# HELP blackbox_exporter_config_last_reload_success_timestamp_seconds Timestamp of the last successful configuration reload.
# TYPE blackbox_exporter_config_last_reload_success_timestamp_seconds gauge
blackbox_exporter_config_last_reload_success_timestamp_seconds 1.6008451361406033e+09
# HELP blackbox_exporter_config_last_reload_successful Blackbox exporter config loaded successfully.
# TYPE blackbox_exporter_config_last_reload_successful gauge
blackbox_exporter_config_last_reload_successful 1
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 10
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.14.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 834256
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 834256
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.444856e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 274
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.436808e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 834256
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.49216e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.794048e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 4258
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 6.49216e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.6715648e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 4532
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 3472
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 28152
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 32768
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 787456
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 393216
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 393216
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.2827136e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.8173952e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.60084513566e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.33978624e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0