More than 3 years have passed since last update.

JSON JSON JSON(SplunkでJSONを扱う)

Last updated at 2020-08-18Posted at 2020-05-02

.conf20がオンラインになって、ラスベガスがなくなってしまった。
せっかくなので、ガイドラインをもとに一応Call for Papersを出してみようと思う。

JSONからフィールドを抽出する。はprops.confの話なので、今回はJSONをテーブルにするので挑戦してみようと思う。

multivalueを分割するで書いたことを発表にまとめるだけともいう

Slackでリクエストある？と聞いたら

how to handle invalid json, and json with syslog headers
と言われたので、そっちも頑張ってみる。

JSONからフィールドを抽出する。で書いてたことに何追加しようか。

JSON形式

JSON の紹介に形式等が載っている。
SplunkはきちんとしたJSONじゃないとフィールド抽出ができない。

基本形

basic1

| makeresults 
| eval _raw=" {\"A\":\"values\",\"arrays\": [{ \"B\": 1, \"C\": 2 }, { \"D\": 1, \"E\": 2 }]}"
| spath

なんの変哲もないJSON。普通に１行になってくれる。

基本形２

basic2

| makeresults 
| eval _raw="{
\"name\":\"John\",
\"age\":30,
\"cars\":[ \"Ford\", \"BMW\", \"Fiat\" ]
}"
| spath

今回相手にしていくのはこの形。配列に_value_が複数あるため、Splunkで扱うと_multivalue_になる。

age	cars{}	name
30	Ford BMW Fiat	John

表形式にする

table1

...
| stats values(age) as age values(name) as name by cars{}

または

table2

...
| mvexpand cars{}
| table cars{} age name

cars{}	age	name
BMW	30	John
Fiat	30	John
Ford	30	John

_mvexpand_と_stats_の違い

mvexpandにはhttps://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Mvexpand に記述されているとおり、メモリの制限がある。
世界は広くて、実際mvexpandじゃ展開できないよってことがある。

それに対して、stats byで展開すると限界はない。
ただし、全てがユニークな値である必要がある。
ということで、stats byで展開できるように値を作っていくのが今回のテクニックの肝

基本形３

JSONの参考で。
Splunkで日本のCOVID19感染状況を表示する（GitHub掲載版）で展開したので、これを使う。

data.json

 { [-] 
    demography: [ [+] 
    ] 
    prefectures-data: { [+] 
    } 
    prefectures-map: [ [+] 
    ] 
    transition: { [+] 
    } 
    updated: { [+] 
    } 
 }

日付毎に配列になっているため、

オブジェクトの値を一つ取得

sourcetype=toyo_json | head 1
| spath prefectures-data.carriers{0}{0}

prefectures-data.carriers{0}{0}
2020
で取得ができる。
この`{}`のなかの数字が可変できると扱いやすいが、Splunkだと_数字一つ_か_なし_しかできなさそう。

このデータは扱いづらいので一般化してみる。

ネストした配列

| makeresults
| eval _raw="{\"A\": [[2020, 2, 17, 38], [2020, 2, 18, 44], [2020, 2, 19, 50]]}"
| spath

A{}{}
2020 2 17 38 2020 2 18 44 2020 2 19 50

なんというか、非常に扱いづらい。
一般的には配列を１行毎に分けるのを先にした方が扱いやすいと思う

やってみた

| makeresults
| eval _raw="{\"A\": [[2020, 2, 17, 38], [2020, 2, 18, 44], [2020, 2, 19, 50]]}"
| rex max_match=0 "(?<blacket>\[[^\][]+\])"
| mvexpand blacket
| spath input=blacket path={3} output=carriers
| eval timestamp=spath(blacket,"{0}")."0".spath(blacket,"{1}").spath(blacket,"{2}") 
| eval timestamp=strptime(timestamp,"%Y%m%d")
| table blacket timestamp carriers

blacket	timestamp	carriers
[2020, 2, 17, 38]	1581865200.000000	38
[2020, 2, 18, 44]	1581951600.000000	44
[2020, 2, 19, 50]	1582038000.000000	50
配列で分割してあげると、いろいろと`eval`で加工がしやすい。

基本形４

配列の数が違う

| makeresults
| eval _raw="{\"A\": [38, 18, 44, 2, 19, 50],\"B\":[2, 28, 2, 17, 34]}"
| spath
| eval tmp=mvzip('A{}','B{}')

非常にこまる数違い。mvzipを使うと多い方の値が欠けてしまします。

やってみた２

| makeresults
| eval _raw="{\"A\": [38, 18, 44, 2, 19, 50],\"B\":[2, 28, 2, 17, 34]}"
| spath
| fields - _*
| eval counter=mvrange(0,mvcount('A{}'))
| mvexpand counter
| foreach A* B* [ eval <<FIELD>> = mvindex('<<FIELD>>',counter)]
| fields - counter *{}

A	B
38	2
18	28
44	2
2	17
19	34
50

foreachで持っていけないので{}とかは最初にrenameしておいたほうがいいですね。
mvexpandは素直に上から行を作ってくれるので、使える場合はこっちがいいです。
対抗は　stats list(*) as * by counter

応用

ネストしているけど綺麗なJSON

| makeresults 
| eval _raw="{\"uid\":\"a82ee257\",\"name\":\"Throughput Utilization\",\"axisXType\":\"DateTime\",\"elementReports\":[{\"element\":{\"id\":\"001\",\"name\":\"NS-001\",\"type\":\"NetworkSegment\"},\"series\":[{\"uid\":\"3242d4e4\",\"instance\":\"0\",\"name\":\"Utilization\",\"data\":[{\"x\":1551051000000,\"y\":0.0},{\"x\":1551051300000,\"y\":3.1},{\"x\":1551136800000,\"y\":7.4},{\"x\":1551137100000,\"y\":1.6}],\"e\":1}]},{\"element\":{\"id\":\"002\",\"name\":\"NS-002\",\"type\":\"NetworkSegment\"},\"series\":[{\"uid\":\"4654d4e4\",\"instance\":\"0\",\"name\":\"Utilization\",\"data\":[{\"x\":1551051000000,\"y\":0.3},{\"x\":1551051300000,\"y\":0.0},{\"x\":1551051600000,\"y\":0.0},{\"x\":1551137100000,\"y\":2.12}],\"e\":1}]},{\"element\":{\"id\":\"003\",\"name\":\"NS-003\",\"type\":\"NetworkSegment\"},\"series\":[{\"uid\":\"2481d4e6\",\"instance\":\"0\",\"name\":\"Utilization\",\"data\":[{\"x\":1551051000000,\"y\":0.0},{\"x\":1551051300000,\"y\":0.0},{\"x\":1551051900000,\"y\":0.0},{\"x\":1551136800000,\"y\":0.0}],\"e\":1}]},{\"element\":{\"id\":\"004\",\"name\":\"NS-004\",\"type\":\"NetworkSegment\"},\"series\":[]}]}" 
| spath path=elementReports{} output=elementReports 
| mvexpand elementReports 
| spath input=elementReports path=series{}.data{} output=data 
| mvexpand data 
| spath input=elementReports 
| spath input=data 
| fields - series{}.data* elementReports data _* 
| rename series{}.* as *

spathとmvexpandを利用して、オブジェクトを行に分割していっている。
どこから切っていくのかというと大きなところから。

spathとかいらないバージョン

https://answers.splunk.com/answers/823169/is-there-a-way-to-have-splunk-recognize-the-nested.html
とかで、よくよくJSONをみてみると、フィールド抽出いらないとなった

props.conf

 [testing]
 SHOULD_LINEMERGE=false
 LINE_BREAKER=(,\s*){\"statementData
 NO_BINARY_CHECK=true
 TRUNCATE=0
 TIME_PREFIX=date\":\s*\"
 TIME_FORMAT=%F
 INDEXED_EXTRACTIONS = none
 KV_MODE = none
 TRANSFORMS-kv = json_kv

transforms.conf

 [json_kv]
 REGEX = {\"value\": (\S+), \"dataCode\": \"(\w+)\"}
 FORMAT = $2::$1
 REPEAT_MATCH = true
 WRITE_META = true

WRITE_METAしてあげているので、indexに取り込まれている（みたい）
なにも書かないと、JSONのフィールド抽出してしまうので、明示的にnoneとしている。

INDEXED_EXTRACTION=jsonのあと

[json_kv_null]
SOURCE = field:statementData{}.overview{}.value
DEST_KEY = queue
FORMAT = nullQueue

とかすれば、フィールドが消せるのかと思ったら、

イベントだけだよ、そんなときはINGEST_EVAL使ってね
とありがたいお言葉を得ました。

INDEXED_EXTRACTON=jsonについて

SEDCMDを使わないといけないJSONみたいなログには、使えない。
ということで
JSON_TRIM_BRACES_IN_ARRAY_NAMES = trueも使えないということになる。

booleanが大文字でsingle quote囲い

trueやfalseは小文字出ないといけない

invalid_json.spl

index=_internal | head 1| fields _raw
| eval _raw="[{'ID': 123, 'Name': 'hostname', 'SetupComplete': True, 'Plugin': 'someplugin', 'PluginName': 'someplugin', 'DomainName': 'something', 'DomainEmail': '', 'dontknow': '', 'Address': '1.2.3.4', 'BackupIntervalString': 'Manual', 'LastBackupString': 'Never (1 uploaded)', 'LastBackupAttemptString': 'Never', 'NextBackupString': '', 'Protocol': 'scp', 'Location': '', 'BaselineState': 'N/A', 'LastBackupCompliant': False, 'LastBackupCompliantString': 'N/A', 'ComplianceScore': -1, 'RetryInterval': 45, 'NumRetries': 0, 'KeepVersions': 0, 'Owner': 'someone@something.com', 'State': 'Idle', 'Uptime': 'Not monitored', 'BackupStatus': 'OK', 'BackupDU': '100MB', 'Manufacturer': 'dontknow', 'Model': 'dontknow', 'AssetID': '', 'Serial': '', 'Firmware': '', 'ApprovedBackups': 0, 'CurrentApproved': False, 'NumBackups': 1, 'Disabled': 'No', 'DomainDisabled': False, 'ApprovedState': 'good', 'IsPush': False, 'Updated': '0001-01-01T00:00:00Z'},"
| rex mode=sed "s/True/true/g s/False/false/g s/\'/\"/g s/(?s).({.*}).*/\1/"
| spath

JSON自体は**(single quote)**でも問題ないけど、Splunkは"(double quote)じゃないと受け付けない。ということで、全てrex mode=sed`で加工してみた。

英語で書いてみる

これから頑張ります。
Submission Deadline May 20

例はこれ　https://conf.splunk.com/pdfs/2020/conf20-abstract-examples.pdf
ガイドラインはこれ　https://conf.splunk.com/pdfs/2020/conf20-abstract-examples.pdf

Call for papaerのページ

I am submitting this paper on behalf of*

All accepted sessions regardless of session type will receive support during the session to assist with filtering and answering audience Q&A, allowing the presenter(s) to focus on presenting their content.

Session Type
What is the session format?*

Each .conf20 breakout session may have A MAXIMUM of 2 presenters (1 primary speaker, 1 co-speaker).

One focus of .conf20 is to exhibit alternative breakout session formats to diversify our attendee learning and speak to the virtual conference experience. Please indicate if you have ideas for alternative formats, or if your submission would lend itself to exploring alternative formats. If you answer YES, we will reach out to you to discuss.

Do you have alternative format ideas? Would your submission lend itself to alternative formats?*
Yes
No
Session Track (Session Track Descriptions)*
Please select A MAXIMUM OF 3 key themes or topics that best match the content you want to cover.*
Advanced Architectures
Getting Started with Splunk
Alerting
Internet of Things
Artificial Intelligence and Machine Learning
Mobile
Augmented Reality / Virtual Reality
Natural Language Processing
Dashboards and Analytics
SPL Techniques
Federated Search
Splunk on your Cloud Journey
Getting Data In
Streaming Data
Session Title
Your session title should clearly indicate the topic of your presentation and do so in a way that attracts potential attendees to your virtual session (Limit 150 characters)*
JSON JSON JSON 136 characters remaining
Abstract
In a few sentences, please explain what your submission will cover (no bullets, please).
If selected, your abstract will be included in our marketing materials, on our website, and will be responsible for driving attendance to your session (Limit 1,000 characters). EXAMPLE ABSTRACT*

abstract

In your index, do you handle the JSON?
For those who have indexed JSON, but don't know what to do, I'll tell you how to get it right.
Don't afraid many multivalues.
I would like to explain how to process JSON into tables in this session.

インデックスでは、JSONを扱っていますか？
JSONをインデックス化したことはあるけど、どうしたらいいかわからないという方のために、どうすればいいのかをお教えします。
多値を怖がらないでください。
今回のセッションでは、JSONをテーブルに加工する方法を説明したいと思います。
763 characters remaining

Thinking about your proposed submission, what are the 3 major takeaways a fellow attendee will learn? Please note, with the online nature of this year's event, you will have the ability to share offline resources with session attendees.
Learning Objective 1*
multivalue handling
Learning Objective 2*
foreach usage
Learning Objective 3*
mvexpand and stats
Please select the relevant Skill Level for your submission. (Skill Level Descriptions)*
Your content primarily focuses on which industry:*
Your content primarily focuses on which of the following Splunk Products:*
Which other Splunk products do you plan to cover in your session:*
None
Splunk IT Service Intelligence
Splunk App for Infrastructure
Splunk Machine Learning Toolkit
Splunk Business Flow
Splunk Mission Control
Splunk Cloud
Splunk Phantom
Splunk Connected Experiences (Mobile, AR, VR, NLP, TV)
Splunk User Behavior Analytics
Splunk Data Fabric Search
SignalFx Infrastructure Monitoring
Splunk Data Stream Processor
SignalFx Microservices APM
Splunk Enterprise
VictorOps
Splunk Enterprise Security
Other
If selected, do you plan to show a product demo as part of your session?*
What types of roles would benefit from attending your session?*
Citizen Data Scientist
Splunk Admin
Cloud Specialist
Splunk Analyst
Data Analyst
Splunk Architect
Industry Line of Business Leader
Will this session need to be kept under Product Embargo prior to .conf for any reason?*
Yes
No
With .conf20 now being a virtual event, if selected as a presenter, you may be asked to pre-record your session content or present live at a predetermined time. We appreciate your flexibility.
As a Splunk employee presenting on behalf of Splunk, your slides and audio recordings will automatically be shared with .conf20 attendees.
If proposal is accepted to present at .conf20, I agree to the following:
Have no less than (3) meetings with a designated Splunk Content Development Manager between June and September 2020 to help strengthen and/or align your presentation.*
I agree.
Submit all presentation and other required materials by outlined due dates. If presentation is NOT submitted by due date, Splunk maintains the right to cancel the presenter/presentation and remove from the .conf20 agenda.*
I agree.
If any component of this session changes prior to .conf20, Splunk reserves the right to change session status to ‘decline’.*
I agree.
By submitting a session proposal, I agree to the Splunk Terms of Use.

この後、どこの国に住んでいるのかとか略歴を入力しました。
略歴はSplunk>AnswersのURLとその記述を書いただけですが
その後、このプレゼンは色々と使うけどいいよねというagreementにポチッとして終了

syslogのJsonの取り出し

sample

Nov 14 03:23:42 hostname rsyslogd-pstats:{ "name": "action 0", "origin": "core.action", "processed": 50996, "failed": 0, "suspended": 0, "suspended.duration": 0, "resumed": 0 }
Nov 14 03:23:42 hostname rsyslogd-pstats:{ "name": "action 1", "origin": "core.action", "processed": 50996, "failed": 0, "suspended": 0, "suspended.duration": 0, "resumed": 0 }
Nov 14 03:23:42 hostname rsyslogd-pstats:{ "name": "global", "origin": "dynstats", "values": { } }
Nov 14 03:23:42 hostname rsyslogd-pstats:{ "name": "imuxsock", "origin": "imuxsock", "submitted": 0, "ratelimit.discarded": 0, "ratelimit.numratelimiters": 0 }

ログ自体は前のをそのまま使用。
これくらいだと、SEDCMD = s/.*?({.*)/\1/gですね。

SEDCMDの注意点としてはLINE_BREAKERの方が先に評価される
ので、そこに気をつければ大丈夫だと思います。

あとは、性能に結構くるのでregex101.comのステップ数は重要です。

まとめ

foreachによるカウンター処理っ前処理がしっかりしておければ、フィールドが何個あろうが関係ないので、そこらへんで評価してもらいたいな〜と思っています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up