More than 5 years have passed since last update.

PythonでmongoDBを操作する～その６：aggregate編～

Posted at 2020-02-05

当記事の記載範囲

この記事ではPythonでmongodbに接続してから、aggregate（SQLで言うところの集計関数）の使い方について記載します。
mongodbの起動やpymongoのインストール方法については以下の記事をご覧いただければ幸いです。
https://qiita.com/bc_yuuuuuki/items/2b92598434f6cc320112

準備データ

準備データは以下の記事でmongoDBに突っ込んだQiitaの記事情報を使用します
[Python]Qiitaの記事情報をmongoDBに突っ込んだ

aggregateの使い方

mongoDBのaggregateの使い方はSQLに慣れているとイマイチピンと来ないです。
下表がSQLとaggregateの対比表です。

SQL	aggregate
WHERE	$match
GROUP BY	$group
HAVING	$match
SELECT	$project
ORDER BY	$sort
LIMIT	$limit
SUM()	$sum
COUNT()	$sum

mongoDBの操作クラス

pymongoを使用して色々とmongoDBを使用するクラスを作ってます。

mongo_sample.py

from pymongo import MongoClient

class MongoSample(object):

    def __init__(self, dbName, collectionName):
        self.client = MongoClient()
        self.db = self.client[dbName] #DB名を設定
        self.collection = self.db.get_collection(collectionName)

    def aggregate(self, filter, **keyword):
        return self.collection.aggregate(filter, keyword)

aggregate呼び出し用の関数を作っているだけです。

mongoDBからデータを取得する

まずは、コードです。

aggregate_sample.py

from mongo_sample import MongoSample
import pprint
# arg1:DB Name
# arg2:Collection Name
mongo = MongoSample("db", "qiita")

# 最大値
pipeline = [
    {"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------最大値-----------------------------")
pprint.pprint(list(results))

# 最小値
pipeline = [
    {"$group":{ "_id":"title","page_min_view":{"$min":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------最小値-----------------------------")
pprint.pprint(list(results))

# 平均値
pipeline = [
    {"$group":{ "_id":"average","page_average_view":{"$avg":"$page_views_count"}}}
]

# 合計
pipeline = [
    {"$group":{"_id":"page_total_count","total":{"$sum":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------平均値-----------------------------")
pprint.pprint(list(results))

# tag毎の出現回数カウント
pipeline = [
    { "$unwind": "$tag_list"}, 
    { "$group": { "_id": "$tag_list", "count": { "$sum":1}}},
    { "$sort": {"count": -1, "_id":1}}
]

results = mongo.aggregate(pipeline)
print("------------------------集計値-----------------------------")
pprint.pprint(list(results))

やっていることは大したことはありません。
最大値、最小値、平均値、タグ毎のカウントを取得しています。

pprintはinstallする必要があります。

pip install pprint

それぞれ、mongoDBの操作方法と見比べていきます。

最大値／最小値／平均値／合計値

まずは、mongoDBのコマンド
例は最大値のみとします。maxをminやavg、sumに変えれば最小値／平均値／最大値になります。

db.qiita.aggregate([{$group:{_id:"page_max_views",total:{$max:"$page_views_count"}}}])

aggregate_sample.py

pipeline = [
    {"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]

実行結果

[{'_id': 'title', 'page_max_view': 2461}]

このやり方だと、"_id"を"title"に固定して、全レコード中の最大値を取得するという結果になりました。

ただ、どの記事が一番読まれているのかを知りたいので記事のタイトルを表示したいです。

mongoDBコマンド

> db.qiita.aggregate([{$project:{title:1,page_views_count:1}},{$group:{_id:"$title", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{ "_id" : "Pythonでmongodbを操作する～その２：find編～", "total" : 2461 }
{ "_id" : "Pythonでmongodbを操作する～その３：update編～", "total" : 1137 }
{ "_id" : "Pythonでmongodbを操作する～その４：insert編～", "total" : 1102 }
{ "_id" : "pymongoを使った様々な検索条件（AND／OR／部分一致／範囲検索）", "total" : 1019 }
（略）

このコマンドだと、記事のタイトルとページの閲覧回数を見れました。
当たり前ですが、記事名でグループ化されているのであまり意味のない集計ですね。。
グループ化が必要ない最大値であれば、findでsortしてlimitをかけるのが良さそうです。

tag1毎の最大値を取得してみます。

> db.qiita.aggregate([{$group:{_id:"$tag1", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{ "_id" : "Python", "total" : 2461 }
{ "_id" : "Vagrant", "total" : 946 }
{ "_id" : "Java", "total" : 617 }
{ "_id" : "Hyperledger", "total" : 598 }
{ "_id" : "solidity", "total" : 363 }
{ "_id" : "Ethereum", "total" : 347 }
{ "_id" : "ブロックチェーン", "total" : 232 }
{ "_id" : "Blockchain", "total" : 201 }
{ "_id" : "coverage", "total" : 199 }

はい。良い感じで取得出来ました。

一応、pythonのコードも変えてみます。

aggregate_sample.py

# 最大値
pipeline = [
    {"$group":{ "_id":"$tag1","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------最大値-----------------------------")
pprint.pprint(list(results))

タグ毎の集計

タグ毎に何個記事を書いているか集計したいと思います。
集計はtag_listという項目を使用しますが、このデータは以下のようになっています。

> db.qiita.find({},{_id:0,tag_list:1})
{ "tag_list" : [ "Python", "MongoDB", "Python3", "pymongo" ] }
{ "tag_list" : [ "Python", "Python3" ] }
{ "tag_list" : [ "Python", "Python3", "Blockchain", "ブロックチェーン", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Blockchain", "ブロックチェーン", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Blockchain", "Ethereum", "Hyperledger", "Hyperledger-sawtooth" ] }
{ "tag_list" : [ "ブロックチェーン", "Hyperledger", "Hyperledger-sawtooth" ] }
{ "tag_list" : [ "Java", "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Java", "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Hyperledger", "Hyperledger-Iroha", "Hyperledger-burrow", "Hyperledger-sawtooth", "Hyperledger-besu" ] }
{ "tag_list" : [ "Vagrant", "VirtualBox", "Hyper-V" ] }
{ "tag_list" : [ "Java", "Ethereum", "solidity", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "solidity", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Ethereum", "ブロックチェーン" ] }

SQLでこんな形式で格納されているデータを集計するとかなり面倒くさいですね。。

mongoDBではunwindというものを使うことでLIST形式のデータを分割して集計可能になります。

> db.qiita.aggregate( { $project:{tag_list:1}}, { $unwind: "$tag_list"}, { $group: { _id: "$tag_list", count: { $sum:1}}},{ $sort: {"count": -1, "_id":1}} )
{ "_id" : "ブロックチェーン", "count" : 16 }
{ "_id" : "Ethereum", "count" : 11 }
{ "_id" : "Java", "count" : 10 }
{ "_id" : "Python", "count" : 9 }
{ "_id" : "Python3", "count" : 9 }
{ "_id" : "Hyperledger", "count" : 7 }
{ "_id" : "Hyperledger-Iroha", "count" : 7 }
{ "_id" : "MongoDB", "count" : 7 }
{ "_id" : "web3j", "count" : 7 }
{ "_id" : "solidity", "count" : 4 }
{ "_id" : "Blockchain", "count" : 3 }
{ "_id" : "Hyperledger-sawtooth", "count" : 3 }
{ "_id" : "Hyper-V", "count" : 1 }
{ "_id" : "Hyperledger-besu", "count" : 1 }
{ "_id" : "Hyperledger-burrow", "count" : 1 }
{ "_id" : "Vagrant", "count" : 1 }
{ "_id" : "VirtualBox", "count" : 1 }
{ "_id" : "coverage", "count" : 1 }
{ "_id" : "pymongo", "count" : 1 }
{ "_id" : "truffle", "count" : 1 }

pythonのコードは「{ "$project": {"tag_list": 1}}」を入れていません。
あっても無くても、結果が変わりませんでした。
イマイチ、このprojectの使い方がよくわからない。

感想

SQLに慣れていると分かり辛い部分が多々あるが、unwindなどを使用することで柔軟な集計が出来そう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

PythonでmongoDBを操作する～その６：aggregate編～

当記事の記載範囲

準備データ

aggregateの使い方

mongoDBの操作クラス

mongoDBからデータを取得する

最大値／最小値／平均値／合計値

タグ毎の集計

感想

関連記事