0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

sql で percentile を算出してランク付けする

Posted at

Overview

テーブル形式の大量のデータに対して、なんらかの値が

閾値 ランク
上位80% A
60% - 80% B
40% - 60% C
20% - 40% D
下位20% E

みたいにランク付けしたいときとかあるじゃないですか。

それを SQL でやります。

$ sqlite3 -version
3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837bb4d6

適当なデータを投入

$ { echo idx,value; seq 10000 | while read x; do echo $x,$RANDOM; done } | \
   sqlite3 -csv -header a.db ".import /dev/stdin hoge"

確認

$ sqlite3 a.db ".schema"
CREATE TABLE hoge(
  "idx" TEXT,
  "value" TEXT
);

$ sqlite3 a.db "SELECT count(*) FROM hoge"
10000

$ sqlite3 -header a.db "SELECT * FROM hoge LIMIT 10"
idx|value
1|16165
2|9345
3|7693
4|1073
5|12271
6|7869
7|3264
8|11153
9|16929
10|14493

パーセンタイルでランク付けして別テーブルに格納

TEXT で入っちゃってるのを INT にキャストしてあるので、そこだけ注意。

a.sql
CREATE TABLE rank AS
WITH q1 AS (
SELECT
  *
  , PERCENT_RANK() OVER(ORDER BY CAST(value AS INT)) as percentile
FROM hoge
)

SELECT
  *
  , CASE
    WHEN percentile >= 0.8 THEN "A"
    WHEN percentile >= 0.6 THEN "B"
    WHEN percentile >= 0.4 THEN "C"
    WHEN percentile >= 0.2 THEN "D"
    ELSE "E"
    END AS rank
FROM q1;

確認

b.sql
SELECT
  COUNT(*)
  , MAX(percentile)
  , MAX(value)
  , rank
FROM rank
GROUP BY rank
$ sqlite3 -header a.db <b.sql
COUNT(*)|MAX(percentile)|MAX(value)|rank
2000|1.0|32767|A
2000|0.7999799979998|26157|B
2000|0.5999599959996|19568|C
2000|0.3999399939994|9999|D
2000|0.1999199919992|999|E

っていう話。
閾値を変えたり、足切りとかにも使えますね。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?