LoginSignup
13
2

More than 5 years have passed since last update.

BigQuery Standard SQLでn-gramを作る方法

Posted at

文字列を与えると、その文字列をn-gramに分解する関数をSQLで書いてみました。
bigramとtrigramは使用頻度が高いので、特殊化した関数を別途定義してみました。

#standardSQL
CREATE  TEMPORARY FUNCTION NGRAM(str STRING, n INT64)
RETURNS ARRAY<STRING> AS ((
  SELECT ARRAY(SELECT SUBSTR(str, seq, n) FROM UNNEST(T.seqs) AS seq)
   FROM (
    SELECT str, GENERATE_ARRAY(1, LENGTH(str) - n + 1) AS seqs
  ) AS T
));

CREATE TEMPORARY FUNCTION BIGRAM(str STRING)
RETURNS ARRAY<STRING> AS (NGRAM(str, 2));

CREATE TEMPORARY FUNCTION TRIGRAM(str STRING)
RETURNS ARRAY<STRING> AS (NGRAM(str, 3));

SELECT BIGRAM('I am an NLPer')
13
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
13
2