More than 1 year has passed since last update.

Databricksでsuperintendentを用いてアノテーションを行う

Last updated at 2023-09-19Posted at 2023-09-19

このようなライブラリがあるとは。

なお、関連ライブラリとしてipyannotationsがありますが、現時点ではこちらはDatabricksでは動作しません。

superintendentはテキストや画像のアノテーションをサポートしています。

%pip install superintendent ipyannotations html5lib
dbutils.library.restartPython()

サンプルのテキストをニュースサイトから取り込みます。

import requests
from bs4 import BeautifulSoup
import datetime

headlines = []
labels = []

r = requests.get('https://www.theguardian.com/uk').text #get html
soup = BeautifulSoup(r, 'html5lib') #run html through beautiful soup
headlines += [headline.text for headline in
              soup.find_all('span', class_='js-headline-text')][:10]
labels += ['guardian'] * (len(headlines) - len(labels))

soup = BeautifulSoup(requests.get('http://www.dailymail.co.uk/home/index.html').text, 'html5lib')
headlines += [headline.text.replace('\n', '').replace('\xa0', '').strip()
              for headline in soup.find_all(class_="linkro-darkred")][:10]
labels += ['daily mail'] * (len(headlines) - len(labels))

headlines

Out[2]: ["Ex-model tells how Russell Brand 'stalked her through London streets demanding sex after they met in a bar forcing her to RUN to flee his advances': Woman to report incident to police - as C4 insiders say Big Brother bosses 'all knew he was a predator'",
 "Inside Russell Brand's rocky relationship with wife's family: How golf legend father-in-law Bernard Gallacher 'begged' daughter Laura to split with the star - as comic's sister-in-law Kirsty deletes Instagram post supporting him in wake of sex scandal",
 "It WAS Russell Brand who Katherine Ryan was talking about: Female comic repeatedly accused him of being a 'sexual predator' during filming for Comedy Central's Roast Battle before he was dropped from what was his last major TV job in the UK",
 'PETER HITCHENS: Trying to have a serious argument with Russell Brand is like playing chess with a squirrel. Why was he given a place in the national debate?',
 "NADINE DORRIES: How can Russell Brand's wife stand by a man accused of sending a car to pick up a girl of 16 from school?",
 "Keir Starmer is accused of 'Brexit betrayal' as he vows to re-write a deal with the EU ahead of a meeting with Emmanuel Macron",
 "Self-styled anti-slavery activist portrayed by Jim Caviezel in 'Sound of Freedom' steps down from Operation Underground Railroad after being accused of sexual misconduct by seven women",
 'Folk singer Roger Whittaker best known for hits Durham Town and New World in the Morning dies aged 87',
 "JAMES MACMANUS: I've dedicated my new book to a beautiful French lover I knew for months in 1974. My wife's not happy - but Marie-Aude's heartbroken fury as she hurled wine at me and fled from my life has haunted me for 50 years",
 'Are YOU one of the 12million Brits missing out on a Covid booster because of NHS penny-pinching?']

最初迷ったのは、アノテーション結果がどこに保持されるかということでした。答えは、引数database_urlにデータベースの格納パスを指定するということでした。引数が無い場合にはインメモリのsqliteが使用されます。パスの先頭にはスキーマsqlite:///を指定します。

from superintendent import Superintendent
from ipyannotations.text import ClassLabeller

# アノテーション結果を格納するデータベース
db_string = "sqlite:////databricks/driver/text_annotation.db"

input_widget = ClassLabeller(options=['professional', 'not professional'])
input_data = headlines
data_labeller = Superintendent(
    database_url=db_string,
    features=input_data,
    labelling_widget=input_widget,
)

data_labeller

以下のようにアノテーションのウィジェットが表示されます。

この場合、ニュースのヘッドラインを読んでプロらしいか(professional)そうで無いか(not professional)を選択する形となります。

注意
ウィジェットが表示されない場合には、ブラウザをリロードしてください。

選択するとプログレスバーが進捗します。

アノテーションが完了しました。

上で指定したパスでデータベースを確認できます。

こちらではデータベースのパスでスキーマを指定しないことに注意してください。

import sqlite3
connection = sqlite3.connect("/databricks/driver/text_annotation.db")

cursor = connection.cursor()

sql_query = """SELECT name FROM sqlite_master  
  WHERE type='table';"""
cursor.execute(sql_query)
print(cursor.fetchall())

superintendentdataというテーブルに格納されています。

[('superintendentdata',)]

テーブルの中を表示します。

df = spark.read.format('jdbc') \
          .options(driver='org.sqlite.JDBC', dbtable='superintendentdata',
                   url='jdbc:sqlite:/databricks/driver/text_annotation.db').load()
display(df)

ライブラリさえインストールすれば手軽に、アノテーションができて便利ですね。

なお、画像のアノテーションは以下のようになります。

from superintendent import Superintendent
from ipyannotations.images import ClassLabeller
from sklearn.datasets import load_digits

input_data = load_digits().data.reshape(-1, 8, 8)
input_widget = ClassLabeller(
    options=list(range(1, 10)) + [0], image_size=(100, 100))
data_labeller = Superintendent(
    database_url=db_string,
    features=input_data,
    labelling_widget=input_widget,
)
data_labeller

Databricksクイックスタートガイド

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up