More than 5 years have passed since last update.

SNLIデータセットの読み込み方

Posted at 2017-08-22

SNLIとは

Standord Natural Language Inference¹の略
自然言語推論を学習するための注釈付きコーパス
前提と仮説の2文書からなるペアと対応するラベル(手作業)
- neutral: どちらとも言えない
- contradiction: 矛盾
- entailment: 正しい
- -: ラベルなし

Text	Judgments	Hypothesis
A man inspects the uniform of a figure in some East Asian country.	contradiction	The man is sleeping
An older and younger man smiling.	neutral	Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people.	contradiction	A man is driving down a lonely road.
A soccer game with multiple males playing.	entailment	Some men are playing a sport.
A smiling costumed woman is holding an umbrella.	neutral	A happy woman in a fairy costume holds an umbrella.

データ件数: 合計57万
- Training: 55万
- Validation: 1万
- Test: 1万
下記形式で構文解析データもある．

{
	"annotator_labels": ["neutral"], 
	"captionID": "3416050480.jpg#4", 
	"gold_label": "neutral", 
	"pairID": "3416050480.jpg#4r1n", 
	"sentence1": "A person on a horse jumps over a broken down airplane.",
	"sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )",
	"sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", 
	"sentence2": "A person is training his horse for a competition.", 
	"sentence2_binary_parse": "( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )", 
	"sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"
}

ダウンロード

The Stanford Natural Language Inference (SNLI) Corpusよりダウンロードできます．

wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip
unzip snli_1.0.zip

データの読み込み

json形式(.jsonl)とtsv形式(.txt)でデータが格納されています．

import pandas as pd
df = pd.read_csv("snli_1.0/snli_1.0_train.txt", sep="\t")

References

Bowman et al., A large annotated corpus for learning natural language inference, 2015. ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up