More than 5 years have passed since last update.

中二病でもTEIを知りたい

Posted at 2018-12-30

TEI自学続き。
自学だから需要無くても無問題ですよね。

前回の自学痕跡はこちら。
今回も以下のとの部分。

<encodingDesc>

公式リファレンスでは
「電子テキストとその元資料との関係を示す」という書き方。
よくわからないので公式の例を確認しましょう。

<encodingDesc>
 <p>Basic encoding, capturing lexical information only. All
   hyphenation, punctuation, and variant spellings normalized. No
   formatting or layout information preserved.</p>
</encodingDesc>

<p>は散文を書くやつですね。
つまりは文章書いてもいいよ欄。

この例では
「この電子テキストは元資料の文字情報のみです。すべてのハイフン、句読点、表記の揺れは正規化されています。フォーマット情報やレイアウト情報は保持されません」
という訳ですかね。
他の文書を見ても、どのくらい情報を保持しているか書いているのかが多い印象です。
gaijiという項目で<charDecl>というのがあります。
示す内容は@xml:langと同様です。
ただ公式リファレンスには「規格にない文字やグリフに関する情報を示す」もあるので、例文を見たことはないのですが、エジプトのヒエログリフや、ミノスの線文字Aなどにも応用が利くのかもしれません。

<profileDesc>

公式リファレンスでは「書誌情報的ではない詳細な解説を示す．例えば，言語や特殊言語，生成されたときの状況，参加者など」。
こちらも情報量多そうな感じですよね。
公式の例を見ましょう。

<profileDesc>
 <langUsage>
  <language ident="fr">French</language>
 </langUsage>
 <textDesc n="novel">
  <channel mode="w">print; part issues</channel>
  <constitution type="single"/>
  <derivation type="original"/>
  <domain type="art"/>
  <factuality type="fiction"/>
  <interaction type="none"/>
  <preparedness type="prepared"/>
  <purpose type="entertain" degree="high"/>
  <purpose type="inform" degree="medium"/>
 </textDesc>
 <settingDesc>
  <setting>
   <name>Paris, France</name>
   <time>Late 19th century</time>
  </setting>
 </settingDesc>
</profileDesc>

急に要素が増えていく。。。
<langUsage>で言語や方言などを示します。
<langUsage>以下の<language>は単語通りの意味で言語を示します。
例ではフランス語ですね。
-identは言語を指定します。言語はBCP47に沿うものとします。
-usageは使われている割合を示します。
以下例

<langUsage>
 <language ident="en-US" usage="75">modern American English</language>
 <language ident="i-az-Arab" usage="20">Azerbaijani in Arabic script</language>
 <language ident="x-lap" usage="05">Pig Latin</language>
</langUsage>

75%は現代アメリカ英語、20%はアゼルバイジャンのアラビア文字、5%はピッグ・ラテンとのこと。
ピッグ・ラテンについてはWikipediaへ。

<testDesc>はテキスト情報を示します。
中のnはhtmlでいうclassみたいなもんです。

<channel>はテキストの収録・伝播の形式を示すものです。
mode=s(=spoken)
mode=w(=written)
mode=sw(=spoken to be written)
mode=ws(=written to be spoken)
mode=m(=mixed)
mode=x(=unknown or inapplicable)(初期値) の6種類。
例えば『平家物語』なら、"mode=sw"に該当します。

<channel mode="s">face-to-face conversation</channel>

<constitution>はテキストの内部構造を示します。
type=single(完全なテキスト)(初期値)
type=composite(ここに完全なテキストを部分テキストとするテキスト。ようは間の何章かが抜けている状態)
type=frags(ここに完全でないテキストを部分テキストとするテキスト。多くの文書はこれ)
type=unknown(構成状況不明)
の4種。

<constitution type="frags">Prologues only.</constitution>

<derivation>はテキストの正当性を示します。
正当性というよりは、オリジナルかどうか、という感じですね。
type=original(オリジナル）
type=revision(改訂版)
type=translation(翻訳)
type=abridgment(簡約)
type=plagiarism(剽窃)
type=traditional(元資料不明）
6種類ですね。剽窃とは。。。という感じでしょうか。

<derivation type="original"/>

<domain>は重要な社会的状況を示します。
type=art(芸術・娯楽)
type=domestic(私的・家庭的)
type=religious(宗教・儀式)
type=business(仕事・職場)
type=education(教育)
type=govt(政府・法律)
type=public(その他公的なもの)

<domain type="domestic"/>
<domain type="rel">religious broadcast</domain>

<factuality>は現実度を示します。
type=fiction(フィクション)
type=fact(ノンフィクション)
type=mixed(フィクションとノンフィクションが混在)
type=inapplicable(フィクションとノンフィクションの混在は問題でない)
一番最後は何なんやというところですが、元は歴史書としているので、存在の正当性を担保したいわけですね。

<factuality type="fiction"/>

<interaction>はテキスト生成者の相互作用の程度を示す。。。ようは独白なのか対話形式なのかといった感じです。
対話形式だと「プラトンの対話編」とかが著名でしょうか。

type=none(独白など)
type=partial(相手がいる状況での独白)
type=complete(直接対話)
type=inapplicable(このパラメータは不適切)

active=singular(個人)
active=plural(複数人)
active=corporate(団体)
active=unknown(特定不能)

passive=self(日記など)
passive=single(手紙など)
passive=many(会話など)
passive=group(講義など)
passive=world(書籍など)

<interaction type="complete"
 active="plural" passive="many"/>

<preparedness>は即興なのかどうなのかを示すものです。
日本語訳そのままですね。

type=none(即興)
type=scripted(台本あり)
type=formulaic(事前打ち合わせあり)
type=revised(事前の準備を練り直し、修正を行う)

<preparedness type="none"/>

<purpose>は目的を示すものです。内容というよりは広報とか教育とかっていうニュアンスです。

type=persuade(教育・広報・宣伝)
type=express(自己表現)
type=inform(情報伝達・教育)
type=entertain(娯楽)

degree='high'(主要）
degree='medium'(中間)
degree='low'(弱い)
degree='unknown'(不明)
このdegreeは上の4つ以外は取らない。

<purpose type="persuade" degree="high"/>
<purpose type="entertain" degree="low"/>

要素がめちゃくちゃ多いですけど、これが欧米のスタンダードなんですね。。。
まぁxmlなので、書いているうちに慣れてくるんでしょうけど。
基本的には<>の中にtype要素を記述していく方式で<>~<>の～には補足的な感じで情報を入れていくイメージです。

<settingDesc>は言語交流が行われた状況を示すものです。
先に挙げた「プラトンの対話編」でいけば、プラトンとソクラテスの対話だぞー、と示すような感じです。

<settingDesc>
 <p>Texts recorded in the
   Canadian Parliament building in Ottawa, between April and November 1988 </p>
</settingDesc>

例の通り、インタビューの場所とか状況とかを示す感じですね。

あー、疲れましたー。
次は<xenoData>と<revisionDesc>について自学していきます。
またOMEKAが面白そうなので、触っていきたいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up