More than 5 years have passed since last update.

ラテン文字辞書ファイルから、先頭の単語を取り出し、単純な辞書ファイルを作るコード。

Last updated at 2019-01-24Posted at 2019-01-17

一般の英和辞書ファイルなどから、単語列からなる単純な、基本の辞書ファイルを抜き出すコードextword.cppです。

一般の辞書ファイルは
・英辞郎のダウンロード可能の辞書、
・PDSのejdic-hand-utf8.txt、
・stardictの辞書から「stardict-editor」で、DeCompileしたTab File
・英単語で始まって、セパレータがアルファベット以外で、\nで行が終わるファイル。

などに対応しています。

sed、awk等で書きたかったのですが、sed,awkは、基本的に、英辞郎のバイナリファイルは扱えませんので。。。

コンパイルの仕方:c++ extwords.cpp -o extwords
使い方:extwords sourcedicfile >dicfile

【GitHub】
https://github.com/fygar256/extwords

extwords.cpp

/*
  英和、英英辞書ファイルから、単語列を抽出するプログラム
*/
# include	<stdio.h>
# include	<stdlib.h>
# include	<string.h>
# include	<ctype.h>
# include	<iostream>
# include	<algorithm>
# include	<vector>
# include	<string>

using namespace std;

char *getword(FILE *fp,char *w) {
	int idx;
	int c;

  while(1) {
	idx=0;
	c=fgetc(fp);
  if (c==EOF) return(NULL);
	if (c==0x81) fgetc(fp);
		else ungetc(c,fp);

	while(1) {
		c=fgetc(fp);
    if (!isalpha(c)) {
        w[idx]='\0';
        while(fgetc(fp)!='\n'); 
        if (w[0]=='\0') break;
        return(w);
        }
    w[idx++]=c;
    }
  }
}

int	main(int argc,char *argv[])
{
	FILE	*fp;
	char	w[1000];
  vector<string> data;
	string	str;

	if (argc!=2) {
      fprintf(stderr,"Usage: words dicfile\n");
      exit(1);
      }

	fp=fopen(argv[1],"r");
	if (fp==NULL) exit(1);

	data.clear();
	while(1) {
      if (getword(fp,w)==NULL) break;
			str=w;
			for(auto & c:str) c=tolower(c);
			data.push_back(str); 
		}
	fclose(fp);

  sort(data.begin(), data.end());
  data.erase(unique(data.begin(), data.end()), data.end());

	for(auto i:data)
		cout << i <<endl;
	exit(0);
}

出力例：ejdic-hand-utf8.txtより.

a
aa
aaa
aam
aardvark
aaron
ab
abaci
aback
abacus
abaft
abalone
abandon
abandoned
abandonment
abase
abasement
abash
...
zr
zucchetto
zucchini
zulu
zurich
zwieback
zzz

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up