More than 3 years have passed since last update.

MeCabでリストを分かち書きしようとすると'TypeError: in method 'Tagger_parse', argument 2 of type 'char const *''が出る

Last updated at 2020-12-25Posted at 2020-12-25

MeCabでリストを分かち書きしようとすると
'TypeError: in method 'Tagger_parse', argument 2 of type 'char const *''が出ます。

エラーメッセージには、引数２が間違っていると記載してあるので、
CSVの書き方やコードの書き方が悪いと思い調べましたが解決に至りませんでした。

また、参考元のサイトは下記ですが
labelのインデックス化が不要なので消去したら次々色々なエラーが出ました。
label依存の変数だけをきちんと消したつもりなのでそんなにエラーが出ないと思ったのですが...。
参考サイト : https://qiita.com/Qazma/items/0daf927e34d22617ddcd

大変お手数おかけしますが、わかる方がいらっしゃいましたらよろしくお願いします。

補足　：　CSVファイルは、１行１列で、１行に１文があります。

2020-12-25 11:55:30.878680: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
C:\Users\Katuta\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.2) or chardet (4.0.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
  File "ex.py", line 5, in <module>
    padded, one_hot_y, word_index, tokenizer, max_len, vocab_size = wakatigaki.create_tokenizer()
  File "C:\Users\Katuta\gotou\wakatigaki.py", line 21, in create_tokenizer
    text_wakati = wakati.parse(text)
  File "C:\Users\Katuta\AppData\Local\Programs\Python\Python38\lib\site-packages\MeCab.py", line 293, in parse
    return _MeCab.Tagger_parse(self, *args)
TypeError: in method 'Tagger_parse', argument 2 of type 'char const *'
Additional information:
Wrong number or type of arguments for overloaded function 'Tagger_parse'.
  Possible C/C++ prototypes are:
    MeCab::Tagger::parse(MeCab::Model const &,MeCab::Lattice *)
    MeCab::Tagger::parse(MeCab::Lattice *) const
    MeCab::Tagger::parse(char const *)

import MeCab
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

def create_tokenizer() :
    text_list = []
    with open("C:/Users/Katuta/gotou/corpus_MEIDAI.csv",'r',encoding="utf-8",errors='ignore') as csvfile :
        texts = csv.reader(csvfile)

        for text in texts :
            text_list.append(text)

    # MeCabを使い、日本語テキストを分かち書きする。
        wakati_list = []
        for text in text_list :
            text = list(map(str.lower,text))

            wakati = MeCab.Tagger("-O wakati")
            text_wakati = wakati.parse(text)
            wakati.parse('')
            wakati_list.append(text_wakati)

    #　文章のうち最大のものの要素数を調べる。
    #　トークナイザーで使用するテキストデータのリストを作成。
        max_len = -1
        split_list = []
        sentences = []
        for text in wakati_list :
            text = text.split()
            split_list.extend(text)
            sentences.append(text)

            if len(text) > max_len :
                max_len = len(text)
        print("Max length of texts: ", max_len)
        vocab_size = len(set(split_list))
        print("Vocabularay size: ", vocab_size)

    #　Tokenizerを使い、単語にインデックス1から番号を割り当てる。
    #　辞書も作成。
        tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<oov>")
        tokenizer.fit_on_texts(split_list)
        word_index = tokenizer.word_index
        print("Dictionary size: ", len(word_index))
        sequences = tokenizer.texts_to_sequences(sentences)

    # to_categorical() を使い、モデルに渡す実際のラベルデータであるOne-Hotベクトルを作成。
        one_hot_y = tf.keras.utils.to_categorical(sentences)

    #　訓練データのサイズを揃えるため、短いテキストにもっとも長いテキストデータに合わせて0を追加する。
        padded = pad_sequences(sequences, maxlen=max_len, padding="post", truncating="post")
        print("padded sequences: ", padded)

        return padded, one_hot_y, word_index, tokenizer, max_len, vocab_size

自己解決

```python # MeCabを構成しているC言語のconst*型は、定数であって変更できない。 # それを変更しようとしている為、エラーが出る。よって一時的にtextをstrで型変換してあげるとエラーを回避できる。 # MeCabを使い、日本語テキストを分かち書きする。 wakati_list = [] for text in text_list : text = str(text).lower()

        wakati = MeCab.Tagger("-Owakati")
        text_wakati = wakati.parse(text)
        wakati.parse('')
        wakati_list.append(text_wakati)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up