More than 5 years have passed since last update.

pythonでテキストファイルから条件に一致した行を抽出

Python

Posted at 2019-12-22

概要

複数の条件を使ってに前方一致、後方一致、部分一致、完全一致のいずれかを使ってテキストを抽出する処理をpythonで作ってみました。
もとはとあるテキストから特定の文言が含まれているものを抽出して除去する処理をpythonで作りましたが、抽出する処理だけでも有効かと考えその部分を一部変更できるようにして作り直してみました。

必要なもの

python 3.7.2
pandas
numpy

今回はexeもあるので動かすだけならpythonなど不要です。

公開場所

githubで公開しています。

処理内容

resources/検索データ.xlsxに定義された文言をもとにresources/appConfig.iniに設定した定義に応じて前方一致、後方一致、部分一致、完全一致の条件で抽出を行います。
処理する対象はdataディレクトリの中に格納されているファイルをすべて処理します。
結果はoutputディレクトリの下に出力します。

ソース説明

以下の処理が検索するための条件を作成しています。

検索データ.xlsxから検索用の文字列を取得します。
ソート条件に応じて文字列の長さでソートを行います。
前方一致、後方一致などの条件に応じて.*を付与して条件を|でつなげています。

    def createReg(self):
        searchItems=pd.read_excel('resources/検索データ.xlsx')
        sortTypeCode=iniFile.get('info','sortType')

        searchItemArray=np.asarray(searchItems['検索語'])
        sortType=SORT_ENUM(sortTypeCode)
        if sortType==SORT_ENUM.SORT_LENGTH_ASC or sortType==SORT_ENUM.SORT_LENGTH_DESC:
            searchItemIndex=[]
            for item in searchItemArray:
                searchItemIndex.append(len(item))
            searchSeries=pd.Series(searchItemIndex)
            serchItemDataFrame=pd.concat([searchItems['検索語'],searchSeries],axis=1)
            if sortType==SORT_ENUM.SORT_LENGTH_ASC:
                sortItems=serchItemDataFrame.sort_values(0,ascending=True)
            else:
                sortItems=serchItemDataFrame.sort_values(0,ascending=False)
            searchItemArray=np.asarray(sortItems['検索語'])
        regTypeCode=iniFile.get('info','regType')
        regType=REG_ENUM(regTypeCode)
        regStr=''
        for item in searchItemArray:
            if regStr!='':
                regStr=regStr+'|'
            sItem=item
            if REG_ENUM.REG_TYPE_CONTAIN==regType:
                sItem='.*'+item+'.*'
            elif REG_ENUM.REG_TYPE_FRONT==regType:
                sItem=item+'.*'
            elif REG_ENUM.REG_TYPE_BACKWARD==regType:
                sItem='*.'+item
            elif REG_ENUM.REG_TYPE_EXACT_MATCH==regType:
                sItem=item
            regStr=regStr+sItem
        return re.compile(regStr)

以下の処理で前述の処理で作成した条件をもとに抽出しています。

with open を使ってファイルを読み込み一行ずつ一致するかを確認します。
一致したものを配列に格納します。
最後にテキストファイルとして出力しています。

    def extract(self):
        reg=self.createReg()
        paths=glob.glob('data/*.csv')
        
        fileDict={}

        for pathName in paths:
            extractList=[]
            with open(pathName,encoding=iniFile.get('info','encoding')) as f:
                # targetStrs=f.read()
                for targetStr in f:
                    extractStr=reg.search(targetStr)
                    if extractStr:
                        extractList.append(targetStr)
            fileDict[os.path.basename(pathName)]=extractList
        outputPath=iniFile.get('info','outputPath')
        for key,data in fileDict.items():
            outputFile=outputPath+'extract_'+key+'.txt'
            with open(outputFile,encoding='utf-8',mode='w') as f:
                for d in data:
                    f.write(d)

使い方

githubのreadme参照
動かしてみたいだけなら
- dataに処理したいファイルを格納
- resources/検索データ.xlsxに抽出したい文言を設定
- regExtract.exeを実行する。

活用方法

連携ようのファイルに時含まれているか確認したいものが複数あるときなど
処理を修正して特定の文言を変換できるようにするなど

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up