More than 3 years have passed since last update.

LINEのトーク履歴をDataframeに変換する

Last updated at 2022-02-15Posted at 2022-02-13

目標

LINEからテキストファイルで出力したトーク履歴を、以下のようなPandas Dataframeに落とし込みます。

date	time	by	content
2022/2/14	10:10	テスト太郎	hoge hoge
2022/2/14	10:11	テスト花子	[スタンプ]

コード

Dataframe.appendはコスパが悪いようなので、まずは各列のリストを作り、最後にDataframeに変換するようにしました。

import re
import pandas as pd


def lineHistoryToDf(filePath):

    regex = re.compile('^[0-9]{4}/[0-9]{1,2}/[0-9]{1,2}\(.\)$')
    date, time, by, content = [], [], [], []

    with open(filePath) as f:
        f = f.readlines()[3:-1]
        for line in f:
            l = line.rstrip('\r\n')
            if len(l.split('\t')) > 2:
                nl = l.split('\t', 2)
                date.append(today[:-3])
                time.append(nl[0])
                by.append(nl[1])
                content.append(nl[2])
            elif len(l.split('\t')) == 2:
                nl = l.split('\t', 1)
                date.append(today[:-3])
                time.append(nl[0])
                by.append(None)
                content.append(nl[1])
            elif regex.match(line):
                today = l
            else:
                content[-1] += l
        df = pd.DataFrame(data={'date': date, 'time': time, 'by': by, 'content': content},
                                columns=['date', 'time', 'by', 'content'])
    return df

以下は遅いのでおすすめしません。

import re
import pandas as pd

filePath = '~/line_log.txt'

df = pd.DataFrame(columns=['date','time','by','content'])
regex = re.compile('^[0-9]{4}/[0-9]{1,2}/[0-9]{1,2}\(.\)$')

with open(filePath) as f:
    f = f.readlines()
    for line in f:
        l = line.rstrip('\r\n')
        if len(l.split('\t', 2)) > 2:
            nl = l.split('\t')
            nl.insert(0, today[:-3])
            nls = pd.Series(nl, index = df.columns)
            df = df.append(nls, ignore_index=True)
        elif regex.match(line):
            today = l
        else:
            df.iat[-1,-1] += l

考え方

テキストファイルの冒頭2行はトーク履歴ではないためスキップする。
1行のタブ数が3つ以上の場合
- 会話のログであると判定する。左から読んで2つ目のタブ文字までを区切り文字とし、それ以降のタブ文字は無視する。
1行のタブが2つ以下の場合
- 日付の場合：該当行以降のdateとして設定する。
- 日付以外の場合：前の会話内容の続きであると判定して直前の会話内容として追記する。

なおcontentに「[ファイル]」「[画像]」「[スタンプ]」などのメタ表記が含まれるため、別途columnとしてflagを立ててあげるとより解析しやすくなります。こちらのサイトが参考になりそうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up