More than 3 years have passed since last update.

Pythonで正規表現を使う

Posted at 2021-01-09

regex.spl

| makeresults 
| eval text="THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now." 
| rex field=text "(?ix)((?P<big_japan>(?P<japan>Japan).*?(?P=japan))) #Japanからjapanまで"

とSplunkでもグループマッチ使えたんだ〜というのは別なところに転記するとして、reがあまりにもできなすぎたので練習

reオフィシャルはとてもわかりやすい。

re

sample.txt

"THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now. The content of education is reduced and students come to have free time more. Furthermore, 'total education time' is taken in all Japanese junior high school. I think this change is bad and Japanese government must change it to original form rapidly for the following reasons. Firstly, many young people this time cannot read or write basic words (Japanese 'kanji.') And, they cannot calculate, too. These things are need in daily life, even if they don't go to college or university. Originally, Japanese student got better score in reading and calculation than any other country's student few decades ago. For, reading, writing, and calculation were very important in Japanese society. Now, however, this good value in old Japan is being reduced. This is very large problem in Japan. Secondly, there is deep gap between the level of high school education and university education. Many students who don't learn the content of high school education cannot catch up with the class in universities. Furthermore, for example, I am medical student, but I don't learn biology in high school. And there are many students like me. In addition, the care of university to us is nearly nothing. So, the level of the study in technology, medicine and so is going down. This is very large problem in Japan, too. Thirdly, as the content of school education is reduced, at the same time, the curiosity of students seems reduced. The new idea and new device are coming from the curiosity, I think. So, the reduction of it means the down of possibility that the evolutional change in various field will happen. This is very large problem in Japan. In conclusion, there are problems like these in Japan, because of the reduction of basic education. Luckily, the Japanese government is planning to change the education system. I hope this change will be going back to old Japanese school education system. \n"

https://www.f.waseda.jp/yusukekondo/TALL19/TALL_Spring03.html
から引用

search

matchが頭から（^キーワード）しか一致しないのでsearchを使う。

search.py

import re

m=re.compile(r"""
\b(?P<sentence>.*?[.]) # 文章で抽出してみる
""",re.X)

result=m.search(text)

print(result)

英文なので.区切ってみる。

結果

<_sre.SRE_Match object; span=(0, 78), match='THE JAPANESE SCHOOL EDUCATION In Japan, education>

len('"THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now.')が７９なので、一致

_Match object_の説明がPython3で出てこないのは何故なんだろう

SRE_Match object#

getattr.py

import re

m=re.compile(r"""
\b(?P<sentence>.*?[.]) # 文章で抽出してみる
""",re.X)

result=m.search(text)

for i in dir(result):
  if not i.startswith('__'): 
   print(f'{i}: {getattr(result,i)}')

Match Objectがなんなのかよくわからないので、メソッドを確認してみる。

結果

end: <built-in method end of _sre.SRE_Match object at 0x7fe65d3ba198>
endpos: 1969
expand: <built-in method expand of _sre.SRE_Match object at 0x7fe65d3ba198>
group: <built-in method group of _sre.SRE_Match object at 0x7fe65d3ba198>
groupdict: <built-in method groupdict of _sre.SRE_Match object at 0x7fe65d3ba198>
groups: <built-in method groups of _sre.SRE_Match object at 0x7fe65d3ba198>
lastgroup: sentence
lastindex: 1
pos: 0
re: re.compile('\n\\b(?P<sentence>.*?[.]) # 文章で抽出してみる\n', re.VERBOSE)
regs: ((0, 78), (0, 78))
span: <built-in method span of _sre.SRE_Match object at 0x7fe65d3ba198>
start: <built-in method start of _sre.SRE_Match object at 0x7fe65d3ba198>
string: THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now. The (...省略）

Python2.7 MatchObjectの通り。Python3のやつはどこにあるんだろ　

findall

findall.py

m=re.compile(r"""
\b(?P<sentence>.*?[.]) # 文章で抽出してみる
""",re.X)

result=m.findall(text)  # search は一つだけだけど、findallは全部

print(type(result))
print('-'*10)
for i in dir(result):
  if not i.startswith('__'): 
   print(f'{i}: {getattr(result,i)}')
print('-'*10)
for i in result:
  print(i) #結果がリストなので、一つずつ展開

一致した全てを出したいときはfindall

結果

<class 'list'>
----------
append: <built-in method append of list object at 0x7fe65d2dca48>
clear: <built-in method clear of list object at 0x7fe65d2dca48>
copy: <built-in method copy of list object at 0x7fe65d2dca48>
count: <built-in method count of list object at 0x7fe65d2dca48>
extend: <built-in method extend of list object at 0x7fe65d2dca48>
index: <built-in method index of list object at 0x7fe65d2dca48>
insert: <built-in method insert of list object at 0x7fe65d2dca48>
pop: <built-in method pop of list object at 0x7fe65d2dca48>
remove: <built-in method remove of list object at 0x7fe65d2dca48>
reverse: <built-in method reverse of list object at 0x7fe65d2dca48>
sort: <built-in method sort of list object at 0x7fe65d2dca48>
----------
THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now.
The content of education is reduced and students come to have free time more.
Furthermore, 'total education time' is taken in all Japanese junior high school.
I think this change is bad and Japanese government must change it to original form rapidly for the following reasons.
Firstly, many young people this time cannot read or write basic words (Japanese 'kanji.
And, they cannot calculate, too.
These things are need in daily life, even if they don't go to college or university.
Originally, Japanese student got better score in reading and calculation than any other country's student few decades ago.
For, reading, writing, and calculation were very important in Japanese society.
Now, however, this good value in old Japan is being reduced.
This is very large problem in Japan.
Secondly, there is deep gap between the level of high school education and university education.
Many students who don't learn the content of high school education cannot catch up with the class in universities.
Furthermore, for example, I am medical student, but I don't learn biology in high school.
And there are many students like me.
In addition, the care of university to us is nearly nothing.
So, the level of the study in technology, medicine and so is going down.
This is very large problem in Japan, too.
Thirdly, as the content of school education is reduced, at the same time, the curiosity of students seems reduced.
The new idea and new device are coming from the curiosity, I think.
So, the reduction of it means the down of possibility that the evolutional change in various field will happen.
This is very large problem in Japan.
In conclusion, there are problems like these in Japan, because of the reduction of basic education.
Luckily, the Japanese government is planning to change the education system.
I hope this change will be going back to old Japanese school education system.

結果はリスト

split

split.py

result1=re.split('(?<=\.)\s',text)  # splitで区切り文字を含めてみた。


print(type(result1))
print('-'*10)

m2=re.compile(r"""
(?P<japan_txt>japan.*?)\b #test
""",re.VERBOSE|re.IGNORECASE)

{i:[v,re.search(m2,v).group()] for i,v in enumerate(result1) if re.search(m2,v)}

文を区切るだけならsplit()で十分と思ってやってみた。
区切りを.としつつ残したかったので、そのあとの (スペース)で区切っている。

結果

<class 'list'>
----------
{0: ['THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now.',
  'JAPANESE'],
 2: ["Furthermore, 'total education time' is taken in all Japanese junior high school.",
  'Japanese'],
 3: ['I think this change is bad and Japanese government must change it to original form rapidly for the following reasons.',
  'Japanese'],
 4: ["Firstly, many young people this time cannot read or write basic words (Japanese 'kanji.') And, they cannot calculate, too.",
  'Japanese'],
 6: ["Originally, Japanese student got better score in reading and calculation than any other country's student few decades ago.",
  'Japanese'],
 7: ['For, reading, writing, and calculation were very important in Japanese society.',
  'Japanese'],
 8: ['Now, however, this good value in old Japan is being reduced.', 'Japan'],
 9: ['This is very large problem in Japan.', 'Japan'],
 16: ['This is very large problem in Japan, too.', 'Japan'],
 20: ['This is very large problem in Japan.', 'Japan'],
 21: ['In conclusion, there are problems like these in Japan, because of the reduction of basic education.',
  'Japan'],
 22: ['Luckily, the Japanese government is planning to change the education system.',
  'Japanese'],
 23: ['I hope this change will be going back to old Japanese school education system.',
  'Japanese']}

結果はリスト

そのあとjapanを大文字小文字関係なく(re.IGNORECASE)で検索して、その文字が含まれている行をインデックス:[当該行,検索文字]の辞書型で出力している。

finditer

m2=re.compile(r"""
(?P<japan_txt>japan.*?)\b #test
""",re.VERBOSE|re.IGNORECASE)

result=re.finditer(m2,text)

print(result)

print('-'*10)

for i in result:
  print(i)

イテレータ型で結果を返すfinditer

結果

<callable_iterator object at 0x7fe65d2e5ba8>
----------
<_sre.SRE_Match object; span=(4, 12), match='JAPANESE'>
<_sre.SRE_Match object; span=(33, 38), match='Japan'>
<_sre.SRE_Match object; span=(209, 217), match='Japanese'>
<_sre.SRE_Match object; span=(269, 277), match='Japanese'>
<_sre.SRE_Match object; span=(427, 435), match='Japanese'>
<_sre.SRE_Match object; span=(576, 584), match='Japanese'>
<_sre.SRE_Match object; span=(749, 757), match='Japanese'>
<_sre.SRE_Match object; span=(804, 809), match='Japan'>
<_sre.SRE_Match object; span=(858, 863), match='Japan'>
<_sre.SRE_Match object; span=(1368, 1373), match='Japan'>
<_sre.SRE_Match object; span=(1705, 1710), match='Japan'>
<_sre.SRE_Match object; span=(1760, 1765), match='Japan'>
<_sre.SRE_Match object; span=(1825, 1833), match='Japanese'>
<_sre.SRE_Match object; span=(1934, 1942), match='Japanese'>

場所と一致箇所が返ってくる。

groupdict

groupdict.py

m2=re.compile(r"""
(?P<japan_txt>japan.*?)\b #test
""",re.VERBOSE|re.IGNORECASE)

result=re.finditer(m2,text)


[i.groupdict() for i in result]

結果

[{'japan_txt': 'JAPANESE'},
 {'japan_txt': 'Japan'},
 {'japan_txt': 'Japanese'},
 {'japan_txt': 'Japanese'},
 {'japan_txt': 'Japanese'},
 {'japan_txt': 'Japanese'},
 {'japan_txt': 'Japanese'},
 {'japan_txt': 'Japan'},
 {'japan_txt': 'Japan'},
 {'japan_txt': 'Japan'},
 {'japan_txt': 'Japan'},
 {'japan_txt': 'Japan'},
 {'japan_txt': 'Japanese'},
 {'japan_txt': 'Japanese'}]

一致した文字とキャプチャーの文字が返ってくる

まとめ

とりあえずはいろいろと試してみたけど、まだ不十分
いったん終了とします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up