More than 5 years have passed since last update.

指定したURLを形態素解析するスクリプト

Posted at 2015-04-13

指定したURLごと形態素解析をしてみようと実験。
正規表現でHTMLタグを除去しようとしたけど除去しきれない。

urlmecab.py


# !/user/bin/env python
# -*- coding: utf-8 -*-
import urllib
import sys
import MeCab
import re


while True:
	search_url = raw_input(u"input URL: ")
	

	def Mecab_file():	
		req = urllib.urlopen(search_url)
		dlText = req.read()

		mt = MeCab.Tagger("mecabrc")
		data = []
		p = re.compile(r"<[^>]*?>")
		sus = p.sub("", dlText)
		data.append(sus)


		node = mt.parseToNode("\n".join(data))
		words = {}
		
		while node:
			word = node.surface
			if word and node.posid >=36 and node.posid <=67:
				if not words.has_key(word):
					words[word] = 0
				words[word] += 1
			node = node.next
		word_items = words.items()
		word_items.sort()
		word_items.reverse()
		for word, count in word_items:
			print word, count
			
	if search_url:
		Mecab_file()
	else:
		break

MeCabの品詞IDで名詞だけを抽出。

if word and node.posid >=36 and node.posid <=67:

この部分を変更すれば色々遊べるかもしれない。
URL入力をし続ける限りループ。空白エンターでループbreak。
http://〜入力する必要あり。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up