11
9

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

指定したURLを形態素解析するスクリプト

Posted at

指定したURLごと形態素解析をしてみようと実験。
正規表現でHTMLタグを除去しようとしたけど除去しきれない。

urlmecab.py

#!/user/bin/env python
# -*- coding: utf-8 -*-
import urllib
import sys
import MeCab
import re


while True:
	search_url = raw_input(u"input URL: ")
	

	def Mecab_file():	
		req = urllib.urlopen(search_url)
		dlText = req.read()

		mt = MeCab.Tagger("mecabrc")
		data = []
		p = re.compile(r"<[^>]*?>")
		sus = p.sub("", dlText)
		data.append(sus)


		node = mt.parseToNode("\n".join(data))
		words = {}
		
		while node:
			word = node.surface
			if word and node.posid >=36 and node.posid <=67:
				if not words.has_key(word):
					words[word] = 0
				words[word] += 1
			node = node.next
		word_items = words.items()
		word_items.sort()
		word_items.reverse()
		for word, count in word_items:
			print word, count
			
	if search_url:
		Mecab_file()
	else:
		break

MeCabの品詞IDで名詞だけを抽出。

if word and node.posid >=36 and node.posid <=67:

この部分を変更すれば色々遊べるかもしれない。
URL入力をし続ける限りループ。空白エンターでループbreak。
http://〜入力する必要あり。

11
9
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
11
9

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?