More than 5 years have passed since last update.

Collective Intelligence Chaprter 3 でハマった点。誤植でないので、自分のコードのどこかがおかしいのだと思います。

Last updated at 2015-01-25Posted at 2015-01-25

最近、機械学習について理解を深めようと、Collective Intelligence（和書名は集合知プログラミング）に手を出したのですが、 Chapter 3 の Hierarchical Clustering (階層クラスタリング）において、clusters.py を作成し、関数を実行したところ、要素指定がうまくいかずにハマってしまった、という話です。

Collective Intelligence は誤植が多く、公式からの修正も少ないので、非公式な修正リストが作成されているのですが、そちらにも掲載されていなかったので、おそらく自分のコードが間違っているのだと思います。もし、何か誤りを見つけてくださった方はご指摘いただければと思います。

clusters.py における readfile('blogdata.txt')の実行

まず、データセット準備の際に以下のコードを書き、 clusters.py を用意しました。

clusters.py

def readfile(filename):
  lines=[line for line in file(filename)]

  # First line is the column titles
  colnames=lines[0].strip().split('\t')[1:]
  rownames=[]
  data=[]

  for line in lines[1:]:
    p=line.strip().split('\t')
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    data.append([float(x) for x in p[1:]])
  return rownames,colnames,data

次にそのファイルをインポートし、インタープリター上で以下のように実行したのですが、

blognames, words, data=clusters.readfile('blogdata.txt')

'could not convert string to float: looking'

'could not convert string to float: looking' と表示されて、怒られてしまいます。ここで、 blogdata.txt には feedparser を用いて、パースされた以下のようなデータが保存されていて、


	four	looking	second	here	music	until	example	want	wrong	easier	series	re	wasn	service	project	person	episode	best	country	asked	much	life	things	big	couple	had	easy	possible	right	old	people	support	later	time	leave	love	working	awesome	such	data	so	years	didn	internet	million	quite	open	future	san	say	saw	note	take	ways	going	where	many	wants	photos	single	technology	being	around	traffic	world	power	favorite	other	image	her	am	number	tv	th	large	small	past	hours	via	company	learn	states	information	its	always	found	week	really	major	also	play	plan	set	see	movie	last	whole	recent	d	continue	anything	into	link	line	posted	us	ago	having	try	video	let	great	makes	tools	next	process	high	move	doing	could	start	system	fact	should	hope	means	stuff	edition	email	less	web	government	five	become	does	chance	told	work	interview	after	order	office	then	them	they	network	another	do	away	com	voice	hand	photo	night	security	marketing	post	months	way	update	together	p	guy	change	history	live	car	write	product	remember	still	now	january	year	space	shows	friend	than	online	only	between	article	comes	these	media	real	read	early	using	business	aren	lot	trying	building	since	month	very	family	put	ve	site	help	actually	event	reason	ask	american	off	clear	pretty	during	x	close	won	probably	else	look	while	user	game	some	doesn	youtube	go	facebook	click	products	started	control	links	software	front	times	exactly	need	able	based	course	she	state	key	problem	both	well	page	twitter	home	he	friends	amp	companies	likely	even	ever	never	call	tell	give	before	better	went	side	content	isn	features	matter	don	m	points	stop	bad	said	against	three	if	make	left	human	yes	yet	deal	popular	down	digital	me	did	run	box	making	may	man	maybe	talk	nbsp	interesting	thing	think	first	long	little	anyone	were	especially	show	black	get	nearly	morning	behind	reading	across	among	those	different	same	running	money	either	users	enough	videos	film	again	important	u	public	search	two	share	coming	through	late	someone	everyone	house	hard	idea	done	least	part	tool	most	find	please	point	simple	itself	bit	google	often	back	others	bunch	ll	day	text	including	taking	value	almost	thought	latest	add	like	works	buy	minutes	special	under	every	would	phone	must	my	keep	end	over	writing	each	group	got	free	days	already	top	too	took	talking	though	watch	amazon	report	full	however	news	quickly	several	social	everything	why	head	check	no	when	cool	posts	says	goes	sports	today	local	name	turn	place	given	released	any	ideas	sure	written	come	case	good	without	seems	blog	there	program	far	list	design	version	short	might	used	friday	feel	story	store	king	kind	nothing	windows	his	him	art	political	questions	fast	called	once	issues	apple	app	use	few	something	united	six	instead	looks	our	york	their	which	who	ones	view	available	stories	gets	know	press	because	lead	getting	own	made	book
Schneier on Security	1	0	1	2	0	2	1	2	2	1	0	5	0	1	1	0	0	2	2	0	4	0	2	1	2	2	0	1	2	1	4	1	2	6	0	0	0	0	3	2	3	1	0	6	0	0	0	3	0	1	4	0	1	1	5	4	3	0	0	0	2	3	3	0	2	1	0	6	0	0	0	2	0	0	0	1	0	0	0	1	1	2	1	9	0	0	0	0	2	3	0	1	1	3	1	1	0	1	0	0	1	2	0	0	0	15	1	1	1	0	2	0	1	1	0	3	1	1	1	9	0	1	1	9	0	1	0	0	0	0	0	12	0	2	2	0	0	5	0	0	1	1	0	5	20	2	1	5	3	1	0	3	0	1	7	0	2	2	1	0	0	0	0	1	1	1	0	0	0	0	1	2	0	4	0	0	0	4	0	7	4	2	0	6	0	1	0	0	4	0	0	2	1	1	2	0	5	0	0	0	0	1	1	0	1	0	1	3	0	1	1	0	0	0	0	2	0	1	1	0	1	2	0	0	0	0	1	1	1	0	0	0	2	0	4	1	2	0	0	2	0	4	0	5	0	0	0	5	0	0	0	1	6	0	2	2	3	1	2	2	0	0	0	1	0	2	5	0	1	0	0	3	7	1	5	1	0	2	0	0	1	0	4	0	0	9	1	0	3	3	0	1	1	0	1	3	1	3	2	0	0	8	0	1	1	4	2	0	1	0	1	1	3	4	9	0	0	5	0	1	1	0	0	1	0	2	0	4	0	2	1	2	0	1	0	2	0	0	1	1	0	5	0	0	0	0	2	0	0	2	1	1	0	0	0	1	2	1	0	0	0	0	0	3	0	0	0	0	2	1	3	1	0	0	0	0	3	0	1	2	1	0	1	2	0	0	0	0	2	0	0	0	7	1	5	1	4	0	1	5	0	0	2	14	0	0	1	0	0	0	0	0	0	0	0	0	2	0	2	2	1	1	0	2	1	1	4	2	0	0	0	0	0	5	4	1	0	0	2	0	1	0	1	1	0	1	0	0	0	2	1	0	0	0	2	1	1	1	0	0	0	3	0	11	5	13	1	1	3	2	0	7	1	7	0	0	2	0	0
PaulStamatiou.com - Technology, Design and Photography	2	21	13	69	15	38	53	120	5	23	6	115	19	21	5	15	2	47	2	12	141	26	60	29	0	100	34	11	74	29	71	21	34	159	11	31	50	2	36	52	210	28	39	7	3	26	31	17	10	22	2	18	69	12	54	91	66	11	131	13	4	50	76	9	17	18	6	95	105	3	20	13	12 …

今回生じた事態としては、ファイルに含まれた数値データを float に転換しようとしたところ、String である "looking" を float に転換しようとしてしまい、それはできないよ、と怒られたのだと理解しています。

問題は lines[1] には、String データが入っていること

何が起きていたかというと--ここからは自分の推測になるのですが-- lines の中には、改行の次の要素であるブログ内の単語リストと、その次の単語出現回数データがあり、元のコードのままでは lines[1]にはブログの単語リスト、つまり、String データが入っているため、それを参照しようとしてしまい、今回のエラーが起きたのではないかと思っています。

そのため、for文の中で初めの一要素をスキップして float への転換をしなければならず（python に慣れてないので、ひどいコードですが…）以下のように書く必要があるのではないかと考えています。（実際、これでうまくいきました。）

clusters.py

def readfile(filename):
  lines=[line for line in file(filename)]

  # First line is the column titles
  colnames=lines[0].strip().split('\t')[1:]
  rownames=[]
  data=[]

  first_line=lines[1]

  for line in lines[1:]:
    p=line.strip().split('\t')
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    if line==first_line: continue
    else: data.append([float(x) for x in p[1:]])
  return rownames,colnames,data

いくつかの情報をチェックしても、元のコードで正常に機能してるみたいなので、自分のコードが間違っている可能性が高いのだと思います。もし、お気付きの点があれば、ご指摘ください。あるいは、万が一この記事がどなたかのお役に立てば、嬉しく思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up