2014-10-22
mahout版本是0.9;hadoop版本是1.2.1。
下载数据集20 newsgroups dataset,解压后得到20news-bydate
目录:
$ cp -R 20news-bydate/*/* 20news-all
在20news-all
目录下,一个子目录代表一个分类,每个分类下有多个文本文件,每个文本文件代表一个样本。
然后,复制到hdfs中:
$ hadoop distcp file:///home/sunlt/20news-all 20news-all
查看一下:
$ hadoop dfs -ls
Warning: $HADOOP_HOME is deprecated.
Found 6 items
......
drwxr-xr-x - sunlt supergroup 0 2014-10-13 10:09 /user/sunlt/20news-all
......
转换成<text,text>
的序列文件:
$ mahout seqdirectory \
-i 20news-all \
-o 20news-seq \
-ow
转换成< Text, VectorWritable >
的序列文件:
$ mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf
此处使用了TF-IDF(词频-逆文档频率)生成向量,也可以指定为TF
。
更多参数,可以使用mahout seq2sparse --help
查看。
拆分数据
将60%的数据用于训练,40%的数据用于测试。
$ mahout split \
-i 20news-vectors/tfidf-vectors \
--trainingOutput 20news-train-vectors \
--testOutput 20news-test-vectors \
--randomSelectionPct 40 \
--overwrite --sequenceFiles -xm sequential
此时:
$ hadoop dfs -ls
Warning: $HADOOP_HOME is deprecated.
Found 10 items
drwxr-xr-x - sunlt supergroup 0 2014-10-13 09:52 /user/sunlt/20_newsgroups
drwxr-xr-x - sunlt supergroup 0 2014-10-13 10:09 /user/sunlt/20news-all
drwxr-xr-x - sunlt supergroup 0 2014-10-13 10:28 /user/sunlt/20news-seq
drwxr-xr-x - sunlt supergroup 0 2014-10-13 11:17 /user/sunlt/20news-test-vectors
drwxr-xr-x - sunlt supergroup 0 2014-10-13 11:17 /user/sunlt/20news-train-vectors
drwxr-xr-x - sunlt supergroup 0 2014-10-13 10:55 /user/sunlt/20news-vectors
训练:
$ mahout trainnb \
-i 20news-train-vectors\
-el \
-o model \
-li labelindex \
-ow \
-c
测试训练好的贝叶斯分类器:
$ mahout testnb \
-i 20news-test-vectors \
-m model \
-l labelindex \
-ow \
-o 20news-testing \
-c
运行结果:
.......
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 3023 93.7946%
Incorrectly Classified Instances : 200 6.2054%
Total Classified Instances : 3223
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i <--Classified as
306 1 0 0 1 2 1 1 16 | 328 a = alt.atheism
0 391 2 1 3 1 0 0 2 | 400 b = comp.windows.x
0 7 345 6 12 12 2 5 1 | 390 c = misc.forsale
0 1 2 399 2 1 1 0 0 | 406 d = rec.motorcycles
0 5 10 2 371 3 0 3 1 | 395 e = sci.electronics
0 5 0 1 4 371 1 2 1 | 385 f = sci.med
1 3 0 2 1 1 317 8 0 | 333 g = talk.politics.guns
1 0 0 1 2 3 11 311 5 | 334 h = talk.politics.misc
22 1 2 1 0 0 8 6 212 | 252 i = talk.religion.misc
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.9105
Accuracy 93.7946%
Reliability 84.0504%
Reliability (standard deviation) 0.2984
对角线代表预测正确的数量。
补充
导出数据:
$ mahout seqdumper -i 20news-testing/part-m-00000 -o ./20news_testing.res
查看:
$ gedit ./20news_testing.res
也可以这样查看:
$ hadoop dfs -cat 20news-testing/part-m-00000
由于是sequence文件,查看的结果是乱码。
将hdfs中某个目录中的文件拷贝到本地:
$ hadoop dfs -copyToLocal 20news-vectors/* .
参考
Twenty Newsgroups Classification Example