2015-04-03
本文讲述如何使用scikit-learn的KNN工具对文本进行分类。
关于KNN
K-近邻算法,简称KNN(k-Nearest Neighbor),是一个相当简单的分类/预测算法。其主要思想就是,选取与待分类/预测数据的最相似的K个训练数据,通过对这K个数据的结果或者分类标号取平均、取众数等方法得到待分类/预测数据的结果或者分类标号。
关于KNN,笔者在 浅入浅出:K近邻算法 有较为详细的介绍。
数据集介绍
数据集是有8个分类的文本数据集,使用了结巴分词对每个文本分词,每个单词当作特征,再利用二元词串构造更多特征,然后去掉停用词,去掉出现次数太多和太少的特征,得到了19630个特征。取1998个样本用于训练,509个用于测试。基于词袋模型的思路将每个文本转换为向量,训练集和测试集分别转换为矩阵,并用python numpy模块将其保存为npy格式。
在 https://github.com/letiantian/dataset 下载text-classification.7z,解压后导入数据:
$ ls
test_data.npy test_labels.npy training_data.npy training_labels.npy
$ ipython
>>> import numpy as np
>>> training_data = np.load("training_data.npy")
>>> training_data.shape
(1998, 19630)
>>> training_labels = np.load("training_labels.npy")
>>> training_labels
array([6, 6, 6, ..., 2, 2, 2])
>>> training_labels.shape
(1998,)
>>> test_data = np.load("test_data.npy")
>>> test_data.shape
(509, 19630)
>>> test_labels = np.load("test_labels.npy")
>>> test_labels.shape
(509,)
如何找一样本的最近k个邻居
方法1:
>>> from sklearn.neighbors import NearestNeighbors
>>> nbrs = NearestNeighbors(n_neighbors=6, algorithm='ball_tree')
>>> nbrs.fit(training_data) # 构造BallTree,可以快速找出6个最近邻居,原理待学习
NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=6, p=2, radius=1.0)
>>> distances, indices = nbrs.kneighbors(test_data[0]) # 找training_data中离样本test_data[0]的最近的6个样本
>>> indices # 6个最近样本,每个值是指在training_data中的第几个样本
array([[500, 294, 62, 802, 732, 703]])
>>> distances # 对应的距离
array([[ 13.37908816, 13.60147051, 13.60147051, 13.60147051,
13.60147051, 13.6381817 ]])
也可以依次找出多个测试样本的最近的6个训练样本:
>>> distances, indices = nbrs.kneighbors(test_data[0:2])
>>> indices
array([[ 500, 294, 62, 802, 732, 703],
[ 62, 294, 636, 1945, 802, 1091]])
>>> distances
array([[ 13.37908816, 13.60147051, 13.60147051, 13.60147051,
13.60147051, 13.6381817 ],
[ 7.93725393, 7.93725393, 8.1240384 , 8.36660027,
8.54400375, 8.54400375]])
方法2:
>>> from sklearn.neighbors import BallTree
>>> bt = BallTree(training_data, metric='euclidean')
>>> distances, indices = bt.query(test_data[0], k=6)
>>> indices
array([[500, 62, 802, 294, 732, 703]])
>>> distances
array([[ 13.37908816, 13.60147051, 13.60147051, 13.60147051,
13.60147051, 13.6381817 ]])
基于KNN的文本分类
令k=6:
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier(n_neighbors=6, metric='euclidean')
>>> knn.fit(training_data, training_labels) # 训练
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_neighbors=6, p=2, weights='uniform')
>>> predict_labels = knn.predict(test_data) # 预测
>>> sum(predict_labels == test_labels)
230
>>> 230./509 # 正确率
0.4518664047151277
令k=20:
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier(n_neighbors=20, metric='euclidean')
>>> knn.fit(training_data, training_labels) # 训练
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_neighbors=20, p=2, weights='uniform')
>>> predict_labels = knn.predict(test_data) # 预测
>>> sum(predict_labels == test_labels)
276 # 效果比k=6时提升了一些
>>> 276./509 # 正确率
0.5422396856581533
这个正确率并不高。在 基于贝叶斯的文本分类实战 中笔者使用了多项式贝叶斯对同样的数据集进行分类,正确率达到近90%。
做个优化
我们将每个样本归一化,看看效果。
先写一个归一化工具(mytools.py):
# !/usr/bin/env python
# -*- encoding:utf-8 -*-
import numpy as np
def uniformization(X):
if X.ndim != 2:
return None
X2 = X.copy()
X2 = X2.astype(float)
rows = X2.shape[0]
for i in xrange(0, rows):
sum_of_squares = sum(X2[i, :]**2)
if sum_of_squares == 0: continue
sqrt_sum_of_squares = sum_of_squares**0.5
X2[i, :] = X2[i, :] / sqrt_sum_of_squares
return X2
if __name__ == '__main__':
arr = np.array([[1,2,3],[4,5,6],[0,0,0]])
print uniformization(arr)
运行结果如下:
[[ 0.26726124 0.53452248 0.80178373]
[ 0.45584231 0.56980288 0.68376346]
[ 0. 0. 0.
处理原始数据集,生成新的数据:
>>> from mytools import uniformization
>>> new_training_data = uniformization(training_data)
>>> new_test_data = uniformization(test_data)
令k=6:
>>> knn = KNeighborsClassifier(n_neighbors=6, metric='euclidean')
>>> knn.fit(new_training_data, training_labels) # 使用新数据训练
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_neighbors=6, p=2, weights='uniform')
>>> predict_labels = knn.predict(new_test_data) # 预测
>>> sum(predict_labels == test_labels)
294 # 由230提升到294
>>> 294./509 # 正确率有提升
0.5776031434184676
令k=20:
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier(n_neighbors=20, metric='euclidean')
>>> knn.fit(new_training_data, training_labels) # 使用新数据训练
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_neighbors=20, p=2, weights='uniform')
>>> predict_labels = knn.predict(new_test_data) # 预测
>>> sum(predict_labels == test_labels)
314 # 由276提升到314
>>> 314./509 # 正确率有提升
0.6168958742632613
可以看到,归一化后,预测分类的正确率提升很多。
参考
1.6. Nearest Neighbors
sklearn.neighbors.KNeighborsClassifier