除了LDA之外,您还可以对K-Means使用潜在语义分析。它不是神经网络,而是“经典”聚类,但是效果很好。
sklearn中的示例(从此处获取):
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
vectorizer = TfidfTransformer()
X = vectorizer.fit_transform(dataset.data)
svd = TruncatedSVD(true_k)
lsa = make_pipeline(svd, Normalizer(copy=False))
X = lsa.fit_transform(X)
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
km.fit(X)
现在,群集分配标签在 km.labels_
例如,这些是从20个带有LSA的新闻组中提取的主题:
Cluster 0: space shuttle alaska edu nasa moon launch orbit henry sci
Cluster 1: edu game team games year ca university players hockey baseball
Cluster 2: sale 00 edu 10 offer new distribution subject lines shipping
Cluster 3: israel israeli jews arab jewish arabs edu jake peace israelis
Cluster 4: cmu andrew org com stratus edu mellon carnegie pittsburgh pa
Cluster 5: god jesus christian bible church christ christians people edu believe
Cluster 6: drive scsi card edu mac disk ide bus pc apple
Cluster 7: com ca hp subject edu lines organization writes article like
Cluster 8: car cars com edu engine ford new dealer just oil
Cluster 9: sun monitor com video edu vga east card monitors microsystems
Cluster 10: nasa gov jpl larc gsfc jsc center fnal article writes
Cluster 11: windows dos file edu ms files program os com use
Cluster 12: netcom com edu cramer fbi sandvik 408 writes article people
Cluster 13: armenian turkish armenians armenia serdar argic turks turkey genocide soviet
Cluster 14: uiuc cso edu illinois urbana uxa university writes news cobb
Cluster 15: edu cs university posting host nntp state subject organization lines
Cluster 16: uk ac window mit server lines subject university com edu
Cluster 17: caltech edu keith gatech technology institute prism morality sgi livesey
Cluster 18: key clipper chip encryption com keys escrow government algorithm des
Cluster 19: people edu gun com government don like think just access
您还可以应用非负矩阵分解,这可以解释为聚类。您需要做的就是在转换后的空间中获取每个文档的最大部分,并将其用作集群分配。
在sklearn中:
nmf = NMF(n_components=k, random_state=1).fit_transform(X)
labels = nmf.argmax(axis=1)