考虑汉明距离 -两个相等长度的字符串之间的汉明距离是相应符号不同的位置数。从这个定义看来,我们显然可以基于汉明距离生成具有聚类的数据,但变量之间没有相关性。
以下是使用Mathematica的示例。
创建一些分类数据(3个符号长的序列,对4个字符进行均匀随机抽样):
chs = CharacterRange["a", "d"];
words = StringJoin @@@ Union[Table[RandomChoice[chs, 3], 40]];
Length[words]
words
(* 29 *)
(* {"aac", "aad", "abb", "aca", "acb", "acd", "adb", "adc", "baa", "bab", "bac", "bad", "bcc", "bcd", "caa", "cab", "cac", "cad", "cbb", "ccb", "cda", "cdb", "dab", "dba", "dbb", "dbd", "dca", "dcc", "dcd"} *)
将镶嵌图用于变量之间的关系(来自不同列的值对的条件概率):
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"]
wordSeqs = Characters /@ words;
opts = {ColorRules -> {2 -> ColorData[7, "ColorList"]}, ImageSize -> 400};
Grid[{{MosaicPlot[wordSeqs[[All, {1, 2}]],
"ColumnNames" -> {"column 1", "column 2"}, opts],
MosaicPlot[wordSeqs[[All, {2, 3}]],
"ColumnNames" -> {"column 2", "column 3"}, opts],
MosaicPlot[wordSeqs[[All, {1, 3}]],
"ColumnNames" -> {"column 1", "column 3"}, opts]}}, Dividers -> All]
我们可以看到没有相关性。
查找集群:
cls = FindClusters[words, 3, DistanceFunction -> HammingDistance]
(* {{"aac", "aad", "adc", "bac"}, {"abb", "acb", "adb", "baa", "bab", "bad",
"caa", "cab", "cac", "cad", "cbb", "ccb", "cda", "cdb", "dab",
"dbb"}, {"aca", "acd", "bcc", "bcd", "dba", "dbd", "dca", "dcc", "dcd"}} *)
如果我们将每个字符替换为整数,则可以从该图中看出汉明距离如何形成聚类:
esrules = Thread[chs -> Range[Length[chs]]]; gr1 =
ListPointPlot3D[Characters[cls] /. esrules,
PlotStyle -> {PointSize[0.02]}, PlotLegends -> Automatic,
FaceGrids -> {Bottom, Left, Back}];
gr2 = Graphics3D[
Map[Text[#, Characters[#] /. esrules, {1, 1}] &, Flatten[cls]]];
Show[gr1, gr2]
进一步聚类
让我们通过连接汉明距离为1的单词来制作图表:
mat = Clip[Outer[HammingDistance, words, words], {0, 1}, {0, 0}];
nngr = AdjacencyGraph[mat,
VertexLabels -> Thread[Range[Length[words]] -> words]]
现在让我们找到社区集群:
CommunityGraphPlot[nngr]
将图集群与找到的集群比较FindClusters
(被迫找到3)。我们可以看到“ bac”处于中心位置,“ aad”可以属于绿色簇,它对应于3D图中的簇1。
图形数据
以下是的边缘列表nngr
:
{1 <-> 2, 1 <-> 8, 1 <-> 11, 1 <-> 17, 2 <-> 6, 2 <-> 12, 2 <-> 18,
3 <-> 5, 3 <-> 7, 3 <-> 19, 3 <-> 25, 4 <-> 5, 4 <-> 6, 4 <-> 27,
5 <-> 6, 5 <-> 7, 5 <-> 20, 6 <-> 14, 6 <-> 29, 7 <-> 8, 7 <-> 22,
9 <-> 10, 9 <-> 11, 9 <-> 12, 9 <-> 15, 10 <-> 11, 10 <-> 12,
10 <-> 16, 10 <-> 23, 11 <-> 12, 11 <-> 13, 11 <-> 17, 12 <-> 14,
12 <-> 18, 13 <-> 14, 13 <-> 28, 14 <-> 29, 15 <-> 16, 15 <-> 17,
15 <-> 18, 15 <-> 21, 16 <-> 17, 16 <-> 18, 16 <-> 19, 16 <-> 20,
16 <-> 22, 16 <-> 23, 17 <-> 18, 19 <-> 20, 19 <-> 22, 19 <-> 25,
20 <-> 22, 21 <-> 22, 23 <-> 25, 24 <-> 25, 24 <-> 26, 24 <-> 27,
25 <-> 26, 26 <-> 29, 27 <-> 28, 27 <-> 29, 28 <-> 29}