可视化2个字母的组合


10

关于SO的此问题的答案返回了一组大约125个一到两个字母的名称:https : //stackoverflow.com/questions/6979630/what-1-2-letter-object-names-conflict-with-existing -r-对象

  [1] "Ad" "am" "ar" "as" "bc" "bd" "bp" "br" "BR" "bs" "by" "c"  "C" 
 [14] "cc" "cd" "ch" "ci" "CJ" "ck" "Cl" "cm" "cn" "cq" "cs" "Cs" "cv"
 [27] "d"  "D"  "dc" "dd" "de" "df" "dg" "dn" "do" "ds" "dt" "e"  "E" 
 [40] "el" "ES" "F"  "FF" "fn" "gc" "gl" "go" "H"  "Hi" "hm" "I"  "ic"
 [53] "id" "ID" "if" "IJ" "Im" "In" "ip" "is" "J"  "lh" "ll" "lm" "lo"
 [66] "Lo" "ls" "lu" "m"  "MH" "mn" "ms" "N"  "nc" "nd" "nn" "ns" "on"
 [79] "Op" "P"  "pa" "pf" "pi" "Pi" "pm" "pp" "ps" "pt" "q"  "qf" "qq"
 [92] "qr" "qt" "r"  "Re" "rf" "rk" "rl" "rm" "rt" "s"  "sc" "sd" "SJ"
[105] "sn" "sp" "ss" "t"  "T"  "te" "tr" "ts" "tt" "tz" "ug" "UG" "UN"
[118] "V"  "VA" "Vd" "vi" "Vo" "w"  "W"  "y"

和R导入代码:

nms <- c("Ad","am","ar","as","bc","bd","bp","br","BR","bs","by","c","C","cc","cd","ch","ci","CJ","ck","Cl","cm","cn","cq","cs","Cs","cv","d","D","dc","dd","de","df","dg","dn","do","ds","dt","e","E","el","ES","F","FF","fn","gc","gl","go","H","Hi","hm","I","ic","id","ID","if","IJ","Im","In","ip","is","J","lh","ll","lm","lo","Lo","ls","lu","m","MH","mn","ms","N","nc","nd","nn","ns","on","Op","P","pa","pf","pi","Pi","pm","pp","ps","pt","q","qf","qq","qr","qt","r","Re","rf","rk","rl","rm","rt","s","sc","sd","SJ","sn","sp","ss","t","T","te","tr","ts","tt","tz","ug","UG","UN","V","VA","Vd","vi","Vo","w","W","y")

由于问题的重点是要避免列出一个令人难忘的对象名称列表,并且大多数人都不擅长使用纯文本块,因此我想对此进行可视化。

不幸的是,我不确定执行此操作的最佳方法。我曾经想到过类似茎叶图的情况,只是因为没有重复的值,所以每个“叶”都放置在适当的列中,而不是靠左对齐。或词云风格的改编,其中字母根据其流行程度进行大小调整。

如何最清晰,最有效地将其可视化?

符合以下条件的可视化效果符合以下条件:

  • 主要目标:通过揭示数据中的模式来提高名称集的可记忆性

  • 替代目标:突出显示名称集中的有趣特征(例如,有助于形象地显示分布,最常见的字母等)

R中的答案是首选,但欢迎所有有趣的想法。

允许忽略单字母名称,因为将它们作为单独的列表更容易给出。

Answers:


12

这是一个开始:在第一个和第二个字母的网格上可视化它们:

combi <- c("Ad", "am", "ar", "as", "bc", "bd", "bp", "br", "BR", "bs", 
"by", "c",  "C",  "cc", "cd", "ch", "ci", "CJ", "ck", "Cl", "cm", "cn", 
"cq", "cs", "Cs", "cv", "d",  "D",  "dc", "dd", "de", "df", "dg", "dn", 
"do", "ds", "dt", "e",  "E",  "el", "ES", "F",  "FF", "fn", "gc", "gl", 
"go", "H",  "Hi", "hm", "I",  "ic", "id", "ID", "if", "IJ", "Im", "In", 
"ip", "is", "J",  "lh", "ll", "lm", "lo", "Lo", "ls", "lu", "m",  "MH", 
"mn", "ms", "N",  "nc", "nd", "nn", "ns", "on", "Op", "P",  "pa", "pf", 
"pi", "Pi", "pm", "pp", "ps", "pt", "q",  "qf", "qq", "qr", "qt", "r",  
"Re", "rf", "rk", "rl", "rm", "rt", "s",  "sc", "sd", "SJ", "sn", "sp", 
"ss", "t",  "T",  "te", "tr", "ts", "tt", "tz", "ug", "UG", "UN", "V",  
"VA", "Vd", "vi", "Vo", "w",  "W",  "y")

df <- data.frame (first = factor (gsub ("^(.).", "\\1", combi), 
                                  levels = c (LETTERS, letters)),
                  second = factor (gsub ("^.", "", combi), 
                                  levels = c (LETTERS, letters)),
                  combi = combi))

library(ggplot2)
ggplot (data = df, aes (x = first, y = second)) + 
   geom_text (aes (label = combi), size = 3) + 
   ## geom_point () +
   geom_vline (x = 26.5, col = "grey") + 
   geom_hline (y = 26.5, col = "grey")

(是:两个字母用字母网格

ggplot (data = df, aes (x = second)) + geom_histogram ()

第二个字母

ggplot (data = df, aes (x = first)) + geom_histogram ()

第一个字母

我收集:

  • 一个字母的名字中,

    • 幸好ijk,和l是可用的(所以能指数高达4D阵列)
    • 不幸的是,t(时间),c(注意力)消失了。那么,m(质量), V(体积)和F(力)。没有半径r也没有直径d
    • 我可以有压力(p),物质的量(n)和长度l
    • 也许我不得不改成希腊名字:ε可以,但是不应该

      π <- pi

  • 我可以取lowerUPPER我想要的名字。

  • 通常,以大写字母开头比使用小写字母更安全。

  • 不要以c或开头d


好的开始。也许在2d图中添加象限线(以大号+表示),以便更好地了解大小写字母的位置?
阿里·弗里德曼

以为我做到了。无论如何,这是。@ gsk3:感谢您上传图片!
cbeleites对SX不满意,2011年

真好 相反,感谢您对提示2提供了有趣的答案。:-)
Ari B. Friedman

查看您的2d图,另一个建议可能是将其缩小为27x26的网格,并在给定字母的上下/上下都改变符号或颜色(或带有alpha的抖动)。也可以使NA行具有不同的颜色,以便在视觉上将其分隔开。
阿里·弗里德曼

1
发布答案之前,我确实查看了27 x 26(颜色和形状根据第一个和第二个字母为大写)。但这并没有传达出简单的信息,所以我立即回到了更大的网格。
cbeleites对SX不满意,2011年

8

好的,这是我非常快速地基于SO问题和其他问题的注释,进行类似于“周期表”的可视化。主要问题是程序包之间变量数量的巨大差异,这在某种程度上阻碍了可视化...我意识到这很粗糙,因此请随意更改它。

这是当前输出(来自我的包裹清单) 样例图

和代码

# Load all the installed packages
lapply(rownames(installed.packages()), require, 
       character.only = TRUE)
# Find variables of length 1 or 2
one_or_two <- unique(apropos("^[a-zA-Z]{1,2}$"))
# Find which package they come from
packages <- lapply(one_or_two, find)
# Some of the variables may belong to multiple packages, so determine the length 
# of each entry in packages and duplicate the names accordingly
lengths <- unlist(lapply(packages, length))
var.data <- data.frame(var = rep(one_or_two, lengths), 
                   package = unlist(packages))

现在,我们有一个像这样的数据框:

> head(var.data, 10)
   var           package
1   ar     package:stats
2   as   package:methods
3   BD    package:fields
4   bs      package:VGAM
5   bs   package:splines
6   by      package:base
7    c      package:base
8    C     package:stats
9   cm package:grDevices
10   D     package:stats

现在我们可以按包拆分数据

 data.split <- split(var.data, var.data$package)

我们可以看到大多数变量来自basestats

> unlist(lapply(data.split, nrow))
     package:base  package:datasets    package:fields 
               16                 1                 2 
  package:ggplot2 package:grDevices  package:gWidgets 
                2                 1                 1 
  package:lattice      package:MASS    package:Matrix 
                1                 1                 3 
  package:methods      package:mgcv      package:plyr 
                3                 2                 1 
     package:spam   package:splines     package:stats 
                1                 2                14 
 package:survival     package:utils      package:VGAM 
                1                 2                 4 

最后,画图程序

plot(0, 0, "n", xlim=c(0, 100), ylim=c(0, 120), 
     xaxt="n", yaxt="n", xlab="", ylab="")

side.len.x <- 100 / length(data.split)
side.len.y <- 100 / max(unlist(lapply(data.split, nrow)))
colors <- rainbow(length(data.split), start=0.2, end=0.6)    

for (xcnt in 1:length(data.split))
    {
    posx <- side.len.x * (xcnt-1)

    # Remove "package :" in front of the package name
    pkg <- unlist(strsplit(as.character(data.split[[xcnt]]$package[1]), ":"))
    pkg <- pkg[2]

    # Write the package name
    text(posx + side.len.x/2, 102, pkg, srt=90, cex=0.95, adj=c(0, 0))

    for (ycnt in 1:nrow(data.split[[xcnt]]))
        {
        posy <- side.len.y * (ycnt-1)
        rect(posx, posy, posx+side.len.x*0.85, posy+side.len.y*0.9, col = colors[xcnt])
        text(posx+side.len.x/2, posy+side.len.y/2, data.split[[xcnt]]$var[ycnt], cex=0.7)
        }
    }

1
真好!一种有趣的方法是按类别将它们分组(例如,图形包,数据处理实践等),对它们进行颜色编码,然后使整体形状更像盒形而不是直方图。
阿里·弗里德曼

+1太好了!:)非常好工作。我想获得元素周期表功能唯一需要的是表布局。标准PT有2个网格,顶部的1个元素缺失,并且组被拆分/重新排列(与1组= 1垂直列相对)。老实说,这不是我认为很难的部分。着色和块布局是最让我兴奋的部分,很高兴看到ggplot2代码。
Iterator

我需要咖啡。我看到gsk3的注释少了一些。:)我想我被颜色迷住了。
Iterator

1
@Iterator:请注意,所有都是R标准绘图功能,不涉及ggplot2 :)
nico

鲭鱼。你是对的!更令人印象深刻。我的结论是:我需要咖啡。
Iterator

4

这是一个基于字母的直方图。考虑过按数字调整前几个字母的大小,但由于已经在垂直组件中进行了编码,因此决定拒绝。

# "Load" data
nms <- c("Ad","am","ar","as","bc","bd","bp","br","BR","bs","by","c","C","cc","cd","ch","ci","CJ","ck","Cl","cm","cn","cq","cs","Cs","cv","d","D","dc","dd","de","df","dg","dn","do","ds","dt","e","E","el","ES","F","FF","fn","gc","gl","go","H","Hi","hm","I","ic","id","ID","if","IJ","Im","In","ip","is","J","lh","ll","lm","lo","Lo","ls","lu","m","MH","mn","ms","N","nc","nd","nn","ns","on","Op","P","pa","pf","pi","Pi","pm","pp","ps","pt","q","qf","qq","qr","qt","r","Re","rf","rk","rl","rm","rt","s","sc","sd","SJ","sn","sp","ss","t","T","te","tr","ts","tt","tz","ug","UG","UN","V","VA","Vd","vi","Vo","w","W","y") #all names
two_in_base <- c("ar", "as", "by", "cm", "de", "df", "dt", "el", "gc", "gl", "if", "Im", "is", "lh", "lm", "ls", "pf", "pi", "pt", "qf", "qr", "qt", "Re", "rf", "rm", "rt", "sd", "ts", "vi") # 2-letter names in base R
vowels <- c("a","e","i","o","u")
vowels <- c( vowels, toupper(vowels) )

# Constants
yoffset.singles <- 3

# Define a function to give us consistent X coordinates
returnX <- function(vec) {
  sapply(vec, function(x) seq(length(all.letters))[ x == all.letters ] )
}

# Make df of 2-letter names
combi <- nms[ sapply( nms, function(x) nchar(x)==2 ) ]
combidf <- data.frame( first = substr(combi,1,1), second=substr(combi,2,2) )
library(plyr)
combidf <- arrange(combidf,first,second)

# Add vowels
combidf$first.vwl <- (combidf$first %in% vowels)
combidf$second.vwl <- (combidf$second %in% vowels)

# Flag items only in base R
combidf$in_base <- paste(combidf$first,combidf$second,sep="") %in% two_in_base

# Create a data.frame to hold our plotting information for the first letters
combilist <- dlply(combidf,.(first),function(x) x$second)
combi.first <- data.frame( first = names(combilist), n = sapply(combilist,length) ,stringsAsFactors=FALSE )
combi.first$y <- 0
all.letters <-  c(letters,LETTERS) # arrange(combi.first,desc(n))$first to go in order of prevalence (which may break the one-letter name display)
combi.first$x <- returnX( combi.first$first )

# Create a data.frame to hold plotting information for the second letters
combidf$x <- returnX( combidf$first )
combidf$y <- unlist( by( combidf$second, combidf$first, seq_along ) )

# Make df of 1-letter names
sngldf <- data.frame( sngl = nms[ sapply( nms, function(x) nchar(x)==1 ) ] )
singles.y <- max(combidf$y) + yoffset.singles
sngldf$y <- singles.y
sngldf$x <- returnX( sngldf$sngl )

# Plot
library(ggplot2)
ggplot(data=combidf, aes(x=x,y=y) ) +
  geom_text(aes( label=second, size=3, colour=combidf$in_base ), position=position_jitter(w=0,h=.25)) +
  geom_text( data=combi.first, aes( label=first, x=x, y=y, size=4 ) ) +
  geom_text( data=sngldf, aes( label=sngl, x=x, y=y, size=4 ) ) +
  scale_size(name="Order (2-letter names)",limits=c(1,4),breaks=c(1,2),labels=c("Second","First")) +
  scale_x_continuous("",breaks=c(13,39),labels=c("lower","UPPER")) +
  scale_y_continuous("",breaks=c(0,5,singles.y),labels=c("First letter of two-letter names","Second letter of two-letter names","One-letter names") ) +
  coord_equal(1.5) +
  labs( colour="In base R" )

在同一情节上带有一个和两个字母名称的版本

基于字母的直方图


2

元素周期表为100,Alex。我没有代码,但是。:(

可能有人认为CRAN中可能已经存在“定期表”程序包。着色方案的思想和此类数据的布局可能是有趣且有用的。

这些可以按包进行着色,并按频率垂直排序,例如在CRAN上的代码示例中,或出现在本地代码库中时。


不知道我是否追随您...您能简单地勾勒出您的想法吗?我看不到元素周期表的布局在这里有什么帮助……
nico

@nico:我在想这样的事情:en.wikipedia.org/wiki/Periodic_table 假设我们用基本的R命令替换“诺贝尔元素”。卤素可以由自己的包装代替,依此类推。有了这样的可视化程序包,我将留给用户指定行,列,组和颜色的性质。尽管我会非常粗略地实现它,但这应该是一件相当简单的事情。放置时应使同一组(即包装)中的物品彼此靠近。垂直放置可以通过使用频率来确定。
Iterator

好的,现在我明白了!也许我会尝试看看我是否可以提出一些建议,但是我需要先找一些业余时间... :(
nico

我还不太清楚,但是我很高兴看到这个想法变成了什么:-)
Ari B. Friedman

1
看一下stackexchange:Tal Galili不久前确实询问了PSE,所以我没有询问。但是我只是将第一部分代码推送到r- forgepse.R,请在结帐处放一些星星 -我不知道如何逃避它们,以使它们消失...
塞贝里人对SX不满意,2011年

1

MacKay的ITILA第2章的前两页有漂亮的图表,显示了英语中所有字符配对的条件概率。您可能会发现它的用途。

我不好意思地说,我不记得用来生成它们的程序是什么。


1
这很酷,但在我看来,所有这些都取决于与每个字母对关联的一些附加信息(普遍性)。因此,他绘制了3个维度,而我们主要绘制了2个维度。...不过,我很想知道R的流行率信息。但这是另一天的数据挖掘操​​作。
阿里·弗里德曼
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.