将每个因子级别的R因子自动扩展为1/0指标变量的集合

108

我有一个R数据框，其中包含一个要“扩展”的因子，因此对于每个因子水平，新数据框中都有一个关联的列，其中包含一个1/0指示器。例如，假设我有：

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

我想要：

df.desired  <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))

因为对于某些需要完全数字化数据框架的分析（例如，主成分分析），我认为可能内置了此功能。编写一个函数来做到这一点并不难，但是我可以预见到一些与列名有关的挑战，如果已经存在，我宁愿使用它。

r

— 约翰·霍顿
source

131

使用model.matrix功能：

model.matrix( ~ Species - 1, data=iris )

— 格雷格·雪诺
source

1

我能补充一下，这种方法比cast我使用的方法快得多。

— 马特·韦勒2013年

3

@GregSnow我回顾的第2段?formula，以及?model.matrix，但不清楚（可能只是我缺少矩阵代数和模型配制知识的深度）。在深入研究之后，我已经知道-1只是指定不包括“拦截”列。如果您不使用-1，则会在输出中看到一个1的拦截列，其中一个二进制列被忽略了。您可以基于其他列的值为0的行查看被忽略的列的值为1。该文档似乎含糊不清-是否还有另一个很好的资源？

— 瑞安·蔡斯

1

@RyanChase，有许多关于R / S的在线教程和书籍（其中有几本在r-project.org网页上有简短描述）。我自己对S和R的学习相当折衷（而且很长），所以我不是最好的意见，以使当前的书籍/教程对初学者具有吸引力。但是，我是实验爱好者。在全新的R会话中尝试某些操作可能会非常有启发性，而且没有危险（发生在我身上的最糟糕的情况是R崩溃，而且很少发生，这导致R的改进）。因此，Stackoverflow是了解发生了什么的好资源。

— 格雷格·斯诺

7

如果要转换所有因子列，则可以使用：model.matrix(~., data=iris)[,-1]

— user890739

1

@colin，不是全自动的，但是您可以naresid在使用后将缺少的值放回去na.exclude。一个简单的例子：

tmp <- data.frame(x=factor(c('a','b','c',NA,'a'))); tmp2 <- na.exclude(tmp); tmp3 <- model.matrix( ~x-1, tmp2); tmp4 <- naresid(attr(tmp2,'na.action'), tmp3)

— 格雷格·斯诺

17

如果您的数据框仅由因素组成（或者您正在处理所有因素的变量子集），则还可以使用包中的acm.disjonctif函数ade4：

R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
  eggs.bar eggs.foo ham.blue ham.green ham.red
1        0        1        0         0       1
2        0        1        1         0       0
3        1        0        0         1       0
4        1        0        0         0       1

您所描述的情况并非完全如此，但它也可能很有用...

— 朱巴
source

谢谢，这对我有很大帮助，因为它使用的内存少于model.matrix！

— Serhiy，2015年

我喜欢变量的命名方式；我不喜欢他们的，当他们需要大存储容量的数字返回应（恕我直言）只是逻辑值。

— dsz

9

使用reshape2软件包的快速方法：

require(reshape2)

> dcast(df.original, ham ~ eggs, length)

Using ham as value column: use value_var to override.
  ham bar foo
1   1   0   1
2   2   0   1
3   3   1   0
4   4   1   0

请注意，这将精确生成所需的列名。

— 普拉萨德（Prasad Chalasani）
source

好。但是要小心火腿。例如，d <-data.frame（eggs = c（“ foo”，“ bar”，“ foo”），ham = c（1,2,1））; dcast（d，火腿鸡蛋〜长度）使富= 2

— kohske

@Kohske，是的，但是我假设ham是唯一的行ID。如果ham不是唯一ID，则必须使用其他唯一ID（或创建一个虚拟ID）并使用代替ham。将分类标签转换为二进制指示符仅对唯一标识有意义。

— Prasad Chalasani 2011年

6

可能哑变量类似于您想要的变量。然后，model.matrix是有用的：

> with(df.original, data.frame(model.matrix(~eggs+0), ham))
  eggsbar eggsfoo ham
1       0       1   1
2       0       1   2
3       1       0   3
4       1       0   4

— 科什克
source

6

迟进入class.ind从nnet包

library(nnet)
 with(df.original, data.frame(class.ind(eggs), ham))
  bar foo ham
1   0   1   1
2   0   1   2
3   1   0   3
4   1   0   4

— nel
source

4

刚碰到这个旧线程，以为我会添加一个函数，该函数利用ade4来获取由因子和/或数字数据组成的数据帧，并以伪代码返回具有因子的数据帧。

dummy <- function(df) {  

    NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
    FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]

    require(ade4)
    if (is.null(ncol(NUM(df)))) {
        DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
        names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
    } else {
        DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
    }
    return(DF)
}

让我们尝试一下。

df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
            ham = c("red","blue","green","red"), x=rnorm(4))     
dummy(df)

df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
            ham = c("red","blue","green","red"))  
dummy(df2)

— 泰勒·林克
source

3

这是一种更清晰的方法。我使用model.matrix创建虚拟布尔变量，然后将其合并回原始数据帧。

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
#   eggs ham
# 1  foo   1
# 2  foo   2
# 3  bar   3
# 4  bar   4

# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
#   eggsbar eggsfoo
# 1       0       1
# 2       0       1
# 3       1       0
# 4       1       0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"

# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
#   bar foo
# 1   0   1
# 2   0   1
# 3   1   0
# 4   1   0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"

# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
#   eggs ham bar foo
# 1  foo   1   0   1
# 2  foo   2   0   1
# 3  bar   3   1   0
# 4  bar   4   1   0

# At this point, you can select out the columns that you want.

— stackoverflowuser2010
source

0

我需要一个“分解”因子的函数，该函数要灵活一些，并基于ade4软件包中的acm.disjonctif函数创建一个。这使您可以选择爆炸值，在acm.disjonctif中为0和1。它只会爆炸“很少”水平的因素。保留数值列。

# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
  exploders <- colnames(data)[sapply(data, function(col){
      is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
    })]
  if (length(exploders) > 0) {
    exploded <- lapply(exploders, function(exp){
        col <- data[, exp]
        n <- length(col)
        dummies <- matrix(values[1], n, length(levels(col)))
        dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
        colnames(dummies) <- paste(exp, levels(col), sep = '_')
        dummies
      })
    # Only keep numeric data.
    data <- data[sapply(data, is.numeric)]
    # Add exploded values.
    data <- cbind(data, exploded)
  }
  return(data)
}

— 拉肯西
source