将data.frame列名传递给函数

119

我正在尝试编写一个函数来接受data.frame（x）和column来自它的a。该函数在x上执行一些计算，然后返回另一个data.frame。我坚持将最佳实践方法传递给函数的列名。

两个最小的例子fun1和fun2下面产生所需的结果，能够执行操作x$column，使用max()，例如，但是，两者都依靠（至少对我而言）看起来不雅

致电substitute()并可能eval()
需要将列名作为字符向量传递。

fun1 <- function(x, column){
  do.call("max", list(substitute(x[a], list(a = column))))
}

fun2 <- function(x, column){
  max(eval((substitute(x[a], list(a = column)))))
}

df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")

例如，我希望能够将函数调用fun(df, B)为。我考虑过但尚未尝试过的其他选择：

column作为列号的整数传递。我认为这样可以避免substitute()。理想情况下，函数可以接受任何一个。
with(x, get(column))，但即使有效，我认为这仍然需要 substitute
利用formula()和match.call()，我都没有丰富的经验。

子问题：do.call()比起首选eval()？

r dataframe r-faq

— 公里
source

108

您可以直接使用列名：

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))

无需使用替代，评估等。

您甚至可以将所需的函数作为参数传递：

fun1 <- function(x, column, fn) {
  fn(x[,column])
}
fun1(df, "B", max)

另外，使用[[也可以一次选择一个列：

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[[column]])
}
fun1(df, "B")

— 尚恩
source

13

有什么方法可以将列名不作为字符串传递？

— kmm

2

您需要传递引用为字符的列名或该列的整数索引。只是通过B将假定B是对象本身。

— Shane 2010年

我懂了。我不知道我怎么会出现令人费解的替代品，EVAL等

— KMM

3

谢谢！我发现该[[解决方案是唯一对我有用的解决方案。

— EcologyTom

1

嗨@路易斯，看看这个答案

— EcologyTom '19

78

这个答案将涵盖许多与现有答案相同的元素，但是这个问题（将列名传递给函数）经常出现，以至于我希望有一个更全面地涵盖所有内容的答案。

假设我们有一个非常简单的数据框：

dat <- data.frame(x = 1:4,
                  y = 5:8)

我们想编写创建一个新列的函数z即列之x和y。

这里一个很常见的绊脚石是自然的（但不正确的）尝试通常看起来像这样：

foo <- function(df,col_name,col1,col2){
      df$col_name <- df$col1 + df$col2
      df
}

#Call foo() like this:    
foo(dat,z,x,y)

这里的问题是df$col1不评估表达式col1。它只是在df字面上查找名为的列col1。在?Extract“递归（类似列表）的对象”一节中介绍了此行为。

最简单，也是最常推荐的解决方案是简单地从$to 切换[[并以字符串形式传递函数参数：

new_column1 <- function(df,col_name,col1,col2){
    #Create new column col_name as sum of col1 and col2
    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column1(dat,"z","x","y")
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

这通常被认为是“最佳实践”，因为这是最难解决的方法。尽可能将列名称作为字符串传递。

以下两个选项更高级。许多流行的软件包都使用了这类技术，但是很好地使用它们需要更多的关怀和技巧，因为它们会引入微妙的复杂性和意外的故障点。哈德利（Hadley）的Advanced R本书的这一部分是其中一些问题的绝佳参考。

如果您确实想避免用户输入所有这些引号，则一种选择可能是使用以下命令将裸露的，未引号的列名转换为字符串deparse(substitute())：

new_column2 <- function(df,col_name,col1,col2){
    col_name <- deparse(substitute(col_name))
    col1 <- deparse(substitute(col1))
    col2 <- deparse(substitute(col2))

    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column2(dat,z,x,y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

坦率地说，这可能有点愚蠢，因为我们确实在做与中相同的事情new_column1，只是做了很多额外的工作将裸名转换为字符串。

最后，如果我们真的想花哨的话，我们可能会决定，与其传递两个列的名称来添加，不如说要更加灵活并允许两个变量的其他组合。在这种情况下，我们可能会使用eval()包含两列的表达式：

new_column3 <- function(df,col_name,expr){
    col_name <- deparse(substitute(col_name))
    df[[col_name]] <- eval(substitute(expr),df,parent.frame())
    df
}

只是为了好玩，我仍在使用deparse(substitute())新列的名称。在这里，以下所有功能均适用：

> new_column3(dat,z,x+y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
  x y  z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
  x y  z
1 1 5  5
2 2 6 12
3 3 7 21
4 4 8 32

因此，简短的答案基本上是：将data.frame列名作为字符串传递，并用于[[选择单个列。只有开始钻研eval，substitute等等。如果你真的知道自己在做什么。

— 乔兰
source

1

不确定为什么这不是最佳选择。

— 伊恩

我也是！很好的解释！

— Alfredo G Marquez

22

我个人认为将列作为字符串传递是非常丑陋的。我喜欢做类似的事情：

get.max <- function(column,data=NULL){
    column<-eval(substitute(column),data, parent.frame())
    max(column)
}

这将产生：

> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5

请注意data.frame的规范是可选的。您甚至可以使用列的功能：

> get.max(1/mpg,mtcars)
[1] 0.09615385

— 伊恩研究员
source

9

您需要摆脱使用引号进行丑陋思考的习惯。不使用它们是丑陋的！为什么？因为您已经创建了只能以交互方式使用的函数，所以很难对其进行编程。

— hadley'4

27

我很高兴看到更好的方法，但是我看不到它与qplot（x = mpg，data = mtcars）之间的区别。ggplot2永远不会将列作为字符串传递，我认为这样做更好。为什么说只能以交互方式使用？在什么情况下会导致不良结果？如何编程更困难？在文章的正文中，我展示了它如何更加灵活。

— 伊恩·研究员

4

5年后-）..我们为什么需要：parent.frame（）？

— mql4beginner 2015年

15

7年后：使用引号还是很丑吗？

— Spacedman '17

11

另一种方法是使用tidy evaluation方法。将数据框的列作为字符串或裸列名称进行传递非常简单。tidyeval 在这里查看更多信息。

library(rlang)
library(tidyverse)

set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))

使用列名作为字符串

fun3 <- function(x, ...) {
  # capture strings and create variables
  dots <- ensyms(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun3(df, "B")
#>          B
#> 1 1.715065

fun3(df, "B", "D")
#>          B        D
#> 1 1.715065 1.786913

使用裸列名称

fun4 <- function(x, ...) {
  # capture expressions and create quosures
  dots <- enquos(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun4(df, B)
#>          B
#> 1 1.715065

fun4(df, B, D)
#>          B        D
#> 1 1.715065 1.786913
#>

^{由reprex软件包（v0.2.1.9000）创建于2019-03-01}

— ung
source

相关：stackoverflow.com/questions/54940237/...

— 东

1

需额外考虑的是，如果需要将未引用的列名传递给自定义函数，则match.call()在这种情况下也可能有用，它可以替代deparse(substitute())：

df <- data.frame(A = 1:10, B = 2:11)

fun <- function(x, column){
  arg <- match.call()
  max(x[[arg$column]])
}

fun(df, A)
#> [1] 10

fun(df, B)
#> [1] 11

如果列名中有错别字，则更安全地停止并显示错误：

fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf

# Stop with error in case of typo
fun <- function(x, column){
  arg <- match.call()
  if (is.null(x[[arg$column]])) stop("Wrong column name")
  max(x[[arg$column]])
}

fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10

^{由reprex软件包（v0.2.1）创建于2019-01-11}

我不认为我会使用这种方法，因为除了传递上面的答案中指出的带引号的列名之外，还存在其他类型和复杂性，但这是一种方法。

— 瓦伦丁
source