从宽格式到长格式重塑data.frame


163

我很难将我data.frame的桌子从宽桌变成长桌。目前看起来像这样:

Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246

现在我想把它变data.framedata.frame。像这样:

Code Country        Year    Value
AFG  Afghanistan    1950    20,249
AFG  Afghanistan    1951    21,352
AFG  Afghanistan    1952    22,532
AFG  Afghanistan    1953    23,557
AFG  Afghanistan    1954    24,555
ALB  Albania        1950    8,097
ALB  Albania        1951    8,986
ALB  Albania        1952    10,058
ALB  Albania        1953    11,123
ALB  Albania        1954    12,246

我已经看过并且已经尝试使用melt()reshape()功能,就像某些人在类似问题中建议的那样。但是,到目前为止,我只得到混乱的结果。

如果可能的话,我想使用该reshape()函数,因为它看起来要好一些。


2
不知道这是否是问题,但在重塑包的功能是熔体流延(重铸。)
爱德华多·莱尼

1
重塑包装已被reshape2取代。
IRTFM 2014年

5
现在reshape2已被tidyr取代。
drhagen

Answers:


93

reshape()melt/ 一样需要一段时间才能习惯cast。这是一个重塑的解决方案,假设您的数据帧称为d

reshape(d, 
        direction = "long",
        varying = list(names(d)[3:7]),
        v.names = "Value",
        idvar = c("Code", "Country"),
        timevar = "Year",
        times = 1950:1954)

153

三种替代解决方案:

1)与

您可以使用与包中相同的melt功能reshape2(这是扩展和改进的实现)。meltfrom data.tablemelt-function from 还具有更多参数reshape2。例如,您还可以指定变量列的名称:

library(data.table)
long <- melt(setDT(wide), id.vars = c("Code","Country"), variable.name = "year")

这使:

> long
    Code     Country year  value
 1:  AFG Afghanistan 1950 20,249
 2:  ALB     Albania 1950  8,097
 3:  AFG Afghanistan 1951 21,352
 4:  ALB     Albania 1951  8,986
 5:  AFG Afghanistan 1952 22,532
 6:  ALB     Albania 1952 10,058
 7:  AFG Afghanistan 1953 23,557
 8:  ALB     Albania 1953 11,123
 9:  AFG Afghanistan 1954 24,555
10:  ALB     Albania 1954 12,246

一些替代符号:

melt(setDT(wide), id.vars = 1:2, variable.name = "year")
melt(setDT(wide), measure.vars = 3:7, variable.name = "year")
melt(setDT(wide), measure.vars = as.character(1950:1954), variable.name = "year")

2)与

library(tidyr)
long <- wide %>% gather(year, value, -c(Code, Country))

一些替代符号:

wide %>% gather(year, value, -Code, -Country)
wide %>% gather(year, value, -1:-2)
wide %>% gather(year, value, -(1:2))
wide %>% gather(year, value, -1, -2)
wide %>% gather(year, value, 3:7)
wide %>% gather(year, value, `1950`:`1954`)

3)与

library(reshape2)
long <- melt(wide, id.vars = c("Code", "Country"))

产生相同结果的一些替代符号:

# you can also define the id-variables by column number
melt(wide, id.vars = 1:2)

# as an alternative you can also specify the measure-variables
# all other variables will then be used as id-variables
melt(wide, measure.vars = 3:7)
melt(wide, measure.vars = as.character(1950:1954))

笔记:

  • 退休了。仅进行必要的更改以将其保留在CRAN上。(来源
  • 如果要排除NA值,则可以将和功能添加na.rm = TRUE到。meltgather

数据的另一个问题是R会将这些值读取为字符值(作为数字中的结果,)。您可以使用gsub和进行修复as.numeric

long$value <- as.numeric(gsub(",", "", long$value))

或直接使用data.tabledplyr

# data.table
long <- melt(setDT(wide),
             id.vars = c("Code","Country"),
             variable.name = "year")[, value := as.numeric(gsub(",", "", value))]

# tidyr and dplyr
long <- wide %>% gather(year, value, -c(Code,Country)) %>% 
  mutate(value = as.numeric(gsub(",", "", value)))

数据:

wide <- read.table(text="Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246", header=TRUE, check.names=FALSE)

一个很好的答案,只不过是一个小小的提醒:不要在数据框中放置id和以外的任何变量timemelt在这种情况下不能说出您想做什么。
杰森·目标

1
@JasonGoal您可以详细说明吗?当我向您解释时,这应该不是问题。只需指定id.vars和即可measure.vars
Jaap

,那么这对我有好处,不知道id.varsmeasure.vars可以在第一种选择中指定,抱歉给我带来麻烦,这是我的错。
杰森·

对不起,这篇文章无法使用-有人可以向我解释为什么 3 可行吗?我已经对其进行了测试,但是可以正常工作,但是我不明白dplyr在看到-c(var1, var2)…… 时在做什么

1
@ReputableMisnomer当tidyr看到-c(var1, var2)将数据从宽格式转换为长格式时,它会忽略这些变量。
Jaap

35

使用重塑包装:

#data
x <- read.table(textConnection(
"Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246"), header=TRUE)

library(reshape)

x2 <- melt(x, id = c("Code", "Country"), variable_name = "Year")
x2[,"Year"] <- as.numeric(gsub("X", "" , x2[,"Year"]))

18

使用tidyr_1.0.0,另一个选择是pivot_longer

library(tidyr)
pivot_longer(df1, -c(Code, Country), values_to = "Value", names_to = "Year")
# A tibble: 10 x 4
#   Code  Country     Year  Value 
#   <fct> <fct>       <chr> <fct> 
# 1 AFG   Afghanistan 1950  20,249
# 2 AFG   Afghanistan 1951  21,352
# 3 AFG   Afghanistan 1952  22,532
# 4 AFG   Afghanistan 1953  23,557
# 5 AFG   Afghanistan 1954  24,555
# 6 ALB   Albania     1950  8,097 
# 7 ALB   Albania     1951  8,986 
# 8 ALB   Albania     1952  10,058
# 9 ALB   Albania     1953  11,123
#10 ALB   Albania     1954  12,246

数据

df1 <- structure(list(Code = structure(1:2, .Label = c("AFG", "ALB"), class = "factor"), 
    Country = structure(1:2, .Label = c("Afghanistan", "Albania"
    ), class = "factor"), `1950` = structure(1:2, .Label = c("20,249", 
    "8,097"), class = "factor"), `1951` = structure(1:2, .Label = c("21,352", 
    "8,986"), class = "factor"), `1952` = structure(2:1, .Label = c("10,058", 
    "22,532"), class = "factor"), `1953` = structure(2:1, .Label = c("11,123", 
    "23,557"), class = "factor"), `1954` = structure(2:1, .Label = c("12,246", 
    "24,555"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

1
这需要更多的投票。根据Tidyverse博客的 说法,该博客gather已退休,pivot_longer现在是实现此目的的正确方法。
埃文·罗斯卡

15

由于此答案被标记为 ,我认为从R:共享另一个替代方案将很有用stack

但是请注意,这stack不适用于factors -仅在is.vectoris时有效TRUE,并且从的文档中is.vector我们发现:

is.vectorTRUE如果x是指定模式的向量,并且名称没有属性,则返回x 。FALSE否则返回。

我正在使用@Jaap的answer中的示例数据,其中year列中factor的值为s。

这里的stack做法:

cbind(wide[1:2], stack(lapply(wide[-c(1, 2)], as.character)))
##    Code     Country values  ind
## 1   AFG Afghanistan 20,249 1950
## 2   ALB     Albania  8,097 1950
## 3   AFG Afghanistan 21,352 1951
## 4   ALB     Albania  8,986 1951
## 5   AFG Afghanistan 22,532 1952
## 6   ALB     Albania 10,058 1952
## 7   AFG Afghanistan 23,557 1953
## 8   ALB     Albania 11,123 1953
## 9   AFG Afghanistan 24,555 1954
## 10  ALB     Albania 12,246 1954

11

这是另一个示例,显示了gatherfrom 的用法tidyr。您可以选择gather删除列,方法是分别删除它们(如我在此处所做的那样),也可以通过明确包含想要的年份来进行选择。

请注意,为了处理逗号(如果check.names = FALSE未设置,则添加X ),我还使用dplyr的mutate with parse_numberfrom readr将文本值转换回数字。这些都是的一部分,tidyverse因此可以与一起加载library(tidyverse)

wide %>%
  gather(Year, Value, -Code, -Country) %>%
  mutate(Year = parse_number(Year)
         , Value = parse_number(Value))

返回值:

   Code     Country Year Value
1   AFG Afghanistan 1950 20249
2   ALB     Albania 1950  8097
3   AFG Afghanistan 1951 21352
4   ALB     Albania 1951  8986
5   AFG Afghanistan 1952 22532
6   ALB     Albania 1952 10058
7   AFG Afghanistan 1953 23557
8   ALB     Albania 1953 11123
9   AFG Afghanistan 1954 24555
10  ALB     Albania 1954 12246

4

这是一个 解:

sqldf("Select Code, Country, '1950' As Year, `1950` As Value From wide
        Union All
       Select Code, Country, '1951' As Year, `1951` As Value From wide
        Union All
       Select Code, Country, '1952' As Year, `1952` As Value From wide
        Union All
       Select Code, Country, '1953' As Year, `1953` As Value From wide
        Union All
       Select Code, Country, '1954' As Year, `1954` As Value From wide;")

要在不键入任何内容的情况下进行查询,可以使用以下命令:

感谢G. Grothendieck实施它。

ValCol <- tail(names(wide), -2)

s <- sprintf("Select Code, Country, '%s' As Year, `%s` As Value from wide", ValCol, ValCol)
mquery <- paste(s, collapse = "\n Union All\n")

cat(mquery) #just to show the query
 #> Select Code, Country, '1950' As Year, `1950` As Value from wide
 #>  Union All
 #> Select Code, Country, '1951' As Year, `1951` As Value from wide
 #>  Union All
 #> Select Code, Country, '1952' As Year, `1952` As Value from wide
 #>  Union All
 #> Select Code, Country, '1953' As Year, `1953` As Value from wide
 #>  Union All
 #> Select Code, Country, '1954' As Year, `1954` As Value from wide

sqldf(mquery)
 #>    Code     Country Year  Value
 #> 1   AFG Afghanistan 1950 20,249
 #> 2   ALB     Albania 1950  8,097
 #> 3   AFG Afghanistan 1951 21,352
 #> 4   ALB     Albania 1951  8,986
 #> 5   AFG Afghanistan 1952 22,532
 #> 6   ALB     Albania 1952 10,058
 #> 7   AFG Afghanistan 1953 23,557
 #> 8   ALB     Albania 1953 11,123
 #> 9   AFG Afghanistan 1954 24,555
 #> 10  ALB     Albania 1954 12,246

不幸的是,我不认为这PIVOTUNPIVOT将工作R SQLite。如果您想以更复杂的方式编写查询,还可以查看以下文章:

使用sprintf编写sql查询   或将    变量传递给sqldf

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.