对于重复/重叠的条目,我必须检查较长时间的鸟类观察。
来自不同地点(A,B,C)的观察者进行了观察,并将其标记在纸质地图上。这些线与其他物种有关的数据,观察点和观察到的时间间隔加入其中。
通常,观察者在观察的同时通过电话彼此交流,但有时他们会忘记,所以我得到了重复的线路。
我已经将数据简化为可以触及圆的线,因此不必进行空间分析,而只需比较每种物种的时间间隔,就可以确定比较得出的是同一个人。
我现在正在R中寻找一种方法来识别那些条目:
- 在同一天以重叠的间隔进行
- 哪里是同一个物种
- 由不同的观察点(A或B或C或...)制成
在此示例中,我手动找到了同一个人的可能重复的条目。观察点不同(A <-> B),种类相同(Sst),开始时间和结束时间的间隔重叠。
现在,我将在data.frame中创建一个新字段“ duplicate”,为两行提供一个通用ID,以便能够导出它们并稍后决定要做什么。
我在很多地方搜索了已经存在的解决方案,但是没有发现有关我必须对物种的过程进行子集化(最好没有循环)并且必须比较2 + x个观察点的行这一事实。
一些数据可玩:
testdata <- structure(list(bird_id = c("20150712_0810_1410_A_1", "20150712_0810_1410_A_2",
"20150712_0810_1410_A_4", "20150712_0810_1410_A_7", "20150727_1115_1430_C_1",
"20150727_1120_1430_B_1", "20150727_1120_1430_B_2", "20150727_1120_1430_B_3",
"20150727_1120_1430_B_4", "20150727_1120_1430_B_5", "20150727_1130_1430_A_2",
"20150727_1130_1430_A_4", "20150727_1130_1430_A_5", "20150812_0900_1225_B_3",
"20150812_0900_1225_B_6", "20150812_0900_1225_B_7", "20150812_0907_1208_A_2",
"20150812_0907_1208_A_3", "20150812_0907_1208_A_5", "20150812_0907_1208_A_6"
), obsPoint = c("A", "A", "A", "A", "C", "B", "B", "B", "B",
"B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "A"), species = structure(c(11L,
11L, 11L, 11L, 10L, 11L, 10L, 11L, 11L, 11L, 11L, 10L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L), .Label = c("Bf", "Fia", "Grr",
"Kch", "Ko", "Lm", "Rm", "Row", "Sea", "Sst", "Wsb"), class = "factor"),
from = structure(c(1436687150, 1436689710, 1436691420, 1436694850,
1437992160, 1437991500, 1437995580, 1437992360, 1437995960,
1437998360, 1437992100, 1437994000, 1437995340, 1439366410,
1439369600, 1439374980, 1439367240, 1439367540, 1439369760,
1439370720), class = c("POSIXct", "POSIXt"), tzone = ""),
to = structure(c(1436687690, 1436690230, 1436691690, 1436694970,
1437992320, 1437992200, 1437995600, 1437992400, 1437996070,
1437998750, 1437992230, 1437994220, 1437996780, 1439366570,
1439370070, 1439375070, 1439367410, 1439367820, 1439369930,
1439370830), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bird_id",
"obsPoint", "species", "from", "to"), row.names = c("20150712_0810_1410_A_1",
"20150712_0810_1410_A_2", "20150712_0810_1410_A_4", "20150712_0810_1410_A_7",
"20150727_1115_1430_C_1", "20150727_1120_1430_B_1", "20150727_1120_1430_B_2",
"20150727_1120_1430_B_3", "20150727_1120_1430_B_4", "20150727_1120_1430_B_5",
"20150727_1130_1430_A_2", "20150727_1130_1430_A_4", "20150727_1130_1430_A_5",
"20150812_0900_1225_B_3", "20150812_0900_1225_B_6", "20150812_0900_1225_B_7",
"20150812_0907_1208_A_2", "20150812_0907_1208_A_3", "20150812_0907_1208_A_5",
"20150812_0907_1208_A_6"), class = "data.frame")
我发现了提到的data.table函数foverlaps的部分解决方案,例如https://stackoverflow.com/q/25815032
library(data.table)
#Subsetting the data for each observation point and converting them into data.tables
A <- setDT(testdata[testdata$obsPoint=="A",])
B <- setDT(testdata[testdata$obsPoint=="B",])
C <- setDT(testdata[testdata$obsPoint=="C",])
#Set a key for these subsets (whatever key exactly means. Don't care as long as it works ;) )
setkey(A,species,from,to)
setkey(B,species,from,to)
setkey(C,species,from,to)
#Generate the match results for each obsPoint/species combination with an overlapping interval
matchesAB <- foverlaps(A,B,type="within",nomatch=0L) #nomatch=0L -> remove NA
matchesAC <- foverlaps(A,C,type="within",nomatch=0L)
matchesBC <- foverlaps(B,C,type="within",nomatch=0L)
当然,这种方式“可行”,但实际上并不是我最终想要实现的目标。
首先,我必须对观察点进行硬编码。我宁愿找到一个取任意点数的解决方案。
其次,结果的格式不是我可以轻松恢复的格式。匹配的行实际上放在INTO的同一行中,而我的目标是将这些行放在下面,并在一个新列中将有一个公共标识符。
第三,我必须再次手动检查间隔是否在所有三个点上都重叠(我的数据不是这种情况,但通常可以)
最后,我只想接收一个新的data.frame,其中所有候选者都可以通过组ID进行识别,然后我可以加入这些行,然后将结果导出为一层以供进一步检查。
那么,还有其他人怎么做呢?
for
循环,则将对其+1 !