R中的“ tm”(文本挖掘)包中的VectorSource和VCorpus是什么


9

我不太确定“ tm”包中的VectorSource和VCorpus到底是什么。

这些文档尚不清楚,有人可以让我简单地理解吗?

Answers:


12

“ Corpus”是文本文档的集合。

tm中的VCorpus是指“易失性”语料库,这意味着该语料库存储在内存中,并且在销毁包含它的R对象时会将该语料库破坏。

与此相对的是PCorpus或永久语料库,它们存储在内存中,比如说在db中。

为了使用tm创建VCorpus,我们需要将“ Source”对象作为参数传递给VCorpus方法。您可以使用此方法找到可用的源
-getSources()

[1]“ DataframeSource”“ DirSource”“ URISource”“ VectorSource”
[5]“ XMLSource”“ ZipSource”

源提取输入位置,例如目录或URI等。VectorSource仅用于字符向量

一个简单的例子:

假设您有一个char向量-

输入<-c('这是第一行。','这是第二行')

创建源-vecSource <-VectorSource(输入)

然后创建语料库-VCorpus(vecSource)

希望这可以帮助。您可以在这里阅读更多内容-https: //cran.r-project.org/web/packages/tm/vignettes/tm.pdf


5

实际上,Corpus和之间存在很大差异VCorpus

CorpusSimpleCorpus默认使用,这意味着的某些功能VCorpus将不可用。显而易见的是,SimpleCorpus您将无法保留破折号,下划线或其他标点符号;SimpleCorpusCorpus自动删除它们,VCorpus但不会。Corpus您还可以在帮助中找到其他限制?SimpleCorpus

这是一个例子:

# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk

输出将是:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 46
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 46

如果检查对象:

# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])

您会注意到Corpus将文本解压缩:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2
[1]                                                                                                                                            
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 0
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 139

同时VCorpus将其保持在对象内。

假设现在您对两个都进行矩阵转换:

dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168

dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187

最后,让我们看一下内容。这是从Corpus

grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)

来自VCorpus

grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)

[1] "alabama,"       "almighty,"      "brotherhood."   "brothers."     
 [5] "california."    "catholics,"     "character."     "children,"     
 [9] "city,"          "colorado."      "creed:"         "day,"          
[13] "day."           "died,"          "dream."         "equal."        
[17] "exalted,"       "faith,"         "gentiles,"      "georgia,"      
[21] "georgia."       "hamlet,"        "hampshire."     "happens,"      
[25] "hope,"          "hope."          "injustice,"     "justice."      
[29] "last!"          "liberty,"       "low,"           "meaning:"      
[33] "men,"           "mississippi,"   "mississippi."   "mountainside," 
[37] "nation,"        "nullification," "oppression,"    "pennsylvania." 
[41] "plain,"         "pride,"         "racists,"       "ring!"         
[45] "ring,"          "ring."          "self-evident,"  "sing."         
[49] "snow-capped"    "spiritual:"     "straight;"      "tennessee."    
[53] "thee,"          "today!"         "together,"      "together."     
[57] "tomorrow,"      "true."          "york."

看一下带有标点符号的单词。那是巨大的差异。是不是

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.