R中的“ tm”（文本挖掘）包中的VectorSource和VCorpus是什么

9

我不太确定“ tm”包中的VectorSource和VCorpus到底是什么。

这些文档尚不清楚，有人可以让我简单地理解吗？

r text-mining

— ome
source

12

“ Corpus”是文本文档的集合。

tm中的VCorpus是指“易失性”语料库，这意味着该语料库存储在内存中，并且在销毁包含它的R对象时会将该语料库破坏。

与此相对的是PCorpus或永久语料库，它们存储在内存中，比如说在db中。

为了使用tm创建VCorpus，我们需要将“ Source”对象作为参数传递给VCorpus方法。您可以使用此方法找到可用的源
-getSources（）

[1]“ DataframeSource”“ DirSource”“ URISource”“ VectorSource”
[5]“ XMLSource”“ ZipSource”

源提取输入位置，例如目录或URI等。VectorSource仅用于字符向量

一个简单的例子：

假设您有一个char向量-

输入<-c（'这是第一行。'，'这是第二行'）

创建源-vecSource <-VectorSource（输入）

然后创建语料库-VCorpus（vecSource）

希望这可以帮助。您可以在这里阅读更多内容-https: //cran.r-project.org/web/packages/tm/vignettes/tm.pdf

— 印地语
source

5

实际上，Corpus和之间存在很大差异VCorpus。

CorpusSimpleCorpus默认使用，这意味着的某些功能VCorpus将不可用。显而易见的是，SimpleCorpus您将无法保留破折号，下划线或其他标点符号；SimpleCorpus或Corpus自动删除它们，VCorpus但不会。Corpus您还可以在帮助中找到其他限制?SimpleCorpus。

这是一个例子：

# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk

输出将是：

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 46
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 46

如果检查对象：

# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])

您会注意到Corpus将文本解压缩：

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2
[1]                                                                                                                                            
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 0
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 139

同时VCorpus将其保持在对象内。

假设现在您对两个都进行矩阵转换：

dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168

dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187

最后，让我们看一下内容。这是从Corpus：

grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)

来自VCorpus：

grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)

[1] "alabama,"       "almighty,"      "brotherhood."   "brothers."     
 [5] "california."    "catholics,"     "character."     "children,"     
 [9] "city,"          "colorado."      "creed:"         "day,"          
[13] "day."           "died,"          "dream."         "equal."        
[17] "exalted,"       "faith,"         "gentiles,"      "georgia,"      
[21] "georgia."       "hamlet,"        "hampshire."     "happens,"      
[25] "hope,"          "hope."          "injustice,"     "justice."      
[29] "last!"          "liberty,"       "low,"           "meaning:"      
[33] "men,"           "mississippi,"   "mississippi."   "mountainside," 
[37] "nation,"        "nullification," "oppression,"    "pennsylvania." 
[41] "plain,"         "pride,"         "racists,"       "ring!"         
[45] "ring,"          "ring."          "self-evident,"  "sing."         
[49] "snow-capped"    "spiritual:"     "straight;"      "tennessee."    
[53] "thee,"          "today!"         "together,"      "together."     
[57] "tomorrow,"      "true."          "york."

看一下带有标点符号的单词。那是巨大的差异。是不是

— 0
source