Answers:
浅Ñ atural 大号 anguage P rocessing技术可以用来从句子中提取概念。
-------------------------------------------
浅层NLP技术步骤:
1)将句子转换为小写
2)删除停用词(这是一种语言中常见的词。诸如for,very,of,are等是常见的停用词)
3)提取n-gram,即从给定的文本序列中提取n个项目的连续序列(仅增加n,即可使用模型存储更多上下文)
4)分配句法标签(名词,动词等)
5)通过语义/句法分析方法从文本中提取知识,即尝试保留名词或动词之类的句子中具有较高权重的单词
-------------------------------------------
让我们检查将上述步骤应用于给定句子的结果Complimentary gym access for two for the length of stay ($12 value per person per day)
。
1克结果:健身房,出入,长度,停留,价值,人,日
Summary of step 1 through 4 of shallow NLP:
1-gram PoS_Tag Stopword (Yes/No)? PoS Tag Description
-------------------------------------------------------------------
Complimentary NNP Proper noun, singular
gym NN Noun, singular or mass
access NN Noun, singular or mass
for IN Yes Preposition or subordinating conjunction
two CD Cardinal number
for IN Yes Preposition or subordinating conjunction
the DT Yes Determiner
length NN Noun, singular or mass
of IN Yes Preposition or subordinating conjunction
stay NN Noun, singular or mass
($12 CD Cardinal number
value NN Noun, singular or mass
per IN Preposition or subordinating conjunction
person NN Noun, singular or mass
per IN Preposition or subordinating conjunction
day) NN Noun, singular or mass
Step 4: Retaining only the Noun/Verbs we end up with gym, access, length, stay, value, person, day
让n增加以存储更多上下文并删除停用词。
2克结果:免费健身房,健身房,长住时间,超值住宿
Summary of step 1 through 4 of shallow NLP:
2-gram Pos Tag
---------------------------
access two NN CD
complimentary gym NNP NN
gym access NN NN
length stay NN NN
per day IN NN
per person IN NN
person per NN IN
stay value NN NN
two length CD NN
value per NN IN
Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym, gym access, length stay, stay value
3克结果:免费使用健身房,长期住宿价值,每天每人
Summary of step 1 through 4 of shallow NLP:
3-gram Pos Tag
-------------------------------------
access two length NN CD NN
complimentary gym access NNP NN NN
gym access two NN NN CD
length stay value NN NN NN
per person per IN NN IN
person per day NN IN NN
stay value per NN NN IN
two length stay CD NN NN
value per person NN IN NN
Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym access, length stay value, person per day
要记住的事情:
工具:
您可以考虑将OpenNLP / StanfordNLP用于语音标记。大多数编程语言都具有OpenNLP / StanfordNLP的支持库。您可以根据自己的喜好选择语言。以下是我用于PoS标记的示例R代码。
样本R代码:
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 32-bit version
library(rJava)
require("openNLP")
require("NLP")
s <- paste("Complimentary gym access for two for the length of stay $12 value per person per day")
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
tagged_str <- tagPOS(s)
tagged_str
#$POStagged
#[1] "Complimentary/NNP gym/NN access/NN for/IN two/CD for/IN the/DT length/NN of/IN stay/NN $/$ 12/CD value/NN per/IN person/NN per/IN day/NN"
#
#$POStags
#[1] "NNP" "NN" "NN" "IN" "CD" "IN" "DT" "NN" "IN" "NN" "$" "CD"
#[13] "NN" "IN" "NN" "IN" "NN"
有关浅和深NLP的其他阅读:
您需要分析句子结构并提取相应的感兴趣的语法类别(在这种情况下,我认为这将是名词短语,这是短语的类别)。有关详细信息,请参见相应的Wikipedia文章和NLTK书的“分析句子结构”一章。
关于用于实现上述方法及其他方法的可用软件工具,我建议考虑使用NLTK(如果您喜欢Python)或StanfordNLP软件(如果您喜欢Java)。对于许多其他NLP框架,库和各种语言的编程支持,请参见此精选列表中的相应(NLP)部分。
如果您是R用户,则可以在http://www.rdatamining.com上找到许多很好的实用信息。查看他们的文本挖掘示例。
另外,看看tm包。
这也是一个很好的聚合站点-http : //www.tapor.ca/