查看此链接。
在这里,他们将引导您加载非结构化文本以创建wordcloud。您可以调整此策略,而不是创建单词云,而可以创建所用术语的频率矩阵。想法是采用非结构化文本并以某种方式对其进行结构化。您可以通过“文档术语表”将所有内容更改为小写(或大写),删除停用词并为每个作业功能查找常用术语。您还可以选择阻止单词。如果您阻止单词,您将能够检测到同一个单词的不同形式的单词。例如,可以将“已编程”和“编程”限制为“已编程”。您可以在ML模型训练中将这些频繁项的出现添加为加权特征。
您还可以将其调整为常用短语,为每个工作功能找到2-3个单词的共同组。
例:
1)加载库并构建示例数据
library(tm)
library(SnowballC)
doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
"job" = c(job1,job2,job3))
2)现在我们进行一些文本结构化。我很肯定有更快或更短的方法来进行以下操作。
# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)
# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))
# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))
# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x){
paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
})
# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) {
paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
})
3)制作语料库和文档术语矩阵。
# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))
# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)
# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)
现在我们有了频率矩阵jobFreq,它是一个(3 x)矩阵,3个条目和X个单词。
您从这里去的地方取决于您。您只能保留特定(更常见)的单词,并将它们用作模型中的特征。另一种方法是保持简单,并在每个职位描述中使用一定比例的单词,比如说“ java”在“软件工程师”中占80%,在“质量保证”中仅占50%。
现在是时候查询为什么“保证”具有1个“ r”而“出现”具有2个“ r”的原因了。