统计和大数据 smoothing

1

示例：我的职位描述中有一句话：“英国Java高级工程师”。我想使用深度学习模型将其预测为2类：English 和IT jobs。如果我使用传统的分类模型，则只能预测softmax最后一层具有功能的标签。因此，我可以使用2个模型神经网络来预测两个类别的“是” /“否”，但是如果我们有更多类别，那就太贵了。那么，我们是否有任何深度学习或机器学习模型可以同时预测2个或更多类别？ “编辑”：使用传统方法使用3个标签，它将由[1,0,0]编码，但在我的情况下，它将由[1,1,0]或[1,1,1]编码示例：如果我们有3个标签，并且所有这些标签都适合一个句子。因此，如果softmax函数的输出为[0.45，0.35，0.2]，我们应该将其分类为3个标签或2个标签，或者可以是一个？我们这样做的主要问题是：分类为1个，2个或3个标签的最佳阈值是多少？

9 machine-learning deep-learning natural-language tensorflow sampling distance non-independent application regression machine-learning logistic mixed-model control-group crossover r multivariate-analysis ecology procrustes-analysis vegan regression hypothesis-testing interpretation chi-squared bootstrap r bioinformatics bayesian exponential beta-distribution bernoulli-distribution conjugate-prior distributions bayesian prior beta-distribution covariance naive-bayes smoothing laplace-smoothing distributions data-visualization regression probit penalized estimation unbiased-estimator fisher-information unbalanced-classes bayesian model-selection aic multiple-regression cross-validation regression-coefficients nonlinear-regression standardization naive-bayes trend machine-learning clustering unsupervised-learning wilcoxon-mann-whitney z-score econometrics generalized-moments method-of-moments machine-learning conv-neural-network image-processing ocr machine-learning neural-networks conv-neural-network tensorflow r logistic scoring-rules probability self-study pdf cdf classification svm resampling forecasting rms volatility-forecasting diebold-mariano neural-networks prediction-interval uncertainty

1

为什么要增加一个反向文档频率？

我的课本将idf列为，其中log(1+Nnt)log(1+Nnt)log(1+\frac{N}{n_t}) NNN：文件数 ntntn_t：包含术语的文档数ttt 维基百科将此公式列为实际的平滑版本。我了解的一个：范围从到，看起来很直观。但是从到似乎太奇怪了…… 我对语言建模的平滑知识有所了解，但是您会在分子中添加一些东西以及分母中，因为您担心概率质量。但是，只加对我来说没有意义。我们要在这里完成什么？log(Nnt)log(Nnt)log(\frac{N}{n_t})log(NN)=0log(NN)=0log(\frac{N}{N})=0∞∞\inftylog(1+Nnt)log(1+Nnt)log(1+\frac{N}{n_t})log(1+1)log(1+1)log(1+1)∞∞\infty111

9 text-mining natural-language smoothing

3

回归平滑样条曲线中等于k个分类变量的k个结的选择？

我正在研究一种预测成本模型，其中患者的年龄（以年为单位的整数）是预测变量之一。年龄与住院风险之间存在很强的非线性关系：我正在考虑针对患者年龄的惩罚性回归平滑样条。根据《统计学习的要素》（Hastie等，2009，第151页），最佳结位置是每个会员年龄的唯一值一个结。假设我将年龄保留为整数，那么惩罚平滑样条曲线是否等效于运行带有101个不同的年龄指标变量的岭回归或套索，每个年龄值在数据集中找到一个（减去一个作为参考）？然后避免过度参数化，因为每个年龄指标上的系数都缩小为零。

9 nonlinear-regression lasso ridge-regression smoothing splines

Questions tagged «smoothing»