统计和大数据 ggplot2

1

我想protoclust{protoclust}通过为用于对我的数据进行分类的每对变量创建散点图，按类着色以及为每个类的95％置信区间重叠椭圆来可视化聚类的结果（用生成），以对每个类进行重叠（以检查椭圆类在每对变量下重叠）。我已经以两种不同的方式实现了椭圆的绘制，并且生成的椭圆也不同！（第一个实现使用更大的椭圆！）先验的只是它们的大小不同（有些不同的缩放比例？），因为轴的中心和角度在两者上似乎是相似的。我想我一定是通过使用其中之一（不要同时使用两者！）或参数来做错事。谁能告诉我我在做什么错？这里是两个实现的代码；两者均基于“ 如何将数据椭圆叠加到ggplot2散点图上”的答案？ ### 1st implementation ### using ellipse{ellipse} library(ellipse) library(ggplot2) library(RColorBrewer) colorpal <- brewer.pal(10, "Paired") x <- data$x y <- data$y group <- data$group df <- data.frame(x=x, y=y, group=factor(group)) df_ell <- data.frame() for(g in levels(df$group)){df_ell <- rbind(df_ell, cbind(as.data.frame(with(df[df$group==g,], ellipse(cor(x, y),scale=c(sd(x),sd(y)),centre=c(mean(x),mean(y))))),group=g))} p1 <- ggplot(data=df, aes(x=x, y=y,colour=group)) + geom_point() …

11 r confidence-interval ggplot2 scatterplot

3

如何使用ggplot2将两个数据集与QQ图进行比较？

作为统计人员和R新手，我一直很难尝试生成纵横比为1：1的qqplots。ggplot2似乎比默认的R绘图包提供了更多的绘图控制，但是我看不到如何在ggplot2中执行qqplot来比较两个数据集。所以我的问题是，ggplot2等价于什么？ qqplot(datset1,dataset2)

11 r distributions ggplot2 qq-plot

2

ggplot2中如何通过连续交互来连续绘制一个图？

假设我有数据： x1 <- rnorm(100,2,10) x2 <- rnorm(100,2,10) y <- x1+x2+x1*x2+rnorm(100,1,2) dat <- data.frame(y=y,x1=x1,x2=x2) res <- lm(y~x1*x2,data=dat) summary(res) 我想通过连续交互来绘制连续图，以使x1在X轴上，而x2用3条线表示，一条在Z分数为0时代表x2，在Z分数为+1时代表另一条，而在a Z分数为-1，每行分别有单独的颜色和标签。如何使用ggplot2执行此操作？例如，它可能看起来像这样（尽管当然使用不同的彩色线条而不是不同的线条类型）：

11 r regression ggplot2 interaction

1

如何解释缺口箱图

在进行一些EDA时，我决定使用箱形图来说明一个因子的两个水平之间的差异。该方法ggplot呈现箱形图是令人满意的，但是稍微简单化（下图1图）。在研究箱形图的特性时，我开始尝试刻槽。我知道，缺口在中位数附近显示CI，并且如果两个框的缺口不重叠，则有“有力的证据”（置信水平为95％）表明中位数有所不同。在我的情况下（第二幅图），槽口没有有意义的重叠。但是，为什么盒子右侧的底部采用这种奇怪的形式呢？在小提琴图中绘制相同的数据并不会表明相应小提琴的概率密度有任何异常。

11 data-visualization ggplot2 eda

1

如何在一幅图中绘制拟合的伽玛分布图和实际图？

加载所需的包。 library(ggplot2) library(MASS) 生成10,000个适合伽玛分布的数字。 x <- round(rgamma(100000,shape = 2,rate = 0.2),1) x <- x[which(x>0)] 假设我们不知道x符合哪个分布，则绘制概率密度函数。 t1 <- as.data.frame(table(x)) names(t1) <- c("x","y") t1 <- transform(t1,x=as.numeric(as.character(x))) t1$y <- t1$y/sum(t1[,2]) ggplot() + geom_point(data = t1,aes(x = x,y = y)) + theme_classic() 从图中可以看出，x的分布与伽马分布非常相似，因此fitdistr()在包中使用它MASS可以获取形状和伽马分布速率的参数。 fitdistr(x,"gamma") ## output ## shape rate ## 2.0108224880 0.2011198260 ## (0.0083543575) …

10 r mathematical-statistics goodness-of-fit gamma-distribution ggplot2

2

计算数据的ROC曲线

因此，我进行了16次试验，试图使用汉明距离从生物特征中鉴定一个人。我的阈值设置为3.5。我的数据如下，只有试验1为“真阳性”： Trial Hamming Distance 1 0.34 2 0.37 3 0.34 4 0.29 5 0.55 6 0.47 7 0.47 8 0.32 9 0.39 10 0.45 11 0.42 12 0.37 13 0.66 14 0.39 15 0.44 16 0.39 我的困惑是，我真的不确定如何根据此数据制作ROC曲线（FPR与TPR或FAR与FRR）。哪一个都不重要，但是我只是对如何进行计算感到困惑。任何帮助，将不胜感激。

9 mathematical-statistics roc classification cross-validation pac-learning r anova survival hazard machine-learning data-mining hypothesis-testing regression random-variable non-independent normal-distribution approximation central-limit-theorem interpolation splines distributions kernel-smoothing r data-visualization ggplot2 distributions binomial random-variable poisson-distribution simulation kalman-filter regression lasso regularization lme4-nlme model-selection aic r mcmc dlm particle-filter r panel-data multilevel-analysis model-selection entropy graphical-model r distributions quantiles qq-plot svm matlab regression lasso regularization entropy inference r distributions dataset algorithms matrix-decomposition regression modeling interaction regularization expected-value exponential gamma-distribution mcmc gibbs probability self-study normality-assumption naive-bayes bayes-optimal-classifier standard-deviation classification optimization control-chart engineering-statistics regression lasso regularization regression references lasso regularization elastic-net r distributions aggregation clustering algorithms regression correlation modeling distributions time-series standard-deviation goodness-of-fit hypothesis-testing statistical-significance sample binary-data estimation random-variable interpolation distributions probability chi-squared predictor outliers regression modeling interaction

Questions tagged «ggplot2»