编辑:我本来以为OP知道哪些观测值来自哪个物种。OP的编辑清楚表明,我原来的方法不可行。我将其留给后代使用,但另一个答案要好得多。作为安慰,我在Stan中编写了一个混合模型。我并不是说贝叶斯方法在这种情况下特别有用,但这只是我可以做出的贡献。
斯坦码
data{
//Number of data points
int N;
real y[N];
real x[N];
}
parameters{
//mixing parameter
real<lower=0, upper =1> theta;
//Regression intercepts
real beta_0[2];
//Regression slopes.
ordered[2] beta_1;
//Regression noise
real<lower=0> sigma[2];
}
model{
//priors
theta ~ beta(5,5);
beta_0 ~ normal(0,1);
beta_1 ~ normal(0,1);
sigma ~ cauchy(0,2.5);
//mixture likelihood
for (n in 1:N){
target+=log_mix(theta,
normal_lpdf(y[n] | beta_0[1] + beta_1[1]*x[n], sigma[1]),
normal_lpdf(y[n] | beta_0[2] + beta_1[2]*x[n], sigma[2]));
}
}
generated quantities {
//posterior predictive distribution
//will allow us to see what points belong are assigned
//to which mixture
matrix[N,2] p;
matrix[N,2] ps;
for (n in 1:N){
p[n,1] = log_mix(theta,
normal_lpdf(y[n] | beta_0[1] + beta_1[1]*x[n], sigma[1]),
normal_lpdf(y[n] | beta_0[2] + beta_1[2]*x[n], sigma[2]));
p[n,2]= log_mix(1-theta,
normal_lpdf(y[n] | beta_0[1] + beta_1[1]*x[n], sigma[1]),
normal_lpdf(y[n] | beta_0[2] + beta_1[2]*x[n], sigma[2]));
ps[n,]= p[n,]/sum(p[n,]);
}
}
从R运行Stan模型
library(tidyverse)
library(rstan)
#Simulate the data
N = 100
x = rnorm(N, 0, 3)
group = factor(sample(c('a','b'),size = N, replace = T))
y = model.matrix(~x*group)%*% c(0,1,0,2)
y = as.numeric(y) + rnorm(N)
d = data_frame(x = x, y = y)
d %>%
ggplot(aes(x,y))+
geom_point()
#Fit the model
N = length(x)
x = as.numeric(x)
y = y
fit = stan('mixmodel.stan',
data = list(N= N, x = x, y = y),
chains = 8,
iter = 4000)
结果
虚线是基本事实,是实线。
原始答案
如果您知道哪个水仙花样品来自哪个品种,则可以估算出品种与土壤PH的相互作用。
您的模型看起来像
ÿ= β0+ β1个品种+ β2PH + β3各种⋅ PH
这是R中的示例。我生成了一些看起来像这样的数据:
显然有两条不同的线,并且这些线对应于两个物种。这是使用线性回归估算线的方法。
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
结果是
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
对于标记为0的物种,该线约为
y=19+1⋅PH
对于标记为1的物种,线近似为
y=40−2⋅PH