仅了解最小值/最大值的数据的统计方法


29

是否有统计信息的一个分支来处理其确切值未知的数据,但是对于每个人,我们都知道该值的最大值或最小值

I suspect that my problem stems largely from the fact that I am struggling to articulate it in statistical terms, but hopefully an example will help to clarify:

假设存在两个相互连接的总体AB,以便在某个时候成员A可以“转换”为B,但不可能相反。过渡时间是可变的,但不是随机的。例如,A可以是“没有后代的个体”,而B “至少有一个后代的个体”。我对这种进展发生的年龄感兴趣,但我只有横截面数据。对于任何给定的个体,我可以找出它们是否属于AB。我也知道这些人的年龄。对于人群A中的每个人A,我知道过渡年龄将比其当前年龄更大。同样,对于B的成员B,我知道过渡年龄比当前年龄小。但是我不知道确切的值。

假设我还有其他一些要与过渡年龄进行比较的因素。例如,我想知道一个人的亚种或体型是否会影响第一个后代的年龄。我绝对有一些有用的信息可以回答这些问题:平均而言,在中的个体中A,年龄较大的个体将有一个较晚的过渡期。但是这些信息并不完美,特别是对于年轻人而言。反之亦然人口B

Are there established methods to deal with this sort of data? I do not necessarily need a full method of how to carry out such an analysis, just some search terms or useful resources to start me off in the right place!

Caveats: I am making the simplifying assumption that transition from A to B is instantaneous. I am also prepared to assume that most individuals will at some point progress to B, assuming they live long enough. And I realise that longitutinal data would be very helpful, but assume that it is not available in this case.

Apologies if this is a duplicate, as I said, part of my problem is that I don't know what I should be searching for. For the same reason, please add other tags if appropriate.

Sample dataset: Ssp indicates one of two subspecies, X or Y. Offspring indicates either no offspring (A) or at least one offspring (B)

 age ssp offsp
  21   Y     A
  20   Y     B
  26   X     B
  33   X     B
  33   X     A
  24   X     B
  34   Y     B
  22   Y     B
  10   Y     B
  20   Y     A
  44   X     B
  18   Y     A
  11   Y     B
  27   X     A
  31   X     B
  14   Y     B
  41   X     B
  15   Y     A
  33   X     B
  24   X     B
  11   Y     A
  28   X     A
  22   X     B
  16   Y     A
  16   Y     B
  24   Y     B
  20   Y     B
  18   X     B
  21   Y     B
  16   Y     B
  24   Y     A
  39   X     B
  13   Y     A
  10   Y     B
  18   Y     A
  16   Y     A
  21   X     A
  26   X     B
  11   Y     A
  40   X     B
   8   Y     A
  41   X     B
  29   X     B
  53   X     B
  34   X     B
  34   X     B
  15   Y     A
  40   X     B
  30   X     A
  40   X     B

Edit: example dataset changed as it wasn't very representative


2
This is an interesting situation. Can you provide your data?
gung - Reinstate Monica

1
I would not be able to post the full dataset but could give an example set.
user2390246

Answers:


26

This is referred to as current status data. You get one cross sectional view of the data, and regarding the response, all you know is that at the observed age of each subject, the event (in your case: transitioning from A to B) has happened or not. This is a special case of interval censoring.

To formally define it, let Ti be the (unobserved) true event time for subject i. Let Ci the inspection time for subject i (in your case: age at inspection). If Ci<Ti, the data are right censored. Otherwise, the data are left censored. We are interesting in modeling the distribution of T. For regression models, we are interested in modeling how that distribution changes with a set of covariates X.

To analyze this using interval censoring methods, you want to put your data into the general interval censoring format. That is, for each subject, we have the interval (li,ri), which represents the interval in which we know Ti to be contained. So if subject i is right censored at inspection time ci, we would write (ci,). If it is left censored at ci, we would represent it as (0,ci).

Shameless plug: if you want to use regression models to analyze your data, this can be done in R using icenReg (I'm the author). In fact, in a similar question about current status data, the OP put up a nice demo of using icenReg. He starts by showing that ignoring the censoring part and using logistic regression leads to bias (important note: he is referring to using logistic regression without adjusting for age. More on this later.)

Another great package is interval, which contains log-rank statistic tests, among other tools.

EDIT:

@EdM suggested using logistic regression to answer the problem. I was unfairly dismissive of this, saying that you would have to worry about the functional form of time. While I stand behind the statement that you should worry about the functional form of time, I realized that there was a very reasonable transformation that leads to a reasonable parametric estimator.

In particular, if we use log(time) as a covariate in our model with logistic regression, we end up with a proportional odds model with a log-logistic baseline.

To see this, first consider that the proportional odds regression model is defined as

Odds(t|X,β)=eXTβOddso(t)

where Oddso(t) is the baseline odds of survival at time t. Note that the regression effects are the same as with logistic regression. So all we need to do now is show that the baseline distribution is log-logistic.

Now consider a logistic regression with log(Time) as a covariate. We then have

P(Y=1|T=t)=exp(β0+β1log(t))1+exp(β0+β1log(t))

With a little work, you can see this as the CDF of a log-logistic model (with a non-linear transformation of the parameters).

R demonstration that the fits are equivalent:

> library(icenReg)
> data(miceData)
> 
> ## miceData contains current status data about presence 
> ## of tumors at sacrifice in two groups
> ## in interval censored format: 
> ## l = lower end of interval, u = upper end
> ## first three mice all left censored
> 
> head(miceData, 3)
  l   u grp
1 0 381  ce
2 0 477  ce
3 0 485  ce
> 
> ## To fit this with logistic regression, 
> ## we need to extract age at sacrifice
> ## if the observation is left censored, 
> ## this is the upper end of the interval
> ## if right censored, is the lower end of interval
> 
> age <- numeric()
> isLeftCensored <- miceData$l == 0
> age[isLeftCensored] <- miceData$u[isLeftCensored]
> age[!isLeftCensored] <- miceData$l[!isLeftCensored]
> 
> log_age <- log(age)
> resp <- !isLeftCensored
> 
> 
> ## Fitting logistic regression model
> logReg_fit <- glm(resp ~ log_age + grp, 
+                     data = miceData, family = binomial)
> 
> ## Fitting proportional odds regression model with log-logistic baseline
> ## interval censored model
> ic_fit <- ic_par(cbind(l,u) ~ grp, 
+            model = 'po', dist = 'loglogistic', data = miceData)
> 
> summary(logReg_fit)

Call:
glm(formula = resp ~ log_age + grp, family = binomial, data = miceData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1413  -0.8052   0.5712   0.8778   1.8767  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  18.3526     6.7149   2.733  0.00627 **
log_age      -2.7203     1.0414  -2.612  0.00900 **
grpge        -1.1721     0.4713  -2.487  0.01288 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 196.84  on 143  degrees of freedom
Residual deviance: 160.61  on 141  degrees of freedom
AIC: 166.61

Number of Fisher Scoring iterations: 5

> summary(ic_fit)

Model:  Proportional Odds
Baseline:  loglogistic 
Call: ic_par(formula = cbind(l, u) ~ grp, data = miceData, model = "po", 
    dist = "loglogistic")

          Estimate Exp(Est) Std.Error z-value        p
log_alpha    6.603 737.2000   0.07747  85.240 0.000000
log_beta     1.001   2.7200   0.38280   2.614 0.008943
grpge       -1.172   0.3097   0.47130  -2.487 0.012880

final llk =  -80.30575 
Iterations =  10 
> 
> ## Comparing loglikelihoods
> logReg_fit$deviance/(-2) - ic_fit$llk
[1] 2.643219e-12

Note that the effect of grp is the same in each model, and the final log-likelihood differs only by numeric error. The baseline parameters (i.e. intercept and log_age for logistic regression, alpha and beta for the interval censored model) are different parameterizations so they are not equal.

So there you have it: using logistic regression is equivalent to fitting the proportional odds with a log-logistic baseline distribution. If you're okay with fitting this parametric model, logistic regression is quite reasonable. I do caution that with interval censored data, semi-parametric models are typically favored due to difficulty of assessing model fit, but if I truly thought there was no place for fully-parametric models I would have not included them in icenReg.


This looks very helpful. I will have a look at the resources you point to and a play with the icenReg package. I am trying to get my head around why logistic regression is less suitable - @EdM 's suggestion looks on the surface as if it should work. Does the bias arise because the "event" - here, having offspring - might have an effect on survival? So, if it decreases survival, we would find that among individuals of a given age, those that have not reproduced will be over-represented?
user2390246

1
@user2390246: You could use logistic regression for current status data. But then you have to do a lot of work getting the functional form of age, and it's interaction with other variables, correct. This is very much non-trivial. With survival based models, you can use a semi-parametric baseline (ic_sp in icenReg) and not worry at all about that. In addition, looking at the survival curves for the two groups answers your question correctly. Trying to recreate this from the logistic fit could be done, but again, much more work than using survival models.
Cliff AB

I agree with @CliffAB on this. I had a hesitation about recommending logistic regression specifically because of the difficulty of getting the right functional form for the dependency on age. I haven't had any experience with current status data analysis; not having to figure out that form of the dependency on age is a big advantage of that technique. I will keep my answer up nevertheless so that those who later examine this thread will understand how this played out.
EdM

It seems to me that your comment here is the crux of the matter. It would help if you could develop that in your answer. Eg, if you could use the OP's example data to build a LR model & an interval censored survival model, & show how the latter more easily answers the OP's research question.
gung - Reinstate Monica

1
@gung: actually, I've taken a softer stance about logistic regression. I edited my answer to reflect this.
Cliff AB

4

This is a case of censoring/coarse data. Assume you think that your data arises from a distribution with nicely behaved continuous (etc.) pdf f(x) and cdf F(x). The standard solution for time to event data when the exact time xi of an event for subject i is known is that the likelihood contribution is f(xi). If we only know that the time was greater than yi (right-censoring), then the likelihood contribution is 1F(yi) under the assumption of independent censoring. If we know that the time is less than zi (left-censoring), then the likelihood contribution is F(zi). Finally, if the time falls into some interval (yi,zi], then the likelihood contribution would be F(zi)F(yi).


1
There's no need for f(x) to be continuous. Or even well behaved. It could be a discrete survival model (so the pdf is undefined and a pmf is used instead) and the rest of what you said would be correct, with a slight adjustment (replace F(yi) with F(yi+).
Cliff AB

4

This problem seems like it might be handled well by logistic regression.

You have two states, A and B, and want to examine the probability of whether a particular individual has switched irreversibly from state A to state B. One fundamental predictor variable would be age at the time of observation. The other factor or factors of interest would be additional predictor variables.

Your logistic model would then use the actual observations of A/B state, age, and other factors to estimate the probability of being in state B as a function of those predictors. The age at which that probability passes 0.5 could be used as the estimate of the transition time, and you would then examine the influences of the other factor(s) on that predicted transition time.

Added in response to discussion:

As with any linear model, you need to ensure that your predictors are transformed in a way that they bear a linear relation to the outcome variable, in this case the log-odds of the probability of having moved to state B. That is not necessarily a trivial problem. The answer by @CliffAB shows how a log transformation of the age variable might be used.

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.