My personal appraisal of his arguments:
- Here he talks about using p as evidence for the Null, whereas his thesis is that p can't be used as evidence against the Null. So, I think this argument is largely irrelevant.
- I think this is a misunderstanding. Fisherian p testing follows strongly in the idea of Popper's Critical Rationalism that states you cannot support a theory but only criticize it. So in that sense there only is a single hypothesis (the Null) and you simply check if your data are in accordance with it.
- I disagree here. It depends on the test statistic but p is usually a transformation of an effect size that speaks against the Null. So the higher the effect, the lower the p value---all other things equal. Of course, for different data sets or hypotheses this is no longer valid.
- I am not sure I completely understand this statement, but from what I can gather this is less a problem of p as of people using it wrongly. p was intended to have the long-run frequency interpretation and that is a feature not a bug. But you can't blame p for people taking a single p value as proof for their hypothesis or people publishing only p<.05.
His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But p as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case p is a transformation of the likelihood ratio. As Cosma Shalizi puts it:
among all tests of a given size s , the one with the smallest miss
probability, or highest power, has the form "say 'signal' if
q(x)/p(x)>t(s), otherwise say 'noise'," and that the threshold t
varies inversely with s. The quantity q(x)/p(x) is the likelihood
ratio; the Neyman-Pearson lemma says that to maximize power, we should
say "signal" if it is sufficiently more likely than noise.
Here q(x) is the density under state "signal" and p(x) the density under state "noise". The measure for "sufficiently likely" would here be P(q(X)/p(x)>tobs∣H0) which is p. Note that in correct Neyman-Pearson testing tobs is substituted by a fixed t(s) such that P(q(X)/p(x)>t(s)∣H0)=α.