Comments on Pappus' plane - cricket stats: The co-efficient of variation

Gabe, I don't think there is any point creatin...

2010-07-26T07:01:16.566+02:00

Gabe, I don't think there is any point creating this sort of statistic unless you DO talk about what it means to be consistent. As an example, what string of scores would you consider more consistent:
5,10,20,40,80,145
10,15,25,45,50,155

The weakness of your approach is that it penalizes very large scores significantly more than low ones. A triple-hundred is worth the same as 5 ducks, which my intuition says is wrong. The players it throws up as consistent are often in the "don't play big innings" category (Mark Waugh), more-so than actually being consistent (David Gower).

Without wanting to overly disparage ItF, 99% of the stuff on there is a gross waste of a statistics database that could be shedding light on real issues in cricket, instead of endlessly rating players. Give your readers a little credit. A non-parametric method will almost certainly produce a "better" (and certainly different) result.

As I stated, my preference is to use the ratio of 2^(log average) and the average. It gives a normally distributed data set that converges on 0.5 as the number of innings played increases, is simple, and produces results that are intuitively correct. As I mentioned to David, I have a post on this in the works, but not the time recently to put it together.

Well, on the one hand, oooh, but, on the other han...

2010-07-25T20:39:53.164+02:00

Well, on the one hand, oooh, but, on the other hand, meh.

I see (and have replicated the results of) your argument. As you suggest, my expectation would have been that the CoV of any sample from an exponential distribution should be 1, but it clearly isn't. By wearing out my F9 button, I reckon the expected CoV is very close to 1/2^(1/n) if you use a sample (n-1) SD, and it's very close to 1-(1/n) if you use a population (n) SD. But it isn't quite that on either count, and you'd need to write a page of proper maths to work out the real relationship. For reasons of time and competence, I'm going to have to leave that maths to someone else...

What I can do is to see how it works in the dataset of interest. And the answer is: not especially clearly. I plotted CoV against Inns in tests, and stuck it here. There's certainly the beginnings of the shape your analysis would lead us to expect in the bottom-left, but it's easily counterbalanced by a mass of inconsistent individuals with not many innings, with the net result that there's no discernible trend (my regression software is on a different computer from this one, but I can't believe it'd turn anything up, from the shape of the plot).

Of course, you're on a fairly sticky wicket the minute you start relying on means and SDs of nonsymmetrical distributions, anyway, and that may be at the root of the oddness, here. I do have some nonparametric methods for looking at things like this, but they are way beyond that which would be welcomed by readers of ItFigures.

Russ makes a very helpful point in suggesting that we're dealing with biased samples, here: if every cricketer played the same amount of matches, regardless of how good he was, we might have something very different, and I guess it's possible that the mass of dots in the top-left represents a selected sample of players who were dropped before their records normalised (and, indeed, may very well have been dropped for *being* inconsistent - or, at least, dropped for a bunch of empirical behaviours that might cause or be caused by a high CoV).

I don't agree with the Russ's suggestion of taking logs at all, though. I don't see any theoretical grounds for doing so, and it doesn't make any cricketing sense to me, either. The whole point of scaling SD by mean, to provide CoV, is to ensure that batsmen with more high scores don't look disproportionately inconsistent. We can't go very far down this road without getting into a philosophical debate about what we mean by "consistent", but I took the view at the outset that a batsman who goes 0, 25, 100 is exactly as consistent as one who goes 0, 50, 200, and I don't see any reason to depart from that view, yet.

DB, I don't actually think it is okay at all b...

2010-07-20T05:15:45.059+02:00

DB, I don't actually think it is okay at all but although I gathered the data I haven't had a chance to post on it. In essence, large scores have a disproportionate effect on the perceived inconsistency because they are much further from the mean. Gabriel should really calculate the CoV of the log of each score. Although my preference is to use career form (as we've discussed before) as it provides an intuitive, normally distributed ratio of consistency.

I'm going to guess that short careers have even lower CoV than expected. The lower average is probably a result of not scoring a big score (which may well be just luck), which means short careers are probably very consistent (albeit in a bad way). Your simulation (I assume) randomly assigns scores, whereas in real life, a cricketer who scores a hundred in their first few knocks will generally not get dropped.