### Monday, January 28, 2008

## 1800's first-class cricket in England: filling in the gaps

This is Part 3 in my series on first-class cricket in England in the 1800's.

1 - data

2 - classification of matches

3 - filling in the gaps

4 - bowlers

5 - batsmen

6 - bowlers across eras

7 - batsmen across eras

8 - all-rounders (across eras)

9 - wicket-keepers

In this post I detail a method of filling in all the gaps in those early scorecards. By doing so, we can get realistic estimates of bowling averages, despite only knowing about bowled dismissals and team totals. This will mostly be a geek interest post. Though the maths isn't technically hard (it's really just the four basic arithmetic operators), it does go on for a bit.

To begin, let's recall what the important gaps in the early scorecards are. First, bowlers were only credited with wickets when they bowl a batsman — catches, LBW's, stumpings, and hit wickets were not counted in bowler's wicket tallies. Second, the number of runs conceded by bowlers was not recorded.

To fill in these gaps, I took a set of scorecards (as old as possible, to try to match the characteristics of the earlier eras) which

A. bowled

B. other wicket credited to the bowler (catches, etc.)

C. wicket not credited to the bowler (run outs, etc.) or not-outs.

For each bowler who took 1 wicket bowled, I counted how many other wickets he took, out of the possible remaining (ie, type B above). Similarly for each bowler who took 2 wickets bowled, 3 wickets bowled, and so on.

If you do this for all the scorecards in the sample and add up the corresponding numbers, you can get the probability that a batsman dismissed by a type B wicket was dismissed by a bowler who took 1 wicket bowled, or by a bowler who took 2 wickets bowled, etc.

Put another way: you can get the average fraction of type B wickets taken by a bowler who took 1 wicket bowled, or 2 wickets bowled, etc.

The actual numbers (based on matches with the relevant data until part-way through 1863) are as follows:

(Tthe last value here was adjusted by hand, based on later matches.) In this particular dataset, there was never a player who took 8 wickets or more in an innings bowled; I set the fractions for 8 and 9 wickets mildly arbitrarily at 0,5 (based on the equivalent numbers for later matches).

Now comes the estimate of the wicket tally. Suppose in a scorecard that Smith took 1 wicket bowled, and Jones took 3 wickets bowled. There are four catches with bowler unknown, and there was one run out.

There are four type B wickets, and Smith gets 4*0,302 = 1,208 of them, giving him 2,208 for the innings. Jones gets 4*0,428 = 1,712, giving him 5,712 for the innings.

Of course, that means that the total wickets don't add up to 10. If a bowler only took wickets caught, then he's going to be ignored by this analysis. This means that the estimated wicket tallies will be significantly lower than what they really were. But bowlers who didn't get any wickets bowled will also not have any runs conceded estimated for them, as we will see shortly. We will hope that, by ignoring both wickets and runs conceded in these situations, the bowling averages over a career will be largely unaffected.

(It is also possible, if three bowlers each took 3 wickets for instance, that the estimated wicket tally for an innings could be greater than 10. This isn't a serious problem.)

To estimate the runs conceded by each bowler, I followed a similar procedure to that for type B wickets, finding the average fraction of runs (ignoring byes etc.) that bowlers who took 1 wicket conceded, bowlers who took 2 wickets conceded, and so on. The resulting table looks like this (the wickets now are total wickets, caught, bowled, the lot):

(The last entry in that table was adjusted by hand, based on the corresponding number for later matches.)

This tells us that, for instance, a bowler who took 4 wickets, on average, conceded 32,2% of the batting team's runs in an innings.

So, for each scorecard, we estimate the number of wickets taken by each bowler, and then use this tally and the second table to estimate the number of runs conceded (based on the batting team's score). We now have wickets and runs, so we can calculate an average!

But there's a rather large assumption in this model, and that is that the characteristics of wicket-taking and conceding runs don't change much. This is definitely not true in general: by taking a sample of matches from later, the fractions in the first table all decrease (suggesting that more bowlers were used in the latter part of the 19th century than in the 1850's). This could cause a systematic error in the estimates. To fudge my way around this, I take the overall bowling average (which we know from the team totals and the total number of wickets lost) and compare it to the overall estimated bowling average. The estimated bowling averages are scaled up or down according to the ratio of the overall average to its estimate. If that's not clear, I'll come to an example shortly.

Before we dive in and start estimating averages from 1812, it would be prudent to check to see if the method actually works. I took a set of about 950 matches from 1888 to 1896 (well after the dataset I used to generate the fractions above), and pretended that I didn't have data on type B wickets or runs conceded. I do the estimates, and then compare the averages with the actual averages, which can be calculated exactly (since there's no missing information).

When I did this (before implementing the fudge factor), there was a clear systematic error: the estimates of the averages were almost always lower than the real averages. According to the estimates, the overall average was 15,07. In reality it was 18,23. So I multiplied all of the estimated averages by 18,23/15,07 = 1,21.

Here are the results, with players ordered by wickets taken (in real life). Note that these are not career figures — they are solely based on the sample of about 950 matches. The headings are

It's not spectacular, but it's pretty good considering the paucity of the data that went into the estimates. Of the top 30 wicket-takers in the sample, only 6 have estimates of the bowling average wrong by more then 10%. And while I've truncated the table at 30 entries here, the good estimates keep going for another 30odd players. The first really wild estimate is for Stephen Whitehead, who took 121 wickets (in the dataset) at an actual average of 21,39, but at an estimated average of 14,95.

It is unfortunate, though understandable, that three of those six entries with errors of over 10% are caused by the top four wicket-takers. The model used for the estimates was based on overall averages, and we would not expect that the best bowlers would follow the same trends, in general.

I repeated this exercise for a similarly-sized dataset containing matches from between 1877 and 1888. The results were similar to those above — again 6 errors of more than 10% in the top 30 players, including the third- and fourth-highest wicket-takers. But further down the table the results are better, perhaps because the era in question is closer to that used to generate the parameters in the model. The first wild estimate was for a bowler who took only 71 wickets.

While I'm emphasising the uncertainties in the estimates for the top bowlers, the estimates are still pretty useful. Suppose that you knew that a modern-day Test bowler had an average between 17 and 23 (that is, 20 plus or minus 15%). He could be one of the greatest of all-time or merely very good. But you know that he's at least very good, and he's not someone like Brett Lee, taking plenty of wickets (until recently), but at an average of 30.

Now we're almost ready to do the estimates for the first half of the 19th century!

1 - data

2 - classification of matches

3 - filling in the gaps

4 - bowlers

5 - batsmen

6 - bowlers across eras

7 - batsmen across eras

8 - all-rounders (across eras)

9 - wicket-keepers

In this post I detail a method of filling in all the gaps in those early scorecards. By doing so, we can get realistic estimates of bowling averages, despite only knowing about bowled dismissals and team totals. This will mostly be a geek interest post. Though the maths isn't technically hard (it's really just the four basic arithmetic operators), it does go on for a bit.

To begin, let's recall what the important gaps in the early scorecards are. First, bowlers were only credited with wickets when they bowl a batsman — catches, LBW's, stumpings, and hit wickets were not counted in bowler's wicket tallies. Second, the number of runs conceded by bowlers was not recorded.

To fill in these gaps, I took a set of scorecards (as old as possible, to try to match the characteristics of the earlier eras) which

*do*contain the relevant information. For each card, I broke the dismissals down into three types:A. bowled

B. other wicket credited to the bowler (catches, etc.)

C. wicket not credited to the bowler (run outs, etc.) or not-outs.

For each bowler who took 1 wicket bowled, I counted how many other wickets he took, out of the possible remaining (ie, type B above). Similarly for each bowler who took 2 wickets bowled, 3 wickets bowled, and so on.

If you do this for all the scorecards in the sample and add up the corresponding numbers, you can get the probability that a batsman dismissed by a type B wicket was dismissed by a bowler who took 1 wicket bowled, or by a bowler who took 2 wickets bowled, etc.

Put another way: you can get the average fraction of type B wickets taken by a bowler who took 1 wicket bowled, or 2 wickets bowled, etc.

The actual numbers (based on matches with the relevant data until part-way through 1863) are as follows:

wkts bowled 1 2 3 4 5 6 7

frac other wkts 0,300 0,363 0,417 0,432 0,423 0,461 0,525

(Tthe last value here was adjusted by hand, based on later matches.) In this particular dataset, there was never a player who took 8 wickets or more in an innings bowled; I set the fractions for 8 and 9 wickets mildly arbitrarily at 0,5 (based on the equivalent numbers for later matches).

Now comes the estimate of the wicket tally. Suppose in a scorecard that Smith took 1 wicket bowled, and Jones took 3 wickets bowled. There are four catches with bowler unknown, and there was one run out.

There are four type B wickets, and Smith gets 4*0,302 = 1,208 of them, giving him 2,208 for the innings. Jones gets 4*0,428 = 1,712, giving him 5,712 for the innings.

Of course, that means that the total wickets don't add up to 10. If a bowler only took wickets caught, then he's going to be ignored by this analysis. This means that the estimated wicket tallies will be significantly lower than what they really were. But bowlers who didn't get any wickets bowled will also not have any runs conceded estimated for them, as we will see shortly. We will hope that, by ignoring both wickets and runs conceded in these situations, the bowling averages over a career will be largely unaffected.

(It is also possible, if three bowlers each took 3 wickets for instance, that the estimated wicket tally for an innings could be greater than 10. This isn't a serious problem.)

To estimate the runs conceded by each bowler, I followed a similar procedure to that for type B wickets, finding the average fraction of runs (ignoring byes etc.) that bowlers who took 1 wicket conceded, bowlers who took 2 wickets conceded, and so on. The resulting table looks like this (the wickets now are total wickets, caught, bowled, the lot):

wkts 1 2 3 4 5 6 7 8 9 10

frac 0,164 0,223 0,277 0,322 0,359 0,368 0,405 0,401 0,424 0,5

(The last entry in that table was adjusted by hand, based on the corresponding number for later matches.)

This tells us that, for instance, a bowler who took 4 wickets, on average, conceded 32,2% of the batting team's runs in an innings.

So, for each scorecard, we estimate the number of wickets taken by each bowler, and then use this tally and the second table to estimate the number of runs conceded (based on the batting team's score). We now have wickets and runs, so we can calculate an average!

But there's a rather large assumption in this model, and that is that the characteristics of wicket-taking and conceding runs don't change much. This is definitely not true in general: by taking a sample of matches from later, the fractions in the first table all decrease (suggesting that more bowlers were used in the latter part of the 19th century than in the 1850's). This could cause a systematic error in the estimates. To fudge my way around this, I take the overall bowling average (which we know from the team totals and the total number of wickets lost) and compare it to the overall estimated bowling average. The estimated bowling averages are scaled up or down according to the ratio of the overall average to its estimate. If that's not clear, I'll come to an example shortly.

Before we dive in and start estimating averages from 1812, it would be prudent to check to see if the method actually works. I took a set of about 950 matches from 1888 to 1896 (well after the dataset I used to generate the fractions above), and pretended that I didn't have data on type B wickets or runs conceded. I do the estimates, and then compare the averages with the actual averages, which can be calculated exactly (since there's no missing information).

When I did this (before implementing the fudge factor), there was a clear systematic error: the estimates of the averages were almost always lower than the real averages. According to the estimates, the overall average was 15,07. In reality it was 18,23. So I multiplied all of the estimated averages by 18,23/15,07 = 1,21.

Here are the results, with players ordered by wickets taken (in real life). Note that these are not career figures — they are solely based on the sample of about 950 matches. The headings are

**est**imated and**act**ual.

wkts runs avg

name mat est act est act est act % error

J Briggs 198 754,4 1172 9243,2 15930 15,16 13,59 +11,5

R Peel 237 744,3 1158 9830,5 17281 16,34 14,92 +9,5

AW Mold 166 1063,1 1107 10992,9 15884 12,79 14,35 -10,9

W Attewell 214 786,4 1087 10714,9 15960 16,86 14,68 +14,8

GA Lohmann 147 766,1 1011 8168,1 13227 13,19 13,08 +0,8

JT Hearne 149 887,9 956 11805,6 14476 16,45 15,14 +8,6

F Martin 199 773,4 950 10852,5 15128 17,36 15,92 +9,0

T Richardson 97 727,2 765 8332,7 10647 14,17 13,92 +1,8

E Wainwright 199 666,6 730 8538,3 11870 15,85 16,26 -2,5

SMJ Woods 156 617,6 729 9084,8 13795 18,20 18,92 -3,8

WH Lockwood 156 561,1 618 7358,1 10067 16,22 16,29 -0,4

JJ Ferris 163 402,0 616 5895,8 11155 18,14 18,11 +0,2

CTB Turner 90 539,2 585 5830,0 7607 13,38 13,00 +2,9

W Wright 149 498,9 577 6835,2 10637 16,95 18,44 -8,1

EJ Tyler 95 274,2 522 4534,0 9947 20,45 19,06 +7,3

JT Rawlin 116 431,2 487 6345,1 8806 18,20 18,08 +0,7

FG Roberts 127 372,5 458 6046,0 9627 20,08 21,02 -4,5

W Flowers 179 336,7 447 4739,3 8006 17,41 17,91 -2,8

WA Humphreys 127 313,1 445 6196,0 9148 24,48 20,56 +19,1

GH Hirst 107 353,5 418 4685,3 7171 16,40 17,16 -4,5

A Hearne 172 358,5 399 5587,0 7641 19,28 19,15 +0,7

FW Tate 103 328,8 362 5409,6 7836 20,35 21,65 -6,0

FS Jackson 122 288,2 359 4412,3 6571 18,94 18,30 +3,5

WG Grace 220 206,8 358 4062,1 8022 24,29 22,41 +8,4

W Mead 53 254,7 351 3705,0 5605 17,99 15,97 +12,7

FJ Shacklock 100 306,4 349 4053,6 6615 16,37 18,95 -13,6

A Watson 87 328,4 332 3663,2 4928 13,80 14,84 -7,0

JW Sharpe 75 312,1 321 3647,3 4922 14,45 15,33 -5,7

AD Pougher 72 223,5 312 3279,3 5260 18,15 16,86 +7,7

GA Davidson 74 273,6 309 3793,4 5241 17,15 16,96 +1,1

It's not spectacular, but it's pretty good considering the paucity of the data that went into the estimates. Of the top 30 wicket-takers in the sample, only 6 have estimates of the bowling average wrong by more then 10%. And while I've truncated the table at 30 entries here, the good estimates keep going for another 30odd players. The first really wild estimate is for Stephen Whitehead, who took 121 wickets (in the dataset) at an actual average of 21,39, but at an estimated average of 14,95.

It is unfortunate, though understandable, that three of those six entries with errors of over 10% are caused by the top four wicket-takers. The model used for the estimates was based on overall averages, and we would not expect that the best bowlers would follow the same trends, in general.

I repeated this exercise for a similarly-sized dataset containing matches from between 1877 and 1888. The results were similar to those above — again 6 errors of more than 10% in the top 30 players, including the third- and fourth-highest wicket-takers. But further down the table the results are better, perhaps because the era in question is closer to that used to generate the parameters in the model. The first wild estimate was for a bowler who took only 71 wickets.

While I'm emphasising the uncertainties in the estimates for the top bowlers, the estimates are still pretty useful. Suppose that you knew that a modern-day Test bowler had an average between 17 and 23 (that is, 20 plus or minus 15%). He could be one of the greatest of all-time or merely very good. But you know that he's at least very good, and he's not someone like Brett Lee, taking plenty of wickets (until recently), but at an average of 30.

Now we're almost ready to do the estimates for the first half of the 19th century!

Subscribe to Posts [Atom]