## 1800's first-class cricket in England: filling in the gaps

This is Part 3 in my series on first-class cricket in England in the 1800's.

1 - data
2 - classification of matches
3 - filling in the gaps
4 - bowlers
5 - batsmen
6 - bowlers across eras
7 - batsmen across eras
8 - all-rounders (across eras)
9 - wicket-keepers

In this post I detail a method of filling in all the gaps in those early scorecards. By doing so, we can get realistic estimates of bowling averages, despite only knowing about bowled dismissals and team totals. This will mostly be a geek interest post. Though the maths isn't technically hard (it's really just the four basic arithmetic operators), it does go on for a bit.

To begin, let's recall what the important gaps in the early scorecards are. First, bowlers were only credited with wickets when they bowl a batsman — catches, LBW's, stumpings, and hit wickets were not counted in bowler's wicket tallies. Second, the number of runs conceded by bowlers was not recorded.

To fill in these gaps, I took a set of scorecards (as old as possible, to try to match the characteristics of the earlier eras) which do contain the relevant information. For each card, I broke the dismissals down into three types:

A. bowled
B. other wicket credited to the bowler (catches, etc.)
C. wicket not credited to the bowler (run outs, etc.) or not-outs.

For each bowler who took 1 wicket bowled, I counted how many other wickets he took, out of the possible remaining (ie, type B above). Similarly for each bowler who took 2 wickets bowled, 3 wickets bowled, and so on.

If you do this for all the scorecards in the sample and add up the corresponding numbers, you can get the probability that a batsman dismissed by a type B wicket was dismissed by a bowler who took 1 wicket bowled, or by a bowler who took 2 wickets bowled, etc.

Put another way: you can get the average fraction of type B wickets taken by a bowler who took 1 wicket bowled, or 2 wickets bowled, etc.

The actual numbers (based on matches with the relevant data until part-way through 1863) are as follows:
`wkts bowled       1      2      3      4      5      6      7frac other wkts   0,300  0,363  0,417  0,432  0,423  0,461  0,525`

(Tthe last value here was adjusted by hand, based on later matches.) In this particular dataset, there was never a player who took 8 wickets or more in an innings bowled; I set the fractions for 8 and 9 wickets mildly arbitrarily at 0,5 (based on the equivalent numbers for later matches).

Now comes the estimate of the wicket tally. Suppose in a scorecard that Smith took 1 wicket bowled, and Jones took 3 wickets bowled. There are four catches with bowler unknown, and there was one run out.

There are four type B wickets, and Smith gets 4*0,302 = 1,208 of them, giving him 2,208 for the innings. Jones gets 4*0,428 = 1,712, giving him 5,712 for the innings.

Of course, that means that the total wickets don't add up to 10. If a bowler only took wickets caught, then he's going to be ignored by this analysis. This means that the estimated wicket tallies will be significantly lower than what they really were. But bowlers who didn't get any wickets bowled will also not have any runs conceded estimated for them, as we will see shortly. We will hope that, by ignoring both wickets and runs conceded in these situations, the bowling averages over a career will be largely unaffected.

(It is also possible, if three bowlers each took 3 wickets for instance, that the estimated wicket tally for an innings could be greater than 10. This isn't a serious problem.)

To estimate the runs conceded by each bowler, I followed a similar procedure to that for type B wickets, finding the average fraction of runs (ignoring byes etc.) that bowlers who took 1 wicket conceded, bowlers who took 2 wickets conceded, and so on. The resulting table looks like this (the wickets now are total wickets, caught, bowled, the lot):
`wkts 1      2      3      4      5      6      7      8      9      10frac 0,164  0,223  0,277  0,322  0,359  0,368  0,405  0,401  0,424  0,5`

(The last entry in that table was adjusted by hand, based on the corresponding number for later matches.)

This tells us that, for instance, a bowler who took 4 wickets, on average, conceded 32,2% of the batting team's runs in an innings.

So, for each scorecard, we estimate the number of wickets taken by each bowler, and then use this tally and the second table to estimate the number of runs conceded (based on the batting team's score). We now have wickets and runs, so we can calculate an average!

But there's a rather large assumption in this model, and that is that the characteristics of wicket-taking and conceding runs don't change much. This is definitely not true in general: by taking a sample of matches from later, the fractions in the first table all decrease (suggesting that more bowlers were used in the latter part of the 19th century than in the 1850's). This could cause a systematic error in the estimates. To fudge my way around this, I take the overall bowling average (which we know from the team totals and the total number of wickets lost) and compare it to the overall estimated bowling average. The estimated bowling averages are scaled up or down according to the ratio of the overall average to its estimate. If that's not clear, I'll come to an example shortly.

Before we dive in and start estimating averages from 1812, it would be prudent to check to see if the method actually works. I took a set of about 950 matches from 1888 to 1896 (well after the dataset I used to generate the fractions above), and pretended that I didn't have data on type B wickets or runs conceded. I do the estimates, and then compare the averages with the actual averages, which can be calculated exactly (since there's no missing information).

When I did this (before implementing the fudge factor), there was a clear systematic error: the estimates of the averages were almost always lower than the real averages. According to the estimates, the overall average was 15,07. In reality it was 18,23. So I multiplied all of the estimated averages by 18,23/15,07 = 1,21.

Here are the results, with players ordered by wickets taken (in real life). Note that these are not career figures — they are solely based on the sample of about 950 matches. The headings are estimated and actual.
`                   wkts         runs           avgname          mat  est    act   est     act    est    act    % errorJ Briggs      198  754,4  1172  9243,2  15930  15,16  13,59  +11,5R Peel        237  744,3  1158  9830,5  17281  16,34  14,92  +9,5AW Mold       166  1063,1 1107  10992,9 15884  12,79  14,35  -10,9W Attewell    214  786,4  1087  10714,9 15960  16,86  14,68  +14,8GA Lohmann    147  766,1  1011  8168,1  13227  13,19  13,08  +0,8JT Hearne     149  887,9  956   11805,6 14476  16,45  15,14  +8,6F Martin      199  773,4  950   10852,5 15128  17,36  15,92  +9,0T Richardson  97   727,2  765   8332,7  10647  14,17  13,92  +1,8E Wainwright  199  666,6  730   8538,3  11870  15,85  16,26  -2,5SMJ Woods     156  617,6  729   9084,8  13795  18,20  18,92  -3,8WH Lockwood   156  561,1  618   7358,1  10067  16,22  16,29  -0,4JJ Ferris     163  402,0  616   5895,8  11155  18,14  18,11  +0,2CTB Turner    90   539,2  585   5830,0  7607   13,38  13,00  +2,9W Wright      149  498,9  577   6835,2  10637  16,95  18,44  -8,1EJ Tyler      95   274,2  522   4534,0  9947   20,45  19,06  +7,3JT Rawlin     116  431,2  487   6345,1  8806   18,20  18,08  +0,7FG Roberts    127  372,5  458   6046,0  9627   20,08  21,02  -4,5W Flowers     179  336,7  447   4739,3  8006   17,41  17,91  -2,8WA Humphreys  127  313,1  445   6196,0  9148   24,48  20,56  +19,1GH Hirst      107  353,5  418   4685,3  7171   16,40  17,16  -4,5A Hearne      172  358,5  399   5587,0  7641   19,28  19,15  +0,7FW Tate       103  328,8  362   5409,6  7836   20,35  21,65  -6,0FS Jackson    122  288,2  359   4412,3  6571   18,94  18,30  +3,5WG Grace      220  206,8  358   4062,1  8022   24,29  22,41  +8,4W Mead        53   254,7  351   3705,0  5605   17,99  15,97  +12,7FJ Shacklock  100  306,4  349   4053,6  6615   16,37  18,95  -13,6A Watson      87   328,4  332   3663,2  4928   13,80  14,84  -7,0JW Sharpe     75   312,1  321   3647,3  4922   14,45  15,33  -5,7AD Pougher    72   223,5  312   3279,3  5260   18,15  16,86  +7,7GA Davidson   74   273,6  309   3793,4  5241   17,15  16,96  +1,1`

It's not spectacular, but it's pretty good considering the paucity of the data that went into the estimates. Of the top 30 wicket-takers in the sample, only 6 have estimates of the bowling average wrong by more then 10%. And while I've truncated the table at 30 entries here, the good estimates keep going for another 30odd players. The first really wild estimate is for Stephen Whitehead, who took 121 wickets (in the dataset) at an actual average of 21,39, but at an estimated average of 14,95.

It is unfortunate, though understandable, that three of those six entries with errors of over 10% are caused by the top four wicket-takers. The model used for the estimates was based on overall averages, and we would not expect that the best bowlers would follow the same trends, in general.

I repeated this exercise for a similarly-sized dataset containing matches from between 1877 and 1888. The results were similar to those above — again 6 errors of more than 10% in the top 30 players, including the third- and fourth-highest wicket-takers. But further down the table the results are better, perhaps because the era in question is closer to that used to generate the parameters in the model. The first wild estimate was for a bowler who took only 71 wickets.

While I'm emphasising the uncertainties in the estimates for the top bowlers, the estimates are still pretty useful. Suppose that you knew that a modern-day Test bowler had an average between 17 and 23 (that is, 20 plus or minus 15%). He could be one of the greatest of all-time or merely very good. But you know that he's at least very good, and he's not someone like Brett Lee, taking plenty of wickets (until recently), but at an average of 30.

Now we're almost ready to do the estimates for the first half of the 19th century!