Section 1: Introduction – The Problem of Binned Data
Hypothetically, say you’re given data like this in Table 1 below, and you’re asked to find the mean:
Group
|
Frequency
|
0
to 25
|
114
|
25
to 50
|
76
|
50
to 75
|
58
|
75
to 100
|
51
|
100
to 250
|
140
|
250
to 500
|
107
|
500
to 1000
|
77
|
1000
to 5000
|
124
|
5000
or more
|
42
|
Table 1: Example Binned Data.
Border cases go to the lower bin.
The immediate problem is that the mean (and the variance, and many other statistics) is the average of exact values by we have ranges of values. There are a few things similar to getting the mean that could be done:
1. Take the median instead of the mean, which is somewhere in the ‘100 to 250’ bin, but higher than most of the values in that bin. The answer is still a bin instead of a number, but you have something to work with.
2. Assign each bin a number such a '0 to 25' response would be 1, a '25 to 50' response would be 2, and so on to 9. One could take the mean of the bin numbers and obtain an 'average' bin, in this case 4.93. This number doesn't have clear translation to the values inside the bins.
2. Assign each bin a number such a '0 to 25' response would be 1, a '25 to 50' response would be 2, and so on to 9. One could take the mean of the bin numbers and obtain an 'average' bin, in this case 4.93. This number doesn't have clear translation to the values inside the bins.
3. Impute each binned value to the bin's midpoint and take the mean. Here, a '0 to 25' response would be 12.5, a '25 to 50' response would be 37.5, ... , a '1000 to 5000' response would be 3000. This poses two problems right away: First, unbounded bins like '5000 or more' have ambiguous midpoints. The calculated mean is sensitive to the choice of imputed upper bound. Taking the upper bound literally as infinity yields an infinite mean estimate. The second problem is that midpoints are not realistic. Consider that the size of the bins is increase approximately exponentially but that the frequencies are not. That implies that as a whole, smaller values are more common than larger values within the population; it’s reasonable to assume that this trend would be true within bins as well.
Better than all of these solutions is to fit a continuous distribution to the data and derive our estimates of the mean (and any other parameter estimates) from the fitted distribution
Section 2: Fitting A Distribution
We can select a distribution, such as an exponential, gamma, or log-normal, and find the probability mass that lands in each bin for that distribution. For example, if we were to fit an exponential with rate = 1/1000 (or mean = 1000), then
( 1 - exp(-.025)) of the mass would be in the '0 to 25' bin,
(exp(-.025) - exp(-.050)) would be in the '25 to 50' bin,
(exp(-1) - exp(-5)) would be in the '1000 to 5000' bin, and
exp(-5) would be in that final '5000 or more' bin. No arbitrary endpoint needed.
We can then choose a parameter or set of parameters that minimizes some distance criterion between these probability weights and the observed relative frequencies. Possible criteria are the Kullback-Leibler (K-L) divergence, the negative log-likelihood, or the chi-squared statistic. Here I'll use the K-L divergence for demonstration.
The Kullback-Leibler divergence gives a weighted difference between two discrete distributions. It is calculated as \sum_i{ p_i * log(q_i / p_i) }, where p_i and q_i are probabilities of distributions p and q at value i, respectively. (There is also a continuous distribution version of the K-L divergence, which is calculated similarly, as shown in https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence )
( 1 - exp(-.025)) of the mass would be in the '0 to 25' bin,
(exp(-.025) - exp(-.050)) would be in the '25 to 50' bin,
(exp(-1) - exp(-5)) would be in the '1000 to 5000' bin, and
exp(-5) would be in that final '5000 or more' bin. No arbitrary endpoint needed.
We can then choose a parameter or set of parameters that minimizes some distance criterion between these probability weights and the observed relative frequencies. Possible criteria are the Kullback-Leibler (K-L) divergence, the negative log-likelihood, or the chi-squared statistic. Here I'll use the K-L divergence for demonstration.
get.KL.div.exp = function(x, cutoffs, group_prob)
{
exp_prob= diff( pexp(cutoffs, rate = 1/x))
KL = sum(group_prob * log(group_prob / exp_prob))
return(KL)
}
result = optim(par = init_par_exp, fn = get.KL.div.exp, method = "Brent", lower=1, upper=10000, cutoffs=cutoffs, group_prob=group_prob)
Here, 'cutoffs' is a vector/array of the boundaries of the bins. For the largest bin, an arbitrarily large number was used (e.g. the boundary between the largest and second-largest bins times 1000). The vector/array group_prob is the set of observed relative frequencies, and exp_prob is the set of probabilities of the exponential distribution that fall into each bin. The input value x is the rate parameter, and the returned value KL is the Kullback-Leibler divergence. The input value init_par_exp is the initial guess at the rate parameter, and it can be almost anything without disrupting optim().
In the example case of Table 1, the optim function returns an estimate of 1/743 for the rate parameter (or 743 for the 'mean' parameter), which translates to an estimate of 743 for the mean of the binned data. The fitted distribution had a KL-divergence of 0.4275 from the observed relative frequencies.
For comparison, the midpoint-based estimate of the mean was 1307, using an arbitrary largest-value bin of '5000 to 20000'.
Section 3: Selecting a distribution
We don’t yet know if the exponential distribution is the most appropriate for this data. Other viable possibilities include the gamma and the log-normal distributions.
We can fit those distributions in R with the same general approach to minimizing the K-L divergence using the following code snippets. For the log-normal, we use
get.KL.div.lognorm = function(x, cutoffs, group_prob)
{
exp_freq = diff( plnorm(cutoffs, meanlog=x[1], sdlog=x[2]))
KL = sum(group_prob * log(group_prob / exp_freq))
return(KL)
}
result_lnorm = optim(par = c(init_mean_lognorm, init_sd_lognorm), fn = get.KL.div.lognorm,
method = "Nelder-Mead", cutoffs=cutoffs, group_prob=group_prob)
Where x is the vector of the two parameters, mu and sigma, for the log-normal, and where 'init_mean_lognorm' and 'init_sd_lognorm' are the initial parameter estimates. We derived initial estimates with the method of moments and the bin midpoints. Also note that the optimization method has changed from the single variable ‘Brent’ method to the multiple variable ‘Nelder-Mead’ method. The respective code snippet for the gamma distribution follows.
get.KL.div.gamma = function(x, cutoffs, group_prob)
{
exp_freq = diff( pgamma(cutoffs, shape=x[1], scale=x[2]))
KL = sum(group_prob * log(group_prob / exp_freq))
return(KL)
}
result_gamma = optim(par = c(init_shape_gamma, init_scale_gamma), fn = get.KL.div.gamma, method = "Nelder-Mead", cutoffs=cutoffs, group_prob=group_prob)
Here, x is the vector of the shape and scale parameters, respectively. As in the log-normal case, the initial parameter values are estimated using the method of moments derived from the midpoint-of-bin case.
The results of fitting all three distributions are shown in Table 2 below.
Distribution
|
Initial Parameters
|
Final Parameters
|
KL-Divergence
|
Est. Of Mean
|
Exponential
|
Mean = 1307
|
Mean = 743
|
0.4275
|
743
|
Gamma
|
Shape = 0.211
Scale = 6196
|
Shape = 0.378
Scale = 2317
|
0.0818
|
876
|
Log-Normal
|
Mean = 7.175
SD = 7.954
|
Mean = 5.263
SD = 1.932
|
0.0134
|
1241
|
The choice of distribution matters a lot. We get completely different estimates for each distribution. In this case, the low K-L divergence of the log-normal distribution makes a compelling case for the log-normal to be the distribution we ultimately use.
Figure 1 further shows the aptitude of the log-normal for this data. Note the log scale of the x.
Figure 1 further shows the aptitude of the log-normal for this data. Note the log scale of the x.
Figure 1 |
Section 4: Uncertainty estimates.
We have point estimates of the population parameters, but nothing yet to describe their uncertainty. There are a few sources of variance we could consider, specifically:
- The uncertainty of the parameters given the observations,
- The sampling distribution that produced these observations.
Methods exist for finding confidence regions for sets of parameters like the gamma. (See: https://www.tandfonline.com/doi/abs/10.1080/02331880500309993?journalCode=gsta20 ) , from which we could derive confidence intervals for the mean, but at this exploratory stage of developing the ‘debinning’ method, I’m going to ignore it.
The uncertainty from the sampling distribution can be addressed with a Monte Carlo approach. For each of many iterations, generate a new sample by putting observations in bins with probability proportional to the observed frequencies. Then using the debinning method described in the previous sections to estimate parameters of the fitted distribution and the sample mean. Over many iterations, the parameter estimates will form their own distribution for which you can draw approximate confidence intervals and regions.
For the log-normal distribution applied to our example case in Table 1, we find the results shown in Table 3 and Figures 2 and 3, all after 3000 resamplings of the observed data with replacement.
We get some idea of the scale of the uncertainty. Since we haven't incorporated all the variation sources, these confidence intervals are optimistic, but probably not overly so.
Section 5: Discussion – The To-Do list.
- Examine more distributions. There are plenty of other viable candidate distributions, and more can be made viable with transformations such as the logarithmic transform. There are often overlooked features of established distributions, such as their non-centrality parameters, that open up new possibilities as well. There are established distributions with four parameters like the generalized beta that could also be applied, provided they can be fit and optimized with computational efficiency.
- Address Overfitting. This wide array of possible distributions brings about another problem: overfitting. The degrees of freedom available for fitting a distribution is only the number of boundaries between bins, or B – 1 if there are B bins. The example given here is an ideal one in which B=9. In many other situations, there are as few as 5 bins. In these cases, a four-parameter distribution should be able to fit the binned data perfectly, regardless of the distribution’s appropriateness in describing the underlying data.
- Estimation of values within categories. If we have a viable continuous distribution, then we have some information about the values within a bin. Not their exact values, but a better idea than simply “it was between x and y”. For starts, we can impute the conditional mean within each category with a bounded integral, and right away this is a better estimate to use than the midpoint when using exact values.
- Incorporation of covariates. We can go further still in estimating these individual values with bins by looking at their predicted bin probabilities, in turn directed from ordinal logistic regression. For example, an observation that is predicted to be in a higher bin than it actually is, is in turn likely to have a true value close to the top of that bin.
In short, there’s a golden hoard worth of information available if only we are able to slay the beast of binning* to get to it.
* Feel free to use ‘dragon of discretization’ in place of this.
Appendix: Full R Code Available Here |
No comments:
Post a Comment