Heteroskedasticity

From Rasmapedia
Revision as of 18:35, 9 May 2021 by Rasmusen p1vaim (talk | contribs) (An Ultra-Small Sample Example)
Jump to navigation Jump to search

An Ultra-Small Sample Example

As I was suffering from food poisoning in Florida last week, I had many hours in between vomiting bouts to while away woozily. So I thought about heteroskedasticity. Suppose you are estimating the temperature of lake water using various thermometers, and the observations are 50, 52, and 64. We will assume the thermometers are all unbiased, so if x_i is observation i and T is the true temperature, E(x_i) = T. The obvious estimator is the sample mean, (50+52+64)/3 = 55.3 (about). This is also the best estimator if the thermometers are all equally good. But what if they are not? That is, what if the disturbances are heteroskedastic?

If we know the variances of the disturbances of each thermometer, it is well known how to proceed, though I forget exactly how at the moment. What you do is to weight each observation by its accuracy. Thus, the right way might be that if the variances of the thermometers are 60, 30, and 10, and the disturbances are normally distributed, you should use the estimator (50/60 + 52/30 + 64/10) /(1/60+ 1/30+ 1/10) = 59.8 (about). But usually you won't know the variances.

So you need to estimate the variances, which you *can* do, if you have at least 3 observations. Consider observation 1, which is 50. What is a good estimate of its variance? First, we need to estimate the population mean. We can do that from observations 2 and 3, which have a sample mean of (52+64)/2 = 58. Then we use our one data point left for estimating the variance, 50. to estimate the variance to be (58-50)^2/1 = 64. We can do the same for observation 2, which uses the sample mean of (50+64)/2 = 57 so the variance is (57-52)^2/1 = 25. Finally we do it for observation 3, which uses the sample mean of (50+52)/2 = 51 so the variance is (64 - 51)^2/1 = 169. Now we estimates of the variance, so we can do a weighted average, as before.

If we want to be fancy, we could iterate, using the weights in the 2-observation mean estimates too.

There is a problem here, though. The reason that with homoskedasticity the sample mean is the best estimator is that it makes the fullest use of the available information. It weights all three observations equally, making use of all of them. Another unbiased estimator is to use just the first two observations, and always throw out the third one, whatever it may be. That wastes information. Similarly, weighting is wasting, if the data is homoskedastic. So there is a cost to weighting.

Weighting also has a benefit, because it lets us use our best information most heavily. If we know exactly how good the information is, there is no downside to weighting. But we don't. We're estimating it. So when we weight, we are hoping our estimation error is small enough that it's worth messing with equal weighting. Is it small enough? That needs proving.

The proof shouldn't be too difficult. But it is too hard to do in my head. I need to look at the proof of the standard weighting scheme when you know the variances. That, I think, comes from doing a calculus minimization of mean squared error for an unbiased estimator.

There is another problem. How good my estimator is depends on the true state of the world. Suppose the data really *is* homoskedastic. Then my estimator will just make things worse. The same is true if the data is just slightly heteroskedastic but has high variances, because in a given sample, it may well happen that one observation is an outlier and I will weight it less heavily than the other two. So the quality of the estimator depends on the amount of heteroskedasticity, which we don't know in advance.

The only way to resolve that problem, I think, is to go Bayesian. If we have a distribution of possible variances (and, more important, of variances of variances), we can find out whether the estimator does well in expectation against that distribution. For example, we might assume that the variance of each observation is drawn randomly and independently from 0 to 100. Better, probably, would be to see how well the estimator does for a variety of true states of the world and then to eyeball it and decide whether it's worth using. I expect what I will do is use python to find two numerical examples for Monte Carlo studies, one where the heteroskedasticty correction helps, and one where it hurts.

Ultra-Small Samples as Low-Fat Modelling

The ultra-small sample I just used is an example of Low-Fat Modelling (or "No-Fat", or "MIT-Style", or, confusingly, "NewJersey Style modelling in computer science, since they use "MIT-Style" to mean the exact opposite). Low-Fat Modelling is the style of economic theory made popular by Joseph Stiglitz and perfected by Jean Tirole. It models an economic idea using the simplest, most specific, model possible, as opposed to the older style of trying to use the fanciest, most general model possible. Thus, instead of assuming there are $N$ goods with a continuum of possible prices, you assume there are 2 goods with 2 prices. This will remind you of international trade's 2x2x2 models (2 countries, 2 goods, 2 inputs), since they have long used Low-Fat Modelling. Sometimes you need at least 3 countries to demonstrate an effect (e.g. the Transfer Paradox), so you do that, but you don't go to N countries.

Thus, for a theorist like me it is natural when thinking about econometric theory to try to estimate a mean rather than a regression coefficient, and to try it with 3 observations rather than N. If we can understand it for a mean with three observations, we can understand it for a regression with M coefficients for a sample with N observations. I would have liked to reduce it to 2 observations here, but then you can't estimate the variance of an observation using the other observations' data.

A Large-Sample Example

Suppose we are trying to estimate mu and we observe x_i = mu + epsilon_i, where epsilon_i is distributed normally, but with a twist. Some percentage theta of the time, epsilon ~ N(0, sigma^2_1), and some percentage of the time (1-theta), epsilon ~ N(0, sigma^2_2), with sigma^2_1 < sigma^2_2. We do not observe mu, theta, sigma^2 _1, or sigma^2_2, and we would like to estimate them all. We do, however, have an infinite amount of data.

Here's what we can do. Start with the easiest parameter, mu. For mu, our estimate will simply be the sample mean, the average of the x_i. The sample mean is consistent even under heteroskedasticity, so in the limit we have an exact estimate of mu.

Note, too, that we can compute the sample variance and get an exact estimate of theta*sigma_1^2 + (1-theta)*sigma_2^2 from it with a bit of calculation that I won't try to work through right now. Alas, that gives us just one statistic for three parameters, but it will be useful later.

The hard task is to figure out sigma_1 and sigma_2. To do that, start by calculating the mean absolute deviation Z of an observation from mu, that is: Z = the limit as N goes to infinity of Sum_i^N |x_i- mu|/N. Now divide the sample in two parts, depending on whether x_i is less than distance W away from mu-hat (which is both the population mean and, with infinite data, the sample mean). Choose W so P percent of the observations in the Far Sample are more than W greater than mu-hat.

The Far Sample will mostly be observations with variance sigma_1^2, though it will be a mix, since sometimes observations with a small variance will turn out to be much bigger than mu anyway.

Now I think maybe I have to assume that someone has taken pity on us and told us that theta = .7. We aren't free and clear, because Z is still one statistic for two unknown parameters, sigma_1^2 and sigma_2^2, but that kind person still deserves our thanks. Now that we have theta and sigma_2^2, we can use the sample variance equation to compute sigma_1^2. And we're done. A consistent estimator of the variances, and we can figure out weights and standard errors.

The Asymptotics Puzzle

This is a good setting for thinking about how we choose estimators. We economists love asymptotic theory, but of course it doesn't actually apply unless you have an infinitely large sample, which we never do. So why do we like good asymptotic properties? I think it's because if the sample is large, any estimator with good asymptotic properties is pretty good. Is that true? Or are there practical consistent estimators that are horrible until you pass some very large threshold? I can think of a silly one like that. Suppose I estimate the mean by setting it equal to 42 if N < 900 and equal to the sample mean if N > 899. That is a consistent estimator, but it is a very bad for N < 900, unless you are lucky enough that mu = 42.

Do asymptotics tell us anything about our Very Small Sample Example? Yes, I think. The idea of my estimators in both examples is the same. If an observation takes a value far from the sample mean, you can guess that it probably has a high variance and should be downweighted, and you can estimate the variance using the information at hand.

Monte Carlo Studies

I should do some Monte Carlos on these two examples. For each, I should try to pick two cases, a Good Case and a Bad Case, to show when the estimator does well and when it does poorly. I should not try to do a range of cases because that would be too confusing, and it is hopeless to try to span the parameter space. Rather, what is useful is to show the sort of situation where an estimator is useful and the sort of situation where it is not. My Python programming ability is plenty good enough to do this.

Asymptotic Efficiency

It would be nice to figure out asymptotic efficiency, but I'd have to relearn my econometric theory for that. Perhaps I could find a co-author fresh from grad school for that.

Median Estimators and Loss Function

 Think hard about this. Do the math. One objeciton yu might make is that even without heteroskedasticity, the outlier of the three observations is most likely to be a big error. So why not drop it? The reason is that it is also th emost importnat IF it is correct.  And that is more important with squared error loss. In fact, I bet it is for any convex loss function. For Absolute deviation loss, I bet it evens out, and it doens' matter if you drop it. That's why that give su a MEDIAN estimator. For Concave loss functions, you DO want to drop it. 

But heteroskedasticity puts a thumb o th eblance, a thumb on the side of dropping the observation, or at least weighting it less

Shrinkage Estimators

  • I bet this is all related to shirnkage estimators too. Need to do some deep thinking here. A shrinkage estimator for James Stein dominance needs 3 different key parameters to be estimated (three "means", the direct effect of three variables, NOT three "nuisance parameters" like the variance and other higher moments). But there HAS to be a link.
  • I need to rewrite my Shrinkage paper. VAR is bad notation for the X-variables. Maybe DATA is better.
  • In James-Stein, I bet the real secret is this (and the Grand Mean estimator too). We have three variables with disturbances attached to their three different means. Suppose we have three numbers that add up to 180: 10, 20, and 150. The average is 60. We expand the little ones and shrink the big ones 50% towards the mean, to get 15, 30, and 105. In so doing we have added absolute amounts of 5 and 10 to the little numbers and subtracted 55 from the big number. There is some reason why that 55 helps us with mean squared error more than the 5 and 10 would hurt us.


  • "Understanding Shrinkage Estimators: From Zero to Oracle to James-Stein." The standard estimator of the population mean is the sample mean, which is unbiased. Constructing an estimator by shrinking the sample mean results in a biased estimator, with an expected value less than the population mean. On the other hand, shrinkage always reduces the estimator's variance and can reduce its mean squared error. This paper tries to explain how that works. I start with estimating a single mean using the zero estimator and the oracle estimator, and continue with the grand-mean estimator. Thus prepared, it is easier to understand the James-Stein estimator, in its simple form with known homogeneous variance and in extensions. The James-Stein estimator combines the oracle estimate's coefficient shrinking with the grand mean estimator's cancelling out of overestimates and underestimates. .