Difference between revisions of "Heteroskedasticity"

From Rasmapedia
Jump to navigation Jump to search
(A Large-Sample Example)
(A Large-Sample Example)
Line 23: Line 23:
 
Here's what we can do. Start with the easiest parameter, mu. For mu, our estimate will simply be the sample mean, the average of the x_i. The sample mean is consistent even under heteroskedasticity, so in the limit we have an exact estimate of mu.  
 
Here's what we can do. Start with the easiest parameter, mu. For mu, our estimate will simply be the sample mean, the average of the x_i. The sample mean is consistent even under heteroskedasticity, so in the limit we have an exact estimate of mu.  
  
Note, too, that we can compute the sample variance and get an exact estimate of  mu*sigma_1^2 + (1-mu)*sigma_2^2 from it with a bit of calculation that I won't try to work through right now. Alas, that gives us just one statistic for three parameters, but it will be useful later.  
+
Note, too, that we can compute the sample variance and get an exact estimate of  theta*sigma_1^2 + (1-theta)*sigma_2^2 from it with a bit of calculation that I won't try to work through right now. Alas, that gives us just one statistic for three parameters, but it will be useful later.  
  
 
The hard task is to figure out sigma_1 and sigma_2. To do that, start by calculating the mean absolute deviation Z of an observation from mu, that is:  Z = the limit as N goes to infinity of  Sum_i^N |x_i- mu|/N. Now divide the sample in two parts, depending on whether x_i is less than distance W away from mu-hat (which is both the population mean and, with infinite data, the sample mean).  Choose W so P percent of the observations in the  Far Sample are more than  W greater than mu-hat.  
 
The hard task is to figure out sigma_1 and sigma_2. To do that, start by calculating the mean absolute deviation Z of an observation from mu, that is:  Z = the limit as N goes to infinity of  Sum_i^N |x_i- mu|/N. Now divide the sample in two parts, depending on whether x_i is less than distance W away from mu-hat (which is both the population mean and, with infinite data, the sample mean).  Choose W so P percent of the observations in the  Far Sample are more than  W greater than mu-hat.  
Line 29: Line 29:
 
The Far Sample will mostly be observations with variance sigma_1^2, though it will be a mix, since sometimes observations with a  small variance will turn out to be much bigger than  mu anyway.  
 
The Far Sample will mostly be observations with variance sigma_1^2, though it will be a mix, since sometimes observations with a  small variance will turn out to be much bigger than  mu anyway.  
  
Now I think maybe I have to  assume that someone has taken pity on us and told us that theta = .7. We aren't free and clear, because Z is still one statistic for two unknown parameters, sigma_1^2  and sigma_2^2, but that kind person still deserves our thanks.  
+
Now I think maybe I have to  assume that someone has taken pity on us and told us that theta = .7. We aren't free and clear, because Z is still one statistic for two unknown parameters, sigma_1^2  and sigma_2^2, but that kind person still deserves our thanks. Now that we have  theta and sigma_2^2, we can use the sample variance equation to compute sigma_1^2. And we're done. A consistent estimator of the variances, and we can figure out weights and standard errors.
 +
 
 +
==The Asymptotic Puzzle==
 +
This is a good setting for thinking about how we choose estimators. We economists love asymptotic theory, but of course it doesn't actually apply unless you have an infinitely large sample, which we never do. So why do we like good asymptotic properties? I think it's because if the sample is large, any estimator with good asymptotic properties is pretty good.  Is that true? Or are there practical consistent estimators that are horrible until you pass some very large threshold?  I can think of a silly one like that. Suppose I estimate the mean by setting it equal to 42 if N < 900 and equal to the sample mean if N > 899. That is a consistent estimator, but it is a very bad for N < 900, unless you are lucky enough that mu = 42.
 +
 
 +
Do asymptotics tell us anything about our Very Small Sample Example? Yes, I think. The idea of my estimators in both examples is the same. If an observation takes a value far from the sample mean, you can guess that it probably has a high variance and should be downweighted, and you can estimate the variance using the information at hand. 
  
  
 
----
 
----

Revision as of 14:32, 21 April 2021

An Ultra-Small Sample Example

As I was suffering from food poisoning in Florida last week, I had many hours in between vomiting bouts to while away woozily. So I thought about heteroskedasticity. Suppose you are estimating the temperature of lake water using various thermometers, and the observations are 50, 52, and 64. We will assume the thermometers are all unbiased, so if x_i is observation i and T is the true temperature, E(x_i) = T. The obvious estimator is the sample mean, (50+52+64)/3 = 55.3 (about). This is also the best estimator if the thermometers are all equally good. But what if they are not? That is, what if the disturbances are heteroskedastic?

If we know the variances of the disturbances of each thermometer, it is well known how to proceed, though I forget exactly how at the moment. What you do is to weight each observation by its accuracy. Thus, the right way might be that if the variances of the thermometers are 60, 30, and 10, and the disturbances are normally distributed, you should use the estimator (50/60 + 52/30 + 64/10) /(1/60+ 1/30+ 1/10) = 59.8 (about). But usually you won't know the variances.

So you need to estimate the variances, which you *can* do, if you have at least 3 observations. Consider observation 1, which is 50. What is a good estimate of its variance? First, we need to estimate the population mean. We can do that from observations 2 and 3, which have a sample mean of (52+64)/2 = 58. Then we use our one data point left for estimating the variance, 50. to estimate the variance to be (58-50)^2/1 = 64. We can do the same for observation 2, which uses the sample mean of (50+64)/2 = 57 so the variance is (57-52)^2/1 = 25. Finally we do it for observation 3, which uses the sample mean of (50+52)/2 = 51 so the variance is (64 - 51)^2/1 = 169. Now we estimates of the variance, so we can do a weighted average, as before.

If we want to be fancy, we could iterate, using the weights in the 2-observation mean estimates too.

There is a problem here, though. The reason that with homoskedasticity the sample mean is the best estimator is that it makes the fullest use of the available information. It weights all three observations equally, making use of all of them. Another unbiased estimator is to use just the first two observations, and always throw out the third one, whatever it may be. That wastes information. Similarly, weighting is wasting, if the data is homoskedastic. So there is a cost to weighting.

Weighting also has a benefit, because it lets us use our best information most heavily. If we know exactly how good the information is, there is no downside to weighting. But we don't. We're estimating it. So when we weight, we are hoping our estimation error is small enough that it's worth messing with equal weighting. Is it small enough? That needs proving.

The proof shouldn't be too difficult. But it is too hard to do in my head. I need to look at the proof of the standard weighting scheme when you know the variances. That, I think, comes from doing a calculus minimization of mean squared error for an unbiased estimator.

There is a problem, here, though. How good my estimator is depends on the true state of the world. Suppose the data really *is* homoskedastic. Then my estimator will just make things worse. The same is true if the data is just slightly heteroskedastic but has high variances, because in a given sample, it may well happen that one observation is an outlier and I will weight it less heavily than the other two. So the quality of the estimator depends on the amount of heteroskedasticity, which we don't know in advance.

The only way to resolve that problem, I think, is to go Bayesian. If we have a distribution of possible variances (and, more important, of variances of variances), we can find out whether the estimator does well in expectation against that distribution. For example, we might assume that the variance of each observation is drawn randomly and independently from 0 to 100. Better, probably, would be to see how well the estimator does for a variety of true states of the world and then to eyeball it and decide whether it's worth using.

A Large-Sample Example

Suppose we are trying to estimate mu and we observe x_i = mu + epsilon_i, where epsilon_i is distributed normally, but with a twist. Some percentage theta of the time, epsilon ~ N(0, sigma^2_1), and some percentage of the time (1-theta), epsilon ~ N(0, sigma^2_2), with sigma^2_1 < sigma^2_2. We do not observe mu, theta, sigma^2 _1, or sigma^2_2, and we would like to estimate them all. We do, however, have an infinite amount of data.

Here's what we can do. Start with the easiest parameter, mu. For mu, our estimate will simply be the sample mean, the average of the x_i. The sample mean is consistent even under heteroskedasticity, so in the limit we have an exact estimate of mu.

Note, too, that we can compute the sample variance and get an exact estimate of theta*sigma_1^2 + (1-theta)*sigma_2^2 from it with a bit of calculation that I won't try to work through right now. Alas, that gives us just one statistic for three parameters, but it will be useful later.

The hard task is to figure out sigma_1 and sigma_2. To do that, start by calculating the mean absolute deviation Z of an observation from mu, that is: Z = the limit as N goes to infinity of Sum_i^N |x_i- mu|/N. Now divide the sample in two parts, depending on whether x_i is less than distance W away from mu-hat (which is both the population mean and, with infinite data, the sample mean). Choose W so P percent of the observations in the Far Sample are more than W greater than mu-hat.

The Far Sample will mostly be observations with variance sigma_1^2, though it will be a mix, since sometimes observations with a small variance will turn out to be much bigger than mu anyway.

Now I think maybe I have to assume that someone has taken pity on us and told us that theta = .7. We aren't free and clear, because Z is still one statistic for two unknown parameters, sigma_1^2 and sigma_2^2, but that kind person still deserves our thanks. Now that we have theta and sigma_2^2, we can use the sample variance equation to compute sigma_1^2. And we're done. A consistent estimator of the variances, and we can figure out weights and standard errors.

The Asymptotic Puzzle

This is a good setting for thinking about how we choose estimators. We economists love asymptotic theory, but of course it doesn't actually apply unless you have an infinitely large sample, which we never do. So why do we like good asymptotic properties? I think it's because if the sample is large, any estimator with good asymptotic properties is pretty good. Is that true? Or are there practical consistent estimators that are horrible until you pass some very large threshold? I can think of a silly one like that. Suppose I estimate the mean by setting it equal to 42 if N < 900 and equal to the sample mean if N > 899. That is a consistent estimator, but it is a very bad for N < 900, unless you are lucky enough that mu = 42.

Do asymptotics tell us anything about our Very Small Sample Example? Yes, I think. The idea of my estimators in both examples is the same. If an observation takes a value far from the sample mean, you can guess that it probably has a high variance and should be downweighted, and you can estimate the variance using the information at hand.