Eric Rasmusen's Weblog

Wald, LR, and Score Tests

From Cornell, "Econ 620: Three Classical Tests; Wald, LM(Score), and LR tests" is a good description of the Wald, likelihood ratio, and score tests. The Hausman test seems more like an LR test, since it estimates both the restricted and unrestricted equations. I found the statalist post below on the Wald test for exogeneity of regressors:

This test is mentioned along with the theory behind -ivprobit- in Wooldridge's "Econometric Analysis of Cross Section and Panel Data" (2002, pp. 472-477). For the maximum likelihood variant with a single endogenous variable, the test is simply a Wald test that the correlation parameter rho is equal to zero. That is, the test simply asks whether the error terms in the structural equation and the reduced-form equation for the endogenous variable are correlated. If there are multiple endogenous variables, then it is a joint test of the covariances between the k reduced form equations' errors and the structural equation's error. In the two-step estimator, in the second stage we include the residuals from the first-stage OLS regression(s) as regressors. The Wald test is a test of significance on those residuals' coefficients.

Labels: statistics

To view the post on a separate page, click: at 10/17/2008 11:11:00 PM (the permalink). 0 Comments Links to this post

Conditional Logit

I was trying to understand how conditional logit and fixed effects in multinomial logit worked, to explain to someone who asked, and I failed. Greene's text was not very helpful. The best thing I found was some notes from Penn: "Conditional Logistic Regression (CLR) for Matched or Stratified Data". The bottom line seems to be that conditional logit (clogit in Stata) chooses its parameter estimates to maximize the likelihood of the variation we see within the strata, while ignoring variation across strata. Thus, if we have data on 30 people choosing to travel by either car or bus over 200 days, we could use 30 dummies for the people, but in conditional logit we don't. Also, in conditional logit, unlike logit with dummies, if someone always travels by car instead of varying, that person is useless to the estimation.

Labels: statistics

To view the post on a separate page, click: at 10/11/2008 04:52:00 PM (the permalink). 0 Comments Links to this post

Ratio Variables in Regressions

I was reading Gibbs and Firebaugh (Criminology, 1990) on ratio variables in regressions. Suppose you regress Arrests/Crime on Crimes/population using city-by-city data, and in fact there is no causal connection. Will they be negative correlated anyway, since CRIMES is in both variables?

No, so long as all relevant control variables are in the regression. Here is a way to see it. Suppose we regress 1/Crime on Crimes/Population. Suppose too, that Crime and Crimes/Population are uncorrelated--- that bigger cities do not have a higher crime rate. Then 1/Crime and Crimes/Population will be uncorrelated.

If, of course, bigger cities do have higher crime rates, then 1/Crime and Crimes/Population will be correlated, but if we suspect that to be true, then in our original regression we should have regressed Arrests/Crime on not only Crimes/Population but on the control variable Crimes.

There is some issue of measurement error-- of false correlation arising if Crime has measurement error. Then we are regressing Arrests/(Crime+Error) on (Crime+Error)/Population. I think if we use (Crime +Error) as a control variable that will fix the problem, though.

Labels: crime, statistics

To view the post on a separate page, click: at 6/24/2008 04:29:00 AM (the permalink). 0 Comments Links to this post

NASA's Temperature Data Adjustments

Too little attention has been given to the news last August that NASA had made a year-2000 mistake in calculating US temperatures, a mistake that meant the temperatures after 2000 were all too high. Details are at Coyote Blog. The mistake was in the adjustment NASA makes for the fact that if a weather station's location become urban, the temperature rises because cities are always hotter. What is more important than the mistake itself are that (1) NASA very quietly fixed its data without any indication to users that it had been wrong earlier. (2) NASA's adjustment is by a secret method it refuses to disclose to outsiders. (3) NASA's adjustment appears (hard to say since it's kept secret) to both adjust "bad" stations (the ones in cities) down and "good" stations (the ones that read accurately) up, on the excuse of some kind of smoothing of off-trend stations. (4) The NASA people doing the adjustment are not statisticians. (5) It isn't clear what, if any, adjustment is made to weather station data from elsewhere in the world. The US has some of the best data, and there seems to be no warming trend in the US.

Labels: global warming, statistics

To view the post on a separate page, click: at 5/08/2008 09:16:00 AM (the permalink). 0 Comments Links to this post

Elasticities in Regressions. (update of old post)Here are how to calculate elasticities from regression coefficients, a note possibly useful to economists who like me keep having to rederive this basic method:

The elasticity is (%change in Y)/(%change in X) = (dy/dx)*(x/y).
If y = beta*x then the elasticity is beta*(x/y).
If y = beta* log(x) then the elasticity is (beta/x)*(x/y) = beta/y.
If log(y) = beta* log(x) then the elasticity is (beta*y/x)*(x/y) = beta, which is a constant elasticity.
(reason: then y= exp(beta*log(x)), so dy/dx = beta*exp(beta*log(x))*(1/x) = beta*y/x.)
If log(y) = beta*x then the elasticity is (beta* y )*(x/y) = beta*x.
(reason: then y = exp(beta*x), so dy/dx = beta*exp(beta*x) = beta*y.)
If log(y) = alpha + beta*D, where D is a dummy variable, then we are interested in the finite jump from D=0 to D=1, not an infinitesimal elasticity. That percentage jump is
dy/y = exponent(beta)-1,
because log(y,D=0) = alpha and log(y, D=1) = alpha + beta, so
(y,D=1)/(y, D=0) = exp(alpha+beta)/exp(alpha) = exp(beta)
and
dy/y = (y,D=1)/(y, D=0) -1 = exp(beta)-1
This is consistent, but not unbiased. We know that OLS is BLUE, unbiased, as an estimator of the impact of the dummy D on log(Y), but that does not imply that it is unbiased as an estimator of the impact of D on Y. That is because E(f(z)) does not equal f(E(z)) in general and that ultimate effect of D on y, exp(beta)-1, is a nonlinear function of beta. Alexander Borisov pointed out to me that Peter Kennedy (AER, 1981) suggests using exp(betahat-vhat(betahat)/2)-1 as an estimate of the effect of going from D=0 to D=1, as biased, but less biased, and also consistent .

Labels: math, statistics

To view the post on a separate page, click: at 1/09/2008 07:30:00 AM (the permalink). 0 Comments Links to this post

Partial Identification and Chi-Squared Tests

I heard Adam Rosen give his paper, "Confidence sets for partially identified parameters that satisfy a finite number of moment inequalities." It stimulated some thoughts. (Click here to read more.)

1. Suppose we wanted to estimate means of X and Y,(μ(x) μ(y)). Our theory says that they are distributed independently, bounded by [0,1]. But we only have data on X.

My maximum likelihood estimator will be a point estimate for μ(x) and an interval for μ(y). I have partial identification. If the sample mean of x were .6, my estimate would be (.6, [0,1]).

If I had a prior on μ(y), I could use that. Maximum likelihood or any kind of minimum distance-MOM estimator would leave every value in [0,1] equally good as an estimate of μ(y).

Another example would be if we wanted to estimate the mean of X+Y, μ(x+y), but only had data on x. If the sample mean of x was .6, our estimate for μ(x+y) would be the interval [,6, 1.6].

We would also have partial identification in a model in which y = αx1 + βx2 but x1 and x2 were endogenous and we had an instrument for x1 but not for x2.

2. Suppose we have partial identification, and our estimation has yielded us a best-estimate interval for the single parameter theta, which is thetahat = [5,10]. Our null hypothesis is that &theta &ge 6. Do we reject it?

We want to construct a confidence set C such that if we repeat the procedure, &alpha = 5% of the time we will wrongly reject the null when it is true:

(1) Prob(&theta -hat is in C)|&theta &ge 6) = .05

C will be a set of intervals.

But that probability in (1) is ill-defined, because C will differ depending on whether &theta =.6, 7, 9, 26, or whichever value greater than 6 we might pick. So we'll be conservative, making it hard to reject the null, and pick the value of &theta for which C is biggest. That kind of conservatism problem arises even in the simplest frequentist inequality null-- the problem is that the null is not "simple".

A nice thing about the chi-squared test is that it avoids having to define C for the &theta -hat space. Instead, we just find the scalar chibarredsquared statistic, a function of the interval, and look at the confidence interval for that test statistic. This is what chi-squared tests do in general--- they transform a multi-dimensional acceptance region into a one-dimensional acceptance interval. For example, we could use a Chi-squared test (or its close relation, an F-test), to test whether the pair of numbers (α, β) was close enough to (0,0).

Here, though, it's especially neat because we're not just doing an R-n to R mapping: we're mapping from a set in (R, intervals on R) to R. An interval on R can be reduced to its pair of endpoints, but even then our mapping wouldn't be as simple as a mapping from three real numbers to one.

Labels: math, statistics

To view the post on a separate page, click: at 10/13/2007 06:56:00 PM (the permalink). 0 Comments Links to this post

An Umbrella with a Drip Case

I brought this umbrella back from Taipei. It has a case to prevent dripping from the wet umbrella onto the floor when it is folded up. The case opens automatically when you open the umbrella, telescoping down into a little cap on top of the umbrella.

Labels: statistics

To view the post on a separate page, click: at 10/10/2007 06:11:00 PM (the permalink). 0 Comments Links to this post

A Coin Flip Example for Intelligent Design

1. Suppose we come across a hundred bags of 20-chip draws from hundred different urns. Each bag contains 20 red chips. We naturally deduce that the urns contain only red chips. (Click here to read more.)

Labels: intelligent design, Paley, religion, statistics

To view the post on a separate page, click: at 10/10/2007 04:42:00 AM (the permalink). 0 Comments Links to this post

Case Control Studies and Repeated Sampling

A standard counterintuitive result in statistics is that if the true model is logit, then it is okay to use a sample selected on the Y's, which is what the "case-control method" amounts to. You may select 1000 observations with Y=1 and 1000 observations with Y=0 and do estimation of the effects of every variable but the constant in the usual way, without any sort of weighting. This was shown in Prentice & Pyke (1979). They also purport to show that the standard errors may be computed in the usual way--- that is, using the curvature (2nd derivative) of the likelihood function. (Click here for more)

This, I was skeptical of. If the constant is misestimated, how can you deduce the variance of the disturbance term, and if you can't deduce that, how can you deduce the standard error of any of the coefficients? Nowhere have I seen a clear demonstration or an intuition for the result, so I thought there might be a crucial unnoticed mistake in the math somewhere, as is not unknown in famous papers (e. g. Hotelling on location, Tullock on overdissipation, Viner on average cost curves, and Rothschild-Stiglitz on risk).

Since I did not follow all the steps of the Prentice-Pyke proof and so did not know of any error in what they did, I tried doing a Monte Carlo study which seemed to confirm my intuition.

Since then, however, I have seen where my Monte Carlo study went wrong, and now I believe Prentice and Pyke. Some details are instructive.

1. An intuition-- a bit shaky, I think, but better than nothing (let me know if it's false). Suppose that a coefficient is estimated correctly by some estimator. We want to estimate the estimator's standard error, to know how variable the estimate would be if we repeated the estimation with different disturbances. For this, we need to know how noisy the data is. We do not need to know how noisy the data in the whole population is, however, just how noisy in the kind of sample we draw. If our procedure is to draw a biased sample, then we need to know what will happen in other biased samples, not in the population. It is okay to use the sample for this purpose. In using a standard error, we are not generalizing anything to the population (not estimating goodness of fit, for example), we are just generalizing to repeated samples.

2. How to think about repeated sampling and how to do a Monte Carlow study. What I did was to construct a population of 60,000 data points, drawing X from a uniform distribution on [0,1] and a disturbance epsilon from a logit density with an α "constant" coefficient of -4 and a β X coefficient of 0. If α + epsilon < 0 then Y=0; if α + epsilon >= 0 then Y = 1. That yields 1,039 points with Y= 1, about 1.7% of them.

Our estimation procedure is to combine two random samples of 1,000 observations with Y=0 and 1,000 observations with Y=1 and do a logit estimate of alpha and beta. We would expect the estimate of alpha to be wrong-- not close to 0.017-- and the estimate of beta to be right-- close to 0.000-- since we have a large enough sample that consistent estimates ought to be close to the true parameters.

The maximum likelihood estimate would give us standard errors based on the second derivative of the likelihood function or on bootstrapping. In repeated sampling, we would expect the standard deviation of the alpha estimates not to be close to the average of the estimates of its standard error. The question to be investigated is whether the the standard deviation of the beta estimates is close to the average of the estimates of its standard error.

So far, so good. Where I made my mistake, I think, is in the definition of "repeated sampling". Ordinarily in frequentist thinking, in repeated sampling we keep the X values the same in each sample, and we draw new disturbances, which combine with the fixed X's to give new Y's. That also amounts to conditioning on the X's, though we wouldn't have had to condition the X's, since our estimator should work fine even if we changed the X's in each sample too. (If we did change the X's, though, that change the information content in each sample--- a sample in which X only varied between .3 and .4 would have less information and yield worse estimates than one with X varying widely between .02 and .94. So in small samples, especially, we'd have to make some allowance for that.)

Here, though, we can't keep the X's fixed. If we did, then although our first sample would have 1,000 observations with Y=1, our succeeding samples would have about 34. We wouldn't be using the case-control method.

So what we have to do is to think about repeated samples with 1,000 Y=0 observations and 1,000 Y=1's. Turning our usual thinking upside down, we need to keep the Y's fixed, draw new disturbances, and let the X's vary. This is especially hard to think about here, because knowing Y and epsilon does not tell us X-- remember, Y is coarse and contains less information than alpha + beta*X + epsilon,and beta is zero here too, making things even worse.

The best way to proceed is to think about repeating the entire scientific procedure, including the sampling as well as the estimation. The way I did this was to take 100 n=2000 samples from the 60,000-point population, each time combining equal-sized subsamples with Y=0 and with Y=1.

Recall, however, that there are only 1,037 Y=1 values in the entire population. Thus, my repeated sampling had to be with replacement, and was using the same Y=1 observations over and over. It is OK to use the same X values repeatedly, but these observations also had the same epsilon values each time, so the samples are not independent in the way needed for the law of large numbers to work. The standard errors computed by maximum likelihood came out wrong--- not equal to the standard deviation of the estimates, but that is to be expected when the draws are not independent.

Realizing this, I also tried doing the procedure with 100 n=200 samples instead of 100 n=2000 samples. I still used sampling with replacement, but now there was less overlap between replacements, less dependence between samples. And now the estimated standard errors were close to the standard deviations.

This, I expect is what would happen if I did the kind of repeated sampling that is our thought experiment for the kind of real studies that use the case-control method. That thought experiment is to take repeated draws of 60,000-point populations, with the same X's each time but with different epsilons and hence Y's. Each of the 100 Monte Carlo samples would be from a different population draw.

Labels: case-control method, frequentist, math, statistics

To view the post on a separate page, click: at 10/08/2007 07:53:00 AM (the permalink). 0 Comments Links to this post

Is Not Necessarily Equal To

At lunch at Nuffield I was just asking MM about some math notation I'd like: a symbol for "is not necessarily equal to". For example, and economics paper might show the following:

Proposition: Stocks with equal risks might or might not have the same returns. In the model's notation, x IS NOT NECESSARILY EQUAL TO y.

Click here to read more

Labels: math, notation, statistics, writing

To view the post on a separate page, click: at 10/04/2007 09:06:00 AM (the permalink). 4 Comments Links to this post

Bayesian vs. Frequentist Statistical Theory: George and Susan

Susan either likes George or dislikes him. His prior belief is that there is a 50% chance that she likes him. He also believes that if she does, there is an 80% chance she will smile at him, and if she does not, there is a 60% chance. She smiles at him. What should he think of that?

The Frequentist approach says that George should choose the answer which has the greatest likelihood given the data, and so he should believe that she likes him.Click here to read more

It warns, however, that if he follows this plan, and she really doesn't like him, then he will come to the wrong conclusion with 60% probability. Thus, he can't confidently reject the null hypothesis that she dislikes him. (Though he also could not confidently reject the null hypothesis that she likes him-- that plan would lead him to mistakenly rejecting the null 80% of the time!)

The Bayesian answer is that he should do this computation:

Prob(likes|smiles) = Prob (smiles|likes)Prob(likes)/Prob(smiles)

Prob(likes|smiles)= .8*.5/(.8*.5+.6*.5) = .4/(.4+.3) = 4/7.

If he must choose a binary action-- say, to invite her out or not-- then if the losses from each kind of mistake are symmetric, he should act as if she likes him, knowing that with probability 3/7 he will be making a mistake.

Now let's think about where prior information enters. It enters via the Prior, of course-- the 50% chance that she likes him. But it also enters via the Likelihood Function-- the 80% and 60% figures. If we ask how sure George is of his 50% prior, it is in the Likelihood Function that we need to look for the answer. Suppose we want to keep George's belief at 50% but make him very sure of 50%, as if he'd had lots of evidence on each side rather than not knowing Susan at all and guessing based on his experience with other women. How does that show up here?

It shows up as this one smiling incident having very little effect on his belief. If his prior is solid, then he thinks that the data conveys little information. Maybe she would smile with probability 80% if she likes him and 79% if she dislikes him. Moreover, the next time he meets her, her second smile or frown will convey about the same, small, amount of information.

If his prior were loose, the first incident would have a big impact and the second would have a noticeably smaller impact. Maybe she would smile with probability 80% if she likes him and 20% if she dislikes him. Moreover, the next time he meets her, her second smile or frown will convey less information-- maybe the numbers would shift to 80% and 40%.

But my last two paragraphs do not satisfy me. I must go on to other things, so I will leave this post now.

Labels: frequentist, statistics, thinking

To view the post on a separate page, click: at 10/02/2007 06:48:00 AM (the permalink). 0 Comments Links to this post

Weighted Least Squares and Why More Data is Better

In doing statistics, when should we weight different observations differently?

Suppose I have 10 independent observations of $x$ and I want to estimate the population mean, $\mu$. Why should I use the unweighted sample mean rather than weighting the first observation .91 and each of the rest by .01?

Either way, I get an unbiased estimate, but the unweighted mean gives me lower variance of the estimator. If I use just observation 1 (a weight of 100% on it) then my estimator has the variance of the disturbance. If I use two observations, then a big positive disturbance on observation 1 might be cancelled out by a big negative on observation 2. Indeed, the worst case is that observation 2 also has a big positive disturbance, in which case I am no worse off by having it. I do not want to overweight any one observation, because I want mistakes to cancel out as evenly as possible.

All this is completely free of the distribution of the disturbance term. It doesn't rely on the Central Limit Theorem, which says that as $n$ increases then the distribution of the estimator approaches the normal distribution (if I don't use too much weighting, at least!).

If I knew that observation 1 had a smaller disturbance on average, then I *would* want to weight it more heavily. That's heteroskedasticity.

Labels: statistics

To view the post on a separate page, click: at 9/28/2007 06:21:00 AM (the permalink). 0 Comments Links to this post

Asymptotics

Page 96 of David Cox’s 2006 Principles of Statistical Inference has a very nice one-sentence summary of asymptotic theory:

[A]pproximations are derived on the basis that the amount of information is large, errors of estimation are small, nonlinear relations are locally linear and a central limit effect operates to induce approximate normality of log likelihood derivatives.

Labels: statistics

To view the post on a separate page, click: at 9/25/2007 04:48:00 PM (the permalink). 0 Comments Links to this post

Bayesian vs. Frequentist Statistical Theory

The Frequentist view of probability is that a coin with a 50% probability of heads will turn up heads 50% of the time.

The Bayesian view of probability is that a coin with a 50% probabilit of heads is one on which a knowledgeable risk-neutral observer would put a bet at even odds.

The Bayesian view is better.

When it comes to statistics, the essence of the Frequentist view is to ask whether the number of heads that shows up in one or more trials is probable given the null hypothesis that the true odds in any one toss are 50%.

When it comes to statistics, the essence of the Bayesian view is to estimate, given the number of number of heads that shows up in one or more trials and the observerâ™s prior belief about the odds, the probability that the odds are 50% versus the odds being some alternative number.

I like the frequentist view better. Itâ™s neater not to have a prior involved.

Labels: statistics

To view the post on a separate page, click: at 9/25/2007 04:41:00 PM (the permalink). 1 Comments Links to this post

Eric Rasmusen's Weblog

Friday, October 17, 2008

Wald, LR, and Score Tests

Saturday, October 11, 2008

Conditional Logit

Tuesday, June 24, 2008

Ratio Variables in Regressions

Thursday, May 8, 2008

NASA's Temperature Data Adjustments

Wednesday, January 9, 2008

Saturday, October 13, 2007

Partial Identification and Chi-Squared Tests

Wednesday, October 10, 2007

An Umbrella with a Drip Case

A Coin Flip Example for Intelligent Design

Monday, October 8, 2007

Case Control Studies and Repeated Sampling

Thursday, October 4, 2007

Is Not Necessarily Equal To

Tuesday, October 2, 2007

Bayesian vs. Frequentist Statistical Theory: George and Susan

Friday, September 28, 2007

Weighted Least Squares and Why More Data is Better

Tuesday, September 25, 2007

Asymptotics

Bayesian vs. Frequentist Statistical Theory

About Me

Previous Posts

Selected Posts >

Selected Archive Topics >

Archives