dummy variables (stats question)

Mobius_Strip · November 5, 2010, 10:06pm

say you are doing a multiple regression with a dummy variable, classic textbook example is testing for wage differences between men and women while controlling for age: Y=b0+b1*D+b2*X+e, Y is wage, D=0 if man/ 1 if woman, and X is age. You have a sample size N, and when you run the regression you get a relationship with strong explanatory power (high adjusted R^2) and statistically significant (low p-values for F, t statistics) the coefficient in front of the dummy variable b1 is supposed to tell you about the wage differential and is statistically significant, but say you realize that your sample size contains almost all men (an extreme example would be N-1 men, 1 woman). You can still get a b1 coefficient with low p-value, and the overall regression will be with high explanatory power. But does this b1 number mean anything in this case? Basically I have a sample which is split into 2 categories with a dummy variable, and one of the two subgroups ends up being very very small compared to the other. Are the regression results to be trusted, specifically the dummy variable coefficient

DoubleDip · November 5, 2010, 10:58pm

It seems to me that there are technical requirements regarding the independent variables, namely that there are sufficient data points to generalize with the central limit theorem etc. However, I just ran a quick test where I imposed a relationship to see what would happen: assuming that the base salary is $25,000 (b0), the person makes $1,000*age on top of that, with an additional $10,000 if male. The formula I used was then =1000*A2+10000*(1-B2)+25000 A2 = Age, B2 = indicator Male/Female (0/1) I ran the multiple regression with a single female present and 9 males to see what would happen. The intercept is $35,000, coefficient of age = $1000 and man/woman coefficient -$10,000 which is exactly what I expected. Even increasing the number of women to 11 gave the same results.

Mobius_Strip · November 5, 2010, 11:48pm

well if you *generated* your Y,X columns using an exact linear relationship, then the regression will find the perfect fit and the exact coefficients. say I have 99 pairs for male salaries vs. ages, run a regression and find a fit. Then I just add one more data point, a woman’s salary vs. her wage. If she’s Meg Whitman and her reported salary is way up there, and now I run another regression with an added dummy variable, I’ll get the same coefficient for the intercept and slope as in my first regression (or very similar), and will also get a coefficient for the dummy variable which the t-statistics will likely show to be significant. If someone just looked at the stats result, they will conclude that a woman’s salary is $ gazillion higher than a man of the same age, with a very low p-value and very high R^2 so great explanatory power for the overall fit. But the fit really just explans male salary vs age, and the statistically significant result for the dummy variable is worthless cause it’s just based on one data point which is an outlier. I’m trying to understand this better, when you split your data set into 2 groups with a dummy variable, do you have to control somehow for the size of the 2 subgroups in the sample. what if the 2 subgroups have very different size, does it compromise the regression results and is there any stats test to adjust the result

gamblingeconomist · November 5, 2010, 11:54pm

Mobius Striptease Wrote: ------------------------------------------------------- > say you are doing a multiple regression with a > dummy variable, classic textbook example is > testing for wage differences between men and women > while controlling for age: > > Y=b0+b1*D+b2*X+e, Y is wage, D=0 if man/ 1 if > woman, and X is age. > > You have a sample size N, and when you run the > regression you get a relationship with strong > explanatory power (high adjusted R^2) and > statistically significant (low p-values for F, t > statistics) > > the coefficient in front of the dummy variable b1 > is supposed to tell you about the wage > differential and is statistically significant, but > say you realize that your sample size contains > almost all men (an extreme example would be N-1 > men, 1 woman). You can still get a b1 coefficient > with low p-value, and the overall regression will > be with high explanatory power. But does this b1 > number mean anything in this case? > > Basically I have a sample which is split into 2 > categories with a dummy variable, and one of the > two subgroups ends up being very very small > compared to the other. Are the regression results > to be trusted, specifically the dummy variable > coefficient If you have one woman and a bunch of men in your dataset, the interpretation of the coefficient would be the difference in that woman’s salary compared to everybody else after controlling for age. The difference in her salary could be because she is a woman or could be because of some factor other than her gender and her age. The more other factors that you control for, and the more women (and men) in your sample, the more convincing it will be that the coefficient is truly measuring the impact on wage of being a woman.

gamblingeconomist · November 6, 2010, 12:09am

Mobius Striptease Wrote: ------------------------------------------------------- > well if you *generated* your Y,X columns using an > exact linear relationship, then the regression > will find the perfect fit and the exact > coefficients. > > say I have 99 pairs for male salaries vs. ages, > run a regression and find a fit. Then I just add > one more data point, a woman’s salary vs. her > wage. If she’s Meg Whitman and her reported salary > is way up there, and now I run another regression > with an added dummy variable, I’ll get the same > coefficient for the intercept and slope as in my > first regression (or very similar), and will also > get a coefficient for the dummy variable which the > t-statistics will likely show to be significant. > If someone just looked at the stats result, they > will conclude that a woman’s salary is $ gazillion > higher than a man of the same age, with a very low > p-value and very high R^2 so great explanatory > power for the overall fit. > > But the fit really just explans male salary vs > age, and the statistically significant result for > the dummy variable is worthless cause it’s just > based on one data point which is an outlier. > > I’m trying to understand this better, when you > split your data set into 2 groups with a dummy > variable, do you have to control somehow for the > size of the 2 subgroups in the sample. what if the > 2 subgroups have very different size, does it > compromise the regression results and is there any > stats test to adjust the result It does not matter at all whether there are very different sizes in the 2 subgroups. However, it may matter if there are insufficient observations in one of the 2 groups. I don’t beleive there is really a bright-line test for number of observations in each group. More than 30 is maybe a rule of thumb though I’ve seen fewer. Also, inclusion of other relevant variables that may be correlated with the dummy variable is important. High R-squared is not always enough to say that there aren’t omitted variables that can explain away the effect found in the coefficient for the dummy variable.

bchad · November 6, 2010, 12:55am

If you run the regression and there are very few women (or even 1 woman) in the group and the result is statistically significant at whatever alpha level you have chosen, then what you know from the result is that there is a statistically significant difference between the men in the group and the women in the group. Whatever difference you observe between those two groups is not likely to be due to chance, provided that the group is a representative (i.e. random) sample. However, what you do not know is whether the difference is because of gender, or because of some other factor that makes this person an outlier (and may also be correlated to gender, without being caused directly by gender). If you have a larger sample of women, then the chance of the average being skewed by an outlier is substantially lower. If you measure other variables and control for them (for example, maybe she earns less because she is younger, and there aren’t older women in the workforce because society was full of chauvenists for so long that there aren’t any more senior women who could be expected to earn more), you are more likely to get the causal effects right. This is more or less what gamblingeconomist said. I just thought I’d add another example.

Mobius_Strip · November 6, 2010, 2:33am

gamblingeconomist Wrote: ------------------------------------------------------- > > It does not matter at all whether there are very > different sizes in the 2 subgroups. However, it > may matter if there are insufficient observations > in one of the 2 groups. I don’t beleive there is > really a bright-line test for number of > observations in each group. More than 30 is maybe > a rule of thumb though I’ve seen fewer. thanks for the input, guys. well this is really the heart of the matter and everything you both said I fully agree with, the question is about quantifying in some way how many observations within a group are “sufficient”. i mean, you can do power analysis before your study to determine the target sample size but as we all agree, for this particular purpose it’s not only the total sample size that is relevant, but also the size of the 2 subgroups. what’s a minimum sufficient subsize? about this rule of thumb… just like 47 is the answer to the universe, 30 is the answer for everything in statistics don’t really think there is much validity to it

DoubleDip · November 6, 2010, 12:14pm

Hey MS, all I was trying to show was that it is possible to get meaningful results with a small population of women and that the coefficient did return valid information.

JoeyDVivre · November 6, 2010, 6:43pm

Hmm…Not sure I agree with all that stuff. Forget about regression for a second and think about doing the t-test. If the variances of the groups are not equal, you have a problem (the famous Behrens-Fisher problem) and there just aren’t great solutions. That means that you generally assume the variances of the two populations are equal (or close enough that you can use some hand-waving theorem that if the variances are close your results are close enough to valid to be valid). So the equal variance t-test is exactly the same test as 1-way ANOVA with two groups. Naturally, you would like to be able to verify your assumption that the variances are equal. Regression with dummy variables is ANCOVA and along with the homogeneity of variances you are also using all the data to estimate the slopes so you are adding a homogeneity of slopes assumption. Naturally, you would like to be able to verify that as well. In any event, this isn’t just a CLT issue either (the CLT that says that the group means are normally distributed doesn’t say that the ANCOVA F-statistic has the F-distribution if your residuals aren’t normal). The problem with having two observations in one group and 98 in the other isn’t power; it’s that you are heavily relying on unverifiable and maybe implausible assumptions. Could you really say that since you have a significant test statistic you believe these two groups have different intercepts because with two observations contributing to an intercept you believe in homogeneity of variances, homogeneity of slopes, and multi-variate conditional normality (and we could throw in a few other assumptions as well)?

gamblingeconomist · November 6, 2010, 8:16pm

JoeyDVivre Wrote: ------------------------------------------------------- > The problem with having two observations in one > group and 98 in the other isn’t power; it’s that > you are heavily relying on unverifiable and maybe > implausible assumptions. Could you really say > that since you have a significant test statistic > you believe these two groups have different > intercepts because with two observations > contributing to an intercept you believe in > homogeneity of variances, homogeneity of slopes, > and multi-variate conditional normality (and we > could throw in a few other assumptions as well)? To the extent that if you had more data that you would actually test these assumptions, I agree with what you are saying. But I think in practice these assumptions are often just assumed.

Mobius_Strip · November 9, 2010, 11:59pm

JoeyDVivre Wrote: ------------------------------------------------------- Could you really say > that since you have a significant test statistic > you believe these two groups have different > intercepts because with two observations > contributing to an intercept you believe in > homogeneity of variances, homogeneity of slopes, > and multi-variate conditional normality (and we > could throw in a few other assumptions as well)? thanks, that helps. can we elaborate on one point here: homogeneity of variances vs. homoscedasticity - I assume these are 2 different things, the first being the equality of variances within each subgroup across all datapoints regardless of X, the second being that you that variances conditional on X must be the same. it’s a real mess out there even in stats books when they list all of the regression assumptions - are these 2 independent assumptions? Homoscedasticity is almost always included but homogeneity of variances is either absent or used interchangeably with homoscedasticity