dummy variables rule


confused by this rule:

“when we want to distinguish between n classes we must use n-1 dummy variables”

what does this exactly mean?

thank you!

I wrote an article on dummy variables that covers this quite well: http://www.financialexamhelp123.com/dummy-variables/

(Full disclosure: as of 4/25/16 there is a charge to read the articles on my website. You can get an idea of the quality of the articles by looking at the free samples here: http://www.financialexamhelp123.com/sample-articles/.)

If you have n mutually exclusive and collectively exhaustive categories and you want to use dummy variables to identify the class to which a data point belongs, you use n − 1 dummy variables.

For example, if you have quarterly sales data, you can have dummy variables for a sale being in, say, the 1st quarter, the 2nd quarter, and the 3rd quarter; 4th quarter sales are identified by zeroes for each of these three variables.

If you have n categories that are not mutually exclusive or are not collectively exhaustive, then you need n dummy variables to characterize the data: one for each category.

For example, if you have data on home prices and the homes can possess (or not) central air conditioning, a swimming pool, and a fireplace, then you need 3 dummy variables: one for each characteristic. The reason is that they’re not mutually exclusive: a home can have _ both _ a fireplace and a swimming pool, for example.

sounds good. I have another question on this:

When they say the omitted class (among equation of dummy variables) is the reference point, does that apply only to equations that only contain dummy variables or also to equations that contain both dummy and quantitative variables?

Thanks again!


Both, as the Magician says.

In such equation, quantitative variables are interpreted as we always do. Only the dummy variables are interpreted in relation to the omitted class.

Keep in mind, the explanation S2000 gave is accurate for models where we include an intercept term. If we omit an intercept, such as cases of regression through the origin, we don’t need to follow the n-1 dummy rule.

  1. One of the Assumptions of multiple regression is no linear relationship among two or more independent variables. Or roughly say, if there are n independent variables, we cannot infer any independent variable using the other n-1 independent variables.

  2. If we can infer one independent variable using linear relationship among other n-1 independent variable, we are violating this assumption and multiple regression will not be appropriate or applicable.

  3. If we use n dummies for n classes, actually we can infer any one dummy using the other n-1 dummies. It is quite easy. If any one of the rest n-1 dummies is 1, the concerned dummy will be 0; if all of the rest n-1 dummies are 0, the concerned dummy will be 1.

  4. Or, say we define n dummies for n classes: x1, x2, x3, …, xn. Then, x1= x2 + x3 + x4 + … +xn. Or generally, xi= x1 + x2 + … + x(i-1) + x(i+1) + … + xn. This is clear linear relationship among independent variables.

The assumption is that there is no perfect collinearity between independent variables in the model.

Again, the assumption is only violated when there is perfect collinearity. In this case you can’t properly estimate the model.

If you see my earlier post, this is not the case if we fit a model without an intercept but with k dummies for k classes. For the exam purposes, though an intercept will be fit, so you need k-1 dummies for k (mutually exclusive and exhaustive) groups.

I recommend you fit a model with an intercept and k-1 dummies for k groups. Then see what happens when you try to fit a model with an intercept with k dummies. Finally, fit the model with k dummies and omit the intercept. The interpretation of coefficients is much different, but you can certainly fit the model with K dummies and without an intercept without necessarily violating the assumption.