# Why does the regression intercept formula include the mean values of both variables?

Hopefully someone good with Quant can help here… For single variable linear regression, the formula for the intercept coefficient is: Intercept = mean Dependent value (y) - (slope*mean independent value(x)) Which is: _ _ Bo = Y - BiX My Question, is why does this formula insist on using the mean values of the dependent and independent variables? If a linear relationship exists with a constant slope, the value of the intercept calculated using the formula above should be the same no matter what point on the line the variable values are taken from. Using the (mean x, mean y) point demonstrates that the regression line passes through this point, but in this case does not seem to serve any other purpose in this formula? Any thoughts? Thanks in advance and keep on plugging!

The formula doesn’t “insist” on using the mean values. In the mathematical derivation of the two coefficients we end up with two equations in two unknowns (intercept and slope). The formula is simply an algebraic result when we solve for the intercept in terms of the slope.

Thanks for replying wyantjs: To clarify, the value of the intercept would come out the same using any point on the regression line correct? I know the formula displays the fact that the mean values are passed through, but I want to be sure going forward that I understand using the mean values is not necessary to arrive at the correct outcome? Thanks for any clarification again~

You are not understanding. Don’t think of it as we are “using” the mean values to derive this coefficient. It just so happens that the formula has the mean values in it as a result of some calculus and algebra. If you insist on thinking of it this way, the answer is that it is necessary to use them arrive at the correct outcome. You cannot just plug any two values for X and Y into this equation and get Bo (assuming you know B1). You will have an error. The point is that you could plug in any two values for X and Y that lie on the line, and not just the mean values, but you have to know the value of the intercept term to know if the two points do in fact lie on the line. Hence, just use the formula as it is.

wyantjs: You are correct, I am still not understanding. To my knowledge, if you have a linear regression line with a known slope, you can pick any point off of this line, plug in the numbers, and arrive at the same intercept value regardless of whether or not the point you chose consisted of the mean variable values or not. I have tested this concept several times and as long as the slope is the same, it works to arrive at the same intercept value. I guess I am still confused as to why they even include the mean values in the formula and why not just say any (X,Y) data point on the regression line? I appreciate the insight and please don’t feel obligated to post again if this seems like an elementry question~ Thanks!!

You cannot know if a point (X,Y) is on the regression line if you do not already know the equation of the line, and therefore already know the value of the intercept. You are using circular logic here. Of course any value on the line is going to give you the same intercept…that’s inbedded in the definition of a line. Try this…take 20 values for X and Y. Do any of them lie on the line? Who the hell knows until you calculate the intercept and slope? You have to find the intercept first. Get it?

wyantjs: I got it now - not sure why that wasn’t sinking in, but it makes sense now! Basically what I need to remember is to use the mean variable values if asked to calc the intercept value on the exam, but I do appreciate the explanations! Have a good week-

Basically, the regression line passes through (X bar, Y bar)

See, the thing is, I’m supposed to be the stats guru here but wyantjs and Hydrogen Rainbow are making me feel marginalized.