

Similarly, we see that the model forecasts that a 40 year-old man who is Independent will have an income of 52,697 (cell J23).įigure 3 – Forecasting with categorical data From Figure 3, you can see that the model forecasts that this person would have an income of 24,494 (cell J22) as calculated by the formula =TREND(J4:J19,F4:I19,F22:I22). The model can predict the income of a 25 year old woman who is a Democrat, provided you recognize that the coding is Age = 25, Gender1 = 0, Party1 = 0, Party2 = 1. The output from the Real Statistics Linear Regression data analysis tool on this input is shown in Figure 2.įigure 2 – Regression with categorical data We can now perform regression analysis on this range. The resulting coding is as shown in range F3:J19 of Figure 1. Party2 = 1 if Party is Dem and Party2 = 0 otherwise Party1 = 1 if Party is Rep and Party1 = 0 otherwise
How to use data analysis in excel regression code#
Since Party takes three values (Rep, Dem, Ind), two dummy variables, called Party1 and Party2, are needed to code Party, defined as follows: Gender1 = 1 if Gender is Male and Gender1 = 0 otherwise (i.e.

Since Gender takes two values (Male and Female), one dummy variable, called Gender1, is sufficient to code Gender, defined as follows: In general, if the original data has k categorical values, the model will require k – 1 dummy variables. There are three possible values for the Party affiliation variable and two possible values for the Gender. Each such dummy variable will only take the value 0 or 1 (although in ANOVA using Regression, we describe an alternative coding that takes values 0, 1 or -1).Įxample 1: Create a regression model for the data in range A3:D19 of Figure 1.Īge is a continuous random variable, while Party affiliation and Gender are categorical random variables. You can also create a scatter plot of these residuals.Categorical independent variables can be used in a regression analysis, but first, they need to be coded by one or more dummy variables (also called tag variables). For example, the first data point equals 8500.

The residuals show you how far away the actual data points are fom the predicted data points (using the equation). For example, if price equals $4 and Advertising equals $3000, you might be able to achieve a Quantity Sold of 8536.214 -835.722 * 4 + 0.592 * 3000 = 6970. You can also use these coefficients to do a forecast. For each unit increase in Advertising, Quantity Sold increases with 0.592 units. In other words, for each unit increase in price, Quantity Sold decreases with 835.722 units. The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 * Advertising. Most or all P-values should be below below 0.05. Delete a variable with a high P-value (greater than 0.05) and rerun the regression until Significance F drops below 0.05. If Significance F is greater than 0.05, it's probably better to stop using this set of independent variables. If this value is less than 0.05, you're OK. To check if your results are reliable (statistically significant), look at Significance F ( 0.001).
