Quantitative Data Analysis with Regression
This lecture for social science research methods at the University of Maine at Augusta presents resources on quantitative data analysis to augment readings in your textbook. As we discussed in a previous lecture, the quantitative approach in research methods is very different from the qualitative approach. While the qualitative approach values open questions and after-the-fact classification, the quantitative approach is focused around the gathering of variables that either are originally quantified or can be converted into quantities.
In this lecture, rather than reiterate the essential material in your textbook, we’ll discuss the essential elements of the basic, entry-level method of quantitative analysis involving multiple variables: multiple regression. Multiple regression seeks to describe the relationship between variables in the world through the use of an equation that you already learned about in high school: the equation for a line. Understanding this equation helps us to understand how to interpret the results of research.
From Variables to Relationships to Regression
Although the phrase “multiple regression analysis” might sound daunting, it really doesn’t take too many steps to get there from where you already are. The following video starts right at the beginning, with material we’ve already covered (variables, operationalization, and direction of effect) and takes the final step to multiple regression: an equation that describes the relationship of operationalized variables to one another:
Interpreting a Regression Table
We learned in the video above that regression is a statistical procedure used to describe how one variable (called the dependent variable) is related to one or more other variables (called independent variables). Regression uses those independent variables to predict the value of a dependent variable.If the goal of regression is prediction, then clearly it is a widely applicable tool. Stock market analysts regularly use regression analysis to predict the movement of stocks; lawyers rely on regression analyses in wrongful death suits; doctors engage in and incorporate medical research that uses regression analysis to assess the effect of various treatments; agricultural biologists use regression to predict pest infestation. Almost any professional occupation you can think of outside the humanities is supported by regression analysis to some degree. If you plan on entering a professional occupation after you graduate, knowing how to read a table of regression results will come in handy.To review material from the video embedded above, social scientists use regression estimation to estimate the values of an equation like y=b0+(b1*x1)+(b2*x2)+(b3*x3)+(b4*x4), in which:
- y is the dependent variable, the quantity you’d like to predict.
- b0 is called the intercept, because it is the value that y takes when x1=0.
- x1 is an independent variable, that thing you think may influence the value of y. Other independent variables are referred to as x2, x3, x4 and so on.
- b1 is the slope, which tells you in what way x1 predicts y. Other slopes, corresponding to other independent variables, are referred to as b2, b3, b4 and so on.
A regression equation allows the inclusion of multiple slopes to describe the effect of multiple independent variables on that dependent variable y. Social scientists include as many slopes as there are variables, just tacking the extra ones on at the end. The inclusion of multiple independent variables is why this kind of research tool is called multiple regression analysis.
The four elements of a regression equation (also known as a regression model) — y, b0, b1 and x1 — are the most important pieces of information you’ll see presented in regression tables. The trick is to interpret them correctly. Let’s work through a real-life example, voter turnout, starting with raw data, progressing through lines on a graph, and finishing up with the interpretation of a regression table.
Regression: Drawing a Best Fit Line
Imagine you are a political consultant who would like to predict voter turnout in the year 2018 (there are many consultants who’d like to do exactly this). How might you go about making an accurate prediction? The first step is to collect some data. The following are data from the United States Election Project on voter turnout — operationalized as the percentage share of those eligible to vote (based on laws from various time periods having to do with gender, age, race, citizenship status, and criminal record) who do actually vote — in national elections in the United States since 1900:
|Year||% Eligible Who Vote||Presidential Election?|
The best prediction when you have nothing but a bunch of observations of the thing you’d like to predict (“dependent variable”, remember?) is the mean of those observations, commonly called the “average.” The mean is calculated as the total of all observations of a variable, divided by the number of observations made.
The mean voter turnout from 1900-2016 is 50.6%, (the result of dividing the sum of the second column — 2983.3 — by 59, the number of elections observed). So a good guess, at least to start off with, is that about 51% of eligible voters will turn out in the 2018 elections. If we don’t know any other information, the mean is our best prediction. But is there other information we know?
Right above this text, a graph presents the same voter turnout data, but as a scatterplot, with dots representing observations and placed according to their position on two dimensions: turnout and year. The guess we made by just going with the mean is represented by a line at 50.6%. To state the prediction in mathematical terms, we claim that the best-fit line is Turnout=50.6 as an equation, with turnout measured in percentage units. “Turnout” takes the place of our y in the regression equation. It is what we would like to explain, our dependent variable.But could we make a better prediction than that? Well, there certainly is more than one variable presented above; in addition to voter turnout, we have information about the year of the vote. The younger generations have often heard it said that people just don’t get out and vote like they used to. Stated in research terms, that’s a hypothesis, a prediction about how independent variables relate to a dependent variable: as year increases, voter turnout will decrease.
The graph above plots the same data as before. However, using the technique of multiple regression, we’ve estimated values for a new, more complicated equation that includes an independent variable, “year.” The values for that equation are shown, along with the line that represents predicted values for various years. Here, the negative slope indicates a negative effect; as years increase in value, voter turnout decreases. It seems as though like this new line fits the data a bit better than the previous estimate, doesn’t it? The line comes closer to most of the dots, after all. This line takes prediction to a new level, from declaring the existence of an overall mean, an average value of turnout for all observations, to a contingent mean. “Contingent” means “dependent on,” so a “contingent mean” is a mean whose value depends on the value of another, independent, variable. The dotted blue line above shows the prediction for a contingent mean generated by the equation Turnout = 164.11 – 0.058*Year. In 1966, the predicted contingent mean is 50% turnout. The predicted contingent mean turnout for the year 2016 is only 47%. This is not a drastic downturn, but it is a downturn.
Could you have seen the downward trend in turnout for yourself, just by looking at the scatterplot without any lines? As a matter of fact, it turns out that in general humans are pretty good at eyeballing patterns like this in two dimensions. Three problems arise, however, from this kind of informal assessment:
- Some humans are better at assessing the fit of a line than others;
- Many of us humans are good at understanding the meaning of a two-dimensional line on a graph. Fewer of us can understand an equation with two variables, expressed in three dimensions. Hardly any (if any) humans can visualize much less understand a graph that shows data plotted in four or more dimensions, which is usually what multi-variable analysis requires. But most humans can, with a bit of practice, understand a regression table.
- Simply looking at a line and judging that “it looks pretty good to me” isn’t very specific. How good is it? How well does an independent variable do in predicting a dependent variable? How much better than the alternatives might it be? What other independent variables might do a better job of predicting that dependent variable?
To address our human limitations and answer research questions more specifically, three statistical terms have been developed: slope, R-squared, statistical significance. The rest of this primer will explain each one in turn.
Interpreting the Value of Slopes
Look again at the figure shown directly above, which shows a best-fit line and the equation for that line, Turnout = 164.11 – 0.058*Year. We can see from the negative sign of the slope, -0.058, that the effect of year on turnout is negative. But what does the value of the slope mean? How one interprets a slope depends on the sort of variable one encounters. For any sort of variable, it is essential to remember the units of the variable when the variable is operationalized (described in a measurable way). There are two major sorts of variables in regression equations, and the way they treat units is different.
A continuous variable is one that can take any value. The variable for year fits roughly in this category; although a value for year like 1976.7589 doesn’t make sense, a year can take on any integer (whole-number) value. The unit of the variable “year” when operationalized is simply the number of the calendar year recognized in the United States and throughout the West. The general rule for interpreting a continuous variable is that the slope equals the change in the dependent variable for every one-unit change in the independent variable. Remember that the dependent variable, turnout, is measured as the percent of eligible voters who actually vote in a year. The units of the dependent variable “turnout” are percentage points. Knowing the units for the dependent variable “year” and the independent variable “turnout,” we can now interpret the slope. A slope of -0.058 is interpreted like this:for every increase of one year, voter turnout decreases by 0.058 percent. We could also say that voter turnout seems to be decreasing by about 1 percent every seventeen years or so (0.058*17=0.986).
The other sort of variable we will consider here is a dichotomous variable. A dichotomous variable is a variable that can take on one of two values, zero or one. The variable is an answer to a yes/no question, with 1=yes and 0=no. Dichotomous variables are also often called “dummy” variables. Because of the way a dichotomous variable is set up, there is only one possible one-unit increase in the variable: from zero to one. So we interpret the slope of a dichotomous variable as the amount an independent variable changes when the dependent variable equals one, versus when the dependent variable equals zero.Let’s add an additional independent variable to the regression analysis predicting turnout. The variable “presidential election” is a dichotomous variable. The question it answers is, “Was this year’s election a presidential election?” As you can see from the data table at the beginning of this article, 0 is entered in when the answer is no and 1 is entered when the answer is yes. We can interpret the slope as the effect of it being a presidential election year on voter turnout.
The figure above plots the new regression equation that best fits the data when the variable “presidential election” is included. The general form of a multiple regression equation y=b0+(b1*x1)+(b2*x2) can be read in this instance as “Turnout = 155.843 – (0.058 * year) + (16.253 * presidential election)”. We would interpret the slope associated with “presidential election year” to mean that in a presidential election year, voter turnout increases by about 16.3%, compared to years in which there is no presidential election. To clarify the graph above, presidential election years are the series of dots on top and off-presidential years are the series of dots below. You’ll notice that the introduction of the dichotomous variable “presidential election” creates two lines to best fit the data: one for presidential years and one for non-presidential years. The lines are separated by 16.253 points on the x-axis. You may notice how much better these two lines fit the data. But by how much? How could we quantify that?
R-Squared: How Much do the Independent Variables Explain?
The three figures predicting turnout above have R-squared values of 0, 0.0407 and 0.7297, respectively. Why, and what do those numbers mean?
To answer that question, let’s imagine that in all of the above figures regarding voter turnout a vertical line is drawn from each data point to the regression line. The figure below shows vertical lines drawn in red to the simplest regression line, the line for the mean value of turnout, Turnout=50.6.
Each of those vertical lines has a length of some amount; if the line is longer for a particular observation, it means that the prediction represented by the regression line Turnout=50.6 was off by a greater amount. The amount by which each prediction is off is called the error; if a data point has a larger value than predicted you have a positive error and if the data point has a smaller value than predicted you have a negative error. You could add up the absolute values of lengths of the lines for all the data points, and that would be called the total error of a regression model. The smaller the total error for a model, the more accurately the model predicts the dependent variable. For technical reasons, social scientists prefer to take the square of the error rather than its absolute value, and then they prefer to take the mean of all squared errors rather than adding the errors up. The effect is the same, however. The mean squared error for a regression model describes the amount of error in the prediction made by the regression. The smaller the mean squared error for a model, the more accurately the model predicts the dependent variable. The mean squared error for the model predicting turnout based only on the mean value of turnout is 95.82. Can we get the mean squared error to decrease by adding independent variables to produce predictions that are contingent means instead of just a single overall mean? That’s a primary goal of research.
One more step allows us to interpret R-squared. R-squared compares the mean squared error from a regression that does include a set of independent variables to the mean squared error from a regression that doesn’t include any independent variables and only makes a prediction based on a mean (like our simplest equation, Turnout=50.6).
The figure above shows vertical lines drawn in red to the regression lines described by the equation Turnout = 164.11– (0.058*Year). If you compare this figure to the previous one, you can see that the mean squared error is mildly reduced (from 95.82 to 91.92). How can I conclude that the reduction in error is only “mild?” It is by comparing the mean squared error for this model (91.92) to the mean squared error generated by just the mean (95.82) that we obtain the statistic called R-squared. R-squared is the proportion (or, if you multiply it by 100, the percentage) of error in your dependent variable that is removed by taking into account the independent variables in your regression equation. R-squared will always have a value between 0 (indicating that the independent variables explain nothing) and 1 (indicating that the independent variables explain everything). An equation without any independent variables must have an R-squared of 0, since you’re comparing the equation to itself. The R-squared statistic of 0.0407 for the equation Turnout = 164.11 – (0.058*Year) tells us that just 4.07% of the variation in voter turnout is explained by the progression of years. That’s better than 0%, but not by too much.
We could do better. Look again at the R-squared statistic when we include year and also the variable describing whether it is a presidential election year. Adding both of these variables to our regression equation reduces prediction error by a whopping 72.97%, an impressive feat. Clearly, whether it’s a presidential election year matters a whole lot when we’re trying to explain variation in voter turnout in the United States.
The Regression Table
With two or more independent variables and one dependent variable, it is quite challenging to represent regression results in a two-dimensional graph. For that reason (and for reasons of brevity), most often regression results are presented in the form of a table. A regression table presents all types relevant information considered above (plus one new type of information) without representing it graphically. Let’s look at a table of regression result for a previous analysis of voter turnout over a different 50 year period, 1960-2010:
The table above displays regression results for three models describing voter turnout between 1960 and 2010, plus a model (Model 1) that only includes the overall mean turnout. Each of the three models in the table adds one new independent variable to the model. By the convention of regression tables, the equation for each model is read downward in a column, and an empty cell in a table means the variable was not included in a model. Another convention is the inclusion of “N,” the number of observations included for analysis in each regression; in this case N=26, the number of election years between 1960 and 2010. Model 4 introduces two new dichotomous variables:
- “Presidential Party Switch in Previous Election,” which equals 1 if the previous election saw a change in the party of the elected President and equals 0 otherwise
- “Congressional Party Switch in Previous Election,” which equals 1 if the previous election saw a change in the majority party in either house of Congress and equals 0 otherwise. These variables reflect the hypothesis that party switches should lead to higher turnout in the next election, a hypothesis in turn supported by the theory that changes in the balance of power prompt those turned out of power to mobilize in an effort to reverse the result.
Based on what we have learned so far, we can conclude the following:
- In Models 2, 3 and 4, year is negatively related to turnout. That is, as time passes, turnout declines.
- There is a positive relationship between having an election during a presidential election year and voter turnout. That is, more voters turn out when it is the year of a presidential election.
- Including the variable for presidential election year dramatically increases the explanatory power of the regression. The independent variables in Model 2 explain a whopping 87% of the variability in voter turnout. One might also say that these independent variables reduce the error in prediction by 87%.
- Introducing variables for party switches in power increases the explanatory power of the regression by only one percentage point, not a large increase. While a presidential party switch increases turnout as our hypothesis predicts, a party switch in Congress decreases turnout, contradicting our hypothesis.
In quantitative science enterprises from genetics to agriculture to economics to sociology, the technique of regression is used in some form to describe the relation between independent variables and dependent variables. Therefore, it is crucial for anyone who will have contact with these or other professional fields (from law to medicine, from social work to investment banking) to be able to read and understand regression results in the form of a regression table.
In this lecture, we have reviewed the elements of a regression — a central tool for quantitative data analysis — and discussed rules for the interpretation of a regression. These elements, which include intercepts, slopes and R-squared statistics, allow a knowledgeable reader to describe relations between variables in the context of probability sampling (which is itself a subject for another lecture).
DIY Activity #9: Contingent Means in Politics
As you know from earlier discussion, completion of this DIY activity is optional, although regardless you are responsible for mastering the content of this lecture. If you choose to participate in this optional DIY Activity, I would like you to try your hand at interpreting a table of multiple regression results. The following regression table reports the actual results of a multivariate regression model in which the dependent variable is turnout (operationalized as the percent of the voting-age population of country that vote) and the independent variables are (operationalized as) and (operationalized). 34 nations are included in the dataset, which represent a combination of Pew Research Center data and Worldbank statistics.
The dataset used to generate the regression table looks like this:
|Country||Year of Election||% of Voting-Age Population who Voted||Voting Compulsory?||Infant Mortality Rate||GDP Per Capita (US, $1000)|
Variables are operationalized as follows:
- Country name: name of country in the dataset
- Turnout (dependent variable): % of voting-age persons who voted in a recent election
- Year: year of election
- Voting Compulsory?: Dichotomous dependent variable in which 1 indicates a national law making voting compulsory in the country, and in which 0 indicates no such law.
- Infant Mortality Rate: number of deaths of infants below the age of 1 in a nation for every 1,000 infants below the age of 1 living in that nation
- GDP Per Capita: all the economic value created by production inside the borders of the nation in the most recent year, plus product taxes, minus subsidies. Measured in currency units converted to thousands of U.S. dollars for purposes of comparison.
Your task is to effectively and accurately interpret the meaning of each slope value in the regression table drawn from this data set. Turn in your report by visiting the “DIY Activities” area of our course Blackboard page and uploading your paper using the “DIY Activity #9: Contingent Means in Politics” link.