Published on February 25, 2020 by Rebecca Bevans. Revised on December 14, 2020.
Linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.
There are two main types of linear regression:
- Simple linear regression uses only one independent variable
- Multiple linear regression uses two or more independent variables
In this step-by-step guide, we will walk you through linear regression in R using two sample datasets.
Download the sample datasets to try it yourself.
Simple regression dataset Multiple regression dataset
Table of contents
- Getting started in R
- Load the data into R
- Make sure your data meet the assumptions
- Perform the linear regression analysis
- Check for homoscedasticity
- Visualize the results with a graph
- Report your results
Getting started in R
Start by downloading R and RStudio. Then open RStudio and click on File > New File > R Script.
As we go through each step, you can copy and paste the code from the text boxes directly into your script. To run the code, highlight the lines you want to run and click on the Run button on the top right of the text editor (or press ctrl + enter on the keyboard).
To install the packages you need for the analysis, run this code (you only need to do this once):
install.packages("ggplot2")
install.packages("dplyr")
install.packages("broom")
install.packages("ggpubr")
Next, load the packages into your R environment by running this code (you need to do this every time you restart R):
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)
Step 1: Load the data into R
Follow these four steps for each dataset:
- In RStudio, go to File > Import dataset > From Text (base).
- Choose the data file you have downloaded (income.data or heart.data), and an Import Dataset window pops up.
- In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables (income and happiness or biking, smoking, and heart.disease).
- Click on the Import button and the file should appear in your Environment tab on the upper right side of the RStudio screen.
After you’ve loaded the data, check that it has been read in correctly using summary()
.
Simple regression
summary(income.data)
Because both our variables are quantitative, when we run this function we see a table in our console with a numeric summary of the data. This tells us the minimum, median, mean, and maximum values of the independent variable (income) and dependent variable (happiness):
Multiple regression
summary(heart.data)
Again, because the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):
Receive feedback on language, structure and formatting
Professional editors proofread and edit your paper by focusing on:
- Academic style
- Vague sentences
- Grammar
- Style consistency
See an example
Step 2: Make sure your data meet the assumptions
We can use R to check that our data meet the four main assumptions for linear regression.
Simple regression
- Independence of observations (aka no autocorrelation)
Because we only have one independent variable and one dependent variable, we don’t need to test for any hidden relationships among variables.
If you know that you have autocorrelation within variables (i.e. multiple observations of the same test subject), then do not proceed with a simple linear regression! Use a structured model, like a linear mixed-effects model, instead.
- Normality
To check whether the dependent variable follows a normal distribution, use the hist()
function.
hist(income.data$happiness)
The observations are roughly bell-shaped (more observations in the middle of the distribution, fewer on the tails), so we can proceed with the linear regression.
- Linearity
The relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.
plot(happiness ~ income, data = income.data)
The relationship looks roughly linear, so we can proceed with the linear model.
- Homoscedasticity(aka homogeneity of variance)
This means that the prediction error doesn’t change significantly over the range of prediction of the model. We can test this assumption later, after fitting the linear model.
Multiple regression
- Independence of observations (aka no autocorrelation)
Use the cor()
function to test the relationship between your independent variables and make sure they aren’t too highly correlated.
cor(heart.data$biking, heart.data$smoking)
When we run this code, the output is 0.015. The correlation between biking and smoking is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.
- Normality
Use the hist()
function to test whether your dependent variable follows a normal distribution.
hist(heart.data$heart.disease)
The distribution of observations is roughly bell-shaped, so we can proceed with the linear regression.
- Linearity
We can check this using two scatterplots: one for biking and heart disease, and one for smoking and heart disease.
plot(heart.disease ~ biking, data=heart.data)
plot(heart.disease ~ smoking, data=heart.data)
Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.
- Homoscedasticity
We will check this after we make the model.
Step 3: Perform the linear regression analysis
Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.
Simple regression: income and happiness
Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.
To perform a simple linear regression analysis and check the results, you need to run two lines of code. The first line of code makes the linear model, and the second line prints out the summary of the model:
income.happiness.lm <- lm(happiness ~ income, data = income.data)summary(income.happiness.lm)
The output looks like this:
This output table first presents the model equation, then summarizes the model residuals (see step 4).
The Coefficients section shows:
- The estimates (Estimate) for the model parameters – the value of the y-intercept (in this case 0.204) and the estimated effect of income on happiness (0.713).
- The standard error of the estimated values (Std. Error).
- The test statistic (t value, in this case the t-statistic).
- The p-value ( Pr(>| t | ) ), aka the probability of finding the given t-statistic if the null hypothesis of no relationship were true.
The final three lines are model diagnostics – the most important thing to note is the p-value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.
From these results, we can say that there is a significant positive relationship between income and happiness (p-value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.
Multiple regression: biking, smoking, and heart disease
Let’s see if there’s a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns. The rates of biking to work range between 1 and 75%, rates of smoking between 0.5 and 30%, and rates of heart disease between 0.5% and 20.5%.
To test the relationship, we first fit a linear model with heart disease as the dependent variable and biking and smoking as the independent variables. Run these two lines of code:
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart.data)summary(heart.disease.lm)
The output looks like this:
The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.
This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.
The standard errors for these regression coefficients are very small, and the t-statistics are very large (-147 and 50.4, respectively). The p-values reflect these small errors and large t-statistics. For both parameters, there is almost zero probability that this effect is due to chance.
Remember that these data are made up for this example, so in real life these relationships would not be nearly so clear!
Step 4: Check for homoscedasticity
Before proceeding with data visualization, we should make sure that our models fit the homoscedasticity assumption of the linear model.
Simple regression
We can run plot(income.happiness.lm)
to check whether the observed data meets our model assumptions:
par(mfrow=c(2,2))
plot(income.happiness.lm)
par(mfrow=c(1,1))
Note that the par(mfrow())
command will divide the Plots window into the number of rows and columns specified in the brackets. So par(mfrow=c(2,2))
divides it up into two rows and two columns. To go back to plotting one graph in the entire window, set the parameters again and replace the (2,2) with (1,1).
These are the residual plots produced by the code:
Residuals are the unexplained variance. They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.
The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.
In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.
Based on these residuals, we can say that our model meets the assumption of homoscedasticity.
Multiple regression
Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:
par(mfrow=c(2,2))
plot(heart.disease.lm)
par(mfrow=c(1,1))
The output looks like this:
As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.
Step 5: Visualize the results with a graph
Next, we can plot the data and the regression line from our linear regression model so that the results can be shared.
Simple regression
Follow 4 steps to visualize the results of your simple linear regression.
- Plot the data points on a graph
income.graph<-ggplot(income.data, aes(x=income, y=happiness))+ geom_point()income.graph
- Add the linear regression line to the plotted data
Add the regression line using geom_smooth()
and typing in lm
as your method for creating the line. This will add the line of the linear regression as well as the standard error of the estimate (in this case +/- 0.01) as a light grey stripe surrounding the line:
income.graph <- income.graph + geom_smooth(method="lm", col="black")income.graph
- Add the equation for the regression line.
income.graph <- income.graph + stat_regline_equation(label.x = 3, label.y = 7)income.graph
- Make the graph ready for publication
We can add some style parameters using theme_bw()
and making custom labels using labs()
.
income.graph + theme_bw() + labs(title = "Reported happiness as a function of income", x = "Income (x$10,000)", y = "Happiness score (0 to 10)")
This produces the finished graph that you can include in your papers:
Multiple regression
The visualization step for multiple regression is more difficult than for simple regression, because we now have two predictors. One option is to plot a plane, but these are difficult to read and not often published.
We will try a different method: plotting the relationship between biking and heart disease at different levels of smoking. In this example, smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.
There are 7 steps to follow.
- Create a new dataframe with the information needed to plot the model
Use the function expand.grid()
to create a dataframe with the parameters you supply. Within this function we will:
- Create a sequence from the lowest to the highest value of your observed biking data;
- Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.
plotting.data<-expand.grid( biking = seq(min(heart.data$biking), max(heart.data$biking), length.out=30), smoking=c(min(heart.data$smoking), mean(heart.data$smoking), max(heart.data$smoking)))
This will not create anything new in your console, but you should see a new data frame appear in the Environment tab. Click on it to view it.
- Predict the values of heart disease based on your linear model
Next we will save our ‘predicted y’ values as a new column in the dataset we just created.
plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)
- Round the smoking numbers to two decimals
This will make the legend easier to read later on.
plotting.data$smoking <- round(plotting.data$smoking, digits = 2)
- Change the ‘smoking’ variable into a factor
This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.
plotting.data$smoking <- as.factor(plotting.data$smoking)
- Plot the original data
heart.plot <- ggplot(heart.data, aes(x=biking, y=heart.disease)) + geom_point()heart.plot
- Add the regression lines
heart.plot <- heart.plot + geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)heart.plot
- Make the graph ready for publication
heart.plot <-heart.plot + theme_bw() + labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking", x = "Biking to work (% of population)", y = "Heart disease (% of population)", color = "Smoking \n (% of population)")heart.plot
Because this graph has two regression coefficients, the stat_regline_equation()
function won’t work here. But if we want to add our regression model to the graph, we can do so like this:
heart.plot + annotate(geom="text", x=30, y=1.75, label=" = 15 + (-0.2*biking) + (0.178*smoking)")
This is the finished graph that you can include in your papers!
Step 6: Report your results
In addition to the graph, include a brief statement explaining the results of the regression model.
Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.
Is this article helpful?
You have already voted. Thanks :-) Your vote is saved :-) Processing your vote...
Rebecca Bevans
Rebecca is working on her PhD in soil ecology and spends her free time writing. She's very happy to be able to nerd out about statistics with all of you.
FAQs
What are the steps to build and evaluate a linear regression model in R? ›
Steps to Establish a Regression
Create a relationship model using the lm() functions in R. Get a summary of the relationship model to know the average error in prediction. Also called residuals. To predict the weight of new persons, use the predict() function in R.
It consists of 3 stages – (1) analyzing the correlation and directionality of the data, (2) estimating the model, i.e., fitting the line, and (3) evaluating the validity and usefulness of the model. First, a scatter plot should be used to analyze the data and check for directionality and correlation of data.
What are the steps to finding the linear regression equation? ›The formula for simple linear regression is Y = mX + b, where Y is the response (dependent) variable, X is the predictor (independent) variable, m is the estimated slope, and b is the estimated intercept.
How do you write a simple linear regression in R? ›The mathematical formula of the linear regression can be written as y = b0 + b1*x + e , where: b0 and b1 are known as the regression beta coefficients or parameters: b0 is the intercept of the regression line; that is the predicted value when x = 0 . b1 is the slope of the regression line.
What is linear regression in R programming? ›Linear regression is used to predict the value of an outcome variable y on the basis of one or more input predictor variables x. In other words, linear regression is used to establish a linear relationship between the predictor and response variables.
How do you create a good linear regression model? ›- It's important you understand the relationship between your dependent variable and all the independent variables and whether they have a linear trend. ...
- It's also important to check and treat the extreme values or outliers in your variables.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
What is step by step regression? ›Stepwise regression is the step-by-step iterative construction of a regression model that involves the selection of independent variables to be used in a final model. It involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration.
What is linear regression for beginners? ›What is simple linear regression? Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.
How long does it take to learn linear regression? ›To truly become an expert in regression analysis, however, you probably need to get a master's degree in statistics, complete a program in data science, or go to school for machine learning, any of which will take you between two and four years.
What are the 6 steps to solving linear equations? ›
- Step 1: Simplify each side, if needed.
- Step 2: Use Add./Sub. Properties to move the variable term to one side and all other terms to the other side.
- Step 3: Use Mult./Div. ...
- Step 4: Check your answer.
- I find this is the quickest and easiest way to approach linear equations.
- Example 6: Solve for the variable.
Examples of Linear Regression
The weight of the person is linearly related to their height. So, this shows a linear relationship between the height and weight of the person. According to this, as we increase the height, the weight of the person will also increase.
We could use the equation to predict weight if we knew an individual's height. In this example, if an individual was 70 inches tall, we would predict his weight to be: Weight = 80 + 2 x (70) = 220 lbs. In this simple linear regression, we are examining the impact of one independent variable on the outcome.
How do you write a simple regression? ›Basically, the simple linear regression model can be expressed in the same value as the simple regression formula. y = β0 + β1X+ ε. In the simple linear regression model, we consider the modelling between the one independent variable and the dependent variable.
What is the easiest method to solve linear equation? ›Graphing is one of the simplest ways to solve a system of linear equations. All you have to do is graph each equation as a line and find the point(s) where the lines intersect. These equations are already written in slope-intercept form, making them easy to graph.
How do you solve for R in statistics? ›Use the formula (zy)i = (yi – ȳ) / s y and calculate a standardized value for each yi. Add the products from the last step together. Divide the sum from the previous step by n – 1, where n is the total number of points in our set of paired data. The result of all of this is the correlation coefficient r.
Which is correct formula for simple linear regression *? ›Simple Linear Regression: It is a regression model that represents a correlation in the form of an equation. Here the dependent variable, y, is a function of the independent variable, x. It is denoted as Y = a + bX + ε, where 'a' is the y-intercept, b is the slope of the regression line, and ε is the error.
What function is used for linear regression in R? ›Summary: R linear regression uses the lm() function to create a regression model given some formula, in the form of Y~X+X2. To look at the model, you use the summary() function. To analyze the residuals, you pull out the $resid variable from your new model.
What are the three requirements of linear regression? ›Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other.
How do you implement simple linear regression? ›- Calculate Mean and Variance. ...
- Calculate Covariance. ...
- Estimate Coefficients. ...
- Make Predictions. ...
- Predict Insurance.
How to improve accuracy of linear regression model in R? ›
- Handling Null/Missing Values.
- Data Visualization.
- Feature Selection and Scaling.
- 3A. Feature Engineering.
- 3B. Feature Transformation.
- Use of Ensemble and Boosting Algorithms.
- Hyperparameter Tuning.
The Statistical Significance
The first step of the regression analysis is to check whether there is any statistical significance between the dependent and the independent variables.
A regression model provides a function that describes the relationship between one or more independent variables and a response, dependent, or target variable. For example, the relationship between height and weight may be described by a linear regression model.
What is the first step in using a regression line? ›The first step of linear regression is to test the linearity assumption, this can be performed by plot the values in a graph known as scatter plot, to observe the relationship between dependent and independent variable, because if the data is exponentially scattered then there is no meaning to create the regression ...
How do you master linear regression? ›- Modeling relationships between variables using regression.
- Understanding simple regression models.
- Implementing simple regression models in Excel.
- Implementing simple regression models in R.
- Implementing simple regression models in Python.
Regression analysis is not difficult. If you repeat it enough times you will believe it, and believing it will make it much less daunting. Right? Well, if that did not minimize your fear related to regression analyses, hopefully these quick and dirty pointers will help you out!
Why is linear regression hard? ›But it turns out that it is quite difficult to do, because the X and the Y must have a linear relationship, and the errors must be normally distributed, independent and have equal variance.
How much data is enough for linear regression? ›So, how much data do we need to conduct a successful regression analysis? A common rule of thumb is that 10 data observations per predictor variable is a pragmatic lower bound for sample size.
Why do linear regression fail? ›Linear and Additive: If you fit a linear model to a non-linear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also, this will result in erroneous predictions on an unseen data set.
What math should I learn before statistics? ›1) Learn the core mathematics first, then the statistics
The key mathematics you should be familiar with are mainly linear algebra (vectors, matrices, matrix operations, eigenvalues, eigenvectors, diagonalization, simultaneous equations, etc.)
How do you do linear step by step? ›
- Step 1: Find a suitable function and center.
- Step 2: Find the point by substituting x = 49 into f ( x ) = x .
- Step 3: Find the derivative f'(x).
- Step 4: Substitute x = 49 into the derivative f'(x).
- Step 5: Write the equation of the tangent line using the point and slope found in steps (2) and (4).
- graphing.
- substitution method.
- elimination method.
In simple terms, linear regression is a method of finding the best straight line fitting to the given data, i.e. finding the best linear relationship between the independent and dependent variables.
What is a real life example of linear regression? ›Linear regressions can be used in business to evaluate trends and make estimates or forecasts. For example, if a company's sales have increased steadily every month for the past few years, by conducting a linear analysis on the sales data with monthly sales, the company could forecast sales in future months.
What is a real life example of regression? ›For example, it can be used to predict the relationship between reckless driving and the total number of road accidents caused by a driver, or, to use a business example, the effect on sales and spending a certain amount of money on advertising. Regression is one of the most common models of machine learning.
What are the assumptions you need to take before starting with linear regression? ›- Linear relationship.
- Multivariate normality.
- No or little multicollinearity.
- No auto-correlation.
- Homoscedasticity.
Constructing a regression model. To construct a linear regression model in R, we use the lm() function.
How do you interpret R in linear regression? ›The most common interpretation of r-squared is how well the regression model explains observed data. For example, an r-squared of 60% reveals that 60% of the variability observed in the target variable is explained by the regression model.
How do you calculate R2 for linear regression in R? ›R2= 1- SSres / SStot
Here, SSres: The sum of squares of the residual errors. SStot: It represents the total sum of the errors.
Linear Regression Example in R using lm() Function. Summary: R linear regression uses the lm() function to create a regression model given some formula, in the form of Y~X+X2. To look at the model, you use the summary() function. To analyze the residuals, you pull out the $resid variable from your new model.
How do you write a linear regression function? ›
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
How do you select variables for linear regression in R? ›- Variables that are already proven in the literature to be related to the outcome.
- Variables that can either be considered the cause of the exposure, the outcome, or both.
- Interaction terms of variables that have large main effects.
- Linear Regression.
- Polynomial Regression.
- Stepwise Regression.
- Ridge Regression.
- Lasso Regression.
- ElasticNet Regression.
Essentially, an R-Squared value of 0.9 would indicate that 90% of the variance of the dependent variable being studied is explained by the variance of the independent variable.
What is a good r2 value for linear regression? ›For example, in scientific studies, the R-squared may need to be above 0.95 for a regression model to be considered reliable. In other domains, an R-squared of just 0.3 may be sufficient if there is extreme variability in the dataset.
What is r2 value in regression? ›R2 is a measure of the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data.
How do you interpret r2 in linear regression? ›R-squared is the percentage of the dependent variable variation that a linear model explains. 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
What is r2 correlation in R? ›r is always between -1 and 1 inclusive. The R-squared value, denoted by R 2, is the square of the correlation. It measures the proportion of variation in the dependent variable that can be attributed to the independent variable. The R-squared value R 2 is always between 0 and 1 inclusive.
What is coefficient of determination in linear regression R? ›The coefficient of determination (R²) is a number between 0 and 1 that measures how well a statistical model predicts an outcome. You can interpret the R² as the proportion of variation in the dependent variable that is predicted by the statistical model.
What does ls () mean in R? ›Overview. The ls() function in R is used to return a vector of character strings containing all the variables and functions that are defined in the current working directory in R programming. Variables whose names begin with a dot are, by default, not returned.
What is RM () function in R? ›
The rm() function in R is used to delete or remove a variable from a workspace.
What is the syntax of lm () function? ›Syntax of lm() Function
The following is the syntax of the lm() function. # lm() function lm(formula, data, subset, weights, na. action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular. ok = TRUE, contrasts = NULL, offset, …)