PEARSON Function

It measures the correlation between two variables defined by continuous data from a random population.

Author: Adin Lykken
Adin Lykken
Adin Lykken
Consulting | Private Equity

Currently, Adin is an associate at Berkshire Partners, an $16B middle-market private equity fund. Prior to joining Berkshire Partners, Adin worked for just over three years at The Boston Consulting Group as an associate and consultant and previously interned for the Federal Reserve Board and the U.S. Senate.

Adin graduated from Yale University, Magna Cum Claude, with a Bachelor of Arts Degree in Economics.

Reviewed By: Austin Anderson
Austin Anderson
Austin Anderson
Consulting | Data Analysis

Austin has been working with Ernst & Young for over four years, starting as a senior consultant before being promoted to a manager. At EY, he focuses on strategy, process and operations improvement, and business transformation consulting services focused on health provider, payer, and public health organizations. Austin specializes in the health industry but supports clients across multiple industries.

Austin has a Bachelor of Science in Engineering and a Masters of Business Administration in Strategy, Management and Organization, both from the University of Michigan.

Last Updated:November 20, 2023

What is the PEARSON Function?

The Pearson function is a statistical function used to calculate the Pearson correlation coefficient represented by r. It measures the correlation between two variables defined by continuous data from a random population. 

The correlation coefficient is derived from the Pearson correlation test, which determines the direction and strength of the variables giving the output in the range of -1 to 1 to indicate the strength of the association. 

What is the Pearson correlation test?

It determines the direction and strength of two variables and gives the output as a Pearson correlation coefficient. If the value of r is -1, it represents a perfect negative correlation, while a value of +1 represents a perfect positive correlation.

If the value is zero, there is no association between the two variables. 

In practice, it is not easy to see a perfect correlation, i.e., get the r value as either -1 or +1. 

When you calculate the value, the result you will get most of the time will be the decimal value between the upper and lower limit of the coefficient (i.e., it will always range from -1 to 1).

To understand it better, let’s look at an example of a correlation test. Assume four people from a random population have age(x) and income(y) as two different variables. The table looks as below:

Example 01
Name Age(x) Income(y)
Jamie 26 2800
Jonathan 44 3400
Jacob 32 3000
Nathan 28 3000

By representing their age on the X-axis and income on the Y-axis, our scatter plot will look as follows:

Graph

As you can see, our scatter plot is almost linear (linear means a straight line can be drawn through it). So if we draw a straight line through the data, the linear relationship between the data will look as:

Graph 2

This linear relationship is known as the correlation between the variables.

Pearson correlation coefficient (r)

One of the outputs returned by performing the correlation test is the correlation coefficient (r). The value of r signifies two essential traits of the correlation:

  • The direction of the correlation
  • Strength of the correlation

For our data of four people from a random population having age (x) and income (y) as two different variables, the correlation coefficient (r) is 0.969. Since the value of r is positive and close to +1, it defines a positive correlation between the two variables. 

If the value of r was negative and towards -1, it defines a negative correlation between the two variables. But that concludes the direction factor of the correlation of the variables. 

What about strength? The closer the value of r to either -1 or +1, the stronger the correlation between the values. On the other hand, the closer to 0, the value of r, the lower the correlation.

Assume that our data consist of the same age but different income values as below:

Example 02
Name Age(x) Income(y)
Jamie 26 1800
Jonathan 44 1200
Jacob 32 1500
Nathan 28 1700

The scatter plot for the data would look as below:

Chart 3

Do you notice a linear relationship between the data but is in a downward direction? 

This represents the negative correlation between the two variables, i.e., the value of the correlation coefficient is inclined towards -1. When we calculate the value of r for this data, we get the result of  -0.984.

Note

The closer the data points are toward the linear relationship, the stronger the correlation between the two values.

As we have represented how the negative and positive correlation looks on a scatter plot, it’s only fair we show what the scatter plot should look like when the correlation coefficient value is near 0.

Assume the data below with different income values:

Example 03
Name Age(x) Income(y)
Jamie 26 3500
Jonathan 44 2000
Jacob 32 2500
Nathan 28 500

The scatter plot for the data would look as below:
 

Chart 4

The data shows no linearity (we can't draw a straight line through all the points) and is random in all likelihood. When we calculate the data's correlation coefficient (r), we get the result of  -0.107. The value of r does not have any units. 

Getting the coefficient values as -1, +1, or 0 is practically difficult. However, as the values returned are in decimals, you can interpret them by applying specific cutoffs to the values to describe the strength of the correlation coefficients(r). 

For example, if the value falls between 0.00 - 0.10, it represents no correlation, while a value between 0.90-1.00 indicates a powerful correlation between the variables.

PEARSON Function formula

The correlation coefficient (r) is calculated between two variables that are represented by continuous data from a random population using the formula:

Formulax and y

Let's manually calculate the correlation coefficient to understand how this formula makes life better for analysts by saving time. Let's assume our data of four random people for their age and income.

Example 04
Name Age(x) Income(y)
Jamie 26 2800
Jonathan 44 3400
Jacob 32 3000
Nathan 28 3000

We have separated all the values from the formula and will build upon them as we calculate the coefficient value.
 

The steps to follow are:

  • Copy the values for Age and Income in the F and H columns, respectively. You can also reference them using the equal ( =) sign.
  • Calculate the mean for Age and Income using the AVERAGE function such that the average age is 32.5 and the average income is 3050
  • Subtract the value of x from its mean x as well as y from its mean y in columns J & K, respectively
  • For columns L & M, take the square of the values in columns J and K, respectively, using the formula =POWER(J3,2)
  • Multiply the values in columns J & K to get the result in column N
  • In cell L9, take the sum of all the values in column L using the formula =SUM(L3:L6)
  • Similarly, in cells, M9 and N9, take the sum of the values in M & N columns, respectively.
  • Finally, in cell C10, use the formula =N9/(SQRT(L9*M9)) to get the result for Pearson’s correlation coefficient, which results in 0.969.

The syntax for the function is:

=PEARSON(array1, array2)

where,

  • array1 = collection/set of independent values
  • array2 = collection/set of dependent values

Note

When logical or text values are supplied as a part of the array, they are ignored by the function. If the lengths of array1 and array2 are different, you may get an #N/A error.

How to use PEARSON Function in Excel?

Here comes the most awaited moment. Finally, we will use the Excel spreadsheet formula to calculate the value of r. Our data for age and income is as below:

Table

Using the formula = PEARSON(C3:C6, D3:D6), we get the result of 0.969, the same as what we had calculated manually in our Excel spreadsheet. Yep, it's that simple. 

But by now, you know the absolute basics of a correlation test and what we intend to achieve by calculating the correlation coefficient r. 

So now, even if you don't have access to your laptop, you can still grab a piece of paper and pen along with the calculator and get the same value to understand different metrics of financial statements better.

Let's see another example of calculating the value of r using the formula in Excel.

Assume that you prepare the three-statement model for Nike Inc. for the financial year-end of 2019. 

In column C, you have historical as well as the predicted 'Revenue,' while in column D, you have the expected 'Earnings per share.' 

Table

By using the formula = PEARSON(C3:C10, D3:D10) in cell D13, you will get the result of 0.807. This implies that the Revenue and Earnings per share have a linear relationship and a strong positive correlation. 

There aren't any 'outlier' predictions or a sudden spike in either revenue or EPS (for example 100% increase in revenue in 2023 followed by a -10% decrease in revenue in 2024). 

If the prediction has outlier values, the Pearson correlation coefficient deviates toward zero. 

Excel Function

If you get recurring digits or a large decimal number after using the formula, you can use the ROUND function along with the formula such that =ROUND(PEARSON(C3:C10, D3:D10), 3). 

The digit 3 means that we have rounded our value to three digits after the decimal point.  

PEARSON vs. CORREL Function

The CORREL function is similar to the PEARSON function in calculating the correlation coefficient r for two values from random populations.

However, the only advantage that the CORREL function 'held' was that it gave more accurate values than the PEARSON function in Excel versions before 2003 (perhaps because hardly anyone uses them nowadays). 

Since then, Excel has flourished tremendously as software, and if you use either function to calculate the value of r you will get the same result (without any rounding-off errors for either function).

The syntax for the CORREL function is

=CORREL(array1, array2)

where,

  • array1 = collection/set of independent values 
  • array2 = collection/set of dependent values

Note

When logical or text values are supplied as a part of the array, they are ignored by the function. If the lengths of array1 and array2 are different, you may get an #N/A error.

For example, assume that you have the below data for x & y:

Excel

We use the formula =PEARSON(B3:B18, C3:C18) in cell F8 and =CORREL(B3:B18, C3:C18) in cell F9 to get the result of 0.195788 for both the formulas.

Excel

It is ultimately your preference what function you use to find the value of correlation coefficient r, as both give the same result.

Things to remember about the PEARSON Function

You should keep some pointers when using this function:

  • If you reference arrays of different lengths, say C3:C6 as array1 and D3:D7 as array2, you will get an #N/A error since array2 has one element more than array1.
  • You might get an #DIV/0! Error if any reference arrays (array1 or array2) are empty or have a value of zero.

Researched and authored by Akash Bagul | LinkedIn

Free Resources

To continue learning and advancing your career, check out these additional helpful WSO resources: