Paired data, correlation & regression

 Paired Sample t-test Correlation Coefficient Pearson's Product Moment Correlation Coefficient Spearman Rank Correlation Coefficient Least Squares Regression Equation Regression Line Simple Linear Regression Multiple Regression Nonlinear Regression Residual Multiple Regression Correlation Coefficient Stepwise Regression Dummy Variable (in regression) Transformation to Linearity

 Main Contents page | Index of all entries

Paired Sample t-test

A paired sample t-test is used to determine whether there is a significant difference between the average values of the same measurement made under two different conditions. Both measurements are made on each unit in a sample, and the test is based on the paired differences between these two values. The usual null hypothesis is that the difference in the mean values is zero. For example, the yield of two strains of barley is measured in successive years in twenty different plots of agricultural land (the units) to investigate whether one crop gives a significantly greater yield than the other, on average.

The null hypothesis for the paired sample t-test is
H0: d = µ1 - µ2 = 0
where d is the mean value of the difference.

This null hypothesis is tested against one of the following alternative hypotheses, depending on the question posed:
H1: d = 0
H1: d > 0
H1: d < 0

The paired sample t-test is a more powerful alternative to a two sample procedure, such as the two sample t-test, but can only be used when we have matched samples.

Correlation Coefficient

A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables.

There are a number of different correlation coefficients that might be appropriate depending on the kinds of variables being studied.

Pearson's Product Moment Correlation Coefficient

Pearson's product moment correlation coefficient, usually denoted by r, is one example of a correlation coefficient. It is a measure of the linear association between two variables that have been measured on interval or ratio scales, such as the relationship between height in inches and weight in pounds. However, it can be misleadingly small when there is a relationship between the variables but it is a non-linear one.

There are procedures, based on r, for making inferences about the population correlation coefficient. However, these make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate.

See also correlation coefficient.

Spearman Rank Correlation Coefficient

The Spearman rank correlation coefficient is one example of a correlation coefficient. It is usually calculated on occasions when it is not convenient, economic, or even possible to give actual values to variables, but only to assign a rank order to instances of each variable. It may also be a better indicator that a relationship exists between two variables when the relationship is non-linear.

Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for making inferences about the population correlation coefficient make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate.

See also correlation coefficient.

Least Squares

The method of least squares is a criterion for fitting a specified model to observed data. For example, it is the most commonly used method of defining a straight line through a set of points on a scatterplot.

See also regression equation.
See also regression line.

Regression Equation

A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others.

A linear regression equation is usually written
Y = a + bX + e
where
Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable (or covariate)
e is the error term

The equation will specify the average magnitude of the expected change in Y given a change in X.

The regression equation is often represented on a scatterplot by a regression line.

Regression Line

A regression line is a line drawn through the points on a scatterplot to summarise the relationship between the variables being studied. When it slopes down (from top left to bottom right), this indicates a negative or inverse relationship between the variables; when it slopes up (from bottom right to top left), a positive or direct relationship is indicated.

The regression line often represents the regression equation on a scatterplot.

Simple Linear Regression

Simple linear regression aims to find a linear relationship between a response variable and a possible predictor variable by the method of least squares.

Multiple Regression

Multiple linear regression aims is to find a linear relationship between a response variable and several possible predictor variables.

Nonlinear Regression

Nonlinear regression aims to describe the relationship between a response variable and one or more explanatory variables in a non-linear fashion.

Residual

Residual (or error) represents unexplained (or residual) variation after fitting a regression model. It is the difference (or left over) between the observed value of the variable and the value suggested by the regression model.

Multiple Regression Correlation Coefficient

The multiple regression correlation coefficient, R², is a measure of the proportion of variability explained by, or due to the regression (linear relationship) in a sample of paired data. It is a number between zero and one and a value close to zero suggests a poor model.

A very high value of R² can arise even though the relationship between the two variables is non-linear. The fit of a model should never simply be judged from the R² value.

Stepwise Regression

A 'best' regression model is sometimes developed in stages. A list of several potential explanatory variables are available and this list is repeatedly searched for variables which should be included in the model. The best explanatory variable is used first, then the second best, and so on. This procedure is known as stepwise regression.

Dummy Variable (in regression)

In regression analysis we sometimes need to modify the form of non-numeric variables, for example sex, or marital status, to allow their effects to be included in the regression model. This can be done through the creation of dummy variables whose role it is to identify each level of the original variables separately.

Transformation to Linearity

Transformations allow us to change all the values of a variable by using some mathematical operation, for example, we can change a number, group of numbers, or an equation by multiplying or dividing by a constant or taking the square root. A transformation to linearity is a transformation of a response variable, or independent variable, or both, which produces an approximate linear relationship between the variables.