## Categorical data

 Contingency Table Confidence Interval for a Proportion Confidence Interval for the Difference Between Two Proportions Expected Frequencies Observed Frequencies Chi-Squared Goodness of Fit Test Chi-Squared Test of Association Chi-Squared Test of Homogeneity

Contingency Table

A contingency table is a way of summarising the relationship between variables, each of which can take only a small number of values. It is a table of frequencies classified according to the values of the variables in question.

When a population is classified according to two variables it is said to have been 'cross-classified' or subjected to a two-way classification. Higher classifications are also possible.

A contingency table is used to summarise categorical data. It may be enhanced by including the percentages that fall into each category.

What you find in the rows of a contingency table is contingent upon (dependent upon) what you find in the columns.

Confidence Interval for a Proportion

A confidence interval gives us some idea of the range of values which an unknown population parameter (such as the mean or variance) is likely to take based on a given set of sample data.

Sometimes we are interested in the proportion of responses that fall into one of two categories. For example, a firm may wish to know what proportion of their customers pay by credit card as opposed to those who pay by cash; the manager of a TV station may wish to know what percentage of households in a certain town have more than one TV set; a doctor may be interested in the proportion of patients who benefited from a new drug as opposed to those who didn't, etc. A confidence interval for a proportion would specify a range of values within which the true population proportion may lie, for such examples.

The procedure for obtaining such an interval is based on the proportion, p of a sample from the overall population.

Confidence Interval for the Difference Between Two Proportions

A confidence interval gives us some idea of the range of values which an unknown population parameter (such as the mean or variance) is likely to take based on a given set of sample data.

Many occasions arise where we have to compare the proportions of two different populations. For example, a firm may want to compare the proportions of defective items produced by different machines; medical researchers may want to compare the proportions of men and women who suffer heart attacks etc. A confidence interval for the difference between two proportions would specify a range of values within which the difference between the two true population proportions may lie, for such examples.

The procedure for obtaining such an interval is based on the sample proportions, p1 and p2, from their respective overall populations.

Expected Frequencies

In contingency table problems, the expected frequencies are the frequencies that you would predict ('expect') in each cell of the table, if you knew only the row and column totals, and if you assumed that the variables under comparison were independent.

Observed Frequencies

In contingency table problems, the observed frequencies are the frequencies actually obtained in each cell of the table, from our random sample. When conducting a chi-squared test, the term observed frequencies is used to describe the actual data in the contingency table.

Observed frequencies are compared with the expected frequencies and differences between them suggest that the model expressed by the expected frequencies does not describe the data well.

Chi-Squared Goodness of Fit Test

The Chi-Squared Goodness of Fit Test is a test for comparing a theoretical distribution, such as a Normal, Poisson etc, with the observed data from a sample.

Chi-Squared Test of Association

The Chi-Squared Test of Association allows the comparison of two attributes in a sample of data to determine if there is any relationship between them.

The idea behind this test is to compare the observed frequencies with the frequencies that would be expected if the null hypothesis of no association / statistical independence were true. By assuming the variables are independent, we can also predict an expected frequency for each cell in the contingency table.

If the value of the test statistic for the chi-squared test of association is too large, it indicates a poor agreement between the observed and expected frequencies and the null hypothesis of independence / no association is rejected.

Chi-Squared Test of Homogeneity

On occasion it might happen that there are several proportions in a sample of data to be tested simultaneously. An even more complex situation arises when the several populations have all been classified according to the same variable. We generally do not expect an equality of proportions for all the classes of all the populations. We do however, quite often need to test whether the proportions for each class are equal across all populations and whether this is true for each class. If this proves to be the case, we say the populations are homogeneous with respect to the variable of classification. The test used for this purpose is the Chi-Squared Test of Homogeneity, with hypotheses:
H0: the populations are homogeneous with respect to the variable of classification,
against
H1: the populations are not homogeneous.