This book is being edited.

The Contents page is inaccurate.

There are errors in the figure numbers.

There is some repetition

There will be sections added

The Exercises and Explanations are not there yet


It is coming.


In the mean time, suggestions, printable or otherwise, comments and anything else can be mailed to


Table of Contents

    Introduction to Statistics, 3

A simple self-teaching program.. 3

Introduction. 4

Some assumptions: 5

Now to business! 6

Definitions, Terminology and the Jargon. 10

Definitions: 10

Data Types. 11

   Interval data. 11

   Ordinal data. 12

   Nominal data. 13

Data Distributions: 13

   Rectangular distribution: 13

   Normal distribution: 14

      Measures of Central Tendency : 15

      Dispersion. 18

   Binomial and Multinomial distributions: 20

   Poisson distribution. 20

Probability. 22

Conditional probability. 25

Odds and Odds Ratio. 26

Design a Study. 27

Observational Studies. 30

Experimental Studies. 30

Sampling. 32

Controlled experimental studies. 33

Randomization. 33

Blinding. 34

Scales, Scores, Indeces and their Validation. 35

Hypothesis testing. 37

Significance. 39

Confidence intervals. 40

The Eyeball Test 41

Descriptive Statistics. 44

Measures of central tendency. 44

Measures of scatter 44

T-test. 46

Power and Sample Size. 49

The Z-scores. 49

Analysis of variance. 53

Correlation and Regression. 56

Interpretation of the Regression output. 59

Comparing two methods of measurement. 64

The regression method. 64

Analysis of Differences. 65

Non-parametric Statistics. 66

The Chi-square test.(Back) 67

Testing for normal distribution with Chi-square. 70

Some very medical statistics. 72

Things we have not discussed. 74

Appendix 1(Back) 77

Some elementary stats books. 77

Appendix 2.(Back) 78

Some Good Stuff on the Net. 78



An Introduction to Statistics

for Health Care Workers



A simple self-teaching program


by T.A. Torda






“Smoking is one of the leading causes of statistics[1].“


Why another introductory statistics book?  In fact, why learn statistics at all?  Data from studies and experiments in most sciences requires summarisation in order to be communicated and readily understood. This is one of the objectives served by statistics which is often forgotten.  More obviously, if there is a difference between the outcomes of one treatment and another, we need to decide whether such a difference is due to chance factors or is the result of one treatment being really superior to another other. This is the other function of the science of statistics, which then enables us to reach rational conclusions and make decisions which are likely to be correct. Statistics is the cornerstone of evidence-based medicine, where we endeavour to base our clinical decisions on scientific evidence rather than other factors such as traditions, gut feelings, personal experience or anecdote. This is a very important development in health care, which has taken about half a century to gain unopposed acceptance by rational, scientific health care practitioners.

I recall a good friend, the Professor of Anesthesiology at a prestigious U.S. university telling me that “...if you need statistics to prove your point, don’t bother!”  This was more than 30 years ago and I suspect that if I asked him the same question to-day, he would say “If you don’t have statistics to prove your point, don’t bother!”  The greatest problem I see in the delivery of quality health care in is to teach the public to demand proof of efficacy rather than be swayed by hearsay and testimonials.


Figure 1. A health claim.


It is an unfortunate corollary of freedom of expression that the most outrageous claims can be made in the field of health without substantiation and little can be done to enforce any standard of proof for such claims.  Alternative medicine has become a fertile field for promoting therapies whose efficacy is unproven and generally unassessed.  Don’t misunderstand, I know there are some efficacious  natural or herbal remedies and there are many careful reviews of alternative medical claims, but the vast majority come without any proof of their claims of effectiveness.  The question which this begs, is, what is acceptable evidence in such cases?  The answer lies in studies which carefully compare a treatment with no treatment or another treatment and statistics offers the only safe way such comparisons should be evaluated. 

Often there are difficulties in making such assessments[2] but the issue, which is paramount to us is the place of statistical processes in evaluating experimental evidence.  It is perhaps quite unrealistic to hope for general public understanding of statistical procedures but it is not unreasonable to demand that health care professionals be able to understand simple statistics which are presented in the final evaluation of experimental or observational studies.

I am not a statistician.  I was a medical practitioner and researcher who has used and abused statistics in a long career and I think I understand enough about the use of statistics in biomedical research to pass on some of this knowledge.  I hope to enable you, the reader, to gain some understanding of the meaning and use of statistics and to do some statistical operations sensibly and correctly if you have occasion.  These are the objectives of this learning package.

Another important aim, is to present statistics as a live and, dare I say, interesting branch of the black arts.  It is a subject with controversies, disputation and disagreements among the experts, a subject which should not be condemned to the cook-book presentation which is so common in elementary statistics texts.

A third objective is to enable the reader to distinguish between cases where statistics is used from where it is abused, which, unfortunately is often the case.  Most abuses are more or less obvious, but it takes a certain mind-set to examine numbers that are presented with the appearance of authority.  I want to instill this certain critical skepticism in users of this package so that they are able to examine carefully the buckets of numbers thrown at them in scientific journals.

Most textbooks of statistics are dry, didactic and full of computational stuff which is not of great utility in these days of convenient software.  I have, in contrast therefore, tried to keep the course material easy to read, relevant and illustrated by practical examples.  I have tried to present the mathematics only where it explains underlying ideas.

Throughout this teaching program I use Excel® and its Data Analysis Pack.  It is convenient and useful for the part time researcher, who is also interested in understanding reported statistics.  A warning, however, is appropriate here.  All statisticians, to whom I have spoken, have had a low opinion of Excel statistics.  It is not very accurate, it is primitive and its graphics are sometimes misleading[3]. It also lacks some important tests (the “non-parametric tests”, which we shall discuss later).  Why am I using it despite all this?  Because it is there!  You have it on your computer and you may have no other statistical programs to hand.  If, however, you become interested in statistics or you propose to do your own statistical calculations for publication, I strongly recommend you use some other platform.  There are many, many good program packages out there and SPSS, Stata, S-plus and Minitab spring to mind.  I have used them all at some time or other.  Of the many, Minitab has an inexpensive academic edition or you may be able to obtain a student copy of S-plus. They both include pretty good graphing components (important if you are to present or publish).  The ‘R’ language has the attraction of being free and very powerful but it is quite a handful to learn and is more suitable for those who develop a serious interest in the subject.  In any case, they all read Excel files, so the latter remains a useful option for recording data.


Some assumptions:

I have written the material for the Windows operating system, Win 2000 or later and I am using Microsoft Word and Excelâ from Office 2007â but you should have no problems using Office 97â or for that matter, Office XPâ or 2003®.  Because it is written with Microsoft Office, I expect it will run on Appleâ computers as well.  In order to follow this course, you must know how to use the common functions of your operating system and how to navigate about your storage media.  You will also need to have a working knowledge of spreadsheets, including data entry, simple data manipulation and graphing.  You may find the Excel courses on the Microsoft training website useful, at  For those new to spreadsheets, I recommend them strongly.  You need to have the Data Analysis Package” installed in your copy of Excel.  In Excel help, enter “data analysis package” in the search box, for help if needed.

There are many links in this document which show blue and turn purple when you have used them. Whenever I use a new term and do not explain it, I have generally inserted a link to the definition. After looking at the link, you can return to where you were in the text by clicking on the word ‘back’. Links will also take you to where explanations and answers to questions are. There are links, which will open the Excel files used in exercises and also the explanation files, but there are no automatic links back, as you only need to re-open the ‘Word’ file to return to where you left it.

Now to business!


I have found the spreadsheet Excelâ very useful to tabulate research data and therefore, I have used its data analysis package quite a lot[4].  It is very simple to use and although it is often criticized, it is really quite useful.  We can make up for some of its shortcomings as we move along.  As I have said, I have three objectives.  The first is to enable you, a clinician, a user of clinical research or clinical researcher to understand and correctly interpret data presented in statistical form.  The second is to help you present or publish results of your research. Finally, if you understand the basics of statistics, you should be able to detect when statistics are misused or abused.

The School of Statistics at UCLA has as its slogan:

Statistics – The study of stability in variation.

This is very cool but does not strike me as being very informative.  I prefer to define statistics as the branch of mathematics, which deals with acquiring, analyzing and drawing conclusions from data, especially, conclusions which are likely to be right.  If we want a slogan, perhaps the following could serve:

Statistics turns data into information[5].

To this we can add the ability to concisely and accurately communicate information. 

Conventional elementary statistics texts present the subject in a cut and dried, (especially dried) manner.  This is how it is, and for ever more shall be so!  Well, as the song says, “…it ain’t necessarily so!...”  As I read for this text, I found out that even some very simple aspects of the subject are debated by experts.  For example, there is the debate between Bayesians and Frequentists. When faced with a piece of research, Bayesians ask: “What does this piece of research add to our knowledge of the subject?” while the Frequentists’ question is: “What does this study tell us?”  Think a little about the difference.  The Frequentists, who are in the majority among statisticians, regard a parameter (a statistic describing some characteristic of a whole population) as having a single true value.  The Bayesians consider a parameter to have a probability distribution – such as, ‘it is 95% probable that it lies between values x and y’.  They tend to analyse and report statistics differently from the Frequentists.

Another example of disputed territory is the use of parametric or non-parametric tests.  The former (parametric tests) assume that the data conform to some known probability distribution with defined characteristics.  The latter (non-parametric tests), do not make this assumption.  The former, when applied to appropriate data tend to be ‘more powerful”, that is, more likely to discover a significant difference than the latter.  When are you entitled to use the former and when not?  The world of statistics here divides into two, again, the ‘near enough is good enough’ and the purists.  There are many other areas of contention among real live statisticians.

It is important to emphasise that the science of statistics does not deal in certainty.  If you seek certainty, go to church or join a political rally.  Statistics will only impart probabilities such as: ‘Based on this data, drug A will cure your headache quicker than drug B, with a probability of 100 to 1’, but it will never say that it will cure it quicker.

In the past, data analysis was tedious and time consuming, best kept simple.  With a computer now on every desk (and even in most University garbage bins) methods which depend upon repeated calculations with large masses of data have become easy and quick.  This has popularised new methods such as resampling, which were simply impractical 25 years ago.

Anecdotal impressions are often misleading.  We only need to look at the claims and testimonials of quack medicines or popular washing powders to be convinced of this.  Look at . I like the Hopi earcandles best.

So in order that conclusions from research may be valid, we need to make sure of four simple things:

1.     The data on which our conclusion is based is drawn from a relevant group of experimental subjects (relevant population). An example would be if we want to use a drug on patients with renal failure, information on its effects in fit volunteers would not be reliable for our purpose.

2.     The data is representative of that group. The best way to achieve this is random sampling of an adequate amount of data from within that group. What amount is adequate? Not an easy question but we consider this when we discuss power and sample size. What is the best way to do our sampling may also be a problem.

3.     We need to be confident that the measurements are accurate and reproducible.

4.     While we try to ensure that the conclusion we draw is valid, at the same time, we must also form an estimate of how probable it is that our conclusion is wrong (see hypothesis testing).

When D’Israeli said “There are three kinds of lies - lies, damn lies and statistics” he was not wrong.  Incomplete, misrepresented and generally bad statistics can mislead.  If a million chimpanzees pounded a million typewriters until the end of time, it is possible that one would type the whole of Hamlet. If you were shown this, could you be persuaded that Shakespeare was a chimp?  It is immensely important to be able to recognize when the data presented to us is misleading and when the conclusions attached to it are unsafe.

There are many books and computer resources on statistics, some of which are quite comprehensible to us, non-mathematicians.  In appendix 1 I name a few, which I have used.  This in no way implies that there are no other good alternatives or even possibly better ones.  I have also listed a few of the huge array of stats resources on the internet in appendix 2.

In this package or course, we shall try to discuss briefly the principles of statistics, simple applications and how to execute them in the Excel spreadsheet. The Excel package has a major limitation, as I have mentioned, in dealing only with certain types of data (parametric). We shall try to overcome this limitation by designing our own spreadsheets to do some tests, which do not come pre-programmed with Excel, although ‘add-ins’ are commercially available to fill this omission.


Exercise 1.   Meet Excel.

Open Excel and check that the Data Analysis Pack is installed:

Open Excel and click on Tools and it should be there, labeled “Data Analysis”.  If it is not, then get your Office CD ready and in Excel, go to Help (F1)/Answer wizard and type in “install data analysis”.  Follow instructions.

Do a few spreadsheet operations, like entering data, filling a column with a sequence of numbers.  Have a quick look at the Data Analysis package. When we have finished, you will be able to use most of the functions.

If you are not familiar with Excel get a simple book or do an introductory course.  Be sure you can use the graphing functions well. Graphs are fundamental in giving an overview of data.

Open Valsalva.xls. (In the ‘Data’ folder) and have a good look around. This is a pretty typical research record. It is a spreadsheet containing data from a study where subjects blew against a resistance for 20 s. and had their pulse rate and BP recorded.  When you have finished, don’t save it but anyway, don’t worry about messing it up; you can always re-copy it from the CD.

Definitions, Terminology and the Jargon.



Here are some of the terms we shall use.

A population includes all members of a defined group. All nurses at St. Elsewhere Hospital.  All 10 year old Australian boys. If we want to know the average height of 10 year olds and perhaps the spread of their size, it is not practical to try to measure every 10 year old in the country. We resort, therefore, to sampling.

A sample is a sub-group taken from a population usually for the purpose of estimating some characteristic of that population, which would be a ‘parameter’.  For example, voters whose intention is polled in a survey, form a sample from the population of all voters.  Great care must be taken in choosing the sample if it is used for inference regarding a population. It must be representative of that whole population of interest. For example income sampling in a rich, bay-side suburb will yield a very different result from a similar sample from a high unemployment area. If sampling were carried out in either locality alone, it would hardly be representative of the whole population.

A statistic is a number, usually derived from a sample, and is descriptive of a variable. It may be used for inference.  For example, we measure the height of all 10 year olds at four randomly chosen schools and find the average to be 112 cm. We conclude that the average 10 year old in Sydney is about 112 cm. tall.  The sample statistic, average height, is used to infer the population parameter of the average height.  We shall learn how to estimate from our sample what the limits are within which the true mean of the population is likely to be. Note: “likely to be”, not “is”.

A parameter is a characteristic of the population. A statistic was similar, but that was for a sample. For example, the mean height of 10 year olds. In practice, parameters are generally inferred from statistics, not actually measured because of the obvious difficulty in making the measurements in all members of a population.(back)

A variable is a defined entity we use in our statistical operations. It can be a measured quantity (height, income, plasma cholesterol), an attribute (colour of cars, voting intension) or something we assign, such as a grouping variable (say, experimental group or control group), which we then go on to use to classify the measurements.

Data (Singular datum.) are the recorded measurements or characteristics, which are used for statistical purposes. For example the height measurements of the individual 10 year olds make up the data set for calculation of the average height. Data can also consist of attributes, such as “male – female” or ”agree strongly – agree – indifferent – disagree – disagree strongly”

Data Types

Data, which we can subjected to statistical analysis are of 3 broad types:

Interval data. 

Also called continuous data.  This is numeric data, where all ‘real’ values are possible.  Such as height in cm. I am 178 cm tall, You may be 181 cm. 201 cm is possible, and 165.35 cm is also meaningful. In some contexts (e.g. height) negatives do not make sense, in others (e.g. growth in stock value), they do.  All values must be possible, not only integers.  A limited set such as scores of 1,2,3 or 4, where a score of 3.5 is not defined, only 3 and 4 are, is not interval data.  With interval data, 2 is twice as big as 1 and 4 is twice as big as 2.  Interval data, which is normally distributed (see later), is often called parametric data. Such data can be described by a measure of central tendency or ‘location’ (mean or mode or median) and a measure of dispersion (variance, standard deviation, standard error, see later, in the section on ‘Normal Distribution’).  It is not strictly correct to define parametric data as necessarily normally distributed although this error is common.  In fact data, which is distributed according to some other definable distribution is also ‘parametric’ but this is getting a little precious at this stage.

Ratio data, such as percentages, are a special case of interval data.

Numeric data can be univariate, meaning that we are dealing with a single variable, however many groups or samples that we may have.  An example would be mean blood pressure, even though we have measured this variable in 4 different groups.  Data is bi-variate, when we deal with the relation between two variables (say height and weight of subjects) and can be multi-variate (say effect of age, sex and education on salary).

Data can be cross-sectional, which is a one-dimensional data set, that is taken across a number of sources at one time or time independently such as our blood pressure data. A time series is data, such as mean blood pressure of a group of subjects recorded at weekly intervals.  Longitudinal data (or longitudinal study) refer to following changes in the study population over time.  This is also 1 dimensional.  When it is 2 dimensional, such as following mean BP readings in a number of groups, over time, it is called panel data.

Ordinal data.

This is like small, medium, and large.  The values have a logical progression, but not necessarily an arithmetic relation.  A pain score of 3 is greater than 1, but not necessarily 3 times as painful.  The ‘faces scale’ used for estimating severity of pain in children is also a form of ordinal data, because they follow a serial trend.  ‘Scores’, so popular in assessing non-measurable things, such as pain or satisfaction are ordinal.  The ‘parametric’ statistics of Excel are not properly used for such ordinal data.  You should use ‘non-parametric’ types of statistics or tests. (If the data set is large, say >30, or the scale has may levels, say > 7 and the data set passes a test for normal distribution, the error introduced by using parametric statistics on ordinal data is not great.  Many statisticians accept this practice.)

Figure 2. Faces scale for scoring pain in children[6]


We may have defined what a pain score of 0, 1, 2, 3 or 4 means, but if you average a number of scores to a value of 3.2, you have not defined what this score means.  All you can say is that it is greater than 3 and less than 4.  Thus averaging to a mean is not really a proper operation for such data.  The median score would be more meaningful.  Still, the mean is used frequently and people get away with it even in scientific papers!  I expect if the means of such scores are used to compare 2 groups or comparing before and after in subjects, means of ordinal variables or parametric tests used on them will have some sensible interpretation, although when we get to ‘significance’ we’ll need to re-consider this question.

Nominal data.

Such data are attributes and are sometimes called categorical data.  Such as: “Votes yes or no”.  “The car is green, red, yellow or black”. Statistical treatment is confined to counts or frequencies: 12345 votes for Smith, 23456 for Jones. 18 green cars, 10 red etc…  Nominal data also requires non-parametric treatment.


Data Distributions: (back)

Data can be distributed in various ways.  The word distribution can refer to frequency or probability and the two, of course, are related..

Rectangular distribution:

Also sometimes called a ‘uniform’ distribution.

Any value, within a range is as likely as any other value within that range.  

Figure 3.

Comparison of the frequency distribution 200 random numbers between -5 and 5 in a normal (Gaussian) distribution with 200 random numbers in rectangular or uniform distribution. The left panels shows the normal distribution, the upper the values and the lower, a frequency distribution graph (histogram). On the right, the same for the rectangular distribution. Note the tendency in the normal distribution for the values to cluster near the mean.


Normal distribution:

This is the type of distribution most frequently seen in biology and medicine. It is also called Gaussian distribution[7].  The most frequent values lie close to the mean, and the further they are from the mean the less likely are they to occur. If graphed, the frequency curve is bell-shaped.  This distribution is defined by a complex looking function, which is of no great importance to us (appendix. 3). It is defined by two variables, an indicator of central tendency or location, the mean (m) and a measure of dispersion or scatter, the standard deviation (s).  It has certain characteristics which are useful.  

As we have already said, the normal distribution is very common in biology and medicine and the chief characteristic is that the further values are from the mean, the less common they are. As an example, consider the height of males. The mean may be about 175 cm. You see many men about 172 cm. or 177 cm but few of 110 cm or 220 cm in height.  Another interesting characteristic is that when you plot the area under the normal distribution curve, called the cumulative density function, it will describe a sigmoid curve, so commonly seen in pharmacology, such as in dose – response curves. If the vertical axis of the plot is between 0 and 1, then this sigmoid describes the fraction of the distribution under the chosen value on the x-axis. For example if x = the mean, then y will be 0.5.


Figure 4.

The normal (Gaussian) probability function mean = 0 and standard deviation = 1 The mean, median and mode are all the same, a characteristic of the normal distribution (upper panel) and its cumulative density function (lower panel). The CDF represents the probability of the sum of all values below that, chosen, on the X axis. For example the probability of encountering a value greater than 0 would be 0.5.


Measures of Central Tendency :

Mean: Is the arithmetic average, which is the sum of all the values divided by the number of such values. Mathspeak represents this as:

x stands for the individual values, , means the sum the observations from first to nth. and n the number of values in the data set.  The formula looks daunting but it simply means sum all the values and divide by the number of such values.

The mean of a population is usually designated as m, and the mean of a sample as . The bar above a symbol usually denotes ‘mean’. This is the ‘arithmetic’ mean and there are other kinds of mean in mathematics (geometric and harmonic) but they are of no interest to us at present.

The mean is a very good measure of the central tendency of a symmetrical distribution such as the normal. The trimmed mean is another measure of central tendency, which is sometimes useful.  To protect against undue effect from a few outliers (values much larger or much smaller than the others), you can use this measure to better describe your sample by saying this is the mean of all the values except the three (or five or fifty) largest and smallest.  It is especially useful if you have a large data set. The mean is sometimes called a measure of location of the data.

Median: (Back)  The middle-ranked value, with as many greater values in the set as there are lesser values.  If the data set has an even number N, than the median is a value half way between the two middle values, ranked N/2 and N/2+1. If the data set contains an odd number, than the median is the value with the rank (N+1)/2.  The median is a better measure of central tendency than the mean for skewed distributions (where there are more values above or below the mean, than on the other side).  Coupled with two percentiles, say 25th. and 75th., it describes skewed distributions better than mean and standard deviation.  It is also less affected by extreme values than the mean.


Figure 5. A skew distribution.

The mean of this distribution is 0.091 but its median is 0.016 and its mode is 1.  In a skew distribution the mean and the median are not the same.  This distribution is skewed to the right. If the small values were few and large ones many, it would be skewed to the left.

Mode: Is the most frequently encountered value. A population can have more than one mode and if there are several modes in a sample, Excel will only report the first it encounters when you do “Descriptive Statistics”.


Excersise 2, Mean, median, mode.

a)            Open Means.xls.

Find the mean, median and mode of the two groups of observations, by using ‘Descriptives’ in the analysis pack (Tools/Data Analysis).

What is the difference in the means?

What is the difference between medians and modes?

Why is there quite a large difference in the means, but not in the medians?

 (Look in Explanations)


b)             Open Valsalva.xls

Find the mean, median, mode, largest and smallest values of the ‘pre-test’ and ‘minimum’ pulse rates. Don’t save the results or you will overwrite the Valsalva file.


Dispersion.(back to “Measures of Scatter)

When we talk dispersion, we mean the scatter of the data about the mean (or median). When we consider dispersion it is not material whether a value is greater or smaller than the mean, what we are interested in, is its distance from the mean. We assess or express this as the mean of the squared deviation (this gets rid of minuses) of the data points from the mean. If we were to simply take the deviation of each value from the mean and averaged them, the result in a symmetric distribution would be zero. So to estimate this scatter of the data about the mean we use variance or standard deviation (SD) or if we are talking of the means of multiple samples, the standard error of the mean (SEM).

The variance is the mean squared deviation from the mean. We square the deviation to make all values positive. For a population

and for a sample,

Why the difference? To find the mean, we must know the number of values in the sample, as well as their sum. If we know the mean and all the values but one, then the last value in the sum can be calculated and is therefore not independent. So in this case, n-1 is the degrees of freedom in the sample and we use it to calculate the variance. It results in a slightly larger estimate of variance for a sample than for a population. We shall come across the concept of degrees of freedom again and again. It is sometimes represented by the Greek letter, n (mu). The abbreviation for the variance of a population is usually σ2 and for that of a sample, s2.

The standard deviation is simply the square root of the variance, abbreviated σ or s. It is a characteristic of the normal distribution, which describes the scatter in the sample or population. So this seemingly complicated process is undertaken to render immaterial whether the deviation is positive or negative, which does not matter when you only want an idea of the magnitude of the scatter of the individual values from the mean. To see the formula for the SD, click here. The region mean ± 1 SD (standard deviation) contains about 68%, mean ± 2 SD contains about 95% and mean ± 3 SD, about 99% of the observations in a normally distributed data set. (back). The standard deviation has little meaning if the data are not normally distributed and should not be quoted in that case.

Coefficient of variation (CV) is a useful way of scaling the SD. A statement such as SD = 32 has little meaning unless we know whether we are dealing with a sample mean of 10 or 10,000. In the former case, the scatter would be huge. In the latter, very tight. So if we express the SD in units of the sample mean it tells us immediately whether we are dealing with a large or small dispersion. The CV can be expressed as a fraction, or percentage:

Using this measure, CV of the first distribution (s = 32;  = 10) is 3.2; in the second case ( = 1000), 0.032.

Standard error of the mean is the equivalent of the SD, when dealing with sample means. It depends on the number of values in the samples as well as the scatter of the sample means. Of course, the larger the sample the more it will resemble the population.

The SEM, sometimes called just SE, represents the uncertainty surrounding the population mean in the same way that the SD represents the uncertainty of individual values in the sample. From a normal population, the means of samples will be distributed about the population mean in a normal manner and the population mean ± 1 SEM will contain about 68% of the sample means; ± 2 SEM, 95% and ± 3 SEM 99%. Conversely, the sample mean ± 2 SEM is 95% probable to contain the population mean and ± 3 SEM, 99% probable.

It is a very important property of sampling, that whatever the distribution of the underlying population may be, the means of the samples drawn from that population will be distributed normally about the mean of the population. This is part of the central limit theorem, and underlies a method called bootstrapping, which we shall discuss much later.

Standard errors of other statistics than the mean can be calculated and are very useful when trying to determine the likely population value of the statistic.

Kurtosis[8] is the property of a bell-shaped curve, which describes whether it is flatter or ‘peakier’ than the normal distribution. Skewness we have already met and these are discussed in the Explanations to exercise 4b. They describe deviations from the normal distribution but they are not very important statistics4.


Binomial and Multinomial distributions:

Binomial is the distribution you meet when an event can have one of two outcomes (heads – tails, alive – dead, etc…) or values, (p and q).  If p = q, the frequency distribution is symmetrical and bell shaped, but if p ¹ q, the distribution will be skewed. (p + q), the sum of the two possible probabilities must equal 1. That is to say, that it is a certainty (probability = 1) that one of the two outcomes will occur. We’ll consider this in detail in the next section on probability.

Multinomial distributions are cases where more than 2 outcomes are possible. For example the throws of a die, which can have 6 outcomes, p, q, r, s, t or u; p(p+q+r+s+t+u)  =  1, again, that is the probabilities of all the outcomes add up to 1. The probabilities of each outcome may be equal as is the case with an honest die or they may be unequal, as the winning probabilities of horses in a race.


Poisson distribution

This is a distribution of discrete events in time or space, such as the number of goals in a soccer match or the number of lamp-posts per kilometre. It defines events which we can count, but we cannot count their non-occurrence. If you really want to see the mathematical expression for this, look in Appendix 3, but I would not bother. In that equation, m is the mean (average number of occurrences in unit time, or distance etc…), but in the Data Analysis Pack it is called l (lambda).

Figure 6.  A Poisson distribution with a mean of 5. This could be the probability of a given number of telegraph poles per kilometre, or such.


Exercise 3: Distributions and descriptive statistics

a)              Open the file (Distrib.xls)  The columns in the file are:

A:  Subject number; B:  Uniform distribution, C:  Normal distribution, D:  Binomial distribution, (no. of blue eyed people in a sample of 5), E:  Poisson distribution with l (mean) set to 3.

Have a look at the graphs, on sheet 2 (Graphs) of this file. First look at the scatter of the points, then at the frequency histograms for each type of distribution. 

What are the features of note? (Explanation)


b)              Now let’s look at the Descriptive Statistics function in Excel.

Still in Distrib.xls, open the data analysis pack and select descriptive statistics (Toools/Data Analysis/DescriptiveStatistics). We’ll do the stats on the normal distribution values. So:

Select the tick box for Labels in First Row, select or enter the data range C1 to C101, put the output in a new worksheet ply, select ‘summary statistics’ and Confidence interval Level of say 95%. If you wish, select a say 5th (or other) smallest and largest value and click OK. The results will be on Sheet 3, which you could rename ‘Descriptives’.

Have a good look at what you have calculated. Read my comments in (Explanations, 4b).



Before we discuss further the binomial and multinomial distributions, let us think a little about probability, the subject that should be dear to the heart of every honest punter.  It all started when the good Duke of Burgundy, who was the sort of high roller who gets the red-carpet treatment at every casino commissioned the mathematician, Blaise Pascal, to determine his chances of winning at roulette.  These, original foundations have been greatly expanded and refined by Fermat, Huygens, Bernoulli and other great mathematicians.

To go back to the good Duke, if you look at the roulette wheel, the European wheel has 37 slots (0 and 1 to 36), and the American, 38 (0, 00, 1-36). The chance, therefore, of winning on a single number bet are 1 in 37 or 1 in 38, respectively.  Yet they only pay 35 to 1 for a win.  So what are the Duke’s chances of emerging a winner from the casino?  In the long run about the chance of a snowball in hell and that is the reason the casino can offer shows and cheap eats.  That little edge (about 2.7% on the European wheel and 5.3% on the U.S wheel) pays for it.  Except probably for black-jack, the house odds are even greater for other games.

Now, to get serious, let us define probability as the likelihood of the occurrence of a random event.  Toss a coin and either a head or a tail will come up.  The probability that either one or the other of the two outcomes occurs, is a certainty and we call a certainty:

p = 1

With an unbiased coin the probability of heads is equal to that of tails, and we can write:

p(head) =  p(tail) = 0.5

And that it will come up a head or a tail is the sum of the two probabilities:

p(head or tail) = 0.5 + 0.5 = 1

We can now take the next step and say that if the probability of heads is ph, then the probability of not heads is (1 – ph) that is, the probability of tails (pt) = (1 – ph).

If you look at a die, there are six possible outcomes and the total probability of the outcome being one of the six is 1. The probability of any one outcome is, of course 1/6 = 0.1667.

These outcomes, head or tail, or the possible results of a single throw of the die are mutually exclusive.

With a die the probability of a throw of 1 is 0.1667, then the probability of throwing any other number, 2 or 3 or 4 or 5 or 6, is (1 – 0.1667) = 0.8333.

What is the probability of throwing less than 4? Well, that is the same as saying the probability of a 1 or a 2 or a 3. So, p<4 = p1 + p2 + p3 = 0.1667 + 0.1667 + 0.1667 = 0.5.

If we look at the probability of outcomes A or B, then we sum the probabilities of each:

p(A or B) = pA + pB

Now let us consider sequences of events. What is the probability of tossing 2 heads in a row? The probability of throwing a head is 0.5. Then we flip again and the probability of getting heads again is 0.5. So in half the cases we were successful on the first flip and in half of those we shall be successful on the second. So the probability of heads twice in a row is half of one half,

p2h = 0.5 x 0.5 = 0.25

For situations where we are calculating the probability of event A and event B, we multiply the probabilities.

We flip the coin 4 times. What are the possible outcomes? The first can come down heads or tails, so can the second etc…Let us count the possibilities:

4 heads – there is only 1 way this can happen, four throws of heads.

p4h = 0.5 x 0.5 x 0.5 x 0.5 = 0.0625

3 heads, 1 tails, in any order: Let us consider the sequence of 3 heads, 1 tails:

p3h1t = 0.53 x 0.5 = 0.0625

but if we are not particular in what order the 3 heads and 1 tails occurs it can happen 4 ways, the 1st. or the 2nd, or the 3rd. or the 4th. toss can be the tails. So the probability of 3 heads and 1 tails in any order is

4 x 0.0625 = 0.25

2 heads, 2 tails. – this can happen any one of 6 ways, work it out… h,h,t,t, h,t,h,t, h,t,t,h, t,h,t,h, t,t,h,h, t,h,h,t. The total probability, any order, is

6 x 0.52 x 0.52 = 0.375

1 heads, 3 tails – 4 ways, is similar to 3 heads, 1 tails, or 0.25.

4 tails – 1 way, 0.0625

There are all together 16 possible sequences after 4 tosses (24) and if we look at the probabilities, they are

4 heads – 1/16 =  0.0625

3 heads, 1 tails – 4/16 = 0.25

2 heads, 2 tails – 6/16 = 0.375

1 heads, 3 tails –  4/16 = 0.25

4 tails – 1/16 = 0.0625

sum the probabilities and it comes to 1! I told you so, if you sum all the possible outcomes, p = 1.

A quicker way to express all this, is to say that if p is the probability of one outcome, and q is the other, p + q = 1 and p = (1 – q) then the probability of any outcome can be calculated by expanding the function (p + q)n, where n = the number of trials:

For 3 trials,

Meaning that for three trials of an experiment whose two possible outcomes are p and q, results of either 3 p’s or 3 q’s can occur one way each, 2 p’s and a q, three ways as can 2 q’s and a p. If p has a probability of 0.5 as with coin tosses, the probability of three heads or three tail will each be 0.53 or 0.125, 2 heads and 1 tails in any order, will have a probability of 3x0.5x0.5x0.5 = 0.375 as will 2 tails and 1 heads. Sum these probabilities for a check, and they do come to 1.0.

The calculation is still similar if the probability of the two outcomes is not the same. The probability of one p and 2 q’s in 3 tries will be 3 times p times q2. For any number of trials, n, and any probabilities of p and q (remember, p + q = 1) the calculations are the similar. The general expression for calculating the probability of an outcome after any number of trials is

Where, again, n is the number of trials.

Now let us complicate matters. You toss your coin and it comes down heads. If you toss once more, what are the odds of having both tosses come down heads? Well if you were asked what the odds were to get two heads in a row, you know the probability would be 0.25, but this is not what you were asked. Having the first toss in the past has made that a certainty. So what is the probability of heads on the next toss? Well you know it is 0.5! The coin has no memory. If the first toss had been tails, what would be the probability of two heads with the next toss? Zero, of course.

Conditional probability

Come let’s play poker! You don’t know how? How did you spend your time at Uni?

Anyway, the simplest form of the game consists of being dealt 5 cards. You make a bet and you can throw away up to 4 cards to improve your hand and a corresponding number will be re-dealt to you. This is the ‘draw’. Further betting follows. The aim is to get the best hand and win the pot. But let us get back to probabilities and work out what the probability of holding a hand with 3 aces is, after the draw. This can come about in 4 ways:

1.     No aces dealt, but 3 in 4 cards on the draw.

2.     1 ace dealt, 2 more in the draw.

3.     2 aces dealt, 1 more in the draw.

4.      3 aces dealt, none in the draw.

Case 1. You have seen 5 cards, there are 47 left unknown and there are 4 aces in there. The chance of a dealt card in the draw being an ace is 4/47 = 0.085[9]. You draw 4 cards and need to know the probability of receiving 3 aces, which is

4 x 0.0853 x 0.915 = 0.002

Case 2. One ace and draw 4 for 2 more. The chances of an ace in the 1st. 5 cards is about 0.279. The chances in the draw are 3/47. You draw 4 more and need 2 aces, probability of 0.021[10]. That makes the probability 0.006.

Case 3.  The chances of 2 aces in 5 cards are about 0.047. Then you have to draw 1 more in 3 cards: 0.117, giving this case a probability of 0.005

Case 4. Probability of being dealt 3 aces in the firs 5 cards is 0.004. We’ll take the simple case of not considering improving to 4 aces.

The overall probability of ending up with 3 aces is the sum of the 4 ways in which this can be done, that is : 0.017


Exercise 4. Probability exercises:

a)    If the probability of blue eyes is 0.3, what is the probability of taking a sample of 5 persons and finding 4 with blue eyes? (Explanations\Exercise 4.doc)

b)    You throw 2 dice. What is the probability of a double 6?

c)     You throw 2 dice. What is the probability of throwing 7?

d)    You throw a die twice. What is the probability of the two throws totaling 7?

e)    You throw a die twice. The first throw is 3. What is the probability of a total of 7 when you take the second throw?


Odds and Odds Ratio

Odds are another way of presenting probability. Familiar territory for us, mug punters. Odds is the ratio of the probability that the event of interest will occur to the probability that it will not. The probability that a throw of a die produces a 6 is 1/6 = 0.1667. The odds are 1 to 5. Using as an example the association between hay fever and excema in children[11]:


Excema, Yes

Excema, No


Hay Fever, Yes




Hay Fever No









What is the probability of hay fever in children, with or without excema?

Probability of hay fever in children with excema: 141/561 = 0.251, about 1 in 4.

The odds are 141 to 420, about 1 to 3.

Probability of hay fever in children without excema: 928 /14453 = 0.064, a little less than 1 in 16.XXX

          The odds are 928 to 13525, about 1 to 14 or 15.

A number of different comparisons can be made:

  1. The difference between the proportions or probabilities: 141/561 – 928/14453 = 0.187 or 18.7%. A child without excema has 18.7% the probability of having hay fever, compared to a child with excema.
  2. The ratio of the proportions, (141/561)/(928/14453) = 3.914. This is the relative risk. A child with excema has almost 4 times the probability of hay fever than a child without.
  3. The odds ratio: (141/420)/(928/13525) = 4.893

we review these statistics in a later chapter.

Design a Study


To do effective research you must set a precise goal for your study. It should be a definite, defined and limited objective. For example, “to compare the heights of normal Sydney school-boys with those of girls of the same age, in age cohorts, between 10 and 13 years” – not “to study the height of 10 to 13 year old boys and girls”. Work out what is the best way to get your data, then, to record it and to evaluate it. This involves:

1.     Selection of the study population.

a.     You could use the classes for appropriate ages in co-educational schools, preferably not confined to one geographic area.

b.     Obtain permission from education authorities, schools, parents and relevant Ethics Committees.

c.     Define “normal” in order to screen the children according to your definition.

2.               Decide how many children you will need for the study. This is not an easy question and we shall deal with it when we discuss power and sample size.

3.               Define your method of measurement.

4.               Prepare the study record.

5.               Define the reporting statistics and your method of evaluation. The reporting statistic is the principal variable upon which you focus your study. In this case, it would be height, in centimeters but in other studies it could be a categorical outcome variable, such as alive or dead or a scale measure such as pain score. In any case, it needs to be chosen before starting the study. You can also define other variables (secondary objectives), which you wish to record and evaluate, for example the children’s weight, which you may wish to record and evaluate, but a principal reporting statistic must be defined before the study is started..

6.               And now, finally you can get on with it!

Of course, you will make certain that the data is representative of the population of interest, not only of a sub-set.  This is never easy and the better you do it, the wider will be the application of the study.  Any possible limitation imposed by the selection of the study population is usually called external validity.  If all your children came from Sydney schools, the conclusion may not apply to Perth children and even less to Kabul.

From our point of view, the important thing will be to decide at the end of our data gathering and evaluation whether the differences we found between the two groups (boys and girls) are likely to happen by chance in populations of the same mean height, or whether they show that if we were to look at the whole population of boys and girls in the relevant age groups, we could expect them to differ.  In statsspeak, are the differences between the heights of boys and girls, in age groups between 10 and 13 years of age: significant?  In this context, significant means unlikely to arise by chance and is ‘true’ or representative for the population under consideration. This is usually expressed as a p (probability) value; this p, is the probability of making an error in accepting the result as true or valid, which is also called type 1 error.  If p = 0.05, that means that such result will arise by random chance once in 20 experiments, and there is a 1 in 20 chance of error if we accept the result from the sample as being representative of the population.  Conversely, in 19 instances from 20, our conclusion of accepting the result as representative, will be correct.  The probability, which is chosen as significant is usually denoted a, and an a of 0.05 is often accepted as the upper limit of a significant result.  That is, if the p < 0.05, the result is accepted as significant, at least statistically.  This does not guarantee, however, that it is of practical importance.  It is important to realise that this is a very arbitrary limit, The probability of type 1 error is not very different between studies whose outcome probabilities are 0.049 and 0.051.  Yet, the former will be usually called significant and the latter not! 

This concept is so important, that we should look at it from a different perspective.  If we take two samples from a single population, their means are likely to be different, but what interests us is the probability whether the two samples we compared in our research, say 12 year old boys versus girls, come from a single population (girls and boys are of the same height) or two different populations with different distribution of heights. So we look for the probability that two samples from the same population have a difference as great as we found, or greater. If this probability is less than 0.05, we think it likely that we are dealing with two distinct populations.

If p > 0.05 it is often arbitrarily discarded as not significant which is sometimes confused with proof that there is no difference due to the tested treatment, or factor. This is not necessarily true. If you are comparing two samples, this ‘not significant’ result can be due to

1.               Difference between the means is too small

2.               The scatter of values within the groups is too large

3.               The sample is too small

The cure may be larger samples or better selection of subjects, to reduce scatter. This is important:

“Lack of proof of difference is not proof of lack of difference”

To translate this into English, the fact that a study fails to show a significant result does not prove that there is no significant difference.  Perhaps a study twice the size would have yielded a significant result.  Before you set out on a study, you need to estimate the number of subjects needed.  This, we shall discuss in some detail, later.  It is also wise to plan before you start your investigation, whether you will examine the results at any time before the conclusion of the study. This may protect you from going on with a study of a treatment which is worse than the control, or to go on trialing well after the value of the treatment can be accepted.  If you do not plan these reviews ahead, you may be accused of opportunism if your study is terminated early or of wasting resources and even lives if it is continued too long..

There are two types of studies: Observational and experimental.  In the former we just measure and record (as we did with heights in the example above), in the latter we ‘manipulate’ the sample (say treat one group and give placebo to another).(back)

Observational Studies

These are studies of the type of: How long does a light globe last? We record the time of each globe in our sample to failure, we analyze our data and we derive conclusions.  Measuring survival after diagnosis of a disease is statistically a similar exercise.  What happens to blood pressure with age?  What is the normal plasma creatinine?  No testing of a treatment is involved, no manipulation of a study group is done and no specific comparison group is used.

Experimental Studies

These are studies where we do something to our sample(s).  We compare treatment A with no treatment, or with treatment B and it can involve comparisons of more than 2 groups (for example, treatment A, dose1; treatment A, dose2; and a control group).  What is very important is that the only difference between the compared groups is the tested factor (the treatment or whatever).  There are several strategies to help with this.

1.               Use tightly selected groups, we test Viagra against placebo in males between the ages of 40 and 50, who are in good health but have erectile problems.  We exclude the obese, diabetics and those with a history of psychiatric problems.  Tight selection criteria may result in some difficulty in amassing sufficient subjects but tends to reduce the scatter within groups, which may help reach a significant result.  It may, of course, also limit the applicability of the result (external validity) and in our example, the result would not necessarily be transferable to 70 year olds with heart disease.

2.               Use large groups. The larger the sample, the more it will resemble the population but large studies are usually expensive and take a long time to complete.

3.               Paired studies. Subject 2 is paired with subject 1 on the basis of similar characteristics (age, sex, body weight, blood pressure, whatever is relevant…).  This type of study may have other advantages, if pairs are matched, the difference in the study variable between the two subjects is subjected to statistical testing rather than the mean results obtained from groups. The number of subjects needed to produce a significant result is often reduced by this strategy.

4.               Crossover studies. Each subject receives all studied treatments, at some time. This eliminates differences between study and control groups completely except for the studied factor but other problems exist. The effect of treatment 1 may persist into the time period of treatment 2, called the carry-over effect, or the sequence of the treatments may affect the outcome.

5.               Latin square is a method of organizing trials to eliminate the effects of unwanted factors. We want to try 3 analgesic methods after an operation (treatments A, B, and C) but we know that with passage of time pain gets less, therefore the 3rd. method is likely to seem more effective than the first.  We need to eliminate this effect. Each subject then receives all three treatments, but in an order which makes certain that each treatment is used first as often as it is second or third:































Of course, subjects must be assigned randomly and without bias to the treatment sequences. 

6.               Sequential designs are trial designs whose objective is to stop a trial as soon as a proposed degree of probability (p value) is reached or a pre-set probability of ‘no difference’ is reached. This method is economical of resources but can only be used where outcomes are available immediately. Decisions take the form of “A preferred’, ‘B preferred’ or ‘no difference” and the outcome of each pair of subjects is entered on a special chart.

7.               Replication.  The repeated measurement of a variable in subjects is called replication. For example, the effect of a drug on blood pressure, measured monthly for a year. It is also a form of replication when the same subjects are exposed to both/all treatments as in crossover studies.

8.               Stratification can be used to overcome imbalance between study groups. For example, the treated group has more females than males whereas the reverse is true of the controls. Reporting the results for the two sexes separately may make interpretation easier.

9.               Other designs. There are many complex designs possible, which may be appropriate to some circumstances.  The best design is always the simplest one, which gives the greatest certainty (lowest p value) with the minimum expenditure of resources.


The objective of most studies is to draw conclusions concerning a population, although we only observe a subset of the population.  One way to ensure that the sample is representative of its parent population is to sample randomly, meaning that every member of the population has equal chance of being in the selected sample.  This is often difficult, sometimes impossible. Let us assume that you are trialing a drug, in Sydney, Australia.  Your sample is unlikely to contain individuals from Uzbekistan or Patagonia, whose response to the drug, may differ from those in your sample.  You should state clearly in your report what the sampled population was (from Sydney, Australia) and how you selected the individuals in the sample.  This may again limit the applicability or: external validity[12]. Internal validity refers to the elimination of confounding factors in your sample.  Neither is usually a ‘yes – no’ quantity, but a gradation of good to bad.

But is your sample even representative of your locality?  Did you ask the local family medical practitioners to refer the patients for your study?  Well you have just excluded those who visit natural therapy practitioners and those who do not use the services of a physician.  These could conceivably bias your sample, if they react differently from those included.

Let us look at different ways of sampling.  The TV rating companies deliberately sample in a non-random manner.  They tailor their sample to the demographics of interest rather than universality.  Their interest is focused on target groups of their advertisers.  Those who are hard to influence or have no spending power are of lesser interest.  They also need to be able to point to specific groups which advertisers wish to target.  This is targeted sampling.  This has parallels in drug testing.  If you are testing a drug to reduce blood-sugar levels, you may wish to target diabetics, not normals.

The most common sampling technique, however, is to grab what is available.  Send the clipboards to the local mall, or ask patients presenting in the E.R. to participate, ignoring considerations of randomization.  This is convenience sampling, so common in bio-medical research.  When reading research papers I am deeply suspicious of studies which do not describe their sampling method.


Controlled experimental studies.

The objective of a study or trial is often to see if some variable (treatment, or procedure) is better than no treatment or another treatment. It is desirable that the only difference between the study groups is the treatment. All other factors need to be kept similar as far as possible between the two (or more) comparison groups. The not treated (or old treatment) group is called the control group. When reporting such studies it is good practice to supply data to demonstrate that the groups were similar in characteristics, such as severity of disease, age, sex, and whatever else may be relevant. As we have already noted, crossover studies are ideal in eliminating differences. Sometimes comparison is made between the studied group and some other group not included in the study such as a ‘similar’ group studied last year (historical control) or a ‘similar’ group studied by another worker. Because we cannot be certain of the similarity or dissimilarity of the study and control groups, such comparisons have less credibility than randomised control studies.


As it is essential that the two (or several) groups be similar, and bias[13] must not be introduced in the selection of subjects.  To this end, study subjects should be assigned at random into the study groups (treated, placebo etc…) at the time of entry into the study.  There are accepted ways of achieving this and this must be planned before beginning the study.  Often tables of random numbers are used, say even numbers go to group 1 and odd to group 2.  Next subject, next number… This is fine but you can end up with dissimilar numbers in the two groups, which reduces the power of the study. One convenient way to eliminate this inconvenience is to randomize in pairs, so that you assign subjects two at a time. Odd random number, the first subject goes into group 1 and the second into group 2 and if the random number is even, vice versa. The random numbers generated by Excel are not really random but very close and are generated by a complex algorithm which takes its start point from the date and time (unless you specify another seed for the random number generator). By specifying the same seed it is possible to produce the same sequence of numbers more than once so if you want random numbers it is best to leave the seed to Excel.  There is a web site which will generate all sorts of random numbers for you, including equal groups[14]


Bias in a scientific study is defined as: “Any process which tends to produce results or conclusions that differ systematically from the truth”

To avoid the dreaded bias, the evaluator must not know whether the subject belonged to the treatment or control group.  Only then is the evaluation safe from prejudices and preconceptions.  This is the reason for placebo treatments, as treated v. untreated groups would otherwise be obvious.  It is also possible for subjects to introduce bias especially if the response to treatment is reported rather than measured, such as “how severe was the pain?”.  For this reason it is best if the subjects are also left ignorant as to whether they are in the treatment or placebo groups.  The greatest weight is always placed on research where the subjects are assigned randomly to each study group and the subjects, as well as the evaluators do not know to which group they belong. This then is the randomized, double blind study, the holy grail of comparative studies.  If there is a third person involved, such as a treating doctor or a person making a measurement, blinding should also be applied to him/her, making the study ‘triple blind’ but this term is not often used.


Exercise 5.Random Numbers

 Look at the file random.xls The first column (A) is 20 random assignments of 0 or 1. Note, not necessarily 10 of each. The second (Col. B) is random integers between 0 and 20. Excel can generate random numbers by a function (=RAND()) or by a more complex program in Data Analysis.

a)       How could you count the exact number of 0’s and1’s if you had a large data set?

Use the Random Number Generator in the Data Analysis Pack to:

b)       Generate a set of 100 random numbers in a uniform distribution, between 1 and 100, with 2 decimal places. Put these in Col. C.

c)       How do you generate integers, rather than decimals?

d)       Generate a normally distributed set of 100 numbers with a mean of 50 and n SD of 17. Put this in Col. D of the spreadsheet. How many values lie outside the 3 SD range?

e)       Generate a set of 100 random ‘0’s and ‘1’s, in about equal numbers to assign 100 subjects to study and control groups. How many 0’s and 1’s did your set contain? Delete this set when finished.

Save your file with the new sets of randoms, in Col. C and D

 How could you get exactly 50 each of 0’s and 1’s, in random order? (Explanations)

Scales, Scores, Indeces and their Validation.[15]

In general, the quantitation of measurable data such as height, plasma rhubarb concentration or duration is not a problem.  Standards of measurement are available and their precision is known or at least can be estimated.  Variables, which are not measured directly but are derived from other measured values, for example cardiac output (calculated from temperature or absorbance), are less straight forward but cardiac output, at least has physical existence even if the indirect measure makes some assumptions and uses approximations.

This is not the case when we are faced with assessments which lack a physical reality but are constructs or subjective estimates.  Examples would be pain scores, anxiety or depression scales or indeces of quality or satisfaction.  These are usually ordinal variables, where each value has a defined meaning, where there is a logical progression but mathematical relation between the scores cannot be established.  Proof is required that these ordinal variables measure what we want them to measure and do so in some ordered and inclusive way.

Linear analog scales are an exception to the ordinal nature of such measurements or variables.  They are probably familiar to you, say a 100 mm. line where one end is identified as “no pain” and the other as “as much pain as you can imagine” or whatever is appropriate for your variable.  Your subject marks his/her assessment on this line.  A slider on a scale is a slightly more sophisticated version of the same.  This yields an assessment, which is a length and is therefore an interval variable.  If the distribution of the resultant variable is normal in distribution or reasonably close, it is appropriate to use parametric methods to evaluate the data.  This is generally accepted, even though proof that the scale is indeed linear is usually lacking.

But to return to the question of validating scales and scores.  This is complex and has a number of facets.

1.     Face validity: Does the scale indeed measure what we want it to measure.  A pain score should measure pain, not disability, distress or something else.

2.     Content validity: Does the scale measure all aspects of the variable?  For example, in a phobia scale we ask questions about dogs, cats, spiders and scorpions, but forget heights or crowding.

3.     Construct validity: Tests whether the scale has the relationships we think appropriate for it to have. Does our depression scale relate appropriately to suicide attempts?  Does the anxiety scale we constructed increase in a group of students before an exam and decrease after it?

4.     Internal consistency: Refers to the relationship of the questions within the scoring system.  They should relate sensibly to one another and not measure the same thing over and over.  An example would be QOLRAD, a 25 question scoring system designed to quantify the impact of gastrointestinal problems on quality of life.

5.     Test-retest reliability: Does the scoring system yield consistent results on retesting if the tested variable has not changed.

6.     Observer reliability: The test result should vary little in the hands of different observers.

Obviously validation of a scale or index is complex and it is all too common to see the bland statement “validated scale” or score, with or without a reference, or any details of what validation has been performed.

Hypothesis testing


Hypothesis testing is the most common way of analysing experimental results.  A hypothesis is proposed, it is usually “that the treated and the control groups come from the same population” which means, that any difference between them has arisen by chance, or more precisely, by sampling error only. This is called a null hypothesis abbreviated Ho, and implies that the treatment has made no difference.  We then calculate the probability whether this is true.  The lower this probability, the less likely is it, that the proposition (the null hypothesis) is true.  The probability that the null hypothesis is true is called a, in statspeak.  What we actually test when comparing groups, is how likely it is that the two (or more) samples are from the same population and differ only because of sampling error.  The alternative hypothesis to Ho is usually called H1. In this case, it would state that the two samples do not come from the same population.  The two hypotheses are, of course, mutually exclusive and the samples either do or do not come from the same population. Thus

Two types of error can be made: You can accept a false result or reject a true one.


Null Hypothesis True

Null Hypothesis False


Correct Decision

Type 2 Error


Type 1 Error

Correct Decision


Type 1 error is rejecting the null hypothesis, when it is true. Another way of saying the same thing is to say that we do not accept that the groups come from the same population when in fact they do. We have then rejected a true hypothesis. That happens when the probability of tested groups coming from the same population and therefore there being no difference between them is low (p<0.05 is usually accepted as low enough) but this result was a fluke (well, sampling error, if you like) and in reality, the samples do come from the same population. If we were to draw more samples or repeat the study, the results would most likely not show such a significant difference.

Type 2 error is the converse, accepting the null hypothesis, when it is false and rejecting a finding as not significant when the two groups are really from different populations. An illustration may make this clearer, because understanding this is important.

Figure X

The curves represent the frequency distribution of the means of samples drawn from two normally distributed populations. The first population has a mean of 0, the second, 2; both have SD of 1. The mean of our sample was 1 and its SD 1 also. The probability of that sample coming from the first population is quite low – p < 0.001, but let us suppose it does, which is obviously possible. On the basis of the low p value we reject this. We have rejected a true null hypothesis, a type 1 error. If the sample came from the second population, equally possible, the null hypothesis would be false. Were we to accept this, it would be a type 2 error.


The chance of making a type 2 error (accepting a false null hypothesis or in other words, failing to find a significant difference) is commonly the result of having too few subjects. The likelihood of type 2 error can (and should be) calculated, when planning a study. It is related to four factors:

1.               The number of subjects in the samples - larger samples reduce the probability of such errors.

2.               The difference between the sample means – The smaller the difference, the more likely the error.

3.               The scatter (SD) within the groups – big scatter makes it difficult to demonstrate a difference.

4.               The desired level of significance (a) – the smaller a, the less likely are we to demonstrate significant difference at that level.

The power of a study is the probability of not rejecting a true finding (which equals (1 – the probability of a type 2 error). Another way of thinking about this is to ask, if there really is a difference of a postulated size between the two groups, how likely are we to find it?  Statspeak calls the probability of a type 2 error b, so power = (1- b). Without calculating the numbers needed to ensure that this error is not likely, it is possible to undertake quite hopeless studies, because the number of subjects needed to show a significant result is too large for the resources available. We shall look at ways of calculating the power of a study or of estimating the numbers needed to reach a certain power a little later. (Back)



‘Significant’ is the real buzz-word in statistics. When you have estimated the probability of the difference between the means of the sample groups arising by chance variation in samples drawn from the same population you have obtained a probability, which says that the difference will happen fortuitously[16] 5 times, once or maybe 0.1 times in 100 attempts. It is customary to designate a probability of 0.05 (5 times in 100 or 1 in 20) as being good enough odds to regard the difference as worth thinking about, or significant! This magic ‘0.05’ is quite arbitrary and may or may not be appropriate. What it says, is how likely the difference between the two groups is to arise by chance in two samples from the same population. To carry it further, if the chance is 1:100 that the two samples are drawn from the same population, then it is 99 instances from 100, they came from different populations. This can be interpreted, if we are considering a treated and a control group, that the treatment is really likely to have made a difference. It says nothing, however, about how important the difference between the two samples may be.

Let me illustrate: You design a clever, large, cross-over study to compare the efficacy of two anti-hypertensive drugs, Oldreliable, costing $10 a month and Newhizz, at $50. You evaluate the results and find that the mean blood pressure reduction by Newhizz is 3 mm. Hg. greater than with Oldreliable, p = 0.01, a highly significant difference from the statistical view. Whether the slightly greater reduction in blood pressure is worth the expense is not answered by your statistical evaluation.

Although we now understand what is meant by “statistically significant’ we shall consider how we find the answer to this question later. (back)

Confidence intervals.

The calculation of confidence intervals (CI) is both an alternative and complimentary to significance testing. For example the 95% confidence interval of a sample mean is the range, which would include 95 of the sample means if 100 samples were drawn from the population. Conversely, this interval is 95% probable to include the population mean.  This is the range, which is consistent with the likely smallest and largest effect of a treatment. This information is not given by significance testing.

If you have 2 samples and the 95% CI’s for their means do not overlap it is at least 95% probable that they do not come from the same population. Unlike the t-test, alone, which is used to test for significance of the difference between 2 sample means, the CI gives you a clearer idea whether a non-significant difference is because of big scatter within the groups indicated by a large CI or small difference in the means. On the other hand, the CI does not precisely answer the question of what is the probability of a type 1 error. CI’s are increasingly used in the medical literature and there is an easy to read book on the subject, which gives many examples. The book also contains a useful check-list, used by referees to evaluate the statistical soundness of papers submitted to the British Medical Journal and the Medical Journal of Australia. There is a good argument for including both a p value and CI’s in reporting trial results.

As the 95% CI of a mean is

The 95% confidence range is the mean ± CI. For a 95% interval (p = 0.05) and a sample size of 30, t = approximately 2. It is also evident, therefore, that to halve this interval the number of subjects need to be increased 4 fold and to reduce the interval to 1/3, by a factor of 9.


The Eyeball Test

This may well be the first commandment of statistics:

Look at thy data!

Before crunching any numbers, look at your data, by whatever type of graph is appropriate. This may not sound scientific but it is important and useful.There are three reasons for this:

1. The first reason is to look for anomalies, such as data entry errors or data points which are unlikely to belong to the data set you are considering. For example, you are looking for the average weight of some boys, weighed to the nearest 1/10th of a kilo.

Our data set is:

44.4, 39.0, 40.0, 41.5, 41.1, 41.5, 44.8, 40.4, 43.5, 407, 43.8, 42.7, 39.2, 41.0

Maybe you noticed that the 10th. value was missing a decimal point, but if you had 250 numbers, it would not have been so obvious. Had you gone ahead with calculating the mean, it would have been 64.1 instead of 41.8. Significant errors can be easily detected by graphing the data.

2.  The second, is to help you develop models for your data:

The relationship between pulmonary ventilation and arterial CO2 content may serve as an example.

It is obvious that the two variables are closely related, but equally obvious that the relation is not a simple linear one. By using Excel’s graphic trend line function, you can pick what looks the best fit and the devilishly clever program will even give you an equation relating the variables. In this case, near enough to

Arterial CO2 = 39/ventilation

3.  The third is to check a model’s assumptions.

You propose to use a t-test to see if the wonder drug made a difference? Well, you know the data need to be normally distributed. Most statistical packages offer tests for this purpose (Excel does not), but a look will usually tell you.

The data are the subjects’ ages from the Eyeball.xls file and the graph is a histogram. An assumption of normality looks decidedly iffy.

Learn to use the Excel graph function in general and to graph your data in various ways, such as scatterplot, column chart or histogram.

Exercise 6. Graphing.

Open Eyeball.xls.

This file contains 4 columns from Valsalva.xls, with the subjects’ record number, age, sex and pre-test SBP.


1.     Plot the ages of the subjects. Use a column chart, and then do the same with a scatterplot of the RecNo (x) against Age (y).

2.     Plot age (x) against pre-test systolic blood pressure (y) as a scatterplot. Do you discern a trend?

3.      Plot the age of the subjects with different symbols for female and male.


Save your work for comparison with mine, in Examples. I describe how I worked these exercises in Explanations


Open Eyeball2.xls

These are 3 series random numbers, created by Excel’s Random Number Generator. Create a duplicate set of each (Copy/Paste) and sort this second set in ascending order (Data/Sort).

1.     Plot the distribution of the numbers in each of the series, first as a scattergram then as a column chart.

2.     Now plot the sorted numbers, using a scattergram and a column chart. It does not matter, which.

Have a good look at the 3 plots of each dataset. Have a think about them before you look at my example and read my comments.


Descriptive Statistics

Now we are getting into it! The descriptive statistics of a sample describe it with a minimum of numbers. We have met many of these terms already, when we looked at the normal distribution. The Descriptive Statistics’ of Excel include the following:

Measures of central tendency

Mean  is the arithmetic average of the values in the sample.  You know already that this is sometimes the most representative value for the measurements in the sample.



Median  is the middle value of the set. As many values below as above. Outliers (occasional values much above or below the others in the set) influence this less than the mean. Where you have 1 or 2 weird high or low values in your sample, they may influence the mean quite a lot, but not the median or mode.

3.               Mode  is the most commonly found value in the sample. This also is not influenced by outliers but a data set can have more than one mode.


Measures of scatter

1.               Variance is the mean squared deviation from the mean.  (This is again just revision.)

Do you recall why we square the deviations and why divide by (n-1) and not ‘n’ in getting the mean of the deviations? Because when we use a sample, not a population to derive this measure the fact that we sampled gives a degree of uncertainty so we divide by the degree of freedom, which is 1 less than the number in the sample.

The sum of all the squared residuals

is usually called the total sum of squares, TSS, or just the sum of the squares, SS. This is often used in averaging variances of several groups.

2.               Standard deviation is the measure of scatter of the values in a normally distributed sample. It is meaningless if the data are not normally distributed. It is independent of sign (pos/neg), that is, we are looking at dispersion, regardless of whether the value is larger or smaller than the mean.  It is the square root of the variance.


Standard error (of the mean) is for samples what the standard deviation is for the individual values. The sample mean ± 2 SE have a 95% probability of containing the population mean. Conversely, draw a number of samples and their mean ± 2 SE will contain the population mean 95% of the time   Its value depends on the SD of the sample (dispersion) and the size of the sample. 

Normalcy of the Data

Neither skewness or kurtosis is commonly used to decide whether the samples are near enough to a normal distribution for use of parametric tests.  In general, if the sample contains more than 30 values, and does not obviously look other than normal by the eye-ball test, parametrics are probably OK. If the sample is small or looks suspiciously non-normal, the ‘Kolmogorov-Smirnoff’ single sample or the Shapiro-Wilk test is used. These you will find in large stats programs but are not included in Excel’ Data Analysis Pack. They compare your sample to a normal distribution and determine whether there is significant deviation from the normal distribution in your sample. We shall, however, build a spreadsheet later to provide our own test for normality of data using the NORMDIST function in Excel.


Exercise 7. Descriptives

Open Valsalva.xls. Find the descriptive statistics for SBP before, minimum and end-valsalva. Are they really different? If you look at the confidence intervals, you will get a feel, but the answer to that question is not yet in our power. In the next section, or better, in the section on analysis of variance we shall learn how to find out.(Explanations)



Remember, the t-test, like other parametric tests (z-test, analysis of variance, regression and correlation), need interval (continuous) data and  the assumption that the data is (at least approximately) normally distributed. This assumption is not necessarily a tight one, but have a look at the plot of the data and make sure that it is not obviously non-normal. If in doubt, test the data. The t-test is the test used to estimate whether there is a significant difference between groups in the means of a variable. If the groups are large (>30), the test becomes the same as comparison of the means of two normally distributed groups (z distribution) and it is also in effect identical to analysis of variance for 2 groups. It can, however, safely be used for groups > 30.

The theory is the following:

We make the null hypothesis that the samples come from the same population. We estimate the population SEM from the variances of the samples. We know that it is 95% probable that the samples do not come from the same population if the difference in their means is greater than 2 (precisely 1.96 for a sample of 20) SEM. We estimate the ‘t’ statistic for our study, which is the difference of the means, divided by the estimated population SEM. We used to look this statistic up in tables and see if this is greater or less than some probability, which we set, usually at 0.05, and call significant

The larger our sample size, the smaller the difference between the means needs be for t to exceed a given value for the desired probability, so our tables show the required t value for the size of the sample and we size the sample in degrees of freedom. This is the number of independent values the variable can assume. We know the mean, thus if we know all values but 1, we can work that one out. Thus it is not independent and the degrees of freedom in this instance is n-1 for each sample. That means, that we need to look up t for (n1 + n2 –2) to find the probability of the t value. Excel saves us looking up the t-tables and not only calculates the t-value but gives its probability, p, when we do a t-test.


T-tests differ slightly, whether the two groups are independent, for example contain different subjects, or paired, when they may represent changes in the same subject (BP before and after treatment) or closely matched paired subjects. In the first instance (unpaired groups) the difference between the means of the two groups is tested. In the second, (paired t-test) it is the difference between the paired values, which is tested. In Excel there are also 2 possible tests for unpaired data, depending on whether you assume the same variance in the two groups or whether this assumption is not made. If the SD is not very different in the two groups, the variance is likely to be roughly the same. If it is very different, then use the t-test for different variances, which uses an estimate for the population SEM based on both sample SD’s (see App. 3). In fact, the difference between the tests is usually small. It is probably safest to quote the ‘worst case’ result. Alternatively, do the Excel F test (different from F ratio!) to determine if the variances are significantly different or not, and then use the appropriate t-test.

When doing a t-test, you can postulate a difference between the means of the two groups. This makes the question you ask “Is the difference between them significantly different from the postulated value?” If you do not postulate a difference, you are simply asking whether the means are significantly different.

 The other thing to decide is whether to use a one-tailed or two-tailed test. This depends on the hypothesis. Did you ask “is treatment 2 different from treatment 1?” or “is treatment 2 better than treatment 1?” In the first instance we are testing how different the treatments are. It is immaterial whether result A is larger or smaller than result B. In the second example, we want to know whether the BP fall is greater after treatment A than after treatment B. If it is less, (A<B) then we are not interested. If A>B then a one-tailed test will be appropriate. The decision is one of logic, not numbers. A one-tailed test will be significant, with a smaller ‘t’ value than that required for a two-tailed test. (back)

Analysing Rates and Proportions

Where we are dealing with a sample with only 2 possible values (say, two party preferred votes) pollsters ask 1000 individuals and eliminating the ‘don’t knows, we end up with 350 for party1 and 400 for party2. The pollsters report that support for party1 is at 46.6%. But we know that this is not necessarily the gospel for the whole population. We are dealing with a sample, which is not likely to be identical to the proportion in whole population. In fact, to have a 95% probability of including the population proportion, we need a range of 2 SE about the sample proportion.

It can be shown that the SE of a proportion is

so that in this case, the SE = .032 and 2SE =.064. Therefore, the population support for party1 is 95% probably between 40.3% and 53.1%, among those who know their preference. When the pollsters want to report honestly, a range of uncertainty must be given and this range will be greater with smaller samples. And to be really meaningful, they also need to report the sample size, proportion of non-responders and ‘don’t knows’.

Exercise 8. T-tests

OpenT-tests.xlsThis is a spreadsheet containing some data from Valsalva.xls, sorted by sex.

1.     Do the’ pre-test’ and ‘minimum’ SBP values for males and females have ‘different’ (read significantly different) variances? Do the F test for variances on each set of BP values, male versus female.

2.     Are the SBP values significantly different for males v. females on either set?

3.     Are the pre-test and minimum SBP’s different using all subjects? Do t-tests, for paired samples, as the two variables come from the same subjects, equivalent to before and after.


Power and Sample Size.

(Back to ‘Design a Study; Back to Type II error) We have already mentioned power, as applied to statistics. If b is the probability of type 2 error (rejecting a true result), power is (1 - b) which is the probability of not rejecting (accepting) a true result. As we said, power is related to the difference we want to detect, the probability of type 1 error, which we are willing to accept (a), the numbers in the study and the scatter of values in the study groups. The smaller we make a, the higher the risk of a type 2 error (b). By looking at the algorithm for the t statistic, we can work out how to estimate the power of a study, and the numbers required to reach a given power.

SEM, here, refers to the SE of the differences between the means and its estimation can be complicated but it will involve sample sizes and standard deviations.

Excel has neither a function for working out sample size nor for estimating b. There are web sites, which will do it for you (do a search on, for power) or use the program PS.EXE (Back)


The Z-scores

We are faced with determining the probability of whether a value X comes from a normal population described by the mean m and SD, s. A simple way of doing this is to express the deviation of X from m in ‘standard’ units, (units of the SD) by dividing the difference of the value from the mean by the SD.

We can now look up a table of ‘area under the normal curve’, and see. For example, for z = 2, we see that the probability of z < 2 is .477. So a one-tailed probability of z > 2 is 0.023, and two-tailed, 0.046 or roughly 5%. similarly, z > 3 has a probability of < 1%. This is the same as saying that the probability of a value outside mean ± 2SD is 5% and mean ± 3SD, 1%.

This is called ‘normalizing’ the distribution. Z is called the standard normal variate, standard normal deviate or just standard score. Tables exist as we said, but Excel gives p values, directly, anyway. The Z distribution tables assume a mean of 0 and deviations in units of the SD. This test is not used much in our line of work, but would be very useful if you have to scale results, as they do in some exams. An example would be two groups, where different assessors have given different marks, but you believe that in fact the average performance is the same for both. By ‘normalizing’ the individual results, and then re-converting to a single mean, say (mean1 + mean2)/2, then  multiplying each result by an estimate of the combined SD, you correct the marks to the same scale.

The z-statistic is also useful for ‘significance’ testing of ratios or proportions. Say treatment 1, tested in 61 subjects has a mortality of 8 and treatment 2, in 67 subjects, 10. The mortality rates are 8/61 = 0.131 and 10/67 = 0.149. The best overall estimate of mortality as a proportion or rate would be

its SD is (see analyzing rates and proportions)

The SD of the difference in rates would be

This is well short of the critical value of 1.96 for p = 0.05 and we can conclude that it is reasonably safe to accept the null hypothesis.

Because there is a small difference between the standard normal distribution and the distribution of the z-statistic, a correction called Yates correction for continuity should be applied. Probably Excel does this, but I could find no reference to it.

The z-test becomes equivalent of the t-test for groups greater than 30.

If there are more than 2 groups or more than 2 possible outcomes, we shall need to try another way, called the Chi-square test or contingency tables, which could be used to test significance here, as well.. We shall consider this when we deal with non-parametric methods. In any case, I have not used the z-test very often, but standardizing scores is useful and it also enables us to determine the significance of proportions.

Exercise 9. Z-scores.

a)Open Scaling.xls

This is an exercise in scaling marks. The two columns of raw marks represent the assessment of two markers on two classes. We think the classes are similar in performance, but this is not reflected by the marks. So we want to scale them to the same mean and scatter, or SD.

To do this, you need to find the mean mark and the variance for each class, Work out a joint mean and a joint SD.

Next you convert each mark into a z-score and re convert them to a mark, using the joint mean and SD.

This sound complicated? Well it is and it is not very important, but have a try. Have a look then at my effort, in Examples\Scaling.xls and have a look at Explanations\Exercise 8.


b) Open Survival.xls

Seventy four patients were treated with chemotherapy after surgery and the 5-year survival was 58. Sixty seven were treated with surgery alone and 36 survived 5 years. Is this difference significant? My calculations are in Examples\Survival.xls and the explanation in Explanations\Exercise 8

Analysis of variance

One-way ANOVA(Back_to_t-test)(Back_to_Ex7):

What if we have more than 2 groups to compare, say the systolic blood pressure, pre-test, minimum and end-valsalva? We could do t-tests between each group, 3 in all, but remember that we expect that if we set a at 0.05, in about 1 case from 20 we are likely to accept a spurious result (type 1 error). So if we do 3 t-tests, the chances are roughly 14% that a fortuitous difference will fool us. If we had 4 groups, we could do 6 t-tests, and the chances of at least one fortuitous “significant’ difference would be >26%. Not good enough!

Analysis of variance (or single factor ANOVA) is a better way to go. If the null hypothesis is true (no differences between the three value sets), then the sample means are just those of 3 random samples drawn from the same population. So if we use two different methods of estimating the variability of the sample means, the results should be the same. The two methods available for this are to estimate the variance of the means from the sample variances or to estimate   it from the means themselves. If indeed the means come from the same population, the two results should be fairly similar. If, however one (or several) of the groups does not come from the same population, the variance estimated from the sample means will be larger than that, estimated from the sample variances, which need not differ even if the means do!

We, therefore estimate a variance for all the samples from the variances of each of the groups. This is the ‘within groups variance’. We also estimate a variance from the group means, the ‘among or between groups variance’. As we have said, the two should be the same, more or less, if all the sample groups are from the same population. Thus the ratio of the means of the two types of variance, tells us whether the total variance is largely from scatter within the groups or whether the means of the groups are different and the scatter within each is a relatively small contributor. The F ratio is the ratio of the ‘among’ or between groups variance to the within groups variance. There are tables for the critical value of F depending on the number of groups and the total number of subjects in the samples. Of course, Excel saves us having to look this up and gives us a ‘p’ value.

Like the t-test, analysis of variance (ANOVA) depends on a number of assumptions for its validity, .

1.     That the samples are drawn from normal populations.

2.       That the variance of the samples is the same, or at least not too different. It has been suggested that variances which differ by less than 3 fold can be accommodated. Excel’s F-test will only handle two variables but probably multiple comparisons would not be unreasonable here. If the variances appear very different, another test needs to be found.

3.     The samples are independent.

If the data do not meet the 3 requirements the non-parametric equivalent of anova, the Kruskall-Wallis test can be used.

The ‘Anova’ tells us whether there are significant differences between the groups. Now it remains to be decided which groups are different or not different from which. This is not necessarily easy. The commonest but not necessarily best way to avoid the dangers of multiple comparisons (the danger of a chance ‘significantly different’ result) is the Bonferroni test, which can also be called t-tests with Bonferroni correction. This takes the p value desired (say < 0.05) and divides it by the number of comparisons, here 3. So if we do 3 t-tests, p < 0.017 (0.05/3) between the two tested groups is required to judge a group different at the 0.05 level. In practice, the correction is easy to apply, all you have to do is enter 0.017 instead of 0.05 in the Alpha box on the t-test panel of Excel, or whatever alpha you want, divided by the number of tests you do.

Many consider this too conservative and use other tests, but Excel does not provide them. (The Student–Newman-Keuls test is considered by many to be the best). Plotting the means of the groups with ± 2 SEM or the 95% confidence intervals will give you a good idea of which differences are significant and which are not, and also the relative importance of the differences in the means and the scatter, anyway.

We called this ‘one-way’ anova, because there is a more complex test called multiple analysis of variance which compare multiple factors in more than 2 groups.

Two factor ANOVA

This Is an extension of the analysis of variance to permit ‘stratification’ of the variable according to some characteristic (grouping into levels by some characteristic such as age or whatever), which reduces the variance. Another possible use is to look at the effect and interaction of two categorical grouping variables on the study variable. The example in the SPSS manual is that of marks for an essay, assessed by same sex and opposite sex markers, who have been shown photograph of the authors. The authors were grouped into attractive, so-so and oh, yuck! groups. The analysis could answer three questions.

1.     Whether attractiveness related to marks,

2.     Whether the marker’s sex influenced the scores,

3.     Whether there is interaction between the effects of attractiveness and the sex of the assessor.

By the way, the conclusion was that the essay marks were influenced by the writers’ looks, if they were attractive and marked by someone of the opposite sex.

The test can be extended to more than 2 grouping variables.

Repeated Measures Anova (or Two-way Anova with Replication, in Excel):

If we look at pulse rates in subjects, after exercise, at 1, 5 and 10 minutes, it is obvious that the measurements are not independent of one another, so it fails to meet the independent samples criteria of either simple or multiple anova. This is analogous to the paired t-test situation but for more than 2 groups.

Surprisingly, Excel includes a two-way anova with replication, but not a one-way repeated measures design. The two-way test, however, can be used to perform a simple ANOVA with repeated measures, by simply omitting the second grouping factor.

Exercise 10. Analysis of variance.

a)       Open Anova.xls

Do an analysis of variance (ANOVA) on each set, then t-test group 1 v grp.2, grp 1 v grp 3 and grp 2 v grp 3

Open Anova.xls do a scatter plot of the data, showing each group separately. Plot the means of the 3 groups and their 95% confidence interval as error bars (separate plot). This is not quite straightforward! If you need help, look in Explanations.

b)       Open Replications.xls

This data purports to be pulse rates 1,5 and 10 min after exercise, in 50 subjects. First we want to see if the mean pulse rates are different at the 3 observation times. That, is a one-way anova with replication but the program we shall use is two factor anova with replications and we shall simply enter the count of all our data points in the “Rows per Sample” window, in this case, 50. Hint: As input range, you need to include the col. with Old and Older, as well as the col. labels: a1:d51.

1.     Is there significant difference between the pulse rates at the 3 times?

2.     Which groups are different? In fact how can you tell?

3.     If you work out the CI95 for the means of the groups, they suggest that 1 is different from 2 or 3 but 2 and 3 are not significantly different. This is not true. Why?


 Discriminating on the grounds of age, we shall divide our subjects into 2 groups of 25, old and older and repeat the test . Now our “Rows per Sample” becomes 25.

1.     How do you interpret these results?


If you could answer these questions without reference to the Explanations, you are pretty damn bright!


Correlation and Regression


Regression deals with the relation between two variables, for example, height and weight of persons and describes this relation. Correlation expresses how ‘tight’ such relation is, meaning, how predictable one variable is from the other.

Relations between variables have 3 features: First, is the nature of the relationship:  Does y grow as x increases or does it diminish? When x is 0 is y also 0 or not?  Second, what is the magnitude of the relation, or how tight or good the relation between the two variables is, and third, reliability. The last means, how probable it is that other samples from the population would demonstrate a similar relation. Therefore, when looking at a regression, we need to answer the three questions:

1.     What is the relation between the two variables? This is answered by the regression equation. (Nature)

2.     How strong is the relation between the two variables? The correlation coefficient and its p value answer this. (Magnitude)

3.     What is the likely range of the population values for the statistics (regression intercept and coefficient and the correlation coefficient)? This we find out from the SE or CI estimates of the statistics. (Reliability)


Regression explores the nature of the relation of how y relates to x. This relation can be linear, the only type which we shall explore in detail, or something other. Linear regression applies to the situation where the relationship between the two variables is best described by a straight line.

Such a relation is described by the equation:

where a is the intercept (sometimes called the constant), the value of y when x = 0, and b is the coefficient, which describes the rate of change of y with x. If it is positive y grows as x increases, if negative, y diminishes. This regression line or trend line as it is sometimes called, always passes through the point described by the means of x and y,  and is the line of best fit, where the sum of all the residuals is least.

In the diagram, the residuals are the vertical bars, from the estimated values of y,  and , to the measured values, Y1 and Y2.  In other words, it is the difference between the actual value of y for a given value of x, and the value predicted by the regression equation. When the sum of all these residuals (the error) is least, we have found the line of best fit.

For continuous variables, we use the method called Pearson product moment regression, which only applies to linear regression. Three criteria need to be fulfilled for the procedure to be valid:

1.     The y values must be distributed normally about the regression line. In English, this implies that y values closer to the trend line (regression line) are more frequent than those further.

2.     That the variance of y be constant for all values of x.

3.     That the x,y pairs are independent, for example, not contain multiple values from the same subject.

These assumptions are quite robust, meaning that they only need to be approximately correct (except for the last). Usually an eye-ball test is sufficient to tell us whether the data set is acceptable. If the x-y plot leaves you in doubt, look at the residuals, which should be randomly distributed about the trend line. We shall look at these plots soon. The fact that the correlation coefficient is reasonably high, or that the correlation is significant, even highly significant, must not fool you into believing that the relation is linear.

If any of these three assumptions is grossly violated, then the non-parametric equivalent of regression procedure, the Spearman rank order regression, should be used. This test is not included in Excel Data Analysis Pack.

One of the two variables, x, by convention, is designated the independent or explanatory variable. The corresponding y is the dependent variable. Which is which, is a logical, not mathematical decision. If you are looking at the relation of family income to spending, it is logical to designate income as the independent variable, on which spending depends. If you are exploring the relation between height and weight, it depends on your question or postulate. If this is “are taller people heavier?”,  make height the independent. If you asked whether “heavier people are taller” chose weight as x.

We now come to the correlation coefficient. We already know that this is the indicator of the goodness/tightness of the relation, that is, the predictability of y from x, or vice versa. The abbreviation use is r, and

In mathsspeak:

This is always a value between 0, or no correlation, and 1, perfect correlation. The square of r, r2, is the fraction of the total variance, explained by the regression. Of course the greater this is the better the fit. If r2 is 0.8, then 80% of the variation in y is due to the variation in x and 20% to other causes, such as measurement error, individual variation or whatever.




Some Practical Points: We have noted that a valid regression procedure requires that the residuals (the differences between the actual and predicted values of y) be normally distributed about the regression line. This is not difficult. Tick ‘Residuals’ and ‘Residual Plot’ on the Regression procedure and look at the residual plot, which is the plot of the y values deviations from their regression estimates (the residuals) against the corresponding values of x. If they are randomly distributed about the zero line and seem to have the same scatter for all values of x, then probably two of the assumptions, normal distribution and equal variance are acceptable. This is an important application of the eye-ball test. If in serious doubt, do a histogram of the residuals and see if this looks anything like a normal distribution. As a last resort, if you think it is worthwhile, you can test the residuals for significant difference from the normal distribution, by the Kolmogorov - Smirnof test (if you have it in a stats. package or by Chi-square, which we discuss, later.

At this point you should look at the RegressionTypes.xls file in Examples. In this file I have included a linear regression (worksheet labeled ‘Linear’) where the residual plot and the histogram of the residuals is done.

The next sheet is labeled ‘Log’, a regression where

This is not a very important relation for us, but I have drawn it to your attention for two reasons:

1.     The value for r2 for a linear regression for this data is 0.82, a very good value. The significance is p<<0.001 – highly significant and if you look at the linear trend line in the x,y plot it is not such a bad fit. It could well masquerade as a linear regression.

2.     The second reason is that when we find out that this is not a linear regression, after all, all is not lost, because we can turn it into a linear relation.

How can we tell that this is not a real linear regression, but something pretending? You only have to look at the residual plot. If you only had the data from x = 300 to x = 1000, it would indeed be hard to tell without the residual plot. But if you look at this, well it is obvious that the residuals are not randomly distributed about 0. Something is fishy and you can track this down by fitting other than linear trend lines. Aren’t computers great? I would hate to have to do this by calculator or paper and pencil.

What can we do about this? Well, you see the equation above, if you regress y not against x, but log(x) (col. A is no longer x, but log(x)) , you will get a linear relation. If you look in ‘Linearized Log’, this is exactly what I have done and the residuals look a lot more random and the histogram is a lot closer to normal than it was.

There are many other non-linear relations, which are possible, but we shall not deal with them. In any case, the trick to deal with them is to turn the relation into a linear one before you do a regression analysis.

Interpretation of the Regression output.

Look at the output of the Regression procedure, in ‘Linear’.

1.     Multiple R: This is the regression coefficient, properly called the Pearson (product moment) regression coefficient. There is another, called the Spearman (a non-parametric) correlation coefficient which we have already mentioned. R can vary from 0 to 1; 0 being no correlation and 1, perfect correlation. Perfect correlation does not mean identity (y = x), it means that you can predict y from x, with no error.

2.     R square: Obviously, the square of the regression coefficient. This, as you know, is the fraction of the variance of y, which is explained by the regression. The remainder of the variation (the residuals) is not explained and includes measurement error, etc… In this instance, r2 = 0.98 so there is little unexplained variation in y.

3.     Adjusted R square: Is an estimate of the population R2. Very useful when comparing different models, such as straight-line relationship with say a logarithmic relation. As it is en estimate of the population value, it is probably a more realistic estimate of the explained variance.

4.     Standard error: Is the square root of the mean residual sum of squares, that is, of the variance not explained by the regression and is the same as the SD of the residuals, the amount y differs from the predicted value. It is in the same units as Y.

5.     Observations, is simply the number of value pairs.

6.     ANOVA tests whether the slope of the regression line differs from 0. Of course we know that if there is no relation between x and y, the regression line will have a 0 slope – saying that y does not grow or diminish as x changes.

7.     Intercept: Is the value of y when x = 0. The p value tests whether this statistic is different from 0 and the 95% confidence intervals are the likely limits of the population value. If they straddle 0, the probability that the true (population) value of the intercept is not 0 is less than 5% and the null hypothesis cannot be rejected. Where you have logical reason do so, the intercept can be forced to 0.

8.     The coefficient: The coefficient defines the rate of growth of y as x changes. If it is negative, then y gets less as x increases in value. In the equation

if  b=0, then

and y has no relation to x.

In the ‘linear’ worksheet, the values for the intercept and the coefficient are given. The intercept is tested, is it different from 0? In this instance the p value, the probability that you find an intercept of –3.14 when the true population value is 0, is 0.8 so we can conclude that the intercept is not significantly different from 0. The coefficient is also tested for difference from 0. In this instance it is 1.002, and the p value is 3,8 x 10-41, or infinetessimal. This simply means that the chance of the coefficient being 0 is practically nil.

  1. Confidence interval: The 95% confidence intervals tell the same story. The limits of the intercept, -30 to 24 well and truly overlap 0. Those of the coefficient, 0.96 to 1.05 are well away from 0 and obviously represent a significant coefficient. So, now let’s do some practice.


Exercise 11. Correlation and Regression.

Open Valsalva.xls.

Copy Pre-test SBP, DBP and Pulse Rate (not Pulse Pressure) to a new file, which you can save as Regression.xls.

Plot pre-test SBP (as the x value) versus DBP. On a separate graph, plot SBP v pulse rate. Insert regression lines on both plots (Right click on a data point, chose ‘add trend line’ and chose ‘linear’). Try to form an idea whether these correlations are worth pursuing.

The eyeball test of the plots says that it is probably worth doing the regressions, but we shall see what the R2’s are.

Do a regression estimation on both relationships, anyway. Place a tick in the box labeled ‘Residual Plots’ and ‘Line Fit Plots’. This will cause Excel to list the deviation of each y from the predicted value and plot the deviations against x. It will also list the residuals, should you want them for a histogram.

Do a histogram of the residuals. When using the Histogram procedure you do not have to build a table of the bins, Excel will do them. Make sure that the cell in the ‘Bin’ box is such that nothing will be overwritten and for convenience in plotting, format the numbers in the Bin column to no decimal places. This makes the plots of the histograms less cluttered.

Some More Linear Regression.

Open line.xls

What is the intercept and the coefficient of columns 1 and 2? What about cols. 3 and 4? Plot these data with cols.1 and 3 respectively as x’s and 2 and 4 as y’s, then use the regression program.

Remember, the Excel regression tool defines the straight line which best represents our data. In other words, it is the line, which, when we calculate the predicted values of y for any value of x, gives the least error compared to our actual values. This, put another way is that, the line is the one, which minimizes the residuals. As we have observed, this line always passes through the point which represents the mean values of x and y. It is defined by the slope (coefficient) and its intercept (constant) on the y axis.

Plot the data in columns 5 and 6. Do a regression analysis, tick the Residual plots and the Line fit plots. What does the residual plot suggest? What does the line-fit plot tell you? Then try the correlation tool. What does this tell you?

The final table shows the statistics of the intercept and the coefficient, called here the by the column number.  The t statistic you know, and the significance is based on this. What was actually tested, is whether the intercept differs from 0 and it does. The coefficient on the other hand does not differ significantly from 0. Tou can conclude that there is no significant correlation between x and y. Lower and upper 95% confidence intervals give the range, within which it is 95% probable that the true population value of the slope coefficient lays. It can be set to other values than 95%. If this range includes 0, the value of the statistic is not significantly different from 0 at the 0.05 level. Why the CI is printed twice in the output table is a mystery.

A Case of Non-linear Regression.

The partial pressure of CO2 in the end-tidal exhaled air is inversely proportional to ventilation of the lung alveoli (as ventilation increases, CO2 decreases). In an experiment, we measured ventilation and end-tidal CO2. The inverse proportional relation is described by

where  stands for the alveolar partial pressure of CO2, is the alveolar ventilation and k is a constant which depends on the measurement units.

Open ventilation.xls

Plot Ventilation as x and PECO2 as y and put in a linear trend line. This just does not look right as there is a decided curve to the relation. Now do a regression estimation on  versus ventilation, with residual and line-fit plots. The numbers come up highly significant, p<0.001 and R2 = 0.84, but the linear trend line just does not look right. The residuals are not distributed randomly.

What could you do to improve the regression relation? What I did you can see in the Examples file, and I explain it in Explanations.



Comparing two methods of measurement.

This is a special case of two variables x and y, where the results recorded by the two instruments should be identical, not just related, in order for the instruments  to be used interchangeably. The purpose of these comparison studies is to identify differences in the two methods, not to point up similarities. This was the subject of a widely quoted paper by two statisticians (Bland and Altman) in the Lancet, 1986; 307-310) who wrote this readable paper for health care workers, not other statisticians. They criticized regression as a method of comparison and offered alternatives, which are well suited to the quick appreciation of the differences.

The regression method

Let us compare SBP measured by invasive arterial measurement and a BP cuff. There is the true blood pressure and then there are the pressures, measured by the two instruments. Both methods have some measurement error. Sometimes we have a time-tested and well validated method which we may chose to consider the ‘gold standard’ with which we compare the other method but as usually both methods contain some measurement error, our best estimate of the actual measurement is the mean of the two methods.. Differences between the instruments can be random as in measurement error or systematic, where the error or bias is fixed such as one thermometer reads 5o higher than another or proportional where the error has some relation to the value of the measurement such as when one machine reads 10% lower than another. Any combination of the three types of error can be present. The objective of a good comparison method is to define which types are present and their magnitude.

To do this with regression analysis is not quite straightforward but it can be done provided care is used in interpretation. The real draw-back of the method is that in linear regression no allowance can be made for random error in the method chosen as the independent or explanatory variable.

Exercise 12.Two Instruments.

Open Instruments.xls

There are 4 worksheets in the file. Start with ‘Random’. This is two sets of identical measurements, with added random ‘noise’, which serves as measurement error.

 Do a regression analysis, and plot the x and y values and the residuals.

Next do the same with ‘Fixed’, where 8 mm Hg has been subtracted from the cuff values, as if the cuff read consistently a little lower than the invasive pressure. The random error is still there.

In ‘Proportional’, the cuff values are multiplied by 1.3 as if the cuff were over-reading by 30%. Do the regression with x-y plot and residuals.

Finally ‘Fixed+Proportional’ combines the two types of error. Do the regression with x-y plot and residuals. Now think about your results, and then look at the Explanations, Exercise 13.



Analysis of Differences

Now let us look at the method suggested by Bland and Altman, sometimes called Analysis of Differences. Basically, this involves plotting the differences between the two methods against their mean. The X axis is the mean of the pair of readings, (X1 + X2)/2, where X1 and X2 are the two measurements because this is the best estimate, which we have of the true value of the variable. On the Y axis is the difference between the pair, X1 – X2. This results in a plot where the differences are easy to see and inspection may reveal fixed and/or proportional bias. This is assisted by showing the 0-difference line and lines showing ± 2SD of the differences. This is an estimate of the range which should contain about 95% of the observations and is sometimes called the limits of agreement. The 95% CI of the mean difference can be calculated and they suggest that if that range overlaps 0, the two methods can be used interchangeably. The exact type and magnitude of the bias can be detected by using linear regression on the differences, versus the meaned value of the paired observations, and testing for differences of the intercept and the coefficient from 0. Here identity has a coefficient of 0, not 1 as in the case of the regression method, because it is the difference between the readings which we are testing.


Exercise 14. Analysis of Differences

Open A&B.xls

This is the same data file as Instruments.xls but we shall analyse it differently. On each worksheet (Random, Fixed, Proportional and Fixed+Proportional) create columns for the mean of the two measurements and their difference.

Plot these columns on a scattergram.

Do the descriptive statistics on the difference column, which you have created.

Do a regression analysis with mean as the x value and the difference as y. You may wish to do a Residual Plot, which will give you a visual indication whether the linear fit is OK.

Tabulate your results for the 4 worksheets, for a good overview. Do the mean and the difference separately, with mean, SE, SD upper and lower 95% CI and add the intercept and the coefficient as well from your regression analysis.

Think about your results. My thoughts are in the Explanations folder, Exercise 14.Comparison of Methods 2.

Probably the best method for comparison of two sets of measurements is the least products regression, a fairly complicated method used by chemists but has not, to my knowledge, penetrated the health care sector. (See Ludbrook J. Clin and Exper  Pharmacol and Physiol 1997; 24:191-203 if you are really keen). It takes into account errors in both measurements, yet yields all the information of the regressions.


Non-parametric Statistics.

We have so far discussed statistics applicable to normally distributed continuous (interval) variables. They are important, but there are ordinal and nominal variables as well as continuous variables, which are not normally distributed (Back).  At least theoretically, these data should not be analyzed by the parametric methods, which we have discussed. In practical fact, in many instances the parametric and non-parametric methods come up with the same or at least very similar answers. As parametric tests are more sensitive (by about 5 % only), they should be used where the scatter of the data does not deviate significantly from a normal distribution and other criteria permit. For those cases where data are not suitable for parametric treatment, we have non-parametric equivalents for most methods for testing differences between means. These tests depend on ranking the data and test for differences between the mean ranks. There are even an equivalents to the parametric (Pearson) correlation coefficient – the Spearman correlation coefficient and Kendall’s Tau. As, however, we have no knowledge of the distribution of our data, we shall have no equivalent to the SD, so we shall not be able to say that we expect 68 % of the values to fall within the ± 1SD limits and so on. Nonetheless, whatever the underlying distribution of the data may be, if we draw samples, the distribution of the sample means will be normal. This leads to a whole new class of tests, called re-sampling upon which we shall touch.




2 independent samples


Mann-Whitney U tests, (also called Wilcoxon)

2 paired samples

t-test for paired samples

Sign test, Wilcoxon signed-rank test

2 or more indepen-dent samples

one-way analysis of variance

Kruskall Wallis test

2 or more indepen-dent samples, with repeated measures

one-way analysis of variance with replication

Friedman test

Relationship between two variables

Pearson’s product moment regression

Spearman’s regression

Kendall’s Tau


The table above gives the non-parametric equivalents of the parametric tests we have met.

There are many other non-parametric tests for various purposes, which we shall not discuss but there is one, which is so useful that we need to have it in our cupboard. This is the Chi-square (X2) test.

The Chi-square test.(Back)

This test can be used to compare our data or results with any expected results. We can test whether a set of numbers is compatible with a distribution, such as the normal distribution, or whether a politician had special appeal for one sex or the other. As long as we can postulate ‘expected’ results, we can compare them with the actual numbers in our data set, whether the expected number is based on a theoretical distribution or on a hypothesis.

The algorithm for chi-square is simple:

where O stands for observed and E for the expected number. If there are only two observed values and two expected (the minimum) a significant error creeps in because the X2 distribution, which will tell us the probability, is continuous but the values for O and E are not, they are usually integers. Therefore a correction, called Yates’ correction, is applied and the formula reads


The vertical bars around O – E, meaning absolute value. This correction is only applied to 2x2 tables.

To find the probability of our result, given the expected values, we need to know the degrees of freedom for C2 which is the (number of columns – 1) x (number of rows – 1)

The X2 statistic has its own distribution, which we can access via the CHITEST function in Excel. So we are now equipped to develop our X2 spreadsheet. Let us get some data, from the Electoral Commission.

At the 2001 Federal election on a 2-party preferred basis the coalition polled 1,811,019 and ALP 1,662,190 votes in NSW. In Bennelong, the PM’s electorate, the corresponding numbers were 44855 and 32883. Did John Winston Howard do better than the State average?  Is the difference in the state, between the two parties statistically significant or likely to arise by chance?

Exercise 15, Chi-square tests.

Let us ask two questions:

1.     Is the Coalition’s vote in Bennelong different from the NSW average?

2.     Is the Coalition’s vote in NSW different from the ALP’s?

We can make a table here, in Word, as the numbers are in front of us. By the way, Word will sum numbers for us: Look in Table\Formula




Row Total










 Open Excel and insert the table into a new spreadsheet, which you will save later as Chi-square.xls in the Data folder.  Now let us answer the two questions.

1.     Is JWH’s vote different from the State average? We know the actual (observed) votes in Bennelong. The expected vote, based on the State average will be

a.     Coalition: 77738*1,811,019/3,473,209 = 40535

b.     ALP: 77738*1,662,190/3,473,209 = 37203

So now it only remains to make our contingency table with rows for the observed and expected values and columns for the Coalition and the ALP (it would not matter if you did it the other way).

Finally, all you need to do is to use the =CHITEST function, get the syntax from the Excel help.

Check the result by setting up your own Chi-square test, using the function (Observed-Expected)2/Expected, for each cell, summing them to get the Chi-square statistic and then look up the probability by using the =CHIDIST(Chi-square, Degrees of freedom) function.

2.     Now the second question. Is the difference in the State vote significant? Well you have the observed values and if there were no difference, the votes would be split 50:50 so there are the expected values. Voila!

A second example:

Now let us examine the sort of cars people drive. We’ll group them into small, medium and large and we shall note the number of each as 100 pass us, in three locations: a rich suburb, a poor suburb and the CBD.






Row Total
















Column Total






The question is whether this pattern is likely to arise by chance? So let us transfer this data to a new worksheet in our spreadsheet and set up our ‘expected’ values, which must be estimated from our observations. If there were no differences between the locations, then there should be 32 (96/3) small cars, 28 medium and 40 large ones in each location.

What does the p value mean here?

Testing for normal distribution with Chi-square.

(Back)Statistics packages with good non-parametric tests contain the one-sample Kolmogorov – Smirnov test. This test compares the data to a known distribution, such as the normal and if the difference is not significant, we can regard the data as near enough to being normally distributed. Excel does not have this test in its data analysis package. We have to make our own,  which we can do with the =Chitest() and =Normdist() functions of Excel, although, in fact the eyeball test is sometimes sufficient for anyone except reviewers of scientific papers.

A normal distribution is described by its mean and standard deviation. If we use the Descriptive Statistics function on our data, we can obtain values for these, for our numbers. These can then be used to produce the ‘expected’ values, for a normal distribution for comparison with our actual data.

Let us look at an example, say, survival after diagnosis of Martian swoof disease; the gathered data for 100 unfortunate sufferers is on the Swoof worksheet in Normtest.xls. Let us set up Ho, that our data does not differ significantly from a normal distribution.

Exercise 13. Is the distribution normal?

Open Normtest.xls and use the Swoof worksheet. The numbers represent survival in years, after diagnosis.

The first thing to do is to run Descriptive Statistics to find the mean and SD of our data. Next make a histogram of the survival times, you can use the ‘bins’ (intervals) Excel makes automatically when you leave ‘bin range’ blank. The reported ‘Frequency’ will become your ‘observed’ data.

Now we get to the complicated bits, we have to set up the ‘expected’ (normally distributed) values. We’ll use the =NORMDIST(x,mean,SD,cumulative) function. Let me explain the syntax.

1.     x is the test value, in this case, the upper limit of the ‘bin’.

2.     The mean and SD we have just calculated for our data.

3.     Cumulative means the area under the normal curve, or sum total probability to the value x, and we have to set it to ‘true’ if we want the probability to that limit (as we do here, this is called the cumulative density function (CDF)) and to ‘false’ if we just want the frequency at that particular value.

Now this is the good bit, we want the cumulative density function for each bin, separately. That means, finding the CDF for the bin and subtracting the CDF of the bins below to find the actual number of values expected in that bin. The marked area between 2 and 3 would represent the fraction of all the values which lie between the two numbers and to find this area we need to subtract the area to the left of  the line at 2 from the area to the left of the line at 3.

 For convenience, I set up an adjacent column with 0 opposite the lowest bin and the top value of the bin below in the other cells. You need to replace ‘More’ in the final bin with a number which is greater than the maximum in your data. Finally, to convert the probability function, a fraction of the total, into the expected number, you have to multiply it by the number of values in your sample, in this case, 100. If this is clear as mud, have a look in the worked examples.

The syntax in the ‘Expected’ column is :

The first value: 100*(=normdist(bin1,mean,SD,true) – 0)

The next: 100*(=normdist(bin2,mean,SD,true) - =normdist(bin1,mean,SD,true))

and so on for each bin, until

The final bin: 100*(=normdist(bin11,mean,SD,true) -=normdist(bin10,mean,SD,true))

Remember to insert the ’$’ for the mean and SD or you will get some funny numbers.  Type in the first two but then copy and paste saves a lot of time. The sum of these ‘expected’ values should be about 100 (there may be a rounding error).

The degrees of freedom are equal to the number of cells (11) – the number of estimated parameters (mean and SD, which is 2), -1 that is, 11 – 2 –1 = 8, multiplied by the number of columns(2) – 1 = 1.

I put in a column of the individual Chi-sq. terms which sum to 4.18. The p value of the Ho (the distribution does not differ from normal) is =chidist(chi-square,DF) = 0.8407 and you can accept the H0 with some alacrity.

The =chitest function does not quite work here, because it does not allow for the estimated parameters and would use DF = (c-1)*(r-1) = 10*1.

Do the 2nd and 3rd examples, for practice and see what conclusions you reach.

As a matter of interest, you can do comparisons with other distributions such as binomial or Poisson in Excel, using similar methods. There are a few remarks in Explanations.


Some very medical statistics.


This section is based on National Prescribing Service News, 25/2002: “Making sense of evidence-based jargon”. The data used are quoted from the Women’s Health Initiative paper (JAMA, 2002; 288:321-33) and presents the effect of hormone replacement therapy on the incidence of osteoporotic fractures and cardiac events. Two groups were studied, those on HRT (N = 8506) and placebo controls (N = 8102). The percentages of osteoporotic fractures and cardiovascular events are used to illustrate the use of certain statistics like relative risk, absolute risk, risk reduction, number needed to treat and number needed to harm.


Annualised Event Rate

Treated (8506)

Control (8102)

Osteoporotic fracture rate

1.47 %

1.91 %


The first question that arises is whether the difference is significant? The way to clarify this is a contingency table and Chi2 test. Let us turn the percent incidence into actual (annual) events:

Exercise 15, Another contingency table

We shall check the significance of the numbers by building a contingency table and doing a Chi square test.

Fractures in the treated group:     8506*1.41/100 = 120

Control group                              8102*1.91/100 = 155

It is easiest now to go to a spreadsheet: HRT.xls, a two-way table, showing the number with and without fractures in the treated and the control groups. Next you need to make a table for the expected values, if the two groups came from the same population (Ho) – this is a simple proportion problem which we have examined in the last section. Make up your contingency table,



No Fracture

Row total









Column total





There were 275 fractures among 16,608 subjects, how many do you expect among 8506 and 8102, respectively? This is very easy in the spreadsheet.

Then finally you just invoke the CHITEST function of excel, with the actual and expected value ranges (=chitest(actual value range,expected value range) and up pops the two-tailed p-value!

Now that we know that the difference is significant at p < 0.02, it is worth continuing to calculate the relative risk or hazard ratio. The risk of fracture with HRT is 1.47 % and in the placebo group it was 1.91%. The relative risk is

Relative risk =

The reduction in risk is 0.23 or 23 %. Looks good, but 23 % of what? The absolute risk is what we should look at! This was reduced from 1.91 % to 1.47 % per annum, which is a reduction of 0.44 % or 0.0044 per subject per year.

It certainly sounds better to say that the annual risk of osteoporotic fracture is reduced by 23 % by HRT than to say that the absolute risk reduction is 0.44 %, per annum or even that “the risk of osteoporotic fracture is reduced by 0.0044 per subject per year!

The number-needed-to-treat (NNT) is the number of subject you would need to treat, over a specific time to reduce the incidence of the adverse outcome by 1. This is readily calculated:

The absolute annual rate of reduction of fractures is 0.0044. We postulated that we want to produce a reduction of 1 fracture, therefore

In this instance, therefore, to prevent 1 fracture per annum, we need to treat 227 patients with HRT. When we discuss producing an adverse event by a treatment, the equivalent of NNT is number-needed-to-harm (NNH).

From the same source, we can look at the HRT-induced increase in rate of cardiac events.

Exercise 16. The number-needed-to harm.

The data for cardiovascular events in the same groups of subjects are in NNH.xls. Check whether the difference is significant, then work out the relative risks, relative risk ratio, the absolute risks, the change in the absolute risk and the NNH.

Why did I start off using an annualized percent incidence rather than the actual number of events? (This is important, see Explanations


Things we have not discussed.

There are many statistical tests, useful or arcane, which we have not discussed. The ones which spring to mind include

Multiple linear regression, of the type  where x1, x2 are different variables.

Partial correlation analysis, where a correlation is adjusted for for one or more variables. For example, the correlation between salary and age, adjusted for experience.

Non-linear regression, such as , or .

Curve fitting, where a hypothetical mathematical model is fitted to your data or possible models are compared.

Survival (life) tables.

there are plenty of others.


Excel statistical functions, which we have not used

Excel is rich in mathematical functions, which include many of statistical use. To have a look open Excel help and search the index on ‘function’.

Of the procedures in the Data Analysis Pack, there are many which we have not explored. I shall list their uses if you are curious:

Anova: Familiar to you.

Correlation: How strong is the relation between x and y?

Covariance: This is a sort of combination of analysis of variance and regression. An example would be if company A owns shares in B. The share price of B would affect the shares in A. If you did an analysis of shares in A, you could include shares B as a factor.

Descriptive Statistics: Familiar ground.

Exponential Smoothing: A method of taking out short term fluctuations in a series.

F-test Two-Sample for Variances: Do the variances of the two groups differ significantly?

Fourrier Analysis: An analysis of power versus frequency in a wave form.

Histogram: You have used this.

Moving Average: A smoothing technique for curves.

Random Number Generator: Not quite random, but near enough. Be sure to use a different ‘seed’ if you don’t want to repeat the previous number series.

Rank and Percentile: Ranks in descending order and assigns a percent value (% of largest in the series).

Regression: An old friend.

Sampling: Takes a sample of a given size from your data. It would be much more useful if you could get it to take a set number of samples.

t-Test: Three types, one for paired data, two for independent groups with equal or unequal variances.

z-Test: Like a t-test, but population variances need to be known.



Appendix 1(Back)

Some elementary stats books.

Not in any particular order.

Statistics in Practice. SM Gore and DG Altman. Brit. Med. Assoc. London, 1982. ISBN 0727900854. A collection of relevant articles from the BMJ. I used this a lot in designing this course.

Statistics at Square One. D Swinskow. Brit. Med. Assoc. London, 1978. ISBN 0727900358. A good intro. Limited in scope.

Statistics with Confidence. MJ Gardner and DG Altman. Brit. Med. Assoc. London, 1989. ISBN 07279022209. Confidence intervals and statistical guidelines. Shows an alternative approach to probability estimation. All the material is available as DOS computer programs.(back)

Primer of Biostatistics. SA Glantz. McGraw-Hill, 1989, ISBN 0070233721. A very good approach with little jargon or intimidating mathematical terms. Good examples, enough to make your own spreadsheet. I have used this book more than any other to design this course.

Medical Uses of Statistics. Eds JC Bailar and F Mosteller. NEMJ Books. ISBN 0910133166. Good discussion, illustrated by examples good and bad from the NEMJ. Not much numeric stuff.

Statistical Methods. GW Snedecor and WG Cochran. Iowa State Univ. Press 1987. ISBN 0818315606. Big, dry, hard to read, but apart from survival and a few other things it’s all in there in the mass of numbers and algorithms.

Epidemiology Biostatistics and Preventive Medicine. JF Jeckel, JG Elmore and DL Katz.WB Saunders & Co, Philadelphia, 1996. ISBN 0721652581. Clear, simple approach to statistics used in Medicine. If you buy a book (apart from mine) get this.


Appendix 2.(Back)

Some Good Stuff on the Net

There are scads of interesting statistical material out there. For a start, 2 discussion groups:         sci.stat.consult and Although rather high powered for me, I have had good help with problems on both.

Then there are stats textbooks, calculators and courses on line:

1.     html:// – Introductory Statistics: Concepts, Models and Applications. A text and course, which is a little lighthearted and easy to digest.

2.     html:// - Statistics: The Study of Stability in Variation.

3.     Html:// - The web textbook I used most frequently. Good search engine.

4.     html:// – used to be at Rice U. he has now struck out on his own. His text is still there, and rather good.

5.     html:// – another on-line course.

6.     html:// – a big collection of statistical resources and URL’s.

As general sources the Statistical Societies’ or Association’s (US, UK etc…) websites always maintain a lot of links

Some good stats programs and calculators are also available for nix.

1.     Epicalc a stats calculator, v good for simple things, once you nut out how.

2.     Whatis.exe, another calculator, from the same source as Epicalc

3.     Dataplot: A great big stats program from the US National Institute of Standards. Does everything if you can work out how. 77 Meg! Has a big manual, and it is needed.\dataplot.html .

4.     Modstat a small, DOS based stats program, which has impressed me very much. Designed by a statistician teaching elementary classes, and has good guides and does most things. Not quite free, at USD 22.00, from Old, but was still there last I looked..

5.     PS.exe – a good program for estimating sample sizes and power. You need something like this now to publish or apply for funds. Download from

6.     Simple Interactive Statistical Analysis. A site, which has a good sample size calculator as well as a whole host of other statistical calculators.

7. - a page of statistical machinery, accessed by the www, easy to use.

8. - a random number generator on the web which will give you all sorts of randomized groupings for more complex research formats (Latin sq., crossovers, multiple groups etc…)

9.     XLStatistics is a large suite of macro-driven worksheets, headed by a good index page which chooses tests on the basis of the type of data to be analysed. It is free for students and educators from . It includes most common non-par. tests and is altogether a sophisticated package, which I recommend. Great value. (back)



Appendix 3.

Some Mathematics That You Don’t Need to Know off By Heart.

SD (standard deviation) of a population, where m is the population mean

SD of a sample,  where  is the sample mean

Standard error of the mean  (back)                                        


The function which describes the normal distribution is:



The Student’s ‘t’ distribution:

The Poisson distribution:

If the average occurrence of an event in time T is m then the probability of an event occurring X times in time T is

where X! is ‘factorial X’ =….X (back)


[1] Fletcher Knebel

[2]Edzard Ernst, Obstacles to research in complementary and alternative medicine, MJA 2003; 179 (6): 279-280; ..This is worth looking up if only as a counterbalance to my utter scepticism of ‘natural remedies’ and alternative medicine, which in turn is well reflected in “The Sceptics Dictionary”,

[3] Line Plots. Reese AR. Significance, June 2006, 85-87.

[4] A useful guide to the Excel Data Analysis package, and for statistics in Excel is “Data Analysis Using Excel” by Michael R. Middleton, Duxbury, CA. ISBN 053435968-X. There are several other similar books

[5] I wish this were original, but I read this somewhere.

[6] Hicks CL et al., Pain 2001; 93:173.

[7] Called Gaussian, because the German mathematician, Karl Friedrich Gauss did not describe it. It may have been first described by Abraham de Moivre, Gauss’ senior by about 100 years.

[8] Kurtosis and skewness are historical statistics. Pearson, about 100 years ago said that any probability distribution can be described by 4 characteristics: The mean, the variance, skewness and kurtosis. This is probably not universally true and of no great importance to you or me, but I thought I had better explain the terms as you are bound to meet them.

[9] We are making a rough calculation here, because the chance of the first card being an ace is 4/47, but the second is 3/46, if the first was an ace and 4/46 if it was not and if the first 2 cards did not contain an ace, your chances have become 0. We are also neglecting the small probability of 4 aces.

[10] The quick and easy way to do these calculations is to use Excel’s =BINOMDIST() function. Look up the syntax in Excel Help.

[11] Bland JM, Altman DG. The odds ratio. British Medical Journal 2000; 320:1468.

[13] Bias is defined as any process, at any stage of inference which tends to produce results or conclusions that differ systematically from the truth. D.L . Sackett in “Bias in Analytic Research” J Chron Dis 1979; 32:51-63. identifies 35 sources of bias in sampling and measurement.

[15] This section is based on: Validating scales and indeces. Bland JM, Altman DG. Brit Med J 324:606; 2002.

[16] This is one of my hobby horses: fortuitous means ‘by chance’, not ‘by good fortune’. Sports commentators please learn this! I know... it is evolving to mean “by good luck” but must we encourage bad usage?

  Site Map