“Smoking is one of the leading causes of statistics[1].“
Why another introductory statistics book?In fact, why learn statistics at all?Data from studies and experiments in most
sciences requires summarisation in order to be communicated and readily
understood. This is one of the objectives served by statistics which is often
forgotten. More obviously, if there is a
difference between the outcomes of one treatment and another, we need to decide
whether such a difference is due to chance factors or is the result of one
treatment being really superior to another other. This is the other function of
the science of statistics, which then enables us to reach rational conclusions
and make decisions which are likely to be correct. Statistics is the
cornerstone of evidence-based medicine, where we endeavour to base our clinical
decisions on scientific evidence rather than other factors such as traditions,
gut feelings, personal experience or anecdote. This is a very important
development in health care, which has taken about half a century to gain
unopposed acceptance by rational, scientific health care practitioners.
I recall a good friend, the Professor of Anesthesiology at a
prestigious U.S. university telling me that “...if you need statistics to prove
your point, don’t bother!”This was more
than 30 years ago and I suspect that if I asked him the same question to-day,
he would say “If you don’t have statistics to prove your point, don’t
bother!”The greatest problem I see in
the delivery of quality health care in is to teach the public to demand proof
of efficacy rather than be swayed by hearsay and testimonials.
Figure 1. A health
claim.
It is an unfortunate corollary of freedom of expression that
the most outrageous claims can be made in the field of health without
substantiation and little can be done to enforce any standard of proof for such
claims. Alternative medicine has become
a fertile field for promoting therapies whose efficacy is unproven and generally
unassessed.Don’t misunderstand, I know
there are some efficacious natural or herbal
remedies and there are many careful reviews of alternative medical claims, but
the vast majority come without any proof of their claims of effectiveness.The question which this begs, is, what is
acceptable evidence in such cases?The
answer lies in studies which carefully compare a treatment with no treatment or
another treatment and statistics offers the only safe way such comparisons
should be evaluated.
Often there are difficulties in making such assessments[2]
but the issue, which is paramount to us is the place of statistical processes
in evaluating experimental evidence. It
is perhaps quite unrealistic to hope for general public understanding of
statistical procedures but it is not unreasonable to demand that health care
professionals be able to understand simple statistics which are presented in the
final evaluation of experimental or observational studies.
I am not a statistician.I was a medical practitioner and researcher who has used and abused
statistics in a long career and I think I understand enough about the use of
statistics in biomedical research to pass on some of this knowledge. I hope to enable you, the reader, to gain some
understanding of the meaning and use of statistics and to do some statistical
operations sensibly and correctly if you have occasion. These are the objectives of this learning
package.
Another important aim, is to present statistics as a live
and, dare I say, interesting branch of the black arts.It is a subject with controversies,
disputation and disagreements among the experts, a subject which should not be
condemned to the cook-book presentation which is so common in elementary statistics
texts.
A third objective is to enable the reader to distinguish
between cases where statistics is used from where it is abused, which,
unfortunately is often the case. Most abuses
are more or less obvious, but it takes a certain mind-set to examine numbers
that are presented with the appearance of authority. I want to instill this certain critical
skepticism in users of this package so that they are able to examine carefully
the buckets of numbers thrown at them in scientific journals.
Most textbooks of statistics are dry, didactic and full of
computational stuff which is not of great utility in these days of convenient
software. I have, in contrast therefore,
tried to keep the course material easy to read, relevant and illustrated by
practical examples. I have tried to
present the mathematics only where it explains underlying ideas.
Throughout this teaching program I use Excel^{®} and its Data Analysis Pack. It is convenient and useful for the part time
researcher, who is also interested in understanding reported statistics.A warning, however, is appropriate here.All statisticians, to whom I have spoken,
have had a low opinion of Excel statistics. It is not very accurate, it is primitive and
its graphics are sometimes misleading[3].
It also lacks some important tests (the “non-parametric tests”, which we shall
discuss later).Why am I using it
despite all this? Because it is
there!You have it on your computer and
you may have no other statistical programs to hand.If, however, you become interested in
statistics or you propose to do your own statistical calculations for
publication, I strongly recommend you use some other platform. There are many, many good program packages
out there and SPSS, Stata, S-plus and Minitab spring to mind.I have used them all at some time or other. Of the many, Minitab has an inexpensive
academic edition or you may be able to obtain a student copy of S-plus. They
both include pretty good graphing components (important if you are to present
or publish).The ‘R’ language has the
attraction of being free and very powerful but it is quite a handful to learn
and is more suitable for those who develop a serious interest in the
subject.In any case, they all read
Excel files, so the latter remains a useful option for recording data.
I have written the material for the Windows operating
system, Win 2000 or later and I am using Microsoft Word and Excel^{â}
from Office 2007^{â}
but you should have no problems using Office 97^{â} or for that matter, Office
XP^{â}
or 2003^{®}.Because it is written with Microsoft Office,
I expect it will run on Apple^{â}
computers as well.In order to follow
this course, you must know how to use the common functions of your operating
system and how to navigate about your storage media. You will also need to have a working knowledge
of spreadsheets, including data entry, simple data manipulation and graphing. You may find the Excel courses on the
Microsoft training website useful, at http://office.microsoft.com/en-us/training/default.aspx.For those new to spreadsheets, I recommend
them strongly.You need to have the “Data
Analysis Package” installed in your copy of Excel. In Excel help, enter “data analysis package” in
the search box, for help if needed.
There are many links
in this document which show blue and turn purple whenyou have used them. Whenever I use a new term and
do not explain it, I have generally inserted a link to the definition. After
looking at the link, you can return to where you were in the text by clicking
on the word ‘back’. Links will also take
you to where explanations and
answers to questions are. There are links, which will open the Excel files used
in exercises and also the explanation files, but there are no automatic links
back, as you only need to re-open the ‘Word’ file to return to where you left
it.
I have found the spreadsheet Excel^{â}
very useful to tabulate research data and therefore, I have used its data
analysis package quite a lot[4]. It is very simple to use and although it is
often criticized, it is really quite useful. We can make up for some of its shortcomings as
we move along. As I have said, I have
three objectives. The first is to enable
you, a clinician, a user of clinical research or clinical researcher to
understand and correctly interpret data presented in statistical form.The second is to help you present or publish
results of your research. Finally, if you understand the basics of statistics,
you should be able to detect when statistics are misused or abused.
The School of Statistics at UCLA has as its slogan:
Statistics – The study of stability in
variation.
This is very cool but does not strike me as being very
informative. I prefer to define statistics
as the branch of mathematics, which deals with acquiring, analyzing and drawing
conclusions from data, especially, conclusions which are likely to be right. If we want a slogan, perhaps the following
could serve:
To this we can add the ability to concisely and accurately
communicate information.
Conventional elementary statistics texts present the subject
in a cut and dried, (especially dried) manner. This is how it is, and for ever more shall be
so! Well, as the song says, “…it ain’t
necessarily so!...” As I read for this
text, I found out that even some very simple aspects of the subject are debated
by experts. For example, there is the
debate between Bayesians and Frequentists. When faced with a piece
of research,Bayesians ask: “What
does this piece of research add to our knowledge of the subject?” while the Frequentists’
question is: “What does this study tell us?” Think a little about the difference. The Frequentists, who are in the majority
among statisticians, regard a parameter (a statistic describing some characteristic of a
whole population) as having a single true value. The Bayesians consider a parameter to have a
probability distribution – such as, ‘it is 95% probable that it lies between
values x and y’. They tend to analyse
and report statistics differently from the Frequentists.
Another example of disputed territory is the use of
parametric or non-parametric tests. The
former (parametric tests) assume that the data conform to some known probability distribution with defined characteristics. The latter (non-parametric tests), do not make
this assumption. The former, when
applied to appropriate data tend to be ‘more powerful”, that is, more likely to
discover a significant difference than the latter.When are you entitled to use the former and
when not? The world of statistics here
divides into two, again, the ‘near enough is good enough’ and the purists. There are many other areas of contention among
real live statisticians.
It is important to emphasise that the science of statistics
does not deal in certainty. If you seek
certainty, go to church or join a political rally. Statistics will only impart probabilities such
as: ‘Based on this data, drug A will cure your headache quicker than drug B,
with a probability of 100 to 1’, but it will never say that it will cure it
quicker.
In the past, data analysis was tedious and time consuming,
best kept simple. With a computer now on
every desk (and even in most University garbage bins) methods which depend upon
repeated calculations with large masses of data have become easy and quick. This has popularised new methods such as
resampling, which were simply impractical 25 years ago.
Anecdotal impressions are often misleading. We only need to look at the claims and
testimonials of quack medicines or popular washing powders to be convinced of
this. Look at http://www.revital.co.uk . I like the Hopi
earcandles best.
So in order that conclusions from research may be valid,
we need to make sure of four simple things:
1.The data on which our conclusion is based is
drawn from a relevant group of experimental subjects (relevant population). An
example would be if we want to use a drug on patients with renal failure,
information on its effects in fit volunteers would not be reliable for our
purpose.
2.The data is representative of that group. The
best way to achieve this is random sampling of an adequate amount of data from
within that group. What amount is adequate? Not an easy question but we
consider this when we discuss power and sample size. What is the best way to do our sampling may
also be a problem.
3.We need to be confident that the measurements
are accurate and reproducible.
4.While we try to ensure that the conclusion we
draw is valid, at the same time, we must also form an estimate of how probable
it is that our conclusion is wrong (see hypothesis
testing).
When D’Israeli said “There are three kinds of lies - lies,
damn lies and statistics” he was not wrong. Incomplete, misrepresented and generally bad
statistics can mislead. If a million
chimpanzees pounded a million typewriters until the end of time, it is possible
that one would type the whole of Hamlet. If you were shown this, could you be
persuaded that Shakespeare was a chimp?It
is immensely important to be able to recognize when the data presented to us is
misleading and when the conclusions attached to it are unsafe.
There are many books and computer resources on statistics,
some of which are quite comprehensible to us, non-mathematicians. In appendix 1 I name a few, which I have used. This in no way implies that there are no other
good alternatives or even possibly better ones. I have also listed a few of the huge array of
stats resources on the internet in appendix 2.
In this package or course, we shall try to discuss briefly
the principles of statistics, simple applications and how to execute them in
the Excel spreadsheet. The Excel package has a major limitation, as I have
mentioned, in dealing only with certain types of data (parametric). We shall
try to overcome this limitation by designing our own spreadsheets to do some
tests, which do not come pre-programmed with Excel, although ‘add-ins’ are
commercially available to fill this omission.
Exercise 1.Meet
Excel.
Open Excel and check that the Data Analysis Pack is installed:
Open Excel and click on Tools and it should be there, labeled
“Data Analysis”. If it is not, then get
your Office CD ready and in Excel, go to Help (F1)/Answer wizard and type in
“install data analysis”. Follow
instructions.
Do a few spreadsheet operations, like entering
data, filling a column with a sequence of numbers. Have a quick look at the Data Analysis
package. When we have finished, you will be able to use most of the functions.
If you are not familiar with Excel get a simple book or do an
introductory course. Be sure you can use
the graphing functions well. Graphs are fundamental in giving an overview of
data.
OpenValsalva.xls. (In
the ‘Data’ folder) and have a good look around. This is a pretty typical
research record. It is a spreadsheet containing data from a study where
subjects blew against a resistance for 20 s. and had their pulse rate and BP
recorded. When you have finished, don’t
save it but anyway, don’t worry about messing it up; you
can always re-copy it from the CD.
A population
includes all members of a defined group. All nurses at St. Elsewhere Hospital. All 10 year old Australian boys. If we want to
know the average height of 10 year olds and perhaps the spread of their size,
it is not practical to try to measure every 10 year old in the country. We
resort, therefore, to sampling.
A sample is a
sub-group taken from a population usually for the purpose of estimating some characteristic
of that population, which would be a ‘parameter’. For example, voters whose intention is polled
in a survey, form a sample from the population of all voters. Great care must be taken in choosing the
sample if it is used for inference regarding a population. It must be
representative of that whole population of interest. For example income
sampling in a rich, bay-side suburb will yield a very different result from a
similar sample from a high unemployment area. If sampling were carried out in
either locality alone, it would hardly be representative of the whole population.
Astatistic is a number, usually
derived from a sample, and is descriptive of a variable. It may be used for
inference. For example, we measure the
height of all 10 year olds at four randomly chosen schools and find the average
to be 112 cm. We conclude that the average 10 year old in Sydney is about 112 cm. tall. The sample statistic, average height, is used
to infer the population parameter of the average height. We shall learn how to estimate from our sample
what the limits are within which the true mean of the population is likely to
be. Note: “likely to be”, not “is”.
A parameteris a characteristic of the population. A statistic was
similar, but that was for a sample. For example, the mean height of 10 year
olds. In practice, parameters are generally inferred from statistics, not
actually measured because of the obvious difficulty in making the measurements
in all members of a population.(back)
A variable is a
defined entity we use in our statistical operations. It can be a measured
quantity (height, income, plasma cholesterol), an attribute (colour of cars,
voting intension) or something we assign, such as a grouping variable (say,
experimental group or control group), which we then go on to use to classify
the measurements.
Data (Singular
datum.) are the recorded measurements or characteristics, which are used for
statistical purposes. For example the height measurements of the individual 10
year olds make up the data set for calculation of the average height. Data can
also consist of attributes, such as “male – female” or ”agree strongly – agree
– indifferent – disagree – disagree strongly”
Also called continuous
data.This is numeric data, where
all ‘real’ values are possible. Such as
height in cm. I am 178 cm tall, You may be 181 cm. 201 cm is possible, and
165.35 cm is also meaningful. In some contexts (e.g. height) negatives do not
make sense, in others (e.g. growth in stock value), they do. All values must be possible, not only
integers. A limited set such as scores
of 1,2,3 or 4, where a score of 3.5 is not defined, only 3 and 4 are, is not
interval data. With interval data, 2 is
twice as big as 1 and 4 is twice as big as 2.Interval data, which is normally
distributed (see later), is often called parametric
data. Such data can be described by a measure of central tendency or ‘location’ (mean or mode or median) and a
measure of dispersion (variance, standard
deviation, standard error, see later, in the section on ‘Normal Distribution’). It is not strictly
correct to define parametric data as necessarily normally distributed although
this error is common. In fact data,
which is distributed according to some other definable distribution is also
‘parametric’ but this is getting a little precious at this stage.
Ratio data,such aspercentages, are a special case of interval
data.
Numeric data can be univariate, meaning that we are
dealing with a single variable, however many groups or samples that we may
have.An example would be mean blood
pressure, even though we have measured this variable in 4 different
groups.Data is bi-variate, when
we deal with the relation between two variables (say height and weight of
subjects) and can be multi-variate (say effect of age, sex and education
on salary).
Data can be cross-sectional,
which is a one-dimensional data set, that is taken across a number of sources
at one time or time independently such as our blood pressure data. A time series is data, such as mean blood
pressure of a group of subjects recorded at weekly intervals.Longitudinal
data (or longitudinal study) refer to following changes in the study population
over time.This is also 1
dimensional.When it is 2 dimensional,
such as following mean BP readings in a number of groups, over time, it is
called panel data.
This is like small, medium,
and large. The values have a logical
progression, but not necessarily an arithmetic relation. A pain score of 3 is greater than 1, but not
necessarily 3 times as painful. The
‘faces scale’ used for estimating severity of pain in children is also a form
of ordinal data, because they follow a serial trend. ‘Scores’, so popular in assessing
non-measurable things, such as pain or satisfaction are ordinal. The ‘parametric’ statistics of Excel are not
properly used for such ordinal data. You
should use ‘non-parametric’ types of statistics or tests. (If the data set is
large, say >30, or the scale has may levels, say > 7 and the data set
passes a test for normal distribution, the error introduced by using parametric
statistics on ordinal data is not great.Many statisticians accept this practice.)
Figure 2. Faces scale for scoring pain in children[6]
We may
have defined what a pain score of 0, 1, 2, 3 or 4 means, but if you average a
number of scores to a value of 3.2, you have not defined what this score means.
All you can say is that it is greater
than 3 and less than 4. Thus averaging
to a mean is not really a proper operation for such data. The median score would be more meaningful. Still, the mean is used frequently and people
get away with it even in scientific papers! I expect if the means of such scores are used
to compare 2 groups or comparing before and after in subjects, means of ordinal
variables or parametric tests used on them will have some sensible interpretation,
although when we get to ‘significance’
we’ll need to re-consider this question.
Such data are attributes and are
sometimes called categorical data. Such as: “Votes yes or no”. “The car is green, red, yellow or black”.
Statistical treatment is confined to counts or frequencies: 12345 votes for Smith,
23456 for Jones. 18 green cars, 10 red etc… Nominal data also requires non-parametric
treatment.
Any value, within a range is as likely as any other value
within that range.
Figure 3.
Comparison of the
frequency distribution 200 random numbers between -5 and 5 in a normal
(Gaussian) distribution with 200 random numbers in rectangular or uniform
distribution. The left panels shows the normal distribution, the upper the
values and the lower, a frequency distribution graph (histogram). On the right,
the same for the rectangular distribution. Note the tendency in the normal
distribution for the values to cluster near the mean.
This is the type of distribution most frequently seen in
biology and medicine. It is also called Gaussian distribution[7]. The most frequent values lie close to the
mean, and the further they are from the mean the less likely are they to occur.
If graphed, the frequency curve is bell-shaped. This distribution is defined by a complex
looking function, which is of no great importance to us (appendix.
3). It is defined by two variables, an indicator of
central tendency or location, the mean
(m) and
a measure of dispersion or scatter, the standard
deviation (s).It has certain characteristics which are
useful.
As we have already said, the normal distribution is very
common in biology and medicine and the chief characteristic is that the further
values are from the mean, the less common they are. As an example, consider the
height of males. The mean may be about 175 cm. You see many men about 172 cm.
or 177 cm but few of 110 cm or 220 cm in height.Another interesting characteristic is that
when you plot the area under the normal distribution curve, called the cumulative density function, it will
describe a sigmoid curve, so commonly seen in pharmacology, such as in dose – response
curves. If the vertical axis of the plot is between 0 and 1, then this sigmoid
describes the fraction of the distribution under the chosen value on the
x-axis. For example if x = the mean, then y will be 0.5.
Figure 4.
The normal (Gaussian) probability function mean = 0 and
standard deviation = 1 The mean, median and mode are all the same, a
characteristic of the normal distribution (upper panel) and its cumulative
density function (lower panel). The CDF represents the probability of the sum
of all values below that, chosen, on the X axis. For example the probability of
encountering a value greater than 0 would be 0.5.
Measures
of Central Tendency :
Mean: Is the
arithmetic average, which is the sum of all the values divided by the number of
such values. Mathspeak represents this as:
x stands for the individual values, , means the sum the observations
from first to n^{th.} and n the number of values in the data set. The formula looks daunting but it simply means
sum all the values and divide by the number of such values.
The mean of a
population is usually designated as m, and the mean of a sample as . The bar above a
symbol usually denotes ‘mean’. This is the ‘arithmetic’ mean and there are
other kinds of mean in mathematics (geometric and harmonic) but they are of no
interest to us at present.
The mean is a very
good measure of the central tendency of a symmetrical distribution such as the
normal. The trimmed mean is another measure of central tendency,
which is sometimes useful. To protect
against undue effect from a few outliers (values much larger or much smaller
than the others), you can use this measure to better describe your sample by
saying this is the mean of all the values except the three (or five
or fifty) largest and smallest. It is
especially useful if you have a large data set. The mean is sometimes called a
measure of location of the data.
Median: (Back)The middle-ranked value, with as many greater
values in the set as there are lesser values. If the data set has an even number N, than the
median is a value half way between the two middle values, ranked N/2 and N/2+1.
If the data set contains an odd number, than the median is the value with the
rank (N+1)/2. The median is a better
measure of central tendency than the mean for skewed distributions
(where there are more values above or below the mean, than on the other side). Coupled with two percentiles, say 25^{th}.
and 75^{th}., it describes skewed distributions better than mean and standard
deviation. It is also less affected by
extreme values than the mean.
Figure 5. A skew
distribution.
The mean of this distribution is 0.091 but its median is 0.016
and its mode is 1. In a skew
distribution the mean and the median are not the same.This distribution is skewed to the right. If
the small values were few and large ones many, it would be skewed to the left.
Mode: Is the most
frequently encountered value. A population can have more than one mode and if
there are several modes in a sample, Excel will only report the first it
encounters when you do “Descriptive Statistics”.
Find the mean, median, mode,
largest and smallest values of the ‘pre-test’ and ‘minimum’ pulse rates. Don’t
save the results or you will overwrite the Valsalva file.
When we talk dispersion, we mean the scatter of the data
about the mean (or median). When we consider dispersion it is not material
whether a value is greater or smaller than the mean, what we are interested in,
is its distance from the mean. We assess or express this as the mean of the
squared deviation (this gets rid of minuses) of the data points from the mean.
If we were to simply take the deviation of each value from the mean and
averaged them, the result in a symmetric distribution would be zero. So to
estimate this scatter of the data about the mean we use variance or standard
deviation (SD) or if we are
talking of the means of multiple samples, the standard error of the mean (SEM).
The variance is the mean squared deviation from the
mean. We square the deviation to make all values positive. For a population
and for a sample,
Why the difference? To find the mean, we must know the
number of values in the sample, as well as their sum. If we know the mean and
all the values but one, then the last value in the sum can be calculated and is
therefore not independent. So in this case, n-1 is the degrees of freedom in
the sample and we use it to calculate the variance. It results in a slightly
larger estimate of variance for a sample than for a population. We shall come
across the concept of degrees of freedom again and again. It is sometimes
represented by the Greek letter, n
(mu). The abbreviation for
the variance of a population is usually σ^{2}^{ }and
for that of a sample, s^{2}.
The standard deviation is simply the square root of
the variance, abbreviated σ or s.
Itis a characteristic of the normal distribution, which describes the
scatter in the sample or population. So this seemingly complicated process is
undertaken to render immaterial whether the deviation is positive or negative,
which does not matter when you only want an idea of the magnitude of the
scatter of the individual values from the mean. To see the formula for the SD,
click here. The region mean ± 1 SD (standard deviation) contains about 68%,
mean ± 2 SD contains about 95% and mean ± 3 SD, about 99% of the observations
in a normally distributed data set. (back).
The standard deviation has little meaning if the data are not normally
distributed and should not be quoted in that case.
Coefficient of
variation (CV) is a useful way of scaling the SD. A statement such as SD =
32 has little meaning unless we know whether we are dealing with a sample mean
of 10 or 10,000. In the former case, the scatter would be huge. In the latter,
very tight. So if we express the SD in units of the sample mean it tells us
immediately whether we are dealing with a large or small dispersion. The CV can
be expressed as a fraction, or percentage:
Using this measure, CV of the first distribution (s = 32; = 10) is 3.2; in the
second case (= 1000), 0.032.
Standard error of the mean is the equivalent of the
SD, when dealing with sample means. It depends on the number of values in the
samples as well as the scatter of the sample means. Of course, the larger the
sample the more it will resemble the population.
The SEM, sometimes called just SE, represents the
uncertainty surrounding the population mean in the same way that the SD
represents the uncertainty of individual values in the sample. From a normal
population, the means of samples will be distributed about the population mean
in a normal manner and the population mean ± 1 SEM will contain about 68% of
the sample means; ± 2 SEM, 95% and ± 3 SEM 99%. Conversely, the sample mean ± 2
SEM is 95% probable to contain the population mean and ± 3 SEM, 99% probable.
It is a very important property of sampling, that whatever
the distribution of the underlying population may be, the means of the samples drawn
from that population will be distributed normally about the mean of the
population. This is part of the central limit theorem, and underlies a
method called bootstrapping, which we shall discuss much later.
Standard errors of other statistics than the mean can be
calculated and are very useful when trying to determine the likely population
value of the statistic.
Kurtosis[8]is the property of a bell-shaped curve,which describes whether it
is flatter or ‘peakier’ than the normal distribution. Skewness we have
already met and these are discussed in the Explanations to exercise 4b. They
describe deviations from the normal distribution but they are not very
important statistics^{4}.
Binomial is the
distribution you meet when an event can have one of two outcomes (heads –
tails, alive – dead, etc…) or values, (p and q).If p = q, the frequency distribution is
symmetrical and bell shaped, but if p ¹ q, the distribution will be skewed. (p + q), the sum of
the two possible probabilities must equal 1. That is to say, that it is a
certainty (probability = 1) that one of the two outcomes will occur. We’ll
consider this in detail in the next section on probability.
Multinomial
distributions are cases where more than 2 outcomes are possible. For example
the throws of a die, which can have 6 outcomes, p, q, r, s, t or u; p_{(p+q+r+s+t+u)
}= 1, again, that is the
probabilities of all the outcomes add up to 1. The probabilities of each
outcome may be equal as is the case with an honest die or they may be unequal,
as the winning probabilities of horses in a race.
This is a distribution of discrete events in time or space,
such as the number of goals in a soccer match or the number of lamp-posts per
kilometre. It defines events which we can count, but we cannot count their
non-occurrence. If you really want to see the mathematical expression for this,
look in Appendix 3, but I would not
bother. In that equation, m
is the mean (average number of occurrences in unit time, or distance etc…), but
in the Data Analysis Pack it is called l
(lambda).
Figure 6.A Poisson distribution with a mean of 5. This
could be the probability of a given number of telegraph poles per kilometre, or
such.
Exercise 3: Distributions and descriptive statistics
a)Open the file (Distrib.xls)The columns in the file are:
A:Subject number;
B:Uniform distribution, C:Normal distribution, D:Binomial distribution, (no. of blue eyed
people in a sample of 5), E:Poisson
distribution with l
(mean) set to 3.
Have a look at the graphs, on sheet 2 (Graphs) of this
file. First look at the scatter of the points, then at the frequency histograms
for each type of distribution.
b)Now let’s look at the Descriptive Statistics function
in Excel.
Still in Distrib.xls,
open the data analysis pack and select descriptive statistics (Toools/Data
Analysis/DescriptiveStatistics). We’ll do the stats on the normal distribution
values. So:
Select the tick box for Labels in First Row, select or
enter the data range C1 to C101, put the output in a new worksheet ply, select ‘summary
statistics’ and Confidence interval Level of say 95%. If you wish, select a say
5^{th} (or other) smallest and largest value and click OK. The results
will be on Sheet 3, which you could rename ‘Descriptives’.
Have a good look at what you have
calculated. Read my comments in (Explanations, 4b).
Before we discuss further the binomial and multinomial
distributions, let us think a little about probability, the subject that should
be dear to the heart of every honest punter. It all started when the good Duke of Burgundy,
who was the sort of high roller who gets the red-carpet treatment at every
casino commissioned the mathematician, Blaise Pascal, to determine his chances
of winning at roulette. These, original
foundations have been greatly expanded and refined by Fermat, Huygens,
Bernoulli and other great mathematicians.
To go back to the good Duke, if you look at the roulette wheel,
the European wheel has 37 slots (0 and 1 to 36), and the American, 38 (0, 00,
1-36). The chance, therefore, of winning on a single number bet are 1 in 37 or 1
in 38, respectively. Yet they only pay 35
to 1 for a win. So what are the Duke’s
chances of emerging a winner from the casino? In the long run about the chance of a snowball
in hell and that is the reason the casino can offer shows and cheap eats. That little edge (about 2.7% on the European
wheel and 5.3% on the U.S wheel) pays for it. Except probably for black-jack, the house odds
are even greater for other games.
Now, to get serious, let us define probability as the
likelihood of the occurrence of a random event. Toss a coin and either a head or a tail will
come up. The probability that either one
or the other of the two outcomes occurs, is a certainty and we call a certainty:
p = 1
With an unbiased coin the probability of heads is equal to
that of tails, and we can write:
p_{(head)}
=p_{(tail)} = 0.5
And that it will come up a head or a tail is the sum of the two probabilities:
p_{(head or tail)}
= 0.5 + 0.5 = 1
We can now take the next step and say that if the
probability of heads is p_{h}, then the probability of not heads is (1
– p_{h}) that is, the probability of tails (p_{t}) = (1 – p_{h}).
If you look at a die, there are six possible outcomes and
the total probability of the outcome being one of the six is 1. The probability
of any one outcome is, of course 1/6 = 0.1667.
These outcomes, head or tail, or the possible results of a
single throw of the die are mutually exclusive.
With a die the probability of a throw of 1 is 0.1667, then
the probability of throwing any other number, 2 or 3 or 4 or 5 or 6, is (1 –
0.1667) = 0.8333.
What is the probability of throwing less than 4? Well,
that is the same as saying the probability of a 1 or a 2 or a 3.
So, p_{<4} = p_{1} + p_{2} + p_{3} = 0.1667
+ 0.1667 + 0.1667 = 0.5.
If we look at the probability of outcomes A or B,
then we sum the probabilities of each:
p_{(A or B)}
= p_{A} + p_{B}
Now let us consider sequences of events. What is the
probability of tossing 2 heads in a row? The probability of throwing a head is
0.5. Then we flip again and the probability of getting heads again is 0.5. So
in half the cases we were successful on the first flip and in half of those we
shall be successful on the second. So the probability of heads twice in a row
is half of one half,
p_{2h} =
0.5 x 0.5 = 0.25
For situations where we are calculating the probability of
event A and event B, we multiply the probabilities.
We flip the coin 4 times. What are the possible outcomes?
The first can come down heads or tails, so can the second etc…Let us count the
possibilities:
4 heads – there is only 1 way this can happen, four throws
of heads.
p_{4h} =
0.5 x 0.5 x 0.5 x 0.5 = 0.0625
3 heads, 1 tails, in any order: Let us consider the
sequence of 3 heads, 1 tails:
p_{3h1t} =
0.5^{3} x 0.5 = 0.0625
but if we are not particular in what order the 3 heads and
1 tails occurs it can happen 4 ways, the 1^{st}. or the 2^{nd},
or the 3^{rd}. or the 4^{th}. toss can be the tails. So the
probability of 3 heads and 1 tails in any order is
4 x 0.0625 = 0.25
2 heads, 2 tails. – this can happen any one of 6 ways,
work it out… h,h,t,t, h,t,h,t, h,t,t,h, t,h,t,h, t,t,h,h, t,h,h,t. The total
probability, any order, is
6 x 0.5^{2}
x 0.5^{2} = 0.375
1 heads, 3 tails – 4 ways, is similar to 3 heads, 1 tails,
or 0.25.
4 tails – 1 way, 0.0625
There are all together 16 possible sequences after 4 tosses
(2^{4}) and if we look at the probabilities, they are
4 heads – 1/16 =0.0625
3 heads, 1 tails – 4/16 = 0.25
2 heads, 2 tails – 6/16 = 0.375
1 heads, 3 tails –4/16 = 0.25
4 tails – 1/16 = 0.0625
sum the probabilities and it comes to 1! I told you so, if
you sum all the possible outcomes, p = 1.
A quicker way to express all this, is to say that if p is
the probability of one outcome, and q is the other, p + q = 1 and p = (1 – q) then
the probability of any outcome can be calculated by expanding the function (p +
q)^{n}, where n = the number of trials:
For 3 trials,
Meaning that for three trials of an experiment whose two
possible outcomes are p and q, results of either 3 p’s or 3 q’s can occur one
way each, 2 p’s and a q, three ways as can 2 q’s and a p. If p has a
probability of 0.5 as with coin tosses, the probability of three heads or three
tail will each be 0.5^{3 }or 0.125, 2 heads and 1 tails in any order,
will have a probability of 3x0.5x0.5x0.5 = 0.375 as will 2 tails and 1 heads.
Sum these probabilities for a check, and they do come to 1.0.
The calculation is still similar if the probability of the
two outcomes is not the same. The probability of one p and 2 q’s in 3 tries
will be 3 times p times q^{2}. For any number of trials, n, and any
probabilities of p and q (remember, p + q = 1) the calculations are the
similar. The general expression for calculating the probability of an outcome
after any number of trials is
Where, again, n is the number of trials.
Now let us complicate matters. You toss your coin and it
comes down heads. If you toss once more, what are the odds of having both
tosses come down heads? Well if you were asked what the odds were to get two
heads in a row, you know the probability would be 0.25, but this is not what
you were asked. Having the first toss in the past has made that a certainty. So
what is the probability of heads on the next toss? Well you know it is 0.5! The
coin has no memory. If the first toss had been tails, what would be the
probability of two heads with the next toss? Zero, of course.
Come let’s
play poker! You don’t know how? How did you spend your time at Uni?
Anyway, the
simplest form of the game consists of being dealt 5 cards. You make a bet and
you can throw away up to 4 cards to improve your hand and a corresponding
number will be re-dealt to you. This is the ‘draw’. Further betting follows.
The aim is to get the best hand and win the pot. But let us get back to
probabilities and work out what the probability of holding a hand with 3 aces
is, after the draw. This can come about in 4 ways:
1.No aces dealt, but 3 in 4 cards on the draw.
2.1 ace dealt, 2 more in the draw.
3.2 aces dealt, 1 more in the draw.
4.3 aces
dealt, none in the draw.
Case 1. You
have seen 5 cards, there are 47 left unknown and there are 4 aces in there. The
chance of a dealt card in the draw being an ace is 4/47 = 0.085[9].
You draw 4 cards and need to know the probability of receiving 3 aces, which is
4 x 0.085^{3 }x 0.915 = 0.002
Case 2. One
ace and draw 4 for 2 more. The chances of an ace in the 1^{st}. 5 cards
is about 0.279. The chances in the draw are 3/47. You draw 4 more and need 2
aces, probability of 0.021[10].
That makes the probability 0.006.
Case
3.The chances of 2 aces in 5 cards are
about 0.047. Then you have to draw 1 more in 3 cards: 0.117, giving this case a
probability of 0.005
Case 4.
Probability of being dealt 3 aces in the firs 5 cards is 0.004. We’ll take the
simple case of not considering improving to 4 aces.
The overall
probability of ending up with 3 aces is the sum of the 4 ways in which this can
be done, that is : 0.017
Exercise 4.Probability exercises:
a)If
the probability of blue eyes is 0.3, what is the probability of taking a sample
of 5 persons and finding 4 with blue eyes? (Explanations\Exercise 4.doc)
b)You
throw 2 dice. What is the probability of a double 6?
c)You
throw 2 dice. What is the probability of throwing 7?
d)You
throw a die twice. What is the probability of the two throws totaling 7?
e)You
throw a die twice. The first throw is 3. What is the probability of a total of
7 when you take the second throw?
Odds are another way of presenting probability. Familiar
territory for us, mug punters. Odds
is the ratio of the probability that the event of interest will occur to the
probability that it will not. The probability that a throw of a die produces a
6 is 1/6 = 0.1667. The odds are 1 to 5. Using as an example the association
between hay fever and excema in children[11]:
Excema, Yes
Excema, No
Total
Hay Fever, Yes
141
928
1069
Hay Fever No
420
13525
13945
Total
561
14453
15014
What is the probability of hay fever in children, with or
without excema?
Probability of hay fever in children with excema: 141/561 =
0.251, about 1 in 4.
The odds are 141 to 420, about 1
to 3.
Probability of hay fever in children without excema: 928
/14453 = 0.064, a little less than 1 in 16.XXX
The odds are
928 to 13525, about 1 to 14 or 15.
A number of different comparisons can be made:
The
difference between the proportions or probabilities: 141/561 – 928/14453 =
0.187 or 18.7%. A child without excema has 18.7% the probability of having
hay fever, compared to a child with excema.
The
ratio of the proportions, (141/561)/(928/14453) = 3.914. This is the relative risk. A child with excema
has almost 4 times the probability of hay fever than a child without.
To do effective research you must set a precise goal for
your study. It should be a definite, defined and limited objective. For
example, “to compare the heights of normal Sydney school-boys with those of
girls of the same age, in age cohorts, between 10 and 13 years” – not “to study
the height of 10 to 13 year old boys and girls”. Work out what is the best way
to get your data, then, to record it and to evaluate it. This involves:
1.Selection of the study population.
a.You could use the classes for appropriate ages
in co-educational schools, preferably not confined to one geographic area.
b.Obtain permission from education authorities,
schools, parents and relevant Ethics Committees.
c.Define “normal” in order to screen the children
according to your definition.
2.Decide how many children you will need for the
study. This is not an easy question and we shall deal with it when we discuss power and sample size.
3.Define your method of measurement.
4.Prepare the study record.
5.Define the reporting statistics and your method
of evaluation. The reporting statistic is the principal variable upon which you
focus your study. In this case, it would be height, in centimeters but in other
studies it could be a categorical outcome variable, such as alive or dead or a
scale measure such as pain score. In any case, it needs to be chosen before
starting the study. You can also define other variables (secondary objectives),
which you wish to record and evaluate, for example the children’s weight, which
you may wish to record and evaluate, but a principal reporting statistic must
be defined before the study is started..
6.And now, finally you can get on with it!
Of course, you will make certain that the data is
representative of the population of interest, not only of a sub-set. This is never easy and the better you do it,
the wider will be the application of the study.Any possible limitation imposed by the selection of the study population
is usually called external validity. If all your children came from Sydney schools,
the conclusion may not apply to Perth children and even less to Kabul.
From our point of view, the important thing will be to
decide at the end of our data gathering and evaluation whether the differences
we found between the two groups (boys and girls) are likely to happen by chance
in populations of the same mean height, or whether they show that if we were to
look at the whole population of boys and girls in the relevant age groups, we
could expect them to differ. In
statsspeak, are the differences between the heights of boys and girls, in age
groups between 10 and 13 years of age: significant?
In this context, significant means
unlikely to arise by chance and is ‘true’ or representative for the population
under consideration. This is usually expressed as a p (probability) value; thisp,isthe probability of making an error in
accepting the result as true or valid, which is also called type 1
error. If p = 0.05, that means that
such result will arise by random chance once in 20 experiments, and there is a
1 in 20 chance of error if we accept the result from the sample as being
representative of the population. Conversely, in 19 instances from 20, our
conclusion of accepting the result as representative, will be correct. The probability, which is chosen as significant
is usually denoted a,
and an a
of 0.05 is often accepted as the upper limit of a significant result.That is, if the p < 0.05, the result is
accepted as significant, at least statistically. This does not guarantee, however, that it is
of practical importance. It is important
to realise that this is a very arbitrary limit, The probability of type 1 error
is not very different between studies whose outcome probabilities are 0.049 and
0.051.Yet, the former will be usually
called significant and the latter not!
This concept is so important, that we should look at it
from a different perspective.If we take
two samples from a single population, their means are likely to be different,
but what interests us is the probability whether the two samples we compared in
our research, say 12 year old boys versus girls, come from a single population
(girls and boys are of the same height) or two different populations with different
distribution of heights. So we look for the probability that two samples from
the same population have a difference as great as we found, or greater. If this
probability is less than 0.05, we think it likely that we are dealing with two
distinct populations.
If p > 0.05 it is often arbitrarily discarded as not
significant which is sometimes confused with proof that there is no difference
due to the tested treatment, or factor. This is not necessarily true. If you
are comparing two samples, this ‘not significant’ result can be due to
1.Difference between the means is too small
2.The scatter of values within the groups is too
large
3.The sample is too small
The cure may be larger samples or better selection of
subjects, to reduce scatter. This is important:
“Lack of proof
of difference is not proof of lack of difference”
To translate this into English, the fact that a study
fails to show a significant result does not prove that there is no significant
difference. Perhaps a study twice the
size would have yielded a significant result. Before you set out on a study, you need to
estimate the number of subjects needed. This,
we shall discuss in some detail, later. It is also wise to plan before you start your
investigation, whether you will examine the results at any time before the
conclusion of the study. This may protect you from going on with a study of a
treatment which is worse than the control, or to go on trialing well after the
value of the treatment can be accepted. If
you do not plan these reviews ahead, you may be accused of opportunism if your
study is terminated early or of wasting resources and even lives if it is
continued too long..
There are two types of studies: Observational and experimental. In the former
we just measure and record (as we did with heights in the example above), in
the latter we ‘manipulate’ the sample (say treat one group and give placebo to
another).(back)
These are studies of the type of: How long does a light
globe last? We record the time of each globe in our sample to failure, we
analyze our data and we derive conclusions. Measuring survival after diagnosis of a
disease is statistically a similar exercise. What happens to blood pressure with age? What is the normal plasma creatinine? No testing of a treatment is involved, no
manipulation of a study group is done and no specific comparison group is used.
These are studies where we do something to our sample(s). We compare treatment A with no treatment, or
with treatment B and it can involve comparisons of more than 2 groups (for
example, treatment A, dose1; treatment A, dose2; and a control group). What is very important is that the only difference between the compared
groups is the tested factor (the treatment or whatever). There are several strategies to help with
this.
1.Use tightly
selected groups, we test Viagra against placebo in males between the ages
of 40 and 50, who are in good health but have erectile problems.We exclude the obese, diabetics and those
with a history of psychiatric problems. Tight
selection criteria may result in some difficulty in amassing sufficient
subjects but tends to reduce the scatter within groups, which may help reach a
significant result. It may, of course,
also limit the applicability of the result (external validity) and in our
example, the result would not necessarily be transferable to 70 year olds with
heart disease.
2.Use large
groups. The larger the sample, the more it will resemble the population but
large studies are usually expensive and take a long time to complete.
3.Paired
studies. Subject 2 is paired with subject 1 on the basis of similar
characteristics (age, sex, body weight, blood pressure, whatever is relevant…).
This type of study may have other
advantages, if pairs are matched, the difference in the study variable between
the two subjects is subjected to statistical testing rather than the mean
results obtained from groups. The number of subjects needed to produce a
significant result is often reduced by this strategy.
4.Crossover
studies. Each subject receives all studied treatments, at some time. This
eliminates differences between study and control groups completely except for
the studied factor but other problems exist. The effect of treatment 1 may
persist into the time period of treatment 2, called the carry-over effect,
or the sequence of the treatments
may affect the outcome.
5.Latin
square is a method of organizing trials to eliminate the effects of
unwanted factors. We want to try 3 analgesic methods after an operation
(treatments A, B, and C) but we know that with passage of time pain gets less,
therefore the 3^{rd}. method is likely to seem more effective than the
first.We need to eliminate this effect.
Each subject then receives all three treatments, but in an order which makes
certain that each treatment is used first as often as it is second or third:
Treatment
1st.
2nd.
3rd.
Subject
1
A
B
C
2
A
C
B
3
B
A
C
4
B
C
A
5
C
A
B
6
C
B
A
Of course, subjects must be
assigned randomly and without bias to the treatment sequences.
6.Sequential
designs are trial designs whose objective is to stop a trial as soon as a
proposed degree of probability (p value) is reached or a pre-set probability of
‘no difference’ is reached. This method is economical of resources but can only
be used where outcomes are available immediately. Decisions take the form of “A
preferred’, ‘B preferred’ or ‘no difference” and the outcome of each pair of
subjects is entered on a special chart.
7.Replication. The
repeated measurement of a variable in subjects is called replication.
For example, the effect of a drug on blood pressure, measured monthly for a
year. It is also a form of replication when the same subjects are exposed to
both/all treatments as in crossover studies.
8.Stratification can be used to overcome imbalance between study groups. For example, the
treated group has more females than males whereas the reverse is true of the
controls. Reporting the results for the two sexes separately may make
interpretation easier.
9.Other
designs. There are many complex designs possible, which may be appropriate
to some circumstances. The best design
is always the simplest one, which gives the greatest certainty (lowest p value)
with the minimum expenditure of resources.
The objective of most studies is to draw conclusions
concerning a population, although we only observe a subset of the population. One way to ensure that the sample is
representative of its parent population is to sample randomly, meaning that every member of the population has equal
chance of being in the selected sample. This is often difficult, sometimes impossible.
Let us assume that you are trialing a drug, in Sydney, Australia. Your sample is unlikely to contain individuals
from Uzbekistan or Patagonia, whose response to the drug, may differ from those
in your sample. You should state clearly
in your report what the sampled population was (from Sydney, Australia)
and how you selected the individuals in the sample. This may again limit the applicability or: external
validity[12].
Internal validity refers to the elimination of confounding factors in your
sample. Neither is usually a ‘yes – no’
quantity, but a gradation of good to bad.
But is your sample even representative of your locality?Did you ask the local family medical
practitioners to refer the patients for your study? Well you have just excluded those who visit
natural therapy practitioners and those who do not use the services of a
physician. These could conceivably bias
your sample, if they react differently from those included.
Let us look at different ways of sampling. The TV rating companies deliberately sample in
a non-random manner. They tailor their
sample to the demographics of interest rather than universality. Their interest is focused on target groups of
their advertisers. Those who are hard to
influence or have no spending power are of lesser interest. They also need to be able to point to specific
groups which advertisers wish to target. This is targeted
sampling. This has parallels in drug
testing.If you are testing a drug to
reduce blood-sugar levels, you may wish to target diabetics, not normals.
The most common sampling technique, however, is to grab what
is available. Send the clipboards to the
local mall, or ask patients presenting in the E.R. to participate, ignoring
considerations of randomization. This is
convenience sampling, so common in
bio-medical research. When reading
research papers I am deeply suspicious of studies which do not describe their
sampling method.
The objective of a study or trial is often to see if some
variable (treatment, or procedure) is better than no treatment or another
treatment. It is desirable that the only difference between the study groups is
the treatment. All other factors need to be kept similar as far as
possible between the two (or more) comparison groups. The not treated (or old
treatment) group is called the control group. When reporting such
studies it is good practice to supply data to demonstrate that the groups were
similar in characteristics, such as severity of disease, age, sex, and whatever
else may be relevant. As we have already noted, crossover studies are ideal in
eliminating differences. Sometimes comparison is made between the studied group
and some other group not included in the study such as a ‘similar’ group
studied last year (historical control) or a ‘similar’ group studied by another
worker. Because we cannot be certain of the similarity or dissimilarity of the
study and control groups, such comparisons have less credibility than
randomised control studies.
As it is essential that the two (or several) groups be
similar, and bias[13]
must not be introduced in the selection of subjects. To this end, study subjects should be assigned
at random into the study groups (treated, placebo etc…) at the time of
entry into the study. There are accepted
ways of achieving this and this must be planned before beginning the study. Often tables of random numbers are used, say
even numbers go to group 1 and odd to group 2. Next subject, next number… This is fine but
you can end up with dissimilar numbers in the two groups, which reduces the power of the study. One convenient way to eliminate this
inconvenience is to randomize in pairs, so that you assign subjects two at a
time. Odd random number, the first subject goes into group 1 and the second
into group 2 and if the random number is even, vice versa. The random numbers
generated by Excel are not really random but very close and are generated by a
complex algorithm which takes its start point from the date and time (unless
you specify another seed for the
random number generator). By specifying the same seed it is possible to produce
the same sequence of numbers more than once so if you want random numbers it is
best to leave the seed to Excel. There
is a web site which will generate all sorts of random numbers for you,
including equal groups[14]
Bias in a scientific study is defined as: “Any process which
tends to produce results or conclusions that differ systematically from the
truth”
To avoid the dreaded bias, the evaluator must not know
whether the subject belonged to the treatment or control group. Only then is the evaluation safe from prejudices
and preconceptions. This is the reason
for placebo treatments, as treated v. untreated groups would otherwise be
obvious. It is also possible for
subjects to introduce bias especially if the response to treatment is reported
rather than measured, such as “how severe was the pain?”. For this reason it is best if the subjects are
also left ignorant as to whether they are in the treatment or placebo groups. The greatest weight is always placed on
research where the subjects are assigned randomly to each study group and the
subjects, as well as the evaluators do not know to which group they belong.
This then is the randomized, double blind study, the holy grail of
comparative studies. If there is a third
person involved, such as a treating doctor or a person making a measurement,
blinding should also be applied to him/her, making the study ‘triple blind’ but
this term is not often used.
Exercise 5.Random Numbers
Look at the file random.xls The first
column (A) is 20 random assignments of 0 or 1. Note, not necessarily 10 of
each. The second (Col.
B) is random integers between 0 and 20. Excel can generate random numbers by a
function (=RAND()) or by a more complex
program in Data Analysis.
a)How could
you count the exact number of 0’s and1’s if you had a large data set?
Use the Random Number Generator in the Data Analysis Pack
to:
b)Generate a
set of 100 random numbers in a uniform distribution, between 1 and 100, with 2
decimal places. Put these in Col. C.
c)How do you
generate integers, rather than decimals?
d)Generate a
normally distributed set of 100 numbers with a mean of 50 and n SD of 17. Put
this in Col.
D of the spreadsheet. How many values lie outside the 3 SD range?
e)Generate a
set of 100 random ‘0’s and ‘1’s, in about equal numbers to assign 100 subjects
to study and control groups. How many 0’s and 1’s did your set contain? Delete
this set when finished.
Save your file with the new sets of randoms, in Col. C and D
How could you get
exactly 50 each of 0’s and 1’s, in random order? (Explanations)
In
general, the quantitation of measurable data such as height, plasma rhubarb concentration
or duration is not a problem. Standards
of measurement are available and their precision is known or at least can be
estimated. Variables, which are not
measured directly but are derived from other measured values, for example
cardiac output (calculated from temperature or absorbance), are less straight
forward but cardiac output, at least has physical existence even if the
indirect measure makes some assumptions and uses approximations.
This
is not the case when we are faced with assessments which lack a physical
reality but are constructs or subjective estimates. Examples would be pain scores, anxiety or
depression scales or indeces of quality or satisfaction. These are usually ordinal variables, where
each value has a defined meaning, where there is a logical progression but
mathematical relation between the scores cannot be established. Proof is required that these ordinal variables
measure what we want them to measure and do so in some ordered and inclusive
way.
Linear
analog scales are an exception to the ordinal nature of such measurements or
variables. They are probably familiar to
you, say a 100 mm. line where one end is identified as “no pain” and the other
as “as much pain as you can imagine” or whatever is appropriate for your variable.
Your subject marks his/her assessment on
this line. A slider on a scale is a
slightly more sophisticated version of the same. This yields an assessment, which is a length
and is therefore an interval variable. If
the distribution of the resultant variable is normal in distribution or
reasonably close, it is appropriate to use parametric methods to evaluate the
data.This is generally accepted, even
though proof that the scale is indeed linear is usually lacking.
But
to return to the question of validating scales and scores.This is complex and has a number of facets.
1.Face validity: Does the scale indeed measure
what we want it to measure.A pain score
should measure pain, not disability, distress or something else.
2.Content validity: Does the scale measure all
aspects of the variable?For example, in
a phobia scale we ask questions about dogs, cats, spiders and scorpions, but
forget heights or crowding.
3.Construct validity: Tests whether the scale
has the relationships we think appropriate for it to have. Does our depression
scale relate appropriately to suicide attempts?Does the anxiety scale we constructed increase in a group of students
before an exam and decrease after it?
4.Internal consistency: Refers to the relationship of the questions within
the scoring system.They should relate
sensibly to one another and not measure the same thing over and over. An example would be QOLRAD, a 25 question
scoring system designed to quantify the impact of gastrointestinal problems on
quality of life.
5.Test-retest reliability: Does the scoring system yield consistent results
on retesting if the tested variable has not changed.
6.Observer reliability: The test result should vary little in the hands of
different observers.
Obviously
validation of a scale or index is complex and it is all too common to see the
bland statement “validated scale” or score, with or without a reference, or any
details of what validation has been performed.
Hypothesis
testing
Hypothesis testing is the most common way of analysing experimental
results. A hypothesis is proposed, it is
usually “that the treated and the control groups come from the same population”
which means, that any difference between them has arisen by chance, or more
precisely, by sampling error only. This is called a null hypothesis abbreviated H_{o}, and implies that the
treatment has made no difference. We then calculate the probability whether this
is true. The lower this probability, the
less likely is it, that the proposition (the null hypothesis) is true. The probability that the null hypothesis is
true is called a,
in statspeak. What we actually test when
comparing groups, is how likely it is that the two (or more) samples are from
the same population and differ only because of sampling error. The alternative hypothesis to H_{o} is
usually called H_{1}. In this case, it would state that the two samples
do not come from the same population. The
two hypotheses are, of course, mutually exclusive and the samples either do or
do not come from the same population. Thus
Two types of error
can be made: You can accept a false result or reject a true one.
Type 1error
is rejecting the null hypothesis, when it is true. Another way of saying the
same thing is to say that we do not accept that the groups come from the same
population when in fact they do. We have then rejected a true hypothesis. That
happens when the probability of tested groups coming from the same population
and therefore there being no difference between them is low (p<0.05 is
usually accepted as low enough) but this result was a fluke (well, sampling
error, if you like) and in reality, the samples do come from the same
population. If we were to draw more samples or repeat the study, the results
would most likely not show such a significant difference.
Type 2 erroris the converse, accepting the null hypothesis, when it is false
and rejecting a finding as not significant when the two groups are really from
different populations. An illustration may make this clearer, because
understanding this is important.
Figure X
The curves represent the frequency distribution of the means
of samples drawn from two normally distributed populations. The first
population has a mean of 0, the second, 2; both have SD of 1. The mean of our
sample was 1 and its SD 1 also. The probability of that sample coming from the
first population is quite low – p < 0.001, but let us suppose it does, which
is obviously possible. On the basis of the low p value we reject this. We have
rejected a true null hypothesis, a type 1 error. If the sample came from the
second population, equally possible, the null hypothesis would be false. Were
we to accept this, it would be a type 2 error.
The chance of making a type 2 error (accepting a false null
hypothesis or in other words, failing to find a significant difference) is
commonly the result of having too few subjects. The likelihood of type 2 error
can (and should be) calculated, when planning a study. It is related to four
factors:
1.The number of subjects in the samples - larger
samples reduce the probability of such errors.
2.The difference between the sample means – The
smaller the difference, the more likely the error.
3.The scatter (SD) within the groups – big scatter
makes it difficult to demonstrate a difference.
4.The desired level of significance (a) –
the smaller a, the
less likely are we to demonstrate significant difference at that level.
The power of a
study is the probability of not rejecting a true finding (which equals (1 – the
probability of a type 2 error). Another way of thinking about this is to ask,
if there really is a difference of a postulated size between the two groups,
how likely are we to find it?Statspeak
calls the probability of a type 2 error b, so power = (1- b). Without calculating the
numbers needed to ensure that this error is not likely, it is possible to
undertake quite hopeless studies, because the number of subjects needed to show
a significant result is too large for the resources available. We shall look at
ways of calculating the power of a study or of
estimating the numbers needed to reach a certain power a little later. (Back)
‘Significant’ is the real buzz-word in statistics. When you
have estimated the probability of the difference between the means of the
sample groups arising by chance variation in samples drawn from the same
population you have obtained a probability, which says that the difference will
happen fortuitously[16]
5 times, once or maybe 0.1 times in 100 attempts. It is customary to designate
a probability of 0.05 (5 times in 100 or 1 in 20) as being good enough odds to
regard the difference as worth thinking about, or significant! This
magic ‘0.05’ is quite arbitrary and may or may not be appropriate. What it
says, is how likely the difference between the two groups is to arise by chance
in two samples from the same population. To carry it further, if the chance is
1:100 that the two samples are drawn from the same population, then it is 99
instances from 100, they came from different populations. This can be
interpreted, if we are considering a treated and a control group, that the treatment
is really likely to have made a difference. It says nothing, however, about how
important the difference between the two samples may be.
Let me illustrate: You design a clever, large, cross-over
study to compare the efficacy of two anti-hypertensive drugs, Oldreliable,
costing $10 a month and Newhizz, at $50. You evaluate the results and find that
the mean blood pressure reduction by Newhizz is 3 mm. Hg. greater than with
Oldreliable, p = 0.01, a highly significant difference from the statistical view.
Whether the slightly greater reduction in blood pressure is worth the expense
is not answered by your statistical evaluation.
Although we now understand what is meant by “statistically significant’ we shall
consider how we find the answer to this question later. (back)
The calculation of confidence intervals (CI) is both an
alternative and complimentary to significance testing. For example the 95%
confidence interval of a sample mean is the range, which would include 95 of
the sample means if 100 samples were drawn from the population. Conversely,
this interval is 95% probable to include the population mean.This is the range, which is consistent with
the likely smallest and largest effect of a treatment. This information is not
given by significance testing.
If you have 2 samples and the 95% CI’s for their means do
not overlap it is at least 95% probable that they do not come from the same
population. Unlike the t-test,
alone, which is used to test for significance of the difference between 2 sample
means, the CI gives you a clearer idea whether a non-significant difference is
because of big scatter within the groups indicated by a large CI or small
difference in the means. On the other hand, the CI does not precisely answer
the question of what is the probability of a type 1 error. CI’s are
increasingly used in the medical literature and there is an easy to read book
on the subject, which gives many examples. The book
also contains a useful check-list, used by referees to evaluate the statistical
soundness of papers submitted to the British Medical Journal and the Medical
Journal of Australia. There is a good argument for including both a p value and
CI’s in reporting trial results.
As the 95% CI of a mean is
The 95% confidence range is the mean ± CI. For a 95%
interval (p = 0.05) and a sample size of 30, t = approximately 2. It is also evident,
therefore, that to halve this interval the number of subjects need to be
increased 4 fold and to reduce the interval to 1/3, by a factor of 9.
This may well be the first commandment of statistics:
Look at thy data!
Before crunching any numbers, look at your data, by
whatever type of graph is appropriate. This may not sound scientific but it is
important and useful.There are three reasons for this:
1. The first reason is to look for anomalies, such as data
entry errors or data points which are unlikely to belong to the data set you
are considering. For example, you are looking for the average weight of some
boys, weighed to the nearest 1/10^{th} of a kilo.
Maybe you noticed that the 10^{th}. value was
missing a decimal point, but if you had 250 numbers, it would not have been so
obvious. Had you gone ahead with calculating the mean, it would have been 64.1
instead of 41.8. Significant errors can be easily detected by graphing the
data.
2.The second, is
to help you develop models for your data:
The relationship between pulmonary ventilation and
arterial CO_{2} content may serve as an example.
It is obvious that the two variables are closely related,
but equally obvious that the relation is not a simple linear one. By using
Excel’s graphic trend line function, you can pick what looks the best fit and
the devilishly clever program will even give you an equation relating the
variables. In this case, near enough to
Arterial CO_{2}
= 39/ventilation
3.The third is to
check a model’s assumptions.
You propose to use a t-test to see if the wonder drug made
a difference? Well, you know the data need to be normally distributed. Most
statistical packages offer tests for this purpose (Excel does not), but a look
will usually tell you.
The
data are the subjects’ ages from the Eyeball.xls file and the graph is a
histogram. An assumption of normality looks decidedly iffy.
Learn to use the Excel graph function in general and to
graph your data in various ways, such as scatterplot, column chart or histogram.
These are 3 series
random numbers, created by Excel’s Random Number Generator. Create a duplicate
set of each (Copy/Paste) and sort this second set in ascending order
(Data/Sort).
1.Plot
the distribution of the numbers in each of the series, first as a scattergram
then as a column chart.
2.Now
plot the sorted numbers, using a scattergram and a column chart. It does not
matter, which.
Have a good look at the
3 plots of each dataset. Have a think about them before you look at my example
and read my comments.
Now we are getting into it! The descriptive statistics of
a sample describe it with a minimum of numbers. We have met many of these terms
already, when we looked at the normal distribution. The Descriptive Statistics’
of Excel include the following:
1. Mean is the arithmetic average of the values in the
sample. You know already that this is
sometimes the most representative value for the measurements in the sample.
2.
Medianis the middle value of the set. As many values below as above. Outliers (occasional values much above or below the others in the set)
influence this less than the mean. Where you have 1 or 2 weird high or low
values in your sample, they may influence the mean quite a lot, but not the
median or mode.
3.Modeis the most commonly found value in thesample. This also is not influenced by
outliers but a data set can have more than one mode.
1.Variance isthe mean squared deviation from the
mean. (This is again just revision.)
Do you recall why we square the deviations and why divide
by (n-1) and not ‘n’ in getting the
mean of the deviations? Because when we use a sample, not a population to derive this measure the fact that we
sampled gives a degree of uncertainty so we divide by the degree of freedom, which is 1 less than the number in the sample.
The sum of all
the squared residuals
is usually called the total sum of squares,
TSS, or just the sum of the squares, SS. This is often used in averaging
variances of several groups.
2.Standard deviation is the measure of
scatter of the values in a normally
distributed sample. It is meaningless if the data are not normally distributed.
It is independent of sign (pos/neg), that is, we are looking at dispersion,
regardless of whether the value is larger or smaller than the mean.It is the square root of the variance.
3.
Standard error(of the mean)
is for samples what the standard deviation is for the individual values. The
sample mean ± 2 SE have a 95% probability of containing the population mean.
Conversely, draw a number of samples and their mean ± 2 SE will contain the
population mean 95% of the timeIts
value depends on the SD of the sample (dispersion) and the size of the sample.
Normalcy of the Data
Neither skewness
or kurtosis is commonly used to decide whether the samples are near
enough to a normal distribution for use of parametric tests. In general, if the sample contains more than
30 values, and does not obviously look other than normal by the eye-ball test,
parametrics are probably OK. If the sample is small or looks suspiciously
non-normal, the ‘Kolmogorov-Smirnoff’ single sample or the Shapiro-Wilk test is
used. These you will find in large stats programs but are not included in
Excel’ Data Analysis Pack. They compare your sample to a normal distribution
and determine whether there is significant deviation from the normal
distribution in your sample. We shall, however, build a spreadsheet later to
provide our own test for normality of data using the NORMDIST function in
Excel.
Exercise 7. Descriptives
Open Valsalva.xls. Find the
descriptive statistics for SBP before, minimum and end-valsalva. Are they
really different? If you look at the confidence intervals, you will get a feel,
but the answer to that question is not yet in our power. In the next section,
or better, in the section on analysis of variance we shall learn how to find out.(Explanations)
Remember, the t-test, like other parametric tests (z-test,
analysis of variance, regression and correlation), need interval (continuous)
data andthe assumption that the data is
(at least approximately) normally distributed. This assumption is not
necessarily a tight one, but have a look at the plot of the data and make sure
that it is not obviously non-normal. If in doubt, test the data. The t-test is the test used to estimate whether
there is a significant difference between groups in the means of a
variable. If the groups are large (>30), the test becomes the same as
comparison of the means of two normally distributed groups (z distribution) and
it is also in effect identical to analysis of variance for 2 groups. It can, however, safely be used for
groups > 30.
The theory is the following:
We make the null hypothesis that the samples come from the
same population. We estimate the population SEM from the variances of the
samples. We know that it is 95% probable that the samples do not come from the
same population if the difference in their means is greater than 2 (precisely
1.96 for a sample of 20) SEM. We estimate the ‘t’ statistic for our study, which
is the difference of the means, divided by the estimated population SEM. We
used to look this statistic up in tables and see if this is greater or less
than some probability, which we set, usually at 0.05, and call significant
The larger our sample size, the smaller the difference
between the means needs be for t to exceed a given value for the desired
probability, so our tables show the required t value for the size of the
sample and we size the sample in degrees of freedom. This is the number
of independent values the variable can assume. We know the mean, thus if
we know all values but 1, we can work that one out. Thus it is not independent
and the degrees of freedom in this instance is n-1 for each sample. That
means, that we need to look up t for (n_{1} + n_{2} –2) to find
the probability of the t value. Excel saves us looking up the t-tables and not
only calculates the t-value but gives its probability, p, when we do a
t-test.
T-tests differ slightly, whether the two groups are independent,
for example contain different subjects, or paired, when they may represent
changes in the same subject (BP before and after treatment) or closely matched
paired subjects. In the first instance (unpaired groups) the difference between
the means of the two groups is tested. In the second, (paired t-test) it is the
difference between the paired values, which is tested. In Excel there are also
2 possible tests for unpaired data, depending on whether you assume the same
variance in the two groups or whether this assumption is not made. If the SD is
not very different in the two groups, the variance is likely to be roughly the
same. If it is very different, then use the t-test for different variances,
which uses an estimate for the population SEM based on both sample SD’s (see
App. 3). In fact, the difference between the tests is usually small. It is
probably safest to quote the ‘worst case’ result. Alternatively, do the Excel F
test (different from F ratio!) to determine if the variances are
significantly different or not, and then use the appropriate t-test.
When doing a t-test, you can postulate a difference between
the means of the two groups. This makes the question you ask “Is the difference
between them significantly different from the postulated value?” If you do not
postulate a difference, you are simply asking whether the means are
significantly different.
The other thing to
decide is whether to use a one-tailed or two-tailed test. This
depends on the hypothesis. Did you ask “is treatment 2 different from treatment
1?” or “is treatment 2 better than treatment 1?” In the first instance we are
testing how different the treatments are. It is immaterial whether result A is
larger or smaller than result B. In the second example, we want to know whether
the BP fall is greater after treatment A than after treatment B. If it is less,
(A<B) then we are not interested. If A>B then a one-tailed test will be
appropriate. The decision is one of logic, not numbers. A one-tailed test will
be significant, with a smaller ‘t’ value than that required for a two-tailed
test. (back)
Analysing Rates and Proportions
Where we are dealing with a sample with only 2 possible
values (say, two party preferred votes) pollsters ask 1000 individuals and
eliminating the ‘don’t knows, we end up with 350 for party_{1} and 400
for party_{2}. The pollsters report that support for party_{1 }is
at 46.6%. But we know that this is not necessarily the gospel for the whole
population. We are dealing with a sample, which is not likely to be identical
to the proportion in whole population. In fact, to have a 95% probability of
including the population proportion, we need a range of 2 SE about the sample
proportion.
It can be shown that the SE of a proportion is
so that in this case, the SE = .032 and 2SE =.064.
Therefore, the population support for party_{1} is 95% probably between
40.3% and 53.1%, among those who know their preference. When the pollsters want
to report honestly, a range of uncertainty must be given and this range will be
greater with smaller samples. And to be really meaningful, they also need to
report the sample size, proportion of non-responders and ‘don’t knows’.
Exercise 8. T-tests
OpenT-tests.xlsThis is a
spreadsheet containing some data from Valsalva.xls, sorted by sex.
1.Do the’ pre-test’ and ‘minimum’ SBP values for
males and females have ‘different’ (read significantly different) variances? Do
the F test for variances on each set of BP values, male versus female.
2.Are the SBP values significantly different for
males v. females on either set?
3.Are the pre-test and minimum SBP’s different
using all subjects? Do t-tests, for paired samples, as the two variables come
from the same subjects, equivalent to before and after.
(Back to ‘Design a
Study; Back to Type II error) We have already mentioned power,
as applied to statistics. If b
is the probability of type 2 error (rejecting a true result), power is (1 - b)
which is the probability of not rejecting (accepting) a true result. As we
said, power is related to the difference we want to detect, the probability of
type 1 error, which we are willing to accept (a), the numbers in the study and the scatter of values in
the study groups. The smaller we make a, the higher the risk of a type 2 error (b). By
looking at the algorithm for the t statistic, we can work out how to estimate
the power of a study, and the numbers required to reach a given power.
SEM, here, refers to the SE of the differences between the
means and its estimation can be complicated but it will involve sample sizes
and standard deviations.
Excel has neither a function for working out sample size
nor for estimating b.
There are web sites, which will do it for you (do a search on Google.com, for
power) or use the program PS.EXE (Back)
We are faced with determining
the probability of whether a value X comes from a normal population described
by the mean m and
SD, s. A
simple way of doing this is to express the deviation of X from m in
‘standard’ units, (units of the SD) by dividing the difference of the value
from the mean by the SD.
We can now look up a table of
‘area under the normal curve’, and see. For example, for z = 2, we see that the
probability of z < 2 is .477. So a one-tailed probability of z > 2 is
0.023, and two-tailed, 0.046 or roughly 5%. similarly, z > 3 has a
probability of < 1%. This is the same as saying that the probability of a
value outside mean ± 2SD is 5% and mean ± 3SD, 1%.
This is called ‘normalizing’ the
distribution. Z is called the standard normal variate, standard normal
deviate or just standard score. Tables exist as we said, but Excel gives p
values, directly, anyway. The Z distribution tables assume a mean of 0 and
deviations in units of the SD. This test is not used much in our line of work,
but would be very useful if you have to scale results, as they do in some
exams. An example would be two groups, where different assessors have given
different marks, but you believe that in fact the average performance is the
same for both. By ‘normalizing’ the individual results, and then re-converting
to a single mean, say (mean_{1} + mean_{2})/2, thenmultiplying each result by an estimate of the
combined SD, you correct the marks to the same scale.
The z-statistic is also useful for
‘significance’ testing of ratios or proportions. Say treatment 1, tested in 61
subjects has a mortality of 8 and treatment 2, in 67 subjects, 10. The
mortality rates are 8/61 = 0.131 and 10/67 = 0.149. The best overall estimate
of mortality as a proportion or rate would be
its SD is (see analyzing rates and
proportions)
The SD of the difference in
rates would be
This is well short of the critical
value of 1.96 for p = 0.05 and we can conclude that it is reasonably safe to
accept the null hypothesis.
Because there is a small
difference between the standard normal distribution and the distribution of the
z-statistic, a correction called Yates correction for continuity should
be applied. Probably Excel does this, but I could find no reference to it.
The z-test becomes equivalent of
the t-test for groups greater than 30.
If there are more than 2 groups or
more than 2 possible outcomes, we shall need to try another way, called the Chi-square test or contingency tables, which
could be used to test significance here, as well.. We shall consider this when
we deal with non-parametric methods. In any case, I have not used the z-test
very often, but standardizing scores is useful and it also enables us to
determine the significance of proportions.
This is an exercise in scaling marks. The two columns of raw
marks represent the assessment of two markers on two classes. We think the
classes are similar in performance, but this is not reflected by the marks. So
we want to scale them to the same mean and scatter, or SD.
To do this, you need to find the mean mark and the
variance for each class, Work out a joint mean and a joint SD.
Next you convert each mark into a z-score and re convert
them to a mark, using the joint mean and SD.
This sound complicated? Well it is and it is not very
important, but have a try. Have a look then at my effort, in Examples\Scaling.xls
and have a look at Explanations\Exercise
8.
Seventy four patients were treated with chemotherapy after
surgery and the 5-year survival was 58. Sixty seven were treated with surgery
alone and 36 survived 5 years. Is this difference significant? My calculations
are in Examples\Survival.xls
and the explanation in Explanations\Exercise
8
What if we have more than 2 groups to compare, say the systolic
blood pressure, pre-test, minimum and end-valsalva? We could do t-tests between
each group, 3 in all, but remember that we expect that if we set a at
0.05, in about 1 case from 20 we are likely to accept a spurious result (type 1
error). So if we do 3 t-tests, the chances are roughly 14% that a fortuitous
difference will fool us. If we had 4 groups, we could do 6 t-tests, and the
chances of at least one fortuitous “significant’ difference would be >26%.
Not good enough!
Analysis of variance (or single factor ANOVA) is a
better way to go. If the null hypothesis is true (no differences between the
three value sets), then the sample means are just those of 3 random samples
drawn from the same population. So if we use two different methods of
estimating the variability of the sample means, the results should be the same.
The two methods available for this are to estimate the variance of the means
from the sample variances or to estimateit from the means themselves. If indeed the means come from the same
population, the two results should be fairly similar. If, however one (or
several) of the groups does not come from the same population, the variance
estimated from the sample means will be larger than that, estimated from the
sample variances, which need not differ even if the means do!
We, therefore estimate a variance for all the samples from
the variances of each of the groups. This is the ‘within groups variance’.
We also estimate a variance from the group means, the ‘among or between
groups variance’. As we have said, the two should be the same, more or
less, if all the sample groups are from the same population. Thus the ratio of
the means of the two types of variance, tells us whether the total variance is
largely from scatter within the groups or whether the means of the groups are
different and the scatter within each is a relatively small contributor. The F
ratio is the ratio of the ‘among’ or between groups variance to the within
groups variance. There are tables for the critical value of F depending on the
number of groups and the total number of subjects in the samples. Of course,
Excel saves us having to look this up and gives us a ‘p’ value.
Like the t-test, analysis of variance (ANOVA) depends on a
number of assumptions for its validity, .
1.That the samples are drawn from normal
populations.
2.That the
variance of the samples is the same, or at least not too different. It has been
suggested that variances which differ by less than 3 fold can be accommodated.
Excel’s F-test will only handle two variables but probably multiple comparisons
would not be unreasonable here. If the variances appear very different, another
test needs to be found.
3.The samples are independent.
If the data do not meet the 3 requirements the non-parametric
equivalent of anova, the Kruskall-Wallis test can be used.
The ‘Anova’ tells us whether there are significant
differences between the groups. Now it remains to be decided which groups are
different or not different from which. This is not necessarily easy. The
commonest but not necessarily best way to avoid the dangers of multiple
comparisons (the danger of a chance ‘significantly different’ result) is the Bonferroni
test, whichcan also be called t-tests with Bonferroni correction.
This takes the p value desired (say < 0.05) and divides it by the number of
comparisons, here 3. So if we do 3 t-tests, p < 0.017 (0.05/3) between the
two tested groups is required to judge a group different at the 0.05 level. In
practice, the correction is easy to apply, all you have to do is enter 0.017
instead of 0.05 in the Alpha box on the t-test panel of Excel, or whatever
alpha you want, divided by the number of tests you do.
Many consider this too conservative and use other tests, but
Excel does not provide them. (The Student–Newman-Keuls test is
considered by many to be the best). Plotting the means of the groups with ± 2
SEM or the 95% confidence intervals will give you a good idea of which
differences are significant and which are not, and also the relative importance
of the differences in the means and the scatter, anyway.
We called this ‘one-way’ anova,
because there is a more complex test called multiple analysis of variance
which compare multiple factors in more than 2 groups.
Two factor ANOVA
ThisIs an extension of the analysis of
variance to permit ‘stratification’ of the variable according to some
characteristic (grouping into levels by some characteristic such as age or
whatever), which reduces the variance. Another possible use is to look at the
effect and interaction of two categorical grouping variables on the study
variable. The example in the SPSS manual is that of marks for an essay,
assessed by same sex and opposite sex markers, who have been shown photograph
of the authors. The authors were grouped into attractive, so-so and oh, yuck!
groups. The analysis could answer three questions.
1.Whether
attractiveness related to marks,
2.Whether
the marker’s sex influenced the scores,
3.Whether
there is interaction between the effects of attractiveness and the sex of the
assessor.
By the way, the conclusion was that the essay marks were
influenced by the writers’ looks, if they were attractive and marked by someone
of the opposite sex.
The test can be extended to more than 2 grouping variables.
Repeated Measures Anova (or Two-way Anova
with Replication, in Excel):
If we look at pulse rates in subjects, after exercise, at
1, 5 and 10 minutes, it is obvious that the measurements are not independent of
one another, so it fails to meet the independent samples criteria of either
simple or multiple anova. This is
analogous to the paired t-test situation but for more than 2 groups.
Surprisingly, Excel includes a two-way anova with replication, but not a
one-way repeated measures design. The two-way test, however, can be used to perform
a simple ANOVA with repeated measures, by simply omitting the second grouping
factor.
Do an analysis of variance (ANOVA) on each set, then
t-test group 1 v grp.2, grp 1 v grp 3 and grp 2 v grp 3
Open Anova.xls do a scatter plot of the data, showing
each group separately. Plot the means of the 3 groups and their 95% confidence
interval as error bars (separate plot). This is not quite straightforward! If
you need help, look in Explanations.
This data purports to be pulse rates 1,5 and 10 min after
exercise, in 50 subjects. First we want to see if the mean pulse rates are
different at the 3 observation times. That, is a one-way anova with
replication but the program we shall useis two factor anova with
replications and we shall simply enter the count of all our data points in the
“Rows per Sample” window, in this case, 50. Hint: As input range, you need to
include the col. with Old and Older, as well as the col. labels: a1:d51.
1.Is there significant difference between the
pulse rates at the 3 times?
2.Which groups are different? In fact how can you
tell?
3.If you work out the CI_{95} for the
means of the groups, they suggest that 1 is different from 2 or 3 but 2 and 3
are not significantly different. This is not true. Why?
Discriminating on
the grounds of age, we shall divide our subjects into 2 groups of 25, old and
older and repeat the test . Now our “Rows per Sample” becomes 25.
1.How do you interpret these results?
.
If you could answer these questions without reference to the
Explanations,
you are pretty damn bright!
Regression deals with the relation between two variables, for
example, height and weight of persons and describes this relation. Correlation
expresses how ‘tight’ such relation is, meaning, how predictable one
variable is from the other.
Relations between variables have 3 features: First, is the
nature of the relationship:Does
y grow as x increases or does it diminish? When x is 0 is y also 0 or not?Second, what is the magnitude of the relation, or how tight or good the relation
between the two variables is, and third, reliability.
The last means, how probable it is that other samples from the population would
demonstrate a similar relation. Therefore, when looking at a regression, we
need to answer the three questions:
1.What is the relation between the two variables?
This is answered by the regression equation. (Nature)
2.How strong is the relation between the two
variables? The correlation coefficient and its p value answer this. (Magnitude)
3.What is the likely range of the population
values for the statistics (regression intercept and coefficient and the
correlation coefficient)? This we find out from the SE or CI estimates of the
statistics. (Reliability)
Regression explores the
nature of the relation of how y relates to x. This relation can be linear,
the only type which we shall explore in detail, or something other. Linear
regression applies to the situation where the relationship between the two
variables is best described by a straight line.
Such a relation is
described by the equation:
where a is the intercept
(sometimes called the constant), the value of y when x = 0, and b is
the coefficient, which describes the rate of change of y with x. If it
is positive y grows as x increases, if negative, y diminishes. This regression
line or trend line as it is sometimes called, always passes through
the point described by the means of x and y, and is the line of
best fit, where the sum of all the residuals is least.
In the diagram, the residuals
are the vertical bars, from the estimated values of y, and , to the measured values, Y1
and Y2.In other words, it is the
difference between the actual value of y for a given value of x, and the value
predicted by the regression equation. When the sum of all these residuals (the
error) is least, we have found the line of best fit.
For continuous variables,
we use the method called Pearson product moment regression, which only
applies to linear regression. Three criteria need to be fulfilled for the
procedure to be valid:
1.The y
values must be distributed normally about the regression line. In English, this
implies that y values closer to the trend line (regression line) are more
frequent than those further.
2.That the
variance of y be constant for all values of x.
3.That the x,y
pairs are independent, for example, not contain multiple values from the same
subject.
These assumptions are
quite robust, meaning that they only need to be approximately correct
(except for the last). Usually an eye-ball test is sufficient to tell us
whether the data set is acceptable. If the x-y plot leaves you in doubt, look
at the residuals, which should be randomly distributed about the trend line. We
shall look at these plots soon. The fact that the correlation coefficient
is reasonably high, or that the correlation is significant, even highly
significant, must not fool you into believing that the relation is linear.
If any of these three
assumptions is grossly violated, then the non-parametric equivalent of
regression procedure, the Spearman rank order regression, should be
used. This test is not included in Excel Data Analysis Pack.
One of the two variables,
x, by convention, is designated the independent or explanatory
variable. The corresponding y is the dependent variable.
Which is which, is a logical, not mathematical decision. If you are looking at
the relation of family income to spending, it is logical to designate income as
the independent variable, on which spending depends. If you are exploring the
relation between height and weight, it depends on your question or postulate.
If this is “are taller people heavier?”,make height the independent. If you asked whether “heavier people are
taller” chose weight as x.
We now come to the correlation
coefficient. We already know that this is the indicator of the
goodness/tightness of the relation, that is, the predictability of y from x, or
vice versa. The abbreviation use is r, and
In mathsspeak:
This is always a value
between 0, or no correlation, and 1, perfect correlation. The square of r, r^{2},
is the fraction of the total variance, explained by the regression. Of course
the greater this is the better the fit. If r^{2}is 0.8, then 80% of the variation in y is due to the
variation in x and 20% to other causes, such as measurement error, individual
variation or whatever.
Some Practical Points:
We have noted that a valid
regression procedure requires that the residuals (the differences between the
actual and predicted values of y) be normally distributed about the regression
line. This is not difficult. Tick ‘Residuals’ and ‘Residual Plot’ on the
Regression procedure and look at the residual plot, which is the plot of the y
values deviations from their regression estimates (the residuals) against the
corresponding values of x. If they are randomly distributed about the zero line
and seem to have the same scatter for all values of x, then probably two of the
assumptions, normal distribution and equal variance are acceptable. This is an
important application of the eye-ball test. If in serious doubt, do a histogram
of the residuals and see if this looks anything like a normal distribution. As
a last resort, if you think it is worthwhile, you can test the residuals for
significant difference from the normal distribution, by the Kolmogorov -
Smirnof test (if you have it in a stats. package or by Chi-square, which we
discuss, later.
At this point you should
look at the RegressionTypes.xls
file in Examples. In this file I have included a linear regression (worksheet
labeled ‘Linear’) where the residual plot and the histogram of the residuals is
done.
The next sheet is labeled
‘Log’, a regression where
This is not a very
important relation for us, but I have drawn it to your attention for two
reasons:
1.The value for r^{2}
for a linear regression for this data is 0.82, a very good value. The
significance is p<<0.001 – highly significant and if you look at the
linear trend line in the x,y plot it is not such a bad fit. It could well
masquerade as a linear regression.
2.The second
reason is that when we find out that this is not a linear regression, after
all, all is not lost, because we can turn it into a linear relation.
How can we tell that this
is not a real linear regression, but something pretending? You only have to
look at the residual plot. If you only had the data from x = 300 to x = 1000,
it would indeed be hard to tell without the residual plot. But if you look at
this, well it is obvious that the residuals are not randomly distributed about
0. Something is fishy and you can track this down by fitting other than linear
trend lines. Aren’t computers great? I would hate to have to do this by
calculator or paper and pencil.
What can we do about
this? Well, you see the equation above, if you regress y not against x, but
log(x) (col. A is no longer x, but log(x)) , you will get a linear relation. If
you look in ‘Linearized Log’, this is exactly what I have done and the
residuals look a lot more random and the histogram is a lot closer to normal
than it was.
There are many other
non-linear relations, which are possible, but we shall not deal with them. In
any case, the trick to deal with them is to turn the relation into a linear one
before you do a regression analysis.
Look at the output of the Regression procedure, in ‘Linear’.
1.Multiple R: This is the regression coefficient, properly called
the Pearson (product moment) regression coefficient. There is another, called
the Spearman (a non-parametric) correlation coefficient which we have already
mentioned. R can vary from 0 to 1; 0 being no correlation and 1, perfect
correlation. Perfect correlation does not mean identity (y = x), it means that
you can predict y from x, with no error.
2.R square: Obviously, the square of the regression
coefficient. This, as you know, is the fraction of the variance of y, which is
explained by the regression. The remainder of the variation (the residuals) is
not explained and includes measurement error, etc… In this instance, r^{2}
= 0.98 so there is little unexplained variation in y.
3.Adjusted R
square: Is an estimate of the
population R^{2}. Very useful when comparing different models, such as
straight-line relationship with say a logarithmic relation. As it is en
estimate of the population value, it is probably a more realistic estimate of
the explained variance.
4.Standard
error: Is the square root of the
mean residual sum of squares, that is, of the variance not explained by the
regression and is the same as the SD of the residuals, the amount y
differs from the predicted value. It is in the same units as Y.
5.Observations, is simply the number of value pairs.
6.ANOVA tests whether the slope of the regression line
differs from 0. Of course we know that if there is no relation between x and y,
the regression line will have a 0 slope – saying that y does not grow or
diminish as x changes.
7.Intercept: Is the value of y when x = 0. The p value tests
whether this statistic is different from 0 and the 95% confidence intervals are
the likely limits of the population value. If they straddle 0, the probability
that the true (population) value of the intercept is not 0 is less than 5% and
the null hypothesis cannot be rejected. Where you have logical reason do so,
the intercept can be forced to 0.
8.The
coefficient: The coefficient defines
the rate of growth of y as x changes. If it is negative, then y gets less as x
increases in value. In the equation
ifb=0, then
and y has no relation to x.
In the
‘linear’ worksheet, the values for the intercept and the coefficient are given.
The intercept is tested, is it different from 0? In this instance the p value,
the probability that you find an intercept of –3.14 when the true population
value is 0, is 0.8 so we can conclude that the intercept is not significantly
different from 0. The coefficient is also tested for difference from 0. In this
instance it is 1.002, and the p value is 3,8 x 10^{-41}, or
infinetessimal. This simply means that the chance of the coefficient being 0 is
practically nil.
Confidence
interval: The 95% confidence intervals tell the same story. The limits
of the intercept, -30 to 24 well and truly overlap 0. Those of the
coefficient, 0.96 to 1.05 are well away from 0 and obviously represent a
significant coefficient. So, now let’s do some practice.
Copy Pre-test SBP, DBP and Pulse Rate (not Pulse
Pressure) to a new file, which you can save as Regression.xls.
Plot pre-test SBP (as the x value) versus DBP. On a
separate graph, plot SBP v pulse rate. Insert regression lines on both plots
(Right click on a data point, chose ‘add trend line’ and chose ‘linear’). Try
to form an idea whether these correlations are worth pursuing.
The eyeball test of the plots says that it is probably
worth doing the regressions, but we shall see what the R^{2}’s are.
Do a regression estimation on both relationships, anyway.
Place a tick in the box labeled ‘Residual Plots’ and ‘Line Fit Plots’. This
will cause Excel to list the deviation of each y from the predicted value and
plot the deviations against x. It will also list the residuals, should you want
them for a histogram.
Do a histogram of the residuals. When using the Histogram
procedure you do not have to build a table of the bins, Excel will do them.
Make sure that the cell in the ‘Bin’ box is such that nothing will be
overwritten and for convenience in plotting, format the numbers in the Bin
column to no decimal places. This makes the plots of the histograms less
cluttered.
What is the intercept and the coefficient of columns 1
and 2? What about cols. 3 and 4? Plot these data with cols.1 and 3 respectively
as x’s and 2 and 4 as y’s, then use the regression program.
Remember, the Excel
regression tool defines the straight line which best represents our data. In
other words, it is the line, which, when we calculate the predicted values of y
for any value of x, gives the least error compared to our actual values. This,
put another way is that, the line is the one, which minimizes the residuals. As
we have observed, this line always passes through the point which represents
the mean values of x and y. It is defined by the slope (coefficient) and its
intercept (constant) on the y axis.
Plot the data in columns 5 and 6. Do a regression
analysis, tick the Residual plots and the Line fit plots. What does the
residual plot suggest? What does the line-fit plot tell you? Then try the
correlation tool. What does this tell you?
The final table shows the
statistics of the intercept and the coefficient, called here the
by the column number.The t statistic
you know, and the significance is based on this. What was actually tested, is
whether the intercept differs from 0 and it does. The coefficient on the other
hand does not differ significantly from 0. Tou can conclude that there is no
significant correlation between x and y. Lower and upper 95% confidence
intervals give the range, within which it is 95% probable that the true
population value of the slope coefficient lays. It can be set to other values
than 95%. If this range includes 0, the value of the statistic is not
significantly different from 0 at the 0.05 level. Why the CI is printed twice
in the output table is a mystery.
A Case of Non-linear Regression.
The partial pressure of CO_{2} in the end-tidal
exhaled air is inversely proportional to ventilation of the lung alveoli (as
ventilation increases, CO_{2} decreases). In an experiment, we measured
ventilation and end-tidal CO_{2}. The inverse proportional relation is
described by
where stands for the
alveolar partial pressure of CO_{2}, is the alveolar ventilation and k is a constant which depends
on the measurement units.
Plot Ventilation as x and P_{ECO2} as y and put
in a linear trend line. This just does not look right as there is a decided
curve to the relation._{ }Now do a regression estimation on versus ventilation,
with residual and line-fit plots. The numbers come up highly significant,
p<0.001 and R^{2} = 0.84, but the linear trend line just does not
look right. The residuals are not distributed randomly.
What could you do to improve the regression relation?
What I did you can see in the Examples
file, and I explain it in Explanations.
This is a special case of
two variables x and y, where the results recorded by the two instruments should
be identical, not just related, in order for the instrumentsto be used interchangeably. The purpose of
these comparison studies is to identify differences in the two methods, not to
point up similarities. This was the subject of a widely quoted paper by two
statisticians (Bland and Altman) in the Lancet, 1986; 307-310) who wrote this
readable paper for health care workers, not other statisticians. They
criticized regression as a method of comparison and offered alternatives, which
are well suited to the quick appreciation of the differences.
Let us compare SBP
measured by invasive arterial measurement and a BP cuff. There is the true
blood pressure and then there are the pressures, measured by the two
instruments. Both methods have some measurement error. Sometimes we have a
time-tested and well validated method which we may chose to consider the ‘gold
standard’ with which we compare the other method but as usually both methods
contain some measurement error, our best estimate of the actual measurement is
the mean of the two methods.. Differences between the instruments can be random
as in measurement error or systematic, where the error or bias is
fixed such as one thermometer reads 5^{o} higher than another or
proportional where the error has some relation to the value of the
measurement such as when one machine reads 10% lower than another. Any
combination of the three types of error can be present. The objective of a good
comparison method is to define which types are present and their magnitude.
To do this with
regression analysis is not quite straightforward but it can be done provided
care is used in interpretation. The real draw-back of the method is that in
linear regression no allowance can be made for random error in the method
chosen as the independent or explanatory variable.
There are 4 worksheets in the file. Start with
‘Random’. This is two sets of identical measurements, with added random
‘noise’, which serves as measurement error.
Do a regression
analysis, and plot the x and y values and the residuals.
Next do the same with ‘Fixed’, where 8 mm Hg has been
subtracted from the cuff values, as if the cuff read consistently a little
lower than the invasive pressure. The random error is still there.
In ‘Proportional’, the cuff values are multiplied by 1.3
as if the cuff were over-reading by 30%. Do the regression with x-y plot and
residuals.
Finally ‘Fixed+Proportional’ combines the two types of
error. Do the regression with x-y plot and residuals. Now think about your
results, and then look at the Explanations, Exercise
13.
Now let us look at the
method suggested by Bland and Altman, sometimes called Analysis of Differences. Basically, this
involves plotting the differences between the two methods against their mean.
The X axis is the mean of the pair of readings, (X_{1} + X_{2})/2,
where X_{1} and X_{2} are the two measurements because this is
the best estimate, which we have of the true value of the variable. On the Y
axis is the difference between the pair, X_{1} – X_{2}. This
results in a plot where the differences are easy to see and inspection may
reveal fixed and/or proportional bias. This is assisted by showing the
0-difference line and lines showing ± 2SD of the differences. This is an
estimate of the range which should contain about 95% of the observations and is
sometimes called the limits of agreement. The 95% CI of the mean
difference can be calculated and they suggest that if that range overlaps 0,
the two methods can be used interchangeably. The exact type and magnitude of
the bias can be detected by using linear regression on the differences, versus
the meaned value of the paired observations, and testing for differences of the
intercept and the coefficient from 0. Here identity has a coefficient of 0, not
1 as in the case of the regression method, because it is the difference between
the readings which we are testing.
This is the same data file as Instruments.xls but we
shall analyse it differently. On each worksheet (Random, Fixed, Proportional
and Fixed+Proportional) create columns for the mean of the two measurements and
their difference.
Plot these columns on a scattergram.
Do the descriptive statistics on the difference column,
which you have created.
Do a regression analysis with mean as the x value and the
difference as y. You may wish to do a Residual Plot, which will give you a
visual indication whether the linear fit is OK.
Tabulate your results for the 4 worksheets, for a good
overview. Do the mean and the difference separately, with mean, SE, SD upper
and lower 95% CI and add the intercept and the coefficient as well from your
regression analysis.
Probably the best method
for comparison of two sets of measurements is the least products regression,
a fairly complicated method used by chemists but has not, to my knowledge,
penetrated the health care sector. (See Ludbrook J. Clin and ExperPharmacol and Physiol 1997; 24:191-203 if you
are really keen). It takes into account errors in both measurements, yet yields
all the information of the regressions.
We
have so far discussed statistics applicable to normally distributed continuous
(interval) variables. They are important, but there are ordinal and nominal
variables as well as continuous variables, which are not normally distributed (Back).At least
theoretically, these data should not be analyzed by the parametric methods,
which we have discussed. In practical fact, in many instances the parametric
and non-parametric methods come up with the same or at least very similar
answers. As parametric tests are more sensitive (by about 5 % only), they
should be used where the scatter of the data does not deviate significantly
from a normal distribution and other criteria permit. For those cases where
data are not suitable for parametric treatment, we have non-parametric
equivalents for most methods for testing differences between means. These tests
depend on ranking the data and test for differences between the mean ranks.
There are even an equivalents to the parametric (Pearson) correlation
coefficient – the Spearman correlation coefficient and Kendall’s
Tau. As, however, we have no knowledge of the distribution of our data, we
shall have no equivalent to the SD, so we shall not be able to say that we
expect 68 % of the values to fall within the ± 1SD limits and so on.
Nonetheless, whatever the underlying distribution of the data may be, if we
draw samples, the distribution of the sample means will be normal. This leads
to a whole new class of tests, called re-sampling upon which we shall
touch.
Data
Parametric
Non-parametric
2 independent samples
t-test
Mann-Whitney U tests,
(also called Wilcoxon)
2 paired samples
t-test for paired
samples
Sign test, Wilcoxon
signed-rank test
2 or more indepen-dent samples
one-way analysis of variance
Kruskall Wallis test
2 or more indepen-dent
samples, with repeated measures
one-way analysis of variance with replication
Friedman test
Relationship between two variables
Pearson’s product moment regression
Spearman’s regression
Kendall’s Tau
The table above gives the
non-parametric equivalents of the parametric tests we have met.
There are many other
non-parametric tests for various purposes, which we shall not discuss but there
is one, which is so useful that we need to have it in our cupboard. This is the
Chi-square (X^{2}) test.
This test can be used to
compare our data or results with any expected results. We can test whether a
set of numbers is compatible with a distribution, such as the normal
distribution, or whether a politician had special appeal for one sex or the
other. As long as we can postulate ‘expected’ results, we can compare them with
the actual numbers in our data set, whether the expected number is based on a
theoretical distribution or on a hypothesis.
The algorithm for
chi-square is simple:
where O stands for
observed and E for the expected number. If there are only two observed values
and two expected (the minimum) a significant error creeps in because the X^{2}
distribution, which will tell us the probability, is continuous but the values
for O and E are not, they are usually integers. Therefore a correction, called
Yates’ correction, is applied and the formula reads
The vertical bars around
O – E, meaning absolute value. This correction is only applied to 2x2 tables.
To find the probability
of our result, given the expected values, we need to know the degrees of
freedom for C^{2} which is the (number of columns – 1) x (number of
rows – 1)
The X^{2}
statistic has its own distribution, which we can access via the CHITEST
function in Excel. So we are now equipped to develop our X^{2}
spreadsheet. Let us get some data, from the Electoral Commission.
At the 2001 Federal
election on a 2-party preferred basis the coalition polled 1,811,019 and ALP
1,662,190 votes in NSW. In Bennelong, the PM’s electorate, the corresponding
numbers were 44855 and 32883. Did John Winston Howard do better than the State
average?Is the difference in the state,
between the two parties statistically significant or likely to arise by chance?
Exercise 15, Chi-square tests.
Let us ask two questions:
1.Is the Coalition’s vote in Bennelong different
from the NSW average?
2.Is the Coalition’s vote in NSW different from
the ALP’s?
We can make a table here, in Word, as the numbers are in
front of us. By the way, Word will sum numbers for us: Look in Table\Formula
Coalition
ALP
Row Total
NSW
1,811,019
1,662,190
3,473,209
Bennelong
44855
32883
77738
Open Excel and
insert the table into a new spreadsheet, which you will save later as
Chi-square.xls in the Data folder.Now
let us answer the two questions.
1.Is JWH’s vote different from the State average?
We know the actual (observed) votes in Bennelong. The expected vote, based on
the State average will be
a.Coalition: 77738*1,811,019/3,473,209 = 40535
b.ALP: 77738*1,662,190/3,473,209 = 37203
So now it only remains to make
our contingency table with rows for the observed and expected values and
columns for the Coalition and the ALP (it would not matter if you did it the
other way).
Finally, all you need to do is
to use the =CHITEST function, get the syntax from the Excel help.
Check the result by setting up
your own Chi-square test, using the function (Observed-Expected)^{2}/Expected,
for each cell, summing them to get the Chi-square statistic and then look up
the probability by using the =CHIDIST(Chi-square, Degrees of freedom) function.
2.Now the second question. Is the difference in
the State vote significant? Well you have the observed values and if there were
no difference, the votes would be split 50:50 so there are the expected values.
Voila!
A second example:
Now let us examine the sort of cars people drive. We’ll
group them into small, medium and large and we shall note the number of each as
100 pass us, in three locations: a rich suburb, a poor suburb and the CBD.
Small
Medium
Large
Row Total
Rich
27
38
35
100
Poor
48
21
31
100
CBD
21
25
54
100
Column Total
96
84
120
300
The question is whether this pattern is likely to arise
by chance? So let us transfer this data to a new worksheet in our spreadsheet
and set up our ‘expected’ values, which must be estimated from our
observations. If there were no differences between the locations, then there
should be 32 (96/3) small cars, 28 medium and 40 large ones in each location.
(Back)Statistics packages with good non-parametric tests
contain the one-sample Kolmogorov – Smirnovtest. This test
compares the data to a known distribution, such as the normal and if the
difference is not significant, we can regard the data as near enough to being
normally distributed. Excel does not have this test in its data analysis
package. We have to make our own,which
we can do with the =Chitest() and =Normdist() functions of Excel,
although, in fact the eyeball test is sometimes sufficient for anyone except
reviewers of scientific papers.
A normal distribution is
described by its mean and standard deviation. If we use the Descriptive
Statistics function on our data, we can obtain values for these, for our
numbers. These can then be used to produce the ‘expected’ values, for a normal
distribution for comparison with our actual data.
Let us look at an
example, say, survival after diagnosis of Martian swoof disease; the gathered
data for 100 unfortunate sufferers is on the Swoof worksheet in Normtest.xls.
Let us set up H_{o}, that our data does not differ significantly from a
normal distribution.
Exercise 13. Is the distribution normal?
Open Normtest.xls and use the Swoof worksheet. The
numbers represent survival in years, after diagnosis.
The first thing to do is to run Descriptive Statistics to
find the mean and SD of our data. Next make a histogram of the survival times,
you can use the ‘bins’ (intervals) Excel makes automatically when you leave
‘bin range’ blank. The reported ‘Frequency’ will become your ‘observed’ data.
Now we get to the complicated bits, we have to set up the
‘expected’ (normally distributed) values. We’ll use the =NORMDIST(x,mean,SD,cumulative)
function. Let me explain the syntax.
1.x is the test value, in this case, the upper
limit of the ‘bin’.
2.The mean and SD we have just calculated for our
data.
3.Cumulative means the area under the
normal curve,or sum total probability to the value x, and we have to
set it to ‘true’ if we want the probability to that limit (as we do here, this
is called the cumulative density function (CDF)) and to ‘false’ if we
just want the frequency at that particular value.
Now this is the good bit, we want the cumulative density
function for each bin, separately. That means, finding the CDF for the bin and
subtracting the CDF of the bins below to find the actual number of values
expected in that bin. The marked area between 2 and 3 would represent the
fraction of all the values which lie between the two numbers and to find this
area we need to subtract the area to the left ofthe line at 2 from the area to the left of
the line at 3.
For convenience, I
set up an adjacent column with 0 opposite the lowest bin and the top value of
the bin below in the other cells. You need to replace ‘More’ in the final bin
with a number which is greater than the maximum in your data. Finally, to
convert the probability function, a fraction of the total, into the expected
number, you have to multiply it by the number of values in your sample, in this
case, 100. If this is clear as mud, have a look in the worked examples.
The syntax in the ‘Expected’ column is :
The first value: 100*(=normdist(bin1,mean,SD,true) – 0)
The next: 100*(=normdist(bin2,mean,SD,true) -
=normdist(bin1,mean,SD,true))
and so on for each bin, until
The final bin: 100*(=normdist(bin11,mean,SD,true)
-=normdist(bin10,mean,SD,true))
Remember to insert the ’$’ for the mean and SD or you
will get some funny numbers.Type in the
first two but then copy and paste saves a lot of time. The sum of these
‘expected’ values should be about 100 (there may be a rounding error).
The degrees of freedom are equal to the number of cells (11)
– the number of estimated parameters (mean and SD, which is 2), -1 that is, 11
– 2 –1 = 8, multiplied by the number of columns(2) – 1 = 1.
I put in a column of
the individual Chi-sq. terms which sum to 4.18. The p value of the H_{o}
(the distribution does not differ from normal) is =chidist(chi-square,DF) =
0.8407 and you can accept the H_{0} with some alacrity.
The =chitest function
does not quite work here, because it does not allow for the estimated
parameters and would use DF = (c-1)*(r-1) = 10*1.
Do the 2^{nd}
and 3^{rd} examples, for practice and see what conclusions you reach.
As a matter of interest, you can do comparisons with
other distributions such as binomial or Poisson in Excel, using similar
methods. There are a few remarks in Explanations.
This section is based on
National Prescribing Service News, 25/2002: “Making sense of evidence-based
jargon”. The data used are quoted from the Women’s Health Initiative paper
(JAMA, 2002; 288:321-33) and presents the effect of hormone replacement therapy
on the incidence of osteoporotic fractures and cardiac events. Two groups were studied,
those on HRT (N = 8506) and placebo controls (N = 8102). The percentages of
osteoporotic fractures and cardiovascular events are used to illustrate the use
of certain statistics like relative risk, absolute risk, risk reduction,
number needed to treat and number needed to harm.
Annualised Event Rate
Treated (8506)
Control (8102)
Osteoporotic fracture
rate
1.47 %
1.91 %
The first question that
arises is whether the difference is significant? The way to clarify this is a
contingency table and Chi^{2} test. Let us turn the percent incidence
into actual (annual) events:
Exercise 15, Another contingency table
We shall check the significance of the numbers by
building a contingency table and doing a Chi square test.
Fractures in the treated group:8506*1.41/100 = 120
Control group8102*1.91/100
= 155
It is easiest now to go to a spreadsheet: HRT.xls, a two-way
table, showing the number with and without fractures in the treated and the
control groups. Next you need to make a table for the expected values, if the
two groups came from the same population (H_{o}) – this is a simple
proportion problem which we have examined in the last section. Make up your
contingency table,
Fracture
No Fracture
Row total
Treated
120
8486
8506
Control
155
7947
8102
Column total
275
16433
16608
There were 275 fractures among 16,608 subjects, how many
do you expect among 8506 and 8102, respectively? This is very easy in the
spreadsheet.
Then finally you just invoke the CHITEST function
of excel, with the actual and expected value ranges (=chitest(actual value
range,expected value range) and up pops the two-tailed p-value!
Now that we know that the
difference is significant at p < 0.02, it is worth continuing to calculate
the relative risk or hazard ratio. The risk of fracture with HRT
is 1.47 % and in the placebo group it was 1.91%. The relative risk is
Relative risk =
The reduction in risk is
0.23 or 23 %. Looks good, but 23 % of what? The absolute risk is what we
should look at! This was reduced from 1.91 % to 1.47 % per annum, which is a
reduction of 0.44 % or 0.0044 per subject per year.
It certainly sounds
better to say that the annual risk of osteoporotic fracture is reduced by 23 %
by HRT than to say that the absolute risk reduction is 0.44 %, per annum or
even that “the risk of osteoporotic fracture is reduced by 0.0044 per subject
per year!
The number-needed-to-treat
(NNT) is the number of subject you would need to treat, over a specific time to
reduce the incidence of the adverse outcome by 1. This is readily calculated:
The absolute annual rate
of reduction of fractures is 0.0044. We postulated that we want to produce a
reduction of 1 fracture, therefore
In this instance,
therefore, to prevent 1 fracture per annum, we need to treat 227 patients with
HRT. When we discuss producing an adverse event by a treatment, the equivalent
of NNT is number-needed-to-harm (NNH).
From the same source, we
can look at the HRT-induced increase in rate of cardiac events.
Exercise 16. The number-needed-to harm.
The data for cardiovascular events in the same groups of
subjects are in NNH.xls. Check whether the difference is significant, then work
out the relative risks, relative risk ratio, the absolute risks, the change in
the absolute risk and the NNH.
Why did I start off using an annualized percent incidence
rather than the actual number of events? (This is important, see Explanations
There are many
statistical tests, useful or arcane, which we have not discussed. The ones
which spring to mind include
Multiple linear regression, of
the type where x_{1}, x_{2}
are different variables.
Partial correlation analysis,
where a correlation is adjusted for for one or more variables. For example, the
correlation between salary and age, adjusted for experience.
Non-linear regression, such as , or .
Curve fitting, where a hypothetical mathematical
model is fitted to your data or possible models are compared.
Survival (life) tables.
there are plenty of others.
Excel statistical functions, which we have not used
Excel is rich in
mathematical functions, which include many of statistical use. To have a look
open Excel help and search the index on ‘function’.
Of the procedures in the
Data Analysis Pack, there are many which we have not explored. I shall list
their uses if you are curious:
Anova: Familiar to
you.
Correlation: How
strong is the relation between x and y?
Covariance: This is
a sort of combination of analysis of variance and regression. An example would
be if company A owns shares in B. The share price of B would affect the shares
in A. If you did an analysis of shares in A, you could include shares B as a
factor.
Descriptive Statistics:
Familiar ground.
Exponential Smoothing:
A method of taking out short term fluctuations in a series.
F-test Two-Sample for
Variances: Do the variances of the two groups differ significantly?
Fourrier Analysis:
An analysis of power versus frequency in a wave form.
Histogram: You have
used this.
Moving Average: A
smoothing technique for curves.
Random Number Generator:
Not quite random, but near enough. Be sure to use a different ‘seed’ if you
don’t want to repeat the previous number series.
Rank and Percentile: Ranks
in descending order and assigns a percent value (% of largest in the series).
Regression: An old
friend.
Sampling: Takes a
sample of a given size from your data. It would be much more useful if you
could get it to take a set number of samples.
t-Test: Three types,
one for paired data, two for independent groups with equal or unequal
variances.
z-Test: Like a t-test,
but population variances need to be known.
Statistics in Practice. SM Gore and DG Altman. Brit. Med.
Assoc. London, 1982. ISBN 0727900854. A collection of relevant articles from
the BMJ. I used this a lot in designing this course.
Statistics at Square One. D Swinskow. Brit. Med. Assoc.
London, 1978. ISBN 0727900358. A good intro. Limited in scope.
Statistics with Confidence. MJ Gardner and DG
Altman. Brit. Med. Assoc. London, 1989. ISBN 07279022209. Confidence intervals
and statistical guidelines. Shows an alternative approach to probability
estimation. All the material is available as DOS computer programs.(back)
Primer of Biostatistics. SA Glantz. McGraw-Hill, 1989, ISBN
0070233721. A very good approach with little jargon or intimidating
mathematical terms. Good examples, enough to make your own spreadsheet. I have
used this book more than any other to design this course.
Medical Uses of Statistics. Eds JC Bailar and F Mosteller.
NEMJ Books. ISBN 0910133166. Good discussion, illustrated by examples good and
bad from the NEMJ. Not much numeric stuff.
Statistical Methods. GW Snedecor and WG Cochran. Iowa State
Univ. Press 1987. ISBN 0818315606. Big, dry, hard to read, but apart from
survival and a few other things it’s all in there in the mass of numbers and
algorithms.
Epidemiology Biostatistics and Preventive Medicine. JF Jeckel,
JG Elmore and DL Katz.WB Saunders & Co, Philadelphia, 1996. ISBN 0721652581.
Clear, simple approach to statistics used in Medicine. If you buy a book (apart
from mine) get this.
There are scads of interesting statistical material out
there. For a start, 2 discussion groups: sci.stat.consult
and sci.stat.edu. Although rather high powered for me, I have had good help
with problems on both.
Then there are stats textbooks, calculators and courses on
line:
1.html://www.psychstat.smsu.edu/sbk00.htm
– Introductory Statistics: Concepts, Models and Applications. A text and
course, which is a little lighthearted and easy to digest.
2.Whatis.exe,
another calculator, from the same source as Epicalc
3.Dataplot: A great big stats program from the US
National Institute of Standards. Does everything if you can work out how. 77
Meg! Has a big manual, and it is needed. www.itl.nist.gov/div898/software\dataplot.html
.
4.Modstat
a small, DOS based stats program, which has impressed me very much. Designed by
a statistician teaching elementary classes, and has good guides and does most
things. Not quite free, at USD 22.00, from RCKnodt@aol.com
Old, but was still there last I looked..
6.Simple
Interactive Statistical Analysis. A site, which has a good sample size
calculator as well as a whole host of other statistical calculators. http://home.clara.net/sisa/samsize.htm
8.http://www.randomizer.org
- a random number generator on the web which will give you all sorts of
randomized groupings for more complex research formats (Latin sq., crossovers,
multiple groups etc…)
9.XLStatistics
is a large suite of macro-driven worksheets, headed by a good index page which
chooses tests on the basis of the type of data to be analysed. It is free for
students and educators from http://www.deakin.edu.au/~rodneyc/xlstats.htm
. It includes most common non-par. tests and is altogether a sophisticated
package, which I recommend. Great value. (back)
Appendix 3.
Some Mathematics That
You Don’t Need to Know off By Heart.
SD (standard deviation) of a population, where m is
the population mean
[3]
Line Plots. Reese AR. Significance, June 2006, 85-87.
[4] A
useful guide to the Excel Data Analysis package, and for statistics in Excel is
“Data Analysis Using Excel” by Michael R. Middleton, Duxbury, CA. ISBN
053435968-X. There are several other similar books
[5] I
wish this were original, but I read this somewhere.
[7]
Called Gaussian, because the German mathematician, Karl Friedrich Gauss did not
describe it. It may have been first described by Abraham de Moivre, Gauss’
senior by about 100 years.
[8]
Kurtosis and skewness are historical statistics. Pearson, about 100 years ago
said that any probability distribution can be described by 4 characteristics:
The mean, the variance, skewness and kurtosis. This is probably not universally
true and of no great importance to you or me, but I thought I had better
explain the terms as you are bound to meet them.
[9]
We are making a rough calculation here, because the chance of the first card
being an ace is 4/47, but the second is 3/46, if the first was an ace and 4/46
if it was not and if the first 2 cards did not contain an ace, your chances
have become 0. We are also neglecting the small probability of 4 aces.
[10]
The quick and easy way to do these calculations is to use Excel’s =BINOMDIST()
function. Look up the syntax in Excel Help.
[11]
Bland JM, Altman DG. The odds ratio. British Medical Journal 2000; 320:1468.
[13]
Bias is defined as any process, at any stage of inference which tends to
produce results or conclusions that differ systematically from the truth. D.L .
Sackett in “Bias in Analytic Research” J Chron Dis 1979; 32:51-63. identifies
35 sources of bias in sampling and measurement.
[15]
This section is based on: Validating scales and indeces. Bland JM, Altman DG.
Brit Med J 324:606; 2002.
[16]
This is one of my hobby horses: fortuitous means ‘by chance’, not ‘by good
fortune’. Sports commentators please learn this! I know... it is evolving to
mean “by good luck” but must we encourage bad usage?