Bias is often impossible toavoid in practice and must be taken into account when statisicalcalculations are performed.
Example: To estimate thenumber,ofnesting birds, scientists catch 100, tag them and release them. Thefraction of birds with tags isLater,they catch another 100 and count the number with tags. Supposeofthese birds have tags, so the probability of a randomly picked birdfrom this sample having a tag is
Equating these two fractionsgivessothatandIn fact isa random variable since it will vary between samples, soisan estimate forWe write
The expected value forisbut it is possible thatsothat usingthis estimator. In fact,isobviously at mostsothat the estimatorisbiased.
More subtle examples of biasare give by considering the mode and median as estimators for themean.
Suppose we have 100 people.80 of the people are labelled with a 1 and 20 are labelled with a 0(probably signifying, like me, that their net wealth is zero).
The mode is 1 but the meanis 0.8 times 1 + 0.2 times 0 = 0.8
The bias of an estimatorfora parameteris
The bias of the mode as anestimator for the mean is 10.8=0.2
The 100 people are lined upin numerical order. First in line are those twenty people labelledwith a zero, and then the 80 people labelled with a 1.
The median is obviously 1,but the mean is 0.8, as before.
The bias of the median as anestimate for the mean is 10.8=0.2
]]>Ifisa random sample of sizedrawnfrom any population with meanand
variancethenthe sample meanhasexpected valueandexpected varianceIflots of samples, all of sizearetaken from the population, then the distribution of the sample meansis approximately normally distributed,andthe goodness of the fit improves with increasingTheunderlying distribution of the population is arbitrary.
The central limit theorem is used for sampling when the samplesize is ‘large’, in practice over 50.
The central limit theorem can then be used to analyse the samplemeans, perfrom hypothesis testing and construct confidence intervalsusing the normal distribution.
A sample of size 100 is taken from a population with mean 40 andvariance 20. Find the
probability that the sample mean is larger than 45.
andsoso
]]>To answer this we can find aconfidence interval: if for example, we construct a 90% confidenceinterval, we can say, if we take many samples and find the mean ofeach one, then 90% of the time, the true mean will lie in theconfidence interval.
If it is know that theunderlying distribution is from a normal distribution, and we knowthe true or population standard deviation, then we can use theexpression for a normal confidence interval:
whereisthe mean of the sample, isthe sample size, and isthe population standard deviation.
Suppose then we have thesample 2.3, 4.2, 5.3, 2.4, 2.6, 4.7 for lengths of french snails andwe know that the lengths of snails are normally distributed with astandard deviation, of 1.7.
The mean of our sample is .We need to find the value of correspondingto a confidence interval of 90% or 0.9. This means a rejection regionof area atthe upper and lower ends. Fromthe Normal tables, for Theconfidence interval is then
Suppose instead that wedidn't know the population standard deviation, but we knew that theunderlying distribution of the lengths was normal. We can can find anestimate for the standard deviation, called the sample standarddeviation, from the original sample. We label it
Now though, since weestimated wemust use the tdistribution with (n1)=5 degrees of freedom,t_{0.05=,5}=2.015. The confidence interval is
]]>We can perform a hypothesis test, suppose as the 5% level:
H_{0}: no difference in pass rates for men and women
H^{1}: The is a difference in pass rates for men and women
We can draw up a contingency table to show all the outcomes for all the subjects.
This is the OBSERVED table:

Pass 
Fail 
Total 
Men 
14 
43 
57 
Women 
31 
28 
59 
Total 
45 
71 
116 
If there were no difference between the pass rates for men and women, we would expect the number of of men who pass would be for example, equal to
In general in fact, to find the expected numbers in the table, given no difference in pass rates for men and women, we findWe obtain the EXPECTED table:

Pass 
Fail 
Total 
Men 
(45*57)/116=22.11 
(71*57)/116=34.89 
57 
Women 
(45*59)/116=22.89 
(71*59)/116=36.11 
59 
Total 
45 
71 
116 
We now find:
The distribution ofis adistribution withdegree of freedom.
From thetables,we reject H_{0:}. From the table, women have a higher pass rate.
]]>A simple example is shown in calculating the mean of a sample.Supposeisa sample from a population.
Let
Theareestimates of the residualswhereisthe true mean of the population.
The sum of the residuals (unlike the sum of the errors) isnecessarily 0. If we know the values of anyofthe residuals, we can find the last residual, so that even thoughthere are n residuals, onlyofthese are free to vary. For this reason we say that the residuals ofa sample of size n hasdegreesof freedom.
For the same reason, the sample standard deviationincludesan expression n1 in the denominator.
We can write
isthe sum of the squares of the residuals and the residuals only haven1 degrees of freedom, so that
]]>The bias of a statisiticalparameter is the difference between the estimator for the parameterand the true value of the parameter. If the estimator for apopulation parameteristhen
We are usually interested inselecting the estimator with the smallest bias, since it will becloser to the actual value of the population parameter more often onaverage that any other estimator.
Suppose we are interested infinding an estimator for the population mean,Wehave a choice of estimators:
whereandareindividual measurements from the population
whereandareindividual measurements from the population
Each of these estimators isunbiased, since
If the variance ofisthenthe variance ofis
and the variance ofis
Of all these estimatorsforhasthe smaller variance, soisthe best estimator.
]]>To estimate these values we take a sampleof size
An estimate for the population mean can be quickly found from the mean ofd the sample.
The sample mean is a good estimator for the population mean. It is unbiased so that E( bar x ) = %mu , and of all estimators for the mean using the sample, the sample mean has the smallest variance.
Using the sample varianceas an estimator for the population variance is not such a good idea. The sample variance is biased, so
Instead we use the estimatorwhich is not biased.
For the sample
an estimator for the mean is
and an estimator for the variance is
]]>There exists a similar formula for the variance. The pooled sample varianceofsurveys, each of sizeand sample varianceis more complicated:
.
The standard deviation is the square root of this.
The pooled standard deviation is most useful when used in the two sample t test, when conducting hypothesis tests for the means of two samples.
In this case
The sample variance is the same as if the two samples had been pooled and the pooled data used to find the variance, with the difference that the denominator in the root is notbutThis is because one degree of freedom is lost for each sample.
Proof:
Example: Find the pooled standard deviation of two samples of size 40, sample standard deviation 7 and sample size 55, sample size 12 respectively.
]]>Collection primary data can be time consuming and expensive, but it has several important advantages.
The collection method can be tailored to the purpose.
The accuracy of the data is easier to assess.
Primary data often needs to be collected by companies deciding on possible demand for a new product, or the location of a new store.
Secondary data is data not collected by or on behalf of the person who is going to use it. Often these days, data can be looked up on the internet. The person or organisation using the data has no direct connection with the gathering of the data.
A lot of government data is freely piublished and widely available. These include census surveys and economic surveys. Public opinion polls also tend to be made freely available. Seconday data is especially convenient if it comes as a spreadsheet, because it is then easy to use and analyse.
Seconday data is often publisehed regularly, allowing trends to be analysed.
Secondary data is cheap – often free – and widely available, but has several disadvantages.
The collection method may not be known and the accuracy of the data can be hard to assess. It can be especially hard to account for bias.
It may come in a hard to handle form. If a survey is in the form of worded answers, the data is not easy to manipulate.
]]>The Procedure for taking a sample is:
1. Decide on the categories into whicha population can be divided
2. Decide on the number to be sampledin each catgory
3. Collect the data from individuals ineach category unyil the required number of individuals from eachcategory has been sampled
Quota sampling us used when it is notpossible to use random methods, or when a random sample would returnuseless data – would you collect data from men on which brand ofwashing powder they use?
Advantages
1. can be carried out with small samplesizes
2. costs are minimized
3. quick and easy to conduct
Disadvantages
1. since the sample is not random, itis difficult to estimate the error
2. individuals may not be put intocorrect qouta
3. non  responses are not recorded
4. bias may be introduced
]]>