WHAT IS STATISTICS?

o The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling

o The subject of statistics can be divided into descriptive statistics - describing data, and inferential Statistics - drawing conclusions from data (Source: dictionary.com)

WHY SHOULD WE STUDY STATISTICS?

Descriptive Statistics : To describe a phenomenon

o Summary and presentation of data

Inferential Statistics: To draw conclusions

o Making statements or predictions about the population based on statistical information

POPULATION & SAMPLE

POPULATION: is the group of all objects or individuals of interest.

o All York Students

o Canadians

SAMPLE: is a subset of the population

o 40 York students chosen at random

o People interviewed for the latest election poll

o We refer to the individual components of a sample as "observations"

PARAMETERS AND STATISTICS

Very generally we can say that:

o Populations are described by PARAMETERS

o Samples are described by STATISTICS

For example:

Parameter: the average hair length of all domestic cats (reflects the true value for the population)

Statistic: the average hair length of cats in my sample (it's an estimate)

Statistical inference: is the process of drawing a conclusion about the population based on the sample (with certain levels of confidence and significance)

FINAL DEFINITIONS

A variable is a characteristic of a population or sample.

o student grades, height, income, etc.

Variables have values

o student marks (0..100)

Data are the observed values of a variable.

o student marks: {67, 74, 71, 83, 93, 55, 48}

ATTAINING THE DATA

We have a phenomenon of interest and we would like to collect data to study it further

o We can directly collect the data: this is called PRIMARY DATA.

o We can use data collected by others (e.g. Statistics Canada; market research companies; etc.): this is called SECONDARY DATA

o

HOW DO WE COLLECT PRIMARY DATA?

1. By observations

2. By experiment

3. By survey

The difference is generally in the amount of control exercised by the researcher and the strength of the inference that can be made

DECISIONS INVOLVED IN SAMPLING

Sample Population

o From which population do we sample?

o Why is this important? What do we have to consider?

Sample Size

o How large should the sample be?

Sampling Method

o How should we pick the sample out of the population?

SAMPLE SIZE DEPENDS ON

o The size of the population

The sample size will INCREASE with the population size

o The variation in the population

The sample size will INCREASE with the variation

o The amount of error that can be tolerated

The sample size will DECREASE with the accepted error

o The amount of resources available

The sample size will INCREASE with resources

HOW TO CREATE THE SAMPLE

There are several statistical sampling methods you can use:

1. Simple Random Sample

2. Stratified Random Sample

3. Cluster Sample

SIMPLE RANDOM SAMPLE (SRS)

Each subject is equally likely to be chosen

o Like raffles, drawing from a hat, etc.

o Subject choice is determined by random numbers

STRATIFIED RANDOM SAMPLE

The population is divided into mutually exclusive subgroups called strata

o i.e. age, gender, home type

Within strata, the sampling is random (simple)

Advantages: Assures the sample has the same structure as the population

Inferences can also be made about the subcategories

CLUSTER SAMPLING

The population is divided into groups, called clusters

Geographical regions, classrooms in a school

Each clusters ideally has the same characteristics as the population

We use simple random sampling to select only a few clusters

We then use either simple random or stratified sampling within each cluster

SAMPLING ERRORS

A sampling error refers to the difference between the sample statistic and the population parameter

Example: survey shows 51% of students work when in fact only 50.42% work

We will learn how to deal with this error in later classes

NON-SAMPLING ERRORS

A non-sampling Error refers to errors in data acquisition Inaccuracies & mistakes; less-than-truthful responses

Non-response Bias: only people with a certain agenda respond to the survey

Selection bias: sampling problems