# 1 One-Variable Statistics: Basics

# Terminology: Individuals/Population/Variables/Samples

Oddly enough, it is often a lack of clarity about *who* [or *what*] *you are looking at* which makes a lie out of statistics. Here are the terms, then, to keep straight:

Note that while the term “individuals” sounds like it is talking about people, the individuals in a study could be things, even abstract things like events.

Examples

*the voters*. But if you are going to make an accurate prediction of who will win the election, it is important to be more precise about what exactly is the population of all of those individuals [voters] that you intend to study, be it

*all eligible voters*,

*all registered voters*,

*the people who actually voted*,

*etc*.

Examples

*flips of that coin*, and the population might be something like

*all the flips ever done in the past and all that*

*will every be done in the future*. These individuals are quite abstract, and in fact it is impossible ever to get your hands on all of them (the ones in the future, for example).

Examples

Suppose we’re interested in studying whether doing more homework helps students do better in their studies. So shouldn’t the individuals be the students? Well, which students? How about we look only at college students. Which college students? OK, how about students at 4-year colleges and universities in the United States, over the last five years – after all, things might be different in other countries and other historical periods.

Wait, a particular student might sometimes do a lot of homework and sometimes do very little. And what exactly does “do better in their studies” mean? So maybe we should look at each student in each class they take, then we can look at the homework they did for that class and the success they had in it.

Therefore, the individuals in this study would be *individual experiences **that students in US 4-year colleges and universities had in the last five **years*, and population of the study would essentially be the collection of all the names on all class rosters of courses in the last five years at all US 4-year colleges and universities.

When doing an actual scientific study, we are usually not interested so much in the individuals themselves, but rather in

A **variable** in a statistical study is the answer of a question the researcher is asking about each individual. There are two types:

- A
**categorical variable**is one whose values have a finite number of possibilities. - A
**quantitative variable**is one whose values are numbers (so, potentially an infinite number of possibilities).

The variable is something which (as the name says) *varies*, in the sense that it can have a different value for each individual in the population (although that is not necessary).

Examples

*who they*

*voted for*, a categorical variable with only possible values being “Mickey Mouse” and “Daffy Duck” (or whoever the names on the ballot were).

Examples

In another example above, the variable most likely would be *what face of the coin was facing up after the flip*, a categorical variable with values “heads” and “tails.”

Examples

There are several variables we might use in another example above. One might be *how many homework problems did the student do in that course.* Another could be *how many hours total did the student spend doing homework over that whole semester, for that course*. Both of those would be quantitative variables.

A categorical variable for the same population would be *what letter grade did the student get in the course*, which has possible values **A**, **A-**, **B+**, …, **D-, ** **F.**

In many [most?] interesting studies, the population is too large for it to be practical to go observe the values of some interesting variable. Sometimes it is not just impractical, but actually impossible — think of the example we gave of all the flips of the coin, even in the ones in the future. So instead, we often work with

**sample**is a subset of a population under study.

Often we use the symbol [latex]N[/latex] to indicate the size of a whole population and the symbol [latex]n[/latex] for the size of a sample; as we have said, usually [latex][/latex]n

Later we shall discuss how to pick a good sample, and how much we can learn about a population from looking at the values of a variable of interest only for the individuals in a sample. For the rest of this chapter, however, let's just consider what to do with these sample values.

# Visual Representation of Data, I: Categorical Variables

Suppose we have a population and variable in which we are interested. We get a sample, which could be large or small, and look at the values of the our variable for the individuals in that sample. We shall informally refer to

this collection of values as a *dataset*.

In this section, we suppose also that the variable we are looking at is categorical. Then we can summarize the dataset by telling

- which categorical values did we see for the individuals in the sample, and
- how often we saw each of those values.

There are two ways we can make pictures of this information: *bar charts *and *pie charts*.

## Bar Charts I: Frequency Charts

We can take the values which we saw for individuals in the sample along the $x$-axis of a graph, and over each such label make a box whose height indicates how many individuals had that value -- the **frequency** of occurrence of that value.

This is called a **bar chart.** As with all graphs, you should *always label all axes*. The $x$-axis will be labeled with some description of the variable in question, the $y$-axis label will always be "frequency" (or some synonym like "count" or "number of times").

Examples

In an example above, suppose we took a sample of consisting of the next 10 flips of our coin. Suppose further that 4 of the flips came up heads -- write it as "H" -- and 6 came up tails, "T". Then the corresponding bar chart would look like

## Bar Charts II: Relative Frequency Charts

There is a variant of the above kind of bar chart which actually looks nearly the same but changes the labels on the $y$-axis. That is, instead of making the height of each bar be how many times each categorical value occurred, we

could make it be *what fraction of the sample had that categorical value* -- the **relative frequency**. This fraction is often displayed as a percentage.

## Bar Charts III: Cautions

Notice that with bar charts (of either frequency or relative frequency) the variable values along the $x$-axis *can appear in any order whatsoever*. This means that any conclusion you draw from looking at the bar chart must

not depend upon that order. For example, it would be foolish to say that the graph in the above Example~\ref{eg:flipsbarchartfreq} "shows and increasing trend," since it would make just as much sense to put the bars in the other order and then "show a decreasing trend" -- both are meaningless.

## Pie Charts

Another way to make a picture with categorical data is to use the fractions from a relative frequency bar chart, but not for the heights of bars, instead for the sizes of wedges of a pie.

# Visual Representation of Data, II: Quantitative Variables

Now suppose we have a population and *quantitative *variable in which we are interested. We get a sample, which could be large or small, and look at the values of the our variable for the individuals in that sample. There are

two ways we tend to make pictures of datasets like this: *stem-and-leaf plots* and *histograms*.

## Stem-and-leaf Plots

One somewhat old-fashioned way to handle a modest amount of quantitative data produces something between simply a list of all the data values and a graph. It's not a bad technique to know about in case one has to write down a dataset by hand, but very tedious -- and quite unnecessary, if one uses modern electronic tools instead -- if the dataset has more than a couple dozen values. The easiest case of this technique is where the data are all whole numbers in the range $0-99$. In that case, one can take off the tens place of each number -- call it the **stem** -- and put it on the left side of a vertical bar, and then line up all the ones places -- each is a **leaf** -- to the right of that tem. The whole thing is called a **stem-and-leaf plot **or, sometimes, just a **stemplot**.

## [Frequency] Histograms

The most important visual representation of quantitative data is a **histogram**. Histograms actually look a lot like a stem-and-leaf plot, except turned on its side and with the row of numbers turned into a vertical bar, like a bar graph. The height of each of these bars would be how many

## [Relative Frequency] Histograms

Just as we could have bar charts with absolute

## How to Talk About Histograms

Histograms of course tell us what the data values are -- the location along the $x$ value of a bar is the value of the variable -- and how many of them have each particular value -- the height of the bar tells how many data values

are in that bin. This is also given a technical name

# Numerical Descriptions of Data, I: Measures of the Center

Oddly enough, there are several measures of *central tendency*, as ways to define the middle of a dataset are called. There is different work to be done to calculate each of them, and they have different uses, strengths, and

weaknesses.

## Mode

Let's first discuss probably the simplest measure of central tendency, and in fact one which was foreshadowed by terms like "unimodal."

## Mean

The next measure of central tendency, and certainly the one heard most often in the press, is simply the average. However, in statistics, this is given a different name.

## Median

Our third measure of central tendency is not the result of arithmetic, but instead of putting the data values in increasing order.

## Strengths and Weaknesses of These Measures of Central Tendency

The weakest of the three measures above is the mode. Yes, it is nice to know which value happened most often in a dataset (or which values all happened equally often and more often then all other values). But this often does not necessarily tell us much about the over-all structure of the data.

# Numerical Descriptions of Data, II: Measures of Spread

## Range

The simplest -- and least useful -- measure of the spread of some data is literally how much space on the $x$-axis the histogram takes up. To define this, first a bit of convenient notation:

## Quartiles and the $IQR$

Let's try to find a substitute for the range which is not so sensitive to outliers. We want to see how far apart not the maximum and minimum of the whole dataset are, but instead how far apart are the typical larger values in the dataset and the typical smaller values. How can we measure these typical larger and smaller? One way is to define these in terms of the typical -- central -- value of the upper half of the data and the typical value of the lower half of the data. Here is the definition we shall use for that concept:

## Variance and Standard Deviation

We've seen a crude measure of spread, like the crude measure "mode" of central tendency. We've also seen a better measure of spread, the $IQR$, which is insensitive to outliers like the median (and built out of medians). It seems that, to fill out the parallel triple of measures, there should be a measure of spread which is similar to the mean. Let's try to build one.

## Strengths and Weaknesses of These Measures of Spread

We have already said that **the range is extremely sensitive to outliers**.

## A Formal Definition of Outliers -- the $1.5\,IQR$ Rule

So far, we have said that outliers are simply data that are *atypical*. We need a precise definition that can be carefully checked. What we will use is a formula (well, actually two formulæ) that describe that idea of an outlier being *far away from the rest of data*.

## The Five-Number Summary and Boxplots

We have seen that numerical summaries of quantitative data can be very useful for quickly understanding (some things about) the data. It is therefore convenient for a nice package of several of these

# Exercises

Exercises

A product development manager at the campus bookstore wants to make sure that the backpacks being sold there are strong enough to carry the heavy books students carry around campus. The manager decides she will collect some data on how heavy are the bags/packs/suitcases students are carrying around at the moment, by stopping the next 100 people she meets at the center of campus and measuring.

What are the individuals in this study? What is the population? Is there a sample -- what is it? What is the variable? What kind of variable is this?