# The BriefA Blog about the LSAT, Law School and Beyond

Often, you want to use data and graphs to figure out what the relation between two variables is. For example, you may wonder whether your scores improve as you spend more time studying. Or you may wonder how years of education affect one’s lifetime earnings. Scatterplots allow you to plot one variable against another in order to determine what the relationship between them is.

Here’s an example of a scatterplot:

Here, we can see how consuming coffees (on the x-axis, i.e. the bottom axis) affects the number of words one writes (on the y-axis, i.e. the axis on the left-hand side). (You may wonder, sensibly enough, how one can consume non-integer quantities of coffee. We presume that means people drank a partial cup of coffee).

(Part of) the data table corresponding to this graph looks like:

Coffees | Words written |

0 | 290 |

1 | 360 |

1.8 | 520 |

2.6 | 560 |

4.4 | 470 |

In the left-hand column, we have the variable for the x-axis (namely the number of coffees one drank) and on the right-hand side, we have the variable for the y-axis (the number of words one writes).

Now, you may be asked to interpret the graph above. So, for example, you may get something like:

__Example 1__

The above graph comes from a study of how coffee affects literary output. The researchers asked 17 people to drink as much coffee they like and recorded how many words they wrote in the next hour. How many people drank 1 or fewer cups of coffee?

__Example 2__

Of the people who drank two or more cups of coffee, how many wrote more than 500 words?

Sometimes, you will see a scatterplot that also has a “trend line” like so:

The trend line is an attempt to infer, from the available data, what the general pattern looks like. Generally, it will be a straight line chosen (by some algorithm) to be an optimal fit for the data.

__Example 3__

According to the trend line in our graph, approximately how many words will someone who drank 2 cups of coffee write?

Finally, scatterplots will often use time as the variable on the x-axis. This is because we are often interested in knowing how some variable (e.g. value of a share, net worth, world record for running a marathon) changes with time. We say that **time plots** are the scatterplots that use time as a variable. Here is an example:

The S&P 500 is an index of, roughly speaking, the share value of the 500 largest publicly traded companies. Here, we can see that it has grown considerably over the past 20 or so years.

__Example 4__

By (approximately) how much has the S&P 500 increased from 1/1/1999 to 1/1/12?

__Example 5__

By (approximately) what percentage has the S&P 500 increased from 1/1/95 to 1/1/18?

**Univariate vs. Bivariate**

Some graphs only use one variable. Those graphs are called **univariate**. Other graphs use two variables; they are called **bivariate**.

How can a graph use only one variable? Well consider the histogram. It tells you how many observations fall into certain brackets. So, for example:

tells you that 4 students scored between 90 and 100; 3 between 80 and 89; and so on. The raw data for this kind of graph just looks like:

Score |

92 |

94 |

93 |

100 |

82 |

83 |

85 |

72 |

62 |

Where we just record different observations of a *single* variable. Thus, we can call such graphs **univariate**. Other examples of univariate graphs include circle graphs and bar graphs.

By contrast, scatterplots involve two variables. See, for example:

Whose data look like:

Coffees | Words written |

0 | 290 |

1 | 360 |

1.8 | 520 |

2.6 | 560 |

4.4 | 470 |

4.2 | 492 |

5.3 | 455 |

4.9 | 462 |

0.8 | 310 |

0.1 | 270 |

1.3 | 280 |

0.5 | 255 |

2.1 | 540 |

2.4 | 510 |

3.2 | 572 |

2.9 | 580 |

4.7 | 400 |

Thus we see that for scatterplots, we need two variables, one for the x-axis and another for the y-axis. Thus, we call these **bivariate**. And since time plots are just a special kind of scatterplot (namely one that uses time as a variable), we get that time plots are also bivariate.

We have already talked about what the mean, median, and mode are. But in the examples we discussed previously, we gave you the data and asked you to find the mean/median/mode. But you can also sometimes, to a limited degree, get information about the mean/median/mode just from the graph of the data. For example:

__Example 1__

Which of the following two quantities is larger?

Average number of words written

A. The average number of words written is larger

B. is larger

C. They are equal

D. It cannot be determined from the information given

Now, sometimes you will be asked to manipulate the data/graph given before estimating the mean/median/mode. So, for example:

__Example 2__

Suppose the researchers for the above graph miscounted the number of words each participant wrote. They accidentally multiplied the number of words written by each participant by 2. Fix this error in their data by dividing the observations in the above graph by 2 to get the new average. Now, which of the following two quantities is larger?

New average of words written

A. The average number of words written is larger

B. is larger

C. They are equal

D. It cannot be determined from the information given

Now, we can generally find the median from a graph as well:

__Example 3__

Find the median number of words written in the above scatterplot.

And again, we can modify the given data and find the new median:

__Example 4__

Suppose that, this time, the researchers under-counted the number of words each participant wrote. Find the median of the corrected data, multiplying the number of words written by each participant by 2.

And finally, we could try to find the mode of the above graph. It is somewhat hard to tell whether some of the points are the same on the above graph, so I will cheat and just tell you that there is no mode; every value occurs just once. But on the GRE, rest assured that if the question asks you to find the mode from the graph, the relevant points will be fairly clearly marked. Then, it is just a matter of counting up how many observations each value has (e.g. how many people wrote 500 words; wrote 550; etc.).

**Finding quartiles/percentiles**

Now, it will not always be possible to find the quartile of a graph. For example, if you are given a circle graph:

and asked to find the various quartiles, the question simply makes no sense. But of course, in a box plot:

the quartiles just correspond to where the lines of the box are.

Now, it is not really feasible to read off, say, what the 96th percentile looks like, just based off of a graph. But some important facts to keep in mind are that: the highest value in your graph will be greater than the 99th percentile (since all of the observations will be less than or equal to that observation’s value). Similarly, the 1st percentile will be greater than the lowest value of the graph. Thus, knowing the maximum and minimum (which you can read off of a graph) can give you some idea of the limits of your percentiles.

Circle graphs (also called pie charts) let you see the relative amounts of different categories. For example, if you run a local grocery store and want to see where your sales are coming from (perhaps because you are considering whether to re-allocate floor space), you might look at a chart like the following:

and conclude that as most of your sales come from produce, you may want to allocate more space to new kinds of produce.

Now, when reading a circle graph, the percentages of the different sections will generally be labelled as in the above. So circle graph tells you that in January of 2010, 23% of all sales were from frozen foods, 10% of all sales were from pharmaceuticals, and so on.

Now, if you are given the actual value (as opposed to the proportion) of any category, you can find the value for each category. So, for example:

__Example 1__

Suppose that in January of 2010, the grocery store sold x

x = 100000.7000 worth of canned food.

[/ss_toggleable_text]

We can also use circle graphs to determine various trends. For example, by comparing the following charts:

we can conclude that the proportion of revenue from canned foods drastically shrunk from January 2010 to 2011.

__Example 2__

From January 2010 to January 2011, which category grew the most as a proportion of total sales?

Now, if we know the actual value of some category in 2010 and the actual value of some category in 2011, we can calculate the value of each category and 2010 and 2011. Thus, we can calculate the absolute increase in revenue for any particular category as follows:

__Example 3__

Suppose in January 2010, the total amount of goods sold was 6,000. Find the absolute change from 2010 to 2011 in the amount of dairy products sold.

A box plot is so named for its iconic shape:

(They can also be laid out horizontally). The parts of the graph correspond to:

Looking at a box plot can give you a quick sense of what the distribution of the data looks like. For example:

__Example 1__

Find the 25th percentile, 50th percentile, 75th percentile, range, and median for the below boxplot:

Example 2

[think of another example]

A histogram looks a lot like a bar graph and, indeed, you can think of it as a special kind of bar graph. But instead of having just any kind of category (as a bar graph does), histograms have, as their categories, certain ranges of values. So, for example, suppose the students in your class score the following scores on their test:

$$100, 98, 68, 88, 79, 92, 90, 85, 86$$

Now, those numbers are kind of unwieldy. So to get a sense of how many students are acing your class (getting an A) or failing your class (getting an F), you might categorize their test scores according to certain ranges. So put all the 90-100 scores together in one category, all the 80 - 89 scores in the same category, and so on. Graphing this, we would get the following:

This graph tells us that 4 students scored between 90 and 100; 3 students between 80 and 89; and so on.

__Example 1__

In the following histogram, approximately how many students scored a 170 or higher?

__Example 2__

[think of another example]

A bar graph tells you how many objects in a given category you have. For example, in the following bar graph:

Each bar tells you how many children fall into that height range. So, for example, there are 2 children between 0 and 4 feet; there are 7 between 4 feet 1 inch and 4 feet 6 inches, and so on.

__Example 1__

How many children are between 4 feet 7 inches and 6 feet? How many children are taller than 6 feet?

We can also have segmented bar graphs like the following:

These bar graphs allow us to compare different groups, in this case the group of 1 year olds against the group of 2 year olds, 3 year olds, and so on. We can thus see how their preferences shift over time.

To read the graph, the horizontal signs tell us which categories we have (here, the different ages 1 to 5) and then the vertical component tell us how many members of that category (e.g. how many one year-olds) in accordance with the key to the right. Thus, the left-most blue bar tells us how many 1 year-olds have "Mighty Man" as their favorite superhero. The right-most yellow bar tells us how many 5 year-olds have "Valiant Vanessa" as their favorite hero and so on.

__Example 2__

Which superhero gains the most fans as the children mature from being 1 years-old to 5 years-old?

Example 3

Which superhero has the most supporters overall (i.e. across all the age groups)?

Example 4

[Give a match the table to the possible bar graph question; simply 5 category bar graph]

In this post, we’ll define the mean, median, and mode. Along the way, we’ll work through an example of how to find each of them. We’ll also talk about why the median is responds less to outliers than the mean does, a fact which can sometimes be crucial in solving a GRE problem.

First, some definitions. The **mean** (or **average**) is found by adding up all of the values of some variable and then dividing by the number of observations. So to go back to our lemonade example:

Date | Lemonades Sold |

7/2/19 | 1 |

7/3/19 | 2 |

7/4/19 | 5 |

7/5/19 | 2 |

7/6/19 | 3 |

Now, finding the median is a little trickier. Imagine lining up all of the values for your variable from least to greatest. Then, we find the one in the middle. So for example, if we have the data:

Then, we order it from least to greatest to get:

And then we pick the middle value, namely .

Sometimes, if there is a long string of values, it will be hard to see which value is the middle one. So take the data:

We order it from least to greatest:

And then we simply cross off the numbers at the end, each pair at a time, to get:

~~1~~, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 7, 8, ~~9~~

~~1, 1,~~ 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 7, ~~8, 9~~

...

~~1, 1, 1, 1, 2, 2, 3,~~ 3, ~~3, 3, 4, 6, 7, 8, 9~~

And thus the median is 3.

Now, sometimes we will have an even number of observations. So suppose we have the data:

Crossing through the numbers on the end, we get:

~~1,~~ 2, 3, ~~4~~

But now, it seems, we are stuck. For the median is a single number - not two numbers! And yet if we cross out any more, we will eliminate all of our numbers.

In these cases, we say that the median is the *average* of the middle two numbers. Thus, the median in the above case is

So in summary, the **median** is either the middle number or, if you have two middle numbers, the average of those two middle numbers. Now, we try to find the median for our lemonade example:

Date | Lemonades Sold |

7/2/19 | 1 |

7/3/19 | 2 |

7/4/19 | 5 |

7/5/19 | 2 |

7/6/19 | 3 |

Finally, the **mode** is the most straightforward of the three: it is the value that occurs most often. So, if you have the data:

Then the mode is .

Now, a set of numbers can have more than one mode, as in this example:

Here, since both and occur three times, they are both modes.

Often, to find the mode it can help to put the numbers in ascending order (so you can see how often certain values are repeated). To return to our lemonade example:

Date | Lemonades Sold |

7/2/19 | 1 |

7/3/19 | 2 |

7/4/19 | 5 |

7/5/19 | 2 |

7/6/19 | 3 |

Finally, one property of the median is that it is "more resilient to outliers." What does this mean? First, let's be clear on what an outlier is: an **outlier** is a data value that is very far from many of the other observations. So suppose you have some data on how many days in a month it rains:

But one month, you have constant downpours and so you get as a new observation. Your new data is:

But the is quite far from all of the other observations. If we were to graph this data on a boxplot:

We would see that sticks out like a sore thumb. That's how you know it is an outlier. (There are more precise, formal definitions of an outlier, e.g. more than 1.5 times the interquartile range outside the 1st or 3rd quartiles, but you won't need to know those kinds of definitions for the GRE).

Now, when we say that the median is "more resilient" to outliers, we mean that if we add an outlier to our data, the median is less affected than the mean. So the change to the median will be less than the change to the mean. Let's confirm that this is the case in our above example. To do so, we will compute the old mean/median, the new mean/median and compare them.

The old mean is:

The old median is: since both 14 and 15 are in the middle of

The new mean is: .

The new median is: .

So we can see that the new mean is higher than the old mean, whereas the median only increased by and so the median changed less than the mean did, as expected.

Now, you might wonder why the median is more resilient than the mean to outliers. You won't need to know this for the GRE, but it might help in remembering which one is more resilient to outliers. The reason is that when you are adding a new value to the data set, the median doesn't really care how large/small the new number is. All that matters is whether it is larger than the previous median or smaller. If it is larger than the previous median, then the new median moves a number to the right. If it is smaller, the new median moves a number to the left. But the mean cares about how large/small the new number is. So if the new number is an outlier, then it is, by definition, really large or really small compared to the other numbers. So the mean responds a lot to this new number whereas the median just moves a number to the right/left. That's why outliers affect the mean more than they do the median.

__Practice Problems__

In this post, we'll talk about ranges, quartiles, and percentiles. These are all ways of getting a sense of what the overall distribution looks like and what the possible outcomes look like.

Let's start with the range. The **range** is the difference between the largest and the smallest value in your data. So suppose you have the following data:

Date | Sold |

7/2/19 | 1 |

7/3/19 | 2 |

7/4/19 | 5 |

7/5/19 | 2 |

7/6/19 | 3 |

7/7/19 | 2 |

7/8/19 | 5 |

7/9/19 | 4 |

7/10/19 | 3 |

7/11/19 | 3 |

Then, the range is $5 - 1 = 4$. Thus, the range tells you the interval over which your data is distributed.

Quartiles are more complicated. As the name suggests, quartiles are a way of dividing up the data in to four parts. So, if we have the following data:

The **quartiles** are the points that divide up the data in four segments, each with the same number of observations. For the above data, we would get:

You can see how in each section, we have exactly three points which is one fourth of our total data (made up of 12 points).

More formally, the **first quartile** is the value that separates the bottom 25% of the data from the top 75%; the **second quartile** is the median and splits the data in half; the **third quartile** separates the bottom 75% of the data from the top 25%.

Now how can we find the quartiles?

Well, we already know how to find the second quartile since, recall, that's just the median! And if we know what the median is, we can separate the data into two halves: the lower half is made up of all the values less than our median, whereas the higher half is all the values greater than our median.

Then, we find the median of the lower half and that will be the first quartile value, and the median of the upper half will be the second quartile value! This makes some intuitive sense since 25% is exactly at the middle between 0 and 50%, which is the data that makes up our lower half, whereas 75% is the middle of 50% and 100%, which is the data that makes up our upper half.

__Example 1__

Find the first, second, and third quartiles of:

21 |

13 |

37 |

45 |

5 |

1 |

9 |

17 |

33 |

41 |

25 |

29 |

[\ss_toggleable_text]

In the next few posts, we will introduce different kinds of graphs. These graphs allow you to visually represent data in ways that emphasize different aspects of the data. In preparation, let's look at some data represented with different graphs:

Tables of Lemonades Sold

Date | Sold |

7/2/19 | 1 |

7/3/19 | 2 |

7/4/19 | 5 |

7/5/19 | 2 |

7/6/19 | 3 |

7/7/19 | 2 |

7/8/19 | 5 |

7/9/19 | 4 |

7/10/19 | 3 |

7/11/19 | 3 |

Bar graph of lemonades sold on a given day:

Circle graph of lemonades sold:

Box plot of lemonades sold:

Scatterplot of lemonades sold each day:

In subsequent posts, we will talk more about how to read each of these graphs.

Talk of "data" is ubiquitous -- what does that word mean in the context of the GRE? We will think about **data** as observations of given variables. Now what does that mean?

Well suppose we are trying to help our child increase her earnings from a lemonade stand. Some days, she sells a lot and some days she sells almost nothing. To figure out why, we might start keeping track of how much she sells on a given day. So we would start making a table that looks like:

Date | Lemonades Sold |

7/2/19 | 1 |

7/3/19 | 2 |

7/4/19 | 5 |

7/5/19 | 2 |

7/6/19 | 3 |

Now, each of the rows in the table is an observation. And we have our variables at the top of the columns: the date and the number of lemonades sold. And more generally: **variables** are just the characteristics that we keep track of, while an **observation** is a set of values for our variables on some particular occasion.

Now, a very simple way to keep track of data is via a **frequency distribution**. This is a table that records, on the left-hand side, possible values of a given variable, and on the right-hand side, it records how often those values appeared. Applied to the above table, we would get:

Lemonades Sold | Number of Days (when that many lemonades were sold) |

1 | 1 |

2 | 2 |

3 | 1 |

4 | 0 |

5 | 1 |

This table answers questions like "How often did my child sell three lemonades?" To find the answer, we go to the "Lemonades Sold" column and look for the row with three lemonades sold. In that row, the right hand column (corresponding to the number of days) says one. So there was one day where the child sold three lemonades.

And, in addition to a frequency distribution, we can also create a **relative frequency distribution** which records, on the left-hand side, possible values of a given variable, and on the right hand-side, the percentage of all observations where that value occurred. So, in the above example, we would get:

Lemonades Sold | Proportion of Days |

1 | 20% |

2 | 40% |

3 | 20% |

4 | 0% |

5 | 20% |

since the total number of days is and 20% and 40% and so on through the table.

Now, to really get a handle on what is driving her lemonade sales, we should probably add some more variables (e.g. daily temperature, day of week) and collect some more observations:

Date | Lemonades Sold | Day of Week | Temperature (Fahrenheit) |

7/2/19 | 1 | Tuesday | 68 |

7/3/19 | 2 | Wednesday | 73 |

7/4/19 | 5 | Thursday | 75 |

7/5/19 | 2 | Friday | 70 |

7/6/19 | 3 | Saturday | 71 |

7/7/19 | 2 | Sunday | 71 |

7/8/19 | 5 | Monday | 78 |

7/9/19 | 4 | Tuesday | 75 |

7/10/19 | 3 | Wednesday | 72 |

7/11/19 | 3 | Thursday | 73 |

Here are some practice problems on the above concepts:

**Practice Problems:**

- Using the above data, construct a frequency table that tells you how often a certain number of lemonades was sold: