4.1: Presenting Categorical Data Graphically
Learning Objectives
Upon completion of this section, you should be able to
- Create a frequency table
- Create bar graphs
- Create pie charts
- Identify common graphical mistakes
Frequency Table
Categorical, or qualitative, data are pieces of information that allow us to classify the objects under investigation into various categories. For example this could be the color of the car a person drives, the zip code where they reside, or the education level attained. We usually begin working with categorical data by summarizing the data into a frequency table. The frequency table is organized by identifying the categories found in the data and then counting how many observations there are for each category.
Frequency Table
A frequency table is a table with two columns or two rows. One column (or row) lists the categories, and another for the frequencies with which the items in the categories occur (how many items fit into each category).
Example 1
The earned grade of 21 randomly selected students enrolled in mathematics courses at PCC is given below. Organize the data into a frequency table.
Solution
In this frequency table we will use two rows (often done in textbooks and articles as it uses less vertical space on a page).
The first row will represent the grade classification and the second row counts how many students in the sample of 21 with that grade.
Grade | A | B | C | D | F | I | W |
---|---|---|---|---|---|---|---|
Frequency | 2 | 8 | 5 | 3 | 1 | 1 | 1 |
Alternatively we could have made the table with two columns where the first column would be the age characteristic and the second column would be the frequency.
In the example below if we had listed out the original data the list of colors would just appear as a long jumple of words without any rhyme or reason for their order or any idea of what the data tells us in any easy way about the frequency of colors observed.
Example 2
An insurance company determines vehicle insurance premiums based on known risk factors. If a person is considered a higher risk, their premiums will be higher. One potential factor is the color of your car. The insurance company believes that people with some color cars are more likely to get in accidents. To research this, they examine police reports for recent total-loss collisions. The data is summarized in the frequency table below.
Color | Frequency |
---|---|
Blue | 25 |
Green | 52 |
Red | 41 |
White | 36 |
Black | 39 |
Grey | 23 |
I would caution you on interpreting this data at this point. One of the things we didn’t examine was the number of potential of cars of each type of color on the road. Hypothetically speaking if there were twice as many Green cars as Blue cars on the road the difference in the number of Green cars in accidents and Blue cars in accidents may be explained by that total difference in the number of cars of each color on the road.
Bar Graphs
Sometimes we need an even more intuitive way of displaying data. This is where charts and graphs come in. There are many, many ways of displaying data graphically, but we will concentrate on one very useful type of graph called a bar graph. In this section we will work with bar graphs that display categorical data; the next section will be devoted to bar graphs that display quantitative data.
Bar graph
A bar graph is a graph that displays a bar for each category with the length of each bar indicating the frequency of that category.
To construct a bar graph, we need to draw a vertical axis and a horizontal axis. The vertical direction will have a scale and measure the frequency of each category; the horizontal axis has no scale in this instance. On the horizontal axis we will create bars to represent each category. It is important that the bars are evenly spaced and each have the same width. The construction of a bar chart is most easily described by use of an example.
Example 3
You would start the process with identifying the frequency as the vertical axis values and the horizontal axis being made up of the categories (color of car). On the vertical axis we look at our data to determine the largest value as we will need to make sure the bar graph can be displayed up to that height. Using our car data above we see the highest frequency is 52, so our vertical axis needs to go from 0 to 52, but we might as well use 0 to 55, so that we can put a hash mark every 5 units. For each category you will create a bar going from 0 to the frequency value in the table (as shown below):
Notice that the height of each bar is determined by the frequency of the corresponding color. The horizontal gridlines are a nice touch, but not necessary. In practice, you will find it useful to draw bar graphs using graph paper, so the gridlines will already be in place, or using technology. Instead of gridlines, we might also list the frequencies at the top of each bar, like this:
These types of graphs are typically easy to create in a spreadsheet program, like Excel or Google Sheets.
In the above example, our chart might benefit from being reordered from largest to smallest frequency values. This arrangement can make it easier to compare similar values in the chart, even without gridlines. When we arrange the categories in decreasing frequency order like this, it is called a Pareto chart.
Video Summary of Examples (4 mins 19 secs – CC)
Pareto chart
A Pareto chart is a bar graph ordered from highest to lowest frequency.
Example 4
Transforming our bar graph from earlier into a Pareto chart, we get:
Video Solution Example 3 (1 mins 58 secs – CC)
Example 5
In some cases the bar graph categories may have an implied order (like dates or quarters) and we would not want to move those categories in a different order as it would lose the meaning behind the graphic. Take the example below from Statista about the revenue for Zoom. As you read this graph we are seeing a progression in time of the revenue growth of Zoom. Reordering from largest to small would lose the meaning behind what we are seeing with the visual representation of the revenue growth through time and the huge increase in the 2021 fiscal year report (Zoom’s 2021 fiscal year started in February 2020). That huge increase in growth was explained by the start of the Covid-19 Pandemic.
You will find more infographics at Statista
Example 6
In a survey, adults were asked whether they personally worried about a variety of environmental concerns. The number (out of 1012 surveyed) indicating that they worried “a great deal” about some selected concerns is summarized below.
Environmental Issue | Frequency |
---|---|
Pollution of drinking water | 597 |
Contamination of soil and water by toxic waste | 526 |
Air pollution | 455 |
Global warming | 354 |
Construct the a bar graph for the data.
Solution
Now since the bars are ordered in terms of frequency from greatest to least we can call this a pareto chart.
To show relative sizes, it is common to use a pie chart. In a pie chart a circle is divided into wedges where each category represents a wedge. The size of the wedge of the whole circle is relative to the frequency for that category when compared to all of the data. If one category represents 25% of the data, than the wedge for that category would be 25% of the circle.
Relative Frequency
A relative frequency refers to the proportion of times we observe that category item within a collection of data compared to the total number of observations in that data.
Relative Frequency of a category item = (# of times an observation from a category is observed) / (total number of observations)
Values between 0 and 1: Relative frequency is a proportion and must lie between 0 (never happens) and 1 (always happens).
Pie Chart
A pie chart is a circle with wedges that represent each categories relative frequencies.
Example 7
In the insurance company example from above construct a relative frequency table along with a pie chart to represent the car color data that was provided.
Color | Frequency |
---|---|
Blue | 25 |
Green | 52 |
Red | 41 |
White | 36 |
Black | 39 |
Grey | 23 |
Solution
To relative frequency table for the vehicle color data we start by adding a new column to our original frequency table and title it as Relative Frequency. The relative frequency for each car color is then found by:
.
For example to find the Blue relative frequency first find the total number in our frequency column (25+52+41+36+39+23=216) and then calculate the relative frequency:
Now do this for each vehicle color:
Color | Frequency | Relative Frequency |
---|---|---|
Blue | 25 | 11.6% |
Green | 52 | 24.1% |
Red | 41 | 19.0% |
White | 36 | 16.7% |
Black | 39 | 18.1% |
Grey | 23 | 10.6% |
To find the pie chart we divide a circle into wedges for each color (category) where the relative frequency would be the percent of the circle that is filled up for that color. For our vehicle color data, a pie chart might look like this:
When looking at the above pie chart you may have a hard time determining which wedge is the largest, 2nd largest, and so on. Pie charts can often benefit from including frequencies or relative frequencies (percent) in the chart next to the pie slices. Often having the category names in a legend next to the pie slices is helpful we can also attach the names to the slices in most programs (as seen below).
Video Solution Example 5 (4 mins 49 secs – CC)
The pie chart below shows the percentage of voters supporting each candidate running for a local senate seat. If there are 20,000 voters in the district, the pie chart shows that about 11% of those, about 2,200 voters, support Reeves.
Take note that without the percentages labeled on the graph it would be hard to determine if Ellison had indeed received more votes than Douglas. By including the percentages we give both a visual and numeric way to compare different groups in the pie chart.
Video Explanation (1 mins 1 secs – CC)
Pie charts look nice, but are harder to draw by hand than bar charts since to draw them accurately we would need to compute the angle each wedge cuts out of the circle, then measure the angle with a protractor. Computers are much better suited to drawing pie charts. Common software programs like Excel or Google Sheets are able to create bar graphs, pie charts, and other graph types.
Try it Now 1
Create a bar graph and a pie chart to illustrate the grades on a history exam below.
A: 12 students, B: 19 students, C: 14 students, D: 4 students, F: 5 students
Answer (click to Show/Hide)
Start with creating a frequency table and adding the relative frequency column (for the pie chart).
Grade | Frequency | Relative Frequency |
---|---|---|
A | 12 | |
B | 19 | |
C | 14 | |
D | 4 | |
F | 5 |
Both charts are given below.
Be aware we only did the relative frequency as an exercise to show where the numbers in the pie chart came from. Typical software packages do not require you to do that step.
Common Mistakes on Graphs
Video Summary of Bad Graphs (3 mins 3 secs – CC)
Don’t get fancy with graphs! People sometimes add features to graphs that don’t help to convey their information. For example, 3-dimensional bar charts like the one shown on the right are usually not as effective as their two-dimensional counterparts. This chart makes it very challenging to determine the heights of the bars due to the horizontal axis being skewed. It would be really challenging to determine if there were more blue or more black cars involved in a total-loss collision.
Here is another way that fanciness can lead to trouble. Instead of plain bars, it is tempting to substitute meaningful images. This type of graph is called a pictogram.
Pictogram
A pictogram is a statistical graphic in which the size of the picture is intended to represent the frequencies or size of the values being represented.
Example 8
A labor union might produce the graph to the right to show the difference between the average manager salary and the average worker salary.
Looking at the picture, it would be reasonable to guess that the manager salaries is 4 times as large as the worker salaries – the area of the bag looks about 4 times as large. However, the manager salaries are in fact only twice as large as worker salaries, which were reflected in the picture by making the manager bag twice as tall.
Try it Now 2
Carefully examine the 2011 State of the Union address graphic given below. Does anything seem wrong? What caused the error?
Hint 1 (click to Show/Hide)
Visually something is not right. Look at the circles that are created. Does anything seem odd based on the size of the circle and the sizes of the numbers being compared?
Answer (click to Show/Hide)
This type of distortion can be intentional or unintentional as in the 2011 State of Union Address shown above. [Image Source dy/dan blog] The error in the image shown in the State of the Union is based upon the diameter being used as the “height” of a graphic causing the area to be disproportional larger than it should have been. A rough estimate could be shown that we could fit six circles the size of China in the United States region, yet we can see based on the values given that United States GDP is not six times the size of Chinas (it is in face less than three times the size of China).
Another distortion in bar charts results from setting the baseline to a value other than zero. The baseline is the bottom of the vertical axis, representing the least number of cases that could have occurred in a category. Normally, this number should be zero. There are times where setting this number higher is needed to show differences in values on the graph, but other times this change dramatically changes the message from the data as shown in the next example.
Example 9
Compare the two graphs below showing support for same-sex marriage rights from a poll taken in December 2008. The difference in the vertical scale on the first graph suggests a different story than the true differences in percentages; the second graph makes it look like twice as many people oppose marriage rights as support it.
In the above example we saw that by changing this vertical axis we are allowing for a different story to be told. On the flip side it is sometimes helpful to do this to allow for the examination of close differences between groups.
Try it Now 3
A poll was taken asking people if they agreed with the positions of the 4 candidates for a county office. The poll found that 42% agreed with Nguyen’s position, 35% agreed with McKee’s position, 52% agreed with Brown’s position, and 64% agreed with Jones position.
Does the pie chart above present a good representation of this data? Explain.
Answer (click to Show/Hide)
A pie chart is inappropriate when a respondent can give an answer that falls into multiple categories (as in this case). You can see this is incorrect since the percentages do not add to 100%. A better approach for this visual would be a bar chart (where you can put the relative frequency on the vertical axis).
Exercises
Please work on all the problems listed below for homework. You may ask questions in the discussion forum (it is also a great place to compare answers with your classmates).
- The table below shows scores on a Math test.
80 50 50 90 70 70 100 60 70 80 70 50 90 100 80 70 30 80 80 70 100 60 60 50 - Treat the scores 30, 40, 50, 60, 70, 80, 90, and 100 as a category. Complete the frequency table for the Math test scores.
Test Score Score Frequency 30 40 50 60 70 80 90 100 - Construct a bar graph of the data
- Construct a pie chart of the data
Answer (click to Show/Hide)
-
Test Score Score Frequency 30 1 40 0 50 4 60 3 70 6 80 5 90 2 100 3 - This is technically a histogram (something you will see in a later section).
- Pie Chart
- Treat the scores 30, 40, 50, 60, 70, 80, 90, and 100 as a category. Complete the frequency table for the Math test scores.
- A group of adults where asked what type (model) of cars they had in their household
- Complete the frequency table for the car number data
- Construct a bar graph of the data
- Construct a pie chart of the data
Type (model) of cars in your household Ford Kia Jeep Ford Toyota Toyota Chevy Honda Ford Toyota Honda Chevy Kia Chrysler Honda Jeep Ford Ford Toyota Kia Ford Toyota Chevy Toyota Answer (click to Show/Hide)
-
Test Score Model Frequency Ford 6 Kia 3 Jeep 2 Toyota 6 Chevy 3 Honda 3 Chrysler 1 - Bar Graph
- Pie Chart
- A group of adults were asked how many children they have in their families. The bar graph below shows the number of adults who indicated each number of children.
- How many adults where questioned?
- What percentage of the adults questioned had 0 children?
Answer (click to Show/Hide)
- The total number of adults from the table:
- 5 of the 15 adults had 0 children:
- Jasmine was interested in how many days it would take an order of a single movie from Netflix to arrive at her door. The graph below shows the data she collected. The frequency represents orders of a single movie.
- How many movies in all did she order?
- What percentage of the movies arrived in one day? Round to the nearest tenth.
Answer (click to Show/Hide)
- The total number of movies ordered:
- The percentage of movies that arrived in one day:
- The bar graph below shows the percentage of students who received each letter grade on their last English paper. The class contains 20 students. What number of students earned an A on their paper?
Answer (click to Show/Hide)
The table is showing the percent of the class earning a given grade. Looking at the column for A we see that it looks like 25% of the class earned an A. The class size was 20, so 25% of the 20 earned an A:
. This shows 5 students earned an A.
- Kori categorized her spending for this month into four categories: Rent, Food, Fun, and Other. The percents she spent in each category are pictured here. If she spent a total of $2600 this month, how much did she spend on rent?
Answer (click to Show/Hide)
From the pie chart we see rent represents 26% of her spending. To find the total spent on rent find 26% of 2600 (total spent):
. Kori spent $676 on rent.
- A graph appears below showing the number of adults and children who prefer each type of soda. There were 130 adults and kids surveyed. Discuss some ways in which the graph below could be improved
Answer (click to Show/Hide)
It is hard to make comparison on 3-d graphs as it can be difficult to determine the heights of each bar. It would be better to turn this into a 2-d bar graph. The graph is also misleading in that the y-axis values do not start with 0, so a height difference between two bars is magnified and seems larger than what it actually is numerically.
- The graph below shows the number of complaints for six different airlines as reported to the US Department of Transportation in February 2013. Alaska, Pinnacle, and Airtran Airlines have far fewer complaints reported than American, Delta, and United. Can we conclude that American, Delta, and United are the worst airline carriers since they have the most complaints?
Answer (click to Show/Hide)
You cannot assume that the numbers of complaints reflect the quality of the airlines. The airlines shown with the greatest number of complaints could be the ones with the most passengers. You must consider the appropriateness of methods for presenting data; in this case displaying totals is misleading as the categories where the data was pulled from (airlines) are not of equal sizes. A more appropriate choice would be to compare the percent of complaints for an airline as it takes into consideration the total number of passengers to compute that percent.
- Below is a frequency table that shows the number of covid cases in some Arizona Counties on May 7 2021 in thousands. (source: azdhs.gov/covid19/data/index.php).
Arizona Covid Cases Arizona County Covid Cases in Thousands Maricopa 540 Pima 115 Pinal 52 Yuma 37 Mohave 23 Yavapai 19 - Construct a bar graph to represent the data for the number of covid cases (in thousands) for the Arizona Counties.
- What danger is there to compare the values for each county directly against each other?
- From the bar graph it seems clear Maricopa has many more cases of covid when compared to Yuma (about 15 times as many). If you factor in the population for each county we can get a better understanding of the penetration of Covid-19. According to the the recent 2019 census Maricopa has a population of 4,485,414 and Yuma has a population of 209,468. Which county has a higher percent of covid cases?
Attributions
This page contains modified content from David Lippman, “Math In Society, 2nd Edition.” Licensed under CC BY-SA 4.0.
This page contains modified content from “Collecting Data” by Foster et al., LibreTexts is licensed under CC BY-NC-SA 4.0.
This page contains modified content from “OpenStax Introductory Statistics” by Barbara Illowsky, Susan Dean. Licensed under CC BY 4.0.
This page contains content by Robert Foth, Math Faculty, Pima Community College, 2021. Licensed under CC BY 4.0.
The survey data for Example 1 is from Gallup Poll. March 5-8, 2009. http://www.pollingreport.com/enviro.htm
The survey data for Example 1 is from CNN/Opinion Research Corporation Poll. Dec 19-21, 2008, from http://www.pollingreport.com/civil.htm