Creating Good Charts
General
Principles of Graphic Display
A graphical chart provides a visual display
of data that otherwise would be presented in a table; a table, one that would
otherwise be presented in text. Ideally, a chart should convey ideas about the
data that would not be readily apparent if they were displayed in a table or as
text.
The three standards for tabular display of
data -- the efficient display of meaningful and unambiguous data
-- apply to charts as well. As with tables, it is crucial to good
charting to choose meaningful data, to clearly define what the numbers
represent, and to present the data in a manner that allows the reader to
quickly grasp what the data mean. As with tabular display, data ambiguity
in charts arises from the failure to precisely define just what the data
represent. Every dot on a scatterplot, every point on a time series line,
every bar on a bar chart represents a number (actually, in the case of a
scatterplot, two numbers). It is the job of the chart’s text to tell the
reader just what each of those numbers represents.
Designing good charts, however, presents
more challenges than tabular display as it draws on the talents of both the
scientist and the artist. You have to know and understand your data, but
you also need a good sense of how the reader will visualize the chart’s
graphical elements.
Two problems arise in charting that are
less common when data are displayed in tables. Poor choices, or deliberately
deceptive, choices in graphic design can provide a distorted picture of numbers
and relationships they represent. A more common problem is that charts are
often designed in ways that hide what the data might tell us, or that distract
the reader from quickly discerning the meaning of the evidence presented in the
chart. Each of these problems is illustrated in the two classic texts on data
presentation: Darrell Huff’s How to Lie with Statistics (1994) and
Edward Tufte’s The Visual Display of Quantitative Information (1983).
Huff’s little paperback, first published in
1954 and reissued many times thereafter, condemned graphical representations of
data that “lied”. Here, the two numbers, one 3 times the magnitude of the
other, are represented by two cows, one 27 times larger than the
other, resulting in a Lie Factor of 9.

|
Figure 1: Graphical distortion
of data
SOURCE: Darrell Huff. 1993. How to Lie with Statistics WW Norton
& Co, 72.
|
Here
the figure depicts the increase in the number of milk cows in the
United States, from 8 million in 1860 to twenty five million in
1936. The larger cow is
thus represented as three times the height the 1860 cow. But she
is also three
times as wide, thus taking up nine times the area of the page.
Moreover the
graphic is a depiction of a three dimensional figure: when we take the
depth of
the cow into account, she is twenty seven times larger in 1936.
Later, Tufte
developed the “Lie Factor”: a numerical measure of the data
distortion. Here,
representing a numbers that 3 times different in magnitude with images
that a
27 times different in size produces a “Lie Factor” of 9.
Such visual distortions are not as common
as they once were, but modern computer technology has made possible all sorts
of new ways of lying with charts.
Edward Tufte would second Emperor Joseph
II’s famous complaint to a young composer: “too many notes, Mozart.” Tufte’s unique
contribution to art of chart design was to stress the virtue of efficient data
presentation. His fundamental rule of efficient graphical design is to minimize
the ratio of ink-to-data by minimizing or eliminating any elements from the
chart that do not aid in conveying what the numbers mean. Tufte’s advice to
those who would chart is essentially the same advice offered by Strunk and
White to would-be writers:
"A sentence should contain no
unnecessary words, a paragraph no unnecessary sentences for the same reason
that a drawing should contain no unnecessary lines and a machine no unnecessary
parts." (23)
Just as the purpose of any statistic is to
simplify, to represent in one number a larger set of numbers, the purpose of a
chart is to simplify numerical comparisons: to represent in several numerical
comparisons in a single graphic. The most common errors in chart design are to
include elements in the graphical display that have nothing to do with the
presentation of the numerical comparisons. Below we will see how the standard
applies to the components of charts in general.
The
Components of a Chart
There are three basic components to most
charts:
-
the labeling that defines the
data: the title, axis titles and labels, legends defining separate data series,
and notes (often, to indicate the data source),
-
scales defining the range of the Y
(and sometimes the X) axis, and
-
the graphical elements that represent
the data: the bars in bar charts, the lines in times series plot, the points in
scatterplots, or the slices of a pie chart.
Titles. In journalistic writing a chart title will sometimes
state the conclusion the writer would have the reader draw from the chart. If
figure 2 were used in a Governors State University press release, the title,
“Tuition and Fees Lowest at GSU” might be appropriate. In academic writing,
the title should be used to define the data series, as is shown in figure 2,
without imposing a data interpretation on the reader. Often, the units of
measurement are specified at the end of the title after a colon or in
parentheses in a subtitle (e.g. “constant dollars”, “% of GDP”, or “billions of
US dollars”).
Axis titles. Axis titles should be brief and should not be used at
all if the information merely repeats what is clear from the title and axis
labels. It would be redundant to repeat the phrase “Tuition and fees” in the Y
axis of figure 2, and the X-axis title, “University”, is completely unnecessary.
If the title of the chart has the subtitle “% of GDP”, it is not necessary to
repeat either the phrase or the word “percent” in the axis title.
Axis scale and data labels. The value or magnitude of the main graphical
elements of the chart are defined by either or both the axis scale and
individual data labels. Avoid using too many numbers to define the data
points. A chart that labels the value of each individual data point does not
need labeling on the y axis. If it seems necessary to label every value in a
chart, consider that a table is probably a more efficient way of presenting the
data.
Legends. Legends are used in charts with more than one data
series. They should not be placed on the outside of the chart in a way than
reduces the plot area, the amount of space given to represent the data. In
figure 2, the legend is placed inside the chart (although some think that
detracts from the main graphical elements), it could also be placed at the
bottom of the chart (where the unnecessary “university” now stands.
Gridlines. If used at all, gridlines should use as little ink
as possible so as to not overwhelm the main graphical elements of the chart.
The source. Specifying the source of the data is important for
proper academic citation, but it also can also give knowledgeable readers who
are often familiar with common data sources important insights into the
reliability and validity of the data. For example, knowing that crime statistics
come from the FBI rather than The National Criminal Victimization Survey can be
a crucial bit of information.
Other chart elements. The amount of ink given over to the non-data
elements of a chart that are not necessary for defining the meaning and values
of the data should be kept to an absolute minimum. Plot area borders and plot
area shading are unnecessary. Keep the shading of the graphical elements
simple and always avoid using unnecessary 3-D effects. In most of the charts
that follow, even the vertical line defining the the Y-axis has been removed,
following the commendable charting standards of The Economist magazine.
When
Graphic Design Goes Badly.
The most general standards of charting data
are thus the following:
To see what happens when these rules are
violated, consider figure 3, taken from Robert Putnam’s Bowling Alone
(where it is labeled figure 47), a work that contains many good and bad
examples of graphical data display (and unfortunately, no tables at all). In
just one chart, Putnam violates the three fundamental rules of data
presentation: the chart does not depict meaningful data; the data it does
depict are ambiguous, and the chart design is seriously inefficient. One can’t
accuse Putnam of distorting the data only because his main conclusions are not
derived from the data presented in the chart.

|
Figure 3: Very bad graphical display
|
Of these, let’s consider the inefficiency first: the first thing you notice
about the chart is that the graphical elements are represented in three
dimensions. On both efficiency and truthfulness this is unfortunate; the 3-D
effect is entirely unnecessary and in this case serves to distort the visual
representation of the data. Had not the data labels been shown on the top of each
bar, it would not be readily apparent that column A is in fact bigger than
column F, or that C is the same size as B. In addition the chart suffers from
what might be called “numbering inefficiency”: Putnam uses 13 numbers to represent
6 data points. Eliminating the 3-D, as shown in figure 4, offers a more exact
representation of the data with a lot less ink.
There are two problems of ambiguous data in the chart. Partly this is resolved
in Putnam’s text where it is explained that bar E is the percentage of women
who are homemakers out of concern for their kids while bar A is the percentage
of women who are working full-time because they need the money. It’s not quite
clear what the numbers for those who work part-time mean. In the case of bar
C, for example, are the women working only part-time because of the kids, or
are they not full-time homemakers because of the money?
The other ambiguity, however, is not for
the lack of proper labeling. If one looks at the chart quickly, the first
impression one would get would be that is that only 11% of women who work
full-time do so for reasons of personal satisfaction. But that is not the
case. Look at the Y-axis title. Or notice that all of the percentages add up
to 100. Of all the women in the survey, 11% were in the single category of
“employed full-time for reasons of personal satisfaction.” This is not what
one expects in a bar chart, but given the data Putnam has decided to display,
there isn’t a whole lot that can be done with the chart to fix it.
Still, we have to ask, “what does this
chart mean?” In particular, what data do the arrows on the bars represent?
A critical standard of good charting is
that the chart should be self explanatory. That there are problems with this
chart become apparent to the reader as soon as one encounters Putnam’s page and
a half of accompanying text devoted, not to explaining the significance of the
data, but to explaining what the elements of the chart represent. A careful
reading of the text tells us that there are basically three conclusions Putnam
would have us draw from this chart:
·
Over time, (the 1980s and 90s) more
women are working.
·
They are doing so less for reasons
of personal satisfaction and more out of necessity (i.e., to earn money).
·
Correspondingly, there has been a
significant decline in the number of women who choose to be homemakers for
reasons of personal satisfaction.
These three conclusions are directly
relevant to Putnam’s general thesis: that over time there has been a decline in
social capital (adults are spending less time raising children and developing
the social capital of future generations) driven in part by the demands of the
expanding work force.
Based
on the textual discussion that Putnam offers it becomes clear that the most
meaningful data is represented in the chart, not by the height of the bars, but
by the direction of the arrows on the bars. Recall that as a general rule data
presentations that include more than one time point provide for much more
meaningful analyses than cross sectional or single time point presentations. Although
most of the data analysis in Bowling Alone is time series data, in this
case Putnam averages 21 years of data down to single data points represented by
the chart’s bars, with the times series change represented by directional
arrows. Thus, the most meaningful comparison in the chart – the comparison
that support the conclusion that Putnam seeks to draw from the data -- is not
that bar A is higher than bar B or F, but that the arrow for Bar A is going up
while the arrow for bar F is going down.
The
crucial comparison is made directly in figure 5, based on the data presented in
the textual discussion. Moreover, it directly illustrates several points that
neither the text nor the original chart make clear: In 1978, a plurality of
women were homemakers who did so out of personal satisfaction; in 1999 women who
worked full time for financial reasons were the plurality.
Note
also that figure 5 simplifies the data presentation by eliminating the
ambiguous part-time category: for part-timers, is “personal satisfaction” the
reason for not staying at home or the reason for not working full-time? And it
clarifies that the “necessity” refers to “kids” in the case of homemakers and
to “money” in the case of full time workers.
Types of Charts
Most
charts are a variation on one of four basic types: pie charts, bar charts, time
series charts and scatterplots. Choosing the right type of chart depends on
the characteristics of the data and the relationships you want displayed.
Pie Charts
Pie charts are used to represent the
distribution of the categorical components of a single variable. Note that as
a general rule, multivariate comparisons provide for more meaningful analysis
than do single variable distributions and for this and other reasons pie charts
should be rarely used, if at all.
Rules
for pie charts:
·
Avoid using pie charts.
·
Use pie charts only for data
that add up to some meaningful total.
·
Never ever use three dimensional
pie charts; they are even worse than two dimensional pies.
·
Avoid forcing comparisons across
more than one pie chart.
|
Pie charts should rarely be used. Pie
charts usually contain more ink than is necessary to display the data and the
slices provide for a poor representation of the magnitude of the data points.
Do you remember as a kid trying to decide which slice of your birthday cake was
the largest? It is more difficult for the eye to discern the relative size of
pie slices than it is to assess relative bar length. Forcing the reader
to draw comparisons across the two pie charts shown in figure 6 is also a
bad idea: without looking at the data label percentages in the above figures
one cannot easily determine whether the FY 2000 slices are larger or smaller
than the corresponding FY 2007 slices
3-D pie charts are even worse, as they also
add a visual distortion of the data (in figure 7, the thick 3-D band exaggerates
the size of the corporate income tax slice).
All
the information in the pie charts above can be conveyed more precisely and with
far less ink in the simple bar chart shown in figure 8.
Nevertheless, people like pie charts. Readers expect to see one or two pie
charts similar to those in figure 6 at the very beginning of an annual agency
budget report. But it would be a big mistake to rely on several pie charts for
the primary data analysis in a report.
For
those who would ignore all the advice given here and insist that good charts
must look pretty, the most recent version of the Microsoft Excel charting
software (in Office 2007, beta) will satisfy all your foolish desires: 3-D pie
charts that gleam and glisten like Christmas tree ornaments, to say nothing
about what you can do with the 3-D pie chart’s cousins, the donut, cylinder,
cone, radar and pyramid charts.
As
a general rule 3-D charts are not a good idea even when the data are three
dimensional. In theory they provide for a precise representation of data, but
it is rare that provide a basis for drawing a simple conclusion.
Bar Charts:
Bar charts typically display the
relationship between one or more categorical variables with one or more
quantitative variables represented by the length of the bars. The categorical
variables are usually defined by the categories displayed on the X-axis and, if
there is more than one data series, by the legend.
Rules
for bar charts:
·
Minimize the ink, do not use 3-D
effects.
·
Sort the data on the most
significant variable.
·
Use rotated bar charts if there
are more than 8 to 10 categories.
·
Place legends inside or below
the plot area.
·
With more than one data series,
beware of scaling distortions.
|
Bar charts often contain little data, a lot of ink, and rarely reveal ideas
that cannot be presented much more simply in a table.
Minimizing
the ink-to-data ratio is especially important in the case of bar charts.
Never use a 3-D bar chart. Keep the gridlines faint. Display no more than
seven numbers on the Y-axis scale. If there are fewer than five bars,
consider using data labels rather than a Y-axis scale; it doesn't make sense to
use a five-numbered scale when the exact values can be shown with four
numbers.
Look
at figure 9 and you can quickly grasp the main points – the United
States has the highest child poverty rate among developed nations --,
but then spend
some time with it and you'll discover other interesting things.
Note, for
example, the differences in child and elderly poverty across nations or
that
the three countries at the top, with the lowest child poverty rates are
Scandinavian
countries; five of the seven countries with the highest child poverty
are
English-language countries.
As with tables, sorting the data on the
most significant variable greatly eases the interpretation of the data. The
data in figure 9 are sorted on the child rather than the elderly poverty rates only
because most of the research on the topic has focused on child poverty. Note
also that if the sorted variable represents time, time should always go from
left to right and on the X-axis.
One variation of the bar chart, the stacked
bar chart, should be used with caution, especially when there is no implicit
order to the categories (i.e., when the categorical variable is nominal rather
than ordinal) that make up the bar, as is the case in figure 10. Note how
difficult it is to discern the differences in the size of the components on the
upper parts of the bar. The same difficulty occurs with stacked line and area
charts.
The stacked bar chart works best when the primary comparisons are to be made
across the data series represented at the bottom of the bar. Thus, placing
the “teachers” data series at the bottom of the bars in figure 11 (and sorting
the data on that series) forces the reader’s attention on the crucial
comparison and the obvious conclusion: American teachers are fortunate to have
such a large supervisory and support staff.
 One common bar charting mistake is including the legend on the right-hand side
of the plot area (shown in figure 12), placing the legend inside the plot area,
as in figure 9, or horizontally under the table title (as in figure 11)
maximizes the size of the area given over to displaying the data.
Scaling effects occur when a bar chart (or a line chart, as we will see) two
data series with numbers of a substantially different magnitude, the variation
in the data series containing the smaller numbers. Figure 12, for example,
depicts the increase in the labor force participation rate (the percent of the
adult population in the labor force) from 60% in 1970 to 67% in 2000, and the
increase in the unemployment rate from 5.3% to 7.1%. The immediate visual
impression the chart gives is that the labor force participation rate is larger
than the unemployment rate (a relatively meaningless comparison), while the
important variation in the unemployment rate (a 30% increase) is hardly
noticeable. Including an additional bar representing the sum of the other
bars in a chart (as shown in figure 13) has the same effect of reducing the
variation in the main graphical elements.
To see what happens when most of the bar charting rules are violated, consider
the example in figure 14, produced by the Illinois Board of Higher Education
(IBHE), (conflict of interest disclosure) my employer.

|
Figure 14: A really bad bar chart.
source: IBHE 2002.
|
It’s not just the 3-D. Look carefully at the X-axis. Using comparable data
(the only available data: Fall headcounts rather than 12 month headcounts),
eliminating the 3-D effects, sorting time from left to right, and removing the
community college data series, and adjusting the bottom of the scale, we see
something in figure 15 that the IBHE chart obscured: private institution
enrollments are responding to public demand for higher education, public
universities are not.
Note,
however, that some would object to not using a zero base for the Y-axis scale
in figure 15, but I don’t think that the depiction is all that unfair. It is
fair to say, I think, that private institutions have accounted for most of the
growth in university and college enrollments in the state, a disparity that
would appear even more dramatic if annual change measures were depicted as in
figure 16, with a zero base.
Times Series Line Charts:
The time series chart is one of the most
efficient means of displaying large amounts of data in ways that provide for
meaningful analysis. The typical time series line chart is a
scatterplot chart with time represented on the X-axis and lines connecting the
data points.
Rules for Time Series (Line) Charts
-
Time is almost always displayed on the X-axis from left to
right.
-
Display as much data with as little ink as possible.
-
Make sure the reader can clearly distinguish the lines for
separate data series.
-
Beware of scaling effects.
-
When displaying fiscal or monetary data over-time, it is often
best to use deflated data (e.g., inflation-adjusted or % of GDP)
|
Scaling effects. When two variables with numbers of different
magnitudes are graphed on the same chart, the variable with the large scale
will generally appear to have a greater degree of variation; the smaller-scale
variable will appear relatively "flat" even though the percentage
change is the same. In figure 18, ABCorp’s stock seems to be growing much
faster than XYZCOM's, yet the rate of increase is identical.
When the differences in scale are so great
as to eliminate most of the perceived variation in the smaller-scale variable,
using a second scale, displayed on the right-hand side as in figure 19, is
sometimes preferable, although this may make the interpretation of the graph
more complicated.
Many who have written about graphical distortion condemn the use of two-scale
charts because the relative sizes of the two scales are completely
arbitrary. This is true; had job approval and unemployment been plotted
on the same 0 to 90 Y-axis scale, the unemployment rate would be an almost flat
line at the bottom of the chart.
One solution to trendlines of different
magnitudes is to rescale the variables, calculating the percentage change from
a base year —but note that the selection of the base year can produce
dramatically different results.
When several times series lines are printed
in black and white, it is sometimes difficult to separate out the different
tend lines. Mixing solid, dotted, and dashed lines for each variable may
solve this problem, although it is sometimes difficult to distinguish between
dotted and dashed lines.
Scatterplots
The two-dimensional scatterplot is the most
efficient medium for the graphical display of data. A simple scatterplot
will tell you more about the relationship between two interval-level variables
than any other method of presenting or summarizing such data.
Rules
for Scatterplots
-
Use two interval-level variables.
-
Fully define the variables with the axis titles.
-
Use the chart title should identify the two variables and the
cases (e.g., cities or states)
-
If there is an implied causal relationship between the
variables, place the independent variable (the one that causes the
other) on the X-axis and the dependent variable (the one that may be
caused by the other) on the Y-axis.
·
Scale the axes to maximize the
use of the plot area for displaying the data points.
·
It’s a good idea to add data
labels to identify the cases.
|
With good labeling of the variables and
cases and common-sense scaling of the X and Y-axes, there's not a lot that can
go wrong with a scatterplot, although extreme outliers on one or more of the
variables can obscure patterns in the data.
In figure 20, TV viewing is the independent
variable. (If you were trying to predict which types of students watch the most
TV, the axes would be reversed.) The scatterplot contains two optional
plotting features: a regression trendline denoting the linear relationship
between the two variables and the use of State postal ID data labels to
indicate each state's position on the chart (these labels require a special
add-in to the Excel program). Although the chart suffers from overlapping
data labels, the interpretation is straightforward; the higher the percentage
of students in a state watching more than 6 hours of TV each day, the lower the
state's math scores.
Boxplots
John W. Tukey invented the boxplot as a
convenient method of displaying the distribution
of interval-level variables.
Rules
for Boxplots:
·
A simple boxplot plots the
median and four quartiles of data for an interval level variable.
·
Boxplots are best used for comparing
the distribution of the same variable for two or more groups or two or more
time points.
·
Boxplots are an excellent means
of displaying how a single case compares to a large number of other cases.
|

|
Figure 21: Components of a boxplot
|
The simple boxplot, as shown in figure 21, displays the four quartiles of the
data, with the "box" comprising the two middle quartiles, separated
by the median. The upper and lower quartiles are represented by the
single lines extending from the box. More detailed versions of the boxplot
restrict the “whiskers” on the plot to 1.5 times the size of the boxes and plot
the higher or lower values (outliers) as individual points. Some versions also
plot the mean in addition to the median.
A single boxplot box (as in figure 21) rarely
reveals much about the data, and graphs of single variable data distributions
(using stem-and-leaf or histogram charts) rarely offer a more detailed graphic
representation of the data distribution. The real advantages of the boxplot
graphic comes through, however, in single charts using several boxplots to
compare the distribution of a variable across groups or over time and an
especially useful elaboration of the boxplot graph involves plotting an
individual case over the boxplot to compare single cases to the
overall distribution (see figure 22).
Thus, figure 22 displays the percentage
Democratic vote for the 50 states over the past seven presidential
elections. Labeling a single case, we can see that the Democratic vote in
Nevada has moved steadily higher relative to the other states. One can easily
imagine applying the same plotting strategy in a variety of other
settings, for example, comparing one school district's test scores to the
distribution of test scores across other school districts.
Notes on
data sources:
Higher education. The higher education data in figures 2, 14, 15 and
16 are compiled by the Illinois Board of Higher Education and are readily
available on the Board’s website. Most states have similar governing board
for higher education, but the governance structure varies from state to state.
Most colleges and universities have an institutional research department
responsible for compiling data and preparing reports on enrollments, tuition
and fees, staffing, expenditures and student academic performance (which is
then forwarded to the governing boards). Often the data are presented in an
annual data profile.
Federal government revenue.
The president’s Office of Management and Budget
submits the proposed federal budget (for the budget year beginning October 1)
to Congress in January of each year. (figures 6, 7, 8, and 10) The actual
budget documents and spreadsheet files are available in the White House, the
Office of Management and Budget, and the Government Printing Office websites.
Poverty. The poverty data shown in figure 9 was obtained
from the Luxembourg Income Survey website and is described in the Poverty
chapter that follows.
Presidential approval. (FIGURES 17, 19) Many polling agencies regularly
conduct surveys asking the more or less standard presidential approval
question: “How would you rate President Bush's performance on the job:
excellent – good – fair or poor?” The Gallup poll website has the most
complete historical data on presidential approval and would be the best source
for comparing several administrations’ approval data, but access to their data
requires a subscription fee. The “Professor PollKatz Pool of Polls” website
http://www.pollkatz.homestead.com/
contains time series charts (but not the actual data) on presidential approval
surveys conducted by 15 polling organizations. The Pollingreport.com website
is an excellent source for political polling data, reporting data from several
polling organization.
Unemployment. The unemployment measure used in figure 19 is the
standard monthly Bureau of Labor Statistics measure available on the Bureau’s
website. It measures the percent of the labor force (those employed and
looking for work) who are unemployed and actively seeking employment.
Education. The Organization for Economic Cooperation and
Development (OECD) is an excellent source of governmental and social data for
the World’s developed nations. The staffing data in figure 11 was reported in
one of their annual reports on education. The TV viewing and math score data
in figure 20 was obtained from the National Center for Educational Statistics
website. As discussed in the Education chapter, this is prime source for US
educational statistics.
Elections. There is no US government agency responsible for
compiling even federal election data, although the Clerk of the House (of
Representatives) does report official tabulations for presidential and
congressional elections and the Census Bureau does do a post-congressional
election survey on voter turnout. Congressional Quarterly’s biennial America
Votes is the most comprehensive source of US election data. Figure 22 is
based on the Congressional Quarterly data obtained from the Census Bureau’s
Statistical Abstract.
Notes to
Excel users:
All of the charts shown in this chapter
were prepared using the Microsoft Excel charting software, but some of the
charts required modifications to the normal charting options offered by Excel.
Figure 2c contains two levels of category
labels on the X-axis that are easily done with Excel but not explained in the
documentation. The trick is to incorporate an additional label series in the
data range, as shown below:

|
Figure 22: Specifying two levels of
labels in the data range
|
To produce scatterplots with case labeling, as in figure 15, use of one of
several add-ins or macros freely available on the Internet. The “J-Walk Chart
Tools” add-in allows users to specify a data range to label any chart’s data
points and includes other options for controlling chart and text size.
Nor does Excel offer a boxplot (also called
“box and whiskers”) chart type as shown in figure 17, although Microsoft does
have on-line instructions that explain how to modify existing charts types to
derive a boxplot. More simply, Jon Peltier provides a “box chart maker”, an Excel
add-in that greatly simplifies the process. A number of other add-ins and
work-arounds to extend the Excel charting capabilities are freely available on
the Internet.
Excel provides two different approaches for
formatting the X-axis for times series line graphs. Excel line graphs treat
the X-axis as if “X” is not a numerical variable, much the same way that the
bar chart graphs define the X-axis. The X-axis labels and the data points are
positioned between the tick marks and the sequencing of the data points depends
on the order of the cases in the Excel spreadsheet (see figure 19).

|
Figure 23: Excel line chart
|
In most situations, Excel’s “XY scatter-chart-with-lines”, works has several
advantages over the line chart. As shown in figure 20, the axis is treated as
a numerical variable, the axis labels are placed under the axis tick marks, and
any gaps in the time series are correctly spaced.

|
Figure 24: Excel XY Scatter chart with
lines
|
The disadvantage of the scatter-with-lines chart is that it offers little
control over the labeling of the axis: the tick marks and labels must start
with the minimum value of the series and are evenly spaced. If the data series
starts in 1959 (e.g., most US poverty data) and you wish to have tick marks
every five years, the axis labels will be the series beginning 1959-1964-1969….
rather than 1960-1965-1970.
 
|
Figure 21: Excel XY scatter: X-axis is
a labeled data series
|
To get around this, a) eliminate the X-axis labels, b) plot an extra data
series with all the values set to the axis’ minimum value, c) use a “+” sign
as the data markers, and d) use the J-Walk Chart tools to label the data
points.
Also available on the Internet are a variety
of macros and instructions for modifying the scaling of the Y-axis (Excel does
a really poor job when it comes to log scaled axes).
One would have hoped that Microsoft would
incorporate such functions into the newer versions of the Excel software. Unfortunately,
the Microsoft developers seem to have chosen to go in a different direction
with the latest version of the software (Office 2007), incorporating all sorts
of new chart styles that involve all sorts advanced 3-D effects, shading, glow,
soft edges and shadows.
References:
Huff, Darrell. 1993. How to Lie with Statistics WW
Norton & Co
Putnam,
Robert D. 2000. Bowling Alone (Simon and Schuster).
Tufte,
Edward. The Visual Display of Quantitative Information (Cheshire:
Connecticut: Graphics Press, 1983).
Other
useful books on graphing data:
Few,
Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to
Enlighten (Analytics Press).
Jones,
Gerald E. How to Lie With Charts (iUniverse.com, 2000)
Kosslyn,
Stephen M. Elements of Graph Design (NY: W. H. Freeman, 1994).
Miller, Jane E. 2004. “Creating Effective Charts,” The
Chicago Guide to Writing about Numbers, (University of Chicago Press). Chapter 7.
Wallgren,
Anders, et. al. Graphing Statistics & Data (Sage Publications,
1996).
Wainer,
Howard. 1997. Visual Revelations: Graphical Tales of Fate and
Deception from Napoleon Bonaparte to Ross Perot. (Mahwah, NJ: Lawrence Erlbaum Associates).
|