Good Charts

Presenting Data: Tabular and graphic display of social indicators: Gary Klass
Illinois State University
© 2002

Note: The website will be discontinued shortly, to be replaced by the Just Plain Data Analysis site

Constructing Good Charts and Graphs

General Principles of Graphic Display

The Components of a Chart

When Graphic Design Goes Badly

Types of Charts

Pie Charts

Bar Charts

Time Series Charts

Scatterplots

Boxplots

Notes on Excel

Tips on Using MS Excel to Prepare Charts and Graphs.

Examples of Bad Data Display:

Bad data display from the IBHE

Creating Good Charts

General Principles of Graphic Display

A graphical chart provides a visual display of data that otherwise would be presented in a table; a table, one that would otherwise be presented in text. Ideally, a chart should convey ideas about the data that would not be readily apparent if they were displayed in a table or as text.

The three standards for tabular display of data -- the efficient display of meaningful and unambiguous data -- apply to charts as well. As with tables, it is crucial to good charting to choose meaningful data, to clearly define what the numbers represent, and to present the data in a manner that allows the reader to quickly grasp what the data mean. As with tabular display, data ambiguity in charts arises from the failure to precisely define just what the data represent. Every dot on a scatterplot, every point on a time series line, every bar on a bar chart represents a number (actually, in the case of a scatterplot, two numbers). It is the job of the chart’s text to tell the reader just what each of those numbers represents.

Designing good charts, however, presents more challenges than tabular display as it draws on the talents of both the scientist and the artist. You have to know and understand your data, but you also need a good sense of how the reader will visualize the chart’s graphical elements.

Two problems arise in charting that are less common when data are displayed in tables. Poor choices, or deliberately deceptive, choices in graphic design can provide a distorted picture of numbers and relationships they represent. A more common problem is that charts are often designed in ways that hide what the data might tell us, or that distract the reader from quickly discerning the meaning of the evidence presented in the chart. Each of these problems is illustrated in the two classic texts on data presentation: Darrell Huff’s How to Lie with Statistics (1994) and Edward Tufte’s The Visual Display of Quantitative Information (1983).

Huff’s little paperback, first published in 1954 and reissued many times thereafter, condemned graphical representations of data that “lied”. Here, the two numbers, one 3 times the magnitude of the other, are represented by two cows, one 27 times larger than the other, resulting in a Lie Factor of 9.

Figure 1: Graphical distortion of data
SOURCE: Darrell Huff. 1993. How to Lie with Statistics WW Norton & Co, 72.

Here the figure depicts the increase in the number of milk cows in the United States, from 8 million in 1860 to twenty five million in 1936. The larger cow is thus represented as three times the height the 1860 cow. But she is also three times as wide, thus taking up nine times the area of the page. Moreover the graphic is a depiction of a three dimensional figure: when we take the depth of the cow into account, she is twenty seven times larger in 1936. Later, Tufte developed the “Lie Factor”: a numerical measure of the data distortion. Here, representing a numbers that 3 times different in magnitude with images that a 27 times different in size produces a “Lie Factor” of 9.

Such visual distortions are not as common as they once were, but modern computer technology has made possible all sorts of new ways of lying with charts.

Edward Tufte would second Emperor Joseph II’s famous complaint to a young composer: “too many notes, Mozart.” Tufte’s unique contribution to art of chart design was to stress the virtue of efficient data presentation. His fundamental rule of efficient graphical design is to minimize the ratio of ink-to-data by minimizing or eliminating any elements from the chart that do not aid in conveying what the numbers mean. Tufte’s advice to those who would chart is essentially the same advice offered by Strunk and White to would-be writers:

"A sentence should contain no unnecessary words, a paragraph no unnecessary sentences for the same reason that a drawing should contain no unnecessary lines and a machine no unnecessary parts." (23)

Just as the purpose of any statistic is to simplify, to represent in one number a larger set of numbers, the purpose of a chart is to simplify numerical comparisons: to represent in several numerical comparisons in a single graphic. The most common errors in chart design are to include elements in the graphical display that have nothing to do with the presentation of the numerical comparisons. Below we will see how the standard applies to the components of charts in general.

The Components of a Chart

There are three basic components to most charts:

the labeling that defines the data: the title, axis titles and labels, legends defining separate data series, and notes (often, to indicate the data source),

scales defining the range of the Y (and sometimes the X) axis, and

the graphical elements that represent the data: the bars in bar charts, the lines in times series plot, the points in scatterplots, or the slices of a pie chart.

Figure 2: Components of a chart

Titles. In journalistic writing a chart title will sometimes state the conclusion the writer would have the reader draw from the chart. If figure 2 were used in a Governors State University press release, the title, “Tuition and Fees Lowest at GSU” might be appropriate. In academic writing, the title should be used to define the data series, as is shown in figure 2, without imposing a data interpretation on the reader.   Often, the units of measurement are specified at the end of the title after a colon or in parentheses in a subtitle (e.g. “constant dollars”, “% of GDP”, or “billions of US dollars”).

Axis titles. Axis titles should be brief and should not be used at all if the information merely repeats what is clear from the title and axis labels. It would be redundant to repeat the phrase “Tuition and fees” in the Y axis of figure 2, and the X-axis title, “University”, is completely unnecessary. If the title of the chart has the subtitle “% of GDP”, it is not necessary to repeat either the phrase or the word “percent” in the axis title.

Axis scale and data labels. The value or magnitude of the main graphical elements of the chart are defined by either or both the axis scale and individual data labels. Avoid using too many numbers to define the data points. A chart that labels the value of each individual data point does not need labeling on the y axis. If it seems necessary to label every value in a chart, consider that a table is probably a more efficient way of presenting the data.

Legends.   Legends are used in charts with more than one data series. They should not be placed on the outside of the chart in a way than reduces the plot area, the amount of space given to represent the data. In figure 2, the legend is placed inside the chart (although some think that detracts from the main graphical elements), it could also be placed at the bottom of the chart (where the unnecessary “university” now stands.

Gridlines. If used at all, gridlines should use as little ink as possible so as to not overwhelm the main graphical elements of the chart.

The source. Specifying the source of the data is important for proper academic citation, but it also can also give knowledgeable readers who are often familiar with common data sources important insights into the reliability and validity of the data. For example, knowing that crime statistics come from the FBI rather than The National Criminal Victimization Survey can be a crucial bit of information.

Other chart elements.   The amount of ink given over to the non-data elements of a chart that are not necessary for defining the meaning and values of the data should be kept to an absolute minimum. Plot area borders and plot area shading are unnecessary. Keep the shading of the graphical elements simple and always avoid using unnecessary 3-D effects. In most of the charts that follow, even the vertical line defining the the Y-axis has been removed, following the commendable charting standards of The Economist magazine.

When Graphic Design Goes Badly.

The most general standards of charting data are thus the following:

Present meaningful data.

Define the data unambiguously.

Do not distort the data.

Present the data efficiently.

To see what happens when these rules are violated, consider figure 3, taken from Robert Putnam’s Bowling Alone (where it is labeled figure 47), a work that contains many good and bad examples of graphical data display (and unfortunately, no tables at all). In just one chart, Putnam violates the three fundamental rules of data presentation: the chart does not depict meaningful data; the data it does depict are ambiguous, and the chart design is seriously inefficient. One can’t accuse Putnam of distorting the data only because his main conclusions are not derived from the data presented in the chart.

Figure 3: Very bad graphical display

Of these, let’s consider the inefficiency first: the first thing you notice about the chart is that the graphical elements are represented in three dimensions. On both efficiency and truthfulness this is unfortunate; the 3-D effect is entirely unnecessary and in this case serves to distort the visual representation of the data. Had not the data labels been shown on the top of each bar, it would not be readily apparent that column A is in fact bigger than column F, or that C is the same size as B. In addition the chart suffers from what might be called “numbering inefficiency”: Putnam uses 13 numbers to represent 6 data points. Eliminating the 3-D, as shown in figure 4, offers a more exact representation of the data with a lot less ink.

Figure 4: Revised chart, without 3-D effects.

There are two problems of ambiguous data in the chart. Partly this is resolved in Putnam’s text where it is explained that bar E is the percentage of women who are homemakers out of concern for their kids while bar A is the percentage of women who are working full-time because they need the money. It’s not quite clear what the numbers for those who work part-time mean. In the case of bar C, for example, are the women working only part-time because of the kids, or are they not full-time homemakers because of the money?

The other ambiguity, however, is not for the lack of proper labeling. If one looks at the chart quickly, the first impression one would get would be that is that only 11% of women who work full-time do so for reasons of personal satisfaction. But that is not the case. Look at the Y-axis title. Or notice that all of the percentages add up to 100.   Of all the women in the survey, 11% were in the single category of “employed full-time for reasons of personal satisfaction.”   This is not what one expects in a bar chart, but given the data Putnam has decided to display, there isn’t a whole lot that can be done with the chart to fix it.

Still, we have to ask, “what does this chart mean?” In particular, what data do the arrows on the bars represent?

A critical standard of good charting is that the chart should be self explanatory. That there are problems with this chart become apparent to the reader as soon as one encounters Putnam’s page and a half of accompanying text devoted, not to explaining the significance of the data, but to explaining what the elements of the chart represent. A careful reading of the text tells us that there are basically three conclusions Putnam would have us draw from this chart:

·         Over time, (the 1980s and 90s) more women are working.

·         They are doing so less for reasons of personal satisfaction and more out of necessity (i.e., to earn money).

·         Correspondingly, there has been a significant decline in the number of women who choose to be homemakers for reasons of personal satisfaction.

These three conclusions are directly relevant to Putnam’s general thesis: that over time there has been a decline in social capital (adults are spending less time raising children and developing the social capital of future generations) driven in part by the demands of the expanding work force.

Based on the textual discussion that Putnam offers it becomes clear that the most meaningful data is represented in the chart, not by the height of the bars, but by the direction of the arrows on the bars. Recall that as a general rule data presentations that include more than one time point provide for much more meaningful analyses than cross sectional or single time point presentations. Although most of the data analysis in Bowling Alone is time series data, in this case Putnam averages 21 years of data down to single data points represented by the chart’s bars, with the times series change represented by directional arrows. Thus, the most meaningful comparison in the chart – the comparison that support the conclusion that Putnam seeks to draw from the data -- is not that bar A is higher than bar B or F, but that the arrow for Bar A is going up while the arrow for bar F is going down.

The crucial comparison is made directly in figure 5, based on the data presented in the textual discussion. Moreover, it directly illustrates several points that neither the text nor the original chart make clear: In 1978, a plurality of women were homemakers who did so out of personal satisfaction; in 1999 women who worked full time for financial reasons were the plurality.

Figure 5: Revised chart, with data from text.

Note also that figure 5 simplifies the data presentation by eliminating the ambiguous part-time category: for part-timers, is “personal satisfaction” the reason for not staying at home or the reason for not working full-time? And it clarifies that the “necessity” refers to “kids” in the case of homemakers and to “money” in the case of full time workers.

Types of Charts

Most charts are a variation on one of four basic types: pie charts, bar charts, time series charts and scatterplots. Choosing the right type of chart depends on the characteristics of the data and the relationships you want displayed.

Pie Charts

Pie charts are used to represent the distribution of the categorical components of a single variable. Note that as a general rule, multivariate comparisons provide for more meaningful analysis than do single variable distributions and for this and other reasons pie charts should be rarely used, if at all.

Rules for pie charts:

· Avoid using pie charts.

· Use pie charts only for data that add up to some meaningful total.

· Never ever use three dimensional pie charts; they are even worse than two dimensional pies.

· Avoid forcing comparisons across more than one pie chart.

Figure 6: Comparing two pie charts

Pie charts should rarely be used. Pie charts usually contain more ink than is necessary to display the data and the slices provide for a poor representation of the magnitude of the data points. Do you remember as a kid trying to decide which slice of your birthday cake was the largest? It is more difficult for the eye to discern the relative size of pie slices than it is to assess relative bar length. Forcing the reader to draw comparisons across the two pie charts shown in figure 6 is also a bad idea: without looking at the data label percentages in the above figures one cannot easily determine whether the FY 2000 slices are larger or smaller than the corresponding FY 2007 slices

3-D pie charts are even worse, as they also add a visual distortion of the data (in figure 7, the thick 3-D band exaggerates the size of the corporate income tax slice).

Figure 7: Exploding 3-D pie charts.

All the information in the pie charts above can be conveyed more precisely and with far less ink in the simple bar chart shown in figure 8.

Figure 8: Bar charts are better than pie charts.

Nevertheless, people like pie charts. Readers expect to see one or two pie charts similar to those in figure 6 at the very beginning of an annual agency budget report. But it would be a big mistake to rely on several pie charts for the primary data analysis in a report.

For those who would ignore all the advice given here and insist that good charts must look pretty, the most recent version of the Microsoft Excel charting software (in Office 2007, beta) will satisfy all your foolish desires: 3-D pie charts that gleam and glisten like Christmas tree ornaments, to say nothing about what you can do with the 3-D pie chart’s cousins, the donut, cylinder, cone, radar and pyramid charts.

As a general rule 3-D charts are not a good idea even when the data are three dimensional. In theory they provide for a precise representation of data, but it is rare that provide a basis for drawing a simple conclusion.

Bar Charts:

Bar charts typically display the relationship between one or more categorical variables with one or more quantitative variables represented by the length of the bars. The categorical variables are usually defined by the categories displayed on the X-axis and, if there is more than one data series, by the legend.

Rules for bar charts:

· Minimize the ink, do not use 3-D effects.

· Sort the data on the most significant variable.

· Use rotated bar charts if there are more than 8 to 10 categories.

· Place legends inside or below the plot area.

· With more than one data series, beware of scaling distortions.

Bar charts often contain little data, a lot of ink, and rarely reveal ideas that cannot be presented much more simply in a table. Minimizing the ink-to-data ratio is especially important in the case of bar charts. Never use a 3-D bar chart. Keep the gridlines faint. Display no more than seven numbers on the Y-axis scale. If there are fewer than five bars, consider using data labels rather than a Y-axis scale; it doesn't make sense to use a five-numbered scale when the exact values can be shown with four numbers.

Figure 9: Rotated bar chart, two data series

Look at figure 9 and you can quickly grasp the main points – the United States has the highest child poverty rate among developed nations --, but then spend some time with it and you'll discover other interesting things. Note, for example, the differences in child and elderly poverty across nations or that the three countries at the top, with the lowest child poverty rates are Scandinavian countries; five of the seven countries with the highest child poverty are English-language countries.

As with tables, sorting the data on the most significant variable greatly eases the interpretation of the data. The data in figure 9 are sorted on the child rather than the elderly poverty rates only because most of the research on the topic has focused on child poverty. Note also that if the sorted variable represents time, time should always go from left to right and on the X-axis.

One variation of the bar chart, the stacked bar chart, should be used with caution, especially when there is no implicit order to the categories (i.e., when the categorical variable is nominal rather than ordinal) that make up the bar, as is the case in figure 10. Note how difficult it is to discern the differences in the size of the components on the upper parts of the bar. The same difficulty occurs with stacked line and area charts.

Figure 10: Stacked bar chart with nominal categories

The stacked bar chart works best when the primary comparisons are to be made across the data series represented at the bottom of the bar. Thus, placing the “teachers” data series at the bottom of the bars in figure 11 (and sorting the data on that series) forces the reader’s attention on the crucial comparison and the obvious conclusion: American teachers are fortunate to have such a large supervisory and support staff.

Figure 11: Stacked (100%) bar chart

One common bar charting mistake is including the legend on the right-hand side of the plot area (shown in figure 12), placing the legend inside the plot area, as in figure 9, or horizontally under the table title (as in figure 11) maximizes the size of the area given over to displaying the data.

Figure 12: Scaling effects in a bar chart with two data series

Scaling effects occur when a bar chart (or a line chart, as we will see) two data series with numbers of a substantially different magnitude, the variation in the data series containing the smaller numbers. Figure 12, for example, depicts the increase in the labor force participation rate (the percent of the adult population in the labor force) from 60% in 1970 to 67% in 2000, and the increase in the unemployment rate from 5.3% to 7.1%. The immediate visual impression the chart gives is that the labor force participation rate is larger than the unemployment rate (a relatively meaningless comparison), while the important variation in the unemployment rate (a 30% increase) is hardly noticeable. Including an additional bar representing the sum of the other bars in a chart (as shown in figure 13) has the same effect of reducing the variation in the main graphical elements.

Figure 13: Scaling effect in a bar chart.

To see what happens when most of the bar charting rules are violated, consider the example in figure 14, produced by the Illinois Board of Higher Education (IBHE), (conflict of interest disclosure) my employer.

Figure 14: A really bad bar chart.
source: IBHE 2002.

It’s not just the 3-D. Look carefully at the X-axis. Using comparable data (the only available data: Fall headcounts rather than 12 month headcounts), eliminating the 3-D effects, sorting time from left to right, and removing the community college data series, and adjusting the bottom of the scale, we see something in figure 15 that the IBHE chart obscured: private institution enrollments are responding to public demand for higher education, public universities are not.

Figure 15: Revised enrollment chart

Note, however, that some would object to not using a zero base for the Y-axis scale in figure 15, but I don’t think that the depiction is all that unfair. It is fair to say, I think, that private institutions have accounted for most of the growth in university and college enrollments in the state, a disparity that would appear even more dramatic if annual change measures were depicted as in figure 16, with a zero base.

Figure 16: Bar chart with annual change data

Times Series Line Charts:

The time series chart is one of the most efficient means of displaying large amounts of data in ways that provide for meaningful analysis. The typical time series line chart is a scatterplot chart with time represented on the X-axis and lines connecting the data points.

Rules for Time Series (Line) Charts

Time is almost always displayed on the X-axis from left to right.
Display as much data with as little ink as possible.
Make sure the reader can clearly distinguish the lines for separate data series.
Beware of scaling effects.
When displaying fiscal or monetary data over-time, it is often best to use deflated data (e.g., inflation-adjusted or % of GDP)

Figure 17: Presidential approval: times series trend with annotations

Scaling effects. When two variables with numbers of different magnitudes are graphed on the same chart, the variable with the large scale will generally appear to have a greater degree of variation; the smaller-scale variable will appear relatively "flat" even though the percentage change is the same. In figure 18, ABCorp’s stock seems to be growing much faster than XYZCOM's, yet the rate of increase is identical.

Figure 18: Illustration of scaling distortion

When the differences in scale are so great as to eliminate most of the perceived variation in the smaller-scale variable, using a second scale, displayed on the right-hand side as in figure 19, is sometimes preferable, although this may make the interpretation of the graph more complicated.

Figure 19: Time series chart with second Y-axis

Many who have written about graphical distortion condemn the use of two-scale charts because the relative sizes of the two scales are completely arbitrary. This is true; had job approval and unemployment been plotted on the same 0 to 90 Y-axis scale, the unemployment rate would be an almost flat line at the bottom of the chart.

One solution to trendlines of different magnitudes is to rescale the variables, calculating the percentage change from a base year —but note that the selection of the base year can produce dramatically different results.

When several times series lines are printed in black and white, it is sometimes difficult to separate out the different tend lines. Mixing solid, dotted, and dashed lines for each variable may solve this problem, although it is sometimes difficult to distinguish between dotted and dashed lines.

Scatterplots

The two-dimensional scatterplot is the most efficient medium for the graphical display of data. A simple scatterplot will tell you more about the relationship between two interval-level variables than any other method of presenting or summarizing such data.

Rules for Scatterplots

Use two interval-level variables.
Fully define the variables with the axis titles.
Use the chart title should identify the two variables and the cases (e.g., cities or states)
If there is an implied causal relationship between the variables, place the independent variable (the one that causes the other) on the X-axis and the dependent variable (the one that may be caused by the other) on the Y-axis.

· Scale the axes to maximize the use of the plot area for displaying the data points.

· It’s a good idea to add data labels to identify the cases.

With good labeling of the variables and cases and common-sense scaling of the X and Y-axes, there's not a lot that can go wrong with a scatterplot, although extreme outliers on one or more of the variables can obscure patterns in the data.

Figure 20: Scatterplot with data labels and trendline

In figure 20, TV viewing is the independent variable. (If you were trying to predict which types of students watch the most TV, the axes would be reversed.) The scatterplot contains two optional plotting features: a regression trendline denoting the linear relationship between the two variables and the use of State postal ID data labels to indicate each state's position on the chart (these labels require a special add-in to the Excel program). Although the chart suffers from overlapping data labels, the interpretation is straightforward; the higher the percentage of students in a state watching more than 6 hours of TV each day, the lower the state's math scores.

Boxplots

John W. Tukey invented the boxplot as a convenient method of displaying the distribution of interval-level variables.

Rules for Boxplots:

· A simple boxplot plots the median and four quartiles of data for an interval level variable.

· Boxplots are best used for comparing the distribution of the same variable for two or more groups or two or more time points.

· Boxplots are an excellent means of displaying how a single case compares to a large number of other cases.

Figure 21: Components of a boxplot

The simple boxplot, as shown in figure 21, displays the four quartiles of the data, with the "box" comprising the two middle quartiles, separated by the median. The upper and lower quartiles are represented by the single lines extending from the box. More detailed versions of the boxplot restrict the “whiskers” on the plot to 1.5 times the size of the boxes and plot the higher or lower values (outliers) as individual points. Some versions also plot the mean in addition to the median.

A single boxplot box (as in figure 21) rarely reveals much about the data, and graphs of single variable data distributions (using stem-and-leaf or histogram charts) rarely offer a more detailed graphic representation of the data distribution. The real advantages of the boxplot graphic comes through, however, in single charts using several boxplots to compare the distribution of a variable across groups or over time and an especially useful elaboration of the boxplot graph involves plotting an individual case over the boxplot to compare single cases to the overall distribution (see figure 22).

Figure 22: Comparing boxplots, with labels for individual cases (Nevada)

Thus, figure 22 displays the percentage Democratic vote for the 50 states over the past seven presidential elections. Labeling a single case, we can see that the Democratic vote in Nevada has moved steadily higher relative to the other states. One can easily imagine applying the same plotting strategy in a variety of other settings, for example, comparing one school district's test scores to the distribution of test scores across other school districts.

Notes on data sources:

Higher education. The higher education data in figures 2, 14, 15 and 16 are compiled by the Illinois Board of Higher Education and are readily available on the Board’s website. Most states have similar governing board for higher education, but the governance structure varies from state to state. Most colleges and universities have an institutional research department responsible for compiling data and preparing reports on enrollments, tuition and fees, staffing, expenditures and student academic performance (which is then forwarded to the governing boards). Often the data are presented in an annual data profile.

Federal government revenue. The president’s Office of Management and Budget submits the proposed federal budget (for the budget year beginning October 1) to Congress in January of each year. (figures 6, 7, 8, and 10) The actual budget documents and spreadsheet files are available in the White House, the Office of Management and Budget, and the Government Printing Office websites.

Poverty. The poverty data shown in figure 9 was obtained from the Luxembourg Income Survey website and is described in the Poverty chapter that follows.

Presidential approval. (FIGURES 17, 19) Many polling agencies regularly conduct surveys asking the more or less standard presidential approval question: “How would you rate President Bush's performance on the job: excellent – good – fair or poor?” The Gallup poll website has the most complete historical data on presidential approval and would be the best source for comparing several administrations’ approval data, but access to their data requires a subscription fee. The “Professor PollKatz Pool of Polls” website http://www.pollkatz.homestead.com/ contains time series charts (but not the actual data) on presidential approval surveys conducted by 15 polling organizations. The Pollingreport.com website is an excellent source for political polling data, reporting data from several polling organization.

Unemployment. The unemployment measure used in figure 19 is the standard monthly Bureau of Labor Statistics measure available on the Bureau’s website. It measures the percent of the labor force (those employed and looking for work) who are unemployed and actively seeking employment.

Education. The Organization for Economic Cooperation and Development (OECD) is an excellent source of governmental and social data for the World’s developed nations. The staffing data in figure 11 was reported in one of their annual reports on education. The TV viewing and math score data in figure 20 was obtained from the National Center for Educational Statistics website. As discussed in the Education chapter, this is prime source for US educational statistics.

Elections. There is no US government agency responsible for compiling even federal election data, although the Clerk of the House (of Representatives) does report official tabulations for presidential and congressional elections and the Census Bureau does do a post-congressional election survey on voter turnout. Congressional Quarterly’s biennial America Votes is the most comprehensive source of US election data. Figure 22 is based on the Congressional Quarterly data obtained from the Census Bureau’s Statistical Abstract.

Notes to Excel users:

All of the charts shown in this chapter were prepared using the Microsoft Excel charting software, but some of the charts required modifications to the normal charting options offered by Excel.

Figure 2c contains two levels of category labels on the X-axis that are easily done with Excel but not explained in the documentation. The trick is to incorporate an additional label series in the data range, as shown below:

Figure 22: Specifying two levels of labels in the data range

To produce scatterplots with case labeling, as in figure 15, use of one of several add-ins or macros freely available on the Internet. The “J-Walk Chart Tools” add-in allows users to specify a data range to label any chart’s data points and includes other options for controlling chart and text size.

Nor does Excel offer a boxplot (also called “box and whiskers”) chart type as shown in figure 17, although Microsoft does have on-line instructions that explain how to modify existing charts types to derive a boxplot. More simply, Jon Peltier provides a “box chart maker”, an Excel add-in that greatly simplifies the process. A number of other add-ins and work-arounds to extend the Excel charting capabilities are freely available on the Internet.

Excel provides two different approaches for formatting the X-axis for times series line graphs. Excel line graphs treat the X-axis as if “X” is not a numerical variable, much the same way that the bar chart graphs define the X-axis. The X-axis labels and the data points are positioned between the tick marks and the sequencing of the data points depends on the order of the cases in the Excel spreadsheet (see figure 19).

Figure 23: Excel line chart

In most situations, Excel’s “XY scatter-chart-with-lines”, works has several advantages over the line chart. As shown in figure 20, the axis is treated as a numerical variable, the axis labels are placed under the axis tick marks, and any gaps in the time series are correctly spaced.

Figure 24: Excel XY Scatter chart with lines

The disadvantage of the scatter-with-lines chart is that it offers little control over the labeling of the axis: the tick marks and labels must start with the minimum value of the series and are evenly spaced. If the data series starts in 1959 (e.g., most US poverty data) and you wish to have tick marks every five years, the axis labels will be the series beginning 1959-1964-1969…. rather than 1960-1965-1970.

Figure 21: Excel XY scatter: X-axis is a labeled data series

To get around this, a) eliminate the X-axis labels, b) plot an extra data series with all the values set to the axis’ minimum value, c) use a “+” sign as the data markers, and d) use the J-Walk Chart tools to label the data points.

Also available on the Internet are a variety of macros and instructions for modifying the scaling of the Y-axis (Excel does a really poor job when it comes to log scaled axes).

One would have hoped that Microsoft would incorporate such functions into the newer versions of the Excel software. Unfortunately, the Microsoft developers seem to have chosen to go in a different direction with the latest version of the software (Office 2007), incorporating all sorts of new chart styles that involve all sorts advanced 3-D effects, shading, glow, soft edges and shadows.

References:

Huff, Darrell. 1993. How to Lie with Statistics WW Norton & Co

Putnam, Robert D. 2000. Bowling Alone (Simon and Schuster).

Tufte, Edward. The Visual Display of Quantitative Information (Cheshire: Connecticut: Graphics Press, 1983).

Other useful books on graphing data:

Few, Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to Enlighten (Analytics Press).

Jones, Gerald E. How to Lie With Charts (iUniverse.com, 2000)

Kosslyn, Stephen M. Elements of Graph Design (NY: W. H. Freeman, 1994).

Miller, Jane E. 2004. “Creating Effective Charts,” The Chicago Guide to Writing about Numbers, (University of Chicago Press). Chapter 7.

Wallgren, Anders, et. al. Graphing Statistics & Data (Sage Publications, 1996).

Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. (Mahwah, NJ: Lawrence Erlbaum Associates).